Why Can't We Reverse Engineer .DOC? 337
Good question.
I wonder if it has something to do with the mentality of the players involved. I don't think Sun, Corel or Lotus ever thought that they might be able to get together so that they could compete on the Office market, I think they all looked to carve out pieces of the market with their own suites, making such collaboration impossible. Despite popular misperception, Applix does not convert DOC, it converts RTF (which may be close enough for some people). Star Office is striving toward this holy grail, but they aren't quite there yet. So maybe it's not too late for folks to pool resources and finally get the job done. In fact, with the eyes of the court on Microsoft, now might be the perfect time.
On the other hand, we have DWG, which is a fairly rich format that deals with the description of 3D objects. Could decoding a file format that deals with text and its presentation really be that much more difficult to reverse engineer? I'd guess this depends more on the design behind said file format. If one of the main goals of the .DOC format is obfuscation, this could be difficult indeed, but I wouldn't say that it's impossible ... not for three big corporations, nor for thousands of loosely organized coders. It's one thing to have control of a file format, but it's another to be put into the position of having to change the format constantly in order to stay in the game. If Microsoft is placed in this situation, the onus would be on them to either concede the format until the next major release is made, or shorten the upgrade cycle on Office. How many businesses would stick with an office suite which forced users to upgrade every eight weeks just to remain compatible? If something like this were to happen, we might finally be able to put a dent in the everpresent Office monopoly.
So why hasn't .DOC been reverse engineered? I would think that if this can happen to the DWG format then it can happen to any proprietary format. Have we tried, or has Microsoft's reputation, both professionally and legally, kept people from really thinking about it?
Aaargh... everything going wrong again (Score:1)
The first posts from the last 136 stories:
.DOC not exactly proprietary (Score:5)
Uh duh (Score:5)
The thing is DOC is a compound file format. Meaning it is made up of various serialized data streams from embedded components. Word itself won't even know what many parts of a DOC file means, it'll just pass it on to Visio, Excel, Photoshop etc to read and understand.
DOC is a hugely extensible file format, and you can't support everything DOC can cause DOC can theorectically support just about anything...especially windows applications.
And no that was not done through evil intent. Believe it or not, integration of applications is very much something that good software engineers strive for.
If you have a problem with it, just wait a few years (or maybe a decade) for KOffice etc to mature, and watch people complain as documents created on the Linux version of KOffice won't work because someone decided to embed in their document some python code, or an xpaint image.
Re:.DOC not exactly proprietary (Score:2)
Or you could register at developer.microsoft.com (yes, I know, that hurts
But since not even m$ can get diffrent versions of word to read each others files, there is a only a slim chance that someone else will get it right. So far I haven't seen anyone that have got it right.
Ok, here we go again... (Score:2)
On the other hand, we have DWG, which is a fairly rich format that deals with the description of 3D objects. Could decoding a file format that deals with text and it's presentation really be that much more difficult to reverse engineer?
Well considering DOC can store ANYTHING - including the description of 3D objects yes.
I'd guess this depends more on the design behind said file format. If one of the main goals of the
I see, Microsoft == Evil, so DOC must be created to obfusticate. Very smart of you.
Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read? I guess Microsoft will go out of it's way next to obfusticate their source code to make it more difficult for the OSS community to read their source?
but I wouldn't say that it's impossible
Yes, those poor, poor companies like SUN with their open software like Java and Corel Office need to band together and blow up microsoft. resistance is not futile!
Please.
DOC isn't going to be very important in a few years anyway, Microsoft are moving to XML based everything. Serialization of com services will be XML based rather binary based as they are today as well. Just don't complain when your documents are 100MB.
.DOC (Score:1)
How do you explain the various programs for Linux that all read and MS Word
This article seems to be just FUD.
A Site About File Formats (Score:5)
http://www.wotsit.org [wotsit.org]
Re:Ok, here we go again... (Score:5)
Well I can't imagine why but Microsoft, on the other hand, has a strong profit motive. Once the file format changes, as it does every year (or faster) people start getting emails with the new format in attachments. If they could just use a filter then they wouldn't have to upgrade from Word 6 or whatever was the last version that actually offered them new features they needed.
An obfusticated format means that filters are hard to write so such people are forced to upgrade which == cash for Bill. In fact, according to M$ this is their single biggest source of revenue.
I guess Microsoft will go out of it's way next to obfusticate their source code to make it more difficult for the OSS community to read their source?
Undoubtedly, if they're ever forced to release it. In fact, since you mention it, the release of the source code would be useful almost exclusively for the .h files with the data structures in them. Frankly, who gives a damn about the rest of the code? I can write my own bugs, thanks.
DOC isn't going to be very important in a few years anyway, Microsoft are moving to XML based everything.
Which means that at some point they'll start changing the definition of XML to close out competitors. They've always taken this approach, why do you think they won't this time?
When a twit like you starts defending M$ the question I always want to ask is "If they're not a pack of shits why do they bribe, threaten, steal and lie? Do you think it's some sort of hobby?"
TWW
DMCA? (Score:1)
OK, OK, you are going to say that
Oh, I forgot one minor detail. The government is in bed with the movie industry, not the software industry. So it's ok to bypass the encryption on anything except mp3s and dvd.
nuclear cia fbi spy password code encrypt president bomb
Two responses predicted (Score:2)
Response2: Yeah! Let's just do it!
This question misses the whole point. The problem (from following the AbiWord list for a while) is not that the .doc file format needs to be reverse engineered, it's that the format is such a piece of crap that you can't implement the spec.
Basically, you have to emulate all of Word's bugs in handling it's own file format to get the expected results. And trying to copy 65,000 bugs is non-trivial. :)
--
Re:.DOC not exactly proprietary (Score:2)
"Documented" part of format... (Score:2)
When I used StarOffice I have seen horribly broken formatting that was magically cured when I have installed Microsoft/Monotype fonts into my Linux box with StarOffice. This suggests that Word formatting is very inflexible regarding changing parameters of the media (as opposed to, say, TeX that will adapt to any size of anything as long as it makes sense), and every slight difference in algotithms (never documented ones, not "packaging") can cause horrendous miscalculation of the formatting.
hmm (Score:1)
LAOLA is GPL (Score:1)
___
Re:Ok, here we go again... (Score:1)
Re:Ok, here we go again... (Score:3)
You mean like this one? [microsoft.com]
When a twit like you starts defending M$ the question I always want to ask is "If they're not a pack of shits why do they bribe, threaten, steal and lie? Do you think it's some sort of hobby?"
When twits like you attack M$ for the wrong reasons it makes it harder to get the unobsessed to listen to the valid complaints against M$.
DataViz has been doing this for years (Score:1)
Apple used to bundle MacLinkPlus with MacOS, so any Mac user could open any file from any program -- PC or Mac. (I used to annoy PC users by using my Mac PowerBook to translate files for them that they couldn't open, from programs that they didn't have and that weren't even available for the Mac, e.g., Lotus AmiPro. The stuff works.) Apple doesn't bundle it any more (?!) for their own inscrutable reasons.
There is no Linux version (yet) of DataViz's translator package, but they do offer translation packages for Palm users, so there's some indication that they're open to addressing "non-traditional" platforms if they see a market. I have hope.
Re:Two responses predicted (Score:3)
Basically, you have to emulate all of Word's bugs in handling it's own file format to get the expected results. And trying to copy 65,000 bugs is non-trivial.
Care to point out these 65000 bugs that relate to DOC formats?
Re:.DOC not exactly proprietary (Score:2)
The problem there is that only part of
Re:.DOC (Score:3)
BTW remember when Office 97 came out and could not save to an Office 95 .doc format? It actually saved to RTF but gave a .doc extension. Corel's WP could save to the real Office 95 .doc which made it more MS compatable than MS was.
Perssonally I think MS is using its illegal monapolistic practices to make calls to secret windows APIs to give it an advantage.
Re:Ok, here we go again... (Score:1)
Once the file format changes, as it does every year (or faster) people start getting emails with the new format in attachments.
The last time the DOC format changed was 1997.
2000 - 1997 is 3, which is not less than 1 last time i checked.
Re:Ok, here we go again... (Score:1)
>>rich format that deals with the description of 3D
>>objects. Could decoding a file format that deals >>with text and it's presentation really be that >>much more difficult to reverse engineer?
>Well considering DOC can store ANYTHING -
>including the description of 3D objects yes
OK, about a minute ago you said in an earlier post that
The Problem is MS's doc (Score:1)
The problem with MS Word is that the way to see how a command or comment would work is to try it on the screen. Will inserting this graphic cause the rest on the page to lose alignment? Not sure? Try it out!
This is fine for people using the software, but a nightmare for other people trying to write compatible software. Try it! Take out your old copy of MS 5.0 and write a fairly complex document (a 5-6 page research paper with graphs and annotation is a good example). Take that and use MS 6.0 to read it. Even MS themselves can't maintain consistency of conversion. That's becuase they basically made the document format up as they went along - no formal software engineering specs were ever written. If they were, then they obviously weren't detailed enough.
Contrast that to TeX. You don't have to copy a single line of TeX source to create your own teX compiler. All you have to do is to examine the picky formatting tests, and ensure that you write your code to reproduce the desired tests. And the specs were designed sensibly, if a little idiosyncratically.
The binary format of the .doc file is hardly the issue!!
Word for Lawyers (Score:1)
Re:DMCA? (Score:1)
Similarly, they could XOR-obfuscate the released code, and any attempt at REing the .doc format would be considered a violation of the DMCA.
Of course, MS would never do that.
Why Corel, Lotus, et al haven't banded together... (Score:2)
The answer for why the big office suite vendors haven't banded together in the same manner as the OpenDWG Alliance seems pretty self-evident to me. I'm sure that each of these software manufacturers have at one time or another signed an NDA with regards to the MS Office file formats. Once they did that, they were precluded from sharing that information amongst themselves. End of question.
As for why they signed those NDAs? Again self-evident: early access. If Corel or Lotus wanted to be able to support the new file formats in a timely fashion, they need to know what the spec is well in advance -- TechNet doesn't get that sort of new information fast enough. For that matter, when you subscribe to TechNet, you're signing a limited NDA with Microsoft; I'd check the fine print before I depended upon TechNet information...
Are you moderating this down because you disagree with it,
Re:A Site About File Formats (Score:1)
http://www.halyava.ru/document/ind_form
it's russian based, but much info is in english.
the real opendwg url... (Score:1)
there should be a policy of the minimum number of cups of coffee the poster has to drink before posting...
On The Subject Of AutoCAD Compatibility (Score:2)
For my purposes, and the purposes of the company for which I work: what good reverse-engineered DWG file formats if you still can't get a good, affordable CAD package on anything but Ms-Win? My company is presently standardizing on applications. And (I'm sure MS would be overjoyed to hear this) it looks like the Unix boxen are on their way out. Why? One of the reasons is AutoCrap. It's available only on Ms-Win. Our customers and vendors demand files in AutoCrap format. There are no price-competitive CAD packages available for Unix anymore. (Bentley has dropped support for MicroStation on Unix--in case you didn't know. Note to Bentley: you screwed up! By dropping MicroStation for Unix you removed any incentive for us to consider your product.) So bye-bye to our reliable, low-TOC Unix workstations and X-terminals :-(.
So even though we're evaluating StarOffice to use instead of MS Office, and even though we're evaluating non-MS email clients and other non-MS client apps: even if these pan out the Unix environment is still probably doomed because of AutoCrap :-(. (Then there's Visio and other stuff.)
IMO many vendors, by not making their apps available on non-MS platforms, are missing the boat by failing to differentiate themselves from the run-of-the-mill "Me too! I do Microsoft" crowd. With things happening like the surge of interest in Linux as potentially a viable workstation platform, Solaris for free and Sun hardware getting quite affordable: this seems to me to be narrow-minded. Particularly wrt to vendors like Bentley--who already had Unix versions of their products.
Sigh...
Re:Ok, here we go again... (Score:1)
You don't need to pay $$$ to read MS word documents, you can download a free reader for most of the MS Office formats from microsoft.
Only if you are a Windows user.
The vast majority of Windows users don't know about this, and wouldn't try it if they did. Closed source software discourages attempts to understand the system you're using and makes simple solutions like this daunting to their users. Oops, off-topic.
TWW
About OpenDWG (Score:3)
As far as reverse engineering the file format, its all but impossible. Now that UCITA is here it will get even tougher. I just hope AutoCAD knows to not shooting itself in the foot by suing its own users. If the peoblem ever amounted to a threat to AutoCAD's market share there would probably be quite a backlash.
Re:Two responses predicted (Score:1)
Maybe.
Re:Ok, here we go again... (Score:1)
The last time the DOC format changed was 1997.
The copy of Word 97 in our office chokes on Word 2000 files. Perhaps there is another reason for this. I no longer use it so I didn't look too deeply into the subject. What other reason do you know off that this would happen?
TWW
Re:.DOC not exactly proprietary (Score:5)
1. Physically reading a storage file is not the problem. Making sense out of the streams in the file much more so ...
2. The Word 97 *was* on the MSDN CDs. Microsoft has pulled it about two years ago. (So much for keeping hundreds of old MSDN CDs around ...)
3. The Word 2000 additions have never been documented in public.
4. The MSDN documentation is vague and sometimes plain wrong.
You get about 85% of a Word converter from coding along the Microsoft docs. It's the remaining 15% that's the hard thing.
-Martin
Re:.DOC (Score:2)
As a figment of your imagination. I've tried them all and they all fail almost as soon as you leave the area of text only docs. In fact none of them print even text only docs well enough for professional use.
TWW
For those interested... (Score:1)
The source code for MFC comes with Visual Studio and the Windows Platform SDK.
You can also download the complete source code for Wordpad [microsoft.com] from MSDN (Under sample applications).
Re:Word for Lawyers (Score:1)
Every tried using Microsoft Equation? (it comes with Word/Office).
Re:Two responses predicted (Score:1)
Re:.DOC (Score:2)
they are good at importing simple
Oh, sure, it's "documented" and "open" (Score:4)
So, to properly read and write
1) run Windows and Word
2) run MacOS and Word
3) port COM to anither OS and write a Word-alike
Yummy. Anyone written COM for Linux lately? TummyX's "it's open, it's open, stop whining" aside,
If he did it on Unix or BeOS or something, he should speak up.
Open file formats are important for interoperability and choice. Non-open ones are important for limiting choice and maintaining control. Knowledge shared is power lost, as Aleister Crowley said.
Re:For those interested... (Score:2)
WordPad calls upon the text import filters installed by Windows and Microsoft Office to convert .DOC files to RTF and then reads the RTF file.
-Martin
Re:.DOC not exactly proprietary (Score:2)
If that's true (and I'm not saying it is--I don't believe everything I read on Slashdot) then that spec doesn't help much. Sure, you could use it to write the converter but it might land you in jail (and living in Norway apparently wouldn't protect you).
--
You are all both wrong and right. (Score:3)
The troth is that those specifications are inaccurate and incomplete with regards to word 97 and 2K. Every person who has tried to implement an import filter has ran into that problem. The end result is that you sit down and create word documents on one PC ( or virtual PC with VMWare ) then go through with a hex editor to figure out what symbol dose what.
To put that all in perspective the two paragraphs above save to 1 KB ( minimum displayed file size on Win98 ) in HTM or text format. In MSWord
Everybody who dose this reverse engineering has to start from scratch. Every company that tries to read *.doc files has to put people to work doing it. A combining of efforts would be very prudent. Let's start by getting The Open Source teems together on this then we can invite IBM, Corel, Sun, etc... to join.
We need someone to advocate the benefits of an LGPL or even BSD licensed library set to corps who must otherwise do it all themselves ? This is what ESR is useful for so go and call him.
If .doc is well so documented, why can't I use... (Score:3)
1. .DOC is documented, this question is lame FUD. Quit bashing Microsoft.
Well, if its so well documented, then why can't I open a Word document in WordPerfect? And please don't tell me its because the Word document can contain embedded things like Excel and Access parts. I'm just talking about a regular word processing document with text and a little formatting. Our MIS guys tell me it does work but they apparently received this information from the WordPerfect 8 packaging rather than from experimentation because it doesn't work on my computer and they have been unable to show me where it works on their's.
2. Why are you picking on poor Microsoft? Do you really think they would purposely obfuscate their own code and make it difficult not only on the rest of the world, but themselves as well? Do you really think they're purposely trying to make it difficult for other companies to use the .DOC format?
Um, well yes, that's exactly what I think. What planet have you people been living on for the last 20 years. Of course Microsoft wants to make it difficult for other wordprocessors to use its format. They pretty much have a monopoly on in the Office arena and they want to keep it. If you could go out and buy WordPerfect for $100 less than Word and still be able to use the .DOC format perfectly, how would that help Microsoft? They have done things like this in the past and they will continue to do them as long as they can.
On a more positive note, I'll say that I do think that Microsoft Office is a good product. I mean it works and it does alot of cool stuff(even though that makes it bloated). The problem is in the way which Microsoft has used the power that Office has given them, not in the product itself. And I'm not just bashing Microsoft. I fully believe that if Sun or Corel were in their place they'd be doing the same thing. The bottom line is that consumers are suffering because of proprietary formats. This is one of the big reasons why computers have not made us more productive (or at least as productive as we could be). I can't count the number of hours I've spent simply trying to convert documents from one format to another.
Re:Two responses predicted (Score:2)
Can you please explain why MS can read the document graphics, but can't maintain format consistency? They seem to have improved a lot in this regard, but so what? All the other guys trying to write a compatible editor are exactly in the position MS itself was a few years ago.
The point is that MS's .doc is a joke specification, if it ever was at all. Sure you could read the files, but the specification is NOT COMPLETE. That is why many people are having a hard time converting.
Re:'Everpresent Office monpoly'? (Score:2)
So because MS wants to keep out competitors, it is entitled to make you find another job simply because you wanted to exercise your choice in software. In my book, hurting innocent people is EVIL!
Why can't we reverse engineer HTML? (Score:5)
The
The tough part comes when you actually want to display the document. Now all sorts of little details that aren't in the file format but are idiosyncrancies of MS Word pop up. And, as anyone who's used Office extensively knows, Word will display the document differently depending on which version you're using, what printer you have connected, phases of the moon, etc.
Parsing and display are two different things. While half a million apps can parse HTML, no two of them seem to display it in quite the same way. The question here is a bit like pointing out that no browser displays things like (IE|Netscape). Well, no they don't, but that has nothing to do with an inability to reverse engineer the file format.
proprietory need not be easy to reverse-engineer (Score:2)
Not necesserily true. A format can be encrypted with PGP and a connection to the Internet may be required to read a document encoded with this format. Try and reverse engineer that.
Re:.DOC (Score:3)
Microsoft Must Have Specs. (Score:2)
The obfuscation isn't actually in the
Yet, this readability has been maintained from Office 95 thru Office 97 to Office 2000. (Lets not even talk about Word for Mac!)
This just isn't possible unless Microsoft has internal conformance specifications that they follow from revision to revision.
We know the specs exist because it literally would have been impossible for Microsoft to have functioned without them.
98% of Word documents don't use any advanced Word features. In fact, 98% of Word documents should be saved in RTF format, and lose nothing of value in the translation. With these specifications, the #1 thing companies could do would be to implement a DOC->RTF filter *at the mail gateway* and be done with 98% of Macro Virii.
Will it happen? Nah. The Word Monopoly is just too critical to Microsoft's success. It really is.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
Re:Ok, here we go again... (Score:4)
I think the point is that you have to pay Microsoft the full price of the office suite for the 'privelege' of using newer document formats. That effectively limits the life of your software purchase so that you have to buy a completely new copy whenever there is a document format change - at that point, why not just use it as your primary version?
THAT is where the rub lies - at that point, you start sending out copies that can only be read in the newer version, and your colleagues begin upgrading as well. It's an endless cycle.
People want to break that cycle so that they can either use a competing Office program based only on its merits or stick with a previous version which they feel was better than the next version (ie. Mac users who upgraded to Word 6, but wish they had stuck with the previous version) .
I believe this is a worthy goal.
- Jeff A. Campbell
- VelociNews (http://www.velocinews.com [velocinews.com])
Re:Ok, here we go again... (Score:2)
- Jeff A. Campbell
- VelociNews (http://www.velocinews.com [velocinews.com])
Re:.DOC not exactly proprietary (Score:3)
This is true, although many people and projects have done a fairly good job - I wasn't trying to say that the format is totally freely available, more of a "What is this question doing here except trying to flame Microsoft?".
2. The Word 97 *was* on the MSDN CDs. Microsoft has pulled it about two years ago. (So much for keeping hundreds of old MSDN CDs around ...)
I wasn't aware it had dropped off. But then a lot of information has dropped off the MSDN CD's in favour of a link to www.microsoft.com. I'm willing to bet the Word 97 format is still on there somewhere.
3. The Word 2000 additions have never been documented in public.
I'm of the understanding that there weren't any (or at least very few). From what I heard, the additions were a few minor features, certainly nothing that would cause interoperability issues. But then what I've heard could be wrong...
4. The MSDN documentation is vague and sometimes plain wrong.
As is the GNU documentation, and sometimes the Perl documentation (actually its a lot better lately) and... well I could go on. Developers hate documenting internals. I don't blame the microsofties for that. Documenting things is boring. I'd rather add an animated paperclip ;-)
Re:About OpenDWG (Score:2)
Re:On The Subject Of AutoCAD Compatibility (Score:2)
Try LinuxCAD [linuxcad.com]. At $99.00 it's quite competitive with AutoCAD for Windows.
If you only do 2d work you can get off even cheaper with QCAD [qcad.org]
It's very simple really... (Score:2)
It's very simple really. Unlike Autodesk, which uses some form of logic to create their file formats, Microsoft uses heavy encryption seeded with a semi-random number.
This number is based on the millions of dollars Bill Gates is worth at the year of release. In fact, the file formats for Office 95, 97, and 2000 are identical - it's just that Bill Gates has been worth more at the time of their release, so the file was encrypted differently.
This is why it's so important for the Microsoft stock price to jump around, if it stood still then the file formats wouldn't change, which means people wouldn't buy the latest version of Office, which means the stock price woudln't change, and so forth in an infinite loop.
Re:Proposal: An alternative posibility. (Score:2)
This is the only answer to the problem and W3C is a great example. It takes the control away from MicroSoft, a company that uses the spec as a means of driving upgrade sales and maintain their monopoly (the real purpose of
The idea for a common doc format could be marketed successfully based on two points.
First, a common doc format would allow companys and individuals could save large amounts of money by not having to upgrade to the latest verion of Word every two years. This would impact a company's bottom line.
Second, a common doc format would provide companies and individuals with a level of "insurance" that older document types that hold important data would not at some point in the future become obsolete.
Neither of these points even brings up the obvious benefit to the rest of us that use non-MS systems. It would increase competition in the Word Processing arena and would probably move use towards a world where
sure. (Score:2)
--
Re:.DOC not exactly proprietary (Score:2)
Re:Two responses predicted (Score:3)
"Those things that you think of as bugs? Those are not bugs. They are actually hot grits. Which are in my pants."
would have been considerably less lame than the actual post made. Just my two cents.
--
-jacob
Re:'Everpresent Office monpoly'? (Score:2)
Mmmm...send them files in TeX format?
Seriously though, knowing what the person can read on the other end and sending them that is courtious. Unfortunately, too many people think that everyone uses what they do.
Re:Ok, here we go again... (Score:2)
From Halloween [opensource.org]: "OSS projects have been able to gain a foothold in many server applications because of the wide utility of highly commoditized, simple protocols. By extending these protocols and developing new protocols, we can deny OSS projects entry into the market."
We *KNOW* that MS is specifically making protocols, not to enhance the experience of the user or add capabilities (although these may also be done sometimes), but to decrease the ability of free software to interoperate.
Re:Why can't we reverse engineer HTML? (Score:2)
And arguably in both cases this is because people are asking the program/format to do more than it was ever intended to. Both html browsers and word processors were originally intended to format documents dynamically and squish them into shape using some fairly general parameters of window/page size, font, etc. The problem is that people are now turning around and trying to use both as detailed page description formats that place each letter or object precisely on the screen. Given the underlying assumptions of the renderer, it shouldn't be surprising that this doesn't work right. If you really want to fix the words onto the page, use a desktop publishing program or convert to PDF.
Re:Oh, sure, it's "documented" and "open" (Score:4)
Yes. It's Microsoft's Component Object Model. A formalized descendant of Object Linking and Embedding, which was originally a method of making compond documents with Word and Excel.
.DOC is an OLE Structured Storage [desaware.com] format which can store data streams meant for other programs, like Visio. Those programs also do not have open formats.
The practice of passing around Word documents in Email because "everyone must be able to read them, right?" is a problem. If someone sends you a document in their favorite proprietary format, you should send them back a document in your favorite proprietary format. Maybe them people will start to understand the need for open, well-documented formats.
And isn't that a tragedy.
Re:Ok, here we go again... (Score:2)
Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read? I guess Microsoft will go out of it's way next to obfusticate their source code to make it more difficult for the OSS community to read their source?
It's not harder for them to read, because all they have to do when they make mods to the format is make changes to the DLL that handles parsing DOC files at the same time. Only the people who work on the file format itself have to work with the contents of a DOC file directly (at microsoft, that is), and everyone else just deals with the data after it's been parsed out.
Incidentally, back in the days of the Amiga (Oh no, here we go down nostalgia lane again) lots of people used IFF format for their files, which is an extensible hunk-based file format where you can include any kind of data, so a palette file can be an IFF with only the palette hunk, whereas an image has both the palette and the image hunks. Even Amiga binaries seemed to be either IFF or something closely akin to it, which is an observation I made purely by watching powerpacker go off on my binaries back in the days before I had a hard disk, and may be purely BS.
Re:Ok, here we go again... (Score:2)
I think it's an expedient combination: using object serialization for I/O makes it both easy for Microsoft to read/write data, it makes it difficult for competitors to do anything with the format on other platforms, and it forces users to upgrade their copies of Office with every new release.
This is, in fact, at the heart of what people are complaining about Microsoft: Microsoft adopts strategies that give them a quick time-to-market, lock users into upgrade paths, and that are also effectively exclusionary. I wouldn't necessarily call that deliberately "evil". I'm sure many people at Microsoft view it as the natural way of doing software development, and they view everybody else in the industry who bothers with standardized or well-documented formats as people who foolishly waste time and money.
DOC isn't going to be very important in a few years anyway, Microsoft are moving to XML based everything. Serialization of com services will be XML based rather binary based as they are today as well.
While it may help a little, serializing objects in XML format will not necessarily result in formats that are significantly more readable, accessible, or backwards compatible. To make sense of a big and complex XML model, you still need a formal definition of what it is.
This is really an issue for users and customers: users should insist that their data is in well-documented formats that remain constant and compatible across releases. That's why many government offices have insisted on using SGML in the past.
Using serialization for document storage is simply poor engineering, whether it is done by Sun or by Microsoft or by anybody else. Skipping the step of formally defining a storage format is expedient to the company but harmful to users. In the long run, users have too much invested in their content to store it in such an ephemeral format.
Re:Uh duh (Score:2)
I really hope I get a response, Slashdot blows for how it handles the user info screen... you need to remember how many replies you had to a given comment instead of having it know that you clicked on it when it was at 3 replies and give you some kind of visual cue that there are now 5 or 7 or 23 :-)
The thing is DOC is a compound file format. Meaning it is made up of various serialized data streams from embedded components. Word itself won't even know what many parts of a DOC file means, it'll just pass it on to Visio, Excel, Photoshop etc to read and understand.
If the spec is right then, I should be able to import my .doc files, creating tables, lists, text, all formatting and most graphics (.gif, .jpg, .png, etc.) without any trouble, as Word doesn't need any part of any other program to do this. Why can't I?
I agree fully with you that Word doesn't handle most of the complex streams (excel data, powerpoint data, visio data, etc.) but in my documents I don't have any of these, it's all text and a lot of formatting, which Word would have to handle on its own.
Well, I'll be darned.... (Score:4)
At one time, you could download the specs for the binary file format. Now, according to:
http://support.micro soft.com/support/kb/articles/Q211/6/41.ASP [microsoft.com]
You need to write to an e-mail address and explain why you want it. It also says that the formats for earlier versions of Word are no longer available.
For what it's worth.
Re:Microsoft Must Have Specs. (Score:2)
Actually, Microsoft doesn't need to have specs at all. It just needs to carry along a bunch of legacy code that gets glued into successive versions, perhaps with some API modifications or a compatability layer. "Conformance" to such a non-spec can be determined by regression testing.
A surprising number of software projects cook along for many years and through many revisions without ever having complete specs. And though the lack of specs may be bad, code re-use is usually a Good Thing, specs or no.
MS Word File Format is here (Score:3)
Specification != Documentation(Samba-like problem) (Score:3)
1) As I think several people have touched on, the problem here isn't the documentation, since Microsoft through MSDN etc. has documented the Word file format. The problem is that the only specifications on how to correctly render the Word documents are the Word rendering engine itself. Without the ability to see the exact logic that Word uses to render certain formatting codes (read: source code), it is impossible to reverse-engineer a 100%-compatible converter/viewer. It is a similar situation to what the Samba team faces: the SMB/CIFS protocols have been documented by Microsoft, but the only implementation of those protocols is Windows NT/2000, so Samba in reality must be coded to re-implement NT, not implement the CIFS specifications. The difference here, of course, is that CIFS apparently has a complete spec that Microsoft simply ignores, rather than the Word situation where they purposefully keep people in the dark on how things should be done.
2) the reason that you can't just watch what the Word rendering engine does and duplicate it is because it's stupid. From my experience working with Word itself and wvWare to convert Word files to HTML, it's obvious that Word just throws odd formatting codes where ever it pleases, and never bothers to clean them up. Often tags to end bold formatting (converted to </b> by wvWare) are just randomly placed in the document, nowhere near where any bolding is supposed to occur. The same goes for font sizing/coloring: Word seems to place odd, irrelevant font codes in places, only to override them with the correct codes a few lines later (often without canceling the first codes). In other words, it's a mess. With the Word source code, one may be able to figure out the (supposed) logic behind the mess; without it, I fear anyone is simply grasping at straws, especially since MS continuously changes to Office keeps everyone guessing about what Word is actually doing underneath it all.
My US$0.02 of course.
Re:why would you want to? (Score:2)
Now I have tried wordperfect 8 for Linux, and the word filter does not work on more than half the documents that i have. StartOffice 5.1 does a pretty good job of this and from what I hear is it is getting better. However I know that if you start doing some complex things in word then startoffice may not read all of the document. They are working on this though. Apparently startoffice 5.2 is supposed to have pretty good support for word files.
On another note their are several project that are open source that are working to reading these formats, on of which Ibelieve is called AbiWord. Although it's native output will not be word, last time I talked with them they were working on a word filter.
send flames > /dev/null
More info on .DOC format (Score:2)
There is a lot of confusion here about whether or not the .DOC format has been documented, because there are two layers to the file format. First, there is the Word document format itself, which Microsoft has published in some MSDN CD versions. It also available from places like www.wotsit.org. This specification is inaccurate in places but close enough to make Word document conversion possible. Caolan McNamara has a very good start on a Word-to-HTML converter at www.wvware.com. The Word document format changed in the transition from Word 6 to Word 97, and is the same in Word 2000.
However, Word documents since version 6 are wrapped in OLE Compound Documents, which Microsoft also uses for .XLS files. The Compound Document format is not officially documented anywhere in Microsoft documentation, as far as I can tell. (But see below for a patent that might disclose this structure...) The MSDN library samples invariably use Windows system calls to access data in Compound Documents, and reveal nothing about the file format.
There have been some efforts to reverse-engineer this format:
http://arturo.directmail.org/filtersweb/ and [directmail.org]
http://snake.cs.tu-berli n.de:8081/~schwartz/pmh/guide.html [tu-berlin.de],
A Compound Document contains a tree structure of data streams, which seems like a simple enough structure but it is implemented using a very complex file format. The lack of complete documentation of this format is a major impediment to development of robust open-source code that will access the Microsoft Office file formats.
A second potential impediment is a nest of patents that Microsoft has built around the Compound Document format. These are just a few:
US5467472: Method and system for generating and maintaining property sets with unique format identifiers
US5715441: Method and system for storing and accessing data in a compound document using object linking
US5506983: Method and system for transactioning of modifications to a tree structured file
US5706504: Method and system for storing data objects using a small object data stream
There are a fair number of patents (IBM seems to have some possibly related ones as well). You can find them here: http://patent.womplex.ibm.com/home [ibm.com]. A search for "((compound document) and microsoft)" lists 24 patents. It would not be surprising if a serious effort to provide open-source access to Microsoft Office documents ran into legal threats because of these patents.
Interestingly, the last one looks like it might disclose the Compound Document format, which Microsoft would have to disclose to satisfy the patent office. The description looks right, but the diagrams do not seem to be available from the IBM site. Looks like I'll have to dig some more -- anyone know how to get the full text and images for U.S. Patent 5,706,504?
Re:.DOC (Score:2)
Several versions of the
I was somewhat involved in the reverse engineering of one of the
It turned out that there were some very small similarities, but not enough to be very helpful to us.
Reverse engineering a
The best way to do it is to start with small files: Start with a file with 1 letter, then two letters, then three.
Then make one of the letters bold, then make one italic, then make one bold italics. Then put each letter in a cell in a table, and so on ad infinitum.
Between each step, do a hex dump and compare the files. Eventually every thing starts to fall into place.
After that's done, then write a converter or dumper for a
Depending on how diligent you are, you can probably get 99% of it.
Personally, I've done about all the reverse-engineering that I want to do, so I'm not going to do it, but if someone wants to follow these instructions, it's probably the easiest way to go. Also, I'd keep the Word 97 specs handy so you can see any similarities that have been carried over from that version to the latest.
Good luck.
Re:Better idea (Score:2)
Re:Ok, here we go again... (Score:2)
This comment is meaningless. Any file can store anything. BFD. Does DOC have predefined data structures to store a 3D database? No. It does have the ability to 'serialize' (is this a Java only term?) OLE/whatever objects. Not at all the same.
Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read?
I don't know, maybe to make more money? I try to stay relatively sane when it comes to MS bashing, but doesn't it only make sense that if the file format is what is locking your product into the market, you will do everything you can to keep it a secret? Autodesk did it with DWG (I had to muck around with reading DWGs a couple years ago and there was incredibly little info out there).
Microsoft are moving to XML based everything. Serialization of com services will be XML based rather binary based as they are today as well. Just don't complain when your documents are 100MB.
I won't be complaining since all my XML docs will be gzip'ed and all my apps will automagically decompress them before reading them. XML is a standard data markup format, but just wait until MS goes crazy with its DTD. Just because you can parse a file doesn't mean you have a clue as to how it works.
-1 (Offtopic) (Score:5)
Right you are, sir! In today's "free" market, there are a slew of businesses which wield monopoly power, but which they don't want you to know about it. Consider:
Cisco Systems has a market value comparable to Microsoft's, and has even exceeded it at times, by maintaining a total stranglehold on the network hardware market. Although they would have us believe that Cisco's strategy is "providing a reliable, top-quality product and good support," a number of internal memos have recently been leaked indicating that Cisco plans to start including support for the "upgraded" IPv6 "extension," putting them in a position to use the "embrace and extend" strategy to leverage their large market share into an almost total monopoly on the Internet's physical infrastructure.
The Lego corporation has a long history of introducing new block designs which render the old blocks almost totally useless from an aesthetic perspective. "I spent all my lawn-mowing money on the medievel set," said a sniffling little boy who asked not to be identified, "but then the Technics came out, and all my spears and stuff wouldn't fit anywhere on the walking robot I built unless I mixed those brown spear-holder blocks in, and then my robot looks yucky." He also pointed out, as is well known, that Lego has broken Technics color-compatibility with their new Mindstorm upgrade, by switching red dye #5 for #8, and yellow #2 for #7. Alas, the legal hassles that await anyone foolish enough to reverse-engineer Lego's proprietary block-connection protocols have ensured that Lego has reigned unchallenged as the only source for toys you can build cool shit with, despite their inferior product. The "accidental" death of Abe Fromage and the subsequent collapse of Tinkertoys spelt the end of competition, even before Lego started blatantly cloning "CPU" and "robotics" technology from the computer industry for use in their "innovative" Mindstorm toys.
Furthermore, Red Lobster, Denny's, and other chain/corporation/restaruant/franchise establishments regularly use unconscionable terms in the dining agreements they make with their patrons. As a large corporation, they play from a position of strength: With their high-priced lawyers and large bankrolls, they can freely impose their will on the consumer (commonly by the use of so-called "walk-through" agreements: the restaurant posts it dining agreement on its wall, you and are considered to have "agreed" simply by choosing to dine there, regardless if you have read or even noticed the sign). Examples of this include:
It is sad, but the powermongering megacorporations who really run our country also have merciless teams of wedgie-men and noogie-goons at their command, and they have bamboozled the media and the government into abusing Microsoft to benefit their own bottom line. What with communistic government interference, backlash from the misinformed public, and the software piracy that is rampant in today's industry, Microsoft can barely stay afloat, let alone research more of the innovative, professionally engineered products the software community has come to expect from them, like Microsoft Bob, the dancing Office paper clip, and email clients that do it all at the click of a mouse! Yay Microsoft! Go Bill! One world, one web, one program!
Re:'Everpresent Office monpoly'? (Score:2)
--
Re:.DOC not exactly proprietary (Score:2)
That's because there aren't any "additions." Word 2000 is 100% backward-compatible with the Word 97 format.
--
Re:Ok, here we go again... (Score:2)
Bento in OpenDoc Re:Stimpy - Yee-u eediot!) (Score:2)
I worked on Bento. I was not the designer. Jed Harris was the designer (Ira Reuben the coder). Jed said Bento was an experimental first cut prototype that was pushed into production, and I agree with this view.
The design goal was only rather similar to *.doc. Unfortunately, since Bento was a version one prototype, it never had a redesign for ease in reading and writing until I designed one.
Gat1024: And that was designed from the get-go to be cross platform.
It was technically cross platform, but Bento was very unfriendly as a clearly understandable format. It's big mistake was to use phsyical stream embedding instead of logical embedding, so the recursive flow of control was a nightmare to analyze. The format had physically discontiguous streams embedded inside other physically discontiguous streams, which would give almost anyone the shudders.
Gat1024: Think of it as component hell. And it is unavoidable no matter who does it. This goes for KOffice as well. Complexity is a run away train. I should say entropy. Since we're tending towards chaos here.
You are correct that every open format can embed opaque content that cannot be understood, so all component systems suffer from the risk of component hell.
I would not accept any amount of money to reverse engineer the Office doc format as a regular job, because it would tend to be too hard and frustrating to deal with the complexity under ongoing changes.
Furthermore, I would not trust any junior engineer who did accept such a job, so I would avoid the product based on such work, under the theory it would be fragile and buggy. Am I a pessimist, or what?
David McCusker, former Bento guy
Restating the question, one more time. (Score:2)
I really think this was my entire point. They both do the same thing. The only difference is that one (TeX) is an open format dating to the early 80s, while the other (.doc) is a proprietary format that changes every 2-3 years. They both do the same thing, so what possible justification could there be for using the second? Assuming, for a moment, that M$ is, as they claim, concerned with producing real benefits to their customers, I don't see any point to .doc. Do you? If so, please explain it.
Re:Microsoft Must Have Specs. (Score:2)
Let's face it. Most people just don't need all that shit in their document. Bulleted lists and tables satisfy most people's needs. Whatever happened to optimize for the common case?
Re:'Everpresent Office monpoly'? (Score:2)
Because their customers expect them to make decisions that make their software better for the user, particularly when those decisions would come at little or no (or negative, in the case of maintaining a consistent document format) cost to Microsoft. The fact that Microsoft repeatedly changed the Word format costs themselves and their competitors money for additional programming work on filter and import/export code, and costs their users money for repeated unnecessary upgrades, incompatibility hassles with other programs. Looking at the Microsoft+competitors+users system as a whole, there is no benefit to anyone for Microsoft to use a poorly documented, convoluted format without an accurate public specification.
However, looking at MS, competitors, and users independently, it's obvious that while the value of the system as a whole is reduced by Microsoft's decisions, the handicap that it gives to competitors and the additional revenues it generates from users causes more of that value to end up as cash in Microsoft's hands.
This isn't the way a free market is supposed to work. If someone makes an inferior product, I'm supposed to be able to switch to a different producer and not be adversely impacted by said product. (and as a side effect, my readily available choices encourage all producers not to produce inferior products) Unfortunatly, when you add network effects, i.e. the requirement that my new product be compatible with the old, suddenly Microsoft has the ability to use an existing large marketshare as it's own "benefit", to make it self-sustaining, to reduce or eliminate that choice.
I'm not saying that, after thinking about it, it doesn't make sense for Microsoft to do just that. I'm just saying that, to consumers used to having a wide selection of companies competing solely based on price and quality for their purchasing dollars, it certainly counts as "unexpected".
Re:"Documented" part of format... (Score:2)
Re:Uh duh (Score:2)
That's actually not necessarily the case; OLE has a mechanism known as "View Caching" which keeps a snapshot of the embedded data in a form which can be displayed or printed in the document without the generating application/control needing to be present on the machine that's viewing the document. You can't edit it, but you can view it or print it - and unless you need to mess with the document (most people just read it), that's enough.
Simon
Re:non-conformant XML (Score:2)
Office output is fully XML/XSL transform compliant - which is why Opera can handle it perfectly fine.
Also, a lot of the stuff in there is for round-tripping; it doesn't get used by a browser for display - the XSL transform just deletes it to all intents and purposes.
Simon
Re:Microsoft Must Have Specs. (Score:2)
--dan
Re:Read the Halloween Documents and come back (Score:2)
The facts being that Vinod Vallipolli wasn't a Microsoft VP, nor even anywhere near that. He was a grunt.
Now if you'd said he was a Microsoft V V, then I'd have to agree with you.
I can sit down right here and now and write a document that claims the best way for Microsoft to make money is to take Linus, strap him to a chair with electrodes on his testicles, and fry him like a bug on a hotplate.
This document would get leaked.
Does this mean that this is happening in real life? Well, goddamnit YES! Linus is strapped to a chair! Right now! With electrodes on his testicles!
Funny how you never saw the leaked document which says that Microsoft would be better off if they gave all the lower-level peter-principle'd management a good kicking, stopped the infighting, and stopped the use of brute force in their development practices.
Simon
Re:Two responses predicted (Score:3)
Can you please explain why MS can read the document graphics, but can't maintain format consistency? They seem to have improved a lot in this regard, but so what? All the other guys trying to write a compatible editor are exactly in the position MS itself was a few years ago.
The point is that MS's
This is because Word 6.0 rendered according to screen metrics, Word 7.0 rendered according to printer metrics for better quality output at low font sizes, and Word 8.0 now renders according to *font design* metrics, which means that while it'll look reasonably like what you get on the printer, and obey margins, it will squish the fonts a pixel or two together at times to get the best fit.
It's nothing to do with the
Simon
Re:Diagrams/line art/etc (Score:2)
Actually, MS did it the way you described above as the way it should be done. It's called View Caching. And most converters don't bother because they haven't implemented all of OLE Structured Storage (or at least enough of it to be able to *use* that part).
Simon
Re:.DOC not exactly proprietary (Score:2)
documentation: the act or an instance of furnishing or authenticating with documents.
The product is not its documentation.
Objective critique of .doc (Score:2)
Word is a COM object and uses COM extensively. OLE was at the roots of COM but these days OLE is just another set of COM objects that one either implements or uses. So.. from the get go, one needs to implement COM, and also IStorage/IStream, on Linux, to get at Word. This would be ok if COM were an open standard, but it ain't. Where in MSDN is the VTABLE format for COM? It isn't there.
Strike 1 against Microsoft. By strike I mean that they are doing the usual evil empire thing by not opening up COM.
Philosophically, IStorage / IStream are a set of COM objects (read Libraries), for divving up a file into its own directory mechanism. The rationale for doing this is that end users want to copy documents as entire entities, and not deal with 200 or even 2000 subdirectories or small files that might comprise a total document. In Microsoft speak, a document must be a moveable entity, and in that regard, COM library based documents are entirely defensible. However, what goes into each of those subdirectory entries, or streams, is free to remain largely undocumented. It is the design intent of COM to ensure interopability between closed interfaces. At this, COM does stunningly well. You can script against COM in any language... but the medium for interchange is an application that you must always have in order to view the document.
Strike 2 against Microsoft. COM IS an excellent piece of software engineering, but it is engineered to do the hypocritical thing. The easiest way to make things interoperable is to post the source...
Much ado has been about Word changing file formats. The critiques of Word say that it is unnecessary to change file formats between releases. This is non-sensical hogwash. New features mean new data requirements, and new data requirements mean new file formats. Every other application on the planet has versions of formats and downward compatibility problems. Have you tried looking at a style sheet page in Netscape 2.0? That Word changes file formats is reasonable.
Hit: Microsoft.
Some criticism has been made about how a Word document changes appearance based on the display or print device. This is in keeping with the philosophy of Windows - which is to enable software features only if the hardware is present to support them. This is radically different from Unix, but this hardware-centric approach of Windows IS defensible on many merits.
Hit: Microsoft.
Word has, in effect, an autoexec scripting mechanism with no sandbox and no security besides that which the user security context of the OS offers. Since Windows 98 effectively runs everyone as root, the vast majority of Windows Word users are flying blind into a cliff.
Strike Three: Microsoft.
The bottom line is this. The
The bottom line is this:
If Microsoft had opened the Word file format, then Word files would have been the defact web page of the Internet, not HTML. That we are doing HTML and HTML rendering engines is testimony to how badly Microsoft missed a golden opportunity with Word. To protect their Word Processing IP, they made sure a non-Word file format (HTML), would become the lingua fraca of the Internet. That by itself is a compelling argument in favor of open file formats.
A Brief History of IntelliCAD (Score:3)
By now we know that both the DWG and DOC format have been reverse engineered. We also know that it really does not matter. Autodesk/MS control the data formats. Their rendering of the data is the reference implementation -- and they both change the format at will. They both exploit run-time and new version peculiarities in their rendering of the data.
When it comes time for a company to decide which product to invest in, when it's time to choose if they want to use the proprietary product or some wannabe cheap-o competitor, the answer is alway the same. Go with the standard bearer. And that really is the correct answer. The price differential is completely and totally irrelevant. Corporations invenst a lot more in labor and data than they invest in any one version of a software product. The "open source" factor is -- if not irrelevant -- not appreciated. It is secondary at best.
Look at IntelliCAD. They attempted to commoditize R12 AutoCAD. Supposedly nobody wanted any of the features crammed into post-R12, post-multiplatform AutoCAD. R13 was a bitter pill for AutoCAD customers and loyalists. Supposedly IntelliCAD would allow drafter/designers to draw basic 2D engineering drawing just as well as R13++ for half the price. More importantly, they thought they had given companies that had huge investments in DWG data a viable alternative -- a way out. They could jump from the ship they were supposedly dissatisfied with and seek alternatives.
But you know what? Nobody took the offer.
Not before IntelliCAD was "open source" and not after.
It turns out that Autodesk was able to pull off R14 and salvage their reputation Turns out customers were not all that dissatisfied with Autodesk -- which they correctly saw as a well entrenched, healthy (==rich) partner, committed to investing in both AutoCAD and other forward looking design products and technologies. Turns out AutoCAD is very capable of getting the drafting job done. Besides, IntelliCAD was for shit. Still is. And when Visio sacked the original ItelliCAD development team - a very idealistic and motivated group -- because ICAD was released prematurely with bugs and feature gaps -- any idealism or customer loyalty went out the window. ICAD was exposed for what it had become -- a cheap knock off with no future. The so-called open sourcing of IntelliCAD was just window dressing. The fact was that Visio had interred it's mistake in preparation for acquisition by MS. (It also parted ways with the folks that had inspired IntellCAD, FWIW.)
So what does this have to do with .DOC?
You could come out with a .DOC compatible word processor without a super-human effort. But wihtout the VBA, without the quirky rendering, without all the nuances and endless litany of features of Word it would be nothing more than a knock-off. It would have to beat Word on functional terms in order to be attractive. That would be a very tall order. Like it or not, Word and AutoCAD are very mature products. Maybe they attempt to do too much. Maybe they are bloated with features that any one customer does not want or need. But a whole lot of customers are well served by these products. They get the job done for a broad spectrum of customers.
They are both going to be very, very hard to disslodge.
It's their game to loose.
Beating them on the merits will be damned hard, and possibly not enough.
And, just to goad anyone still reading, being "open source" or not has nothing to do with it.
If open source is a strategic advantage, it will hvae to do with stamina and longevity. Eventually MS/Autodesk will find it hard to keep milking their cash cows. Eventually they will find it harder and harder to justify continued investment in these products. Eventually the WinX platforms both producst are married to will fade. At that point, when Word and AutoCAD stagnate, they may be vulnerable to an open source comminity that can run endlessly on no cash, that can build bridges to newer, more current technologies.
I'm not holding my breath.
In fact, I've changed jobs to get out of the CAD industry. The action is elsewhere. I may not live long enough to see AutoCAD take a fall. It may never happen.
PS: In the CAD space, the most intersting open source activity is not IntelliCAD. The Matra folks have a more interesting offering. IntelliCAD is a corpse. OpenDWG may prove useful if and when the action moves beyond AutoCAD. If that future is to involve open source, it will more likely be centered on Matra than OpenDWG.
Intellisense an innovation? (Score:2)
I myself developed a syntax directed editor in 1985 called ALICE -- see this page [templetons.com] to download it for DOS or Linux -- which still 15 years later does more than Intellisense.
There are some MS innovations but this is also 20 year old stuff.
Re:Two responses predicted (Score:2)
Because of this I maintain that MS has a joke of a document specification.
Its bloody well pretty much done... (Score:3)
Its GPLed, granted it needs work. So scoot onto the abiword mailing list and cvs down the latest version, get hacking on it and sort it out.
ole2 is fully sorted out with libole2, excel is being handling by gnumeric.
What is not handled by wv is not by lack of documentation or design, its simply a matter of spending some time at it. Easy peasy. Info on the MSDN docs can be got from here [wvware.com]. They can be gotten off the MSDN 1998 July cd, or you can get some of them from wotsit.org [wotsit.org]. I even wrote ivt2html [csn.ul.ie] for you to convert the office.ivt file into html. Like what else do you need.
90% of all the hard work has been done, wv can parse fast and simple with no bother to it, which was a nightmare to do, it can construct the correct PAP (paragraph properties) and CHP (character properties) for a given run of text. Feed you the correct characters and charset and font, the TAP (table properties), graphic properties and handle to graphics. The correct OLE handle for embedded objects. Document properties etc. There is an example html conversion program included for reference (wvHtml).
I put together libwmf [csn.ul.ie] to convert wmf file into something useful as well. Theres a half done implementation of an Escher (the graphics for Office) importer floating around in there as well.
Theres also an implementation of a Summary Stream displayer for all ole2 documents.
I even bust my ass and dragged together the right bunch of motivated people to help implement the decryption [csn.ul.ie] module for word 97, 95 and 6, and that was not fun at all to say the least
The hard work is done, if you want something improved you have a very very solid base to work from. Yes the spec is confusing, yes its not a great format, yeah is sort of moves over time, but in a fairly rational way that can be supported with some work. There are any number of equally crap formats with weak documentation supported in various tools.
There is just this false myth that the Microsoft formats are inpenetrable and/or not available. Just download wv, fair enough there might be problem documents, if there are, just debug wv and get onto the abiword list and work it out with them. If something fails it can be fixed and improved, its not a case of "ah well, its a MS format, nothing can be done". If you truly want to handle Microsoft formats there are a number of people working on it that you can help.
So its right there for the right bunch of motivated people to work on. C.
Re:MS Word File Format is here (Score:2)
--
Re:'Everpresent Office monpoly'? (Score:2)
This the exact kind of attitude that should turn people away from MS. Why ? Because it is Bill Gate's explicit goal (and he goes to TV to say this) that MS wants to bring computing to the masses.
Pretend you are him, and you want to achieve this goal. By what means should you use? Closed file formats with lousy specifications? How does that bring computing to the masses when they are prevented from speaking to the Unix Priesthood?
If you, as a MS lackey and worshipper, believe that this is not MS's responsibiilty, then please go take it up with Bill, your prophet. He has stated publicly and many times that this is his goal. Remind him that MS's duty is to the stockholders and they should make as much money as possible. Please tell him that, and also tell him to STOP LYING to the American public.
Why? (Score:2)
Because the formats suck...?