Can XML Replace Proprietary Document Formats? 291
Pauly asks: "My former profession of Technical Writer was made very painful by my customers' requirement to have their documents delivered in MS Office formats. PDF/FrameMaker was not acceptable, as they needed to be able to edit the documents as well. Let me tell you, it is painful watching a 3,000+ page Word97 manuscript, the fruit of weeks of hard labor, rendered into rubbish by my customer's Word95. I've missed deadlines, lost money, and will never forgive Microsoft for their abuse of me and my kind. My question: is it possible that XML-based standard file formats suitable for word processor, spreadsheets, etc. could be created that forever do away with proprietary binary formats and inadequate file conversion routines? This notion seems to be working for the graphics crowd in the form of SVG. The benefits are obvious, what are the drawbacks?"
Re:Proprietary Formats = Never Confused (Score:1)
XML needs to be integrated into Linux (Score:1)
As a professional IT consultant working for one of the top names in the software industry I am working on a detailed report into the "open source" phenomenon (thanks to various people for pointing out that it is not freeware per se) as started by Linus Torvalds with his Linux operating system some six years ago.
Anyway, let me tell you that XML is truly the wave of the future as far as the major players in the corporate domain are concerned. A lot of companies have rejected custom-built solutions in favour of the perceived "openness" that XML provides, and once the SOAP protocol is approved, XML will be the definitive technology for software to include in this day of interoperability.
Since XML is going to be such a huge part of the industry within a year or two, what I think that Linux (and indeed, every operating system that wants to compete) needs to do is to integrate XML into the kernel as an intrinsic part of the system architecture. Then software vendors, looking for a rock-solid platform with which to write mission-critical apps in the internet able domain, will choose Linux as their platform of choice, since having XML integrated into the kernel will provide stability and performance attributes.
Anyway, once all interapplication communication is acheived through the use of XML and the SOAP protocol, any operating system fully supporting these in a native environment will benfit hugely in the perception of tech-savvy CTOs. This is why I think that Linus and his kernel team should make this their number one priority if Linux is to suceed.
OT: First Post Generator (Score:1)
According to the old book "Hackers", an old MIT OS had a HALT (or CRASH) command accessible to any user. It would bring the computer to a stop, requiring a reboot. The idea was to make crashing the computer trivial, so that the current game of "can I write a program to crash the computer?" would stop.
Of course, this command depends on an ethic to *not* run the command (i.e., the "brotherhood of man" instead of the "depravity of mankind").
Reading comprehension? (Was: This is beneath /.) (Score:1)
Did you read the intro? The document wasn't lost, it was mangled.
DonkPunch asked earlier, "How was this [the foisting of inferior products on an uninformed populace] allowed to happen?"
This is your answer. It was allowed to happen because some of the loudest, most visible advocates of alternate products were the least polite, the least diplomatic, and, in this case, the least thoughtful.
I would find this hilarious if it wasn't true. A lot of Slashdotters are fond of throwing around "RTFM!" or "You deserved to get hosed because you made a mistake!", but they can't be bothered to read (or take to heart) the advocacy HOWTO. For pity's sake, hair-trigger flamage is so common in the anti-MS community that somebody actually felt the need to write a document telling people what ought to be obvious: mindless flamage does not change someone's mind.
www.alarmist.org [alarmist.org]
Re:Why not even html (Score:1)
HTML is a crazy choice (at least as a _writing_ format; it might be a good choice for _presentation_), but LaTeX is pretty good for this.
Not to say that LaTeX is perfect-- I have tons of gripes with it, but it works, it's here now, and has a large user base that continuously extends it.
Seriously name one fearture in Word97, Star Office, Word Prefect that couldn't been done in a nice GUI html editor? Just name one, one example.
Multiple column text. Footnotes and endnotes. Automatic section numbering. Automatic generation of tables of contents and indexes. Extensibility with macros. Precise control over how your document will be formatted on the printed page. Need I continue?
Not that Word does any of these particularly well, BTW ;)
Agreed. (Score:1)
You've hit the nail in the head. This is very closely related to my major gripe with TeX and LaTeX.
The fact that you extend TeX by writing programs in TeX is absolutly horrendous. I find it to be completely dense and impenetrable. TeX might be a reasonable page description language, but it absolutely sucks as a programming language.
If I could design the LaTeX replacement of my dreams, I'd go with a system with 2 or 3 languages:
And did I mention that TeX's syntax is just horrible? "\command{...}"-- it's just begging for users to mismatch braces when stuff gets nested too deep!
Re:Why not even html (Score:1)
read "TABLE" tag.
Ugly hack; basically, you have to explicitly lay out your text in the two columns. And switiching back and forth between multicolumn and normal mode requires a lot of editing.
Footnotes and endnotes.
Read "FONT SIZE=1" and "I"
Yeah right. Has it occured to you that there's more to footnotes than just small script?
Automatic section numbering.
uh what? an html editor would be able to automatically add page numbering, not sure what section numbering is
You don't know what section numbering is, yet you want to claim HTML is adequate for serious document preparation? Jeez.
Automatic generation of tables of contents and indexes.
Read "TABLE" again
Yeah, right. Which is the HTML tag whose semantics consists of looking at all the sections and subsections in my document, figure out their numbering and in which page they occur, and automatically generate a table of contents?
Which is the HTML tag that allows me to mark points in the document I want indexed, and which is the tag that, when encountered, will cause and index of all the parts I marked, with page numbers, to be generated automatically?
Extensibility with macros.
Any decent GUI html editor would be able to add this fearture
But again, you sorely miss the point. I don't mean a keyboard macro in an editor to insert some preset text; I mean a macro facility to extend the document language itself. In HTML, this would be, for example, something like a tag that let you define a new tag.
When you are repeating a particular pattern over and over, you want a new language idiom to represent the pattern (which makes your document more legible), not just a keyboard macro to insert it over and over again. I do this all the time in LaTeX-- when I find I'm repeating some commands to often in a certain way, I define a new command to encapsulate the pattern.
Need I continue?
More than likely, I still don't see you point. Ok lets assume Word97 is the greatest word processer of all time, say it has the prefect user interface.
I refuse to make that assumption ;-)
Ok, now on the back end, rip, tear and pull out that property format. Ok, so you just have the front end now right? Ok when it tries to save the file, have it push everything out into html instead of Word97 DTD. Ok, so you have an HTML file pushed out, but the user doesn't know it is an html document, pretty neat huh?
One thing is to use HTML as the file format for a particular application-- the application can do a lot of things that HTML _itself_ doesn't have. But the original post was talking about editing HTML with vi or HotDog Pro.
My point was that HTML is a terrible language for _authoring_ documents. If you want to use some portable, application independent language for _writing_ serious documents, HTML doesn't cut it-- it's underfeatured. A good language for these purposes is LaTeX; it's free, extensible (and has tons of extensions), featureful, and can be used to output a document into many formats: dvi, ps, pdf and html, among others. It has professional typesetting capability. It has facilities for automating tables of contents, indexes, bibliographies, and many other things.
Really, take a look at LaTeX and its feature set. This is the kind of thing you want for a document authoring language.
Re:Typical slashdot ignorance (Score:1)
Incompatibility between versions of Word
Binary file formats that work only in their software
Brutish business practices that remove the benefits of competition in this marketplace
My customer is as much a sucker in this scenario as you and I.
Re:Already Happened (Score:1)
Then I tried opening up the
Am I missing something here?
Re:Ever hear of OpenDoc? (Score:1)
Re:Sabotage (Score:1)
Is that: Vive la France!
My French is rather rusty though
Re:Hey genius... (Score:1)
Re:Already Happened (Score:1)
Unless, of course, they publish it on the web as a 'trade secret'.
Then anyone who parses it can theoretically be sued for violation (because, well, you *could* have looked at the spec. remember that - the posting of the trade secret on the web is probably just preparing a club to use on anyone who implments a version. not legal, but, hey, how many open source programmers can legally defend themselves against MS?)
Re:An open question (Score:2)
1. Because they don't know any better.
2. Because people, as a general rule, resist change, even when it's for the better.
The first point cannot be addressed by marketing or advocacy alone. MS has a hold over the market that is incredible, both in terms of penetration and mindshare. Every business executive has heard of MS. How many have heard of StarOffice? Or know that WordPerfect is still around? How many know what XML is, let alone that it exists?
The second point is basic human psychology. We're comfortable with predictable patterns, with habits, with the devil that we know.
Why do they blindly pay for new versions every few years when their current versions do everything they need and more?
Marketing and habit. Marketing because everybody recognizes MS Office and dancing paperclips, not StarOffice or XML. Habit because everybody has been trained to believe that the latest version must be the best because it's the most recent. (Argumentum ad novum? I forget which logical fallacy it is.)
People are afraid of change, to some extent or another.
I wonder what we can do about that. Certainly crying and moaning about MS won't stop it. It's time to do something besides whine.
www.alarmist.org [alarmist.org]
Re:Nope (Score:2)
This has been a huge win for people using SGML for quite some time.
Re:XML needs to be integrated into Linux (Score:2)
Since XML is going to be such a huge part of the industry within a year or two, what I think that Linux (and indeed, every operating system that wants to compete) needs to do is to integrate XML into the kernel as an intrinsic part of the system architecture. Then software vendors, looking for a rock-solid platform with which to write mission-critical apps in the internet able domain, will choose Linux as their platform of choice, since having XML integrated into the kernel will provide stability and performance attributes.
I say I've never heard such horse s**t before. Integrating an XML parser into the kernel will do exactly zero for the performance of XML parsing. This is a userland task. It's fairly obvious that you know nothing about XML, nothing about kernels, and I'm guessing these "top names" didn't do enough research about your knowledge before asking you to do their report. I look forward to seeing it headline on ZDNet.
Want to deliver XML with Apache to different media in varying styles? Get AxKit [sergeant.org]
WP formats (Score:2)
TeX/LaTeX is a system which supports macros, total system independence (both as source and compiled), has a wide range of fonts and has a degree of respect in the publishing industry.
But it has ONE WYSIWYG editor with any popularity - Klyx - and is utterly unsupported by any common word-processor on the market today.
Then there's RTF. Supprted (some) on the PC, but keeps changing. The format's too unstable and too primitive to be usable, on any real scale.
ASCII, itself, is supposedly a standard for information interchange (hence the acronym). But it doesn't have the range to be useful for WP and DTP any more. Wide ASCII has the range, but isn't widespread and is still far and away too limited.
What's needed is a standard that will take the market by storm. It's no use simply being good, or even "the best, so far". That doesn't shift users, or software houses. What's needed is something so outrageous, so crazy, that it captures mindshare by force. Unfortunately, the only people who come up with ideas like that are all in sales, and the idea is a technological disaster, all round.
Re:XML only benefits the priviledged few (Score:2)
No, the problem was caused by new functionality and changes in existing functionality. Even if the doc was based on an open format, that would not help with how Word renders the document.
Before agreeing to any contracts it should have been clear what exactly was being delivered to his clients. If they had agreed to Word95 then he deserved to lose money. If they had agreed to Word97, then it was his client's fault. Somebody screwed up on the business side from the very beginning - it's not a technical issue.
Don't forget, HTML is an open standard but no browser whether open source or not renders pages identically.
Re:Learning XML, XHTML (Score:2)
XML fixes a low-level syntax for identifying what is a tag, which is the matching close tag, etc. (like HTML without the meanings for the tags, and with some obvious (in hindsight) stupidities removed. By actually defining meanings for the tags, and specifying what tags are allowed in what contexts, you can then construct a representation format for some type of data.
So what is the point, I hear you cry! Well the point is that
1) The XML rules encourage you to be sensible in defining your format
2) Applications handling different XML-based formats can share large chunks of parser code
3) The common underpinning makes it much easier to
work with hybrid documents that include data in multiple XML-based representation formats
4) Some limited processing and checking can be done just with the raw XML and perhaps the formal
format specification (the DTD).
Steve
Re:An open question (Score:2)
'Broken'? You are missing the point. PostScript is intended as a programming language for telling the printer where to put ink on the page; if it does that then it is not 'broken'. A tool which formats PS files as 2up might make certain assumptions about the format of the file, but if those assumptions turn out to be wrong then the tool is broken, not the PS file.
To make an analogy: suppose you ran the PostScript through a 'grep' program, which dumped core because it couldn't handle lines longer than 80 characters.
Re:Biased rant from co-inventor of XML (Score:2)
You can use HTML Tidy [w3.org] to correct the HTML output by Word.
Re:Bloatware (Score:2)
Two year ago, my two roommates and I were all finishing off our EE Master Theses. Two of us had gone with LaTeX. I went with LaTeX because, well, why would you ever want to write a thesis in anything else (especially when your school is cool enough to provide their own latex thesis style)? Roommate #1 went with LaTeX 'cause he had a PC at home, and a Mac in the lab, and had already seen the damage that the Word 97/Word 98 switcheroo could do.
Roommate #2, however, choose to go with Word 97, which provided much amusement for the rest of us, as he spent the last three days before his thesis due date moving pictures and text around trying to get his thesis to look as good as those produced by LaTeX.
Re:(-1: Offtopic) (Score:2)
One would think a Slashdot reader, of all people, would realize how dated this concept is. Land may be the ultimate resource in terms of overall human welfare (though I contest that), but that does not make it a meaningful measure of power or social stratification. I am a member of the American elite in a number of ways (white, male, upper-class, well-educated), and by any sane measure I have a disproportionate amount of power in this country (or will, once I get out of college :-)), and yet I am likely to never own any more land than my house covers, if that. Land may be where food comes from, but it sure isn't where influence or power comes from. 3% of America probably does hold most of the power in this country, but it's not the 3% that owns 95% of the land, it's the 3% that has 95% (or however much) of the money. And believe me, they're not the same 3%. Bill Gates isn't a big landowner. Neither is Larry Ellison. The only man to get notably rich off land lately is Donald Trump, and he did that in real estate, which isn't really the same thing.
As for the demand issue, demand for land grows at an exponential rate only within the confines of simple-minded mathematical models (trust me: the population of this planet is not a first-order DE in one variable) and scare headlines in the media. It is now very clear that the world population will stabilize within 40-60 years. Even the most panicky and irrational of the estimates for the world's eventual maximum population permit that entire population to fit on the eastern seaboard of the United States, at the population density of San Francisco, with plenty of farmland left over to feed them. Moreover, the world's food supply is currently growing faster than its population. Overpopulation is bunk.
Re:Good Idea but counter gratuitous complexity (Score:2)
Re:Already Happened (Score:2)
Byte has a short description [byte.com] of the Word format that you might want to look at.
I've looked a little at the Excel format. Once thing that seems clear it that the O2000 formats are almost human readable. It shouldn't be that difficult for someone to whip a converter -- well, it should be easier than parsing the binary formats.
--
CSS3 (Score:2)
However, one problem is that there currently isn't a sufficient standard to support printed output, along with things like margins, page numbering, headers/footers, foot/end notes, and so on.
An alert slashdotter pointed this out to me just the other day - the proposed CSS3 Page Media Properties [w3.org] spec addresses most of these issues. However, it's not done yet, and has not been implemented anywhere that I know of. So, it might be a year or more before we have a truly open format that could be used for word processing programs.
--
Re:XML needs to be integrated into Linux (Score:2)
It doesn't matter one bit that it was troll bait. Adding irrelevant stuff to the kernel is a bad idea, and even a troll could get other people thinking that maybe it's a good idea. The last thing we need is thousands of people clamoring to put every application into the kernel. (Sure, they wouldn't be heeded, but it would be a distraction nonetheless.)
Or is the user "Deven" actually a sophisticated troll herself?
Bzzt. Wrong on both counts: (1) I'm not trolling, and (2) I'm male.
On \., who can tell?
"On the Internet, nobody knows you're a dog." [unc.edu] (Or a program that can pass a Turing test?)
Re:Why not even html (Score:2)
KOffice uses XML! (Score:2)
MSFT sometimes "forgets" that feature (Score:2)
Here is your answer...File...save As...Word 95...end of question.
If only it were that simple.
Today, that shouldn't be a problem, but when Office97 first came out, Word95 was not an available "Save As" format. So, Word97 could read Word95 files, but couldn't output them. This "oversight" was fixed in the first Office97 service pack, but still ...
Re:An open question (Score:2)
Hmm...I think you're right. There's even an O'Reilly book about this if memory serves.
So I guess it must be really be Microsoft, then. :-)
Re:Already Happened (Score:2)
Well, except if the Byte article is accurate about the fact that the "XML islands" aren't really quite XML either. :-(
Good God.
This is so absurd, it...it has to be an accident. I mean, why in the world would FrontPage want to screw around with comments of any kind? Please tell me that it doesn't screw over script tags within comments, for example.
I mean, that's almost as lame as patenting style sheets out from under the W3C, right?
Re:A common standard (Score:2)
Well, I think there are two answers here. The first answer is that some of the proprietary tags were disgustingly useful where appropriate. Remember that things like tables were first seen as proprietary extensions before they were ever blessed by the W3C. And there were a few other things like "center" that looked easier to use than waiting for somebody (anybody!) to come up with decent style sheets. (No, it wasn't a new technology, but I would claim that general style sheets are hard for on-the-CRT display formats.)
The second part of the answer was that it wasn't the specialized tags so much that ruined things (ignoring "blink" and "font" for the moment) but the real dorkiness of relying on parsing quirks in html to get layout effects. You know, bulletless list elements to get indents and such.
Re:An open question (Score:2)
Gosh, now I wonder what kind of certain source would generate PostScript that was so broken that a simple filter would be unable to do a 2-up transformation on it. Could it be the same outfit that can't make it's word processor use the same format in consecutive versions? The same outfit that gave us the gratuitously extended character set known as Windows-1252? The same company that was a guiding force in W3C stylesheet discusssions and then tried to patent the use of stylesheets? The same people who now claim to be on the XML bandwagon except that they fall off every half-mile when their stuff still doesn't parse as anything?
Could it be... SATAN?
Re:An open question (Score:2)
Baloney. On at least two counts.
Postscript was designed as a page description language that was purposely abstracted away from the act of putting ink or toner (or photons) on the "page". The appropriate use of postscript should allow a wide range of transformations on the graphics or pages so described; this is the whole beauty of the concept. Why on earth would you bother describing fonts as outlines if you weren't going to do interesting transformations on them?
The second issue here is that Adobe has, for several years, documented a standard for the overall structure of Postscript documents that allows utuilities like psnup to do all kinds of cool and useful things with postscript files. But to get the goodies, you have to follow the (fairly trivial) guidelines. (In brief: you have to have BoundingBoxes specified and use the special %%Page* comments correctly.) Postscript produced by many companies works great with things like psutils. But not Microsoft. And this isn't anything new; this crap has been going down for years. I'm willing to accept the possibility that it's just a stupid bug that nobody in Redmond wants to fix. But just because they don't fix it doesn't mean it ain't broken.
Uh, and your point? Really, I must be missing something interesting about your world. In my world, if a program dumps core, the program is broken. Really, there's no grey zone here. Crashing == Broken. No matter what you think the proper role of PostScript is.
Re:A common standard (Score:2)
Wrong. Work on the table specification for HTML started in 1993, a year before Netscape was founded. Netscape wasn't even one of the first 3 browsers to implement tables. However, Netscape was the first to not follow the proposal, and invent something much poorer.
Hmm...you've got a point there. Thanks for whacking me upside the head. :-\
I now can't remember whether or not Mosaic had tables; Emacs-w3 did, but I'm not sure when. I'm presuming Arena did, although I never did use Arena very much.
In my (limited) defense, though, I did say "blessed" by the W3C. Do correct me if I'm wrong, (please! I want to remember this stuff right!) but tables weren't in the HTML2.0 spec (which was RFC 1866). Whatever else you have to say good or bad about the HTML+ (later HTML3) spec, it ended up being canned. Worse, it languished for what seemed like forever at the time, and that's where I remember the floodgates opening up wide.
No, that doesn't excuse the crappiness that Netscape unleashed; I do now remember how much it pissed me off. Thanks for the memories. :-/
Oh yes: you're completely right about "center" and what I remember as the Great Alignment War. Dunno where my brain was when I wrote the post you were responding to...
Re:MS Word and Linux Alternatives? (Score:2)
The problem with word is that it is not very suitable for writing stuctured documents. It has all the necessary features but the implementation is crappy. Framemaker however is all about structure. It suits all my needs and provides the comfort of a wordprocessor whereas Latex does not. In addition it has nice graphical features and you can embed objects from other programs. Latex would force me to convert all my images to eps (not supported by the programs I use).
Having experienced Latex, Framemaker and Word I can say that I consider both Word and Latex a step backwards. Latex delivers nice results but sends you back to the stoneage from a usability perspective. Word is exactly the opposite: the interface is very nice but the result sucks. Framemaker is in the middle, decent result and a fairly good UI (not perfect though).
Re:MS Word and Linux Alternatives? (Score:2)
I then installed framemaker 5.5. It took me a week to convert my document and erase all traces of ms word from my thesis. It was a good move, though. Framemaker is excellent for creating structured documents (such as a thesis). I haven't looked back since. I now work as a Ph D. student and write all my papers in framemaker. I have not run into any serious problems yet.
I particularly like how framemaker forces you to work (structure your document using paragraph and character tags). This is also the way I used word in the past. Unfortunately word automagically fucks op your document structure if you don't pay attention.
I wouldn't consider any other wordprocessor at the moment than framemaker. Word is nice if you know how to work with it, however it is too buggy to do any graphics in it. Basically anybody I know who ever tried to do anything serious involving graphics and word has had to deal with all sorts of bugs in word. Framemaker doesn't have this problem. It's rock solid, available on many platforms (including Linux). It's also very suitable for scientific documents since most conferences and journals have templates for framemaker (if they haven't it's usually easy to create them yourself given a detailed description of how the document should look).
Of course you don't have much interoperability with word. You can import word, but the result is usually not pretty. You can also export word/rtf but there are some problems here as well (especially with graphics).
Forward compatibility (Score:2)
Well, if your customer runs Word95, shouldn't you have checked this before spending all these weeks in Word97? Especially since you clearly spend some time rehashing in what format exactly should the document have been presented.
And I can't really see this a Word failing, too. You are asking for forward compatability -- kinda hard to realize. It's unreasonable to expect a piece of software be able to read file formats from its future versions (unless its plain-vanilla or tagged text).
Kaa
Re:Hey genius... (Score:2)
Now how did this get moderated up??
This guy is working in Word 97. Maybe he upgraded because he wanted new features, maybe he was forced to upgrade whatever. Now he works in Word, wants to save a file, the native format is now Word 97, and being Microsoft they make sure that Word 97 files are not backwards compatible with Word 95. You can "save as Word 95" but that actually converts it to Word 95. Any time you convert something it approximates how to do the thing you want in the new format. If you've spent lots of time positioning things, and making it look good, it will now most likely look like crap.
The AskSlashdot question was "Can XML Replace Proprietary Document Formats?", the problem with Word was the reason for the question. But hey, if Word works fine for you, feel free to keep on using it, dumbass.
Tools, and Data vs. Presentation (Score:2)
Now, having said that, tools become the next major piece. There is only 1 HTML -- but there are as many XMLs as there are DTDs. This is very intimidating. Nobody wants to write XML tags directly. They expect a tool to do it. Therefore, if you want to have your news department crank out press releases in XML format, you're going to need to supply them with a tool that specifies press releases in XML format. That means telling them where to type the title, the date, the byline, and so on....NOT giving them the same old word processor. And they're not going to like it.
I'm dealing with this problem right now at work. We want all the departments to start sharing content. I convinced them that the first step in doing that is to get rid of the HTML and Word formats, and store things in raw XML, and then everybody can pick and choose what they want and slap their own look and feel on it. They all agreed this was a wonderful idea. Then somebody pointed out that they'd have to start creating their content using this new format, and they said "Oh..uh...that sounds like alot of work.....no."
-d
(P.S. - And why the hell doesn't plain text formatted messaging work?? Do you know how much of a royal pain it is to talk about XML without being able to use angle brackets?!)
Re:An open question (Score:2)
Content-free language (Score:2)
I must say I really don't understand the hype about XML. It's certainly a progress wrt SGML, because it's free (in the sense that the standard is free, as in speech, by the W3 Consortium, whereas SGML is an ISO standard). But technically I find its usefulness questionable. It is a completely content-free language, for one thing. Not that this is a major defect, but it's certainly not something to be enthusiastic about. And despite its claim to simplicity, it seems like we still don't have one free (as in speech), validating XML parser library that also doesn't fsck everything up in its handing of Unicode (last time I checked, the libxml from the W3 Consortium was completely broken in this respect and expat was hardly better; even the SGMLtools don't really validate XML, since they only validate it as SGML).
All right, let's admit it's a general-purpose, content-free, easy to parse markup language. And so what? Didn't LISP sexps exist long before this? They are exactly that, and they're far simpler than XML. I still don't see it.
Re:Reinventing TeX (Score:2)
TeX is not a format, it's a language. Or, if you want, it is a format that has only one implementation and that is defined by it, and that's bad. Furthermore, TeX is a very old standard and it's quite painfully appearent. Try to write a context-free grammar for TeX, for example: you can't (and Knuth deserves serious blame for this, since he is the one who suggested the name "Backus-Naur form"). TeX is not semantical, it's presentational: another bad point (LaTeX does not have that disadvantage, at least). TeX is Turing-complete, which for a description language is a bad thing (for an input language it's good, obviously). Last but not least, TeX does not support Unicode (the specially modified version of TeX which does, Omega, is a progress, but I don't think it's still good enough).
Sorry, no cigar. There are plenty of reasons to prefer anythingML over TeX.
DocBook (Score:2)
http://www.oreilly.com/catalog/docbook/chapter/
http://www.oasis-open.org/docbook/
http://www.docbook.org/
Re:LaTeX GUI (Score:2)
Seeking XML alternatives? (Score:2)
As long as the DTDs remain locked up in the software, you're fscked.
I'm presently quite happy with Framemaker (proprietariness be damned, at least I can count on a Frame doc written under Solaris to render the same way on NT or Mac, whereas with M$-Turd, I can't even depend on the same goddamn file to page-break the same way on two NT boxen!) to generate PDF and WebWorks Publisher for batch HTML conversion, but am becoming increasingly open to alternatives.
Fsck Microslut's half-baked excuse for XML. They're not interested in anything more than lining their own pockets and reinforcing the Orifice monopoly. Interoperability is not in their vocabulary. Scalability never was. The only good use for M$Turd is for writing one-page memos. (If you're a technical writer, I point out that at least this is long enough to write a letter of acceptance for a new job, and a letter of resignation to any manager dumb enough to use "but Office is the corporate standard, and we've already paid for it" as an excuse to take away professional authoring tools.
Rant off. Where the hell was I going with this? OH yeah...
I may soon have the budget for a pure XML solution, does anyone know anything about ArborText [arbortext.com]? Looks bloody promising, and appears to offer easy integration with the DocBook DTD as a sweet bonus.
Re:Absolutely! (Score:2)
What is really needed is a schema which indicates data types, and much more information.
The point of XML is to "document" a proprietary format in a non-proprietary way. There is very little specified in the XML standard other then how to tag something. The tags are up to the vendor, and the hope is that the tags are human readable (and understandable).
--Mark
Re:Biased rant from co-inventor of XML (Score:2)
Amen to that. I wouldn't trust Word for anything much over 20+ pages.
I know several other people have mentioned it, but LaTeX is seriously the way to go. I know it has somewhat of a learning curve, and there aren't any really good/stable gui front ends for it. But it works. It doesn't eat your data, and it makes elegant formatting easy, and as long as you use your definitions properly, its very easy to change entire document formats with ease. Added benefits are available dvi->HTML converters, so you can make your document available in a easy to read, web-accessible format, and also have a hard copy with nice postscript fonts.
I used LaTeX to document a year long project for a CS course that I worked on. It took all of a week to learn everything I needed to use and become very proficient with the system. I have nothing but good things to say about it.
Also several very large and respectable publishing companies (Addison-Wesley for one) use LaTeX almost exclusively for their typesetting. In fact Addison-Wesley's "Latex users guide and reference manual" is a simple resource for LaTeX. Trust me, its worth learning a system that works NOW. Sure XML has some benefits, and hopefully we'll see some systems that really take advantage of XML formatting, but for now, there just isn't much out there. Trust your 3000+ page documents to a system thats been in use since the '80s, you can't go wrong.
Oh, and the added benefit, its free
Spyky
Re:XML needs to be integrated into Linux (Score:2)
Re:Absolutely! (Score:2)
That would of course be silly and pointless. That's like saying "ah, now that we have lex and yacc, let's hope there will only be one programming language, supported by all compilers".
One DTD to do it all will lead to bloatware. I don't think anyone is waiting for that.
-- Abigail
Re:A common standard (Score:2)
That is something I've never understood. The only reason Netscape could ruin the web with the proprietary tags was because so-called web developers embraced the proprietary tags and used them all over.
If you didn't like the proprietary tags, why did you use them?
-- Abigail
Re:A common standard (Score:2)
Wrong. Work on the table specification for HTML started in 1993, a year before Netscape was founded. Netscape wasn't even one of the first 3 browsers to implement tables. However, Netscape was the first to not follow the proposal, and invent something much poorer.
And there were a few other things like "center" that looked easier to use than waiting for somebody (anybody!) to come up with decent style sheets.
I worked with browsers that were able to use stylesheets before Netscape came with center (Arena, Emacs-WWW). Furthermore, before Netscape came with center, much work was done on HTML 3.0, which had the align attribute (and DIV). But no, Netscape didn't look at the draft, the invented the less flexible, and proprietary, center. Microsoft techniques.
Well, you can't blame browser vendors for the fact that so-called web-developers had no clue what the web was about.
-- Abigail
Re:Drawback of XML (Score:2)
Stylesheets have been part of HTML from its first standard, of spring 1994, before Netscape existed, Bill Gates was aware of the internet or before anyone talked about XML. Stylesheet capable HTML browsers were available 5 years ago. Stylesheets actually predate HTML - they come from the SGML world. It's actually quite old technology; it only has recently become a buzzword.
-- Abigail
Re:Yes and No (Score:2)
No, it's not, for the same reason BNF isn't a programming language. XML is a way of formalizing content marking languages. XML is a meta-language.
-- Abigail
The value of human/script readable data (Score:2)
The assumption that you will use a particular application to manipulate data is a poor one. It limits you to the capabilities that it provides. Word processors generally provide limited or non-existent scripting capabilities. So when you want to automatically generate tables in a document from some other files, you are stuck doing it manually. That is a recipe for documentation that is out-of-date and full of errors.
Reinventing TeX (Score:2)
If you have any documents generated with early word processors, can you still read them with anything?
I don't mean to say that SGML, HTML, XML or FOOML is a bad thing. But they are simply another way of given us what we've had with TeX for years, with a few enhancements. Let's remember TeX's strengths and not allow them to be lost with newer tools.
Page Layout vs. Content Description vs. hybrids (Score:2)
Content Description Languages (CDLs)
HTML, XML, and their parent SGML are content description languages - they describe document content entities such as paragraphs, headings, lists and tables, but don't describe how to make black marks on paper or RGB marks on CRTs.
They're well-suited to automated layout programs , letting the document reader determine how to lay out the visuals, which may be different depending on whether the target reading environment is hi-res dead tree phototypesetters, medium-res CRTs, text-to-speech readers, braillewriters, cell-phone microscreens, PDA mini-screens, dumb terminals, browsers with images turned off for speed or font set really large for low-sight people or really small for pocket-sized printouts, etc. SGML was the original flexible metalanguage; HTML is a simplified static instantiation, as are the cell-phone variants, and XML is a newer SGML variant that's learned from 15 years of real-world experience.
They're also well-suited for automated content handling, such as the XML developments that are replacing EDI for applications like purchase orders. Editing CDL documents is easy, as long as you stick to the defined structures.
Page Description Languages
PDLs let authors tell the computer how to make documents look the way they want them to, and programs that process them make various compromises to support different presentation media, such as paper or CRTs. They range from things like Postscript, which forces the display to do the best job it can rendering an image that the authoring program specifies, to things like MSWord, which let the display device determine layout, and reprocess the entire input document any time you change printers. They also range from higher-level systems which know a lot about the document structure to lower-level systems that know about output but know next to nothing about structure - "systems" includes the application programs as well as the representation language.
All of them have the problem that if you want to edit the contents, the output looks different so they have to cope with what you've done. Depending on the document structure and app design, they may have to repaginate the entire document, or only up to a chapter/section break, or they may be crude and only patch the current page and force you to renumber the rest if you want.
Hybrids
Lots of authors want to specify the output appearance, regardless of whether this constrains the readers' choices - you can see hybridization like this crunched into HTML, with commands for fonts, font sizes, colors, and the newer cascading style sheet stuff. It's possible to do this in ways that preserve content - the language represents that this is a "Heading Type 2", and instructs that "Heading Type 2" be represented in Palatino Bold Blinking with a full line-break after the heading text. It's also depressingly common to lose content structure information, especially during translations, either because the target language doesn't have a mechanism for representing the content tags, or because the translator writer JUST DOESN'T GET IT. An example of the former problem is rendering for constant-width ASCII or for GIFs or Faxes. A depressingly stupid example of the latter is saving MSWord documents in HTML - MSWord knows about objects like headings and paragraphs, and knows that the current user's settings for a "Heading 2" object and "Normal Paragraph" object are 14-point Arial Bold followed by a single blank line and 10-point Times Roman followed by a single blank line - but instead of outputting an H2 heading object, a P paragraph marker, the text, and another P, the depressingly stupid program outputs a request for 14-point Arial Bold font, the header text, a couple of BR spaces, a request for 10-point Times Roman font, the paragraph text, and some more spaces.
Application Programs can do lots of different things with PDLs or CDLs. For instance, you can use comments to put non-printed document structure information into PDLs, or to put layout information into CDLs, and programs that know about it will use it, while programs that don't know about it will ignore it. That doesn't mean there are any industry standards about doing this, so of course one editor may stomp all over another's markings, or may leave them in place while adding things the first program doesn't notice because the comments weren't updated. Postscript is an egregious contributor to this - it's an extremely general-purpose programming language, and there are lots of different ways to get the same set of black marks on paper, ranging from bitmaps to format-annotated CDLs, and almost no two applications can read each other's Postscript.
Production Software
While PDFs are both liked and disliked because they are designed not to be editable, I'm surprised your customer couldn't accept FrameMaker. It's one of the best WYSIWYG large-document production systems I've seen over the last decade, and if the customer wants to export pieces into MSFoo, they can, but if edit the entire document, they probably should have enough control over the process that they can buy a few copies of Frame for what's basically a trivial addition to the cost they've already paid for producing 3000 pages of documentation. Also, to do big documents, you need tools that can cope with multiple authors working simultaneously, and Word isn't really designed for that.
If you're doing 3000 pages of documentation, or for that matter 300, and you're not a graphic arts shop or something, it's probably going to be mostly text, or text with user-interface illustrations, and you're going to use a uniform formatting style for the whole thing, modulo a few tweaks for illustrations that need to be placed on a page to fit together with an occasional tweak. I'd think the best approaches these days are either to build the thing in hypertext to start with, or else use some of the tools from the GNU / Emacs world, or else use a batch production system like LaTeX or troff with an appropriate macro set.
Learning HTML was soeasy, since it looked a lot like the Troff -mm Macros
It's been a long time since I've been part of anything over a hundred pages or so; the last time I was on a very large RFP response project, the boss had some kinky troff macros and basic shell mungers that let us keep the entire document in a database, so we could track which of the N thousand requirements were being handled by which authors, in what files, patch up the figure and page-number references (really a two-pass process) , and build the indexes to the document and the crossreferences to the RFP requirements document it was a response to.
Arrrgh - Slashdot doesn't let me use H1 and H2
XML is the way we should go! (Score:2)
TeX coolness (Score:2)
Re:TeX coolness (Score:2)
Why XHTML? (Score:2)
Hmm, I'm not amazingly well read up on these things myself, but it was my view that XHTML is a well-formed version of HTML. It's really the follow on to HTML 4, but instead of being HTML 5 it will be XHTML 1.0. It is designed to be portable across all kinds of applications and platforms, and the fact that it is well-formed means that the applications required to view it will be simpler than today's browsers. Also, since it is XML-based, new elements can simply be added to the existing DTD rather than having to rewrite it from scratch.
Anyway, what is XHTML and how does it differ from HTML? Well, the following points apply:
For an overview of all things XHTML, see here [about.com].
Yes and No (Score:2)
It depends what you mean by using application independent formats. XML is not a formatting language: it's a content marking language. Formatting is then left to XSL or some equivalent language. Certainly MSWord could manipulate documents in an XML format; in order to display them and make them pretty, however, it would need to translate them, perhaps to its proprietary binary format.
The question then arises as to whether some standardized DTDs will appear, which then word processors could recognize and have their own XSL templates for. I'd argue this won't happen; more likely, MS would come up with a DTD, and others would follow.
On one hand, professional work would certainly stay within the ideas presented by a DTD (this is a paragraph: see, it's indented). On the other, random users wouldn't; they desire a final look, not a semantically consistent document. Someone who wants to indent a paragraph might use the wrong word processor feature; it's an ordered list of one element with no bullets. Put it into another program, and it might look entirely different.
Re:Yes and No (Score:2)
Actually, I fail to see why this would be so terrible.
The first reason for its non-terribleness is that it's what people do now... end users use spreadsheets for databases, databases for mail merges, mailing label files for databases. We can encourage them in better habits, but we would be unwise to declare "Oh, no, we can't give you powerful tools, since you'd just use them wrong."
The second reason actually extends the first reason: sometimes doing it wrong is actually right. Namely, because doing it 'wrong' can produce something acceptable, fast, while doing it 'right' requires more time and care and installed base of expertise, all going into a product that may be used once and thrown away.
The third reason is to correct a strange blind spot that may originate from the perception of what the Microsoft Way is and how the Right Way must be diametrically opposed. Namely -- if the end user is interested in a final look, why is that an inappropriate use of XML?? Isn't the point of XML to be the mark-up language that can be adapted to any domain? I don't see why we should insist that 'the final look' is an exception, for which the only correct answer is XML plus XSL...
What's the best programming language? Trick question; a bondage-and-discipline language may be the best for writing code that someone else will have to maintain, but it's probably not best for a quick-and-dirty one-shot or for prototyping. There's no one right language. And just the same, there's no stone-graven law that says an XML editor will always be better than a word processor (or a word processor than a text editor). Having a choice would be better than what we have now... and having a XML-based word processor format would be better than a proprietary binary word processor format.
XML is not for humans (Score:2)
Re:Why not even html (Score:2)
Multiple column text.
read "TABLE" tag.
Footnotes and endnotes.
Read "FONT SIZE=1" and "I"
Automatic section numbering.
uh what? an html editor would be able to automatically add page numbering, not sure what section numbering is
Automatic generation of tables of contents and indexes.
Read "TABLE" again
Extensibility with macros.
Any decent GUI html editor would be able to add this fearture
Precise control over how your document will be formatted on the printed page. Any decent GUI html editor would be able to do this.
Need I continue?
More than likely, I still don't see you point. Ok lets assume Word97 is the greatest word processer of all time, say it has the prefect user interface. Ok, now on the back end, rip, tear and pull out that property format. Ok, so you just have the front end now right? Ok when it tries to save the file, have it push everything out into html instead of Word97 DTD. Ok, so you have an HTML file pushed out, but the user doesn't know it is an html document, pretty neat huh?
Ok, so html is a little bloated, lets use a freely avaiable compress program to compress that sucker. Have it save to disk in html, compress it, without the user even knowing about it.
Then what do we have here, Bob has a file called `Resume.gz`, smaller than a word document, more portable then a word document (you can find gzip on almost any machine) but looks exactly like a word document and the user doesn't even know he is save out into an html format. What is the disadtanages here? It makes the admin life more easier, when Bob say "Oh dam on a Mac/OpenVMS/Unix/DOS I need to find a Word97 machine" his freind can say "Wait Bob, dont' be silly you old gus, this machine can read you document, the days of property DTDs are gone my freind"
Why not even html (Score:2)
Why not even html 4.0 for a standard DTD? Seriously how much stuff do we REALLY need shoved into our bloated word processers/documents any ways? HTML 4.0 can do Tables, images, highlights, points, etc, etc. Seriously, give me a standard Word97, Star Office, Applix, Word Perfect documents and tell me that something in them can't be done in staight html. Give me any document and I bet I could convert it into staight HTML and have it look exactly like the $250 office suite.
How much useless crap do we REALLY need in documents? Seriously, a paper typed on a $15 type writer from ebay and a $200 Microsoft office suite, tell me, can you really tell the differance between them. We need to write paper, and have these papers look professinal and that is it, and you are telling me this can't be done in staight HTML?
Hell, HTML is everywhere, take your document edit in Hot Dog Pro and view it in Netscape under Windows. On a Unix machine? Use vi, bluefish and Netscape or kfm as the viewer. Got a mac, uh, I know for a fact there is Mac HTML editing software, just don't know it's name at the momement. See where this is going?
Create your documents, upload you documents to a web server, or put them on a floppy disk and I guarnette that everyone on any decent platform will be able to not only view, but edit these documents.
Does anyone else remember the KISS theorgy in "intro to CS". Keep It Simple Stupid. Everyone can make good use of staight HTML and it looks dam good, what is the problem?
If you want to test this out, the next time you have to write a document (school paper, resume, etc) do it in staight HTML. The open up your favorite graphical browser and click on "print", hand in the paper, does the other person notice? If no, this is a good idea, if yes, this idea sucks.
Seriously name one fearture in Word97, Star Office, Word Prefect that couldn't been done in a nice GUI html editor? Just name one, one example.
When I had to start writing papers, my freind told me "No you can't write documents in vi, here use this". I load up Word95, after it takes fifteen minutes to load and grins the hard disk the entire time (lack of memory, swaping a lot) it basically gave me the funcation to put "Bullets" and "underline". Yea, why the hell can't I do that in vi and html, print it in Netscape and it will look the exact same, the differance though
1) the document is protable
2) the document can be uploaded to the web without any modifactions
3) I can use any standard ASCII or html editor on any platform to edit it
4) the document size is smaller
5) I have html bragging rights
6) the document looks just as good as any non-portable property document
7) I don't have to pay $200-500 for a word processer
I have thought about this for a long time and haven't really seen any advantages in using a word processer over html. You may debat, "the are %100 easier to use", you could make a GUI html editor just as easy to use, just 'hide' all the tages from the user and have all the same buttons. Just most (or some) html editor don't have spell checkers, etc. But how hard would it really be to build these into a GUI HTML editor? Very simple.
I don't mean to be flaming here, but I honestly don't see any adtanages over other word processers (hell you could use these as the front end and just have it "spit" out html in the background when the user it not looking)
It sounds so simple, please enlighten me if I am wrong here.
You have it backwards. (Score:2)
The issue is a failure of the global populace to standardize their computing platforms.
If you people would just keep your software updated and standardize on Microsoft products, this would be a non-issue.
If all computer users would just vote republican in November, we could get a decent man in the White House who will put a stop to the DOJ madness, and show that twerp RMS the true meaning of freedom -- the freedom to use a single product and vendor.
If you could just put your personal issues aside and trust Bill Gates, everything will run smoother, and people will love and smile again.
Hey genius... (Score:2)
That is not Microsoft's fault. That is your fault.
This has to be one of the dumbest AskSlashdot questions ever.
Here is your answer...File...save As...Word 95...end of question.
Re:Drawback of XML (Score:2)
For example, slashdot puts their headlines in a XML format, making it easy for other sites to create a "slashdot headline box", adding some interesting content. Another good use would be to have a search engine for your site that outputted XML. If you rely on comparison sites like mysimon to drive traffic to your store, having the data in XML rather than HTML can make the communication between sites far more reliable.
To me, XML is about data, not documents. Also, I can recommend "XML Bible" by Elliot Rusty Harold or "Professional XML" by Wrox as a couple of interesting books if you ever decide to revisit the subject.
Re:Absolutely! (Score:2)
I agree entirely that XML conceivably allows us to focus on the matter at hand: writing well.
I'm a technical writer and I've said all along that media means nothing, the tools are irrelevant, and technology is merely a practical means to get a job done.
(Refer to prior posts about the horrors of MS Word and the glories of LaTeX. Which allowed success? LaTeX painlessly produced more thesis papers, correct?)
My professors didn't like to hear my stand on media, but I haven't been proven wrong yet. They were mystified by media because people who use online media are less forgiving than people who read hardcopy. The problem? They didn't realize that to write well takes discipline, hard work, and diligence. Sadly, they were my professors.
If we use XML, we focus on writing well. Presentation becomes secondary, as it should, with the use of XSL, DTDs, or whatever other mechanism is available to output our words.
"Think well, speak well, write well."
Perhaps my dissenting professor forgets, but she defeats her own argument with her mantra.
Tim Somero
What you ask for has existed for years... (Score:2)
...It is just now that the general public is becoming aware of it, in the form of XML. Just visit any IBM documentation shop. They've done all their documentation in SGML for years; in this problem space, there is no difference between SGML and XML.
Four years ago, I faced exactly the same problem as you: several thousand pages of product documentation formatted in Word. In our case, we had just lost five tech writers, leaving me holding the bag. So, I cobbled together an SGML publishing system with a colleague, for about $1000, including an on-line collaborative editing system, full on-line browsing of all docs, and semi-automatic translation of the SGML into LaTeX, thence into postscript or PDF. These days, all you need is a copy of WordPerfect 9 for the writer, a Linux box running the Xalan XSL formatter and a copy of PDFLaTeX, and you can single-source web and print versions of all your docs. The cost would be the cost of WP9, but of course you could just use a text editor.
This is not new technology. It's quite mature, as institutions like IBM and the U.S. military will attest.
As to myself, I create new DTDs as I need them for writing projects. One language per project, often.
By the way, you do not want to standardize on one "word processor" language for XML--that would be to miss the whole point of XML/SGML.
SGML has been doing this for years. (Score:3)
Of course SGML is pretty complex, so XML has been born to simplify SGML. XML is now being used to accomplish the same thing.
Re:DTDs are not that important (Score:3)
Seriously, having a DTD is VERY helpful, because it allows you to edit a document using ANY SGML (or nowdays XML) compliant editor and ensure that you will be producing something which can be loaded back in to the original editor 100% cleanly (and without blowing away half of the structure that the original editor had setup). This is the specific functionality that the question was referring to.
DTD's do indeed have their semantics documented, indeed most of the more common ones have their semantics documented MUCH more extensively than ANY proprietary format out there. Easy and obvious examples would be HTML and XHTML, indeed just about anything produced by the W3C. Better exampleswould include DocBook, TEI, MIL-STD-38784, and ISO 12083. I would argue that these are all documented much more extensively than most proprietary file formats. Certianly, being proprietary doesn't mean that the file format defines semantics any better than something with a DTD.
Sure the semantics aren't enforced by the DTD, but they can be enforced by the end user, something which is typically not true when you're editing a proprietary format using a foreign tool.
This kind of stuff is done by the US government on a daily basis.
Barriers exist right now... (Score:3)
In short, I expect to see this sort of tool become a reality in this season's software releases.
Other barriers to this also include decent formatting. We have reasonable XSLT styles for DocBook, but completely modifying these to make a custom look and feel is still pretty hard. Someone is going to release an XSLT WYSIWYG editor real soon now and make another killing in the market.
So in summary, I think yes, XML can and will replace proprietary formats. And ultimately be easier to work with.
Want to deliver XML with Apache to varying media devices in different styles? Get AxKit [sergeant.org]
Re:An open question (Score:3)
Just one nit: PostScript is actually a pretty bad example of this, because while it's reasonably easy to generate, it's horrendously hard to extract any useful information from.
Tools that take PostScript as input tend to be fairly fragile if they're trying to do anything beyond just rendering the document. "2up" converters often fail on PostScript generated from certain sources. Many graphics packages that allow insertion of EPS simply can't render the EPS on-screen unless there's an embedded TIFF "preview". PostScript to text converters rarely, if ever, work.
PostScript is a nice language for talking to printers. It isn't a good language for talking to software though. That fact that it's Turing complete means a lot of the analyses that would be useful to do on documents simply can't be done with PostScript without actually executing it, and there's no way you can tell if it'll ever halt. PostScript documents also tend to just be filled with low-level rendering information, not high-level semantic information required for things like searching, translation, converson into other formats, etc.
XML is far superior in this respect. XML documents can encode semantic information, and they're easy to analyze. They're also a heck of a lot easier to parse. There are many XML parsers available. I can only think of one PostScript parser that isn't built into a printer (GhostScript). XML isn't a panacea though. Even if every application vendor switched to XML, they'd probably all use different DTD's. That's still better than unreadable binary formats though, because it's a lot easier to reverse engineer the file format, if it isn't published.
Absolutely! (Score:3)
/ZL
Re:An open question (Score:3)
Word is not the worst case here, Excel is even worse -- it has changed in almost every new release of MSOffice.
As for why this happens -- peer pressure, and that's exactly what Pauly talks about. If your client uses it, so will you (or at least you will have to convert to your customer's format before exchanging documents). In the recent past it was not even so much a question of tollerance, rather of no choice. Look at any of the Office Productivity Suites reviews at ZDNet [zdnet.com] or C|Net [cnet.com] -- MS is almost always a clear-cut winner, even though most of the blows and whistles an average consumer will NEVER use (as a side note, wouldn't you think that most users could happily live with functionality of Word 2.0?).
As for what could be done to resolve it, I think that trying (whenever possible) to exchange HTML docs could be one solution, but you loose some control over the layout and won't be able to do any sort of document automation. And when it comes to a 3000+ page document -- you just gotta convince that customer not to use Word for this.
A few people had mentioned TeX and LaTeX, as well as SGML here, but I guess this is not the answer for Pauly, as his customers are not happy with it. OTOH, slowly educating them could help a lot. FrameMaker [adobe.com] would be the best choice then: you don't need UNIX to run it (unless you'd want to try to convert your customer completely), get great documents, can convert them into SGML (with FrameMaker-SGML).
Binary file formats vs XML (Score:3)
With products as XMill it is even possible to compress XML documents very welll so that the additional markup won't result in bloated files.
Binary proprietary formats are only good for keeping the structure secret and competitors out of the race. I wonder why Microsoft opened up theirs... Maybe it has become complicated enough so that nobody tries to create a filter! Or the descriptions do not contain 100 percent of the file format or wrong information... Yes, that's a bit paranoid, I know. Anyone from the KOffice team here to give us some insight?!
Already Happened (Score:3)
XMLBabel-Logic gone Awry (Score:3)
I'm usually very sceptical of new buzzword technologies. When I first heard about XML and did a little reasearch I was floored by the elegant simplicity of the model. XML at its most basic is a set of parser(HTML or whatever) tags that allow the representation of structured data. Much like simple HTML tables are constructed of tags like "" tags, XML extend this to define more complex structures.
For instance, a simple dataset containg haircolor,eyecolor and name for a group of people could be represented with tags like .... This idea is not only a boon for people trying to translate complex information across the web but it also allows for greated complexity in documents viewed on the web.
Here's the part where things went Awry. The W3C (World Wide Web Consortium [w3c.org] is the offical standards organization for the web. Their biggest problem is that as a standards body they are trying to maintain stabilty and conformity of standards. This makes them rather slow at approving and implementing new standards. In the past this resulted in companies like Netscape and Microsoft integrating new technologies into their browsers long before they become new standards. Javascript and ActiveX are just two examples. Can't really blam them, they have to compete in the marketplace and he who gives the consumer what they want soonest usually wins and gets to set the defacto standards. In a nutshell, the W3C has become little more than a R&D organization.
So, then we get to XML. Initially proposed over six years ago it was initially rejected by the W3C. Many outside the W3C like the proposal though so many groups started developing and testig different variations of XML. XML and similar technologies like XSL-Extensible Style Language, SVG Scalable Vector Graphics and and a plethora of others began to appear. See the Oasis [oasis-open.org] to get an idea for how far this has gone.
Today there are so many standards for XML variants that there are actually groups with competing standards for XML formats as specific as data exchange between banks. Kind of like a modern day tower of Babel.
So, to answer your question, yes XML holds a lot of promise for document and data interchangability among different software products, but between here and that goal is one huge civil war among competing groups and technologies. Giants of the software industry like IMB and Microsoft have already staked their grounds. Recent Patent Rules changes and passage of UCITA in several states have complecated matters by allowing companies to patent abstract things like database structure and parsing rules. Hopefully like the war between VHS and Beta a clear winner emerges quickly more importantly the winner must be an open standard.
Re:Drawback of XML (Score:3)
Well, yes. With 1996 standards even. It might of course be hard to find a popular browser that is even remotely up to date, but you can't blame HTML or stylesheets for that. And XML isn't a magic wand that suddenly makes browser authors do something "advanced" instead of going for the mass market appeal.
-- Abigail
New file formats are not new! (Score:3)
I'm not going to get into the debate about "open" standards, XML vs proprietary format, or whether Microsoft is somehow evil.
I will say that if you prepared 3000 pages in a format that your client wasn't able to use, it's your fault. Stand up and do the legwork to understand your client's needs. If your client had the same version of Word, or you started with a copy of their version of Word, it wouldn't have mangled your "weeks of hard work." If you need critical compatibility, preview using exactly the same set of operating system, software, fonts, video drivers, printer drivers, paper and ink cartridges that they will use.
Applications extend their format all the time. I can't load a Photoshop 5.0 document into version 1.0 without problems. I can't load an HTML 2.0 compliant page into an HTML 1.0 compliant browser without problems.
The same thing would happen even if Microsoft was 100% XML 1.0 compliant, as soon as people made XML 2.0 documents.
It's your responsibility to provide the results for your client; stop blaming the tools. Get tools that will provide the results your clients want. "Gee, my hammer's left-handed, that's why I need to start your kitchen cabinets all over again."
(New file formats are not new. =anagram>
Lament, now refine software.)
XML easily subverted (Score:3)
XML offers quite a few benefits, not least of which being that it forces the author to think of a document in terms of a tree. It by no means will enable everyone to just start talking overnight, magically.
Re:Hey genius... (Score:3)
Matthew Miller, [50megs.com]
Good Idea but counter gratuitous complexity (Score:3)
Re:You have it backwards. (Score:3)
1) Recent lobotomy (credit for ECT)
2) Totally humorless (credit for cluelessness)
3) Blind
4) Stupid
5) Poke at keyboard with cane.
it was soooo obviously a joke...oh well. knowing my luck i'll probably end up trying to teach one of these chumps to program one day...take a deep breath...start over at the beginning...keep trying to break through...arrrrgh.
XML does everything - whatever. (Score:3)
The only problem is that the applications have to be created to support the XML standard. So, unless you have a word processor that supports XML, and the people you're sending your documents to can read XML with their software ('cause you know MS will make an MSXML bastardisation...) you might still be out of luck.
Re:Already Happened (Score:3)
XML is used in HTML files created by Off2000 applications (except for FrontPage) to use in "round-tripping" of HTML files from source app, to HTML app, back to source app. You see, there is a small application that reads the XML tags in the HTML files and sends it off to the source app for further editing when you Edit the document. Some information is also stored in the HTML file in XML complient tags to help the source app to provide it with further information about the original source document -- to make it appear seamless.
FrontPage strips these XML tags out of the HTML files and breaks round-tripping.
Re:XML does everything - whatever. (Score:3)
Regarding the Applications. It's true. They're not quite there yet, but they're coming. My company is putting together quite a decent one, and many other vendors are trying to do it right as well. The Windows support, unfortunately, will generally be there before the UNIX support, but UNIX support is not far off.
Regarding Microsoft and XML. Microsoft, though I hate to admit it, is one of the more influential catalysts for the development and standardization of XML specifications at the W3C. Their stance on XML, for the most part, is a driving stance, as opposed to what it has been in the past with other Internet technologies, being embrace and extend. Let's give them some credit in that sense then.
Tom Bradford (CTO) The dbXML Group
Why? (Score:4)
Now all the businesses that want to do business with the government switch to word. So what happens next? The businesses that do businesses with those busiensses switch to word. It's recursive.
Personally, I think the government could do much to open up the playing field by making it so all documents sent to the government had to be in some openly documented file format (XML based if you like to pretend that XML solves all problems, or just some random binary format or what not.)
This simple move would smack Microsoft far harder, and more fairly than most any DoJ action.
----------------------------
Learning XML, XHTML (Score:4)
I spend my working time (and then some) as a Web Designer and have recently been trying to read up on XML and XHTML. (is that slightly redundant?)
It is turning out to be quite a difficult task. While everything I read tells me that it will be replacing all those proprietary document formats, it doesn't tell me exactly how that is supposed to work in a real world scenario. I believe that it does have that potential am stuck in exactly the same place as the poster... not being able to find the answer to what seems to me is a rather basic and obvious question. Is it worth my time to learn XML for future use or is it just another wild dream of a select few people?
XML Wouldn't Help (Score:4)
In addition, XML only effects how the file is stored on its disk. Internally, Microsoft Word will represent your document the same regardless of whether its stored as XML or in a binary format. If it wants to create a binary version of your document, Word will simply write your document's raw internal data structures to the disk; if it wants to create an XML version of the document, it will first convert its internal binary version to XML and write it to disk. The only case where an XML based file format is better is for third parties who don't know the internal structure of Word's file formats, but still want to read its files. For Microsoft, it has intimate knowledge of its file formats, so storing it as XML gives no advantage to Microsoft applications
Re:Nope (Score:4)
"Your example of Word formats changing is a perfect one. If Word95 used XML, Word97 could still be incompatible if it used different elements and attributes."
You're overlooking a fundamental feature of XML. If Word21 needs to add additional elements or attributes to support new features, they simply create new tags. If the document is loaded in Word20 (ignorant of those tags) it won't look quite right (whatever feature was implimented with those tags will be skipped) but it will still display. If M$ wanted to try and maintain it's current upgrade-4-compatability approach, they could change all the tags with every version, but such obvious and outlandish behavior would only serve to destroy whatever fragment of reputation they still have.
"XML can't replace proprietary document formats. That's like asking if ASCII could replace proprietary document formats."
I must not be understanding what you mean when you're refering to ASCII since simple texts replace proprietary document formats all the time. TeX, CSV, RTF, HTML, PS, all are human readable text files. Certainly XML is only part of the solution, it stores the content while the format is handled elsewhere. In that sense it differs from the traditional mixed approach.
The most important thing about the transition from mixed formatting/content to clearly delineated content vs. formatting is that the author isn't (ultimatly) going to have any control over formatting. Relax and give a little thought. The format of a document should be determined (or at least be determinable) by the person reading it. If I counted the times I've read the source of someone's HTML because their background is obnoxious I would have wasted much time.
Not Already Happened (Score:5)
Microsoft want you to believe that they are buzword compliant, but in reality the output from Microsoft's "Save As HTML" looks like XML, smells like XML, but isn't. Try parsing it.
See the recent Byte article "The cup is half full" for more details. I'm surprised you haven't heard about this. MS is using it's proprietary XML Islands inside a HTML document. That means you have to get a HTML parser to be able to parse it. The content of the XML is just as proprietary. It's basically a conversion of their OLE Document objects into XML.
Re:XML needs to be integrated into Linux (Score:5)
Comments such as "XML should be in the kernel" betray a lack of understanding as to the proper function of the kernel. Worse yet, (unlike, say, khttpd), putting an XML parser in the kernel wouldn't provide any benefit. All you're doing is encouraging the kind of useless feature bloat that Microsoft is rightly loathed for. That's why people get upset about remarks like this; they don't want this attitude to spread further than it already has.
Anyway, what you are clearly unaware of is that the perception of performace and stability is far more important in the corporate domain than the actuality of the situation. By integrating XML into the kernel, you have provided Linux with a major marketing point for the people who are actually in charge of what their company uses.
You won't be able to maintain the perception of performance and stability if the actuality is the opposite. Even Microsoft, with its legendary marketing might, has begun to pick up on this fact a little. (Note how stability has become a marketing point for them; why would it need to be, but for the constant crashing of their existing products?)
The exact breakdown of an operating system varies from one OS to another. In general, the purpose of any "operating system" is to arbitrate and manage hardware resources. Anything else is basically fluff. XML parsing is an application support issue, and detracts from the core function of managing hardware resources. Occasionally, an application function may be put in the kernel for good reasons, usually related to huge performance advantages gained by an in-kernel implementation. (khttpd is an example of this.) Even this is resisted strongly, because it "pollutes" the most critical code in the entire system, and poses an inherent risk to the stability, integrity and maintainablility of the system as a whole.
Basically, to add an application-specific function to the kernel, you had better have a really good reason to be suggesting it, one that can be justified (and defended) on a technological basis. If Linus were to allow marketing considerations (such as this) to drive kernel development, not only would he lose the respect of most of his supporters, but the end result would be just as crappy as Windows, sooner or later.
Given that Linus himself has talked about "world domination", doesn't it seem short-sighted to ignore a major selling point in favour of your petty-minded arguments?
Keep in mind that "world domination" remarks are somewhat tongue-in-cheek. Yes, he's half-serious, but only half. He wants people to use Linux over Windows because it's a better system. It wouldn't remain better if this approach to kernel development were adopted. Keeping the kernel pure isn't a "petty-minded argument"; it's a critical element of good design.
All that said, you would have received a much different response had you suggested that Linux systems (as a whole) start integrating XML support , use XML for system configuration and provide XML services for applications. There's a good argument to be made for that, and the marketing value should be similar. There's also technological arguments to be made in favor of it. The distinction here is that this support would all be in "user space" rather than the kernel, even though it might be an integral part of the operation of the system as a whole. The kernel is the core of the system, and the idea of integrating XML into Linux does not imply that it belongs in the kernel.
An open question (Score:5)
The compatibility breaking between different versions of Word is well-known and oft-maligned. I have a hard time seeing it as anything more than a forced upgrade cycle, where Word users MUST buy the latest version in order to exchange documents.
There are other document formats which deliver the same power, have been around longer, have not *radically* changed, and are open to implementation by other vendors. HTML and XML-based grammars are only one example of this. PostScript would be an even better example.
So why have business environments settled on a standard which seems clearly to not be in their best interests? Why do they blindly pay for new versions every few years when their current versions do everything they need and more?
I'm all for letting the free market determine the best product, but Word strikes me as a solid example of the free market failing in this regard. Perhaps poor consumer education is preventing software from being a truly free market. The feature set of Word is nice, but the upgrade-insuring file format should cause people to run away. I would be skeptical of a car that used non-standard gasoline and forced me to buy an engine upgrade each year to handle new gas.
How has this been allowed to happen?
Sabotage (Score:5)
Assuming that any open document standard emerges, you can pretty well bet that saving from the market leader to that format will be an ugly process (have you looked at the HTML that that turkey produces? Blech!) You can also bet that imports from it will be better but still a pain. For real fun, try repetitive translations between the native format and the portable one and compare the starting and end results.
The sad fact is that monopolists have a huge stake in incompatibility (read the Halloween Documents) and every reason to maintain it. The rest of us will just have to survive in that environment until it changes. Changing it is another topic entirely, but for once I'll say, Vive le France!
Nope (Score:5)
If you have ever used lex [gnu.org] or yacc [gnu.org], then you'll know what I mean when I say that XML parsers essentially do the job of lex, but not of yacc. An XML parser is little more than a scanner which breaks a file into chunks to simplify the next level of processing. The XML parser gives the illusion of hierarchical processing that lex can't do, but it's an illusion nonetheless.
Your example of Word formats changing is a perfect one. If Word95 used XML, Word97 could still be incompatible if it used different elements and attributes.
So no, XML will not replace proprietary file formats. XML + proprietary DTD specifications + proprietary semantics could replace proprietary file formats. Is this an improvement? Probably. Will it make backward (or forward, or sideways) compatibility problems go away? Nope.
--
Patrick Doyle
Biased rant from co-inventor of XML (Score:5)
First off, while there's a place for MS Word, a 3000-page document ain't it. In my experience it tends to severe breakage in this situation.
Office2K will already save docs in a kind of bastardized HTML++ format which truly sucks because it is neither rules-following HTML nor well-formed XML, and it could have been without much trouble. A little bird has told me that a not-too-distant future release of Office will have a *real* XML save format, which would be cool. I mean, a lot of the tags will still be proprietary MS gibberish, but at least you can parse 'em, and it'll be way less susceptible to inter-version breakage.
A basic part of the XML dream was the notion that the idea that software packages have proprietary data formats is just as silly as the 80's notion that computer networks should have proprietary per-wire data formats (remember DECnet, Wangnet, SNA?). So what pauly wants is exactly what XML is trying to do.
Having said that, a lot of the infrastructure we need to make it easy to author and deliver XML isn't here yet.
What I'm doing these days for complex documents is writing them in HTML++, by which I mean mostly well-formed HTML to which I add my own tags (e.g. , ) whenever I need to; because you can display what you've written in old browsers, which helpfully ignore the non-HTML tags, and you can write perl scripts or use XSL to turn it into RTF if you want to publish paper, and with Mozilla you can write a CSS stylesheet and dress up your own tags the way you want.
Cheers, Tim Bray