Which XML Parser Do You Recommend? 17
tshieh asks: "I'm trying to add XML-configurability to a Java application, and I'm trying to figure out which XML parser I should use. Any thoughts on whether I would be better off using Xerces, expat, XML4J, or JDOM (or any others)? So far, I've decided to use DOM rather than SAX since I've heard that DOM is easier to use and I don't anticipate my configuration file becoming so large that the slower and more memory-intensive DOM parsing becomes an issue."
Re:i recommend (Score:1)
Using SAX you simply write code that accepts events and does the right thing (writes into your data structures, does the work, whatever), or traverses the data structures to write out XML. Using DOM the parser generates a DOM data structure that I then need to traverse in order to copy data into my internal data structures (or build a DOM structure so that I can write out XML).
The short version is that I don't see a significant advantage to using the DOM data structures in parallel with my application's data structures, and it seems odd and inefficient to use DOM as the internal data structure for applications. Beyond that, SAX makes it easy to code in a robust manner that (properly) ignores everything in an XML file that it doesn't care about.
To an extent the preference between DOM and SAX come down to programming model. I find event-driven programming natural, and the data structure crawling code required a waste of time and a pile of extra code to debug. Other people seem to find event-driven programming confusing, and like writing loops .
XML/XSLT parsers (Score:2)
There are several good XML parsers, some free, some commercial. Have a look at the following URLs for more info on free versions:
xml.apache.org [apache.org]
users.iclway.co.uk/mhkay/saxon [iclway.co.uk]
www.jclark.com/xml [jclark.com]
I hope this is of some use to use.
Re:SAX or DOM (Score:2)
For example, given this:
Bob
admin
I'm going to hit the beginning of the "object" tag, and create an intermediate data structure (let's call it ObjectNode). As I continue to hit the start of the param tag, I'll pass that information on to the current object node. Once
I hit the end tag of object, I pop that node off the stack and transform it into the final object I'm interested in (in this case, I finally end up create a com.razorsys.User object from my little
XML-object-serialization example).
What this shows is a way to do the sort of "remember where I was" stuff DOM gives you, without creating an entire tree. I believe some parsers out there actually do this for DOM parsing -- they really underneath have a pseudo-DOM, and build full branches only when they are requested. Much quicker and less memory intensive.
Anyway... I endorse use of SAX. I'd stay away from JDOM or JAXP, but that's only because I like to stay language-neutral when it makes sense. Using just SAX means I can probably move my parsing code to a variety of languages and keep most of the logic...
Crimson (Score:2)
I'm surprised nobody mentioned Crimson [apache.org] (scroll down to the bottom of the directory).
This Java parser open-source, under the Apache project. It was developed by Sun under the name of JAXP and is far more lightweight than IBM's Xerces, mainly because it uses built-in Java I/O instead of reinventing the wheel as Xerces does (with its ChunkyByteArray etc.)
For whatever reason Sun plans to adopt Xerces as the official JAXP implementation. I dunno why... I think Crimson is much better for a lot of applications.Do NOT use JDOM... (Score:1)
As it is, parsing text is one of the things that java does the absolute worst, as it's string management is an abyssmal. Your best bet is to make your choice based on what your goals for your documents are.
1. Are you going to need to evaluate information in your XML documents linearly, or random-access. If you are just reading thru the document, and don't need to keep a tree, use SAX. Very Simple, Very Straightforward interface...
2. Are you going to need to access the same document several times? Use DOM, and cache the document in memory.
3. Are your documents LARGE?
4. Read Read Read Download several Parsers, try them out, _Understand_ the way this technology works, and then benchmark your typical use of the technology with several parsers and methods. Depending on how you use this technology, and it's implementation is going to be as important as the flavor of parser. Do you Need to validate your documents? How many are you going thru? how fast does it need to be. These questions change the answer for 'which parser to use' drastically.
And remember: XML is NOT the magic bullet. For pete's sake, it's bloody text parsing... It's not designed to be ultra fast, it is designed to be flexible...
A few words from me.
SAX or DOM (Score:4)
SAX is really simple to use. You write event handlers, where an event is something like "start of document", "end of foobar tag", etc.
If you can write your application using the SAX event model, then you'd find DOM too complex. DOM is good when you need the entire document as a tree structure, e.g. to do some analysis.
If you use SAX, and you find yourself writing a lot of code to remember what you've seen, then you should probably consider DOM. If you can look at each tag or data item once and then forget about it, SAX is the way to go.
JDOM is by far the easiest (Score:2)
As your goal is to build a java application, I'd definitely recommend JDOM because it is lightweight and the data is stored in normal java constructs.
/J
Another anti-DOM voice. (Score:2)
- if you write your code using the JAXP api you should be independent of your choice of parser (it lets you get hold of one, which you then use SAX or DOM with, without explicitly mentioning the other package)
- Beware of non-standard features. For example, some XML parsers (like Oracle's) let you reparent a DOM node to a different document. This is not supposed to be legal. (because the underlying representations of the nodes may differ; eg you could have just attached part of a (text) document to a one-row-per-element database representation)
- The DOM is chock full of gotchas. Like, you MUST remember to normalize an element before grabbing the first child node and expecting it to be text (you see this mistake made *a lot*). Some parsers will e.g. split lines of text into separate text nodes, each of which must be retrieved separately if you havent normalized the element.
- upshot is that SAX is very easy to use in comparison. You can also very easily build an application in terms of SAX filters; with the DOM you can find yourself spending all your time writing stuff to traverse bits of the tree.
So what was the bad XML parser? Consider the following scenario: an XML document gets stored as a string in a database. An application reads the string back, and tries to parse the document. Its supposed to infer the character encoding by looking at the chars or the explicit 'encoding=' attribute of the xml PI. However, in VB, the msxml parser will only respect encodings when reading from _files_. Strings are stored internally in MS-land as DBCS, and it will silently convert the characters in your XML as it reads from the DB. Thus, as soon as it sees an encoding of (eg) ISO-8859-1 in the XML string, it barfs and dies. The original data was stored by the DB in the correct charset, it seems to be the internal conversion to DBCS that screws up. Neat. not.
Xerces (Score:3)
Pull-based parsers (Score:1)
http://www-ai.cs.uni-dortmund.de/SOFTWARE/KXML/
http://www.simonstl.com/articles/cxmlspec.txt
JDOM (Score:4)
We've been using it at work and are happy with it so far.
DOM sucks (Score:3)
It also takes a whole lot less memory and time to parse than the DOM approach.
We process XML files that can be up to 10MB in size, and DOM parsing these files brings my 500MHz P3 to its knees (yes i have increased the JVM heap size)
Parsing with SAX, however, has proved simple, clean and easy with no performance problems at all.
If youre definitely wanting to use the DOM, try NanoXML.. its much smaller that Xerces and the like, and is perfect for parsing config files etc., as will as being small enough (6KB or so) for client-side and embedded use.
However, for an all-round, no compromise XML parsing solution, then Xerces is pretty good.
DOM vs SAX (Score:2)
Neither DOM nor SAX really specify a parsing mechanism. Rather they specify the means by which the parser exposes the parsed content to the client app. SAX is events, DOM is the whole thing, in a bucket.
DOM is nearly always easier to work with, except when it isn't. Then it becomes completely unworkable (usually due to either huge documents, or long time delays) and you use SAX because you've simply no other choice.
JDOM ? I've never felt the slightest need for it myself. Maybe it's because I first hit XML DOMs through M$oft's (where there is no JDOM). Once you've got the knack of using the DOM, I don't see what JDOM offers long-term.
i recommend (Score:1)
Incidently, I could be wrong but I believe that SAX was easier to use than DOM. That's just my opinion though... You'd be better served finding someone who knows both and asking them...
--
Peace,
Lord Omlette
ICQ# 77863057
Re:Xerces (Score:2)
Also bear in mind that Xerces is a validating parser - I have no idea if the other parsers mentions are or not, but Xerces certainly is. Using a DTD and letting Xerces take care of the majority (all) of the validation of your XML data is great.
I started using Xerces for a JAVA projects, and eventually required a C++ element to it - the Xerces XML parser is also available in a C++ version and the API is practically identical to that of the JAVA version (incl. smart pointer based access to allow a JAVA style memory management and parameter passing). This made my job one hell of a lot easier - rigorous use of Ctrl+C, Ctrl+V.
Steve.
Re:XML/XSLT parsers (Score:2)
Re:JDOM (Score:1)