Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
News

XML Compression Options? 51

ergo98 asks: "About a year ago I had the need to evaluate XML compression technologies (for a project where two machines had to communicate via XML document, and there was an excess of CPU power and a dearth of bandwidth): At the time the best option seemed to be a research project called XMill, however it seemed even then to be an abandoned project with no more updates and little market presence, and was only source available as a command line utility requiring reworking into library form. I'm curious if there's been any progress in the XML compression arena in the past year: If you have more CPU power than bandwidth what is the best option for XML document compression? Has any XML specific compression algorithms been made as a module for Apache?"
This discussion has been archived. No new comments can be posted.

XML Compression Options?

Comments Filter:
  • GZIP (Score:4, Informative)

    by shemnon ( 77367 ) on Friday January 18, 2002 @12:52PM (#2862870) Journal
    How about simply using a text compression on the XML? Since gzip has a backwards token index it compresses XML quite nicely. It is availabe on java as well as C based implementations. If windows is your platform of choice you can get at it via cygwin.
    • Has any XML specific compression algorithms been made as a module for Apache?

      And there is mod_gzip [remotecommunications.com] for Apache, if you go the gzip route.
    • That's what we do to store a bunch of XML documents in our database. Our applications which access them use business logic classes which transparently compress and decompress the data. We use gzip and get an 80-90% reduction in size, with no noticeable effect on performance.
    • Re:GZIP (Score:5, Informative)

      by ergo98 ( 9391 ) on Friday January 18, 2002 @06:03PM (#2864902) Homepage Journal

      The problem with GZIP is quite simply that it doesn't take advantage of XML specifics (i.e. there are domain attributes of XML that make certain algorithms much more efficient than others): XMill, as an example, manages to achieve almost twice the compression of GZip with approximately the same CPU usage.

      In this case adding in a module that would do XMill on the server side (of course obeying the Accept-Encoding that the HTTP client passes in, so a client that didn't handle the custom compression would not be thwarted) would allow us to use a custom HTTP client that could do the compression and achieve twice the throughput on the limited pipe. As I mentioned in the submission: We have more CPU power than bandwidth, so that 2x compression improvement is very significant (i.e. there is a world of low bandwidth vertical uses out there: satellite, frame relay, CDPD, etc). For those who cringe at the idea of using XML to begin with, please realize that with the proper compression XML encoded data is smaller than any proprietary packaging because of its high degree of predictability.

    • How about simply using a text compression on the XML? Since gzip has a backwards token index it compresses XML quite nicely. It is availabe on java as well as C based implementations.

      The problem with the gzip approach is that it is best on long files rather than short ones. The problem with XML is that SOAP messages that should take one or 2 packets end up taking 5. Gzip tends to make 5 packets 4, but it *will* make 50 packets 25.

      The other problem with GZIP is that the implementation is fairly computation intensive. It is not something that you can implement as a simple filter with almost no memory overhead.

      GZIP is a great general purpose compression scheme for largeish quantities of data, but a scheme that is highly domain specific can usually do better.

      The main mistake we made with the Web was to not include a simple compression scheme that was optimized for HTML at an early stage. The Content encoding tag was a good idea, but we should have shipped a lightweight compressor as well.

      The approach in XMill is a good one, but I am not too happy with the table switching codes, also I think that an extra 25% compression can be wrung out at low cost. The big problem with XMill is that it was not progressed as a standard anywhere. I would give more details on the scheme, however there has been a recent rash of Patent Trolls filing patents on ideas they find on mailing lists. I call that type of behavior fraud, the courts call it a cause of action that will cost me $2 million to defend against. Congress call this situation an opportunity to collect campaign bribes.

      XML Compression is something that I am working on, but I am already the point person in two major XML working groups and this is not a high priority. I really need an intern to crank through measuring efficiency on various document sets.

  • ummm... (Score:1, Redundant)

    by bprotas ( 28569 )
    ...gzip? sure it's not state of the art, but everything supports it, and it's a decent compressor....
  • by disappear ( 21915 ) on Friday January 18, 2002 @12:52PM (#2862880) Homepage
    Why not reduce the information to transmit, using rsync or the equivalent? Or batch the data and use gzip. Or use ssh's -C option for compression, which won't do as good a job. But mucking with the XML is likely to be the least-understood way of doing it when it comes time to admin the system later, or update things.
  • XML Compression... (Score:3, Informative)

    by bje2 ( 533276 ) on Friday January 18, 2002 @12:56PM (#2862903)
    these guys [ictcompress.com] claim to be able to compress XML at a 34-1 ratio...
    • by ergo98 ( 9391 )

      This is exactly the sort of information I'm looking for: Thank you. I've downloaded their demo and am going to give it a try on some sample data to see how it performs (of course comparing it against the ubiquitous infamous GZip).

  • gzip? (Score:4, Insightful)

    by Howie ( 4244 ) <howie@thi[ ].com ['ngy' in gap]> on Friday January 18, 2002 @01:00PM (#2862927) Homepage Journal
    There is a content-encoding plugin for Apache called mod_gzip [remotecommunications.com] that will do the server end, for any output including dynamic. I've not tried it, but on face value it's a standards-based way of getting what you want.

    I think, although I can't find it for sure, that LWP supports gzip content-encoding too, which would mean that things like SOAP::Lite and XML-RPC would benefit too.

    more about the content-encoding thing [webreference.com]
    • I've used mod_gzip for the last year or so on a number of sites, one of which gets millions of hits a day. Trading idle CPU time in for drastically reduced bandwidth is a sweet bargain indeed. As a side benefit, those on slow links get a much faster experience.
  • dot gz (Score:1, Redundant)

    by forehead ( 1874 )
    It's text. Gzip/Bzip2/Compress/(pk)zip will all do a decent job. If you are using C++, the following library may be useful. http://www.cs.unc.edu/Research/compgeom/gzstream/. Basically, it implements an igzstream and ogzstream which look and act like normal C++ IO streams, but automagically gzip (or gunzip) the data.
  • by dmorin ( 25609 ) <dmorin@@@gmail...com> on Friday January 18, 2002 @01:47PM (#2863246) Homepage Journal
    If you are both the producer and consumer of the XML messages, then you could get away with whatever compression scheme you liked (I always envisioned something like mapping the DTD to numeric values, since the DTD describes all tags and attributes you will use), and then converting down that way. Figure for every tag that tags up like 14 characters, you replace it with an integer (or even byte) representing what tag it is, and perhaps something for the length of the field. My point is that you could roll whatever you like if you know who is on the other end.

    The problem of course is that if you control both the producer and consumer you're greatly limiting the applicability of XML in the first place. Just yesterday I explained to my boss that one of the advantages to XML is for cases when you have 10 people who want your data, but you can't dictate was software they use...AND, 6 months from now, 10 other people who you haven't even met yet are going to want your data too. If you're in that boat, and you create any sort of compression scheme, then you're in trouble. If you're not, then you may not need XML at all (at least, not for moving your data around).

    Perhaps you're hoping that there will be some compression module that becomes a standard part of XML, so that you can safely say "Anybody who is able to parse my XML message would also be able to decompress it"? Good luck. Even if that did happen, it would take ages for all of the parsers out there to get up to date.

    What you'll probably find is that something like SOAP or WSDL will have a compression component. But in that case it's ok, because both the client and server sides of WSDL that do the marshalling/unmarshalling will be provided for you by your tools (such as BEA WebLogic). Think about what CORBA IDL was like -- you just write the interface, and then both client and server stubs are automatically generated for you. In that case, it's perfectly reasonable to expect that some compression/decompression code could be written in to the code automatically.

    • I do indeed control both sides: Basically imagine it as a blackhole on both sides and I want to stuff XML in one side and have it pop out the other, but unfortunately the connection in between is a limited pipe. While GZip does a nice job, other XML specific algorithms offer dramatically more compression (and 2x the throughput, for example, is pretty worthwhile).

      Even having said that: Thankfully when they made HTTP they realized that GZip isn't the be-all and end-all, so if I implemented ErgoCompression my custom clients that sent the ErgoCompression Accept-Encoding header value could take advantage of it, while others could use GZip or compress of deflate, or whatever their client supported. To be honest I'm very surprized to see some of the responses that I've seen here on Slashdot, which is "GZip is the monopoly: Use it!", a sort of "Gzip is good enough". Well when it comes down to it, getting 2x or greater XML pushed through that limited pipe seems pretty worthwhile to me, especially because, as you mentioned, I control both ends of the pipe.

      • I suppose a more apt metaphor would be that there's a wormhole on either side, rather than a blackhole. Blackhole compression would be pretty good though : 1MB in, 0 bytes out!

      • Don't worry, you've just run into a well-known Slashdot phenomena: people reply not for the usefulness of their comments, but rather to whore for karma. Therefore, you get a lot of people who are talking completely out of their asses and are far out of their depth on the issue. Of course, I see that you have found a thoughtful reply that seemed to help you, so all is not lost, but by-and-large I would take anything posted on Slashdot with a grain of salt.
      • You could start with pre-loaded bit tables if your data is mostly static. You start out creating a subset of your data that includes the bits you know will be repeated and you feed that to gzip and let it build a table. You then build a custom version of gzip that starts with that table and not an empty one. The result is your compression will be better and it will be much faster. Years ago I looked at this for use in usenet news. if you compress headers and text seperately, you get better compression than just treating it as a stream.
  • Define "best," as in "best" option.

    Are you looking for the tightest possible compression, or are you looking for "good enough" compression that's fairly standard? Or do you want "good enough" compression that can be quickly implemented?

    If you're looking for tightest possible compression, that requires a good statistical model of your data and is far beyond the scope of any answer here. Depending on your data, a good encoder could require far less bandwidth than any generic compressor, but it's highly nonstandard.

    If you're looking for something that's "good enough" and standard, there's absolutely no doubt that you should use zlib (gzip) and call it a day.

    If you're willing to stray from the standards for better compression, then bzlib generally offers better compression than zlib.

    Bottom line: figure out what you really need, then pick the tool. Don't just grab the first thing that comes to mind, or a tool that others swear by but which doesn't meet your needs.
  • by fm6 ( 162816 ) on Friday January 18, 2002 @03:35PM (#2863907) Homepage Journal
    This question really needs more specifics. There's a good reason why you hear so little about XML compression: it's usually not worth the trouble. People assume such a verbose format is fundamentally inefficient, but when the get down to cases they find that the XML "pipe" is just not a bottleneck.

    Of course, there are exceptions. Obviously it doesn't take much to saturate a modem connection. But modern modems have data compression built in, so there's not a lot to be gained by compressing the data beforehand.

    Another example is one I had to deal with recently. Mat Ballard started distributing Kylix help files in HTML format [hypermart.net]. This is 100 meg of data, so Mat was concerned to minimize his bandwidth costs. He found he got the best result with bzip2 [redhat.com], which reduced his file size to a mere 7.6 meg, or twice the compression of gzip.

    This caught my interest. I'd like to distribute a similar collection of documentation. But my app needs to be able to read individual files on the fly. Would bzip2 compression work equally well applied to small blocks of data?

    Unless I did something wrong, the answer is no. A bzip-in-tar file doesn't seem to come out any smaller than the equivalent zip file. Perhaps I did something wrong, but it does make sense. Bzip2 gains its superior compression by combining the Burrows-Wheeler transform [dogma.net] with old-fashioned Huffman encoding. And BW transforms are drastically more effective on big data sets. So I might as well stick with Zip format. Oh well.

    Bottom line: no magic bullet for minimizing bandwidth costs through compression. You need to analyze your specific application and find out what's most effective -- and whether it's worth the trouble.

  • I think, that if you use something like wbxml you will keep some of the stucture in the file (aka. your parser will be simpler/faster). http://www.devx.com/xmlzone/articles/bs120101/Part 1/bs120101p1-1.asp
    (Well I newer tried it, so mayby I should just shut up.)
  • There is some interesting work going on with XML/ASN.1 translation. Essentially, one can create a 1-1 mapping between an ASN.1 module and an XML schema, and use whatever encoding you're comfortable with (i.e., XML encoding, or ASN.1 [DBP]ER). The advantage here is that the encoding rules for ASN.1 (especially the PER (Packed Encoding Rules)) are very space efficient. Thus, by re-encoding the XML data as ASN.1, one can achieve remarkable levels of compression.

    This site [elibel.tm.fr] has some useful information on the proposed standardization of this effort. There was also a /. post [slashdot.org] about a recent EE Times article [eetimes.com] on the subject.
    • There is some interesting work going on with XML/ASN.1 translation

      No there is not. ASN.1 is an utter crock. The original idea was good, the implementation sucketh, especialy the DER encoding.

      ASN.1 is not particularly efficient, unless you use something like PER which if it is supported requires a support library that comes on 2 CDROMS.

      The XML in ASN.1 proposal is simply an attempt by the die hard OSI/EDI types to try to preserve the work they have done.

      The problem with XML is the over complex schema notation, the problem with ASN.1 is the over complex encoding. So put them together and you have a perfect platform for information engineering consultants to suck an IT budget dry. If that is your objective then go for XML in ASN.1.

  • by SuiteSisterMary ( 123932 ) <slebrunNO@SPAMgmail.com> on Friday January 18, 2002 @04:03PM (#2864104) Journal
    Replace any set of spaces greater than 1 with a single space. You'll cut, by a fair bit, your average XML document. :-) In other words, you can have it concise and efficent, or you can have it human-readable and pleasant to look at. :-)
    • Removing excess whitespace need not make it unreadable if you have a tool that can easily parse XML. jEdit [jedit.org], for instance, has nice XML support via a plugin (hell, what doesn't jEdit have a plugin for?). It will display a nice, collapsible tree for you; I'm pretty sure another plugin will format XML files nicely. I'm sure there are myriad other Free/free tools available to do the same.

      So anyway, I'd agree with removing whitespace, as long as your XML compresses significantly from this. If it's machine-generated, there's a decent chance whitespace is already minimized. If not, you may want to strip the whitespace and see if that improves performance adequately. If not, you could always pipe the output through a compression filter, too.
    • Oh, this is not a good idea :)

      Every space is relavant in XML. It does not follow the same rules as HTML.

      Does not equal

      • Well, XML requires processors to preserve spaces by default, but there is allowance in the spec for stripping whitespace-only text nodes (that is, spaces that are between two tags) for some/all content.
  • XMill [att.com] by AT&T is free (not quite GPL, but the source is there too), and it takes advantage of the redundancy in XML data so that it's super efficient. Here [att.com] are some comparisons to plain old gzip compression (it blows it away). It'd be horrible on random data, but it squishes XML like you wouldn't believe.
  • libXML supports a built in compression method using libz. It may not be the most efficent method of compression, but it will get the job done and it will work seemlessly with any app that uses libXML.

    I've compressed a 1.5MB XML document down to the tens of KB range. That was definitely good enough for me :)
  • Search google (Score:5, Insightful)

    by Hard_Code ( 49548 ) on Saturday January 19, 2002 @12:51PM (#2868509)
    really, how hard is it:

    http://www.google.com/search?q=xml+compression

    The very first thing that comes up is a project on SourceForge with in depth explanation of algorithm.
    • Uh, thanks for the help. Do you follow-up every AskSlashdot with a lame "search Google"? Guess what genius: I already did and found most of the links insufficient, hence the open discussion about it.

      There isn't "an algorithm" that defines XML compression, and that's the whole point. Additionally, SourceForge most certainly is not the be all and end all (i.e. the fact that you jumped to that link and you think it answers ANYTHING [it doesn't, and just as I said in my askslashdot most of it has been abandoned long ago. Of course that's the status of most Sourceforge projects] shows that you just wanted to jump in with your great cynics wisdom). As has been shown in this discussion: There is an awful lot of ignorance out there in regards to XML compression, but thankfully a few people see through it and actually realize the domain specifics, and I have gotten helped.

      In any case the Slashdot posting filter should automatically filter any "Duh, search Google!" ridiculous posts.

  • So, you control the both ends of the communication so you can implement any compression scheme you want? Why not choose IIOP (The Corba protocol)? I bet IIOP would give you the compression that you want because it is a very efficient protocol. It is also widely supported (many ORB vendors for various programming languages, and Java 2 Standard Edition even comes with a free one).
    • What do you mean, efficient? I know a bit about IIOP (and its generalized cousin, GIOP), and I've always thought that it was sort of wasteful on bits. It requires extensive (and not very simple) padding and deals with little vs big endian issues. For its purpose (primarily transmitting remote method invocation parameters and return values in a platform-independant way) it's probably as good as binary gets while retaining platform independance, but that doesn't make it any good at compressing. The padding in IIOP makes an IIOP package less predictive (sp?) than most XML, usually leading to inferior compression performance (previous posters have already explained that a lot of communication hardware does the compression transparently). Besides, IIOP does not easily transfer deep hierarchies, and most certainly does not carry its own documentation (as most XML does by way of the tagging) - you need access to the IDL file at both ends, whereas you can do lots of stuff with XML without having real access to its DTD or Schema. In fact, one may not even exist...


      There are no requirements in IIOP (or GIOP, for that matter) that packages be compressed. In fact, to ensure cross-ORB functionality as intended in CORBA 2.x+ (not sure of the version - I'm not in the office right now) you'd probably have to forego compression unless it is explicitly stated in the CORBA standard that it is mandatory to implement it (it might say that, but I don't recall reading about it).


      If you need to tunnel your request over HTTP to circumvent firewall issues, I'd stick with the ones that are supported by HTTP/1.1 (gzip and deflate) for nice and easy operation. If you need no tunneling I'd go for XML-specific compression, and previous posters already demonstrated that they know more about the specific products avaliable for this purpose than I do, so I'll leave at that.

      • I'm sorry I wasn't really using the correct terminology. I was talking about using Corba valuetypes to transmit the information. By using valuetypes you can transmit structured information across the wire. The metadata is contained in the IDL, which is a major space savings: the metadata exists on either side of the network and is never transmitted over the wire. In XML a large amount of information that is being transmitted is just metadata. I never noticed padding to be an issue with this kind of system.

        If you go with a "pure valuetypes" system then you can easily change your system one that passes datums around to one that is fully distributed.

        What I would like to know is why the OP insists on using XML in the first place if he isn't doing any kind of integration with legacy/third-party systems.
        • It is correct that XML contains its own metadata, which is (sort of) wasteful in terms of transmission size. But that metadata is highly predictive and therefore compresses very well. This is exactly the reason why XML-specific compression works so well (as other posters pointed out).


          You are right that valuetypes can (and indeed should be) used to transmit structures in an efficient manner. They do, however, still have to pad the structures to comply with the word boundary specified in IIOP. This padding is done automatically by the ORB to comply with the package standard specified in IIOP, but if you look at the specs you'll see a lot of padding with zeros to align at 4 (or is it 8?) byte boundaries. Transmitting lots of, say, 16 bit values this way is highly wasteful of resources. There are issues concerning arrays and sequences that remedy this, but these are not always preferable.


          In some ways, you can think of XML as a textual (and somewhat verbose) marshaling of parameters and return values. Look at XIOP [sourceforge.net] for a GIOP implementation that does exactly this.


          I agree that the reasons for using XML in this particular project does not seem clear, but (in my experience) many projects tend to use XML even when not entirely sure that it's the best choice right now, just to open their system to future integration with third-party systems in an easy way. Most projects have a tendency to grow after being deployed, and "future-proofing" your app is a valid argument for using XML (IMHO, anyway).

  • The binary representation of ASN.1 [elibel.tm.fr] is supposed to be quite good at representing XML.

    If you have a schema for your data, then it can simplify the data by making assumptions about the order and format of the data.. at least that is my understanding.. and of coarse it's no longer verbose and human readable..
  • It's by the WAP Forum [wapforum.org], but you'll need to register to get the specifications.

    Basically, it's a serialisation of XML using a tokenisation system - tags, attributes and even values become tokens, and has extensions for unknown items.

    It also has extensions for string tables, where commonly used words or phrases can be given, with a lookup into the string table index used.

    Although it doesn't actually compress XML per sé, it does do a fairly good job (unfortunately, I don't have any figures to hand).

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...