XML Compression Options? 51
ergo98 asks: "About a year ago I had the need to evaluate XML compression technologies (for a project where two machines had to communicate via XML document, and there was an excess of CPU power and a dearth of bandwidth): At the time the best option seemed to be a research project called XMill, however it seemed even then to be an abandoned project with no more updates and little market presence, and was only source available as a command line utility requiring reworking into library form. I'm curious if there's been any progress in the XML compression arena in the past year: If you have more CPU power than bandwidth what is the best option for XML document compression? Has any XML specific compression algorithms been made as a module for Apache?"
GZIP (Score:4, Informative)
Re:GZIP (Score:1)
And there is mod_gzip [remotecommunications.com] for Apache, if you go the gzip route.
Re:GZIP (Score:1)
Re:GZIP (Score:5, Informative)
The problem with GZIP is quite simply that it doesn't take advantage of XML specifics (i.e. there are domain attributes of XML that make certain algorithms much more efficient than others): XMill, as an example, manages to achieve almost twice the compression of GZip with approximately the same CPU usage.
In this case adding in a module that would do XMill on the server side (of course obeying the Accept-Encoding that the HTTP client passes in, so a client that didn't handle the custom compression would not be thwarted) would allow us to use a custom HTTP client that could do the compression and achieve twice the throughput on the limited pipe. As I mentioned in the submission: We have more CPU power than bandwidth, so that 2x compression improvement is very significant (i.e. there is a world of low bandwidth vertical uses out there: satellite, frame relay, CDPD, etc). For those who cringe at the idea of using XML to begin with, please realize that with the proper compression XML encoded data is smaller than any proprietary packaging because of its high degree of predictability.
Re:GZIP (Score:2)
The problem with the gzip approach is that it is best on long files rather than short ones. The problem with XML is that SOAP messages that should take one or 2 packets end up taking 5. Gzip tends to make 5 packets 4, but it *will* make 50 packets 25.
The other problem with GZIP is that the implementation is fairly computation intensive. It is not something that you can implement as a simple filter with almost no memory overhead.
GZIP is a great general purpose compression scheme for largeish quantities of data, but a scheme that is highly domain specific can usually do better.
The main mistake we made with the Web was to not include a simple compression scheme that was optimized for HTML at an early stage. The Content encoding tag was a good idea, but we should have shipped a lightweight compressor as well.
The approach in XMill is a good one, but I am not too happy with the table switching codes, also I think that an extra 25% compression can be wrung out at low cost. The big problem with XMill is that it was not progressed as a standard anywhere. I would give more details on the scheme, however there has been a recent rash of Patent Trolls filing patents on ideas they find on mailing lists. I call that type of behavior fraud, the courts call it a cause of action that will cost me $2 million to defend against. Congress call this situation an opportunity to collect campaign bribes.
XML Compression is something that I am working on, but I am already the point person in two major XML working groups and this is not a high priority. I really need an intern to crank through measuring efficiency on various document sets.
ummm... (Score:1, Redundant)
don't compress the XML (Score:3, Interesting)
XML Compression... (Score:3, Informative)
Re:XML Compression... (Score:1)
This [cornell.edu] was the first result google gave me to my search "xml compression" and took me approximately 6 seconds.
ictcompress was number 7.
Re:XML Compression... (Score:3, Interesting)
This is exactly the sort of information I'm looking for: Thank you. I've downloaded their demo and am going to give it a try on some sample data to see how it performs (of course comparing it against the ubiquitous infamous GZip).
gzip? (Score:4, Insightful)
I think, although I can't find it for sure, that LWP supports gzip content-encoding too, which would mean that things like SOAP::Lite and XML-RPC would benefit too.
more about the content-encoding thing [webreference.com]
mod_gzip (Score:2)
dot gz (Score:1, Redundant)
Do you control both sides? (Score:4, Insightful)
The problem of course is that if you control both the producer and consumer you're greatly limiting the applicability of XML in the first place. Just yesterday I explained to my boss that one of the advantages to XML is for cases when you have 10 people who want your data, but you can't dictate was software they use...AND, 6 months from now, 10 other people who you haven't even met yet are going to want your data too. If you're in that boat, and you create any sort of compression scheme, then you're in trouble. If you're not, then you may not need XML at all (at least, not for moving your data around).
Perhaps you're hoping that there will be some compression module that becomes a standard part of XML, so that you can safely say "Anybody who is able to parse my XML message would also be able to decompress it"? Good luck. Even if that did happen, it would take ages for all of the parsers out there to get up to date.
What you'll probably find is that something like SOAP or WSDL will have a compression component. But in that case it's ok, because both the client and server sides of WSDL that do the marshalling/unmarshalling will be provided for you by your tools (such as BEA WebLogic). Think about what CORBA IDL was like -- you just write the interface, and then both client and server stubs are automatically generated for you. In that case, it's perfectly reasonable to expect that some compression/decompression code could be written in to the code automatically.
Re:Do you control both sides? (Score:2)
I do indeed control both sides: Basically imagine it as a blackhole on both sides and I want to stuff XML in one side and have it pop out the other, but unfortunately the connection in between is a limited pipe. While GZip does a nice job, other XML specific algorithms offer dramatically more compression (and 2x the throughput, for example, is pretty worthwhile).
Even having said that: Thankfully when they made HTTP they realized that GZip isn't the be-all and end-all, so if I implemented ErgoCompression my custom clients that sent the ErgoCompression Accept-Encoding header value could take advantage of it, while others could use GZip or compress of deflate, or whatever their client supported. To be honest I'm very surprized to see some of the responses that I've seen here on Slashdot, which is "GZip is the monopoly: Use it!", a sort of "Gzip is good enough". Well when it comes down to it, getting 2x or greater XML pushed through that limited pipe seems pretty worthwhile to me, especially because, as you mentioned, I control both ends of the pipe.
Re:Do you control both sides? (Score:1)
I suppose a more apt metaphor would be that there's a wormhole on either side, rather than a blackhole. Blackhole compression would be pretty good though : 1MB in, 0 bytes out!
Re:Do you control both sides? (Score:1)
Re:Do you control both sides? (Score:1)
"best" option? (Score:2)
Are you looking for the tightest possible compression, or are you looking for "good enough" compression that's fairly standard? Or do you want "good enough" compression that can be quickly implemented?
If you're looking for tightest possible compression, that requires a good statistical model of your data and is far beyond the scope of any answer here. Depending on your data, a good encoder could require far less bandwidth than any generic compressor, but it's highly nonstandard.
If you're looking for something that's "good enough" and standard, there's absolutely no doubt that you should use zlib (gzip) and call it a day.
If you're willing to stray from the standards for better compression, then bzlib generally offers better compression than zlib.
Bottom line: figure out what you really need, then pick the tool. Don't just grab the first thing that comes to mind, or a tool that others swear by but which doesn't meet your needs.
Specifics, Algorithms (Score:5, Interesting)
Of course, there are exceptions. Obviously it doesn't take much to saturate a modem connection. But modern modems have data compression built in, so there's not a lot to be gained by compressing the data beforehand.
Another example is one I had to deal with recently. Mat Ballard started distributing Kylix help files in HTML format [hypermart.net]. This is 100 meg of data, so Mat was concerned to minimize his bandwidth costs. He found he got the best result with bzip2 [redhat.com], which reduced his file size to a mere 7.6 meg, or twice the compression of gzip.
This caught my interest. I'd like to distribute a similar collection of documentation. But my app needs to be able to read individual files on the fly. Would bzip2 compression work equally well applied to small blocks of data?
Unless I did something wrong, the answer is no. A bzip-in-tar file doesn't seem to come out any smaller than the equivalent zip file. Perhaps I did something wrong, but it does make sense. Bzip2 gains its superior compression by combining the Burrows-Wheeler transform [dogma.net] with old-fashioned Huffman encoding. And BW transforms are drastically more effective on big data sets. So I might as well stick with Zip format. Oh well.
Bottom line: no magic bullet for minimizing bandwidth costs through compression. You need to analyze your specific application and find out what's most effective -- and whether it's worth the trouble.
Re:Specifics, Algorithms (Score:1)
Re:Specifics, Algorithms (Score:2)
gzip will work, but... (Score:1)
(Well I newer tried it, so mayby I should just shut up.)
XML/ASN.1 Translation (Score:1)
This site [elibel.tm.fr] has some useful information on the proposed standardization of this effort. There was also a
Re:XML/ASN.1 Translation (Score:2)
No there is not. ASN.1 is an utter crock. The original idea was good, the implementation sucketh, especialy the DER encoding.
ASN.1 is not particularly efficient, unless you use something like PER which if it is supported requires a support library that comes on 2 CDROMS.
The XML in ASN.1 proposal is simply an attempt by the die hard OSI/EDI types to try to preserve the work they have done.
The problem with XML is the over complex schema notation, the problem with ASN.1 is the over complex encoding. So put them together and you have a perfect platform for information engineering consultants to suck an IT budget dry. If that is your objective then go for XML in ASN.1.
take out the spaces (Score:3, Insightful)
Re:take out the spaces (Score:1)
So anyway, I'd agree with removing whitespace, as long as your XML compresses significantly from this. If it's machine-generated, there's a decent chance whitespace is already minimized. If not, you may want to strip the whitespace and see if that improves performance adequately. If not, you could always pipe the output through a compression filter, too.
Re:take out the spaces (Score:2)
Re:take out the spaces (Score:1)
Every space is relavant in XML. It does not follow the same rules as HTML.
Does not equal
Re:take out the spaces (Score:1)
XMill is designed specifically for XML (Score:1)
Just use the built in compression methods. (Score:2)
I've compressed a 1.5MB XML document down to the tens of KB range. That was definitely good enough for me
Search google (Score:5, Insightful)
http://www.google.com/search?q=xml+compression
The very first thing that comes up is a project on SourceForge with in depth explanation of algorithm.
Re:Search google (Score:1)
Uh, thanks for the help. Do you follow-up every AskSlashdot with a lame "search Google"? Guess what genius: I already did and found most of the links insufficient, hence the open discussion about it.
There isn't "an algorithm" that defines XML compression, and that's the whole point. Additionally, SourceForge most certainly is not the be all and end all (i.e. the fact that you jumped to that link and you think it answers ANYTHING [it doesn't, and just as I said in my askslashdot most of it has been abandoned long ago. Of course that's the status of most Sourceforge projects] shows that you just wanted to jump in with your great cynics wisdom). As has been shown in this discussion: There is an awful lot of ignorance out there in regards to XML compression, but thankfully a few people see through it and actually realize the domain specifics, and I have gotten helped.
In any case the Slashdot posting filter should automatically filter any "Duh, search Google!" ridiculous posts.
IIOP: The ultimate in XML message compression (Score:1)
Re:IIOP: The ultimate in XML message compression (Score:1)
What do you mean, efficient? I know a bit about IIOP (and its generalized cousin, GIOP), and I've always thought that it was sort of wasteful on bits. It requires extensive (and not very simple) padding and deals with little vs big endian issues. For its purpose (primarily transmitting remote method invocation parameters and return values in a platform-independant way) it's probably as good as binary gets while retaining platform independance, but that doesn't make it any good at compressing. The padding in IIOP makes an IIOP package less predictive (sp?) than most XML, usually leading to inferior compression performance (previous posters have already explained that a lot of communication hardware does the compression transparently). Besides, IIOP does not easily transfer deep hierarchies, and most certainly does not carry its own documentation (as most XML does by way of the tagging) - you need access to the IDL file at both ends, whereas you can do lots of stuff with XML without having real access to its DTD or Schema. In fact, one may not even exist...
There are no requirements in IIOP (or GIOP, for that matter) that packages be compressed. In fact, to ensure cross-ORB functionality as intended in CORBA 2.x+ (not sure of the version - I'm not in the office right now) you'd probably have to forego compression unless it is explicitly stated in the CORBA standard that it is mandatory to implement it (it might say that, but I don't recall reading about it).
If you need to tunnel your request over HTTP to circumvent firewall issues, I'd stick with the ones that are supported by HTTP/1.1 (gzip and deflate) for nice and easy operation. If you need no tunneling I'd go for XML-specific compression, and previous posters already demonstrated that they know more about the specific products avaliable for this purpose than I do, so I'll leave at that.
Re:IIOP: The ultimate in XML message compression (Score:1)
If you go with a "pure valuetypes" system then you can easily change your system one that passes datums around to one that is fully distributed.
What I would like to know is why the OP insists on using XML in the first place if he isn't doing any kind of integration with legacy/third-party systems.
Re:IIOP: The ultimate in XML message compression (Score:1)
It is correct that XML contains its own metadata, which is (sort of) wasteful in terms of transmission size. But that metadata is highly predictive and therefore compresses very well. This is exactly the reason why XML-specific compression works so well (as other posters pointed out).
You are right that valuetypes can (and indeed should be) used to transmit structures in an efficient manner. They do, however, still have to pad the structures to comply with the word boundary specified in IIOP. This padding is done automatically by the ORB to comply with the package standard specified in IIOP, but if you look at the specs you'll see a lot of padding with zeros to align at 4 (or is it 8?) byte boundaries. Transmitting lots of, say, 16 bit values this way is highly wasteful of resources. There are issues concerning arrays and sequences that remedy this, but these are not always preferable.
In some ways, you can think of XML as a textual (and somewhat verbose) marshaling of parameters and return values. Look at XIOP [sourceforge.net] for a GIOP implementation that does exactly this.
I agree that the reasons for using XML in this particular project does not seem clear, but (in my experience) many projects tend to use XML even when not entirely sure that it's the best choice right now, just to open their system to future integration with third-party systems in an easy way. Most projects have a tendency to grow after being deployed, and "future-proofing" your app is a valid argument for using XML (IMHO, anyway).
ASN.1 (Score:1)
If you have a schema for your data, then it can simplify the data by making assumptions about the order and format of the data.. at least that is my understanding.. and of coarse it's no longer verbose and human readable..
You could look at the WBXML standard... (Score:1)
Basically, it's a serialisation of XML using a tokenisation system - tags, attributes and even values become tokens, and has extensions for unknown items.
It also has extensions for string tables, where commonly used words or phrases can be given, with a lookup into the string table index used.
Although it doesn't actually compress XML per sé, it does do a fairly good job (unfortunately, I don't have any figures to hand).