Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Apache Software

Gzip Encoding of Web Pages? 42

Both Brendan Quinn and msim were curious about the ability to send gzip-encoded Web pages. Brendan asks: "It's possible to make Apache detect the "Accept-encoding: gzip" field sent by NS 4.7+, IE 4+ and Lynx, and send a gzip-encoded page, thus saving lots of bandwidth all over the place. So why don't people do it? Here is a module written by the Mozilla guys a couple of years ago that -almost- does what I want, and I could change it pretty easily... but I thought someone else would have done it by now? eXcite do it, does anyone know of any other large-scale sites that use gzip encoding?"

"If you have LWP installed, you can check with:

GET -p '<my proxy>' -H 'Accept-encoding: gzip' -e http://www.site.com/ | less

Try that with 'www.excite.com' and you'll get binary (gzipped) data. That's what I want to do."

This discussion has been archived. No new comments can be posted.

Gzip Encoding of Web Pages?

Comments Filter:
  • bleh, dead web site. Try axkit.org.
  • Why bother compressing data? Face it, 99% of all web pages out there consist of the following:

    • Text - the easiest to compress, but for most sites it's the quickest element to load..
    • Graphics - already in compressed (.GIF, .JPG, .PNG) format, so gzip won't (shouldn't be able to) compress them any further - and these are usually the bulk of most page downloads...
    I would also think that there is some sort of level of traffic you would have to have before the improvement in bandwidth would compensate for the extra load placed on the server having to gzip everything dynamically.

    Of course, for high-text, heavy traffic sites (for example, right here on /.), this may make some sense. But for the majority of sites, it doesn't seem to make sense to me.

    On the other hand, I might just be a grumpy old man who can't understand all these new-fangled things... :=]
    ________________________

  • This could actually be used to get around content-based 'net access filters. You could use this method of requesting compressed text to thwart and keyword-scanning filter. Of course, if this became very popular, it would only be a matter of time before the filtering programs added the capability to decode gzip or whatever other compression people were using.
  • But you can do that pretty easely with mod_rewrite and PHP.

    Have the PHP script make a html and gzipped image of its output whenever it is called (there's a bunch of ob_* fonctions in PHP who can help you do that). Then use mod_rewrite to have the server serve :
    gziped image if available and supported by client
    html image if available and gzip not supported
    php file if no file is available

    you can refresh the content of the file by deleting the html and gzip image... that way you have optimal load on the server and a bandwidth-friendly site.
  • Of course, with rapidly updating pages like current comment pages, this would not be practical. But what if all "old" comment pages were pre-compressed when they were mothballed?

    Of course, I could be incorrect about Slashdot archives, in which case I will look like a complete idiot.
  • Would you dynamically gzip the whole site before sending it? On some sites where the page content goes on for (screen) pages the browser loads what it has and when it has more it loads more. Think about the ramifications of waiting for the entire (site) page to be gzip'd then sent, then you have to unzip it...

    What about parts and peices??

    Here is a tar command I use to move files around from system to system occasionally:

    tar -cf - ./$1 | rsh destination "cd /export/home1; tar -xBpf -"

    it goes in chuncks - not the whole thing, maybe you should think about incorporating this type of duck movement...
  • Doesn't the typical modem connection try and compress stuff anyway? I regularly get 12kbyte download speeds over a 56kbps modem when downloading uncompressed text.

  • Why not just cache the zipped files? Seems like you would save space, bandwidth, and CPU clocks. Especially for non-dynamic, large pages (such as slashdot's archived sites).
  • If Netscape 4.7+ and IE 4+ claim that they can accept gzipped data, they had better know how to handle it.
    So IE's claims are true? ;)

    I was thinking more along the lines of NS for *nix might be able to handle it, but the Win version might not be able to. I just went from a T1 in my college dorm to a 56k on my dad's computer at home, and anything to speed up the downloads would keep hair in my head. So my original question stands,

    Does anyone know if gzip downloads are even theoretically possible for a Windows machine?
    This is something I don't know how to test, and I don't know where to start an intelligent search, so if anyone has a good place for me to start looking, I would be grateful. Thanks.

    BTW, my criteria for a new place to live just grew to include DSL/cable modem access. How do people live on 56k?

    Louis Wu

    "Where do you want to go ...

  • Since I'm stuck with Windows for a couple more months, I'm wondering if this will work on Netscape 4.7+ for Windows. Or even IE 4+ for Windows. Does Opera do this?

    Does this trick need gzip installed already, or is it included in the huge download of NS?

    Louis Wu

    "Where do you want to go ...

  • I realised a long long time ago that I could save space on my Linux box's hard drive by goinf into the html documentation directories and doing a gzip -9 `find . -name "*.html"` .

    Since I was opening these files through the file system, not via http, Netscape had no problem whatsoever opening and displaying them.

    I just tried this using Netscape on an SGI with http, ( like this http://server/path/page.html.gz ) and it still works... I seem to remember that when I tried this at home with Linux, it didn't work...

    I'm running a server, dishing up static HTML batch generated from source files once per month. The saving can be enormous... two HTML files of 25kB and 13kB were reduced to just 2kB each! Admittedly, the body of files only takes up 100MB, to I'm not going to run out of space anytime soon...

    Now surely the server would fetch a small file off the disc faster than it could fetch a bigger file. And since I'm not compressing these files on the fly, there's no overhead on the server side. The LAN should get some benefit, too, since there is less data being whizzed around. There's going to be some overhead on the client side, as Netscape needs to gunzip the data at some point...

    However, I was under the impression that analog modems already had some dedicated data compression hardware... so if you have a server grabbing gzipped data off its discs, pushing that out to an analog modem, then the hardware of the modem won't be able to compress it (much) further in any case. And if your server is generating the HTML on the fly, maybe it would be better to just push uncompressed data to the modem, and let the hardware compression take care of things.

    s errare humanum est, sed merda futare machinem necessit

  • I don't know about IE5 but in IE4 the HTTP/1.1 on/off switch wasn't connected to anything.
  • Finally someone who knows what they're talking about.

    Will you have my babies?

  • I thought we had something, you and I, but was I ever wrong.
  • Well if you're not going to take my babies at least fix that CV.rtf link on your website.
  • GET / HTTP/1.1
    Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/msword, application/vnd.ms-powerpoint, */*

    Slightly off topic, but interesting: ever notice that it claims to accept these whether or not the applications to handle them are actually installed ? Its probably the same with gzip.

  • >>If Netscape 4.7+ and IE 4+ claim that they can accept gzipped data, they had better know how to handle it.

    Ah... but that's the rub. IE *says* it accepts gzip, but it can't handle it. And with their market share, that means that the % of hits that can accept gzip is probably not big enough to make it worthwhile.

  • This could make it more difficult for prying eyes (e.g. ISPs, CIA, CSIS, MI6..) to search passing packets for keywords. It wouldn't be secure, but you'd need to look at the entire application layer packet to know it's gzipped, and have enough contextual information to decrypt it properly.
  • Would this really make much of a difference for web pages? It's stuff like images, sound files, MP3s, pr0n jpegs, etc that make up the bulk of web transfers. HTML text files, gzipped or not, make up such a tiny fraction of web traffic that I don't see how it'd matter if they were zipped or not. Perhaps that's the reason nobody uses the gzip module?

    - A.P.

    --
    * CmdrTaco is an idiot.

  • The CPU usage on the server scares a lot of people away from this - it's not a big deal for static content (zip once and cache), but for dynamic sites (say, /.) gzipping 5-700K of text each time would kill a loaded server pretty quickly...
    If you have enough bandwidth to waste on putting 80% of redundant (compressable) data over it, but you don't have enough computer power to run gzip on that data, your resource allocation is seriously messed up. Fast computers to gzip the data are a lot cheaper than fat pipes to send it.

    The only place where it might not make sense is in an academic environment where (for artificial reasons) the bandwidth is very cheap, and the servers might still be overwhelmed.
    --

  • I'm not talking about outgoing bandwidth!!!

    I'm talking about people on slow links USING your web site. People with modems. I don't care how fast your pipe is outgoing - these people on slow modems can effectively crush your site, shocking as it may sound, because they end up spawning more httpd's, eventually either forcing you to your httpd limit (if you've taken the time to set it sensibly), or forcing your server into swap. And you don't want that to happen.

    Please go and read some real quality information on people who have worked with these high end solutions before thinking about replying again. Such as the mod_perl guide, at http://perl.apache.org/guide/
  • Will you have my babies?

    Not if you don't want them! ;-)

  • This is a bit of a plug, but I found a really big win for the server side (not the client side) when I added this feature to AxKit (link in .sig). I'm behind a 64Kb line, and some of the AxKit pages are pure documentation. This feature reduced the outgoing page size by about 80% for many pages, which seriously helps me deliver more content to my users. And the gzipped content is cached, so its just as fast as the non-gzipped content when using cacheable pages.

    Yes, its not much help for images, but then you just shouldn't enable this concept for images.

    Apache::GzipChain can also provide this option for people working with static pages on mod_perl enabled servers, but it has a serious memory leak in it that I found last week (and posted details of to the mod_perl mailing list).
  • It all depends on your architecture. Sure if you have a caching proxy front end it might not be worth it. But if you don't, and have a slow client connecting (say a 56K modem), the time taken to gzip a 700K file (assuming this is mostly text) vs the time it takes to actually download that file make the benefit definitely worthwhile.

    People easily forget that, and assume that their bandwidth is big enough that the file will just instantly disappear down the pipe. Your server will get overloaded an awful lot quicker if every httpd is waiting on a slow client to download 700K when they could be downloading 100K.
  • When ever I try to open a file that's been gzipped, Netscape (4.75 on linux) automatically prompts me with a file dialog box. This is even if I'm reading it straight from the file system. Thanks

  • S: Here is foo.html.your.pak

    You just made it so that pages can't incrementally load any more. The browser would have to wait until the whole .pak was downloaded before it could start laying out the page.

  • I like to think that I know a bit about the practical side of compression, so I'll jump in here.

    Yes, there are many places along the transmission lines where compression is attempted, but like the standard setting in most disk compression packages it's a little simple and typically does the worst job of compression in the system. Since compression in a modem is handled independent of any CPU, if you can do better somewhere else it then it doesn't really matter if the modem's efforts are wasted.

    In addition, people have been saying it isn't worth compressing .gif or .jpg files. While that's typically true with .gif files, .jpgs can usually have 10-15% of their bulk squeezed out even with the humble zip program.

    I'm a huge fan of compression and I strongly believe that transmission of compressed HTML files will have a major positive impact on the 'Net. Don't just think of the lower serving overhead on the servers, think of all the (caching) proxies and other routers and gateways. HTML files seriously lose 80% of their bulk when compressed.

    But we need to go further. We need to start bringing in a new highly compressed image format now so it's in popular use before 2005. There are a couple of nice fractal formats around that result in smaller files than the equivalent zipped .jpg -- we need to get at least one into the standard installation of the next IE or NS.

  • Actually, you can display the files in the order they're packed, you just can't parallel download so some of the multilink systems might be disadvantaged...
  • Compressing a file at a time, without reference to data that has gone before can only do so well. There needs to be some way to quickly determine which files of an impending page the client doesn't have, then package them up into a single compressed wad. Obviously, the gains would need to excessed the negotiation overhead in terms of both time and size, but I believe that's the next step after every individual file is compressed.

    Something like;

    • Client: I want http://blah.com/foo.html
    • Server: That has files; foo.html, foopic1.gif, foopic2.jpg/foopic2.fractal, fooflash & adiframe10111.html
    • C: I have adiframe10111.html and I support .fractal
    • S: Here is foo.html.your.pak
    Make any sense?
  • Doesn't Keep-Alive in HTTP/1.1 take care of the problem of sending multiple resources for one page?

    Though I definitely agree with you about the whole multiple-version of a single resource thing (foopic2.jpg/foopic2.fractal)

  • Acctually I built in GZIP compression to the core product at the company I'm working for (a web application) about a year ago. All HTML content coming out of our application passes through a layer which examines the browser and compresses it. The programmers never need to think about it. All the compression is done in realtime though, so there is a minute cpu overhead assosciated with it. We average about 4% extra cpu time because of GZIP. However, we've been averaging about 75% compression of our html. That -triples- the speed of page loads on modems. It's really noticable when I'm doing work from home. GZIP is a run-length compression, so if the page load stalls half way though, it still renders perfectly fine.

    GZIP Compression is supported in NS4.5 and higher, IE4.01 and higher, and all versions of Mozilla. We have, in the past year, never had a reported problem with the GZIP compression. There are some known bugs if you try to compress other mimetypes other then html.

    On a side note in probably about a month or so, I will be releasing into open source a java servlet web application framework. Included, among other goodies, is a layer which can automatically do GZIP encoding if the browser supports it. So anybody writing a web application using this automatically gets the benefits. Eventually coming to http://www.projectapollo.org
  • >I'm not talking about outgoing bandwidth!!!
    I wasn't either...
    ...that was just sort of an extra thought I tacked on at the end, the rest of it wasn't directed in that fashion...

    I agree with your points here, as I had said, my previous posts were coming from a viewpoint where everyone had fairly high bandwidth, especially considering the increasing availability of DSL/cable modems. I know there are a lot of lower bandwidth links out there, and from a time perspective, they spend megapercentages more connected to each httpd. If you can afford the hardware to throw at it for gzipping and large dynamic generation, that's fine. I've found that you fill a (even large) pipe faster than you run out of CPU time on a fairly powerful system (which agrees with your assesments more than mine). I was trying to provide a different viewpoint (since almost none of the people who use my site are on anything slower than 256k DSL, and it runs with a small amount of mem/CPU reserve). If you have a quad-xeon with a couple of gigs of memory, or an S80, then, by all means, go right ahead - it can and will save the slow people time. I haven't run anything to the scale that would need any of this (mostly since I have 80% static content), and my end-user demographic is much more bandwith-enabled than the typical cross-section.

    >Please go and read some real quality information on people who have worked with these high end solutions before thinking about replying again.

    Thanks for the kind comment... I have read the mod_perl guide... relax a little, will ya?

    --
  • Yeah, but I'm biased 8^) I spent 4 years on the campus LAN (with a few T3s), cable modem in the last year, and I'll probably be switching to DSL soon... Bandwith spoiled... There are always tradeoffs, and yes, if you are going to be running with a slow endpoint, there can be savings, but I still think that the overhead of gzipping all the files (from memory to CPU time) outweighs another httpd that is waiting for the client (since it is now waiting for the gzip before it waits for the client). Of course, if your server is behind a slow pipe, and you have static pages, it will save a bundle.

    I'll have to see if I can get one of those modules, and give something a shot with webbench.

    --
  • The CPU usage on the server scares a lot of people away from this - it's not a big deal for static content (zip once and cache), but for dynamic sites (say, /.) gzipping 5-700K of text each time would kill a loaded server pretty quickly...
    --
  • >Of course, for high-text, heavy traffic sites (for example, right here on /.), this may make some sense.

    Ah, but (like I mentioned in another comment) when you have a page that is say 500k of text (a hundred or so comments), dynamically generated for each hit, the overhead of compression is rather dangerous, and if a server is already somewhat near capacity, it could slow it dramatically... if you can't cache it, and have high traffic, it's a big problem.

    [Insert your own joke about Jon Katz wasting even more time with compression]
    --
  • Actually, there are the needed provisions to render those.. For owners of 95(c), 98, 98SE, Millenium and W2K, the needed .dlls come with the OS. For MacOS, 95(a), and (b), they were supplied when you installed Internet Explorer 4+.

    Also, IE4+ does work correctly with gzipped pages.
  • I know for a fact that Netscape 4.75 can handle gzip-compressed data.

    I set up a program to listen on port 80 and told NS to browse to localhost. It sent the "Accept-encoding: gzip". I then telnetted to www.excite.com:80 and sent that data. I got gzipped data in return. I then browsed the site using Netscape, and it loaded properly; therefore, Netscape 4.75 can handle gzipped downloads.

    I then tricked IE 5.5 into sending the same HTTP request; I connected to a proxy (127.0.0.1) which would transparently forward to excite.com, filtered out the HTTP request, pasted in Netscape's; it also loaded properly.

    So yes, gzip downloads work fine under Windows systems using Netscape 4.75 or IE5.5 (not sure about older versions, though), though IE5.5 sends an odd "Accept-encoding: gzip, deflate" which results in some sites not compressing it at all.

    -- Sig (120 chars) --
    Your friendly neighborhood mIRC scripter.
  • Since I'm stuck with Windows for a couple more months, I'm wondering if this will work on Netscape 4.7+ for Windows. Or even IE 4+ for Windows. Does Opera do this?
    Let me quote part of the original question:
    ...the "Accept-encoding: gzip" field sent by NS 4.7+, IE 4+...
    If Netscape 4.7+ and IE 4+ claim that they can accept gzipped data, they had better know how to handle it.

    -- Sig (120 chars) --
    Your friendly neighborhood mIRC scripter.
  • Why bother compressing data?

    For conventional web pages, I agree. The slowness of most web sites is either due to graphics, or they are using some slow CGI on the server side. Compression of HTML wouldn't help them much.

    There are also cases where the HTML is just plain resource-intensive for the browser to render (lots of nested tables, for example). Adding in the extra step of de-compressing wouldn't help there either.

    However, I could see clients (not necessarily browsers) sucking down large chunks of XML in a gzipped form. It could be used for things like sending thousands of raw database records to a client application for further processing and presentation to the end user.

  • by kevin42 ( 161303 ) on Thursday September 21, 2000 @09:31AM (#763622)
    http://perl.apache.org/guide/modules.html#Apache_G zipChain_compress_HTM
  • by Quietust ( 205670 ) on Thursday September 21, 2000 @12:50PM (#763623) Homepage
    Here's what IE5.5 gives when I go to http://127.0.0.1/:

    GET / HTTP/1.1
    Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/msword, application/vnd.ms-powerpoint, */*
    Accept-Language: en-us
    Accept-Encoding: gzip, deflate
    User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
    Host: 127.0.0.1
    Connection: Keep-Alive


    In comparison, Netscape 4.75:

    GET / HTTP/1.0
    Connection: Keep-Alive
    User-Agent: Mozilla/4.75 [en] (Win98; U)
    Host: 127.0.0.1
    Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
    Accept-Encoding: gzip
    Accept-Language: en
    Accept-Charset: iso-8859-1,*,utf-8


    The main points of interest are that IE5.5 can handle HTTP/1.1 while Netscape only requests HTTP/1.0, and that IE5.5 also claims to handle gzip AND deflate encoding, even though they're exactly the same (last time I checked, gzip used the deflate algorithm).

    I also tried sending the IE5.5 HTTP request via telnet to www.excite.com; it returned plain text, whereas Netscape's HTTP request returned gzipped data.

    -- Sig (120 chars) --
    Your friendly neighborhood mIRC scripter.
  • by AT ( 21754 ) on Thursday September 21, 2000 @11:54AM (#763624)
    The page [mozilla.org] quoted in the article shows its a pretty big win for some "typical use" sites on slower modems.

    Incidentally, no extra load would be neccessary on the server for static content if it was pre-compressed.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...