Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
The Internet

Look-Ahead Caching For HTTP Proxies? 19

ryandlugosz asks: "Why can't I find an easy way to do look-ahead web caching/proxying for my network? I'm running the squid web proxy/cache right now and it's great, however I want to take it a step further. Why doesn't Squid look at the HTML I download, parse out the 'A HREF' and 'IMG SRC' tags and go get those documents while I'm reading the page that I just loaded? This way, when I click on any of those first-level links on the page there is a good chance that they're already sitting in the cache waiting for me? There would have to be a way to prioritize these look-aheads, ala explicit page requests are always loaded before look-ahead pages are downloaded. I've got lots of bandwidth and disk space at my disposal... why can't I do this easily in Linux? (BTW: there used to be a couple of windows apps that would do this for you, such as Peak's Net.Jet, but they appear to be gone now) I'd appreciate any recommendations other /.ers can provide."
This discussion has been archived. No new comments can be posted.

Look Ahead Caching for Web Proxies?

Comments Filter:
  • Comment removed based on user account deletion
  • You could avoid the ad problem and the "porn pop-ups" but not having the proxy server follow links that go offsite and make it ignore javascript. Of course it would stil follow these links when clicked on, but it wouldn't follow them pre-emtively. sounds like someone should just get the code for squid and start hacking.
  • Just imagine if your look-ahead proxy decides to prefetch all the one click buy links on Amazon's (and nobody else's) site.

  • But this sounds like a quick way to saturate your link. Imagine how much content it would bring back that you would end up throwing away. And when do you check for the freshness of the cached content?
  • by goldmeer ( 65554 ) on Tuesday February 06, 2001 @07:37AM (#454207)
    This is a bad idea for 2 reasons:

    1. It's not very net friendly
    Imagine if you will, you have a web site. Let's say that it's a simple site with 50 static pages with text and graphics. You pay to have this hosted with certain bandwidth restrictions. now lets say that your site is linked to from another site. If some yokel with a look ahead caching proxy pulls up the other site and is reading the info there, in the background, your site is getting hit. Possibly very hard, if the yokel has broadband. Now, this yokel didn't ever want to see your 50 pages at all, but you know what, you just paid for that yokel to sasturate your web server for a period of time even though he never saw your site.

    2. You might now want what's out there.
    Let's say you are looking for a tidbit of information. Say, some information on the book "Little Women" You go to your favorite search engine and type in "little women" and click GO. The results page may indeed have pertinent infomration, but it may also have links to websites that may contain stuff you don't want to see like kiddie porn. Despite you not wanting to download the kiddie porn site, your computer sees the link, goes to the site and downloads away into your cache. "No problem, I delete my cache regulary" you maight say. With the current state of Law Enforcement, how sure can you be that you are not being monitored for ilegal activity such as possession of child pornography. According to all logs, you visited the kiddie porn site. If the Feds bust down your door *before* you delete your cache, you are busted! (And if you *do* delete your cache, they still may be able to recover the information anyway. Even if you can convince the jury that you didn't mean to download kiddie porn (like th'ed believe you) You are still looking at several months of the police holding your computer(s) "for evidence".

    If you have your look-ahead proxy looking more than 1 link deep, you are really asking for some bad juju. Even 1 link deep can be really bad.

    Is it worth the damage? Is it worth the risk?

    -Joe

  • I think if would be better for people to focus on things like ... on-the-fly gzip up webpages before sending them to the host and expanding HTTP to handle these sorts of things...

    HTTP 1.1 already has transfer encoding [w3.org] as an optional parameter that can be negotiated between the client and server. I'll bet we can all guess which side of the communication is lacking support for it.

  • So give Squid a memory of what has been read in the past, and prefetch the one link which tends to get read after the one just fetched. Now you get to figure out "after" algorithms: 24th link from the top on this site, while after the CNN.com frontpage the BBC Science page is most likely...
  • by Pilchie ( 869 ) on Tuesday February 06, 2001 @08:30AM (#454210) Homepage
    I think the WWWoffle cache that comes with Debian can do things like this. I remember when I first installed Debian, I was playing with it, and told it to go and fetch /., AND everything it linked to a depth of 5, or was it 15, I can't remember. What I do remember was waking up the next moring to find that my 13GB drive had 20K of free space. OOPS!!!
    >~~~~~~~~~~~~~~~~
  • It seems that people are forgetting something:
    Normally, the browser sends the request to the server (possibly through the proxy), but in look-ahead proxying the proxy has to generate the request. Since it is normally caching for many users, it would have to leave out information such as cookies and user auth - in other words, huge bunches of the pages the proxy grabs may have the same URL, but would not at all be what the user actually wants when he clicks on the link!
  • Link saturation seems to be the point -- on the principle that - "I've already paid for all this bandwidth, so why can't I use it to the fullest extent. If not getting something right now, then getting something that I might be needing".

    It actually seems like a neat idea to me, and one that I've thought of (although if it became commonplace, I'll be ISP's would be failing left and right) as being useful.

    Another useful thing would be to be able to "queue" up downloads explicitly (instead of implicitly by reference) (I think you can with some napster clients, although not in the ideal way) so that when the link is non-busy it will start sucking down data.

    The hardest part to implement I think would be to ensure that your link isn't fully saturated when you want to use it. Then you would see a delayed response as your clever "look-ahead" backed down -- would you really see a net positive speed gain then?

    The algorithm would have to be something like:

    Is there any traffic currently, or has there been in the last X minutes? If so, then only use
    half or 3/4 of what is currently available (ie: left over from what is currently being used, if any is).

    else if during non-standard usage hours
    use 9/10ths of the link

    else if during standard usage hours
    use 3/4 of the link

    Obviously there would be a CPU price to pay, but if you've already got a much more powerful firewall/proxy than you need, why not?

  • The point being that caches are designed to maximize the efficiency of networks of computers, not individuals. The basic idea is that if I look at /., someone else on my network is more likely to be looking at /. than the Barney [barney.com] website (at least I hope thats true). By caching the pages most often requested by people on the network, bandwidth and latency are both reduced.

    Caches and caching algorithms are not optimized for individual behavior. Yes, you could make them do what you want, but others have quite accurately pointed out that the cost of a little reduced latency is high in terms of consumed bandwidth. Or alternatively, the algorithm for look-ahead would have to be well-tuned to your preferences - something that has proved notoriously difficult to teach computers.
  • by crazney ( 194622 ) on Monday February 05, 2001 @09:43PM (#454214) Homepage Journal
    Ok, i think that all you people rambling about "what about 1 click shopping" are missing the point..

    becuase, the thing would have to be INTELIGENT!..

    for example, say if im on slashdot.org and there is an add for thinkgeek, pointing to the "thinkgeek.com" domain, ofcourse it wouldnt get that - as that is on a different domain..

    if it looks like a big message board, with lots of links, ofcourse it wouldnt follow it.

    also, itd have to have a selectable read-ahead level - default would be only 1 level - hence if i did it on slashdot, it would goto the "read more's" - and a few others, but owuldnt KEEP GOING..

    there is so much more intelegience it could have - eg, if it discovered on slashdot i never clicked the "rob's page" link, it would stop getting that after a while..

    and ofcourse, to save space, these links would have a low expiry time (30min)..

    shouldnt be to hard to implement into squid, maybe theres a plugin somewhere.,.

    crazney

    "Who is General Failure and why is he reading my hard disk ?"

  • by pjrc ( 134994 ) <paul@pjrc.com> on Tuesday February 06, 2001 @08:55AM (#454215) Homepage Journal
    My little web site suffers regularily from people running archiver programs, Teleport Pro, WebZIP, WebReaper, WebCopier, HTTrack, Wget, WebSymmetrix, Xyro, and many others. Some people set these things to run late at night, which helps a bit, but still impacts the responsiveness for overseas visitors. Often times people running these things have dialup... when a DSL user runs TeleportPro on my little site, it really hurts performance for everyone. My site offers free technical resources, hardware and firmware development tools, and other similar stuff, so many people want to download the whole thing to their hard drives. I have years of experience try to deal with the problems from this sort of activity.

    It's easy to get into the "gim'me gim'me gim'me" mode of thinking when surfing the web. After all, "information wants to be free", right? The sad truth is that is costs real money to host a web site. Sure, there are some crappy free hosting services [freewebsites.com], but their performance is dismal AND your page gets served with their adverts. User account website at ISP (www.some-isp.com/~username) come at no extra charge but only for a few megs of data and rather limited transfer each month.

    Web unfriendly software raises the cost of hosting a web site. Low cost hosting usually seems to be billed on the number of bytes transfered, and software like what you're proposing will needlessly increase the site owner's costs. High end hosting tends to be billed on bandwidth (not total transfer), so this software doesn't hit the site's owner directly in the wallet, instead it just makes the site less responsive for other users.

    This idea is even worse than archivers, as most of the bytes will sit in a cache and get expired, instead of in a archive directory where they _might_ someday be seen or used. In the case of an archiver, the user went out of their way to obtain a complete local copy of the web site, presumably because they are interested in the material and might actually read it off-line. With a predicitive caching proxy, there's no indication from the user that they will ever make any use of the material dwonloaded. The vast majority will sit in the cache, which will in all likelyhood rapidly need to remove least recently obtained pages. In normal caching terminology, one would say "least recently used", but in this caching scheme, the vast majority of pages in the cache will never have been viewed. Utter waste.

    This sort of net-unfriendly behavior is analogous to pollution. Even if just one person pollutes the environment, there is some small harm to a small number of people, perhaps significant harm to a couple if the pollution is severe. If a large company pollutes recklessly, perhaps a community or two is badly effected. If pollution becomes widespread, it's a global problem and almost everyone is harmed.

    Likewise, if you hack together a predicitive look-ahead caching proxy and use it amongst yourself and your friends, you're impacting the net similarily as if you'd take your use motor oil and other waste and dump it directly in a local stream. If a large company or two replaces their bandwidth conserving squid proxy with your bandwidth abusing look-ahead caching, a lot of sites will suffer increased costs. If its use becomes widespread, it would significantly increase overall bandwidth usage on the net, in all likelyhood raising costs enough to be passed all the way back down to end users, and it's raise the cost of hosting web sites, which would need to be made up somehow. Perhaps large website could absorb the cost of more bandwidth? Smaller sites, like mine, would be in a world of hurt. For quite some time, I paid out-of-pocket a couple hundred dollars a month to keep the site up. Now, we're making some small sales from the site... getting close to covering the costs.

    Well, that's been a long rant. I hope you'll take a moment to consider that web site operators pay real dollars to make their sites available to you, and keep that in mind when you consider designing networking software.

  • by whydna ( 9312 ) <.whydna. .at. .hotmail.com.> on Monday February 05, 2001 @10:28PM (#454216)
    This is not the purpose of squid. Squid is designed for helping a large number of users handle a not so large ammount of bandwidth. If squid performed the actions that you mention in your post, it would be defeating this purpose. Imagine if you have 200 users sharing a dsl line. Squid would cache all those pages that everybody constantly loads up (yahoo, msn, etc.). If it also grabbed all the extra links, it would be wasting bandwidth because the cache-miss-rate would be so increadibly large.

    What should really happen is that the first user that goes to a site should have to wait a bit. Squid will then have a cache and it'll be fast after that. But you know that.

    Squid just isn't the right tool... perhaps if you rewrite it/make your own proxy you can do what you're looking for...

    -Andy
  • On the other hand, these are programs that go downloading content by following links on the pages you read. Wouldn't most people who run sites love that? It lets you increase the number of hits you can tell advertisers you get, as well as the click-through rate for ads you run. While these things may not be imprtant for your (non-ad supported) site, for many others that's cash in the bank.

    Care about freedom?
  • But wouldn't a user spidering your site (partially or entirely) and then browsing it offline, end up in reduced traffic? I mean, they're not flipping back and forth and repeatedly requesting pages (especially dynamically generated ones). If you are complaining because this loses you ad revenue, I think the whole banner ad system is pretty much a failed endeavour. In any case, it's nothing as drastic as throwing toxic chemicals in a stream.
  • Use google's cache. http://www.google.com/search?q=cache:www.domain.co m will bring up the cached version of the page (with the google header). They can probably tweak this to avoid abuse, but for now it's open.

    I think a lot of time is wasted waiting for pages to appear. I routinely use google's cached version of a page to avoid indefinite waits (and in fact the "stale" version is often better than a page that's gone dead - after all it's more likely to match my query). Large-scale caching a la Akamai is probably the way to go for a lot of sites.

    It would be interesting to build a distributed network of cache sites to decrease latency. My thought is to use a proxy of some sort to get requested pages, and then stream them out on IP multicast to other proxies. Each client proxy could then maintain a certain number of recently-broadcast pages as a cache. I think this has a lot of similarity to Gnutella, so maybe a gateway could be built around this program instead.

  • That could do realy sad things to sites that employ a system to prevent replay-attacks. Not to mention consume bandwidth and increase the chances of you getting to read perfectly cached out of date data..

    Good Idea, but I think if would be better for people to focus on things like harware driven compressors to on-the-fly gzip up webpages before sending them to the host and expanding HTTP to handle these sorts of things...
    I think a partially compressed format would be interesting... EG
    <COMP method="gzip" size="14842">
    --- The Data ---
    &lt/COMP>
    Nesting these would not be a problem...

    But what the hell, no one would vote for me anyway.....
  • Sounds to me like a bad idea. Me and a few friends set up a tiny website where we could exchange ideas, and store tidbits of useful information for posterity. One of us (not me!) used a browser which had the "feature" (bug) that it could cache an entire website. All the articles also had a "Delete article" link. You do the math.

    Luckily, we also had a calendar in there, in which the aggressive cacher got lost. ;-)

It is easier to write an incorrect program than understand a correct one.

Working...