Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Apache Software

Throttle Apache Bandwidth Based on IP Address? 75

BigBlockMopar asks: "A friend of mine runs a web site which offers a very large archive of files. He wishes to continue to offer free and unrestricted access to his archive, but his bandwidth consumption has been through the roof because of people using wget (and similar) to download his entire site. Current traffic is around 200 gigabytes per month with over 50% of that being clients who are downloading every document on the site. The server space is donated by a hosting provider who is understandably starting to become impatient with the traffic. I've checked out mod_throttle and mod_bandwidth, neither appears to do exactly what is desired. Does anyone have any suggestions?"

"Eventually, he plans to set up mirrors, but he'd like to get the greedy users under control first. Alternatives are adding a (free) log-in authentication system, or a text-in-image system like Network Solutions uses to weed out automated whois queries. But I think the best solution is to allow a given client IP address full-speed downloading for the first $WHATEVER megabytes, and then automatically reduce the speed of the transfer to that IP address. This would probably deter most leeches but continue to allow legitimate users to transfer more than an arbitrary limit."

This discussion has been archived. No new comments can be posted.

Throttle Apache Bandwidth Based on IP Address?

Comments Filter:
  • BitTorrent (Score:3, Interesting)

    by Directrix1 ( 157787 ) on Saturday January 31, 2004 @04:53PM (#8145847)
    Well, you could always zip the whole site up, and put that up as a bittorrent link.
    • A zip? Ugh!

      Aside from just the format (a .tar.bz2 is better for most, I imagine), a zipfile is huge. It does not help him with ordinary users. I suggest that all links but bittorrent be removed, and have the torrents be one to a file, so that for most users, the download is just as before, except now they need bittorrent installed.

      This even allows the recursive get people to work. They use wget to mirror all the html, which downloads torrents in the process. Then they spin off a one-liner to find all
      • I know bz2 usually gives better compression ratios (well, chances are if he's zipping JPGs or MPGs it isn't going to compress down much anyways), but zip is much more prevalent.

        As far as a torrent for every file. Please, he'd waste more traffic on the tracker. It would have to be one large file, otherwise there wouldn't be enough concurrent downloads to justify the overhead of the tracker for each individual file.
        • .tar.gz might not save as much (or even any) over zip as .bz2 would, but Winzip supports tar and gzip (albeit separately), and WinRAR supports all that plus .bz2.

          For that matter, Stuffit Expander supports all of the above. But this isn't worth arguing about -- I'd probably never distribute a zip, but I can see why people do and I don't have a huge problem with that, especially since the format is standardized and works with opensource tools.

          As for the one gigantic torrent, how big a site are we talking a
          • A couple big chunks would work. I was just saying that a torrent for every file would definetely not work to save bandwidth, because there is tracker overhead and chances are no two people will be downloading the file at the same time.

  • I've been searching for this for over a year now, and so far have come up with nothing. I'm afraid it's not around, but if it is, let me know ;^)
  • by nudelding ( 101942 ) on Saturday January 31, 2004 @05:03PM (#8145905)
    if you just slow down the connection you will have a lot of nearly idle apache processes running and so that after a while you cannot get more clients connected.
    Either just drop connections or use a single process proxy whith the required ability, which then forwards the requests to the apache.
    But restring by IP can be dangerous if users are sitting behind a proxy from the ISP (very common at least in Germany).
  • Javascript? (Score:2, Interesting)

    by zcat_NZ ( 267672 )
    Set up javascript links, which wget can't follow.

    Or set up a 'captcha' for each download, so that a human has to confirm each file one at a time.
    • and, as a follow-up to my above post, a captcha is a horrible solution to anything. In fact, as has been shown before, a captcha can easily be defeated by having the spammer offer a service (probably porn) which uses someone else's captcha to verify the user, and in so doing registers (say) a hotmail account. If users are really determined to automate something, human labor can be cheaper than machine labor. And for the rest of us, it's just a pain in the ass.
    • > Set up javascript links, which wget can't follow.

      It's possible for clueful scripters to get around this using WWW::Mechanize,
      as I had to do in my script that retrieves database usage stats for the
      commercial databases we subscribe to at work. Ebsco's usage stats thingydoo
      uses script links that call functions in an external, linked .js file. The
      functions then change certain form elements and submit the form. So I just
      told my WWW::Mechanize browser object to change the form elements in question
      and sub
  • by Loualbano2 ( 98133 ) on Saturday January 31, 2004 @05:07PM (#8145927)
    If you enable referer checking, this will stop most wget type programs. Wget has an --referer=URL option, but I find that it doesn't work. Also, there are a lot of windows clients that will spider a website and pull files based on extention, but again these don't usually have an option to set referer, or if they do most people aren't smart enough to turn it on.

    One exeption to this is Pavuk, which does referer spoofing pretty well. This program is about 4 times harder to use than wget, and isn't very popular (you don't see it included in distros too often).

    Of course this won't completely fix your problem, but it will probably stop about 90% of the people doing it now. It's an easy fix that you can implement quickly until you get something to throttle bandwidth properly.

    -ft
    • Wget has an --referer=URL option, but I find that it doesn't work.

      Which version of wget are you using? The referrer option works fine for me--for one website when I don't use it, I get redirected to the main page. With the referrer option I can download the file. Although something that sets the referrer automatically would be best.

      • You may be correct about wget, the last time I tried it was about 6-7 months ago, and I didn't get it to work. I didn't try all that hard though, because I already knew of pavuk and I knew it did a better job.

        http://www.idata.sk/~ondrej/pavuk/pavuk.1.html

        That is the man page to Pavuk. If you search for referer on that page, it will tell you how to use the -auto_referer option.

        -Fran
        • If you search for referer on that page, it will tell you how to use the -auto_referer option.

          That will not work in all situations. A lot of servers just check to see that you come from on of its pages. A correct implementation would be to allow only the clients with referer from pages pointing to it, which is why I suspect no one has added it to curl or wget.

    • > If you enable referer checking, this will stop most wget type programs.

      This won't phase WWW::Mechanize one iota. When it follows a link, it sends
      the referer just like any other browser, unless you tell it otherwise.

      You really are going to have to throttle based on client IP. They can't fake
      that if they want to get the information back, so you can positively identify,
      "this is the same host that has already requested n pages in the last thirty
      seconds".

      If you throttle overall, without throttling base
  • by DDumitru ( 692803 ) <doug @ e a s y c o .com> on Saturday January 31, 2004 @05:08PM (#8145933) Homepage
    If you want to limit BW by IP address, this might be doable depending on what the server is. If the server is a Linux or "virtual Linux" box, you can probably use 'tc' (Traffic Control) in the kernel to meter bandwidth by subnet or address. This works pretty well. Look at the advanced routing howto for info. It is a bear to setup, but actually works quite well.

    The problem is that if these are bots grabbing your whole site, slowing them down to 10K/sec won't actually reduce the amount of traffic they pull from you. They may take all day to get the pages, but the bytes will still move.

    Some options that you have.

    * If the user really doesn't need the data, block their address entirely.
    * consider blocking the 'bots' "client" signature. You can do this in
    Apache. "Respected" bots don't lie about who they are. If a bot does
    lie, then it is a DOS attack in disguise.
    * Contact the users, if you can.
    * If you want the user to get a mirror, setup something to actually do the
    mirror that is effective. I would recommend running rsync.

    • Also, wget will listen to robots.txt, just specify what is allowed and what isn't allowed for wget to grab. Granted, this can be circumvented, but it should help with most of the users who are not smart enough to get around it.

      This link [robotstxt.org] should help you out.
      • I was gonna recommend robots.txt as well. Also, if your site is really that popular and you don't mind people grabbing mirrors of it, why not post a link to a heavily compressed archive, maybe even 'torrented like another posted suggested.
    • consider blocking the 'bots' "client" signature. You can do this in Apache. "Respected" bots don't lie about who they are.

      Unless you have servers which look at the client and then exhort you to get the "supported" browser (read IE).

  • Wrong solution (Score:3, Informative)

    by insensitive claude ( 645770 ) on Saturday January 31, 2004 @05:11PM (#8145964) Journal
    I don't think a bandwidth limitation is going to be effective for this situation. They're still going to consume the same amount of bandwidth, just over a longer term. It's not like people usually sit and wait for the site sucker to do it's thing. Bandwidth limiters like you suggested are usually used to reduce the effects of slashdotting and the like.

    What you need is an anti-leech mod to limit the amount of data that can be downloaded from a specific IP. I know they're out there. Just do a bit of googling.
    • Actually it should work just fine. The problem isn't the total amount of data being transferred, just that it transfers so fast. If the transfers are done at a slower rate then that leaves that much more capacity for others at any given time.

      So what you say is true and a good thing. People don't sit and wait so they won't mind if the transfer takes longer. Meanwhile the provider has more bandwidth available for other customers at any given time.

  • If this a Linux server he could look into QOS/Fair Queuing, however this would require a lot of research and testing to find the right setup, and maybe even a kernel recompile depending on the distro.

    If the hosting service had a problem with excessive bandwidth usage, I don't know why they didn't throttle port 80 at the routers.

    • qos is kinda unlikely to be of much use.

      besides it sounds like they have enough bandwith, and if they have enough bandwith to actually serve the stuff then slowing the downloads down wont really decrease the amount of bytes transferred at all. so what he seems to really want is to limit the amount of visitors that wget the whole site.

      looks like he would like to make the wgetting of the site impossible, but this is actually quite hard because basically wget is for just that and even if he makes some nice c
  • Unfortunately, likely the best recommendation you'll receive.
  • BT (Score:4, Insightful)

    by Gadzinka ( 256729 ) <rrw@hell.pl> on Saturday January 31, 2004 @05:20PM (#8146020) Journal
    Does anyone have any suggestions?

    Yes, use the frelling BitTorrent, that's exactly what it was written for!

    Add to this some way of limiting bandwith per connection (so people are mainly downloading from other bt clients, not from you) and you have perfect distribution means.

    Leave the possibility to download via http, but limit it with QoS or some other way to tiny little stream, plus advertise all over the site that people can achieve unlimited dl speeds using BT.

    Publishing documents only to limit in every possible way access to them (like all the game files servers do) is unwise, to say the least. Especially if you don't have to.

    Robert
  • Compile a CBQ module for your kernel

    find a way to parse the log and find out (in as close to real time as possible) who is slurping the whole site.

    restrict that IP's bandwidth to the machine entirely, not just apache.

    The linux kernel can handle these things very easily. Why bother with apache, use the linux machine itself to do it.

    this, of course, assumes you are using linux.
  • by zcat_NZ ( 267672 ) <zcat@wired.net.nz> on Saturday January 31, 2004 @06:03PM (#8146255) Homepage
    Don't bother trying to rate limit downloads; you'll get exactly the same number of people downloading everything, except that instead of doing it quickly they'll leave wget running all week and tying up your server's resources.

    Have a page "download.php?filename=foo.txt" that all your links point to, and have that page return <meta http-equiv="Refresh" content="1;URL=files/$filename">

    (pseudocode; my php scripting is not great, but you get the idea..)

    This totally breaks wget, although it's not too hard to script around. You'll cut spider traffic back by probably 95%, all the casual 'grab everything we can' downloaders, but people who really want to get all your files will still figure out how to.

    Or if you totally want to stop automated downloads, put each file behind a 'captcha'.

    • It might break wget, but HTTRACK will get around that.

      I would use PHP and log users and their IP's. And only allow so many files per user. You do want return traffic?

      And as others have said, use bittorrent.
      • Good idea... but how do you handle people behind proxys and NATs? IP address != user anymore. I suppose overall it might not be a problem, but I'd be hesitant to limit access on IP address alone.

        Yeah, in this situation, BT would probably shine.
    • The proper code would be (IIRC)

      Make a file with this code in it: call it downloader.php

      ">

      then the url to download would be

      http://site.com/downloader.php?file=http://www.s it e.com/file.zip
    • > Don't bother trying to rate limit downloads; you'll get exactly the same
      > number of people downloading everything, except that instead of doing it
      > quickly they'll leave wget running all week and tying up your server's
      > resources.

      Rather than slowing the transfer, I'd send a 503 response to clients that
      have pulled too much in the last hour/day/week/whatever. If you have to
      send the same client too many 503s in too short a time, switch to 403.
      If you're concerned about really persistent people c
    • Have a page "download.php?filename=foo.txt" that all your links point to, and have that page return <meta http-equiv="Refresh" content="1;URL=files/$filename">

      This sucks. The biggest problem: I right click on the link in my browser and select "save to disk". I'm not spidering the site, I just want to save one lousy document. However, I don't get the document, I get the stupid referrer page. This occasionally bites me when I download software from SourceForge and the link suggests that it's a dir

  • by Etyenne ( 4915 ) on Saturday January 31, 2004 @06:21PM (#8146353)
    I strongly second the idea of offering your files via BitTorrent only. If, however, you must continue to offer them via plain HTTP, you should be able to cook up something with a custom Apache module. I suggest to have a look at http://www.oreilly.com/catalog/wrapmod/
  • by stevey ( 64018 ) on Saturday January 31, 2004 @06:22PM (#8146356) Homepage

    I wrote an apache module which I call mod_curb [steve.org.uk] (for Apache 1.3)

    This doesn't do exactly what you want, but I'm sure if you were to ask me or somebody else we could code something for you.

    The basic idea I have for you problem is to have a database of currently active clients, beit MySQL/Flat files, then you can keep track of all data transferred by that address.

    Once a threshold has been reached you can either stop everything, or start throttling.

    However throttling alone won't help you out they'll still mirror you, just slowly.

  • If you just want to prevent people from automatically slurping files off your server with wget or the like, try this in an .htaccess file:

    SetEnvIfNoCase BrowserMatch wget allowed=1

    <FilesMatch ".(MP3|mp3|avi|mpeg)">
    Allow from env=allowed
    Order Deny,Allow
    Deny from all
    </FilesMatch>

    Of course, people can pass the -U flag to wget and get around this, but it'd work while you get a real solution in place.

    -B

  • by lorcha ( 464930 ) on Saturday January 31, 2004 @06:50PM (#8146511)
    If you are trying to limit the actual amount of downloaded bytes, how would throttling by IP help? If Larry the leach types
    wget -l99 http://your.site.org/
    he's just gonna walk away from the machine and check back when it's done. If you serve up all those files in 1 minute, 1 day, or 1 week, it doesn't matter. He's still downloaded exactly the same amount of data from you. Your solution only works if you're trying to limit transfer rates, which you should be able to do with your mod_throttles of the world.

    If you're just trying to discourage people from downloading so much from you, you need to set up mirrors, bittorrents, or some other protection of your site. Maybe you could reduce the size of your site? Is it an archive of pictures? If so, maybe your friend could reduce the size of them? I mean, if he's offering 100,000 pictures that are 100k in size each, then if he reduces the size/quality so they're only 30k each, then you'll really reduce your bandwidth.

    If it's all text, maybe you could use some kind of compression. If it's video, maybe use a lower bitrate. You get the idea.

    But just limiting transfer rates by IP is probably not gonna help.

  • Well, some people may not know it, but the firewall (iptables) in linux is very neat when it comes to doing "tricks" with incoming connections.

    First, start by creating a table that all incoming SYN packets to the port 80 should jump to.

    Next, us some sort of php script that has sudo permissions to add an ip address (DAMN WELL make sure you know how to properly check those numbers).

    Set the default policy for this new table to REJECT or DROP (your call)

    On each new session incoming to the server, call a scr
    • There are ISP's who give a single user an access to up to 255 ip addresses at a time. Every new connection initiated by the client goes through a different ip (kind of Round Robin).

      This is of course the direct impact of the lack of ip's and the inteligence of the ISP's admins. Let's hope ipv6 comes soon.
  • by spitzak ( 4019 ) on Saturday January 31, 2004 @08:13PM (#8146906) Homepage
    As several people here have said, if you just slow it down the wget will just take all week, and perhaps use more resources (you will have to keep track of it to slow it down).

    Instead, when they pass the bandwidth limit (or more likely a number-of-requests limit) you should deliver a dead-end page from which there are no links to go anywhere else. Then when they wget it you will get a lot of these dead-end pages instead of the data they want. If a normal user hits it, it can tell them to wait a few minutes and then reload the page.

    If the owner of the material does not mind, it does sound like a bittorrent download would help a lot too. Have the dead-end page give instructions on how to retrieve the bittorrent.

    Anyway these are just my ideas, I really have zero experience in web sites so feel free to dismiss them as stupid.
  • Could you post a link to your friend's site? :)
  • by dj.delorie ( 3368 ) on Saturday January 31, 2004 @09:46PM (#8147453) Homepage
    What I do is have a hidden link at the top of every page that links to a specially-named missing HTML file in that directory. The missing file handler checks for this special name and, if found, adds the client's IP to the .htaccess deny list. The access denied handler checks the .htaccess list and, if their IP is found, explains the acceptable use policy to them. A cron job expires the .htaccess entries quickly once they stop trying to bulk download.
    • What I do is have a hidden link at the top of every page that links to a specially-named missing HTML file in that directory.

      Elegant. That takes advantage of the difference between a human and wget. And even better, the human can understanf his AUP, wget can't and keeps banging its head. Elegant.
  • Plan for them (Score:3, Insightful)

    by DynaSoar ( 714234 ) on Saturday January 31, 2004 @10:01PM (#8147555) Journal
    If they're going to suck down the whole thing, plan for it.

    Offer it pre-zipped. This would reduce the bandwidth and download time. A plus for everyone.

    Make it easy for people who do this to obtain updates/additions by date.

    As part of accessing the zipped version, ask people to mirror it. If they're going to carry it all, offer it all. Arrange dynamic mirror updating with those willing.

    Find one or more secondary storage site for the archive. Ask people to use these (put them highest on the list).

    If people persist on sucking down the whole thing and don't go for the archive, arrange a throttle with the sysadmin, and advertise it. Let people know that if they try to wget everything, things will start going real slow for them.

    Set up a small version without the files, in parallel to the real one, with a note saying "files temporarily unavailable". Allow the system owner to switch to the small version during times of high traffic so as not to bog down his other users, or alterntaively, switch it yourself according to the owner's estimates of his traffic and times.
  • A friend of mine runs a web site which offers a very large archive of files.....his bandwidth consumption has been through the roof because of people using wget...50% of that being clients who are downloading every document on the site

    That friend wouldn't be SCO , by any chance?
  • I think the Apache SpeedLimit perl module [modperl.com] accomplishes the effect that you want, but not in the same way as you asked for.

    SpeedLimit works by limiting the request rate of each IP address. If your web site consists of many small files (which sounds like the case), then curbing the request rate is enough to cure the most abusive bots. A determined adversary can still circumvent request rate limits with wget --random-wait, but it will be more frustrating for them, and a large percentage of clients can be exp

  • I once came accross a site that used HTML forms that are submitted using the HTTP POST metod to download files. The site was still easy to use; the only visible part of of the form was a button you clicked to start the download. However that couldn't be automated without some coding.

    I still got what I wanted but I'm sure the vast majority of people wouldn't do it. Most wouldn't know how to and some wouldn't feel like going through the trouble.

    The Javascript method mentioned earlier could work in a simi

    • actually that is a very clever solution, provided the script checks if the form was actually POST'ed and not loaded using GET, because it is easy to write a script that extracts form-elements
      Checking the REFERER in the same process would add a second layer of security
  • KISS (Score:1, Informative)

    by borgdows ( 599861 )
    just put Squid (http://squidcache.org) in front of your Apache.

    Squid's config is very easy for bandwidth throttling by IP.
  • I would use a round robin anonymous proxy and then I could bust your IP based nonsense in a jiffy.

    I already have a module for python to do this that took about half an hour to write.

    HTTP is an open protocol, there is no true way to filter one set of users from another.

    You could always use passworded accounts and use micropayments for bandwidth.

    that's the way to do your usenet porn archive
  • Totally Possible (Score:5, Informative)

    by yancey ( 136972 ) on Sunday February 01, 2004 @11:40AM (#8150841)

    Don't you hate it when everyone tells you something is impossible? It would be much more useful if they wouldn't, so that people who post solutions are easier to find.

    This is absolutely possible and not that hard. It is just that most people don't take the time to learn how. The poster who mentioned Quality of Service (QOS) was correct. You will certainly want to read about traffic control and queueing disciplines.

    Under Linux, use the traffic control (tc) command to configure bandwidth limits by adding or chaining queueing disciplines to your network interface. tc may not come pre-installed with your distribution, so you might have to find it.

    At the end of this post is a script I wrote to limit bandwidth from my website, which limits anything going out of port 8000 to 2 Mbps, but can "borrow" up to 2 Mbps more when bandwidth is available (almost always on a 100 Mbps connection).

    Since you can accidentally limit yourself to near nothing, you'll want a quick way to disable traffic control. The line below removes the "root" queueing disciple from the network interface which removes all the queueing disciplines that are chained from it.

    tc qdisc del dev eth0 root

    By modifying the u32 queueing discipline parameters, you can quite easily limit based upon IP addresses/networks.

    This should get you started, but you really should read the traffic control documentation and understand how to configure this stuff. Don't just think you can tweak a few parameters in the script and get what you want. I'm not ashamed to admit that it took me a few hours to get a beginning grasp on it.

    OK, here is the script...

    # Add HTB queuing discipline to root of eth0 with handle 1:0
    # unclassified traffic goes to class 1:99
    tc qdisc add \
    dev eth0 \
    root \
    handle 1: \
    htb \
    default 99

    # Add a single class that will limit all bandwidth on this interface
    # This is done so that we can borrow between the classes below
    tc class add \
    dev eth0 \
    parent 1: \
    classid 1:1 \
    htb \
    rate 100mbit

    # Class 1:10 is limited to 2mbit/s but can borrow up to 2mbit/s more from 1:99
    # in practice the other 2mbit/s should almost always be available
    tc class add \
    dev eth0 \
    parent 1:1 \
    classid 1:10 \
    htb \
    rate 2mbit \
    ceil 4mbit

    # Class 1:99 is limited to 90mbit/s and can not borrow any more
    tc class add \
    dev eth0 \
    parent 1:1 \
    classid 1:99 \
    htb \
    rate 90mbit \
    ceil 90mbit

    # Use SFQ to load balance the connections within class 1:10
    tc qdisc add \
    dev eth0 \
    parent 1:10 \
    handle 10: \
    sfq

    # Use SFQ to load balance the connections within class 1:99
    tc qdisc add \
    dev eth0 \
    parent 1:99 \
    handle 99: \
    sfq

    # This filter selects all traffic from port 8000 as belonging to class 1:10
    tc filter add \
    dev eth0 \
    protocol ip \
    parent 1: \
    prio 1 \
    u32 match ip sport 8000 0xffff \
    flowid 1:10
  • Instead of treating the leeches as enemies, treat them as friends.

    Perhaps you could have a special section of the site just for leeches. Explain your problems: you have too many people downloading everything.

    Ask for solutions. Maybe they'd all be happy with BitTorrent. Maybe they could help set up mirrors. Maybe they'd voluntarily restrict their leeching to lower-traffic times like the middle of the night. If you do some kind of throttling, maybe they'd rather buy a CD containing all your stuff than wait
  • I use a great product called Linux Arbitrator..as I'm sure a lot of you are familiar with it. I put it in my DMZ..with a farm of machines behind it and it throttles whatever traffic I want based on port, traffic type, ip address or MAC address. http://www.bandwidtharbitrator.com/ Check it out..I think you'll like what it offers. Rob
  • Have your 'friend' require subscriptions to his porn^H^H^H^H 'large archive of files'.

    Problem solved! ;)

  • If he does mind people downloading his entire site, why not box the whole lot up and offer it as a bittorrent file? Of course then he has the problem that he may have to run the torrent when no-one else is seeding it, but that's an easy way to limit bandwidth uses.

    To stop wget, edit your robots.txt and forbid it. Hopefully people will obey...
  • Consider running thttpd instead of apache for the static downloadable contact. It supports this type of throttling http://www.acme.com/software/thttpd
  • I run a message board, and at one time, I had a ton of programs and videos and flash files that people were doing the same thing to me. So what I did was write a php/mysql program that makes you authenticate, and then it creates a link that you can click, or it will automatically download in a set amount of time, and its only good for one click. No wget programs can get these files. If you want to try it out, goto http://downloads.tusclan.com and try it out. For instance, all of the superbowl commerical
  • Throw in a couple of PHP scripts that generate millions of links to nonexistent pages.

    No "um's" please.

What is research but a blind date with knowledge? -- Will Harvey

Working...