Throttle Apache Bandwidth Based on IP Address? 75
BigBlockMopar asks: "A friend of mine runs a web site which offers a very large archive of files. He wishes to continue to offer free and unrestricted access to his archive, but his bandwidth consumption has been through the roof because of people using wget (and similar) to download his entire site. Current traffic is around 200 gigabytes per month with over 50% of that being clients who are downloading every document on the site. The server space is donated by a hosting provider who is understandably starting to become impatient with the traffic. I've checked out mod_throttle and mod_bandwidth, neither appears to do exactly what is desired. Does anyone have any suggestions?"
"Eventually, he plans to set up mirrors, but he'd like to get the greedy users under control first. Alternatives are adding a (free) log-in authentication system, or a text-in-image system like Network Solutions uses to weed out automated whois queries. But I think the best solution is to allow a given client IP address full-speed downloading for the first $WHATEVER megabytes, and then automatically reduce the speed of the transfer to that IP address. This would probably deter most leeches but continue to allow legitimate users to transfer more than an arbitrary limit."
BitTorrent (Score:3, Interesting)
Re:BitTorrent (Score:1)
Aside from just the format (a
This even allows the recursive get people to work. They use wget to mirror all the html, which downloads torrents in the process. Then they spin off a one-liner to find all
Re:BitTorrent (Score:2)
As far as a torrent for every file. Please, he'd waste more traffic on the tracker. It would have to be one large file, otherwise there wouldn't be enough concurrent downloads to justify the overhead of the tracker for each individual file.
Re:BitTorrent (Score:1)
For that matter, Stuffit Expander supports all of the above. But this isn't worth arguing about -- I'd probably never distribute a zip, but I can see why people do and I don't have a huge problem with that, especially since the format is standardized and works with opensource tools.
As for the one gigantic torrent, how big a site are we talking a
Re:BitTorrent (Score:2)
Re:BitTorrent (Score:1)
WANTED (Score:2)
I've been searching for this for over a year now, and so far have come up with nothing. I'm afraid it's not around, but if it is, let me know
Re:WANTED (Score:2, Informative)
See my other comment about mod_curb [slashdot.org] which comes close to doing the right thing.
You could hack it, or find somebody else to do so for you.
bad idea, apache because 1 connection per process (Score:4, Interesting)
Either just drop connections or use a single process proxy whith the required ability, which then forwards the requests to the apache.
But restring by IP can be dangerous if users are sitting behind a proxy from the ISP (very common at least in Germany).
Javascript? (Score:2, Interesting)
Or set up a 'captcha' for each download, so that a human has to confirm each file one at a time.
Re:Javascript? (Score:1)
Re:Javascript? (Score:1)
It's possible for clueful scripters to get around this using WWW::Mechanize,
as I had to do in my script that retrieves database usage stats for the
commercial databases we subscribe to at work. Ebsco's usage stats thingydoo
uses script links that call functions in an external, linked
functions then change certain form elements and submit the form. So I just
told my WWW::Mechanize browser object to change the form elements in question
and sub
Another solution might be referer checking (Score:5, Insightful)
One exeption to this is Pavuk, which does referer spoofing pretty well. This program is about 4 times harder to use than wget, and isn't very popular (you don't see it included in distros too often).
Of course this won't completely fix your problem, but it will probably stop about 90% of the people doing it now. It's an easy fix that you can implement quickly until you get something to throttle bandwidth properly.
-ft
Re:Another solution might be referer checking (Score:2, Informative)
Which version of wget are you using? The referrer option works fine for me--for one website when I don't use it, I get redirected to the main page. With the referrer option I can download the file. Although something that sets the referrer automatically would be best.
Re:Another solution might be referer checking (Score:1)
http://www.idata.sk/~ondrej/pavuk/pavuk.1.html
That is the man page to Pavuk. If you search for referer on that page, it will tell you how to use the -auto_referer option.
-Fran
Re:Another solution might be referer checking (Score:1)
That will not work in all situations. A lot of servers just check to see that you come from on of its pages. A correct implementation would be to allow only the clients with referer from pages pointing to it, which is why I suspect no one has added it to curl or wget.
Re:Another solution might be referer checking (Score:1)
This won't phase WWW::Mechanize one iota. When it follows a link, it sends
the referer just like any other browser, unless you tell it otherwise.
You really are going to have to throttle based on client IP. They can't fake
that if they want to get the information back, so you can positively identify,
"this is the same host that has already requested n pages in the last thirty
seconds".
If you throttle overall, without throttling base
Even if it works, it might not. (Score:5, Interesting)
The problem is that if these are bots grabbing your whole site, slowing them down to 10K/sec won't actually reduce the amount of traffic they pull from you. They may take all day to get the pages, but the bytes will still move.
Some options that you have.
* If the user really doesn't need the data, block their address entirely.
* consider blocking the 'bots' "client" signature. You can do this in
Apache. "Respected" bots don't lie about who they are. If a bot does
lie, then it is a DOS attack in disguise.
* Contact the users, if you can.
* If you want the user to get a mirror, setup something to actually do the
mirror that is effective. I would recommend running rsync.
Re:Even if it works, it might not. (Score:2, Informative)
This link [robotstxt.org] should help you out.
Re:Even if it works, it might not. (Score:2)
Re:Even if it works, it might not. (Score:1)
Unless you have servers which look at the client and then exhort you to get the "supported" browser (read IE).
Wrong solution (Score:3, Informative)
What you need is an anti-leech mod to limit the amount of data that can be downloaded from a specific IP. I know they're out there. Just do a bit of googling.
Re:Wrong solution (Score:2)
So what you say is true and a good thing. People don't sit and wait so they won't mind if the transfer takes longer. Meanwhile the provider has more bandwidth available for other customers at any given time.
QOS (Score:2)
If the hosting service had a problem with excessive bandwidth usage, I don't know why they didn't throttle port 80 at the routers.
Re:QOS (Score:2)
besides it sounds like they have enough bandwith, and if they have enough bandwith to actually serve the stuff then slowing the downloads down wont really decrease the amount of bytes transferred at all. so what he seems to really want is to limit the amount of visitors that wget the whole site.
looks like he would like to make the wgetting of the site impossible, but this is actually quite hard because basically wget is for just that and even if he makes some nice c
Don't post the URL at Slashdot (Score:2)
BT (Score:4, Insightful)
Yes, use the frelling BitTorrent, that's exactly what it was written for!
Add to this some way of limiting bandwith per connection (so people are mainly downloading from other bt clients, not from you) and you have perfect distribution means.
Leave the possibility to download via http, but limit it with QoS or some other way to tiny little stream, plus advertise all over the site that people can achieve unlimited dl speeds using BT.
Publishing documents only to limit in every possible way access to them (like all the game files servers do) is unwise, to say the least. Especially if you don't have to.
Robert
Class Based Queueing (Score:2)
find a way to parse the log and find out (in as close to real time as possible) who is slurping the whole site.
restrict that IP's bandwidth to the machine entirely, not just apache.
The linux kernel can handle these things very easily. Why bother with apache, use the linux machine itself to do it.
this, of course, assumes you are using linux.
Seriously.. use Javascript! (Score:3, Interesting)
Have a page "download.php?filename=foo.txt" that all your links point to, and have that page return <meta http-equiv="Refresh" content="1;URL=files/$filename">
(pseudocode; my php scripting is not great, but you get the idea..)
This totally breaks wget, although it's not too hard to script around. You'll cut spider traffic back by probably 95%, all the casual 'grab everything we can' downloaders, but people who really want to get all your files will still figure out how to.
Or if you totally want to stop automated downloads, put each file behind a 'captcha'.
Re:Seriously.. use Javascript! (Score:3)
I would use PHP and log users and their IP's. And only allow so many files per user. You do want return traffic?
And as others have said, use bittorrent.
Re:Seriously.. use Javascript! (Score:2)
Yeah, in this situation, BT would probably shine.
Re:Seriously.. use Javascript! (Score:1)
Make a file with this code in it: call it downloader.php
">
then the url to download would be
http://site.com/downloader.php?file=http://www.
Re:Seriously.. use Javascript! (Score:1)
> number of people downloading everything, except that instead of doing it
> quickly they'll leave wget running all week and tying up your server's
> resources.
Rather than slowing the transfer, I'd send a 503 response to clients that
have pulled too much in the last hour/day/week/whatever. If you have to
send the same client too many 503s in too short a time, switch to 403.
If you're concerned about really persistent people c
Re:Seriously.. use Javascript! (Score:2)
This sucks. The biggest problem: I right click on the link in my browser and select "save to disk". I'm not spidering the site, I just want to save one lousy document. However, I don't get the document, I get the stupid referrer page. This occasionally bites me when I download software from SourceForge and the link suggests that it's a dir
mod_perl is your friend (Score:3, Informative)
Not quite what you want .. (Score:3, Informative)
I wrote an apache module which I call mod_curb [steve.org.uk] (for Apache 1.3)
This doesn't do exactly what you want, but I'm sure if you were to ask me or somebody else we could code something for you.
The basic idea I have for you problem is to have a database of currently active clients, beit MySQL/Flat files, then you can keep track of all data transferred by that address.
Once a threshold has been reached you can either stop everything, or start throttling.
However throttling alone won't help you out they'll still mirror you, just slowly.
Use this to disallow wget (Score:1)
Of course, people can pass the -U flag to wget and get around this, but it'd work while you get a real solution in place.
-B
How would this be a solution? (Score:3, Insightful)
If you're just trying to discourage people from downloading so much from you, you need to set up mirrors, bittorrents, or some other protection of your site. Maybe you could reduce the size of your site? Is it an archive of pictures? If so, maybe your friend could reduce the size of them? I mean, if he's offering 100,000 pictures that are 100k in size each, then if he reduces the size/quality so they're only 30k each, then you'll really reduce your bandwidth.
If it's all text, maybe you could use some kind of compression. If it's video, maybe use a lower bitrate. You get the idea.
But just limiting transfer rates by IP is probably not gonna help.
One thing you may try... (Score:1)
First, start by creating a table that all incoming SYN packets to the port 80 should jump to.
Next, us some sort of php script that has sudo permissions to add an ip address (DAMN WELL make sure you know how to properly check those numbers).
Set the default policy for this new table to REJECT or DROP (your call)
On each new session incoming to the server, call a scr
Re:One thing you may try... (Score:1)
This is of course the direct impact of the lack of ip's and the inteligence of the ISP's admins. Let's hope ipv6 comes soon.
Instead of slowing down, try stopping it entirely (Score:3, Insightful)
Instead, when they pass the bandwidth limit (or more likely a number-of-requests limit) you should deliver a dead-end page from which there are no links to go anywhere else. Then when they wget it you will get a lot of these dead-end pages instead of the data they want. If a normal user hits it, it can tell them to wait a few minutes and then reload the page.
If the owner of the material does not mind, it does sound like a bittorrent download would help a lot too. Have the dead-end page give instructions on how to retrieve the bittorrent.
Anyway these are just my ideas, I really have zero experience in web sites so feel free to dismiss them as stupid.
link (Score:1)
auto-block bulk downloads (Score:5, Insightful)
Re:auto-block bulk downloads (Score:2)
Elegant. That takes advantage of the difference between a human and wget. And even better, the human can understanf his AUP, wget can't and keeps banging its head. Elegant.
Re:auto-block bulk downloads (Score:1)
between the <a> and </a> tags, so the user doesn't
see anything. Heck, check out my website and
see for yourself - it's on every page.
Plan for them (Score:3, Insightful)
Offer it pre-zipped. This would reduce the bandwidth and download time. A plus for everyone.
Make it easy for people who do this to obtain updates/additions by date.
As part of accessing the zipped version, ask people to mirror it. If they're going to carry it all, offer it all. Arrange dynamic mirror updating with those willing.
Find one or more secondary storage site for the archive. Ask people to use these (put them highest on the list).
If people persist on sucking down the whole thing and don't go for the archive, arrange a throttle with the sysadmin, and advertise it. Let people know that if they try to wget everything, things will start going real slow for them.
Set up a small version without the files, in parallel to the real one, with a note saying "files temporarily unavailable". Allow the system owner to switch to the small version during times of high traffic so as not to bog down his other users, or alterntaively, switch it yourself according to the owner's estimates of his traffic and times.
A "friend" of yours? (Score:2)
That friend wouldn't be SCO , by any chance?
Apache::SpeedLimit and mod_limitipconn (Score:2)
SpeedLimit works by limiting the request rate of each IP address. If your web site consists of many small files (which sounds like the case), then curbing the request rate is enough to cure the most abusive bots. A determined adversary can still circumvent request rate limits with wget --random-wait, but it will be more frustrating for them, and a large percentage of clients can be exp
How about forms and the POST method? (Score:2)
I still got what I wanted but I'm sure the vast majority of people wouldn't do it. Most wouldn't know how to and some wouldn't feel like going through the trouble.
The Javascript method mentioned earlier could work in a simi
Re:How about forms and the POST method? (Score:1)
Checking the REFERER in the same process would add a second layer of security
KISS (Score:1, Informative)
Squid's config is very easy for bandwidth throttling by IP.
If I was the leech (Score:2)
I already have a module for python to do this that took about half an hour to write.
HTTP is an open protocol, there is no true way to filter one set of users from another.
You could always use passworded accounts and use micropayments for bandwidth.
that's the way to do your usenet porn archive
Totally Possible (Score:5, Informative)
Don't you hate it when everyone tells you something is impossible? It would be much more useful if they wouldn't, so that people who post solutions are easier to find.
This is absolutely possible and not that hard. It is just that most people don't take the time to learn how. The poster who mentioned Quality of Service (QOS) was correct. You will certainly want to read about traffic control and queueing disciplines.
Under Linux, use the traffic control (tc) command to configure bandwidth limits by adding or chaining queueing disciplines to your network interface. tc may not come pre-installed with your distribution, so you might have to find it.
At the end of this post is a script I wrote to limit bandwidth from my website, which limits anything going out of port 8000 to 2 Mbps, but can "borrow" up to 2 Mbps more when bandwidth is available (almost always on a 100 Mbps connection).
Since you can accidentally limit yourself to near nothing, you'll want a quick way to disable traffic control. The line below removes the "root" queueing disciple from the network interface which removes all the queueing disciplines that are chained from it.
tc qdisc del dev eth0 root
By modifying the u32 queueing discipline parameters, you can quite easily limit based upon IP addresses/networks.
This should get you started, but you really should read the traffic control documentation and understand how to configure this stuff. Don't just think you can tweak a few parameters in the script and get what you want. I'm not ashamed to admit that it took me a few hours to get a beginning grasp on it.
OK, here is the script...
# Add HTB queuing discipline to root of eth0 with handle 1:0
# unclassified traffic goes to class 1:99
tc qdisc add \
dev eth0 \
root \
handle 1: \
htb \
default 99
# Add a single class that will limit all bandwidth on this interface
# This is done so that we can borrow between the classes below
tc class add \
dev eth0 \
parent 1: \
classid 1:1 \
htb \
rate 100mbit
# Class 1:10 is limited to 2mbit/s but can borrow up to 2mbit/s more from 1:99
# in practice the other 2mbit/s should almost always be available
tc class add \
dev eth0 \
parent 1:1 \
classid 1:10 \
htb \
rate 2mbit \
ceil 4mbit
# Class 1:99 is limited to 90mbit/s and can not borrow any more
tc class add \
dev eth0 \
parent 1:1 \
classid 1:99 \
htb \
rate 90mbit \
ceil 90mbit
# Use SFQ to load balance the connections within class 1:10
tc qdisc add \
dev eth0 \
parent 1:10 \
handle 10: \
sfq
# Use SFQ to load balance the connections within class 1:99
tc qdisc add \
dev eth0 \
parent 1:99 \
handle 99: \
sfq
# This filter selects all traffic from port 8000 as belonging to class 1:10
tc filter add \
dev eth0 \
protocol ip \
parent 1: \
prio 1 \
u32 match ip sport 8000 0xffff \
flowid 1:10
Ask the leeches (Score:2)
Perhaps you could have a special section of the site just for leeches. Explain your problems: you have too many people downloading everything.
Ask for solutions. Maybe they'd all be happy with BitTorrent. Maybe they could help set up mirrors. Maybe they'd voluntarily restrict their leeching to lower-traffic times like the middle of the night. If you do some kind of throttling, maybe they'd rather buy a CD containing all your stuff than wait
Re:Ask the leeches (Score:3, Interesting)
Move it over to FTP, and allow only X number of simultaneous logins.
Linux Arbitrator works well (Score:1)
Solution.. (Score:2)
Problem solved! ;)
Offer a torrent? (Score:2)
To stop wget, edit your robots.txt and forbid it. Hopefully people will obey...
thttpd (Score:1)
Stopping people from downloading all your files. (Score:1)
Booby trap your site. (Score:1)
No "um's" please.