Who Isn't Paying Attention to ROBOTS.TXT? 85
Kickstart asks: "After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider, I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules. Looking around, I see that some specialized search engines make no mention of their policy on this or say what servers their spiders come from. Does anyone have information on who follow this standard and who doesn't?"
zerg (Score:4, Interesting)
Re:zerg (Score:4, Informative)
Re:zerg (Score:4, Informative)
Got Zerg Source? (Score:3, Informative)
WPoison is actually better from a technical standpoint, as it's a random page each time, not just a block of pages you download.
Re:Got Zerg Source? (Score:2, Interesting)
Re:Got Zerg Source? (Score:3, Interesting)
If you're *really* serious about non-detection, then you should vary the amount of poison in the pages, so that some will be merely annoying or almost innocent, with links that are completely lethal.
If I was a perl hacker (instead of merely playing a sysadmin at work), I'd write this idea out, so if anyone
Re:Got Zerg Source? (Score:2)
Re:Got Zerg Source? (Score:1)
Re:zerg (Score:5, Interesting)
Start returning 500 errors... Or 302s that redirect them back to themselves...
EricPS: Is there some kind of bot storm going on, I'm getting all kinds of weird accesses to my site today, they're all fetching just the home page and leaving, and the referrer tag is null for everyone... They may be committing click fraud through my site, which makes me mad...
Re:zerg (Score:2)
Re:zerg (Score:2)
Re:zerg (Score:3, Informative)
See this discussion on SecurityFocus
http://www.securityfocus.com/archive/75/401729/30 / 0/threaded [securityfocus.com]
Making them Pay (Score:4, Interesting)
Re:zerg (Score:3, Informative)
robots.txt:
Its trivial to write a script that will link back to itself to make millions of bogus pages. If you include address rewriting, it won't even appear to be a script.
The only downside is that while you are wasting their CPU and bandwidth, you are also wasting your own resources. If your CPU is mostly idle, then its mostly a waste of bandwidth.
Re:zerg (Score:2)
1. Study your firewall logs, and try to determine some baseline criteria that identifies a spider. While you're at it, note the domains the spiders are coming from.
2. Create a small perl script that, when fed the IP address of a questionable domain, automatically does an add (creating a new "DROP" rule for that domain) and tacks the command onto your existing firewall script. This is your manual tool. Of course, you should debug it using your spider list from 1, above.
3. On
Re:zerg (Score:2, Interesting)
Tarpit [wikipedia.org] them! Bonus points if you feed them bogus data at the same time.
Tarpitting unwelcome spiders not only limits the damage (in terms of bandwidth) they can do to you, but also the damage they can do to everyone else.
Software for this is available, for example Peachpit [devin.com].
Oh, they'll regret it if you try this. (Score:2)
Alternately, use it to your advantage. Have a page of text that is nothing other than porn-related words, and have Apache return that when the bot comes looking. You're guaranteed to get a lot more
Spammers are bad (of course) (Score:4, Insightful)
Most crawlers will obey. Spambot email harvesters will usually not. Generate a huge page of crap with loads of fake email addresses and put that in your robots.txt as uncrawlable and watch the spammers grab it.
Re:Spammers are bad (of course) (Score:1)
You'd think with all the sites clamouring to be in the search engine results, they'd honor requests to be out.
Re:Spammers are bad (of course) (Score:2)
--
Evan
Re:Spammers are bad (of course) (Score:2, Funny)
Re:Spammers are bad (of course) (Score:3, Funny)
Re:Spammers are bad (of course) (Score:2)
Re:Spammers are bad (of course) (Score:2)
anyone save the contents?
Re:Spammers are bad (of course) (Score:1)
Re:Spammers are bad (of course) (Score:1)
Re:Spammers are bad (of course) (Score:2)
Which law makes it illegal for a search engine to ignore robots.txt files? And would you even have a valid claim against them due to non-compliance of a web standard?
Re:ok, let's get this out of the way... (Score:1)
Hey I've got an idea (Score:1)
Seriously. If you don't want it to get crawled, don't make it accessible by the outside. If you can't figure out how to do that, you get what you deserve.
I've got a better idea (Score:3, Insightful)
Re:Hey I've got an idea (Score:2, Insightful)
I could just delete it all. But I'm trying to avoid deleting any posts.
Re:Hey I've got an idea (Score:2)
Re:Hey I've got an idea (Score:4, Insightful)
Re:Hey I've got an idea (Score:1)
GET should only do just that, and a user agent should be allowed to reload a page that is the result of a GET request without fear of side effects.
You'll have trouble from more than bots if you've got an app written like that - you'll have users hitting the back and forward buttons on their browsers causing multiple entry.
[OT]
Slow Down Cowboy!
Slashdot requires you to wait 2 minutes between each successful posting of a commen
whitehouse.gov/robots.txt (Score:5, Interesting)
The whitehouse seems to take a "pre-emptive" approach. Just in case they ever put stuff on the internet that they might someday not want you to see (or that they might not want archived by google), they seem to cover all the bases in their 92KB robots.txt file [whitehouse.gov].
My personal favorites: /911/iraq /911/patriotism/iraq /911/patriotism2/iraq /911/sept112002/iraq [sic.]
Disallow:
Disallow:
Disallow:
Disallow:
There's a theme here. Can you spot it? I'd like to think it's intentional, but at 2255 lines, it may just be that all permutations of Republican buzzwords have been covered.
Re:whitehouse.gov/robots.txt (Score:1)
Re:whitehouse.gov/robots.txt (Score:2)
Disallow:
Here is your problem: (Score:5, Funny)
Re:Here is your problem: (Score:2)
I was going to say this (but in a better way, of course!) but suspected that somebody else might beat me to it, and indeed they did ...
However, there is a bit more to it. If he has a web server on his Windows or Mac OSX box, the odds are that the filesystem in use is case insensitive, so either robots.txt or ROBOTS.TXT will work, because either would be served up by the web server when one reques
Re:Here is your problem: (Score:2, Funny)
Re:Here is your problem: (Score:2)
Ideal solution... (Score:2)
1. Create robots.txt, including references to the spam spider trap. Make sure that the legitimate references to normal pages are out numbered by a large margin.
2. When pages that could only be referenced in the spam spider trap are accessed, note the IP address.
3. Slowly respond or block connections from the originating IP address.
Bad guys are punished. Good guys are not. Low impact on system resources.
There's got to be a dozen filters out there that already do this.
Re:Ideal solution... (Score:2)
Re:Ideal solution... (Score:2)
No, that's the point. When the spider ignores robots.txt, they pick up the poisoned pages. Then, because they are doing something wrong, punish them.
Re:Ideal solution... (Score:2)
If the poisoned pages are only findable from robots.txt, then if they ignore robots.txt, they won't be punished.
If they're findable via links, or whatnot, then you're punishing more than the robots (that means your users). We typically frown on people (**AA) that do that sort of thing, yes?
-9mm-
Re:Ideal solution... (Score:2)
Not at all. The AC has it right;
You make a robots.txt like with directories/pages they should not enter. You include links in pages that only a crawler would see normally. Only a crawler that ignored robots.txt or the exclusion in it would go those that trap pages and get banned.
To make this very clear; the links on the legitimate pages are not normally visible or say things like "." or "," or "This is a trap for sp@mmer$"...whatever. Color the text white
Re:Ideal solution... (Score:2)
Re:Ideal solution... (Score:2)
If someone is that inattentive, they get banned.
If you want implementation details -- and it looks like you indeed do -- I'll be glad to provide them to you for a fee. Are you that curious, or can you figur
Re:Ideal solution... (Score:3, Interesting)
If you do go there, it initiates a countdown.... I've never stuck around long enough to see what happens when the countdown finishes... I like my internet connection just a little too much for that...
Re:Ideal solution... (Score:2)
Judging by the fact that I can still post this after trying it, I'd say little or nothing.
Re:Ideal solution... (Score:2)
Re:Ideal solution... (Score:2)
Tried it, nothing happened, could still access the site. Since the page is from 1996 or so, it might be sending a ping of death to your IP. Back then you could crash a Windows computer by sending it a nonstandard ping.
Big name != "real" (Score:5, Informative)
I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules.
No, you rather see some well-known search engines that generate illegitimate traffic instead of behaving properly. I note a number of them in this highly-documented robots.txt [subsume.com] file. I'm personally most offended by idiots running this shit [grub.org], since there is no single IP block to blacklist.
Re:Big name != "real" (Score:2)
Re:Big name != "real" (Score:2)
Heh. I think that your attitude is part of the problem here. If you try to piss people off, they will try to piss you off too.
I don't see your logic. By default, everyone gets to see the web site. In order to be singled out and disallowed, they must make the first effort to piss me off. If they go further and ignore robots.txt, I go further and ban by IP. At no time am I responsible for any escalation. If you think my "colorful" language is disturbing, I would say it is better for them to be able
Re:Big name != "real" (Score:2)
# Another bot that ignores * disallows, even though they claim they follow the protocol.
# And what the hell is with Yahoo-VerticalCrawler-FormerWebCrawler in the agent? Pick a name!
# This may be the same bot that was listed as FAST above, but it gets a special list.
# Dirty, dirty bot. I kind of hope this is ignored so I get to block by IP.
# Update: It is! I do!
User
Blackhole them at the border routers (Score:3, Interesting)
After too many complaints from clients about this nasty behaviour, a number of carriers started blackholing the prefixes of bad spiders at the border routers. Nice simple solution, and then you don't even see the spider traffic. Last I looked, about 20 major ISPs were blackholing prefixes of the worst spider/bot offenders.
Nobody would dare to blackhole google, but there are hundreds of google wannabe's and a few of them are unethical enough to get blocked. And then they wonder why they can't see 75% of the internet.
the AC
Simple Solution (Score:1)
I'd just block the "bad-bots" though, if they don't listen to you, don't give them contact.
Or, contact the owner of the domain and get mad at them for spidering without following proper spider rules. He is wasting <b>your</b> resources in exchange for <b>their</b> profit, get mad, get even!
How to keep bad robots away (Score:3, Informative)
I found this site through some slashdotter website long back... I've forgotten where and when, but it lends itself nicely to the topic...
Also good it the way arxiv.org fights back [arxiv.org].
Known Bad Bots (Score:4, Informative)
easy solution (Score:1)
Re:Non-solution (Score:1)
On a similar note... (Score:3, Interesting)
66.35.250.150 - - [29/Jan/2005:09:50:54 -0500] "GET http://it.slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 650 "-"
66.35.250.150 - - [31/Jan/2005:23:24:04 -0500] "GET http://slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 647 "-"
66.35.250.150 - - [04/Feb/2005:23:21:43 -0500] "GET http://slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 647 "-"
66.35.250.150 - - [08/Feb/2005:21:55:18 -0500] "GET http://it.slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 650 "-"
66.35.250.150 - - [11/Feb/2005:20:27:09 -0500] "GET http://slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 647 "-"
66.35.250.150 - - [21/Feb/2005:20:02:05 -0500] "GET http://games.slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 653 "-"
66.35.250.150 - - [02/Mar/2005:20:56:12 -0500] "GET http://it.slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 651 "-"
66.35.250.150 - - [08/Mar/2005:20:37:50 -0500] "GET http://slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 648 "-"
66.35.250.150 - - [12/Mar/2005:09:43:37 -0500] "GET http://yro.slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 652 "-"
I know the article is about bad spiders, but why is slashdot doing this?
Re:On a similar note... (Score:4, Interesting)
Re:On a similar note... (Score:2)