Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet Privacy

Who Isn't Paying Attention to ROBOTS.TXT? 85

Kickstart asks: "After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider, I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules. Looking around, I see that some specialized search engines make no mention of their policy on this or say what servers their spiders come from. Does anyone have information on who follow this standard and who doesn't?"
This discussion has been archived. No new comments can be posted.

Who Isn't Paying Attention to ROBOTS.TXT?

Comments Filter:
  • zerg (Score:4, Interesting)

    by Lord Omlette ( 124579 ) on Thursday June 09, 2005 @06:12PM (#12774468) Homepage
    The next question should be, "How do we make them regret their non-compliance?"
    • Re:zerg (Score:4, Informative)

      by Intron ( 870560 ) on Thursday June 09, 2005 @06:16PM (#12774505)
      zerg? [webpoison.org]
      • Re:zerg (Score:4, Informative)

        by BrynM ( 217883 ) * on Thursday June 09, 2005 @08:44PM (#12775808) Homepage Journal
        From the WebPoison site:
        "WebPoison.org is an open source project... (at the bottom of the page) *Technically speaking, webpoison.org is not "open source" because the source code may never be made public- doing so would undermine the project's central goal.
        Sorry, but it rubs me wrong when a project claims to be OSS on the first line of their about page only to tell me they lied in the fine print at the bottom. They may be doing a good thing, but they should be blunt and honest about it.
        • Got Zerg Source? (Score:3, Informative)

          by Kalak ( 260968 )
          WPoison [monkeys.com] is a Perl script, as source (naturally).

          WPoison is actually better from a technical standpoint, as it's a random page each time, not just a block of pages you download.
          • Re:Got Zerg Source? (Score:2, Interesting)

            by paulatz ( 744216 )
            It is better from a tchnical standpoint, but it could be worse from a practical one. Expecially if WPoison generated pages can be automatically detected.
            • Re:Got Zerg Source? (Score:3, Interesting)

              by Kalak ( 260968 )
              I hadn't considerd that until this morning, but you can add to the source to do things like randomize meta tags, include text from other pages at random, etc. to make it less likely to detect a pattern.

              If you're *really* serious about non-detection, then you should vary the amount of poison in the pages, so that some will be merely annoying or almost innocent, with links that are completely lethal.

              If I was a perl hacker (instead of merely playing a sysadmin at work), I'd write this idea out, so if anyone
            • The Wpoison copyright [monkeys.com] requires you to put their logo on your website, which would be kind of a tipoff, right there. If I wrote a spider that did look at robots.txt I might not crawl a site with that logo. Some people just don't like spiders.
    • Re:zerg (Score:5, Interesting)

      by Eric Giguere ( 42863 ) on Thursday June 09, 2005 @06:21PM (#12774558) Homepage Journal

      Start returning 500 errors... Or 302s that redirect them back to themselves...

      Eric
      PS: Is there some kind of bot storm going on, I'm getting all kinds of weird accesses to my site today, they're all fetching just the home page and leaving, and the referrer tag is null for everyone... They may be committing click fraud through my site, which makes me mad...
      • If your site uses PHP, you may be able to adapt Bad Behavior [ioerror.us]. The script was originally developed for WordPress and has already been ported to MediaWiki and Geeklog. It identifies known "bad" robots and robots that imitate real browsers based on the HTTP headers, then sends an access denied response.
        • Actually, I was more wondering if there was some zombie going around right now. Nobody's really accessing my blog, just the home page. Also, I see a lot of accesses from the CoDeeN project, so I wonder what's up....
      • Re:zerg (Score:3, Informative)

        PS: Is there some kind of bot storm going on, I'm getting all kinds of weird accesses to my site today, they're all fetching just the home page and leaving, and the referrer tag is null for everyone... They may be committing click fraud through my site, which makes me mad...

        See this discussion on SecurityFocus

        http://www.securityfocus.com/archive/75/401729/30 / 0/threaded [securityfocus.com]
    • Making them Pay (Score:4, Interesting)

      by Kelson ( 129150 ) * on Thursday June 09, 2005 @06:30PM (#12774631) Homepage Journal
      How about Stopping Spambots [neilgunton.com]?
    • Re:zerg (Score:3, Informative)

      by dasunt ( 249686 )

      The next question should be, "How do we make them regret their non-compliance?"

      robots.txt:

      User-agent: *
      Disallow: /the-site-that-never-ends/

      Its trivial to write a script that will link back to itself to make millions of bogus pages. If you include address rewriting, it won't even appear to be a script.

      The only downside is that while you are wasting their CPU and bandwidth, you are also wasting your own resources. If your CPU is mostly idle, then its mostly a waste of bandwidth.

    • Here's the programmer solution:

      1. Study your firewall logs, and try to determine some baseline criteria that identifies a spider. While you're at it, note the domains the spiders are coming from.

      2. Create a small perl script that, when fed the IP address of a questionable domain, automatically does an add (creating a new "DROP" rule for that domain) and tacks the command onto your existing firewall script. This is your manual tool. Of course, you should debug it using your spider list from 1, above.

      3. On
    • Re:zerg (Score:2, Interesting)

      by Nagus ( 146351 )
      The next question should be, "How do we make them regret their non-compliance?"

      Tarpit [wikipedia.org] them! Bonus points if you feed them bogus data at the same time.

      Tarpitting unwelcome spiders not only limits the damage (in terms of bandwidth) they can do to you, but also the damage they can do to everyone else.

      Software for this is available, for example Peachpit [devin.com].
    • Use Apache's SSI to detect browser type. If it is a known bot type, have Apache return the results of a PHP script that creates a valid header, then pipes /dev/urandom through uuencode (trimming off anything that makes it clear that it's UUencoded), so that their database then has to process a bunch of garbage.

      Alternately, use it to your advantage. Have a page of text that is nothing other than porn-related words, and have Apache return that when the bot comes looking. You're guaranteed to get a lot more

  • by grub ( 11606 ) <slashdot@grub.net> on Thursday June 09, 2005 @06:14PM (#12774493) Homepage Journal
    Does anyone have information on who follow this standard and who doesn't?

    Most crawlers will obey. Spambot email harvesters will usually not. Generate a huge page of crap with loads of fake email addresses and put that in your robots.txt as uncrawlable and watch the spammers grab it.

  • by Anonymous Coward
    Why don't you play a different game. Rather than play "whine about unenforceable standards" why don't you play "Don't put stuff on the internet you don't want people to see".

    Seriously. If you don't want it to get crawled, don't make it accessible by the outside. If you can't figure out how to do that, you get what you deserve.

    • by Kelson ( 129150 ) *
      RTFA and realize he's not talking about loss of "sensitive" data, but rather the DOS effect of extra traffic from rude robots.
    • There are good reasons for robots.txt. I use it keep crawlers from hitting "spam" forums on my website which is where all solicitations go. That way no google (or other search engine) rank is gained by spamming the site.

      I could just delete it all. But I'm trying to avoid deleting any posts.
    • You don't want a spider crawling over a webapp and creating, deleting, updating data it knows nothing about. If there's a new wave of robots.txt ignoring spiders, there's going to be a lot of ugly side effects.
      • by jbplou ( 732414 ) on Thursday June 09, 2005 @08:11PM (#12775550)
        well you got a poor app if a spider can run right through it without authenicating and inserting/updating/deleting your data.
      • why the hell are you making GET requests modify data? That's what POST is for.

        GET should only do just that, and a user agent should be allowed to reload a page that is the result of a GET request without fear of side effects.

        You'll have trouble from more than bots if you've got an app written like that - you'll have users hitting the back and forward buttons on their browsers causing multiple entry.

        [OT]
        Slow Down Cowboy!

        Slashdot requires you to wait 2 minutes between each successful posting of a commen
    • by CommandoB ( 584587 ) on Friday June 10, 2005 @03:26PM (#12782951) Journal

      The whitehouse seems to take a "pre-emptive" approach. Just in case they ever put stuff on the internet that they might someday not want you to see (or that they might not want archived by google), they seem to cover all the bases in their 92KB robots.txt file [whitehouse.gov].

      My personal favorites:
      Disallow: /911/iraq
      Disallow: /911/patriotism/iraq
      Disallow: /911/patriotism2/iraq
      Disallow: /911/sept112002/iraq [sic.]

      There's a theme here. Can you spot it? I'd like to think it's intentional, but at 2255 lines, it may just be that all permutations of Republican buzzwords have been covered.

  • by Neil Blender ( 555885 ) <neilblender@gmail.com> on Thursday June 09, 2005 @06:45PM (#12774777)
    All spiders are going to ignore your ROBOTS.TXT file. Instead, they look for a file called robots.txt.
    • All spiders are going to ignore your ROBOTS.TXT file. Instead, they look for a file called robots.txt.

      I was going to say this (but in a better way, of course!) but suspected that somebody else might beat me to it, and indeed they did ...

      However, there is a bit more to it. If he has a web server on his Windows or Mac OSX box, the odds are that the filesystem in use is case insensitive, so either robots.txt or ROBOTS.TXT will work, because either would be served up by the web server when one reques

    • What a lot of sites need is a slashdot.txt file.
  • Here's what would seem to work;

    1. Create robots.txt, including references to the spam spider trap. Make sure that the legitimate references to normal pages are out numbered by a large margin.

    2. When pages that could only be referenced in the spam spider trap are accessed, note the IP address.

    3. Slowly respond or block connections from the originating IP address.

    Bad guys are punished. Good guys are not. Low impact on system resources.

    There's got to be a dozen filters out there that already do this.
    • The problem with this approach is this: If the spider doesn't bother to even read the robots.txt file, nothing gets trapped.
      • The problem with this approach is this: If the spider doesn't bother to even read the robots.txt file, nothing gets trapped.

        No, that's the point. When the spider ignores robots.txt, they pick up the poisoned pages. Then, because they are doing something wrong, punish them.

        • I think that you're overlooking the point.

          If the poisoned pages are only findable from robots.txt, then if they ignore robots.txt, they won't be punished.

          If they're findable via links, or whatnot, then you're punishing more than the robots (that means your users). We typically frown on people (**AA) that do that sort of thing, yes?

          -9mm-
          • I think that you're overlooking the point.

            Not at all. The AC has it right;

            You make a robots.txt like with directories/pages they should not enter. You include links in pages that only a crawler would see normally. Only a crawler that ignored robots.txt or the exclusion in it would go those that trap pages and get banned.

            To make this very clear; the links on the legitimate pages are not normally visible or say things like "." or "," or "This is a trap for sp@mmer$"...whatever. Color the text white

            • All that is fine and dandy untill you get a curious user, like me, who either sees the ghost link (move the mouse over and your cursor and status bar will reflect there being a link) or views the HTML for some reason and sees it. But then again, I can just pull out tor or something if I get banned.
              • All that is fine and dandy untill you get a curious user, like me, who either sees the ghost link (move the mouse over and your cursor and status bar will reflect there being a link) or views the HTML for some reason and sees it. But then again, I can just pull out tor or something if I get banned.

                If someone is that inattentive, they get banned.

                If you want implementation details -- and it looks like you indeed do -- I'll be glad to provide them to you for a fee. Are you that curious, or can you figur

              • I always liked the way that arxiv.org dealt with this matter [arxiv.org]. It clearly says that it will initiate a seek and destroy against your site, if you visit a certain link.

                If you do go there, it initiates a countdown.... I've never stuck around long enough to see what happens when the countdown finishes... I like my internet connection just a little too much for that... :-)
                • I've never stuck around long enough to see what happens when the countdown finishes...

                  Judging by the fact that I can still post this after trying it, I'd say little or nothing.
                • I've never stuck around long enough to see what happens when the countdown finishes...

                  Tried it, nothing happened, could still access the site. Since the page is from 1996 or so, it might be sending a ping of death to your IP. Back then you could crash a Windows computer by sending it a nonstandard ping.

  • Big name != "real" (Score:5, Informative)

    by droleary ( 47999 ) on Thursday June 09, 2005 @07:59PM (#12775440) Homepage

    I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules.

    No, you rather see some well-known search engines that generate illegitimate traffic instead of behaving properly. I note a number of them in this highly-documented robots.txt [subsume.com] file. I'm personally most offended by idiots running this shit [grub.org], since there is no single IP block to blacklist.

    • Heh. I think that your attitude is part of the problem here. If you try to piss people off, they will try to piss you off too.
      • Heh. I think that your attitude is part of the problem here. If you try to piss people off, they will try to piss you off too.

        I don't see your logic. By default, everyone gets to see the web site. In order to be singled out and disallowed, they must make the first effort to piss me off. If they go further and ignore robots.txt, I go further and ban by IP. At no time am I responsible for any escalation. If you think my "colorful" language is disturbing, I would say it is better for them to be able

        • For what it's worth (not much) I think your approach is perfect. And very funny. My favourite example for people who didn't follow the link:

          # Another bot that ignores * disallows, even though they claim they follow the protocol.
          # And what the hell is with Yahoo-VerticalCrawler-FormerWebCrawler in the agent? Pick a name!
          # This may be the same bot that was listed as FAST above, but it gets a special list.
          # Dirty, dirty bot. I kind of hope this is ignored so I get to block by IP.
          # Update: It is! I do!
          User
  • by anticypher ( 48312 ) <[moc.liamg] [ta] [rehpycitna]> on Thursday June 09, 2005 @09:21PM (#12776088) Homepage
    There was a bunch of fsckwits called dir.com who had a real nasty spider crawling all over the place a few months ago. It blatantly ignored robots.txt, tried dictionary attacks to detect unlinked parts of the website, and may have been trying exploits to crack systems to discover secrets normally protected by passwords or logins. Honeypot email addresses fed to the spider would be spammed within days.

    After too many complaints from clients about this nasty behaviour, a number of carriers started blackholing the prefixes of bad spiders at the border routers. Nice simple solution, and then you don't even see the spider traffic. Last I looked, about 20 major ISPs were blackholing prefixes of the worst spider/bot offenders.

    Nobody would dare to blackhole google, but there are hundreds of google wannabe's and a few of them are unethical enough to get blocked. And then they wonder why they can't see 75% of the internet.

    the AC
  • Just block the bot from your site, or write some simple PHP to restrict it from querying the pages you want, and the frequency....

    I'd just block the "bad-bots" though, if they don't listen to you, don't give them contact.

    Or, contact the owner of the domain and get mad at them for spidering without following proper spider rules. He is wasting <b>your</b> resources in exchange for <b>their</b> profit, get mad, get even!
  • by stoborrobots ( 577882 ) on Friday June 10, 2005 @04:15AM (#12778083)
    http://www.fleiner.com/bots/ [fleiner.com]

    I found this site through some slashdotter website long back... I've forgotten where and when, but it lends itself nicely to the topic...

    Also good it the way arxiv.org fights back [arxiv.org].

  • i once had to deliver a solution for that problem to a friend. i made him a php script that detects the content directory and generates a javascript-website which links into the content directory with an encrypted javascript-link which cannot be used by spiders. the content directory is being renamed to some random name every hour. the error404 leads people to the entry-page, in case they surf the content dir while it is being renamed.
  • On a similar note... (Score:3, Interesting)

    by Transcendent ( 204992 ) on Sunday June 12, 2005 @12:32AM (#12792960)
    What is with requests for http://xxx.slashdot.org/ok.txt [slashdot.org] coming through on my webserver as if someone (Slashdot if you trace the IP) is trying to use it as a proxy?

    66.35.250.150 - - [29/Jan/2005:09:50:54 -0500] "GET http://it.slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 650 "-"
    66.35.250.150 - - [31/Jan/2005:23:24:04 -0500] "GET http://slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 647 "-"
    66.35.250.150 - - [04/Feb/2005:23:21:43 -0500] "GET http://slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 647 "-"
    66.35.250.150 - - [08/Feb/2005:21:55:18 -0500] "GET http://it.slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 650 "-"
    66.35.250.150 - - [11/Feb/2005:20:27:09 -0500] "GET http://slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 647 "-"
    66.35.250.150 - - [21/Feb/2005:20:02:05 -0500] "GET http://games.slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 653 "-"
    66.35.250.150 - - [02/Mar/2005:20:56:12 -0500] "GET http://it.slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 651 "-"
    66.35.250.150 - - [08/Mar/2005:20:37:50 -0500] "GET http://slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 648 "-"
    66.35.250.150 - - [12/Mar/2005:09:43:37 -0500] "GET http://yro.slashdot.org/ok.txt [slashdot.org] HTTP/1.0" 404 652 "-"
    ...(continues, of course)

    I know the article is about bad spiders, but why is slashdot doing this?

"If it ain't broke, don't fix it." - Bert Lantz

Working...