Cloaking Detection?

Catch up on stories from the past week (and beyond) at the Slashdot story archive

Cloaking Detection? 42

Posted by Cliff on Thursday April 18, 2002 @11:33PM from the use-your-cloaking-device-hahaha!-I-didn't-think-so dept.

drcrja asks: "I am conducting some academic research on the use of cloaking and how it affects search engine rankings (cloaking is the practice of delivering a specially optimized page to search engine spiders while delivering a completely different page to the user). I am currently using Alta Vista's Babel Fish to retrieve pages and compare those pages to the pages on the actual web sites but I am trying to find other methods of detecting cloaking. I am wondering if any members of the /. community have any experience with this?"

This discussion has been archived. No new comments can be posted.

Cloaking Detection?

Load All Comments

Search 42 Comments Log In/Create an Account

Comments Filter:

user agent. (Score:2, Funny)

by QuodEratDemonstratum ( 569501 ) writes:
1. Get a browser which lets you alter the useragent, or use a proxy.
2. Set it to the name the search engine's spider uses.
3. Browse away.
4. ???
5. profit.
- Problem (Score:3, Informative)
  
  by epsalon ( 518482 ) writes:
  
  Some sites may apply cloaking based on the IP addess of the spider.
  
  I suggest using Google's cache [google.com] as a method to detect cloaking. The advantage is that the page cached is exactly the same page used for indexing, and google is the most popular search engine, and thus you win.
- Re:user agent. Who want's to vew optimized pages (Score:1)
  
  by A55M0NKEY ( 554964 ) writes:
  
  The optimized pages are probably just tons of keywords that you would not be interested in viewing anyway. Every concievable synonym for 'Nude' for instance.
- Re:Just follow the ion emissions (Score:4, Informative)
  
  by irony nazi ( 197301 ) writes: on Friday April 19, 2002 @12:00AM (#3370961)
  
  Well, that is the way to detect Vulcan cloaking, but in order to detect website cloaking, try some variation of the following:
  Ingredients:
  Computer
  Perl
  Internet Connection
  LibWWW, UserAgent, and all the dependencies, I forgot which
  Optional: Perl Cookbook, by Christianson and Torkingham.
  Directions:
  Start with the Perl Cookbook to give you a quick background of how to design an autonomous www agent that will crawl around gathering webpages. You can have them visit links or read from a list of links or whatever you want.
  Read the documentation from the Perl UserAgent libs and figure out how to change the http headers to spoof various browsers. I've done this before. I think that I ended up going into the UserAgent code and doing this manually. I don't remember exactly how I accomplished this, I just remember that it was easy.
  Now have to agents to crawl websites and compare output from one website using the "Spider" http headers with the output from spoofing the "IE" http headers. Websites would sometimes still think the IE headers were a robot. The key is to pause the request so that it is as though a human is reading the page/clicking the links/etc.
  Keep track of the sites that are different or keep track of whatever stats that you need.
  Mix, Stir, Burn, Enjoy.
  I've actually done this type of thing before in order to test various IE only websites on non-IE browsers (non-MS computers). My results were that all of the pages the *require* IE render perfectly in Mozilla and most render fine in Opera. I still don't understand why businesses would *turn-away* potential customers only for having different http headers!
  
  Parent Share
  twitter facebook
  - Re:Just follow the ion emissions (Score:3, Informative)
    
    by penguinboy ( 35085 ) writes:
    
    Read the documentation from the Perl UserAgent libs and figure out how to change the http headers to spoof various browsers. I've done this before. I think that I ended up going into the UserAgent code and doing this manually. I don't remember exactly how I accomplished this, I just remember that it was easy.
    It's quite straightforward:
    my $ua = LWP::UserAgent->new;
    $ua->agent("Whatever 3.11/sun4u");
I'd make or modify an existing program to do this (Score:5, Informative)

by stienman ( 51024 ) writes: <adavis@@@ubasics...com> on Friday April 19, 2002 @12:02AM (#3370979) Homepage Journal

I imagine wget or another HTTP client can be coaxed to spit out the spider and browser type strings associated with search engine spiders. It would be a simple, straightforward hack to make a script that would request a page twice, once reporting itself as a search engine (and requesting the robots.txt file for good measure) and secondly as a regular browser. Then do a simple compare.

You could give it a list of sites and it could go through dozens or hundreds of sites a minute, rather than you doing it by hand. You could have it save pages that show differences, or at least give you the URLs so you could load them later and study the differences (if that is a goal).

You could use PHP, perl, java, etc to do this very simply as well. I imagine a simple PHP script could well be less than 50 lines, and could even call your browser and load the two pages side by side each time it found a difference.

-Adam

Share
twitter facebook
- Re:I'd make or modify an existing program to do th (Score:2)
  
  by Martin Spamer ( 244245 ) writes:
  
  Then do a simple compare.
  
  It's not nessecarry to compare the pages, calc a CRC or HASH for each version and test/store those.
  - Re:I'd make or modify an existing program to do th (Score:2)
    
    by stienman ( 51024 ) writes:
    
    Many of the site he'll likely be testing have some dynamic content in them, such as a date and time, counter, etc. A simple compare will show those lines that are different - if a lot is different then you know the page has changed significantly. If the compare only spits out a few lines you can log them and look at them later, or simply assume it's giving a different page to a different user and modifying some date/time/ads/cookie/etc info.
    
    A CRC or HASH will be a bit faster, but I suspect they'll turn up more false positives and create more post-processing than necessary.
    
    -Adam
    - Re:I'd make or modify an existing program to do th (Score:1)
      
      by majorero ( 572108 ) writes:
      
      So use a CRC/Hash to filter out the obvious and then a simple compare to try to narrow down the false positives.
- Re:I'd make or modify an existing program to do th (Score:1)
  
  by Mad Marlin ( 96929 ) writes:
  
  I imagine wget or another HTTP client can be coaxed to spit out the spider and browser type strings associated with search engine spiders. It would be a simple, straightforward hack to make a script that would request a page twice, once reporting itself as a search engine (and requesting the robots.txt file for good measure) and secondly as a regular browser. Then do a simple compare.
  For wget, --user-agent= AgentString will determine what user agent it reports. A list of user agent strings may be found here [d2g.com]. The file robots.txt is retrieved by default in wget.
Use Konqueror (Score:3, Informative)

by JabberWokky ( 19442 ) writes: <slashdot.com@timewarp.org> on Friday April 19, 2002 @12:05AM (#3370993) Homepage Journal

Use Konqueror and set your UserAgent to that of a search engine spider. I'm sure this can be done with most any web browser - in Konqueror, just go to Tools/Change Brower Identification. You'll probably have to add a new one (for some reason a spider userid string isn't common enough to be a standard one) - you can do that in Settings/Configure Konqueror/Browser Identification.
--
Evan

Share
twitter facebook
- Re:Use Konqueror (Score:2, Informative)
  
  by QuodEratDemonstratum ( 569501 ) writes:
  
  Some lists of useragent strings are here [mrsneeze.com] and here [html.ne.jp].
  - Re:Use Konqueror (Score:1, Funny)
    
    by Anonymous Coward writes:
    
    Your nickname is wrong. It should be quod erat demonstrandum, not demonstratum. As you have written it, you have taken the gerundive and made it a perfect passive participle.
- Re:Use Konqueror (Score:1)
  
  by neophase ( 97476 ) writes:
  
  Mozilla (recent releases, I'm currently running RC1 of 1.0) can send arbitrary UA strings by using David Illsley's uabar, available at http://uabar.mozdev.org/ . Enjoy :)
If cloaking becomes a problem... (Score:3, Interesting)

by tswinzig ( 210999 ) writes: on Friday April 19, 2002 @12:22AM (#3371094) Journal

...why don't the search engines just play the game? Cloak themselves to look like regular users.

Download the robots.txt file through one set of IP addresses, with your normal user-agent header. Then request the actual pages using a Mozilla or MSIE user-agent ID, and using new IP addresses that cannot be traced back to google (or whoever) using DNS. Queue up URL's to be downloaded in a random order so that a really clever website can't detect your robot by examining traffic patterns. (I.e. maybe take a full day or week to download all the pages from a site.)

If they did all this, could someone still detect it's a search engine robot and use cloaking?

Share
twitter facebook
- Re:If cloaking becomes a problem... (Score:1)
  
  by mclearn ( 86140 ) writes:
  
  Sure, base the decision on those IPs requesting /robots.txt.
  - Re:If cloaking becomes a problem... (Score:2, Interesting)
    
    by mclearn ( 86140 ) writes:
    
    Well, I gues that's what happens when you don't hit preview, eh? :-)
    
    I was going to qualify that by saying that statistically-speaking, one could deliver the false page to the set of requestors following closely behind the IP that grabbed /robots.txt. Of course, you go on to say that why don't the search engine companies spread the requests out?
    
    Well, how does the search engines know who is a cloaker and who isn't? Search engines *should* be good netizens, and abide by rules of conduct. Hence /robots.txt, throttling, some form of search that doesn't kill a server (eg. breadth-first), etc.
- Re:If cloaking becomes a problem... (Score:2)
  
  by dattaway ( 3088 ) writes:
  
  If they did all this, could someone still detect it's a search engine robot and use cloaking?
  
  I am seeing "cable customers" with very interesting browsing patterns that suggests this is already happening. They appear to be normal IE98 browsers, but pick up one page per minute in no particular order. Something like that is more likely done from something like a search engine script list than a human.
  
  I have thought of setting up Junkbuster info string to make my browser appear as a robot when I view sites. ...just to be clever, checking out robots.txt first. I'd imagine the pr0n sites would give me the keys to the place...
  - MSIECrawler is part of IE 4 (Score:2)
    
    by yerricde ( 125198 ) writes:
    
    I am seeing "cable customers" with very interesting browsing patterns that suggests this is already happening. They appear to be normal IE98 browsers, but pick up one page per minute in no particular order. Something like that is more likely done from something like a search engine script list than a human.
    
    If the user agent says MSIECrawler [google.com], then it's a browser feature. Microsoft Internet Explorer 4 (packaged with Windows 98) and later have a feature to mirror web sites locally for offline browsing.
  - Re:If cloaking becomes a problem... (Score:1)
    
    by Mr Z ( 6791 ) writes:
    
    It could be spammers harvesting addresses, as this page indicates. [dslreports.com]
    
    On 6.26am the morning of May 13th, 2001, the link is hit from IP 24.1.197.144 - a residential cable modem in Arizona, then on the @home network, now Cox. The browser is identified to the web server as a generic windows 95 version of Netscape. Of course it isn't, its an email harvesting robot that goes from web page to web page.
    
    Tada.
    --Joe
- Re:If cloaking becomes a problem... (Score:1)
  
  by SplatFileGoo ( 458668 ) writes:
  
  >why don't the search engines just ...Cloak
  >themselves to look like regular users.
  
  They do. Altaviast, Inktomi have been know to in the past, and some suspect Google does too.
  
  You don't think all those generic browsers coming from Exodus are real users do you? They all use Exodus and can sniff out cloaking at a whim.
  
  The problem?
  
  Having spidered millions of pages, it's pretty obvious that some form of cloaking is at work on a very high percentage of top sites.
  
  - Agent cloaking for browser support. (all the se's, and major sites do this).
  - IP based cloaking to feed custom languages. Sites such as Google auto redirect from cloaked setups to the local language (eg: google.com becomes google.fr in french for someone from france).
  
  Have an intelligent robot sort through that, and determine if a page is cloaked for se promotion purposes or just user purposes, is next to impossible without a brain behind the keyboard.
- Re:If cloaking becomes a problem... (Score:1)
  
  by Quadrature ( 524139 ) writes:
  
  Sure, just throw something into robots.txt that isn't linked to anywhere on the site. When some supposed "browser" loads it you know where it came from... You could even have some script change the bait URL each time robots.txt is requested and log the IP with that bait URL so you could track multiple abuses.
How to detect cloaking (Score:4, Funny)

by gnovos ( 447128 ) writes: <gnovos@ c h i p p e d . net> on Friday April 19, 2002 @02:03AM (#3371466) Homepage Journal

Use your long range sensors to detect search-space anomolies.

Share
twitter facebook
- Re:How to detect cloaking - Star Trek style... (Score:2, Funny)
  
  by AtariDatacenter ( 31657 ) writes:
  
  I'm trying to remember the details of the episode where they had the Klingon civil war, those wonderful Duras daughters (and coward son), and the daughter of Tasha Yar. They made some sort of tachyon network between a large number of starships to try to find a cloaked Romulan convoy crossing into Klingon space. Of course, in the end, Lt. Commander Data used a modified photon torpedo to illuminate a cloaked Romulan ship.
  
  Therefore, I would suggest that a photon torpedo would be the method best used against cloaked sites. Just one should do the trick.
google's cache; search engine cloak=bad (Score:5, Insightful)

by Craig Ringer ( 302899 ) writes: on Friday April 19, 2002 @02:05AM (#3371477) Homepage Journal

Hi
First: sometimes google cached copies of pages might be informative.

Changing your browser's User-agent str won't always detect the cloaking, as it is quite likely to be configured to work by ip addr block too (googlebot!). Similarly, babelfish may not show cloaked pages because it comes from a different IP than altavista'a index bots and this can be checked for in the cloaking server's config.

Second: it is *imperitave* that search engines keep unique user-agent strings that identify them. P'haps none of you who suggested the engine change user-agent str runs a website? It would remove a great tool from log analasys, and in the end make no difference to cloakers as they'd just do engine detection by IP anyway.

Share
twitter facebook
I thought (Score:1)

by 1Oman ( 308666 ) writes:

I thought google ranked websites by the number of pages that linked to a site. How do you fool the spider into thinking that more sites link to your site than really do?
- Re:I thought (Score:1)
  
  by majorero ( 572108 ) writes:
  
  You register a bunch of unrelated domain names, create dummy sites and interconnect them all. There's an actual name for this, I just can't remember it.
  - Re:I thought (Score:1)
    
    by praktike ( 543776 ) writes:
    
    google-bombing. usually done through blogs as well.
I do this for one of my sites (Score:2, Informative)

by Karora ( 214807 ) writes:
I make fairly extensive changes to one of my sites for search engines.
Things I do are:
- Make the content more easily indexible.
- Reduce the numbers of links to the same content.
- Present the variable content first
I have had a number of problems with badly behaved search engines basically DoSing the site as well.
Many search engines are easily identifiable by looking at the HTTP_REFERER, but for some of the stealthier ones you have to identify them by source IP, and of course the technique is only ever going to be 95%.
I really limit the options down and make the site look much like one of those old hierarchical sites of old. After all, the search engine is going to see the whole lot of it and I'm sure it is easier for them to navigate a tree without redundant processing since most of the site navigation is about providing multiple content categorisations of what is basically the same content.
To the poster of the question. (Score:3, Informative)

by perlyking ( 198166 ) writes: on Friday April 19, 2002 @05:00AM (#3371943) Homepage

I recommend you look at Webmasterworld [webmasterworld.com] there is a massive amount of knowledge there.
One of the guys from google even posts there on occasion.

Share
twitter facebook
Google API (Score:1)

by mini me ( 132455 ) writes:

You can retrieve cached pages from Google using their new SOAP API.
This would allow you to automate the process significantly.

This limits you to just Google, but it is a start.
Good reasons to vary response (Score:3, Informative)

by Martin Spamer ( 244245 ) writes: on Friday April 19, 2002 @01:45PM (#3374720) Homepage Journal

(cloaking is the practice of delivering a specially optimized page to search engine spiders while delivering a completely different page to the user).

There are good reasons for a site to respond differently to different clients. Indeed responding to the capabilities of the client should be considered 'best practice'.

There are a host of client types out there other than just PC Browsers and Robots, IDTV STB's, 3G & WAP Phones, Convergent devices. The range is set to explode.

This is the whole reason for the Http 'Accept' header, which is provided to allow a server to handler clients with different capabilities.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec1 4. html

Share
twitter facebook
I've seen this work (Score:2, Funny)

by xrayspx ( 13127 ) writes:

If you vent your nacelles the resultant plasma trail should make any cloaked vessels which may be following you visible to the naked eye long enough to shoot at them.
Cloaking (Score:2, Informative)

by SplatFileGoo ( 458668 ) writes:

> sometimes google cached copies of pages might be informative.

Look at all the trouble Google is finding itself in with the "cache" (term used loosely since Google doesn't "cache" pages - they "page jack" them and put their own self advertisments on them). It's a ticking time bomb.

Tick, Scientologists,
Tick, German railroads,
Tick, Ok who's next?

> cloaked pages because it comes from
> a different IP than altavista'a index bots

Alta spiders from the Babelfish ip too. They switched the ip last year just to rat out cloakers.

> You can retrieve cached pages from
> Google using their new SOAP API.

Not if the webmaster has used the NOARCHIVE tag (which itself is probably cloaked if done properly).

> Indeed responding to the capabilities of the
> client should be considered 'best practice'.

Absolutly. Language differences, display differences, and various levels of css/dom/scripting support are all quality reasons to cloak. I have a site that deliver 8 different versions of a page based on ip and agent.

There are also cloaking programs sponsored by the search engines themselves. Inktomi's index connect, and Altavista's "trusted feed" programs encourage the cloaking of pages to protect them.

Did you know every major search engine cloaks their own site? Here's one that is agent cloaked by Google themselves: http://wap.google.com . If you don't have the right secret decoder ring, all you will see is stock Google.

--

>LibWWW

lol. No cloaker worth his salt would agent cloak for se purposes today. It's all 100% IP based detection. Unless you are parked on a searchengine ip address, you won't know what you are looking at.

With se's moving to off-the-page criteria (links and contextual themes such as google, teoma, and wisenut) this whole discussion is moot.
Cloaking for search engine purposes is rather rare anymore.

The hay day of cloaking was 99 when their was so much page jacking on Altavista. If you had a top ranking page, it was sure to be ripped off by afternoon and your rankings destroyed in the next update. In that environment, cloaking skyrocketed.

Now that Alta is a dead search engine walking, Inktomi requires fees, and all that is left is Google - it just doesn't make economic sense to cloak. Even if you can cloak, it does very little good and you really, really have to know what you are doing.
How I've seen it done (Score:2, Informative)

by The Bean ( 23214 ) writes:

I know of a company that used to make a living doing exactly this type of thing. The biggest thing for them was checking for requests for robots.txt. Once it detected a request for this the IP was placed in a database for future reference. They owned a large number of domains, and made up many pages containing the keywords that the client wanted their site to "show up under," with links between them all. These pages were served to any client that had previously requested the robots.txt file. But when a request for one of the pages was received from an IP that hadn't requested robots.txt, they were bounced to the clients site.
detecting cloaking... (Score:3, Funny)

by dousette ( 562546 ) writes: <`gro.ettesuod' `ta' `evad'> on Saturday April 20, 2002 @10:39AM (#3379058) Homepage

Didn't that require some sort of inverse tackyon pulse from the main deflector?

Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

user agent. (Score:2, Funny)

Problem (Score:3, Informative)

Re:user agent. Who want's to vew optimized pages (Score:1)

Re:Just follow the ion emissions (Score:4, Informative)

Re:Just follow the ion emissions (Score:3, Informative)

I'd make or modify an existing program to do this (Score:5, Informative)

Re:I'd make or modify an existing program to do th (Score:2)

Re:I'd make or modify an existing program to do th (Score:2)

Re:I'd make or modify an existing program to do th (Score:1)

Re:I'd make or modify an existing program to do th (Score:1)

Use Konqueror (Score:3, Informative)

Re:Use Konqueror (Score:2, Informative)

Re:Use Konqueror (Score:1, Funny)

Re:Use Konqueror (Score:1)

If cloaking becomes a problem... (Score:3, Interesting)

Re:If cloaking becomes a problem... (Score:1)

Re:If cloaking becomes a problem... (Score:2, Interesting)

Re:If cloaking becomes a problem... (Score:2)

MSIECrawler is part of IE 4 (Score:2)

Re:If cloaking becomes a problem... (Score:1)

Re:If cloaking becomes a problem... (Score:1)

Re:If cloaking becomes a problem... (Score:1)

How to detect cloaking (Score:4, Funny)

Re:How to detect cloaking - Star Trek style... (Score:2, Funny)

google's cache; search engine cloak=bad (Score:5, Insightful)

I thought (Score:1)

Re:I thought (Score:1)

Re:I thought (Score:1)

I do this for one of my sites (Score:2, Informative)

To the poster of the question. (Score:3, Informative)

Google API (Score:1)

Good reasons to vary response (Score:3, Informative)

I've seen this work (Score:2, Funny)

Cloaking (Score:2, Informative)

How I've seen it done (Score:2, Informative)

detecting cloaking... (Score:3, Funny)

Related Links Top of the: day, week, month.

Slashdot Top Deals