Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet

Is Spidering Content from the Web Illegal? 11

Lysol asks: "I'm setting up a new site that I want to have spider some industry sites and grab their news. The problem is that no one has been able to tell me if this is illegal or not. I'd be fine with it, but I could see if a site was touchy about this and they were monitoring their logs and found a particular server on my site grabbing pages every few hours or so and then proceeding to sue me or something." As I understand it, it all depends on the fine print. Some sites have notices that allow you to freely copy the content of a web page as long as credit is given. Many commercial sites are much more restrictive.
This discussion has been archived. No new comments can be posted.

Is Spidering Content from the Web Illegal?

Comments Filter:
  • Look for a copyright on the website. If there is one then you can't just copy the news. If there isn't a copyright notice then you can due to copyright law. But you should probably just ask.

    Updating your site with news you read, but not a identical copy that you just grabbed off the site, would be ok. You're not copying anything. Just using information from the site to make your own version.

  • Spidering suggests to me that you're copying everything word for word, which is bad. Linking to someone is one thing, but copying their content is asking for trouble.

    Your safest method of automation would be to parse their headlines, and link to their content for the details, making it clearly known that they're the original source.
  • Updating your site with news you read, but not a identical copy that you just grabbed off the site, would be ok. You're not copying anything. Just using information from the site to make your own version.

    Oh, you mean like plagairizing?

  • by Anonymous Coward

    There is and informal but generally accepted standard you should take a look at called "A Standard for Robot Exclusion"

    Take a look at http://info.webcrawler.c om/mak/projects/robots/robots.html [webcrawler.com] and http://info.webcrawler .com/mak/projects/robots/norobots.html [webcrawler.com]

    This does not address copyright issues, which have become even murkier with the recent revisions to the copyright law restricting fair use.

    You should also take a look at the XML syndication format (aka RSS [RDF Site Summary]). It's based on RDF and is becoming supported by alot of larger news sites, even /. Here are some links: http://www.edventure.c om/release1/abstracts/syndication.html [edventure.com] for background info. http://www.w3.org/RDF/ [w3.org] for the low level info, and http://my.netscape.com/publish/ help/quickstart.html [netscape.com] for the RSS implementation.

  • If you're doing something like pulling headlines and providing links to the full story where it was found, I don't think anybody could reasonably object. If you summarize the story yourself (like slashdot) that's cool; if you quote some of the page it's a little more iffy. Search engines do it. If you're pulling text and serving it yourself, more folks are going to have a problem with that. I'd imagine most sane outfits would ask you (politely first) to desist before going to legal action. My understanding of the concept of "fair use" means that you could summarize and provide links, and even quote chunks directly, even if the original source objected.

    What do others get away with is probably the standard to test against. Google caches a lot of pages and serves them, is that republication without permission legal? I think they'll stop offering pages cached from your site if you ask; haven't heard of 'em being sued yet. LinuxToday consists mostly of quotes from stories from other sites, have they gotten any complaints? Granted, their process probably isn't automated.

    Then there are things like the Internet Archive [archive.org]:

    "A digital library for the future. We have already started to build it - we have already collected much of the Web as well as other public Internet materials including Netnews and downloadable software."


    They claim to have near 20 terabytes collected, although AFAIK they don't make any of it available to the public. At least their spider is polite, it seems to honor robots.txt, and it's throttled. What's the legality there?

    Then again, there's eWatch [ewatch.com], who runs a spider over sites being monitored daily, with never a query for robots.txt and a deceptive User-Agent string. They sell "Internet Monitoring":

    "Safeguard stakeholder value, improve customer service, protect corporate reputation, monitor competition, identify trends, and pinpoint corporate activism... "


    While not on topic, this quote from the propaganda for their new Cybersleuth(R; TM, & FU) service is also likely of interest:

    "eWatch CyberSleuth will attempt to identify the entity or entities behind the screen name(s) which have targeted your organization."

    ...and...

    "...then containment is the next step. Containment is a two part endeavor focusing on (1.) Neutralizing the information appearing online, and; (2.) Identifying the perpetrators behind the postings, rogue website, hack, etc. Neutralizing information posted online, if appropriate, is the removal of the offending messages from where ever they appear in cyberspace. This may mean something as simple as removing a posting from a web message board on Yahoo! to the shuttering of a terrorist web site. The objective is to not only stop the spread of incorrect information, but ensure that what has already spread is also eliminated."


    Go check it out on their site; I've lost some context in quoting. That'll probably get me a free target's eye view of the service.

    ...but I tend to ramble. To get back to the point, I'd say go ahead and do it, openly and above board, and see what kind of feedback you get from your vict... sources.

  • If this is a commercial venture and you're looking at offering news kind of like a 'portal' does (god i hate buzzwords) then there are a couple companies that eliminate the need to do-it-yourself and will also eliminate the legal hassles for delivering news content... if it fits your bill, that is.

    Here are the two that I know of. I have used both services and they have been fairly decent content providers.

    1) Screaming Media http://www.screamingmedia.net/ [screamingmedia.net] - These guys let you pick and choose what goes on your site. Articles appear as if they came from your site. Good hands-on approach. Good content, too.

    2) iSyndicate http://www.isyndicate.com/ [isyndicate.com] - These guys let you publish your content back to their network as well as letting you use their syndicated content. I really like the way it works. I dont know if you have as much say as to what comes down the pipes from them though.. E.G. not wanting to put your competitor's press release on your site..

    Anyway that's just a glimpse of the content providers out there. I have seen tens upon tens of them.. these are the two that i had settled on before. I hope that you might find some use with them. iSyndicate used to have a free program too.. i dont know what (if anything) happened to that.. possibly it's still there.

    ~GoRK
  • by copito ( 1846 ) on Tuesday November 09, 1999 @09:40PM (#1547718)
    #include "disclaimer.h"

    www.freerepublic.com [freerepublic.com] a conservative news discussion site is being sued by the LA Times and the Washington Post for copying news stories for discussion. Sort of like Slashdot without the links. The judge ruled [yahoo.com] that this was not free use. A final ruling in the case has not been reached.

    Linking to content is likely to be much safer than copying it or framing it [zdnet.com], although copying headlines might be safe.

    Remember it is not whether someone can sue you that is the important question. Anyone can sue you. It is a question of whether you will piss them off enough they might sue you, and whether it is easy to get the lawsuit dismissed.


    --
  • Provided you indicate your source (and it's usually best to ask them first anyway) then it may be copyright infringement (depending on the country) but not plagairism.
  • i thought that plagerizing was repeating someone elses statements word for word as your own. but, i think that if you take all the information from a given source and draw your own conclusions (or whatever), even if it comes out as the same opinions but in different wording it's perfectly legal.
  • Any work is automatically copyrighted to its author and subject to copyright laws at the moment of creation, EVEN IF there is no explicit copyright statement. This is part of the Digital Millennium Copyright Act and applies to web pages. Therefore, you can't copy stuff from a page just because it doesn't say you can't. The only time you can is if it says you can.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...