Archiving Web Pages - Legal or Illegal? 102
Dyer asks: "I used to run several high-trafficked anonymous surfing sites and if I wasn't getting emailed by a lawyer telling me to block someone's site from being accessed I was being woken up at 2am with a telephone call from a crazy person yelling, sometimes swearing at me with the impression that my site copied theirs and it resided on my server, when in actuality it was being accessed by my server at that instant and being relayed to the user. This is my point, how do services like Archive.org and Google's cache get away with what they're doing? You can call their services whatever you like, but it doesn't change the fact that they are copying people's websites and saving them onto their servers for everyone to access."
It SHOULD be legal (Score:4, Interesting)
Everything should go, except for things like malicious alteration and theft (taking stuff and claiming it is yours)
Re:It SHOULD be legal (Score:5, Interesting)
You know, I've been wondering about Java/Shockwave games. Certainly most kids would love a CD full of those games, and many companies have many different games online which mostly disappear a few months later.
Is anybody archiving these? Do we need to start?
Would the companies object?
You can play The Hitchhiker's Guide to the Galaxy [douglasadams.com] on Douglas Adams' web site. As it happens, if you know what you're doing you can also download the
This ties in to today's "is ROM collecting wrong" story, except in this case you're actually offered the games, under mostly unclear terms.
Not as bad as me (Score:1)
Rachael
Re:Archive sites are valid and important resources (Score:1)
RTFF (Score:5, Informative)
Archive .org FAQ [archive.org]
How can I remove my site's pages from the Wayback Machine?
The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.
See our exclusion policy.
You can find exclusion directions at exclude.php. If you cannot place the robots.txt file, opt not to, or have further questions, email wayback2@archive.org.
In other words, by your NOT including a robots.txt file, you are implicitly granting them permission to cache your content. Also, the content is cached as it was published, complete with the appropriate markings, and is only publicly accessible content, so you'd be hard press to argue there is any economic harm from the caching, which means there would be likely be no damages from a successful copyright suit, which means a copyright suit would be pretty damned unlikely.
IANAL.
Re:RTFF (Score:2, Insightful)
Re:RTFF (Score:2)
People are free to drive by your house and take a picture of it, or anything else out in public view.
Not if that thing out in public view is copyrighted.
But is it still caching once the original is gone? (Score:2)
That may be true in some places, I don't know. Regardless, if the archive continues after the original site is taken down, it is no longer a cache, it is an outright copy.
And yes, this could be damaging. To give a close-to-home example, consider a case where a site gets /.ed so only a few people can see the real content. If that site is then updated in some critical way, the numerous caches all over
Re:But is it still caching once the original is go (Score:2)
Regardless, if the archive continues after the original site is taken down, it is no longer a cache, it is an outright copy.
I'm not sure what you mean by "an outright copy." It's always an outright copy. But if the archive continues after the original site is taken down, it's still a cache.
If that site is then updated in some critical way, the numerous caches all over the web won't be (at least not immediately, and it is clearly unreasonable to expect anyone publishing a website to notify them all).
Re:But is it still caching once the original is go (Score:2)
My point is that the term "cache", as commonly used in computing, carries an inference that the cached material is identical to the original but faster to access. If the original is no longer there, or has changed, then you are no longer caching it, you are simply keeping a copy of the old data.
Re:But is it still caching once the original is go (Score:1)
Re:But is it still caching once the original is go (Score:2)
OK, I've read the relevant parts of that. I fail to see how a web cache "...complies with rules concerning the refreshing, reloading, or other updating of the material when specified by the person making the material available online in accordance with a generally accepted industry standard data communications protocol for the system or network..."
The industry standard is that when you request information from a web site, you get the current version. (As I noted elsewhere, browser caching is quite differe
Re:But is it still caching once the original is go (Score:1)
The industry standard is that when you request information from a web site, you get the current version.
Sometimes. If the current version isn't available (for instance because you're offline), then you get whatever is in the cache.
I'm mainly thinking of google, here. Google isn't intentionally displaying old content, and they take it down after a rather short period of time. Presumably they adhere to the "Expires" header and other relevant information. Certainly they adhere to the robots exclusion a
Re:But is it still caching once the original is go (Score:2)
Sorry, but I don't think this is reasonable. That caching is part of the browser software, and as I've noted repeatedly, that is a different issue to the web caches we're discussing here.
Nothing in the HTTP spec, or in any other relevant Internet standards, provides for any caching of old content and supplying it when a straightforward HTTP request for a file is sent. You get the current
Re:But is it still caching once the original is go (Score:2)
Nothing in the HTTP spec, or in any other relevant Internet standards, provides for any caching of old content and supplying it when a straightforward HTTP request for a file is sent.
I don't think that's what the spec says. It says that you have to explicitly warn the end-user when semantic transparancy is relaxed by cache. It only says that the request must be explicit when relaxed by client or origin server.
You get the current file, or a defined error code.
Or you get an old file and a warning.
R
Re:But is it still caching once the original is go (Score:2)
I'll have to take your word about the HTTP spec; I don't recall ever seeing what you describe, but I can't say I've read it in that much detail recently.
Regarding your other points...
Re:But is it still caching once the original is go (Score:1)
Because if you're smart, you respect reasonable laws of other countries with which yours deals regularly, and not just your own.
I don't think copyright is a reasonable law. I only respect it to the extent I think I might get caught.
Why do I give a shit about the US? Many of these sites aren't based there any more than they are in the UK.
I never said you had to give a shit about the US.
Unfortunately, it won't pay my rent.
So why don't you get a job?
There comes a point, however, when I feel reaso
Re:But is it still caching once the original is go (Score:2)
Ah, I see. You're one of the people who, instead of discussing the issue on merits, decides unilaterally that he is above the law. So much for your credibility in any discussions around here, then.
Re:But is it still caching once the original is go (Score:2)
Ah, I see. You're one of the people who, instead of discussing the issue on merits, decides unilaterally that he is above the law. So much for your credibility in any discussions around here, then.
We're all above the law. The government derives its power from the people, not the other way around.
That remark was pretty crass considering the state of the industry and the number of good people who currently aren't in employment in spite of having useful skills.
I'm unemployed despite having useful skill
Re:But is it still caching once the original is go (Score:2)
Those two statements aren't in any way equivalent.
Just to be clear, I do not volunteer my time just to toot my horn or make a profit. I have given thousands of hours over the past 5-6 years helping out in very technical forums (not just writing amusing anecdotes on Slashdot for my own entertainment, which is hardly the same thing) and never made a penny from it. I post here anonymously, and do you see m
Re:But is it still caching once the original is go (Score:1)
Those two statements aren't in any way equivalent.
Ah, but they are. When the government tries to force an unjust law on the people, the people are under no obligation to accept it. Now sometimes it's in our own best interests to follow it anyway, just as we would sometimes give up certain freedoms under other gunpoint situations, but it's not always the best idea.
Just to be clear, I do not volunteer my time just to toot my horn or make a profit. I have given thousands of hours over the past 5-6 years
Re:But is it still caching once the original is go (Score:2)
And who is to say that it is unjust? You? Copyright is a well-established legal principle, and there are very good reasons for it. The fact that you don't like it doesn't make it unjust.
Perhaps you have a better idea for how to make laws? Or should we dispense with them altogether, since no doubt someone thinks every illegal thing should be legal, typically those who want to break the law an
Re:But is it still caching once the original is go (Score:2)
And who is to say that it is unjust? You?
Yep, me. We each have to decide for ourselves what is moral and immoral.
Copyright is a well-established legal principle
So was slavery.
and there are very good reasons for it.
I disagree.
The fact that you don't like it doesn't make it unjust.
You're right. But the fact that it is unjust makes me not like it.
Perhaps you have a better idea for how to make laws?
Yes. You should never go to jail for breaking a law which does not cause direct physical har
Re:But is it still caching once the original is go (Score:2)
OK, I give up. You're worse than RMS. You persistently ignore the positives of things you don't like, you exaggerate the negatives, you put words into people's mouths, you ignore the wording of the law or just dismiss it outright when you happen not to agree with it, and your arguments are illogical, emotional and utterly without objective merit. The best you can do is attack figures of speech and twist what I've written to give it meanings I did not, so as to set up a range of straw men at which you can sh
Re:But is it still caching once the original is go (Score:2)
OK, I give up. You're worse than RMS.
Wow. Thank you for the compliment. I wish I really could compare myself to RMS. I don't agree with him all the time, but his combination of practicality with stubborness has been inspirational :).
Re:RTFF (Score:1)
Re:RTFF (Score:2)
So, I'd be breaking the law if I went to Stephen King's next book-signing, and snapped a shot of the cover art of the books on the table?
Re:RTFF (Score:2)
Re:RTFF (Score:2)
It's called 'publishing'.
" do you know that they are publishing on a "public" network? How fucking infeasable would to be to send only to IP addresses which they wanted to?"
Yeah, they wouldn't want to password-protect their files or anything.
I can see that you're no more 'enlightening' here than you are in our thread about your inability to see your own idiocy.
Re:RTFF (Score:3, Insightful)
In a copyright case, the courts first establish whether infringement has taken place, and this is determined irrespective of economic issues. It is determined purely on issues of subsistance, owernship, duration, etc - in terms of the statuory provisions and the existing case law. It is only then that exceptions (such as fair use, and specific exemptions - say - for public archives and libraries) are considered.
Then, finally, when remedies are considered (e.
Re:RTFF (Score:2)
Re:RTFF (Score:3, Interesting)
Yes, but it places the burden in the wrong place and so is not likely to be considered an adequate remedy by the courts. More properly, the violator should be seeking permission prior to re-distributing the content, rather than essentially saying to the copyright holder "Stop me before I copy again!"
I'm not sure I think that caching sites should be subject to traditional copyright law--it has some nasty implications for anyone who cu
Not so black and white as most here are saying! (Score:3, Interesting)
Riiiiight. See you in court.
As I've just posted elsewhere, it is quite feasible that a site owner could be damaged if caches maintain information after the original site has been changed or taken down. For example, if updated information is placed on the original, this leaves the "cached" versions out of date and misleading anyone who reads them thinking they're seeing a perfect c
Re:Not so black and white as what you are saying! (Score:1)
As I've just posted elsewhere, it is quite feasible that a site owner could be damaged if caches maintain information after the original site has been changed or taken down.
Damaged in what way? Aren't there archives of newspapers, journals, and magazines? And if time-sensitive information is present on a website, does the public have a right to see what was previously there? Websites can get away with a lot of instant censorship that way - you can check out this site [thememoryhole.org] for an archive designed in response
Re:Not so black and white as what you are saying! (Score:3, Interesting)
If I put up information on a web site, for free, as a volunteer, then the public has no rights whatsoever, either legally or morally. Why the hell should they? They didn't do anything to earn them.
I'll give you a coupl
Re:Not so black and white as what you are saying! (Score:1)
If I put up information on a web site, for free, as a volunteer, then the public has no rights whatsoever, either legally or morally. Why the hell should they? They didn't do anything to earn them.
The fact that the public has a right to anything you produce is the reason that the public domain exists. Copyright is instituted by governments to keep creative people in a position to keep creating - but when you're dead, the information should go somewhere to enhance the public good. If the human race is to
Re:RTFF (Score:2)
I think you're sort-of-right. The mere fact that a search engine gives you this facility to opt out does not create an implicit licence to use content by itself: there is an old principle of law that silence does not mean consent. If this were not the case, I could, e.g., write to you offering an opportunity to engage in a Nigerian money-laundering scam with the rider that "if you don't reply to this I will take it to mean you have accepted my offer" and then enforce that through contract law if you didn't
Re:RTFF (Score:2)
Bullshit.
Your argument is like saying "If you leave your front door unlocked you are giving your neighbors implicit permission to loot your house."
Many website creators don't even know about archive.org, so how will they know to go read the document? You can not assume permission by the lack of existence of a robots.txt file.
Now, if archive.org only copied your site if you DID
My 9/11 Archive (Score:5, Interesting)
Over the next week I archived some 4,600 blogs. They've kind of been sitting around waiting for me to weed through and organize. I've also been wgetting 30 or so large news sites' front page every 15 minutes or so on the hunch that I'll grab something emerging even if I'm AFK. Well
The answer(s) to this question will definitely be of use to me. Thanks for asking it. Slash, thanks for posting it.
Re:My 9/11 Archive (Score:2)
1) Burn it onto DVD. But I don't know which format is likely to survive the longest!
2) Hand it over in whatever form you can to your nearest major University and let them work out how to archive it. If they can find a way to do so reliably, it will be very valuable to their Faculty of History in a hundred years or so!
If you can do both, then great - you could distribute it to several Universities. Be sure to include a few European Unis that that have already been around for at
Re:My 9/11 Archive (Score:2)
Re:My 9/11 Archive (Score:1)
Paper will last far, far, far longer than any electronic media. I can still read the masters thesis my dad wrote in the 70s, but the box of punch cards to go along with it is utterly useless.
Re:My 9/11 Archive (Score:1)
That's not what I've read. (Score:2)
CDs will last at least as long as the average paper archive
Paper can easily last a hundred years (I have a number of books from the late 1800's & early 1900's); IIRC the typical MTF for CDs is on the order of 20 years, and can be as low as 5 [fcw.com].
-- MarkusQ
Re:That's not what I've read. (Score:2, Insightful)
could you send that to me? (Score:2)
email me with details!, use my public key!
Re:could you send that to me? (Score:2)
Re:could you send that to me? (Score:2)
Re:could you send that to me? (Score:2)
Re:could you send that to me? (Score:2)
Re:could you send that to me? (Score:2)
"me too!"
Re:could you send that to me? (Score:2)
"Sure. Ummm. I'm going in for surgery today so it'll have to wait until tomorrow, but drop me an email at slash@php.us and I'll get you a URL where you can just dl it. If you have dialup or some such, let me know and I'll snailmail you a CD with the data."
Re:My 9/11 Archive (Score:2)
An idea (Score:5, Insightful)
DON'T POST THINGS YOU DON'T WANT PEOPLE TO SEE ON A PUBLIC NETWORK.
It's quite simple really.
Re:An idea (Score:2)
I think you misunderstood his question.
No, my post was in response to people that get angry when their site's are mirrored. They seem to feel that even though they are distributing the content on an international network with millions of users, they can still control the information, solely through litigation, and that is not how things SHOULD be.
It might be useful to note... (Score:4, Informative)
Google, on the other hand, is gathering data for its search engine, and, of necessity, must have what essentially amounts to a copy of each web page in its stores in order to provide this service. If one does not want to have their data in Google, they simply use robots.txt, and Google doea not spider, cache, or store any data from that site if robots.txt is filled out. However, the site owner also denies themselves the ability to be listed, for 'free', in googles search pages. This could be thought of as the cost of being listed.
So I don't think either of those two situations have any problems defending themselves. An anonymizer could also be seen as providing a useful, protected service. An anonymizer is nothing more than a proxy service, and many ISPs use proxies now, not to mention caches and many other tools that store website information or meta information without notifying or requesting explicit permission to do so - they request implicit permission by sending a GET command.
-Adam
Re:It might be useful to note... (Score:3, Informative)
In any case, the Archive's work with the Library Of Congress and, increasingly, national libraries who want to archive the Web content of their countries, proves that the establishment also thinks Web a
Re:It might be useful to note... (Score:1)
Email? (Score:2, Funny)
Email from lawyers is
As for the waking up in the middle of the night...
Um, turn off the ringer? Stop sleeping in the NOC? Maybe invest in a second phone line for your business instead of using moms POTS line.
Be Happy (Score:3, Insightful)
Re:Off topic, but... (Score:1)
Honestly... (Score:2, Informative)
Thus, never discovering that their site has been archived somewhere else. That, and Google has a rather chunky disclaimer-type-deal at the top--I'm sure it's in response to just that behaviour.
*copy* right (Score:5, Interesting)
(FWIW, IANAL) Web site content is copyrighted. Therefore, you have a right to make your own personal copy, and backup copies, but it is not legal to redistribute those copies without the site owner's permission. I cannot imagine that the Wayback machine or the Google cache is legal. They are blatantly disregarding the site owners' copyright.
That said, I think the law should be changed or at least clarified, because it is patently (pun intended) obvious that those services are doing a vast social good, and should be encouraged.
Re:*copy* right (Score:3, Interesting)
Web site content is copyrighted. Therefore, you have a right to make your own personal copy, and backup copies, but it is not legal to redistribute those copies without the site owner's permission. I cannot imagine that the Wayback machine or the Google cache is legal. They are blatantly disregarding the site owners' copyright.
That would imply that every ISP running a public squid cache is breaking the law, and Akamai's entire business model is based on illegal content-smuggling. I really don't th
Re:*copy* right (Score:2)
Re:*copy* right (Score:3, Informative)
"...and Akamai's entire business model is based on illegal content-smuggling. I really don't think so!"
Akamai caches sites of people who pay them to cache them, so that would be one hell of a lawsuit. I know this because I worked for them for a few years.
Re:*copy* right (Score:2, Informative)
(FWIW, IANAL)
Obviously [cornell.edu].
Re:*copy* right (Score:5, Informative)
Heck, it's so useful that I'm going to quote some of it here:
TITLE 17 > CHAPTER 5 > Sec. 512. Prev | Next
Sec. 512. - Limitations on liability relating to material online
(a) Transitory Digital Network Communications. -
(b) System Caching. -
Interesting (Score:2)
Interestingly, the law cited makes explicit provision for several of the concerns I expressed in earlier posts in this thread, notably the issues of keeping the data up-to-date and of the information provider getting information from those visiting their site directly.
The normal Internet convention is that when I update my site, changes are immediately visible to everyone. (NB: browser caching is not equivalent to web caching here for several reasons.) Also, visitors to my site normally leave information
Re:Interesting (Score:1)
All in all, if that is the exemption I was referred to earlier in this thread, it looks as though the web caches are skating on very thin ice. If they did something like cloning material on a web site that was later removed in order to publish it in a book, I imagine they could wind up having a serious dispute with the publisher, or perhaps the author himself, either of whom might have a strong case that they suffered financially because of the actions of the caching site.
I'm not sure you're talking abou
Re:Interesting (Score:2)
I was referring to the post where someone said there was an exemption under copyright law for web caches. I assumed the parts of the DMCA that were cited here were that exemption. In that case the validity of the original claim appears to be less clear than was suggested.
Re:*copy* right (Score:1)
(FWIW, IANAL) Web site content is copyrighted. Therefore, you have a right to make your own personal copy, and backup copies, but it is not legal to redistribute those copies without the site owner's permission. I cannot imagine that the Wayback machine or the Google cache is legal. They are blatantly disregarding the site owners' copyright.
This confuses fair use on purchased items that you own with what you are allowed to do with temporary copies for viewing. By the same logic, you could legally tak
Archiving is important (Score:1)
Actually it's the DMCA (Score:2)
ON A RELATED NOTE (Score:2)
legality (Score:3, Informative)
Firstly, in the general case of search engines providing indexing of content, this is legal and there are legal cases to back it up (in the UK: antiquesportfolio) so long as the indexes are not copies.
Secondly, in the case of USENET groups and mailing lists, then in the process of submitting a message to the mailing list or group, you have given an implicit license for the message to be reproduced within the nature of the particular technology at hand. This means if at a later date you object to a message in a mailing list that you wrote in the past, you don't really have the ability to retract it. In all cases, anyone deciding to use the material in another way (e.g. creating a commercial CDROM of USENET material for a marked up price) would be violating your (and others) copyright. However, if they were providing that CDROM as a distribution service for USENET itself (e.g. "get your monthly USENET CDROM") then this is probably within the bounds of legality as it is still transfer via the USENET system, and the cost is likely to be that to reflect media/distribution costs rather than some specific aim to make a commercial product out of your material.
Finally, in the specific case of copies of websites, yes this is a violation of copyright - but as far as I know this has not been tested in a court of law. The use of the Robots Exclusion Protocol and the NOARCHIVE, NOINDEX and NOFOLLOW elements allow a weasal argument suggesting that it is inherent in the WWW itself (as a new form of media / technology) that search engine indexing and archiving / caching is legal unless you specifically disallow it with this mechanism. It may also be the case that if this archiving / caching was carried out for profit or at price greater than fair for distribution/media then a party is making an economic gain out of your material and this suggests an inequitable violation of your economic rights.
Another point to remember is that in WTO treaties that resulted in DMCA provisions, as enacted in the UK and EU, there are specific fair use allowances for intermediate copies of a copyright work as necessary for the telecommunications medium itself (this would seem to allow things like store-and-forward systems, and caching).
shrinkwrap/acceptable use policy? (Score:2, Interesting)
the court case that a company *lost* because it copied the data of a
competitor site and set it's prices lower.
This is equivalent to Kroger hiring a few clerks to go down each day and
take prices of various objects on their wifi equip'ed phones/handhelds in
a store so Safeway can under cut prices.
What, you didn't read the fine print on the Safeway door that says no price
comparisons or making up price lists? Or what...were they supposed to