ShaunC asks:
"As the webmaster of numerous sites, I'm curious how others feel about the Wayback Machine. What particularly interests me is the fact that the Machine is a relatively new animal, yet it contains snapshots from my sites dating back to 1998. I can't help but wonder: where did they get such old copies of my websites, and who gave them permission to make those copies? I certainly didn't provide either. Perhaps I'm too much of a purist, but I've always seen the internet as an ever-changing medium, not a permanent one. Archives have bothered me ever since the fledgling days of DejaNews." This site last made an appearance on Slashdot,
earlier this year. Internet archival sites are right smack in the
crosshairs
of copyright, but they
are useful. Anyone who has ever used Google's cache (and there are plenty of those links on Slashdot) can attest to this. Of course, the issue that may bug many content providers is how to opt-out of such services, since some see it as a copyright violation. Is it possible to balance the issues of copyright and history, or will these two Internet resources find themselves in legal trouble in the future?
"The way I see it, archives are much like SPAM; I never opted in, why should it be my responsibility to opt out? I manage a number of domains and the process of refining robots.txt files and submitting myself to the Wayback Machine for removal seems to be intrusive. Worse, domains I've abandoned (which have lapsed or been re-registered by someone else) are forever archived in the Machine and I have no way to exclude them. Why should I have to deliberately remove my copyrighted material from an archive which was never granted permission to replicate that material in the first place?"
Yummy (Score:2, Informative)
"The Wayback Machine" (Score:3, Informative)
And yes, it obeys the Robot Exclusion Principle.
"Ask Google" strikes again; I would hope that you could find all of this information by searching, or reading an "About" page, or something. Fortunately, these abortions to journalism don't appear on the Front Page very often.
Robots.txt (Score:5, Informative)
Legally you can stop them, but why? (Score:3, Informative)
Of course in practice you have to purse this and ask them to remove it.
If you really object I suggest a list of every site you have or have had and dates with a request to remove everything. Then you only need to notify them when you put up a new site that that whould also be excluded. That would not be such a nuisance, would it?
That said I think they are providing a service that is interesting so unless you are harmed by it, why object?
I am interested in knowing how they had such old versions of your site though. Do search engines keep archives?
I love it. (Score:3, Informative)
What's the problem?
If you do something illegal on your website, you won't be held responsible more than once just because the data persists on the Wayback machine. If you remove the offensive material from your site, that's all you can do. The Wayback machine can deal with their own lawsuit threats. And I'm sure they'll remove material if you are the site owner and ask nicely.
As far as outdated information, anyone reading pages on the wayback machine and expecting them to be current would have to be crazy. It's an archive after all.
It's easy to opt out. Google provides instructions in there webmaster faq [google.com] which points out "There is a standard for robot exclusion at http://www.robotstxt.org/wc/norobots.html [robotstxt.org]."
Re:Erm (Score:2, Informative)
Re:Erm (Score:2, Informative)
User-agent: *
Disallow: /
Library archives are given broader copyright uses (Score:5, Informative)
Strange that such a complaint would appear within a group expousing that "information wants to be free."
Re:For what it's worth... (Score:2, Informative)
Re:Erm (Score:2, Informative)
It looks like the page has been removed! My guess is if you request to remove a page and it doesn't exist anymore, they probably will remove it for you. This web page revealed me as the pothead and pro-marijuana person that I was (and still am though in private) back in college. I was afraid my employers were going to find my old web page, but they're probably potheads too.. But still, its good to be able to cover up the silliness of my past.
Some one hasn't done their research (Score:4, Informative)
1) They've been archiving since 1998, but they've only recently had the horse power to provide a live connection to it
2) It is very easy to not have your stuff indexed. the directions are here. [archive.org]
Re:court evidence? (Score:2, Informative)
Re:Erm (Score:5, Informative)
User-agent: *
Disallow:
# BITE ME WAYBACK MACHINE
... uploaded it, went back to the Wayback Machine, and got:
Robots.txt Query Exclusion.
We're sorry, access to [site] has been blocked by the site owner via robots.txt.
Read more about robots.txt
See the site's robots.txt file.
Try another request or click here to search for all pages on [site]
So, yeah, they seem to check the site for the most current robots.txt file before they show the archive. And if the robots.txt disallows archiving the site, ALL the entries are marked unavailable, not just the current ones.
So, it's pretty easy to solve the problem of the Wayback Machine -- and probably without going balls-out with the "disallow everything everywhere" like I did.
Re:Like it or not, it's the law (Score:1, Informative)
law gives libraries and archives special fair use
powers.
Re:Erm (Score:3, Informative)
User-agent: ia_archiver
Disallow: /
Most (all?) search engines provide information on how to specifically exclude their spiders (while allowing everyone else). Just go to the engine's site and search for info on how they treat robots.txt.
Websites from 1998.. (Score:2, Informative)
I think if you use the Wayback Machine to go back to their own site in 1998/1999 their front page tells you this.