Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
The Internet

The Wayback Machine, Friend or Foe? 508

ShaunC asks: "As the webmaster of numerous sites, I'm curious how others feel about the Wayback Machine. What particularly interests me is the fact that the Machine is a relatively new animal, yet it contains snapshots from my sites dating back to 1998. I can't help but wonder: where did they get such old copies of my websites, and who gave them permission to make those copies? I certainly didn't provide either. Perhaps I'm too much of a purist, but I've always seen the internet as an ever-changing medium, not a permanent one. Archives have bothered me ever since the fledgling days of DejaNews." This site last made an appearance on Slashdot, earlier this year. Internet archival sites are right smack in the crosshairs of copyright, but they are useful. Anyone who has ever used Google's cache (and there are plenty of those links on Slashdot) can attest to this. Of course, the issue that may bug many content providers is how to opt-out of such services, since some see it as a copyright violation. Is it possible to balance the issues of copyright and history, or will these two Internet resources find themselves in legal trouble in the future?

"The way I see it, archives are much like SPAM; I never opted in, why should it be my responsibility to opt out? I manage a number of domains and the process of refining robots.txt files and submitting myself to the Wayback Machine for removal seems to be intrusive. Worse, domains I've abandoned (which have lapsed or been re-registered by someone else) are forever archived in the Machine and I have no way to exclude them. Why should I have to deliberately remove my copyrighted material from an archive which was never granted permission to replicate that material in the first place?"

This discussion has been archived. No new comments can be posted.

The Wayback Machine, Friend or Foe?

Comments Filter:
  • Erm (Score:3, Insightful)

    by adamwright ( 536224 ) on Wednesday June 19, 2002 @05:37PM (#3732271) Homepage
    Isn't this exactly the point of robots.txt? Google won't cache content it doesn't spider, and it won't spider content forbidden by your robots.txt. Does the WayBack Machine obey the robots rules?
    • Re:Erm (Score:2, Informative)

      by JebusIsLord ( 566856 )
      Yes, it does follow robots.txt protocol. Therefore there really isn't a problem now is there?
    • Why do webmasters have to "opt-out" rather than "opt-in" to be cached?

      Shouldn't the default be "don't allow spiders and caching" ? And if I want it then I should specifically allow it.
      • Re:Erm (Score:5, Insightful)

        by kevinank ( 87560 ) on Wednesday June 19, 2002 @06:34PM (#3732672) Homepage
        The goal of the person who started archive.org was to record the history of the world wide web. The assumption was that whatever anyone thinks about the archive, there will never be another chance to go back and get that data once it is lost.

        The copies that they have archived in their databases are individual copies served from the original web requests, so they have the right to keep them. They became their copy when they were originally downloaded. Whether they have the right to make new copies and redistribute them depends on how you think fair use applies to that content.

        Ultimately if a lot of people start suing them they will probably shut down the archive to public access and only allow researchers to view their original copies on site. And if you'd prefer that, well, you'll end up with the world you deserve.

      • by account_deleted ( 4530225 ) on Wednesday June 19, 2002 @06:42PM (#3732710)
        Comment removed based on user account deletion
    • Re:Erm (Score:2, Informative)

      by zootread ( 569199 )
      Yeah, you can add a robots.txt file and ask them to remove your site [archive.org] and it'll be wiped from their records. The problem is, if you don't have access to the site anymore, you can't throw in the robots.txt file. But, I just checked on a web page I requested they remove, which no longer existed so I couldn't put up a robots files, but I made the request anyways.

      It looks like the page has been removed! My guess is if you request to remove a page and it doesn't exist anymore, they probably will remove it for you. This web page revealed me as the pothead and pro-marijuana person that I was (and still am though in private) back in college. I was afraid my employers were going to find my old web page, but they're probably potheads too.. But still, its good to be able to cover up the silliness of my past.
    • Re:Erm (Score:5, Informative)

      by dswensen ( 252552 ) on Wednesday June 19, 2002 @06:59PM (#3732837) Homepage
      Yes it does, and how. In fact, immediately upon reading this story, I went to the Wayback Machine and checked out my personal website archive. There it was, material dating back to 1996 ("Oh God, no, not the digging man GIF!"). I made a new robots.txt file:

      User-agent: *
      Disallow: /
      # BITE ME WAYBACK MACHINE

      ... uploaded it, went back to the Wayback Machine, and got:

      Robots.txt Query Exclusion.

      We're sorry, access to [site] has been blocked by the site owner via robots.txt.
      Read more about robots.txt
      See the site's robots.txt file.
      Try another request or click here to search for all pages on [site]

      So, yeah, they seem to check the site for the most current robots.txt file before they show the archive. And if the robots.txt disallows archiving the site, ALL the entries are marked unavailable, not just the current ones.

      So, it's pretty easy to solve the problem of the Wayback Machine -- and probably without going balls-out with the "disallow everything everywhere" like I did.
      • Re:Erm (Score:3, Informative)

        by guttentag ( 313541 )
        A number of people who don't want their content archived by the Internet Archiver may still want search engines to direct traffic to their sites (The Washington Post [washingtonpost.com] does this). If that's the case, use this in your robots.txt file:

        User-agent: ia_archiver
        Disallow: /

        Most (all?) search engines provide information on how to specifically exclude their spiders (while allowing everyone else). Just go to the engine's site and search for info on how they treat robots.txt.

  • Yummy (Score:2, Informative)

    by sheepab ( 461960 )
    Slashdot from 1997 [archive.org].
    • Very nice. And it's good to know they were using the same careful journalism back then. I like this headline:

      Judge Uninstalls IE in 90 seconds.
  • by pb ( 1020 ) on Wednesday June 19, 2002 @05:40PM (#3732287)
    "The Wayback Machine" has been a pet project for a long time, and we're only now seeing results. I know for a fact that they have pages back at least as far as 1996, and it's a damn shame they don't have anything that much earlier...

    And yes, it obeys the Robot Exclusion Principle.

    "Ask Google" strikes again; I would hope that you could find all of this information by searching, or reading an "About" page, or something. Fortunately, these abortions to journalism don't appear on the Front Page very often.
    • by Disevidence ( 576586 ) on Wednesday June 19, 2002 @05:44PM (#3732329) Homepage Journal
      I think the question is not about its being publicly available, but rather about it archiving web pages that were taken down at later dates for various reasons.

      Its legally grey, and all it really takes is for some paranoid person to sue, and then the fireworks start.

      IANAL.
      • by martyn s ( 444964 ) on Wednesday June 19, 2002 @05:58PM (#3732455)
        So I suppose libraries should just stop carrying books because the author doesn't like what he wrote anymore? I mean, what the fuck?
        • Im not saying whats right or wrong, im saying he could possibly sue however.

          Person A puts up a website about X. Wayback Machine archives this website. Later, unbeknownst (sp?) to the Wayback Machine, Person A is sued by someone who controls or has copyright on X, and its taken down. Yet that copy is still on the Wayback Machine.

          What if the Wayback Machine archives a link to Decss? They do archive forums, i;ve checked numerous old old posts of forums through the wayback machine.

          I just think its a bit iffy about archiving stuff...
          • Possibly sue for what? It's not libellous to (truthfully) say "n years ago, so and so said 'whatever'".

            It's always been the case that "if you don't want a future potential employer to read it, don't put it out in public". If a newspaper prints a libellous story, they issue a retraction, they don't seek out and destroy all copies of the paper.
        • So I suppose libraries should just stop carrying books because the author doesn't like what he wrote anymore? I mean, what the fuck?

          The issue is more akin to a library making a copy of a book and giving out copies of that copy to anyone who asks.

          • by Rick the Red ( 307103 ) <Rick,The,Red&gmail,com> on Wednesday June 19, 2002 @06:34PM (#3732676) Journal
            No, the issue is more akin to a library carrying newspapers and magazines for years, and their publishers suddenly telling the libraries "those copies are out of date, stop letting people read them." Why? If you didn't want anyone to read it, why did you put it out on the web?

            Are you ashamed of what you did back then, when you were young and foolish? Grow up -- we're all ashamed of what we did when we were young and foolish, and years from now you'll be ashamed of what you're doing today. Get over it.

            Personally, I think archives are great. Whenever I design an application I always ask about archiving, because inevitably they're gonna want it and it's easier to design in from the start. Oh, you want to know what your top 10 customers ordered last Christmas? Now you tell me! Geeze, we flushed that data last February, 'cause you said once the credit card cleared you didn't care to pay for the storage. But I digress.

            Someday your next client will want examples of your previous work, then you'll go crawling on your hands and knees to the Wayback Machine, begging them to show you what your pages looked like. And they'll honor your robots.txt file and tell you to get lost.

            • by pjrc ( 134994 )
              Are you ashamed of what you did back then, when you were young and foolish?

              I am. Well, sorta anyway. My site has all of the pages that have ever appeared, all the way back to 1995. For example, this circuit board schematic page got a lot of hits in 1995 [pjrc.com]. For years, I got emails from people who attempted to build it... a few were success but most were failures. So, in 1997 I redesigned the board/schematic [pjrc.com] so that it would be much easier to build and troubleshoot, and then I made another new rev in 1999 [pjrc.com] (because the flash rom chip became obsolete).

              Based on lots of user feedback, I redesigned it yet again in 2001 [pjrc.com], mainly to increase the speed, add more memory to be C compiler friendly, and I added the most user-requested feature, a port to plug in a standard LCD.

              Today, those old pages (well, still need to update the '99 ones) have a message at the top of the page that tell the visitor they're viewing obsolete material and strongly suggests they follow a link to the new version of the circuit board, which is easier to build (added in 1997), uses parts that are currently available on the market (added in 1999), and has more features (added in 2001).

              An archive of the original 1995 page, even archived in 1996, isn't going to warn the poor user about the usability improvements added in 1997, the part that became obsolete in 1999, and the nice new features that were added in 2001. At the very least, it'd be proper for archive.org to link to the current version of the page (if it's on-line)... but even that would be difficult since the site moved from a university to its permanent domain name in 1999 (the old site keep a redirect for a couple years, but even that is gone now).

              So, while it sucks that someone might find that old material and suffer though all the problems that have been corrected and miss out on the improvements of the last several years, it doesn't suck enough that I'd hire a lawyer, or even bother to tell them to exclude my material.

              But I can understand how a large company would not want its old products displayed with the then-current literature in a way that might confuse potential customers.

        • Actually, if a book is declared obscene or libellous, a library may well stop carrying it, and the Wayback machine has the same problem.

          And while it is sometimes delightful that it preserves things that, eg, Big Companies may prefer we didn't see, it's less delightful that the ramblings of a 17 year old's blog may come back to haunt them years later...
  • Robots.txt (Score:5, Informative)

    by mshowman ( 542844 ) on Wednesday June 19, 2002 @05:41PM (#3732291)
    I had recently placed a restricted robots.txt file on my site and when trying to access any of the past revisions, I get a message saying that the owner has restricted access to the site via robots.txt. They seem to have that aspect under control.
    • Re:Robots.txt (Score:3, Interesting)

      by ShaunC ( 203807 )
      Sigh. This, I suppose, is what happens when Slashdot keeps stories in the queue too long:
      2002-03-30 10:12:57 The Wayback Machine, friend or foe? (askslashdot,news) (accepted)
      At the time, I was having severe problems getting in touch with anyone at The Wayback Machine. Yes, their site makes it quite clear how to have your site removed. Yes, I placed the appropriate entry in my robots.txt files. Yes, I submitted my sites for exclusion. Then nothing happened. After emailing them several times with a list of domains I'd prefer to have removed from the archive, I got a reply back saying they should disappear by the end of the following day. No go.

      That's all changed. They've got the kinks worked out, as best I can tell, and have begun obeying robots.txt files. They weren't so diligent about it three months ago, or I wouldn't have gotten ticked at 'em.

      BTW, my submission was edited in at least one place: I don't capitalize the word "SPAM," as the capitalized version is Hormel's trademark. (Maybe my submission was combined with someone else's; hard to remember what I wrote 3 months ago.)

      Everything else I'd say has already been said, I wish I'd noticed the story sooner.

      Shaun
  • by Anonymous Coward on Wednesday June 19, 2002 @05:41PM (#3732294)
    It's a scary thought that things kids are saying on message boards when they're teenagers are going to be back to haunt them when they apply for jobs in their mid 40s...

    I mean, if everything I posted on BBSes in the 1980s were still attributable to me... yikes.

    Remember kids. Use a nickname, and change it frequently if you ever want to run for any kind of office.

    • by TheMonkeyDepartment ( 413269 ) on Wednesday June 19, 2002 @05:44PM (#3732323)
      Well, that's a great point, and it's a good illustration of the double-edged sword of free speech. You are free to say whatever dumbshit, ridiculous things you want. But you are also free to deal with the social consequences.
    • by rhaig ( 24891 ) <rhaig@acm.org> on Wednesday June 19, 2002 @06:19PM (#3732591) Homepage
      dejanews was my best tool to weed out resumes

      before I secheduled even a phone interview, I'd always search dejanews for the person in question. Sometimes I'd come up with a definate hit (first and last name as well as email and mentioning the local area or some work that was on their resume) and I'd be able to see what kind of person I was really dealing with. That's when I started looking at what I'd posted.
      • by madmancarman ( 100642 ) on Wednesday June 19, 2002 @11:38PM (#3733974)
        dejanews was my best tool to weed out resumes

        before I secheduled even a phone interview, I'd always search dejanews for the person in question. Sometimes I'd come up with a definate hit (first and last name as well as email and mentioning the local area or some work that was on their resume) and I'd be able to see what kind of person I was really dealing with. That's when I started looking at what I'd posted.

        This kind of freaked me out when I started teaching in 1998 - I'd been running a large fan web site devoted to one of my favorite bands, and being heavily into the band, I posted a lot in their newsgroup and participated in more than one flame war. Of course, I was in college and in my very early 20's and late teens, but it's all archived on DejaNews now, with no way to remove it. I really doubt any public school districts are going to wise up to this (or even care, considering the national teacher shortage), but I wouldn't be surprised if it came back to haunt me in some way some day. As a previous poster mentioned, such is the burden of free speech.

        An interesting thing did happen to me at the beginning of this school year. I teach high school computer classes, and I was talking about managing that fan web site when one of my students (a junior) opened his eyes really big and pointed at me with his jaw dropped, sort of aghast. I paused and asked him what was wrong, and he exclaimed that he downloaded and used the guitar tabs I'd written years earlier when he was in junior high. I found that kind of amusing!

        I think the archiving of the internet is particularly scary when I can still find a lousy guitar tab I did of Pearl Jam's "Footsteps" [guitaretab.com] that I did back in 1992, when I was a senior in high school piggybacking off an account at the nearby university, on my parents' Apple //e, while I was still learning how to play guitar. Obviously, the internet can have a much longer shelf life than a ProDOS 5.25" floppy (excluding news sites [msnbc.com] that "expire" their articles after limited availability).

        First they ignore you, then they laugh at you, then they fight you, then you win. -- Gandhi

    • The problem with this is that as the copyright owner you cannot convince google to remove your old usenet posts unless you still have that 1998 email address. Its a ridiculous requirement for google to ask people to have their college email accounts when they want the posts they own and wrote removed from google's system. A simple proof of ID should be enough for them.

      An inaccessible copyright policy is like having no policy at all. Expect fallout sooner or later.
  • by TheMonkeyDepartment ( 413269 ) on Wednesday June 19, 2002 @05:41PM (#3732296)
    When you publish something on the web, it is publicly available via HTTP. End of story. Responsible netizens can observe the requests of "robots.txt" but they don't have to. If you want something more controlled, create a VPN or intranet or some other kind of non-public data server.

    Your argument is similar to that of newspaper publishers who didn't like "deep linking." What they couldn't (or didn't want to) understand is that the nature of an HTTP web server is quite simple. A client asks for a file, the server gives it back. Using that protocol implies that you are OK with that. If you're not, I suggest you look into different technologies, instead of complaining about lack of control, in a medium that was never intended to provide it.

    • Exactly what I wanted to say. Of course, when you put something on the Internet you don't expect it to be archived forever, but you have to keep in mind that anyone can download it and do what they want.
    • by KillerCow ( 213458 ) on Wednesday June 19, 2002 @05:56PM (#3732444)
      When you publish something on the web, it is publicly available via HTTP. End of story.

      I don't think that that is a good enough standard. When a television show is broadcast, or when a book is published, it is publicly available -- but we don't think that the publisher looses their right to copyright protection in these cases. Publishing on the web is similar. The creator wants people to see his/her creation, but does not automatically give visitors the right to archive and retransmit the works.
      • If the creator wants people to see his/her creation, but not give them the right archive and retransmit the works just like always they can put a (C) at the bottom of their webpage expressing that redistribution of the work without express permission of blah is prohibited. Obviously that would bring the question about how many people the creator would want to see his/her work in the first place. If they want to be selective then be selective, write a webapp that only allows registered users who have agreed to a non-disclose or redistribute license etc etc. There are many ways to go about it so long as the creator understands "When you publish something on the web, it is publicly available via HTTP".
      • Actually, In canada (I am an american, but I'm married to a canuck) anybody can rebroadcast anything. the deal is tho, that they can not change it, cant remove advertisments, cant shorten, lengthen, commentary over it, or put up their logo. kinda a neat idea.
    • When you publish something on the web, it is publicly available via HTTP. End of story.

      Exactly. By publishing online and publicly you've already opted-in.

      This is just another example of how incompatibel copyright is with any kind of normalcy vis-a-vis individual freedom and, in this particular case, the freedom to archive information and hold someone accountable if they try to change it retroactively (and on the sly). Unless we want Orwellian-style changing of the facts post facto copyright must lose to the right of archivists to preserve information from being lost. Any other policy would be disasterous.
      • Archiving is one thing, rebroadcasting (or rehosting as is the case here) is another. By copyrighting my site, I reserve the sole right to host a server that distributes that content. Nobody else is given a right, expressly or implied, to 'mirror' my site, regardless of if its for archival purposes or not. That's the consideration that needs to be understood here. Archives are great - I often make use of Google's cache, but only if the *real* content I'm trying to reach is behind a slow connection or down entirely. Technically, Google cache should reserve copyright, too - a concept that would certainly kill off the practice. Is it worth it to lose the convenience? Possibly ... think about the ramifications and think hard. If its permissible to archive and host a site's content, why wouldn't it be permissible to archive and broadcast the ST:TNG episodes you've so faithfully taped, w/o paying the royalties to whoever holds the rights to that? Seems like its pretty much the same thing to me, eh?

        Now - if you yourself want to archive a site that you're interested in, or if you want to contact the maintainer of a site for something you're looking for that used to be on his/her site, that's perfectly legit and respectable. I personally have all the previous sites for my company archived - if someone wants something that's no longer on our current site, they can certainly ask and we'll try to fulfill their request.

        I dunno ... I'm up in the air on this, but I entirely understand the copyright ramifications of the situation.
    • When you publish something on the web, it is publicly available via HTTP. End of story.

      Ah...as a previous post pointed out, I don't think kids should have their remarks recorded forever. I doubt I would have made it as far as I have if my BBS quotes were still around...
      • I doubt I would have made it as far as I have if my BBS quotes were still around...

        I don't see why not. I have lots of fun quotes floating around in google, deja* and slashdot, but my employers have never called me on them. Do you think HR has nothing better to do than find ways to embarass potential employees? Perhaps some places, but I wouldn't want to work there....

        Public office is a different matter, but honestly, I don't see how embarassing things said on the net are any worse than embarassing things done in the public record (like the famous Newt G. divorce your wife while she's in the hospital with cancer debacle).

    • by krypto246 ( 567378 ) on Wednesday June 19, 2002 @06:45PM (#3732725)
      People are just pissed about this archinving because they like the internet to be a 100% responsibility free zone - now matter what you say or do, you ca nalways change, edit or delete it later. How about standing behind your comments and opinions, instead of just deleting them when they can be held against you? Yes - use nicknames and aliases, but dont expect that the things you put out there to be temporary. You put something out into the internet, it stays there, and it can be found later, thats the power of the net, and the price you pay for it.
  • by wompser ( 165008 ) on Wednesday June 19, 2002 @05:42PM (#3732300)
    Went back and looked at the site for the .com I used to work for, very nostalgic. The wayback machine is a good resource for people who create content on someone's site (a.k.a. me), and then lose access to it because the company goes under. Now I'm able to add my old content to my portfolio, now that the company who once owned it is gone.
  • Permission... (Score:3, Insightful)

    by gorf ( 182301 ) on Wednesday June 19, 2002 @05:42PM (#3732307)

    who gave them permission to make those copies?

    The way I see it, you implicitly give people some limited form of permission by putting it up on the internet freely available to download in the first place. You put it up for people to download, print out and so forth (which amounts to copying), and therefore you've implied that people may do so.

    Sure, you own copyright, and blatant plagarism is something that clearly is wrong. But I see nothing wrong with taking an article that you published on the web and reproducing it, as long as it is taken in context and is clearly attributed (and it made obvious that the copy isn't the original, but proper attribution would do this and therefore suffice).

    Of course, this is republication and so the issue is not so clear and obviously subjective. That's just my opinion.

  • by the_womble ( 580291 ) on Wednesday June 19, 2002 @05:43PM (#3732310) Homepage Journal
    If you own the copyright they can not archive it without your permsiission, legally, that is all there is to it.

    Of course in practice you have to purse this and ask them to remove it.

    If you really object I suggest a list of every site you have or have had and dates with a request to remove everything. Then you only need to notify them when you put up a new site that that whould also be excluded. That would not be such a nuisance, would it?

    That said I think they are providing a service that is interesting so unless you are harmed by it, why object?

    I am interested in knowing how they had such old versions of your site though. Do search engines keep archives?

  • www.cisco.com, 1 page (1996)
    www.microsoft.com, 5 pages (1996)
    www.ibm.com, 7 pages (1996)

    This is in the FAQ [archive.org].
  • As someone who makes lots of free sellable [furinkan.net] and href="http://www.furinkan.net/fanfic/">unsellab le content, I think The Wayback Machine is an invaluable resource. I can look back a see how big a dork I was and still am. I've also found stuff of mine that I've lost over time, amazed that anyone ever bothered to hold on to it.
    • I've also found stuff of mine that I've lost over time, amazed that anyone ever bothered to hold on to it.

      Yes, I've used it for this too. I'm a volunteer webmaster for a site (www.fudgerpg.com) where we have a "monthly spotlight", but foolishly I wasn't keeping track of past spotlights. Eventually I wanted to put together a list of past spotlights, and realized that I hadn't kept that list. I felt stupid. The Wayback Machine (mostly) came to my rescue there.

      -Rob

      [ Reply to This | Parent ]
  • Ah, Gee! (Score:2, Funny)

    Sherman: Mr. Peabody, I want to go back in time!

    Mr. Peabody: Be quite, Sherman. This new Wayback Machine is now accessable via a browser. Be happy with that.

    Sherman: But I wanted to go back in time and watch Cleopatra taking one of those milk baths again.

    Mr. Peabody: .... Damn it, boy, fire up the Wayback machine. And fetch me my chew toy.
  • Do I have permission to copy the content of your site to my browser history directory, and if so, how long do I have permission to keep it? Can I show a copy of an html document that is stored in my browser history to my mother? What about my neighbor? Or the dude in another country I happen to be chatting with online?

    IANAL blah blah blah, but once you open your files up to being downloaded and stored by a browser, you've pretty much given up the right to tell people they can't be re-distributed--I would think the best you could hope for is that people would re-distribute them, in whole, the way you originally released them.

  • I like it but... (Score:4, Insightful)

    by rknop ( 240417 ) on Wednesday June 19, 2002 @05:44PM (#3732330) Homepage

    When I first discovered it, it was a lot of fun. Much nostalgia; it was fun seeing earlier verisons of my webpages. Some go back quite a number of years.

    On the other hand, I was horrified when I realized that there was full archiving of www.dramex.org. If you visit that site, you will see that there are a large number of scripts (as in plays), many of which have restrictions on use. Over the years, we've had people request that scripts be removed from the site; of course, we did so. However, they weren't necessarily removed from the archive, and an archive keeps them forever. Specifically with the wayback machine, I was able to submit stuff that removed the specific directories I was worried about (they don't archive the scripts from www.dramex.org, just the "front page" stuff which is all part of the fun), and keep them from doing it again.

    I like the idea of archives; it preserves history. The web is a transient medium, but not entirely. Yes, much of the content is dynamic and should only be dynamic. Some of it, though, is like the front page of a newspaper. Each day, what's on "today's front page" is different-- but there is value and use in seeing what was on the front page in any day in history.

    But sometimes you need to delete something and make sure it really is no longer available. When you don't completely control your site (i.e. somebody else archives it, rather than just mirrors it), that becomes impossible.

    newspaper.

    (Incremental backups can have a similar issue. If you only back up files which are "newer than the last backup", your backup doesn't have the information about files which have been *deleted* since the last backup. When you restore, you might find some files there you thought shouldn't exist any more.)

    (Dramex.org has changed so that it's not straightforward to get directly to the scripts any more. META tags tell the search engines to leave the actual scripts alone, and you can only get the text itself via CGI. Yes, it's easy to subvert if you put your mind to it, but at least you do have to put your mind to it, and automated search engines or archivers won't. 90% of the security for 1% of the effort.)

    -Rob

  • I love it. (Score:3, Informative)

    by gripdamage ( 529664 ) on Wednesday June 19, 2002 @05:45PM (#3732334)

    What's the problem?

    If you do something illegal on your website, you won't be held responsible more than once just because the data persists on the Wayback machine. If you remove the offensive material from your site, that's all you can do. The Wayback machine can deal with their own lawsuit threats. And I'm sure they'll remove material if you are the site owner and ask nicely.

    As far as outdated information, anyone reading pages on the wayback machine and expecting them to be current would have to be crazy. It's an archive after all.

    It's easy to opt out. Google provides instructions in there webmaster faq [google.com] which points out "There is a standard for robot exclusion at http://www.robotstxt.org/wc/norobots.html [robotstxt.org]."

  • by schon ( 31600 ) on Wednesday June 19, 2002 @05:45PM (#3732338)
    As a webmaster of various sites, I have no problem with archives.. if I didn't want people to see my stuff, I wouldn't have put it on the internet in the first place.

    where did they get such old copies of my websites, and who gave them permission to make those copies?

    They probably got the copies the same way everybody else did - by surfing. You (implicitly) gave them permission to cache your sites by not including an appropriate entry in your robots.txt.

    The way I see it, archives are much like SPAM; I never opted in, why should it be my responsibility to opt out?

    Archives are nothing like spam. Spam is primarily harrassment. These guys aren't harrassing you. They did ask your permission (by way of checking your robots.txt). If you've since changed your mind, it's your responsibility to notify them.

    Google caches material too - do you consider them to be spam as well?

    Archive sites provide a valuable resource to the rest of the 'net. If you don't like it, put an appropriate entry in your robots.txt file, and be done with it.
    • This parent post said almost everything I was going to, but one thing that I wanted to add was that the web, if a spider is even able to get to a page, (even if it doesn't follow the robots protocol which the wayback machine does) is only seeing a public page that anyone with an internet connection can get to.

      Otherwise you have bad control over your content and need to update your web server to not serve that content. If you don't want people to be able to copy your information then don't give it to them. Or only give it to them in a signed format that cannot be easily duplicated.

      It's like being surprised that someone has forwarded an email that you sent them.
  • The submitter states that he never gave the Internet Archive permission to replicate his work. He is wrong.

    By placing material on the web, one is implicitly granting permission for it to be read. If I put a poster up in my window, I lose the right to complain if someone walking by on the street reads it.

    Equally, I lose the right to complain if someone walks by and takes a photograph of the front of my house, including the poster. The fact that someone might then be able to read the poster ten years from now is irrelevant.

    If the Internet Archive were required to seek permission before archiving freely and publicly available material, then the same argument would require libraries to seek permission prior to archiving (free) newspapers.

    Timeshifting is fair use, and it applies to web pages just as well as TV signals.
    • A person may take a picture of the front of your house and of you and your painting for personal use.

      Now, when that person redistributes it, then it becomes an issue of fair use, copyright and license.

  • I would never have visisted countless sites I reguarly surf to. Google has definitely been a major gateway to the internet for me.

    I think making an issue of the caching is a moot point, as about 99% of the time I always go to the website for the content since the source is always better than the cache. I use the cache only in cases when the content has disapeared or in some cases when the website itself is gone.

    This is a valuable service Google is providing-- and webmasters get it for free.
  • by Chiasmus_ ( 171285 ) on Wednesday June 19, 2002 @05:46PM (#3732344) Journal
    I doubt that I'm alone in my belief that it is always tragic when any piece of information--no matter how trivial--is lost forever.

    If a person has offered that information for free at any point, to the extent that an automated script could access it, then I believe that information can be safely considered public domain. I doubt that there's any mechanism by which Richard M. Stallman could lose his mind and "rein in" all copies of GNU, or by which Stephen King could recall all his novels and refund the purchase price; once something is offered to the public, it no longer belongs exclusively to the publisher.

    In my opinion, the value of archives in the future immeasurably outweighs occasional inconveniences of having information stick around longer than the author would have wished.
    • Throwing out so much dated information would mean discarding a critical part of our written history. Did you notice how the multitude of Y2K disaster sites changed from 1999 to 2000? That is history.

      If the courts are going to outlaw archives of the Internet, I suggest they do a complete job of suppression and order the burning of all books, newspapers, and magazines more than a year old.

      Then authors will be free to rewrite history as they wish.
  • "I can't help but wonder: where did they get such old copies of my websites, and who gave them permission to make those copies?"

    You sound like Television broadcasters when you say something like that. "We'll broadcast content over the airwaves, but you better not capture it!"

    Well, let me make it simple for you: When you make something public you cannot expect to bottle it up later. That's the whole reason that the internet is in existance: Extreme redundancy so that data is never lost. The original idea was to build a data network that could survive a nuclear attack.

    I don't think anybody should ever post stuff on the web without expecting it to last forever in some form or another, regardless of whether permission is granted.
  • "Of course, the issue that may bug many content providers is how to opt-out of such services, since some see it as a copyright violation."

    So I need to burn all my old comics? Or perhaps I don't need to every allow anybody to look at them?

    Caches aren't republishing information, they are archiving it. That's what libraries do to. Hell, they can even charge for the service if they want and still be in the moral right.
  • I browsed your all of your sites (even the abandoned ones) and since my browser cache is set to 782TB (and I'm still running Netscape 1.0N), your sites are still there. And my cache is publically accessible via my webserver. Yet another way you're being violated. Ah, the risks and perils of publishing on a public network.
  • by Waffle Iron ( 339739 ) on Wednesday June 19, 2002 @05:48PM (#3732366)
    If the courts determine that it is technically illegal to make archives of electronic content, then the copyright laws should be changed to explicitly allow archiving. Otherwise, we could eventually lose track of history. The only written record of large portions of our civilization would be relegated to a few rusting web server hard drives buried landfills.

    If you read 1984, you might remember that the government tightly controlled all old copies of documents so that they could manipulate history as they wished. We might get into a similar situation by accident if we don't allow independent archives of electronic information.

    With traditional media, you publish something on paper, but you don't get to control who puts the paper copies in which archives. That has served us well for keeping track of history, and an equivalent system needs to maintained for electronic content.

  • Do libraries have to get permission to save and allow browsing of copies of newspapers (both physical and microfiche)?
  • For the most part I don't have a problem with them archiving my sites (after all they can show me what a site used to look like faster than digging out my back ups), but recently one of my customers told me to remove all traces of a product from thier site (something about nasty litigatiation). I pulled the info off our servers quickly, but three hours later I get a nasty phone call from the customer saying he can still see the product on the site. seems it was hung up in some proxy server between here and there.

    back to the point how do you deal with an archive when you need to get rid of information that is a liability to you now? Maybe we are better off without them in some cases
  • by Da J Rob ( 469571 ) on Wednesday June 19, 2002 @05:50PM (#3732387) Homepage
    I was talking to this guy who works for a web hosting company [hostirian.com], and he says a fourth of his sales calls are people calling him up cause they're pissed that their last hosting company 'lost' thier site. (in reality most the time its later found out that the guy deleted it himself or renamed index.html to index2.html, etc..) He says 90% of the sites he can find a copy on the wayback machine. He'll then start to quote the website's contents to the guy on the phone and usually will have the amazed (and dumbfounded) customer signing a hosting contract by the end of day.
  • Use robots.txt, stupido. It lets you prevent search engines from indexing and archiving your property. However, if you're that concerned about people copying your pages, you might try avoiding the internet.

    I personally love the internet archive and google's cache.
  • For such a "webMASTER" this guy doesn't seem to know a lot about the Internet, seems more concerned with keeping his "Intellectual Property" safe then actually understanding the way things work.

    People like this ruin the concept of the Internet, the free exchange of knowledge. I hope other people on /. feel the same.

  • by www.sorehands.com ( 142825 ) on Wednesday June 19, 2002 @05:57PM (#3732449) Homepage
    It could be argued that the site is publically available and thus anyone can copy it. There is also the issue of fair use. That is why many people place terms of use and robots.txt files on their sites. It could even be a DMCA violation where an IP (or range) has been blocked, so people from that IP use the google cache to bypass the block.


    I don't mind that my site is being added to indexes that the public have use of for free. I have a problem where a company uses my site to make a profit, with no public benefit.


    There is case law where unauthorized access to a website is a copyright violation.


    I am trying to use copyright law against some of the spammers who scrape my site for email addresses. Then, go after the spam software companies for contributory infringement (let the napster rulings serve some good).

  • I understand the concerns, but I think it's a part of the net, a good part, that we have to wrap our minds around.

    Especially when you mention Usenet archives, which are (ok, get ready to laugh) historically important. I'm not kidding! There is a little signal in there, it's a cultural brain dump, and that's of historic interest.

    I think the rub is, if the archive presents the data exactly as you presented it (that is, it doesn't play with your content, present it in a frame or otherwise embed it as their own content), then it is a fair archive, a ghost of your site still walking the internet. There is no taking it back once you post it.
  • TV Broadcast analogy (Score:4, Interesting)

    by rknop ( 240417 ) on Wednesday June 19, 2002 @05:58PM (#3732456) Homepage

    Some have already drawn analogies to TV broadcasts, saying hey, it was broadcast, you get to keep a copy. You can't bitch now if people still have that copy, unless you're Jack Valenti.

    You can spin this how you want. Here's one valid way to think about it though: a TV network brodcasts a show. You make a private copy on a VCR tape. Jack Valenti aside, you can watch that copy again as often as you like, and it's no big deal. However, you do emph not have the right to rebroadcast your copy of that show to the public without the permission of the original copyright holder. (I have my B5 tapes. I'm watching them through again now, showing them to my wife. I'm sure nobody is upset about this. But I'd be in deep doo-doo if I managed to broadcast them on a local access station, or uploaded them to a public website.)

    If you are inclined to be negative about the Wayback Machine, you could view it this way. While the page existed on the original site, it was broadcast to the public. If somebody made a personal copy, they have it and will always have it, even if the site goes down. However, when the site goes down, individuals do not necessarily have the right to then "rebroadcast" (i.e. post) themselves the content they downloaded and kept. This, however, is what the WayBack machine is doing.

    Mind you, except for the issue with www.dramex.org that I noted above (and which I fixed long ago), I like the WayBack machine, and am happy that they archived the content which was implicitly copyrighted to me. I would have opted in if I had wanted to. But, of course, I didn't know about it back in 1996 to opt in.

    I don't have a good answer to the questions. Just thought.

    -Rob

  • There is nothing-worst then revisionist history. I can't stand seeing site that post something and a bit later it vanished forever or have it altered removing the very think I was interested in.
    There are several GPL'ed Open Source software packages that I have copies of, that have vanished with all references to them and are no longer available on the net. Also a number of great sites that came and gone for either lack of cash or time. I think if someone open sources something it should stay that way.

    Also if it's open on the net for public viewing, then it should be fair game. Especially if the original author is credited and it is in the original context, like the Wayback Machine is. I know there are always special cases where something was put up that the webmaster was not entitled to like a copyrighted book or something, but for most stuff this is invaluable and a great service to humanity.

    Also think of all those users who's we site was lost without backup. Now they can get that data back.

    The Wayback Machine is one of the few web services I'd be willing to pay for.

    John
  • by tiltowait ( 306189 ) on Wednesday June 19, 2002 @06:02PM (#3732486) Homepage Journal
    .... and wayback is sponsored, amongst others, by the library of congress. The archive itself a 501(c)(3) public nonprofit. See 17 U.S.C. SECTION 108(a)(3) [cornell.edu] for more information.

    Strange that such a complaint would appear within a group expousing that "information wants to be free." :)
  • What particularly interests me is the fact that the Machine is a relatively new animal, yet it contains snapshots from my sites dating back to 1998.

    Interestingly, if you look at Slashdot's earliest entry [archive.org] (man, that page was ugly back then!), and then look at the bottom of the page, it shows the domain that was used to pull the page: "Welcome User From firestone.alexa.com".

    Alexa.com appears to be some web search ("powered by Google") toolbar thingy. I can't determine if they are the same people as the wayback machine or not.

  • Purist? Pure what? (Score:5, Insightful)

    by American AC in Paris ( 230456 ) on Wednesday June 19, 2002 @06:03PM (#3732492) Homepage
    Perhaps I'm too much of a purist, but I've always seen the internet as an ever-changing medium, not a permanent one. Archives have bothered me ever since the fledgling days of DejaNews.

    I'd say it makes you more of a control freak than a purist, personally.

    Seriously, how did you ever get it into your head that a medium that serves documents to the general public on demand would be somehow exempt from archiving?

    Would it bother you of John Q. Savant could recite the contents of your web pages from memory ten years after you'd taken it down?

    Would it bother you to learn that stock prices, perhaps the most "ever-changing" thing out there, are permanently archived by a variety of services?

    Or are you just jittery at the thought that your spouse/boss/Friendly Neighborhood Representative of The Man/kids may be able to someday look at the shite you plastered all over the web in your younger days? ("Ech, that stupid Netscape 2 animated title hack--honey, you actually -did- that?")

  • http://web.archive.org/web/19961020014044/http://w ww.microsoft.com/ [archive.org]

    Well back in 1996 you really could win a million dollars from Bill Gates... well atleast a cruise.

    See all the exciting things happening on the Internet in Latin America, and win big prizes at the same time! Register for the first Latin American Internet Explorer Race. You'll have a great time, and perhaps even win a Caribbean cruise!
  • robots.txt

    User-agent: ia_archiver
    Disallow: /
  • by MrResistor ( 120588 ) <peterahoff&gmail,com> on Wednesday June 19, 2002 @06:08PM (#3732524) Homepage
    By the very act of posting your site on the web you have given permission to make copies of it. Otherwise, how would anyone view it? And if no one is supposed to view it, why have you published it in a publicly accessible space?

    If I went to your website 2 years ago and never closed or refreshed that browser window, would I now be violating your copyright? What if I saved the page so I could view it later offline? What if I never erased that file, would that mean that I'm violating your copyright? I have several floppies of web sites I saved at school for viewing at home from the days when I was stuck on a crappy dial-up service. Does that make me a pirate? What about all the copies of sites held in my browsers cache?

    Don't get me wrong, I understand where the sentiment is coming from, even if I disagree with it. I'm just trying to point out how incongruous it is with the basic nature of computers and the internet and how they work.

    These questions aside, though, I have to come down in favor of the historians. People here are always whining about old movies/books/music being lost because their owners refuse to let them go, even if they aren't using them, why should the web suffer the same fate? The rate of destruction is far faster on the internet, and since it isn't a physical media, the information has to be actively archived if it is to be preserved.

  • Could you imagine if there was the equivalent of the wayback machine for everything published in 5th century Athens? We'd know and incredible amount more about were the human race had already been intellectually and where its going.

    I publish several websites and I don't mind this a bit - If someone wants to host my content for free and offer my customers a way to get at older versions of the site for whatever reason (maybe they want to know what prices were 2 years ago), then they've done me a service. Cool.
  • Historical Records (Score:2, Interesting)

    by JonBuck ( 112195 )
    As a historian and future librarian, one thing has always bothered me about the Internet. Because change is a constant, it's very difficult to keep records. It isn't like newspapers, pamphlets, books, or any other form of written record of the past five thousand years. Unless they're printed out, our writings here leave no physical evidence of their existance. Because I feel that the Internet is as significant as the printing press five centuries ago, the prospect of having no records from its early days is frightening.

    We have books from five centuries ago. Will anything here still exist in a readable form five centuries from now? Unless something is done to preserve it, I feel there will be a massive gap in history.

    And this is why I do not object to web archives. They are a half step to printed and more permanent storage mediums, but preferable to nothing at all.
    • Maybe just to play devils advocate here, but is there anything on the web that is historically significant that is not also in a more permanent (say, dead tree) format? I'll agree that the Internet is important, but in the scope of history I would think that the structure would be of more interest than the content.
  • libel? (Score:3, Interesting)

    by sckeener ( 137243 ) on Wednesday June 19, 2002 @06:13PM (#3732557)
    I didn't know that the wayback machine went that far back. I wonder if anyone is going to go to jail from posts they made in the past....
  • opting out (Score:3, Interesting)

    by josepha48 ( 13953 ) on Wednesday June 19, 2002 @06:14PM (#3732571) Journal
    At least for google to opt out of its service add the following tag in the "head" of your web page:
    <meta NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
    This will tell google not to cache your pages. If you dont want them to index your page and include the page in the search engine use:
    <meta NAME="ROBOTS" CONTENT="NOINDEX">
    Now I am not sure about this other site that is caching old pages, and right now I cannot get through but if they are caching any of my pages I will tell them to take them off as ALL my pages are just that MY pages. I think you can sue them, I'd imagine with all the other internet lawsuits it would be valid. They are stealing your pages.
  • I was just digging through a few hundred pages of information in the wayback machine when the site became sluggish. I jokingly told my friends (you know, the kind that live in my head?) it seemed I was singlehandedly slashdotting the site.
    *sighs* Seems I had some help...

    Anyway, I love the Wayback Machine. Besides being an extremely useful tool, it proves that Zindell was right. Information is never lost, only ever created.
  • I used to run a Half-life map review site, and a TFC map review site called "radium". I took my sites down a couple of years ago, and recently some friends pointed out that they showed up on one of these archival sites. I took my sites down for a reason, and didn't appreciate them hovering about on someone else's server without my permission. Say what you will, but I just don't like it. I emailed them and had my property removed from their servers. It took a bit of badgering, but it finally got done.
  • The Wayback Machine started as Brewster Kale's project. He also did Alexa, which provided some of the old archive tapes to start up the Wayback Machine.

    The long-term plan is to have a copy of the history of the Internet, beyond the power of any single government to censor. To this end, there are copies of the archive at multiple locations around the world.

    One of them is in the Bibliotheca Alexandrina [bibalex.gov.eg], in Egypt. They too have a Wayback Machine [bibalex.gov.eg]. It's jointly operated by the Government of Egypt and the United Nations Scientific and Cultural Organization. While they will usually honor removal requests, they don't have to do so.

    There are plans for two more archive sites around the world, affiliated with major national libraries.

  • Wayback machines should function exactly like search engines. If there's a robots.txt file, check it. If it tells you to get lost, do so. A search engine is going to cache at least the text part of your site, and you know it. And you can prevent it if you wish. And depending on the engine, it can take months or years to update.

    Besides, wayback machines will run into the same snags that search engines do. They can't replicate cgi scripts any better than search engines can, so to deny them access to those resources for their sake as well as the server's makes sense.

    I don't know how wayback works. At the very least they SHOULD read the robots file. If they do, then I consider most of the copyright issues to be a moot point.

    -Restil
  • by quantaman ( 517394 ) on Wednesday June 19, 2002 @06:28PM (#3732632)
    Anyone else find it mildly disturbing that 1998 is considered to be distant history?

  • It is suspected by many that archive.org also removes archives based on content.

    For instance, try accessing news sites back in the days immediately before and after 9/11. It is a very spotty record.

    I have seen this for myself as well, as a web site I am struggling to find the time to build, and which has controversial content, was at one time retrievable under archive.org, but no longer is.

    For that matter, it seems impossible to get Google to index it anymore either (though they too once included the site.)

    By presenting themselves as having a complete record of the Internet's web sites, and then selectively deleting or restricting access to sites based on content is a very pernicious form of censorship. It isn't a First Amendment issue perhaps since dotgov assumedly isn't the one restricting content, but it is worrisome nonetheless.
  • You can't unregister a copyright.

    You give a copy of your work to the Libary of Congress, and there the evidence sits for eternity, free to be accessed by anyone with a request slip.

    The price you pay for copyright protection is public availability and persistence of your old rantings.

    --Blair
  • Once you're on the Internet you can never get out. Its simple fact. Someone will always have a copy of that e-mail you sent professing your love to Missy Gringlebach or the nntp post about how brilliant Hitler was or your web site dedicated to New Kids on The Block.

    Trying to get that stuff off is futile at best. A professor of mine once said that there is not a nanosecond when some computer isn't processing or storing something about you somewhere. And that was in 1991. I've got to side with McNealy on this. There is no such thing as privacy anymore.
  • by mfos.org ( 471768 ) on Wednesday June 19, 2002 @06:32PM (#3732659)
    A few things

    1) They've been archiving since 1998, but they've only recently had the horse power to provide a live connection to it

    2) It is very easy to not have your stuff indexed. the directions are here. [archive.org]
  • by kcbrown ( 7426 ) <slashdot@sysexperts.com> on Wednesday June 19, 2002 @07:06PM (#3732887)
    The Congress shall have Power To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries

    -- United States Constitution

    The purpose of copyright is to promote progress, to entice authors and inventors to release their works and discoveries to the public.

    But that is not an end unto itself. The true end is the benefit to society that the release of such works brings.

    Now, remember that the whole incentive here, the entire reason for granting the monopoly privilege of copyright, is to allow the originators of works to make money from their works, which in turn (theoretically) gives them incentives to release their works to the public.

    When you publish something on the web, you're publishing your works for free, unless you go to the extra trouble of implementing some kind of access control. The Wayback Machine won't work on a site that has access control, so all it ends up archiving is stuff that was published for free public consumption.

    So the real question is: if a work has already been released for free to the general public, how would letting authors restrict the republication of that work after the fact bring greater benefit to society than not letting the author impose such restrictions?

    My opinion is that it is much more beneficial to society as a whole if the release of a work for free public consumption automatically implied that members of the public have the right to redistribute that work. So if an author doesn't want people in the general public to be able to redistribute his work, he has to control who receives the work and who doesn't. Certainly requiring payment for the work in question is sufficient to meet the requirement of controlling access. But whatever method the author chooses, it should be one that makes it clear that the work in question is not being released for free to the public.

  • by fdiaz5583 ( 531839 ) on Wednesday June 19, 2002 @07:47PM (#3733104)
    If anyone has ever heard of the Library of Alexandria it was supposedly the most impressive knowledge base the world had ever assembled. Some crazy guy came by and burnt it to the ground -- setting the entire industrialized planet back hundreds perhaps thousands of years. We are now in the process of surpassing this great library, and are making it even easier for people to have access to knowledge. That knowledge may be porn, may be the morning news, or sports scores, it may even be how to construct a nuclear bomb. Nevertheless it is knowledge and EVERY person who is alive has the God (and any other higher power) given right to knowledge, despite what any government agency, or copyright may say. 21st century libraries such as the WayBack Machine are providing the tools necessary for researchers to go "back to the future." This is a great service to mankind, and it's overall importance should not be outweighed by greedy, and or overparanoid privacy rights activists. If you do not wish to be known, please do not post any information on the web, and move to the jungles of Africa and step away from a time and place known as the PRESENT.

Only through hard work and perseverance can one truly suffer.

Working...