Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
The Internet

Obtaining Archives of USENET? 86

Academic Researcher asks: "Google took over Deja and built the mammoth Google Groups, which is a near complete archive of USENET dating back to early 1980s. For an academic project, I need to analyse a lot of USENET data. The Terms of Service for Google Groups will not allow automated access (and even so, I'd have to write a bunch of tools and reverse engineer it all). Inquiries about purchasing copies of the archive have gone unanswered. No one else seems to have such an archive. Apart from this meaning that world has one single USENET archive (I hope they have backup floppies!), how can I obtain historical data for research purposes ? I'd happy pay money for DVD's of archival material if they were available. Can anyone help"
This discussion has been archived. No new comments can be posted.

Obtaining Archives of USENET?

Comments Filter:
  • Comment removed (Score:4, Informative)

    by account_deleted ( 4530225 ) on Friday July 25, 2003 @08:34PM (#6537152)
    Comment removed based on user account deletion
    • by /dev/trash ( 182850 ) on Saturday July 26, 2003 @12:31PM (#6539943) Homepage Journal
      Actually according to this http://news.com.com/2100-1023-252449.html?legacy=c net

      Ebay bought the consumer website and Google bought everything else, including the deja.com name.
    • Technically, Google did not buy Deja, they just bought a bunch of data.

      That's interesting... I'm pretty sure I never gave anyone permission to sell my thousands of copyrighted Usenet posts.

      Now maybe somebody will notice the flaw in the "archives are great, you and your copyright can get stuffed" arguments. It's OK for Deja to sell my copyrighted material, and for Google to make it available but only under restrictive agreements? Not by me it's not.

      • by Anonymous Coward
        Actually, if you are at all genuinely upset by this, Google lets you remove your postings from their archive.

        http://groups.google.com/googlegroups/help.html# 9

        • Actually, if you are at all genuinely upset by this, Google lets you remove your postings from their archive.

          Sure, if you jump through their hoops and do it for each of the thousands of posts you might have made that are stored in their archive.

          This isn't really the point. Personally, I don't dramatically object to having my Usenet posts archived, since I don't post anything there I wouldn't want others to see in future. However, I do object to having my material sold for profit, or to having restric

          • Google didn't purchase your post: they purchased the database containing your post. There are intellectual property rights (often copyright) in a database or catalogue of information and these rights are quite separate from the information itself. This is the kind of right that anyone who compiles a telephone directory will have. When you post to usenet, the sensible legal view has to be that the post remains your copyright but that you grant an implied license to anyone anywhere in the world to read or ar
            • Google didn't purchase your post: they purchased the database containing your post. There are intellectual property rights (often copyright) in a database or catalogue of information and these rights are quite separate from the information itself.

              I appreciate the distinction you're making here, but aren't the rights to a database usually relevant in cases where the data itself is trivial, such as the phone directory you mentioned? Phone numbers aren't themselves subject to copyright. In this case, I fai

              • Anonymous Brave Guy (457657) writes on on 22:50 Monday 28 July 2003 (#6556019):

                Archiving is a somewhat dicey proposition, since it was neither the original purpose of Usenet nor something provided for within its normal protocols, but you could possibly make a case for it on public interest grounds. I'm not sure what effect (or otherwise) I think the "x-no-archive" header should have here.

                The intention of the user's use of the x-no-archive tag can not be fully deterimined however, the original purpose f

                • Thanks for a thoughtful reply.

                  To address your book analogy quickly first, the difference there is that in the case of a book, you've presumed acquired the right to have that book, by buying it, borrowing it from a library or otherwise. When you paid for it, you bought the right to keep it. So did the library when they bought their copy (and they can lend out the copy they bought, but not copy it again themselves and lend out multiple copies from one original). The whole issue isn't really directly compara

              • Copyrights in the database are quite separate from copyrights in the content of the database. An analogy: if you cover a Britney Spears song then she retains copyright in her song (and you would need a license from her), but you have copyright in your recording of the song. Having multiple copyrights in one work is very common.

                The question here is whether the implied license granted by someone posting to Usenet is wide enough to let Google operate their Groups service.

                I'm an English IP lawyer (well, I'm S
                • Copyrights in the database are quite separate from copyrights in the content of the database.

                  Exactly my point; whatever rights Google may have relating to the former Deja database, are independent of my rights over my own work.

                  However Courts have to take a wide view of the implied license granted by people who post information on Usenet (and indeed the internet) or else they would be making a wide variety of internet use unlawful.

                  I don't see how that necessarily follows...

                  Implied licenses tend

      • That's interesting... I'm pretty sure I never gave anyone permission to sell my thousands of copyrighted Usenet posts.

        Someone will just put them on Kazaa, and then the copyrights won't matter, just like everything else up there.

        • Someone will just put them on Kazaa, and then the copyrights won't matter, just like everything else up there.

          It's rather unlikely that anyone would, or could, put the whole of a Usenet archive onto Kazaa. It's even more unlikely that anyone would deliberately single out my own posts. If they did, it's even more unlikely still that those posts would get widely distributed, as most of what I have to say will appeal to a fairly small audience reading the specific newsgroups to which I post. And of course,

  • Sure! (Score:5, Funny)

    by Sevn ( 12012 ) on Friday July 25, 2003 @08:36PM (#6537163) Homepage Journal
    http://groups.google.com

    Search: *

    Then save to file.

    DONE!

  • Google (Score:4, Informative)

    by rmohr02 ( 208447 ) <mohr.42NO@SPAMosu.edu> on Friday July 25, 2003 @08:53PM (#6537242)
    You can't automate a search of Google Groups through the web, but you might be able to work out a deal with Google's corporate offices [google.com]. There can't be many other people with the data to talk with.

  • You're going to want to talk with Jim Gray of this article:
    http://slashdot.org/article.pl?sid=03/07 /10/235225 1
  • couldn't you write a script to first get a list of all the newsgroups on a server (here I am assuming that the server I use, news.tbaytel.net, is at convergance with the rest of usenet), then get all the headers in each, then get all the messages in each. If you use the server provided by your ISP over a fast connection you can easily max out your connection as it is more or less a direct line.
    • Re:it occurs to me (Score:2, Informative)

      by damien_kane ( 519267 )
      The problem with this is data retention.
      I dont personally know of any UNSENET servers that have more than 10 days worth of posts.
      Some of the groups (like alt.binaries.*) will only have 1-3 days retention.

      Nope, this guy needs an archive, not a current snapshot
      • by RGRistroph ( 86936 ) <rgristroph@gmail.com> on Friday July 25, 2003 @09:42PM (#6537465) Homepage
        Why can't he start archiving now, and then work on his project testing it on his data as he goes ? He might have enough by the time he is done.

        Also, why does it have to be usenet specifically ? If he needs a long time period but not a great breadth of groups, he may be able to find mailing list archives that are sufficient. The 9fans and lkml both go back quite a ways, just to mention a few off the top of my head. If you go down to your universities sys admin department, and ask them what's the oldest continuously active mailing list they have archives for, you may strike gold.

        In fact, your university's sys admins may have usenet archives also.

        You may find this helpful in scraping web-served mailing list archives into a form you can use:

        http://www.linpro.no/lwp/

        Also there is a perl script out there that will download the archives a yahoo group into an mbox.
      • I dont personally know of any UNSENET servers that have more than 10 days worth of posts.
        Some of the groups (like alt.binaries.*) will only have 1-3 days retention.


        You need a better news server. You're never gonna get a huge collection of whacking material from a news server like that. From Supernews:

        alt.binaries.multimedia.erotica has 190466 articles, and 23.6 days of retention.

        And non-pr0n:

        comp.os.linux.misc has 34277 articles, and 281.5 days of retention.

        This hasn't always been the case for Su
      • Easynews has retention of at least several weeks - even in the binaries groups. the discussion groups go back considerably longer.

        news.cis.dfn.de (I believe that's it - you can find it mentioned a lot in the "free news servers" newsgroup) has retention in the discussion groups (at least the ones I've looked in) all the way back to (at least) december of last year.

        But if you need years of research data, it seems google would be the place to go. If your needs are limited to only a few groups, many of them h

    • The problem with this apreoach is the server's data retention poliy. You're not even going to come close to haveing an archive that extends back a year, let alone longer.

      I know there's some servers that you're lucky if they still have the post a week later on the server.
  • by shoppa ( 464619 ) on Friday July 25, 2003 @09:35PM (#6537432)
    In the early 1990's, a company in Vancouver BC proposed a monthly distribution of Usenet via CD. This sparked extensive discussions in the newsgroups (at the time almost exclusively dominated by academic people, of course) with a lot of resentment that some company was going to be making money by selling *their* posts. (Do a Google groups search for "usenet on CD" to see some of these. They also mention Walnut Creek.)

    In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.

    Some individuals archive individual usenet groups, or the group is gatewayed back and forth to a mailing list that is archived. This IMHO is more appropriately managable for research.

    The announcement of Google Groups with a 20 year archive [google.com] acknowledges several sources for the broad timeframe of the archive (as well as the donors to the preceding Dejanews archive); you might want to check out their specific work.

    • by dougmc ( 70836 ) <dougmc+slashdot@frenzied.us> on Friday July 25, 2003 @11:54PM (#6538017) Homepage
      In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.
      NOBODY has a full archive of the posts made to alt.binaries.* -- I doubt that even the NSA has that much storage to spare for it.

      I believe that a full feed is around 300 GB per day now, with 99+% of that being alt.binaries.*.

      • 300 GB/day -- That's 2 new hard drive each day (Pricewatch.com: 160 GB for $114 x 2 = $228/day), plus the cost of the bandwidth, plus the cost of the RAID backup (I hope). Expensive, but well within reason for a large orginization.
        • OK, so say we keep this archive for a year. That's 730 x 160gb hard drives. Forget internet bandwidth; forget LAN bandwidth; where the fuck are you going to get enough hard drive controller bandwidth to be able to search such a monster?

          "OK, archive, give me the md5 hashes of every article posted between the hours of 10:00 and 11:00 (except Wednesdays) during 2002. Don't forget the porn. Count how many of the hashes contain both the hex strings 'DEAD' and 'BEEF'."

          "Hmm, I'll have to get back to you ... I
          • Hmm, I'll have to get back to you ... I should have an answer by 2012.

            Actually, if they were storing md5sum, header information etc. into a database as the items came in (which is exactly what the NSA would do if they were tracking Usenet like this), this sort of request could be satisfied with simple SQL in a few seconds. The database wouldn't even be that big (not compared to some corporate databases out there.) But the actual images/movies/warez/etc., that's another matter.

            I'm not saying it's n

            • by Raul654 ( 453029 ) on Sunday July 27, 2003 @11:18PM (#6548236) Homepage
              It's just not that useful. 106 TB/year, mostly porn and warez.

              I don't see how you can reconcile those two sentences.
            • Actually, if they were storing md5sum, header information etc. into a database as the items came in (which is exactly what the NSA would do if they were tracking Usenet like this),


              You are of course right. Change MD5 Sum to Number of times any of the strings "AB", "CD", or "EF" appear in the body of the article and it meets my original claim of impossibility.
              • You are of course right. Change MD5 Sum to Number of times any of the strings "AB", "CD", or "EF" appear in the body of the article and it meets my original claim of impossibility.

                and then realize that, if you are looking for information in/about the binaries, you have to assemble the many pieces of the binary, and then decode them.
              • You are of course right. Change MD5 Sum to Number of times any of the strings "AB", "CD", or "EF" appear in the body of the article and it meets my original claim of impossibility.

                Are there more usenet posts than web pages? I only ask because Google manages to index over three billion pages in the manner you refer to.
    • > They also mention Walnut Creek

      Hey, I know those guys... They took some of my copyrighted wallpaper [gdargaud.net] images, removed my logos, and resold them on CDs. Assholes.

  • Did you just send off emails, or did you actually pick up the phone and try to talk to somebody? You might have much better luck with the latter, Google must get a billion random emails a day.
    • I sent them an e-mail reporting a minor problem and it took them months to get back to me.

      Tim
      • Strange, I sent an email off and got a response within a couple of days.

        I suppose that if you sent it to the wrong department then it would take *much* longer.
        • Re:Yeah (Score:5, Funny)

          by orthogonal ( 588627 ) on Saturday July 26, 2003 @03:36AM (#6538588) Journal
          I suppose that if you sent it to the wrong department then it would take *much* longer.

          Well, it's not the department actually.

          What you need to do is fill in the X-Meta tag in your email (or if you use Microsoft email products, the <HEAD> <META> tags) with keywords describing your email. "Hot amine poon-tang" is often a useful meta tag.

          Then, you need to get a lot of people to link to your email (that is, refer to it in their email) with a X-References tag.

          You can do this just by getting it popular on a mailing list, but even more effective is to post it to a number of bloggers. Bloggers, being how they are (self-important but afraid of not being noticed), will endlessly refer to it, and will probably get into interminable "blog spats" about it -- well, to be honest, about something completely tangential to it, but what do you care? It pushes your "Mail-Rank" score up regardless.

          If you can't do that, try contacting "MailKing", a commercial service that sends out a lot of email, purportedly from unrelated individuals, which refers to your email. They'll make it look like your email is really important.

          When your "Mail-Rank" is high enough, Google will have no choice but to notice it and reply.

          Best of luck getting your email noticed!
  • by AtariDatacenter ( 31657 ) on Friday July 25, 2003 @11:23PM (#6537899)
    > For an academic project, I need to analyse a
    > lot of USENET data.

    Are you SURE you're prepared for that much data? And the costs for just storing that much data? Not to mention manipulating it? I think you'd need an academic project just to analyze that factor alone.
  • Try http://www.nsa.gov [nsa.gov].

    Seriously, I'm sure that a number of intelligence agencies have archives of this stuff. I believe the FBI used to get a USENET feed on 9-track 1/2" tape.

  • SURE..... (Score:3, Funny)

    by JDWTopGuy ( 209256 ) on Friday July 25, 2003 @11:41PM (#6537972) Homepage Journal
    You REALLY just want to get all that hot porno and wares and MP3z!!! We can see right through your story! I bet it's "seminal research" too.....
  • Translation (Score:5, Funny)

    by truffle ( 37924 ) on Saturday July 26, 2003 @12:10AM (#6538057) Homepage

    how can I obtain historical data for research purposes ?


    Translation:

    How can I build the largest pr0n collectin ever?
  • dejanews had a rival for a year or so. They got bought out long ago. Sorry, I can't remember their name. But, if google really won't co-operate look up those other guys.
  • google api (Score:2, Informative)

    by krismon ( 205376 )
    download the google free API(for non-commercial use), would make your scripting/programming, much easier...
  • Follow up (Score:5, Informative)

    by Anonymous Coward on Saturday July 26, 2003 @08:40AM (#6539152)
    I'm the original poster of the question. Just to answer a few points raised by other people:

    1. I am only interested USENET data until the early 1990's, and even then only for parts of comp.*. I know that I wouldn't be able to purchase/obtain selective parts of it, but I have quantified that 1980->199[0-5] is order of couple of hundred gigabytes max. This is not a large volume of data if obtained on DAT (I certainly have the facilities/funds for limited hardware/software), or if I could make use of whatever facilities store it at the moment that would also be great (e.g. if I were able to come to an agreement to use a host account located at google, where I could develop and run tools over the data [1]).

    2. I have so far developed tools to complete part of this work, but found myself violating Google's ToS and having myself blocked from access to Google Groups. My tools simply scrapped out a large amount of data from a specific newsgroups (comp.something) and performed some automated analysis that I use to correlate with other findings, etc.

    [1] There's an analogy here with obtaining time on a supercomputer for research work, though what I need is not a supercomputer, but superstorage.

  • by Anonymous Cowdog ( 154277 ) on Saturday July 26, 2003 @08:46AM (#6539162) Journal
    I am surprised this Ask Slashdot question hasn't (really) been answered after this many hours. Well here is the answer.

    Go to this link:

    http://www.archive.org/web/researcher/proposal.php [archive.org]

    Create an account, log in, write a proposal, submit it, then wait.

    Yes they have Usenet data, not just web data.

    After I submitted a proposal this way, I had to bug them by email and eventually phone even to get the account set up. It's not that they aren't helpful, they are just busy with lots of projects.

    But don't expect hand holding. You need to be comfortable operating in a Unix environment. And I don't know if the data can leave their servers; you might need to do all your processing using their machines.
  • I think, for practical purposes, Google owns most of USENET now, including its complete history; there will probably not be any other institutions (well, maybe the NSA) that keep a history of USENET postings.
    • I don't believe Google owns Usenet in any way. I mean, Usenet is NNTP, a P2P protocol, that is actually a method to distribute the postings made to a Usenet server. As to those postings that Google has collected and archived, each individual posting is really copyright of the poster (if not stolen), and I certainly don't recall receiving any compensation from Google for having the use of my copyrighted postings. So, IMHO, Google should provide access to all Usenet postings in the archive, with the access m
      • Of course, Google doesn't own USENET legally; in fact, I think Google's use of USENET is of questionable legality, since postings traditionally have been made with expiration dates in mind and Google's use far exceeds the implicit license posters to USENET have given people for using their postings.

        What I'm saying is that the institution of USENET these days is to a large degree controlled by Google; Google has changed the way people post, Google archives everything, and most people, looking for an article
  • Practically no one is lining up to give reseahers data ! This is a hotpoint of mine. I find it particularly outragous, that the research that is excluded by law is usually that which is in the public interest, while many private and comerical studies are well funded and leagal.

    Many areas of the law are purposefully made to protect commercial interests and "simulate the economy." Politcians are still at it! That this also seems in corporate interests is no accident since a great deal of commerce is conduct

  • He provided much of the early USENET archives. See his page here [csd.uwo.ca] for details of his contribution and email address.

BLISS is ignorance.

Working...