Obtaining Archives of USENET?

Follow Slashdot stories on Twitter

Obtaining Archives of USENET? 86

Posted by Cliff on Friday July 25, 2003 @08:27PM from the dealing-with-the-red-tape dept.

Academic Researcher asks: "Google took over Deja and built the mammoth Google Groups, which is a near complete archive of USENET dating back to early 1980s. For an academic project, I need to analyse a lot of USENET data. The Terms of Service for Google Groups will not allow automated access (and even so, I'd have to write a bunch of tools and reverse engineer it all). Inquiries about purchasing copies of the archive have gone unanswered. No one else seems to have such an archive. Apart from this meaning that world has one single USENET archive (I hope they have backup floppies!), how can I obtain historical data for research purposes ? I'd happy pay money for DVD's of archival material if they were available. Can anyone help"

This discussion has been archived. No new comments can be posted.

Obtaining Archives of USENET?

Load All Comments

Search 86 Comments Log In/Create an Account

Comments Filter:

Comment removed (Score:4, Informative)

by account_deleted ( 4530225 ) writes: on Friday July 25, 2003 @08:34PM (#6537152)

Comment removed based on user account deletion

Share
twitter facebook
- Re:Google just bought the data (Score:4, Informative)
  
  by /dev/trash ( 182850 ) writes: on Saturday July 26, 2003 @12:31PM (#6539943) Homepage Journal
  
  Actually according to this http://news.com.com/2100-1023-252449.html?legacy=c net
  
  Ebay bought the consumer website and Google bought everything else, including the deja.com name.
  
  Parent Share
  twitter facebook
- That's interesting... (Score:3, Insightful)
  
  by Anonymous Brave Guy ( 457657 ) writes:
  
  Technically, Google did not buy Deja, they just bought a bunch of data.
  
  That's interesting... I'm pretty sure I never gave anyone permission to sell my thousands of copyrighted Usenet posts.
  
  Now maybe somebody will notice the flaw in the "archives are great, you and your copyright can get stuffed" arguments. It's OK for Deja to sell my copyrighted material, and for Google to make it available but only under restrictive agreements? Not by me it's not.
  - Re:That's interesting... (Score:1, Informative)
    
    by Anonymous Coward writes:
    
    Actually, if you are at all genuinely upset by this, Google lets you remove your postings from their archive.
    
    http://groups.google.com/googlegroups/help.html# 9
    - Re:That's interesting... (Score:2)
      
      by Anonymous Brave Guy ( 457657 ) writes:
      
      Actually, if you are at all genuinely upset by this, Google lets you remove your postings from their archive.
      
      Sure, if you jump through their hoops and do it for each of the thousands of posts you might have made that are stored in their archive.
      This isn't really the point. Personally, I don't dramatically object to having my Usenet posts archived, since I don't post anything there I wouldn't want others to see in future. However, I do object to having my material sold for profit, or to having restric
      - Re:That's interesting... (Score:1)
        
        by BobTheLawyer ( 692026 ) writes:
        
        Google didn't purchase your post: they purchased the database containing your post. There are intellectual property rights (often copyright) in a database or catalogue of information and these rights are quite separate from the information itself. This is the kind of right that anyone who compiles a telephone directory will have. When you post to usenet, the sensible legal view has to be that the post remains your copyright but that you grant an implied license to anyone anywhere in the world to read or ar
        
        Re:That's interesting... (Score:2)
        
        by Anonymous Brave Guy ( 457657 ) writes:
        
        Google didn't purchase your post: they purchased the database containing your post. There are intellectual property rights (often copyright) in a database or catalogue of information and these rights are quite separate from the information itself.
        
        I appreciate the distinction you're making here, but aren't the rights to a database usually relevant in cases where the data itself is trivial, such as the phone directory you mentioned? Phone numbers aren't themselves subject to copyright. In this case, I fai
        
        copyright applied to usenet posts (Score:1)
        
        by zarthon ( 614232 ) writes:
        
        Anonymous Brave Guy (457657) writes on on 22:50 Monday 28 July 2003 (#6556019):
        Archiving is a somewhat dicey proposition, since it was neither the original purpose of Usenet nor something provided for within its normal protocols, but you could possibly make a case for it on public interest grounds. I'm not sure what effect (or otherwise) I think the "x-no-archive" header should have here.
        The intention of the user's use of the x-no-archive tag can not be fully deterimined however, the original purpose f
        
        Re:copyright applied to usenet posts (Score:2)
        
        by Anonymous Brave Guy ( 457657 ) writes:
        
        Thanks for a thoughtful reply.
        
        To address your book analogy quickly first, the difference there is that in the case of a book, you've presumed acquired the right to have that book, by buying it, borrowing it from a library or otherwise. When you paid for it, you bought the right to keep it. So did the library when they bought their copy (and they can lend out the copy they bought, but not copy it again themselves and lend out multiple copies from one original). The whole issue isn't really directly compara
        
        Re:That's interesting... (Score:1)
        
        by BobTheLawyer ( 692026 ) writes:
        
        Copyrights in the database are quite separate from copyrights in the content of the database. An analogy: if you cover a Britney Spears song then she retains copyright in her song (and you would need a license from her), but you have copyright in your recording of the song. Having multiple copyrights in one work is very common.
        
        The question here is whether the implied license granted by someone posting to Usenet is wide enough to let Google operate their Groups service.
        
        I'm an English IP lawyer (well, I'm S
        
        Re:That's interesting... (Score:2)
        
        by Anonymous Brave Guy ( 457657 ) writes:
        
        Copyrights in the database are quite separate from copyrights in the content of the database.
        
        Exactly my point; whatever rights Google may have relating to the former Deja database, are independent of my rights over my own work.
        However Courts have to take a wide view of the implied license granted by people who post information on Usenet (and indeed the internet) or else they would be making a wide variety of internet use unlawful.
        
        I don't see how that necessarily follows...
        Implied licenses tend
  - Re:That's interesting... (Score:2)
    
    by deanj ( 519759 ) writes:
    
    That's interesting... I'm pretty sure I never gave anyone permission to sell my thousands of copyrighted Usenet posts.
    Someone will just put them on Kazaa, and then the copyrights won't matter, just like everything else up there.
    - Re:That's interesting... (Score:2)
      
      by Anonymous Brave Guy ( 457657 ) writes:
      
      Someone will just put them on Kazaa, and then the copyrights won't matter, just like everything else up there.
      
      It's rather unlikely that anyone would, or could, put the whole of a Usenet archive onto Kazaa. It's even more unlikely that anyone would deliberately single out my own posts. If they did, it's even more unlikely still that those posts would get widely distributed, as most of what I have to say will appeal to a fairly small audience reading the specific newsgroups to which I post. And of course,
- Re:Dumb question (Score:1, Funny)
  
  by Anonymous Coward writes:
  
  "Inquiries about purchasing copies of the archive have gone unanswered"
  
  So, yes, I guess this qualifies as a dumb question.
Sure! (Score:5, Funny)

by Sevn ( 12012 ) writes: on Friday July 25, 2003 @08:36PM (#6537163) Homepage Journal

http://groups.google.com

Search: *

Then save to file.

DONE!

Share
twitter facebook
Google (Score:4, Informative)

by rmohr02 ( 208447 ) writes: <mohr.42NO@SPAMosu.edu> on Friday July 25, 2003 @08:53PM (#6537242)

You can't automate a search of Google Groups through the web, but you might be able to work out a deal with Google's corporate offices [google.com]. There can't be many other people with the data to talk with.

Share
twitter facebook
- - Re:RTFA (Score:2)
    
    by rmohr02 ( 208447 ) writes:
    
    That implies that he emailed them. Contacting via phone sort of forces an answer.
    - Re:RTFA (Score:1)
      
      by acceleriter ( 231439 ) writes:
      
      And when you "force an answer," the answer is generally "no."
      - Re:RTFA (Score:4, Funny)
        
        by __aafkqj3628 ( 596165 ) writes: on Saturday July 26, 2003 @02:27AM (#6538476)
        
        But if you follow it up with masses of "C'moooooooon" they generally just give in.
        
        Parent Share
        twitter facebook
        
        Re:RTFA (Score:1)
        
        by jon doh! ( 463271 ) writes:
        
        what i find works is to get someone to repeat with you (to them) "can we have a [insert item here]?" over and over and over and over till they give in.
        
        ask bart...
      - Re:RTFA (Score:2)
        
        by rmohr02 ( 208447 ) writes:
        
        But at least he'd have an answer, which is more than he has now.
- Re:Google (Score:5, Interesting)
  
  by Big Sean O ( 317186 ) writes: on Friday July 25, 2003 @11:35PM (#6537951)
  
  If it's really for an academic project (say a master's thesis), you might want to direct your inquiry to Craig Silverstein [google.com]. Since he's grad student [stanford.edu] it's likely he would be more interested in your project than the Google corporate types.
  
  Who knows, mebbe you can parlay the project into an internship into a real live job.
  
  Parent Share
  twitter facebook
You understand how much data you're talking about? (Score:5, Interesting)

by rthille ( 8526 ) writes: <web-slashdot@ran g a t .org> on Friday July 25, 2003 @09:07PM (#6537308) Homepage Journal

You're going to want to talk with Jim Gray of this article:
http://slashdot.org/article.pl?sid=03/07 /10/235225 1

Share
twitter facebook
it occurs to me (Score:1)

by Vilim ( 615798 ) writes:

couldn't you write a script to first get a list of all the newsgroups on a server (here I am assuming that the server I use, news.tbaytel.net, is at convergance with the rest of usenet), then get all the headers in each, then get all the messages in each. If you use the server provided by your ISP over a fast connection you can easily max out your connection as it is more or less a direct line.
- Re:it occurs to me (Score:2, Informative)
  
  by damien_kane ( 519267 ) writes:
  
  The problem with this is data retention.
  I dont personally know of any UNSENET servers that have more than 10 days worth of posts.
  Some of the groups (like alt.binaries.*) will only have 1-3 days retention.
  
  Nope, this guy needs an archive, not a current snapshot
  - sequential snapshots are an archive (Score:5, Informative)
    
    by RGRistroph ( 86936 ) writes: <rgristroph@gmail.com> on Friday July 25, 2003 @09:42PM (#6537465) Homepage
    
    Why can't he start archiving now, and then work on his project testing it on his data as he goes ? He might have enough by the time he is done.
    
    Also, why does it have to be usenet specifically ? If he needs a long time period but not a great breadth of groups, he may be able to find mailing list archives that are sufficient. The 9fans and lkml both go back quite a ways, just to mention a few off the top of my head. If you go down to your universities sys admin department, and ask them what's the oldest continuously active mailing list they have archives for, you may strike gold.
    
    In fact, your university's sys admins may have usenet archives also.
    
    You may find this helpful in scraping web-served mailing list archives into a form you can use:
    
    http://www.linpro.no/lwp/
    
    Also there is a perl script out there that will download the archives a yahoo group into an mbox.
    
    Parent Share
    twitter facebook
    - - Re:sequential snapshots are an archive (Score:2)
        
        by RGRistroph ( 86936 ) writes:
        
        It sounds very interesting. At one time I intended to do a thesis on statistical authorship identification using usenet as a test body.
        
        However, what would word frequency and etc have to do with the nature of development of protocols and computing ?
  - Re:it occurs to me (Score:1)
    
    by grondu ( 239962 ) writes:
    
    I dont personally know of any UNSENET servers that have more than 10 days worth of posts.
    Some of the groups (like alt.binaries.*) will only have 1-3 days retention.
    
    You need a better news server. You're never gonna get a huge collection of whacking material from a news server like that. From Supernews:
    
    alt.binaries.multimedia.erotica has 190466 articles, and 23.6 days of retention.
    
    And non-pr0n:
    
    comp.os.linux.misc has 34277 articles, and 281.5 days of retention.
    
    This hasn't always been the case for Su
  - long retention (Score:2)
    
    by poptones ( 653660 ) writes:
    
    Easynews has retention of at least several weeks - even in the binaries groups. the discussion groups go back considerably longer.
    news.cis.dfn.de (I believe that's it - you can find it mentioned a lot in the "free news servers" newsgroup) has retention in the discussion groups (at least the ones I've looked in) all the way back to (at least) december of last year.
    But if you need years of research data, it seems google would be the place to go. If your needs are limited to only a few groups, many of them h
- Re:it occurs to me (Score:1)
  
  by Drakin ( 415182 ) writes:
  
  The problem with this apreoach is the server's data retention poliy. You're not even going to come close to haveing an archive that extends back a year, let alone longer.
  
  I know there's some servers that you're lucky if they still have the post a week later on the server.
History of selling Usenet archives (Score:5, Interesting)

by shoppa ( 464619 ) writes: on Friday July 25, 2003 @09:35PM (#6537432)

In the early 1990's, a company in Vancouver BC proposed a monthly distribution of Usenet via CD. This sparked extensive discussions in the newsgroups (at the time almost exclusively dominated by academic people, of course) with a lot of resentment that some company was going to be making money by selling *their* posts. (Do a Google groups search for "usenet on CD" to see some of these. They also mention Walnut Creek.)
In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.
Some individuals archive individual usenet groups, or the group is gatewayed back and forth to a mailing list that is archived. This IMHO is more appropriately managable for research.
The announcement of Google Groups with a 20 year archive [google.com] acknowledges several sources for the broad timeframe of the archive (as well as the donors to the preceding Dejanews archive); you might want to check out their specific work.

Share
twitter facebook
- Re:History of selling Usenet archives (Score:5, Interesting)
  
  by dougmc ( 70836 ) writes: <dougmc+slashdot@frenzied.us> on Friday July 25, 2003 @11:54PM (#6538017) Homepage
  
  In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.
  
  NOBODY has a full archive of the posts made to alt.binaries.* -- I doubt that even the NSA has that much storage to spare for it.
  I believe that a full feed is around 300 GB per day now, with 99+% of that being alt.binaries.*.
  
  Parent Share
  twitter facebook
  - I disagree (Score:2)
    
    by Raul654 ( 453029 ) writes:
    
    300 GB/day -- That's 2 new hard drive each day (Pricewatch.com: 160 GB for $114 x 2 = $228/day), plus the cost of the bandwidth, plus the cost of the RAID backup (I hope). Expensive, but well within reason for a large orginization.
    - Re:I disagree (Score:3, Funny)
      
      by rpresser ( 610529 ) writes:
      
      OK, so say we keep this archive for a year. That's 730 x 160gb hard drives. Forget internet bandwidth; forget LAN bandwidth; where the fuck are you going to get enough hard drive controller bandwidth to be able to search such a monster?
      
      "OK, archive, give me the md5 hashes of every article posted between the hours of 10:00 and 11:00 (except Wednesdays) during 2002. Don't forget the porn. Count how many of the hashes contain both the hex strings 'DEAD' and 'BEEF'."
      
      "Hmm, I'll have to get back to you ... I
      - Re:I disagree (Score:2)
        
        by dougmc ( 70836 ) writes:
        
        Hmm, I'll have to get back to you ... I should have an answer by 2012.
        Actually, if they were storing md5sum, header information etc. into a database as the items came in (which is exactly what the NSA would do if they were tracking Usenet like this), this sort of request could be satisfied with simple SQL in a few seconds. The database wouldn't even be that big (not compared to some corporate databases out there.) But the actual images/movies/warez/etc., that's another matter.
        I'm not saying it's n
        
        Re:I disagree (Score:5, Funny)
        
        by Raul654 ( 453029 ) writes: on Sunday July 27, 2003 @11:18PM (#6548236) Homepage
        
        It's just not that useful. 106 TB/year, mostly porn and warez.
        
        I don't see how you can reconcile those two sentences.
        
        Parent Share
        twitter facebook
        
        Re:I disagree (Score:2)
        
        by rpresser ( 610529 ) writes:
        
        Actually, if they were storing md5sum, header information etc. into a database as the items came in (which is exactly what the NSA would do if they were tracking Usenet like this),
        
        You are of course right. Change MD5 Sum to Number of times any of the strings "AB", "CD", or "EF" appear in the body of the article and it meets my original claim of impossibility.
        
        Re:I disagree (Score:1)
        
        by Fareq ( 688769 ) writes:
        
        You are of course right. Change MD5 Sum to Number of times any of the strings "AB", "CD", or "EF" appear in the body of the article and it meets my original claim of impossibility.
        
        and then realize that, if you are looking for information in/about the binaries, you have to assemble the many pieces of the binary, and then decode them.
        
        Re:I disagree (Score:2)
        
        by bill_mcgonigle ( 4333 ) writes:
        
        You are of course right. Change MD5 Sum to Number of times any of the strings "AB", "CD", or "EF" appear in the body of the article and it meets my original claim of impossibility.
        
        Are there more usenet posts than web pages? I only ask because Google manages to index over three billion pages in the manner you refer to.
- Re:History of selling Usenet archives (Score:2)
  
  by dargaud ( 518470 ) writes:
  
  > They also mention Walnut Creek
  Hey, I know those guys... They took some of my copyrighted wallpaper [gdargaud.net] images, removed my logos, and resold them on CDs. Assholes.
These inquiries... (Score:1)

by Meowing ( 241289 ) writes:

Did you just send off emails, or did you actually pick up the phone and try to talk to somebody? You might have much better luck with the latter, Google must get a billion random emails a day.
- Yeah (Score:2)
  
  by TheOnlyCoolTim ( 264997 ) writes:
  
  I sent them an e-mail reporting a minor problem and it took them months to get back to me.
  
  Tim
  - Re:Yeah (Score:2)
    
    by __aafkqj3628 ( 596165 ) writes:
    
    Strange, I sent an email off and got a response within a couple of days.
    
    I suppose that if you sent it to the wrong department then it would take *much* longer.
    - Re:Yeah (Score:5, Funny)
      
      by orthogonal ( 588627 ) writes: on Saturday July 26, 2003 @03:36AM (#6538588) Journal
      
      I suppose that if you sent it to the wrong department then it would take *much* longer.
      
      Well, it's not the department actually.
      
      What you need to do is fill in the X-Meta tag in your email (or if you use Microsoft email products, the <HEAD> <META> tags) with keywords describing your email. "Hot amine poon-tang" is often a useful meta tag.
      
      Then, you need to get a lot of people to link to your email (that is, refer to it in their email) with a X-References tag.
      
      You can do this just by getting it popular on a mailing list, but even more effective is to post it to a number of bloggers. Bloggers, being how they are (self-important but afraid of not being noticed), will endlessly refer to it, and will probably get into interminable "blog spats" about it -- well, to be honest, about something completely tangential to it, but what do you care? It pushes your "Mail-Rank" score up regardless.
      
      If you can't do that, try contacting "MailKing", a commercial service that sends out a lot of email, purportedly from unrelated individuals, which refers to your email. They'll make it look like your email is really important.
      
      When your "Mail-Rank" is high enough, Google will have no choice but to notice it and reply.
      
      Best of luck getting your email noticed!
      
      Parent Share
      twitter facebook
I need to analyse a lot of USENET data... (Score:5, Insightful)

by AtariDatacenter ( 31657 ) writes: on Friday July 25, 2003 @11:23PM (#6537899)

> For an academic project, I need to analyse a
> lot of USENET data.

Are you SURE you're prepared for that much data? And the costs for just storing that much data? Not to mention manipulating it? I think you'd need an academic project just to analyze that factor alone.

Share
twitter facebook
Archives (Score:2)

by Detritus ( 11846 ) writes:

Try http://www.nsa.gov [nsa.gov].
Seriously, I'm sure that a number of intelligence agencies have archives of this stuff. I believe the FBI used to get a USENET feed on 9-track 1/2" tape.
SURE..... (Score:3, Funny)

by JDWTopGuy ( 209256 ) writes: on Friday July 25, 2003 @11:41PM (#6537972) Homepage Journal

You REALLY just want to get all that hot porno and wares and MP3z!!! We can see right through your story! I bet it's "seminal research" too.....

Share
twitter facebook
Translation (Score:5, Funny)

by truffle ( 37924 ) writes: on Saturday July 26, 2003 @12:10AM (#6538057) Homepage

how can I obtain historical data for research purposes ?

Translation:

How can I build the largest pr0n collectin ever?

Share
twitter facebook
- Re:Translation (Score:2)
  
  by RevMike ( 632002 ) writes:
  
  Assert(girl && geek && bi && cute && humble)
  That usage has been deprecated. Use AssertTrue instead. :)
dejanews had a rival (Score:2)

by geoswan ( 316494 ) writes:

dejanews had a rival for a year or so. They got bought out long ago. Sorry, I can't remember their name. But, if google really won't co-operate look up those other guys.
google api (Score:2, Informative)

by krismon ( 205376 ) writes:

download the google free API(for non-commercial use), would make your scripting/programming, much easier...
- Re:google api (Score:2)
  
  by darkpurpleblob ( 180550 ) writes:
  
  download the google free API(for non-commercial use), would make your scripting/programming, much easier...
  
  The Google API is not going to be of any use here. It cannot be used to access Google Groups [google.com].
- Re:Hey, that's my copyrighted data... (Score:3, Interesting)
  
  by Arioch of Chaos ( 674116 ) writes:
  
  Seriously, this should actually work. You do not automatically waive all rights just by posting to usenet. I don't think disk space or bandwidth is the only thing stopping them from archiving binaries. If they were not concerned about getting sued, wouldn't they at least archive all the popular binaries (i.e. porn and warez)? They could easily become the world's largest pay site ;-)
- Re:Hey, that's my copyrighted data... (Score:3, Interesting)
  
  by Anonymous Brave Guy ( 457657 ) writes:
  
  Hey... Cunning thought... <ahem> DMCA... <ahem> subpoena... <ahem>
  
  Given that Google may well not have a legal leg to stand on making any sort of money out of using posts where copyright is owned by the poster (no, I don't buy the "you've given implicit agreement" arguments for a second, and I'm betting they don't want to risk their whole business finding out whether a court does either) there's an interesting "deal" to be made here. :-)
  - Re:Hey, that's my copyrighted data... (Score:2)
    
    by roystgnr ( 4015 ) writes:
    
    no, I don't buy the "you've given implicit agreement" arguments for a second
    
    So do you think people should be able to sue any Usenet server in the world for copying their posts without explicit permission, or just Google? If the answer is "just Google", then please explain what attributes of Google take away from them the permission that every other news server has.
    - Re:Hey, that's my copyrighted data... (Score:2)
      
      by Anonymous Brave Guy ( 457657 ) writes:
      
      Essentially, I believe people should have a right to take action against anyone who has used their posts in a way other than they might reasonably have expected when they posted them to Usenet. They have not consented to them being used in any other way.
      
      This is clearly a grey area. Typically posts propagate within a few days, and then remain on servers for 2-4 weeks before they expire. Someone keeping them on a server a bit longer is fine, because there's no explicit standard that says everything expires
- Re:Hey, that's my copyrighted data... (Score:2)
  
  by Restil ( 31903 ) writes:
  
  Actually, google is pretty good about removing any pages you don't want listed. I'll bet they'd remove usenet posts too if you could prove you were the poster.
  
  Problem with posting to usenet though, you're already giving them the rights to redistribute it, as all usenet servers propogate the content to other newsservers. And any of those servers can be pay only, and there's not a whole heck of a lot you can do about it. So to say that google suddenly is out of compliance because they happen to have a ser
Follow up (Score:5, Informative)

by Anonymous Coward writes: on Saturday July 26, 2003 @08:40AM (#6539152)

I'm the original poster of the question. Just to answer a few points raised by other people:

1. I am only interested USENET data until the early 1990's, and even then only for parts of comp.*. I know that I wouldn't be able to purchase/obtain selective parts of it, but I have quantified that 1980->199[0-5] is order of couple of hundred gigabytes max. This is not a large volume of data if obtained on DAT (I certainly have the facilities/funds for limited hardware/software), or if I could make use of whatever facilities store it at the moment that would also be great (e.g. if I were able to come to an agreement to use a host account located at google, where I could develop and run tools over the data [1]).

2. I have so far developed tools to complete part of this work, but found myself violating Google's ToS and having myself blocked from access to Google Groups. My tools simply scrapped out a large amount of data from a specific newsgroups (comp.something) and performed some automated analysis that I use to correlate with other findings, etc.

[1] There's an analogy here with obtaining time on a supercomputer for research work, though what I need is not a supercomputer, but superstorage.

Share
twitter facebook
Here is what you are looking for. (Score:5, Informative)

by Anonymous Cowdog ( 154277 ) writes: on Saturday July 26, 2003 @08:46AM (#6539162) Journal

I am surprised this Ask Slashdot question hasn't (really) been answered after this many hours. Well here is the answer.

Go to this link:

http://www.archive.org/web/researcher/proposal.php [archive.org]

Create an account, log in, write a proposal, submit it, then wait.

Yes they have Usenet data, not just web data.

After I submitted a proposal this way, I had to bug them by email and eventually phone even to get the account set up. It's not that they aren't helpful, they are just busy with lots of projects.

But don't expect hand holding. You need to be comfortable operating in a Unix environment. And I don't know if the data can leave their servers; you might need to do all your processing using their machines.

Share
twitter facebook
USENET is Google now (Score:2)

by 73939133 ( 676561 ) writes:

I think, for practical purposes, Google owns most of USENET now, including its complete history; there will probably not be any other institutions (well, maybe the NSA) that keep a history of USENET postings.
- Re:USENET is Google now (Score:2)
  
  by SpaceLifeForm ( 228190 ) writes:
  
  I don't believe Google owns Usenet in any way. I mean, Usenet is NNTP, a P2P protocol, that is actually a method to distribute the postings made to a Usenet server. As to those postings that Google has collected and archived, each individual posting is really copyright of the poster (if not stolen), and I certainly don't recall receiving any compensation from Google for having the use of my copyrighted postings. So, IMHO, Google should provide access to all Usenet postings in the archive, with the access m
  - Re:USENET is Google now (Score:2)
    
    by 73939133 ( 676561 ) writes:
    
    Of course, Google doesn't own USENET legally; in fact, I think Google's use of USENET is of questionable legality, since postings traditionally have been made with expiration dates in mind and Google's use far exceeds the implicit license posters to USENET have given people for using their postings.
    
    What I'm saying is that the institution of USENET these days is to a large degree controlled by Google; Google has changed the way people post, Google archives everything, and most people, looking for an article
    - Re:USENET is Google now (Score:1)
      
      by Yottabyte84 ( 217942 ) writes:
      
      Use the X-No-Archive header, and they won't snarf your posts.
      - Re:USENET is Google now (Score:2)
        
        by 73939133 ( 676561 ) writes:
        
        Use the X-No-Archive header, and they won't snarf your posts.
        
        They'll just snarf all the responses with bits-and-pieces quoted out of context; that's even worse.
        
        Besides, Google has snarfed and republished, without permission, thousands of my posts from before X-No-Archive even existed.
        
        Re:USENET is Google now (Score:1)
        
        by mink ( 266117 ) writes:
        
        Thats the fault of people newsreaders. They should be set to keep the noarchive bit on reply.
lawyers policians and research alternatives (Score:1)

by zarthon ( 614232 ) writes:

Practically no one is lining up to give reseahers data ! This is a hotpoint of mine. I find it particularly outragous, that the research that is excluded by law is usually that which is in the public interest, while many private and comerical studies are well funded and leagal.
Many areas of the law are purposefully made to protect commercial interests and "simulate the economy." Politcians are still at it! That this also seems in corporate interests is no accident since a great deal of commerce is conduct
Email David Weisman from UWO (Score:2)

by Large Green Mallard ( 31462 ) writes:

He provided much of the early USENET archives. See his page here [csd.uwo.ca] for details of his contribution and email address.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Comment removed (Score:4, Informative)

Re:Google just bought the data (Score:4, Informative)

That's interesting... (Score:3, Insightful)

Re:That's interesting... (Score:1, Informative)

Re:That's interesting... (Score:2)

Re:That's interesting... (Score:1)

Re:That's interesting... (Score:2)

copyright applied to usenet posts (Score:1)

Re:copyright applied to usenet posts (Score:2)

Re:That's interesting... (Score:1)

Re:That's interesting... (Score:2)

Re:That's interesting... (Score:2)

Re:That's interesting... (Score:2)

Re:Dumb question (Score:1, Funny)

Sure! (Score:5, Funny)

Google (Score:4, Informative)

Re:RTFA (Score:2)

Re:RTFA (Score:1)

Re:RTFA (Score:4, Funny)

Re:RTFA (Score:1)

Re:RTFA (Score:2)

Re:Google (Score:5, Interesting)

You understand how much data you're talking about? (Score:5, Interesting)

it occurs to me (Score:1)

Re:it occurs to me (Score:2, Informative)

sequential snapshots are an archive (Score:5, Informative)

Re:sequential snapshots are an archive (Score:2)

Re:it occurs to me (Score:1)

long retention (Score:2)

Re:it occurs to me (Score:1)

History of selling Usenet archives (Score:5, Interesting)

Re:History of selling Usenet archives (Score:5, Interesting)

I disagree (Score:2)

Re:I disagree (Score:3, Funny)

Re:I disagree (Score:2)

Re:I disagree (Score:5, Funny)

Re:I disagree (Score:2)

Re:I disagree (Score:1)

Re:I disagree (Score:2)

Re:History of selling Usenet archives (Score:2)

These inquiries... (Score:1)

Yeah (Score:2)

Re:Yeah (Score:2)

Re:Yeah (Score:5, Funny)

I need to analyse a lot of USENET data... (Score:5, Insightful)

Archives (Score:2)

SURE..... (Score:3, Funny)

Translation (Score:5, Funny)

Re:Translation (Score:2)

dejanews had a rival (Score:2)

google api (Score:2, Informative)

Re:google api (Score:2)

Re:Hey, that's my copyrighted data... (Score:3, Interesting)

Re:Hey, that's my copyrighted data... (Score:3, Interesting)

Re:Hey, that's my copyrighted data... (Score:2)

Re:Hey, that's my copyrighted data... (Score:2)

Re:Hey, that's my copyrighted data... (Score:2)

Follow up (Score:5, Informative)

Here is what you are looking for. (Score:5, Informative)

USENET is Google now (Score:2)

Re:USENET is Google now (Score:2)

Re:USENET is Google now (Score:2)

Re:USENET is Google now (Score:1)

Re:USENET is Google now (Score:2)

Re:USENET is Google now (Score:1)

lawyers policians and research alternatives (Score:1)

Email David Weisman from UWO (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals