Media Providers And Short Online Retention? 13
delfstrom asks: "Retention time for online reference material is decreasing. First it was Deja moving archives offline. Now try to find the AP story you saw on Yahoo from earlier this year about a judge's order against a CyberPatrol decryption tool. You can't, because anything older than 30 days is canned from news.yahoo.com. Likewise, certain online newspapers (not to mention any names) are removing content after a mere 7 days, though for $25 per retrieved article you can go back to 1977. This certainly goes against the philosophy of not breaking links. What responsibility do information providers have in maintaining articles that they post? In this era of electronic publishing, academic papers are beginning to contain URLs in the references. To what extent can we keep copies of such information and provide it to others?"
Re:uh huh (Score:1)
Exactly my point! (Score:3)
This is why I've been hoarding data since about 1992 or so. Anything that I deem worth keeping I keep a local copy of, whether it be my old Bluemail .qwk archives, newsgroup postings, HTML pages adobe acrobat files from where and whenever, old .mod and .stm/s3ms, you name it. I've got .zip files I'll probably never use again, but I've kept them specifically because I got sick and tired of so-called "permanent" sites taking them off.
Whenever my hard drive gets full, I do a couple categorization passes (I try to keep them categorized as I go but it's never quite perfect; there's always too many files in my /data/dump directory) and then make an .iso. Two copies are burned, one for my bookshelf and one for work or safe storage.
As Signal11 once had in his .sig (and ripped from somewhere I'm not sure, but I've seen it in the old taglines of yore): I don't have a solution but I admire your problem.
No sense (Score:2)
How does this compare with "offline" news? (Score:2)
I know newsstands tend not to keep even yesterdays papers, it's up to organisations like libraries to do that.
Do we have any comparible organisations who specifically archive things like online news?
How do they deal with copyright issues?
- Muggins the Mad
This is where Freenet/etc. can help! (Score:1)
I was just thinking!
This is where Freenet and other p2p and distributed sharing programs will and can help!
Thnx,
Fuller
ps. http://freenet.sourceforge.net
http://www.mirc.com
http://www.forteinc.com
http://www.deja.com
http://www.google.com (their cached pages are wonderful!)
Re: No sense -- In The Real World... (Score:3)
Storage keeps getting cheaper,
There are three issuses here. The first is that storage isn't as cheap as you think. The second is that indexes are hard to maintain. Finally, you forget that old text is a good revenue stream.
Storage
You are correct that space is cheap for small amounts of storage. If you go to your local computer store, you can buy a 60-gig drive for less than I paid for my first five-meg drive. I have no contention there.
However, people who archive data for a living don't buy bare 60-gig IDE drives and string them together. It ain't that simple.
I work for a newspaper. We have every text we have published since 1985 and every picture since 1996 (don't quote me on that last date). They are both inside IBM RS/6000s. The text archive is under 15 gig. The photo archive clocks in at 230 gig (and growing by nearly 600 meg a day).
Initially, the data lived in a $100,000 HP optical jukebox. When that got too small, we scrapped it and bought IBM 7133 disk arrays. Bare, before you put the first drive in the box, they cost $36,000. Each nine gig drive is $2,000. (Yes, I know you can get them cheaper. But not hot-swap, not with an IBM warrenty, etc.) When you hit 144 gig (9 gig by 16 drives), you've got to buy another 7133. In order to get good performance, you can't just RAID-5 everything in one big SSA loop. You have got to have multiple paths. Each enhanced SSA card is a few thousand dollars.
Indexing
Keeping the raw images isn't that difficult in the grand scheme of things. Indexing and searching for content, however, is less than trivial. Keeping the database well-groomed is hard work. You do want all the stuff these web sites keep online to be searchable, right?
Storing photographs is especially difficult. For a quick discussion on archiving images, see this post [slashdot.org] from a week or so ago.
Revenue
Newspapers sell you a hundred stories with pictures and comics a day for, generally, 50 cents. However, if you want a story that was in last year's newspaper, they can charge you five dollars for that story and you will pay it.
Why on earth would newspapers give you content for free that they spent money to create and archive? Yeah, yeah, information wants to be free and all that but they are still have to make a profit otherwise there will be no information to be made free.
Solution?
The obvious solution is for these media outlets to charge for old stories. That way the links don't break and they have a way to support the archive and indexing costs. Folks here won't like that idea.
Summary
It's easy to say that the media should keep everything online all the time. In the real world, however, there's problems with doing just that. The problems are both technical and financial. Information may want to be free but 'wanting' doesn't pay the bills.
InitZero
Re: No sense -- In The Real World... (Score:1)
Initially, the data lived in a $100,000 HP optical jukebox. When that got too small, we scrapped it and bought IBM 7133 disk arrays. Bare, before you put the first drive in the box, they cost $36,000. Each nine gig drive is $2,000. (Yes, I know you can get them cheaper. But not hot-swap, not with an IBM warrenty, etc.) When you hit 144 gig (9 gig by 16 drives), you've got to buy another 7133. In order to get good performance, you can't just RAID-5 everything in one big SSA loop. You have got to have multiple paths. Each enhanced SSA card is a few thousand dollars.
I diagree. Online secure storage is as cheap as we think. We just installed a 500 Gig RAID for US$20,000 for storing huge (and critical) medical images. Are you saying that if you were to provide all of the text of articles of a single daily newspaper back to the the late 70's, that it would require anything more than 500 Gig?
Sure, I'd go for charging for old stories. Possibly micropayments. As long as the links stay the same! You work for a newspaper. How can we convince newspapers and other media to do this?
Re:This is where Freenet/etc. can help! (Score:1)
From the FreeNet FAQ [sourceforge.net]: Documents that are never requested are eventually removed through disuse.
On the other hand, as the price of storage media drops, we'll probably see somebody (Google?) attempt to cache the entire Internet.
hmmmm... (Score:1)
"Your Honor, here is a copy of a news article from May 2001 proving that the MPAA willfully and illegally spanked a room full of children."
"But how can we be certain that you did not fabricate or alter that article? Where is the original?"
"Well, your Honor, as is the custom nowadays, all news is removed from a site just 7 days after it is posted...."
"I'm sorry, but I cannot allow that in as evidence."
D'oh!
-----
My problem with it... (Score:1)
Well, a lot of that problem was the teacher, but if the sources would realize that sometimes what they say can be used for scholarly work and keep them around for a bit, life would be nicer. I'm all for getting rid of things that no one has accessed in over a year, but when you operate a news site you should atleast think about putting old articles somewhere without pictures and such - just the text, how much space would a years worth of CNN news stories in plain text take up???
Re: No sense -- In The Real World... (Score:2)
We just installed a 500 Gig RAID for US$20,000 for storing huge (and critical) medical images.
Does that storage have a single point of failure? It is mirrored? Is it SSA? Will it work on an RS/6000? Can it be backed up to ADSM/TSM?
All of these are critical questions for us. There are many solutions that will hold a lot of data for little cost. Take the 1U Maxtor box [slashdot.org] for example. At under $5,000 for 320 gig, it sounds good. However, it only has one NIC and doesn't support an SSA connection so we can't use it. It doesn't scale well within our application environment.
InitZero
Re:This is where Freenet/etc. can help! (Score:1)
Also, I read in an AJC technology article that there is a group that is archiving the Internet. (something like 3TB and counting last I saw it almost 6 months ago) sorry don't have link.
also, check out www.archivists.org and also
http://www.loc.gov/ead/ead.html
for Encoded Archival Description format.
Thnx,
Fuller
Re:This is where Freenet/etc. can help! (Score:1)
Not that I doubt you, but could you give some details?