Bulk Data Storage For The Common Man? 483
Vigyaan writes "Lately, I have been looking into different bulk data storage options available to a common man. My work
depends on generating, storing and analyzing a
large amount of data -- averaging about 1 TB per
month. I would like to have a storage system which is automated, fast, reliable
and most importantly does not cost the price of an
eye. Right now, I have a 4 node Linux cluster with
10 large hard disks (total capacity 1.6 TB); data storage roughly costs
about $0.60/GB (excluding the cost of PC
hardware). But long term storage is painful -- DVDs
cost about $0.10-$0.15/GB but takes too much human time
and leaving data on hard disks makes me nervous
because of possible failures. RAID is a possibility, but it increases the cost significantly. I was wondering, if
Slashdot readers have any recommendations for a
cheap automated way to store and retrieve data."
Hard disks (Score:5, Informative)
If you buy them in bulk you can save.
Burning DVDs is going to take you forever and drive you nuts.
Find a hotswappable set of drives and use that for your offline backups. Use a raid for your current backups.
!RAID (Score:1, Informative)
Rev Drives (Score:1, Informative)
Buy an older tape drive (Score:5, Informative)
Capacities are (for the cost of a sub $50 tape):
- LTO1: 100 GB uncompressed
- LTO2: 200 GB uncompressed
- SDLT220: 110 GB uncompressed
- SDLT320: 160 GB uncompressed
If your data is particularly ammenable to compression (i.e. database data) you could easily get 3 or 4 to 1 compression with these drives without sinking your CPU utilization.
Re:Wirewire drives? (Score:4, Informative)
Oh, but they are cheap. Just buy a large IDE disk and a $30 firewire/fast-usb enclosure.
I'm just not sure about the "long term", though. I have no idea what the shelf life of a hard disk is.
compression (Score:3, Informative)
Then backup using tapes just like every other place that has to do backups. Generally do full backups once a week and incremental ones nightly or whatever is necessary based on the data you are working with.
Drawbacks, what are you willing to put up with? (Score:5, Informative)
Tape: Tapes break, they wear, they have dropouts, take a while to back everything up, can't always access files if you just want to restore something (Different methods vary, folks)... but ultimately, it's cheap when you use DAT because they're a common media. Swap the tapes twice as often (and throw old ones out) if you're paranoid about tape related failures.
Hard Drive: Most common form of backup I see now, mainly for the 1:1 size factor. Yeah, drives fail, too. Sometimes you have a pretty good warning when this is going to happen, sometimes you don't. (My 13GB Maxtor and 40GB IBM Deathstar drives both went *pfft* on reboot.) Get enough of them at once, you could swap out the logic boards if one does fry out. Ultimately, RAID or just simple 1:1 mirroring is probably the most efficient and easy method. Accessing bits and pieces is also easiest under this method. I personally just use an external USB2 case with a 120GB drive in it. Everything I want to back up goes on that drive, and then eventually... DVDRs. I turn off the drive when I don't need it... hopefully prolonging the life of it when I need it most.
DVDR: Not anymore. If we had these new-fangled DVDR discs (+ or -) say... when 2 to 6GB drives were common.... sure... But in addition to hard drives, recovering selective files is easy under this method too... Unless you use a backup program that crunches everything together on the disc in some spanning format. Burn times can be tedious... but it's not bad if you consider the overall amount of data you're putting on the disc. Cheaper than quality-brand name CDRs, though, in terms of price per mega/gigabyte. Only an idiot would trust $0.01-per-disc spindles for long-term backups. Even the longevity of DVDR has yet to be seen...
CDR: I'm not going to bother.
Network: Well, still relies on hard drives and other components... but good if you don't want to saddle one room with a ton of boxes. Simply for space and efficiency... external drive is probably better anyway.
Old fashioned method: Print everything out and keep it in a filing cabinet somewhere. You could always OCRA the stuff later.
LTO Ultrium 2 Tape Drive (Score:4, Informative)
Re:Wirewire drives? (Score:5, Informative)
Lacie makes their 1 terabyte firewire (943 gigabyte formatted) drive. I we get them for $1,080 a drive (Macmall matched Provantage's price). This is more then the article author spends now per gig, but these drives have done quite well in the studio. You can find cheaper firewire though.
We are at the point where hard drives give the best bang for the buck. The only fault of firewire is that my bosses have burned several bridges. ground yourself before unplugging the drives. The bridges were cheap though. In any case, hard drives are probably the most failsafe and cost effective solution, with firewire being the easiest interface to use those drives with.
If its volume you want (Score:5, Informative)
Run the disks RAID 5 and you will get about 800GB of storage for $600 . Now get two cheap ata100 cards so you have a total of 6 channels, and mount each drive as a master on each channel. Build a 2gb root partition on the first disk (mirror it if you want) and then set the rest of the space up as a huge raid 5 array.
Et Voila cheap, big server. To archive data, turn off pc, and throw into attic
Re:age old problem... (Score:4, Informative)
Re:Give Up Now (Score:5, Informative)
Of course higher quality media might be better, but then you can no longer quote the $0.10/GB figure.
Re:Hard disks (Score:4, Informative)
If you want one of those nifty things with robotic arms and whatnot, plan on spending upwards of $3500. The AIT Automated Tape Library goes for that much and holds only 15 tapes. Plan on spending tens of thousands for something like Ampex's DIS 914 for 30 Terabytes.
Your friend is right: tapes or cheap. The equipment needed to support them is expensive, slow and error prone. It gets cost effective once you have enough money for a new Porsche though...
4*400Go Sata on Raid 5 (Score:5, Informative)
Not the cheapest, but fast, simple and saves you the unholy pleasure of having 2-3 DLT boxes to archive/cycle each month...
You already have a linux cluster, so implementing a distributed file system, or even simply a nightly incremental mirror to the target server if you can afford losing one day work/computation...
It would help if you told us what sort of data you work with... from databases and to automated telescope tracking system, both need large amount of storage, but you won't need the same system array for each...
I seem to remember a
You don't need to go to the Petabyte capacuty but you will find some interesting comments on filesystems, disk virtualisation, 1U rack providers and so on....so a 1 Terabyte rack server is definetly possible...
Good luck...
DLT is the way to go (Score:3, Informative)
Ultrium (Score:3, Informative)
carousel for everyman (Score:4, Informative)
Cheap hardrives with firewire enclosure (Score:2, Informative)
Makes an cheap, fast way to put lots of data onto lots of hard drives. Using one of these bad boys means no extra money is spent on drive enclosures, cases etc. You only buy raw standard hard drives. Excellent if it's only backup, and you do not need lots of access. This solution is not automated however.
Hard drives are prone to failure. I was thinking of buying at least 2 drives of different brands to mirror, storing them in separate locations in sealed, air tight containers at just the right humidity/temperature. Also I think a disk check every 6 months or year would be necessary, and if any problems are found, replace the disk with another.
One beauty with this method is you only need to pay for disk space as you need it, and hard drives may still get much bigger. I was going to buy drives at the lowest cost/megabyte which at the moment is 160GB drives.
I would love to find more information on the physical storage of hard drives, especially how long they would be expected to last without use - months? years?
Re:1024 GMail Accounts (Score:2, Informative)
Um, 1TB a month in IDE drives is cheap . . . (Score:3, Informative)
Promise SX6000 = $255.95. (6) 200GB IDE drives in a Raid 5 = $624.95
If you had a separate boot drive from the SX6000, you could just bring the system down for a couple hour maintenance once a month and slam all the drives out and put fresh ones in.
Just keep buying new 200GB drives anymore and shelf the old ones (or if its *really* valuable and your home firesafe isn't enough, pay Iron mountain or someone to keep it).
There aren't hidden labor costs outside of those two hours it takes to setup a new array every month (DVD's are about 60 bucks a month for a TB, with a hundred or so for a drive (which *will* need to be replaced occasionally if you are burning that much) but you'll spend hours and hours just dealing with the swap outs and breaking up your data .
If you don't have to keep the TB of data after a month or three, then your price gets even cheaper after you invest in your initial hard drive media sets . . . and you can put all the drives in hot swap chassis to further minimize your time dealing with the issue.
Of course this is all moot if your 1TB of data isn't valuable enough to invest 600 a month in . . .
Exabyte (Score:2, Informative)
http://www.exabyte.com/products/prodviews.cfm
I think you can store about 1.6 TB on a single tape or similar, but check them out. Tape drives have come a long way from old SCSI DATs transferring 20meg a minute. And they're fully automated and although there's an outlay cost for the tape drive, over time the cost per gigabyte for storage will be lower than hard drives.
If you have a security company do patrols of your office you can get them to take the tape offsite with them after nightly backups for added security... etc etc.
Bad idea (Score:5, Informative)
Re:Wirewire drives? (Score:5, Informative)
I'm not a "Common Man" then (Score:3, Informative)
I have to disagree with the sister system though. For most geeks like you and I a sister system would be fairly adequate. It would be better with an occasional off-site backup. However it really sounds like this guy's data is far too valuable to have only one copy of it and to have all copies be at one physical location. He really needs an off-site backup somewhere. Imagine for a moment if his home (I'm guessing he works from home, but this still applies to a real store-front business) was robbed. The crooks didn't know what they were taking. They saw two shiny computers in an office and figured they could hawk them on the street. There goes all his data, both copies. D'oh! So in short a sister system is a good idea but it probably won't do this guy much of any good. It would be a good local solution for a short term live mirror (ie, data is archived that night but the sister machine gives you a backup for that one day's work).
Re:Hard disks (Score:3, Informative)
Re:Give Up Now (Score:5, Informative)
Re:Finally a use for my 1GB Gmail invites... (Score:3, Informative)
Yee. Sent someone else who replied a invite too.
Mod this up and I'll send you one too.
Re:age old problem... VXA2 (Score:1, Informative)
Re:Hard disks (Score:4, Informative)
If I had the money, at a minimum I'd get a tape drive that could handle the 200 GB (uncompressed) tapes. Something like IBM's LTO Gen-2 Tape Library. That should run a bit less than $6,000.
For that matter, if I won the lottery, my first purchase would probably be a top of the line tape backup system instead of a the usual new car.
Since I can't afford it, I use DVDs and CDs for backups. They are a pain in the neck and are not that dependable, but I keep backups up to a year on DVD+RW so if one fails, hopefully the others will have the data.
Instead of writing directly to the DVD writer, I write the backups to disk and then copy the backup sets to the DVDs.
I also keep a complete current backup of nearly everything important on a seperate computer.
Re:Bad idea (Score:4, Informative)
google does not want ANY data to be lost. The have many mirrors of all data.
Re:age old problem... (Score:2, Informative)
Re:Personally I prefer something in a blonde (Score:3, Informative)
KFG meant to say "You can have fast, good, or cheap. Pick two."
It's an old software design maxim that applies suprisingly well to this subject.
What about aggregate speed? (Score:2, Informative)
For simplicity, I'm not going to go into RAID tradeoffs, etc. and just stick with "striped data", which gives you maximum bang for the buck. You should draw up a simple spreadsheet with the following headings:
It's not exactly a great spreadsheet layout, but it should be enough to enter everything in and start seeing what is practical and what isn't. I'm sure that someone else would be able to enhance this a little further - any takers?
By the way, you really should think about RAID-5 at the very least. All it would take is just one drive to hose your data completely. Besides, as the array grows in size, the price tradeoff becomes smaller and smaller, to the point where it's really not worth your time to stripe all of your data without redundancy. I believe that the md drivers in linux support up to 32 devices per RAID set. That takes your overhead from 1/5 of your array (in a 5-drive setup) down to 1/32 of your array.
A SAN-style setup lends itself well to this, but the price is very prohibitive to "the common man", as it requires very expensive hardware. You can emulate something like this via GFS support in Linux, which (theoretically) would allow you to aggregate your data.
If there is a requirement to keep the data online at all times, you'll need to spend more on some PC cases, as well as some networking to string the units together. Pick a reasonably-priced case that will house all the media units, have adequate power (at least 250 Watts, 300+ would be ideal) and keep them cool. Use a motherboard that is reliable, and can adapt to several different clock speeds for a given CPU; you'll want something that can be thrown out for less than $99.00 if it should go bonkers on you, but if the CPU burns up, you should be able to still get parts off the shelf and get the Motherboard running again. Stick will the "commodity" or low-end CPUs, as (a) they tend to be cheaper, and (b) having been through a complete lifecycle, any bugs or issues with the CPUs will be well-known by now. Don't worry about the speed of the board or CPU at this point, as most "modern"
Mmmm, hard disks (Score:1, Informative)
I've got an EMC Clariion CX200 fibered to the servers, that means its a SANS. Its 3 TB. I think it was $40k. Then I've got a Win2k3 Appliance Server fibered to the EMC Clariion CX300 with 12 TB. It shares out through various filesystems ( nfs, smb, ftp etc.. ). So its' my NAS. The CX300 I think was about 100k. The SANS holds the online data, the NAS holds the archived data. Oh, and the NAS head is fibered to a Dell PowerVault 12TB LTO2 tape jukebox. Which backs up the NAS. It was pretty cheap, i think it was 18k or so.
Oh yes, and just for the slashdot crowd. Yes, this equipment will be holding fMRI studies. As well as MRA, CT, DR, CR, CTA, US and variaty of other modalites. Being a PACS engineer is fun, although, I do make less then a teacher. Fucking Economy.
Re:age old problem... (Score:1, Informative)
LTO2: 200GB/400GB
SAIT1: 500GB/1300GB
Re:Firewire hub with hardware RAID (Score:3, Informative)
Firewire drives can be daisychained, and in fact OS X allows you to set up software RAID on multiple firewire drives attached to the system. You can't move them to another system and get access, but that's about the only limitation that I've found and it's more than decent for local high-density storage..
How about compression? (Score:1, Informative)
First of all, is the 1TB of data that you collect every month mostly different or mostly the same as the data you collected in the past month?
Secondly, how compressable is the data you are collecting?
Thirdly, how much random access do you need to the data, or is a serial stream of the data good enough?
--
If the data you have is mostly the same from month to month then you only have to perserve the difference between 2 months
If the data is highly compressable then you can use bzip2 -9 to make the data much smaller, therefore needing a lot less of the media than otherwise.
A 1TB file that compresses 50 to 1 will only be 20 GB. This will easily fit on 5 DVD's.
If you collect 1TB of data and diffing it with the previous months data outputs only 100 GB differences, and that compresses down 21 to 1 then you can fit it on a single DVD.
rsync is also good for copying the data on a system to a remote system.
more about the app. (Score:1, Informative)
This might sound off topic, but I have used different compression schemes (like wavelets) for sensor data, and regenerative techniques with statistical sumeries that gave me me about 2000:1 equiv. compression. That still does not address the 1TB/month backup problem, but might help reduse the problem somewhat.
As for backup. In the past I have chosen to go with a modified cluster solution, where I set up a data server in another building and auto-backuped up the new/unique data to it. The reason I chose slow/large HD's instead of tapes is my experience with using tapes over a course of 10 years. I cannot even get replacement tapes if I needed them not to mention the actual tape drives...
How long do you realistically need to keep the data around for? Can you recycle the media (say after 1 to 2 years? Does it need to be truly perminate?
What are your near- and long-term requirements (Score:5, Informative)
I looked through some of the answers here, and as near as I can tell, you've got a bunch of home hobbyists telling you how to back up your home computers. Perhaps all your needs entail is a computer with an external IDE drive array and 4-10 200G SATA drives in it. But from your initial post, it's not clear what you need your offline storage _for_.
First of all, you mention that you generate and use 1G of data a month. What happens at the end of that month? Does all of the data become useless? Is some of it carried through? Is it useful for historical processing for some time after it's not "live" any more? The disposition of that offline data is important; you can't determine how you can most effectively back up your data until you know what you need to do with that data once it's backed up.
Since no one cares about backing up old data that they never use any more, I'm going to assume you need this data in some form in the future. I'm also assuming that your data ages out completely every month.
Realistically, you have two options: Large redundant disk arrays, or tape. Various factors give credence to one or the other.
First of all, get off of the SATA hacks, and realize you're going to need to go to SCSI, whether you end up with disk or tape. You're backing up data, you're going ot want it to be reliably written out, and SCSI is the de facto standard for backup architecture. Yes, you pay more for it, but there's a reason for it: the SCSI equipment I manage at work fails a fraction of the percentage of time that the various IDE/ATA systems fail. While SATA is marketed as a consumer technology, it will never meet the rigors of being a reliable backup methodology.
Re:Hard disks (Score:3, Informative)
RAID does not a backup system make. You still need backups.
For increased on-line availability, how about a good distribued file system with several servers? And, of course, back everything up anyway.
Re:options options, what is your time and data wor (Score:3, Informative)
A disadvantage is that the data cannot change while you write all N+1 DVDs and restoring would require lots of DVD swaps (regardless of whether you've lost a DVD or not) and the ability to incrementally write files with gaps in them (not an issue with most filesystems).
Re:Finally a use for my 1GB Gmail invites... (Score:3, Informative)
Knowing how long hotmail and M$ has been around, and still failed to backup hotmail with their infinite windows license. What makes you think your 1 Gig will be backed up by Google.
Re:Wirewire drives? (Score:2, Informative)
ICI Optical Tape / CREO Vancouver (Score:2, Informative)
There were several packaging options for this tape.. including reels of 2" wide tape and cartridges.
I've lost track of what happened to it. All I remember is that this tape existed at one time and some research was being done to make data recorders of phenomenal storage capacity.
Back in the early 90's, there was one company in Campbell, California, known as "LaserTape" which was trying to design a tape drive for the PC which used cartridges of this tape. I have lost track of whatever happened to the company.
Re:Finally a use for my 1GB Gmail invites... (Score:3, Informative)
Inside source? Just call them up and ask! It's not hidden knowledge [com.com].
Re:Firewire drives? (Score:2, Informative)
In those applications, I've gone with dedicated ATA/133 cards with a nice roomy case with a bunch of removable drive bays. It's a pain to have to shutdown to swap drives, but less of a pain than Windows bluescreening, rebooting, and "fixing" your attached Firewire drive, scrambling all of the data on it, and making it impossible to run a recovery (no, I didn't have a backup of that data...)
I've also had weird crap happen with my Macs as well - some hardware doesn't show up unless you have it plugged in on startup. In theory it was a great idea (mix Firewire cases with removable drive bays), in practice, you're asking for trouble if you're using cheap parts (ie, bottom-basement cases, with cheap cables.)
A cheap, simple solution (Score:2, Informative)
Re:Use those HDDs! (Score:2, Informative)
"Microfilm!?" I hear you say, "But this is 2004!". I was suprised too until I heard their reasoning: the only thing you need to read microfilm is a magnifying glass and a light source.
And that, ladies and gentlemen, makes it virtually future-proof. Try finding a hard-drive for your old 5 1/4 C64 floppy disks (or better yet, those bloody tapes that took 30 minutes only to fail to load)
Thanks to all for lots of useful suggestions (Score:2, Informative)
Hard drives based solution seems to be (currently at least) cheap + easy to use for immediate use
DVDs are cheaper for long term storage but automation devices are still not commonly available (plus capacity per DVD is small)
Software RAID is slow for writing but okay for reading data
Tapes may be an viable alternative for long term data storage but the tape drives require an initial investment
Some readers have mentioned "LaCie Bigger Disk". $1199 for 1 TB disk space ... a price to pay for convenience.
Since a lot of you have asked me, now I will explain the nature of my data, its storage and analysis.
I am a scientific researcher working in computational biology. I do atomistic modeling and generate snapshots of protein conformations along MD trajectories. The data is analyzed several times to calculate different quantities. Those of you who do this kind of stuff know that we can collect and store only a fraction of data we want. I generate this data on supercomputers and then compress it (using bzip2 and gzip) to store temporarily and permanently.
The data generated is ususally analyzed within the next month. A good fraction of the data (about 70-80%) needs to analyzed again several times for different quantities in the next few months. In my experience 80% of data is usually discarded after a year of so. Therefore 2.4 TB/year need permanent storage.
So 500 GB at least is required for "daily use", 1-2 TB would be nice to have for intermediate use and over 2 TB will need "permanent" storage.
Re:Give Up Now (Score:3, Informative)
It will get scratched and damaged.
Which means that unless you're adding recovery data (using QuickPar) or burning 2 copies, you will lose at least some data on the media within a few years. (Cheap media sometimes only lasts a few months if not stored in dark and climate controller conditions.)
QuickPar is nice because you can pick how much redundancy you want on the disc. I find that 5-10% is plenty for most uses and guards against all but catestrophic damage to the disc.
(The guideline for redundancy is based on how often you check the media vs how fast the media degrades or is damaged. If the media degrades at a rate of 1% per month and you only verify the disc annually, you'll want at least 12% redundancy but more like 18% redundancy.)