Long-Term Storage of Moderately Large Datasets? 411
hawkeyeMI writes "I have a small scientific services company, and we end up generating fairly large datasets (2-3 TB) for each customer. We don't have to ship all of that, but we do need to keep some compressed archives. The best I can come up with right now is to buy some large hard drives, use software RAID in linux to make a RAID5 set out of them, and store them in a safe deposit box. I feel like there must be a better way for a small business, but despite some research into Blu-ray, I've not been able to find a good, cost-effective alternative. A tape library would be impractical at the present time. What do you recommend?"
Different manufacturers (Score:5, Insightful)
Hard drives are ridiculously cheap these days, especially for how much data you are storing. You may wish to consider buying drives from different manufacturers but of the same size to put in a single mirrored set. This way if there is a problem with a particular batch of drives it won't ruin everything.
Use RAID6 not RAID5 (Score:4, Insightful)
I would use RAID6 not RAID5, since 2 drive failures means data loss with RAID5, while it takes 3 drive failures to loose data on RAID6.
Linux MDADM has supported RAID6 for years, it's stable.
I would mix and match drives, not buying all the same model from one maker. One Samsung, One WD, One Hitachi, One Seagate.
That gets you 4TB in 4 drives, and unlike a RAID1, any 2 drives can fail with no dataloss.
You can further ensure no dataloss by making a second copy using different brand drives for each clone.
Eight 2TB drives is around $1500. Not bad for a very safe 4TB backup.
Re:Exactly what you're doing (Score:5, Insightful)
That's why you hot-swap them. You treat them just like tapes. In fact, once you start doing that, you realize that RAID mirroring isn't helping you any (striping is another matter).
The best way to backup a big hard drive these days is with another big hard drive.
Re:Tape is your friend (Score:4, Insightful)
There's some code lurking in the amanda backup package I did a while back for "RAIT" (RAID with tape instead of disk) to make a stripe-set of tapes, if you need several tapes worth of data in one set, with redundancy.
On the other hand, while LT04 tapes are about half the price ($40) of cheap 1TB disk drives ($80), the tape drives are ablout $2k apiece, so depending how many data sets you want to keep, and for how long, the disk drives may really be cheaper...
Re:Exactly. (Score:5, Insightful)
Ok, yes, we see you know a lot about this.
So what's your recommendation?
Blu-ray not the answer (Score:3, Insightful)
Re:Go with Blu-ray (Score:4, Insightful)
I thought burned optical discs started to degrade after a few years. Have they solved this problem?
Re:Blu-Ray (Score:3, Insightful)
Re:Use RAID6 not RAID5 (Score:3, Insightful)
Just because you had some problems with Samsung means nothing about their general reliability.
A few specific models have had problems, such as the IBM "Deathstar" models, or the recent Seagate firmware problems, but there is no evidence that whole brands are less reliable.
Read the Google report on drive brands, there are no clear winners or losers across brand lines in their exhaustive real world tests.
Re:Exactly what you're doing (Score:5, Insightful)
(Or btrfs on a Linux distro)
Are you honestly suggesting using an in-development filesystem for backup purposes?
Re:Exactly. (Score:5, Insightful)
Feel free to ask more questions.
Re:Exactly what you're doing (Score:4, Insightful)
if it was medical records i'd be storing 5 copies in 5 geographically distinct locations, each with their own backup for the backup. i'd be checking the MD5's each day on all the backups to ensure they can be accessed when i need them
I can about guarantee you that nobody stores medical records in this way. And realistically, why should they? 5 different locations is insane for just about any piece of data.
Geeks tend to go overboard when it comes to data paranoia and worry too much about technology, but then forget about all the human problems that go on. Most data loss doesn't occur from some geographic catastrophe where a super volcano destroys half a continent. More often someone changes some critical path of the backup scheme and the whole she-bang comes crashing down. Super-redundant geographic co-location can't save you from one idiot that didn't understand changing one critical name silently took down the backup scheme.
Re:Exactly what you're doing (Score:3, Insightful)
If disks in the safe deposit box are fast enough to access, running to the store to buy a generic power supply is fast enough recovery.