Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Data Storage Media IT

Long-Term Storage of Moderately Large Datasets? 411

hawkeyeMI writes "I have a small scientific services company, and we end up generating fairly large datasets (2-3 TB) for each customer. We don't have to ship all of that, but we do need to keep some compressed archives. The best I can come up with right now is to buy some large hard drives, use software RAID in linux to make a RAID5 set out of them, and store them in a safe deposit box. I feel like there must be a better way for a small business, but despite some research into Blu-ray, I've not been able to find a good, cost-effective alternative. A tape library would be impractical at the present time. What do you recommend?"
This discussion has been archived. No new comments can be posted.

Long-Term Storage of Moderately Large Datasets?

Comments Filter:
  • Amazon AWS? (Score:5, Interesting)

    by TSHTF ( 953742 ) on Wednesday March 03, 2010 @06:23PM (#31351386) Homepage
    It might not be the cheapest option, but with Amazon's AWS [amazon.com], you can snail mail them a copy of the drive with the data and they're store it in S3 storage buckets.
  • Go with Blu-ray (Score:3, Interesting)

    by sabreofsd ( 663075 ) on Wednesday March 03, 2010 @06:29PM (#31351446) Homepage
    With the advent of 2TB drives, you could easily combine 3 of these with software RAID 5 as you suggested. Depending on how long you need to keep the data, recording them to dual-layer blu-ray disks might be a better solution. Ya, it's a lot of disks (you can buy 100GB discs now), but they'll last longer and you don't have to worry so much about mechanical failure or needing a certain OS when you want to restore them.
  • Drobo fan and user (Score:3, Interesting)

    by Lvdata ( 1214190 ) on Wednesday March 03, 2010 @06:33PM (#31351502)
    You might look at a www.drobo.com as a set of 4,5 and 8 drive enclosures. 1 TB disks gives you 3 TB usable space with a 2 drive failure tolerance. I have the older 4 bay drobo (2 for myself, and 2 at separate clients offices). It is much simpler to use, and will scale to your 2-3tb use and allow mismatched drives that normal raid will not use. Get a enclosure to start with, and then financing permitting, get a 2nd for Drobo redundancy. Not the fastest or cheapest, but reasonably good by both accounts, and simple to use.
  • by forgottenusername ( 1495209 ) on Wednesday March 03, 2010 @06:34PM (#31351504)

    I don't think it's a great solution. You're storing relatively fragile hard drives in a raid5 configuration in a lock box? It's not like you can tell if one of the drives goes bad and needs to be replaced when it's sitting in a box. You'd have to regularly pull the data sets out, fire them up and make sure everything is still functional.

    I'd at least want to do 2 complete sets of mirrored drives.

    Tape storage does store better.

    Depending on how important the data is, I might do something like a local mirrored drive set in storage and an online copy at something like rsync.net - stay away from s3, it's not designed to protect data, despite what AWS fans may say.

  • by __aastpl2241 ( 1723140 ) on Wednesday March 03, 2010 @06:36PM (#31351548)
    Magnetic disk are not a permanent storage, AFAIK not even for medium, they break, when they mustn't. Optical such as blueray kept in a really safe place; hidden from light, humidity and temperature variations can last much longer. Hard Disk failure may always occour, i had a raid 1 at home.. both disk failed at the same time... You may even do something good such as compressing your data and adding some disks for additional safety with some kind of error correcting code spreaded in them.
  • by strangeattraction ( 1058568 ) on Wednesday March 03, 2010 @07:17PM (#31352074)
    Repeat never use DVDs as long term storage. I have seen them go unreadable anywhere from 2-5 years. I have fired up disk drives 10 years later with no problems. They are cheap reliable and fast. Don't try and get fancy just compress and store data sets over multiple volumes. Don't use RAID.
  • Re:GMail Drive (Score:5, Interesting)

    by 0100010001010011 ( 652467 ) on Wednesday March 03, 2010 @07:25PM (#31352160)

    That's what ZFS is for.

    mount -t gmailfs /disk1 -o username=gmailuser,password=gmailpass
    mount -t gmailfs /disk2 -o username=gmailuser,password=gmailpass
    mount -t gmailfs /disk3 -o username=gmailuser,password=gmailpass
    mount -t gmailfs /disk4 -o username=gmailuser,password=gmailpass
    mount -t gmailfs /disk5 -o username=gmailuser,password=gmailpass

    zpool create gzfs raidz1 disk1 disk2 disk3 disk4 disk5

    Actually.... I think I just found my project for the evening. I mean it's already been done with 12 USB drives [sun.com]

  • by HeronBlademaster ( 1079477 ) <heron@xnapid.com> on Wednesday March 03, 2010 @07:28PM (#31352206) Homepage

    stay away from s3, it's not designed to protect data, despite what AWS fans may say.

    Just curious... S3 stores all of your data at multiple, geographically separate data centers. How exactly does that not protect your data? What else would you want it to do in terms of protection? It even gives you md5 sums of your files if you want to verify them (check the ETag attribute of each object).

    So, honest question: what do you think they're missing to make S3 really protect data?

  • by Anonymous Coward on Wednesday March 03, 2010 @07:47PM (#31352436)

    Or are you keeping it just in case. If your customers are paying you to keep this data in archive form for them for some set period of time, then you need a solution that will meet that. Taking hard drives offline and hoping for the best is not a reliable long term way to store data. If you leave them online then at least you can detect disk failures, but then you are at risk from accidental or malicious deletion. If you are serious about protecting your customer's data, then it needs to be written to a real offline media (tape, optical, cloud), and preferably 2 copies one on your site, one at a managed 3rd party site that the customer can access if your business model doesn't pan out. If the data contains any sensitive information then it should be written in an encrypted format, and you should provide your customer a way to unencrypt that data if you are unable. The cloud providers have some great options for these services, but the data sizes you are talking about should be much cheaper to manage with 1-2 LTO4 drives direct attached to a system of your choice, and bonus is that LTO4 supports hardware encryption and is widely adopted. You may be a small business but multi-TB data archives are normally a mid to large size business problem. So, the small business online solutions that I have seen are not priced to solve your problem (but there are new ones all the time). Finally, LTO-5 will be out soon, so LTO4 prices should be coming down some over the next few months.
    If you are not storing this data for your customers benefit, but for your own - to reuse in future work etc - then just decide what its worth to you. I still think tape or maybe bluray will be a more reliable / dependable / repeatable than hard drives, but if it’s your data do what you want with it.

  • by iluvcapra ( 782887 ) on Wednesday March 03, 2010 @08:03PM (#31352582)
    As someone that did a lot of backing up (and maybe restoring if I was lucky) to DDS DATs in the earlier part of the century, I can assure there's a very good reason the drives are so cheap now :) The reliability was atrocious and at $10 a cart DVD-R is quite competitive.
  • Re:Exactly. (Score:4, Interesting)

    by TooMuchToDo ( 882796 ) on Wednesday March 03, 2010 @11:21PM (#31354078)
    Almost forgot to add. Never pay for expensive disk systems. Put the intelligence into your application instead. It'll scale faster and much cheaper. You also aren't locked into a technology (and instead, can enjoy the falling costs of storage, both spinning and SSD).
  • by bruciferofbrm ( 717584 ) on Thursday March 04, 2010 @12:26AM (#31354430) Homepage

    A problem I have here is the definition of 'long term'. To each of us it means something different.

    In my job I have to archive 1.6 terabytes of data per day, and keep it around for 45 days (which, BTW, is not my definition of LONG TERM). For this task I utilize Data Domain storage, which utilizes data deduplication techniques for massive compression.

    What you find is that at the block level your data may in fact be incredibly deduplicatable. In my case it very much the situation. I am currently storing 86 terabytes of rolling archives within 2.5 terabytes of physical disk space.

    The problem with any technology you use for 'long term' storage is the ability to read those archives later. Assuming the media doesn't self degrade inside of the time frame you call 'long term', you must have the tools to read that media again. If you use BluRay, then you must store a compatible drive with it. (Nothing says Sony will not change the standard in two years and make all current drives obsolete, so no one makes them any more). Tape is worse, in that in two major model revisions, drives wont be able to read your media because its density is to low for the new drive head technology. Hardware based disk raid has the issue that the controller the raid was built with needs to stay with that raid. Another controller from the same manufacture, with the same model number, but a different firmware revision may not be able to figure out the raid, and declare the drives empty. Software raid is a little easier to deal with as long as you keep a copy of the OS you used to create it with in the same box. But then, during your defined 'long term' period, will you still have access to a system you can even plug these drives into, or run the OS on?

    What you end up dealing with in reality is that as an archivist, you either ignore these facts, or you invest in a constant media / technology refresh and spend large amounts of time keeping your archives on the latest storage available.

    Of course, all this falls apart if your definition of 'long term' isn't as long as some will project. In my case, my archives roll over every 45 days. I could easily keep that data alive for years on a live piece of hardware with a service contract. If I do not trust that hardware enough, I can buy two and replicate between them. (which, actually I am, for disaster recovery purposes)

    With deduplication my (acknowledged) high initial investment quickly outweighs the cost of single purpose drives holding one copy, and wasting unused space. My purchase cost was less then $60k, but if I had to store all of that data in its raw form, my costs would be in the millions. However, if the data is not deduplicatable, then of course it is a moot point.

    Each answer has it flaws. You decide which risks are acceptable, plan your best to deal with obsolesce, and define your definition of 'long term'. You also have to be ready to change your solution, when the one you choose today, fails to be the right solution for your needs in 5 years.

  • Re:Exactly. (Score:3, Interesting)

    by raddan ( 519638 ) * on Thursday March 04, 2010 @01:18AM (#31354722)
    Why distributed FS and not something like live mirroring/shadow copy? I wonder also... what do you consider an "expensive disk system"?

    We played around with DIY JBOD a bit (i.e., moving the complexity up into software) because it seemed a lot cheaper, but we have yet to get the thing to operate as reliably and simply as our fibre channel RAID units. The main problem we're running into is that for SATA to be practical, you need to multiplex several SATA disks onto single SATA ports, but that software support for doing this (at least in Linux and Solaris) is terrible. SAS interconnects are obviously better in this regard, but the hardware doesn't end up being any cheaper than standard fibre channel RAID, so you end up spending the same amount of money but doing more work because the spanning/striping/whatever is not managed for you. Also, under Linux, we've been somewhat dissatisfied with mdadm compared with something like OpenBSD's bioctl (which is easy to use and AWESOME). Sadly, OpenBSD is severely limited in some other areas that make it inappropriate to manage large disks (very large filesystems are not possible... fsck is prohibitively expensive with large volumes... etc...).

    The other thing I've been wondering, and it is unclear to me whether distributed filesystems solve this is: it seems like they're designed to spread data around on many nodes. Are they also good for allowing access FROM many hosts? E.g., our StorNext system allows us to plug any client into a fibre channel switch and simply use the disk without really having to think through what that means. But StorNext limits us because we can't plug, say, an OpenBSD machine into it (proprietary drivers, no client software) and that licensing can get expensive fast. StorNext's documentation also leaves something to be desired, and their salespeople do not inspire confidence (none of them could tell us how to actually install a license on a machine... I had to figure it out myself).
  • Re:Exactly. (Score:3, Interesting)

    by TooMuchToDo ( 882796 ) on Thursday March 04, 2010 @01:37AM (#31354824)

    Why distributed FS and not something like live mirroring/shadow copy? I wonder also... what do you consider an "expensive disk system"?

    Distributed file systems are great because your limitations are (almost) always going to be hardware. Want 1000 boxes serving up your content? Get 1000 commodity boxes with disk. Need 10000? Also, not a problem. A box filled with raw disk is WAY cheaper than an EMC, Nexsan, etc (i.e. expensive disk system).

    Serving over Ethernet should be fine, as you can always bond network connections together to increase throughput from your storage boxes to whatever boxes are processing the data (or even process the data *on* the storage boxes). I've bonded 8 gigabit ports together on some of our storage boxes without problems (Linux Kernel 2.6). I'm not a fan of fiber solutions. They're expensive, and I can get gigabit networking gear cheaper.

    In short, stick to commodity hardware and open source distributed file systems. I know you're a little far along for that, since we had the opportunity to build from the ground up.

  • by jon3k ( 691256 ) on Thursday March 04, 2010 @12:28PM (#31358998)
    Are you doing disk to tape to tape or disk to tape and the disk to tape again? or disk to disk to tape/tape? are you using LTO4? Just curious, because that seems like a lot of data to backup to tape (twice! every night!)

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...