Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Data Storage Media IT

Long-Term Storage of Moderately Large Datasets? 411

hawkeyeMI writes "I have a small scientific services company, and we end up generating fairly large datasets (2-3 TB) for each customer. We don't have to ship all of that, but we do need to keep some compressed archives. The best I can come up with right now is to buy some large hard drives, use software RAID in linux to make a RAID5 set out of them, and store them in a safe deposit box. I feel like there must be a better way for a small business, but despite some research into Blu-ray, I've not been able to find a good, cost-effective alternative. A tape library would be impractical at the present time. What do you recommend?"
This discussion has been archived. No new comments can be posted.

Long-Term Storage of Moderately Large Datasets?

Comments Filter:
  • by rwa2 ( 4391 ) * on Wednesday March 03, 2010 @06:19PM (#31351342) Homepage Journal

    I don't think you can beat a bunch of conventional hard disks in a RAID5 for both cost-per-TB and backup/restore performance, not to mention medium-term data integrity. Might be able to make hooking up the drives more convenient with an eSATA mult-bay enclosure, but those are kinda expensive. But I bet your backup box already has some sort of hot-swap on it already, like: http://www.amazon.com/Thermaltake-BlacX-eSATA-Docking-Station/dp/B001A4HAFS [amazon.com]

    I assume you already compress your data, since scientific datasets tend to compress well. You might consider compressing to squashfs, since it will let you do transparent decompression later on so you can skip the restore step if you just need a handful of files.

  • Tape is your friend (Score:5, Informative)

    by chill ( 34294 ) on Wednesday March 03, 2010 @06:25PM (#31351414) Journal

    LTO tape, properly stored, will outlast burned optical media and hard drives. Great stuff and designed specifically for what you're talking about.

    http://en.wikipedia.org/wiki/Linear_Tape-Open [wikipedia.org]

  • Amazon S3 (Score:4, Informative)

    by friedo ( 112163 ) on Wednesday March 03, 2010 @06:26PM (#31351420) Homepage

    It can get a little pricey for huge datasets, but Amazon S3 now has an option where you can ship your data [amazon.com] on a big set of disks directly to them, they will import everything into S3, and it will live there forever. The nice thing about S3 is unlike physical disks, it can grow essentially forever, and comes with retention and redundancy guarantees. And once your stuff is in S3, you can recycle the same disks to mail them more data.

  • Re:Exactly. (Score:5, Informative)

    by TooMuchToDo ( 882796 ) on Wednesday March 03, 2010 @06:34PM (#31351516)
    Because Amazon can be *expensive* compared to doing it yourself ($$$ for data in, $$$ for data out, $$$ for monthly storage). But heh, what do I know. I just manage the storage for one of the LHC detectors (5PB spinning disk, 17PB tape). Amazon is good when you've got VC money or have no IT folks.
  • by cruff ( 171569 ) on Wednesday March 03, 2010 @06:35PM (#31351524)

    I agree, when the tapes are stored in proper environmental conditions. You don't need a library, just use some stand alone tape drives. Also look at the claimed media lifetime and recovered bit error rate figures to see if you are choosing the right tape drive/media.

  • Re:Amazon S3 (Score:2, Informative)

    by Andraax ( 87926 ) <mario.butter@silent-tower.org> on Wednesday March 03, 2010 @06:35PM (#31351530) Homepage Journal

    I hope so. A 3TB dataset on Amazon S3 would run $450 / month for storage.

  • by pavera ( 320634 ) on Wednesday March 03, 2010 @06:36PM (#31351542) Homepage Journal

    You misunderstood the post. He needs 2-3TB PER CLIENT, not 2-3TB total.

  • by Saint Aardvark ( 159009 ) on Wednesday March 03, 2010 @06:38PM (#31351562) Homepage Journal

    Couldn't agree more. A tape library (as in autochanger) might be out of your budget, but a simple tape drive wouldn't be too much -- say $5000 for an LTO4. Media is $50-$100 or so depending on where you shop. Seriously, you're not going to find a reasonable way of storing that much data anywhere else.

    BTW, if you're not a member of LOPSA [lopsa.org], you may want to seriously consider it. Even if you're not a sysadmin, this is definitely a sysadmin-type question, and their mailing lists are second to none. It's an excellent resource.

  • use a tape drive (Score:4, Informative)

    by Lehk228 ( 705449 ) on Wednesday March 03, 2010 @06:40PM (#31351604) Journal
    you make the assertion that a tape archive would be impractical, but really it is the most practical solution. the drive will set you back a couple thousand, but 800 gig tapes are only around 40 bucks each, and they are engineered for data storage unlike hard drives. this will only cost $160 per 3 gig dataset, or 200 if you use par2 files and an extra tape to make it recoverable in case a tape does fail.
  • by TheMeld ( 13880 ) <`gro.tactsaf' `ta' `todhsals-hateehc'> on Wednesday March 03, 2010 @06:44PM (#31351660) Homepage

    The other thing to do if you want longish term reliability is to add redundancy to whatever you're storing with a tool like par2, http://www.par2.net/ [par2.net] and http://www.quickpar.org.uk/ [quickpar.org.uk] are your friend.

    Raid5 will help you if you lose a whole drive (e.g. siezes up from sitting still for a long time), the par2 data will both allow you to verify that the data hasn't been corrupted, and if it is (e.g. a couple sectors go bad), it will let you recover the data.

  • by klubar ( 591384 ) on Wednesday March 03, 2010 @06:58PM (#31351838) Homepage

    Tape is probably your best option. You can buy at DAT-5 (or even a DAT-4) tape drive for not very much. The tapes cost about $10 to $30 each (depending on what tape option you choose). Make 3 copies of the data set, store one onsite, store another offsite in a secure/climate controlled facility and send the 3rd to the client. Buy a spare tape drive and use both to make writing across tapes easier. There is a wide variety of software to write to the tape; we use the aging Retrospect.

    The disk options is just way too complex; if anything, skip the RAID option and just store 2 copies. Putting the RAID sets back together and finding the RAID software will be nearly impossible in a couple of years. Use some standard formatting on the drives (FAT, NTFS, etc.) and you'll be good to go for the next 15 years.

  • by afidel ( 530433 ) on Wednesday March 03, 2010 @07:00PM (#31351870)
    I didn't see you mention LTO or DLT in your little rant against tape, both are superior to any other method for the storage of bulk offline data.
  • by mabhatter654 ( 561290 ) on Wednesday March 03, 2010 @07:07PM (#31351942)

    LTO4 tapes are reasonably price.. but the DRIVES push $5k. For long term even the drives go obsolete too quickly, then become MORE expensive in 5 - 10 years when you really need them.

    The best thing is probably what Google does, simply keep 3 "live" copies of the data. Then the data is always on current hardware. The data is on "production" hardware along with other stuff so it is properly monitored by the OS and database for integrity, and hardware is maintained with support. Drive arrays are cheap enough in large scale just to keep moving things over to one that's twice the size in half the space every few years. The problem of course is when you are a Small or Medium business keeping even $50k in hardware around you don't really NEED for operational purposes is a big money sink... way to much for "insurance". But unless you keep one copy of the data live and verify/refresh your tapes every year you really don't have "securely" stored data.

  • by rwa2 ( 4391 ) * on Wednesday March 03, 2010 @07:15PM (#31352030) Homepage Journal

    Yeah, keeping those drives in a huge online storage array is probably better. Then they can mirror them across multiple sites.

    Here's a compelling petabyte online RAID system for cheap:

    http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ [backblaze.com]

  • Re:Exactly. (Score:3, Informative)

    by hawkeyeMI ( 412577 ) <brock&brocktice,com> on Wednesday March 03, 2010 @07:24PM (#31352150) Homepage
    I already use S3 for some things. Unfortunately it would be about $500/month/customer's data on S3 right now, so that's out. I could buy a whole computer with hard drives every month for that.
  • by TravisHein ( 981987 ) on Wednesday March 03, 2010 @08:05PM (#31352598)
    how about a cluster file system like http://www.moosefs.com/ [moosefs.com] , where it works well on commodity or even end of life by current standards of hardware. Redundancy is achieved through several chunk server nodes in the system. Performance and size can be dynamically scaled over time by adding more nodes or disks to the system.
    Their project is GPL licensed, requires very little effort to set up, and likely much less effort over time to maintain and administer than a RAID or Tape system likely would, with the benefit of being able to choose what data is online or offline (by its presence on chunk servers being connected at the time).
  • Comment removed (Score:5, Informative)

    by account_deleted ( 4530225 ) on Wednesday March 03, 2010 @08:18PM (#31352698)
    Comment removed based on user account deletion
  • IT Auditor Opinion (Score:1, Informative)

    by Anonymous Coward on Wednesday March 03, 2010 @08:34PM (#31352838)

    As usual, the "Ask Slashdot" doesn't have enough information to make a proper recommendation. There are a number of factors that need to be considered.

    1. How valuable is the data?
    2. How far into the future must you be able to produce the data for your customer?
    3. What resources do you have at your disposal to solve this problem?
    4. How secure do you need this to be?
    5. How quickly do you need to be able to produce the data?

    Once you answer those questions, you'll have a lot more insight into how to proceed. For example, if you suspect that a customer might want you to do something with their data in the next year or two, then your current solution seems reasonable.

    On the other hand, if you are contractually obligated to provide copies of the data upon request for the next 10 or 20 years, you would need to invest in proper future-proof archiving technologies and duplicate environmentally controlled storage in geographically disparate locations. You may even have to go so far as to include schemas or transform the data to more universal formats.

    As a related aside, if you incur big expenses to secure a client's data, you should be charging a premium for this service. At the very least, you should be using this as a selling point.

  • by lgw ( 121541 ) on Wednesday March 03, 2010 @08:35PM (#31352844) Journal

    Tape is really best for archiving, to this day. A single LTO drive won't break the bank for a small business, and it will be reliable.

    3 Things to remember about tape backup:

    Encrypt your backups. This is becoming available in the tape drive itself, but many backup applications will also do it for you in software. Limits embarassment if a tape goes missing.

    Occasionally test restores. This is incredibly important - almost every unreadable tape in existance was unreadable when created. Any reasonable backup software will give you the ability to do this automatically (as part of the backup job). If practical, create a job that does a backup of everything, but verifies only some small volume. If you can read anything, chances are high that the whole tape is fine.

    Get those tapes offsite. A safe deposit box works for a tiny company, but someone like Iron Mountain works better and is less hassle. Store a copy of your encryption key in the same facility (but don't transport the tape and key together).

  • by dbIII ( 701233 ) on Wednesday March 03, 2010 @11:35PM (#31354152)
    If the setup is raw files on tape (the stuff I've read in from reels recorded in the 1980s) or tar (stuff from the 1990s) it's not a big deal at all. It's only two-bit closed source single developer efforts on MSDOS that are difficult for anyone to read if they have the hardware.
    With recent backup systems like AMANDA you can just dump a file with dd or similar and the instructions on how to deal with the data are there in the header in ASCII! It couldn't possibly be easier.
    Also the "proper environment" is a lot easier to achieve than with storing hard disks, putting them in boxes in a warehouse without a leaky roof is most of the way there. Those tapes from the 1980s I'm talking about were exposed to 30C temperatures and high humidity every year, which is not ideal. Theoretically they were spare copies that were only made for transport and only had worth after the originals and copies were mistakenly thrown away by the clients.
    The "people" problem is very easy to deal with. You make a lot of people responsible for backups of the department they are in, and make it VERY easy for them to do the backups. Somebody in small department X knows to put a tape in a few times a week, and so on, so you get a lot of people that know for certain that there are backups and how old they are. When the inevitable happens and things get lost or deleted there is no longer the panic of tracking down the IT guy and asking lots of questions whether something is lost or not. Instead of spreading panic you just get mild annoyance that it will take so long to get it back.
    Also when one person leaves there are still many that know the system.
    Of course using rsync to a machine elsewhere to get a copy of everything is useful, but that is not a backup, it's a copy, so it shouldn't be the only option. With things like that you eventually purge old deleted files or write over old versions of files, while the tape from two years ago has what was on there two years ago.
  • by Bigjeff5 ( 1143585 ) on Thursday March 04, 2010 @12:48PM (#31359292)

    LTO-4 drives can be had for under $2k if you shop carefully, and a 1.6TB tape can be had for between $50 and $80.

    Cheap, simple, and far more reliable long term than hard drives. They also support Write Once Read Many, which is a regulatory requirement for some industries. Getting WORM on a hard disk would be a really neat trick, particularly the part about guaranteeing that it cannot be written to more than once.

    A better way to look at it, is disk is best for quick backup and quick recovery from failure (which is very important for enterprise environments). Tape is for archiving and as a last protection against catastrophic failure.

Neutrinos have bad breadth.

Working...