Building a Massive Single Volume Storage Solution? 557
An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical.
I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"
gmail (Score:4, Funny)
Er... be careful (Score:2, Informative)
Re:gmail (Score:2, Interesting)
I wonder if tinyurl can handle 25TB...
Re:gmail (Score:5, Funny)
The first, and presumably the reason this was posted to
Imagine a Beowolf cluster...
Stuart
The SCIENTIFIC Answer (Score:3, Funny)
GFS? (Score:5, Informative)
Oracle, also (Score:2)
Re:Oracle, also (Score:2)
Re:Oracle, also (Score:3, Insightful)
I second that.
Starting at 25TB to scale 1PB? And you want it cheap? If it was cheap to do that sort of thing, we'd all be lining up to get one of our own(*).
Seriously, though, you don't really specify how cheap you are expecting to get it for. What are your expectations, and just how far over-budget are the options you've looked at already? Do you really need 25TB/1PB in one volume, or could it be achieved by splitting it into smaller chunks and wor
Re:Oracle, also (Score:4, Funny)
I know. They don't know I know, but I do. It's data gathered by the black helicopters, by Echelon, by Carnivore, by our very own printers, by RFID, about every movement of every single one of us... *They* do it. They.
Re:Oracle, also (Score:3, Insightful)
If you don't want to participate, don't. Stop stuffing the threads with posts about how lame everyone's questions, knowledge and motivations are.
I'm actually interested in what people have thought about this very topic, AND I'm not a petabyte database expert. So it's news to me. And probably is to you as well.
Re:GFS? (Score:3, Informative)
Re:GFS? (Score:3, Informative)
Regards,
Steve
Controllers! (Score:3, Informative)
If you're not doing any processing on this, a good CPU should be able to handle the load.
Re:Controllers! (Score:4, Insightful)
Apple Xserve? (Score:3, Informative)
Re:Apple Xserve? (Score:4, Informative)
Scaling to petabytes means spanning storage across multiple systems.
Re:Apple Xserve? (Score:4, Informative)
Re:Apple Xserve? (Score:3, Insightful)
Only on fucking Slashdot.
Re:Apple Xserve? (Score:5, Interesting)
Probably would be the least expensive option overall and the simplest to implement. Don't take my word for it, go look for yourself.
Xsan has volume size limits (Score:3, Informative)
Short research indicates this was a limitation in 10.3, but I haven't found anything confirming or denying that 10.4 still has it.
Not that we've been looking into large amounts of Xsan storage here, but our requirements are a bit different. You can't hook >600 nodes up to the storage via fibre. Our problem is scaling out the NFS servers to be able to push all this data around.
Re:Apple Xserve? (Score:4, Informative)
I do remember some college buiding a nearline backup storage system using 1U servers with 2 or 3raid cards each connected to like 12 drives per machine in homemade brackets but it was hardly ideal. But It did work. Anybody remember where that was?
How about a PetaBox? (Score:5, Interesting)
The folks at the Internet Archive [archive.org] have already done the hard work of figuring out how to create a petabyte storage system [archive.org] using commodity hardware. The system works so well they started a company to sell PetaBoxes [capricorn-tech.com] to others. Why reinvent the wheel?
Re:How about a PetaBox? (Score:3, Informative)
I will add that the Archive has particular design and performance goals, namely:
- keep the cost / GB as low as possible
- keep cooling and power requirements low
- use the filesystem and bundle objects into large chunks (~100MB ARC files, last I checked)
- assume streaming writes affecting an edge of the system -- previously written data isn't modified
- assume random reads
- read latency is less important than cost / GB
I worked on the Archive ~5 years ago, and these are based on my unde
Andrew FIle System (Score:5, Informative)
Re: (Score:3, Informative)
Re:Andrew FIle System (Score:3, Informative)
Re:Andrew FIle System (Score:2)
Subject: 1.02 Who supplies AFS?
Transarc Corporation phone: +1 (412) 338-4400
The Gulf Tower
707 Grant Street fax: +1 (412) 338-4404
Pittsburgh
Re:Andrew FIle System (Score:2, Informative)
AFS Rocks- Now stop (Score:5, Insightful)
Having said all this- If you are still intent on finding a good file system then use AFS. It's probably your best free solution. If you want to sleep at night call EMC.
-sirket
PetaBox (Score:4, Informative)
Re:PetaBox (Score:5, Funny)
Re:PetaBox (Score:4, Informative)
MogileFS from livejournal (Score:3, Informative)
http://www.danga.com/mogilefs/ [danga.com]
It's scalable and has nice reliability features, but is all userspace and doesn't have all the features/operations of a true POSIX filesystem, so it may not suit your needs.
Go the Easy Route (Score:4, Funny)
Petabox (Score:2, Insightful)
http://www.archive.org/web/petabox.php [archive.org]
There is now a company that seems to make the same design:
http://www.capricorn-tech.com/products.html [capricorn-tech.com]
I don't know what FS they use, but apprently it is redudent.
Re:Petabox (Score:5, Insightful)
Re:Petabox (Score:3, Interesting)
Re:Petabox (Score:4, Insightful)
Interesting to think about. My brain probably holds about a petabyte of memories and it uses 20-60 watts. Mostly from sugar.
GPFS from IBM (Score:5, Interesting)
http://www-03.ibm.com/servers/eserver/clusters/so
Re:GPFS from IBM (Score:3, Interesting)
Re:GPFS from IBM (Score:3, Insightful)
From what I've heard, definitely give GFS a thorough shakedown before you decide to implement it, I'
Why? (Score:2, Insightful)
Just because you are starting at 25TB doesn't mean you aren't building a 1PB solution.
You also need to figure out what kind of bandwidth you need. It's very seldom that people have 1PB of data that is accessed by one person occasionally. If Some sort of USB or 1394 connection will work you are much better off than requiring infiniband.
Like many "ask Slashdot" questions this is the last place you should be l
Google Releases OSS? (Score:2)
Since when has Google released any open source software?
Re:Google Releases OSS? (Score:2, Informative)
Scale (Score:3, Interesting)
Clustering the disks with iSCSI or ATAoE is trivial - you can do that very easily, but the filesystem to run on top of it is where you will have problems.
PVFS - has no redundancy - Lose one node lose them all.
GFS - does not scale well to those sizes or a large number of nodes - lots of hassle with the dlm.
GoogleFS - Essentially one write only - no small (50GB) files - little or no locking.
xFS - Way too easy to lose your data.
It seems that you only have one option:
Lustre - VERY Expensive - lots of hassle with meta-data servers and lock servers.
Go with a company to take care of all this hassle - you do not have the resources of Google to deal with this kind of thing yourself.
Re:Scale (Score:4, Insightful)
Here's a couple to look at (Score:2, Informative)
MogileFS at http://www.danga.com/mogilefs/ [danga.com]
Wow (Score:5, Funny)
That's over 3 million hours of
Re:Wow (Score:2)
Re:Wow (Score:5, Funny)
Re:Wow (Score:4, Funny)
Come to think of it, so do I.
Data redundancy REQUIRED (Score:5, Informative)
Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!
Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.
You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.
Re:Data redundancy REQUIRED (Score:4, Insightful)
Re:Data redundancy REQUIRED (Score:3, Informative)
For the sake of your argument I suppose that assumption could be considered fair. If one were to do a somewhat more sophisticated analysis, a better model for hard drive failures is the Bathtub curve [wikipedia.org]. It represents the result of a combination of three types of failures: infant mortality (flaws in the manufacturing), random failures and wear-out failures.
The fa
Good point, bad data (Score:3, Insightful)
Not a sound assumption. Things don't fail uniformly over time. Suppose 70 babies are born with a life expectancy of 70 years. Is one of them guaranteed to die every year for the next 70 years? Obviously not. If they avoid some joint disaster (like they all take a trip on the Titanic), most of them will die within a decade or so of the 70-year mark.
Same with di
That's not MTBF, this is.. (Score:5, Informative)
You have a drive that is rated 500,000 hours MTBF. Suppose you bought a drive and let it run at rated duty. Driver are normally rated to run 100% of the time, but many other devices will have duty period. Further, you run the drive until its warrenty is up. You then throw this perfectly working drive out the window and replace it. If you keep the up this pattern, then approximately once per 500,000 hours on average you should have a drive fail before the warrenty period is up. This is why it is important to not only look at the MTBF but also its warrenty period.
As a side note: In theory, you should be throwing drives out on a periodic basic. One way around this is to not buy all the same drive type and manufacturer. By having a pool of drive types, you distribute, thus minimize, risk of drive failures. Additionally, you may want to have a standard period of time for drive replacement so as to shedule your down time, as opposed to it all being unexpected.
Re:Data redundancy REQUIRED (Score:3, Interesting)
Oooo... (Score:2)
NFS on Reiser4 on RAID-5 on AoE (multipath) on LVM on RAID-5?
What kind of availability do you need? Does all data need to be up all the time (like a bank/telco), or most of the data need to be up all the time (like google), or all the data need to be up most of the time (like a movie studio)?
I just have to ask... (Score:5, Informative)
Re:I just have to ask... (Score:4, Funny)
With a network bootable bios, the nodes could just be plugged in and install an image off a server, then customize it based on their MAC.
Yup, time to pick up the phone. (Score:5, Insightful)
I know people get tired of hearing "call IBM" as a solution to these questions, but in general if you have some massive IT infrastructure development task and are so lost on it that you're asking the
It's not even a question if whether you could do it in-house or not; given enough resources you probably could. It comes down to why you want to do something like this yourselves instead of finding people who do it all the time, week after week, for a living, telling them what you want, getting a price quote, and getting it done. Sure seems like a better way to go to me.
I built a 1.7 TB for about $2000 (Score:3, Insightful)
Re:I just have to ask... (Score:3, Funny)
15 zeros are no bytes at all (Score:2)
15 zeros is no bytes at all... :)
Stress the importance .... (Score:4, Insightful)
Unfortunately, I should think needing a solution which can scale up to a Petabyte (!) of disk-space and a "fairly small" budget are at odds with one another.
Maybe you need to make a stronger case to someone that if such a mammoth storage system is required, it needs to be a higher priority item with better funding?
Heck, the loss of such large volumes of data would be devastating (I assume it's not your pr0n collection) to any organization. Buliding it on the cheap and having no backup (*)/redundancy systems would be just waiting to lose the whole thing.
(*) I truly have no idea how one backs up a petabyte
Re:Stress the importance .... (Score:2)
For the most part (Score:5, Insightful)
I would look at some lessons learned from Google. If you decide to go with some sort of homebrew solution based on a bunch of standard consumer disks you will run into other problems besides money. The more disks you have running, the more failures you will encounter. So any system you setup has to be able to have drives fail all day, and not require human intervention to stay up and running(unless you can get humans for cheap too).
Re:For the most part (Score:3, Informative)
eWare 12x S-ATA raid5 card
12x 300GB raid5
linux machine
iscsi software - share out 1 LUN.
Duplicate this machine until you have enough storage.
One big box with a number of trunked/bonded gigE ports
Iscsi initiator software - mount all the luns.
software raid them together - striping if you arent too worried - raid5 if you are.
tada - big stora
Re:For the most part (Score:3, Informative)
Do It Right (Score:5, Insightful)
Look. Everyone wants a Lamborgini for the price of a Chevy. Cute. Yawn. Half of the Ask Slashdot questions are people who didn't find what they want at Walmart. Despite the amazing Slashdot advice, Ask Slashdot answers have somehow failed to put EMC, IBM, HP, etc. out of business. There is no free lunch.
Just call EMC, get a rep out, and give the paperwork to your boss. Do it today instead of 5 months from now and you will have a much better holiday season.
Note to moderators and other finger pointers: I did not say to BUY from EMC, I just said to show his boss how and why to do things the right way. It does not hurt to get quotes from the big vendors, mainly because the quote also comes with good, solid info that you can share with the PHBs. Despite what you think about "evil" tech sales persons and sales engineers, you actually can learn from them.
Hell, BUY it from EMC! (Score:5, Interesting)
Yeah, EMC costs. THIS is why. The support, when needed, is top top top notch. Which would you rather have in a DR situation?
Re:Hell, BUY it from EMC! (Score:4, Funny)
IBRIX (Score:4, Informative)
Re:IBRIX (Score:3, Informative)
storagetek... (Score:2)
Don't forget Coda.. (Score:2)
what about JFS and ATAoE? (Score:2)
This article in Linux Journal ( http://www.linuxjournal.com/article/8149 [linuxjournal.com] ) talks about doing just that. The hardware costs ring up and don't scale as you get into your capacity ranges unless you can get a deal buying bulk HDDs - something like $10K per 7.5 terabytes
Just wait 5 years ... (Score:4, Interesting)
Hard disk space is doubling every 6 months - wait 5 years and you'll be able to buy a 25TB disk for $125.00.
A single raid50 of them will then give you your petabyte of storage, for around $6,000.
No Redundancy? (Score:5, Insightful)
Re:No Redundancy? (Score:5, Funny)
Hollywood script archive.
Re:No Redundancy? (Score:3, Funny)
iSCSI storage / san (Score:3, Informative)
http://www.equallogic.com./ [www.equallogic.com] They make nice SATA-raid based iSCSI SAN devices with all the features you could expect (volumes, snapshots, array/volume-expansion, hotswap, redundant controllers, redundant fans, etc).
http://www.equallogic.com/pages/products_PS100E.h
14 250G sata disks, 3U, 3.5 TB of raw storage.
http://www.equallogic.com/pages/products_PS300E.h
14 500G sata disks, 3U, 7 TB of raw storage.
http://www.equallogic.com/pages/products_PS2400E.
56+ TB
Looks good. I have not yet used them myself
Another iSCSI SATA SAN possibility:
http://www.mpccorp.com/smallbiz/store/servers/pro
16 sata disks, review:
http://www.infoworld.com/MPC_DataFrame_420/produc
This company also has SATA iSCSI SAN devices:
http://www.dynamicnetworkfactory.com/products.asp
iSCSI SAN comparison:
http://www.networkcomputing.com/story/singlePageF
There are also software iSCSI target solutions for use with your own/custom hardware.
http://iscsitarget.sourceforge.net/ [sourceforge.net] for building linux-based iSCSI target/SAN.
If you are familiar with iSCSI targets / iSCSI SAN devices please post your comments!
What we've done (30TB so far) (Score:5, Informative)
Storage nodes: 7 x 2.8TB 2U RAID5+1 boxen with Serial ATA. The 2.8TB is logical, not physical. The OS for each of those machines is RAMDISK based (something we concocted based on what I read about the DNALounge awhile back) so it helps curb disk failures of the storage nodes themselves. We avoid disk failure by using RAID5. Of course that doesn't protect against mutiple simultaneous disk failure, but read on for more. Each of the storage nodes is exported via NBD.
Then we have a head unit, a 64-bit machine. This machine does a software RAID5 across the storage nodes using an NBD client. Essentially each storage node is a "disk" and the head unit binds and manages the sofware raid5. So let's say a whole storage node goes down (for whatever reason it does), all the data is still intact. RAID5 rebuild time over the gigabit network is about 18hrs, which is acceptable. We even have another storage box as a hot-spare.
On top of that, we have the whole cluster mirrored to another identical cluster via DRBD in a different geographic location. This is linked by Gigabit WAN. So if we have a massive disaster and lose the entire primary cluster, then we have a 2ndary cluster ready to go. We needed to purchase the Enterprise version of DRBD ($2k US) but that's worth it because they're neato guys.
We use XFS as the filesystem. This system gives us 14TB of redundant "RAID-55 with a Mirror" space. Both clusters together? $85k.
When the cluster starts running out of space (about 70% or so), we add ANOTHER cluster of similar stats to the initial one and use LVM to join the two units together.
This has scaled us to 30TB and we're pretty happy with it. The read speed is very good (hdparm says Timing buffered disk reads: 200 MB in 3.01 seconds = 66.49 MB/sec) and the write speed is about 32 MB/sec. For what our application is doing, that's a fine speed.
Re:What we've done (30TB so far) (Score:3, Informative)
Ask Slashdot Formula: (Score:5, Funny)
I have been tasked with (insert very difficult, very important job). This is very important to my company. I have (insert number much lower than it should be) dollars to do this. I do not want to use (insert company name specializing in this exact thing) because management thinks they are too expensive. I think I can do this (insert better/faster/cheaper/...) than said company, even though they have vastly more experience and have invested much more time and research than I have. My continued and future employment probably rests on this project. Please advise.
Re:Ask Slashdot Formula: (Score:3, Funny)
Use Linux.
Regards,
Slashdot
Fibre Channel 30TB in 7 RU (Score:3)
Raid, Fibre Channel, 42 ATA drives per 7 RU chasis. Throw in 500GB drives and 1 parity drive for every 6 data drives and you have ~30 TB per chasis.
Can the company funding this really afford this? (Score:3, Insightful)
If you've been asked to do something this by a company that can afford to buy one commercial off-the-shelf high volume storage solutions, then I honestly can't imagine any solution they try and knock up will actually work (as I'm not aware of any free software solution that's currently up to the task).
If your company doesn't have / can't raise the capital to buy a commercial system for a project of this scale, I can't possibly see how they could afford to screw up on this and go with an untested idea that could very well end up being a huge money sink they wouldn't be able to dig themselves out of - one that could doom the entire company and all it's investors given the cost it could run to.
And of course, for such a big project, they should hire people who would already know how to do something like this (which is not a dig, it's just crazy to skimp on staff when you have an ambitious project which requires large amounts of capital investment).
That said...
I were going to do large scale storage on the cheap, depending on the design of the software and the specific requirements (particularly if I was also developing the software we were going to use, or was able to set feature requirements and/or was able to make the modifications myself) I would build the largest standard file shares I could with SATA disks (using commodity hardware, hot swappable, running linux, with front loading drive bays).
The specifics of handling the load balancing (via multiple front ends, multiple mount points, pre-deteremined hashing to balance things out, proxies/caches, hooks in the file system calls, hooks in the application to talk to a controller, etc) depend entirely on the sort of application however.
It's definately likely to be far easier (and more cost effective) to have the software take care of knowing where the data is stored, rather than trying to build a single really large file share. I know at least one very known large company who've went down this route (with essentially elaborately hacked up versions of common OS software).
The downside is you have to support whatever hack you come up with to do this, but that shouldn't be an enormous amount of work (and you can probably afford to hire someone to support it full time for significantly less than the cost of a support contract for a commercial solution).
Why one volume? (Score:3, Informative)
What's making your question hard is the "make it like one volume" restriction. The problem is trivial otherwise. If I were you, I'd be asking whoever tasked you with this to *really* justify on a technical level why they need it to appear as a single volume, since that makes all the possible solutions slower, more costly, and more difficult to maintain.
Chances are extremely high that what they really want is a "/bigfatfs" directory visible everywhere in which they will store many discrete items in subdirectories by project or by dataset or by user. You should convince them to let you build it from commodity machines serving a few TB each mounted as seperate filesystems underneath that umbrella directory. Then your only challenge is coherent management of the namespace of mountpoints for consistency across the environment (which there are longstanding tools for, like autofs + (ldap, nis, nis+, whatever)), and administration/assignment of new space requests within your cluster (that could be scripted to automatically allocate from the least-used volume which can satisfy the request (where least used could mean space or could mean activity hotness based on the metrics you're logging)).
GPFS - performance and stability (Score:3)
Take it from someone who's messed with nearly every storage product on the market, if you want something that works fairly simply, performs at approaching spindle speed ( meaning the file system is not the bottleneck - if you have 10 GB/sec. storage bandwidth, expect to see near that with proper tuning ), is very stable ( compared to most storage solutions on the market - bear in mind that most storage products are aimed at large-block sequential I/O, and fall down - either performance-wise or stability-wise - when you throw other I/O patterns or combinations of patterns at them ), and is portable across nearly any Linux distribution ( with varying amounts of difficulty, I have had to hack their kernel patches before when using a unsupported kernel ), GPFS is the one. Of course, the problem there is I believe it's pretty expensive to run on non-IBM hardware. But if you have IBM hardware ( even if it's not the hardware you're running the FS on ) or some sort of in with IBM, they'll let you have it for a song and a dance.
Having said that, Lustre [lustre.org] is getting there. I'd say it's the equal of GPFS ( as a parallel filesystem - I believe it is even more flexible as a distributed filesystem ) in performance, probably scales roughly the same ( haven't played with it in a large installation, so can't tell you beyond looking at the architecture ), and is going to the be the biggest player on the market in the future. It's also free ( IIRC Cluster File Systems sells support, but the code is freely available ) and not tied to IBM and whatnot, like GPFS is. Of course, HP has a big connection with Lustre, but not ownership thereof.
Those are really the only two that I would consider for a serious high-performance storage project. If you don't need great performance, that's when you can start looking at things like GFS, ADIC's StorNext, Ibrix, etc.
Oh, Gautham Sastri ( of former Maximum Throughput fame ) has a newer company called Terrascale, I recall them putting on a presentation at the 2003 or 2004 ( can't remember ) Supercomputing conference ( SC2005 is coming up in a few weeks, yeah!!! ) which showed pretty good performance ( relative to the small system they were using ), not sure how they're coming along...
Anyways, good luck...and don't forget to use Iozone [iozone.org] to benchmark the damn thing!
I'm no storage expert but... (Score:3, Funny)
Who let the PHBs out? (Score:3, Insightful)
In short, your stated objective smells. Not enough data.
WHAT is going to be done (database, file storage?)
HOW will it be accessed? (One large file, many smaller files)
WHEN will it be accessed? (During business hours, distributed over the day?)
AVERAGE TRANSFERS - will the whole schmear come over, selected parts?
SECURITY a concern? (Sensitive data, protected network)
BACKUP - a petabyte of tape storage is expensive, and takes quite a while to do.
POWER - do you have enough?
COOLING - ditto
SPACE - ditto - my $DAYJOB computer room is about 3000 sq ft... and we're going to be using all of it within 12 months.
That said, if you go with big drives over a lot of systems, use lots-o-nics to keep the nic from being the bottleneck. A single gig connection sounds fine, but wait until you have 100's of people going for files at once. It'll get swamped. And swear off V-SAN from Cisco. Not worth it at all.
PetaBox? (Score:3, Informative)
terrascale is cool... (Score:3, Interesting)
Petabox.... (Score:2)
Re:Petabox.... (Score:2)
Does not appear to be a single volume..
That depends entirely on what software you run on top of the hardware, doesn't it.
Re:Petabox.... (Score:3, Funny)
Re:Go Virtual (Score:3, Interesting)
Re:call EMC. i am sure their clarion line will han (Score:3, Insightful)
and rebooting.
EMC is obsolete. Their customers just haven't discovered it yet.