Building a Massive Single Volume Storage Solution? 557
An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical.
I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"
GPFS from IBM (Score:5, Interesting)
http://www-03.ibm.com/servers/eserver/clusters/so
Re:Apple Xserve? (Score:5, Interesting)
Probably would be the least expensive option overall and the simplest to implement. Don't take my word for it, go look for yourself.
Scale (Score:3, Interesting)
Clustering the disks with iSCSI or ATAoE is trivial - you can do that very easily, but the filesystem to run on top of it is where you will have problems.
PVFS - has no redundancy - Lose one node lose them all.
GFS - does not scale well to those sizes or a large number of nodes - lots of hassle with the dlm.
GoogleFS - Essentially one write only - no small (50GB) files - little or no locking.
xFS - Way too easy to lose your data.
It seems that you only have one option:
Lustre - VERY Expensive - lots of hassle with meta-data servers and lock servers.
Go with a company to take care of all this hassle - you do not have the resources of Google to deal with this kind of thing yourself.
Re:Go Virtual (Score:3, Interesting)
Re:gmail (Score:2, Interesting)
I wonder if tinyurl can handle 25TB...
Been there done that (Score:2, Interesting)
Just wait 5 years ... (Score:4, Interesting)
Hard disk space is doubling every 6 months - wait 5 years and you'll be able to buy a 25TB disk for $125.00.
A single raid50 of them will then give you your petabyte of storage, for around $6,000.
Re:Er... be careful (Score:1, Interesting)
Hell, BUY it from EMC! (Score:5, Interesting)
Yeah, EMC costs. THIS is why. The support, when needed, is top top top notch. Which would you rather have in a DR situation?
Re:Petabox (Score:3, Interesting)
How about a PetaBox? (Score:5, Interesting)
The folks at the Internet Archive [archive.org] have already done the hard work of figuring out how to create a petabyte storage system [archive.org] using commodity hardware. The system works so well they started a company to sell PetaBoxes [capricorn-tech.com] to others. Why reinvent the wheel?
Here's my solution (Score:2, Interesting)
My solution was to use the nodes' hard disks (each one has a 120GB Ultra320 10000rpm disk) combined in a network RAID1+0 solution (we use gigabit ethernet) to get more space. With that aproach you can get as much redudancy as you need.
Heres what I did:
1. After install the network block device server (nbd-server)in each one of the nodes, I created a 100GB partition on the HD and exported then directly using the raw mode;
2. On the master node (using the nbd-client) I created a block device for each one of the nodes partitions;
3. After that I installed the linux software raid tools (mdadm) and created a small RAID1 array for each pair of nodes. I ended up with 14 100GB network RAID1 arrays each one with its very own
4. I created a big 1.4TB (14 * 100GB) RAID0 array with the 14 RAID1 ones and attached it to the
5. The final step was to create a large RaiseFS filesystem on the
You have to pay special attention to the array shutdown and startup procedures. I wrote my own scripts to take care of that for me.
Our array may seens small compared to what you are looking for, but I am pretty sure that it will scale well for arrays much larger then ours.
Good luck.
Small companies and short-sighted management (Score:1, Interesting)
1/5 boxes arrived DOA. The ethernet cards didn't work. The cables to the hard drives weren't long enough. The hot-pluggable disk trays were flakey. The BIOS had to be flashed. The properitary hard drive controller drivers sucked, had to buy new controllers. 1/10 disk drives were DOA.
Three monhs later, and $40K poorer, we had a system that couldn't pass 24 hours of stress testing without failing in some wacky way. For the $40K and my salary time, we could have bought a usable system from IBM or HP or whomever, and it would have worked. Engineering big systems is non-trivial.
Re:GPFS from IBM (Score:3, Interesting)
Re:Data redundancy REQUIRED (Score:3, Interesting)
4.4Tb on raid5 per mode at $0.32 per GB (Score:2, Interesting)
2) ~$40 - an additional PCI sata controller with 4 ports,
3) ~$100 - the cheapest AMD64 CPU you can buy, 12 400GB drives,
4) ~$150 - coolermaster stacker case
5) ~$1020 - 12 WD 400Gb drives
5) $0 - your favorite Linux distribution.
TOTAL: $1410
Each drive eats about 15W meaning around 180W with an additional 60W for motherboard/cpu consumption which makes it a comparable solution to an efficient scsi solution in terms of power consumption at a small fraction of the cost.
Personally, I created a raid1 array of 2 37GB 10krpm raptor drives for critical stuff and OS, and 2 raid5 arrays of 5 300GB drives for even superior cost per GB while increasing redundancy by a factor of 2. But that only gives you 2.4TB per mode in that case.
The configuration can be done with evms or lvm2, rebuilding on the fly and replacing drives on the fly should work just fine in theory (never tried on the fly), but if not, a scheduled 5 minute downtime is just fine also. My previous 0.5TB raid5 is up >3 years so far and a hard drive failure just required to mdadm md0 --add
Increasing the array size becomes tricky (although an available option) and fiddling with various distributed network filesystems doesnt really seems worth it for me personally, but openmosix and other clustering solutions offer distributed filesystems.
Just remember, the SATA architecture is nice, SCSI isnt really a requirement for this kind of solution.
Re:Petabox (Score:2, Interesting)
terrascale is cool... (Score:3, Interesting)
Petabox from Capricorn (Score:2, Interesting)
Capricorn Technologies Petabox
http://www.capricorn-tech.com/ [capricorn-tech.com]
Linux Devices Review
http://linuxdevices.com/news/NS2659179152.html [linuxdevices.com]