Building a Massive Single Volume Storage Solution? 557
An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical.
I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"
GFS? (Score:5, Informative)
Apple Xserve? (Score:3, Informative)
Andrew FIle System (Score:5, Informative)
PetaBox (Score:4, Informative)
MogileFS from livejournal (Score:3, Informative)
http://www.danga.com/mogilefs/ [danga.com]
It's scalable and has nice reliability features, but is all userspace and doesn't have all the features/operations of a true POSIX filesystem, so it may not suit your needs.
Comment removed (Score:3, Informative)
Re:Apple Xserve? (Score:4, Informative)
Scaling to petabytes means spanning storage across multiple systems.
Re:GFS? (Score:3, Informative)
Here's a couple to look at (Score:2, Informative)
MogileFS at http://www.danga.com/mogilefs/ [danga.com]
Data redundancy REQUIRED (Score:5, Informative)
Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!
Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.
You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.
I just have to ask... (Score:5, Informative)
Re:Google Releases OSS? (Score:2, Informative)
IBRIX (Score:4, Informative)
Er... be careful (Score:2, Informative)
Re:Andrew FIle System (Score:2, Informative)
Re:PetaBox (Score:4, Informative)
Re:Apple Xserve? (Score:4, Informative)
Re:IBRIX (Score:3, Informative)
Re:IBRIX (Score:1, Informative)
Re:Data redundancy REQUIRED (Score:3, Informative)
For the sake of your argument I suppose that assumption could be considered fair. If one were to do a somewhat more sophisticated analysis, a better model for hard drive failures is the Bathtub curve [wikipedia.org]. It represents the result of a combination of three types of failures: infant mortality (flaws in the manufacturing), random failures and wear-out failures.
The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.
I think what you are referring to is how multiple observations of a uniformly distributed stochastic variable generally look. It doesn't have anything to do with fractals, though.
Re:For the most part (Score:3, Informative)
eWare 12x S-ATA raid5 card
12x 300GB raid5
linux machine
iscsi software - share out 1 LUN.
Duplicate this machine until you have enough storage.
One big box with a number of trunked/bonded gigE ports
Iscsi initiator software - mount all the luns.
software raid them together - striping if you arent too worried - raid5 if you are.
tada - big storage, one volume, all accessable from one machine.
the maintenance will suck though.
Re:Andrew FIle System (Score:2, Informative)
Red Hat GFS != Google FS (Score:2, Informative)
iSCSI storage / san (Score:3, Informative)
http://www.equallogic.com./ [www.equallogic.com] They make nice SATA-raid based iSCSI SAN devices with all the features you could expect (volumes, snapshots, array/volume-expansion, hotswap, redundant controllers, redundant fans, etc).
http://www.equallogic.com/pages/products_PS100E.h
14 250G sata disks, 3U, 3.5 TB of raw storage.
http://www.equallogic.com/pages/products_PS300E.h
14 500G sata disks, 3U, 7 TB of raw storage.
http://www.equallogic.com/pages/products_PS2400E.
56+ TB
Looks good. I have not yet used them myself
Another iSCSI SATA SAN possibility:
http://www.mpccorp.com/smallbiz/store/servers/pro
16 sata disks, review:
http://www.infoworld.com/MPC_DataFrame_420/produc
This company also has SATA iSCSI SAN devices:
http://www.dynamicnetworkfactory.com/products.asp
iSCSI SAN comparison:
http://www.networkcomputing.com/story/singlePageF
There are also software iSCSI target solutions for use with your own/custom hardware.
http://iscsitarget.sourceforge.net/ [sourceforge.net] for building linux-based iSCSI target/SAN.
If you are familiar with iSCSI targets / iSCSI SAN devices please post your comments!
Re:Apple Xserve? (Score:4, Informative)
I do remember some college buiding a nearline backup storage system using 1U servers with 2 or 3raid cards each connected to like 12 drives per machine in homemade brackets but it was hardly ideal. But It did work. Anybody remember where that was?
What we've done (30TB so far) (Score:5, Informative)
Storage nodes: 7 x 2.8TB 2U RAID5+1 boxen with Serial ATA. The 2.8TB is logical, not physical. The OS for each of those machines is RAMDISK based (something we concocted based on what I read about the DNALounge awhile back) so it helps curb disk failures of the storage nodes themselves. We avoid disk failure by using RAID5. Of course that doesn't protect against mutiple simultaneous disk failure, but read on for more. Each of the storage nodes is exported via NBD.
Then we have a head unit, a 64-bit machine. This machine does a software RAID5 across the storage nodes using an NBD client. Essentially each storage node is a "disk" and the head unit binds and manages the sofware raid5. So let's say a whole storage node goes down (for whatever reason it does), all the data is still intact. RAID5 rebuild time over the gigabit network is about 18hrs, which is acceptable. We even have another storage box as a hot-spare.
On top of that, we have the whole cluster mirrored to another identical cluster via DRBD in a different geographic location. This is linked by Gigabit WAN. So if we have a massive disaster and lose the entire primary cluster, then we have a 2ndary cluster ready to go. We needed to purchase the Enterprise version of DRBD ($2k US) but that's worth it because they're neato guys.
We use XFS as the filesystem. This system gives us 14TB of redundant "RAID-55 with a Mirror" space. Both clusters together? $85k.
When the cluster starts running out of space (about 70% or so), we add ANOTHER cluster of similar stats to the initial one and use LVM to join the two units together.
This has scaled us to 30TB and we're pretty happy with it. The read speed is very good (hdparm says Timing buffered disk reads: 200 MB in 3.01 seconds = 66.49 MB/sec) and the write speed is about 32 MB/sec. For what our application is doing, that's a fine speed.
Proprietary FS, commodity disk enclosures (Score:2, Informative)
The filesystems going to be the hardest component of this. I know of no open-source fs that could handle this. I'm assuming this is all online storage, and there is no desire to nearline it to tape. Ideally, you'd want something that could contcatenate multiple LUNs (of RAIDed storage) without having to run through a volume manager. Nothing agaist volume managers, but it'd be another component to support. Looking at proprietary FSs, you've got CXFS from SGI [sgi.com], which could easily handle the PB requirement and plays nice on Linux. Sun's got QFS [sun.com], which would max out at 1PB and could do the volume management bit easily. Linux support was a little flakey last time I used it, but it's a free download and evaluation, you could go get it [sun.com] right now.
IBM's SAN-FS would also meet the capacity needs and would have the advantage of providing nearline capability, if you're into that. Sun's SAM-FS is basically the QFS product with nearline-to-tape capability. Linux is only supported as a client OS there. Of course, if you buy the mantra that Solaris is 'open-source,' then that might not be an issue.
As for hardware with any of the above solutions, you're going to be looking at using multiple RAIDing disk enclosures of some kind. At a budget, probably SATA disks talking to the controller, and iSCSI to the host. FibreChannel to the host would be a little more costly, but might be worth it since iSCSI is just getting mature enough to be usable in production.
Have you looked at.... (Score:2, Informative)
Re:GFS? (Score:3, Informative)
Regards,
Steve
Why one volume? (Score:3, Informative)
What's making your question hard is the "make it like one volume" restriction. The problem is trivial otherwise. If I were you, I'd be asking whoever tasked you with this to *really* justify on a technical level why they need it to appear as a single volume, since that makes all the possible solutions slower, more costly, and more difficult to maintain.
Chances are extremely high that what they really want is a "/bigfatfs" directory visible everywhere in which they will store many discrete items in subdirectories by project or by dataset or by user. You should convince them to let you build it from commodity machines serving a few TB each mounted as seperate filesystems underneath that umbrella directory. Then your only challenge is coherent management of the namespace of mountpoints for consistency across the environment (which there are longstanding tools for, like autofs + (ldap, nis, nis+, whatever)), and administration/assignment of new space requests within your cluster (that could be scripted to automatically allocate from the least-used volume which can satisfy the request (where least used could mean space or could mean activity hotness based on the metrics you're logging)).
Re:Andrew FIle System (Score:3, Informative)
LeftHand Networks storage does it (Score:2, Informative)
I have worked with this units and they kick ass. You can do snapshots of entire servers quickly, given that you have the right infrastructure, set thresholds for voulmes that can be increased or reduced on the fly, brick level restoration of files!!!, etc. And of course, my respect goes to their engineers. I saw them working on one unit cause we had a really bad power failure that killed one HD. Man those guys know their stuff up and down, and I've never seen anybody type commands so complex and so freaking long at that speed! They fixed the damn thing and got 99.99999% back from limbo!
I guess their storage boxes follow the model of LVM which is pretty cool and the storage boxes run Linux!!!
Don't take my word for it, go to their website and take a look 'cause I tend to confuse people with my posts rather than pass info efficiently.
Have a good one.
What about MatrixStore? (Score:2, Informative)
Xsan has volume size limits (Score:3, Informative)
Short research indicates this was a limitation in 10.3, but I haven't found anything confirming or denying that 10.4 still has it.
Not that we've been looking into large amounts of Xsan storage here, but our requirements are a bit different. You can't hook >600 nodes up to the storage via fibre. Our problem is scaling out the NFS servers to be able to push all this data around.
Controllers! (Score:3, Informative)
If you're not doing any processing on this, a good CPU should be able to handle the load.
Re:For the most part (Score:3, Informative)
or did you mean to specify a GBit switch in between the clients/big box?
also, agree with yours and others' proclamation that administration will not be trivial. be sure to spec at least 6 months of your time in writing/debugging scripts to automate the detection and RMA of dead drives, and find a vendor who will ship based on an automated mail you can send out about failed disks, rather than waiting on turnaround from you pulling the drive and the delivery making a round trip.
Re:What we've done (30TB so far) (Score:3, Informative)
Re:I just have to ask... (Score:1, Informative)
Yes, this is what SANS were developed for. Let me clue you in on our SAN, and what it cost us in the long run:
We bought a Xiotech Magnitude 1.4T SAN, with everything redundant, except the backplane (not an option in that line at the time). Well, what part dies? The backplane. Twice!
So, after converting all of our critical servers over to this SAN, it takes down our entire datacenter. We paid tons of money for this POS too.
Now, we're trying to sell it. We don't need it, have since moved to Infortrend for our storage requirements (way better product), and would like to get rid of it.
We tried eBay. Everytime a customer inquired about re-licensing the crappy software on it from Xiotech, whadda ya know? It costs more to relicense the software than just buying a whole new unit from Xiotech. We call Xiotech and complain to high hell about this.
We try to donate it to a local university... SAME DEAL! Xiotech tries to relicense the software, EVEN AFTER KNOWING ITS A CHARITIBLE DONATION for more than a new unit!
So, in our closet sits a @@#@!(%$ pile of @($*!@ called a Xiotech. I will speak badly about this company until the day I die. I will but into other people's conversations that are unrelated to massive amounts of storage, and tell them to curse the name Xiotech and stay bloody clear of that company. I will sacrifice small animals in Xiotech's name, so hopefully something horribly evil will happen to that company, and the damn CIO that tells us to deal with it, and hangs up on us.
My ire towards Xiotech is strong and will never die. I wish a worse fate for Xiotech than I do SCO.
That's not MTBF, this is.. (Score:5, Informative)
You have a drive that is rated 500,000 hours MTBF. Suppose you bought a drive and let it run at rated duty. Driver are normally rated to run 100% of the time, but many other devices will have duty period. Further, you run the drive until its warrenty is up. You then throw this perfectly working drive out the window and replace it. If you keep the up this pattern, then approximately once per 500,000 hours on average you should have a drive fail before the warrenty period is up. This is why it is important to not only look at the MTBF but also its warrenty period.
As a side note: In theory, you should be throwing drives out on a periodic basic. One way around this is to not buy all the same drive type and manufacturer. By having a pool of drive types, you distribute, thus minimize, risk of drive failures. Additionally, you may want to have a standard period of time for drive replacement so as to shedule your down time, as opposed to it all being unexpected.
Re:Hell, BUY it from EMC! (Score:1, Informative)
Sure have, it happened when some dumbass striped their database across multiple hypers on the same physical spindle. There's a reason they give you those device layout maps, you should probably pay attention to them.
Re:Petabox (Score:2, Informative)
Lets say $0.07 per kW/hr,
Then the 50kW as you said would be:
50*24*31*$0.07 = $2,604/month
So it isn't super cheap, guess that is why you don't hear about everyday people buying petabyte of storage. I think if you try to save more on electricity (liking coming up with some other device besides hard drives) you will end up paying a huge amount in whatever makes you save that electricity beyond the electricity costs.
PetaBox? (Score:3, Informative)
Re:GPFS from IBM (Score:2, Informative)
I've implemented multiple gpfs file systems in the multi terabyte range. It's a pretty robust file system. With full redundancy at the disk/controller/brocade/server level per file system I can still write more the 3 gb/s and read better than 3.5 gb/s. This was a design for redundancy and not performance.
20+ Terabytes of FAStT fibre attached storage. After four "SURPRISE" power outages after Katrina which caused the loss of 12 disks and I still did not lose a single byte of data for the customer. GPFS can be pretty robust if implemented correctly.
I'd have no qualms about putting together a petabyte of gpfs file systems.
Icehawk55Re:How about a PetaBox? (Score:3, Informative)
I will add that the Archive has particular design and performance goals, namely:
- keep the cost / GB as low as possible
- keep cooling and power requirements low
- use the filesystem and bundle objects into large chunks (~100MB ARC files, last I checked)
- assume streaming writes affecting an edge of the system -- previously written data isn't modified
- assume random reads
- read latency is less important than cost / GB
I worked on the Archive ~5 years ago, and these are based on my understanding of the Archive from that period, so some of these may have changed.
But essentially, these are instantiated as: off-the-shelf SATA disks in fairly standard cases with either normal or special low-power motherboards, running a free OS (the Archive has used both Linux and FreeBSD), with off-the-shelf networking equipment.
--Pat
That's a lot of disks there son! (Score:2, Informative)
Well, lets see... using 300Gb SCSI disks (assuming you can find raid 0 hardware to support enough disks) you can build out a 1Pb storage system with about 3,334 disks. That would set you back about $3.3Million, assuming you paid retail prices for the disks ~$1,000 / disk @ CDW today. Of course, if you orderded 3,000+ disks, I'm sure they would cut you a deal on the price.
Any hope of daisy chaining together a few dozen direct attached storage devices to a NAS server? Something like a Dell PowerVault 220 with 14 300Gb SCSI drives will set you back about $21K and give you 3.4Tb / 3U of space (RAID-5) so you would have some saftey net built in (albeit not much). Slap 10 of these on a Powervault 6000 series and you should have a ball park of 34Tb (while shy of what your looking for gets you in the right direction). Total cost around $250K - do it four times and spread the work out over four logical volumes and you should get in the neighborhood of 1Petabyte. You could then set up a redundant server structure and for $2Million you have a redundant mirrored architecture ready for one to fail and be brought up online quickly.
Google FS is not a real FS! (Score:2, Informative)
Also, the GoogleFS has very narrow requirements/goals. It works best for programs that only append to the files.