Does ZFS Obsolete Expensive NAS/SANs? 578
hoggoth writes "As a common everyman who needs big, fast, reliable storage without a big budget, I have been following a number of emerging technologies and I think they have finally become usable in combination. Specifically, it appears to me that I can put together the little brother of a $50,000 NAS/SAN solution for under $3,000. Storage experts: please tell me why this is or isn't feasible." Read on for the details of this cheap storage solution.
Get a CoolerMaster Stacker enclosure like this one (just the hardware not the software) that can hold up to 12 SATA drives. Install OpenSolaris and create ZFS pools with RAID-Z for redundancy. Export some pools with Samba for use as a NAS. Export some pools with iSCSI for use as a SAN. Run it over Gigabit Ethernet. Fast, secure, reliable, easy to administer, and cheap. Usable from Windows, Mac, and Linux. As a bonus ZFS let's me create daily or hourly snapshots at almost no cost in disk space or time.
Total cost: 1.4 Terabytes: $2,000. 7.7 Terabytes: $4,200 (Just the cost of the enclosure and the drives). That's an order of magnitude less expensive than other solutions.
Add redundant power supplies, NIC cards, SATA cards, etc as your needs require.
ZFS (Score:5, Informative)
Congradulations, you discovered the "File Server" (Score:5, Informative)
BBH
ok for low end, not for high (Score:4, Informative)
Current issues (Score:5, Informative)
Real SANs do more (Score:5, Informative)
Still, I bet you could do what you want on the cheap. Being in health care, response time and availability really are life-and-death, but many other industries don't need to spend the extra. Best of luck.
Its just not the same thing. (Score:5, Informative)
Then you have the other features like dual redundant everything: controllers, power supplies, etc. Then you have thermal capabilities of rack-mount solutions that often are different from SATA, etc, etc.
HW Raid vs ZFS (Score:5, Informative)
No (Score:5, Informative)
Reliable? (Score:5, Informative)
Businesses also want service and support. They want the system to phone home when a drive starts getting errors, so a tech shows up at their door with a new drive before they even notice there are problems. They want to have highly trained tech support available 24/7 and parts available within 4 hours for as long as they own the SAN.
Finally, the performance of this solution almost certainly pales as compared to a real SAN. These are all things that a home grown solution doesn't offer. Saving 47K on a SAN is great, unless it breaks 3 years from now and your company is out of business 3 days waiting for a replacement motherboard off Ebay.
That being said, everything has a cost associated with it. If management is ok with saving actual money in the short term by giving up long term reliability and performance, then go for it. But by all means, get a rep from EMC or HP in so the decision makers completely understand what they're buying.
No but... (Score:4, Informative)
I haven't gotten the point of standalone NAS boxes. They always were not fundamentally different from a traditional server, but with a premium price attached. I may not have seen the high-end stuff, howerver.
SAN is an entirely different situation all together. You could have ZFS implemented on top of a SAN-backed block device (though I don't know if ZFS has any provisions to make this desirable). SAN is about solid performance to a number of nodes with extreme availability in mind. Most of the time in a SAN, every hard drive would be a member of a RAID, with each drive having two paths to power and to two RAID controllers in the chassis, each RAID controller having two uplinks to either two hosts or two FC switches, and each host either having two uplinks to the two different controllers or to two FC switches. Obviously, this gets pricey for good reason which may or may not be applicable to your purposes (frequently not), but the point of most SAN situations is no single-point of failure. For simple operation of multiple nodes on a common block device, HA is used to decide which single node owns/mounts any FS at a given time. Other times, a SAN filesystem like GPFS is used to mount the block device concurrenlty among many nodes, for active-active behavior.
For the common case of 'decently' availble storage, a robust server with RAID arrays has for a long time been more appropriate for the majority of uses.
Re:Its just not the same thing. (Score:5, Informative)
We have a Sun X4500 which uses 48 500GB SATA drives and ZFS to produce about 20TB of redundant storage. The performance we have seen from this machine is amazing. We're talking hundreds of gigabytes per second and no noticeable stalling on concurrent accesses.
Google has found that SATA drives don't fail noticeably more often than SAS/SCSI drives, but even if they did, having several hot spares means it doesn't matter that much.
SATA is a great disk standard. You get a lot more bang for your buck overall.
Re:Its just not the same thing. (Score:5, Informative)
I work in the world of hardware manufacturing, and I can tell you that this "magical" more testing process simply does not exist. Hardware failures are always expensive, and we do anything we can to prevent them. To do this, we build burn in procedures based on what most call the 90% rule, but you really cannot guarantee more reliability beyond that. Better device design at that point is what will determine reliability beyond that point. Any person who says differently either does not completely understand individual test harness processes or does not understand how burn in procedures work.
In short, more money is not nessesarily better. More volume designs typically are, though...
Re:Everyman? (Score:5, Informative)
Not being rich, I have a couple of external HDs totalling a little less than 1TB, and it's nearly full. The rest is archived on DVD or transfered to HD for storage (cheaper, faster and more reliable than DVD).
So yeah, I can easily imagine why any organisation dealing with huge media files would be interested. Heck, I'd be a client for a safe, multi-TB storage system if I could afford it... Not everybody only deals with text files for a living
ZFS is great, but... (Score:4, Informative)
This hardly depends on ZFS... (Score:5, Informative)
That being said, this sort of solution may or may not be appropriate, depending on site needs. Sometimes support is worth it.
You're also grossly overestimating the cost of an entry-level iSCSI SAN solution. Even going with EMC, hardly the cheapest of vendors, you can pick up a 6TB solution for about $15k, not $50k. Go with a second tier vendor and you can cut that number in half.
vs Reiser4 (someday, maybe) (Score:5, Informative)
Reiser4 had the same problems with fsync -- basically, fsync called sync. This was because their sync is actually a damned good idea -- wait till you have to (memory pressure, sync call, whatever), then shove the entire tree that you're about to write as far left as it can go before writing. This meant awesome small-file performance -- as long as you have enough RAM, it's like working off a ramdisk, and when you flush, it packs them just about as tightly as you can with a filesystem. It also meant far less fragmentation -- allocate-on-flush, like XFS, but given a gig or two of RAM, a flush wasn't often.
The downside: Packing files that tightly is going to fragment more in the long run. This is why it's common practice for defragmenters to insert "air holes". Also, the complexity of the sync process is probably why fsync sucked so much. (I wouldn't mind so much if it was smarter -- maybe sync a single file, but add any small files to make sure you fill up a block -- but syncing EVERYTHING was a mistake, or just plain lazy.) Worse, it causes reliability problems -- unless you sync (or fsync), you have no idea if your data will be written now, or two hours from now, or never (given enough RAM).
(ZFS probably isn't as bad, given it's probably much easier to slice your storage up into smaller filesystems, one per task. But it's a valid gotcha -- without knowing that, I'd have just thrown most things into the same huge filesystem.)
There's another problem with reliability: Basically, every fast journalling filesystem nowadays is going to do out-of-order write operations. Entirely too many hacks depend on ordered writes (ext3 default, I think) for reliability, because they use a simple scheme for file updating: Write to a new temporary file, then rename it on top of the old file. The problem is, with out-of-order writes, it could do the rename before writing the data, giving you a corrupt temporary file in place of the "real" one, and no way to go back, even if the rename is atomic. The only way to get around this with traditional UNIX semantics is to stick to ordered writes, or do an fsync before each rename, killing performance.
I think the POSIX filesystem API is too simplistic and low-level to do this properly. On ordered filesystems, tempfile-then-rename does the Right Thing -- either everything gets written to disk properly, or not enough to hurt anything. Renames are generally atomic on journalled filesystems, so either you have the new file there after a crash, or you simply delete the tempfile. And there's no need to sync, especially if you're doing hundreds or thousands of these at once, as part of some larger operation. Often, it's not like this is crucial data that you need to be flushed out to disk RIGHT NOW, you just need to make sure that when it does get flushed, it's in the right order. You can do a sync call after the last of them is done.
Problem is, there are tons of other write operations for which it makes a lot of sense to reorder things. In fact, some disks do that on a hardware level, intentionally -- nvidia calls it "native command queuing". Using "ordered mode" is just another hack, and its drawback is slowing down absolutely every operation just so the critical ones will work. But so many are critical, when you think about it -- doesn't vim use the same trick?
What's needed is a transaction API -- yet another good idea that was planned for someday, maybe, in Reiser4. After all, single filesystem-metadata-level operations are generally guaranteed atomic, so I would guess most filesystems are able to handle complex transactions -- we just need a way for the program to specify it.
The fragmentation issue I see as a simple tradeoff: Packing stuff tightly saves you space and gives you performance, but increases fragmentation. Running a defragger (or "repacker") every once in awhile would have been nice. Problem is, they never got one written. Common UNIX (and Mac) philosoph
Replace NAS? Sure. SAN? No way. (Score:4, Informative)
Why not good ol' trusted Linux? (Score:5, Informative)
After that you just create a filesystem on top of the raid. If you don't like ext3 or don't trust it, there is always xfs. I had some rough times with reiserfs, xfs, and ext3 and for all the experience I had I would go xfs for long running server environments (and now get flamed for this little bit, use ext3 all you want).
The advantage is that you use very well tested code.
The problem comes with hotswapping. I don't know if the drivers are up to that yet. But I also highly doubt that OpenSolaris SATA drivers for some low price chip in a low price storage box can deal with hotswapping. So Linux might be faster on that one.
That is a setup I would compare to a plug'n play SAN solution. And it totally depends on the environment. If the Linux box goes down for some reason for a couple hours/days, how much will that cost you? If it is more than twice the SAN-solution, you might just buy the SAN and if it fails just pull the disks and put them in the new one. I dunno if that would work on Linux.
Have you tried out Starfish? (Score:3, Informative)
Ever heard of Starfish? It's a new distributed clustered file system:
Starfish Distributed Filesystem [digitalbazaar.com]
From the website:
And you can build clusters at relatively low cost:
(warning: I work for the company that created Starfish)
-- manuRe:Cheap, redundant, and performant storage. (Score:4, Informative)
And hiring a team to develop something similary to google filesystem is not cheep. Even highend sans will be cheeper.
Re:Its just not the same thing. (Score:3, Informative)
In which case you're talking complete rubbish, "hundreds of gigabytes per second" just one GB/sec would need 4x2Gbit FC links all exceeding their peak theoretical throughput
Re:Congradulations, you discovered the "File Serve (Score:3, Informative)
Why not?
Sure, a 16-channel SATA controller with RAID 0/1/5 will cost you $400. But that will handle, using 750GB drives that have recently entered the "affordable" range, a total of 12TB (or more practically, a 10.5TB RAID5 with one hot spare). Find an OEM that can set you up with that for under $5000 total.
Now, that uses a PC chassis and wouldn't look "nice" in a rack. So what? If you need 10TB and don't want to blow $50k on it, you don't have a lot of choices... So if you insist on all racked equipment, buy a rack shelf kit and lay it on its side (and hide it with blanks if you care that much)
Re:Congradulations, you discovered the "File Serve (Score:3, Informative)
What you have is a box that exports block devices out over layer 2. Another devices loads it as a block device, and can now treat it in whatever fashion it could deal with any other block device, so for example I have 2 "shelves" of Serial ATA drives going. I have a third box that I could either load linux on, using md to create raid sets, or what I've actually done is used the hardware on each of the two shelves, created a raid5 set on each, then used md to create a raid1 set out of the two raid5's. I then take my spankin' new md0 device which is huge for my needs (7.5TB), use LVM to create a volume group (called 'office' for me) and that creates
Now you can format those lv's like any other partition/slice. I've used xfs on all of mine, but you could use ext2/3 if you really wanted.
Re:Its just not the same thing. (Score:3, Informative)
Google Releases Paper on Disk Reliability:
http://hardware.slashdot.org/article.pl?sid=07/02
Re:Reliable? (Score:5, Informative)
The CMU study by Bianca Schroeder and Garth A. Gibson would suggest otherwise. In fact, there was no significant difference in replacement rates between SCSI, FC, or SATA drives.
"They want the system to phone home when a drive starts getting errors"
Of course, the other recent study by Google showed that predictive health checks may be of limited value as an indicator of impending catastrophic disk failure.
Basically, empirical research has shown that the SAN storage vendors are screwing their customers every day of the week.
"Saving 47K on a SAN is great, unless it breaks 3 years from now"
Of course, saving 47K on a SAN means you can easily triple your redundancy, still save money, and when it breaks, you have two more copies.
At the same time, the guy spending the extra 47k on an 'Enterprise Class Ultra Reliable SAN', will get the same breakage 3 years from now, he wont have been able to afford all those redundant copies, and as he examines his contract with the SAN vendor, he notes that they actually dont promise anything.
"But by all means, get a rep from EMC or HP in so the decision makers completely understand what they're buying."
Premium grade bovine manure with (fake) gold flakes?
Really, handing the decision makers several scientific papers and a few google search strings would leave them much better equipped to make a rational decision.
Having several years experience with the kind of systems you're talking about I can just say, I've experienced several situations where, if we didnt have system-level redundancy, we would have suffered not only system downtime but actual data loss on expensive 'enterprise grade' SANs. That experience, as well as the research, has left me somewhat sceptical towards the claims of SAN vendors.
Re:Specifics please. (Score:4, Informative)
Re:ZFS is great, but... (Score:1, Informative)
Now seriously, you must be a windows "admin"... In the rare case a reboot is necessary, you should know that you can patch an offline boot partition (via lu) and boot into it afterwards (whenever). The downtime is measured in seconds and you can boot back to the original un-patched partition if anything goes wrong.
That's unix administration for you.
Re:Specifics please. (Score:5, Informative)
No system can compensate for bad management by people, but I digress.
All data is critical. But, to say that your data is less safe with a system that cost $4700 than a system that cost $50,000 is fallacious without some heavy proof behind it. For now, I am going to ignore that a functional backup is part of "mission critical" and just address the online storage portion of the argument.
Let's start with a server white box. Something with redundant power supplies, ethernet, etc. Put a mirrored boot drive in it. Install Linux. So far, the cost is fairly low. Add an external disk array, at least 15 slots, the ones with hot-swap, hot-spare, RAID 5, redundant power supplies and fill it with inexpensive (but large) SATA drives. Promise sells one, as do others. Attach to server, voila, a cheaper solution than EMC for serving up large amounts of disk space.
What if a drive fails? The system recreates the data (it is RAID5, after all) onto a hot-spare. You remove the bad drive, insert new, run the administration. The uses won't even notice their MP3's and Elf Bowling game were ever in danger.
For the people who believe strongly in really expensive storage solutions, please explain why. I would like to know if you also hold the same theory for your desktop PCs, because surely, a more expensive PC has to be better. Right?
Re:Specifics please. (Score:4, Informative)
True, but there is a difference. The difference is in QA.
The "consumer-grade" and "business-grade" are the same off the shelf stuff, but if you are getting business-grade stuff from a reputable vendor they QA the consumer-grade parts, throw out the bad ones, and stamp "business-grade" on the ones that survive. This is why the business-grade level of products often are a generation or so behind the consumer-grade level.
Yes, you can get lucky and get consumer-grade stuff that works great. But if it doesn't, then you are the QA guy, and the downtime is on you. If the time for you to do the QA and the associated downtime is cheaper than the cost of business-grade, then by all means do it. Otherwise, you have to pay the extra bucks.
Now, regarding NASs, I think these things are overpriced, especially the maintenance on them. The maintenance goes through the roof once the equipment is beyond the MTBF of the drives, which is where a high dollar NAS should shine right? Any piece of crap RAID box will work when all the drives are new and functioning well. What you are paying for is the redundancy and availability, which is redundant to pay for when all of the equipment is new.
Re:Everyman? (Score:5, Informative)
I routinely work with files that are 100 GB - 300 GB each.
Just copying one file from drive to drive takes hours.
I have about 4 Terabytes in use, with another 4 Terabytes for backup.
My usage is the exact opposite of database usage (which most storage is optimized for).
I need to copy huge sequential files. I rarely need many small reads or writes.
Because of the long times it takes to move these files around, I think NFS or CIFS would be too slow. That's why I am interested in the ability of ZFS to easily export iSCSI targets. Some tests I read showed that ZFS exporting iSCSI is about 4 times faster than ZFS exporting NFS or CIFS.
I am comparing to drives directly attached via eSATA, so it's got to be fast to come anywhere close to what I get with eSATA.
Re:No (Score:3, Informative)
It's generally not about the 64- vs 128- vs whatever.
It's about the additional reliability (current bugs aside), and the ease of filesystem/pool management. For example, a Sun developer was developing on a workstation with bad hardware, which occasionally caused incorrect data to be written to disk. After setting up raidz, ZFS automatically detected and corrected the error: http://blogs.sun.com/elowe/entry/zfs_saves_the_day _ta [sun.com]
Scary, yes. Doing that definitely isn't something I'd recommend, but it does show one of the powerful features of ZFS.
Re:Everyman? (Score:2, Informative)
Re:Specifics please. (Score:4, Informative)
It's the same reason we buy Dell. We could buy white boxes or parts from Newegg for all of our systems, but talk about a hassle when it comes to them needing hardware maintenance or just assembly. With the support Dell offers, we get a complete box that's been tested, we just need to reformat and install our own stuff. Something breaks, we make a 10 minute phone call and get a replacement the next day, with or without assisted installation. But we pay probably 30% more per box for that.
Re:I'm a Hardware Guy, Not a ZFS Guy (Score:5, Informative)
OK:
Re:ZFS (Score:4, Informative)
Increased reliability (all data is checksummed, even in non-raid configurations), near brainless management (e.g., newfs is not needed, raid configurations are trivial to setup, etc.), built-in optional compression (even for swap, if you're feeling masochistic), etc.. Encryption is in development.
See my other posts here for links.
RAID controller failure (Score:5, Informative)
Because I've seen that happen more than once.
I'm not saying the more expensive solution is better. I'm just saying that in my personal experience I've seen *more* data destroyed from RAID controller failure than from hard drive failure. I would love to find out the solution to that one.
I do not claim to be a hardware expert or system administrator, so there may be a well known solution (don't buy 'brand X' RAID controllers). I just don't happen to know it.
Re:ZFS and Sun boxes (Score:3, Informative)
You can pick up those 750GB Seagate SATA drives for about $200 each now...
Re:ZFS (Score:2, Informative)
Re:Specifics please. (Score:3, Informative)
A Netapp S500 with 12 disks and NAS/iSCSI features is a good example. Roughly 10k and you get Netapp's SMB/CIFS implementation (considered excellent), NFS, iSCSI, snapshots, etc. Slightly lower price points can be had through Adaptec Snap Servers. They have a nice SAS JBOD expansion unit for their systems. HP just released new "storage servers" based on Microsoft's storage server OS; heck of a lot of value in those systems.
The delta between 3-4k and 10k isn't trivial, and if your budget is tight perhaps should roll your own. But 10k for supported NAS/iSCSI that functions a few minutes after you get it in the rack isn't a ripoff either. Not by a long shot.
Re:ZFS and Sun boxes (Score:2, Informative)
Just a brief summary:
-- SATA refers to the new Serial ATA.
-- ATA or PATA refers to the older "Parallel" ATA. (ATA dates back to IBM PC AT and refers to that machine's AT Attachment interface.)
-- IDE refers to any drive (ATA, SCSI...) with integrated drive electronics, that is, everything that has come after the ancient dumb drives that required a model-specific controller on the motherboard. In other words, not a very useful term anymore.
-- SCSI refers to Small Computer System Interface; funny how it's the one used in the bigger iron. Beats the pants out of ATA when handling multiple daisy-chained drives; SATA is catching up in handling multiple drives. SCSI also has parallel interface and cabling.
-- SAS refers to Serially Attached SCSI (some inspiration from SATA perhaps?).
-- FC refers to Fibre Channel, a SCSI-like very fast interconnect type and interface protocol; often (but not always) uses optical cabling.
-- iSCSI refers to SCSI over Ethernet (thus it could be "SCSIoIP"...).
But I never understood the difference between a SAN and a NAS when the configuration gains any complexity beyond a textbook example. You can have a SAN with many NAS boxes, or you can have NAS with multiple SANs, sooo...
Re:ZFS (Score:4, Informative)
Simple administration and data integrity. This is all it takes to make a 6-disk RAID at home:
zpool create sun711 raidz c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0 c0t0d0
That gets you a ZFS storage pool mounted at
Of course, once you have a storage pool, you can then create additional file systems from there. Here's how you create some NFS storage:
zfs create sun711/storage
zfs set sharenfs=rw sun711/storage
When I had a disk start getting flaky (it started reporting high raw read error rates - that's what I get for buying drives on ebay...), I simply did the following:
zfs offline c0t5d0
zfs replace c0t5d0
There you have it... it can't get much simpler than than.
Re:ZFS (Score:5, Informative)
A correction:
Copy-on-write is quite a misnomer here (even if Sun use that term). It is a Transactional filesystem. Blocks are not copied upon write, they are only written and then the transaction log is updated. It's far more clever than old-fashined COW schemes. It can be compared with NetApp's WAFL filesystem.
Re:ZFS and Sun boxes (Score:3, Informative)
In really really simple terms, a SAN provides configurable disk space to a host, a NAS supplies file space and file serving to a host(s). Many storage solutions offer various functionality and can provide both NAS and SAN functionality at some level.
Re:ZFS (Score:3, Informative)
Re:Specifics please. (Score:1, Informative)
Re:Specifics please. (Score:3, Informative)
If you honestly think a SAN, particularly an EMC SAN, is a single point of failure, then you don't understand SANs. Take a look at the high-end Clariion and Symmetrix models for redundancy and scalability. Then take a look at who some of EMC's bigger customers are.
No, I don't work for EMC, but my employer has several mid-level EMC storage systems. They work. Reliably, quietly, and yes, they scale nicely too.
Re:ZFS (Score:3, Informative)
"ZFS is a new kind of filesystem that provides simple administration, transactional semantics, end-to-end data integrity, and immense scalability. ZFS is not an incremental improvement to existing technology; it is a fundamentally new approach to data management. We've blown away 20 years of obsolete assumptions, eliminated complexity at the source, and created a storage system that's actually a pleasure to use.
ZFS presents a pooled storage model that completely eliminates the concept of volumes and the associated problems of partitions, provisioning, wasted bandwidth and stranded storage. Thousands of filesystems can draw from a common storage pool, each one consuming only as much space as it actually needs. The combined I/O bandwidth of all devices in the pool is available to all filesystems at all times.
All operations are copy-on-write transactions, so the on-disk state is always valid. There is no need to fsck(1M) a ZFS filesystem, ever. Every block is checksummed to prevent silent data corruption, and the data is self-healing in replicated (mirrored or RAID) configurations. If one copy is damaged, ZFS will detect it and use another copy to repair it."
Re:ZFS and Sun boxes (Score:3, Informative)
Well, actually, IDE's history is a bit different than that. IDE requires a host buss interface, but, yes, they do have their disk controllers built into the PCA attached to the disk mechanism.
Before Compaq and others developed the first IDE systems, hard drives usually had external controller boards that used low level commands. IDE standardized the host interface to disk storage at the driver level, and standardized the host buss/drive command set at that buss level.
And, it's not just disk drives that use the IDE stack. Other devices can be attached to the IDE buss, too.
SCSI drives require a SCSI host buss adapter with a dedicated processor and that adapter does the heavy lifting for disk access. IDE requires the host CPU to do a lot of processing, where SCSI does the majority of the work. This model was used for the FC technology. It, too, unloads the processing from the CPU.
SCSI/FC are preferred in the 'big iron' type of installations. IDE/ATA/SATA are fine on a dedicated NAS system. In effect, the CPU of the NAS motherboard is doing the work that is done on the host buss adapters in SCSI/FC.
At the drive mech level, FC is a copper interface. The design of the connector on the disk mech allows it to be plugged. This provides the ability to quickly replace failed drives. The drive mechs are aggregated into some type of array to provide protection from data loss. This array of drives is then attached to systems via fiber optic cabling.
You can simulate some, but not all, of the benefits of a FC/SCSI array using SATA technology. I don't know if the IDE drivers are being rewritten to use the multi-core processors yet, but that would help reduce some of the latency.
Short answer, if what the OP was aiming for is to get into a large disk array for cheap, trading some reliability and performance for low cost, the idea is a good one. I would be looking for a multi-core cpu in the motherboard and an OS that has parallel processing drivers for the IDE channel. Be sure that all the drives have plenty of cooling. Have a backup solution. Some day, this lash-up will give you heartache, but till then, you've saved money.
Can you tell I used to work in the disk storage business?
Good luck.
Re:ZFS and Sun boxes (Score:3, Informative)
Re:Multiple-disk failures? Why?! (Score:3, Informative)
The first problem with your gnuplot script is that you're assuming a Poisson distribution for HDD failures (which is incorrect). Statistical failure distribution follows a Weibull distribution with k roughly equivalent to 7.5. Unfortunately, because you build your argument off of a Poisson distribution approximation, the rest of the analysis doesn't make much sense.
If you are interested in HDD failure rates and failure prediction, there is a fantastic paper done by Bianca Schroeder and Garth Gibson of CMU. I think this is the link to their main research website [cmu.edu].
I think you miss the point of systems such as Starfish and other distributed clustered file systems. You have many other points of failure in a system: memory, CPU, power supply, power outage, motherboard, network switch, OS kernel, router, network cable, and the all important "oops, I tripped over the power cord". There are also times that you want to take down nodes in a highly-available cluster for maintenance without affecting your applications - to do this, you need a file system that assumes and can work around node-level failure.
There is much more to highly-available clustering than just making sure your disk sub-systems are bulletproof.
-- manuenter GPL3 (Score:4, Informative)
That's nice except Jonathan Schwartz has indicated that OpenSolaris will go GPL3, assuming the final version of the license is OK.
Re:ZFS (Score:1, Informative)
Re:No (Score:3, Informative)
For example, ZFS will panic when you lose enough data or devices that the pool is no longer functional. If you take care and use a replicated pool though, this is unlikely to ever happen. Even if it does, all it requires is that you reattach the devices and reboot. If the disks truly are dead, then you are going to backups anyway. You do have backups, right?
ZFS has some rough edges yet, but to call it "far from ready" is mere FUD. Most of the problems with it are a matter of convenience, and nearly all that have been mentioned in the comments are actively being worked on. With a little bit of care, and proper backups, ZFS is rock solid. Meanwhile, it is improving every day, and if you choose not to revisit it, it is your loss.
Re:Multiple-disk failures? Why?! (Score:1, Informative)
Nice little blog about that (from a manufacturer) ;
http://blogs.netapp.com/dave/TechTalk/2006/03/21/
Re:RAID controller failure (Score:3, Informative)
Maybe not suitable for high performance situations, but I've not found it slow.
Re:Multiple-disk failures? Why?! (Score:2, Informative)
Re:ZFS (Score:4, Informative)
Re:ZFS (Score:3, Informative)
Also, if you are using a controller (as on a SAN) that has read/write order battery backed cache, you will want to disable fsync() for ZIL in the /etc/system file. There is also a nice script that can be used to limit the amount of memory ZFS uses, though, again, due to the poor memory accounting, it doesn't work that great. I have limited ZFS to 512MB of RAM to use and it grabs 3GB. Solaris 10 u4 is supposed to address some of this. Because of ZIL (part of the integrity check/scrubber) performance for multiple writes suffers and it can make using it as an NFS server less than ideal. It has a LOT of potential, and with enough spindles it can overcome some of these issues temporarily. It just has not matured to the level at which they are selling ZFS.