Does ZFS Obsolete Expensive NAS/SANs? 578
hoggoth writes "As a common everyman who needs big, fast, reliable storage without a big budget, I have been following a number of emerging technologies and I think they have finally become usable in combination. Specifically, it appears to me that I can put together the little brother of a $50,000 NAS/SAN solution for under $3,000. Storage experts: please tell me why this is or isn't feasible." Read on for the details of this cheap storage solution.
Get a CoolerMaster Stacker enclosure like this one (just the hardware not the software) that can hold up to 12 SATA drives. Install OpenSolaris and create ZFS pools with RAID-Z for redundancy. Export some pools with Samba for use as a NAS. Export some pools with iSCSI for use as a SAN. Run it over Gigabit Ethernet. Fast, secure, reliable, easy to administer, and cheap. Usable from Windows, Mac, and Linux. As a bonus ZFS let's me create daily or hourly snapshots at almost no cost in disk space or time.
Total cost: 1.4 Terabytes: $2,000. 7.7 Terabytes: $4,200 (Just the cost of the enclosure and the drives). That's an order of magnitude less expensive than other solutions.
Add redundant power supplies, NIC cards, SATA cards, etc as your needs require.
Re:ZFS (Score:4, Interesting)
Cheap, redundant, and performant storage. (Score:5, Interesting)
Google have a great solution [storagemojo.com] that focuses on the “cheap” part without compromising much the latter two. If you have not read up on the Google Filesystem, definitely take the time to. At the very least, it seems to call into question the need to shell out tens of thousands for high-end storage solutions that promise reliability in proportion to the dollar.
Re:Everyman? (Score:3, Interesting)
If you like your music to actually SOUND good, 128kbps sucks. I personally rip my music using a Variable bitrate between 224 and 320 kbps. Unfortunately, this makes for VERY large files. But my music sounds FANTASTIC!
Re:Everyman? (Score:3, Interesting)
Consolidate your multimedia and run MythTV for a while. Once you rip and encode several TV series, all your DVD films, and have the Myth recording your favourite shows, a terabyte doesnt seem that much. If you want an idea for future examples of massive storage consumptions, imagine having MythTV recording all channels all the time, so you'd basically be able to decide post-transmission what you want to view and save...
Of course, while I agree most NAS and SAN solutions are grotesquely overpriced and mainly useful for separating fools from their money, I cant really see why one would bring up ZFS and OpenSolaris for this purpose. Something like Openfiler [openfiler.com] would be vastly more appropriate, proven and easy to manage.
Comment removed (Score:3, Interesting)
Re:Everyman? (Score:4, Interesting)
I actually have two 48GB databases full of minimal instruction sequences for generating boolean functions. Do I win the obscure use of disk space prize?
Re:Its just not the same thing. (Score:3, Interesting)
Show me how you build a Raid 50 of 32 sata or ide drives.
Get yourself a nice big rackmount case and some 8 or 16 port SATA controllers.
also show me a SINGLE sata or ide drive that can touch the data io rates of a u320 scsi drive with 15K spindle speeds.
Of course, the number of single drives you can buy for the cost of that single 15k drive will likely make a reasonable showing...
Low end consumer drive cant do the high end stuff. Dont even try to convince anyone of this. guess what, those uses are not anywher near strange for big companies. witha giant SQL db you want... no you NEED the fastest drives you can get your hands on and that is SCSI or Fiberchannel.
This is technically true, however, "low end consumer" drives can certain beat the performance of what *used* to be "high end" until relatively recently, and are quite adequate for significant proportions of the market.
Individually, 7.2k SATA drives are quite a bit slower than 15k SAS/FC drives. But get a dozen or two of those SATA spindles in an array, and you've got some (relative to cost) serious performance.
Re:ok for low end, not for high (Score:3, Interesting)
BTW, I read recently that where 4Gb FC really excels is in large block sequential transfers and that small random access transfers are actually better over gigabit iSCSI. Check it out: http://searchstorage.techtarget.com/columnItem/0,
Plus you really have to think about other bottlenecks. How many disks need to be striped to consistently saturate the bandwidth of 1Gb Ethernet? 10Gb ethernet? What about the bus that the host adapter/NIC is on? Precious few boxes have 4x PCIe and then what about CPU overhead, managing all this streaming data? Just food for thought...
ZFS ftw! (Score:3, Interesting)
Re:Everyman? (Score:3, Interesting)
I'm a Hardware Guy, Not a ZFS Guy (Score:3, Interesting)
RAID0 = bunch of hard drives strung together, look like one big drive. in the implementation I'm refering two the data is striped to a block size and written across each disk simeultaneously (or nearly). This is the fastest disk subsystem available but the most susceptible to failure. If one disk fails, you're toast.
Does ZFS do anything in this situation? I have 'one big drive' presented to the operating system, the striping is abstracted at the hardware layer, and I have a semi-expensive ($300 - 800) RAID card running this. In a random I/O workload I get about 150 iops per drive, streaming is another ball of wax typically more interface limited/block size limited than head movement.
To save money, I'd drop the RAID card and put ZFS down. I now have 12 drives (SAS attached), can I get better performance with ZFS like I could with the RAID Card? Think Log Drives for a big DB, or scratch storage space while manipulating a metric assload of video files. Gbit / sec transfer rates for real-time storage of HD Video. In a random I/O workload I get about 150 iops per drive, streaming is another ball of wax.
RAID 0+1. All the perf benefits of RAID0, but 2x the drives. Typically two cabinets RAID 0'd then RAID 1 the two RAID 0s. I get redundancy at a slight penalty of performance due to 2x the writes happening, but no degredation the read.
What can ZFS do for me here? Again, performance improvements/changes?
RAID5
50% penalty in performance even with the best card because in a high drive count you have to read in data, calc, then do the write. However one drive gives me full redundancy, I loose a second though and I'm toast. RAID6 sometimes is used to describe distributing the hot spare into the array, so no more disk space but can take two simultaneous drive failures and keep running.
What does ZFS do for us here?
This isn't a troll, I really know nothing about ZFS and I'm really curious how I could not have to do the above to protect my photography / video data for my photography business. Would be cool if I could do it on my Mac Pro
Let's pretend you're right for a moment... (Score:5, Interesting)
So, that's $6k total instead of $3k.
The one problem I have with OCFS2 is that when it fences a system, it tends to either bring the whole thing down (kernel panic), or in newer versions, give you the option of forceably rebooting instead. This killed it for a project I was working on, where one of the machines had other mission-critical systems running that were not on the OCFS2, and thus, it seemed retarded to panic and bring down everything else too.
So if that's your problem, you can always build a third, identical system to run the other stuff on. $9k.
Even if you figure another $1k for random stuff, like maybe a LOT of gigabit crossovers, or 10gig fiber, or something, that's still a fifth of the cost of the "business-grade" or whatever else he was considering. Even assuming the worst-case scenario, where the homebrew system costs a lot more to maintain (even electricity and cooling, maybe), how long will it take for it to cost another $40k? And this way, you have an ENTIRELY redundant system -- the only way you lose it is if, say, the whole building blows up.
I mean, I sort of agree that you get what you pay for. But when the difference in price is that much, the only way it's ever worth it is if there's really great support with the high-end package. And is it $40k worth of support? If not, I imagine this guy could put together a company selling little $3k, $6k, and $10k systems for $20k each (including support), shaving off $30k even for the most paranoid.
And all of that is pretending you're right about the cheap consumer-grade hardware actually being less reliable.
Re:ZFS (Score:1, Interesting)
I'm actually very interested in such a project, but I see nothing compeling here.
Re:Have you tried out Starfish? (Score:2, Interesting)
Re:Everyman? (Score:5, Interesting)
Here's why:
1) Snapshots. ZFS lets me make lots of snapshots to protect myself from user error, viruses, etc destroying my data. ZFS snapshots are so lightweight that I can make them hourly at nearly no cost in time and disk space.
2) Data integrity. Even RAID-5 can allow some errors to creep into my data (google: bit rot). ZFS has a much higher level of data integrity protection.
3) Cost/Performance. ZFS RAID-Z appears to be much faster than software RAID-5. it appears to be even or faster than hardware RAID-5. Hardware RAID-5 is much more expensive than software.
Re:Everyman? (Score:2, Interesting)
Of course that was before I jumped onto the NAS storage cash cow, doing pretty much exactly what the article poster wrote, only I turned around and sold my PC-based NAS boxes for about half the price of "enterprise" solutions, which still represents a 400% markup for me
You have to realize, the companies and people building these overpriced RAID arrays are just your average greedy bastard, usually no smarter or more skilled than any other geek. Most of the computer-attached devices today are little more than an XScale processor, a tiny bit of RAM and Linux. Broadband routers, NAS boxes, KVM/IP switches, "smart" network adapters, heck I wouldn't be surprised to find home entertainment devices running Linux. We're in the age of mashups, where any idiot with a marketing budget can slap various I/O ports on a board and "invent" appliances.
Re:Specifics please. (Score:5, Interesting)
The premium paid for higher-end storage is decidedly nonlinear. For marginally more reliable or faster storage, you pay about a factor of ten. One example I'm familiar with is Hitachi. We had a 64TB HDS array a few years ago that was worth roughly $2M. We could have purchased an equivalent amount of commodity storage for probably $200k at the time, but didn't. Why would we spend the extra money? Speed, configurability, expandability, and reliability.
First of all, speed. That thing was loaded with 73GB 15k FCAL drives. RAID was in sets of four disks, with no two disks in a set sharing the same controller, backplane, or cache segment. Speaking of cache, the rule was 1GB/TB. so we had 64GB of fast, low-latency, fully mirrored cache on the thing. It was insanely fast, and (most importantly) didn't slow down under point load. One tool automatically ran on the array itself, looking for hotspots and reallocating data on the fly.
Configurability: We could mirror data synchronously or asynchronously to our DR site, by filesystem, file, block, LUN, or byte. We could dynamically (re)allocate storage to multiple systems, and moving databases between machines was a breeze. Disk could be allocated from different pools (i.e. different performing drives could be installed), depending on requirements. Quality-of-Service restrictions could be put in place as well, although we never used them.
Expandability: The beast had 32 pairs of FC connections, could support 96GB of internal mirrored cache, and I can't remember how much actual disk. The key wasn't the amount of disk we could put on it, so much as how well the bandwidth scaled--and it scaled well.
Finally, the real key - Reliability. All connections were dual-pathed, with storage presented to a pair of smart FC switches which were zoned to present storage to various systems. We could lose three of the four power cables to the main unit (auxiliary disk cabinets only had two power connections each), and still run. We could lose any entire rack, and still run. We could lose any switch in our environment, and still run. We could lose two disks from the same RAID set and still run. When we lost a disk, the system would automatically suck up some cache to use for remirroring the data to multiple disks as fast as possible, and then after protecting it, would remirror back to a single logical device. In the event that we lost the entire device, we could run from our DR site synchronous mirror with less than a ten second failover.
This sort of thing is massive overkill for most people and companies, but when someone is doing realtime commodities trading, (or banking, or stock exchanges, etc.) the protection and support are worth the extra money. You just can't build that sort of thing on your own for any less money, at the end of the day.
Absolutely true (Score:3, Interesting)
Old guys like me may remember when National Semiconductor was, if I recall rightly, fined for faking test records in the same year they won an award for the reliability in the field of their military products. Or the discovery in the early 90s that volume produced Japanese semiconductors were far more reliable than many JAN devices. There is just no substitute for having to manufacture in volume in a competitive world.
In semiconductors, the downside is that things that produce higher reliability like thicker oxide and bigger anti-static diodes also slow down clocks. You would think that, for a really reliable disk array, you would a less than state of the art system running conservatively. I guess that this is a case where having a great deal of practical and experimental experience is the best recipe for success, and perhaps this is where SAN manufacturers shine.
Re:RAID controller failure (Score:3, Interesting)
Re:Specifics please. (Score:3, Interesting)
While I agree with your "Solly Cholly" comment, I've steered clear of the SAN/NAS solution for once simple reason: it's a single point. Maybe I'm just paranoid or unreasonable, but having ANY single point of failure creeps me out, even if it has excellent numbers, etc. Any SINGLE point will have the following problems:
1) Uptime: if it goes down, so does everything else. Is there more to add here?
2) Scalability: Any single point limits the performance of the system at large. Sure, your SAN/NAS might be mighty quick, and mighty fast. But given its current workload, can you sustain 100x growth annually for a few years without re-architecting your entire information infrastructure?
3) Price: It costs lots of cash upfront. Yes, expensive. But it gets even more expensive as your system grows beyond the performance of a single NAS.
4) Locality: Less of an issue than it used to be, but still an issue. If your IT infrastructure depends on the existence of a single-point SAN/NAS, your ability to spread to other geographical regions is limited. Sure, there's the WAN. But WANs are slow, unreliable, riddled with hiccups, and anybody who has to move very much data becomes acutely aware of these inconveniences!
I choose to be expensive in another way - intelligent system design. When architecting a software stack, it should immediately be able to scale up or down to virtually any size, from a $100 second-hand P3 to a 100-node cluster of servers, and handle it with minimal fuss. Thus, I've designed my softwares to scale up linearly. We started our hosted web-based product on a single P4 1U server. Now running on a 4-node, 16-core Opteron cluster.
Maybe you're not expecting to handle 100x growth annually. But I AM anticipating this kind of growth. Growth over the past 3-4 years has been between 40% and 100% annually, and there's every reason to believe that the next year will see a severe hockey-stick as a few potentially landmark-scale agreements have just been made.
We're ready to grow FAST to basically any size, modeling after the
Re:ZFS is great, but... (Score:2, Interesting)
Hence my note that the single-user/reboot process isn't *really* required, but if you look at Sun's patch documentation that's what it almost invariably says.
I don't want to sound critical of ZFS or Solaris - that's not my intent. Solaris is a wonderful operating environment, and ZFS is an incredibly powerful filesystem and volume manager. However, the fact of the matter is that implementing and administering an appliance-based solution is almost invariably going to be simpler and better supported, especially since ZFS is a relatively immature technology - there are some things you just flat out can't do yet, and some things you can't do as easily.
If the question is "Does ZFS Obsolete Expensive NAS/SANs?" the answer is clearly "not yet." At some point that may be a more difficult question to answer - ZFS is amazingly powerful despite being very young, and along with Solaris itself it continues to rapidly improve in important ways.
Re:Specifics please. (Score:3, Interesting)
Wow, talk about not understanding what a SAN actually is. A $4700 unit is no comparison to what you get for 50k from an HP or IBM SAN solution. The biggest factor, I started with 30tb now I need 100tb! Oh noes, I need to buy more units and create more volumes. With a SAN you fill up your fabric or iSCSI switch with more storage racks allowing you to mix and match, SATA, SCSI, FCA or anything else you like. Then replicate the whole deal off-site at the hardware level.
Now mirrored OS drives? Are you mad? You've got a SAN, the suckers will boot straight from the SAN meaning you'll never have to worry about RAID array failure or hard drive failure causing downtime. This also means there is no need for hard drives in your blade which will reduce the internal temperature under full load. Of course since the OS drive is connecting through a fiber-channel SAN it is a hell of a lot faster than two mirrored drives which means faster and more reliable servers and increased uptime. A motherboard fries? No problem, either remap the LUN to another server with identical hardware which could be configured automatically or just change which port the server is plugged into the fabric.
There are hundreds more advantages here, ZFS does address a lot of them but it is not at that level and not near the performance. How about automatic archiving of files that haven't been accessed in say six months? Either archive to near-line storage or to tape, or do a tiered approach and archive to near-line after six months and tape after a year. The management is there for SANs and is very mature. It's just getting started for ZFS. ZFS is turning out to be a great low cost alternative meaning less risks starting out, but once you're into SAN country and not just DASD land then the rules are different.
Re:ZFS (Score:3, Interesting)
FUD.
FUD.
Interesting times. These days, we not only get the typical Redmond FUD, but FUD from the Linux people.
Available on OpenSolaris and FreeBSD (and being ported at least to NetBSD, AFAIK). Those are free software operating systems.
Your problem is that not everything fits in your little GNU/Linux box.
The best laid plans of mice and men... (Score:3, Interesting)
True story:
Two years ago next month, a clumsy plumber got a propane torch too close to a sprinkler head with the expected consequences: LOTS of water took the path of least resistance, where it finally filtered it's way into the basement data center, coming out right on top of our SAN.
Obviously, it didn't survive.
Re:Have you tried out Starfish? (Score:3, Interesting)
Yes, at the present speed of development and testing, mirroring should be available by the end of August 2007.
Yes, absolutely. Keep in mind that preference will be given to the 500GB storage node until it fills up to around 420GBs. Starfish tries to load-balance storage across all nodes, thus it always picks the nodes with the most amount of free space remaining.
There will be different storage node selection strategies in the future. Currently there is round-robin (which load-balances based on number of files per storage node) and largest-free-storage (which load-balances files to the storage node with the largest amount of available storage).
Re:Its just not the same thing. (Score:3, Interesting)
Hundreds of gigs per second is limited to the very largest GPFS or Lustre configurations. The fastest filesystem throughput I'm aware of, anywhere in the world, is about 100 gig/second at Sandia national labs. That requires dozens of high-end raid cabinets.
Re:ZFS (Score:4, Interesting)
I did the math. That would handle 3.4x10^38 Bytes, or 340 trillion YottaBytes (1 YottaByte = 1 billion PetaBytes, 1 PetaByte = 1 million GigaBytes). That's a very large number of Bytes, but I still wouldn't use the word never. I usually even try to avoid the phrase "never in my lifetime", but in this case that's probably a safe bet.
Note: I'm using the hard drive manufacturer's definition of *bytes here.
Re:ZFS (Score:2, Interesting)
640K is enough for anyone!
Our government will find a way.
Re:ZFS (Score:5, Interesting)
Mass of the earth = 5.9742 × 10^27 grams
Make the drives out of the earth, you need a drive density of 57Gb/gram
A drive with a density of 1 bit per carbon atom, 5.4 *10^10 metric tons
Size of said nanotech drive, a cube 2.88 Km tall (at the standard density of carbon)
Never in your lifetime is a really safe bet.
T
Re: RAID controller failure (Score:3, Interesting)
The more important question may be whether there are other, unknown sources of errors. Not many people would probably even think of a hard drive controller card failure when building a storage solution, and that's probably one of the major advantages of the more expensive vendors; they have the experience to know a lot more error sources. I'm not sure which weighs the heaviest: The enormous difference in price, or a chance to avoid a source of major errors that one would never have thought of on one's own (even if there's a 1 ppm risk of it happening)?
Re:RAID controller failure (Score:4, Interesting)
Look at the high-end Sun Fire X4500 server ("Thumper") they released a few months ago: 48 drives in 4U with *no* hw RAID raid controller, Sun designed this server as the perfect machine to run ZFS.
Hitachi _can_ go down! (Score:3, Interesting)
However 2-3 years ago we stumbled (very painfully!) across a firmware bug which took the primary Hitachi array down:
As we (i.e. the Hitachi service reps) were upgrading the mirrored cache, an error hit the active half, and it turned out that the firmware would always check the mirror (a very good idea, right?) before falling back on re-reading the disk(s). However, the firmware error handler which could have handled an error on the mirror copy as well (as long as the data wasn't dirty, of course), did not know how to handle a _missing_ copy, instead it blew away the entire array while crashing.
It took us three days to get everything back up, even though most of the critical systems were running off of the WAN backup copy after 2-3 hours.
Terje
PS. That particular firmware bug has of course been extinguished, but there's bound to be some more lurking around. Getting totally non-stop operation is a _hard_ problem!