Building a Massive Single Volume Storage Solution?

Building a Massive Single Volume Storage Solution? 557

Posted by Cliff on Tuesday October 25, 2005 @03:21PM from the 15-zeros-is-a-lot-of-bytes dept.

An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical. I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"

Building a Massive Single Volume Storage Solution?

This discussion has been archived. No new comments can be posted.

Search 557 Comments Log In/Create an Account

Comments Filter:

Petabox (Score:2, Insightful)

by russ_allegro ( 444120 ) writes: on Tuesday October 25, 2005 @03:28PM (#13874240) Homepage

archive.org made a petabox

http://www.archive.org/web/petabox.php [archive.org]

There is now a company that seems to make the same design:

http://www.capricorn-tech.com/products.html [capricorn-tech.com]

I don't know what FS they use, but apprently it is redudent.

Why? (Score:2, Insightful)

by Anonymous Coward writes: on Tuesday October 25, 2005 @03:29PM (#13874251)

What are you doing on a limited budget trying to build a 1PB solution? And why are you on a budget?

Just because you are starting at 25TB doesn't mean you aren't building a 1PB solution.

You also need to figure out what kind of bandwidth you need. It's very seldom that people have 1PB of data that is accessed by one person occasionally. If Some sort of USB or 1394 connection will work you are much better off than requiring infiniband.

Like many "ask Slashdot" questions this is the last place you should be looking for help...

Stress the importance .... (Score:4, Insightful)

by gstoddart ( 321705 ) writes: on Tuesday October 25, 2005 @03:38PM (#13874360) Homepage

I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB ... Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small.

Unfortunately, I should think needing a solution which can scale up to a Petabyte (!) of disk-space and a "fairly small" budget are at odds with one another.

Maybe you need to make a stronger case to someone that if such a mammoth storage system is required, it needs to be a higher priority item with better funding?

Heck, the loss of such large volumes of data would be devastating (I assume it's not your pr0n collection) to any organization. Buliding it on the cheap and having no backup (*)/redundancy systems would be just waiting to lose the whole thing.

(*) I truly have no idea how one backs up a petabyte

IBRIX (Score:1, Insightful)

by Anonymous Coward writes: on Tuesday October 25, 2005 @03:39PM (#13874374)

Check out the IBRIX Clustered Filesystem. http://www.ibrix.com/ [ibrix.com]

For the most part (Score:5, Insightful)

by retinaburn ( 218226 ) writes: on Tuesday October 25, 2005 @03:39PM (#13874376)

the reason you can't find a cheap way to do this is because it just isn't cheap.

I would look at some lessons learned from Google. If you decide to go with some sort of homebrew solution based on a bunch of standard consumer disks you will run into other problems besides money. The more disks you have running, the more failures you will encounter. So any system you setup has to be able to have drives fail all day, and not require human intervention to stay up and running(unless you can get humans for cheap too).

Do It Right (Score:5, Insightful)

by moehoward ( 668736 ) writes: on Tuesday October 25, 2005 @03:41PM (#13874402)

Look. Everyone wants a Lamborgini for the price of a Chevy. Cute. Yawn. Half of the Ask Slashdot questions are people who didn't find what they want at Walmart. Despite the amazing Slashdot advice, Ask Slashdot answers have somehow failed to put EMC, IBM, HP, etc. out of business. There is no free lunch.

Just call EMC, get a rep out, and give the paperwork to your boss. Do it today instead of 5 months from now and you will have a much better holiday season.

Note to moderators and other finger pointers: I did not say to BUY from EMC, I just said to show his boss how and why to do things the right way. It does not hurt to get quotes from the big vendors, mainly because the quote also comes with good, solid info that you can share with the PHBs. Despite what you think about "evil" tech sales persons and sales engineers, you actually can learn from them.

Re:Data redundancy REQUIRED (Score:4, Insightful)

by OrangeSpyderMan ( 589635 ) writes: on Tuesday October 25, 2005 @03:42PM (#13874407)

Agreed. We have around 50 TByte of data in one of our datacenters and it's great, but the number of disks that fail when you have to restart the systems (SAN fabric firmware install ) is just scary. Even on the system disks of the Wintel servers (around 400) which are DAS, around 10% fail on Datacenter powerdowns. That's where you pray that statistics are kind and you have no more failures on any one box than you have hot spares+tolerance :-) Last time one server didn't make it back up because of this.... though it was actually strictly speaking the PSUs that let go, it would appear.

Yup, time to pick up the phone. (Score:5, Insightful)

by Kadin2048 ( 468275 ) writes: <slashdot.kadin@xox y . net> on Tuesday October 25, 2005 @03:48PM (#13874473) Homepage Journal

Exactly. This seems like somebody is trying to figure out a way to do something in-house which really ought to be left to either an outside contractor, or at least set up as a turnkey solution by a consultant. Given that he knows little enough about it that he's asking for help on Slashdot, I think this is yet another problem best solved using the telephone and a fat checkbook, and enough negotiating skills to convince management to pony up the cash up front instead of piddling it out over time on an in-house solution that's going to be a hole into which money and time are poured.

I know people get tired of hearing "call IBM" as a solution to these questions, but in general if you have some massive IT infrastructure development task and are so lost on it that you're asking the /. crowd for help, calling in professionals to take over for you isn't probably a bad idea.

It's not even a question if whether you could do it in-house or not; given enough resources you probably could. It comes down to why you want to do something like this yourselves instead of finding people who do it all the time, week after week, for a living, telling them what you want, getting a price quote, and getting it done. Sure seems like a better way to go to me.

Re:Why? (Score:2, Insightful)

by temojen ( 678985 ) writes: on Tuesday October 25, 2005 @03:50PM (#13874498) Journal

Unless you are the mint, every budget is limited.

Re:Petabox (Score:5, Insightful)

by afidel ( 530433 ) writes: on Tuesday October 25, 2005 @03:56PM (#13874563)

This guy is worried about budget, yet even with the "low power" usage of the petabox it would still use 50kW for one petabyte of storage! When you combine the cooling for that with the cost of electricity you are talking some serious money. If you have trouble getting the capital funds for something like this how are you ever going to pay the operating costs?

No Redundancy? (Score:5, Insightful)

by Giggles Of Doom ( 267141 ) writes: <michael&redlightning,net> on Tuesday October 25, 2005 @03:57PM (#13874567) Homepage

A PETABYTE without redundancy? I can't imagine having that much data I didn't care about.

Another, There Is. (Score:2, Insightful)

by LifesABeach ( 234436 ) writes: on Tuesday October 25, 2005 @04:09PM (#13874721) Homepage

If designing for speed, NOT cost:
given 2PB = 1 Human Brain, non interlaced
1024 TB == 1 PB
1 TB == 1 PC Computer with 1200GB H/D, 2Gig RAM, Networking

If designing for cost, NOT speed:
1 DVD = 4.5GB
1 PB = 1024 TB = 1,048,576 GB
1 PC Computer, with a DVD like the one mementioned above.
1 Robotic CNC Arm, with DVD Gripper(tm)
1 Very Huge Wire Cage to hold DVD's like a Juke Box.
(This has been done before, but with Tapes)

I built a 1.7 TB for about $2000 (Score:3, Insightful)

by composer777 ( 175489 ) * writes: on Tuesday October 25, 2005 @04:16PM (#13874809)

but I'm just a linux hobbyist and programmer, so take any advice I give with a grain of salt, but here's what I did for my setup at home. To start, you're looking a little over $1000 per TB. And, that's about as cheap as it gets with redudundancy. I have 8 drives in one machine, it's in a RAID 5 config, and I have a hot spare. However, if I were doing this for a mission critical application, I would have it in a RAID 6 configuration with a hot spare, and buy a hot swap cage, which would further add to the costs. Then, I would simply export the RAID 5 volume using ISCSI, and then see if there is a way to RAID all of the ISCSI volumes using a master server. I imagine that if you do it right, you could scale up such a system to a fairly large number of machines. You would probably want something faster than gigabit eithernet, probably 10,000 MB/s connecting everything together, otherwise, things could get a bit congested at the head node.

Where all this could get terribly expensive is in power requirements, it requires less power to run a cage of hard drives than it does to run a network of PC's. I'd imagine that any money you save on hardware, you would spend on your power bill. Either way, your looking at, bare minimum, about $30K to start for 25TB's, and I would add another 10K padding just to be safe, to pay for stuff like UPS (which you want), a high end switch (which you'll also need), cabling, etc. In other words, it's not cheap, and like my parent just said, it will probably be cheaper in the long run to have someone like IBM do it for you. Do you really want to be responsible for 25-1000 TB's of data?

Re:Scale (Score:4, Insightful)

by Wesley Felter ( 138342 ) writes: <wesley@felter.org> on Tuesday October 25, 2005 @04:19PM (#13874838) Homepage

Why do people keep talking about GoogleFS, given that it doesn't exist outside Google?

Re:Do It Right (Score:2, Insightful)

by stanmann ( 602645 ) writes: on Tuesday October 25, 2005 @04:20PM (#13874852) Journal

Yes, there are lots of things that can be done by an open source team on the cheap... Massive hardware components aren't currently one of them. And aren't likely to be in the future.

Can the company funding this really afford this? (Score:3, Insightful)

by @madeus ( 24818 ) writes: <slashdot_24818@mac.com> on Tuesday October 25, 2005 @04:33PM (#13874979)

I appreciate this might not seem like helpful advice, but...

If you've been asked to do something this by a company that can afford to buy one commercial off-the-shelf high volume storage solutions, then I honestly can't imagine any solution they try and knock up will actually work (as I'm not aware of any free software solution that's currently up to the task).

If your company doesn't have / can't raise the capital to buy a commercial system for a project of this scale, I can't possibly see how they could afford to screw up on this and go with an untested idea that could very well end up being a huge money sink they wouldn't be able to dig themselves out of - one that could doom the entire company and all it's investors given the cost it could run to.

And of course, for such a big project, they should hire people who would already know how to do something like this (which is not a dig, it's just crazy to skimp on staff when you have an ambitious project which requires large amounts of capital investment).

That said...

I were going to do large scale storage on the cheap, depending on the design of the software and the specific requirements (particularly if I was also developing the software we were going to use, or was able to set feature requirements and/or was able to make the modifications myself) I would build the largest standard file shares I could with SATA disks (using commodity hardware, hot swappable, running linux, with front loading drive bays).

The specifics of handling the load balancing (via multiple front ends, multiple mount points, pre-deteremined hashing to balance things out, proxies/caches, hooks in the file system calls, hooks in the application to talk to a controller, etc) depend entirely on the sort of application however.

It's definately likely to be far easier (and more cost effective) to have the software take care of knowing where the data is stored, rather than trying to build a single really large file share. I know at least one very known large company who've went down this route (with essentially elaborately hacked up versions of common OS software).

The downside is you have to support whatever hack you come up with to do this, but that shouldn't be an enormous amount of work (and you can probably afford to hire someone to support it full time for significantly less than the cost of a support contract for a commercial solution).

Good point, bad data (Score:3, Insightful)

by fm6 ( 162816 ) writes: on Tuesday October 25, 2005 @04:34PM (#13874992) Homepage Journal

If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!

Not a sound assumption. Things don't fail uniformly over time. Suppose 70 babies are born with a life expectancy of 70 years. Is one of them guaranteed to die every year for the next 70 years? Obviously not. If they avoid some joint disaster (like they all take a trip on the Titanic), most of them will die within a decade or so of the 70-year mark.
Same with disk drives — most failures will be clustered around the 57-year mark. Not that your attitude towards redunancy is wrong. Just as people sometimes die in infancy, some disk drives break down quickly. So there's a chance that you'll lose some drives from your thousand-disk system in the first year.
How big a chance? To answer that question, you need more statistics about drive failure — and a much better grasp of probability theory.

Re:call EMC. i am sure their clarion line will han (Score:3, Insightful)

by aminorex ( 141494 ) writes: on Tuesday October 25, 2005 @04:44PM (#13875102) Homepage Journal

What once required talent and brilliance today only requires reading a how-to file, configuring,
and rebooting.

EMC is obsolete. Their customers just haven't discovered it yet.

AFS Rocks- Now stop (Score:5, Insightful)

by sirket ( 60694 ) writes: on Tuesday October 25, 2005 @05:01PM (#13875307)

Stop what you are doing right now. If your architecture requires you to have one huge volume then you have architected things wrong. Imagine trying to fsck this damned thing! What about file system corruption- What the hell are you going to do when you lose a Petabyte of data because of some file system corruption? Small, sensible, easily managed smaller partitions are the way to go. Use a database to organize where given files are stored. Do something that makes sense. I have a client now who just lost a bunch of data because they used a system like this.

Having said all this- If you are still intent on finding a good file system then use AFS. It's probably your best free solution. If you want to sleep at night call EMC.

-sirket

Re:GPFS from IBM (Score:3, Insightful)

by Obasan ( 28761 ) writes: on Tuesday October 25, 2005 @05:23PM (#13875574)

Having implemented GPFS I feel qualified to say it kicks butt. As the poster mentions, its not cheap but if you want reliability and support it may be well worth it. Thats where you need to decide the level of risk you are willing to expose your data to. One limitation of GPFS is that it does (or did last I looked) only run on IBM hardware, either Pseries or Xseries with FastT fiber channel at the back end.

From what I've heard, definitely give GFS a thorough shakedown before you decide to implement it, I've heard some horror stories.

Get out now!! (Score:2, Insightful)

by egriebel ( 177065 ) * writes: <edgriebel AT gmail DOT com> on Tuesday October 25, 2005 @05:36PM (#13875736) Journal

Really, go now before your company's stinginess brings you down too.
There's a reason why Terabyte storage arrays for commercial applications cost a lot of money, and why consulting services from IBM, EMC, Hitachi, etc. have the huge per-hour cost. If you/your management can't see that, you really have no business being there. Sure, anyone can throw a JBOD RAID together for a thousand bucks, but I wouldn't trust anything more important than MP3s to it.

Who let the PHBs out? (Score:3, Insightful)

by buss_error ( 142273 ) writes: on Tuesday October 25, 2005 @05:48PM (#13875867) Homepage Journal

Sounds like the PHBs have been at this. First, *why* does it have to be a single file system? With Oracle, MySQL, and MS-SQL you can do partitioning, if your need is databases. If your need is really a monolithic file, then I'll bet that the single file size won't be multi-hundreds of gigs.
In short, your stated objective smells. Not enough data.
WHAT is going to be done (database, file storage?)
HOW will it be accessed? (One large file, many smaller files)
WHEN will it be accessed? (During business hours, distributed over the day?)
AVERAGE TRANSFERS - will the whole schmear come over, selected parts?
SECURITY a concern? (Sensitive data, protected network)
BACKUP - a petabyte of tape storage is expensive, and takes quite a while to do.
POWER - do you have enough?
COOLING - ditto
SPACE - ditto - my $DAYJOB computer room is about 3000 sq ft... and we're going to be using all of it within 12 months.
That said, if you go with big drives over a lot of systems, use lots-o-nics to keep the nic from being the bottleneck. A single gig connection sounds fine, but wait until you have 100's of people going for files at once. It'll get swamped. And swear off V-SAN from Cisco. Not worth it at all.

Re:Oracle, also (Score:3, Insightful)

by Spudley ( 171066 ) writes: on Tuesday October 25, 2005 @05:52PM (#13875917) Homepage Journal

It sounds like a seriously ambitious project to approach...

I second that.

Starting at 25TB to scale 1PB? And you want it cheap? If it was cheap to do that sort of thing, we'd all be lining up to get one of our own(*).

Seriously, though, you don't really specify how cheap you are expecting to get it for. What are your expectations, and just how far over-budget are the options you've looked at already? Do you really need 25TB/1PB in one volume, or could it be achieved by splitting it into smaller chunks and working out some sort of load-sharing system?

And in any case, what on Earth kind of data do they anticipate will take a petabyte of contiguous storage????

[(*) Yes, I'm aware that in X years, someone's going to be looking back at this in the /. archive, and laughing about how low tiny our disc storage space was back in 2005]

Re:Petabox (Score:4, Insightful)

by Databass ( 254179 ) writes: on Tuesday October 25, 2005 @06:22PM (#13876196)

This guy is worried about budget, yet even with the "low power" usage of the petabox it would still use 50kW for one petabyte of storage!

Interesting to think about. My brain probably holds about a petabyte of memories and it uses 20-60 watts. Mostly from sugar.

Re:Oracle, also (Score:3, Insightful)

by Catbeller ( 118204 ) writes: on Tuesday October 25, 2005 @06:44PM (#13876359) Homepage

And you don't have an answer to the question.

If you don't want to participate, don't. Stop stuffing the threads with posts about how lame everyone's questions, knowledge and motivations are.

I'm actually interested in what people have thought about this very topic, AND I'm not a petabyte database expert. So it's news to me. And probably is to you as well.

Re:Apple Xserve? (Score:3, Insightful)

by Anonymous Coward writes: on Tuesday October 25, 2005 @06:48PM (#13876394)

"This product is tangentially related to a product which, five years ago, I had unspecified bad experiences with. Ergo, this product sucks."

Only on fucking Slashdot.

Re:Controllers! (Score:4, Insightful)

by sr180 ( 700526 ) writes: on Tuesday October 25, 2005 @09:14PM (#13877309) Journal

The CPU might be able to handle this load easily, but my question is will the bus (PCI or otherwise) be able to handle this load?

Re:That's not MTBF, this is.. (Score:1, Insightful)

by Anonymous Coward writes: on Wednesday October 26, 2005 @02:45AM (#13878637)

Your ideas are wrong. First off, If you have redundancy set up correctly for your arrays, drive failure will not be an issue. You just replace drives as they fail, and let them rebuild the array. Hell, set up hotspares, so the array rebuilds automatically when there is a failure. Then you just replace the bad drive at your leisure, and set it up as a new hotspare.

Secondly, you generally can't mix drive types, as they tend not to be exactly the same size. This will really mess up any attempts to rebuild a failed drive, or redundancy in general. Additionally, most "hot-swap" array solutions require drives of a specific mounting type and form-factor, which is going to throw that idea out the window.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Building a Massive Single Volume Storage Solution? 557

Building a Massive Single Volume Storage Solution? More Login

Building a Massive Single Volume Storage Solution?

Petabox (Score:2, Insightful)

Why? (Score:2, Insightful)

Stress the importance .... (Score:4, Insightful)

IBRIX (Score:1, Insightful)

For the most part (Score:5, Insightful)

Do It Right (Score:5, Insightful)

Re:Data redundancy REQUIRED (Score:4, Insightful)

Yup, time to pick up the phone. (Score:5, Insightful)

Re:Why? (Score:2, Insightful)

Re:Petabox (Score:5, Insightful)

No Redundancy? (Score:5, Insightful)

Another, There Is. (Score:2, Insightful)

I built a 1.7 TB for about $2000 (Score:3, Insightful)

Re:Scale (Score:4, Insightful)

Re:Do It Right (Score:2, Insightful)

Can the company funding this really afford this? (Score:3, Insightful)

Good point, bad data (Score:3, Insightful)

Re:call EMC. i am sure their clarion line will han (Score:3, Insightful)

AFS Rocks- Now stop (Score:5, Insightful)

Re:GPFS from IBM (Score:3, Insightful)

Get out now!! (Score:2, Insightful)

Who let the PHBs out? (Score:3, Insightful)

Re:Oracle, also (Score:3, Insightful)

Re:Petabox (Score:4, Insightful)

Re:Oracle, also (Score:3, Insightful)

Re:Apple Xserve? (Score:3, Insightful)

Re:Controllers! (Score:4, Insightful)

Re:That's not MTBF, this is.. (Score:1, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot