Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Data Storage Hardware Hacking

Building a Massive Single Volume Storage Solution? 557

An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical. I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"
This discussion has been archived. No new comments can be posted.

Building a Massive Single Volume Storage Solution?

Comments Filter:
  • GFS? (Score:5, Informative)

    by fifirebel ( 137361 ) on Tuesday October 25, 2005 @03:24PM (#13874189)
    Have you checked out GFS [redhat.com] from RedHat (formerly Sistina)?
  • Apple Xserve? (Score:3, Informative)

    by mozumder ( 178398 ) on Tuesday October 25, 2005 @03:24PM (#13874198)
    Can't you hook up 4x 7TB Xserve RAIDs to a PowerMac and use that?
  • Andrew FIle System (Score:5, Informative)

    by mroch ( 715318 ) * on Tuesday October 25, 2005 @03:25PM (#13874211)
    Check out AFS [openafs.org].
  • PetaBox (Score:4, Informative)

    by Anonymous Coward on Tuesday October 25, 2005 @03:26PM (#13874221)
    Howabout the PetaBox [capricorn-tech.com], used by the Internet Archive [archive.org] ?
  • by mikeee ( 137160 ) on Tuesday October 25, 2005 @03:27PM (#13874224)
    Livejournal developed their own distributed filesystem:

    http://www.danga.com/mogilefs/ [danga.com]

    It's scalable and has nice reliability features, but is all userspace and doesn't have all the features/operations of a true POSIX filesystem, so it may not suit your needs.
  • Comment removed (Score:3, Informative)

    by account_deleted ( 4530225 ) on Tuesday October 25, 2005 @03:27PM (#13874232)
    Comment removed based on user account deletion
  • Re:Apple Xserve? (Score:4, Informative)

    by Jeff DeMaagd ( 2015 ) on Tuesday October 25, 2005 @03:29PM (#13874248) Homepage Journal
    Apple Xserve may be the cheapest of that kind of storage, but it's probably not fitting the original idea of commodity hardware.

    Scaling to petabytes means spanning storage across multiple systems.
  • Re:GFS? (Score:3, Informative)

    by N1ck0 ( 803359 ) on Tuesday October 25, 2005 @03:32PM (#13874296)
    GFS over a FC SAN with some EMC CLARiiON CX700s as the hosts is the solution that I'm going to looking at deploying next year, although there is still some thoughts on using iSCSI instead of FC. It all really depends on what your usage patterns and performcance requirements are. I don't believe GFS supports ATAoE systems but since their is linux support I doubt it would be too far of a strech.
  • by Anonymous Coward on Tuesday October 25, 2005 @03:33PM (#13874301)
  • by cheesedog ( 603990 ) on Tuesday October 25, 2005 @03:34PM (#13874314)
    One thing to think about when building such a system from a large number of hard disks is that disks will fail, all the time. The argument is fairly convincing:

    Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!

    Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.

    You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.

  • by jcdick1 ( 254644 ) on Tuesday October 25, 2005 @03:34PM (#13874319)
    ...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.
  • by ggvaidya ( 747058 ) on Tuesday October 25, 2005 @03:39PM (#13874367) Homepage Journal
    A while ago [google.com]
  • IBRIX (Score:4, Informative)

    by Wells2k ( 107114 ) on Tuesday October 25, 2005 @03:42PM (#13874414)
    You may want to take a look at IBRIX [ibrix.com] systems. They do a pretty robust parallel file system that has redundancy and failover.
  • Er... be careful (Score:2, Informative)

    by LeonGeeste ( 917243 ) * on Tuesday October 25, 2005 @03:43PM (#13874422) Journal
    That violates their terms of use pretty severely. I don't know what they would do (Google's not the "suing-for-the-hell-of-it" type), but that wouldn't last very long when they found out. And they would find out. +5 Interesting? Well, curiosity killed the cat.
  • by Anonymous Coward on Tuesday October 25, 2005 @03:46PM (#13874465)
  • Re:PetaBox (Score:4, Informative)

    by MikeFM ( 12491 ) on Tuesday October 25, 2005 @03:52PM (#13874516) Homepage Journal
    I priced one of those and decided I'd have to work my way up to that kind of toy. Instead I started with Buffalo's TeraStations [buffalotech.co.uk] which are affordable and have built-in RAID support. You can mount them in Linux and use LVM to span a single filesystem across several of them or just mount them normally depending on your needs. $1-$2 per GB for external, RAID, storage isn't bad at all.
  • Re:Apple Xserve? (Score:4, Informative)

    by stang7423 ( 601640 ) on Tuesday October 25, 2005 @03:55PM (#13874541)
    Apple has a solution for this. Xsan [apple.com] is a distrubuted filesystem that is based on the ADIC's StoreNext filesystem. Apple states on that page that it will scale into the range of petabytes.
  • Re:IBRIX (Score:3, Informative)

    by Wells2k ( 107114 ) on Tuesday October 25, 2005 @03:57PM (#13874570)
    Something else I forgot about is the actual hardware... you may want to take a look at the nStor [nstor.com] products. Their hardware RAID systems are relatively economical, and you can go to fibrechannel drives with fibre connected boxes quite easily with their equipment.
  • Re:IBRIX (Score:1, Informative)

    by Anonymous Coward on Tuesday October 25, 2005 @04:02PM (#13874630)
    IBRIX is an excellent product. Especially if used in conjunction with HPCC or a renderfarm that requires very low access times, high availability, and works in conjunction with storage systems that will scale up to Petabytes.
  • by Alef ( 605149 ) on Tuesday October 25, 2005 @04:09PM (#13874732)
    If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!

    For the sake of your argument I suppose that assumption could be considered fair. If one were to do a somewhat more sophisticated analysis, a better model for hard drive failures is the Bathtub curve [wikipedia.org]. It represents the result of a combination of three types of failures: infant mortality (flaws in the manufacturing), random failures and wear-out failures.

    The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.

    I think what you are referring to is how multiple observations of a uniformly distributed stochastic variable generally look. It doesn't have anything to do with fractals, though.

  • Re:For the most part (Score:3, Informative)

    by epiphani ( 254981 ) <epiphani&dal,net> on Tuesday October 25, 2005 @04:10PM (#13874738)
    It wont be cheep - but how about this idea. You'll get plenty of data redundancy out of it, however you may need to spend some extra bucks on stability and maintainability.

    eWare 12x S-ATA raid5 card
    12x 300GB raid5
    linux machine
    iscsi software - share out 1 LUN.

    Duplicate this machine until you have enough storage.

    One big box with a number of trunked/bonded gigE ports
    Iscsi initiator software - mount all the luns.
    software raid them together - striping if you arent too worried - raid5 if you are.

    tada - big storage, one volume, all accessable from one machine.

    the maintenance will suck though.
  • by Trepalium ( 109107 ) on Tuesday October 25, 2005 @04:11PM (#13874754)
    Transarc was acquired by IBM in 1998, and released OpenAFS [ibm.com] in 2000. This [ibm.com] used to be IBM's site for Transarc technologies, but it looks like it doesn't exist anymore, and instead just redirects to IBM's software page.
  • by Anonymous Coward on Tuesday October 25, 2005 @04:11PM (#13874756)
    Read the post that you're replying to more carefully next time.
  • iSCSI storage / san (Score:3, Informative)

    by pasikarkkainen ( 925729 ) on Tuesday October 25, 2005 @04:13PM (#13874776)
    There seems to be lots of SATA-RAID based iSCSI SAN devices available nowadays.. Some links to products I have seen:

    http://www.equallogic.com./ [www.equallogic.com] They make nice SATA-raid based iSCSI SAN devices with all the features you could expect (volumes, snapshots, array/volume-expansion, hotswap, redundant controllers, redundant fans, etc).

    http://www.equallogic.com/pages/products_PS100E.ht m [equallogic.com]
    14 250G sata disks, 3U, 3.5 TB of raw storage.

    http://www.equallogic.com/pages/products_PS300E.ht m [equallogic.com]
    14 500G sata disks, 3U, 7 TB of raw storage.

    http://www.equallogic.com/pages/products_PS2400E.h tm [equallogic.com]
    56+ TB

    Looks good. I have not yet used them myself :)

    Another iSCSI SATA SAN possibility:
    http://www.mpccorp.com/smallbiz/store/servers/prod uct_detail/dataframe_420.html [mpccorp.com]
    16 sata disks, review:
    http://www.infoworld.com/MPC_DataFrame_420/product _53700.html?view=1&curNodeId=0 [infoworld.com]

    This company also has SATA iSCSI SAN devices:
    http://www.dynamicnetworkfactory.com/products.asp/ section/Product~Categories/category/iSCSI/options/ IPBank/drivetype/L~Series/formfactor/Integrated/in face/SATA~-~Serial~ATA [dynamicnet...actory.com]

    iSCSI SAN comparison:
    http://www.networkcomputing.com/story/singlePageFo rmat.jhtml?articleID=170702726 [networkcomputing.com]

    There are also software iSCSI target solutions for use with your own/custom hardware.
    http://iscsitarget.sourceforge.net/ [sourceforge.net] for building linux-based iSCSI target/SAN.

    If you are familiar with iSCSI targets / iSCSI SAN devices please post your comments!
  • Re:Apple Xserve? (Score:4, Informative)

    by TRRosen ( 720617 ) on Tuesday October 25, 2005 @04:19PM (#13874845)
    To do this would cost around $50,000 with xRaids and xSan...$2000/TB is probably the best price your going to get. You could do this with generic hardware but the cost of assembling, the extra room, extra power consumption and the maintaince and enginnering costs will cetainly wipe out what you might save. The xRaid solution could be up in a day and fit in one (actually 1/2) rack.

    I do remember some college buiding a nearline backup storage system using 1U servers with 2 or 3raid cards each connected to like 12 drives per machine in homemade brackets but it was hardly ideal. But It did work. Anybody remember where that was?

  • by bernz ( 181095 ) on Tuesday October 25, 2005 @04:21PM (#13874861) Homepage
    We've scaled this to 30TB so far. I'm not sure about 1PB, though. For us, redundancy and storage size is key, performance less so.

    Storage nodes: 7 x 2.8TB 2U RAID5+1 boxen with Serial ATA. The 2.8TB is logical, not physical. The OS for each of those machines is RAMDISK based (something we concocted based on what I read about the DNALounge awhile back) so it helps curb disk failures of the storage nodes themselves. We avoid disk failure by using RAID5. Of course that doesn't protect against mutiple simultaneous disk failure, but read on for more. Each of the storage nodes is exported via NBD.

    Then we have a head unit, a 64-bit machine. This machine does a software RAID5 across the storage nodes using an NBD client. Essentially each storage node is a "disk" and the head unit binds and manages the sofware raid5. So let's say a whole storage node goes down (for whatever reason it does), all the data is still intact. RAID5 rebuild time over the gigabit network is about 18hrs, which is acceptable. We even have another storage box as a hot-spare.

    On top of that, we have the whole cluster mirrored to another identical cluster via DRBD in a different geographic location. This is linked by Gigabit WAN. So if we have a massive disaster and lose the entire primary cluster, then we have a 2ndary cluster ready to go. We needed to purchase the Enterprise version of DRBD ($2k US) but that's worth it because they're neato guys.

    We use XFS as the filesystem. This system gives us 14TB of redundant "RAID-55 with a Mirror" space. Both clusters together? $85k.

    When the cluster starts running out of space (about 70% or so), we add ANOTHER cluster of similar stats to the initial one and use LVM to join the two units together.

    This has scaled us to 30TB and we're pretty happy with it. The read speed is very good (hdparm says Timing buffered disk reads: 200 MB in 3.01 seconds = 66.49 MB/sec) and the write speed is about 32 MB/sec. For what our application is doing, that's a fine speed.

  • by SteveOU ( 541402 ) <sbishop20.cox@net> on Tuesday October 25, 2005 @04:22PM (#13874867)

    The filesystems going to be the hardest component of this. I know of no open-source fs that could handle this. I'm assuming this is all online storage, and there is no desire to nearline it to tape. Ideally, you'd want something that could contcatenate multiple LUNs (of RAIDed storage) without having to run through a volume manager. Nothing agaist volume managers, but it'd be another component to support. Looking at proprietary FSs, you've got CXFS from SGI [sgi.com], which could easily handle the PB requirement and plays nice on Linux. Sun's got QFS [sun.com], which would max out at 1PB and could do the volume management bit easily. Linux support was a little flakey last time I used it, but it's a free download and evaluation, you could go get it [sun.com] right now.

    IBM's SAN-FS would also meet the capacity needs and would have the advantage of providing nearline capability, if you're into that. Sun's SAM-FS is basically the QFS product with nearline-to-tape capability. Linux is only supported as a client OS there. Of course, if you buy the mantra that Solaris is 'open-source,' then that might not be an issue.

    As for hardware with any of the above solutions, you're going to be looking at using multiple RAIDing disk enclosures of some kind. At a budget, probably SATA disks talking to the controller, and iSCSI to the host. FibreChannel to the host would be a little more costly, but might be worth it since iSCSI is just getting mature enough to be usable in production.

  • by Farfromlosin ( 558901 ) on Tuesday October 25, 2005 @04:24PM (#13874883) Homepage Journal
    Capricorn Tech [capricorn-tech.com]? They power the Internet Archive. "Capricorn Technologies was founded in 2004 and provides petabyte-class storage solutions for organizations worldwide. Capricorn's PetaBox technology grew out of a search for high density, low cost, low power storage systems for the world's largest data collections. Capricorn Technologies is proud to be a leader in the next data storage revolution."
  • Re:GFS? (Score:3, Informative)

    by LnxAddct ( 679316 ) <sgk25@drexel.edu> on Tuesday October 25, 2005 @04:33PM (#13874985)
    I second this parent post. GFS is exactly what he wants, although I've never used it in the 1 PB range, I can vouch for it working excellent with TBs.
    Regards,
    Steve
  • Why one volume? (Score:3, Informative)

    by photon317 ( 208409 ) on Tuesday October 25, 2005 @04:34PM (#13874995)

    What's making your question hard is the "make it like one volume" restriction. The problem is trivial otherwise. If I were you, I'd be asking whoever tasked you with this to *really* justify on a technical level why they need it to appear as a single volume, since that makes all the possible solutions slower, more costly, and more difficult to maintain.

    Chances are extremely high that what they really want is a "/bigfatfs" directory visible everywhere in which they will store many discrete items in subdirectories by project or by dataset or by user. You should convince them to let you build it from commodity machines serving a few TB each mounted as seperate filesystems underneath that umbrella directory. Then your only challenge is coherent management of the namespace of mountpoints for consistency across the environment (which there are longstanding tools for, like autofs + (ldap, nis, nis+, whatever)), and administration/assignment of new space requests within your cluster (that could be scripted to automatically allocate from the least-used volume which can satisfy the request (where least used could mean space or could mean activity hotness based on the metrics you're logging)).
  • by miles31337 ( 539573 ) on Tuesday October 25, 2005 @04:40PM (#13875051)
    No longer true, the OpenAFS 1.3.X (soon to be 1.4) has support for larger files.
  • by smartsaga ( 804661 ) on Tuesday October 25, 2005 @04:50PM (#13875178)
    http://www.lefthandnetworks.com/ [lefthandnetworks.com] supports all that of what the person is talking about in the article. As you add more of these units, the volumes are spread over the units you add. This means that you can add storage as you go and still have redundancy. You can configure each individual unit to use RAID 0, 1, or 5, and still get to have a volume, or many, across multiple storage units that in turn have parts of a whole voule or set of volumes. Its like haveing double mirroring, once within each individual storage unit level (which has many IDE drives in RAID 1, or 5) and then twice at the storage unit level. Of course this assumes that you have at least two storage units. And, yes, this means that to have redundancy you ahve to add them in pairs (I think) and have some storage units in one physical location and the pairs of each of those in another location for disaster recovery (fire, earthquackes, you know things can happen.)

    I have worked with this units and they kick ass. You can do snapshots of entire servers quickly, given that you have the right infrastructure, set thresholds for voulmes that can be increased or reduced on the fly, brick level restoration of files!!!, etc. And of course, my respect goes to their engineers. I saw them working on one unit cause we had a really bad power failure that killed one HD. Man those guys know their stuff up and down, and I've never seen anybody type commands so complex and so freaking long at that speed! They fixed the damn thing and got 99.99999% back from limbo!

    I guess their storage boxes follow the model of LVM which is pretty cool and the storage boxes run Linux!!!

    Don't take my word for it, go to their website and take a look 'cause I tend to confuse people with my posts rather than pass info efficiently.

    Have a good one.
  • by Steve.Murray ( 578192 ) on Tuesday October 25, 2005 @04:54PM (#13875223)
    MatrixStore from Object Matrix http://www.object-matrix.com/ [object-matrix.com] uses commodity hardware and clusters it together to create a highly expandable, reliable and secure storage environment.
  • by Rhys ( 96510 ) on Tuesday October 25, 2005 @04:59PM (#13875282)
    I want to say it is 16 Tbyte offhand, but I'm not sure on that.

    Short research indicates this was a limitation in 10.3, but I haven't found anything confirming or denying that 10.4 still has it.

    Not that we've been looking into large amounts of Xsan storage here, but our requirements are a bit different. You can't hook >600 nodes up to the storage via fibre. Our problem is scaling out the NFS servers to be able to push all this data around.
  • Controllers! (Score:3, Informative)

    by man_of_mr_e ( 217855 ) on Tuesday October 25, 2005 @05:06PM (#13875373)
    You could get a bunch of Broadcom 8 port SATA controllers, which equals about 4TB per controller. 4 or 5 controllers = 16-20TB per box, then you can run the cables into an outside drive bay enclosure and one box can control 40 500GB hard drives.

    If you're not doing any processing on this, a good CPU should be able to handle the load.
  • Re:For the most part (Score:3, Informative)

    by fool ( 8051 ) on Tuesday October 25, 2005 @05:17PM (#13875503) Homepage
    well, since all of the (high-end) PC's we were looking at for snort boxen had severe problems pushing even 5Gbit/s (not GByte) of traffic in/out over the PCI busses simultaneously, you hit a bottleneck pretty quickly there, even before you get to 25TB with your disk sizes. at 500GB disks you get pretty close, but you're at the ceiling already. while a decent (not even cutting-edge) machine could push a Gbit to the server pretty easily, the server, no matter how beefy, needs a ton of internal bandwidth to gather/process/serve the data timely-like. if he only needs 100mbit/s of data service then he's golden =)

    or did you mean to specify a GBit switch in between the clients/big box?

    also, agree with yours and others' proclamation that administration will not be trivial. be sure to spec at least 6 months of your time in writing/debugging scripts to automate the detection and RMA of dead drives, and find a vendor who will ship based on an automated mail you can send out about failed disks, rather than waiting on turnaround from you pulling the drive and the delivery making a round trip.
  • by bernz ( 181095 ) on Tuesday October 25, 2005 @05:27PM (#13875621) Homepage
    I'll put this out as a side point since I'm the OP: If we had to do more than 50TB, I think we'd go to a "real" solution like EMC or something like that. This has been very good for us, but given the need for that amount of storage, we also now have the money to spend on a superduper storage machine. Homebrew has been wonderful to get to this point, but unless we get the kind of employees necessary to really write our own FS a-la GoogleFS, I can't see us taking this solution that much further past where it is now only because I can't see myself putting THAT much scalable trust into something like NBD or software RAID5. At least not with really really close inspection of the limitations of that code.
  • by Anonymous Coward on Tuesday October 25, 2005 @05:34PM (#13875706)
    ...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.
     
    Yes, this is what SANS were developed for. Let me clue you in on our SAN, and what it cost us in the long run:
     
    We bought a Xiotech Magnitude 1.4T SAN, with everything redundant, except the backplane (not an option in that line at the time). Well, what part dies? The backplane. Twice!
    So, after converting all of our critical servers over to this SAN, it takes down our entire datacenter. We paid tons of money for this POS too.
     
    Now, we're trying to sell it. We don't need it, have since moved to Infortrend for our storage requirements (way better product), and would like to get rid of it.
     
    We tried eBay. Everytime a customer inquired about re-licensing the crappy software on it from Xiotech, whadda ya know? It costs more to relicense the software than just buying a whole new unit from Xiotech. We call Xiotech and complain to high hell about this.
     
    We try to donate it to a local university... SAME DEAL! Xiotech tries to relicense the software, EVEN AFTER KNOWING ITS A CHARITIBLE DONATION for more than a new unit!
     
    So, in our closet sits a @@#@!(%$ pile of @($*!@ called a Xiotech. I will speak badly about this company until the day I die. I will but into other people's conversations that are unrelated to massive amounts of storage, and tell them to curse the name Xiotech and stay bloody clear of that company. I will sacrifice small animals in Xiotech's name, so hopefully something horribly evil will happen to that company, and the damn CIO that tells us to deal with it, and hangs up on us.
     
    My ire towards Xiotech is strong and will never die. I wish a worse fate for Xiotech than I do SCO.
  • by beldraen ( 94534 ) <chad...montplaisir@@@gmail...com> on Tuesday October 25, 2005 @05:54PM (#13875942)
    Just a comment about MTBF. It's often not understood, and it is one of my little pet peaves with tech producers because they don't try to correct it. MTBF is a rating for reliability to achieve lasting the warrenty period.

    You have a drive that is rated 500,000 hours MTBF. Suppose you bought a drive and let it run at rated duty. Driver are normally rated to run 100% of the time, but many other devices will have duty period. Further, you run the drive until its warrenty is up. You then throw this perfectly working drive out the window and replace it. If you keep the up this pattern, then approximately once per 500,000 hours on average you should have a drive fail before the warrenty period is up. This is why it is important to not only look at the MTBF but also its warrenty period.

    As a side note: In theory, you should be throwing drives out on a periodic basic. One way around this is to not buy all the same drive type and manufacturer. By having a pool of drive types, you distribute, thus minimize, risk of drive failures. Additionally, you may want to have a standard period of time for drive replacement so as to shedule your down time, as opposed to it all being unexpected.
  • by Anonymous Coward on Tuesday October 25, 2005 @06:00PM (#13876003)
    Have you ever seen a single drive failure cause a 30TB SAN to grind to a halt?

    Sure have, it happened when some dumbass striped their database across multiple hypers on the same physical spindle. There's a reason they give you those device layout maps, you should probably pay attention to them.
  • Re:Petabox (Score:2, Informative)

    by russ_allegro ( 444120 ) on Tuesday October 25, 2005 @06:38PM (#13876318) Homepage
    They claim ~40 watts per terrabyte. That is pretty darn low, if you are going to try to come up with your own solution with off the shelf parts it'll be hard to match that. If they can't pay for 40 watts per terrabyte for a petabyte maybe they should reconsider that they need the petabyte for now.

    Lets say $0.07 per kW/hr,
    Then the 50kW as you said would be:
    50*24*31*$0.07 = $2,604/month

    So it isn't super cheap, guess that is why you don't hear about everyday people buying petabyte of storage. I think if you try to save more on electricity (liking coming up with some other device besides hard drives) you will end up paying a huge amount in whatever makes you save that electricity beyond the electricity costs.
  • PetaBox? (Score:3, Informative)

    by mr_zorg ( 259994 ) on Tuesday October 25, 2005 @06:46PM (#13876370)
    The PetaBox, as previously discussed on Slashdot [slashdot.org] sounds like just what you want...
  • Re:GPFS from IBM (Score:2, Informative)

    by icehawk55 ( 585876 ) on Tuesday October 25, 2005 @08:34PM (#13877095)

    I've implemented multiple gpfs file systems in the multi terabyte range. It's a pretty robust file system. With full redundancy at the disk/controller/brocade/server level per file system I can still write more the 3 gb/s and read better than 3.5 gb/s. This was a design for redundancy and not performance.

    20+ Terabytes of FAStT fibre attached storage. After four "SURPRISE" power outages after Katrina which caused the loss of 12 disks and I still did not lose a single byte of data for the customer. GPFS can be pretty robust if implemented correctly.

    I'd have no qualms about putting together a petabyte of gpfs file systems.

    Icehawk55
  • by yppiz ( 574466 ) on Tuesday October 25, 2005 @09:42PM (#13877432) Homepage
    You beat me to this link.

    I will add that the Archive has particular design and performance goals, namely:

    - keep the cost / GB as low as possible
    - keep cooling and power requirements low
    - use the filesystem and bundle objects into large chunks (~100MB ARC files, last I checked)
    - assume streaming writes affecting an edge of the system -- previously written data isn't modified
    - assume random reads
    - read latency is less important than cost / GB

    I worked on the Archive ~5 years ago, and these are based on my understanding of the Archive from that period, so some of these may have changed.

    But essentially, these are instantiated as: off-the-shelf SATA disks in fairly standard cases with either normal or special low-power motherboards, running a free OS (the Archive has used both Linux and FreeBSD), with off-the-shelf networking equipment.

    --Pat
  • by giberti ( 110903 ) on Tuesday October 25, 2005 @11:31PM (#13877912) Homepage
    File systems asside, your talking about a whole lot of hardware here! Is it really necessary to have all this data online at the same time, is it possible to store it in some other way (ie tapes) because it would probably be a whole lot cheaper!

    Well, lets see... using 300Gb SCSI disks (assuming you can find raid 0 hardware to support enough disks) you can build out a 1Pb storage system with about 3,334 disks. That would set you back about $3.3Million, assuming you paid retail prices for the disks ~$1,000 / disk @ CDW today. Of course, if you orderded 3,000+ disks, I'm sure they would cut you a deal on the price.

    Any hope of daisy chaining together a few dozen direct attached storage devices to a NAS server? Something like a Dell PowerVault 220 with 14 300Gb SCSI drives will set you back about $21K and give you 3.4Tb / 3U of space (RAID-5) so you would have some saftey net built in (albeit not much). Slap 10 of these on a Powervault 6000 series and you should have a ball park of 34Tb (while shy of what your looking for gets you in the right direction). Total cost around $250K - do it four times and spread the work out over four logical volumes and you should get in the neighborhood of 1Petabyte. You could then set up a redundant server structure and for $2Million you have a redundant mirrored architecture ready for one to fail and be brought up online quickly.
  • by 1tsm3 ( 754925 ) on Wednesday October 26, 2005 @03:48AM (#13878846)
    Google FS is not a real file system. It's just a bunch of API's that the program calls. The GoogleFS is not integrated into the Linux VFS. So you can't mount a GoogleFS. All programs need to be modified to use the GoogleFS API.

    Also, the GoogleFS has very narrow requirements/goals. It works best for programs that only append to the files.

"Protozoa are small, and bacteria are small, but viruses are smaller than the both put together."

Working...