RAID Solutions For Terrabyte Databases? 28
gullevek asks: "We are about to implement a huge database. In the first two years it will grow up to 650 GB and it will get larger and larger afterwards and could grow up to 2-4 TB. I never before implemented such a large DB. The DB Software has been choosen but now I have to find the right hardware. The basic components are not a problem, but what about storage? I would prefer to use RAID, of course, but what type of RAID? RAID 5 may be the best for disk failure, but it can be quite slow. RAID1 is fastest but the most expensive. Especially when it is at this sizes.
And what about the type of drive: SCSI-3? FCA? FibreChannel? Do you folks at Slashdot have any suggestions?"
Re:You have many options, (Score:1)
Get some bids.... (Score:4)
Put the word out that you need a some RAID, and a service contract to go with it. Start talking with other folks who have similar sized solutions, and see who their vendor is.
RAID Solutions for Exabyte Tape Drives (Score:1)
Re:EMC, in my home state (Score:1)
Your database vendor should have some suggestions. (Score:3)
I don't see a question in this unless you're using MySQL or some other relatively "low-end" DB in which case, you probably have larger concerns to deal with.
-JF
Re:EMC, in my home state (Score:2)
Money? (Score:1)
In Over Your Head? (Score:2)
As usual, rant first then opinion.
When I see questions like this on Slashdot, I get chills. This is obviously a big-budget job and yet the guy responsible for the project seems to be asking some very basic -- too basic -- questions. Honestly, this isn't a flame. I've been in that boat myself from time to time. However, before I'd every consider holding myself up to public ridicule, I'd do some heavy research. (And hope the boss never finds out. {grin})
Anyway, don't even consider RAID5. That'll be double-dog slow. You need to go mirrored. Yes, that's more expensive. However, I can't imagine a database vendor recommending anything but mirrored. Actually, Oracle says folks should use RAID 0+1 which is mirrored stripes. We don't but our system was setup before that was the recommendation.
In terms of drive size, the more spindles you have the better. That means, buy nine-gig, not 80-gig, drives. Of course, with your sizing requirements, you may buy 18-gig drives instead. However, my advice still stands. More spindles is more speed.
Drive technology will be determined by what vendor you choose. We're an IBM shop and use SSA drives exclusively for our RS/6000 systems. It's fast, allows multiple paths (up to eight) to each drive and allows for easy clustering. Since you will have a large amount of drives, it's more important to have multiple paths to each drive than a single ultra-fast backbone. (Ie: sharing 120 mbps across 40 drives isn't as good as sharing 40mbps across groups of 10 drives.)
As a shareholder of EMC, I highly recommend their products. They are the best bar none. If you are a big player (Wal-Mart, Charles Schwab, etc.), their cost isn't much more than a less qualified solution. (Of course, I don't think you'll be a big enough player to get the really good discounts.)
Overall, the best advice I've seen in this thread is to ask your database vendor what you need to buy. Oracle|Sybase|IBM wants you to have a good database experience and will not give you bad advice on the hardware front. In fact, Oracle (who I most often work with) can sometimes help you to get bigger discount (no one ever pays list price) out of IBM|Sun|DEC.
You're in over your head. Make sure you follow the George W. game plan and get yourself some fine advisors.
InitZero
Definately talk to EMC!! (Score:1)
But anyway... if your serious about your data, and judging by the amount of data you are planning to store I would guess you are, then EMC is who you should talking to. They are totally first rate!
Re:Get some bids.... (Score:1)
Re:Spelling mistake in the title, LOL (Score:1)
--
You have many options, (Score:1)
For Extremely good performance, and many features at a high price, the EMC Symmetrix [emc.com] is definatly the way to go.
RAID 1 is NOT the fastest, RAID 1 is mirroring. It is slow. RAID 0 is the fastest, but has no redundancy. RAID 0+1 is the way to go for speed/redundancy.
My suggestion... (Score:1)
Get the highest quality disks, fastest channel (Fiber all the way!) and as much storage as possible,
and maybe, just maybe, it will still be in use 20 years from now, and the company will be glad they put that initial investment in.
old saying.... (Score:1)
"No one ever got fired for choosing EMC "
Its pretty much a given that if you are planning to go that big you should not even try to this on your own, thats asking for trouble (given that you admit this is new ground for you). A little while ago when i was researching hosting services (logictier, loud cloud etc.) EMC was pretty much the standard.....
Re:Your database vendor should have some suggestio (Score:1)
Who needs RAID or a relational database? Just read the linux clustering HOWTO and apt-get yourself an enterprise database. MySQL is a better and faster database than anything else out there. Who needs online backup or transactions anyway.
Besides, since it is open source I can just skip lunch and write a perl script to fix all of my corrupted data!
[/sarcasm]
EMC disk all the way (expensive & it's worth it) (Score:2)
If you are running a medium to large database like a few terabytes. I would recommend investing in their TimeFinder solution, which allows you to make exact copies of the database (or data) by splitting the disks via a third mirror which they call Business Continuity Volumes (BCVs). This makes backups quicker and easier(simply split the third mirror and mount the volumes to your backup host), database schema changes less risky (split the mirror before the change, and you have a speedy backout), and overall your life easier (you will sleep at night).
The above is making the assumption that you are also using a real database such as Oracle that can handle raw devices, and online backups, multiple nodes, etc.
If you can't afford the Cadillac, you can go for the lower end which is their Clariion arrays which is also a damn good little unit if your budget is tight. You can essentially do the same things, except it's not as slick. Either one of them can go up to several terabytes of storage per unit.
---Hey, it's me.
Re:EMC disk all the way (expensive & it's worth it (Score:1)
They'll give you like a 10 page document describing why this piece of hardware cannot fail.
If it does, though, BOOM. The whole EMC frame locks up for any transaction whatsoever, until you get EMC onsite with another to replace it. Hope your data integrity isn't something you're concerned with.
Bang for the buck, we've done better with Network Appliance installations than I think EMC could ever hope for...
Terabyte database (Score:1)
Why don't you go compile your kernel again? (Score:1)
Here's a few highlights:
"Get larger, not necessarily as fast drives for your primary partitions. These can and should be on very large RAID/5 partitions"
RAID/5 + Databases = Bad data. RAID-5 reduces write performance by about 30% (and uses more cpu), and does not protect your data from controller failure (or for more than one disk failure per volume).
All data chunks need to be on simple volumes or RAID 0+1. This allows you to have up 50% of the disks in a volume fail without a loss of data. If you use DMP on the fibre channel array, you'll also get load balancing.
"Get larger, not necessarily as fast drives for your primary partitions. These can and should be on very large RAID/5 partitions."
In a perfect world, you would have more fast disks. In the past 2 years disk capacity has increased 10x while speeds have increased barely 2x. Multi-terabyte databases need to be on multiple, switched fibre channel arrays with the smallest (like 18GB) disks possible. This is expensive, but if you have 2TB of data online, you should have the money to buy a real solution.
"Get a bunch of smaller, but at least 10,000 RPM drives for your index storage. They should be on quite a few different hardware RAID adapters, and you should be using RAID/0 for them. For this, you don't care about losing a drive. The worst that can happen is reduced performance while you rebuild an index, you'll never lose any data."
This one is great. Any DBA who consideres it to be no big deal to lose a whole dbspace worth of detached indexes need to go back to Burger King. I'm sure everyone will be REAL happy when the database is in single-user mode while you 'just rebuild' all of your indexes. (All of your indexes were lost, since you have no mirroring, remember?)
compaq storageworks is a nice compromise (Score:2)
My last SAN project was 5TB raw disk and the solution that EMC pitched me was close to 1Million dollars more expensive than the competition.
In addition to being cheaper, the alternatives were mostly faster than EMC disk *AND* they played nicely with fiber channel hardware from other vendors (unlike EMC which likes to lock you in to their hardware only).
Not to knock EMC (solid product & killer support) but their obscene (no other way to describe it) pricing only makes their stuff worthwhile in situations where you need a 'black box' solution where some other guy is on the hook for hardware failures. In the life sciences I see EMC disk being used on drug manufacturing process hardware as well as on databases and systems that come under FDA scrutiny (patient outcome and clinical trial data, etc.) Generally people with more money than sense purchase EMC for anything but the most absolute mission critical stuff.
The other thing that annoyed me about EMC was the overly agressive frat-boy style sales force. The internal competition to make sales quotas is killer I've heard.
I ended up going with Brocade fiber channel switches (Silkworm II) and Compaq StorageWorks disks. We needed a SAN that could talk to NT, Linux, Tru-64, Solaris, HP-UX and Irix systems all at once.
I'm not a Compaq cheerleader but I like the StorageWorks line because although they are not always the first to market with the latest buzzword technology when they do come out with a product it is generally really solid and actually reasonably priced. The other cool thing about their new universal drive form factor is that all of their disks are now plug and play from the lowest end proliant server all the way up to their high end systems.
As for RAID levels and such you really need your database architects to tell you what they need. It may end up being a mix of RAID 1+0 and RAID5 for some filesystems and they may ask for solid state disks to store indices and such. Hardware tuning for high-end databases is a whole field in itself and there are lots of people out there who can probably tell you exactly what should be needed.
What you are going to find at the end of this project is that disk capacity is pretty simple and easily handled. The real problem you are going to have is figuring out how you are going to backup 5TB worth of DB data :) Not a trivial task by any means...
just my $.02
-chris
Re:In Over Your Head? (Score:1)
So I also asked slashdot. Hey, and I have to admit I get a lot of advise here and after reading all the posts, I see some common ground here.
If we all would start with the "top" knowledge, well, wouldn't that be very boring. Our goal (from human point of view) is to learn. When I just do the same thing, I can do, I wouldn't evolve.
Anyway, you didn't flame me, but I realised myself how much I still have to learn about High End DB things.
Thanks anyway for your advice. Yours were one of the best here!
mfg, Gul!
Re:EMC disk all the way (expensive & it's worth it (Score:1)
Next day?! Try 4 hours. I don't know what support you guys got, but they have 4 hours to be onsite and have it fixed with our Symmetrix.
I'm assuming an RDBMS for the database. (Score:5)
The first thing is to talk to your DBA and get his/her input. DBAs, competant ones, have done a lot of this type of work in the past, and they'll have an enormous amount of help to provide you. They'll know your usage pattern by heart, and be able to provide you with some help as to usage.
The first thing to realize is that for most RDBMS usage patterns, RAID is a Very Very Very Bad Thing. But when I say "most", I mean "most with updates to live data."
RDBMS' use data in 4 main types of storage, and it's important to understand them:
You also want to bear in mind that your update speed is limited by the ability to handle log writes. Log writes aren't limited by bandwidth. They're limited by the latency of each disk. Every disk can handle a certain number of operations per second. Even if you add more disks in a RAID configuration, you're never going to be able to handle more transactions per second, because you're not increasing the number of operations of any of them, and all of them must be touched for every transactional write.
So with that in mind, allow me to recommend something:
The number one advice I can give is to consult with others. If you haven't done this before, there are people (your DBA, your database vendor, your hardware vendor, your systems integrator) who have. This is serious business, and not something to screw around with. Terabyte-level databases are still NOT so common that everyone can and should attempt them. Having terabyte-levels of data throughout an enterprise is, but in one application it isn't. You'll probably not get it right the first time, so take your time and consult with every one of your vendors on capacity and performance planning.
Not to be crass or mean, but if you're asking slashdot, you probably shouldn't be doing this all by yourself.
Re:EMC, in my home state (Score:1)
No they don't. Try 10^6x10^7 to get started
SAN Datadirector from DataDirect Networks (Score:1)
Re:EMC disk all the way (expensive & it's worth it (Score:1)
Get proposals from Major Vendors (Score:2)
Good Luck.
The cure of the ills of Democracy is more Democracy.
Re: In Over Your Head? (Score:2)
And how are they going to achieve TBs with 18 GB drives??
Can you really not do the math?
1000gb divided by 18gb seems to 56 (rounded up) drives. Double that since it's mirrored. That's 112 drives. An IBM SSA array (7133-040) holds 16 drives. That comes to seven drawers.
We've got 148 drives (mostly 4.5gb and 9gb since we're a smaller shop) online in a similar SSA configuration.
Of course, if the row size is substantial (binary objects such as images), it may make sense to use larger drives. However, if the data is primary textual in nature (ie: small), keep with the relatively small drives.
InitZero