Tips for Increasing Server Availability? 74
uptime asks: "I've got a friend that needs some help with his web server availability. On two separate occasions, his server has had a problem that caused it to be unavailable for a period of time. One was early on and was probably preventable, but this latest one was due to two drives failing simultaneously in a RAID5 array. As a web business moves from a small site to a fairly busy one, availability and reliability becomes not only more important, but more difficult to accomplish it seems. Hardware gets bigger, services get more expensive, and options seem to multiply. Where could one find material on recommended strategies for increasing server availability? Anything related to equipment, configurations, software, or techniques would be appreciated."
Hosting (Score:5, Insightful)
If that isnt something you want to venture down, then start planning outages for fsck, upgrade, and standard checkups. There are alos plugins for NAGIOS that will check different RAID controller status, server response, and server load
high availability of the service (Score:5, Informative)
RAID, ECC RAM, team NICs and all that stuff are very helpful, but if you want to make DARN sure that service is as available as possible, do server times two.
P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful
Re:high availability of the service (Score:2)
Re:high availability of the service (Score:5, Interesting)
Actually, each server should be able to handle the entire load of the cluster. A lot of people forget to pay attention to this. It's great that you have two servers in a cluster, so that if one fails the other still works, except when you HAVE to have two servers in the cluster for it to work at all.
Where I work we run our web solutions on clusters. This works great for redundancy, availability, etc. BUT, if we ever have less than two servers in the cluster, the system will go down anyway due to load. Our primary production cluster, therefore, is four servers.
Re:high availability of the service (Score:2)
Good point. I took it as read that the first server in question is handling the load on its own and the question was solely about providing higher availability. I put in a note about clustered/load balanced servers being equivalent in available capacity because in my experience, this is a place that novices will shortsightedly cut corners to save $$$.
Re:high availability of the service (Score:2)
Re:high availability of the service (Score:5, Interesting)
This is basically the priciple that my company runs on with their servers. We should be able to be running perfectly fine with 2/5ths of our servers down at any given time. Of course, this almost never happens, but building with that sort fo redundancy in mind reduces the chances of downtime to almost nothing. Each machine is also on redundant links to redundant switches on redundant upstream links. We do have the advantage of being an ISP and CLEC ourselves, so we already have multiple peering agreements with many other CLEC/ILECs.
As for the double-failure in a RAID5 array thing the article poster mentioned, for Pete's sake, buy a couple spare disks. You should follow the same rule in making your RAID arrays as your server clusters. You *should* be able to lose 2/5ths of your disks without losing the array. This means that you need at least 1 spare for every 5 drives, for a total of 6 drives.
Add some good monitoring on top of these and your downtimes drop to almost nothing. In fact, you shouldn't ever see service downtimes with a proper setup, provided you actually bring machines back up as they fail.
Re:high availability of the service (Score:2)
A RAID 5 array can happily chug along after a single disk dies, but if you lose two drives on the same array at the same time, you are pretty well and good screwed.
--
The proof of this, of course, is left up to the reader (with help from Google.)
Re:high availability of the service (Score:2)
Depends what "at the same time" means. If it happens a few minutes apart and you have a hot spare, your data might be saved by the spare.
Re:high availability of the service (Score:2, Interesting)
You can lose one disk in a RAID5 and it still works. However, the contents of the failed disk needs to be regenerated on the spare before it's back to its 'redundant' state.
Therefore, the time between the two disks failing *must* be enough for the regeneration to complete, else you'll lose everything(?).
This can take quite some time.
On my s/w RAID5, it takes hours.
Furthermore, the process causes significantly more disk activity on the
Re:high availability of the service (Score:2)
[snip]
This can take quite some time.
On my s/w RAID5, it takes hours.
This is what 15K RPM SCSI drives and expensive RAID controllers are for.
s/w RAID is great 99% of the time, but...
Re:high availability of the service (Score:3, Interesting)
As for RAID5, it's not always practical, but bear in mind if you buy all your disks from the same mfg at the same time, your chances of concurrent failure
Re:high availability of the service (Score:2)
Re:high availability of the service (Score:2)
Um, details? (Score:1, Insightful)
You are hosting this on a 56K dial-up in your root cellar?
Your apps need to run on Microsoft Windows or HP-UX or...?
You've got a SAN or local disk or...?
You're using home-built white-box x86s or Sun E15000s or...?
You have sysadmin talent on hand? You're outsourced to IBM global services?
Who vets these silly questions? Oh, I forgot - the "Editors".
Re: Here they are (Score:2, Informative)
Here are some details:
The budget is probably $30k or less.
I'm not exactly sure of the server models (I didn't spec them), but they are dell boxes and are fairly new. One is production, one sits idle to be swapped in the event of failure.
The server is hosted locally, bandwidth is not an issue.
They system is Windows based, ASP a
Re: Here they are (Score:2)
As for SQL, you could have two installations of SQL 2000 and use NLB to share among them; so long as you either manually take care of write transactions or use replica
But it is a generic problem: Here is how to start (Score:2)
Where's the link? (Score:5, Funny)
Identify, Prioritize, Budget (Score:3, Informative)
Not a bad idea... (Score:3, Funny)
Re:Not a bad idea... (Score:1)
Hire a Professional? (Score:3, Insightful)
Re:Hire a Professional? (Score:1)
A third party product is simply an additional tool in this case. One he will have to manage in addition to his server. It will help, but your friend needs to look at how he manages his technology (including product selection). I'm not trying to bust his chops but he can learn it sooner or he can learn it later. I wish him luck...
Define then plan (Score:5, Informative)
You may find that one valuable outcome of this exercise is that it puts everything on a sliding scale rather than a managerial edict of "just make sure we don't go down." It also means that costs can be attached to everything. Peak time slowness is OK and we can take the system down 30 minutes each night for maintenance? Here's the tab. No maintenance windows allowed and peak-load must be handled well? That costs more. We need to stay up even if a hurricane/earthquake/volcano/terror-attack/plague
Managers deal with money/value issues all the time and expressing things this way is really just giving them the info they need to do their job.
Once you know the requirements, list everything that may impact your availablity including hardware, os, application(s), network switches, internet connectivity, etc. And it doesn't just include the web server - any database, app-server, dns-server, load-balancer or other necessary piece of the puzzle must be included as well. You will have to determine the likelyhood of failure of each piece, its impact on your defined goal, and the speed with which the failure must be corrected.
With this in hand you can start to make informed decisions on whether to have single drives (since your servers are mirrored), non hot-swap drives, hot-swap drives or hot-swap drives with warm spare. You can determine if you need hot redundant networking or if a spare switch on the shelf is good enough. Can you page a tech and have him be there in 2 hours or do you need people on-site 24/7?
A personal note: to be really well covered you have to have multiple sites located at significant distances from each other. I've suffered FAR more cumulative downtime due to fiber cuts (when a backhoe hits a OC192 the backhoe wins and large parts of the city lose) than to all other failures combined. Colo facilities have suffered downtime due to improper use of the Emergency Power Off switch or large natural disaster. To do this you can use DNS failover (from the inexpensive but effective dnsmadeeasy to the high-end and pricey UltraDNS) to switch traffic to your backup site within a few minutes or, if you are really big (ie. can afford $$$), you can use routing protocols to reroute the traffic to your other location at the TCP/IP level very quickly. But one nice thing about having two sites is that each individual site doesn't need to be as highly reliable in order to achieve the desired system reliability.
Re:Define then plan (Score:1)
Re:Define then plan (Score:2)
Circa Y2K, we had a provider who promised full redundancy on everything but they had a piece-o-crap load balancer and firewall from a company that got bought by Cisco. Every so often this would cause an amusing array of serious network problems but only on a portion of the sites that were handled by this equipment. This had a severe impact on our site but reb
Uptime ++ (Score:3, Informative)
Regardless of how badly you may have done things in the past, here's how to prevent problems in the future. First, start with top-shelf equipment like HP Proliant servers. Sure, there will be flames for this recommendation but, think about it. There is a reason that HP Proliant servers are the ONLY choice in almost every Fortune 1000 company. Regardless of some anecdotal whining about poor support that is sure to follow my post, HP Proliants ARE that good!
Use hot-pluggable SCSI drives attached to a battery backed-up RAID controller and a hot-spare drive. Do NOT use IDE and you might even want to forgo SATA though they are a possibility. Use high quality ECC memory, dual processors, redundant power supplies and don't forget to fully utilize HP's management utilities to monitor and manage the server. SNMP and HP's Insight Manager will not only let you know, via alarms or alerts or pagers or email, before a drive fails. It will let you know when your logs are getting too large or your utilization is too high or even restart Apache for you should it fail.
Now this is all well and good for greatly reducing downtime to almost none but, it doesn't guarantee uptime. To guarantee 100% uptime you need to implement redundant systems behind a load-balancer or implement a cluster. If you're super paranoid, both. Naturally you need to also have redundant power sources and network connectivity with all this so that you do not have a single point of failure ANYWHERE.
Naturally, all this will cost big piles of cash. But that's what it takes for 100% uptime. If you're going to try and use white-box desktop hardware you've already failed it!
Re:Uptime ++ (Score:2)
Not really so unlikely... (Score:2)
you can certainly have a second drive fail while another one is being reconstructed if all the drives in the RAID are near end of life. The only good way to prevent it is to intentionally fail & replace sufficiently old drives before they actually fail (i.e. before you start climbing the steep end of the "bathtub" curve).
It c
Re:Uptime ++ (Score:2)
Re:Uptime ++ (Score:2)
I never said that I experienced a multi-drive failure. I am aware of how they come about, and I know of those who have experienced them. You are making assumptions that are not supported by fact.
HotMail and Google were cheaper why?
I also never said this. I said they used less expensive components. I made no statement as to their total cost. Again, you are making assumptions tha
Re:Uptime ++ (Score:2)
Re:Uptime ++ (Score:2)
I would advise that the MTBF divided between the drives in the array be used to give a guide
Re:Uptime ++ (Score:1)
That's what Raid 5 does.
Rebuilding a drive in raid 5 is a long intensive process, as it has to rebuild the parity.
Raid 5's often fail when a second near end of life drive dies while rebuilding a failed drive. The number one cause of this is heat buildup over time IMHO. Usually when a raid 5 looses 2 drives, it is actually the case that one drive failed, and there was no hot spare to take over, when the second drive fails (at some later date) the raid is broken. Anyone running a raid 5
Probability of simultaneous two disk failure (Score:3, Insightful)
On that basis I am going to make some wild assed guesses that are more probable given the little information we have.
1) the drives were consumer models from the same production lot,
2) the death of the first drive was not immediately noticed,
3) compatible replacement drives are not easy to come by (no hot spare),
3) the second drive died before the first one was replaced,
4) the server did not have hot swap drive carriers
5) someone tried to replace the dead drive in the running chassis
If you don't like my guesses provide your own
Re:Probability of simultaneous two disk failure (Score:1)
Bathub Curve Makes it More Likely (Score:3, Informative)
Re:Probability of simultaneous two disk failure (Score:5, Insightful)
6) The drives are overheating. This happened to my two Seagate 200Gb drives. Had to mount them in a heatsink, the normal bay does not provide adequate cooling for Seagate's 7200 rpm drives.
Re:Probability of simultaneous two disk failure (Score:2)
Happened to us... It didn't stop at the hard drives either.
But yes, it was cheap hardware.
HA is elusive (Score:3, Informative)
Before tackling the problem of downtime you should consider how much downtime is acceptable. See the discussion on downtime at the Uptime Institute [upsite.com] regarding what is acceptable. Are you looking for 99.999% uptime? Dream on.
Specifically you need to make everything in your system redundant. The web servers need to be redundant, you need to have redundant copies of the data. The paths to the internet need to be redundant and the environment should be remote and redundant.
Once you get a handle on your environment, you should consider some sort of clustering technology for server duality. I suggest you read "In Search of Clusters" [regehr.org] by Gregory F. Pfister to get a fundamental understanding of the technology.
As was posted earlier, you might just want to throw in the towel and accept web hosting. Use the Uptime Institute specifications against the ISP's service level agreement.
You might also consider a local ISP co-lo and do your own remote clustering.
Re:HA is elusive (Score:2)
Why? What makes 5 nines unachievable, especially for such a simple setup as described in TFQ?
HA for web applications isn't very difficult, it just requires doing a little architecture up front, and spending the appropriate amount of money. Load balanced N+1 clusters, multiple redundant Internet links and their associated hardware, none of this is difficult or even that expensive today.
There are very few places in the computing world where real HA is even remo
Re:HA is elusive (Score:1)
Re:HA is elusive (Score:2)
I fully agree that keeping any individual server to 5 nines availability is generally an unattainable goal, in my environment we strive for 3 nines on individual components, and 5 nines on the overall service. With that goal we build HA into everything we design, whether its through load balancing, heartbeat/ip take over, or building the failover logic in
Re:HA is elusive (Score:1)
I do clustering for db, and we shoot for "NO Nine's" (100%). Our apps a
Re:HA is elusive (Score:2)
We've bought some really large storage solutions at really cut-t
Re:HA is elusive (Score:1)
Re:HA is elusive (Score:2)
I've seen the ultramonkey stuff before, and even deployed a fairly complex LVS balanced cluster (the pair of lvs directors actually had a
Re:HA is elusive (Score:1)
Re:HA is elusive (Score:2)
Some examples of services I've run in active/passive with full stateful failover include Cisco PIX firewalls, Foundry ServerIron load balancers, NetApp filers, a pair of LVS directors (there was an app you ran which passed state data from th
Re:HA is elusive (Score:1)
I've never worked with Foundry ServerIron load balancers or NetApp (to do this), Unfortunatly our clusters are not the standard type workload usually found behind this type of gear. Cisco, Veritas and Piranha (RedHat LVS) all failed in work simulation testing. There were too many dropped conections and timeouts. The Cisco gear failed over ok, but had problems with failback and high load. Veritas was just a w
Re:HA is elusive (Score:2)
I strongly recommend NetApp to everyone I encounter professional who needs HA NAS. The
Re:HA is elusive (Score:1)
What, an iSCSI implementation thats not junk? I'll believe it when I see it.
I really can't say too little about Vertitas. They have been costing me sleep since UnixWare 7.1.1.
Why is it that the Beancounters
Re:HA is elusive (Score:2)
Depends on budget and requirements (Score:2)
Use a service monitor, such as Hobbit or Big Brother, to monitor the services. If a service fails once, have it auto-restarted. If it fails repeatedly, have the monitor reboot the box automatically. If the server keeps crashing, or if recovery locks up, then have it not
Re:Depends on budget and requirements (Score:1)
And using software such as Pound [apsis.ch] that can be setup fairly easily [debian-adm...ration.org]. Of course if you're running a massive site like /. you might be better off with a hardware load balancer.
But Pound can be used to easily pool a collection of back-end hosts, and avoid forwarding connections if one of them dies. It will even take care of mai
mod_globule, bitches (Score:2)
You need to start looking at server redundancy (Score:2)
Now you can just have other machines waiting to take the load with a quick reconfig, or you can start doi
Re:You need to start looking at server redundancy (Score:2)
Virtualisation ? (Score:2)
2 drive fail happening 2 times! (Score:2)
But, apparantly you have to schedule a consistency check every week or so... if you do not do this there is a possibility that data on the RAID gets out of sync/corrupted (or something like that) and when a drive fails the su
Re:2 drive fail happening 2 times! (Score:2)
Re:2 drive fail happening 2 times! (Score:2)
Re:2 drive fail happening 2 times! (Score:1)
Make sure you schedule the weekly check of the RAID!!!
Better yet, make sure you buy hardware that isn't complete crap and needs you to manually check it to make sure it hasn't silently started corrupting itself!!!!
Sheesh -- next you'll be telling us to run mysqlcheck from time to time to make sure your "database" hasn't corrupted itself.
Remove Single Points of Failure (Score:2)
There was a really good LISA talk... (Score:5, Interesting)
Well, anyhow, those are my top N recommendations for a keeps-on-running web service configuration. I'm sure I'm overlooking some stuff, but that should head you in the right direction. And if it doesn't sound like a lot of work, you weren't paying attention...
Anycast, Mosix (Score:2)
Mosix is a UNIX patch that allows processes to migrate across machines. There's even a neat ut
Similar but different (Score:2)
Worked on that a bit while I was employed there.
Basically, we use synchronize routing table state across several hot-standby routers, so failover is instantaneous, limiting flapping in the network.
Rather cool, actually.
Some thoughts on this (Score:1)
Well, as you can see, you can get a bit of information her, obvoiusly.
Since there was no specific details, but a request for information sources, I would say this:
Many vendors of products will offer an assortment of solutions to high availability needs. Legato used to have cluster software for Windows servers (for ava
Don't forget downtime due to human error/malice (Score:2)