Tips for Increasing Server Availability? 74
uptime asks: "I've got a friend that needs some help with his web server availability. On two separate occasions, his server has had a problem that caused it to be unavailable for a period of time. One was early on and was probably preventable, but this latest one was due to two drives failing simultaneously in a RAID5 array. As a web business moves from a small site to a fairly busy one, availability and reliability becomes not only more important, but more difficult to accomplish it seems. Hardware gets bigger, services get more expensive, and options seem to multiply. Where could one find material on recommended strategies for increasing server availability? Anything related to equipment, configurations, software, or techniques would be appreciated."
high availability of the service (Score:5, Informative)
RAID, ECC RAM, team NICs and all that stuff are very helpful, but if you want to make DARN sure that service is as available as possible, do server times two.
P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful
Identify, Prioritize, Budget (Score:3, Informative)
Define then plan (Score:5, Informative)
You may find that one valuable outcome of this exercise is that it puts everything on a sliding scale rather than a managerial edict of "just make sure we don't go down." It also means that costs can be attached to everything. Peak time slowness is OK and we can take the system down 30 minutes each night for maintenance? Here's the tab. No maintenance windows allowed and peak-load must be handled well? That costs more. We need to stay up even if a hurricane/earthquake/volcano/terror-attack/plague
Managers deal with money/value issues all the time and expressing things this way is really just giving them the info they need to do their job.
Once you know the requirements, list everything that may impact your availablity including hardware, os, application(s), network switches, internet connectivity, etc. And it doesn't just include the web server - any database, app-server, dns-server, load-balancer or other necessary piece of the puzzle must be included as well. You will have to determine the likelyhood of failure of each piece, its impact on your defined goal, and the speed with which the failure must be corrected.
With this in hand you can start to make informed decisions on whether to have single drives (since your servers are mirrored), non hot-swap drives, hot-swap drives or hot-swap drives with warm spare. You can determine if you need hot redundant networking or if a spare switch on the shelf is good enough. Can you page a tech and have him be there in 2 hours or do you need people on-site 24/7?
A personal note: to be really well covered you have to have multiple sites located at significant distances from each other. I've suffered FAR more cumulative downtime due to fiber cuts (when a backhoe hits a OC192 the backhoe wins and large parts of the city lose) than to all other failures combined. Colo facilities have suffered downtime due to improper use of the Emergency Power Off switch or large natural disaster. To do this you can use DNS failover (from the inexpensive but effective dnsmadeeasy to the high-end and pricey UltraDNS) to switch traffic to your backup site within a few minutes or, if you are really big (ie. can afford $$$), you can use routing protocols to reroute the traffic to your other location at the TCP/IP level very quickly. But one nice thing about having two sites is that each individual site doesn't need to be as highly reliable in order to achieve the desired system reliability.
Uptime ++ (Score:3, Informative)
Regardless of how badly you may have done things in the past, here's how to prevent problems in the future. First, start with top-shelf equipment like HP Proliant servers. Sure, there will be flames for this recommendation but, think about it. There is a reason that HP Proliant servers are the ONLY choice in almost every Fortune 1000 company. Regardless of some anecdotal whining about poor support that is sure to follow my post, HP Proliants ARE that good!
Use hot-pluggable SCSI drives attached to a battery backed-up RAID controller and a hot-spare drive. Do NOT use IDE and you might even want to forgo SATA though they are a possibility. Use high quality ECC memory, dual processors, redundant power supplies and don't forget to fully utilize HP's management utilities to monitor and manage the server. SNMP and HP's Insight Manager will not only let you know, via alarms or alerts or pagers or email, before a drive fails. It will let you know when your logs are getting too large or your utilization is too high or even restart Apache for you should it fail.
Now this is all well and good for greatly reducing downtime to almost none but, it doesn't guarantee uptime. To guarantee 100% uptime you need to implement redundant systems behind a load-balancer or implement a cluster. If you're super paranoid, both. Naturally you need to also have redundant power sources and network connectivity with all this so that you do not have a single point of failure ANYWHERE.
Naturally, all this will cost big piles of cash. But that's what it takes for 100% uptime. If you're going to try and use white-box desktop hardware you've already failed it!
HA is elusive (Score:3, Informative)
Before tackling the problem of downtime you should consider how much downtime is acceptable. See the discussion on downtime at the Uptime Institute [upsite.com] regarding what is acceptable. Are you looking for 99.999% uptime? Dream on.
Specifically you need to make everything in your system redundant. The web servers need to be redundant, you need to have redundant copies of the data. The paths to the internet need to be redundant and the environment should be remote and redundant.
Once you get a handle on your environment, you should consider some sort of clustering technology for server duality. I suggest you read "In Search of Clusters" [regehr.org] by Gregory F. Pfister to get a fundamental understanding of the technology.
As was posted earlier, you might just want to throw in the towel and accept web hosting. Use the Uptime Institute specifications against the ISP's service level agreement.
You might also consider a local ISP co-lo and do your own remote clustering.
Re: Here they are (Score:2, Informative)
Here are some details:
The budget is probably $30k or less.
I'm not exactly sure of the server models (I didn't spec them), but they are dell boxes and are fairly new. One is production, one sits idle to be swapped in the event of failure.
The server is hosted locally, bandwidth is not an issue.
They system is Windows based, ASP and SQL Server.
The raid array is using scsi drives. They are hot swappable, but do not have a hot spare.
I am approaching other sysadmins and looking for advice from them as well. I am not as worried about the traffic or about the backbone at this point as I am keeping the hardware up and the data backed up and available. I am also interested in methods of getting things back up and running quickly should a hardware failure occur or the database become corrupt.
This is not my area of expertise (obvious from my questions) and I thought there might be some general guidelines for this sort of thing. I suspect that he will be paying someone to help with this, but I was hoping to get a good feel for what to expect and to have some knowedge beforehand to be better able to make decisions. (get more than one opinion basically)
Thanks for everyone's responses, I appreciate your time.
Bathub Curve Makes it More Likely (Score:3, Informative)
As usual, variety is the spice of life... just don't buy lots of the same kind of stuff at once.