Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Data Storage Upgrades IT

Tips for Increasing Server Availability? 74

uptime asks: "I've got a friend that needs some help with his web server availability. On two separate occasions, his server has had a problem that caused it to be unavailable for a period of time. One was early on and was probably preventable, but this latest one was due to two drives failing simultaneously in a RAID5 array. As a web business moves from a small site to a fairly busy one, availability and reliability becomes not only more important, but more difficult to accomplish it seems. Hardware gets bigger, services get more expensive, and options seem to multiply. Where could one find material on recommended strategies for increasing server availability? Anything related to equipment, configurations, software, or techniques would be appreciated."
This discussion has been archived. No new comments can be posted.

Tips for Increasing Server Availability?

Comments Filter:
  • Hosting (Score:5, Insightful)

    by hatch815 ( 917921 ) on Tuesday September 27, 2005 @01:34PM (#13659742)
    if you are moving to a level that you need uptime, but cant dedicate more resources to overseeing it - you may want to considering a hosted solution. They host, monitor, upgrade, do checkups (YMMV with whom you choose)

    If that isnt something you want to venture down, then start planning outages for fsck, upgrade, and standard checkups. There are alos plugins for NAGIOS that will check different RAID controller status, server response, and server load
  • by PFactor ( 135319 ) on Tuesday September 27, 2005 @01:36PM (#13659762) Journal
    If you have a service that must be highly available, cluster or load balance the service. Use more than 1 box and either cluster them or load balance them.

    RAID, ECC RAM, team NICs and all that stuff are very helpful, but if you want to make DARN sure that service is as available as possible, do server times two.

    P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful
    • Don't forget to have a redundant network connection, unless you sign up for HSRP, most data centers only hook you into one of their providers.
    • by Rolan ( 20257 ) * on Tuesday September 27, 2005 @01:48PM (#13659890) Homepage Journal
      P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful

      Actually, each server should be able to handle the entire load of the cluster. A lot of people forget to pay attention to this. It's great that you have two servers in a cluster, so that if one fails the other still works, except when you HAVE to have two servers in the cluster for it to work at all.

      Where I work we run our web solutions on clusters. This works great for redundancy, availability, etc. BUT, if we ever have less than two servers in the cluster, the system will go down anyway due to load. Our primary production cluster, therefore, is four servers.

      • "Actually, each server should be able to handle the entire load of the cluster"

        Good point. I took it as read that the first server in question is handling the load on its own and the question was solely about providing higher availability. I put in a note about clustered/load balanced servers being equivalent in available capacity because in my experience, this is a place that novices will shortsightedly cut corners to save $$$.
        • or, for really busy sites, 3 out of 5 should be able to handle the load...
          • by wolf31o2 ( 778801 ) on Tuesday September 27, 2005 @03:51PM (#13660972)

            This is basically the priciple that my company runs on with their servers. We should be able to be running perfectly fine with 2/5ths of our servers down at any given time. Of course, this almost never happens, but building with that sort fo redundancy in mind reduces the chances of downtime to almost nothing. Each machine is also on redundant links to redundant switches on redundant upstream links. We do have the advantage of being an ISP and CLEC ourselves, so we already have multiple peering agreements with many other CLEC/ILECs.

            As for the double-failure in a RAID5 array thing the article poster mentioned, for Pete's sake, buy a couple spare disks. You should follow the same rule in making your RAID arrays as your server clusters. You *should* be able to lose 2/5ths of your disks without losing the array. This means that you need at least 1 spare for every 5 drives, for a total of 6 drives.

            Add some good monitoring on top of these and your downtimes drop to almost nothing. In fact, you shouldn't ever see service downtimes with a proper setup, provided you actually bring machines back up as they fail.

            • Actually, per definition of RAID 5 - no, you *shouldn't* be able to lose 2/5ths of your disks at the same time without losing the array.
              A RAID 5 array can happily chug along after a single disk dies, but if you lose two drives on the same array at the same time, you are pretty well and good screwed.
              --
              The proof of this, of course, is left up to the reader (with help from Google.)
              • Actually, per definition of RAID 5 - no, you *shouldn't* be able to lose 2/5ths of your disks at the same time without losing the array.

                Depends what "at the same time" means. If it happens a few minutes apart and you have a hot spare, your data might be saved by the spare.
                • This is my take on RAID5 - I take it as written that I will be corrected :

                  You can lose one disk in a RAID5 and it still works. However, the contents of the failed disk needs to be regenerated on the spare before it's back to its 'redundant' state.

                  Therefore, the time between the two disks failing *must* be enough for the regeneration to complete, else you'll lose everything(?).

                  This can take quite some time.

                  On my s/w RAID5, it takes hours.

                  Furthermore, the process causes significantly more disk activity on the
                  • You can lose one disk in a RAID5 and it still works. However, the contents of the failed disk needs to be regenerated on the spare before it's back to its 'redundant' state.

                    [snip]

                    This can take quite some time.

                    On my s/w RAID5, it takes hours.


                    This is what 15K RPM SCSI drives and expensive RAID controllers are for.

                    s/w RAID is great 99% of the time, but...
    • If you find yourself running a bunch of servers all with similar spec/config, you should consider removing the disks from them and netbooting off a single image on another server (or a single image available one 2 other servers just in case). Disks are far more likely to break than any other component imo, far more likely than fans or PSUs if you ask me.
      As for RAID5, it's not always practical, but bear in mind if you buy all your disks from the same mfg at the same time, your chances of concurrent failure
      • Not bad, but by booting from a remote image, you run the risk of the remote image server being unavailable, unless you have a redundant image server. Your best bet is to have the remote image server, but use/create a provisioning system that allows you to install the image you want to the machine OR run a network image. This eliminates the need for a redundant image server most of the time (machines can still run from local disk), but allows you to reprovision for temporary needs like a failed server or s
      • DO NOT listen to this advice. Every real RAID controller I have ever worked with requires exactly matched disks. In fact having different firmware versions on the drive controllers can cause problems. You can get away with unmatched disks using pseudo software controllers but I still wouldn't do it because I really doubt the manufacturer has tested such a configuration. Total software setups can of course use whatever as they are just a layer on top of the already existing disks and drivers.
  • Um, details? (Score:1, Insightful)

    by afabbro ( 33948 )
    You have a budget of $1 million?
    You are hosting this on a 56K dial-up in your root cellar?
    Your apps need to run on Microsoft Windows or HP-UX or...?
    You've got a SAN or local disk or...?
    You're using home-built white-box x86s or Sun E15000s or...?
    You have sysadmin talent on hand? You're outsourced to IBM global services?

    Who vets these silly questions? Oh, I forgot - the "Editors".

    • Re: Here they are (Score:2, Informative)

      by Anonymous Coward
      Well, I was really more after general approaches to be honest. I didn't post a lot of specifics because I didn't think anyone was truly interested is solving my exact problem for me.

      Here are some details:
      The budget is probably $30k or less.
      I'm not exactly sure of the server models (I didn't spec them), but they are dell boxes and are fairly new. One is production, one sits idle to be swapped in the event of failure.
      The server is hosted locally, bandwidth is not an issue.
      They system is Windows based, ASP a
      • I would recommend the Google approach - cluster cheap computers. Clustering ASP can be easy (depends if you use the Session varaible) - look into Microsoft's Network Load Balancing [google.com]; which while it load balances HTTP applications, also provides clustering and failover (I think - you'd have to check) without setting up a formal Windows cluster.

        As for SQL, you could have two installations of SQL 2000 and use NLB to share among them; so long as you either manually take care of write transactions or use replica
    • First you should make a list of all your server and define your tolerances (figure 2/5). What needs to be kept online in the event of a power outage for example (maybe your PBX, telephones, publically accessible servers, and associated networking equipment) is a good question that needs to be asked, but don't stop there. List every possible disaster (structure fire, flood, earthquake, terrorist attack, you name it) and determine which resources need to be available. Then you can design a redundant system
  • by azuroff ( 318072 ) on Tuesday September 27, 2005 @01:36PM (#13659767)
    Just post a link to this server of his - we'll gladly stress-test it for him at no charge. :)
  • by Nos. ( 179609 ) <andrewNO@SPAMthekerrs.ca> on Tuesday September 27, 2005 @01:59PM (#13659986) Homepage
    From the internet to the resources you are trying to provide, identify every point of failure. Power outage, uplink, router, switch, server, etc. Prioritize - which are the most likely to fail? Budget - which ones can we afford to, or are cost effective, to duplicate? Clustering may be an option, but might be too expensive. What about a cold/hot standby - reduces downtime overall. You can find relatively inexpensive UPSs just about everywhere. Making your entire network redundant can take a lot of time and money.
  • by Ingolfke ( 515826 ) on Tuesday September 27, 2005 @02:06PM (#13660045) Journal
    I think sysadmins would respond very nicely to tips for increasing server availability. Let's say average tip is about 15%, and for simplicities sake we'll say that every hour of downtime costs about $10,000. Baseline service should be set either by an SLA or a measured baseline. For this we'll say 99% uptime per month. That allows for 7.2 hours of uncscheduled downtime in a month. So if the server is up for 100% of that time, then you'd want to tip your sysadmin about $10,800 per server (assuming the #s stay the same) for a month of great work.
    • who then pays when it is down. SLA's for the most part are a joke. Did you ever get your 10,000 per hour back for downtime? I was down multiple T-1's for a week and never got a single penny. The owners estimated that we loose 8K per hour for T-1. that week was 6 t-1x10hoursx4 days = 240x8K
  • by marcus ( 1916 ) on Tuesday September 27, 2005 @02:13PM (#13660124) Journal
    That is all...
    • A server's stability is a transitive property: it is about as stable as its administrator's personality.

      A third party product is simply an additional tool in this case. One he will have to manage in addition to his server. It will help, but your friend needs to look at how he manages his technology (including product selection). I'm not trying to bust his chops but he can learn it sooner or he can learn it later. I wish him luck...

  • Define then plan (Score:5, Informative)

    by linuxwrangler ( 582055 ) on Tuesday September 27, 2005 @02:15PM (#13660140)
    You will find that "availability" is a vague term. First you need to have a discussion to determine what availability means. It must be able to be put in measurable and non-vague terms. 99% uptime is not a good definition. The system must handle 99.7% of requests in 30 milliseconds or less is much better in part because it includes a performance expectation. It's also recognizes that not every request will receive the desired level of response. Additionally, if you determine that you want N+1 redundancy then you need to know the appropriate value of N (how many servers are needed to provide our required response times).

    You may find that one valuable outcome of this exercise is that it puts everything on a sliding scale rather than a managerial edict of "just make sure we don't go down." It also means that costs can be attached to everything. Peak time slowness is OK and we can take the system down 30 minutes each night for maintenance? Here's the tab. No maintenance windows allowed and peak-load must be handled well? That costs more. We need to stay up even if a hurricane/earthquake/volcano/terror-attack/plague- of-locusts destroys our primary site? Cough up the dough.

    Managers deal with money/value issues all the time and expressing things this way is really just giving them the info they need to do their job.

    Once you know the requirements, list everything that may impact your availablity including hardware, os, application(s), network switches, internet connectivity, etc. And it doesn't just include the web server - any database, app-server, dns-server, load-balancer or other necessary piece of the puzzle must be included as well. You will have to determine the likelyhood of failure of each piece, its impact on your defined goal, and the speed with which the failure must be corrected.

    With this in hand you can start to make informed decisions on whether to have single drives (since your servers are mirrored), non hot-swap drives, hot-swap drives or hot-swap drives with warm spare. You can determine if you need hot redundant networking or if a spare switch on the shelf is good enough. Can you page a tech and have him be there in 2 hours or do you need people on-site 24/7?

    A personal note: to be really well covered you have to have multiple sites located at significant distances from each other. I've suffered FAR more cumulative downtime due to fiber cuts (when a backhoe hits a OC192 the backhoe wins and large parts of the city lose) than to all other failures combined. Colo facilities have suffered downtime due to improper use of the Emergency Power Off switch or large natural disaster. To do this you can use DNS failover (from the inexpensive but effective dnsmadeeasy to the high-end and pricey UltraDNS) to switch traffic to your backup site within a few minutes or, if you are really big (ie. can afford $$$), you can use routing protocols to reroute the traffic to your other location at the TCP/IP level very quickly. But one nice thing about having two sites is that each individual site doesn't need to be as highly reliable in order to achieve the desired system reliability.

    • What I found running maintenance operations at a large, multinational bank, is that the definition of down time is very important. You need to separate "scheduled" down time (patches, preventative maintenance, etc) from unplanned downtime (disk died). Otherwise, you will never get to a reasonable benchmark.
      • You are correct. Definitions are key. Be VERY VERY careful what constitutes "scheduled" downtime. Scheduling downtime to fix a problem is unacceptable.

        Circa Y2K, we had a provider who promised full redundancy on everything but they had a piece-o-crap load balancer and firewall from a company that got bought by Cisco. Every so often this would cause an amusing array of serious network problems but only on a portion of the sites that were handled by this equipment. This had a severe impact on our site but reb
  • Uptime ++ (Score:3, Informative)

    by Anonymous Coward on Tuesday September 27, 2005 @02:16PM (#13660143)
    Hmm, two RAID drives failed simultaneously? This is possible but so unlikely, it isn't worth mentioning. Either the equipment he (Freudian You) is using is utter crap or they didn't really fail at the same time. Most likely one failed and no one noticed until the second failed. Or possibly He/You were using software RAID in which case your OS or you failed and caused the apparent drive failure. IN any of these cases the real cause of the failure is him/you!

    Regardless of how badly you may have done things in the past, here's how to prevent problems in the future. First, start with top-shelf equipment like HP Proliant servers. Sure, there will be flames for this recommendation but, think about it. There is a reason that HP Proliant servers are the ONLY choice in almost every Fortune 1000 company. Regardless of some anecdotal whining about poor support that is sure to follow my post, HP Proliants ARE that good!

    Use hot-pluggable SCSI drives attached to a battery backed-up RAID controller and a hot-spare drive. Do NOT use IDE and you might even want to forgo SATA though they are a possibility. Use high quality ECC memory, dual processors, redundant power supplies and don't forget to fully utilize HP's management utilities to monitor and manage the server. SNMP and HP's Insight Manager will not only let you know, via alarms or alerts or pagers or email, before a drive fails. It will let you know when your logs are getting too large or your utilization is too high or even restart Apache for you should it fail.

    Now this is all well and good for greatly reducing downtime to almost none but, it doesn't guarantee uptime. To guarantee 100% uptime you need to implement redundant systems behind a load-balancer or implement a cluster. If you're super paranoid, both. Naturally you need to also have redundant power sources and network connectivity with all this so that you do not have a single point of failure ANYWHERE.

    Naturally, all this will cost big piles of cash. But that's what it takes for 100% uptime. If you're going to try and use white-box desktop hardware you've already failed it!

    • Eh... two drives... it can happen. I manage 175 remote Proliant servers ranging from lowly ML330's to 8-CPU DL740's. These machines are reliable and have excellent value-added monitoring features, but anything can happen. I average about 20-30 drive failures annually across those 175 machines. I had two drives fail simultaneously on a four-drive Raid 10 array recently following a power-outage. The drives wouldn't spin up. And unfortunately, they were mirror pairs. I didn't have any predictive-failure or S.M
    • ... when the drives in a RAID set near their end of life: Given
      • the bathtub curve [microsoft.com] of disk failure rates, and
      • that a raid reconstruct can take about a day on a lot of RAID sets

      you can certainly have a second drive fail while another one is being reconstructed if all the drives in the RAID are near end of life. The only good way to prevent it is to intentionally fail & replace sufficiently old drives before they actually fail (i.e. before you start climbing the steep end of the "bathtub" curve).

      It c

    • Ever heard of a failure somewhere else other than the drives causing the failure? The controller? The Cable? 2 drives from the same batch that had a flaw? And saying it is going to cost big piles of cash, you obviously haven't heard of how Hotmail and Google started. (Hotmail back when it ran BSD.) He's not asking for marketing speak for big iron, but real examples. My One Big Servers have cost me more sleep than my little cheap, but well thought out cluster ever has. And the cluster has been running for ov
    • Rebuilding is pretty disk intensive. It can push a second drive that's near failure over the edge.
      • I'm no expert but that sounds like the filesystem or RAID program isn't doing a good job of planning where it puts the data (or its hash) in the array. While there will be benefits in avoiding putting data in the same access pattern across an array of drives (if the drive heads all scratch the same area of disk surface), if you have the datalined up nicely, rebuilding the arrays should be lighter work on the drive.

        I would advise that the MTBF divided between the drives in the array be used to give a guide
        • Raid 5 was specified.
          That's what Raid 5 does.
          Rebuilding a drive in raid 5 is a long intensive process, as it has to rebuild the parity.
          Raid 5's often fail when a second near end of life drive dies while rebuilding a failed drive. The number one cause of this is heat buildup over time IMHO. Usually when a raid 5 looses 2 drives, it is actually the case that one drive failed, and there was no hot spare to take over, when the second drive fails (at some later date) the raid is broken. Anyone running a raid 5
  • by metoc ( 224422 ) on Tuesday September 27, 2005 @02:23PM (#13660195)
    are extremely low given the MTBF of modern drives. You have a better chance of a power supply or fan failure.

    On that basis I am going to make some wild assed guesses that are more probable given the little information we have.

    1) the drives were consumer models from the same production lot,
    2) the death of the first drive was not immediately noticed,
    3) compatible replacement drives are not easy to come by (no hot spare),
    3) the second drive died before the first one was replaced,
    4) the server did not have hot swap drive carriers
    5) someone tried to replace the dead drive in the running chassis

    If you don't like my guesses provide your own
    • I have actually had this happen. Actually, option 4 is closest to it, although the time between failures was too short to do anything. We have an older Dell computer (First mistake!) with a SCSI hot swap RAID. One of the drives failed, and we replaced it with the spare and ordered a second drive. About a day later, another drive failed before the replacement arrived. For some reason, the Windows Admin removed the drive even though there was no replacement available. He placed the bad one back in, and within
    • You're right that disks don't fail together that often, but components do tend to fail when you get them or at the end of their expected lifetimes (just like us!). This is called the bathtub curve. If you buy a bunch of disks at the same time with the same MTBF, you'll get a big spike of failures within the first few days or in say 4 years. If you use RAID5 on lots of disks, you're hosed because it can't tolerate a failure during a recovery. This may sound exotic, but it's a key design consideration on
    • by mangu ( 126918 ) on Tuesday September 27, 2005 @03:32PM (#13660800)
      If you don't like my guesses provide your own


      6) The drives are overheating. This happened to my two Seagate 200Gb drives. Had to mount them in a heatsink, the normal bay does not provide adequate cooling for Seagate's 7200 rpm drives.

    • 6. PSU went crazy and killed all the drive in the array and the spares instantly.
      Happened to us... It didn't stop at the hard drives either.
      But yes, it was cheap hardware.
  • HA is elusive (Score:3, Informative)

    by Ropati ( 111673 ) on Tuesday September 27, 2005 @02:27PM (#13660228)
    Preventing downtime is an expensive, time consuming exercise, with few limits.

    Before tackling the problem of downtime you should consider how much downtime is acceptable. See the discussion on downtime at the Uptime Institute [upsite.com] regarding what is acceptable. Are you looking for 99.999% uptime? Dream on.

    Specifically you need to make everything in your system redundant. The web servers need to be redundant, you need to have redundant copies of the data. The paths to the internet need to be redundant and the environment should be remote and redundant.

    Once you get a handle on your environment, you should consider some sort of clustering technology for server duality. I suggest you read "In Search of Clusters" [regehr.org] by Gregory F. Pfister to get a fundamental understanding of the technology.

    As was posted earlier, you might just want to throw in the towel and accept web hosting. Use the Uptime Institute specifications against the ISP's service level agreement.

    You might also consider a local ISP co-lo and do your own remote clustering.
    • Are you looking for 99.999% uptime? Dream on.

      Why? What makes 5 nines unachievable, especially for such a simple setup as described in TFQ?
      HA for web applications isn't very difficult, it just requires doing a little architecture up front, and spending the appropriate amount of money. Load balanced N+1 clusters, multiple redundant Internet links and their associated hardware, none of this is difficult or even that expensive today.

      There are very few places in the computing world where real HA is even remo

      • You do realize that that is less than six minutes down time per year.
        • Yes, I'm fully aware. And like I said, its not that hard to achieve. You have to be talking about service availability though, not component availability.

          I fully agree that keeping any individual server to 5 nines availability is generally an unattainable goal, in my environment we strive for 3 nines on individual components, and 5 nines on the overall service. With that goal we build HA into everything we design, whether its through load balancing, heartbeat/ip take over, or building the failover logic in

          • Five nines is dificult, even for simple services like web serving. For this guys' 30K budget, it will be very dificult. He will need at least 3 machines +NAS. Several whitebox machines would be fine, and even cheap, but the shared storage will not be cheap. IBM4200 fiber attached SATA or similar. Only if his web content is very static can he get by without an external raid. (all content on each server, rsync and a gang of servers)

            I do clustering for db, and we shoot for "NO Nine's" (100%). Our apps a
            • I'd disagree on the cost of the shared storage, especially depending on how its done. I'd also disagree that whether the content is static or not really makes a difference. If he's running a bunch of CGIs (as an example), which all pull data from a database then simply doing a heartbeat/ip-takeover pair of whitebox DB servers isn't that expensive. The shared storage there could be expensive, but depending on the DB engine may not be necessary.

              We've bought some really large storage solutions at really cut-t

              • My point about static content was that if his content is static, he can avoid having shared storage. No shared storage significantly reduces the price tag. The problem of using a NetApp or similar is the single point of failure. The cheapest possible way with shared storage would be a 2 box nfs server cluster with shared scsi. Finding hardware that is reliable for this can be problematic. LSI Logic MegaRAID (perc 3) is usable and not too expensive... I built one cluster that used AFS, but distributed fi
                • You don't actually need active/active to avoid the single point of failure, though I do see some advantages in terms of resource costs to going active/active. No matter what you won't pick up a huge amount of savings, since you can't load active/active boxes as heavily as active/passive but it does look a little better when presenting to a cost-conscious management team.

                  I've seen the ultramonkey stuff before, and even deployed a fairly complex LVS balanced cluster (the pair of lvs directors actually had a

                  • With active passive, when you loose the active director, you loos the session mapping through it. With more complex things going on through LVS like telnet or remote instruments or ftp, or even some .asp, When you loose the LVS state, you loose the connection. Sure they can just reconnect, but when 400 techs & data entry people get bumped off, and 40 instruments in a lab have to be reset and reloaded, thats not an acceptable level of service. For continuous sessions, active-passive is a single point
                    • I can speak from experience that active/passive versus active/active is not the determining factor for single point of failure with load balancers (or anything else for that matter). It merely determines whether both nodes are providing services or if one is in standby.

                      Some examples of services I've run in active/passive with full stateful failover include Cisco PIX firewalls, Foundry ServerIron load balancers, NetApp filers, a pair of LVS directors (there was an app you ran which passed state data from th

                    • I really have to look into NetApp. Service levels are critical to us and they sound outstanding.

                      I've never worked with Foundry ServerIron load balancers or NetApp (to do this), Unfortunatly our clusters are not the standard type workload usually found behind this type of gear. Cisco, Veritas and Piranha (RedHat LVS) all failed in work simulation testing. There were too many dropped conections and timeouts. The Cisco gear failed over ok, but had problems with failback and high load. Veritas was just a w
                    • Overall I think the foundries are excellent devices. They balance my mail cluster exceptionally well. The site here has a number of ip based virtual hosts on a cluster of 7 servers, and the foundries balance each virtual host separately which means that on a server load basis the balancing is uneven. For a single host to a group of machines though they're really quite good. The failover and failback works quite well also.

                      I strongly recommend NetApp to everyone I encounter professional who needs HA NAS. The

                    • With our inhouse expertise in LVS, I'm not likely to use any of the Foundry ServerIron gear, it's too hard to justify on the budget. I like getting new kit to play with though. Where would Foundry have a reasonable justification cost (over LVS). Where is it that the Foundry ServerIron is best suited?

                      What, an iSCSI implementation thats not junk? I'll believe it when I see it.

                      I really can't say too little about Vertitas. They have been costing me sleep since UnixWare 7.1.1.
                      Why is it that the Beancounters
                    • I doubt the ServerIron gear, for example, has any significant advantages over a well run LVS cluster, which it sounds like you've got. There are some higher end LBs out there that I've looked at to do things like SSL acceleration, caching, etc. Stuff that LVS doesn't really hook into very well, though you could easily build from open source components should you want those features. For my environment SSL acceleration is a big deal, because I'm running 30+ sites with different certs, and therefor stuck doin
  • If you need High Availability (ie: almost, but not quite, 100% uptime) then you want two or more boxes which you can either load-balance between (dropping crashed servers from the list) OR which you can fail-over to in the event of a problem.

    Use a service monitor, such as Hobbit or Big Brother, to monitor the services. If a service fails once, have it auto-restarted. If it fails repeatedly, have the monitor reboot the box automatically. If the server keeps crashing, or if recovery locks up, then have it not

    • If you need High Availability (ie: almost, but not quite, 100% uptime) then you want two or more boxes which you can either load-balance between (dropping crashed servers from the list)

      And using software such as Pound [apsis.ch] that can be setup fairly easily [debian-adm...ration.org]. Of course if you're running a massive site like /. you might be better off with a hardware load balancer.

      But Pound can be used to easily pool a collection of back-end hosts, and avoid forwarding connections if one of them dies. It will even take care of mai

  • Put a server in San Francisco, put a server in Massachusetts, put a server in Florida or Texas, put a server in Chicago, put a server in New York, and then redundent-cluster them all with something like mod_globule or rsync scripting mojo. Problem solved.
  • Drives die. Fans die. Power supplies die. Motherboards die. Having a RAID array is not enough. There are plenty of other things in a single system that can go wrong that can take the system down for a period of time. The biggest issue I have with the limited description here is the fact you are talking about one system. If you want availability, you need to be looking at scaling by the machine.

    Now you can just have other machines waiting to take the load with a quick reconfig, or you can start doi
    • The parent has said some of the most important things about high-availability, and load balancing includes HA as well. The nugget of the logic driving this is that with redundancy outside servers, you don't need redundancy *within* a server. Then whiteboxes will do, RAIDs are less important, and a reboot on one machine will not take you down (you can actually kill the correct half your machines without a hitch). Add in redundant, seperate UPSs and network switches - my pager is *so* quiet these days - and y
  • Despite the obvious price tag, vmware products allow to "virtualize" your server(s) and to make it run across multiple hardware hosts. Ok, that just adds another option, but it's nice to be able to just "move" one virtual host from one hardware box to another without shutting it down.
  • Well, 2 simultaneous drive fails: it has happened to me at two seperate occasions on the same Dell PE2650. I also complained to tech support that this was very strange, and that I did not trust the server anymore, and that there had to be a harware faillure in the Array or backplane .

    But, apparantly you have to schedule a consistency check every week or so... if you do not do this there is a possibility that data on the RAID gets out of sync/corrupted (or something like that) and when a drive fails the su
    • So does a RAID consistency check mean you have to reboot the system, every week or so, and use the RAID bios to do the consistency check? If so that sort of defeats any uptime requirements, since I would imagine that consistency checks on large RAID volumes could take quite a while...
    • Make sure you schedule the weekly check of the RAID!!!

      Better yet, make sure you buy hardware that isn't complete crap and needs you to manually check it to make sure it hasn't silently started corrupting itself!!!!

      Sheesh -- next you'll be telling us to run mysqlcheck from time to time to make sure your "database" hasn't corrupted itself.

  • You need to take a look at the infrastructure and see where the single points of failure are. After you have come up with a list of single points of failure, you can then analyze the cost vs. risk of each one. Here's a list of things I would look at (off the top of my head):
    • Access provider - Do you have more than 1? If you only have one, do they have multiple uplinks? Do you have multiple connections to your provider? If so, do they terminate into different pieces of equipment? Also, if these are d
  • ... about this a few years back. I forget the guy's name; he was administering a site that did stock quotes with pretty graphs, etc. I suspect I don't remember all of his points anymore, but:
    • two (or more!) network feeds from different vendors, verify monthly that they don't have common routing the best you can (sometimes you end up sharing a fiber even though it doesn't look like it...). These various connections all come into your front-end service LAN (which is distinct from your back-end service LAN...)
    • redundant front end servers which have their own copies of static content and cache any active content from...
    • redundant back end servers that actualy do the active content, and keep any databases, etc. Use a separate LAN for the front-end/back-end connections so that traffic doesn't fight with the actual web service.
    • Backup power (UPS + generator) with regular tests. (test on one side of your redundant servers at a time, just in case...)
    • Log only raw IP's, have a backend system with a caching DNS setup where you do web reports. Do things like log file compression, reports, etc. on the back end server only.
    • tripwire all the config stuff against a tripwire database burned to CD-ROM.
    • update configs on a test server (you do have test servers, right?) when they're right update the tripwire stuff, build a new tripwire CDROM, then update the production boxes.
    • use a fast network-switch-style load balancer on the front. They also help defend your servers against certain DOS attacks, (I.e. SYN floods).
    • when things get busy, load your test servers with the latest production stuff, and bring them into the load balance pool. If it takes N servers to handle a given load, it takes N+1 or N+2 to dig back out of a hole, because the load has at least 1 server out of commission at a time...
    • use revision control (RCS, CVS, subversion, whatever) on your config files.
    • use rsync or some such to keep 2 copies of your version control, above.
    • make sure you can reproduce a front-end or back-end machine from media + revision control + backups in under an hour. Test this regularly with a test server.
    If you have a site whose content changes less frequently, (i.e. at most daily) burn the whole site to a CD-ROM or DVD-ROM image, and boot your webservers from CDROM, as well. Then if you blow a server, you can just slap the CD's/DVD's in another box and be back in business, and it's much harder to hack.

    Well, anyhow, those are my top N recommendations for a keeps-on-running web service configuration. I'm sure I'm overlooking some stuff, but that should head you in the right direction. And if it doesn't sound like a lot of work, you weren't paying attention...

  • Anycast allows you to serve the same IP from multiple servers distributed around the net. It's a BGP hack, and some people don't like it because of that, but it's used to keep some of the DNS root servers up. Has the added ability to divide and conquer DDOS attacks. The caveat is that routing changes break stateful connections, so that's why DNS can use it well (single package UDP connections most of the time).

    Mosix is a UNIX patch that allows processes to migrate across machines. There's even a neat ut
    • Chiaro S.T.A.R.: STateful Assured Routing.

      Worked on that a bit while I was employed there.

      Basically, we use synchronize routing table state across several hot-standby routers, so failover is instantaneous, limiting flapping in the network.

      Rather cool, actually.

  • "Where could one find material on recommended strategies for increasing server availability? Anything related to equipment, configurations, software, or techniques would be appreciated."

    Well, as you can see, you can get a bit of information her, obvoiusly.

    Since there was no specific details, but a request for information sources, I would say this:

    Many vendors of products will offer an assortment of solutions to high availability needs. Legato used to have cluster software for Windows servers (for ava

  • Modern server hardware is pretty good. If you have redundant disks and PSUs, which are the main moving parts in a system, that should be enough. Your bigger worry, especially if you're running Windows systems, is downtime due to rebooting for service patches, and downtime due to malicious break-ins. You can mitigate against this to an extent by having lots of servers. Make sure they all have their own passwords so breaking in to one won't compromise your whole network. Check regularly using tripwire or simi

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...