Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Data Storage Upgrades IT

Tips for Increasing Server Availability? 74

uptime asks: "I've got a friend that needs some help with his web server availability. On two separate occasions, his server has had a problem that caused it to be unavailable for a period of time. One was early on and was probably preventable, but this latest one was due to two drives failing simultaneously in a RAID5 array. As a web business moves from a small site to a fairly busy one, availability and reliability becomes not only more important, but more difficult to accomplish it seems. Hardware gets bigger, services get more expensive, and options seem to multiply. Where could one find material on recommended strategies for increasing server availability? Anything related to equipment, configurations, software, or techniques would be appreciated."
This discussion has been archived. No new comments can be posted.

Tips for Increasing Server Availability?

Comments Filter:
  • by Rolan ( 20257 ) * on Tuesday September 27, 2005 @01:48PM (#13659890) Homepage Journal
    P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful

    Actually, each server should be able to handle the entire load of the cluster. A lot of people forget to pay attention to this. It's great that you have two servers in a cluster, so that if one fails the other still works, except when you HAVE to have two servers in the cluster for it to work at all.

    Where I work we run our web solutions on clusters. This works great for redundancy, availability, etc. BUT, if we ever have less than two servers in the cluster, the system will go down anyway due to load. Our primary production cluster, therefore, is four servers.

  • ... about this a few years back. I forget the guy's name; he was administering a site that did stock quotes with pretty graphs, etc. I suspect I don't remember all of his points anymore, but:
    • two (or more!) network feeds from different vendors, verify monthly that they don't have common routing the best you can (sometimes you end up sharing a fiber even though it doesn't look like it...). These various connections all come into your front-end service LAN (which is distinct from your back-end service LAN...)
    • redundant front end servers which have their own copies of static content and cache any active content from...
    • redundant back end servers that actualy do the active content, and keep any databases, etc. Use a separate LAN for the front-end/back-end connections so that traffic doesn't fight with the actual web service.
    • Backup power (UPS + generator) with regular tests. (test on one side of your redundant servers at a time, just in case...)
    • Log only raw IP's, have a backend system with a caching DNS setup where you do web reports. Do things like log file compression, reports, etc. on the back end server only.
    • tripwire all the config stuff against a tripwire database burned to CD-ROM.
    • update configs on a test server (you do have test servers, right?) when they're right update the tripwire stuff, build a new tripwire CDROM, then update the production boxes.
    • use a fast network-switch-style load balancer on the front. They also help defend your servers against certain DOS attacks, (I.e. SYN floods).
    • when things get busy, load your test servers with the latest production stuff, and bring them into the load balance pool. If it takes N servers to handle a given load, it takes N+1 or N+2 to dig back out of a hole, because the load has at least 1 server out of commission at a time...
    • use revision control (RCS, CVS, subversion, whatever) on your config files.
    • use rsync or some such to keep 2 copies of your version control, above.
    • make sure you can reproduce a front-end or back-end machine from media + revision control + backups in under an hour. Test this regularly with a test server.
    If you have a site whose content changes less frequently, (i.e. at most daily) burn the whole site to a CD-ROM or DVD-ROM image, and boot your webservers from CDROM, as well. Then if you blow a server, you can just slap the CD's/DVD's in another box and be back in business, and it's much harder to hack.

    Well, anyhow, those are my top N recommendations for a keeps-on-running web service configuration. I'm sure I'm overlooking some stuff, but that should head you in the right direction. And if it doesn't sound like a lot of work, you weren't paying attention...

  • by wolf31o2 ( 778801 ) on Tuesday September 27, 2005 @03:51PM (#13660972)

    This is basically the priciple that my company runs on with their servers. We should be able to be running perfectly fine with 2/5ths of our servers down at any given time. Of course, this almost never happens, but building with that sort fo redundancy in mind reduces the chances of downtime to almost nothing. Each machine is also on redundant links to redundant switches on redundant upstream links. We do have the advantage of being an ISP and CLEC ourselves, so we already have multiple peering agreements with many other CLEC/ILECs.

    As for the double-failure in a RAID5 array thing the article poster mentioned, for Pete's sake, buy a couple spare disks. You should follow the same rule in making your RAID arrays as your server clusters. You *should* be able to lose 2/5ths of your disks without losing the array. This means that you need at least 1 spare for every 5 drives, for a total of 6 drives.

    Add some good monitoring on top of these and your downtimes drop to almost nothing. In fact, you shouldn't ever see service downtimes with a proper setup, provided you actually bring machines back up as they fail.

  • by captainclever ( 568610 ) <rj@NoSPaM.audioscrobbler.com> on Tuesday September 27, 2005 @04:08PM (#13661132) Homepage
    If you find yourself running a bunch of servers all with similar spec/config, you should consider removing the disks from them and netbooting off a single image on another server (or a single image available one 2 other servers just in case). Disks are far more likely to break than any other component imo, far more likely than fans or PSUs if you ask me.
    As for RAID5, it's not always practical, but bear in mind if you buy all your disks from the same mfg at the same time, your chances of concurrent failure are increased. (the batch the disks came from may be suspect). You could buy disks from different manufacturers. Hot spares are always handy too.
    The webservers for last.fm [www.last.fm] are all diskless and boot off a single debian image. Makes it helluvalot easier to upgrade/update them. We use Perlbal [danga.com] (from the LiveJournal crew) as a reverse-proxy load balancer, which works nicely.
  • by dwater ( 72834 ) on Tuesday September 27, 2005 @11:46PM (#13664103)
    This is my take on RAID5 - I take it as written that I will be corrected :

    You can lose one disk in a RAID5 and it still works. However, the contents of the failed disk needs to be regenerated on the spare before it's back to its 'redundant' state.

    Therefore, the time between the two disks failing *must* be enough for the regeneration to complete, else you'll lose everything(?).

    This can take quite some time.

    On my s/w RAID5, it takes hours.

    Furthermore, the process causes significantly more disk activity on the remaining disks, increasing the risk of another failure.

    The time taken to regenerate the array depends on many things. The ones I can think of :

    1) speed of the read performance of the disks remaining in the array, and the write performance of the spare.
    2) controller performance (or interface(s)/chipset/memory/cpu performance if s/w RAID).

    Of course, once the array is back to it's redundant state, it still isn't back to 'normal' because the spare is now missing and needs to be replaced.

    Better to use mirroring (0+1?) in some way, IMO. If you have 3 mirrors, you can hot swap one 'reflection' and immediately replace it - keeping many off-line 'reflections' in the same way as you would tape and using it as a backup.

    I think 0+1 is faster too, since the '0' is a stripe. Of course, it's also more expensive to obtain a given capacity.

"And remember: Evil will always prevail, because Good is dumb." -- Spaceballs

Working...