Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
Technology

HA Metrics on Non-clustered Systems? 15

javester asks: "Has anybody given any thought to compiling metrics for high-availability for the different OSs on a high-availability non-clustered system? Is 'High-availability Windows' an oxymoron? Can you even get close the 5 9s (99.999% which is about 5 minutes of downtime a year) on a typical stand-alone Windows 2000 Server running IIS 5 with typical patch-it-up, three-finger salute routine? If you are just serving up a web-site on a plain-vanilla Windows box, how highly-available can it get? By my calculations, with the typical reboot cycle being 3 minutes, and with a security patch requiring a reboot being released on a weekly basis - a stand-alone high-availability Windows box looses about 156 minutes a year just applying patches! So it can never get past the third 9! In *nix environments, reboots are not required as often (except for kernel changes - how APT!), since you can recycle the appropriate daemon without restarting. But really, has anybody made a formal study?" Now wouldn't this make an interesting college project?
This discussion has been archived. No new comments can be posted.

HA Metrics on Non-clustered Systems?

Comments Filter:
  • Regardless of the operating system, how can you make a high availability system out of a PC? PCs are designed to be cheap, not reliable or maintainable.

    When I think of high availability, I think of systems with redundancy and hot backups that can switch to alternate hardware in a few seconds if a failure is detected.

    • That's true if by PC you mean "beige box."

      But not if you mean "Intel (or compatible) based system."

      You can get a "PC" with redundant power, "lights out" mangemnet, hot-plug PCI, hot-plug RAID disks, etc.

      -Peter
  • Not to mention that Windows 95 had a bug that it would blue screen 47 days after installation: after doing aboslutely nothing.

    But yes, it would be a neat project if someone put in the resources to do it right.
    • Ok, I'm not about to defend Microsoft but this post is so typical of Slashdot that it makes me want to vomit.

      We're running a hypothetical web server on a Windows box, right? I'm sure we have non-essential services disabled then... (Since not doing so would immediately qualify us as a worthless paper MCSE who should go home and leave system administration to Linux geeks) That would mean that the last time we needed to patch our system was June 10th. Five times this year (if we are running index server). And if I recall correctly, the index server patch is one of the more and more common no-reboot variety. I realize that Windows has its problems and would never think of claiming that Linux has as many but why does it provide some people intense pleasure to grossly exaggerate the degree of Windows insecurity?

  • I met with [insert name of huge telecom company here] this week. My boss and I discussed uptime with the regional salesperson. He discussed "five 9" uptime on windows like it was nothing. Now, jump forward to yesterday afternoon when their engineers called me asking about our proposed configuration... I told him, and asked him what the approximate guaranteed uptime would be... he said about "89-92%". heh... gotta love engineers being honest.

    sorry... just thought it was funny...
  • by Anonymous Coward
    First, the troll part is that patches for Microsoft's OS are not released weekly. Moreover, the number of patches that apply to your situation, and your installed software, will only arise monthly, at most.

    Second, this nonsense of 5, 6, or 7 nine's availability is bullshit. No one is using the box that often. Cluster, bring boxes down during the normal 9-5 workday, fix it, and bring it back online. Give the support staff a life, reduce cost by having only one shift, with a rotating pager.

    Third, this whole Six Sigma nonsense on availability is bullshit. A box that has to be available every minute of the day is the weak link in the production chain. Go back, cluster, and re-architect the whole thing.

    Finally, most of the boxes I see with high uptime (as in years of uptime) aren't doing squat. You're effectively saying that a box like that was set up perfectly, with perfect software, perfect administration, perfect hardware, and no part of the software needed updating, patching, or restoring from a backup, ever, nor did any of the hardware age, get flakey, fail, need expansion, or anything that would normally happen. You've been to Vegas; calculate the odds of it all working out wonderfully.

    Fifth, having two or more boxes clustered is cheaper, and makes more sense, than trying to keep one box up for years at a time. The space shuttle has, what, 5 or 7 computers, checking each other? DNS and mail servers have secondaries. In fact, anyone who's smart, and plans to make money, doesn't put all their computing eggs into one basket.
  • by embobo ( 1520 ) on Saturday September 22, 2001 @02:23AM (#2333668) Homepage

    Where are the _real_ (not marketdroid) numbers for HA systems? My impression is that companies never want to give out uptime numbers. If you'd like to do any sort of study I would suspect that you can't just walk up to IBM and say: "Hi! I'm a college student. Can I have you HA stats?" It will take much negotiation and then years to collect the data. I.e.: not a college project unless your advisor is Ghod and can bend companies to his will and you start the study as a freshman.

    "...reboots are not required as often, since you can recycle the appropriate daemon without restarting." Umm, you can? What gave you the impression that this was the case in general? Certainly for simple apps like sendmail or apache one can do this but any more complicated app is much more involved. For example, take the recent update of /.: can you update a Slash-based site merely by cylcing (HUPing) a daemon? No, there was a huge database conversion involved.

    Finally, it is going to be very hard to get statistically significant results with such a high degree of precision. For example, mon [kernel.org] will track downtime. It says that my mailserver has been available 99.95% of the time over the past two years. I know that is wrong (and mon notes this as well). The server hasn't gone down at all but the method of measurement (snmp) has introduced error because it's availablility isn't 100%.

  • In the real world where uptime of a service matters it's not about the availability of the box but rather the application. You can achieve 5x9's of uptime on an application even with windows machines and proper management. This is why HA clusters are so important, and anyone who has a business critical application knows this.

    If you have a service that absolutely has to be up one machine isn't the answer...but of course it will cost you.

    -aaron
  • Never Reboot? (Score:3, Informative)

    by pete-classic ( 75983 ) <hutnick@gmail.com> on Saturday September 22, 2001 @01:14PM (#2334790) Homepage Journal
    In *nix environments, reboots are not required as often (except for kernel changes[...])

    I've never used it, but AFAIK you can even get around this with the HURD.

    -Peter
  • Discounting special environments such as Tandem (which require a LOT of rewriting applications to work) there is no way to get "high availability" out of any single system, for just the reasons you describe. Patches, hardware/software upgrades, OS panics, sysadmin stupidity, leaky roofs, twitchy circuit breakers, fiber-seeking backhoes, and a myriad of other minor and major catastrophes can't simply be put on hold. Hence, any single system is likely to have availability in the 80-95% range in a typical environment.

    If you beef up things with advanced monitoring, very reliable hardware, well trained staff, good vendor support, and are generally very careful you can get a single system up to about the 99% range. This is about about 87 hours of downtime per year-- enough for quarterly patches, bi-annual hardware work, and the ocassional problem. Most of this is scheduled, and we're likely to only see about 12 hours of "surprises" a year, even if it's not a good year.

    Yes, I'm pulling most of these numbers out of my ass, but my ass has been around the block a couple times. Feel free to fudge the numbers if your experience varies. Changing the 12 hours around won't make that much difference in the later calculations. (below)

    Assuming 365.25 days per year, that 12 hours means we have a "no surprises" uptime of roughly 99.8 percent. Not hard if you're running things in a stable environment.

    The only way to get those extra two nines is to start running things with spares. If you have a spare system and are careful not to build an environment where both can fail simultaneously (i.e. on the same power grid), the chance of running into a simultaneous surprise becomes 1 - (0.002 * 0.002) or .999996 (99.9996%). Voila! five nines. (You don't multiply the original 99% since I'm assuming that you're smart enough not to schedule patches for the same day on both primary and standby systems.) In order to be able to multiply the probabilities together they do need to be truly independent of one another-- again, not on the same power grid, in the same building, etc.

    In short-- there's no way to get five nines from a single system. You need clusters of independent systems.
  • by addaon ( 41825 )
    Perhaps I'm missing something, so please, enlighten me, don't label me a troll.

    In what situation, ever, does the uptime of a single system matter? It doesn't matter if my server has one nine (90%) uptime, because I'd be crazy to have just one server. Instead, it's the uptime of the cluster/infrastructure/system and fallback that matters, right? I mean, uptime isn't just a feel-good-about-lots-of-nines number, it's supposed to reflect how often your site/service is unavailable. Which, with only a little work and some redundancy, is as close to never as you choose, even when the single system explodes in a fire ball. No?

Simplicity does not precede complexity, but follows it.

Working...