Forgot your password?
typodupeerror
The Internet

Quickly Switching Your Servers to Backups? 73

Posted by Cliff
from the fast-failover dept.
moogoogaipan writes "After a few days thinking about the quickest way to bring my website back to the internet users, I am still stuck at DNS. From experience, even if I set the TTL for my DNS zone file as low as 5 minutes, there are still DNS servers out there won't update until a few days later (Yeah. I'm looking at you, AOL). Here is my situation. Say that I have my web servers and database servers at a remote backup location, ready to serve. If we get hit by an earthquake at our main location, what can I do in a few hours to get everyone to go to our backup location?"
This discussion has been archived. No new comments can be posted.

Quickly Switching Your Servers to Backups?

Comments Filter:
  • BGP (Score:5, Informative)

    by ckdake (577698) <ckdake @ c k d a ke.com> on Tuesday May 08, 2007 @05:19PM (#19043309) Homepage
    Same Provider at both (N) locations, Same IPs for servers/services, Just don't advertise the prefixes via BGP from the backup location until the primary one goes down.
    • Re:BGP (Score:5, Informative)

      by georgewilliamherbert (211790) on Tuesday May 08, 2007 @06:46PM (#19044895)
      Bingo.

      This is exactly what BGP (or OSPF feeding in to your providers' BGP) is made for.

      If you're big enough, you can get your own AS number and do this without having the same provider at each end (useful if the disaster that happens is that the software on all of provider X's core routers goes insane all at once, which happens from time to time).

      DNS just can't be assumed to fail fast enough for very high reliability services. You can do DNS right... low TTLs and all... and some providers just cache the results and do the wrong thing, and some client systems will never look up the changed data if the old IP stops responding, until someone reboots the browser or workstation.

      BGP.
      • Re: (Score:3, Interesting)

        by sjames (1099)

        In addition, make sure that your routers at both locations have a routable IP NOT in your IP block and use that as the source for your BGP sessions AND make sure you can log in remotely using that address (perhaps from a short list of external IPs you control). That way you can log in to each and make sure the route is actually announced by only one of them. Not all failover situations will completely take your network down (but unless you have a way to do it yourself, you'll sure wish it did).

        • DNS doesn't do the job seemlessly 100% of the time but BGP isn't robust enough to do the job alone. Solution = do both.
          • by sjames (1099)

            Actually, BGP is certainly robust enough unless your provider configures their router very badly. The advantage if they do is that as a paying customer you have some leverage to get it corrected or change providers. By contrast, DNS configurations will be on 3rd party servers and so beyond corrective reach.

            That's not to say there are no problems at all with BGP, but those typically have more to do with router admins who haven't yet noticed that a few formerly reserved ranges are actually valid now and oth

            • At the point where you are doing site/site DR via BGP you are most likely going to be running both your edge routers and your DNS based GSLB servers.

              When I mentioned robustness I was really thinking of two specific issues:
              -partial failures (where not all of your services are down)
              -routing instability directly following a major 'disaster'

              I've seen people use iBGP w/ route health injection to move traffic internally between DC's, but AS-at-a-time is not particularly granular.

              Users well outside of a disaster e
              • by sjames (1099)

                For cases where a handful of servers go down in an otherwise functional facility, simple failover where a backup server takes the relevant IP address and/or double up a service on another machine and alias the IP will do fine. For example, I run DNS on my web server, but block external queries. Should a DNS server go away, I add it's address on the web server and add a permit rule to the firewall.

                Another strategy for single server failures is to NAT it's IP over to it's counterpart in the DR location and

      • by opec (755488) *
        Could we possibly fit any more acronyms in there?
  • Server Clusters (Score:2, Informative)

    by Tuoqui (1091447)
    NLB (Network Load Balancing) Cluster, link the two together and have them both serve the website. Not only will it not go down (barring freak accidents like both locations being hit at once) but it will also have the added benefit of presumably double the bandwidth and such.

    Only problem is if you're locating them in two separate locations that they need to be able to communicate with each other and keep identical copies of the website and be able to connect to any databases you may need.

    Basically any server
    • by Tackhead (54550) on Tuesday May 08, 2007 @05:32PM (#19043579)
      > Only problem is if you're locating them in two separate locations that they need to be able to communicate with each other and keep identical copies of the website and be able to connect to any databases you may need.

      Depending on the industry, that's a very real problem.

      Sysadmin: "Don't worry, we're already switched over to the hot spare, just get out of there!"
      CIO: "What if the whole building goes?"
      Sysadmin: "No worries. Remember that $1M we spent stringing all that fiber over to the other datacenter?"
      CIO: "Oh yeah, the one in WTC 2!"
      Sysadmin: "Aaw, shit."

      • by akita (16773)
        I fail to see how is this flamebait.
      • by Anonymous Coward
        How is this flamebait?

        The canonical example is "the failover system is on the floor below us" / "the AC failed and dumped 100 gallons of water into the machine room floor" / "oh shit".

        Guaranteeing sub-15-minute failover is hard. Many IT departments in the industry aim for 5 minutes of downtime per year. An earthquake wasn't likely to hit NYC, and even the "worst-case scenario" of an airplane hitting the tower would only destroy one tower, and putting the hot spare in "the other tower" was a perfectl

        • by Sobrique (543255)
          This is why many sites are now moving to a 3 site model. A 'local failover' and a 'nearline async' copy. Certainly SAN vendors are up on the bandwagon, so with EMC kit (and probably others, but I've been on the training course of the EMC one) you can do 'SRDF/STAR' which basically does synchronous storage replication between two sites, with an asynchronous replication to a remote site (so it can be further away, geographically - synchronous fiberchannel replicas you hit problems with latency as you go furth
  • by MeanMF (631837) on Tuesday May 08, 2007 @05:21PM (#19043345) Homepage
    Talk to your ISP. They can set it up so the IP addresses at the main location can be rerouted to the DR site almost instantly.
  • by eln (21727) on Tuesday May 08, 2007 @05:25PM (#19043413) Homepage
    You could spend a bundle of money doing global load balancing and maintaining a full hot spare site, or you could figure out how critical it really is that your website be up within 5 minutes of some major disaster like an earthquake.

    In the event of a major disaster, the need for "immediate" recovery is actually defined as being able to be back up and running within 24 hours of the event. This is true even for business critical functions. Unless your business would cease to exist within 24 hours if your website went down, I would consider a 72 hour return to service to be perfectly adequate, and it would cost a whole lot less time and money to set up. Keep in mind that we are talking about an eventuality that would only occur if your primary site was entirely disabled for an extended period of time, which is highly unlikely to happen if you're hosted in any kind of modern data center.
    • talk to trading companies. 5-10 minutes could mean millions. (So that VP kept on telling me)

      We had a major failure of all systems one day. All the servers were up and running fine when we looked at them locally. The network went down, (not the database/web/file servers) the actual network blade thing. The network blades were working fine. We had communication on each blade, just not cross blades. We asked the network guys why they didn't have a backup plan. We had to have 5 sets of backups, (backup on site,
      • by eln (21727) on Tuesday May 08, 2007 @06:17PM (#19044409) Homepage
        Ok sure, but if you're the kind of company that is making millions of dollars a minute through your website, you're paying qualified IT professionals to go out and spend a bundle of money developing an architecture that will allow full global load balancing with constant mirroring, probably with dedicated circuits between sites. You are definitely not posting Ask Slashdot articles about how to get around other ISPs' annoying habit of holding on to DNS records for too long.
        • by vux984 (928602)
          Even if its only 2 million dollars (which is about as low as it can be and be called 'millions')...

          24 hours
          x60 minutes/hour = 1440 minutes
          x360 days/year = 525,600 minutes
          x 2,000,000 $/minute = 1,051,200,000,000 $

          A trillion dollars! 100 times what even googles total revenues are. Its a problem nobody has.

          And lets face it, if your "website" is generating that kind of revenue, you wouldn't have an ISP. You'd BE an ISP, with your own peering, fiber, and so on.
          • Re: (Score:3, Insightful)

            by autocracy (192714)
            A million dollars in revenue isn't what he meant, I think. A bank's revenue, for example, is much less than its transaction amount. Stock Exchance as well.

            Also, consider 2x60x40x52 comes out to 249,600,000,000. I bet Bank of America sees a billion dollars a day move. Peak transaction volume is often used when calculating potential loss, so it may only be $2 million / minute during the highest hour -- but that's always the hour you'll fail during ;)
            • by vux984 (928602)
              How can not being able to move X dollars be construed as an X dollar "loss"? If my online banking goes down and I'm unable to place a web order to transfer my Mom X, and move Y from chequing to savings, and purchase Z mutual fund for an entire day. Its absurd to suggest the bank lost X+Y+Z? Even *I* didn't lose X+Y+Z from the inconvenience, and its my money.
    • by CrankyOldBastard (945508) on Tuesday May 08, 2007 @07:02PM (#19045151)
      "You could spend a bundle of money doing global load balancing and maintaining a full hot spare site, or you could figure out how critical it really is that your website be up within 5 minutes of some major disaster like an earthquake."

      I wish I still had mod points left. It's overlooked by many people, that you should always compare the cost of a disaster/breakin/breakdown to the cost of being prepared for it. I've seen situations where over $10,000 was spent on a bug that would of cost about $200 in downtime. Similarly I've seen a few thousand dollars spent fixing a bug that affected one customer who was paying $15 per month.

    • by CBravo (35450)

      Keep in mind that we are talking about an eventuality that would only occur if your primary site was entirely disabled for an extended period of time, which is highly unlikely to happen if you're hosted in any kind of modern data center
      Ha, this happened to IBM last year. The whole site in Belgium burned down when they tried to upgrade some electricity. Our contracts stated 72 hours but it became 1,5 weeks.
    • by afabbro (33948)
      In the event of a major disaster, the need for "immediate" recovery is actually defined as being able to be back up and running within 24 hours of the event.

      Defined by who? Someone without a passable command of the English language?

      (If your answer is any sort of government agency, then I was right.)

  • A geographical load balancing solution, such as Coyote Point's Envoy [coyotepoint.com] or F5's Global Traffic Manager [f5.com]. Very expensive though.
    • Re: (Score:3, Informative)

      F5 mainly uses DNS for its Global Traffic Management solution. There are other bits and pieces, but that is the core, really.
  • Um (Score:2, Funny)

    by Anonymous Coward
    You could hire an actual IT administrator who knows what they're doing? Like, one who's actually trained?
    • Re: (Score:2, Funny)

      by CRiMSON (3495)
      But then it wouldn't be another half-assed implementation. Come on what were you thinking.
  • I hate to say this, but sometimes you want to avoid hosting locations in places that are earthquake, hurricane, or just natural disaster prone if it is that critical.

    Then again even places like NYC are victims to total power grid failure once in a blue moon so you do want some type of clustering in place like the prior people mentioned. I can't tell how many times in IT I've heard someone say, "Some guy in Georgia just dug up a major fiber cable with his backhoe!"
    • Re: (Score:3, Informative)

      by bahwi (43111)
      I've got my personal/small business server(does nothing but a crappy webpage) so it's not critical down in florida. It's gone down because of router issues but never a hurricane, and oh yes, it's been hit. I actually think those places may be better as they are built to weather those types of storms.

      Yeah, I've heard lots of people sweating and panicking because a back ho was working somewhere near the datacenter. On site and beads of sweat on their forehead.
      • Re: (Score:3, Interesting)

        by walt-sjc (145127)
        There is an AT&T data center in Virginia that was hit by a tornado. Our servers are there. In the part that got hit hardest, water was pouring in and down onto a few racks of servers. The servers were still up, but they powered down that section of the data center for safety reasons. Our servers were fortunate not to be affected, and AT&T kept them running throughout the whole ordeal (power grid was down too, so they were on generator for a couple days.) BTW, that was the "before SBC" AT&T.
    • Oops (Score:3, Funny)

      by tedhiltonhead (654502)
      That was me... sorry... my bad. FSB's (Fiber Seeking Backhoe) are tough to control.
    • That's what makes places like Boise, Idaho ideal areas for datacenters. 1) No natural disasters 2) Good Power Grid 3) Cheap HydroPower 4) Unused Fiber Networks solutionpro is the way to go
  • DNS failover (Score:4, Informative)

    by linuxwrangler (582055) on Tuesday May 08, 2007 @05:42PM (#19043787)
    We have used DNS failover from dnsmadeeasy.com for a couple years and have put it to the test a couple times. They have had perfect reliability and a low cost (typically well under $100/year).

    The method is not perfect, but it is plenty good enough for our needs to protect against something that takes a datacenter down for a prolonged time (several minutes/hours/days). And the price

    And to those who recommend avoiding "disaster prone" places: they all have people. People like the backhoe guy who took out the OC192 down the street. Or the core drillers who managed to punch both the primary and secondary optical links to a building of ours at a point where they were too close to each other.

    You can roll your own by having a DNS server at each site and DNS 1 always issues IP of server 1 while DNS 2 always issues IP of server 2. But there are a number of issues like traffic hitting both sites at the same time. And you will have to detect more than just a down link so you will be scripting web test and DNS update systems. By the time you are done, you will have spent decades' worth of dnsmadeeasy fees.

    Note: dnsmadeeasy isn't the only game in town. Just the one we happen to use.
    • Re:DNS failover (Score:5, Informative)

      by Joe5678 (135227) on Tuesday May 08, 2007 @06:27PM (#19044605)
      dnsmadeeasy doesn't solve the problem the OP is asking about. They simply monitor your services and start serving a different DNS record if your primary is down.

      The OP is concerned with all the DNS servers that aren't yours that would then have a cached version already, and continue to serve up the dead DNS record until their (incorrectly configured) TTL expired.

      As another poster already mentioned, BGP is really the only technical solution to this problem. All other "solutions" are going to be convincing people that they don't really need instant failover in the event of a major disaster.
      • Re: (Score:2, Informative)

        Disclosure time: I deal with this stuff every day and I have a vested interested in commercial solutions to this issue.
        dnsmadeeasy doesn't solve the problem the OP is asking about. They simply monitor your services and start serving a different DNS record if your primary is down.

        The OP is concerned with all the DNS servers that aren't yours that would then have a cached version already, and continue to serve up the dead DNS record until their (incorrectly configured) TTL expired.

        As another poster already me
  • by tedhiltonhead (654502) on Tuesday May 08, 2007 @06:01PM (#19044171)
    For a real answer, buy Theo Schlossnagle's book, "Scalable Internet Architectures". Theo presented a lengthy and highly-informative session at OSCON last year, and I subsequently bought and read his book. Worth every penny if you're professionally involved in providing reliable Internet services of any kind.
  • Rather than the name.

    Your networking gurus will find that an interesting one. In reality though, a couple of days is probably good enough in the event of something like an earthquake, you may find that people have other things on their minds when that amount of shit hits the fan.

    Still, kudos on planning for it. Most IT depts are taken completely by surprise and the business goes down in flames. A decent global filesystem (like AFS) helps massively.

     
    • by QuantumRiff (120817) on Tuesday May 08, 2007 @06:30PM (#19044635)
      you may find that people have other things on their minds when that amount of shit hits the fan.

      Really interesting point that seems to be overlooked. The CEO is concerned about getting everything back up and running (since statistically they have no heart or pulse), but the employees are more concerned about finding family members in the wreckage of their house, cleanup, watching the kids cause schools are shut down, etc..

      Whatever you do, ensure it is automated as possible, and please, please, please don't forget to test. I've heard to many stories about everything looking okay, until the emergency generator runs for several hours, vibrating a connection loose and causing it to shut down. It would pass the test run every month, that was only 15 min long. "Hmm, power is out, and power poles are blown all over the streets, do I stay safely inside? Or do I brave a trip across town to try to flip a switch for my wonderful employer?"
      • by mabu (178417)
        Speaking as the owner of an ISP that was hit by Hurricane Katrina, I can comment on this... while everyone else in the family was looking for a place to live, my customers demanded their web sites be back online ASAP, so it took me 24-48 hours to get everything back online even though our network never went down - the military took over the building and messed with the generator transfer switch and screwed things up.

        I slept on floors and in peoples cars while I got the network back online. Only then did I
      • by jrvz (734655) *
        Re: testing. I remember a command center that ran on a generator only as long as the inside tank had fuel. When they tried to switch to the outside tank, the generator died. There was water in the connecting pipe, which had frozen.
    • by gbjbaanb (229885)
      in the event of something like an earthquake, you may find that people have other things on their minds

      Only if you live there, most people rent servers in a geographically different location, and their customers may easily be all over the world. If you live in the Uk, and your servers are in Texas, and your customers are in Australia - you only care about the disaster like its just another news story.

      The OP is not necessarily concerned about disasters though - power cuts (increasingly likely nowadays), cabl
  • If you can get somebody with multiple allready redundant locations to put up a redirection page, then you are all set, since you can basically ride on their redundancy. Personally I think this is the cheapest solution. I don't see any way around DNS RFC violators.

    Other options would likely require messing with routing info. I think tat is chancy at best.
  • by DamonHD (794830) <d@hd.org> on Tuesday May 08, 2007 @08:08PM (#19045957) Homepage
    Hi, An alternative is to forget the all-or-nothing view, and make sure that with some simple round-robin DNS and enough geographically-separated servers for the DNS and HTTP/whatever, then even if one is taken out by a quake or Act of Congress (ewwww, those nature programmes), *most* users will still get through just fine. Any clients/proxies that are smart and that can try out multiple A records for one URL will always get through if even one of your servers is reachable. Example: my main UK server failed strangely yesterday morning, but only about 30% of my visitors can even have noticed, and the other servers worldwide took up some of the load. Just simple and reliable and cheap round-robin DNS. Rgds Damon
  • by Wabbit Wabbit (828630) on Tuesday May 08, 2007 @08:24PM (#19046117)
    The industry pros discuss this sort of thing there all the time. The colocation sub-forum would be the best place to ask. I know that sounds odd, but that's the area on WHT where the best network/transit/BGP people hang out.
  • You need cross-site IP address takeover. You can accomplish this generically with BGP (but if you're asking these questions, I'd stay away from this for now), or work with your ISP to set up a simple way to accomplish this.
  • by ejoe_mac (560743) on Tuesday May 08, 2007 @11:11PM (#19047535)
    When $ is no issue, a tier 1 colocation provider with their own services would be the best option. They've got big pipes, and will work with you to have the additional services needed. I'd go as far to say that you're going to want to have a failover script that they would follow in the event of site A going offline. You'd need redundant equipment, or use a DR firm for getting back up.
  • You can move a block of IP addresses, most sites will honor an advertisement of a /24 block, in my experience. With BGP you can cause this IP block to start getting routed to equipment in another part of the world. In other words, you can keep your DNS the same and cause the IP addresses to move. No DNS propagation time required. BGP changes can propagate in a minute or two, unless it's been flapping and remote routers have dampened the route.

    Sean
    • It's been a long time since I've been near any kind of routing, wouldn't that require access to two separate Autonomous Systems? I'd be interested to know how this works, good refresher :-)

      I personally thought that using VLANs would be a quicker way to go about it, but that's obviously a more localised solution. But you're right in looking for solutions at network level, there's too much ignoring of DNS TTL values going on (AOL being the example we all love) to make any other measure quick enough.

      = Ch =
  • A disaster tolerant system will have servers configured at multiple sites with real time data update. OpenVMS can do this with the remote sites being far enough away you need to fly there. How do I restore service? is the the wrong question, ask How do avoid downtime even if a site fails?

    The Amsterdam Police IT group recently announced a 10 year uptime. OpenVMS.org [openvms.org] has details on celebrating 10 years of uninterrupted service.

  • But, can't you use a CN for your external so that clients are forced to come to your DNS server to get the actual IP? I'm not sure it's the most effecient way, but then again your TTL is only 300. Like you make yourself the SOA and only use CN, and then just point the CN to the currently running server? It *would* mean having to have failover DNS servers though. I'm no admin, but I do admin-type work from time to time (read: I admin a single server where I have an internship that only touches the intran
    • by spydum (828400)
      The complaint is not that his DNS server does not act appropriately, but that there are providers on the net who run software that cache responses for far longer than the norm. Those folks will have old IP addresses in their cache for anywhere from 24-48 hours (sometimes longer, but that is rare).
      • by Gazzonyx (982402)
        Ah, I was thinking they would cache his DNS server's IP instead of what his server resolved the server to, like a single world IP and then treating the servers as internal IPs. I withdraw my idea, it probably wouldn't work at all. I was truly half asleep when I posted that; not sure what I was thinking. Oh, yeah, I was thinking about studying for finals... speaking of which...
  • You don't want to mess with BGP unless you have plenty of money to have a redundant location, and a large enough IP block to justify it. You may find an ISP that has this set up or their own block, but I don't know of any.

    The way to go is DNS. For an example of this, look at Akamai:
    (Removed because of the fucking lameness filter. It was a very useful DNS lookup. Try 'dig images.apple.com' to see what I saw.)

    It's done using an extremely short TTL on the final A record. Obviously this handles the vast ma
  • sitebacker - that will change your dns within 3m of your site going down.

    cheers,

    Alex

C'est magnifique, mais ce n'est pas l'Informatique. -- Bosquet [on seeing the IBM 4341]

Working...