
Quickly Switching Your Servers to Backups? 73
moogoogaipan writes "After a few days thinking about the quickest way to bring my website back to the internet users, I am still stuck at DNS. From experience, even if I set the TTL for my DNS zone file as low as 5 minutes, there are still DNS servers out there won't update until a few days later (Yeah. I'm looking at you, AOL). Here is my situation. Say that I have my web servers and database servers at a remote backup location, ready to serve. If we get hit by an earthquake at our main location, what can I do in a few hours to get everyone to go to our backup location?"
BGP (Score:5, Informative)
Re:BGP (Score:5, Informative)
This is exactly what BGP (or OSPF feeding in to your providers' BGP) is made for.
If you're big enough, you can get your own AS number and do this without having the same provider at each end (useful if the disaster that happens is that the software on all of provider X's core routers goes insane all at once, which happens from time to time).
DNS just can't be assumed to fail fast enough for very high reliability services. You can do DNS right... low TTLs and all... and some providers just cache the results and do the wrong thing, and some client systems will never look up the changed data if the old IP stops responding, until someone reboots the browser or workstation.
BGP.
Re: (Score:3, Interesting)
In addition, make sure that your routers at both locations have a routable IP NOT in your IP block and use that as the source for your BGP sessions AND make sure you can log in remotely using that address (perhaps from a short list of external IPs you control). That way you can log in to each and make sure the route is actually announced by only one of them. Not all failover situations will completely take your network down (but unless you have a way to do it yourself, you'll sure wish it did).
Re: (Score:1)
Re: (Score:2)
Actually, BGP is certainly robust enough unless your provider configures their router very badly. The advantage if they do is that as a paying customer you have some leverage to get it corrected or change providers. By contrast, DNS configurations will be on 3rd party servers and so beyond corrective reach.
That's not to say there are no problems at all with BGP, but those typically have more to do with router admins who haven't yet noticed that a few formerly reserved ranges are actually valid now and oth
Re: (Score:1)
When I mentioned robustness I was really thinking of two specific issues:
-partial failures (where not all of your services are down)
-routing instability directly following a major 'disaster'
I've seen people use iBGP w/ route health injection to move traffic internally between DC's, but AS-at-a-time is not particularly granular.
Users well outside of a disaster e
Re: (Score:2)
For cases where a handful of servers go down in an otherwise functional facility, simple failover where a backup server takes the relevant IP address and/or double up a service on another machine and alias the IP will do fine. For example, I run DNS on my web server, but block external queries. Should a DNS server go away, I add it's address on the web server and add a permit rule to the firewall.
Another strategy for single server failures is to NAT it's IP over to it's counterpart in the DR location and
Re: (Score:1)
Server Clusters (Score:2, Informative)
Only problem is if you're locating them in two separate locations that they need to be able to communicate with each other and keep identical copies of the website and be able to connect to any databases you may need.
Basically any server
Re:Server Clusters (Score:5, Funny)
Depending on the industry, that's a very real problem.
Sysadmin: "Don't worry, we're already switched over to the hot spare, just get out of there!"
CIO: "What if the whole building goes?"
Sysadmin: "No worries. Remember that $1M we spent stringing all that fiber over to the other datacenter?"
CIO: "Oh yeah, the one in WTC 2!"
Sysadmin: "Aaw, shit."
Re: (Score:1)
Re: (Score:1)
The canonical example is "the failover system is on the floor below us" / "the AC failed and dumped 100 gallons of water into the machine room floor" / "oh shit".
Guaranteeing sub-15-minute failover is hard. Many IT departments in the industry aim for 5 minutes of downtime per year. An earthquake wasn't likely to hit NYC, and even the "worst-case scenario" of an airplane hitting the tower would only destroy one tower, and putting the hot spare in "the other tower" was a perfectl
Re: (Score:1)
Get the ISP involved (Score:4, Insightful)
Scale back your expectations (Score:5, Informative)
In the event of a major disaster, the need for "immediate" recovery is actually defined as being able to be back up and running within 24 hours of the event. This is true even for business critical functions. Unless your business would cease to exist within 24 hours if your website went down, I would consider a 72 hour return to service to be perfectly adequate, and it would cost a whole lot less time and money to set up. Keep in mind that we are talking about an eventuality that would only occur if your primary site was entirely disabled for an extended period of time, which is highly unlikely to happen if you're hosted in any kind of modern data center.
Re: (Score:1)
We had a major failure of all systems one day. All the servers were up and running fine when we looked at them locally. The network went down, (not the database/web/file servers) the actual network blade thing. The network blades were working fine. We had communication on each blade, just not cross blades. We asked the network guys why they didn't have a backup plan. We had to have 5 sets of backups, (backup on site,
Re:Scale back your expectations (Score:5, Insightful)
Re: (Score:2)
24 hours
x60 minutes/hour = 1440 minutes
x360 days/year = 525,600 minutes
x 2,000,000 $/minute = 1,051,200,000,000 $
A trillion dollars! 100 times what even googles total revenues are. Its a problem nobody has.
And lets face it, if your "website" is generating that kind of revenue, you wouldn't have an ISP. You'd BE an ISP, with your own peering, fiber, and so on.
Re: (Score:3, Insightful)
Also, consider 2x60x40x52 comes out to 249,600,000,000. I bet Bank of America sees a billion dollars a day move. Peak transaction volume is often used when calculating potential loss, so it may only be $2 million / minute during the highest hour -- but that's always the hour you'll fail during
Re: (Score:1)
Re:Scale back your expectations (Score:5, Insightful)
I wish I still had mod points left. It's overlooked by many people, that you should always compare the cost of a disaster/breakin/breakdown to the cost of being prepared for it. I've seen situations where over $10,000 was spent on a bug that would of cost about $200 in downtime. Similarly I've seen a few thousand dollars spent fixing a bug that affected one customer who was paying $15 per month.
Re: (Score:2)
Re: (Score:2)
Defined by who? Someone without a passable command of the English language?
(If your answer is any sort of government agency, then I was right.)
geographical load balancing (Score:2)
Re: (Score:3, Informative)
Um (Score:2, Funny)
Re: (Score:2, Funny)
Location (Score:2)
Then again even places like NYC are victims to total power grid failure once in a blue moon so you do want some type of clustering in place like the prior people mentioned. I can't tell how many times in IT I've heard someone say, "Some guy in Georgia just dug up a major fiber cable with his backhoe!"
Re: (Score:1)
For instance, try to compare the specs of the two locations. They're the same, except for the first page! But the link to the second location goes to the second page, so you can't even tell that it's a different location!
WHY IS THIS A FLASH SITE?
Re: (Score:3, Informative)
Yeah, I've heard lots of people sweating and panicking because a back ho was working somewhere near the datacenter. On site and beads of sweat on their forehead.
Re: (Score:3, Interesting)
Oops (Score:3, Funny)
Re:Location, Location, Location (Score:1)
DNS failover (Score:4, Informative)
The method is not perfect, but it is plenty good enough for our needs to protect against something that takes a datacenter down for a prolonged time (several minutes/hours/days). And the price
And to those who recommend avoiding "disaster prone" places: they all have people. People like the backhoe guy who took out the OC192 down the street. Or the core drillers who managed to punch both the primary and secondary optical links to a building of ours at a point where they were too close to each other.
You can roll your own by having a DNS server at each site and DNS 1 always issues IP of server 1 while DNS 2 always issues IP of server 2. But there are a number of issues like traffic hitting both sites at the same time. And you will have to detect more than just a down link so you will be scripting web test and DNS update systems. By the time you are done, you will have spent decades' worth of dnsmadeeasy fees.
Note: dnsmadeeasy isn't the only game in town. Just the one we happen to use.
Re:DNS failover (Score:5, Informative)
The OP is concerned with all the DNS servers that aren't yours that would then have a cached version already, and continue to serve up the dead DNS record until their (incorrectly configured) TTL expired.
As another poster already mentioned, BGP is really the only technical solution to this problem. All other "solutions" are going to be convincing people that they don't really need instant failover in the event of a major disaster.
Re: (Score:2, Informative)
dnsmadeeasy doesn't solve the problem the OP is asking about. They simply monitor your services and start serving a different DNS record if your primary is down.
The OP is concerned with all the DNS servers that aren't yours that would then have a cached version already, and continue to serve up the dead DNS record until their (incorrectly configured) TTL expired.
As another poster already me
excellent point (Score:3, Insightful)
it's incumbent on manglement to have useful plans, and you should help make what they have useful. shift the end focus and present it to them.
Re: (Score:2)
Oddly enough, those of us who give a damn about whether or n
Buy "Scalable Internet Architectures" (Score:4, Informative)
Re: (Score:1)
What?! I'm just saying thanks.
Re: (Score:1)
Fail the IP address across (Score:2)
Your networking gurus will find that an interesting one. In reality though, a couple of days is probably good enough in the event of something like an earthquake, you may find that people have other things on their minds when that amount of shit hits the fan.
Still, kudos on planning for it. Most IT depts are taken completely by surprise and the business goes down in flames. A decent global filesystem (like AFS) helps massively.
Re:Fail the IP address across (Score:5, Insightful)
Really interesting point that seems to be overlooked. The CEO is concerned about getting everything back up and running (since statistically they have no heart or pulse), but the employees are more concerned about finding family members in the wreckage of their house, cleanup, watching the kids cause schools are shut down, etc..
Whatever you do, ensure it is automated as possible, and please, please, please don't forget to test. I've heard to many stories about everything looking okay, until the emergency generator runs for several hours, vibrating a connection loose and causing it to shut down. It would pass the test run every month, that was only 15 min long. "Hmm, power is out, and power poles are blown all over the streets, do I stay safely inside? Or do I brave a trip across town to try to flip a switch for my wonderful employer?"
Re: (Score:2)
I slept on floors and in peoples cars while I got the network back online. Only then did I
Re: (Score:1)
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
Only if you live there, most people rent servers in a geographically different location, and their customers may easily be all over the world. If you live in the Uk, and your servers are in Texas, and your customers are in Australia - you only care about the disaster like its just another news story.
The OP is not necessarily concerned about disasters though - power cuts (increasingly likely nowadays), cabl
Redirection (Score:2)
Other options would likely require messing with routing info. I think tat is chancy at best.
Re: (Score:1)
Re: (Score:2)
Nope. It means that each client request for an address gets the next address in the list. A client does not do another DNS lookup when the po rt 80 connection fails.
This is true.
To address your final statement, geo-location to find the closest server and having a backup sites are two problems that ar
Stochastic Resillience (Score:4, Interesting)
Try asking at webhostingtalk.com (Score:3, Informative)
Re: (Score:1)
But your answer (and your posting as AC) certainly tell me which parts of WHT *you're* familiar with, and more importantly, which parts you aren't.
Cross site IP takeover (Score:2)
Back in the day I used Exodus to do this (Score:4, Informative)
It's called "BGP"... (Score:2)
Sean
Umm, doesn't that get complicated? (Score:2)
I personally thought that using VLANs would be a quicker way to go about it, but that's obviously a more localised solution. But you're right in looking for solutions at network level, there's too much ignoring of DNS TTL values going on (AOL being the example we all love) to make any other measure quick enough.
= Ch =
Disaster Tolerant is the answer (Score:1)
A disaster tolerant system will have servers configured at multiple sites with real time data update. OpenVMS can do this with the remote sites being far enough away you need to fly there. How do I restore service? is the the wrong question, ask How do avoid downtime even if a site fails?
The Amsterdam Police IT group recently announced a 10 year uptime. OpenVMS.org [openvms.org] has details on celebrating 10 years of uninterrupted service.
Not so sure about this... (Score:2)
Re: (Score:1)
Re: (Score:2)
DNS is the way to go (Score:2)
The way to go is DNS. For an example of this, look at Akamai:
(Removed because of the fucking lameness filter. It was a very useful DNS lookup. Try 'dig images.apple.com' to see what I saw.)
It's done using an extremely short TTL on the final A record. Obviously this handles the vast ma
ultradns have a product called .... (Score:2)
cheers,
Alex