Dealing with Development House Disasters? 59
Skinnytie asks: "I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. 'I'm on it, Sir' was my response, and I ran back to my desk and started writing contingency plans and trying to imagine what to do if a meteorite strikes our co-lo facility. I quickly came te realize that there is far more that *could* happen (the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point) than I am even prepared to write plans for. I thought I'd hit the Slashdot audience up for some ideas/horror stories regarding avoiding, dealing with and getting past whatever disasters that have occurred at your development houses. Have at you!"
Secure the data (Score:5, Informative)
Next ensure you can get you backup data when you need it. You state colo I'm assuming you have a duplicate system at your colo. Can you get access at 22:15 on July 4th?
Now draw up your usual emergencies fire, flood, tornado, earthquake etc. Have a plan how to get your systems up and running. Do you need to rent office space? What about net conectivity.
Lastly, re-check your data backups, do you have everything, is it error free, do you have more than one copy. If you have the data your company can recover.
No data, no job, it's as simple as that.
Re:Secure the data (Score:4, Informative)
And something else as important: Where are your backup tapes?
If they are sitting in a locked cabinet right next to the computers, they won't survive the asteroid blast either.
Off site backups. A pain to maintain, but good idea for any contigency plan.
Re:Secure the data (Score:1)
Re:Secure the data (Score:1)
Uh, maybe I'm being clueless (Hey, I've just come back from the pub so I'm a bit fuzzy!), but don't transformers tend to have really big f*ckoff magnets in them?
;)
If so I'd imagine your tapes might well be very safe, but they'll also be very blank...
Re:Secure the data (Score:1)
Re:Secure the data (Score:4, Funny)
Re:Secure the data (Score:1)
Just a quick addition to this (which I know doesn't really apply to really heavily IT-dependent companies such as software development firms or co-lo companies, but is relevant for a large majority of companies) -
(after checking your backups actually restore) it may be worth keeping your contingency plans minimal, e.g. just covering critical systems with a few spare workstations, and considering this option:
how much would it cost to run the office the good ol' fastioned manual paper way for say up to a wee
A certain friend of mine (Score:5, Funny)
A friend of mine once gave a response that was less gentle:
Sir, you just laid off half the developers, and half of the support staff, but you didn't reduce the marketing staff.
There is one manager for every 5 non-manager, we're still not meeting our financial targets, our new "Premium services" campaign is earning $1 for every $1000 we invested, we don't have enough tech staff to fix the bugs, the QA department was reduced to a single person and can't even find the bugs, and tech support is dealing with a growing number of irate customers every day.
We can barely keep up with the endless list of new tasks that you assign, sir, and you want me to waste my time daydreaming about asteroids?
We don't need a contigency plan sir, we ARE IN the contigency plan.
Get real, sir.
Still kept his job. Ok, maybe he wasn't that snotty...
The Sky is falling!!! (Score:4, Funny)
This was an architectural firm.
Rebar, or Rerod, if you're in Missouri NT (Score:1)
Re: re-bar (Score:2)
a peice of rebarb(sp?) through the motherboard
This stuff is called 're-bar', which is short for 'reinforcing-bar'. It is a metal rod about a half-inch in diameter (there are larger/smaller versions) that is used to add strength to concrete structures. Re-bar is made with a coarse pattern on the outside so the concrete can get a grip.
For those of you who are still lost, this is the stuff that Cordelia fell on in 'Lover's Walk', an episode from the second or third season of Buffy the Vampire Slayer.
Re: re-bar (Score:2)
I find that the following explanation is a better example:
When you were a kid, it was the metal bars you stole from the new house next door to play "Darth Vader vs. Luke Skywalker".
Partial answer: do dry runs of the plans (Score:4, Insightful)
For example, if you have plans to relocate the headquarters to a different site then every 6 or 12 months so try it out to sort out the "glitches". To expand on this example, does everyone know where the other site is and how to get there? How will they know to go there? Are there problems of quorum, where half the managers will be at one site and half at the other making contradicting decisions? Etc, etc, etc... This is also a good time to learn about how your organization operates.
Whatever your plans and contingencies do regular dry runs of them, to the extent that it is practically possible.
As well as not-so-dry runs (Score:2)
Great way to test the UPS batteries and auto-shutdown software is to walk over to the wall and yank the power cord of the UPS out of the socket.
Plugging it back in after 30 seconds is good way to test the "power came back, cancel the shutdown" part of the software, too.
Backups (Score:2)
No person is so special that someone else can't be trained to be 'their backup' - no single person should ever hold the only set of 'keys to the kingdom'.
Re:Backups (Score:1)
hehe
Re:Backups (Score:1)
In our office, we've had one programmer laid low with no notice by a drug reaction and surgery, out for four months; another programmer injured in an accident, out for a month. In both cases, the paper trail didn't begin to explain just what they were working on adequately for the remainder of the crew to pick up where they left off, and they were, of course, the only ones with info on their specific bits.
Now that everyone's back, there's still no buy-in from management that we need
Categorize (Score:2, Interesting)
So, your meteorite example would probably fall into something between a horrible fire and earthquake, as the kind
Volcano Disaster Plan (Score:3, Funny)
Re:Volcano Disaster Plan (Score:1)
that's not all that far fetched. buildings in the area of WTC on 9/11/2001 were all dusty on the inside from the collapsed towers. additionally - there was a museum that saved their inventory by closing off the vents and other ductwork before the two towers fell.
Re:Volcano Disaster Plan (Score:2)
I didn't think so.
Re:Volcano Disaster Plan (Score:1)
Probably not. But I do remember that the first PC my father bought back in the mid-80s, a Victor 9000 [drake.nl], did come with a dust cover...
Re:Volcano Disaster Plan (Score:1)
They probably just paid for the off site data backup from IBM and didn't think much about it. It's the sort of thing IBM does very well.
What not to do (Score:3, Insightful)
I know of one outfit that shall remain namless.
They layed off their sys admin.
Then soon found out what a root password was for.
Re:What not to do (Score:1)
Re:What not to do (Score:1)
Cracking Sun's hardware security. Easy. (Score:1)
Re:What not to do (Score:2)
I'm sure the sysadmin has no legal right to withold the password to the system. They own it, and owned it when he set the password in the first place.
A couple of thoughts (Score:3, Insightful)
Identify critical paths and personnel. An organization can function for several weeks without C*O's, but without the "worker bees" the company will grind to a halt almost immediately. Also consider the effects of losing large numbers of staff. If every developer and admin quit, what would be the effect on the company?
Verify the physical security of all sites. Imagine "man with a gun" security breaches -- would an armed intruder (willing to kill) be able to cause significant damage?
I visited the colo site for a company I was with. They had multiple electrical hookups and enough fuel to run the diesel generators for 48 hours. There were two separate water main connections (a holdover from the water-cooled mainframe days). The colo was connected to two different phone companies (on opposite sides of the building), plus a dish for satellite uplink. It was incredibly expensive to build and maintain, but it would be more expensive if there was ever any downtime.
Also take cost into account. How much is your company willing and able to spend for disaster prevention and recovery? Don't forget to include your time in those figures...
simple (Score:5, Insightful)
break it down into procedures that you can take based on real problems, not causes. some things to consider:
etc. I've seen too many recovery plans that are focused on the cause, rather than solutions which is really what these plans are all about. if you really need to, you can cross reference the plans with potential causes. this seems to satisfy the cio types who stay up late wondering 'what happens if _____'.
of course, a backup plan is totally useless if the 'course of action' section is not possible to carry out, due to bad backup practices or lack of failover equipment. having a disaster recovery plan is no substitute for good policy and an adequate hardware/isp budget.
Re:simple (Score:4, Interesting)
The concept of breaking down the recovery phases is the best recovery advice I can give you. Worry about things in sort of a concentric ring of problems much as the previous poster presented. Start with the simplest broken piece and move on to the more compilicated.
The things companies went through due to the WTC going poof is a true real-world example of the worst case scenario occuring - Not only the data center disappeared, so did the staff that ran IT there!
Of the recovery efforts I've read about - the guys that had deals with hot-standby facilities out of the immediate area came back the quickest.
Re:simple (Score:3, Interesting)
Stage a disaster.
Either that, or just fake it. Take the data to and try to bring up all the data, bring it up on the internet (under a dr.www.?????.com DNS name for example) and see what's accessable.
In case of failure, tune the plan, and try again in 6 months.
In case of success, tune the plan, and try again in a year.
Jason
Re:simple (Score:1, Funny)
Really simple (Score:2)
Re:Really simple (Score:1, Troll)
Re:Really simple (Score:2)
Pop culture.
plan for the effect not cause (Score:2)
plan for that.
if you need to relocate an office, move it to a hotel. you have space (rooms, convetion space) all designed to have furniture brought in and phones, power and data, and available on short notice. you also have staff to keep it all clean.
plan the order for it systems restoration. development can wait a week but if your cash flow stops your screw
Contingency Plans (Score:2)
The only reason the NYSE stopped during 9/11 was for political reasons. Most of the brokerage firms in the WTC had warehouses in Jersey rented for years waiting for something like 9/11 to happen.
It comes down to cost, how much is it worth if you loose all your records. How much is it worth if you are down for 2 weeks. First figure out how much a minute of downti
Backups... and another RAID board (Score:2, Insightful)
Little realized thing until it's too late: Do you have raid arrays? If so, you might want to ponder making sure you can always "get" another raid card of the same type currently running your array. Either have another on a shelf (off site, whatever), or another compatable system.
Nothing sucks more to have something stupid like the raid card (or motherboard) die, and you can't a replacement tha
2 points (Score:2)
2. makesure that every person in the company can be replaced, directly (by someone else), or indirectly (by a combination of people)
Years ago (Score:3, Interesting)
They guy told me something about their disaster preps (it was financial data)
The first thing he pointed out was that there was 2 complete mainframes, side by side. Each was capabile of doing the whole job, but....
The next was pointing out that they had redundant power, plus a generator, and a lesson they learned the hard way - the generator had the ability to power the Air Conditioner as well as the computers - if your server room gets too hot...
Then he said, "we have a second Identical data center about 5 miles across town"
Each data center could handle all the customers from that region - yes there would be a perfomance hit, but...
Then he said, there are 7 more cities around the world, each with 2 data centers like this one - all transactions go to all 8 cities.
And then last, he said there was one more data center, in the Outback of Australia. They figured that it was the least likely place to get nuked, and they even planned for that
Yep, paranoid enough that they wanted their data to survive even if all the major financial centers in the world ALL went "kaboom"
Chicago Flood of 1991? (Score:2)
Risk management (Score:3, Insightful)
What you've been asked (volunteered) to do is a risk analysis. This is a whole lot more than being a l33t admin of a high-availability site, and many Slashdotters seem to think. You've hit onto some of the non-technical risks (your CTO example), but to address this properly you need to concentrate on identifying risks, and how to handle them.
The first thing to realise is that you can't have a preformed contingency plan for everything. What you can do is identify every point of risk, weight it according to likelihood and severity, and develop plans for the "likely worst cases" that you discover. The rest of the risk is a business risk, that is, you insure yourself against it and deal with it if and when it happens.
You should also bear in mind that, from a technical viewpoint, there are no absolute guarantees. Almost all high-availability strategies protect against a single point of failure, but this isn't enough. What if you have multiple failures? How quickly can you detect and respond to a failure? How long can you suffer a complete outage (this is really important to know, and "we can't" is not an acceptable answer). Uptime costs money, calculate the point of balance.
Ask Google about "organizational risk" - you'll find a lot of information about auditing risk that can put you on the right path.
Make sure the disaster is profitable (Score:3, Interesting)
Make sure the company stays profitable.
All this involves is insuring against disasters, and making sure the payout will *exceed* whatever it costs to recover.
If my office exploded and I had to re-build everything from scratch, I'm fine, because my company will soon be getting a cheque that will cover everything, including the expected profits for the next 6 months. If we rebuild in 5 months, the disaster is actually a revenue generator for the company.
Compare this to someone with excellent plans and the ability to get rebuilt in just a month, but no insurance on the lost profits. They are facing a net loss even if they work like crazy and get the rebuild done in 3 weeks.
Of course you should still do off-site backups to deal with problems that are 'serious' but not 'disasterous', and have contingency plans in place for day to day traumas, but I think the best way to deal with 'the big one' is simply to insure against it, and make sure the insurance covers the lost profits.
Dev? (Score:3, Informative)
For a development shop, I should think that all you really need is to make sure you've got secure, recent, usable backups of your source code and important licensing/contract data.
Unless you live in an area prone to a certain type of natural disaster, the types of things that cause real contigency plans to go into effect have a statistically small chance of ever happening to you. It's just not worth the money and effort to go to great lengths to make sure the company is running full speed the next day rather than two to three weeks down the road. As long as your data is safe, you should be ok. Just take good backups in duplicate - put one set in a fire safe onsite and ship one to a secure offsite location, perhaps using a service provider like Iron Mountain (although I'd encrypt that tape before I gave it to some random Iron Mountain driver if I were you).
Some businesses have to worry about 24/7 production operations that can't be allowed to stop. Typical examples of extreme uptime environments are stock exchanges, utility/telco companies, various emergency services, etc. In a lot of these sorts of cases, it's actually justifuable to double or even quadruple the cost of your implementations and the ongoing maintenance and salary costs just to make sure than when a 1:1,000,000 chance event occurs, you experience a 10 second performance hiccup rather than a serious outage. In some cases a one hour outage simply cannot be tolerated at any cost. These are the environments that really have a hard time pushing the bleeding edge of engineering geographically redundant "systems", where systems includes the machines, the networks, and the people using them. A development house, in contrast, is a pretty easy problem.
backup tapes in a fire safe?!? (Score:1)
Re:backup tapes in a fire safe?!? (Score:2)
Yes most fire safes I've ever seen are mostly worthless. When you read the specs, it turns out they let through way too much heat for most practical purposes - all they're doing is blocking the physical flame. You're still running a fair chance of total heat destruction on the inside. However, there are fire safes out there that are capable of protecting magnetic media, you just have to look around for them.
For example, check out http://www.firekingoffice.com/fk_data.html
Decide where to spend your effort. (Score:4, Informative)
Risk category : eg. people risk (CTO gets hit by a bus), infrastructure risk (your server is destroyed by aliens), legal risk (you get sued by the RIAA for having an MP3 somewhere on your network), commercial risk (your biggest client goes bust), regulatory risk (the government licenses development shops) etc.
Risk impact
You then have to reach a business decision how much to spend on mitigating each risk. Clearly it's worth spending time on "high likelihood, disastrous impact" risks, but you may not care much about a transport strike stopping the cleaning staff from getting into work for a couple of days.
When you know which risks you care about, identify mitigation strategies. Typically, this starts with identifying an owner for the risk, who is in charge of the mitigation strategy. For instance, the development manager may have to find ways of mitigating the "top coder headhunted" risk by implementing code review processes, knowledge sharing systems, etc.
You should not let the business view risk management as a technology issue - it's a business issue. Risk mitigation has associated costs, either financial, time, or opportunity cost - the best way to avoid not getting paid for your work is to avoid working for unreliable bill payers. If you come up with a wonderful risk list and proposals for mitigation, your work will be wasted unless the business is willing to bear the cost of implementing your proposals.
How about.... (Score:1)
We actually kept supplying a Big3 auto manufacturer in a "Just in time" sequencing operation when that happened to us.
Not bad...
September the 11th (Score:2)
Re:September the 11th (Score:2)
1. In the first bombing attack (in '93?), hundreds of businesses went permanently under because they lost their invoices and were unable to get back up and bill their customers (had no offsite backup).
2. Many businesses that had primary operations in WTC on 9-11-01 were running real time mirroring to remote sites and had zero downtime from an equipment/data availability perspective. Of course, they lost a lot of key people
consider going bankrupt (Score:2)
I'm not saying that you shouldn't plan and protect, I'm saying that, in real life, you have to look at what your risks are, what your legal obligations are, and what your competitors are doing. Accepting