Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Programming Security IT Technology

Dealing with Development House Disasters? 59

Skinnytie asks: "I was recently asked by the CEO of the company for which I work to find a resource from which to better understand what to do in the event of a disaster. 'I'm on it, Sir' was my response, and I ran back to my desk and started writing contingency plans and trying to imagine what to do if a meteorite strikes our co-lo facility. I quickly came te realize that there is far more that *could* happen (the CTO gets hit by bus, or the in-house server room gets abducted by aliens...you get the point) than I am even prepared to write plans for. I thought I'd hit the Slashdot audience up for some ideas/horror stories regarding avoiding, dealing with and getting past whatever disasters that have occurred at your development houses. Have at you!"
This discussion has been archived. No new comments can be posted.

Dealing with Development House Disasters?

Comments Filter:
  • Secure the data (Score:5, Informative)

    by Usquebaugh ( 230216 ) on Thursday April 03, 2003 @08:55PM (#5657433)
    First and foremost, make sure you have your data backed up and secure. Nothing else matters if the data is missing. When did you last test your backups?

    Next ensure you can get you backup data when you need it. You state colo I'm assuming you have a duplicate system at your colo. Can you get access at 22:15 on July 4th?

    Now draw up your usual emergencies fire, flood, tornado, earthquake etc. Have a plan how to get your systems up and running. Do you need to rent office space? What about net conectivity.

    Lastly, re-check your data backups, do you have everything, is it error free, do you have more than one copy. If you have the data your company can recover.

    No data, no job, it's as simple as that.
    • Re:Secure the data (Score:4, Informative)

      by stefanlasiewski ( 63134 ) <slashdotNO@SPAMstefanco.com> on Thursday April 03, 2003 @08:59PM (#5657479) Homepage Journal
      Lastly, re-check your data backups, do you have everything, is it error free, do you have more than one copy. If you have the data your company can recover.

      And something else as important: Where are your backup tapes?

      If they are sitting in a locked cabinet right next to the computers, they won't survive the asteroid blast either.

      Off site backups. A pain to maintain, but good idea for any contigency plan.
    • Just a quick addition to this (which I know doesn't really apply to really heavily IT-dependent companies such as software development firms or co-lo companies, but is relevant for a large majority of companies) -

      (after checking your backups actually restore) it may be worth keeping your contingency plans minimal, e.g. just covering critical systems with a few spare workstations, and considering this option:
      how much would it cost to run the office the good ol' fastioned manual paper way for say up to a wee

  • 'I'm on it, Sir' was my response

    A friend of mine once gave a response that was less gentle:


    Sir, you just laid off half the developers, and half of the support staff, but you didn't reduce the marketing staff.

    There is one manager for every 5 non-manager, we're still not meeting our financial targets, our new "Premium services" campaign is earning $1 for every $1000 we invested, we don't have enough tech staff to fix the bugs, the QA department was reduced to a single person and can't even find the bugs, and tech support is dealing with a growing number of irate customers every day.

    We can barely keep up with the endless list of new tasks that you assign, sir, and you want me to waste my time daydreaming about asteroids?

    We don't need a contigency plan sir, we ARE IN the contigency plan.

    Get real, sir.


    Still kept his job. Ok, maybe he wasn't that snotty...
  • by TheDarkRogue ( 245521 ) on Thursday April 03, 2003 @08:56PM (#5657445)
    The Ceiling (Or floor to the party (Company sucess celebration) going on upstairs) of the server room fell in at my friends place of work, impaling their file server with a peice of rebarb(sp?) through the motherboard and a raid/ide controller card. All was well till the hole caught on fire due to what was later found to be a damages coffe cup heater that a stack of books had fallen onto. No one was seriously injured by the event other then the group of people who designed the building.

    This was an architectural firm.

    • a peice of rebarb(sp?) through the motherboard

      This stuff is called 're-bar', which is short for 'reinforcing-bar'. It is a metal rod about a half-inch in diameter (there are larger/smaller versions) that is used to add strength to concrete structures. Re-bar is made with a coarse pattern on the outside so the concrete can get a grip.

      For those of you who are still lost, this is the stuff that Cordelia fell on in 'Lover's Walk', an episode from the second or third season of Buffy the Vampire Slayer.
      • For those of you who are still lost, this is the stuff that Cordelia fell on in 'Lover's Walk', an episode from the second or third season of Buffy the Vampire Slayer. Does that help? :-)

        I find that the following explanation is a better example:

        When you were a kid, it was the metal bars you stole from the new house next door to play "Darth Vader vs. Luke Skywalker".
  • by herrlich_98 ( 267669 ) on Thursday April 03, 2003 @08:59PM (#5657483)
    One really hard part of plans is to catch *all* the things you are going to need to recover. This advice is kinda like the old advice to actually test your backups occasionally.

    For example, if you have plans to relocate the headquarters to a different site then every 6 or 12 months so try it out to sort out the "glitches". To expand on this example, does everyone know where the other site is and how to get there? How will they know to go there? Are there problems of quorum, where half the managers will be at one site and half at the other making contradicting decisions? Etc, etc, etc... This is also a good time to learn about how your organization operates.

    Whatever your plans and contingencies do regular dry runs of them, to the extent that it is practically possible.

    • Great way to test the UPS batteries and auto-shutdown software is to walk over to the wall and yank the power cord of the UPS out of the socket.

      Plugging it back in after 30 seconds is good way to test the "power came back, cancel the shutdown" part of the software, too.

  • Of everything - not just those files in CVS. Every person, every concept, every document needs to be duplicated, or be /easily/ reconstructed from others that are.
    No person is so special that someone else can't be trained to be 'their backup' - no single person should ever hold the only set of 'keys to the kingdom'.
    • Hey Stop it, Us BOFH's want to stay in our positions of total power / blackmail and corruption.
      hehe ;_)
    • I can't agree more.

      In our office, we've had one programmer laid low with no notice by a drug reaction and surgery, out for four months; another programmer injured in an accident, out for a month. In both cases, the paper trail didn't begin to explain just what they were working on adequately for the remainder of the crew to pick up where they left off, and they were, of course, the only ones with info on their specific bits.

      Now that everyone's back, there's still no buy-in from management that we need

  • Categorize (Score:2, Interesting)

    I don't have experience with this myself, but if I were in your boat I would make a system for classifying types of disaster and the appropriate recovery methods for each. For instance at the top level you would have either a disaster resulting in either physical damage or non-physical damage. From there you could classify disaster types according to how much and what physical damage occurred.

    So, your meteorite example would probably fall into something between a horrible fire and earthquake, as the kind
  • by Rick the Red ( 307103 ) <Rick.The.Red@nOsPaM.gmail.com> on Thursday April 03, 2003 @09:38PM (#5657729) Journal
    I worked at a large aircraft manufacturer in the Pacific Northwest back when Mt. St. Helens blew. They quickly imposed a "Volcano Disaster Plan" that, AFAIK, is still officially in place. We never followed it, however, because it included such mandates as turning off all equipment at night and sealing it up in plastic and duct tape just in case the building got dusted with ash. Yeah, right! (remember, this was 1980, well before a computer could fit in a garbage bag) It was bad enough for us with our CAD workstations and Tektronix terminals; I can imagine what the boys running the IBM big iron thought of that plan. Where are you going to find a plastic bag big enough for a 370 mainframe?
    • sealing it up in plastic and duct tape just in case the building got dusted with ash. Yeah, right!

      that's not all that far fetched. buildings in the area of WTC on 9/11/2001 were all dusty on the inside from the collapsed towers. additionally - there was a museum that saved their inventory by closing off the vents and other ductwork before the two towers fell.
      • And so, since 9/11, has every business in New York been sealing up all their electronic equipment with plastic and tape each evening, then un-sealing it the next morning, then re-sealing it, etc?

        I didn't think so.

        • And so, since 9/11, has every business in New York been sealing up all their electronic equipment with plastic and tape each evening, then un-sealing it the next morning, then re-sealing it, etc?

          Probably not. But I do remember that the first PC my father bought back in the mid-80s, a Victor 9000 [drake.nl], did come with a dust cover...

    • can imagine what the boys running the IBM big iron thought of that plan.


      They probably just paid for the off site data backup from IBM and didn't think much about it. It's the sort of thing IBM does very well.

  • What not to do (Score:3, Insightful)

    by the_other_one ( 178565 ) on Thursday April 03, 2003 @09:52PM (#5657818) Homepage

    I know of one outfit that shall remain namless.

    They layed off their sys admin.

    Then soon found out what a root password was for.

    • I apologise if im being ignorant (as usual) but couldnt they just boot disk it and reset the root (if bios was passworded to stop floppy boot then mobo jumper to reset it ?) only flaw is this would be a pain (read major financial loss)on production/critical servers
    • Uh, you make it sound like they LYNCHED the sysadmin and chucked him into the subfloor.

      I'm sure the sysadmin has no legal right to withold the password to the system. They own it, and owned it when he set the password in the first place.
  • by travail_jgd ( 80602 ) on Thursday April 03, 2003 @09:53PM (#5657821)
    Instead of planning for fire, flood, and alien attack, categorize things by level of severity: whether or not the primary site is intact, the amount of time it will take to resume normal operations, whether the event is isolated/regional/national, etc. It doesn't matter if your office is ruined by an asteroid or a terrorist's bomb, the company will still be doing its work at another site.

    Identify critical paths and personnel. An organization can function for several weeks without C*O's, but without the "worker bees" the company will grind to a halt almost immediately. Also consider the effects of losing large numbers of staff. If every developer and admin quit, what would be the effect on the company?

    Verify the physical security of all sites. Imagine "man with a gun" security breaches -- would an armed intruder (willing to kill) be able to cause significant damage?

    I visited the colo site for a company I was with. They had multiple electrical hookups and enough fuel to run the diesel generators for 48 hours. There were two separate water main connections (a holdover from the water-cooled mainframe days). The colo was connected to two different phone companies (on opposite sides of the building), plus a dish for satellite uplink. It was incredibly expensive to build and maintain, but it would be more expensive if there was ever any downtime.

    Also take cost into account. How much is your company willing and able to spend for disaster prevention and recovery? Don't forget to include your time in those figures... :)
  • simple (Score:5, Insightful)

    by farnsworth ( 558449 ) on Thursday April 03, 2003 @10:07PM (#5657891)
    having both worked on disaster recovery plans and having worked in a data center in the world trade center that was completely destroyed, I can say that the best recovery plans are extremely simple.

    break it down into procedures that you can take based on real problems, not causes. some things to consider:

    1. internet connectivity down (switch to backup colo)
    2. unrecoverable db (drive to offsite backup storage and get backup data)
    3. app servers fried (engage hot standby boxes)
    4. all of the above (shit, it's going to be a long night)

    etc. I've seen too many recovery plans that are focused on the cause, rather than solutions which is really what these plans are all about. if you really need to, you can cross reference the plans with potential causes. this seems to satisfy the cio types who stay up late wondering 'what happens if _____'.

    of course, a backup plan is totally useless if the 'course of action' section is not possible to carry out, due to bad backup practices or lack of failover equipment. having a disaster recovery plan is no substitute for good policy and an adequate hardware/isp budget.

    • Re:simple (Score:4, Interesting)

      by stevew ( 4845 ) on Thursday April 03, 2003 @10:20PM (#5657955) Journal
      I've done alot of disaster scenario planning for emergency service providers - and we come up with some really wild stuff, but the above is perhaps the best advice I've seen here so far. Don't worry about the aliens hijacking the data center, worry about the data center resources not being available for whatever reason.

      The concept of breaking down the recovery phases is the best recovery advice I can give you. Worry about things in sort of a concentric ring of problems much as the previous poster presented. Start with the simplest broken piece and move on to the more compilicated.

      The things companies went through due to the WTC going poof is a true real-world example of the worst case scenario occuring - Not only the data center disappeared, so did the staff that ran IT there!

      Of the recovery efforts I've read about - the guys that had deals with hot-standby facilities out of the immediate area came back the quickest.
    • Re:simple (Score:3, Interesting)

      by Zapman ( 2662 )
      Adding to the last point you mention, the most critical thing you do to any plan is TEST IT.

      Stage a disaster.

      Either that, or just fake it. Take the data to and try to bring up all the data, bring it up on the internet (under a dr.www.?????.com DNS name for example) and see what's accessable.

      In case of failure, tune the plan, and try again in 6 months.

      In case of success, tune the plan, and try again in a year.

      Jason
      • Re:simple (Score:1, Funny)

        by Anonymous Coward
        I took your advice and set fire to the machine room. I didn't warn anyone because it is supposed to be a realistic test. They didn't seem to take it too well when I explained that they had "done a very good job" on the test. The only server that was really toasted was the one I tested for the "crazy lawnmower man comes into server room and sloshes gasoline on prized IBM mainframe and dropps a match." Some people really cried about all that data loss but a test is a test, right? If people don't do backups th
  • Offsite backups, and more than one person that can perform/knows the same critical aspects of the business (code or other specialized information), is all that is really necessary. Some standarized software inventory (OSs, versions, other software and version, service packs) on what machines. This is really a procedural issue. This is about all you can really to to be ready for about anything that is thrown at you.
    • <monotone>I have been told to think for myself. I will think for myself for I have been told I will think for myself. The mass media are the opiate; I will think for myself...The mass media are the opiate; I will think for myself...The mass media are the opiate; I will think for myself...The mass media are the opiate; I will think for myself...The mass media are the opiate; I will think for myself...The mass media are the opiate; I will think for myself...The mass media are the opiate; I will think fo
  • plan for the effect, not the cause. aliens, fire flood are all different causes, but they can all cause the same result - destruction of the building.
    plan for that.

    if you need to relocate an office, move it to a hotel. you have space (rooms, convetion space) all designed to have furniture brought in and phones, power and data, and available on short notice. you also have staff to keep it all clean.

    plan the order for it systems restoration. development can wait a week but if your cash flow stops your screw
  • First, find out how much they want to spend, ie. somewhere like the NYSE has 3 remote sites that immediately mirror every transaction that occurs.

    The only reason the NYSE stopped during 9/11 was for political reasons. Most of the brokerage firms in the WTC had warehouses in Jersey rented for years waiting for something like 9/11 to happen.

    It comes down to cost, how much is it worth if you loose all your records. How much is it worth if you are down for 2 weeks. First figure out how much a minute of downti
  • by Anonymous Coward
    Yes backups backups backups... OFF SITE BACKUPS. Get a fire safe and keep a set in there too, what the hell.

    Little realized thing until it's too late: Do you have raid arrays? If so, you might want to ponder making sure you can always "get" another raid card of the same type currently running your array. Either have another on a shelf (off site, whatever), or another compatable system.

    Nothing sucks more to have something stupid like the raid card (or motherboard) die, and you can't a replacement tha

  • 1. off site backup, of ALL data (and out of country or state backup, if your worried about rebels, or floods.)

    2. makesure that every person in the company can be replaced, directly (by someone else), or indirectly (by a combination of people)
  • Years ago (Score:3, Interesting)

    by CharlieG ( 34950 ) on Thursday April 03, 2003 @11:56PM (#5658460) Homepage
    Years ago (Back in the days of drum memory), I was taken on a tour of a data center (first computer I ever saw)

    They guy told me something about their disaster preps (it was financial data)

    The first thing he pointed out was that there was 2 complete mainframes, side by side. Each was capabile of doing the whole job, but....

    The next was pointing out that they had redundant power, plus a generator, and a lesson they learned the hard way - the generator had the ability to power the Air Conditioner as well as the computers - if your server room gets too hot...

    Then he said, "we have a second Identical data center about 5 miles across town"

    Each data center could handle all the customers from that region - yes there would be a perfomance hit, but...

    Then he said, there are 7 more cities around the world, each with 2 data centers like this one - all transactions go to all 8 cities.

    And then last, he said there was one more data center, in the Outback of Australia. They figured that it was the least likely place to get nuked, and they even planned for that

    Yep, paranoid enough that they wanted their data to survive even if all the major financial centers in the world ALL went "kaboom"
  • Or there abouts. I was working in the north suburbs at the time, so it didn't effect us, but a few years later I worked in the Board of Trade building. The bottom line is that the building was totally closed down for an extended period of time. The Chicago Board of Trade (I think they actually own and control the building) pulled up generators on the street to run their systems, but none of the tenants could get any power. The story I got from the company I eventually worked for was that they were able
  • Risk management (Score:3, Insightful)

    by Twylite ( 234238 ) <twylite&crypt,co,za> on Friday April 04, 2003 @02:18AM (#5659008) Homepage

    What you've been asked (volunteered) to do is a risk analysis. This is a whole lot more than being a l33t admin of a high-availability site, and many Slashdotters seem to think. You've hit onto some of the non-technical risks (your CTO example), but to address this properly you need to concentrate on identifying risks, and how to handle them.

    The first thing to realise is that you can't have a preformed contingency plan for everything. What you can do is identify every point of risk, weight it according to likelihood and severity, and develop plans for the "likely worst cases" that you discover. The rest of the risk is a business risk, that is, you insure yourself against it and deal with it if and when it happens.

    You should also bear in mind that, from a technical viewpoint, there are no absolute guarantees. Almost all high-availability strategies protect against a single point of failure, but this isn't enough. What if you have multiple failures? How quickly can you detect and respond to a failure? How long can you suffer a complete outage (this is really important to know, and "we can't" is not an acceptable answer). Uptime costs money, calculate the point of balance.

    Ask Google about "organizational risk" - you'll find a lot of information about auditing risk that can put you on the right path.

  • by Andy_R ( 114137 ) on Friday April 04, 2003 @05:27AM (#5659470) Homepage Journal
    There is a natural tendency to think that this is all about keeping the data safe, or about having procedures in place, but I have a different way of looking at it that I think is more practical:

    Make sure the company stays profitable.

    All this involves is insuring against disasters, and making sure the payout will *exceed* whatever it costs to recover.

    If my office exploded and I had to re-build everything from scratch, I'm fine, because my company will soon be getting a cheque that will cover everything, including the expected profits for the next 6 months. If we rebuild in 5 months, the disaster is actually a revenue generator for the company.

    Compare this to someone with excellent plans and the ability to get rebuilt in just a month, but no insurance on the lost profits. They are facing a net loss even if they work like crazy and get the rebuild done in 3 weeks.

    Of course you should still do off-site backups to deal with problems that are 'serious' but not 'disasterous', and have contingency plans in place for day to day traumas, but I think the best way to deal with 'the big one' is simply to insure against it, and make sure the insurance covers the lost profits.
  • Dev? (Score:3, Informative)

    by photon317 ( 208409 ) on Friday April 04, 2003 @06:47AM (#5659690)

    For a development shop, I should think that all you really need is to make sure you've got secure, recent, usable backups of your source code and important licensing/contract data.

    Unless you live in an area prone to a certain type of natural disaster, the types of things that cause real contigency plans to go into effect have a statistically small chance of ever happening to you. It's just not worth the money and effort to go to great lengths to make sure the company is running full speed the next day rather than two to three weeks down the road. As long as your data is safe, you should be ok. Just take good backups in duplicate - put one set in a fire safe onsite and ship one to a secure offsite location, perhaps using a service provider like Iron Mountain (although I'd encrypt that tape before I gave it to some random Iron Mountain driver if I were you).

    Some businesses have to worry about 24/7 production operations that can't be allowed to stop. Typical examples of extreme uptime environments are stock exchanges, utility/telco companies, various emergency services, etc. In a lot of these sorts of cases, it's actually justifuable to double or even quadruple the cost of your implementations and the ongoing maintenance and salary costs just to make sure than when a 1:1,000,000 chance event occurs, you experience a 10 second performance hiccup rather than a serious outage. In some cases a one hour outage simply cannot be tolerated at any cost. These are the environments that really have a hard time pushing the bleeding edge of engineering geographically redundant "systems", where systems includes the machines, the networks, and the people using them. A development house, in contrast, is a pretty easy problem.
    • I thought those things got hot enough to melt tapes. Certainly the last ones I laid eyes on said explicitly that they wouldn't work for tapes. More expensive models may or may not do better, but if the assumption about the heat and duration of the fire exceeds their rating, you're still pretty screwed.

      • Yes most fire safes I've ever seen are mostly worthless. When you read the specs, it turns out they let through way too much heat for most practical purposes - all they're doing is blocking the physical flame. You're still running a fair chance of total heat destruction on the inside. However, there are fire safes out there that are capable of protecting magnetic media, you just have to look around for them.

        For example, check out http://www.firekingoffice.com/fk_data.html

  • by PinglePongle ( 8734 ) on Friday April 04, 2003 @07:57AM (#5659841) Homepage
    There have been a number of posts already, but I would create a simple spreadsheet breaking down all the risks you can think of as follows :

    • Risk category : eg. people risk (CTO gets hit by a bus), infrastructure risk (your server is destroyed by aliens), legal risk (you get sued by the RIAA for having an MP3 somewhere on your network), commercial risk (your biggest client goes bust), regulatory risk (the government licenses development shops) etc.

    • Risk impact : - the impact on the business if the risk were to occur. Probably best to summarize into 5 or so levels from "negligible" to "unrecoverable"
    • Likelihood - the chances of the risk occuring. Your office is unlikely to get hit by a meteorite, but your top coder may get headhunted and take all that knowledge with him.

    You then have to reach a business decision how much to spend on mitigating each risk. Clearly it's worth spending time on "high likelihood, disastrous impact" risks, but you may not care much about a transport strike stopping the cleaning staff from getting into work for a couple of days.
    When you know which risks you care about, identify mitigation strategies. Typically, this starts with identifying an owner for the risk, who is in charge of the mitigation strategy. For instance, the development manager may have to find ways of mitigating the "top coder headhunted" risk by implementing code review processes, knowledge sharing systems, etc.
    You should not let the business view risk management as a technology issue - it's a business issue. Risk mitigation has associated costs, either financial, time, or opportunity cost - the best way to avoid not getting paid for your work is to avoid working for unreliable bill payers. If you come up with a wonderful risk list and proposals for mitigation, your work will be wasted unless the business is willing to bear the cost of implementing your proposals.
  • The electric company uses a backhoe to cut all your data lines?

    We actually kept supplying a Big3 auto manufacturer in a "Just in time" sequencing operation when that happened to us.

    Not bad...
  • I wonder how business in the towers coped after 11/9. They must have had to have applied their contingency plans there, perhaps you could try looking for someone in one of those companies.
    • I remember hearing two things regarding the WTC's and disaster recovery planning.

      1. In the first bombing attack (in '93?), hundreds of businesses went permanently under because they lost their invoices and were unable to get back up and bill their customers (had no offsite backup).

      2. Many businesses that had primary operations in WTC on 9-11-01 were running real time mirroring to remote sites and had zero downtime from an equipment/data availability perspective. Of course, they lost a lot of key people
  • You can spend enormous amounts of money on trying to protect against every eventuality. But is that worth it? What makes the US economy so dynamic is that companies can take risks and that they do fail. And while you are trying to take costly steps to protect your data, your competitors may well not be.

    I'm not saying that you shouldn't plan and protect, I'm saying that, in real life, you have to look at what your risks are, what your legal obligations are, and what your competitors are doing. Accepting

"If it ain't broke, don't fix it." - Bert Lantz

Working...