Disaster Recovery? 18
M. Grochmal asks: "A three-alarm fire at Southern Maine Technical College burned through the Computer Technology and Technical Graphics departments. We have salvaged most of what we can, but cannot return into the building until the asbestos risk decreases. The hard part now is rebuilding the networks in another building. The schedules have been rearranged, many of the department students and faculty are volunteering to relocate salvageable computers, as well as install/configure the new computers that will be arriving in the next day or so. On top of that, we have to rebuild the Netware servers, restore from backups, and get them networked again. I was wondering how other Slashdot readers were able to recuperate from unforeseen damage to their work (and learning) environments. You can read about the fire here and see what the schedule is. Wish us luck."
NDS and the 3 day rule (Score:2, Informative)
Re:NDS and the 3 day rule (Score:2, Interesting)
That said, of course you should have had offsite replicas of each partition. With offsite replicas (or at least replicas on servers that didn't get destroyed) you can remove the crashed servers, still have access to your NDS resources (IDs, passwords, etc), and then re-install the crashed servers to the tree later.
But with or without offsite replicas - three days makes no sense. If the whole tree is involved in the disaster (all servers containing replicas of the tree are gone), it doesn't matter how long it takes you to begin NDS recovery. If the whole tree is not involved, but ALL replicas of a partition are, you may have isses with external references and subordinate reference partitions until you've completed recovery - but nothing that would require 'rebuilding your tree'. If all replicas are not gone, you do NOT have a problem, just a challenge
Netware crashed server recovery has been interesting over the years. Netware 5.x & NDS8 removed the old DSMaint NLM that could preserve server references, so until the recent release of the XBROWSE program, the only clean way to quickly recover from a crashed server while preserving server references was to hang on to an old NW4x box to run the DSMaint process.
Recovering NDS from real disasters can be... challenging. If you don't know what you are doing, you *can* really mess things up. Be careful and you'll be fine.
Re:NDS and the 3 day rule (Score:1)
My comment was based on the assumption that he has an active tree still alive and CHANGING. In which case, it'd be assinine to put one of those servers back in after 3 days.
PS: Glad to see at least one other person familiar with NDS on
A Day late and a Dollar short question of the year (Score:2, Informative)
Uhh, you would have a DR plan BEFORE the place burns down based on your DR plan that was developed with the business or school needs. Tape backups and another site to restore to. Perhaps even the information was mirrored to it via a SAN. Everyone keeps their tapes offsite right?
That's all I've been doing since the 9/11 incident and I think it has something to do with that since I work in the Boston WTC everyone is a bit paranoid about dataloss since we had offices that got toasted in NYC.
Re:A Day late and a Dollar short question of the y (Score:1)
as for recovering without a plan. If there was anything that you always wished you could go back to the begining and change, now's the time
Been there, done that ... (Score:2)
Even though out Telecomm Department was able to pull enough equipment out of the Telecomm Engineeing lab to get the network sort of back up, we were without full connectivity for almost a month. It took about 4 days to get the electricity back on to our undamaged building and we didn't have phone for about 2 weeks. There are a few buildings on campus that are still unusable.
Best of luck. It sounds like your situation is going to be more tedious than difficult, though.
Some Idea ... (Score:2, Informative)
Besides, i suppose it would be best to see the positive side of that incident, i'm sure it will be a good experience rebuilding the network! Anyway, good luck to you
Seize the moment! (Score:5, Insightful)
I was visiting some friends at your campus just this past December; sorry to hear about your loss.
Sadly, I can't give you any suggestions on how to better recover from your current situation -- seems like what can be done now is being done. It seems there's not been much of a response to this as yet, so I'll go out on a limb and offer some ideas that may sound obvious, but forest and trees and all that.
I'm reading between the lines, but I suspect that prior thoughts of backups and disaster recovery were shot down by the PHBs as being too expensive or time consuming. Here's your chance!
You now have a rare opportunity where proposals for FUTURE disaster recovery would actually be listened to!
First off, document what you are doing now! Write it down in a notebook, carry around a pocket tape recorder, use a PDA, hire some students who will answer a phone so that when something comes to mind, you can just dial a phone and get it recorded; whatever, but document what it is actually costing to recover! And not just the hardware/software expenses either! Increased calls to the help desk. Impact on faculty and students' schedules. Reconstructing the network topology.
Anything you can think of, now, document it! If, upon later review, some things are questionable, you can omit it then. But, if during that later review the thought was: "Gee this took more than we had thought it would, too bad we didn't keep track..." Get the picture?
So, now you'll have some kind of baseline as to what the actual recovery costs were, in this case. With that, you can now make a strong business case to implement a solid disaster recovery plan. Include server configs, backups, inventory of hardware and software... in short you've got a list of what you actually had to do to recover from this disaster; use that to identify what you'd need to do again.
Other ideas off the top of my head: Get a fire supression system. Split some of the equipment (e.g. labs) across multiple buildings so that if one burns down, there's some infrastructure that is still usable. You'll have a working system that you can refer to while rebuilding the destroyed system, too.
What is the question? (Score:3, Informative)
Disaster Recovery Resources [labmice.net] - it contains a lot of useful articles about disaster recovery.
I wish you luck!
Disaster Recovery..... (Score:3, Insightful)
Re:Disaster Recovery..... (Score:1)
PROs:
1) scheduled full backups (TSM uses a full incremental package, only "full" is your first run)
2) Better user interface (layout) (TYPES -- Legato: Unix-X/cli, Win32-gui/cli TSM: Unix-cli, Any-http)
3) Better/more complete client/group/schedule configuration
4) Interwoven, multi-session push to device (not single threaded)
5) Can get site license (maybe w/TSM too?)
6) GREAT spin up time for even moderately tech person ('tier 1' (gen. operational support) in 1week, 'tier 2' (installs/restores/basic troubleshooting) in ~1-3 months, YMMV)
7) Others I'm not thinking of
CONs:
1) BOTH: Complete lack of metrics/reporting characteristics (Legato allows for a '@completionDO' script execution, so think DB....)
2) TSM has a more complete Table of Contents/Index while viewing Volumes in GUI(Legato needs restore initiation to read TOC)
3) LEGATO: Client interface is a "canary in a coal mine". If there are issues causing slowdowns, you often see it first in the interfaces.
4) BOTH: Media management for offsite. TSM has DRM (Disaster Recovery Manager) which apparently makes it easier to deal with, but Legato (I think) has a media management program called AlphaStor(?) that is supposed to be end-all/be-all (nothing like marketing, ehhh?).
5) Others I'm not thinking of
Overall, I have not have as 'happy' a time working with TSM as I have had with Legato. I'm sure that someone who is more comfortable with TSM could/would have counters to some of the TSM apparent shortcomings, and I welcome hearing about them. But having been pitched in the deep end in an environment of Legato and TSM, and mostly learning as I go, Legato is working out better for me.
Lastly, having NOT worked with Veritas NetBackup, but having talked to some who have, the biggest complaint I have heard involved convoluted, non-intuitive user interface and assumed Looooooong spin-up times to reach operational support levels. In this particular case, I was told by someone with 2 backup products under their belt that it would take them at least one month to reach the point of being able to do general operational support.......and 37 different active views/interfaces to the application.
I do not speak for my team, my company, my government, my race, or, sometimes, myself. Please insert 'Large Grain o'Salt'(TM) in mouth at this time.
Good time to get rid of legacy shit... (Score:2)
Since everything is destroyed for the most part, use this as an opertunity to get rid of those pesky NT 3.51, Novell Netware, and Vax machines that have been cluttering up the computer room.
Ditch that legacy shit and start anew with the insurance check. (Presuming the machines were insured.)
Re:Good time to get rid of legacy shit... (Score:2)
sPh
Re:Good time to get rid of legacy shit... (Score:1)
You have never worked at a university (Score:2)
Yes, some systems can be consolidated down, upgrade, or otherwise be made more space efficient, but you need to maintain the same OS and similar hardware, or you're looking at significantly increasing your workload due to instability, installation headaches, etc.
Now, there may be some systems that just can't be recovered, but it's not the system admin's job to decide that-- it's management's. The system admin can give advice, but they don't run the university, and if management decides that it's in the best interest to restore 17 year old mainframes that suck down $300k/yr in maintaince contracts and cooling costs, and occupy 1/2 of the space in the machine room, it's their decision. You can either do what they tell you to, or find a new job.
Now, if your management repeatedly doesn't listen to you, and continues to do what you warn them against, you'd probably be happier finding a new job. One of the nice benefits of university jobs is that you can carry over TIAA-CREF to most schools.
Personally, I would recommend first recovering every system possible, and once that's done, and you have everything back and operational again, work on migrating out machines that are harder to maintain and recover. Do not worry about getting rid of systems unless they just can't be recovered. Don't worry about anything else until after the systems are recovered.
PS. Don't put machine rooms in basements. Sewer pipes breaking above the machine room is bad.
Data Recovery Is Only Half the Battle... (Score:2, Insightful)
Without this vital aspect, companies such as Deutsche Bank (who were ravaged by the WTC disaster on 9/11 [pcmag.com]), would have been down for days/weeks while attempting to relocate, rebuild and restore their data center operations...
I, for instance, work at a rather large, international fortune 500 company and we have BCP strategies that include a complete off-site location. This facility houses fail-over systems for all business critical processes including a 1.2 terabyte, mirrored SAP database that can go online within minutes notice, and a phone bank/workstations for our 50+ CSR's (customer service reps) and our global helpdesk. Even more, we frequently (twice yearly) perform non-production drills to validate the systems health and improve upon our strategies...
This is obviously a bit late for you, but I would suggest reading up on the matter [gartner.com] a bit more thoroughly prior to redesigning your future systems and developing your next DRP...
Re: Data Recovery Is Only Half the Battle... (Score:2)
But from there the decision tree goes like this: cost of having a real disaster recovery plan like the big guys? $X. Probability of a disaster? p%. Cost of having our sysadmin and his buddies work 25 hours a day for 3 or 4 days, slapping together whatever equipment he can find at CompUSA and doing the miminum necessary to get back online? $Z. Cost of not having our computer system for 3 day? $C.
And for a small organization, $X * p% is almost always greater than $Z + $C. So creating an extensive DRP isn't justified. Tough luck for the guys who DO end up working 25 hours a day for a week or so (been there), but the economics are usually pretty clear.
sPh