Big Red Button Disasters? 508
FredDC asks: "The Daily WTF has a story about a Big Red Button disaster. What Big Red Button disasters have you experienced? Which ones have you caused? Are there any that you've heard about, or do you know of any that can happen any day now?"
What kind of idiot... (Score:4, Interesting)
Re:A literal "Big Red Button" disaster (Score:5, Interesting)
A QA Intern Story... (Score:5, Interesting)
Small Red Button (Score:5, Interesting)
Arghhhhhhhhhhhhhhhhhhh @ little fingers.
I either rip the bastard thing right off the board or dig out the regkey thingy to disable it.
Re:Well... (Score:3, Interesting)
First Job Ever (Score:5, Interesting)
Two actually (Score:4, Interesting)
He manually tripped the battery low condition with the intention that the UPS would abort the shutdown when the power came good again. Nope, all the servers were triggered for shutdown (Couldn't abort on the UPS or the servers) and had to be rebooted. The best part was that the UPS sent commands to another site for servers to shutdown there. We had to phone another data centre and get them to go power on the servers after we quickly faxed through forms telling them what they had to do (Cabinet number, server name etc)
2) Another job, I had just wired up a big red button next to the door in the new data centre (Someone had forgotten to install one, so I had to do it on a weekend). Well, one of the guys who I worked with phoned me up asking me if the switch was connected. I told him it was, and that I hadn't installed the Molly Guard yet, but was going to do it after I finished all the testing when I got back from lunch. He said OK, and hung up. He got it into himself to finish the testing to save me the time. Under normal circumstances, this wouldn't have been a problem, pulling apart APC UPS units wasn't a major concern to us at this point. What he assumed had happened however was that I had left the toggle switch for test on. No, I hadn't. The switch worked and was live in case it needed to be used (It was there for a reason, just because I am a block down the street getting lunch doesn't mean that it might not have a purpose as far as I am concerned). About 5 mins later, he phones me back up and asks if I can come back to the office, I say "Sure, not a problem, what's going on? The SQL server not patching?" (Something else we were doing that day) "I am in the data centre". At this point in time, I realise that normally I am asking him to walk out of the data centre cause it's too noisy. Glad it was the weekend and there wasn't much going on.
I have also had a UPS engineer blow dust into a VESDA [vision-fs.com] and we had a few fire trucks turn up, but that wasn't big red button issue.
X was running from desk to desk... (Score:3, Interesting)
On top of the support calls, he was of course getting his daily "yelling at escalation managment party conference call" because not everything was smooth, needless to say. For instance, that brand new customer was brand new to deploying a cell phone infrastructure: bad planning, downtime, crazy schedule were the least of their problems. To add some icing to this merry cake, our switching software was quite new as well, and prone to er, quirks?
Talk about a recipe for wide-scale disaster ;-)
But this day, instead of storming to the next support specialist to wearily beg for some random node reboot, X was running to each and everyone desk, doing his grand floor tour, smiling like a madmam, yelling for everyone to hear:
"They shut off the switch, they shut off the ***** switch, I tell you".
Turns out some cleaning lady tried to shut the **light** off after a good floor-cleaning session, but when for the BIG SWITCH, the one with the big conspicuous red handle, labelled "MAIN POWER - DO NOT..."
BLACKOUT
X did not get his yellint at party this day, and possibly a few days afterward.
(No real names was used, because all the abovementionned companies are still operating as of today:)
Using jed or emacs (Score:4, Interesting)
I also had some directories I wanted deleted so I renamed them with a ~ at the end as well So I can delete them in one swoop.
So I was in a rush and didn't want to be warned because I had about 50 or so Temp Files so I did a
That one acedental space whiped out all the files and folder in that directory then preceded to begin deleteing all the files in my home directory as well. I suck most because I had a 1000 line HTML code I just finished (This was in the days before dependable Javascript and CSS). I spent the rest of the day shifting thew the Cache files on my windows box for IE and Netscape for Windows testing. I was able to get most of it back, but man that was a bad day.
Power Station Emergency Shutdown (Score:4, Interesting)
There were a couple days last summer where there was no spare capacity in the Northeast. It was simply so hot that all the AC's were cranked and the grid was saturated in many places. This year should be interesting.
Re:A literal "Big Red Button" disaster (Score:5, Interesting)
Taking out an entire City... (Score:5, Interesting)
In 1999 while I was working as a private consultant for the capitol city of a small New England state, a colleague of mine was attempting to make a change to the city's core switches. Per usual with this guy, he over-sold his skill set and was way out of his league - while never willing to admit it.
Meanwhile, I was working in the server room on the squid web caching server while he was attempting the change...
I kept hearing him say things like "I wonder what this command does", and "I wonder what the reset command means. Should I enter it?"
Suddenly I was no longer ssh'ed into the proxy server... I looked up and asked "What the hell did you do?"
His answer: "I entered the reset command"
Me: "Well, fix it. Restore the configuration. It looks like you just reset EVERYTHING..."
Well, needless to say, there was NO saved configuration to restore, and no documentation for the city's network nor the equipment installed, and on this equipment the reset command was the command to reset it to its default settings. (BTW, he entered the reset command on the core switch) There were several local switches (connected via copper), and many fiber connections to all the remote departments across the city - several fire departments, the main police department, city hall, you name it... All off-line.
In the end, the city's network was DOWN for 3-4 full days while he contacted qualified people to attempt to rebuild the network...
We would have been better off if he had hit the big red button near the sliding glass door at the server room's exit.
sigh...
P.S. I am pretty sure he blamed it all on me.
Re:Well... (Score:3, Interesting)
Re:Using jed or emacs (Score:3, Interesting)
Say you want to delete all of the hidden files in a user's home directory. Quite naturally, as root, you type:
rm -rf
At some point you will realise this is taking far longer than it should. You have plenty of time to think about what just happened while you restore your filesystem from backup or reinstall.
Re:Power Station Emergency Shutdown (Score:5, Interesting)
To make things worse, I was told there was a industrial fatality in the aftermath when a panel was removed from a region of the generator that hadn't been properly depressurized. Then they determined that the required replacement rotor was too large to legally truck into Ontario over any public roadway from the U.S. based factory where it originated. I was told they ended up doing a very complex comedy-cops operation under cover of darkness with many scouts and radios, but they did finally get it up and running again, months later.
This was well before the internet so I wasn't able to check out any of the details at the time, and it was a fairly small (yet costly) accident as these things go. I was surprised at the use of VAXes for grinding FFTs, as they seemed rather underpowered in raw CPU relative to other solutions from that era, though maybe not at the time the generator was first commissioned.
Right after my divorce... (Score:3, Interesting)
THNTD (Score:5, Interesting)
Just as trivia, that type of circuit is common on industrial equipment (think of the big press from the end scene in Terminator 1) and is called a Two-Hand No-Tie-Down. Basically there are two switches, and they have to both be depressed within a certain interval in order to close the circuit (generally 0.5s or so). If you "tie down" one of the switches, or have something leaning against it, or whatever, pressing the second switch won't trigger (otherwise it would be just a simple AND gate).
The circuits to do it are pretty standard and easily available. What's cooler, is that you can actually get a basically-identical circuit that uses compressed air or other gas instead of electricity (for use in chemical plants and other explosive atmospheres). One of the cooler things I've gotten to see made was a pneumatic "circuit board" cut out of Lucite for this purpose. I've always thought they would make a nice demonstration device for teaching kids about electronic circuits.
Re:A literal "Big Red Button" disaster (Score:1, Interesting)
A half million dollars later, they decided it was best to switch to a pull switch on all installations...
About six months after that the new power engineer up at one of our other sites was doing his rounds after his long change. He wanted to know what the "Do Not Operate" tag was for on the control panel so he stopped to check on it. In doing so he pulled on the shiny new emergency stop button it was attached to and took down an 80MW gas turbine. (Incidentally, the "Do Not Operate" tag was someone's bright idea to designate the emergency stop button until the lamacoid sign could be reattached after changing the button style.)
How about a Big Red Power Cord? (Score:2, Interesting)
It turned out that a main switch had the power cable removed by someone who needed the outlet to charge her cell phone. Since she is a retail manager with a different company who happened to share one of our retail spaces, I traded out the knowledge of her mistake for goods from her store, including a nifty pair of Timberland boots and some sweaters.
1. Wait for BRB disaster.
2. Determine it's not your fault nor anyone on your team.
3. Blackmail the culprit.
4. Profit!!
Nuclear Power Plant Emergency Shutdown (Score:5, Interesting)
From http://en.wikipedia.org/wiki/Chernobyl_disaster [wikipedia.org] :
"At 1:23:04 the experiment began. The unstable state of the reactor was not reflected in any way on the control panel, and it did not appear that anyone in the reactor crew was fully aware of any danger. Steam to the turbines was shut off and, as the momentum of the turbine generator drove the water pumps, the water flow rate decreased, decreasing the absorption of neutrons by the coolant. The turbine was disconnected from the reactor, increasing the level of steam in the reactor core. As the coolant heated, pockets of steam formed voids in the coolant lines. Due to the RBMK reactor-type's large positive void coefficient, the steam bubbles increased the power of the reactor rapidly, and the reactor operation became progressively less stable and more dangerous. As the reaction continued, the excess xenon-135 was burnt up, increasing the number of neutrons available for fission. The prior removal of manual and automatic control rods had no substitute, leading to a runaway reaction.
At 1:23:40 the operators pressed the AZ-5 ("Rapid Emergency Defense 5") button that ordered a "SCRAM" - a shutdown of the reactor, fully inserting all control rods, including the manual control rods that had been incautiously withdrawn earlier. It is unclear whether it was done as an emergency measure, or simply as a routine method of shutting down the reactor upon the completion of an experiment (the reactor was scheduled to be shut down for routine maintenance). It is usually suggested that the SCRAM was ordered as a response to the unexpected rapid power increase. On the other hand, Anatoly Dyatlov, chief engineer at the nuclear station at the time of the accident, writes in his book:
"Prior to 01:23:40, systems of centralized control
The slow speed of the control rod insertion mechanism (18-20 seconds to complete), and the flawed rod design which initially reduces the amount of coolant present, meant that the SCRAM actually increased the reaction rate. At this point an energy spike occurred and some of the fuel rods began to fracture, placing fragments of the fuel rods in line with the control rod columns. The rods became stuck after being inserted only one-third of the way, and were therefore unable to stop the reaction. At this point nothing could be done to stop the disaster. By 1:23:47 the reactor jumped to around 30 GW, ten times the normal operational output. The fuel rods began to melt and the steam pressure rapidly increased, causing a large steam explosion. Generated steam traveled vertically along the rod channels in the reactor, displacing and destroying the reactor lid, rupturing the coolant tubes and then blowing a hole in the roof.[7] After part of the roof blew off, the inrush of oxygen, combined with the extremely high temperature of the reactor fuel and graphite moderator, sparked a graphite fire. This fire greatly contributed to the spread of radioactive material and the contamination of outlying areas
Re:Well... (Score:1, Interesting)
Yeah, next time, let's put the button behind a door...
Re:Well... (Score:5, Interesting)
One for the nerds (Score:5, Interesting)
You can do really interesting things as root; in a place I worked one of my colleagues wouldn't admit that he had done the following on one of our biggest and most important UNIXes: He had logged on as root and opened up
Now, what is normally the very first line in
Absence of button caused problem. (Score:3, Interesting)
Unfortunately resetting them involved touching a wire to a ground pin....which was near to a 48V avionics supply pin.... as I found out....twice.
Burning out the only two systems in existence did not make me popular.
UPS power switch (Score:3, Interesting)
Eject a floppy... NOT! (Score:1, Interesting)
We're on-site peforming a minor upgrade to one component of a national card-transaction switching network, on the main switch which in those days was a lowley NCR 3000 system...
The upgrade has gone well, so colleague pushes the button to eject the floppy (remember them!) containing the patch... Of course he the realises which button he's *really* got his finger on, but before thinking that he could single-handedly type 'shutdown' with his free hand, is interrupted by the operator wanting to do his 2AM backup; his finger then slips from the button.
With the system down nobody in the [popular holiday destination] country can perform Visa/Mastercard/JCB or cross-bank ATM and POS transactions (from the main 4 banks). Luckily it was 2AM, a low volume time...
Ops guy is too stupid to even realise the consequences, insisting it's time for him to do the backup, whilst my colleague is frantically re-booting, fscking the volumes, and running consistency checks on our databases, and telling him to fsck right off
Luckily, all is OK and everything's back on-line about 20 minutes later.
Stupid place to put a power button, right next to the floppy eject... We 'fixed' it by taping a match-box cover over the power switch
Re:Small Red Button (Score:2, Interesting)
Small moist fingers can (and do) still fit in the holes left and make contact shutting down the computer all the same - cover the contact points with electrical tape to be sure
Re:Well... (Score:3, Interesting)
A large video post production facility was using a space which used to belong to Sperry Univac years before. They had all the big red emergency power shutoff buttons on the wall which we covered with a guard. One new employee asked what those big red buttons did and someone answered "if you push it, the clowns come out". Sure enough, before anyone could tackle him, he yanked the guard off and pushed the button. Put us out of business for an hour.
Sidebar: my Uncle used to work for Univac as a system troubleshooter and he remembered that old building. He also told me long ago what "IBM" stood for - "Itty Bitty Monkeys, because that's what's inside their machines".
Re:Absence of button caused problem. (Score:3, Interesting)
I worked in Military Electronics in the 8Xs too. My story has to do with a backwards BRS - the time a BRS was supposed to trip, but didn't
We were testing some power supplies in the environmental lab, The heaters in the test chamber had a triple redundant cutoff - The were normally controled by a computer via a IEEE-488 bus, there was a bi-metalic thermostat on the fixture, and there was also a thermocouple with an overtemp cutout that dropped the entire 3 phase to the heaters. The original design of the test fixtures used Mercury wetted relays, but they had been replaced by standard contactors due to environmental concerns
Well, one weekend (when the lab was in automated mode) - there was a failure of the computer controlling the IEEE bus. Well, it turns out, there were 2 other problems we didn't know about - the contacts in the bi-metalic thermostat were corroded into place, and the contacts in the (formerly wetted) contactor were welded... Oooops
The good news was that the test chamber was able to control the test fixure temp for MOST of the weekend by dumping LARGE quantities of LN2 through the test fixture - and the control for power to the power supplies itself DID turn off, so we only fried about 1-2 PWBs in each power supply that was being tested - but it was 2-3 weeks of labor to strip the 4 power supplies down, inspect and repair them. The Navy was NOT happy
The welded contactor was replaced with a (duh) Mercury wetted version, a test procedure was put into place to test the thermostat, the reason for the bus failure was investigated (we didn't figure it out - btw never failed again) and we put in a seperate watchdog/phone dialer - if the bus hung, OR there was an overtemp, or.. It would start dialing. Only got called once over the next 5 years - selenoid valve actually cracked and failed. We looked at the rated life - we were well beyond it. Replaced it, and it's twins, and we were fine
Control Systems (Score:3, Interesting)
One thing I found interesting is that the emergency stop button for safety
systems always has electrical current going through it. In the case of a saws
and robots, that current might hold a relay closed, which in turn delivers power
to the saw or robot.
The reason they wire it that way is so that if the wire ever breaks or becomes
disconnected from the emergency stop button, the machine stops.
For those systems, stopping the machine when you don't mean to is preferable
to not stopping the machine the one time in ten years that you really need to.
I had worked with computers for years and would never have though of doing it that way.
The most common "big red button" I see turns off the power to subway third rail
power. Now if they could do something about workers getting hit by trains.
Re:A literal "Big Red Button" disaster (Score:3, Interesting)