

Ask Slashdot: How Much Did Your Biggest Tech Mistake Cost? 377
NotQuiteReal writes: What is the most expensive piece of hardware you broke (I fried a $2500 disk drive once, back when 400MB was $2500) or what software bug did you let slip that caused damage? (No comment on the details — but about $20K cost to a client.) Did you lose your job over it? If you worked on the Mars probe that crashed, please try not to be the First Post, that would scare off too many people!
I'm retired now (Score:5, Funny)
But back in the 1960's, I figured we could save a bit of money by only storing the year in our data records. No one would use my program decades later, right? Boy, was I wrong!
Re:I'm retired now (Score:5, Funny)
I don't have anything nearly that bad - my worst only cost me data. A friend taught me (while I was still learning Linux) a trick, how you could play music with dd by outputting the sound to /dev/dsp. But as I said, I was still learning Linux and hadn't quite gotten all of the device names into my head, and I mixed /dev/dsp up with /dev/sda...
Re: (Score:3)
Re: (Score:2)
My worst was pretty tame in comparison. Over promised on some specs I couldn't deliver on in the end. Cost the client about $4k - oops.
Other way round for me (Score:2, Interesting)
I let a vendor sell me a product without really testing it. Turns out it didn't work (at all) and we lost €50k on license fees for a product we could not use.
I was able to lay the blame on an accountant who had locked us into a 5-year contract in exchange for a minor discount. So I didn't get fired.
NASA ouchie (Score:2)
I was on the NASA Genesis price team. Only a few hundred million lost on that one when it crashed into Earth...
On the plus side, it discovered life... (Score:3)
... too bad it was here :)
Re:I'm retired now (Score:5, Interesting)
At the time I was not aware of the unbelievable bureaucracy of large multinationals, and what this would do to my project. Normally I estimate the amount of real work, and add 20% for project management overhead. Maybe another 20% for red tape. But in this case, the PM was more or less forced to involve an ever increasing legion of other teams from various Centers of Excellence in the client's organization. A simple upgrade turned into a project that ran for over half a year. And by agreeing to this approach, I probably cost the client around $300,000. Of course it was mostly their own organization that ran up the cost, and they asked for this in the first place, so they never gave me any grief.
underestimates... (Score:3)
Underestimating time needed happens all the time in the software industry. It probably is worse in the gaming industry where publishing deadlines often get set 6 months or more in advance, but I still get hit with guaranteed release dates for customer commitments at my job now where I've put in ~100 hour weeks to fulfill (telecommuting many of these probably saved my marriage, as I would work 4 hours after my wife went to bed). Still, it is nothing like the 160 hour weeks in the office for a game release cr
Re: (Score:3)
The moment I hear "Center of Excellence" I run for the exit.
Re:I'm retired now (Score:5, Funny)
I'm writing firmware today that stores the date as a 16 bit unsigned integer giving the number of days since 1/1/2000. When printed it is converted to an 8 bit unsigned year and formatted with %02u (2 digits). I'm well aware that this will fail on 1/1/2100, but... I'll almost certainly be dead and no-one will be running this code in 85 years time, surely...
I'm starting to feel bad about it now.
$24,000 (Score:2, Interesting)
I was in charge of ordering a leak correlation system for a water utility that I work for. The system I choose was not quite what we needed, but worked. One week after the warranty expired, I dropped the correction unit and it has never worked since. I found out the correlator wad unrepairable and we had to order a whole new system.
Outage.. (Score:5, Interesting)
I unplugged the wrong thing in a datacenter once which took 20k domains offline. Traced the cable from the machine to the wall 2 or three times before pulling too..
They didn't have any cable management and only one border router..
Didn't lose my job, I was a very young sysadmin who was learning but good at what I did.. everyone kinda shrugged it off as a lesson learned.
Re: (Score:2)
Something similar. Took almost an entire ISP down. Had a few servers with about 200 domains running bsd located at thier "data center " which was more like a couple shelve and a long bench. Anyways, they where supposed to be running a script to verify two servers were mirroring the other two. I got lazy and stopped checking the logs for it and eventually they stopped running the backups or the script to verify it. One day a drive failed and about 50 domains were off line. I couldn't remote into any serv
Re: (Score:3, Interesting)
That lets me think about a cleaner who for some unknown reason had the keys to open all rooms including the server room. Around Christmas time she needed to find a wall plug for the Christmas tree. She found one in the server room with the switches/routers/ups/backups/aircos (why she had a key of the server room, nobody knows) and just plugged the Christmas lightning in an unused socket, between UPS and switches. Of course as usual, the Christmas lightning didn't work and short circuited the network, which
Re:Outage.. (Score:4, Informative)
DNS servers on the same subnet. You, know, the thing you aren't supposed to do, but everyone does anyway.
Re: (Score:2)
Re: (Score:3)
My DNS servers are on the same subnet and there isn't one cable anywhere you could unplug that would take them both offline.
What about:
* Router misconfig, takes out default gateway for a while for both
* An extra cable is added and {MR}STP was disabled by accident or something like that.
* etc etc
Anyway, your proud boast may one day discover that people do the funniest things. If your DNS servers are in fact the same box with two IPs ...
Re: (Score:2, Informative)
Be careful about criticizing others. Routers don't have default gateways, they have null routes. They can also be set up to be redundant gateways for others and have many redundant null routes themselves...
Turning off STP on just one router would never be a problem. There are master and standby root bridges. Even if they both go down, others will step in to take the job. It would require a total network shutdown of all layer three equipment before it would be a problem and even then, ttl limits and excess t
Re: (Score:2)
there isn't one cable anywhere you could unplug that would take them both offline.
I didn't say they were invulnerable. Calm down.
Re: (Score:3)
Hmm...
1. Create a domain.
2. Have that domain host a single page saying "Nothing can take down this page."
3. Have that page and DNS server hosted in a datacenter in an enemy country.
4. Sit back and watch.
Weaponized hubris - what could possibly go wrong?
Re: (Score:3, Insightful)
As with most mistakes, it is part of a system that is faulty and awaiting one simple mistake to escalate.
Any one human can make a mistake. However a good system should have built in methods to protect against this.
Why wasn't their a backup system, why didn't it have have a fail over network/power, why wasn't there proper labeling.
Chances are there was a culture of trying to save money: paying for a redundant system cost twice as much, or more. Having those network guys spend hours cleaning up and reorgani
Re:Outage.. (Score:5, Interesting)
"As with most mistakes, it is part of a system that is faulty and awaiting one simple mistake to escalate."
Can't agree any more.
"Chances are there was a culture of trying to save money"
Sometimes the "cargo cult" is so ingrained that even the techs are unable to see it.
Anecdote:
Was in a hiring process, not remember if it was Google or Amazon. One of the questions (from a hands-on tech team lead) was about a single server that went crazy and couldn't spawn any more processes, so it was almost impossible to do nothing with the computer. It still was offering whatever services it hosted just OK.
It went more or less like this:
Me: Has this happened before?
Recruiter: Nope.
Me: So... Can I try this, or that, or this other one?
R: No, because you can't run any new process.
M: Ok, reboot it (I of course know saying somehting like that is taboo for a unix/linux sysadmin). Let's look at the booting messages to see if we get some clue and let's monitor it afterwards to see if this happens again. If that's the case, we will be in better position to diagnose, if not, we will put it on the "computer gnomes" account.
R: Won't try to diagnose anymore before rebooting?
M: Nope. My time is valuable and there will surely be more productive things on my to-do list.
R: But the computer host a service that if turned off will cost the company a bazillion!
M: Nope. If that were the case, the powers-that-be would have engineered the service with high avaliability in mind -which in turn means we could reboot the server without further hesitation. Since that's not the case, the implicit is that business already considered it not a critical service so point above about me costing money still applies.
R: But, but, but...
[...]
Of course, I knew from the very begining the answer he wanted was to find a way to list the process list without spawning a new process so after a while I went throw that route -I vaguely remember there was some Bash built-in that would allow me to do it, but not exactly which one, but back in that time I wanted to see the culture of that place.
There's no need to say I wasn't hired. But I didn't wanted to be hired either. Not within that team at least.
Re: (Score:3)
"I just read you as, "not my problem""
Yes, that's the case... from a certain point of view.
I usually respect enough others' work as to give them their due credit. In this case, it means I credit the system architect as being able to design the system properly. No high availability means it's not a critical server, so I adapt my procedures accordingly.
"figuring out what went wrong is precisely your job"
No, it isn't. My job is to produce the most value for the company within my assigned competencies. Some
Re: (Score:3)
The Job interview process is actually a two way process.
The company needs/wants the resource, that is why they are open positions.
The Person needs/wants a job or a better job, that is why they are applying.
Now even in the height of the last recession and it was a big one. In America average Unemployment was under 10% of the population. While that created a market where employees had the advantage, it was only an advantage not supreme power.
1. The employees wanted people who were currently employed (Using a
Re: (Score:3)
I unplugged the only border router.
My HD DVD player and collection... (Score:2)
Heh - would have to total all that up... sigh... but it still works!
Improper use of systems (Score:4, Interesting)
Well... (Score:5, Interesting)
I don't know what monetary cost they assigned to this, but this is the one I got in the most trouble for.
Frankly, it was something I got blamed for. I guess I can take partial responsibility. You guys tell me.
I was the only UNIX guy at this place. We were moving our Main Internal Server to a newer machine. I had set up a cron job to rsync all user data nightly, so that when we transition over the rsync would be faster.
So, the big day comes. I come in on a weekend, do the final rsync, change some DNS entries, shut down old machine, bring new machine up. No problem.
Next day everyone is working happily, everything is working smoothly, no worries.
Or so I thought. Turns out the main developer wanted something off the old server, so he turned it back on to copy his files... and then left it up.
So, during the night, the thing automatically rsyncs and overwrites an entire day's work for about 80 people.
Definitely partially my fault for not disabling the cron job, but I was the only one who got in any kind of trouble at all for this (to the extent of almost losing my job, and frankly that was the catalyst for me leaving that place).
Re:Well... (Score:5, Insightful)
Definitely partially my fault for not disabling the cron job,
Or pulling the network cable. You have to plan for idiots, because there will be idiots. And odds are, they will outrank you.
Re:Well... (Score:5, Interesting)
You know the old saying, "make something idiot-proof and someone will come up with a better idiot."
They'd have plugged it back in. Again, the guy physically went into the server room and pushed a button.
I certainly should've disabled the cron job or, better yet (as pointed out by AC down there) have known what rsync actually was and used that - I know I said I did in the original post but in retrospect I couldn't have as it wouldn't have overwritten everything. This was about 20 years ago...
Re: (Score:2)
Or pulling the network cable. You have to plan for idiots, because there will be idiots. And odds are, they will outrank you.
Since this was a server unless he was at the console copying it off to a USB stick he'd probably hook the server back up to the network so he could copy it to his client.
Re: (Score:2)
Is that a quote from somewhere? Who said that?
In any case, I'm adding it to my list of quotes.
Re: (Score:2, Offtopic)
" And odds are, they will outrank you."
No, the odd are they *are* you.
Re: (Score:2)
Everyone else has already told you what you did wrong 20 years ago. Here's my take: If you were actually rsync'ing all of the user data, then the developer wouldn't have known the difference and would never have had the inkling to turn the old machine back on.
Re: (Score:2)
I believe they were looking for old versions of some files, possibly from directories they never asked to be rsynced.
And, again, 20 years ago. I have definitely learned my lessons AGES since then (:
Re: (Score:2)
Uh... doesn't rsync have a flag to only sync files that are newer? If 80 people did their work and saved it on the new box, how did rsyncing their data from the old box overwrite newer files?
Re: Well... (Score:2)
Yup, and as I said to the other people who said that very sane thing "you are right, and I was likely wrong about using rsync. This was 20 years ago and I probably didn't know how to use rsync yet."
Re:Well... (Score:5, Funny)
Can't speak for a cost, but I thought this one was funny...
A company I used to work for used Lotus Notes. For some reason, and I don't remember exactly what the reason was, I set up my e-mail to copy my mail to another account. I think it was just a "hey, I can do this" thing, playing with the e-mail system. Unfortunately, I made a typo in the name of the account to forward to.
When I came in the next morning, the e-mail system was running really slowly. Everyone was complaining about it. I logged into my e-mail and, low-and-behold, there's all sorts of e-mails in my account complaining about how it couldn't send this message to the other account and, of course, the contents of the e-mail was a message that it couldn't send this message to the other account, and the contents of that message was a complaint that...you get the idea.
I turned off the script and deleted all the e-mails. And, suddenly, from the office next door, I hear, "Hey! E-mail is working again!"
Shhhhh...
Re: (Score:2)
Good call... this was about 20 years ago, and it's not likely that I used rsync (not sure I knew how to do that back then).
My memories of the event are not... perfect. But it's likely that I just used scp to dump entire directories. Couldn't have been using rsync because, as you say, it wouldn't have one as much damage.
Re: (Score:3)
They weren't supposed to, but the head developers were like gods at that place. They had the root passowrds and I wasn't allowed to restrict them in any way.
It stemmed from them being among the original 10 people when the company started, and even though the place was now a 200+ employee organisation, in some ways they still ran it like 10-person operation.
I did vocally complain about this. They quite often went in and overrode stuff I did.
Re: (Score:2)
*laughs*
This was 20 years ago, and in a company that still thought it was very small even though it was medium-sized. The devs ere gods. They outranked me in every way and had root access to all my servers.
Re: (Score:2)
As I've mentioned, this was about 20 years ago, so I can't really remember it 100%.
However, this was one of those shops that started with about 10 employees, and even though by then it was 200+, it still operated as if it was a small, small company. The head devs were part of the original 10, and they were like gods. They had full access to EVERYTHING. Including root access to all the servers. They were basically allowed to do whatever they wanted.
If something went wrong where they and someone else was invo
Patent filing missed. (Score:2)
Re: (Score:2)
Yea, it would have been a bullsit patent. Could have still made millions from it though. Sadly, that's how our patent system works(?).
Re: (Score:3, Interesting)
Is it purly your mistake. (Score:2)
I have been part of of a large mistake costing hundreds of thousands of dollars.
However most mistakes are part of a chain of events of little mistakes, where they all combine to a big mistake. For example, if someone happen to trip over a plug that unplugged a production server. Then questions on why was the cable was out where it can be tripped, who decided that it wasn't worth the money to put time, to get a better system of cable management...
Normally a person will get fired for a mistake if it was due
$480 Phone bill (Score:2, Funny)
Re: (Score:2)
Tech mistake (Score:2)
Whole computer. (Score:2)
$10k. .... per day (Score:2)
I made a calculation error that cost $10k per day. Took 9 months to straighten things out.
I later won an award for outstanding work.
Re: (Score:2)
Oh the joys of working at IBM.
Re: (Score:2)
I guess you're a banker..
Software bugs (Score:2)
Some bugs I've been responsible for, although it's hard to tell exactly what they did cost:
- rounding error when programming a timer in an embedded system, resulting in a baud rate to be 10% off, causing problems with several units shipped to customers
- overflow of an 8-bit counter, resulting in a serial protocol failing
Plus tons of other errors I forgot or haven't been aware of. Total damage for sure thousands of Euros. However, that's probably little for a 25+ years career mostly in software development.
A Photographic Slide (Score:2)
I think one of the coworkers stole it as I did not get along with them.
Insurance claims for that kind of thing can involve the cost of setting up the shoot again, whatever that entails.
Was fired not long after.
About $2M -- But not really a mistake... (Score:5, Interesting)
So, on the day everyone wanted to track their new iPhone, my code shut down all tracking on FedEx for about 12 hours before we could switch the config setting (10 minutes) and the downstream systems could catch up (11+ hours).
Estimate of cost was around $2 million in lost time and revenue and extra calls to customer service. Luckily, since I wasn't actually at fault, and we had multiple email chains backing up the volume estimates and warnings, we didn't get the axe.
Re:About $2M -- But not really a mistake... (Score:4, Informative)
The poster was not the boss. The boss calls the final shots. The technician's job is to present the risks (trade-offs) as accurately and clearly as possible. If the boss(es) then choose to ignore the risk warnings, the blame falls on them. If you usurp their power, you are out the door (unless it's a legal matter).
Incidentally, I was in a somewhat similar situation where marketing planned to release about 30 websites for satellite offices all at once along with a press release about the new sites. I pointed out our "budget-oriented" infrastructure may not be able to handle such a sudden load, and suggested staggering the releases. Other technicians agreed with my warning, but the marketing chief was really disappointed, saying something like, "It's better P/R to have one big release. Staggering the releases takes the punch out of it."
I was tempted to respond, "30 crashed sites is not good P/R either", but smartly bit my tongue (based on prior experience with "reality" statements). He was a true P-H-B, always looking for a cheap short-sighted shortcut, but tried to blame us when his paper tigers got eaten. He drove one guy to retire early. Later he was under investigation for giving contracts to his buddies instead of basing them on merit. Not surprising, his buddies were also idiots.
Fried an early... (Score:3)
$40,000 - $60,000 (Score:2)
Took an online trading company offline for a day (Score:5, Interesting)
I was hired as a firewall admin at an online trading company, then quickly discovered the director of IT was insane, but kept management happy because he made his numbers by keeping his team constantly understaffed; I was told to work on not just servers, but installing Sun servers in racks, running cable, and fixing just about anything plugged into the network.
I made the mistake of showing competence in networking, so was asked to "expand my role" (new title, same salary), and start working on the switches themselves, including executing an "upgrade" to stacked HP ProCurve switches with VLANs (replacing a hodge-podge of random manufacturer switches). The actual upgrade went fine, basic testing (ping) showed everything stable, but as soon as trading opened the next day, everything went to hell, performance dropped through the floor and customers started calling in about trades timing out. Long story short, turned out that Solaris HME cards were unable to negotiate properly with ProCurve switches, half the machines were dropping packets due to duplex mismatches. There's a reason people call the Sun interface cards "Happy Meal Ethernet"
Cost the company approximately $180,000 in direct and customer exodus losses, and was likely a factor in their eventual collapse. I wasn't fired, but management never trusted me again so I saw the writing on the wall, and quit to do consulting work at a (also doomed) dot-com online supermarket.
On the upside, I was able to make thousands in consulting income from installing those same "lock speed to 100 and duplex to full" Solaris scripts on servers for various customers who also had performance issues plugging in Sun servers to cheap switches.
Re: (Score:3)
I knew a guy who did support for a multi million pound company. They had many problems, mostly due to the fact that he was too scared to reboot their servers because he did all the support remotely and it would be a 100 mile trip up to their office if the machine didn't come back up. They insisted that he do maintenance in the evenings or at weekends to avoid disrupting their work.
So their terminal server was still running IE 7, because he was too afraid to update to IE 9 as it required a reboot. Someone ac
I killed three networks, but that was planned. (Score:3)
obsolescence, I got the task to shut 'em down. I also forced a worldwide recall of PC card disk drives in the switches that were the backbone of the Internet when we kept the vendor engineering on the phone all day for a failed switch... and read the duty cycle of the drives to them, like 5 minutes a shot, 10 minutes an hour, when they were running read/write continuously.
but I got a haircut indeed when we had to get out stuff out of a colocate that was shutting down. built a mirror data system for that in the new place, had the trunks up, costed over the traffic. then it was time to demanage and power down the old shelf. telcordia assigned a code to the new unit that was one letter different than the old one.
the good news is I got the new one back up in 20 minutes and they didn't stake me out over an anthill.
My $5 million bug (Score:2)
I wonder... (Score:5, Insightful)
How many people will refrain from posting because the statute of limitations hasn't run out yet?
Re:I wonder... (Score:5, Interesting)
How many people will refrain from posting because the statute of limitations hasn't run out yet?
Well, I'm certainly not going to admit to the most costly mistake as it appears no one realizes it was me and what I had done. So I'm not gonna do it; wouldn't be prudent.
The most embarrassing mistake was I inadvertently brought down the clients' network (a major hospital) during the middle of the day. Didn't realize what I had done until about three minutes later when about a dozen IT guys flooded the computer room paying particular attention to the area I was just working in. It appears I made an error. To this day I am likely persona non grata in that computer room.
Click of death (Score:5, Interesting)
My worst IT disaster was suffering from a hard drive failure, click of death. I had warning of a few days of it, and I deliberately kept the pc on 24/7 instead of normal switch on/off, to make sure the drive stayed alive until its replacement arrived.
Obviously I had to turn the pc off to change the drive, it was not hot-swapable. When I powerd the pc up, the old hard drive failed, didn't work at all. I was faced with losing all the data on it. I left the drive alone for months wondering what to do, reading different ideas online, some of them weird.
Eventually I decided to try the least distructive idea first. I put a sheet of paper on the failed drive to make sure the label doesn't come off, and heated up the clothes iron, then applied the iron directly onto the top of the hard drive. When the drive casing was wam enough (not so hot as to make it hard to carry), I took it to my pc, and powered up.
The failed hard drive came to life, and I managed to grab all the files on it onto the new hard drive, uncorrupted.
Out of interest, the failed drive failed about three months before I do forced drive change as a backup / failure prevention. I got lucky.
Re: (Score:3)
Wait, what?
Not sure how much $$$ (Score:2)
F-16 panel flew off in flight (Score:5, Interesting)
Someone found it overnight, and held it up at the morning meeting. "Anyone know where this goes?" Unfortunately, I did not recognize it as a part one of my systems.
Aircraft flew, panel breaks off, punching several other holes in the side as it departs.
Training mission aborted. much sheet metal work needed.
Actual repair cost? Unknown, but easily 5 figures if not more.
Power cable mistake (Score:2, Interesting)
Working for a desktop publishing house in it. Spent just under $4000 on 36 inch flat panel displays. Accidentally plugged in printer power cable. Immediately fried monitor. My boss was not happy. The internship did not go well the rest of. The summer.
2 million (Score:2)
I let a upgrade bug slip by me during a software upgrade for the accounting software. In retrospect it should have been caught before it got out of hand. It got out of hand in about 3-4 seconds and had a cascading effect bringing down the whole datacenter for the company.
It happened when a "guaranteed" bid was due for a 2 million dollar job. We had nothing. Not so guaranteed...
Fortunately (?) I had a ownership stake in the company; so I also screwed myself too. Figuring ~12% profit on the job was typical an
~$60k (Score:2)
Just my time (Score:3)
Six or so years ago I was using a (fairly cheap) Virtual Private Server as a dev/testing box for a pet project of mine.
The VPS company was bought by a larger company, and prices were to double on the next billing period. I hastily chose a new provider without doing any research. I paid for 3 months of service in advance, got the container set up the way I like, migrated all of my data over, and was up and running.
2 months in the new provider vanished, along with all of my data. I wasn't very concerned about the months worth of money I had lost by not getting the 3 months I had paid for, I think it was only about $15. "Okay," I thought. I'll just pull my data out of my nightly backups and move on. It turns out I forgot to adjust my local cron script that pulled the data over rsync to the new IP address. My backups had not been pulled in over 2 months.
Luckily it wasn't very important, as it didn't make me any month and was mostly just for fun. I ended up starting over from scratch and ended up with a better system anyway.
I learned my lesson, though.
The Final Nail (Score:5, Interesting)
The total cost was actually weet FA in numbers terms, but I think I put the final nail in the company's coffin.
My first 'job' was a jobbridge internship with a 'small' company. Small enough that I was literally person number three on the employee roster. The company worked in the renewable energy sector, and had been hammered pretty hard over the last few years by The Recession as domestic and corporate purse strings were pulled tighter and tighter.
I was taken as an Engineer, but rapidly found myself wearing a wide range of hats from Sales, to Customer Support, to System Design, to Project Management, web development in PHP, and finally, IT Support.
Because, one day, I managed to figure out why one of my colleagues couldn't log in to the server upstairs, and corrected the problem.
I will say, the Server was the problem.
It was a dinosaur. It was 14 years old - twice as old as the company - and had been bought second hand. It was a monstrous beige tower with a pentium II processor and God Knows What else inside. It ran Windows Server 2000, and was solely dedicated to serving the company accounts and acting as a networked file storage. Inside the case where four HDD's.... A pair of 9GB ones for the OS and programs, and a pair of 32GB ones for files. Both pairs were mirrored in RAID 1. It had a pair of lockable Zip disk drives still fitted though the keys long lost, along with a floppy drive and a CD Drive with no write ability. Or ability to read DVDs.
It creaked as it worked, then fumed, whuffed, whirred and occasionally burped. And it sat there, creaking away for years without thought or consideration to its well being or security. Until I came along.
By this stage, it was obvious the company was dying - the Titanic had hit the iceberg a long time ago, and everything that was happening was just a desperate attempt to bail it out. We might've slowed the sinking - from two months, out to six, even buying a full year - but the abyss of liquidation always loomed.
So, any suggestion of upgrading the server hardware was met by 'With What Money?'. At the same time, everybody knew the server was the lynchpin. If it broke, that was it - company gone. A suggestion that I use a spare computer from home was quietly discouraged - in case the company went under by surprise and someone decided to liquidate it to pay a creditor rather than give it back to me. Or we turned up to find the doors locked.
The best I could do was schedule a backup of the accounts and a few other critical systems, and have it go somewhere offsite. I asked our webhost if we could use our spare space for it, and they were happy to let it happen, provided we didn't cause them problems. So, I set it to run the backup every Sunday morning - 1am or so. Each successive backup would overwrite the previous because there just wasn't the spare space to hold two (No money to pay for it)
I figured even if the server went pop, or we had a building fire or some other catastrophe, at least those copies would survive. I'd figure out what to run them on afterwards.
Someone, somewhere, should see the potential problem in this. In my defence, I am not, nor ever was, an IT professional. The software education I have is more related to the engineering side of things - making machines and robotics work with a view towards industrial automation, rather than the maintenance and setup of IT infrastructure and data security.
I just did what I thought I could to keep the Titanic afloat.
So, one Monday morning, I come to the office and am met by shrill sound of metal screaming against metal and a high speed. There's a heart-in-mouth moment as I realise that it's coming from the server cabinet.
But, we have backups, I assured myself. The disks are mirrored in RAID 1, so if one drops out, the other should still be clean and working. If that fails, I've my own little backup too....
Unfortunately - that only works if the damaged disk decides to drop out of the array.
It didn't.
I find th
Re: (Score:3)
There's a clawing feeling that it was somehow 'My Fault'.... and it probably was. With hindsight, maybe I should've set it to run the backup while we were in the building, rather than at home over the weekend. I could've used an external drive to keep one locally too. There were probably a dozen things that I could've done that'd stop it.
Only one thing which really mattered... verifying your backups. If you don't do that, there's almost no point in making any. (It gives you something to pray for...)
Re: (Score:3, Informative)
Databases should be backed up with a text-dump (such as an SQL INSERT list), not the actual database file, because of the internal pointers that are fragile. A text-dump "flattens" the pointers. If you do use the actual database file as a backup, shut all DB writing off first, during the backup. And keep multiple generations.
Re: (Score:3)
I honestly had no idea how it actually backed up, it was a function within the accounts application itself to generate the backup. Which it did, to a local disk. I then had an automatic scheduled upload of that backup to the server.
Ultimately, like I said, I'm not really an IT guy - I was the one with google and enough patience to fuck about until things worked again. We didn't have one. We did pay one company a hundred quid a month for a while in case something went TU, but we stopped paying him six months
Not my mistake, but my boss' (Score:3)
I didn't get some contractors fired soon enough (Score:2)
Two totally incompetent twits from a populous south Asia country. Cost about $32k in salary and 4 month schedule slippage. Another contractor, who is competent, said she suspected they gave 'ghost' interviews, a common practice n her country. I heard managers say the same thing, that the two who showed up for work were not the ones they phone interviewed. They did not know command line basics in either bash or Windows, how to use remote desktop, J ava, unit tests, and other things we required.
Oddly enough
Killed a project by ordering a code audit (Score:2)
I was brought onto a small web startup project as a co-lead. By this time the project was already 2.5 years old and had been rewritten at least three times by progressively less lousy developers. The final iteration was built on CodeIgniter (MVC framework), a decent choice in 2013.
My first day I'm browsing the codebase to see what's what, and a grep finds something like "UPDATE my_table set foo=" . $_POST['bar']. Not in a controller... not in a model... in a view.
So I immediately told the other leads tha
Multiple multi-million dollar satellites. (Score:2)
I had a friend who's job it was to find a way to break satellites. She said she was quite often successful.
(Hey, the OP didn't say it had to be an accident.)
Re: (Score:2)
Re:Multiple multi-million dollar satellites. (Score:5, Funny)
Now the programmers in the audience could probably think of like 10 different specific things that could be coded into the system to prevent that from happening, but this company didn't. Which really isn't too surprising. I asked one of the devs on the ground systems team if the ground systems was using GMT or UTC. His answer was "What's the difference?" I was able to infer from his answer that it was most likely GMT, and that did appear to be the case. Somewhere deep in the bowels of the system there was presumably some piece of code written by an Indian contractor with a math degree adjusting times for leap seconds, but it wasn't in any code that anyone knew about.
The early history of that company read like a Monty Python sketch. The first satellite exploded on the launch pad. The second satellite fell over and then exploded. The third satellite burned down, fell over, exploded and then sank into the swamp. The forth satellite got into orbit and was promptly bricked by sending the wrong version of Windows(!) to it. To be fair they only had to do that because they launched it with the wrong version of Windows(!!) in the first place. One would think that ANY version of Windows would be the wrong version of Windows to shoot into space, but that's why you're not the head of a billion dollar satellite company.
Re: Multiple multi-million dollar satellites. (Score:3)
Wow. Just, wow.
Re: Multiple multi-million dollar satellites. (Score:3)
I talked to someone recently who lost a day of science data from a UAV because the Windows system driving the instrument decided to auto update while in the air with something like a 56kbps data rate.
I recently built a field instrument and made it Linux based specifically to prevent things like that, as well as to keep power and latency down by being able to kill unnecessary background tasks.
BGP4 (Score:2)
During an acquisition, the company being acquired helpfully passed along the list of AS they used in their BGP4 configurations in their core routers.
They helpfully had included the ones from other networks they provided connectivity to as well, but just had sent the AS numbers over in one big list, unlabeled, along with the AS their network originated: "Do these."
So during the network integration I dutifully entered the entire list of AS into the core routers as AS to be originated. Needless to say, hilari
Surrendered three letter .COM domain (Score:2)
Got this domain "hsa.com" in the *very* early days of the Internet (pre-web). Decided that since we were a Canadian company, I we should have a Canadian domain, and surrendered it and got hsa.on.ca. (we weren't allowed to have hsa.ca, since all our offices were in Ontario...)
A three letter .com address would probably have been the most valuable asset of the company :-).
Out-of-sync DB entries for CC payments (Score:2)
At the same time, there was a bug somewhere that periodically caused only one of the 2 tables to be written to, when we noticed that the tables were out-of-sync we immediately jumped to the timestamps to make some sense of the situation, which of course didn't work in this case.
Took only a few hours to sort out
Crashed the Uni Mainframe Once (Score:2)
Mostly I make my career out of fixing other people's tech mistakes. Which is not something that un
About three days work, but PITA (Score:2)
Basically an loading tool with a bug I knew from testing, you could set it correctly once in production but if you set it twice every user was f*cked up and could only be fixed from the web interface by about 5 clicks per user, no programmatic solution. And of course we had an error in the production setup, I altered that part - which I could - but forgot to take out the "you can run this only once" settings. Hundreds of users borked and the vendor support would take forever or claim there's no other way, w
a bug i found once (Score:2)
Re: (Score:2)
Posted this before (Score:3)
But it's worth repeating in this context. Thankfully, it wasn't me.
When I worked at a KC bank, we had a Wire Transfer team manager who loved golf. He was supposed to come in Saturday and test a firmware/OS upgrade, then restore. Nice, sunny day Saturday, so he decided golfing would be better.
Came in Sunday. Installed firmware/OS upgrade. Tested fine. Forgot to reinstall previous firmware and powered up old OS.
Incompatible. Froze the machine solid. He panicked and tried for maybe four hours to fix things himself. No go. Finally called Cupertino for help 4+ PM.
The techs had to be found, gathered and flown out from CA to disassemble said machine and reassemble. No wires until 1 or 2 PM Monday. Much money loss for all customers.
To answer the obvious question, no - beyond my understanding, he wasn't fired or even demoted.
$8m UPS modification. (Score:3)
One of my first engineering jobs out of uni involved modifying a UPS. This UPS had a massive battery bank that was quite dangerous to load test and didn't have an automatic load testing function. I came up with a small design involving a contractor and some minor wiring changes and we were part way through implementing it on every UPS at this site.
This UPS was part of a redundant pair that fed an emergency shutdown system at an oil refinery. In between the UPSs and the ESD system were about 120 circuit breakers, two for each circuit, and one of them was off. We modified the first UPS without issue then started the process for the second one. After calling the control room to let them know they will receive an alarm I switched off the UPS and was suddenly meet with a steam of profanities over the radio.
We lost power to 80 field instruments which triggered a fail safe action on the shutdown system tripping 4 units at the refinery, one of them was the FCCU which is core to a lot of refinery processes. To add insult to injury the unit was unable to be hot restarted because of a stuck valve and then thermally contracted breaking of large chunks of coke from the overhead line which blocked the internal cyclones. The FCCU was down for repair for roughly 10 days, I had made a name for my self and was asked to display the cock-up award (a giant dildo mounted on a plaque) on my desk.
Total cost of the outage was about $8million. Fortunately only partially my fault.
interesting synchronicity (Score:3)
Just fifteen minutes ago I realized that my script to refactor the primary file server (newly converted to ZFS) into more sensible datasets had an irritating detail wrong (a path element was being duplicated in some paths).
I said to myself "oh, I'll just roll that whole thing back to the snapshot I made 30 minutes ago".
Then I go "zfs list -t snapshot" and discover that my snapshot was holding onto 0 GB because I forgot the -r switch to make the snapshot recursive.
Oh, well. By some impossible-to-separate mixture of good management and good fortune, it turns out I had a set of (different) snapshots from the last two days covering all datasets in questions. I lost very little work (only scripts were executed against these datasets and I still have all the scripts).
My real screw up?
Back in my second co-op workterm job, I managed not to notice that a system I was backing up changed the order of the listed drives between two very similar screen requests that I made almost immediately one after the other. Unfortunately, on the second pass I selected the active system drive as the recipient of the system backup, picking from the position in the menu where the desired destination drive had appeared moments before.
I had become accustomed to my home system being deterministic in the order it listed things. My bad.
This is back at the very beginnings of the 4.77 MHz era, so my PC was actually not yet what we now know as a "PC" (its father had an S-100, and its mother had a itty-bitty CRT).
Thirty years later I still can't type dd of=/dev/ada3 without making three trips to the metaphorical bathroom.
Whenever I type a disk-level dd command, I leave the sudo off, until after the third proof-read and several console consultations in which at least two different programs give me the same view of the drive name.
In dollar costs I couldn't say. In psychic cost, it's indelibly etched onto my permanent record.
I had a co-worker once (EEng) who claimed that as a junior intern during the late 1990s back when laser gear for fiber optics was all the rage, he routinely fried extremely delicate $2000 DUTs while the old hands just shrugged their shoulders. Dotcom dollars. Who really gave a fuck? It was considered barely worse than ruining a nice chair.
I nearly cost my company millions (Score:3)
I nearly cost my employer several million by fixing a bug.
The first task I was given in my new job was to look at an old system that printed labels to be put on containers of car parts. A message would come in on a serial cable saying what part was going to be needed within a few hours at a car assembly line, the parts were packed into stillages (a frame designed to hold a certain number of a certain part, like bonnets, bumpers, doors panels, etc.) and when a stillage was full, or when a certain amount of time had passed since the first part was picked, then a label was printed, applied to the stillage, and it was dispatched over the road to the factory.
Every time the serial number rolled over 9999 to 0001, the system would go wrong and stop working. This happened about once a month, and the help desk had a sheet of instructions on how to fix the problem. Some of the staff knew the fix off by heart.
I looked at the code, found a roll-over bug, and fixed it. Everything was fine, and a couple of years went by with no problems.
Then, at 3 in the morning, the help desk called me and said that it had happened again. They didn't have the sheet of paper any more, and no-one could remember how to fix it. I rubbed the sleep from my eyes, and tried to get my brain into gear and remember what to do. It took me about an hour talking with a couple of help desk people, and between us we figured out what the fix was, and they called the warehouse and talked them through it.
The next day I talked with my colleagues, and found out that we had come within a few minutes of triggering a penalty clause for halting the production line that could have run into millions of pounds. This was back in the '90s when millions of pounds were a lot of money!
I looked back over the code, and found that there were actually two very similar bugs in the code, one of which happened fairly regularly, and one which only happend much more infrequently, but the same fix worked for both of them.
Back when I first started working in IT, my boss told me, "One day, you will probably make your million pound mistake. In our business, we build systems that, over the course of our careers, will save millions of pounds in lots of small ways. Eventually you will make a mistake, and one of those systems will go wrong, and it might cost millions. Your employer will bear the cost of it, which is why we don't earn those millions ourselves. You have to be prepared for that eventuality. If it happens while you're working for me then I will kick your arse, and maybe I will fire you, but I'd be wrong to do so, that's just the nature of the business that we are in."
Re: (Score:2)
Re:Intel CPU sockets are terrible. (Score:4, Informative)
Re: (Score:2)