Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
The Almighty Buck Technology

Ask Slashdot: How Much Did Your Biggest Tech Mistake Cost? 377

NotQuiteReal writes: What is the most expensive piece of hardware you broke (I fried a $2500 disk drive once, back when 400MB was $2500) or what software bug did you let slip that caused damage? (No comment on the details — but about $20K cost to a client.) Did you lose your job over it? If you worked on the Mars probe that crashed, please try not to be the First Post, that would scare off too many people!
This discussion has been archived. No new comments can be posted.

Ask Slashdot: How Much Did Your Biggest Tech Mistake Cost?

Comments Filter:
  • by Anonymous Coward on Saturday July 04, 2015 @01:04PM (#50044143)

    But back in the 1960's, I figured we could save a bit of money by only storing the year in our data records. No one would use my program decades later, right? Boy, was I wrong!

    • by Rei ( 128717 ) on Saturday July 04, 2015 @01:11PM (#50044175) Homepage

      I don't have anything nearly that bad - my worst only cost me data. A friend taught me (while I was still learning Linux) a trick, how you could play music with dd by outputting the sound to /dev/dsp. But as I said, I was still learning Linux and hadn't quite gotten all of the device names into my head, and I mixed /dev/dsp up with /dev/sda...

    • My worst was pretty tame in comparison. Over promised on some specs I couldn't deliver on in the end. Cost the client about $4k - oops.

      • by Anonymous Coward

        I let a vendor sell me a product without really testing it. Turns out it didn't work (at all) and we lost €50k on license fees for a product we could not use.

        I was able to lay the blame on an accountant who had locked us into a 5-year contract in exchange for a minor discount. So I didn't get fired.

      • I was on the NASA Genesis price team. Only a few hundred million lost on that one when it crashed into Earth...

      • Re:I'm retired now (Score:5, Interesting)

        by JaredOfEuropa ( 526365 ) on Saturday July 04, 2015 @05:53PM (#50045359) Journal
        I over-promised on a time estimate once, or rather: I let myself be convinced to pad the estimate. Not by a vendor but by the client! One of the client's systems was due for an upgrade, and between myself and the support guys in India I figured it would be a 19 man-day job. I would run it as a "small project" meaning that I could run it any way I wanted. However, the client asked me: "Can you make the estimate 21 days?" That meant it would be a "proper" project run according to the client's methodology, which the client preferred for budgetary reasons. I had nothing to worry about according to the manager, a PM would be assigned to me to take care of the project formalities. So I agreed.

        At the time I was not aware of the unbelievable bureaucracy of large multinationals, and what this would do to my project. Normally I estimate the amount of real work, and add 20% for project management overhead. Maybe another 20% for red tape. But in this case, the PM was more or less forced to involve an ever increasing legion of other teams from various Centers of Excellence in the client's organization. A simple upgrade turned into a project that ran for over half a year. And by agreeing to this approach, I probably cost the client around $300,000. Of course it was mostly their own organization that ran up the cost, and they asked for this in the first place, so they never gave me any grief.
        • Underestimating time needed happens all the time in the software industry. It probably is worse in the gaming industry where publishing deadlines often get set 6 months or more in advance, but I still get hit with guaranteed release dates for customer commitments at my job now where I've put in ~100 hour weeks to fulfill (telecommuting many of these probably saved my marriage, as I would work 4 hours after my wife went to bed). Still, it is nothing like the 160 hour weeks in the office for a game release cr

        • The moment I hear "Center of Excellence" I run for the exit.

    • by AmiMoJo ( 196126 ) <mojo@world3.nBLUEet minus berry> on Saturday July 04, 2015 @06:40PM (#50045515) Homepage Journal

      I'm writing firmware today that stores the date as a 16 bit unsigned integer giving the number of days since 1/1/2000. When printed it is converted to an 8 bit unsigned year and formatted with %02u (2 digits). I'm well aware that this will fail on 1/1/2100, but... I'll almost certainly be dead and no-one will be running this code in 85 years time, surely...

      I'm starting to feel bad about it now.

  • $24,000 (Score:2, Interesting)

    by Anonymous Coward

    I was in charge of ordering a leak correlation system for a water utility that I work for. The system I choose was not quite what we needed, but worked. One week after the warranty expired, I dropped the correction unit and it has never worked since. I found out the correlator wad unrepairable and we had to order a whole new system.

  • Outage.. (Score:5, Interesting)

    by steveb3210 ( 962811 ) on Saturday July 04, 2015 @01:09PM (#50044163)

    I unplugged the wrong thing in a datacenter once which took 20k domains offline. Traced the cable from the machine to the wall 2 or three times before pulling too..

    They didn't have any cable management and only one border router..

    Didn't lose my job, I was a very young sysadmin who was learning but good at what I did.. everyone kinda shrugged it off as a lesson learned.

    • Something similar. Took almost an entire ISP down. Had a few servers with about 200 domains running bsd located at thier "data center " which was more like a couple shelve and a long bench. Anyways, they where supposed to be running a script to verify two servers were mirroring the other two. I got lazy and stopped checking the logs for it and eventually they stopped running the backups or the script to verify it. One day a drive failed and about 50 domains were off line. I couldn't remote into any serv

    • Re: (Score:3, Interesting)

      by Anonymous Coward

      That lets me think about a cleaner who for some unknown reason had the keys to open all rooms including the server room. Around Christmas time she needed to find a wall plug for the Christmas tree. She found one in the server room with the switches/routers/ups/backups/aircos (why she had a key of the server room, nobody knows) and just plugged the Christmas lightning in an unused socket, between UPS and switches. Of course as usual, the Christmas lightning didn't work and short circuited the network, which

  • Heh - would have to total all that up... sigh... but it still works!

  • by pierced2x ( 527997 ) <pierced2x@gmail.com> on Saturday July 04, 2015 @01:14PM (#50044195)
    I used a system improperly over the course of a month. It connected to some services that ran up a $50k bill. I was mortified when my boss told me, thought for sure I'd be canned on the spot. I was only 22 and it was my first job out of college, so the amount was nearly double what I was being paid. The boss basically took the heat for not having explained it to me better, and I was not reprimanded in any way.
  • Well... (Score:5, Interesting)

    by Jethro ( 14165 ) on Saturday July 04, 2015 @01:16PM (#50044203) Homepage

    I don't know what monetary cost they assigned to this, but this is the one I got in the most trouble for.

    Frankly, it was something I got blamed for. I guess I can take partial responsibility. You guys tell me.

    I was the only UNIX guy at this place. We were moving our Main Internal Server to a newer machine. I had set up a cron job to rsync all user data nightly, so that when we transition over the rsync would be faster.

    So, the big day comes. I come in on a weekend, do the final rsync, change some DNS entries, shut down old machine, bring new machine up. No problem.

    Next day everyone is working happily, everything is working smoothly, no worries.

    Or so I thought. Turns out the main developer wanted something off the old server, so he turned it back on to copy his files... and then left it up.

    So, during the night, the thing automatically rsyncs and overwrites an entire day's work for about 80 people.

    Definitely partially my fault for not disabling the cron job, but I was the only one who got in any kind of trouble at all for this (to the extent of almost losing my job, and frankly that was the catalyst for me leaving that place).

    • Re:Well... (Score:5, Insightful)

      by drinkypoo ( 153816 ) <martin.espinoza@gmail.com> on Saturday July 04, 2015 @01:31PM (#50044297) Homepage Journal

      Definitely partially my fault for not disabling the cron job,

      Or pulling the network cable. You have to plan for idiots, because there will be idiots. And odds are, they will outrank you.

      • Re:Well... (Score:5, Interesting)

        by Jethro ( 14165 ) on Saturday July 04, 2015 @01:46PM (#50044389) Homepage

        You know the old saying, "make something idiot-proof and someone will come up with a better idiot."

        They'd have plugged it back in. Again, the guy physically went into the server room and pushed a button.

        I certainly should've disabled the cron job or, better yet (as pointed out by AC down there) have known what rsync actually was and used that - I know I said I did in the original post but in retrospect I couldn't have as it wouldn't have overwritten everything. This was about 20 years ago...

      • by Kjella ( 173770 )

        Or pulling the network cable. You have to plan for idiots, because there will be idiots. And odds are, they will outrank you.

        Since this was a server unless he was at the console copying it off to a USB stick he'd probably hook the server back up to the network so he could copy it to his client.

      • You have to plan for idiots, because there will be idiots. And odds are, they will outrank you.

        Is that a quote from somewhere? Who said that?

        In any case, I'm adding it to my list of quotes.

      • Re: (Score:2, Offtopic)

        by radarskiy ( 2874255 )

        " And odds are, they will outrank you."

        No, the odd are they *are* you.

    • by adolf ( 21054 )

      Everyone else has already told you what you did wrong 20 years ago. Here's my take: If you were actually rsync'ing all of the user data, then the developer wouldn't have known the difference and would never have had the inkling to turn the old machine back on.

      • by Jethro ( 14165 )

        I believe they were looking for old versions of some files, possibly from directories they never asked to be rsynced.

        And, again, 20 years ago. I have definitely learned my lessons AGES since then (:

    • Uh... doesn't rsync have a flag to only sync files that are newer? If 80 people did their work and saved it on the new box, how did rsyncing their data from the old box overwrite newer files?

      • Yup, and as I said to the other people who said that very sane thing "you are right, and I was likely wrong about using rsync. This was 20 years ago and I probably didn't know how to use rsync yet."

    • Re:Well... (Score:5, Funny)

      by R3d M3rcury ( 871886 ) on Saturday July 04, 2015 @04:51PM (#50045121) Journal

      Can't speak for a cost, but I thought this one was funny...

      A company I used to work for used Lotus Notes. For some reason, and I don't remember exactly what the reason was, I set up my e-mail to copy my mail to another account. I think it was just a "hey, I can do this" thing, playing with the e-mail system. Unfortunately, I made a typo in the name of the account to forward to.

      When I came in the next morning, the e-mail system was running really slowly. Everyone was complaining about it. I logged into my e-mail and, low-and-behold, there's all sorts of e-mails in my account complaining about how it couldn't send this message to the other account and, of course, the contents of the e-mail was a message that it couldn't send this message to the other account, and the contents of that message was a complaint that...you get the idea.

      I turned off the script and deleted all the e-mails. And, suddenly, from the office next door, I hear, "Hey! E-mail is working again!"


  • In 1993, I failed to file the US Patent on "A means of accessing a relational database via the Internet." If we'd known we could do it, CompuServe might still be around.
  • I have been part of of a large mistake costing hundreds of thousands of dollars.
    However most mistakes are part of a chain of events of little mistakes, where they all combine to a big mistake. For example, if someone happen to trip over a plug that unplugged a production server. Then questions on why was the cable was out where it can be tripped, who decided that it wasn't worth the money to put time, to get a better system of cable management...

    Normally a person will get fired for a mistake if it was due

  • by Anonymous Coward
    When I was 12 years old and hanging out on BBSs in 1989, I didn't realize dialing Gilroy from San Jose was long distance (Both were 408 area code). My parents were not pleased at the nearly $500 phone bill.
  • I maneuvered downward the left button of the mouse attached to the computer I was working on which pointer was right on a small gif saying "Send" that technically sent a message I should never have sent. Cost me a lot.
  • Not me, but a friend. In high school the best computer in the school was a 386SX. They decided to upgrade it to a DX by adding a maths co-processor to the main board. So the ordered one, and when it arrived, they gave it to my friend to install for some reason. Now, the chip had one corner cut, which you are supposed to line up with the cut corner on the socket, so you know it's seated the right way. Of course, my friend put it in completely backwards (because it fit an any direction.) So he tries to boot u
  • by Anonymous Coward

    I made a calculation error that cost $10k per day. Took 9 months to straighten things out.

    I later won an award for outstanding work.

  • Some bugs I've been responsible for, although it's hard to tell exactly what they did cost:
    - rounding error when programming a timer in an embedded system, resulting in a baud rate to be 10% off, causing problems with several units shipped to customers
    - overflow of an 8-bit counter, resulting in a serial protocol failing

    Plus tons of other errors I forgot or haven't been aware of. Total damage for sure thousands of Euros. However, that's probably little for a 25+ years career mostly in software development.

  • Lost a slide for 3rd party client that was to be featured in a skateboarding magazine.
    I think one of the coworkers stole it as I did not get along with them.

    Insurance claims for that kind of thing can involve the cost of setting up the shoot again, whatever that entails.
    Was fired not long after.
  • by jnaujok ( 804613 ) on Saturday July 04, 2015 @01:35PM (#50044323) Homepage Journal
    Our group at FedEx released code that I wrote on a Saturday night. This was two days before the Apple iPhone 4 shipped. The code worked perfectly, however, despite our repeated warnings about nearly doubling downstream traffic, the downstream systems (like billing and tracking) weren't ready for it.

    So, on the day everyone wanted to track their new iPhone, my code shut down all tracking on FedEx for about 12 hours before we could switch the config setting (10 minutes) and the downstream systems could catch up (11+ hours).

    Estimate of cost was around $2 million in lost time and revenue and extra calls to customer service. Luckily, since I wasn't actually at fault, and we had multiple email chains backing up the volume estimates and warnings, we didn't get the axe.
  • by michael_cain ( 66650 ) on Saturday July 04, 2015 @01:37PM (#50044339) Journal
    digital signal processing chip from TI. The $750 (in 1986 dollars) wasn't the big deal. That the parts had serial numbers hand-lettered on them and I had to go back on the waiting list to get a replacement was.
  • A long time ago on mainframes. IBM 3083's and VAX's. I was running analysis on some waveform data, took probably about 20 reels of mag tape. Fucking marine seismic data. I sent the big deck of cards down to the floor on a Friday. 1st thing Monday, I had to go the VP's office. He explained that Monday morning, the fucking job was still running. Turns out, instead of sampling the data every 4ms, I accidentally sampled it every 2ms. Back then, you didn't own your mainframes, IBM leased it to you. The
  • by Nonesuch ( 90847 ) on Saturday July 04, 2015 @01:47PM (#50044399) Homepage Journal

    I was hired as a firewall admin at an online trading company, then quickly discovered the director of IT was insane, but kept management happy because he made his numbers by keeping his team constantly understaffed; I was told to work on not just servers, but installing Sun servers in racks, running cable, and fixing just about anything plugged into the network.

    I made the mistake of showing competence in networking, so was asked to "expand my role" (new title, same salary), and start working on the switches themselves, including executing an "upgrade" to stacked HP ProCurve switches with VLANs (replacing a hodge-podge of random manufacturer switches). The actual upgrade went fine, basic testing (ping) showed everything stable, but as soon as trading opened the next day, everything went to hell, performance dropped through the floor and customers started calling in about trades timing out. Long story short, turned out that Solaris HME cards were unable to negotiate properly with ProCurve switches, half the machines were dropping packets due to duplex mismatches. There's a reason people call the Sun interface cards "Happy Meal Ethernet"

    Cost the company approximately $180,000 in direct and customer exodus losses, and was likely a factor in their eventual collapse. I wasn't fired, but management never trusted me again so I saw the writing on the wall, and quit to do consulting work at a (also doomed) dot-com online supermarket.

    On the upside, I was able to make thousands in consulting income from installing those same "lock speed to 100 and duplex to full" Solaris scripts on servers for various customers who also had performance issues plugging in Sun servers to cheap switches.

    • by AmiMoJo ( 196126 )

      I knew a guy who did support for a multi million pound company. They had many problems, mostly due to the fact that he was too scared to reboot their servers because he did all the support remotely and it would be a 100 mile trip up to their office if the machine didn't come back up. They insisted that he do maintenance in the evenings or at weekends to avoid disrupting their work.

      So their terminal server was still running IE 7, because he was too afraid to update to IE 9 as it required a reboot. Someone ac

  • obsolescence, I got the task to shut 'em down. I also forced a worldwide recall of PC card disk drives in the switches that were the backbone of the Internet when we kept the vendor engineering on the phone all day for a failed switch... and read the duty cycle of the drives to them, like 5 minutes a shot, 10 minutes an hour, when they were running read/write continuously.

    but I got a haircut indeed when we had to get out stuff out of a colocate that was shutting down. built a mirror data system for that in the new place, had the trunks up, costed over the traffic. then it was time to demanage and power down the old shelf. telcordia assigned a code to the new unit that was one letter different than the old one.

    the good news is I got the new one back up in 20 minutes and they didn't stake me out over an anthill.

  • We were writing a Unix program to parse transactions from some specialized terminals that read customer invoices and the checks that accompanied them, writing the transactions to digital tape to carry over to the mainframe system. During testing our tapes were compared to tapes generated by the legacy IBM system. Our team lead got a call from the customer liaison *early* on morning saying "Do you realize one of your batches was 5 MILLION DOLLARS SHORT - yes, she was shouting. Turns out that the $5 millio
  • I wonder... (Score:5, Insightful)

    by waspleg ( 316038 ) on Saturday July 04, 2015 @01:50PM (#50044421) Journal

    How many people will refrain from posting because the statute of limitations hasn't run out yet?

    • Re:I wonder... (Score:5, Interesting)

      by dcollins117 ( 1267462 ) on Saturday July 04, 2015 @03:45PM (#50044897)

      How many people will refrain from posting because the statute of limitations hasn't run out yet?

      Well, I'm certainly not going to admit to the most costly mistake as it appears no one realizes it was me and what I had done. So I'm not gonna do it; wouldn't be prudent.

      The most embarrassing mistake was I inadvertently brought down the clients' network (a major hospital) during the middle of the day. Didn't realize what I had done until about three minutes later when about a dozen IT guys flooded the computer room paying particular attention to the area I was just working in. It appears I made an error. To this day I am likely persona non grata in that computer room.

  • Click of death (Score:5, Interesting)

    by Wowsers ( 1151731 ) on Saturday July 04, 2015 @01:51PM (#50044431) Journal

    My worst IT disaster was suffering from a hard drive failure, click of death. I had warning of a few days of it, and I deliberately kept the pc on 24/7 instead of normal switch on/off, to make sure the drive stayed alive until its replacement arrived.

    Obviously I had to turn the pc off to change the drive, it was not hot-swapable. When I powerd the pc up, the old hard drive failed, didn't work at all. I was faced with losing all the data on it. I left the drive alone for months wondering what to do, reading different ideas online, some of them weird.

    Eventually I decided to try the least distructive idea first. I put a sheet of paper on the failed drive to make sure the label doesn't come off, and heated up the clothes iron, then applied the iron directly onto the top of the hard drive. When the drive casing was wam enough (not so hot as to make it hard to carry), I took it to my pc, and powered up.

    The failed hard drive came to life, and I managed to grab all the files on it onto the new hard drive, uncorrupted.

    Out of interest, the failed drive failed about three months before I do forced drive change as a backup / failure prevention. I got lucky.

  • I used to work as a SDH/DWDM admin. In early 2000's, while my colleague screwed up a major firmware update on a STM1/4 ADM [wikipedia.org] and I as senior (haha - I was in my 1st half of 20ies) admin had to drive up to site (since the affected node was unresponsive to management system). After many unsuccessful attempts to recover it, at about 3 am. I decided to hard reboot the node, which caused it to boot up from corrupt firmware bank (it had two of those); which in turn just erased all the configuration, including tra
  • by YrWrstNtmr ( 564987 ) on Saturday July 04, 2015 @01:54PM (#50044451)
    Some other fool did not install the panel properly, and left one of the three nuts off. Distinctive nuts, used in only one place.
    Someone found it overnight, and held it up at the morning meeting. "Anyone know where this goes?" Unfortunately, I did not recognize it as a part one of my systems.

    Aircraft flew, panel breaks off, punching several other holes in the side as it departs.
    Training mission aborted. much sheet metal work needed.

    Actual repair cost? Unknown, but easily 5 figures if not more.
  • Power cable mistake (Score:2, Interesting)

    by Anonymous Coward

    Working for a desktop publishing house in it. Spent just under $4000 on 36 inch flat panel displays. Accidentally plugged in printer power cable. Immediately fried monitor. My boss was not happy. The internship did not go well the rest of. The summer.

  • I let a upgrade bug slip by me during a software upgrade for the accounting software. In retrospect it should have been caught before it got out of hand. It got out of hand in about 3-4 seconds and had a cascading effect bringing down the whole datacenter for the company.

    It happened when a "guaranteed" bid was due for a 2 million dollar job. We had nothing. Not so guaranteed...

    Fortunately (?) I had a ownership stake in the company; so I also screwed myself too. Figuring ~12% profit on the job was typical an

  • I was working as a Jr. Network admin, helping to install some new cisco PoE switches to facilitate our building's move to VoIP phones. I aligned a brand new 48-port poe switch slightly off when inserting it into the chassis, and bent the insanely-complex connector at the back of the card, rendering it unusable. Fortunately, we had a ridiculous service agreement with cisco, and a new card arrived at our office within 4 hours. I distinctly remember buying burritos and beer for me and the Sr. admin to help mak
  • by corychristison ( 951993 ) on Saturday July 04, 2015 @02:02PM (#50044505)

    Six or so years ago I was using a (fairly cheap) Virtual Private Server as a dev/testing box for a pet project of mine.

    The VPS company was bought by a larger company, and prices were to double on the next billing period. I hastily chose a new provider without doing any research. I paid for 3 months of service in advance, got the container set up the way I like, migrated all of my data over, and was up and running.

    2 months in the new provider vanished, along with all of my data. I wasn't very concerned about the months worth of money I had lost by not getting the 3 months I had paid for, I think it was only about $15. "Okay," I thought. I'll just pull my data out of my nightly backups and move on. It turns out I forgot to adjust my local cron script that pulled the data over rsync to the new IP address. My backups had not been pulled in over 2 months.

    Luckily it wasn't very important, as it didn't make me any month and was mostly just for fun. I ended up starting over from scratch and ended up with a better system anyway.

    I learned my lesson, though.

  • The Final Nail (Score:5, Interesting)

    by Dartz-IRL ( 1640117 ) on Saturday July 04, 2015 @02:16PM (#50044569)

    The total cost was actually weet FA in numbers terms, but I think I put the final nail in the company's coffin.

    My first 'job' was a jobbridge internship with a 'small' company. Small enough that I was literally person number three on the employee roster. The company worked in the renewable energy sector, and had been hammered pretty hard over the last few years by The Recession as domestic and corporate purse strings were pulled tighter and tighter.

    I was taken as an Engineer, but rapidly found myself wearing a wide range of hats from Sales, to Customer Support, to System Design, to Project Management, web development in PHP, and finally, IT Support.

    Because, one day, I managed to figure out why one of my colleagues couldn't log in to the server upstairs, and corrected the problem.

    I will say, the Server was the problem.

    It was a dinosaur. It was 14 years old - twice as old as the company - and had been bought second hand. It was a monstrous beige tower with a pentium II processor and God Knows What else inside. It ran Windows Server 2000, and was solely dedicated to serving the company accounts and acting as a networked file storage. Inside the case where four HDD's.... A pair of 9GB ones for the OS and programs, and a pair of 32GB ones for files. Both pairs were mirrored in RAID 1. It had a pair of lockable Zip disk drives still fitted though the keys long lost, along with a floppy drive and a CD Drive with no write ability. Or ability to read DVDs.

    It creaked as it worked, then fumed, whuffed, whirred and occasionally burped. And it sat there, creaking away for years without thought or consideration to its well being or security. Until I came along.

    By this stage, it was obvious the company was dying - the Titanic had hit the iceberg a long time ago, and everything that was happening was just a desperate attempt to bail it out. We might've slowed the sinking - from two months, out to six, even buying a full year - but the abyss of liquidation always loomed.

    So, any suggestion of upgrading the server hardware was met by 'With What Money?'. At the same time, everybody knew the server was the lynchpin. If it broke, that was it - company gone. A suggestion that I use a spare computer from home was quietly discouraged - in case the company went under by surprise and someone decided to liquidate it to pay a creditor rather than give it back to me. Or we turned up to find the doors locked.

    The best I could do was schedule a backup of the accounts and a few other critical systems, and have it go somewhere offsite. I asked our webhost if we could use our spare space for it, and they were happy to let it happen, provided we didn't cause them problems. So, I set it to run the backup every Sunday morning - 1am or so. Each successive backup would overwrite the previous because there just wasn't the spare space to hold two (No money to pay for it)

    I figured even if the server went pop, or we had a building fire or some other catastrophe, at least those copies would survive. I'd figure out what to run them on afterwards.

    Someone, somewhere, should see the potential problem in this. In my defence, I am not, nor ever was, an IT professional. The software education I have is more related to the engineering side of things - making machines and robotics work with a view towards industrial automation, rather than the maintenance and setup of IT infrastructure and data security.

    I just did what I thought I could to keep the Titanic afloat.

    So, one Monday morning, I come to the office and am met by shrill sound of metal screaming against metal and a high speed. There's a heart-in-mouth moment as I realise that it's coming from the server cabinet.

    But, we have backups, I assured myself. The disks are mirrored in RAID 1, so if one drops out, the other should still be clean and working. If that fails, I've my own little backup too....

    Unfortunately - that only works if the damaged disk decides to drop out of the array.

    It didn't.

    I find th

    • There's a clawing feeling that it was somehow 'My Fault'.... and it probably was. With hindsight, maybe I should've set it to run the backup while we were in the building, rather than at home over the weekend. I could've used an external drive to keep one locally too. There were probably a dozen things that I could've done that'd stop it.

      Only one thing which really mattered... verifying your backups. If you don't do that, there's almost no point in making any. (It gives you something to pray for...)

    • Re: (Score:3, Informative)

      by Tablizer ( 95088 )

      Databases should be backed up with a text-dump (such as an SQL INSERT list), not the actual database file, because of the internal pointers that are fragile. A text-dump "flattens" the pointers. If you do use the actual database file as a backup, shut all DB writing off first, during the backup. And keep multiple generations.

      • I honestly had no idea how it actually backed up, it was a function within the accounts application itself to generate the backup. Which it did, to a local disk. I then had an automatic scheduled upload of that backup to the server.

        Ultimately, like I said, I'm not really an IT guy - I was the one with google and enough patience to fuck about until things worked again. We didn't have one. We did pay one company a hundred quid a month for a while in case something went TU, but we stopped paying him six months

  • by whoever57 ( 658626 ) on Saturday July 04, 2015 @02:19PM (#50044579) Journal
    Not selling the company for $250M because he wanted $300M during the dot-com boom. My boss personally owned about 30% of the company at this point.
  • Two totally incompetent twits from a populous south Asia country. Cost about $32k in salary and 4 month schedule slippage. Another contractor, who is competent, said she suspected they gave 'ghost' interviews, a common practice n her country. I heard managers say the same thing, that the two who showed up for work were not the ones they phone interviewed. They did not know command line basics in either bash or Windows, how to use remote desktop, J ava, unit tests, and other things we required.

    Oddly enough

  • I was brought onto a small web startup project as a co-lead. By this time the project was already 2.5 years old and had been rewritten at least three times by progressively less lousy developers. The final iteration was built on CodeIgniter (MVC framework), a decent choice in 2013.

    My first day I'm browsing the codebase to see what's what, and a grep finds something like "UPDATE my_table set foo=" . $_POST['bar']. Not in a controller... not in a model... in a view.

    So I immediately told the other leads tha

  • I had a friend who's job it was to find a way to break satellites. She said she was quite often successful.

    (Hey, the OP didn't say it had to be an accident.)

    • So once she tried to break a satellite and she fixed it by mistake? Oops!
    • by Greyfox ( 87712 ) on Saturday July 04, 2015 @04:15PM (#50045003) Homepage Journal
      Funnily enough at the satellite company I worked for that one time, one of the older guys there mentioned how he almost lost a satellite once by logging in to his own account and issuing a maneuver command to the satellite. Problem was the satellite was expecting times in GMT and got them in MST. Took them days to get it oriented correctly again.

      Now the programmers in the audience could probably think of like 10 different specific things that could be coded into the system to prevent that from happening, but this company didn't. Which really isn't too surprising. I asked one of the devs on the ground systems team if the ground systems was using GMT or UTC. His answer was "What's the difference?" I was able to infer from his answer that it was most likely GMT, and that did appear to be the case. Somewhere deep in the bowels of the system there was presumably some piece of code written by an Indian contractor with a math degree adjusting times for leap seconds, but it wasn't in any code that anyone knew about.

      The early history of that company read like a Monty Python sketch. The first satellite exploded on the launch pad. The second satellite fell over and then exploded. The third satellite burned down, fell over, exploded and then sank into the swamp. The forth satellite got into orbit and was promptly bricked by sending the wrong version of Windows(!) to it. To be fair they only had to do that because they launched it with the wrong version of Windows(!!) in the first place. One would think that ANY version of Windows would be the wrong version of Windows to shoot into space, but that's why you're not the head of a billion dollar satellite company.

  • by Bookwyrm ( 3535 )

    During an acquisition, the company being acquired helpfully passed along the list of AS they used in their BGP4 configurations in their core routers.

    They helpfully had included the ones from other networks they provided connectivity to as well, but just had sent the AS numbers over in one big list, unlabeled, along with the AS their network originated: "Do these."

    So during the network integration I dutifully entered the entire list of AS into the core routers as AS to be originated. Needless to say, hilari

  • Got this domain "hsa.com" in the *very* early days of the Internet (pre-web). Decided that since we were a Canadian company, I we should have a Canadian domain, and surrendered it and got hsa.on.ca. (we weren't allowed to have hsa.ca, since all our offices were in Ontario...)

    A three letter .com address would probably have been the most valuable asset of the company :-).

  • Worst thing (so far) has been formatting a PHP date() DB timestamp wrong for entries associating users and payments. I think it was something like accidentally using 'M' for both month and minute.
    At the same time, there was a bug somewhere that periodically caused only one of the 2 tables to be written to, when we noticed that the tables were out-of-sync we immediately jumped to the timestamps to make some sense of the situation, which of course didn't work in this case.

    Took only a few hours to sort out
  • Was curious what an apparently undocumented feature on the login page did. Turns out what it did was crash the mainframe. Go figure. You'd think they'd take that shit off the login page, but apparently no one had ever been so curious as to explore it before. Which says a lot about that uni, now that I think about it. Also, once trash talked a uni in a story on a news blag website. Yeah, those were the days...

    Mostly I make my career out of fixing other people's tech mistakes. Which is not something that un

  • Basically an loading tool with a bug I knew from testing, you could set it correctly once in production but if you set it twice every user was f*cked up and could only be fixed from the web interface by about 5 clicks per user, no programmatic solution. And of course we had an error in the production setup, I altered that part - which I could - but forgot to take out the "you can run this only once" settings. Hundreds of users borked and the vendor support would take forever or claim there's no other way, w

  • was created by my boss. I fixed the bug instead of reporting it. The boss was incompetent and was costing the company millions in missed opportunities and in increased turn over of really good people. He couldn't see when his successes were pure accidents and when his mistakes were entirely foreseeable and preventable. I had a few opportunities to get him fired when fixing his messes. I wasn't ruthless. It cost a number of good smart people their jobs and cost the company millions (in fixes, unnecessar
  • Back in the 80's I worked for a field service organisation, fixing and maintaining PDP11 and VAX systems, but also CDC-9766 removable disk systems. Big 14" removable disk packs like you see them in old scifi movies. One of my customers had a string of 10 or so attached to a five-node Tandem Non-stop system.

    Each week they brought two out of ten off-line for me to work on. I cleaned the heads, then used a servo disk pack to realign those heads.
    To do this, I needed to remove the control cable from the string,

  • by Oligonicella ( 659917 ) on Saturday July 04, 2015 @05:01PM (#50045161)

    But it's worth repeating in this context. Thankfully, it wasn't me.

    When I worked at a KC bank, we had a Wire Transfer team manager who loved golf. He was supposed to come in Saturday and test a firmware/OS upgrade, then restore. Nice, sunny day Saturday, so he decided golfing would be better.

    Came in Sunday. Installed firmware/OS upgrade. Tested fine. Forgot to reinstall previous firmware and powered up old OS.
    Incompatible. Froze the machine solid. He panicked and tried for maybe four hours to fix things himself. No go. Finally called Cupertino for help 4+ PM.

    The techs had to be found, gathered and flown out from CA to disassemble said machine and reassemble. No wires until 1 or 2 PM Monday. Much money loss for all customers.

    To answer the obvious question, no - beyond my understanding, he wasn't fired or even demoted.

  • by thegarbz ( 1787294 ) on Saturday July 04, 2015 @05:13PM (#50045199)

    One of my first engineering jobs out of uni involved modifying a UPS. This UPS had a massive battery bank that was quite dangerous to load test and didn't have an automatic load testing function. I came up with a small design involving a contractor and some minor wiring changes and we were part way through implementing it on every UPS at this site.

    This UPS was part of a redundant pair that fed an emergency shutdown system at an oil refinery. In between the UPSs and the ESD system were about 120 circuit breakers, two for each circuit, and one of them was off. We modified the first UPS without issue then started the process for the second one. After calling the control room to let them know they will receive an alarm I switched off the UPS and was suddenly meet with a steam of profanities over the radio.

    We lost power to 80 field instruments which triggered a fail safe action on the shutdown system tripping 4 units at the refinery, one of them was the FCCU which is core to a lot of refinery processes. To add insult to injury the unit was unable to be hot restarted because of a stuck valve and then thermally contracted breaking of large chunks of coke from the overhead line which blocked the internal cyclones. The FCCU was down for repair for roughly 10 days, I had made a name for my self and was asked to display the cock-up award (a giant dildo mounted on a plaque) on my desk.

    Total cost of the outage was about $8million. Fortunately only partially my fault.

  • by epine ( 68316 ) on Saturday July 04, 2015 @07:17PM (#50045637)

    Just fifteen minutes ago I realized that my script to refactor the primary file server (newly converted to ZFS) into more sensible datasets had an irritating detail wrong (a path element was being duplicated in some paths).

    I said to myself "oh, I'll just roll that whole thing back to the snapshot I made 30 minutes ago".

    Then I go "zfs list -t snapshot" and discover that my snapshot was holding onto 0 GB because I forgot the -r switch to make the snapshot recursive.

    Oh, well. By some impossible-to-separate mixture of good management and good fortune, it turns out I had a set of (different) snapshots from the last two days covering all datasets in questions. I lost very little work (only scripts were executed against these datasets and I still have all the scripts).

    My real screw up?

    Back in my second co-op workterm job, I managed not to notice that a system I was backing up changed the order of the listed drives between two very similar screen requests that I made almost immediately one after the other. Unfortunately, on the second pass I selected the active system drive as the recipient of the system backup, picking from the position in the menu where the desired destination drive had appeared moments before.

    I had become accustomed to my home system being deterministic in the order it listed things. My bad.

    This is back at the very beginnings of the 4.77 MHz era, so my PC was actually not yet what we now know as a "PC" (its father had an S-100, and its mother had a itty-bitty CRT).

    Thirty years later I still can't type dd of=/dev/ada3 without making three trips to the metaphorical bathroom.

    Whenever I type a disk-level dd command, I leave the sudo off, until after the third proof-read and several console consultations in which at least two different programs give me the same view of the drive name.

    In dollar costs I couldn't say. In psychic cost, it's indelibly etched onto my permanent record.

    I had a co-worker once (EEng) who claimed that as a junior intern during the late 1990s back when laser gear for fiber optics was all the rage, he routinely fried extremely delicate $2000 DUTs while the old hands just shrugged their shoulders. Dotcom dollars. Who really gave a fuck? It was considered barely worse than ruining a nice chair.

  • I nearly cost my employer several million by fixing a bug.

    The first task I was given in my new job was to look at an old system that printed labels to be put on containers of car parts. A message would come in on a serial cable saying what part was going to be needed within a few hours at a car assembly line, the parts were packed into stillages (a frame designed to hold a certain number of a certain part, like bonnets, bumpers, doors panels, etc.) and when a stillage was full, or when a certain amount of time had passed since the first part was picked, then a label was printed, applied to the stillage, and it was dispatched over the road to the factory.

    Every time the serial number rolled over 9999 to 0001, the system would go wrong and stop working. This happened about once a month, and the help desk had a sheet of instructions on how to fix the problem. Some of the staff knew the fix off by heart.

    I looked at the code, found a roll-over bug, and fixed it. Everything was fine, and a couple of years went by with no problems.

    Then, at 3 in the morning, the help desk called me and said that it had happened again. They didn't have the sheet of paper any more, and no-one could remember how to fix it. I rubbed the sleep from my eyes, and tried to get my brain into gear and remember what to do. It took me about an hour talking with a couple of help desk people, and between us we figured out what the fix was, and they called the warehouse and talked them through it.

    The next day I talked with my colleagues, and found out that we had come within a few minutes of triggering a penalty clause for halting the production line that could have run into millions of pounds. This was back in the '90s when millions of pounds were a lot of money!

    I looked back over the code, and found that there were actually two very similar bugs in the code, one of which happened fairly regularly, and one which only happend much more infrequently, but the same fix worked for both of them.

    Back when I first started working in IT, my boss told me, "One day, you will probably make your million pound mistake. In our business, we build systems that, over the course of our careers, will save millions of pounds in lots of small ways. Eventually you will make a mistake, and one of those systems will go wrong, and it might cost millions. Your employer will bear the cost of it, which is why we don't earn those millions ourselves. You have to be prepared for that eventuality. If it happens while you're working for me then I will kick your arse, and maybe I will fire you, but I'd be wrong to do so, that's just the nature of the business that we are in."

Who goeth a-borrowing goeth a-sorrowing. -- Thomas Tusser