Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Embedded Computer Horror Stories? 19

Embedded Geek asks: "As computers are embedded in more and more products these days, I was wondering if any developers out there had any good war stories to share. Whether it's scary (software problems at a Czech nuclear power plant) or simply amusing (personal experience - once caught a bug in the software for an airliner I was writing code for where flushing two toilets at the same time would reset the cabin lighting computer), post your stories of embedded software causing bizarre or unexpected problems here. Shameless plug: The best ongoing source for this subject matter I have ever seen is the ACM Forum On Risks To The Public In Computers And Related Systems, found on the USENET as comp.risks." As our technology progresses, the need for better debugging of our complex systems becomes more and more aparent. These stories are illustrations of this fact. How would you improve the current techniques?
This discussion has been archived. No new comments can be posted.

Embedded Computer Horror Stories?

Comments Filter:
  • I would augment our current debugging techniques by installing some rather special logic in libc that'd use RPC to erase, say, five randomly-chosen sectors from the hard disk of one of the application's developers (either chosen at random or based on the last contributor to the offending region of code, or random but weighted by % of source contributed) every time a production build segfaulted. With Bluetooth, we can even extend this to embedded devices. Although not all bugs are segfaults, I imagine the resultant attention to detail would help a lot :)
    Then again, there'd be an Infocalypse in Redmond...
  • by Embedded Geek ( 532893 ) on Thursday November 01, 2001 @01:16AM (#2505824) Homepage
    Well, since I aluded to this in the main post, I guess I should come clean on my "Two Toilet Reset" story. Before that, though, I need to make some things perfectly clear:
    • The buggy code never made it onto any aircraft. It was caught during integration in a laboratory by our customer (the manufacturer of the aircraft) nearly a year before the first test flight of the system.
    • Like most everything on any aircraft, the lighting system was designed to handle other systems (including embedded computers) acting flakey. Resetting the lighting computer would not have killed the lights or plunged the cabin into darkness. It simply would have disabled some of the automated lighting controls (e.g. dim the lights before the movie, etc.) for a short period of time.
    • The bug was entirely my fault and does not reflect on the quality of the products by my employer or our customer. In fact, the fact that it was caught speaks highly of the processes put in place by aircraft manufacturers, the FAA, and aviation regulators throughout the world.

    Whew! Now that that's out of the way, here's the tale (pardon the length)...

    The system I was working on controlled the cabin lighting, reading lights, attendant call lights, and the public address sytem for a popular passenger aircraft. Basically, all automated functions (except for in flight entertainment - movies, video games and such) that are used by anyone in the passenger cabins. One input was a discrete that indicated when a toilet valve was thrown (flushed). The discrete would stay set if the waste tanks were filled with, er, waste. Our system would light the "toilet occupied" light above the door, which we normally only lit when another discrete indicated the door was locked.

    For various reasons, we were upgrading the computers to an embedded 80186, a workhorse 16 bit processor that my company had a lot of experience with. As a technical lead, I ported an executive (it would be misleading to call it an "embedded operating system," but it did the same job) that we had used on another 80186 based system.

    The discrete driver used a linked list to handle queued inputs, chaining a number of structures together, each about 20 bytes big. I did not need a lot of the features the driver had, so I pared it down to a three byte structure. What I didn't know was that the garbage collection routine did a type casting trick with a four byte pointer (instead of putting the pointer into the structure itself) which would overwrite the three byte structures. The net result was that the chaining was worthless: the driver interrupt routine could only handle one input each time it ran (at 60Hz).

    During our in house testing, we hit the dicretes as fast as we could, but we never caused the bug to show up. Our customer, however, had a dedicated lab that simulated an entire 400 passenger aircraft. To allow a single operator to stress test the system, a lot of the inputs (passenger reading light switches, attendant call switches, etc) were wired together. One such arrangement was wiring two "toilet waste tank full" discretes together.

    So, shortly after we delivered our first hardware and software to the lab, the lab guys were checking out the wiring on their test stands. One of them threw the ganged switch (that is, flushed two toilets). My driver recieved two inputs in the same 1/60 of a second. It read the first, processed it just fine, then tried to read the second via a corrupted pointer and... BAM!!... my computer reset.

    The lab guys thought it was hillarious. My boss was less amused, I'm afraid.

    The moral of the story? I guess that, first, you can never have too much integration. I hammered on our test rack for hours on end, but we never had thought to wire multiple discretes together (heck, the customer's lab guys only did it because it was more convenient than throwing multiple switches at once). If it had made it into the field, I guarentee we never would've found the root cause (what are the odds of a flight attendant standing next to a toilet with a stopwatch?).

    The other moral is the one I've heard a million times but didn't take to heart until this bug bit me: ported code is new code. I didn't understand the garbage collection in the driver and just took it on faith that because it had worked on the old project that it would work on this one. You have to understand (and test) ported code just as much as you would stuff you wrote yourself.

    • "What I didn't know was that the garbage collection routine did a type casting trick with a four byte pointer (instead of putting the pointer into the structure itself) which would overwrite the three byte structures."

      Could this have been foreseen/avoided if the code (written by someone else and being reused by you, if I understand correctly) had been better commented?

      • Could this have been foreseen/avoided if the code (written by someone else and being reused by you, if I understand correctly) had been better commented?

        You hit the nail right on the head, Unitron. I would have preferred to have had a union of the structure and the pointer, so the size of the whole thing would've been adjusted automatically if the structure's size got stripped down below four bytes. Nevertheless, a simple "/* Structure is overwritten by pointers in obscure_garbage_routine()" would have been inelegant but would've set off alarm bells in my head when I did the prot.

        The fact is, though, that latent bugs exist in most systems and that by porting code we implictly accept the code in its entirety - bugs and all. That's why (IMHO) you have to test it to the same level as if it was brand new: source code should be tested and inspected to its core, COTS DLLs should be tested for interface errors, and pieces of hardware should be hammered with invalid signals, messages, and environmental conditions.

    • Our system would light the "toilet occupied" light above the door, which we normally only lit when another discrete indicated the door was locked.

      That's amazing; I had always assumed that this was a simple mechanical electrical switch in the lock mechanism directly controlling power to the light above the door. I suppose it does make sense to have some computing power if it's a large aircraft, are 747s (which have a block of 6 toilets midway through the cabin) smart enough to know how many cubicles are occupied and adjust a single toilet-free indicator in the main passenger section appropriately?
      • assumed... a simple mechanical electrical switch in the lock mechanism

        sql*kitten: So did I when I came on board! The fact is that in a widebody aircraft, everything is wired into some kind of computer. For lighting, it's simply because it's cheaper and weighs less to have a primative LAN than to directly wire a switch from your seat, through the floor, up around the interior wall of the plane, and back up to your reading lamp in the ceiling. For everything else, there's usually master controls so the attendants can override anything (cancel all the attendant calls, ring everyone in the toilets to go back to their seats, etc).

        smart enough to know how many cubicles are occupied and adjust a single toilet-free indicator in the main passenger section appropriately?

        Actually, the way the plumbing is, there are multiple tanks serving different toilets. The system simply "occupies" those that don't work and the passengers have to hike to find empty ones.

        BOUNUS TRIVIA: If a depressurization alarm goes off, the "return to seats" light go off everywhere in the plane except the toilets. Someone who wrote our requirements (FAA?) probaly did a study or something and found that it was safer to sit on the throne than try to stumble back to your seat during a huricanne!

  • by Paul Lamere ( 21149 ) on Thursday November 01, 2001 @06:55AM (#2506300) Homepage Journal
    My favorites from comp.risks concern the test flights for the F-16.

    When the test pilots checked to see what would happen if they tried to put the landing gear up while the plane was on the tarmack ....ouch.

    When they pilots flew the plane over the equator (luckily in a simulator, the plane flipped over).

    More on these here:

    http://catless.ncl.ac.uk/Risks/3.44.html
  • by bluGill ( 862 )

    current system I'm working on now, mounts tapes in robots. (At least that is what I can tell you). Customer has it installed, but their backups keep timing out. Turns out their network is so congested that TCP/IP is backing off assuming a very slow network. When a lot of jobs are requested at once over the same socket the last few mounts don't make it through before the program times out. (We are still trying to figgure out how anything on that subnet works) When we tell the customer to fix their network they respond "You should have realized this problem before you sold the system to us, now fix your system. Of cousre we don't supply the backup software, just the robot. (note that to avoid giving away things I'm NDA for I've changed some things in this story, you get the point but don't examine the details too close)

    Same system, but this time with the tape drives. Customer sees data corruption. Not good at all. We don't know where it is, could be their end or ours, but of course we are blamed. At least we are not on the network (will a SAN, but it isn't congested).

    Clustered system of two solaris machines. We are writing our own cluster software. Management give us one ethernet interface between them, (cluster must have redundant connections between them), and then wonders why we can't solve the bug where the backup system wants to take over all the time while the master is still operating.

  • I worked on a controller for high voltage switch gear for 6 years and came across a couple of weird 'bugs'. High voltage switch gear is basically the 'light switch' for a whole building or subdivision. Most of them are in big boxes, 8'x6'x4' that hold the switches, fuses and controller, switching up to 35kV@12kA. The overhead switch systems are the stuff you see in utility substations and 140kV@25kA.

    ANyway, in the first case we had a problem where once in a great while, once in every 50-100 operations, one set of outputs would change state to all zeros. Banged my head against the wall for a week because we could not reproduce it in the lab. So I hauled all my equipment(PC, logic analyser, digital scope, analog scope) down to the shop floor. I was able to reproduce the problem about once per day for two weeks before finding the problem. The latch chip was getting screwed up by the EMF of a large motor that was nearby. We couldn't change the hardware layout so the solution was to add code to rewrite all output ports whenever the sampling interupt occur, 360/second.

    We had never seen this happen in the lab in 4 years of development. Of course in the lab the entire package was laid out differently, for ease of access, so the same EMF effect did not occur. The moral is you need to test under 'real world' conditions.

    The second one was simpler to find, but more of a pain in the end. I did a point update on the compiler, something like going from version 3.2 to 3.3. After this one of our logging functions, to an EEPROM, no longer worked. This one at least could be reproduced every time. After running around in circles for a couple days I finally recompiled the code to assembly instead of binary and started reading line by line. Ends up the new version of the compiler had optimized out the write to the EEPROM. Ends up there was a bug in how the new version of the compiler was a little too overagressive in its optimization.

    The moral here is watch out for changes in the tools. They can have just as much of an effect on things as anything you might write and you may never realize it unless you look really close.
    • jsfetzik, I hate to ask, but was your comiler C/C++? If so, did it support the volatile type or did the optimizer wipe it out anyway?
      • At the time we were using C for about 90% of the code and assembler for the other 10%. We were using the volitile keyword. Problem was when the highest optimization setting was used, with the newer version of the compiler, it basically ignored the volatile keyword and optimized it out.
        • I was afraid of that. I've heard of it happening (and of compilers that ignore volarile all the time) but it's never bitten me personally.

          My sympathies. :)

          • Yeah, live and learn, as they say. We switched to a different vendors compiler about a year later and were worried about similar things. Ended up having no real problems with the new vendors tools.

            In fact while double checking things we found the cause of display bug we had. It was so minor, meaning it did not effect functionality just made one screen look odd, we never spent the time to track down the problem. The bug disappeared completely with the new compiler. Once again a strange optimization by the old compiler.
  • by renehollan ( 138013 ) <rhollan@@@clearwire...net> on Thursday November 01, 2001 @12:15PM (#2507211) Homepage Journal
    Oh this was a doozy to fix. Well, finding it was tough too, but only because management wouldn't let us do a proper root cause analysis until it was way too late, chosing instead, to "tweak things, take carefull measurements, and see what improves things". Of course, by the time the go-ahead was given to do the root cause analysis, customer was royally pissed, and we were given little time.

    Basically, we had this distributed test system with embedded test controllers in about 1500 locations, nation-wide (no, not in North America, but it was still a fairly big country). There was a central location with master controllers managing the whole thing. And there was a networking bug.

    Like many embeded systems, this one had a feature which would permit downloading new operating code to it. And, while normal operation worked fine, most of the time we downloaded new code the remote units would lock up and the watchdog would reboot them (to the original operating code). Now, software download testing done in our labs with PCs basically FTPing code to the units worked fine. Software download with the Solaris-based "real" master controllers worked fine. Software download in the field didn't. For some reason, the PC-based testing was done at the same data-rates as under field conditions, but the Solaris-based testing wasn't (go figure). And there lay the key.

    See, for some reason, the Solaris networking stack would try to initialize the TCP round-trip timer based on sending a single byte on a newly opened socket and adjust it from there. This was fine when sending short (i.e. "normal" command and control") packets. But software download rammed long packets down the pipe. So, the RTT was set based on the ACK of a single byte, and then the next transmission involves about 1500 bytes... at 9600 bps. You guessed it, TCP timeout and retry. See, the Solaris stack was dumb, it didn't know that the last byte hadn't been sent yet before starting it's ACK timer... Of course the fact that we had a virtual device tunnelling all this traffic encapsulated in PPP frames over an X.25 circuit and locally accepting whole IP frames in one gulp didn't help... the poor Solaris TCP/IP stack got SOOO confused.

    So, anyway, here's Solaris sending duplicate packets in fits and starts, and here's our box receiving them and throwing most of them away. To accomodate the slow datarate and long packets, a rather large TCP window was used, which just made things worse... Solaris would send a single byte of payload, initialize it's TCP RTT, and then squeeze off a window of data in several packets. These would time out... Solaris would repeat... But wait, it gets better!

    Eventually Solaris would see all these ACKs coming back, for older packets, in multiplicate. Well, after two acks for the same old packet (i.e. with future packets already acked), it would move the window back and start all over.

    So far, all we've got is a really slow, messed up link. But, because of the slow TCP RTT convergence at Solaris' end, our poor embedded target is getting duplicate packets left and right, and, more importantly, repetitions of packets that came before one's already acknowledged. In fact, Solaris was being a bit too aggressive in resending windows, doing it on the second duplicate "old" ACK instead of the third one. But, the important thing is, our embedded TCP stack had a bug handling this situation... and we crashed big time.

    O.K. So we find the problem. Easy to fix, and update all the units... all over the country.... with hundreds of truck rolls... at our expense. I don't think so. Boss insists that it be fixed using the existing download mechanism.

    Now, at this point I should perhaps mention that none of this code was my responsibility. But, I had a reputation for making miracles happen, so it suddenly became my responsibility (I HATE that -- getting shit for not fixing someone elses bugs fast enough, but that's another rant).

    Lots of people were still insisting on teaking some parameters, and trying stuff, and to be fair, some FTP client mods helped Solaris' TCP RTT converge better, minimizing timeouts, but it still wasn't perfect. The application guys weren't about to throttle back their PPP virtual device because they were "too busy" so we got no help there. The damn download app was trying to compensate for a situation that it just didn't have enough data about. I knew this was not going to lead to an acceptable solution... poor app was being so overly conservative in sending data that throughput was shot to hell. Clearly the code in the remote test controllers had to be fixed... after all it had the worst bug of all. But how?

    To summerize, you need to download a lot of code to a unit that you can't download a lot of code to at any reasonable rate of speed. It was suggested to make the download app really conservative, but I pointed out that even if it worked 99.99% of the time, after 6000 packets, the chances of not crashing due to the bug were only 54%. It would have taken forever and the customer was pissed, and my boss wanted it yesterday, and, and everyone was mad that I hadn't fixed it yet. (This three days into the root cause analysis).

    Clearly, we had to download new code to the test controllers. We could download a little (because that didn't tickle the bug) slowly, but that was about all. I suggested the following:

    "We find a minimal fix for the broken TCP/IP stack in the test controllers, download an in-situ patch program, that upon reset, patches the relevant pages in flash, and reboots. Then, using the more robust stack, we download a final working version that fixes the rest of the bugs" (by this time we found about 6, but only 2 were critical).

    Management went ape-shit, to put it politely. "You want to download SELF-MODIFYING code!? That breaks ALL the rules!! And you call yourself a miracle worker?" (Er, no, they did, I never said that, but I digress). Clearly someone needed a scapegoat and I was about to become it. The legitimate concern was, of course, that if it didn't work and locked up the test controllers, we'd have to roll a truck to each site. But, without a solution, we'd have to do that anyway, so what did we have to lose?

    Now, I wasn't in this alone... I had help from the guy responsible for the download app, and the guy resposible for the flash drivers in the test controller. We all agreed that this was the best way to proceed, and after much arguing, name-calling, and time wasting were given the go-ahead. O.K. we're one week into our alotted two-fix "fix it or else" deadline.

    Well, we managed to fix it, tested with 20 live customer test head controllers (pipelining the development and testing of progressively refined versions), stressed it, and it worked like a charm. I think, of some 1500 remote test controllers, all but 20-40 got repaired this way, and those had other problems.

    Me, when it came to review time, got a note that "Rene improved software download performance". That's it. One stinking sentence. Saved the company at least some $250k - $500k in repair costs, avoided a potential multimillion $$$ contract cancellation, and that's all I get?

    I no longer work for that company and am much happier now working elsewhere. But I think this bug, and subsequent fix, ranks up there with NASA-grade "no, we can't send anyone there" repair hacks.

Software production is assumed to be a line function, but it is run like a staff function. -- Paul Licker

Working...