Forgot your password?
Programming IT Technology

How Do You Deal w/ "Heisenbugs"? 32

Posted by Cliff
from the cures-for-the-rabid-coder dept.
horos1 asks: "I was wondering how people out there deal with 'Heisenbugs': bugs that have no logical, programmatic cause (mostly from C and C++ programs and especially threaded C/C++ programs), that may change or disappear if you modify the state of the program. For example: we have a multi-threaded C++ app which cores in about 5 places, at a different memory address each time, and which disappear if we turn threading off. They seem to be caused because of a memory overrun error, but this too is exceedingly hard to fix with tools like purify, because they tend to give several 'false positives' on memory errors, as well as core when we link with certain libraries. Anyways, this is getting very annoying... Any help with this, as well as pointers on how to deal with bugs like this would be greatly appreciated."
This discussion has been archived. No new comments can be posted.

Hod Do You Deal w/ "Heisenbugs"?

Comments Filter:
  • by Anonymous Coward on Wednesday April 11, 2001 @06:51PM (#297261)
    You could use a "Heisenburg Compensator".
  • by Dancin_Santa (265275) <> on Wednesday April 11, 2001 @07:37PM (#297262) Journal
    Critical Sections

    printf() statements

    Also keep in mind that many of C and C++ built-ins are not necessarily thread-safe.

    Dancin Santa
  • The closer you get to finding out where the bug is, the less you can reproduce the bug. Whenever you make a measurement, you disturb the system and alter the results. So much for deterministic computers. Not a helpful comment, I know. =)
  • Assertions More assertions Even more assertions And most importantly: work in Java
  • by Spock the Vulcan (196989) on Wednesday April 11, 2001 @08:56PM (#297265)
    printf() statements for debugging multithreaded code are a really bad idea. It's an I/O, and hence blocking call, and can therefore affect the way your threads are scheduled - resulting in mysterious things, e.g. the problem disappears when you use the printf(), but reappears if you comment it out.
  • by Christopher Thomas (11717) on Wednesday April 11, 2001 @08:57PM (#297266)
    Several useful suggestions have already been posted. One additional one: Initialization.

    Rig macro wrappers for malloc() and calloc() and the like that initialize new memory to 0xdeadbeef if you have a memory debug flag active. This should make your program crash more abruptly when it starts doing something questionable.

    This won't help much for finding the race condition or non-thread-safe code causing the problem, but it may give you some idea of what's being stomped, and make the heisenbug more predictable.
  • by dutky (20510) on Wednesday April 11, 2001 @09:32PM (#297267) Homepage Journal

    Patient: "Doctor, when I write multi-threaded programs in C++ they dump core all over the place and I don't understand why!"

    Doctor: "Don't do that."

    I know it sounds glib, but it really is the heart of the matter. Writing proper multi-threaded programs is difficult and takes a fair amount of skill. If you or your programming staff don't have the skill to do it, you are better off sticking to something less challenging. This isn't meant as a put down, either: there are plenty of simple solutions to problems that, at first blush, may look like they require threads (or dynamic memory allocation, or any of a number of other complex and error prone tactics). You could replace your multiple threads with a polling loop and a switch statement, or spawn completely separate processes and communicate through pipes (you might be supprised how much performance you can get out of either solution compared to the threaded code).

    If you feel that you really must use the cool threaded code (or a complex, dynamically allocted data structure, or whatever) then you may need to dedicate a week or two to a carefull re-examination of the code and the accompanying re-implementation to eliminate race-conditions/memory-leaks/mutual exclusion errors/etc. While there are a lot of cool debugging tools out there, that can help you find some kinds of errors, there is really no substitute for a deep and thorough understanding of your code. You can either spend the time understanding the mess you have, or try replacing it with something less complex but easier to maintain. (but, maybe, harder to extend/scale/etc.)

    Here are some of my favorite high-tech problems with low-tech solutions:

    • Multi-threading: multiple processes or polling loop with switch statement.
    • Search Trees or other tree structures: hash tables.
    • Linked lists or other linked structures: dynamically allocated arrays (double when size>=length, halve when size<length-n, like Java vectors)
    • Complex decision logic: state machines.
    Admittedly, sometimes you really do need to go with the more complex solution, but it is best to avoid it whenever possible. Besides, you can always put the complex stuff in the next version.
  • by bko (73379) on Wednesday April 11, 2001 @10:31PM (#297268) Journal
    Here are a few tips that help. They are not exhaustive, nor necessarily the best way.

    Debugging tips

    0. If you are running linux: apply the patch that causes the thread that had the segfault to dump core. The default Linux semantics (under 2.2.x at least) are for the threads to exit STARTING with the one that had the problem. Then, the LAST thread dumps its program counter / stack info into your core file. The result: you get what look like "random" crashes when really they aren't very random.

    There is a patch which fixes this behavior. try this patch [].

    1. See if you can get your program to crash in a debugger or dump core. I presume that you are getting this by your comments.

    Record the places that you get crashes. Each time anyone gets a crash, have them record where, as best as they can.

    Try to figure out what's getting overwritten, even if it's not clear when or how.

    3. Try to increase the frequency of the crash (e.g. by running on an SMP machine). this usually provides people with more incentive and a better chance to test if any given change really fixed things.

    4. The next few hints fall into the category of "reduce the problem code". It sounds to me like you don't have a good feeling as to where the problem is happening. Try to eliminate sections of the code, by any means necessary. examples include one big lock to force serialization, test programs that only excercise certain modules of your code, etc. I know there is often a temptation to just jump in, but some extra scaffolding to reduce the possible problems is almost always valuable on hard debugging problems.

    Reduce the amount of data shared between threads. We are using a message passing interface, where each thread more or less has its own data. This has been a big win. We often copy data before passing on to the next thread, just to make sure.

    5. Understand what is and isn't thread safe in your libs. For example, did you know that it is almost impossible to make the C++ std string thread safe, without changing the implementation? that's because the implementation is copy on write. So, even when you're not sharing any data between threads, you are....

    I hope some of these help. threading problems aren't easy, particularly in c++...

  • by The Mayor (6048) on Wednesday April 11, 2001 @10:32PM (#297269)
    "Heisenbugs", as you call them, are almost always the result of memory management bugs. Arrays boundaries are being overwritten, null pointers are being accessed, freed memory is being used, or some other memory is being misused. It's a very common problem.

    Now, one answer to this problem is to use smart pointers and automatic garbage collection. That won't help array boundaries, but you can use an equivalent wrapper for arrays. It's not a bad practice to get into for large-scale C++ development.

    Another solution, and one I have found to be *incredibly* useful, is to use a memory profiler. There are loads of memory profilers on the market today. Visual C++ has one included, the GNU Foundation has one [], and loads of commercial companies offer profilers. Most of the commercial ones have nice functions, such as pretty (and useful) graphical displays and the ability to profile code that has already been compiled. I remember profiling Solaris (the OS itself) back in the early 90s, finding loads of memory leaks and memory management problems.

    Even though I'm not a betting man, I would lay money that if you rid yourself of memory management problems, your code will no longer contain these "Heisenbugs".
  • by The Mayor (6048) on Wednesday April 11, 2001 @10:35PM (#297270)
    Oh yes, I forgot to mention. If you use Purify, then use it regularly. Set up ignore rules. After a few weeks, you'll find that you no longer get any false positives. Purify is a *very* good tool, and one I have used for years. If you think Purify is too difficult for your problem, then hire a programmer that knows how to use Purify. Purify is *tha bomb*.
  • by Detritus (11846) on Thursday April 12, 2001 @12:13AM (#297271) Homepage
    Anyone who is writing multi-threaded software should have a good understanding of semaphores, critical sections, deadlock etc. It is easy to generate hard to find/reproduce bugs if you don't understand these concepts.

    Consider using something other than C/C++.

    You can never be too paranoid. Check all passed parameters for validity. Do range checks on indices. Verify that a pointer to foo actually points to an object of type foo.

    Don't use dynamic memory allocation.

    Have a recovery strategy for resource exhaustion.

    Check for things that "can't happen".

    For every global data structure, you should be able to say which procedures access/modify it, and under what conditions.

    Read some good books on real-time and concurrent programming.

  • Don't code in C/C++. Use a language that has a minimum of type/boundaries/pointers enforcement. Use Pascal for low level programming like C/C++ does, now with Kylix being available there's no reason to deal with all the ugly pointers of C.
  • by jon_adair (142541) on Thursday April 12, 2001 @05:51AM (#297273) Homepage

    "Heisenbugs", as you call them, are almost always the result of memory management bugs.

    Absolutely. In about 75% of the cases I've seen they were from clobbering something on the stack. In one case I built malloc() and free() wrappers and preceeded every array or memory reference with a (#ifdef DEBUG) check to make sure the index was in bounds. I found dozens and dozens of cases where the indexes went out of bounds.

    One of my favorite is code sort of like this:
    main() { int x[100]; int i; for ( i=0; i != 101; i++ ) { x[i] = i * i; } }
    Where overflowing the array steps on the loop index.

    Another case I saw that my team chased off and on for weeks was one where we didn't initialize one field of a time-related structure.

    It probably doesn't impact anyone these days, but we spent an hour one day stripping some 16-bit Windows or DOS code down to just a dang printf("Hello world."); and it cored inside the printf(). Finally we noticed that there were a lot of large arrays declared locally in main(), so the stack was almost completely used up. The next function call would core no matter what.

  • by bluGill (862) on Thursday April 12, 2001 @06:13AM (#297274)

    I have a lot of can't happen checks. They never, or rarely trigger (anymore). I have to maintain code written by someone less paranoid about buggy hardware (for 100 triggered can't happen except on hardware errors, I've fixed 110 software bugs. This on new untested hardware where I have found hardware bugs by other means.

    I've also fixed several crashes because I knew the code well enough to know where it should be an what it should be doing. Once I proved it wasn't in that state I had to figgure out why not, and from there the fix was easy. Unfortunatly figgureing out why I was in the wrong state is hard.

    Duplicatable problems are easy to fix. If you can crash in one of 5 cases, then splatter printf's all over those areas. Consider writting your own printf which just writes to memory, not the screen so you don't block. Then when your program crashes you pull that memory from the core file and you know where each function was last. just knowing what function each thread was in last is a big clue.

    Finially, code inspections are a must. Get some good programers who have never seen that section of code and have them inspect it. If nothing else it will assure that your comments are meaningful before the programmer quits.

  • yup. too many people use threads cause theyre cool..which is really silly. very few people can write multithreaded code correctly and it takes a helluva lot more experience than most people have. took me three months to write my first heavily multiprocessor/multithreaded C code and get it to run without any problems. once you learn to avoid the pitfalls it becomes easier though. in general - spinlock spinlock spinlock. its much easier to use global atomic spinlocks than anything else. and try and do it in java which at least gives some handholding first before trying it in a nightmarishly complex C or C++ which core dumps for the slightest reason.
  • Unless, offcourse, you want your application to be portable to other Unices.
  • True, true. But for debugging in general they are fairly useful.

    Dancin Santa
  • Since it sounds like you are using multiple libraries, and there are no problems when you don't use threads, my working assumption would be that one or more of the libs is not thread-safe. It's amazing how often a static variable or two used to share state information or preserve it across multiple calls sneaks into the code. Only takes one of those to break a threaded app.

    My own approach, at least in C, matches the suggestion of a previous post: a polling loop and a switch statement (or equivalent). Breaking code up into appropriate blocks to handle the underlying events (I got a packet; I got a button click; I got something from a pipe) can certainly yield some ugly code. But it does let you guarantee that critical sections are handled properly, and any code that invokes library routines that you can't verify as thread-safe are critical sections.

    I wouldn't want to try this approach if there are a lot of threads (dozens, hundreds) involved.

  • by dmorin (25609) <> on Thursday April 12, 2001 @08:12AM (#297279) Homepage Journal
    A little philosophy, here, goes a long way.
    1. Unless you are dealing with some form of hardware interface beyond the norm (for example, embedded systems) or massively multi-user, then odds are you are working in a deterministic system. It might be an incredibly complex one, but decide up front whether it is a deterministic system or not, and never let that thought leave your head. In a deterministic system, all bugs are reproducible, and there is always logic. You just haven't found it yet.
    2. It's all ones and zeroes. Remind yourself of this occasionally. You know that removing a print statement should not cause a bug to disappear, but the compiler doesn't care. Think like the computer. I used to love tracking down null pointers in C because it's such a simple, raw error: "Some register has a 0 in it that should't. Find the 0, find the register, find who loaded the register with that 0. Done."
    3. Minimalism is your friend. Start by removing everything. Then put pieces back until the bug shows up again. Then remove pieces from the other end. Trap the sumbitch in as small a hunk of code as you can.
    4. Always remember that finding bugs is a good thing, because that means one less bug in your system. So if in your travels you discover something and go "Aha!", fix it, and then discover that it didn't solve your problem, fret not, you still fixed a bug.

    My two favorite examples of past debugging:

    • In an embedded system (medical device) I had a loop that ran almost half a million times. Roughly every dozen runs of the system, I would get a weird math error. Extraordinarily hard to locate. Minimalizing the thing eventually showed me that once in a blue moon, a statement that said "If a &lt b..." was being evaluated wrong. Turns out that I had a non-threadsafe floating point library, and every now and then I would task switch after the comparison had been done, into a task that modified the compare bit, and then back -- before the result of the compare was acted on! So even though a &lt b, I would end up executing the else block.
    • Poor hardware guy was going crazy because his printf statement (C program) was causing his code to go into an infinite loop. Turned out that he was telling his format string to expect a short, but then feeding it an int (or something like that, cut me some slack it was almost 10 years ago). The extra bytes remained on the stack, and when the compiler tried to return from the procedure it saw those bytes and returned to the wrong offset! Which in this rare instance happened to be a perfectly valid piece of code, given the numbers that were being left on the stack. Instant loop. If the number being printed had been different, or if the code didn't line up in just the right way that the address was a valid statement to jump to, the bug would have been entirely different.
  • Patience, young Jedi.

    Quite seriously. There are lots of debugging things you could do where first thought is "Ug, that would take me forever." True. It might. But if you start now you'll be done that much sooner.

    Consistency, too.

    The worst thing in the world for debugging is to sit back, go "Hmmmm, I wonder if this is it?", throw in a debug statement and see what happens. That's what you do when you don't really want to find the problem (I used to do that in the old days when I knew that firing up the system took 10 minutes each time). Be thorough and consistent. If you know that your program crashes in one of 12 methods, don't assume it happens in method foo() and only breadcrumb that one -- sit down and do them all. You can always take it out later.

    (I have something of a reputation on my team that after a bug has stumped the developers for days, I am given it and I solve it in hours. Normally the first thing I do is turn debugging statements on for every component. Sure, it takes me a little while to do it, and when I'm done I get megs of output, but I almost always find the problem straightaway, too.)

  • by Milo_Mindbender (138493) on Thursday April 12, 2001 @09:06AM (#297281) Homepage
    Setting up your memory allocators to flood RAM with the NAN (not a number) value will catch uninitialized floating point data instantly. To catch other missed initializers, flooding RAM with a different value each time you run can help smoke them out.

    Built in test code that periodically checks structures to make sure the values in them make sense is also very useful and easy to do in C++

    Also, having a few "signature" fields in your structures (essentially, unused data items that are initialized to a specific value and never changed) can help locate tough-to-find memory overwrites. Just check the signature field of a structure before using the other data in that structure. If the signature isn't right, the memory has been overwritten

    It can also be helpful to have a pointer check routine that looks at the value of a pointer and determines if it is legal for the platform you are on. For example, checking to make sure it falls in a range where the hardware actually has memory, or making sure it doesn't overlap system reserved areas

  • I really like mpatrol. You can log pretty much everything (errors, malloc, frees, news, etc etc ), plus cause it to instigate errors ( ie malloc returns NULL, or fill memory with 0xDEADBEEF on alloc ). Fix ALL the problems even the ones which are not causing your heisenburg... eventually you _can_ collapse the wave function.
  • by edwards (25927) on Thursday April 12, 2001 @09:25AM (#297283)
    when the act of debugging itself changes or masks the bug. The poster is refering to non-deterministic bugs, such as can easily happen when multithreading.
  • Most so-called Heisenbugs are really race conditions, and their behavior looks non-deterministic only because the interactions involved with context switches and volatile variables can become extremely complicated. The best way to remove heisenbugs is to program defensively, taking care to prevent possible races from occuring. Examine all the possible places shared data structures can be changed and make sure that they are properly protected by mutexes. Ensure that the mutexes are updated in the proper order to avoid possible deadlocks. Printf debugging works better than any other method here I find.
  • Which, incidentally, doesn't have any assertions. :)
  • hehe, I got something like this once with a non-threaded C program. I was writing a small little cgi program to grab a file from my webserver, and display it on a page that was publically accesable (damn firewall). So, I'm going along, writing data to a socket, then reading the result. First, it would never exit, so I changed to non-blocking. Then it would never get the data. I added a printf() between the write and read for debugging, and the program worked just fine. Removed it, and the program broke again. I eventually figured that the program was executing the read statement too soon, the server hadn't responded yet. added a sleep(1) in there, and low and behold, the damn thing decided to work. Sorry, I know this has nothing to do with threads, but the above comment reminded me of this situation. I'm done rambling now.
  • [stuff about writing in Kylix]

    Or look at Ada 95 [] next time you need to design a piece of multithreaded software. Ada has threading capabilities build into the language, and has proper array datatypes so there is no need to fiddle with pointers and worry about checking of boundaries.

    Ada [] rights many of the wrongs of Pascal and has a strong and rich type system build in.

    I am amazed that so many insist on staying away from modern languages that make it easy to find bugs at compile time. Oh well.

    The GNAT [] compiler is GPL'ed and available for Win32 and Linux and quite a number of other platforms.
  • If you have a number of "parallel" calculations or procedures to perform, nothing beats a good implementation of co-routines, hunks of code that are called from a master scheduler or process loop. When done correctly, the response time of a co-routine implementation can be better than a multi-threaded implementation because the operating system never has to worry about context switching.

    Now, that doesn't mean that threads don't have their place. In my programs, the only use for threads (or processes, for that matter) is real-time I/O monitoring, and I do the absolute minimum I can in each thread/process so as to avoid tromping on shared variables. Everything else is under the control of the main program.

    The primary purposes for forking is when you are dealing with human-time activities -- that's one reason that some Web servers will fork a process to handle a request, so that each user sees some response early. Another good reason to fork is to reduce the number of file pointers that have to be monitored. (There has been discussion of this in a number of different places.)

    Just remember that threading imposes an overhead that you may be ill able to afford. Within embedded systems with real-time implications (single-CPU modems are a great example) the allocation of time is critical to balance the need to stream data with the absolute requirement to maintain synchronization with the modem line.

    I believe you will also find that an appropriate co-routine implementation is much easier to debug, because the time relationships between the routines is strictly controlled by the routines themselves -- it is very easy to implement "critical path" segments without the use of semaphores or other tricks.

    The down side? In real-time applications, the key to successfully using co-routines is to design each co-routine to do a useful amount of work without hogging the CPU and still relinquish control often enough to let other time-critical routines run appropriately. Judicious use of event trigger flags set by interrupts and periodically tested by your co-routines can ease this task. The same can be done in Unix applications with judicious implementation of signal-based I/O.

    One aspect of co-routine implementation coupled with the correct use of select() timeouts is that it becomes very simple to implement an "I'm-sane" indicator in the user interface, reassuring your user that just because nothing appears to be happening doesn't mean the program is napping on the job.

  • As a complement to the parent's described behavior, I would encourage the reader to check out GNU pth [], a very portable single-process non-preemptive threading library. With this you get the sort of benefits the parent describes, but the work is mostly done for you already.
  • We have been using Insure++ from ParaSoft [], and it works wonders. In my experience, it catches far more bugs than Purify, the other such product. The only catch is that the execution is a lot slower, since every memory access is monitored. But, what would you rather waste: CPU cycles or your own time? :-)
  • Ok, I could have used a select() to get it working instead, but this was just a quick hack to get data through the firewall. It isn't exactly production software. I was simply looking for a quick and dirty hack and that's what I got.
  • IANA low level programmer (at the moment), but i like what i understand of Eiffel's "Design by Contract"(tm) assertion/documentation methodology. It seems like it should help the programmer debug and understand his code. The language also has garbage collection and plays well w/C & C++. And comes with a large open source library. So why is Eiffel faring poorly in the 'language wars'? Thx alot.

Work is the crab grass in the lawn of life. -- Schulz