Ultra-Stable Software Design in C++? 690
null_functor asks: "I need to create an ultra-stable, crash-free application in C++. Sadly, the programming language cannot be changed due to reasons of efficiency and availability of core libraries. The application can be naturally divided into several modules, such as GUI, core data structures, a persistent object storage mechanism, a distributed communication module and several core algorithms. Basically, it allows users to crunch a god-awful amount of data over several computing nodes. The application is meant to primarily run on Linux, but should be portable to Windows without much difficulty." While there's more to this, what strategies should a developer take to insure that the resulting program is as crash-free as possible?
"I'm thinking of decoupling the modules physically so that, even if one crashes/becomes unstable (say, the distributed communication module encounters a segmentation fault, has a memory leak or a deadlock), the others remain alive, detect the error, and silently re-start the offending 'module'. Sure, there is no guarantee that the bug won't resurface in the module's new incarnation, but (I'm guessing!) it at least reduces the number of absolute system failures.
How can I actually implement such a decoupling? What tools (System V IPC/custom socket-based message-queue system/DCE/CORBA? my knowledge of options is embarrassingly trivial :-( ) would you suggest should be used? Ideally, I'd want the function call abstraction to be available just like in, say, Java RMI.
And while we are at it, are there any software _design patterns_ that specifically tackle the stability issue?"
inline code (Score:4, Informative)
You can easily embed C/C++ in other languages. Take a look at Inline::CPP [cpan.org], for example. With code like:
use Inline CPP;
print "9 + 16 = ", add(9, 16), "\n";
print "9 - 16 = ", subtract(9, 16), "\n";
__END__
__CPP__
int add(int x, int y) {
return x + y;
}
int subtract(int x, int y) {
return x - y;
}
you can put the parts that need to be fast in C++, and the parts that need to be easy in Perl. (If you do the GUI in perl, you won't have to worry about portability or memory allocation. And the app will be fast, because the computation logic is written in C++.)
> The application can be naturally divided into several modules, such as GUI, core data structures, a persistent object storage mechanism, a distributed communication module and several core algorithms.
Yup. There's no need for the GUI to know how to do computations, remember. The more separate components you have, the more reliable your application (can) be. Make sure you have good specs for communication between components. Ideally, someone will be able to write one component without having the other one to "test" with. For testing, write unit tests that emulate the specs... and make sure your tests are correct!
Don't get too fancy... (Score:5, Informative)
Furthermore, you're crunching large amounts of data, so I'm guessing batch processing. If you can have the application not be a server, then you simplify things a lot. Make it a utility that takes data on standard input and runs whatever analysis you need, and duct tape it together with cron or a simple program that watches for new input files.
Also, I'd like to suggest that you consider whether other languages could be efficient for the task. For example, Java is pretty good numerically, and as far as your libraries go, see if you can use SWIG to generate JNI wrappers. Also, then you get Java RMI.
Next, get them down to one platform. It's *way* easier to develop software with tight constraints on a single platform (versus multiple platforms). Investigate QNX: a reliable operating system (though admittedly quirky) with a beautiful IPC API. In any case, make sure you get a well-tested library with message queues, etc. You don't want to be using raw sockets; you could but that's just another pain in the ass on top of everything else.
Last, figure out what the cost of a failure is. Getting that last few percent of reliability is very very expensive. Unless you're a pacemaker or respirator, the cost of failure is probably not as high as the cost of five nines of uptime.
Don't code to impress. (Score:5, Informative)
When coding something that needs to be stable, you need to keep your ego aside and concentrate on the task at hand. Stick with tried and true methods don't go with any algorithm that you are not 100% comfortable with even if it makes the code less ugly. Be sure to follow good practices make many function/methods, and make each one as simple as possible, makes it easier to check each function for bugs when they are simple. Secondly document it like you never want to touch the code again (in code and out of code), you want to know what is going on at all time and the bigger it gets the larger chance you could get lost in your own code. When working in a team and you are in someone else's code document that you did the change.
Next take into account what causes most Crashes.
Bad/Overflow memory allocation.
Memory leaks.
Endless loops.
Bad calls to the hardware.
Bad calls to the OS.
Deadlock
If you are going to decouple modules keep in mind that you will need to do as much processing as possible with minimum message passing and allow for mirrors so if one system is down and other can take its place, without killing the network.
For IPC I tend to like TCP/IP Client server. But that is because it tends to offer a common platform independence and allows for expansion across the network. Or try other Server Methods such as a good SQL server Where you can put all the shared data in one spot and get it back. But not knowing the actual requirements it may just be a stupid idea.
I would suggest that you also ask in other places other then Slashdot. While there are many experts on this topic there are also equal if not greater amount of kids on there who think they know what they are talking about, or they have there ego in this technology/or method.
Re:You're not the first one.... (Score:2, Informative)
As for your question about CORBA, look into IceC++ [zeroc.com]. I read about it somewhere and it sounded cool
Re:You're not the first one.... (Score:3, Informative)
Test, test, test. Use test driven development if you can. Have a good test harness and use it all the time. Do the stereotypical "random input" tests. Test each and every component to destruction white-box style and then double test all your interfaces. Design by contract can help here. Use a tinderbox that runs continual builds. Maintain strict version controls. Maintain code discipline (getting rid of C++ helps here too). Realize that you probably do not (currently) have the skills to produce the kind of product you are talking about and be willing to commit the time and effort to tear up mistakes, to start over and to teach yourself. What you are attempting to do is not easy.
Uphill Battle (Score:2, Informative)
To make your program as crash proof as YOU can control you should validate your requirements using Use Cases, minimize Design Complexity, use good C++ programming practices, and do extensive testing at every level using white box and black box testing techniques. Testing is key, and regression testing after changes is even more key. Don't assume fixing this didn't break that. Test with REAL data if you can. Test with invalid data so you will test your error handling, test at maximum usage levels to validate no memory leaks, resource contentions, deadlocks, etc.
However, at some point things get out of your control, as you don't write the C++ system calls, or the compiler code, or any OS features the code uses. So bugs in those can cause your program to crash. It wasn't your code that crashed but you'll get the blame. So to be crashproof it takes a "system" that is crashproof, you program is just one part of that.
Use state machines (Score:2, Informative)
Here's a great framework to start with:
http://www.quantum-leaps.com/products/qf.htm [quantum-leaps.com]
And the book:
http://www.quantum-leaps.com/writings/book.htm [quantum-leaps.com]
be assertive (Score:2, Informative)
Pretend you are writing code for an airplane... (Score:2, Informative)
I've spent over a decade refining how best to create stable, great software. And guess what? I still learn things every day. If you are really new to enterprise-grade software, the best thing you can do is search amazon and choose 3 to 5 great books about writing stable, bug-free enterprise code and just start reading and scheming. Give yourself lots of time. Be neurotic, type-A, attention to every detail, stay up at night wondering how your system could fail and what you can do to prevent it. Some immediate thoughts, however:
1. Good hardware. Obviously. Redundancy everything, self-diagnosing, etc. How can things go wrong? What will go wrong? How can I know when something is going wrong? How I can fix it quickly without impacting the system? Etc.
2. Enterprise grade (n-tier) architecture: You'll definitely want to do something where you have a database running on one or two (or more) machines, at least two business servers and at least two web servers. Redundancy is good. As you suggested, a setup like this lets you isolate problems (and provides for better security in general).
3. Test, test, test. From the very start, every day to the very end. Start coding by writing test suites for your code. Learn about unit testing, black box testing, user testing, regression testing, etc. And hire developers and QA whose sole job is test, test, test using great automated testing software.
4. Profile. Stress-load-test. Know how your system responds to all scenarios. Feel comfortable knowing the limits of your system. There should be no surprises.
5. Assert. Learn the magic of assert(). If your code isn't at least 25% asserts, you are not trying hard enough.
I told myself I was only going to write the first five thoughts that came to my mind, otherwise I could spend weeks trying to answer your question!
Re:XP (Score:2, Informative)
Exactly. Three words: Test Driven Development.
Since you're tied to C++, may I suggest CppUnitLite2 1.1 [gamesfromwithin.com]...
It's incredible how much more productive you can be writing the tests first (contrary to what you might think initially). I hardly ever need a debugger anymore, and I know that the code I wrote does the right thing, and doesn't adversely affect something else.
Put the unit tests as a post-build step (or a dummy target in a makefile) and any defect will pop up instantly. If you find a bug not covered by your test suite, add a test that reproduces the problem, ensuring that it will never bite you again.
If you're not familiar with TDD, check out Wikipedia for an explanation and some useful external links: http://en.wikipedia.org/wiki/Test_driven_developm
Re:Here's your best bet. (Score:2, Informative)
Re:Here's your best bet. (Score:5, Informative)
1) Wrap your legacy libs with SWIG
2) Code a working prototype in Python
3) Profile it (never skip this step)
4) Use SWIG to write the bottle neck parts in C++
5) Use Valgrind to ensure you are still OK memory wise
6) Profit!!
Re:Here's your best bet. (Score:3, Informative)
Two additional points:
1.) You don't need to replace all the python code.
2.) Use a garbage collector like http://www.hpl.hp.com/personal/Hans_Boehm/gc/ [hp.com] for your C++ code.
Re:Development Practices (Score:4, Informative)
Jedidiah.
Re:They Write the Right Stuff (Score:3, Informative)
You can still be a balls-out code-monkey if the verification-analysis-requirements-code loop in your organization is well designed.
The part about blame is important, too. The fact that gangs of humans are applying a vast store of partially-learned rules to a purely imagined set of requirements through a skein of lossy transmission lines with any number of distractions means that noise is inevitable, and in something as literal as code for Von Neumann machines, any noise means error.
So it's not your fault, as long as you're honest about it. And if people accept that, you won't try to hide things, and that means the feedback processes will work to iteratively smooth out the kinks until the entire feature space and code set is 100% covered by correct functionality and test.
Assuming all of the requirements are valid, for which there are only heuristic tools and few guarantees...
my experience (Score:3, Informative)
I've dealt with software that automatically restarts a dead process, and in my experience, it doesn't work so good. If you want ultra-stable software, you want to know what caused the crash and why.
For your situation, where I guess you're doing lots of time consuming computing, I'd think you should also set checkpoints, save intermediate results, or something, so if it does crash, you can restart in the middle instead of going back to 0. (A standard practice when I was analyzing large databases for corruption, a task that could take days)
Re:You're not the first one.... (Score:3, Informative)
The use of managed language will not necessarily result in a more stable code. Recovering form SIGSEGV by installing a POSIX handler or detecting the death of the forked child process in C++ can be done with the same ease as catching NullPointer Runtime exception in Java.
I would agree that having to write memory management code is error prone, but it is possible to be careful (i.e., use auto pointers, stl vectors instead of arays, etc). You do need to be very good with C++, however.
My suggestion to the author is to look at the application servers that support C++ components. I worked with a small, relatively unknown server, but even that product had a feature that kept the server up even when some C++ component (running as a forked process) crashed.
good coding techniques (Score:5, Informative)
As for a GUI programming, if you are strictly tied to c++, i would recommend QT (www.trolltech.com) they have a fabulous API (takes getting used to, but it makes sense once you do). Nice part about QT is that it is source portable to just about every major platform (X11, Win32, Mac).
It is possible to write reliable, fault tolerate code in c++ (realize please that perfect code is impossible in any language), it just has to be well thought out and done right.
proxy
Consider Ada (Score:2, Informative)
Using the good-old Ada programming language (for which a new standard should be issued this year), you can go closer to what you're looking for (though I'm not sure your goal is realistic).
Here's a pointer to the new standard : http://adaic.org/standards/05rm/html/RM-TTL.html [adaic.org]
With Ada:
It probably won't solve every of your problems, but it might help.
For a free, quite strong Ada compiler, have a look at https://libre2.adacore.com/ [adacore.com] (it's based on GCC).
Oh yes, Ada is a statically, strongly, strictly typed language (e.g. the compiler won't let you assign an integer to a float variable). My opinion is that it's a Good Thing for critical programs. Useless to restart a "type war" on this subject ;-)
Good luck.
Re:Question from a Newbie (Score:2, Informative)
Re:Development Practices (Score:4, Informative)
There is no silver bullet for what you describe other than sound development practices.
True, but it should be pointed out that C++ is well-equipped to make such sound development practices easy. Consider the major sources of instability in C programs:
In my experience, doing the above religiously will ensure you never see segmentation faults. The next step, of course, is to make sure your code correctly implements the desired functionality. C++ is no different from Java or any other OO language in this respect. Clear rquirements definition, modularity, clean separation of concerns and testing, both automated an manual, are the basic keys to generating correct and maintainable code in any language.
[*] A story: I once asked a guy on my team to write a little program to monitor a bank of modems, accepting incoming calls and exchanging data with the callers. He spent two weeks and produced nearly 10,000 lines of code
Re: Consider Ada (Score:3, Informative)
Yes, it's very frustrating when you first start using it because it makes you say what you mean and mean what you say, but once you get over that you find yourself writing programs where certain kinds of bugs simply never happen.
The strong typing (no, strong, not the fluff some other languages call "strong") and run-time checks work to move error-discovery forward. That is, you find stuff a compile time that most languages leave you to find at run time, and you find stuff at run time that most languages leave you to (hopefully) notice that you're getting incorrect outputs.
Use STL; avoid pointers; avoid fixed buffers. (Score:4, Informative)
I'm sure that there's tons I've left out, but this has worked reasonably well for me. The only problem is that STL can be slow. Sure, map may be O(log(n)), but the constants are huge. Unfortulately, for practical reasons, performance and security are often inversely proportional.
Build it around a stable database (Score:2, Informative)
I am by no means a specialist in this field and of course I do not know whether your project actually allows this approach.
But if I were asked to do this, I would take a database (a stable release of MySql is what I would choose) and use it both as the persistent object storage and communication module. The GUI and the number-crunching module(s) would be set up to primarily communicate through the database, rather than directly with each other. A task state/queue table in the database would inform the modules what tasks have not been assigned yet, are running, are complete, have not returned in the expected time, or have failed. This would make it asynchronous and highly traceable; databases are (supposed to be) good at managing the interactions between multiple user processes and still maintaining data integrity. Admittedly this is not the best approach if you want your results real-time.
The central managament of the processes could be kept minimalistic and simple, and "therefore" robust: Some very simple communication with number-crushing processes to test whether they are alive (a TCP/IP socket read-write might do), re-opening a task that has not returned in an expected time period (if its process still returns later, the newly started process will have to detect that its work was already done, and discard its results instead of writing them back), and perhaps signalling critical task completion to users (by GUI message, e-mail, text message, ...). The central management would not have the startup responsibility for distributed number-crunching modules, that would remain with the local servers they are running on. Such a process can then "knock on the door" of the database, register its presence, and take the next available task, or wait until one is available.
The persistent form of the core data structures would be in database tables, but the modules would of course have their share of the data in memory as class representations of the data structures, defined to be initialized from the database tables and written back to them. These class representations of the data structures then could be in a common library shared by the different modules, but alternatively you might opt for different class representations for e.g. the GUI and the number-crunching modules if that is more efficient (it often is) and even write them in different languages if that is more convenient. I admit that that adds to the amount of code and therefore to the amount of bugs. On the other hand, you could write two "completely" independent implementations of the same task.
If your number-crunching is complex and long, then evaluate whether you can write back intermediate states to the database as a recovery point, or even split the calculations in completely independent modules, each one starting and ending with a given database state. The desirability of this depends, of course, on the balance between I/O and processing costs. If you have modules that are relatively simple and safe but need to work quickly through a large amount of data, you could consider database stored methods for these; not very distributed but it reduces the amount of I/O and they can easily be called by client processes.
The database does not care in what language the different modules are written, so you can then write every one in the language that is most appropriate. For example, there may be no reason at all to write (parts of) the GUI in C++ -- and that is something I would try to avoid. If performance allows it, I would use Java for the GUI, both for portability and simply to avoid the mess of writing user interfaces in C++; in my experience that does not tend to be the most stable solution.
For the C++ part I would start by structuring pretty strongly; write a large number of simple classes instead of a smaller number of complex ones, and test every class before you move on to the next level. The "salami approach" works well if you plan it well. It is perfectly possible to write very ro
Have you have flown on a commercial airline? (Score:4, Informative)
Thare is a standard called DO-178B Level A that applies to aircraft software upon which lives depend. There is a saying in the commercial avionics business: "Nobody has ever died from software failure on an airplane, yet." There have been some accidents where software played a role, but I won't quibble with that now.
The point is that safety critical software is developed routinely. It has been developed in asembly language. It has certainly been developed in Ada, C, and sub-sets of C++. It is expensive. Validation of avionics software and certification in an aircraft can easilly cost an order of magnitude more that just writing the software, and writing the software using required processes and producing required artifacts is not cheap either.
Re:Fault Tolerance Vs. Stability (Score:3, Informative)
Wow. An entire thread devoted to this question, and so far this is the only answer that actually addresses the problem. Every other suggestions seems to be "changes languages", or "here's how to avoid bugs".
Anyway, let's talk specifics here. For the theoretical end of software fault tolerance, you can get a quick overview here [ibm.com] or here [cmu.edu].
In terms of practicalities, I know of an older fault tolerance library for Unix that includes watchdog, checkpointing, and replication utilities, and was created by AT&T (details and downloads here [att.com]). A newer version appears to be available for Windows under the name SwiFT [bell-labs.com]. Disclaimer: I haven't actually used either of these, and they both seem a little old. But I don't know of anything newer that isn't a proprietary in-house solution.
Finally, in terms of general design patterns for fault-tolerant distributed systems, you can't go past Joe Armstrong's PhD thesis, "Making reliable systems in the presence of software errors" [www.sics.se]. While it's mostly about Erlang, many of the ideas he presents are readily applicable to other languages too.
Re:Listen to what he said!! (Score:4, Informative)
So what he needs to do is develop a design that is robust in the face of errors. In other words, it needs to be fault tolerant. There are well-known design practices for doing this (checkpoints, watchdogs, rollbacks, etc.) as well as design patterns for robust distributed computation (see, for one example, Joe Armstrong's thesis on making reliable systems in the presence of software errors [www.sics.se].
No, the situation the OP is in is not ideal. But it's also not impossible to work with, and there are techniques that can help him to get closer to achieving his goals within the constraints placed upon him.
Learn to use STL and don't *ever* type "new[]" (Score:3, Informative)
Do *all* memory management via STL vector/string.
2) Don't ever type "new[]/delete[]".
Just don't do it. Not. Ever. Use std::vector instead.
"Arrays are evil" - the C++ FAQ.
PS: You can still use malloc()/free() but only as a last resort in low-level classes which are designed for data storage.
3) Get a reference-counted pointer and use it.
Automatic memory management...'nuff said.
4) Attach an alarm bell to your "~" key.
If you're writing destructors for classes which don't control system resources (eg. files) then you're probably doing something wrong - see notes 1, 2 and 3.
Re:You're not the first one.... (Score:3, Informative)
Of course it doesn't - you just killed it.
C++ runtime libraries are loaded into the programs memory space. Just like the Java runtime is in the memory space of whatever Java program you are running. The difference is that the operating system has a specific loader for the kind of files (ELF in Linux) C++ compiler produces, and labels them according to this file in the list of running programs instead of labeling them all as ld.so processes; on the other hand, Java runtime contains its own code for loading Java programs, so the system will label them all as Java processes.
Did this have any point ?
And for all you know, gcc could just cat your C++ source files together and add a runtime interpreter to them. If that runtime interpreter was "darn fast", would you know or care ?
I'm trying to understand your point, I really am, but it seems to be some weird idea of interpreted language being bad even if it is as fast as a precompiled one, and that just doesn't make sense to me.
Also, did you mean "Java runtime" instead of "Java process" - since the JVM and the program being run form a single process, just like a C++ program and its runtime libraries, so this really doesn't make sense otherwise ?
C is interpreted, then. On Linux, a C program depends on ld.so to bring all the neccessary C libraries to the process's memory space and link it just prior to execution (or during execution, if you use dynamic lib loading). And, of course, it is the kernel process that reads it to memory in the first place.
More generally, each and every program depends on some other process to read it in and "figure stuff out", unless you want to suggest that the code will magically jump from the disk to the main memory ? BIOS is propably the only exception to this rule, since it resides in ROM and is in computers main memory at boot. Every other program, including operating system, is loaded by some other program and therefore interpreted by your standards. And of course a modern process will also change machine code execution order to get more execution speed - a clear example of compilation :).
The real question is: does it matter ? Why on Earth would I care if my code satisfies some completely arbitrary requirement of being "non-interpreted", if the interpreter is so smart that I can't measure a speed difference ?
Re: cycle (Score:1, Informative)
See also: wikipedia article on reference counting [wikipedia.org].
Re:I'm gonna take a guess, but.. (Score:3, Informative)
This is why all of the comments about using a managed language are completely missing the point. Catching exception conditions before they terminate the process is great for tracking down bugs and for code that's allowed to recover from errors, but if you need error-free code, then you cannot afford to have these exception conditions in the first place.
Your other point about checking all return values is 100% right.
My own recommendation is to keep the code as simple as possible, and avoid anything that 'hides' code. Use OO design if you like, but be very wary of C++ constructs that might result in a whole load of code being fired off behind your back. This is especially true of operator overloading and temporary objects with their constructors.
Do as little dynamic memory allocation as possible. Sometimes you can arrange things so that you know before you start exactly how much memory you'll need for everything and then allocate this all statically. Even consider accepting arbitrary limitations on data sizes if it helps achieve this and doesn't cost much in terms of flexibility.
QuickCheck (Score:3, Informative)
The idea is simply to define the "space" of legal inputs for each module and the correctness criterion for each input, and then generate random inputs based on the spec. This is far more effective than traditional hand-coded test data at both unit and system test levels, and as an added bonus the test spec doubles as a formal specification of the correct behavour that coders can actually work from. This is similar to the XP practice of "test-driven development".
Paul.
Re:You're not the first one.... (Score:2, Informative)
By the way I couldn't care less what you say about "the mantra", I just don't like it when false information gets spread. So until you can find proof that "managed code" running under the CLR requires that it be interpreted, I'm going to stick with the impression you're wrong.
Sources: The Common Language Infrastructure Annotated Standard [1st edition] pg 8. Figure 1-4 (Execution Model) on the following page clearly shows "unmanaged native code" and "managed native code" existing along side each other. Since this book is basically an expanded version of the ECMA CLR specification the quote and figure are probably available online for you to check but the page numbers are probably different.
Please cite your sources that show managed code==interpreted and hence satisfy your premise before drawing the conclusion that MS have redefined terminology.
My Top Ten (Score:3, Informative)
Bulletproof code isn't cheap, but it can be done.
This is the most insightful comment I've seen so far. Particular tools can fix particular problems, but that's the easy part. The hard part is finding and noticing the problems, so that you know to look for (or make) the tools.
My teams have in-production bug rates well below one per developer-month. Here are the ten things I think are most important:
Re:inline code (Score:3, Informative)
A language that leaks memory in real use cases is just not good enough. A language that slows down execution for periods of time may be good enough. The difference is that the first impairs correctness, while the second does so to performance. While bad performance can be tolerated, incorrectness can't.
Re:Sorry, *not* in C++ (Score:3, Informative)
You might want to let them know about that.
Aeralib [faa.gov]