Learning and Maintaining a Large Inherited Codebase? 532
An anonymous reader writes "A couple of times in my career, I've inherited a fairly large (30-40 thousand lines) collection of code. The original authors knew it because they wrote it; I didn't, and I don't. I spend a huge amount of time finding the right place to make a change, far more than I do changing anything. How would you learn such a big hunk of code? And how discouraged should I be that I can't seem to 'get' this code as well as the original developers?"
Time (Score:5, Interesting)
A good starting point (Score:4, Interesting)
Some things I do to figure out code... (Score:2, Interesting)
Re:30 to 40 thousand lines isn't large by any meas (Score:3, Interesting)
I inherited a code base of 1.5 million lines of code at the last job I was at. Thankfully I wasn't the only one responsible for it. My advice to the original poster is to add lots of logging information. Log statements should document what the code is doing at any point in time and tell you where it is doing it. If it's java you can get the stack trace from anywhere--this is very handy for logging.
Re:Use it (Score:2, Interesting)
Am I off base here? What do you think about intermediate variables that are not strictly necessary?
I can't say you're off base per se (I don't have nearly enough production dev experience to make statements like that, and even if I did, I couldn't speak for everyone), but my personal style is not quite the complete opposite of yours.
I pretty heavily use intermediate variables. Why? A couple big reasons. One, if you give the temporary variables decent names, they serve as additional documentation. Two, if you're debugging, you can look at those intermediate values in a debugger (or log them) much easier than you could if they weren't explicitly stored somewhere. In most graphical debuggers you can just hover the mouse over a variable and see its value; if you didn't have that variable, you'd have to enter the expression in the immediate window or set up a watch or something like that.
Re:Not at all. (Score:1, Interesting)
being a genius mean getting the right feature on time
the customers dont care for craftsmanship, it suck, but
deal with it
Re:30 to 40 thousand lines isn't large by any meas (Score:3, Interesting)
Just out of curiosity, what is your opinion of a "Large" codebase then?
My first programming job was on an enterprise system that was over 7 million lines of just C++ code by the time I left, not including SQL stored procedures, web server code for the reporting system, and surely other code stuff that I can't recall. The entire development team for the system was something like 45 programmers. So to many of us, 30-40 klocs does not seem like a large codebase at all.
That said, I've also inherited code in the 10-50 kloc area of magnitude that was far more of a challenge/nightmare to decipher and maintain than that 7 million line system was. Code maintainability has more to do with good system architecture and coding standards than it has to do with the size of the code base; without those you system will likely collapse under its own bloat long before it can grow to millions of lines.
That's small (Score:3, Interesting)
Medium size is 250 to 750 thousand lines of code (one person can still understand how it all works). Big is 1 to 10 million lines of code. Really big is >10 million.
I have worked on code bases of all of those sizes, and I like the medium size the best -- it's big enough to be interesting, and small enough that you can understand it all.
One that I've worked on (over 25 million lines) is just too big for my tastes -- over 3 hours to do a clean recompile is excessive.
Re:Large? (Score:5, Interesting)
Are you Microsofties really so stupid and ignorant that you're not aware of the ports of GNU utilities to Windows [sourceforge.net] or Cygwin [cygwin.com] or even your own company's Interix [wikipedia.org] and Services for UNIX [wikipedia.org] products?
No, but to explain this, I need to give you some background.
When I joined Microsoft, I hadn't used any version of Windows at all for any reason other than playing games. After joining Microsoft, I never used Windows at home for any purpose other than logging into the VPN to work from home... and since I did not even have an x86 machine, this required using Virtual PC on my Mac OSX box.
Now, I know of all of these tools, and I even could install GVim on the machine as well. However, I was working in a Build Group. This required me to occasionally log into 100 different machines at once in order to start the build process for WinXP/Server 2003. Most of these machines require no more input than logging in and starting up a single app... thus no reason to install special software on them.
Then, something would break, and I would have to read logs, and/or code on the actual box that had the exact problem. Spending an hour installing apps to do my job would be an unacceptable use of my time, and delay the build unnecessarily.
I learned to use the tools that were available with the environment that I was in. Thus, I did almost all of my programming at Microsoft in notepad.exe, and I'm not kidding you.
Were I in a different group? The results could have been different... but having 100 different machines, most of which I didn't have admin rights to, meant that even just installing Notepad++ or something like that would have been a waste of time.
Re:Try to learn the structure (Score:3, Interesting)
Depending on the language and domain, one way to speed up learning the structure can be to see if you can match it to some set of programming idioms, and then read up on those idioms if it's not a style of programming you're familiar with. For example, if it's C++, can you figure out by looking at the code's layout whether it was written by someone big into C++ design patterns? If so, it might be easier to reverse-engineer what it's doing if you read a C++ design-patterns book, and then match large segments of the code to "oh it's just implementing [pattern]". In some languages there are 3-4 main styles of programming, and figuring out which of them the author adhered to, and then reading something up on that idiom, can really speed things up.
Re:Large? (Score:5, Interesting)
What the hell? Are you serious?
So Microsoft themselves hired you to work on Windows, although you were a Mac user and had absolutely no real experience with Windows?
Not only that, but you had to manually log in to hundreds of systems just to run a script? They didn't push for this to be automated, and you tossed back on the street where you belong? What the hell?
Don't get me wrong, I don't doubt that your story is true. It's the sort of shit that we should expect from any large company, especially Microsoft. Please tell me you're an H1B, though. At least then it'd make some sense why they'd hire you. H1Bs typically aren't worth more than a batch file.
Yeah, it took me about a month before I understood that my entire group would be replaced by a few scripts in the Open Source world.
The primary problem was that because the source code was not a "product", the build code was so full of holes and edge-cases and hacks, that it broke almost constantly, and required someone to babysit it for the whole 14-some hours that it takes to compile.
Actually, in my orientation class, we went over patents, copyright, and trademark, and I knew it all, and the teacher asked me how I knew so much, and I told her that I owned a registered copyright on some GPL code, and she was like, "and your managers hired you knowing that?" And I was like, know about it? It's the only reason I got hired by Microsoft... be damn sure I didn't submit a resumé.
I'm afraid the time may already have passed (Score:2, Interesting)
If both the original developers and the knowledge they had have been lost, then it is probably already too late to perform any major maintenance on this code base. The project has already entered its “servicing” stage.
At that point, you basically have two possible approaches that actually work: you can restrict maintenance to small-scale changes, which may be sufficient if the goal is just to keep the project ticking over for a while, or you can accept The Big Rewrite (which isn’t so big in this case) in order to get a project that can be properly maintained.
If you want to go down the tactical changes path, there are a couple of approaches to finding your way around the code.
If you’re familiar with the general field of the software, just not this particular code, then you can work top-down. Start with the key, high-level concepts you know the program implements, and try to find the code that represents those:
Hopefully, if the code has a reasonable modular design and you just don’t know what it is yet, this sort of approach will identify the organisation of the code at a very coarse level, but then you can try to break down each area in more detail the same way.
Alternatively, you can work bottom-up. Find a significant starting point, such as:
Examine the code near that point. Look at what kinds of data it works with. Look at what functions it calls, and what functions call it. Try to figure out the wider significance of the code you started with, and the other code to which it relates. Then move up a level: what is the purpose of all of that code collectively? Repeat until you’ve explored as far as you need to.
After some other discussions about these topics, I recently wrote up a couple of articles with some more background information than I’ve given here — link in my sig if anyone’s interested (though be warned that they are pretty long).
Design patterns are your friend (Score:2, Interesting)
"A couple of times in my career, I've inherited a fairly large (30-40 thousand lines) collection of code. The original authors knew it because they wrote it; I didn't, and I don't."
A couple of times in your career? You must be lucky. Most jobs you can get coding will always involve taking over someone else's code.
In my experience, design patterns are your best friend, bearing in mind that most of the code base will always remain a black box to you.
For example, when I was doing some health insurance work, I had inherited a code base that was substantially larger than 30 or 40 thousand lines of code. The objective was to make the code that used an older, fixed-length record format work with the newer X837 EDI format, which is basically XML but almost without any tags to help you figure out where the data begins and ends. Suffice it to say that the task was to figure out how to smoothly stick a square peg in a round hole.
The task itself determined the design patterns, of which an adapter pattern was the most used. The type of pattern in turn dictated what in the code to look for in order to implement it, and (of course) how the new code would be built. For example, since we were using an adapter pattern, the first order of business was to find out how the data was represented in the code base, and then trick the "black box" into using your own spiffy, new representation of the data.
For the most part I didn't have to care all that much how the application handled the data as long as I got the right data into a form the application would accept in my adapater.
Re:A good starting point (Score:2, Interesting)
I think this is the fastest way to find the right place to make a change. Stepping the application through a debugger is probably faster than reading through the code to learn how things are done.
Re:Use it (Score:1, Interesting)
The main thing that bothers me when working with other peoples code is the sheer number of variables they use. I tend not to declare a new variable unless it is absolutely necessary (and in object oriented programming variables other than pointers are almost never necessary). It seems like code written this way is easier to read and understand (and significantly smaller). This is slashdot, so there are a lot of other programmers out there. Am I off base here
No, you're on target. Making a variable to temporarily store a variable amounts to writing unnecessary plumbing. You can abstract that plumbing out very easily, through "functional monadism". This makes things much easier: you can manipulate the plumbing apart from the things it plumbs.
You are definitely on the right track. If you haven't done this, try learning SQL, and then a functional programming language. You will see that the computation of a function amounts to the computation of a subset of the cartesian product of types (or sets, more generally). Evaluating a function amounts to evaluating a query. It's easiest to write queries against data types in certain "normal forms". This means that a program has three essential components: definitions of the normal forms (the data definition languages for SQL), queries to run against values of these forms, and, as a practical matter, data to query.
My Dick is Bigger than Your 250,000 lines of code (Score:5, Interesting)
Really. A guy asks a question for help and all of these people keep telling him 30-40,000 lines of code isn't much.
That's a lot of code to get your arms around if you didn't write it. It's not the end of the world, but it is a sizeable task, and is the type of topic that few professional journals or books will ever be written about.
Having been in similar situations, I my advice would be:
1) Try to get an understanding of the history of the code. Who wrote it? Why? How many developers? How long has it been around? Do people love it or hate it? Is there a version control system in place you can use for information?
2) Look at it from a technical viewpoint. Is is complete? Does it compile and run? How many languages are used? Are there interfaces with other systems you need to know about? What dependancies are there? How easy is it to setup a test server? What parts are well coded? What parts stink up the joint?
3) Dig for functional documentation. What does it do? For whom does it do it? What business needs does it support? How mission critical is it?
4) Meet with the business owners. Seriously. This helps you do two things: #1-- Define the real business need (which may be different than what was understood by the previous developers), and #2-- Set appropriate expectations about maintenance. You'll work hard to maintain and keep it working, but you are working from a disadvantaged position. It is important they know this and support you in your efforts, rather than complain loudly when something doesn't work.
5) Plan to remove the dead weight. There's always a lot of dead weight in these near-abandonded projects. Get an idea how to simplify things and plan your work in phases.
6) Setup real test and development servers. Yeah, you know that wasn't already done.
7) Use version control. But you know this. It's 2010, and no developer worth his/her salt would code a paying project without version control. Right?
8) All fixes will take much longer than if you wrote the code, so be careful with estimating time.
Re:Not at all. (Score:2, Interesting)
But YOU get the blame, which is the problem. This kind of thing happened to me recently when I inherited a big pile of MS-Access code with variables like A34 and 300 objects (tables, reports, queries, etc.). I went from an "excellent" rating to a "C" rating on my evaluation because they wanted quick turnaround. I felt like the victim of a hit-and-run. I'm not the one who did the crime, yet I'm the one with the black eye and a missing wallet. The Pasta Mugger did a number on me.
At least with text source code you can find or write variable, function, and command indexers/profilers to help one see the structure, find definitions, and browse relationships. Not so easy to do that with MS-Access with all it's proprietary binary crap. I found a way to extract some of the info, but it looks different from how you'd see it inside MS-Access so it's hard to relate to. Gotta love MS.
Re:Large? (Score:3, Interesting)
I used to work in a similar environment in a university. Tons of windows machines, that I didn't have admin access to. I just carried a usb with me with all sorts of tools that didn't require any more access than a user would have. Seriously borland made a grep for dos that was 7 k back in the 90's. It doesn't sound like you were very creative, but your story does illustrate why the lack of decent command line tools *by default* sucks.
I didn't even have physical access to the machines. We just RDPed into them, and I had to be logged into every machine at the same time.
While I had a DFS share that had some of my own tools in it, the problem with running GVim or such off of that is just one of convenience... there were already decent command-line tools available... findstr really does cover everything that I've ever tried to do with grep...
So, the effort of going out of my way to jury rig all this stuff together wasn't any better than just using the tools that were present.
We don't really NEED grep... We just need a tool that works LIKE grep.
I just completed something similar... (Score:2, Interesting)
As a result, although the new functionality worked fine, the application still suffered for the "spaghetti" code of patches upon patches of years of various developers adding additional capabilities, but no one ever addressed the reliability of the application. The support group for this application was clearly frustrated with years of late night calls and hours and hours spent trying to correct errors.
About 6 months ago I was tasked with essentially "cloning" the application for new business purposes. I proposed porting the application to a newer, more modern language (java). It took a lot of selling (i.e. convincing management and other developers that the end result would run just as fast, be easier to maintain and have more reliability), but I was able to get them to buy off on it.
The rewrite was completed about 3 months ago and the results were better than i had hoped for. I was able to complete the rewrite in the same amount of time allocated for the original "enhancement" project. The application actually runs faster than the old one, has yet to crash (it runs 24x7), and the code is well structured and easy to maintain. We're now in the position that if/when another "enhancement" is requested to the old application, we can simply clone the new java version and completely replace the old app. Given the results of the last project, it won't be a hard sell (especially to the support group) to go the java route.
I know this is a long post, but the bottom line is that sometimes (more often than many realize), recoding an old application in a modern language and bringing it into the 21st century rather than patching old code can pay off dividends beyond the basic added functionality.
Have you tried Krugle? (Score:2, Interesting)
If you surf on over to Krugle.com [krugle.com], you will see that they now offer a free evaluation copy as a standard product. If you want to get a feeling for what can be done with the tool, just check out Krugle.org [krugle.org], where lots of open-source projects are indexed online. I would definitely recommend using the free evaluation tool as a way of speeding your high-level understanding of any new-to-you code base.