Programmer's Language-Aware Spell Checker? 452
Jerry Asher writes "Not all of my coworkers are careful about spelling errors. Sometimes this causes real embarrassment as spelling errors creep into software interfaces. Does anyone know of spell checkers for programming languages? I don't want a text spell checker, I want a programming-language-aware spell checker. A spell checker that I can pass all of my code through and will flag spelling errors in function names, variable names, and comments, but will ignore language keywords, language constructs and expressions, and various programming styles (camel code, or underscores, or...). I want a spell checker that knows that void *functionSigniture(char *myRoutine) contains one spelling error. Does anyone have such a thing for Java or C++? Are there any Eclipse plugins that do this?"
Sounds like a good idea (Score:5, Interesting)
should be fairly simple to implement (Score:2, Interesting)
It's a good question ... (Score:5, Interesting)
People here making fun of his request and saying that this should be set in stone in design documents, or be checked in peer code reviews are obviously not working in a run-of-the-mill software company where there's neither the inclination nor the time to do everything the formal way. Also, I have to see the first design document that correctly enumerates all the requirements for the software, let alone all the names for the variables to be used.
TextMate does some... (Score:3, Interesting)
You can right-click on any "word" (variable name, subroutine name, whatever, just generally a whitespace-delimited group of characters) and it will check the spelling and present alternatives in the context menu. It also recognizes things like perl's sigils so correcting '$teh' turns into '$the', not 'the'.
It _won't_ automatically check spelling except in strings (so e.g. if I have '$teh = "This is a tset.";', 'tset' will be underlined, '$teh' won't). It doesn't include comments in its automatic checking either, which is probably the most annoying part about it.
Overall I typically just don't bother with it, but someone _has_ thought along these lines, at least.
No one's offering solutions... (Score:2, Interesting)
I assume a compiler will parse the source and in the process identify which tokens are key words and literals, and which are programmer-defined identifiers in the code. The spell checker would either use the same algorithm, or latch into that part of the algorithm to get at all of the identifiers. There are two possible word separators in typical code--either capital letters or underscors. (If you have something more bizarre, then I think it's a lost cause). So pass those identifiers through a filter that chops them up at each capital letter or underscore (with some exceptions, say, if the identifier is all caps). So, now you've got a pile of strings which are either oddball programming convention stuff, like "p" and "g" for pointers and globals, and things that should generally be words. The rules can include "toss out single character identifiers", "toss out everything up to first capital or underscore", etc. If you have coding guidelines that enforce variable naming conventions, this should get you most of the way.
Now you have English words that you can pass through your standard spelling engine, possibly with a dictionary tweaked for your field of endeavor to decrease false positves and escapes.
Re:Eclipse WTP 3.3 Europa seems to do this.. almos (Score:3, Interesting)
When it comes to StudlyCaps, anything identified as an identifier can be split _before_ any uppercase letter. This would produce a lot of single-letter tokens for ALL-CAPS #defines and the like, but as a nearby post said, you're going to ignore one-two letter tokens anyway. The usual conventions say XMLHttpRequest or XML_http_request so I wouldn't bother with XMLhttpRequest (and thus "lhttp").
How about this (Score:5, Interesting)
Enough rant. How about this:
perl -ne "s/([a-z])([A-Z])/$1 $2/g; tr/A-Za-z/
That will give a list of unique words in your source code (use find and xargs to scan the whole source tree). Then you can run that list of words through an ordinary spellchecker such as ispell. Unfortunately when you find a mistake you have to go back and grep for it to find where it occurs. You would also need a personal dictionary for things that are not English words but nonetheless appear in code.
I would probably keep the private word list containing things like 'foreach' and 'const' with the program source code, and have a makefile target 'make spellcheck' that runs a command like the above and then prints out all words found that are not in
find . -type f -name '*.c' | xargs perl -ne "s/([a-z])([A-Z])/$1 $2/g; tr/A-Za-z/
sort -u private_word_list
diff -u allowed_words found_words | grep -E '^[+][^+]'
The private word list can be kept under version control and checked in whenever you add a new non-English word like 'Frobule' to your source code.
Adding filenames and line numbers to the output is left as an exercise for the reader. You might also want to change the perl command to ignore words with length < 5.
Visual Assist (Score:3, Interesting)
I'm not associated with Whole Tomato, but if anyone from WT sees this, can I have a free subscription
Create a dictionary for your project (Score:2, Interesting)
I used aspell and went through huge parts of the source, telling it what wasn't misspelled. It was an incredible pain in the neck because it got confused over all the variable names, bits of C syntax etc etc.
Once I had a dictionary, though, I could recheck the source periodically and although there were a lot of false warnings, we still caught a lot of problems that would have gone into the production release.
As you can work out, I didn't restrict the test to strings - this is because misspelled variable names can cause bugs too so I checked for them as well.
Cheers,
Tim
Re:May I suggest.... ctags & aspell? (Score:1, Interesting)
Or they are "English", such as a function that flags "setColour" as incorrect because it is a US English dictionary and British spelling.
This is a non-trivial problem to do right. The spell checker has to be not only familiar with CamelCase, word fragments that might be added (like the Obj, Int, Ptr or various prefixes), and the programming language syntax, but it would also need to be familiar with the native spoken language.
One strategy might be to strip out all the programming syntax fluff (something like ctags [wikipedia.org]) and then run a spell checker on that with a custom dictionary and a script to split up such things as CamelCase. You'd have to do the same for comments (which ctags normally ignores).
In any case, with ctags, something like aspell [wikipedia.org], and a bit of custom scripting and dictionary fiddling, it looks tricky but doable as a batch process. Doing it interactively in the editor would be slightly trickier, but if your editor can invoke programs, not hard.
Annoying perhaps but (Score:5, Interesting)
HRESULT MFGetService(
IUnknown* punkObject,
REFGUID guidService,
REFIID riid,
LPVOID* ppvObject
);
You'll probably just end up spending all your day removing false positives.
Re:Visual Assist (Score:3, Interesting)
Comments like this make me wonder. Is it so hard to imagine a spelling checker for say the C language that finds words that were not written the way they were intended? Limiting yourself to correct English words for identifiers is stupid. Assuming that a spelling checker for a programming language would do that is about ten times more stupid.
The problem is that the market for such a spelling checker is much smaller than the market for a spelling checker for natural language, so nobody bothered writing one. The other problem is that correction is much, much more difficult.
Re:Eclipse WTP 3.3 Europa seems to do this.. almos (Score:1, Interesting)
Check for typos instead (Score:2, Interesting)
I'm not sure spell-checking can really be made to work because, by definition spell-checkers flag anything that is not in the allowed list (also called dictionary) as an error. But source code always contains tons of identifiers that are not real words, like pid, ret, req, riid, etc. The problem is that there are hundreds if not thousands of them in a large project and that you get a ton of new ones making the maintenance of a custom directory a pain.
But I've been annoyed by spelling errors too and what I noticed is that the same errors come over and over again. So what I did is write a script that specifically checks for common typos. And I've very imaginatively called it 'typos'.
What's great with this approach is that, no matter whether you're writing a C, Perl, PHP or HTML file, 'seperate' is never going to be a real word. So we can identify these with no cumbersome custom dictionary, and a very very low false positive rate.
Typos is open-source (GPL) and has no dependency that I know of (besides perl). So you can try it out just by downloading it, making the script executable, and running it with no argument on your source:
Re:How about this (Score:1, Interesting)
Re:What the fuck is the OP on? (Score:3, Interesting)
"Misspelt" is a legitimate spelling in British English. It's in the OED, with examples from 1762 to 1990.
Since I have just corrected you, I assume I have made an error somewhere in this post, though I haven't managed to find it.
Ken Thompson and creat() (Score:5, Interesting)
"If I had to do it over again? Hmm... I guess I'd spell 'creat' with an 'e'."
Re:Man Dies Waiting for Eclipse to Launch (Score:4, Interesting)
Work in a windows environment in Virginia. Access the Eclipse workspace directory through a mounted drive pointing to your home directory on a UNIX box in Montana. On the UNIX machine, your home directory is actually mounted on a Windows box back in Virginia.
God help you if you have the "compile on save" option enabled. And don't even THINK of rebuilding the workspace.
And yes, I know this from experience.
Re:Ken Thompson and creat() (Score:3, Interesting)
Now, in essentially every program in the world there is a function named 'create_something' or alternatively 'createSomething'. Had Ken Thompson's creat() function been spelled create(), early C compilers would have treated them the same way, thus making any function starting with "create" useless (not to mention resulting in error-prone behavior: which function actually gets called? It varies by compiler...)
By naming the function 'creat', KT neatly sidestepped the whole problem. I wonder if that's why he did it? Either way, if he did it all over again, and this time named the function create(), no early programs would have been able to have any function that started with those six letters. Our programming habits would thus be quite different. Perhaps we'd have to say creat_foo() instead...
I never use the creat() function anyway (it's just an alias for open()) so frankly I prefer it this way. Of course, today's compilers no longer have the silly 6-character rule, but if you're aiming to write über-portable code, it's still advisable to have all your function and global variable names unique in the first six characters. There are utilities that will verify that this is the case for you.
By the way, do you have a source for that quote? I've heard it too, but I'd be interested in knowing where KT said it.
I'm sorry... (Score:3, Interesting)
I don't think I'd like to hire someone who can't spell. It shows volumes about you.
Intelligence starts with a keen understanding and application of your language.
if you simply must have it, editplus has syntax highlighting and offers spellchecking dictionaries.