Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Programmer's Language-Aware Spell Checker? 452

Jerry Asher writes "Not all of my coworkers are careful about spelling errors. Sometimes this causes real embarrassment as spelling errors creep into software interfaces. Does anyone know of spell checkers for programming languages? I don't want a text spell checker, I want a programming-language-aware spell checker. A spell checker that I can pass all of my code through and will flag spelling errors in function names, variable names, and comments, but will ignore language keywords, language constructs and expressions, and various programming styles (camel code, or underscores, or...). I want a spell checker that knows that void *functionSigniture(char *myRoutine) contains one spelling error. Does anyone have such a thing for Java or C++? Are there any Eclipse plugins that do this?"
This discussion has been archived. No new comments can be posted.

Programmer's Language-Aware Spell Checker?

Comments Filter:
  • by PhrostyMcByte ( 589271 ) <phrosty@gmail.com> on Tuesday September 04, 2007 @05:25AM (#20461629) Homepage
    And not too hard to implement - all you need is a lexer and a few functions to classify different naming styles. lexertl [benhanson.net] even comes ready with a full example for C++, so get to it ;)
  • by BoxedFlame ( 231097 ) on Tuesday September 04, 2007 @05:33AM (#20461665) Homepage
    A small script to split up camelCase into seperate words, then feed the result through a normal spell checker. Then after that just whitelist certain words like maybe "m" as found in "mSomeVariable".
  • by YeeHaW_Jelte ( 451855 ) on Tuesday September 04, 2007 @05:33AM (#20461667) Homepage
    We've got code here that refers to 'insurrances', 'insurances', 'insurrences' and 'insurences', I'm not kidding.

    People here making fun of his request and saying that this should be set in stone in design documents, or be checked in peer code reviews are obviously not working in a run-of-the-mill software company where there's neither the inclination nor the time to do everything the formal way. Also, I have to see the first design document that correctly enumerates all the requirements for the software, let alone all the names for the variables to be used.
  • by NNKK ( 218503 ) on Tuesday September 04, 2007 @05:44AM (#20461735) Homepage
    TextMate on OS X has spell checking functionality that is semi-useful, but it's not really "aggressive" enough, and there doesn't seem to be a way to make it such with prefs/configuration.

    You can right-click on any "word" (variable name, subroutine name, whatever, just generally a whitespace-delimited group of characters) and it will check the spelling and present alternatives in the context menu. It also recognizes things like perl's sigils so correcting '$teh' turns into '$the', not 'the'.

    It _won't_ automatically check spelling except in strings (so e.g. if I have '$teh = "This is a tset.";', 'tset' will be underlined, '$teh' won't). It doesn't include comments in its automatic checking either, which is probably the most annoying part about it.

    Overall I typically just don't bother with it, but someone _has_ thought along these lines, at least.
  • by Chaset ( 552418 ) on Tuesday September 04, 2007 @06:38AM (#20462035) Homepage Journal
    Well, I'm a total newbie in terms of compiler architectures and such, but throwing it out there for the purpose of discussion...

    I assume a compiler will parse the source and in the process identify which tokens are key words and literals, and which are programmer-defined identifiers in the code. The spell checker would either use the same algorithm, or latch into that part of the algorithm to get at all of the identifiers. There are two possible word separators in typical code--either capital letters or underscors. (If you have something more bizarre, then I think it's a lost cause). So pass those identifiers through a filter that chops them up at each capital letter or underscore (with some exceptions, say, if the identifier is all caps). So, now you've got a pile of strings which are either oddball programming convention stuff, like "p" and "g" for pointers and globals, and things that should generally be words. The rules can include "toss out single character identifiers", "toss out everything up to first capital or underscore", etc. If you have coding guidelines that enforce variable naming conventions, this should get you most of the way.

    Now you have English words that you can pass through your standard spelling engine, possibly with a dictionary tweaked for your field of endeavor to decrease false positves and escapes.

  • by KiloByte ( 825081 ) on Tuesday September 04, 2007 @06:42AM (#20462055)
    In fact, there are only two kinds of things to look at:
    • string literals (not what the poster wanted, but this is what needs spelchekars the most)
    • identifiers
    The former can be done by a simple regexp, the latter... you can do a LALR parser, but why even bother? Just look for _any_ potential identifier; in most languages, that's [a-zA-Z_][a-zA-Z_0-9]+; and simply add the few keywords which are not English words to your dictionary. In fact, this would be nearly programming language agnostic.

    When it comes to StudlyCaps, anything identified as an identifier can be split _before_ any uppercase letter. This would produce a lot of single-letter tokens for ALL-CAPS #defines and the like, but as a nearby post said, you're going to ignore one-two letter tokens anyway. The usual conventions say XMLHttpRequest or XML_http_request so I wouldn't bother with XMLhttpRequest (and thus "lhttp").

  • How about this (Score:5, Interesting)

    by Ed Avis ( 5917 ) <ed@membled.com> on Tuesday September 04, 2007 @06:48AM (#20462093) Homepage
    Yes, this is a legitimate problem. I work on code that has spelling mistakes embedded into interfaces and it's very annoying. The fashionable use of StudlyCaps in programming (why? who decided that TextLikeThis is more readable than text_like_this?) makes the job a little harder but not impossible, as long as you follow the sane rule of making each word start with capital and continue lowercase, even if an acronym (so XmlParser not XMLParser or, God forbid, XMLparser - though of course XML_parser would be better than any of those).

    Enough rant. How about this:

    perl -ne "s/([a-z])([A-Z])/$1 $2/g; tr/A-Za-z/ /c; foreach (split) { print qq{$_\n} unless $seen{lc $_}++ }" source_file...

    That will give a list of unique words in your source code (use find and xargs to scan the whole source tree). Then you can run that list of words through an ordinary spellchecker such as ispell. Unfortunately when you find a mistake you have to go back and grep for it to find where it occurs. You would also need a personal dictionary for things that are not English words but nonetheless appear in code.

    I would probably keep the private word list containing things like 'foreach' and 'const' with the program source code, and have a makefile target 'make spellcheck' that runs a command like the above and then prints out all words found that are not in /usr/share/dict/words or in the private word list. Indeed, why not this:

    find . -type f -name '*.c' | xargs perl -ne "s/([a-z])([A-Z])/$1 $2/g; tr/A-Za-z/ /c; foreach (split) { print qq{$_\n} unless $seen{lc $_}++ }" >found_words
    sort -u private_word_list /usr/share/dict/words >allowed_words
    diff -u allowed_words found_words | grep -E '^[+][^+]'

    The private word list can be kept under version control and checked in whenever you add a new non-English word like 'Frobule' to your source code.

    Adding filenames and line numbers to the output is left as an exercise for the reader. You might also want to change the perl command to ignore words with length < 5.
  • Visual Assist (Score:3, Interesting)

    by soundman32 ( 147936 ) on Tuesday September 04, 2007 @07:02AM (#20462179) Homepage
    Doesn't Visual Assist from Whole Tomato do this? I've used it in the past and I'm sure spelling mistakes (and a whole host of other things) were pointed out.

    I'm not associated with Whole Tomato, but if anyone from WT sees this, can I have a free subscription :-)

  • by thaig ( 415462 ) on Tuesday September 04, 2007 @07:08AM (#20462207) Homepage
    I had your problem once because I was working with people whose first language was not english. I don't write US English either and I always left English spellings in by mistake.

    I used aspell and went through huge parts of the source, telling it what wasn't misspelled. It was an incredible pain in the neck because it got confused over all the variable names, bits of C syntax etc etc.

    Once I had a dictionary, though, I could recheck the source periodically and although there were a lot of false warnings, we still caught a lot of problems that would have gone into the production release.

    As you can work out, I didn't restrict the test to strings - this is because misspelled variable names can cause bugs too so I checked for them as well.

    Cheers,

    Tim
  • by Anonymous Coward on Tuesday September 04, 2007 @07:19AM (#20462277)
    "Similarly, there should probably be a set of words added that aren't "English" but are used often enough to be worth adding to the dictionary. Things like Obj, Int, and Ptr."

    Or they are "English", such as a function that flags "setColour" as incorrect because it is a US English dictionary and British spelling.

    This is a non-trivial problem to do right. The spell checker has to be not only familiar with CamelCase, word fragments that might be added (like the Obj, Int, Ptr or various prefixes), and the programming language syntax, but it would also need to be familiar with the native spoken language.

    One strategy might be to strip out all the programming syntax fluff (something like ctags [wikipedia.org]) and then run a spell checker on that with a custom dictionary and a script to split up such things as CamelCase. You'd have to do the same for comments (which ctags normally ignores).

    In any case, with ctags, something like aspell [wikipedia.org], and a bit of custom scripting and dictionary fiddling, it looks tricky but doable as a batch process. Doing it interactively in the editor would be slightly trickier, but if your editor can invoke programs, not hard.
  • Annoying perhaps but (Score:5, Interesting)

    by Taagehornet ( 984739 ) on Tuesday September 04, 2007 @07:47AM (#20462451)
    True, identifier names containing spelling errors can be a real annoyance, but I somehow doubt you'll ever find a usable solution, at least not as long as you'll need to interface to code beyond your control. What spell checker wouldn't choke on regular C++? Just picking a random declaration from MSDN (feel free to choose any other API, it won't change anything):

    HRESULT MFGetService(
    IUnknown* punkObject,
    REFGUID guidService,
    REFIID riid,
    LPVOID* ppvObject
    );


    You'll probably just end up spending all your day removing false positives.
  • Re:Visual Assist (Score:3, Interesting)

    by gnasher719 ( 869701 ) on Tuesday September 04, 2007 @07:52AM (#20462485)
    '' It points out spelling mistakes in "strings" but not variable names. ie, it won't point out that the variable lAnsIdx is spelt incorretly, like the submitter is asking for, that would be just stupid. ''

    Comments like this make me wonder. Is it so hard to imagine a spelling checker for say the C language that finds words that were not written the way they were intended? Limiting yourself to correct English words for identifiers is stupid. Assuming that a spelling checker for a programming language would do that is about ten times more stupid.

    The problem is that the market for such a spelling checker is much smaller than the market for a spelling checker for natural language, so nobody bothered writing one. The other problem is that correction is much, much more difficult.
  • by Anonymous Coward on Tuesday September 04, 2007 @07:58AM (#20462531)
    It's actually impossible for the computer to know whether you're creating an infinite loop.
  • by fgouget ( 925644 ) on Tuesday September 04, 2007 @09:25AM (#20463321)

    I'm not sure spell-checking can really be made to work because, by definition spell-checkers flag anything that is not in the allowed list (also called dictionary) as an error. But source code always contains tons of identifiers that are not real words, like pid, ret, req, riid, etc. The problem is that there are hundreds if not thousands of them in a large project and that you get a ton of new ones making the maintenance of a custom directory a pain.

    But I've been annoyed by spelling errors too and what I noticed is that the same errors come over and over again. So what I did is write a script that specifically checks for common typos. And I've very imaginatively called it 'typos'.

    What's great with this approach is that, no matter whether you're writing a C, Perl, PHP or HTML file, 'seperate' is never going to be a real word. So we can identify these with no cumbersome custom dictionary, and a very very low false positive rate.

    Typos is open-source (GPL) and has no dependency that I know of (besides perl). So you can try it out just by downloading it, making the script executable, and running it with no argument on your source:

  • Re:How about this (Score:1, Interesting)

    by Anonymous Coward on Tuesday September 04, 2007 @09:35AM (#20463401)
    underscores are hard to type
  • by brusk ( 135896 ) on Tuesday September 04, 2007 @10:12AM (#20463741)
    You have one again confirmed Hartman's Law (or Skitt's, depending on preference; see http://en.wikipedia.org/wiki/Hartman's_law [wikipedia.org]).

    "Misspelt" is a legitimate spelling in British English. It's in the OED, with examples from 1762 to 1990.

    Since I have just corrected you, I assume I have made an error somewhere in this post, though I haven't managed to find it.
  • by Maximum Prophet ( 716608 ) on Tuesday September 04, 2007 @11:38AM (#20464801)
    Wow, 240 comments about spelling and programming and no-one's mentioned the famous Ken Thompson quote:

    "If I had to do it over again? Hmm... I guess I'd spell 'creat' with an 'e'."
  • by kalirion ( 728907 ) on Tuesday September 04, 2007 @12:15PM (#20465355)
    Here's one way to ensure an Eclipse launch takes enough time for you to go grocery shopping:

    Work in a windows environment in Virginia. Access the Eclipse workspace directory through a mounted drive pointing to your home directory on a UNIX box in Montana. On the UNIX machine, your home directory is actually mounted on a Windows box back in Virginia.

    God help you if you have the "compile on save" option enabled. And don't even THINK of rebuilding the workspace.

    And yes, I know this from experience.
  • by 808140 ( 808140 ) on Tuesday September 04, 2007 @02:48PM (#20467787)
    This is completely off the top of my head, but do you remember how early C compilers used to only recognize the first six characters of a function name? So, for example, create_foo() and create_bar() were recognized the same way.

    Now, in essentially every program in the world there is a function named 'create_something' or alternatively 'createSomething'. Had Ken Thompson's creat() function been spelled create(), early C compilers would have treated them the same way, thus making any function starting with "create" useless (not to mention resulting in error-prone behavior: which function actually gets called? It varies by compiler...)

    By naming the function 'creat', KT neatly sidestepped the whole problem. I wonder if that's why he did it? Either way, if he did it all over again, and this time named the function create(), no early programs would have been able to have any function that started with those six letters. Our programming habits would thus be quite different. Perhaps we'd have to say creat_foo() instead...

    I never use the creat() function anyway (it's just an alias for open()) so frankly I prefer it this way. Of course, today's compilers no longer have the silly 6-character rule, but if you're aiming to write über-portable code, it's still advisable to have all your function and global variable names unique in the first six characters. There are utilities that will verify that this is the case for you.

    By the way, do you have a source for that quote? I've heard it too, but I'd be interested in knowing where KT said it.
  • I'm sorry... (Score:3, Interesting)

    by DragonTHC ( 208439 ) <<moc.lliwtsalsremag> <ta> <nogarD>> on Tuesday September 04, 2007 @04:12PM (#20468967) Homepage Journal
    If you are too damn lazy or too stupid to type your language properly, then you shouldn't be a programmer. Become an insurance adjuster or something less demanding.

    I don't think I'd like to hire someone who can't spell. It shows volumes about you.

    Intelligence starts with a keen understanding and application of your language.

    if you simply must have it, editplus has syntax highlighting and offers spellchecking dictionaries.

Software production is assumed to be a line function, but it is run like a staff function. -- Paul Licker

Working...