Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Parsing Algorithms and Resources? 52

Derek Williams asks: "I'm a senior majoring in computer engineering & computer science and I've been programming for about 7 years, mainly in C and Java. While I've had quite a few courses that delve into some of the deeper topics of programming (e.g. Object Oriented Design), I find that the majority of programs I write, both for work and elsewhere, involve parsing. Although I have no problem tackling these sorts of programs, I was wondering if there was some branch of computer science dedicated to the study of parsing. What books and websites out there are of interest to someone looking to learn more about parsing and algorithms relating to it?"
This discussion has been archived. No new comments can be posted.

Parsing Algorithms and Resources?

Comments Filter:
  • by arb ( 452787 ) <amosbaNO@SPAMgmail.com> on Wednesday June 19, 2002 @03:50AM (#3727516) Homepage
    Parsing Techniques - A Practical Guide [cs.vu.nl]
    Flexible Parsing [warwick.ac.uk]
    Workshop on The Evaluation of Parsing Systems [slashdot.org]
    Robust Parsing [slashdot.org]
    Parsing Resources [cs.uu.nl]

    Probably the last one on that list would be the most useful starting place...

  • Your parsing adventure has barely begun.

    Until you've parsed the Linux kernel in a single regex, you haven't lived.
  • by hackwrench ( 573697 ) <hackwrench@hotmail.com> on Wednesday June 19, 2002 @04:09AM (#3727552) Homepage Journal
    Compilers and parsing seem to go hand-in-hand. With that in mind, here's a link to Jack Crenshaw's How to Build a Compiler [iecc.com]
  • In short, unless you have a very good reason (i.e. you know what you are doing) to write your own parsers, you should be using a parser generator (use google with the right keywords and you will find plenty of resources) or regular expressions (for the simple stuff, again: google is your best friend).
  • Try some books on Compiler Design, especially the first chapters about analysis. They should provide lots of information about parsing. One good book is Modern Compiler Design [amazon.com].
    • This book is heavy on the theory and such, a bit light on examples, but since they primarily use (f)lex and (b)yacc, finding examples if you get stuck isn't too hard. They dedicate a lot of pages to parsing and parser theory.
  • Book recommendations (Score:3, Informative)

    by Evil Attraction ( 150413 ) on Wednesday June 19, 2002 @04:14AM (#3727570)
    Start with Learning Perl [oreilly.com], proceed with Programming Perl [oreilly.com] and finish off with Mastering Regular Expressions [oreilly.com]. Your parsing needs will be filled forever.
    • by Tom7 ( 102298 )

      Your needs for parsing regular (or regular-ish) grammars in perl will be filled. Unfortunately, regular expressions are not always the right tool for the job (see the gargantuan e-mail address parser in the back of the MRE book for an example) because many grammars are not regular. Try parsing (say) C with perl! Also, perl has poor facilities for manipulating such parse trees, especially compared to a language with datatypes and pattern matching like SML or O'caml.

      Perl makes parsing flat data brief (though regular expression libraries in other high level languages make it just as easy, if a little more verbose), but it is not by any stretch the be-all and end-all of parsing.
  • Red Dragon Book (Score:5, Informative)

    by blackcoot ( 124938 ) on Wednesday June 19, 2002 @04:23AM (#3727590)
    You've got a couple choices -- finding yourself a good regular expression library seems like a good start ;-) If you're looking to do something a little more interesting than just lexical analysis, check out the red dragon book [bookpool.com] (better known as Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman. I used it in my compiler course and I can tell you that they hit all the various parsing techniques (recursive descent, LA, LALR, SLR, etc.) very well, along with some other stuff. They concentrate on Lex/Yacc as tools -- you may prefer to check out ANTLR [antlr.org] -- Terrence Parr's parser generator. It can be targeted at a bunch of languages and can also produce tree walkers for when it comes time to use your parsed data.
    • I'd have to agree that the dragon book is required reading if one wants to know about parser theory. However, it is a very dense book and without proper guidance, one can get very confused. Much of the examples are applicable on a theoretical level since the tools mentioned have limitations like all variables used by yacc is global to the program and is not reentrant. It's better to get the theory from dragon and then apply it to more modern tools than lex and yacc. But I haven't gone through the other links others have mentioned but I would recommend dragon as a last resort. If you want to come up to speed quickly, dragon will not help you do that.
      • bison has been made to be reentrant (using the hairy skeleton) -- but this is not something for the faint of heart. just a technical point ;-) as for ANTLR, i believe that it is by design all nice and encapsulated & reentrant and stuff. altho, i really do have to wonder -- who in their right mind would want a multi-threaded parser? parsing is an inherently sequential process, so unless you're having to parse multiple streams simultaneously, i don't know why this would be of any use to you. even then, perhaps you should rethink your design -- like doing batch processing of your input and then dispatching a thread to work on the already parsed input.
        • Imagine a program, where there is an embedded scripting language. Imagine that the scripts can be attached to various objects to tie the system together. Imagine that these scripts are interpreted by a parser, and all have to run at once. Imagine that you have just found a use for a reentrant parser.
          • You have me up to the last one. Parsers are not inperpreters. If you mean that you have a bunch of scripts to be interpreted concurrently, the interpreter, not the parser, needs to be reentrant. If I've understood you correctly, this problem is solved fairly easily (without requiring reentrant code parser side): the parser has an input queue containing bundles of raw source code. Parser does its trick, then spits out the resulting parse trees to an output queue where a bunch of worker threads are waiting to handle the interpreting. Considering that parsing is an inherently IO bound problem, I seriously doubt that you'll see enough of a performance difference to warrant the misery of trying to deal with the nasty mess that bison spits out when you ask it to use its reentrant skeleton. Of course, I may well be misunderstanding you.
            • Yes, you're misunderstanding me. A nice simple way of writing an interpreter is to use a parser like Bison. Write Bison rules to recognize all the statements in your language. When you've got a looping construct, just reset your input stream to the top of the loop and continue parsing. Each rule executes code that does something.

              Take a look at the ubiquitous calculator example and imagine that expanded to include all the regular language features that you'd expect to see.

              • A better example would be to consider something multithreaded, like a web-browser, loading frames. Any multithreaded program, for that matter, that might have multiple threads interested in parsing, needs a multithreaded parser. That is, unless you enjoy lock contention.

                --Joe
  • For reading about parsing (and regular expressions), the book of O'Reilly "Sed and Awk" is a good start.
    Based (and extended), you'll find a lot of information in "Programming Perl".
    And if you're writing in C, again the O'reilly book "lex and awk".

    http://py-howto.sourceforge.net/regex/regex.html
    http://sitescooper.org/tao_regexps.html

    Hey, I can't help it, that I find the books by that publisher exactly what I'm looking for in my programming needs :)
  • by Faré ( 6486 ) on Wednesday June 19, 2002 @05:22AM (#3727737) Homepage
    There is a vast literature on parsing, and a lot of (mostly useless) parsing theory is a large part of the cursus in many CS departments. Check CiteSeer [nec.com] to get a glimpse of how much literature there is.

    The best tools to build parsers and manipulate parsed syntax trees are functional languages, such as OCaml [inria.fr] (with its streams parsers, camlp4, ocamllex/ocamlyacc, etc.), or SML, Haskell, etc.

    Of course, if you were a LISPer, you'd know that although you have lots of well-known tools to build new parsers, such as Meta [telent.net] or Zebu [telent.net], the best thing to do about a parser is not to write it, but rather to reuse the builtin extensible universal parser, READ, and its extensible universal unparser, WRITE.

    If you spend most of your time writing parsers, you're not just using the wrong tools, you're also using the wrong approach.

    Just my .2 mg of e-gold worth...

  • by ncarey ( 13288 ) <ncarey@speakeasy.org> on Wednesday June 19, 2002 @06:07AM (#3727804) Homepage

    Well...the Dragon book for starters, as mentioned earlier. That's probably the ur-source for most of the theory behind the magic. Makes my head hurt, though.

    Terence Parr's book, Practical Computer Language Recognition and Translation [antlr.org] (out of print). His doctoral dissertation is a useful thing too (try the Purdue University library).

    comp.compilers [comp.compilers] is another useful resource. It's archived at http://compilers.iecc.com [iecc.com].

    Alan Holub [holub.com]'s Compiler Design in C is a classic.

    The ACM [acm.org]'s SIGPLAN [acm.org] ("Special Interest Group On Programming Languages") and it's journal SIGPLAN Notices of the ACM are all fine resources. So is ACM Transactions on Programming Languages and Systems [wustl.edu].

    Don't forget the IEEE [ieee.org] as well.

    Not to mention Abelman and Sussman: Structure and Interpretation of Computer Programs.

    The garbage collection page [ukc.ac.uk] is a good source for information on memory management and garbage collection.

    Your university's library is another good resource.

    Well. That should keep you out of trouble.

  • Finite automata & Natural Language. Study of Finite State Machines progressing to more complicated grammars and languages. At PSU it's a relatively difficult undergrad elective.

    Very informative and much more interesting than most of my other courses. Doesn't teach you lex/yacc (not that hands-on), but you will be better grounded in regular expressions, languages, parsing, etc.
  • Reverse parsing (Score:3, Interesting)

    by ka9dgx ( 72702 ) on Wednesday June 19, 2002 @07:48AM (#3728064) Homepage Journal
    Things get really interesting when the parser is set up to preserve context so that it can be run backwards, spitting out source from the symbol tables and code segment tokens. You need to preserve whitespace and declaration order, but it's definitely feasible for a language that doesn't use tons of Macros (sorry C/C++ programmers) like Pascal, Basic, etc.

    --Mike--

    • VisualWorks (Smalltalk) do this, you can tell the compiler to reformat your source, for instance, this is done compiling and decompiling an object method, I think.
    • Having learned the hard way (again, in my compiler class), this is unfeasible for a truly compiled language for a couple simple reasons:

      1) Which symbols get stored in your symbol table? Most people don't export the entire symbol table -- just the global symbols (which usually means only global variables shared across translation units and function declarations). And even then, you aren't guaranteed to get the global symbols -- if you've linked statically or run strip over your executable to make it smaller, you're out of luck.

      2) Loops?
      Assembly like languages don't have loops, they have jumps and logical tests. So how exactly are you planning on matching loop structures, given that they all look essentially identical in assembly?

      There are other, equally hairy problems to be dealt with. If you're talking about an interpreted language, you may have more hope.

      The best you can really hope for is a disassembler and dust off your assembly skillz.
  • Lex and Yacc (Score:2, Informative)

    by Aniquel ( 151133 )
    I'm surprised that no one has mentioned lex and yacc. Guys, perl != generic parser. If you're looking to delve into the science of scanners and parsers, from tokenizing to LR1 grammars, look at lex and yacc. Their java equivalents, JLex and CUP, are just as good for that language (and a whole lot easier to use, I might add). OReilly has a small book on using them, and it would probably be a good introduction to the science of parsing - not to mention the foundation of compilers.

    • Nobody has mentions lex/yacc, flex/bison or other lexical analyzer/compiler generators because that's not what he asked for. He wanted resources about parsing and the theory behind it. Of course, it begs the question: how can you get through a university CS cirriculum and not learn this stuff? Or even how to use the university library's catalog and find the information. It's not as if its difficult to find this stuff.
  • Much of parsing has to do with regular expressions, which, in, turn, deal with NFAs/DFAs. Most parsing, in the end, comes to these finite automata and has been on most schools' CS curriculi(sp?) for many many years.
  • While it's kind of scary that you're a senior with 7 years of experience and don't know this, it should be easy to catch up by reading the beginning of a compilers book. Older compiler books tend to concentrate on parsing a lot, since parsing used to be "most" of the work involved in compiling. Actually, all the theory on how to conserve the last bit of memory using some kind of wacky table is pretty useless these days, but it's there.

    Though perl's built-in regexp support makes its syntax more brief for matching regular grammars, many grammars are not regular (The grammars of most programming languages are context free or worse!). For those grammars, a parser generator like bison can make life a bit easier.

    Then, manipulating parse trees in a language like C or perl is pretty painful once you've used datatypes and pattern matching in a language like SML, O'Caml, or Haskell. If you're going to be doing a lot of work on the parsed data, and its structure is not trivial, give these a shot.

    Finally, about the only thing that's excited me about parsing for years is the idea of parsing combinators: writing parsers that look like the grammar IN the language itself. They can allow the parsing of arbitrary grammars, and don't require resorting to external tools. There's a good beginning explanation in Paulson's "ML for the Working Programmer", though I can't find a tutorial online... maybe it's time to write one!

  • Of course there's a branch of CS that studies parsing. It's called the Theory of Computation. Any good introductory Theory of Comp. book will give you an introduction to parsing. Any decent book on compiler construction will give you the gory details about implementing parsers. What do they teach CS students these days, if a senior doesn't know this?
    • Wow, somebody stole my Rant!

      Language Theory and Computer Theory go Hand in Hand. RegEx is just the start. Push Down Automata, and Turing Machines are where my Comp Theory course finished up, with the next semester leading to Compilers.

      But that stuff is old hat. Neural Networks and pattern recognition in 3D is much funner.
  • The Dragon Book [amazon.com] is the standard reference on this subject.

    However, I've found that Constructing Language Processors for Little Languages [amazon.com] by Randy M. Kaplan helped me more in constructing small one-off parsers for various purposes. The Dragon Book provides a good grounding in theory, but I've found the utilitarian approach in Little Languages more accessible.

  • I have found this book [amazon.com] very good. I am a Java programmer and I wanted to understand parsing from the bottom up, before using something like ANTLR. This books gives a great foundation with code to explain it all.

  • Check out JavaCC at WebGain's [webgain.com] site:
    "Java Compiler Compiler (JavaCC) is the most popular parser generator for use with Java applications. A parser generator is a tool that reads a grammar specification and converts it to a Java program that can recognize matches to the grammar. In addition to the parser generator itself, JavaCC provides other standard capabilities related to parser generation such as tree building (via a tool called JJTree included with JavaCC), actions, debugging, etc. "
    It's very cool, and there are scores of grammars already defined.

  • We had 2 courses at school, the first being abstract machines, languages, and computations (regular expressions, grammars, automata, turing machines). The second was application of the first - programming languages and their processors.

    See "Introduction to Languages and the Theory of Computation" by Martin.

  • The front end of every language compiler or interpreter is a parser. Start there. Lex, Yacc and ANTLR are good places to start. Perl was orginally designed to parsing files and most languages have parsing libraries in them but their value can vary greatly.

  • then C++ shouldn't scare you too much. Check out the Spirit parser library [sourceforge.net]. In a nutshell, it's kind of like Lex and Yacc, but without the need for seperate tools, ie it's all done with standard C++. On the theory side, while everybody is mentioning compiler books, I find those to be a little rushed. Get a good book on the theory of computation (sorry, don't remember the name of the one I like :( ). I found their explanations of automata, regular expressions, grammers, etc etc much more elucidating than the compiler books. I think that this is because the comp theory books treat them for their own sake instead of in passing. And, of course, if you're looking for algorithms, Knuth has plenty :)
  • The following is not meant as a flame.

    This is an odd question coming from a cs/ce senior. What, may I ask, have you been studying, when you should have been in a compiler course?

    I've been out of school for a while, but it is inconceivable that compilers would now be an optional part of any CS curriculum.
  • I've browsed other replies, and I think they've missed the following:
    • Programming Language Research [cmu.edu] - Links maintained by a CMU student.
    • Compilers.Net [compilers.net]
    • Lambda the Ultimate [weblogs.com] - I found this from Meerkat. While somewhat more esoteric than straight up parsing talk, I'm seeing it spawn alot of programming language discussion across blogs.
    Just to be redundant though, none of these sites are a replacement for a good low level book, like the Aho, Sethi, Ulman text. Of course, if you are looking to really bake your noodle, go straight to computational linguistics, starting with Chomsky.

    *Smirk*

  • While there's lots of theory, I've been living for years off the same small set of tools:

    For C/C++, there's YACC/LEX or BISON/FLEX.

    For Java, there's CUP/JFLEX.

  • Some have mentioned the dragon book (Aho, Ullman et all) and the tiger book (Appel's "Modern Compiler Implementation"). Both excellent references.

    If you're at all mathematically minded, you should look into a good book on formal languages, such as Harrison's "Introduction to Formal Language Theory", or R. Nigel Horspool's "LR Parsing". (Caution: Don't buy these. Look for them in your local university library.)

    Once you have the basics, you should look into the work of Mark Hopkins [uwm.edu]. The guy is a genius, and understands more of the theory of parsing than anyone I've ever seen, and has invented some techniques which far surpass the "state of the art" of lex and yacc.

  • Try using google.com. sheesh

    -Kevin

Marriage is the triumph of imagination over intelligence. Second marriage is the triumph of hope over experience.

Working...