


Parsing Algorithms and Resources? 52
Derek Williams asks: "I'm a senior majoring in computer engineering & computer science and I've been programming for about 7 years, mainly in C and Java. While I've had quite a few courses that delve into some of the deeper topics of programming (e.g. Object Oriented Design), I find that the majority of programs I write, both for work and elsewhere, involve parsing. Although I have no problem tackling these sorts of programs, I was wondering if there was some branch of computer science dedicated to the study of parsing. What books and websites out there are of interest to someone looking to learn more about parsing and algorithms relating to it?"
A quick search on Google for parsing returns... (Score:4, Informative)
Flexible Parsing [warwick.ac.uk]
Workshop on The Evaluation of Parsing Systems [slashdot.org]
Robust Parsing [slashdot.org]
Parsing Resources [cs.uu.nl]
Probably the last one on that list would be the most useful starting place...
A quick search on Google by arb (Score:3, Funny)
you should at least look to the results of the search. One of the links is about compression algorithms and other two are about natural language parsing.
The following two are the only ones relevant:
Stepan Kasal
Re:A quick search on Google by arb (Score:1)
Learn Perl (Score:1)
Until you've parsed the Linux kernel in a single regex, you haven't lived.
Compilers and Parsing... (Score:3, Interesting)
duh! welcome to the 21st century (Score:2)
Read books on Compiler Design (Score:2, Interesting)
Re:Read books on Compiler Design (Score:2, Interesting)
Book recommendations (Score:3, Informative)
Not... (Score:2)
Your needs for parsing regular (or regular-ish) grammars in perl will be filled. Unfortunately, regular expressions are not always the right tool for the job (see the gargantuan e-mail address parser in the back of the MRE book for an example) because many grammars are not regular. Try parsing (say) C with perl! Also, perl has poor facilities for manipulating such parse trees, especially compared to a language with datatypes and pattern matching like SML or O'caml.
Perl makes parsing flat data brief (though regular expression libraries in other high level languages make it just as easy, if a little more verbose), but it is not by any stretch the be-all and end-all of parsing.
Red Dragon Book (Score:5, Informative)
Re:Red Dragon Book (Score:1)
Re:Red Dragon Book (Score:1)
Re:Red Dragon Book (Score:2)
Re:Red Dragon Book (Score:1)
Re:Red Dragon Book (Score:2)
Take a look at the ubiquitous calculator example and imagine that expanded to include all the regular language features that you'd expect to see.
Re:Red Dragon Book (Score:1)
A better example would be to consider something multithreaded, like a web-browser, loading frames. Any multithreaded program, for that matter, that might have multiple threads interested in parsing, needs a multithreaded parser. That is, unless you enjoy lock contention.
--Joesed and regular expressions (Score:2, Informative)
Based (and extended), you'll find a lot of information in "Programming Perl".
And if you're writing in C, again the O'reilly book "lex and awk".
http://py-howto.sourceforge.net/regex/regex.htm
http://sitescooper.org/tao_regexps.html
Hey, I can't help it, that I find the books by that publisher exactly what I'm looking for in my programming needs
Re:sed and regular expressions (Score:1)
Re:sed and regular expressions (Score:1)
Re:sed and regular expressions (Score:1)
How to deal with parsing (Score:3, Informative)
The best tools to build parsers and manipulate parsed syntax trees are functional languages, such as OCaml [inria.fr] (with its streams parsers, camlp4, ocamllex/ocamlyacc, etc.), or SML, Haskell, etc.
Of course, if you were a LISPer, you'd know that although you have lots of well-known tools to build new parsers, such as Meta [telent.net] or Zebu [telent.net], the best thing to do about a parser is not to write it, but rather to reuse the builtin extensible universal parser, READ, and its extensible universal unparser, WRITE.
If you spend most of your time writing parsers, you're not just using the wrong tools, you're also using the wrong approach.
Just my .2 mg of e-gold worth...
compiler design books and resources. (Score:3, Informative)
Well...the Dragon book for starters, as mentioned earlier. That's probably the ur-source for most of the theory behind the magic. Makes my head hurt, though.
Terence Parr's book, Practical Computer Language Recognition and Translation [antlr.org] (out of print). His doctoral dissertation is a useful thing too (try the Purdue University library).
comp.compilers [comp.compilers] is another useful resource. It's archived at http://compilers.iecc.com [iecc.com].
Alan Holub [holub.com]'s Compiler Design in C is a classic.
The ACM [acm.org]'s SIGPLAN [acm.org] ("Special Interest Group On Programming Languages") and it's journal SIGPLAN Notices of the ACM are all fine resources. So is ACM Transactions on Programming Languages and Systems [wustl.edu].
Don't forget the IEEE [ieee.org] as well.
Not to mention Abelman and Sussman: Structure and Interpretation of Computer Programs.
The garbage collection page [ukc.ac.uk] is a good source for information on memory management and garbage collection.
Your university's library is another good resource.
Well. That should keep you out of trouble.
One course that might be relevant (Score:1)
Very informative and much more interesting than most of my other courses. Doesn't teach you lex/yacc (not that hands-on), but you will be better grounded in regular expressions, languages, parsing, etc.
Reverse parsing (Score:3, Interesting)
--Mike--
Re:Reverse parsing (Score:1)
Re:Reverse parsing (Score:1)
1) Which symbols get stored in your symbol table? Most people don't export the entire symbol table -- just the global symbols (which usually means only global variables shared across translation units and function declarations). And even then, you aren't guaranteed to get the global symbols -- if you've linked statically or run strip over your executable to make it smaller, you're out of luck.
2) Loops?
Assembly like languages don't have loops, they have jumps and logical tests. So how exactly are you planning on matching loop structures, given that they all look essentially identical in assembly?
There are other, equally hairy problems to be dealt with. If you're talking about an interpreted language, you may have more hope.
The best you can really hope for is a disassembler and dust off your assembly skillz.
Lex and Yacc (Score:2, Informative)
Re:Lex and Yacc (Score:1)
compilers (Score:1)
Compiler Books, data types, combinator parsing (Score:2)
Though perl's built-in regexp support makes its syntax more brief for matching regular grammars, many grammars are not regular (The grammars of most programming languages are context free or worse!). For those grammars, a parser generator like bison can make life a bit easier.
Then, manipulating parse trees in a language like C or perl is pretty painful once you've used datatypes and pattern matching in a language like SML, O'Caml, or Haskell. If you're going to be doing a lot of work on the parsed data, and its structure is not trivial, give these a shot.
Finally, about the only thing that's excited me about parsing for years is the idea of parsing combinators: writing parsers that look like the grammar IN the language itself. They can allow the parsing of arbitrary grammars, and don't require resorting to external tools. There's a good beginning explanation in Paulson's "ML for the Working Programmer", though I can't find a tutorial online... maybe it's time to write one!
Theory of Computation (Score:1)
Re:Theory of Computation (Score:1)
Language Theory and Computer Theory go Hand in Hand. RegEx is just the start. Push Down Automata, and Turing Machines are where my Comp Theory course finished up, with the next semester leading to Compilers.
But that stuff is old hat. Neural Networks and pattern recognition in 3D is much funner.
Constructing Little Languages (Score:1)
The Dragon Book [amazon.com] is the standard reference on this subject.
However, I've found that Constructing Language Processors for Little Languages [amazon.com] by Randy M. Kaplan helped me more in constructing small one-off parsers for various purposes. The Dragon Book provides a good grounding in theory, but I've found the utilitarian approach in Little Languages more accessible.
Java Book on Parsers (Score:2)
JavaCC (Score:1)
State machines (Score:1)
We had 2 courses at school, the first being abstract machines, languages, and computations (regular expressions, grammars, automata, turing machines). The second was application of the first - programming languages and their processors.
See "Introduction to Languages and the Theory of Computation" by Martin.
Languages and compiles... (Score:1)
If you can stomach Java (Score:1)
Branch of computer science dedicated to the study (Score:2)
This is an odd question coming from a cs/ce senior. What, may I ask, have you been studying, when you should have been in a compiler course?
I've been out of school for a while, but it is inconceivable that compilers would now be an optional part of any CS curriculum.
Links from my bookmark list... (Score:1)
*Smirk*
A few tools do it all... (Score:2)
For C/C++, there's YACC/LEX or BISON/FLEX.
For Java, there's CUP/JFLEX.
Other books (Score:2)
Some have mentioned the dragon book (Aho, Ullman et all) and the tiger book (Appel's "Modern Compiler Implementation"). Both excellent references.
If you're at all mathematically minded, you should look into a good book on formal languages, such as Harrison's "Introduction to Formal Language Theory", or R. Nigel Horspool's "LR Parsing". (Caution: Don't buy these. Look for them in your local university library.)
Once you have the basics, you should look into the work of Mark Hopkins [uwm.edu]. The guy is a genius, and understands more of the theory of parsing than anyone I've ever seen, and has invented some techniques which far surpass the "state of the art" of lex and yacc.
dumb question (Score:2)
-Kevin