# grammar, spaces and line breaks are important # grammar declarations are read line by line, so just don't use line breaks for long rules and comments. nonterminal E Expr terminal ( ) + * - == typedterminal num Integer topsym E # top level grammar symbol # precedence and associativity declarations, precedence determined by order # These rules must appear before the production rules. left * 500 left + 400 left - 400 left == 300 left ) 50 left ( 900 left num 980 # Grammar production rules, order breaks reduce-reduce conflicts # spaces must separate grammar elements E --> num:x { return new intexp(x); } E --> E:e1 + E:e2 { return new sumexp(e1,e2); } E --> E:e1 * E:e2 { return new multexp(e1,e2); } E --> E:e1 - E:e2 { return new subexp(e1,e2); } E --> E:e1 == E:e2 { return new eqexp(e1,e2); } E --> - E:e { return new negexp(e); } E --> ( E:e ) { return e; } # ends file, whatever follows EOF is ignored: EOF In this version of the online calculator parser, we use the inherent ability of the LR state machine to resolve shift-reduce conflicts using external declarations. This allows us to use ambiguous grammars. This feature is missing from "SableCC" but is available in Java_Cup, which is just a Java port of the traditional YACC tool. Ambiguous grammars are most natural for humans. They also correspond much more directly to the abstract syntax structures we wish to build. In other words, which grammar would you prefer working with for parsing arithemtic expressions? 1. ambiguous, SLR(1) with external declarations E --> E+E | E-E | E*E | E/E | num | (E) left-associative +, -, * , / precedence-order (*,/) > (+,-) 2. non-ambiguous SLR(1) (and thus LALR(1) and LR(1)) E --> E+T | E-T | T T --> T*F | T/F | F F --> num | (E) 3. non-ambiguous LL(1) E --> T E1 E1 --> "" | +T E1 | -T E1 T --> F T1 T1 --> "" | *F T1 | /F T1 F --> num | (E) ** E1 and T1 are introduced for left-recursion elimination In terms of speed, there is no question that grammar 1 is the fastest, since there are less reductions to make. But even more importantly, ask yourself: what kind of abstract syntax tree do you want to have for an expression such as 2+3*4? Why don't you sketch out the trees relative to the three grammars. For grammar 1, it is: E / | \ E + E | / | \ 2 E * E | | 3 4 For grammar 2, it is a bit less obvious: E / | \ E + T | / | \ T T * F | | | F F 4 | | 2 3 You can also try to make a tree for grammar #3 but I will not give extra credit :). The point is not just that the ambiguous grammar is more convenient, but also that it more directly correspond to how we want to represent the expression *in the abstract*. In such a role grammar symbols such as T, F (not to mention E1, T1) are simply superflous. For the LL(1) grammar, even the simplest expression "3" will have quite a tree: E / \ T E1 / \ | F T1 "" | | 3 "" What role does T1 and E1 play? Why would I want such entities in my abstract syntax tree? At the very least I will have to walk over the tree to "clean it up". What's more, in actual use we usually want to associate some meaning and action relative to each non-terminal. That is, each non-terminal symbol should represent some valid program component: each E represents a valid arithemtic sub-expression. With the unambiguous LR grammar, at least each non-terminal still corresponds to a valid subexpression. But what kind of expression does "* F T1" represent? Clearly "* 3" is not a well-formed expression or subexpression. What kind of meaning can I associate with just "* 3"? Is it dereferencing a pointer? What kind of animal is this?