Parser Generator Stage 3 /////////////////////////////////////////////////////////////////////////// By now you should have: * A metaparser that parses a grammar into internal representations. Be sure to have tested it against several of my examples. * Representations of Nullible, First and Follow sets * Representation of a finite state machine. However, there's still quite a bit of code to write, and some tricky points to understand. A: How to Parse: First you need to understand how the LR parsing algorithm uses the state machine to parse, so we can decide what kind of information is required at parser runtime (as opposed to parser generation time). Implementing the LR parsing algorithm to follow the state machine really isn't hard. However, we must also incorporate the "semantic actions", i.e., the bit of java code that's to be executed after each reduction. Intuitively, the parse stack contains a sequence of terminal and nonterminal grammar symbols (a "viable prefix"). However, once we've built the viable prefix state machine, we don't need to store grammar symbols on the stack, just a state number. Each shift of a terminal also goes to a new state, and each reduction to the left-hand side of a production rule pops some states from the stack, and then jumps to a new state. Each state therefore corresponds to some grammar symbol. The only thing we need to keep track of other than the state number is the value (a java "Object") that represents the attribute associated with the symbol. How do we use the state machine that we've generated to actually parse? I use the following data structure, which represents an "item" on the stack (please don't confuse with items that make up LR states). public class Sitem { public int si; // state index public Object value; // arbitrary attribute associated with symbol. public Sitem(int a) {si=a;} public Sitem(int a, Object b) {si=a; value=b;} } My abstract class that all parsers inherit is basically the following. The type parameter TOPTYPE is the type associated with the "topsym" nonterminal of the grammar. The function absparser.parse() returns something of this type (javac will give a compiler warning because the cast to the unknown TOPTYPE cannot be checked statically - but in fact we know it will work because we're casting from type Object). Mind you that this code is only guideline: you're still responsible for whatever you decide to use. public abstract class absparser { public ArrayList Stack = new ArrayList(); // parse stack public Gsym lookahead; // lookahead symbol public ArrayList> FSM; // points to FSM public ArrayList Rules; // skeleton of grammar rules public boolean acceptnow = false; // set externally public Mylexer lexer; // see Mylexer interface below public abstract void setactions(); // generated for particular grammar public abstract void settable(); // generated for particular grammar //... public TOPTYPE parse() throws Exception { settable(); // dispatch to concrete class methods setactions(); Stack.add(new Sitem(0,null)); // state 0 is always initial for me lookahead = lexer.next(); // get next Gsym token from lexer. while (lookahead!=null && !acceptnow) { Sitem topitem = Stack.get(Stack.size()-1); // top of stack state stateaction action = FSM.get(topitem.si).get(lookahead.sym); if (action!=null) action.doit(this); // dispatch to action else { /* FSM doesn't know what to do! */ } } // main parse loop if (acceptnow) { return (TOPTYPE) Stack.get(Stack.size()-1).value; // The above line will cause a javac compiler warning : it's ok. } // else report failure }//parse ... } // absparser class "Mylexer" is an interface that allows my code to be abstract enough to work with any lexical analyzer. interface Mylexer // delegate interface for lexical tokenizer { Gsym next() throws java.io.IOException; } Here's a sample "concrete lexer," that was in fact automatically generated by my metaparser with the "genflex" option: class Mynameslexer implements Mylexer { nameslexer lx; public Mynameslexer(nameslexer x) {lx=x;} public Mynameslexer(java.io.Reader r) { lx = new nameslexer(r); } public Gsym next() throws java.io.IOException { return lx.yylex(); } } // specialized lexer class In this class, namelexer is the name of a class that was generated by Jflex. Mynameslexer therefore acts as an object adapter (one of the design patterns) that allows me to integrate a Jflex lexer into my generic code. B. What does "doit" do? Let's assume that the FSM is of type ArrayList>. This means that the FSM is looked up using the current state index plus a grammar symbol (as a string). The FSM returns a "stateaction" which can be reduce, shift, accept, or gotonext. When we call doit(), they need to perform the appropriate action: shift: pushes next state on top of the stack, set lookahead to next token reduce: pop a number of Sitems from the stack, as determined by the rule being reduced to gotonext: this is always done right after reduce: look up the FSM based on the current state (state on top of the stack) and the nonterminal symbol being reduced to, and push the next state on the stack. accept: set the absparser.acceptnow variable to true to stop the parser loop. However, "doit" also needs to execute the java code associated with each production rule. It needs to execute Grule.Ruleaction. instances of Ruleaction should be generted by your parser generator. For example, for the grammar rule T --> T:t * F:f { return new multexp(t,f); } I generated: candidate = new Grule(); candidate.lhs = new Gsym("T",null,true); // only info needed at runtime Rules.add(candidate); candidate.act = new Ruleaction() { public Object compval() { int si = Stack.size()-1; Sitem stackele=null; stackele = Stack.remove(si--); Expr f = (Expr) stackele.value; stackele = Stack.remove(si--); stackele = Stack.remove(si--); Expr t = (Expr) stackele.value; return new multexp(t,f); } }; The rhs of this rule has three symbols, which correspond to the three calls to Stack.remove. The elements are removed from right to left. Two of the symbols (Gsyms) name their respective attribute values (f and t). Code must be generated to assign the values from the stack to these variables, with correct typing. This is where the Gsym.assoc and Gsym.javatype fields are used. Your reduce.doit(..) procedure can be combined with gotonext.reduce() (which can be left blank). It needs to execute the Grule.Ruleaction, then lookup the next state to goto after the reduce, and push that state along with the value returned by Grule.Ruleaction on top of the stack. C. Exactly what code should be generated for each grammar? The best way to answer this is with a simple example. Assume that the grammar file simple.grammar contains: terminal a b c d nonterminal S String nonterminal T Integer topsym S S --> a T:n d { return n+" in the middle"; } T --> b c { return 3; } EOF I generated the following file called simpleparser.java: //generated parser dependent on absparser.java import java.util.*; public class simpleparser extends absparser { public simpleparser(ArrayList S) { setup("simple",S);} public simpleparser(Mylexer lx) { newsetup("simple",lx);} public void settable() { FSM = new ArrayList>(8); for(int i=0;i<8;i++) FSM.add(new Hashtable(18)); FSM.get(0).put("a",new shift(1)); FSM.get(0).put("S",new gotonext(2)); FSM.get(1).put("b",new shift(4)); FSM.get(1).put("T",new gotonext(3)); FSM.get(2).put("EOF",new accept()); FSM.get(3).put("d",new shift(6)); FSM.get(4).put("c",new shift(7)); FSM.get(5).put("EOF",new reduce(2)); FSM.get(6).put("EOF",new reduce(0)); FSM.get(7).put("d",new reduce(1)); } // settable public void setactions() { Rules = new ArrayList(3); Grule candidate = null; candidate = new Grule(); candidate.lhs = new Gsym("S",null,true); Rules.add(candidate); candidate.act = new Ruleaction() { public Object compval() { int si = Stack.size()-1; Sitem stackele=null; stackele = Stack.remove(si--); stackele = Stack.remove(si--); Integer n = (Integer) stackele.value; stackele = Stack.remove(si--); return n+" in the middle"; } }; candidate = new Grule(); candidate.lhs = new Gsym("T",null,true); Rules.add(candidate); candidate.act = new Ruleaction() { public Object compval() { int si = Stack.size()-1; Sitem stackele=null; stackele = Stack.remove(si--); stackele = Stack.remove(si--); return 3; } }; candidate = new Grule(); candidate.lhs = new Gsym("START",null,true); Rules.add(candidate); candidate.act = new Ruleaction() { public Object compval() { int si = Stack.size()-1; Sitem stackele=null; stackele = Stack.remove(si--); stackele = Stack.remove(si--); return Stack.get(Stack.size()-1).value; } }; } // setactions } // parser class Beware that there's a limit to how large a method can be in java: when settable gets too large, you need to either break it up into multiple functions (each function can just have a tail call to the next) or have the parser read the table from a file. I tested both methods, and the first method seemed faster even when there are tens of thousands of lines to execute. D. Selecting Runtime data structures. We do not wish to re-parse the grammar and regenerate the large state machine each time we want to run the parser. The parser generator should therefore create a .java file (and possibly some other support files). Let us review the different data structures we constructed while parsing the grammar and generating the state machine, and decide which one of them are needed at parser runtime (yours may vary, depending on your program): 1. Information on which symbols are terminal, and which are non-terminal: NO 2. Nullible, First and Follow Sets: NO. 3. Information concerning operator precedence and associativity: NO All of this information went into the construction of the FSM: the FSM is all that we need: part of the benefits of LR parsing is its runtime efficiency: follow the FSM, never need to go backwards (so the paper tape with holes in it won't rip). 4. The list of grammar production rules: ONLY SOME PARTS OF EACH RULE. Much of the information associated with each grammar rule is not used at runtime either, but we do need the following: 4a: The left-hand side symbol of each rule (but not the right-hand side) 4b: The delegate Ruleaction act, which now must be instantiated. We need to know the left-hand side symbol of each grammar rule so we can use it to look up the FSM after a reduce (to gotonext state). The information in the right-hand side list of symbols is used at parser generation time to form the Ruleaction delegate of each rule. Observe that (at least in my code) it is the Ruleaction that actually calls stack.remove (which also returns that element that was removed). Remember that java does not support closures, so the Ruleaction code cannot refer to a local variable, though it may refer to instance variables and static variables. Finally, we definitely need the FSM: 5. The table representing the finite state machine itself. However, as my example shows, we should generate code to construct the FSM quickly. In my own program, I used the same data structures at runtime as at generation time: Gsym, Grule, ArrayList>. This may not be the best choice. For example, since we know how many states there are, we can use an ordinary array instead of an ArrayList. We can certainly use a different structure to represent each grammar rule, since the information required is completely different at parser runtime. But I did not regard such optimizations are significant enough given the memory capacity of modern machines. ///// For this stage of the assignment, you should be able to generate a file as I have done, write the doit methods of the stateaction classes (it's ok to combine reduce.doit with gotonext.doit), and parse some small examples. You should have enough to complete the truth table generation assignment. The next stage we'll do error reporting and put some final touches on the parser generator.