Chapter 2: More Advanced Calculator

!use crate::exprtrees::*;
!use crate::exprtrees::Expr::*;
!use rustlr::LBox;
!/* anything on a ! line is injected verbatim into the parser */

absyntype Expr
nonterminal E
nonterminal ES
terminal + - * / ( ) ;
terminal int
topsym ES
resync ;

left * 500
left / 500
left + 400
left - 400

E --> int:@Val(m)@ {Val(m)}
E --> E:e1 + E:e2 { Plus(parser.lb(e1),parser.lb(e2)) }
E --> E:e1 - E:e2 { Minus(parser.lb(e1),parser.lb(e2)) }
E --> E:e1 / E:e2 { Divide(parser.lb(e1),parser.lb(e2)) }
E --> E:e1 * E:e2 { Times(parser.lb(e1),parser.lb(e2)) }
E --> - E:e { Negative(parser.lb(e)) }
E --> ( E:e )  { e }
ES --> E:n ; { Seq(vec![parser.lb(n)]) }
ES ==> ES:@Seq(mut v)@  E:e ;  {
   v.push(parser.lb(e));
   Seq(v)
 } <==

# ==> and <== are required for rules spanning multiple lines

EOF

In the second chapter of this tutorial, we redo the calculator example and introduce a more complete set of features of Rustlr. This grammar differs from the first in the following principal ways

The grammar is ambiguous. There are shift-reduce conflicts from the pure grammar that are resolved using operator precedence and associativity rules as evidenced by declarations such as left * 500. A terminal symbol that's to be used as an operator can be declared as left or right associative and a positive integer defines the precedence level. Each grammar production rule is also assigned a precedence, which is the same as the highest precedence among the symbols on the right-hand side of the grammar. A lookahead symbol with a higher precendence than the rule, or a right-associative lookhead, will result in a shift instead of reduce. (It is recommended that precedence levels greater than 100 are used for now, due to some internal matters that should be fixed soon).

Speaking of grammar ambiguity, rustlr also resolves reduces-reduces conflicts by always favoring the first rule that appears in the grammar, although a warning is always printed regardless of trace level.

The grammar requires a separate module, exprtrees.rs, that defines its abstracy syntax type 'Expr' as well as a lexical scanner using an available tokenizing tool (basic_lexer). There is also a separate main.rs
The grammar allows a sequence of arithmetic expressions to be evaluated in turn by separating them with a semicolon. The semicolon also allows us to define a simple error-recovery point: resync ; indicates that when a parser error is encountered, the parser will skip past the next semicolon, then look down its parse stack for a state with which it can continue parsing. In otherwords, failure to parse one expression does not mean it will not try to parse the next ones.
The grammar's abstract syntax uses LBox, which encapsulates a Rust Box along with the line number, column number, and potentially other information associated with each syntactic construct. LBox works like a Box in that it implements deref coercion on the boxed value, but which also carries along the additional information when they're needed. The parser.lb function invoked from within the semantic actions automatically transfers the parser's lexical information while creating an LBox. It is recommended that LBox (or LRc) be used instead of Box (Rc) when defining the recursive enums and structs that typically form the abstract syntax representation. This allows accurate error reporting after the parse tree is built, as in the division-by-zero example explained below.
The labels attached to grammar symbols on the right-hand side of grammar productions can be more than a simple variable, but can be a pattern enclosed in @...@. Rustlr generates an if-let expression that attempts to bind the pattern to the value associated with that symbol. The value is moved to a mut variable before being deconstructed by the pattern. In general, the label associated with a right-hand side grammar symbol can be of the following forms:
1. E:a + E:b: this is found in the first grammar, each symbol 'a', 'b' is a immutable Rust variable that's assigned to the value popped from the parse stack. This kind of label can also be a simple, irrefutable pattern such as a tuple: E:(a,b) is also acceptable if the absyntype of the grammar is, for example, (usize,usize).
2. E:@Seq(mut v)@: as seen in this grammar. This pattern is if-let bound to the value popped from the stack as a mutable variable. The specified semantic action is injected into the body of if-let. A parser error report is generated if the pattern fails to match, in which case the default value of the abstract syntax type is returned. To be precise, the semantic action function generated for the last rule of the grammar is
```
 rule.Ruleaction = |parser|{ parser.stack.pop();  let e:Expr=parser.stack.pop().unwrap().value;
 let mut _vflab_0=parser.stack.pop().unwrap().value;  
 if let (Seq(mut v),)=(_vflab_0,) { 
   v.push(parser.lb(e));
   Seq(v)
 }  else {parser.bad_pattern("(Seq(mut v),)")} };
```
1. E:es@Seq(v)@ The pattern can be named. 'es' will be a mut variable assigned to the value popped from the stack and an if-let is generated that attempts to match the pattern to &mut es. In particular, the last production rule of this grammar is equivalent to:
```
ES ==> ES:es@Seq(v)@  E:e ;  {
   v.push(parser.lb(e));
   es
 } <==   
```
In contrast to an non-named pattern, the value is not moved into the pattern, which means we can still refer to the value by its name.

The Abstract Syntax Type Expr

Let's take a closer look at the definition of the abstract syntax type:

pub enum Expr
{
   Val(i64),
   Plus(LBox<Expr>,LBox<Expr>),  // LBox replaces Box for recursive defs
   Times(LBox<Expr>,LBox<Expr>),
   Divide(LBox<Expr>,LBox<Expr>),
   Minus(LBox<Expr>,LBox<Expr>),
   Negative(LBox<Expr>),
   Seq(Vec<LBox<Expr>>),
   Nothing,                    // for integration into lexer/parser
}  // no copy/clone trait defined

The variant 'Nothing' allows us to define a default, which is required for the 'absyntype' of the grammar:

impl Default for Expr
{
  fn default() -> Self { Nothing }
}//impl Default

Unlike in the first example, here evaluation is defined after the parsing stage, when the abstract syntax tree is available as a complete structure. Note how the LBox is used in the same way as a Box in most of the cases except for Division. Here we access the line number enclosed inside the LBox to print an error message when division-by-zero is detected. For this still-simple example, the lexer we implemented did not accurately keep track of the column information.

pub fn eval(e:&Expr) -> i64
{
   match e {
     Val(x) => *x, 
     Plus(x,y) => eval(x) + eval(y), // deref coercion works nicely here
     Times(x,y) => eval(x) * eval(y),
     Divide(x,y) => {
       let yval = eval(y);
       if yval==0 {
         eprintln!("Division by zero on line {}, returning 0 as default",y.line-1); // -1 to offset ; already read
	 0// returns default
       } else {eval(x) / yval}
     },
     Minus(x,y) => eval(x) - eval(y), 
     Negative(x) => -1 * eval(x),
     Seq(V) => {
       let mut ev = 0;
       for x in V
       {
         ev = eval(x);
	 println!("result for line {}: {} ;",x.line-1,&ev);	 
       }
       ev
     },
     Nothing => 0,
   }//match
}//eval

Lexical Scanner.

The file exprtrees.rs also contains a simple lexical analyzer called exprscanner using basic_lexer, which is a typical tokenizer. All tokenizers must be adopted to implement Rustlr's Lexer trait, and produce Lextoken structures where the 'sym' fields match the names of terminal symbols of the grammar.

pub struct exprscanner<'t>
{
  st: Str_tokenizer<'t>, // from basic_lexer
  line: usize,
}
impl<'t> exprscanner<'t>
{
   pub fn new(s:&'t str) -> exprscanner<'t>
   {
      exprscanner{ st: Str_tokenizer::new(s), line:1, }
   }
}   

// Since this is a simple example, we will not use a file tokenizer. Instead,
// we will just assume that each ; marks a new line.  Column information
// is not kept.
impl<'t> Lexer<Expr> for exprscanner<'t>
{
   fn nextsym(&mut self) -> Option<Lextoken<Expr>>
   {
      let tok = self.st.next();
      if let None = tok {return None;}
      let token = tok.unwrap();
      let retval = 
      match token {
        Token::Alphanum(n) => { Lextoken::new(n,Nothing) },
        Token::Symbol(c) => { if c==";" {self.line+=1;} Lextoken::new(c,Nothing) },
        Token::Integer(n) => Lextoken::new(String::from("int"),Val(n)),
        _ => Lextoken::new(String::from("error"),Nothing),
      };//match
      Some(retval)
   }
   fn linenum(&self) -> usize {self.line}
}//impl Lexer<Expr> for exprscanner

Generate the parser with

rustlr calculator.grammar lalr -trace 3 > calculator.states

This creates a file calculatorparser.rs, although each time it's generated the state numbers may be different. Create a cargo crate with the following dependencies in Cargo.toml:

# path to latest available rustlr, version on github is always newer
rustlr = "0.1.4"  
basic_lexer = "0.1.3"

copy the main.rs, exprtrees.rs and the generated calculatorparser.rs files into src/. The supplied main parses and evaluates the following input:

-5-(4-2)*5; 
5-7- -9 ; 
4*3-9; 
2+1/(2-1-1);

Cargo run should expect the following output:

expression tree after parse: Seq([Minus(Negative(Val(5)), Times(Minus(Val(4), Val(2)), Val(5))), Minus(Minus(Val(5), Val(7)), Negative(Val(9))), Minus(Times(Val(4), Val(3)), Val(9)), Plus(Val(2), Divide(Val(1), Minus(Minus(Val(2), Val(1)), Val(1))))])
result for line 1: -15 ;
result for line 2: 7 ;
result for line 3: 3 ;
Division by zero on line 4, returning 0 as default
result for line 4: 2 ;
final result after evaluation: 2

Training The Parser For Error Reporting

It is recommended that, when a parser is generated, the -trace 3 option is given, which will print all the LR states that are created. This may be helpful when training the parser. Each time the grammar is regenerated the states may have different numbers identifying them, even if the grammar is unchanged.

With a newly generated parser, when a parser error is encountered, the line and column numbers are printed and an "unexpected symbol" error message is given. To print more helpful error messages, the parser can be trained interactively. Interactive training also produces a script for future, automatic retraining when a new parser is generated.

Modify main.rs by changing the input to include some errors, such as:

let mut input =
"-5-(4-2)*5; 
3(1+2);
5%5;
5-7- -9 ; 
4*3-9; 
2+1/(2-1-1);
";

The new 2nd and 3rd lines now contain errors. Note that the supplied main already calls parse_train(&mut lexer,"calculatorparser.rs"); For input with no errors, this call works the same way as parse(&mut lexer); The parse_train function takes a path to a copy of the parser being trained. Cargo run will lead to the following (possible) training session:

ERROR on line 2, column 0: unexpected symbol ( .. 
>>>TRAINER: if this message is not adequate (for state 1), enter a replacement (default no change): insert an operator before (
>>>TRAINER: should this message be given for all unexpected symbols in the current state? (default yes) no
ERROR on line 3, column 0: unexpected symbol % .. 
>>>TRAINER: if this message is not adequate (for state 1), enter a replacement (default no change): this symbol is not recognized as an operator in this language
expression tree after parse: Seq([Minus(Negative(Val(5)), Times(Minus(Val(4), Val(2)), Val(5))), Minus(Minus(Val(5), Val(7)), Negative(Val(9))), Minus(Times(Val(4), Val(3)), Val(9)), Plus(Val(2), Divide(Val(1), Minus(Minus(Val(2), Val(1)), Val(1))))])
result for line 1: -15 ;
result for line 4: 7 ;
result for line 5: 3 ;
Division by zero on line 6, returning 0 as default
result for line 6: 2 ;
parser error, best effort after recovery: 2

Notice that error recovery was effective and the parser still produced a usable parse tree: however, the parser's error_occurred flag will be set. It is under consideration as to wether future editions of Rustlr will also allow the error-recovery strategy to be trainable in the same way. For now, only a fixed number of strategies are available. In the opinion of the author, the resync technique is the simplest and most effective.

The training session augments the LR state transition table with new entries, which are found at the end of the modified parser in the load_extras function:

fn load_extras(parser:&mut RuntimeParser<Expr,Expr>)
{
  parser.RSM[1].insert("ANY_ERROR",Stateaction::Error("this symbol is not recognized as an operator in this language"));
  parser.RSM[1].insert("(",Stateaction::Error("insert an operator before ("));
}//end of load_extras: don't change this line as it affects augmentation

When the "unexpected symbol" is recognized as a declared symbol of the grammar, the trainer will be given the option of entering the error message for either just that symbol, or all unexpected symbols in the same state. If the latter is chosen then an entry is created for the reserved ANY_ERROR symbol. If the unexpected symbol is not recognized as a terminal symbol of the grammar, an ANY_ERROR entry is always created. You can see the contents of state 1 if you created it with the -trace 3 option.

When the modified parser runs and encounters another unexpected symbol in the same state, it will first see if there is an entry for that symbol; if none exists, it will look for an ANY_ERROR entry for a message to display. Thus the two entries do not conflict with eachother.

The interactive session also generated a script file, which would be called "calculatorparser.rs_script.txt", with the following contents:

# Rustlr training script for calculatorparser.rs

2	0	( ::: insert an operator before (
3	0	ANY_ERROR ::: this symbol is not recognized as an operator in this language

This script can be used to retrain a newly genenerated parser (with different state numbers) with the train_from_script function provided the same input for the original training.

Final note: most of the more advanced features introduced in this chapter are not supported if the parser is generated with the -verbose option, which is only intended to test very small examples and for debugging grammars.