Chapter 1: Unambiguous LR Grammar for Simple Calculator.

valuetype i32
nonterminals E T F
terminals + * ( ) num
topsym E

E --> E:e + T:t { e+t }
E --> T:t { t }
T --> T:t * F:f { t*f }
T --> F:f { f }
F --> ( E:e )  { e }
F --> num:n { n }

EOF

This tutorial will start with a sample grammar. Those who are already familiar with similar LR parser generation tools may wish to skip to the more advanced example in Chapter 2.

Everything after "EOF" is ignored and can be used for comments in addition to lines that start with #. Because (currently) the parser that reads in a grammar file is hand-coded, the # symbol must be the first character on a line (you can't use # after something else on the same line).

This classic example of LR parsing is found in virtually all compiler textbooks. It is an unambiguous grammar. You can produce a parser from this grammar file with:

rustlr test1.grammar lalr -verbose

The first argument to the executable is the path of the grammar file. The second argument is an option to use lalr or lr1 to generate the finite statemachine for the parser. LALR suffices for most examples. Another option that can be given is -trace 1, -trace 2, etc, which will define the tracing level. Level 0 prints nothing at all (unless errors occur in reading the grammar). Level 1 prints some minimum information. Level 2 will print some basic information concerning the grammar, including the intial state. Level 3 will print all states. Level n+1 will also print all information from level n. The default trace level is 1. The last argument, -verbose, is only appropriate for small grammars: the state transition table will appear readable to humans. Larger parsers should not use this option: the state transition table will be encoded in a compact binary form.

You can also, from within a program that 'use rustlr::rustler', call rustler("test1","lalr"), which assumes that "test1.grammar" can be found in the working directory.

The generated parser will be a program called test1parser.rs. RustLr will derive the name of the grammar (test1) from the file path, unless there is a declaration of the form

grammarname somename

in the grammar spec, in which case the parser generated will be called "somenameparser.rs"

You can look at the contents of the generated parser. Since this grammar only requires 12 states, the generated parser is in a readable format. Without the -verbose option, you will still be able to read some of the code but not the state transition table. Please note that most of the more advanced features of Rustlr are no available with the -verbose option, which was only made available for debugging. The generated parser must import some elements of rustlr so it should be used in a crate. We will come back to how to use the generated parser later.

GRAMMAR FORMAT

The first line in the grammar specification:

valuetype i32

(alternatively absyntype i32) determine the value to be returned by the parser. In most cases that would be some enum that defines an abstract syntax tree, but here we will just calculate an i32 value. The default valuetype (if none declared) is i64.

The valuetype you choose must implement the Default trait

RustLr requires that all grammar symbols be defined before any production rules using multiple "nonterminals" or "terminals" directives.

Reserved Symbols

The following terminal symbols are reserved and should not be used in a grammar:

 EOF   ANY_ERROR   :  |  @ {  }   -->   ==>   <==

The following symbols should also NOT be used as non-terminals in your grammar:

START valuetype absyntype grammarname resync resynch topsym errsym 
nonterminal terminal nonterminals terminals flexname lexname typedterminal
left right externtype externaltype

For example, if ":" is to be one of the terminal symbols of your language, then you should call it something like COLON inside the grammar. You will then adopt your lexical analyzer so that ":" is translated into "COLON" before sending the token to the parser. If you want to treat a whitespace as a token your lexer must similarly translate whitespaces into something like WHITESPACE. Non-terminal symbol START and terminal EOF will always be added as additional symbols to the grammar. The other symbols that should not be used for non-terminals are for avoiding clash with grammar directives.

Top Nonterminal

topsym E

You should designate one particular non-terminal symbol as the top symbol: The parser generator will always create an extra production rule of the form START --> topsym EOF

Grammar Production Rules

You will get an error message if the grammar symbols are not defined before the grammar rules. Each rule is indicated by a non-terminal symbol followed by -->, ::= , or ==>. The symbol ::= is interpreted to be the same as -->. ==> is for rules that span multiple lines that you will find used in other grammars (later chapters). You can specify multiple production rules with the same left-hand side nonterminal using |, which you will also find used in other grammars.

The right hand side of each rule must separate each symbol with whitespaces. For each symbol, you can optionally bind a simple pattern (usually a variable) to the value attached to the symbol: this value will have type valuetype, which is i32 in this example. If the valuetype is something else, it is possible to use a simple pattern, for example: E:(a,b) if the valuetype is a pair. However, patterns may not contain whitespaces.

The right-hand side a rule may be empty, which will make the non-terminal on the left side of --> "nullable".

SEMANTIC ACTIONS

Each rule can optionally end with a semantic action inside { and }, which can only follow all terminal symbols. This is a piece of Rust code that would be injected verbatim into the generated parser. This code will have access to any labels defined using ":" as immutable variables of type valutype. The semantic action must return a value of type valuetype (in this case i32). If no semantic action is given, then a default one is created that just returns valuetype::default(), which is why the valuetype must implement the Default trait. Here's an example, taken from the generated parser, of how the code is injected:

rule.Ruleaction = |parser|{ let f:i32=parser.stack.pop().unwrap().value; parser.stack.pop(); let t:i32=parser.stack.pop().unwrap().value; t*f };

This is the semantic action generated from the rule

 T --> T:t * F:f { t*f }

The parser generator is not responsible if you write an invalid semantic action that's rejected by the Rust compiler.

Within the { } block, you may also call other actions on the parser, including reporting error messages and telling the parser to abort (see later chapters). However, you should not try to "pop the stack" or change the parser state in other ways: leave that to the generated code.

CREATING A LEXER AND INVOKING THE PARSER

Take a look at the "main.rs" associated with this grammar. Its principal contents define a lexical analyzer that conforms to the Lexer trait.

pub struct lexer1<'t>(Str_tokenizer<'t>); //Str_tokenizer from basic_lexer
impl<'t> Lexer<i32> for lexer1<'t>
{
   fn nextsym(&mut self) -> Option<Lextoken<i32>>
   {
      let tok = self.0.next();
      if let None = tok { return None; }
      let token = tok.unwrap();
      let retval = 
      match token {
        Token::Symbol(n) => { Lextoken::new(n,0) },
        Token::Alphanum(n) => { Lextoken::new(n,0) },
        Token::Integer(n) => Lextoken::new("num".to_owned(),n as i32),
        _ => Lextoken::new("error".to_owned(),-999),
      };//match
      Some(retval)
   }
}//impl Lexer for lexer1

fn main()
{
  let mut input = "5+2*3";
  let args:Vec<String> = std::env::args().collect(); // command-line args
  if args.len()>1 {input = &args[1];}
  let mut parser1 = make_parser();
  let mut lexer = lexer1(Str_tokenizer::new(input));
  let result = parser1.parse(&mut lexer);
  println!("result after parsing {}: {}",input,result);
}//main

You must become familiar with the Lexer trait and the Lextoken struct which are defined by rustlr:

pub trait Lexer<AT:Default>
{
  fn nextsym(&mut self) -> Option<Lextoken<AT>>; // must implement

  fn linenum(&self) -> usize { 0 } //reimplement to return current line number
  fn column(&self) -> usize  { 0 } //reimplement to return current column
  fn current_line(&self) -> &str { "" } // reimplement ...
}


pub struct Lextoken<AT:Default> 
{
   pub sym: String, // must correspond to name of terminal symbol
   pub value: AT,   // value of terminal symbol
}

pub fn new(name:String, val:AT) -> Lextoken<AT>  //creates a Lextoken

The parser requires a Lexer-trait object to be passed to it.

The nextsym function of the trait object must produce Some(Lextoken) until end of input is reached, at which point it should return None. The Lextoken::new function can be called to create a lex token. Very importantly, the "sym" owned string of the Lextoken must match identically with the name of a terminal symbol of your grammar. The "value" of the token is something of type valuetype as defined by the grammar. In this case, for example, each integer constant must be translated into a Lextoken with .sym=="num" and .value = the value of the integer.

This example uses basic_lexer, available on crates.io, as the background lexical analyzer, but any tokenizer can be adopted.

Since for this example we do not need the functionalities of Lexer::{linenum, column, current_string}, the default implementations suffice.

Once a lexer has been created, invoke the make_parser function in the generated parser to create a mut parser. Then call

 parser.parse(&mut tokenizer)

where tokenizer is the Lexer-trait object. This function will return a value of type valuetype. It will return a valuetype-value even if parsing failed (but error messages will be printed). After .parse returns, you can also check if an error had occurred with the parser::error_occurred function before deciding to use the valuetype result that was returned.