Pick a markdown and code style:
valuetype i32
nonterminals E T F
terminals + * ( ) num
topsym E
E --> E:e + T:t { e+t }
E --> T:t { t }
T --> T:t * F:f { t*f }
T --> F:f { f }
F --> ( E:e ) { e }
F --> num:n { n }
EOF
This tutorial will start with a sample grammar. Those who are already familiar with similar LR parser generation tools may wish to skip to the more advanced example in Chapter 2.
Everything after "EOF" is ignored and can be used for comments in addition to lines that start with #. Because (currently) the parser that reads in a grammar file is hand-coded, the # symbol must be the first character on a line (you can't use # after something else on the same line).
This classic example of LR parsing is found in virtually all compiler textbooks. It is an unambiguous grammar. You can produce a parser from this grammar file with:
rustlr test1.grammar lalr -verbose
The first argument to the executable is the path of the grammar file. The second argument is an option to use lalr or lr1 to generate the finite statemachine for the parser. LALR suffices for most examples. Another option that can be given is -trace 1, -trace 2, etc, which will define the tracing level. Level 0 prints nothing at all (unless errors occur in reading the grammar). Level 1 prints some minimum information. Level 2 will print some basic information concerning the grammar, including the intial state. Level 3 will print all states. Level n+1 will also print all information from level n. The default trace level is 1. The last argument, -verbose, is only appropriate for small grammars: the state transition table will appear readable to humans. Larger parsers should not use this option: the state transition table will be encoded in a compact binary form.
You can also, from within a program that 'use rustlr::rustler', call rustler("test1","lalr"), which assumes that "test1.grammar" can be found in the working directory.
The generated parser will be a program called test1parser.rs. RustLr will derive the name of the grammar (test1) from the file path, unless there is a declaration of the form
grammarname somename
in the grammar spec, in which case the parser generated will be called "somenameparser.rs"
You can look at the contents of the generated parser. Since this grammar only requires 12 states, the generated parser is in a readable format. Without the -verbose option, you will still be able to read some of the code but not the state transition table. Please note that most of the more advanced features of Rustlr are no available with the -verbose option, which was only made available for debugging. The generated parser must import some elements of rustlr so it should be used in a crate. We will come back to how to use the generated parser later.
The first line in the grammar specification:
valuetype i32
(alternatively absyntype i32) determine the value to be returned by the parser. In most cases that would be some enum that defines an abstract syntax tree, but here we will just calculate an i32 value. The default valuetype (if none declared) is i64.
The valuetype you choose must implement the Default trait
RustLr requires that all grammar symbols be defined before any production rules using multiple "nonterminals" or "terminals" directives.
The following terminal symbols are reserved and should not be used in a grammar:
EOF ANY_ERROR : | @ { } --> ==> <==
The following symbols should also NOT be used as non-terminals in your grammar:
START valuetype absyntype grammarname resync resynch topsym errsym nonterminal terminal nonterminals terminals flexname lexname typedterminal left right externtype externaltype
For example, if ":" is to be one of the terminal symbols of your language, then you should call it something like COLON inside the grammar. You will then adopt your lexical analyzer so that ":" is translated into "COLON" before sending the token to the parser. If you want to treat a whitespace as a token your lexer must similarly translate whitespaces into something like WHITESPACE. Non-terminal symbol START and terminal EOF will always be added as additional symbols to the grammar. The other symbols that should not be used for non-terminals are for avoiding clash with grammar directives.
topsym E
You should designate one particular non-terminal symbol as the top symbol:
The parser generator will always create an extra production rule of the
form START --> topsym EOF
You will get an error message if the grammar symbols are not defined before
the grammar rules. Each rule is indicated by a non-terminal symbol followed
by -->
, ::=
, or ==>
. The symbol ::=
is interpreted to be the same
as -->
. ==>
is for rules that span multiple lines that you will find used
in other grammars (later chapters). You can specify multiple production
rules with the same left-hand side nonterminal using |, which you will
also find used in other grammars.
The right hand side of each rule must separate each symbol with
whitespaces. For each symbol, you can optionally bind a simple pattern
(usually a variable) to the value attached to the symbol: this value
will have type valuetype, which is i32 in this example. If the valuetype
is something else, it is possible to use a simple pattern, for example:
E:(a,b)
if the valuetype is a pair. However, patterns may not contain
whitespaces.
The right-hand side a rule may be empty, which will make the
non-terminal on the left side of -->
"nullable".
Each rule can optionally end with a semantic action inside { and }, which can only follow all terminal symbols. This is a piece of Rust code that would be injected verbatim into the generated parser. This code will have access to any labels defined using ":" as immutable variables of type valutype. The semantic action must return a value of type valuetype (in this case i32). If no semantic action is given, then a default one is created that just returns valuetype::default(), which is why the valuetype must implement the Default trait. Here's an example, taken from the generated parser, of how the code is injected:
rule.Ruleaction = |parser|{ let f:i32=parser.stack.pop().unwrap().value; parser.stack.pop(); let t:i32=parser.stack.pop().unwrap().value; t*f };
This is the semantic action generated from the rule
T --> T:t * F:f { t*f }
The parser generator is not responsible if you write an invalid semantic action that's rejected by the Rust compiler.
Within the { } block, you may also call other actions on the parser, including reporting error messages and telling the parser to abort (see later chapters). However, you should not try to "pop the stack" or change the parser state in other ways: leave that to the generated code.
Take a look at the "main.rs" associated with this grammar. Its principal contents define a lexical analyzer that conforms to the Lexer trait.
pub struct lexer1<'t>(Str_tokenizer<'t>); //Str_tokenizer from basic_lexer
impl<'t> Lexer<i32> for lexer1<'t>
{
fn nextsym(&mut self) -> Option<Lextoken<i32>>
{
let tok = self.0.next();
if let None = tok { return None; }
let token = tok.unwrap();
let retval =
match token {
Token::Symbol(n) => { Lextoken::new(n,0) },
Token::Alphanum(n) => { Lextoken::new(n,0) },
Token::Integer(n) => Lextoken::new("num".to_owned(),n as i32),
_ => Lextoken::new("error".to_owned(),-999),
};//match
Some(retval)
}
}//impl Lexer for lexer1
fn main()
{
let mut input = "5+2*3";
let args:Vec<String> = std::env::args().collect(); // command-line args
if args.len()>1 {input = &args[1];}
let mut parser1 = make_parser();
let mut lexer = lexer1(Str_tokenizer::new(input));
let result = parser1.parse(&mut lexer);
println!("result after parsing {}: {}",input,result);
}//main
You must become familiar with the Lexer trait and the Lextoken struct which are defined by rustlr:
pub trait Lexer<AT:Default>
{
fn nextsym(&mut self) -> Option<Lextoken<AT>>; // must implement
fn linenum(&self) -> usize { 0 } //reimplement to return current line number
fn column(&self) -> usize { 0 } //reimplement to return current column
fn current_line(&self) -> &str { "" } // reimplement ...
}
pub struct Lextoken<AT:Default>
{
pub sym: String, // must correspond to name of terminal symbol
pub value: AT, // value of terminal symbol
}
pub fn new(name:String, val:AT) -> Lextoken<AT> //creates a Lextoken
The parser requires a Lexer-trait object to be passed to it.
The nextsym function of the trait object must produce Some(Lextoken) until end of input is reached, at which point it should return None. The Lextoken::new function can be called to create a lex token. Very importantly, the "sym" owned string of the Lextoken must match identically with the name of a terminal symbol of your grammar. The "value" of the token is something of type valuetype as defined by the grammar. In this case, for example, each integer constant must be translated into a Lextoken with .sym=="num" and .value = the value of the integer.
This example uses basic_lexer, available on crates.io, as the background lexical analyzer, but any tokenizer can be adopted.
Since for this example we do not need the functionalities of
Lexer::{linenum, column, current_string}
, the default implementations
suffice.
Once a lexer has been created, invoke the make_parser function in the generated parser to create a mut parser. Then call
parser.parse(&mut tokenizer)
where tokenizer is the Lexer-trait object. This function
will return a value of type valuetype. It will return a valuetype-value
even if parsing failed (but error messages will be printed). After
.parse returns, you can also check if an error had occurred with the
parser::error_occurred
function before deciding to use the valuetype result
that was returned.