Pick a markdown and code style:
Please note that this tutorial has been rewritten for Rustlr version 0.3, Parsers created since version 0.1.3 remain compatible. The original version of this chapter is available here.
This tutorial is written for those with sufficient background in computer science and in Rust programming, with some knowledge of context free grammars and basic bottom-up parsing concepts. Those who are already familiar with similar LR parser generation tools may wish to skip to the more advanced example in Chapter 2 or Chapter 4.
The tutorial will start with a sample grammar.
valuetype i32
nonterminals E T F
terminals + * ( ) num
topsym E
E --> E:e + T:t { e.value + t.value }
E --> T:t { t.value }
T --> T:(t) * F:(f) { t*f }
T --> F:(f) { f }
F --> ( E:e ) { e.value }
F --> num:n { n.value }
lexvalue num Num(n) (n as i32)
EOF
These are the contents of a Rustlr grammar file, called test1.grammar.
This classic example of LR parsing is found in virtually all compiler
textbooks. It is an unambiguous grammar. After you cargo install rustlr
you can produce a LALR parser from this grammar file with:
rustlr test1.grammar
The first and the only required argument to the executable is the path of the grammar file. Optional arguments (after the grammar path) that can be given to the executable are:
lexvalue
.The generated parser will be a program
test1parser.rs
that contains a make_parser
function. If the -genlex
option
is used, it will also contain a struct test1lexer
that implements
the Tokenizer. RustLr will derive the name of the grammar
(test1) from the file path, unless there is a declaration of the form
grammarname somename
in the grammar spec, in which case the parser generated will be called "somenameparser.rs". The parser must import some elements of rustlr so it should be used in a crate. We will come back to how to use the generated parser later.
The first line in the grammar specification:
valuetype i32
(alternatively absyntype i32
) defines the type of value returned by
the parser. In most cases that would be some enum that defines an
abstract syntax tree, but here we will just calculate an i32 value.
The default valuetype (if none declared) is ()
, unit.
The valuetype you choose must implement the Default trait.
RustLr requires that all grammar symbols be defined before any production rules using multiple "nonterminals" or "terminals" directives.
topsym E
You should designate one particular non-terminal symbol as the top symbol:
The parser generator will always create an extra production rule of the
form START --> topsym EOF
You will get an error message if the grammar symbols are not defined before
the grammar rules. Each rule is indicated by a non-terminal symbol followed
by -->
, or ==>
. The symbol ==>
is for rules that span multiple
lines that you will find used
in other grammars (later chapters). You can specify multiple production
rules with the same left-hand side nonterminal using | which you will
also find used in other grammars.
The right hand side of each rule must separate each symbol with
whitespaces. For each grammar symbol such as E, you can optionally
bind a "label" such as E:a
, E:(a)
, E:@pattern@
or
E:v@pattern@
. Each type of binding carries a different meaning and
affects how they will be used in the semantic action part of the rule. The
grammar used in this Chapter will only use the first two forms: a
and (a)
.
The right-hand side of a rule may be empty, which will make the
non-terminal on the left side of -->
"nullable".
Each rule can optionally end with a semantic action inside { and },
which can only follow all grammar symbols making up the right-hand
side of the production rule. This is a piece of Rust code that will
be injected verbatim into the generated parser. This code will have
access to any labels associated with the symbols defined using ":".
In a label such as E:e
, e is of type StackedItem, which includes the
following fields:
e.value
refers to the semantic value associated with this
symbol, which in this case is of type i32 but in general will be of the
type defined by the "valuetype" or "absyntype" directive.However, if we are only interested in the .value of the label, we can
also capture the value directly using the form demonstrated by
T:(t)
: in this case t
refers only to the .value of the popped
StackedItem. In case the valuetype can be described by an irrefutable
pattern, such as (i32,i32)
, a label such as E:(a,b)
can also used
to directly capture the value. The other kinds of labels (with the
@
symbol) will be described in the next chapter.
The semantic action code must return a value of type valuetype (in this case i32). If no semantic action is given, then a default one is created that just returns valuetype::default(), which is why the valuetype must implement the Default trait. Here's an example, taken from the generated parser, of how the code is injected:
rule.Ruleaction = |parser|{ let mut t = parser.popstack(); let mut _item1_ = parser.popstack(); let mut e = parser.popstack(); e.value + t.value };
This is the semantic action generated from the rule
E --> E:e + T:t { e.value + t.value }
Notice that if a symbol carries no label, then rustlr generates a name
_item{n}_
for it. The parser generator is not responsible if you
write an invalid semantic action that's rejected by the Rust compiler.
Within the { } block, you may also call other actions on the parser,
including reporting error messages and telling the parser to abort.
However, you should not try to "pop the stack"
or change the parser state in other ways: leave that to the generated
code.
A lexical scanner (aka "tokenizer", "lexer", etc) can either be created manually by implementing the Tokenizer trait, or be generated automatically from a minimal set of declarations using the built-in StrTokenizer. This tokenizer makes zero-copy of the source. It is capable of recognizing multi-line string literals and comments, alphanumeric and non alpha-numeric symbols, decimal and hexadecimal constants, floating point constants. It also has the option of returning newline and whitespaces (with count) as tokens. It returns the starting line and column numbers of each recognized token. But it has limitations and may not be the best tokenizer for every scenario. The process of adopting another tokenizer for use by a Rustlr parser will be covered in a speparate chapter.
For this grammar, a lexer is generated from a single declaration
lexvalue num Num(n) (n as i32)
This line states that a token of the form RawToken::Num(n) should be recognized as the terminal grammar symbol "num", carrying semantic value (n as i32) - because in Num(n), n is of type i64 and the semantic value attached to each grammar symbol must be of the declared absyntype (valuetype). The rest of the lexical scanner is derived from the declarations of terminal symbols in the grammar.
To understand what declarations are needed to generate a lexer in general, the reader should become familiar with RawToken. This is what StrTokenizer returns. The RawToken enum contains the following principal variants:
Alphanum(&str): where the string represents an (ascii) alphanumeric symbol that does not start with a digit. The underscore character is also recognized as alphanumeric.
Symbol(&str): a string consisting of non alphanumeric characters such as "==",
Num(i64): Both decimal and hexidecimals (starting with "0x") are recognized as Nums. However, although the returned value is signed, a negative integer such as "-12" is recognized as a Symbol("-") followed by a Num(12), and thus must be recognized at the parser level. Despite this, it is still more convenient to return the more generic signed form. Also, "3u8" would be reconized as a Num(3) followed by an Alphanum("u8").
Float(f64): like the case of Num, this represents unsigned, decimal floats.
BigNumber(&str): Numbers that are too large for i64 or f64 are represented verbatim.
Char(char): this represents a character literal in single quotes such as 'c'
Strlit(&str): A string literal delineated by double quotes. These strings can span multiple lines and can contain nested, escaped quotes. The surrounding double quotes are included in the literal.
Newline: optional token indicating a newline character. These tokens are not returned by the tokenizer by default, but can be returned with the directive
lexattribute keep_newline = true
Whitespace(usize): another optional token that carries the number of consecutive whitespaces. This option is likewise enabled with
lexattribute keep_whitespace = true
Verbatim(&str): another optional token carrying verbatim text, usually comments. Enable with
lexattribute keep_comment = true
By default, StrTokenizer recognizes C-style comments, but this can be customized with, for example,
lexattribute set_line_comment("#")
Custom(&'static str, &str): user-defined token type (since Version 0.2.95). The static string defines the token type-key and the other string should point to raw text. This token type is intended to be paired with declarations such as
lexattribute add_custom("uint32",r"^[0-9]+u32")
Text matching the given regex will be returned as a Custom("uint32",_) token. Please note that custom regular expressions should not start with whitespaces and will override all other token types. Multiple custom types are matched by the order in which they appear in the grammar file. Note: this is a change to the original feature introduced in version 0.2.95, in which they were matched by the alphabetical ordering of their keys. An anchor (^) will always be added to the start of the regex if none is given.
The most important lexer-generation directive is lexvalue. For every terminal symbol in the grammar that carries a (non-default) semantic value, typically numerical and string literals, a lexvalue directive is needed to identify the corresponding RawToken that represents the terminal and how to translate the RawToken's value to the valuetype/absyntype value to be associated with the terminal symbol. The lexvalue directive must identify the name of the terminal symbol, the RawToken form, and the valuetype form that should be recreated from the RawToken.
Besides lexvalue, there are two other lexer-generation directives,
lexname, which allows the mapping of a reserved symbol such as {
to a terminal symbol (see below), and lexattribute which allows the
customization of the scanner. Further usage of these directives can be
found in other chapters and examples.
Please note that malformed lexattribute declarations will only result in errors when the generated parser is compiled.
The generated lexer is a struct called test1lexer alongside the make_parser()
function inside the generated parser file. One creates a mutable instance
of the lexer using the generated test1lexer::from_str
and test1lexer::from_source
functions.
Here is the main.rs associated with this grammar, which forms a simple calculator. Its principal contents creates a parser, a lexer, and invokes the parser on the first command-line argument.
mod test1parser;
use test1parser::*;
fn main() {
let mut input = "5+2*3";
let args:Vec<String> = std::env::args().collect(); // command-line args
if args.len()>1 {input = &args[1];}
let mut parser1 = make_parser(); // calls function in mod test1parser
let mut tokenizer1 = test1lexer::from_str(input); //creates lexer
let result = parser1.parse(&mut tokenizer1);
println!("result after parsing {}: {}",input,result);
}//main
Alternatively, we can choose to create a test1lexer from another source, such as a file, with:
let source = rustlr::LexSource::new("file path").unwrap();
let mut tokenizer1 = test1lexer::from_source(&source);
An instance of the runtime parser is created by calling the make_parser
function.
Once a lexer has also been created, parsing can commence by calling
`parser1.parse(&mut tokenizer1)`
This function will return a value of type valuetype. It will return a valuetype-value
even if parsing failed (but error messages will be printed). After
.parse returns, you can also check if an error had occurred by calling
parser1.error_occurred()
before deciding to use the valuetype result
that was returned.
An alternative way to invoke the parser is to call
let result = parse_with(&mut parser1, &mut tokenizer1)
.unwrap_or_else(|x|{println!("Parsing errors occurred; results not guaranteed");
x});
The parse_with
function returns a Result<T,T>
where T
is the valuetype/absyntype.
To run the program, cargo new
a new crate and copy
the contents of main.rs and test1parser.rs to src/main.rs and src/test1parser.rs respectively. Add to Cargo.toml
under [dependencies]:
rustlr = "0.3"
cargo run "2+3*4"
will print 14 and cargo run "(2+3)*4"
will print 20.
The following terminal symbols are reserved and should not be used in a grammar:
EOF ANY_ERROR _WILDCARD_TOKEN_ : | @ { } --> ::= ==> <== _
The following symbols should also NOT be used as non-terminals in your grammar:
START valuetype absyntype grammarname resync resynch topsym errsym nonterminal terminal nonterminals terminals lexvalue lexname typedterminal left right externtype externaltype lifetime lexattribute any symbol starting with `SEQ` or `NEW..NT` may potentially, but unlikely, cause conflict.
For example, if ":" is to be one of the terminal symbols of your language, then you should call it something like COLON instead in the grammar. You will then adopt your lexical analyzer so that ":" is translated to COLON. This can be accomplished with the directive (if generating a lexer automatically):
lexname COLON :
This directive is equivalent to
lexvalue COLON Symbol(":") <valuetype>::default()
where valuetype refers to the declared valuetype. Underneath, the ":" symbol is translated into a TerminalToken with .sym="COLON" before sending the token to the parser. If you want to treat a whitespace as a token your lexer must similarly translate whitespaces. For automatic lexer generation, use something like the following:
lexvalue WHITESPACE Whitespace(n) value
assuming that WHITESPACE is a declared terminal symbol and "value" is the value you want to be associated with the symbol (usually this is just the valuetype::default()). Whitespace(n) is a variant of RawToken.
It is possible to combine a lexname declaration with the declaration of a terminal symbol with
lexterminal COLON :
The symbol START and terminal EOF will always be added as additional symbols to the grammar. The other symbols that should not be used for non-terminals are for avoiding clash with grammar directives.
The following identifiers (variable names) are reserved and should only be used carefully from within the semantic actions of a grammar production (rust code inside {}s):
parser
: the code generated from the semantic actions is of the form
|parser|{...}
. The parser refers to the instance of the runtime
parser ZCParser. It is valid to invoke certain functions on this object inside the
semantic actions, including parser.report (to report an error message),
parser.abort and most importantly, parser.lbx, which forms an LBox
smartpointer by inserting into it line/column information that accompanies
an abstract syntax value (see next chapter). However, there are other functions on parser that are
exported, but should only be called by the automatically generated portion of
the code. For example, calling parser.popstack() would remove an extra
state/value from the parse stack and corrupt the core parsing algorithm._item0_, item1_, item{n}_
: these variables may be generated
to hold the values that are popped from the stack.SYMBOLS, TABLE
: these are constant arrays holding essential information
about the LR state machine.make_parser
, load_extras
, _semaction_for_{n}_
Most rustlr projects will consist of mulitple files: the .grammar file, a module defining the abstract syntax type, a module defining a lexical analyzer, the generated parser as another module, and presumably a main to launch the program. In this additional example, enough code has been injected into the .grammar so that rustlr can generate a relatively self-contained program, that includes a lexer and a main, and illustrates a few extra features of Rustlr.