Chapter 4: Automatically Generating the Abstract Syntax

Rustlr is capable of automatically generating the data types (enums and structs) for the abstract syntax of a language as well as the semantic actions required to create instances of those types. For beginners new to writing grammars and parsers, we do not recommend starting with an automatically generated AST. The user must understand clearly the relationship between concrete and abstract syntax and the best way to learn this relationship is by writing ASTs by hand, as demonstrated in the previous chapters. Even with Rustlr capable of generating nearly everything one might need from a parser, it is still possible that careful fine tuning will be required.

Automatically creating the AST from the grammar (or the grammar from the AST) is not a new idea and can cause problems if not done carefully. There is a gap between the parse tree and the abstract syntax tree that must be bridged. The grammar usually contains extraneous elements, in the form of non-terminal symbols such as 'T', 'F' that enforce operator precedence, as well as terminal symbols such as ';' that should not be included in the AST. In addition, the AST should stay relatively stable when minor modifications are made to the grammar. Rustlr addresses these problems by essentially defining a language, embedded with the grammar syntax, for describing how ASTs should be generated. The language will allow one to create ASTs that are relatively stable and independent of the format of the grammar. It is also possible to override the types and semantic actions of any non-terminal, and use a hybrid approach between automatic and manually written AST generation.

We redo the enhanced calculator example from Chapter 2. Although some form of abstract syntax can be generated for any grammar, the format of the grammar can greatly influence the form of the AST types. To illustrate the various choices, this grammar is a hybrid between the purely unambiguous grammar of Chapter 1 and the one in Chapter 2 in that operator precedence declarations are only given for the binary arithmetic operators. For the unary minus and the = sign in let-expressions, we choose to define different syntactic categories in the form of extra non-terminal symbols UnaryExpr and LetExpr. Along with Expr they define three levels of precedence from weakest to strongest: LetExpr, Expr, and UnaryExpr. Writing ambiguous grammars with operator precedence/associativity declarations is convenient and can make the grammar more readable. They also lead to more reasonable abstract syntax. Symbols such as T in E --> T often have no meaning at the abstract syntax level. However, when there are a large number of operators and precedence levels, using such declarations alone may be problematic (See the original ANSI C grammar). Besides, these categories sometimes have genuine semantic meaning, such as the distinction between lvalues and rvalues.

The following grammar is found here.

# the auto directive means AST types and semantic actions will be generated
auto
lifetime 'lt
terminals + - * / ( ) = ;
terminals let in
valueterminal int ~ i64 ~ Num(n) ~ n
valueterminal var ~ &'lt str ~ Alphanum(n) ~ n
lexattribute set_line_comment("#")

nonterminals Expr ExprList
nonterminal UnaryExpr : Expr
nonterminal LetExpr : Expr

topsym ExprList
resync ;

left * 500
left / 500
left + 400
left - 400

UnaryExpr:Val --> int
UnaryExpr:Var --> var
UnaryExpr:Neg --> - UnaryExpr
UnaryExpr --> ( LetExpr )

Expr --> UnaryExpr
Expr:Plus --> Expr + Expr
Expr:Minus --> Expr - Expr
Expr:Div --> Expr / Expr
Expr:Times --> Expr * Expr

LetExpr --> Expr
LetExpr:Let --> let var = Expr in LetExpr

ExprList:nil -->
ExprList:cons --> LetExpr:car ; ExprList:cdr

EOF

Note the following, further differences between this grammar and the one presented in Chapter 2:

There are no semantic actions
There is no "absyntype" or "valuetype" declaration; any such declaration would be ignored when using the auto option, which is enabled by the auto directive at the top of the grammar, or by the -auto flag given to the rustlr executable.
Only the types of values carried by certain terminal symbols must be declared (with typedterminal or valueterminal). A valueterminal declaration is just a combination of a typedterminal and a lexvalue declaration, with ~ separating the components. The other terminals all have type () (unit).
The non-terminal symbol on the left-hand side of a production rule may carry a label. These labels will become the names of enum variants to be created.

Process the grammar with rustlr calcauto.grammar (without the auto directive inside the grammar, run rustlr with the -auto option). Two files are created. Besides calcautoparser.rs there will be, in the same folder as the parser, a calcauto_ast.rs with the following (principal) contents:

#[derive(Debug)]
pub enum ExprList<'lt> {
  nil,
  cons{car:Expr<'lt>,cdr:LBox<ExprList<'lt>>},
  ExprList_Nothing,
}
impl<'lt> Default for ExprList<'lt> { fn default()->Self { ExprList::ExprList_No
thing } }

#[derive(Debug)]
pub enum Expr<'lt> {
  Plus(LBox<Expr<'lt>>,LBox<Expr<'lt>>),
  Minus(LBox<Expr<'lt>>,LBox<Expr<'lt>>),
  Div(LBox<Expr<'lt>>,LBox<Expr<'lt>>),
  Times(LBox<Expr<'lt>>,LBox<Expr<'lt>>),
  Var(&'lt str),
  Neg(LBox<Expr<'lt>>),
  Val(i64),
  Let(&'lt str,LBox<Expr<'lt>>,LBox<Expr<'lt>>),
  Expr_Nothing,
}
impl<'lt> Default for Expr<'lt> { fn default()->Self { Expr::Expr_Nothing } }

Compare these types with the manually written ones in Chapter 2: they are not so different. For example, the expression 5 - 7 - -9 will be represented in abstract syntax as Minus(Minus(Val(5), Val(7)), Neg(Val(9))). This is exactly what we want.

Generally speaking, a new type is created for each non-terminal symbol of the grammar, which will also share the same name as the non-terminal itself. But this would mean that separate types would be created for LetExpr and UnaryExpr as well, which would lead to convoluted types that serve no purpose. Their creation was avoided with the following declarations:

nonterminal UnaryExpr : Expr
nonterminal LetExpr : Expr

The syntax means that instead of generating new types, the ASTs representing the rules for UnaryExpr and LetExpr would extend the enum that would be created for Expr. The type created for Expr must be an enum for this to work (it would not work if it was a struct). Leave out the : Expr portion from the declarations and we will get these types instead. They would be more cumbersome to work with.

Rules of AST Generation

An enum is created for each non-terminal symbol of the grammar that appears on the left-hand side of multiple production rules, unless the type of the non-terminal is declared to "extend" another type as explained above. The name of the enum is the same as the name of the non-terminal. The names of the variants are derived from the labels given to the left-hand side nonterminal, or are automatically generated from the nonterminal name and the rule number (e.g. Expr_8). A special Nothing variant is also created to represent a default. There is normally an enum variant for each production rule of this non-terminal. Each variant is composed of the right-hand side symbols of the rule that are associated with non-unit types. If none of the right-hand side symbols are given labels, a tuple-variant is created. The presence of any right-hand side label will result in a struct-like variant with named fields: the names will correspond to the labels, or are generated automatically in the form _item{i}_ where i refers to the position of the symbol on the right-hand side. Unit-typed values can also become part of the enum if the symbol is given a label. For example: A:case1 --> a B where terminal symbol a is of unit type, will result in a enum variant case1(B). whereas A:acase --> a:m B will result in a variant case1{m:(), _item1_:B}. It is recommended that either labels are given to all right-hand side symbols that are to be included in the variant, or to none at all.

A struct is created for any non-terminal symbol that appears on the left-hand side of exactly one production rule, unless the type of that nonterminal is declared to extend another type. You can also force an enum to be created instead of a struct by giving the singleton rule a left-hand side label, in which case the label will name the sole variant of the enum (besides the _Nothing default). This would be required when you know that the type will be extended with other variants, as demonstrated above.

The struct may be empty if all right-hand-side symbols of the single production rule are associated with the unit type and do not have labels. Rustlr will generate code to derive the Debug and Default traits for all structs (this works fine for recursive structs).

The name of the struct is the same as the non-terminal. If any of the grammar symbols on the right-hand side of the rule is given a label, it would create a struct with the fields of each struct named by these labels, or with _item{i}_ if no labels are given. For example, a nonterminal Ifelse with a singleton rule

Ifelse --> if Expr:condition Expr:truecase else Expr:falsecase

will result in the generation of:

#[derive(Default,Debug)]
pub struct Ifelse {
  pub condition: LBox<Expr>,
  pub truecase: LBox<Expr>,
  pub falsecase: LBox<Expr>,
}

If none of the symbols on the right have labels, rustlr creates a tuple struct. For Example a singleton rule such as whileloop --> while ( expr ) expr will produce an a struct whileloop(expr,expr); Be careful to avoid using Rust keywords as the names of non-terminals.

Rustlr also calculates a reachability closure so it is aware of which non-terminals are mutually recursive. It uses this information to determine where smart pointers are required when defining these recursive types. Rustlr always uses its LBox custom smartpointer to also include line/column information. Notice that the variant enum::cons has only the second component in an LBox. One can, for the sake of recording position information, always create an LBox regardless of reachability by giving the component a "boxed label". That is, ExprList:cons --> Expr:[car] SEMICOLON ExprList will generate a variant that also has its first component in an LBox. The reachability relation also determines if a type requires a lifetime parameter.

Although the generated parser code may not be very readable, rustlr also generated semantic actions that create instances of these AST types. For example, the rule Expr:Plus --> Expr + Expr will have semantic action equivalent to one created from:

Expr --> Expr:[a] + Expr:[b] {Plus(a,b)}

Recall from Chapter 2 that a label of the form [a] means that the semantic value associated with the symbol is enclosed in an LBox.

'Passthru'

There are three production rules in the grammar that do not correspond to enum variants: Expr --> UnaryExpr, LetExpr --> Expr and UnaryExpr --> ( LetExpr ). Rustlr infers from the fact that

there is no left-hand side label for any of these rules
There is exactly one grammar symbol on the right-hand side that has a non-unit type, and that type is the same as the type of the left-hand side symbol. The other symbols, if any, are of unit type
There are no labels nor operator precedence/associativity declarations for the other symbols.

For the rule UnaryExpr --> ( LetExpr ), it therefore infers that the parentheses on the right hand side carry no meaning at the AST level, and thus generates a semantic action for this rule that would be equivalent to:

  UnaryExpr --> ( LetExpr:e ) { e }

We refer to such cases as "pass-thru" cases. If the automatically inferred "meaning" of this rule is not what's desired, it can be altered by using an explicit left-side label: this will generate a separate enum variant (at the cost of an extra LBox) that distinguishes the presence of the parentheses. Note that the rule UnaryExpr:Neg --> - UnaryExpr, was not recognized as a pass-thru case by virtue of the left-hand side label Neg. Unlike the parentheses, the minus symbol certain has meaning beyond the syntactic level. We can also force the minus sign to be included in the AST by giving it an explicit lable such as -:minus UnaryExpr. This would create an enum variant that includes a unit type value.

Flattening Structs

Rustlr provides another way to control the generation of ASTs so that it is not always dependent on the structure of the grammar, although it is not illustrated in the calculator example. When writing a grammar, we sometimes create extra non-terminal symbols and rules for the purpose of organization. As an abstract example:

A --> a Threebs c
Threebs --> b b b

Rustlr will create two tuple structs for these types. Assuming that a, b, c are not of unit type, there will be a struct A(a,Threebs,c) and a struct Threebs(b,b,b). However, it is possible to declare in the grammar, once the non-terminals A and Threebs have been declared, that the type Threebs can be flattened into other structures:

flatten Threebs

This means that the AST for Threebs should be absorbed into other types if possible (multiple nonterminals can be so declared on the same line). This will still create a struct Threebs(b,b,b), but it will create for A: struct A(a,b,b,b,c).

Both structs and enums can absorb 'flatten' types. However, there are several enforced rules governing the flattening of types:

Only struct types can be flattened: thus only nonterminals that has but a single production rule can have its AST absorbed into other types. Enum types can absorb 'flatten' structs but cannot be absorbed into other types.
Types already defined to 'extend' the enum of another type cannot be flattened
A tuple struct can only absorb the flattened form of another tuple struct. In the above example, if Threeb was a non-tuple struct with named fields (which can be created by giving of the the b's a label), then it cannot be absorted into A.
A boxed-labeled field cannot absorb a 'flatten' type. That is, if the rule for A above was written A --> a:a Threebs:[b] c:c then the AST for A would become pub struct A{a:a, b:LBox<Threebs>, c:c}. This is the only way to prevent the absorption of a 'flatten' type on a case-by-case basis.
Mutually recursive types cannot flatten into each other.
Nested flattening is not currently supported. This is a temporary restriction.

Point 5 is rather subtle. Consider productions rules A --> B and B --> A. It is perfectly valid to declare flatten B: This will result in a struct A(LBox<A>): the LBox is created for the AST of B using reachability calculations. What we cannot have is flatten A and flatten B: the flattening is only allowed in one direction. Otherwise we would be replacing B with A and A with ... what? One consequence of this restriction is that a type cannot flatten into itself: B --> B would not be valid for flatten B: B is mutually recursive with itself.

The last restriction is related to the mutual-flattening restriction. However, there are cases where it would be safe to flatten A into B and then flatten B into C. This ability is not currently supported (as of Rustlr 0.3.5).

Importance of Labels

The usage of labels greatly affect how the AST datatype is generated. Labels on the left-hand side of a production rule give names to enum variants. Their presence also cancel "pass-thru" recognition by always generating an enum variant for the rule. A left-hand side label will also prevent a struct from being generated even when a nonterminal has but a single production rule. The absence of labels on the right-hand side leads to the creation of tuple variants or structs. The presence of right-side labels creates structs or struct-variants with named fields. A label on unit-typed grammar symbol means that the symbol won't be ignored and will be included in the the type. If a non-terminal has a single production rule, the lack of any labels left or right leads to the creation of a simpler tuple struct. The use of boxed labels such as [e] forces the semantic value to be wrapped inside an LBox whether or not it is required to define recursive types. Boxed labels also prevent the absorption of 'flatten' types.

Overriding Types and Actions

It is always possible to override the automatically generated types and actions. In case of ExprList, the labels 'nil' and 'cons' are sufficient for rustlr to create a linked-list data structure. However, the right-recursive grammar rule is slightly non-optimal for LR parsing (the parse stack grows until the last element of the list before ExprList-reductions take place). One might wish to use a left-recursive rule and a Rust vector to represent a sequence of expressions. This can be done in several ways, one of which is by making the following changes to the grammar. First, change the declaration of the non-terminal symbol ExprList as follows:

nonterminal ExprList Vec<LBox<Expr<'lt>>>

You probably want to use an LBox even inside a Vec to record the line/column position information. Then replace the two production rules for ExprList with the following:

ExprList --> { vec![] }
ExprList --> ExprList:ev LetExpr:[e] ; { ev.push(e); ev }

When writing your own types and actions alongside automatically generated ones, it's best to examine the types that are generated to determine their correct usage: for example, whether a lifetime parameter is required for Expr.

The presence of a non-empty semantic action will override automatic AST generation. It is also possible to inject custom code into the automatically generated code:

ExprList -->  {println!("starting a new ExprList sequence"); ... }

The ellipsis are allowed only before the closing right-brace. This indicates that the automatically generated portion of the semantic action should follow. The ellipsis cannot appear anywhere else.

An easier way to parse a sequence of expressions separated by ; and to create a vector for it, is to use the special suffixes +, *, ?, <_*> and <_+>. These are described in Chapter 5.

Invoking the Parser

Parsers created from grammars in auto mode must use the parse_with and parse_train_with functions to invoke the parser, as already shown in Chapter 3. Since the above grammar also contains lexer generation directives, all we need to do is to write the procedures that interpret the AST (see main).

   let mut scanner = calcautoparser::calcautolexer::from_str("2*3+1;");
   let mut parser = calcautoparser::make_parser();
   let result = calcautoparser::parse_with(&mut parser, &mut scanner);
   let tree = result.unwrap_or_else(|x|{println!("Parsing errors encountered; results are partial.."); x});
   println!("\nAST: {:?}\n",&tree);

The parse_with and parse_train_with functions were also backported for grammars with a single valuetype.

Please note that using LBox is already included in all parsers generated with the -genabsyn or -auto option, so do not use !use ... to include it again.