Additional and Informal Notes
on the Theory and Implmenation of "Lambdayacc."
Chuck Liang
I first became aware of the existence of bounded right context
grammars while reading Donald Knuth's original paper on LR parsing. I
was already basically convinced that there was no way to describe
unique handles algebraically without regular expressions - which would
be the same as constructing the "viable prefix recognizing"
deterministic finite state machine in LR parsers. It would be
infinitely unfair to say that I'm trying to challenge Knuth by going
back to BRC grammars. In his paper Knuth gave two examples of LR(0)
grammars that were not BRC. The first of these is similar to the one in
my paper. The second is basicly the same as the following example:
S -> (A) | c
A -> (S) | c
Here, A -> c would be the correct handle after an odd number of ('s,
while S -> c would be the handle after an even number of ('s. Clearly
the correct handle can not be determined without looking back at the
entire contents of the stack, and thus it is not BRC. The BRC(1,0)
equivalent grammar would be:
S -> DA) | c
A -> ES) | c
D -> (
E -> (
The extra non-terminals D and E (for odd and even) capture the state
information neccessary for the handle to be uniquely determined by
looking back only one symbol. This example well illustrates the
essential difference between BRC and non-BRC LR grammars: for the
latter class it is possible to *derive* a kind of state information
from the grammar, while for the BRC class it is necessary to embed the
state information *inside* the grammar. In Prolog-style logic
programming we are often confronted with the same type of problem.
As opposed to imperative languages where varibles can be set to
different values at different times, logic programming must capture
the state information necessary for computation *as part* of the
logical specification. This usually entails predicates having to
carry a large number of parameters, as well as other techniques that
lead to poor performance unless extra-logical features are used
extensively. The resulting logic program will loose much of its
"declarative" character. Logic programming is better used when it can
avoid the carrying of state information and instead, concentrate on
what it does best: logical deduction.
Thus what I have done can be seen as the application of Knuth's
insight in reverse. By having the grammar encode the state
information, the resulting logic program parser need not be so
burdened. That is, the tradeoff between BRC and general LR grammars
is a positive one in the context of logic programming. Perhaps future
developements in logic programming, such as those based on linear
logic, would allow state information of the kind used in LR parsers to
be specified succinctly. But currently, it is a lot easier to modify
the grammar (consider the computation and representation of the
LR(0) parsing DFA for even the simple grammar above).
As a more practical example of this point, in the file "fndefgram.mod",
you will find two productions which may appear to be redundant at first:
tlparen ==> [lparen]
tcomma ==> [comma]
These productions are needed precisely because we use BRC grammars
(actually simple BRC grammars). The parser can parse most of Lambda
Prolog. In particular, lambda Prolog allows declarations of the
following kinds:
type f,g,h (i -> i) -> i.
pred1 A B :- pred2 A, pred3 B.
etc,...
There are three sources of possible ambiguity that this syntax can cause for
a BRC parser:
1. identifiers such as "i" and "pred2" can be parsed as either
"term expression" or "type expression" - the same is true
for application (a b).
2. "," can mean conjuction in a formula, or as a simple separator between
identifiers in a type declaration.
3. expressions enclosed in multiple parentheses may be term or type
expressions.
Without the extra productions for tlparen and tcomma, reduce-reduce
conflicts will result. In a LR parser, these conflicts are
disambiguated by the "type" keyword. The tlparen and tcomma symbols
propagates this information so that a simple BRC parser would suffice.
The user may complain that this method entails learning about
characteristics of a certain class of grammars before the parser
generator can be used. But this applies to any parser generator. For
example, the grammar
S -> Aa | Bb
A -> (A) | c
B -> (B) | c
is perfectly unambiguous, but is not a LR(k) grammar for any k, and
unless one understands why he or she will not be able to use any
yacc-style tool effectively.
Implementation-Related Notes:
The current implementation of LambdaYacc has the following characteristics,
which will be modified in the future:
No epsilon productions are allowed. This is only a implementation-wise
simplification, and does not reflect on any deflects in the theoretical
foundations, which encompass epsilon rules. As a consequence of this,
however, the lookahead set is allowed to contain non-terminal symbols,
and the "first" relation is checked dynamically. The first relation-closure
is computed at generation time and added as atomic clauses to the
parser (via the "gen_first" predicate).
Free or "logic" variables are used for attributes in the grammar.
This, as explained in my paper, simplifies the implementation, but
is a non-logical feature. As a consequence, the "freshcopy" clauses
are used profusely to keep these variables from being bound.
...
Please check back for more!