Compilers – Top-Down Parsing: my study notes

These notes are from the Week 3 of the Compilers course by Stanford.

The parsing phase comes right after the lexing phase is complete with its result as input. The basic difference between the two is that the lexer transforms a string of characters into a string of tokens, and later on, the parser transforms that string of tokens into a parse tree or abstract syntax tree. It’s worth noting that some compilers merge both phases into one.

The parsing rules are equivalent to context-free grammars, which usually are a way to say whether or not a language belongs to a grammar. For example, a part of the CFG of the COOL programming can be expressed as follows:

EXPR -> if EXPR then EXPR else EXPR fi
EXPR -> while EXPR loop EXPR pool
EXPR -> identifier

Derivations

Derivations are a way of producing parse trees from a given string of tokens based on a given CFG. However many derivations might exist, so it is important to choose, in particular, left-most and right-most derivations that are widely used in parsers implementations.

E -> E + E
   | E * E
   | (E)
   | id

Ambiguity

Given the previous grammar, we can say it is ambiguous because it can yield to different parse trees. For example, if we feed the string id * id + id we see two possible parse trees:

Leftmost parse tree
Rightmost parse tree

There are several ways to handle this type of ambiguity. The first one is to rewrite the grammar in a way it generates only one parse tree. The previous grammar can be rewritten like that:

E  -> E' + E
    | E'
E' -> id * E'
    | id
    | (E) * E'
    | (E)

If we feed the same string id * id + id to the previous grammar, we would have the following parse tree:

It’s worth noting that E’ can also generate parenthesized expressions and it allows to, recursively, parse internal pluses. Another example, to show ambiguity is a grammar that generates parse trees for if-then-elses where elses are optional.

E -> if E then E
  |  if E then E else E
  |  OTHER

This grammar is ambiguous because it generates more than one parse tree. The following example illustrates this because the else might be associated with the outermost if or to the innermost if, and both would yield a valid parse tree.

if A then if B then C else D

So the rule to disambiguate the previous example might be to the else match the closest unmatched if and by solving this at the grammar level, we can have:

E   -> MIF
    |  UIF

MIF -> if E then MIF else MIF
    |  OTHER

UIF -> if E then E
    |  if E then MIF else UIF

This grammar works because all unmatched ifs will always be interpreted as a UIF. If we, on the other hand, look at the previous grammar, we’ll see that there are two possibilities of matching for this expression.

We can note that most unambiguous grammars are more complex to read and understand. As this is a problem by itself, we have another way of solving the ambiguity problem, which is choosing a set of criteria to, when an ambiguity happens, choose a non-ambiguity parse tree. One very popular criterion is to declare precedence.

Error Handling

Basically, a compiler has two jobs translate valid programs and detect non-valid programs. There are many possible errors that can be detected by the compiler: lexical, syntax, and semantic. So it can be noted that error handling should be present in many phases of the compilation.

In addition to that, an error handler should have the following properties:

  • Report errors
  • Recover from an error
  • Do not interfere with the performance when compiling valid code

There are mainly two types of error handling used nowadays by the compilers, they are called Panic mode and Error productions and are error detection strategies.

  • Panic mode: This is the simplest and most popular error recovery strategy. The way it works is by discarding input until it finds something that is known, defined by some rule.
  • Error productions: Basically identifies common mistakes and adds those to the language grammar.

Also, it is worth mentioning error correction techniques, which basically try to find a nearby program by trying insertion or deletion of tokens and exhaustive search (for some well-defined boundaries). The downsides are that they make the implementation more complex, slow down the parsing of correct programs, and nearby doesn’t necessarily mean the intended program.

Abstract Syntax Trees

They are a core data structure used by compilers. They are similar to the parse tree but ignoring some details such as abbreviating the one-child nodes and removing parenthesis. This makes them more suitable for the compiler to work since only the essential information is kept.

Recursive Descent

Recursive Descent is a Top-Down parsing algorithm, which means it produces the parse tree from Top to bottom and from left to right. This video takes a walkthrough approach to explain the algorithm:

Recursive Descent Parsing

Here is how this algorithm can be put in code to parse the grammar below:

E -> T | T + E
T -> int | int * T | ( E )
bool term(TOKEN tok) { return *next++ == tok; }

bool E1() { return T(); }
bool E2() { return T() && term(PLUS) && E(); }

bool E() {
    TOKEN *state = next;
    return (next = state, E1()) 
        || (next = state, E2());
}

bool T1() { return term(INT); }
bool T2() { return term(INT) && term(TIMES) && term(INT); }
bool T3() { return term(OPEN) && E() && term(CLOSE); }

bool T() {
    TOKEN *state = next;
    return (next = state, T1()) 
        || (next = state, T2())
        || (next = state, T3());
}

The interesting part of this algorithm is that the backtracking is taking care of by both the functions returning and the state being restored before trying any production.

It is important to note the limitations of this algorithm. If we give the input int * int to the program described above, we would see it being rejected. This happens because the algorithm doesn’t offer a way to go back and try other different productions. The problem is related to left recursion, and there are some techniques to eliminate mechanically this potential problem.

1 thought on “Compilers – Top-Down Parsing: my study notes

Leave a Reply

Your email address will not be published. Required fields are marked *