These notes are from the Week 2 of the Compilers course by Stanford. You may notice that this week also covers Finite Automata, but I left this out on purpose.
This phase transforms the source code file into the tokens of the program. To achieve this it needs to both identify and then classify the elements according to their roles. Internally, the tokens are represented by both their class and then their value. For example, let’s look at the example
foo = 42 which can be divided into tokens:
Lexical Analysis Examples
Some interesting examples that appeared in some programming languages.
In Fortran, the whitespace is insignificant, so
VAR1 is equal to
VA R1. The reason for this rule is when the code was typed into punch cards, it was easy that, by mistake, the operators would add whitespaces to the code.
Another interesting example, in Fortran, is shown below. While example A is a do statement, example B is a variable assignment. Both of the following examples would require the compiler to look ahead and decide the context of each expression since the only difference there is the comma.
DO 5 I = 1,25 // Example A DO 5 I = 1.25 // Example B
In PL/1, keywords are not reserved, so this statement would be perfectly legal:
IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN.
The following example is interesting because there is no way to say whether
DECLARE is a keyword or an array reference. If there is an equal sign right after the closing parenthesis, it would be an array assignment. It is interesting to note that, as the number of arguments is unbounded, it would require an unbounded lookahead.
DECLARE (ARG1, ..., ARGN)
As both Fortran and PL/1 have been designed in the 50s and 60s, not much was known before about what NOT to do in programming languages, but this hasn’t gone away completely today. Let’s look at this C++ example:
Foo<Bar> // Example A cin >> var; // Example B
The problem with that is when we have nested templates (e.g.,
Foo<Bar<Baz>>) the compiler would label it as the stream operator, and as it would be out of context, the compilation would fail. For some time, the only solution to this was to add a space between the closing angle brackets of the template, like
Foo<Bar<Baz> >, ugh!