Unit 3: LR Parsers and Syntax-Directed Translation

Syllabus:

LR parsers, SLR, LALR and Canonical LR parser.

Syntax Directed Translations: Syntax directed definitions, construction of syntax trees, syntax directed translation scheme, implementation of syntax directed translation, Intermediate-Code Generation: three address code, quadruples and triples.

Section 1: LR Parsers

1.1 Introduction to LR Parsers

LR parsers are a class of bottom-up parsers that can handle a wide range of context-free grammars. They are particularly powerful due to their ability to parse deterministic context-free languages without backtracking. The "L" in LR stands for "Left-to-right" scanning of the input, while the "R" stands for "Rightmost derivation" in reverse.

LR parsers are capable of handling a larger class of grammars than LL parsers, making them suitable for many programming languages. They work by maintaining a stack of states and using a parsing table to determine the next action based on the current state and the next input symbol.

Types of LR Parsers:

LR parsers are categorized into several types based on their capabilities and the complexity of the parsing tables they use. The most common types include:

SLR (Simple LR): The simplest form of LR parsing, using FOLLOW sets for reductions.
LALR (Look-Ahead LR): A more powerful version that merges states in the SLR parser to reduce the size of the parsing table while maintaining the ability to handle a larger class of grammars.
CLR (Canonical LR): The most powerful LR parser, using lookahead symbols to resolve conflicts but resulting in larger parsing tables.
LR(0): The basic form of LR parsing without lookahead, which can handle a limited set of grammars.
LR(1): An extension of LR(0) that incorporates one lookahead symbol, allowing for more complex grammars. LR parsers are widely used in compiler construction due to their efficiency and ability to handle a large class of grammars. They are particularly effective for programming languages with complex syntax, making them a popular choice for parser generators like Yacc and Bison.

Characteristics of LR Parsers:

Bottom-Up Approach: Builds the parse tree from leaves to root by recognizing handles.
Deterministic: Makes parsing decisions without backtracking.
Shift-Reduce Actions: Based on shifting tokens onto a stack and reducing by grammar rules.
State Machine Driven: Uses a deterministic finite automaton to track parsing states.
Lookahead Utilization: Uses lookahead tokens to guide parsing decisions.

LR Parser Structure:

An LR parser consists of:

Input Buffer: Holds the string to be parsed
Stack: Stores state numbers and grammar symbols
Parsing Table: Contains ACTION and GOTO functions
Driver Program: Controls the overall parsing process

                  +----------------+
                  | Driver Program |
                  +----------------+
                          |
                          v
+--------+        +----------------+        +----------+
| Input  | -----> |  LR Parser     | -----> | Output   |
| Buffer |        |  (State Stack) |        | (Parse   |
+--------+        +----------------+        |  Tree)   |
                          |                 +----------+
                          v
                  +----------------+
                  | Parsing Table  |
                  | (ACTION/GOTO)  |
                  +----------------+

LR Parsing Algorithm:

The general algorithm for LR parsing follows these steps:

Initialize:
- Push state 0 onto the stack
- Get the first input token
Main Parsing Loop:
- Let current state be the top of the stack
- Determine the action for current state and input token
Process Actions:
- SHIFT:
  - Push the current token onto stack
  - Push the next state onto stack
  - Advance to next input token
- REDUCE by production A → β:
  - Pop 2*|β| symbols from stack (states and grammar symbols)
  - Let new current state be the top of stack
  - Push LHS nonterminal (A) onto stack
  - Push the state from GOTO[current state, A] onto stack
- ACCEPT:
  - Parsing completed successfully
  - Exit the algorithm
- ERROR:
  - Report syntax error
  - Attempt error recovery
Repeat the main loop until input is accepted or an unrecoverable error occurs

The algorithm uses the ACTION and GOTO tables to determine parsing decisions:

ACTION[state, terminal] indicates what to do with each terminal symbol
GOTO[state, nonterminal] determines which state to transition to after a reduction

1.2 Construction of LR(0) Items

LR parsing relies on constructing sets of items that represent the parser's state at each point. An LR(0) item is a grammar production with a "dot" marker indicating how much of the production has been seen.

LR(0) Items:

For a production A → XYZ, the possible LR(0) items are:

A → •XYZ (Nothing seen yet)
A → X•YZ (X seen)
A → XY•Z (X and Y seen)
A → XYZ• (Complete production seen)

Closure Operation:

For a set of LR(0) items I, the closure is computed as:

Initially, I is in the closure.
If A → α•Bβ is in the closure and B → γ is a production, add B → •γ to the closure.
Repeat step 2 until no more items can be added.

Example:

If I contains: E → E•+T
The closure is: E → E•+T
                (No new items added as the dot is not before a nonterminal)

If I contains: E → •E+T
The closure is: E → •E+T
                E → •T
                T → •T*F
                T → •F
                F → •(E)
                F → •id

GOTO Operation:

The GOTO function determines transitions between item sets:

GOTO(I, X) = Closure({ A → αX•β | A → α•Xβ is in I })

Example:

If I = {E → •E+T, E → •T, T → •F, F → •id, ...}
Then GOTO(I, id) = {F → id•}

Constructing the LR(0) Automaton:

Compute closure of initial item S' → •S (where S is the start symbol).
Compute GOTO for each grammar symbol from the initial state.
Repeat step 2 for each new state until no new states can be generated.

This automaton forms the basis for constructing LR parsing tables.

1.3 Simple LR (SLR) Parsing

SLR parsing is the simplest form of LR parsing, using FOLLOW sets to determine when to reduce.

Constructing SLR Parsing Tables:

For each state i in the LR(0) automaton:

For ACTION[i,a] where a is a terminal:
- If [A → α•aβ] is in state i, add "shift j" to ACTION[i,a], where j = GOTO(i,a)
- If [A → α•] is in state i (A ≠ S'), add "reduce A → α" to ACTION[i,a] for all a in FOLLOW(A)
- If [S' → S•] is in state i, add "accept" to ACTION[i,$]
For GOTO[i,A] where A is a nonterminal:
- Set GOTO[i,A] = j where j = GOTO(i,A) in the automaton

SLR Parsing Example:

Consider the grammar:

E → E + T | T
T → T * F | F
F → (E) | id

A simplified SLR parsing table might look like:

STATE | ACTION                           | GOTO
      | id    | +     | *     | (     | ) | $ | E | T | F
------+-------+-------+-------+-------+---+---+---+---+---
  0   | s5    |       |       | s4    |   |   | 1 | 2 | 3
  1   |       | s6    |       |       |   |acc|   |   |
  2   |       | r2    | s7    |       |   |r2 |   |   |
  3   |       | r4    | r4    |       |   |r4 |   |   |
  4   | s5    |       |       | s4    |   |   | 8 | 2 | 3
  5   |       | r6    | r6    |       |   |r6 |   |   |
  6   | s5    |       |       | s4    |   |   |   | 9 | 3
  7   | s5    |       |       | s4    |   |   |   |   |10
  8   |       | s6    |       |       |s11|   |   |   |
  9   |       | r1    | s7    |       |   |r1 |   |   |
 10   |       | r3    | r3    |       |   |r3 |   |   |
 11   |       | r5    | r5    |       |   |r5 |   |   |

Where:

s5 means "shift and go to state 5"
r2 means "reduce by production 2"
acc means "accept"

SLR Conflicts:

SLR parsers may encounter conflicts when the parsing table has multiple entries in the same cell:

Shift-Reduce Conflict: When the parser can't decide whether to shift the next input or reduce a production.
Reduce-Reduce Conflict: When the parser can't decide which of two or more productions to reduce.

These conflicts indicate that the grammar is not SLR(1).

1.4 Canonical LR (CLR) Parsing

Canonical LR or LR(1) is the most powerful form of LR parsing, using lookahead symbols to avoid conflicts that occur in simpler forms.

LR(1) Items:

An LR(1) item is an LR(0) item with a lookahead symbol: [A → α•β, a]

The lookahead 'a' indicates that this production should only be reduced when the next input symbol is 'a'.

Construction of LR(1) Automaton:

Similar to LR(0) automaton construction, but with additional tracking of lookahead symbols.

Closure Operation:
- If [A → α•Bβ, a] is in the closure and B → γ is a production, add [B → •γ, b] for each b in FIRST(βa).
- If β can derive ε, also include a in the lookaheads.
GOTO Operation:
- GOTO(I, X) = Closure({ [A → αX•β, a] | [A → α•Xβ, a] is in I })
Automaton Construction:
- Starts with closure({[S' → •S, $]})
- Computes GOTO for each grammar symbol from each state

LR(1) Parsing Table Construction:

For each state i in the LR(1) automaton:

For ACTION[i,a] where a is a terminal:
- If [A → α•aβ, b] is in state i, add "shift j" to ACTION[i,a], where j = GOTO(i,a)
- If [A → α•, a] is in state i (A ≠ S'), add "reduce A → α" to ACTION[i,a]
- If [S' → S•, $] is in state i, add "accept" to ACTION[i,$]
For GOTO[i,A] where A is a nonterminal:
- Set GOTO[i,A] = j where j = GOTO(i,A) in the automaton

CLR vs. SLR:

The main advantage of CLR over SLR is its ability to handle a wider class of grammars without conflicts. However, this comes at the cost of significantly larger parsing tables, as the number of states grows due to the addition of lookahead symbols.

1.5 Lookahead LR (LALR) Parsing

LALR(1) parsing represents a compromise between SLR (simple but less powerful) and CLR (powerful but large tables).

LALR(1) Construction:

Construct the CLR(1) automaton.
Merge states that have the same core (same LR(0) items):
- Two states i and j are merged if they contain the same LR(0) items, regardless of lookaheads.
- The lookaheads in the merged state are the union of the lookaheads from the original states.

LALR Parsing Table Construction:

Follows the same procedure as CLR table construction, but using the merged states.

LALR Parsing Properties:

Table Size: Significantly smaller than CLR, often the same size as SLR.
Parsing Power: Stronger than SLR, but may introduce reduce-reduce conflicts not present in CLR.
Practical Usage: Most commonly used in practical parser generators like Yacc and Bison.

LALR Conflicts:

LALR parsers may have conflicts not present in CLR parsers due to the merging of states:

Original Conflicts: Conflicts already present in CLR (rare).
Induced Conflicts: Conflicts created by merging states with different lookahead contexts.

1.6 Comparison of LR Parsing Methods

LR parsing methods form a hierarchy in terms of parsing power:

Grammar Classes:  LL(1) ⊂ SLR(1) ⊂ LALR(1) ⊂ LR(1) ⊂ Context-Free Grammars

Comparison of practical characteristics:

Feature	SLR(1)	LALR(1)	CLR/LR(1)
Parsing power	Good	Very Good	Excellent
Table size	Small	Moderate	Very large
Construction time	Fast	Moderate	Slow
Implementation	Simple	Moderate	Complex
Error detection	Good	Good	Excellent
Practical usage	Limited	Widespread	Rare

Most practical parsers use LALR(1) as it strikes a good balance between power and efficiency. Tools like Yacc and Bison typically generate LALR(1) parsers by default.

Section 2: Syntax-Directed Translation

Syntax-Directed Translation (SDT) is a method of associating semantic actions with grammar productions. It allows for the generation of intermediate code, type checking, and other semantic analysis tasks during parsing. It bridges the gap between syntax analysis and semantic analysis in compiler design.

SDT = CFG (Context-Free Grammar) + Semantic Rules

Syntax-Directed Translation (SDT) is a method of translating high-level programming language code into intermediate code or other representations, based on syntax rules (grammar) and associated semantic rules or actions.

Application of SDT:

Intermediate Code Generation: Translating high-level constructs into a lower-level representation.
Type Checking: Ensuring that operations are performed on compatible types.
Error Reporting: Providing meaningful error messages during parsing.
Symbol Table Management: Maintaining information about identifiers, types, and scopes.
Code Optimization: Performing optimizations on the intermediate representation before generating target code.
Code Generation: Generating target machine code or assembly language from the intermediate representation.
Semantic Analysis: Performing checks and transformations based on the semantic rules of the language.
Attribute Evaluation: Computing values associated with grammar symbols during parsing.

SDT can be organized into two main categories:

Syntax-Directed Definitions (SDD): Define the semantics of a grammar using attributes and rules.
Syntax-Directed Translation Schemes (SDTS): Embed semantic actions directly within grammar productions.

2.1 Syntax-Directed Definitions (SDD)

Syntax-Directed Definitions (SDD) are a formal way to define the semantics of a grammar. They associate attributes with grammar symbols and specify how these attributes are computed.

Components of an SDD:

Context-Free Grammar: The underlying grammar G that defines the syntax.
Attributes: Values associated with grammar symbols:
- Synthesized attributes: Computed from child nodes to parent in parse tree.
- Inherited attributes: Passed from parent or siblings to children in parse tree.
Semantic Rules: Functions that compute attribute values based on other attributes.

Types of Attributes:

Synthesized Attributes: Computed from children (used in bottom-up parsing).
Inherited Attributes: Passed from parent/siblings (used in top-down parsing).

Example of an SDD:

For a simple expression grammar with type checking:

Production                 Semantic Rules
--------------------------------------------------
E → E₁ + T                 E.type = if E₁.type = T.type = int 
                                     then int
                                     else error
T → T₁ * F                 T.type = if T₁.type = F.type = int
                                     then int
                                     else error
F → (E)                    F.type = E.type
F → id                     F.type = lookup(id.name).type
F → num                    F.type = int

Dependency Graphs:

A dependency graph shows how attribute values depend on each other:

Nodes are the attributes
Edges indicate dependencies between attributes

Example for E → E₁ + T:

    E.type
     ↑   ↑
     |   |
 E₁.type T.type

Evaluation Order:

An SDD is well-defined if there's no circular dependency in the attributes. The evaluation order can be:

Bottom-up: Suitable for synthesized attributes (post-order traversal).
Top-down: Suitable for inherited attributes (pre-order traversal).
Mixed: Requires dependency analysis to determine order.

2.2 Construction of Syntax Trees

Syntax trees (or abstract syntax trees, ASTs) are tree representations of the syntactic structure of source code. They abstract away unnecessary details from the parse tree, focusing on the essential structure of the program.

AST Node Structure:

Abstract Syntax Tree Node Structure:

An AST node typically contains:

Node type identifier (operator, identifier, constant, etc.)
Pointers to child nodes (typically left and right for binary operations)
Value storage appropriate to the node type, which might include:
- Integer values for integer literals
- Floating-point values for real literals
- String pointers for identifiers and string literals
- Operation codes for operators
- Type information for declarations

AST Construction During Parsing:

For a grammar rule E → E₁ + T:

AST Construction Process:

For a grammar rule like E → E₁ + T, the AST node creation involves:

Create a new expression node of type OPERATOR
Set the operator value to '+'
Set the left child to the AST node from E₁
Set the right child to the AST node from T
Return the new node as the AST representation for E

During parsing, this node construction happens when the corresponding grammar rule is recognized, building the tree bottom-up as the parse progresses.

Example AST for an Expression:

For the expression "a + b * c":

AST vs. Parse Tree:

Parse Tree: Represents the exact syntactic derivation, including all grammar rules.
AST: Represents the essential structure, omitting unnecessary productions.

Example:

Parse Tree:                     AST:
      E                          +
     / \                        / \
    E   T                      a   *
    |  / \                        / \
    T  T   F                     b   c
    |  |   |
    F  F   id(c)
    |  |
   id(a) id(b)

2.3 Syntax-Directed Translation Schemes (SDTS)

Syntax-Directed Translation Schemes (SDTS) are continuations of SDDs, where semantic actions are embedded within grammar productions. They allow for direct code generation or other actions during parsing.

Differences between SDD and SDTS:

SDD (Definition)	SDTS (Scheme)
Semantic rules are written separately from grammar	Semantic actions are embedded within productions
Used for attribute evaluation	Used for direct code generation or actions
Example: E → E₁ + T { E.type = E₁.type }	Example: E → E₁ + T { emit('+') }

Embedded Actions:

Actions (enclosed in {}) are executed when the parser reaches their position during parsing:

E → T { print('+') } '+' E

Types of SDTS:

Postfix SDTS:
- Actions are executed after the production is recognized.
- Suitable for bottom-up parsing.
Prefix SDTS:
- Actions are executed before the production is recognized.
- Suitable for top-down parsing.
Infix SDTS:
- Actions are executed in the middle of the production.
- Suitable for mixed parsing strategies.
Mixed SDTS:
- Combines different action execution strategies.
- Suitable for complex parsing scenarios.

Example SDTS for Expression Translation:

Converting infix expressions to postfix:

E → E₁ + T  { print('+') }
E → T        { /* no action */ }
T → T₁ * F   { print('*') }
T → F        { /* no action */ }
F → (E)      { /* no action */ }
F → id       { print(id.lexeme) }

For "a + b * c", this produces: "a b c * +"

SDTS Implementation Approaches:

SDTS with synthesized attributes: Actions only compute attributes going up the parse tree.
SDTS with inherited attributes: Actions can pass information down the parse tree.
L-attributed SDTS: Restricted form suitable for top-down parsing.
S-attributed SDTS: Only synthesized attributes (suitable for bottom-up parsing).

2.4 Implementation of Syntax-Directed Translation

Implementing SDTS depends on the parsing method being used.

For Top-Down Parsers (Recursive Descent):

Top-Down Parser Implementation of SDTS:

For recursive descent parsers, semantic actions can be embedded directly in the parsing functions:

The parsing function for nonterminal E would:
- Call the parsing function for T
- Check for the + token
- If found, match and consume it
- Call the parsing function for E recursively
- Execute the semantic action for addition (e.g., print "+")
The parsing function for nonterminal T would:
- Call the parsing function for F
- Check for the * token
- If found, match and consume it
- Call the parsing function for T recursively
The parsing function for nonterminal F would:
- Check for the ( token
- If found, match and consume it
- Call the parsing function for E
- Check for the ) token
- If found, match and consume it
- Execute the semantic action for parenthesized expressions

For Bottom-Up Parsers (LR-based):

Actions are executed during reduce steps:

Bottom-Up Parser Implementation of SDTS:

For LR-based parsers, semantic actions are executed during reduction steps:

When reducing by production E → E + T:
- Calculate the value of E by adding the values of E and T
- Associate this computed value with the E symbol being produced
When reducing by production E → T:
- Transfer the value of T directly to E with no calculation
When reducing by production T → T * F:
- Calculate the value of T by multiplying the values of T and F
- Associate this computed value with the T symbol being produced
When reducing by production T → F:
- Transfer the value of F directly to T with no calculation
When reducing by production F → (E):
- Calculate the value of E and associate it with F
- Execute any semantic actions for parenthesized expressions
When reducing by production F → id:
- Retrieve the value of the identifier and associate it with F
- Execute any semantic actions for identifiers
When reducing by production F → num:
- Retrieve the numeric value and associate it with F
- Execute any semantic actions for numeric literals

Parser Generator Implementation:

Tools like Yacc/Bison allow embedding semantic actions directly in grammar specifications. These actions are executed when the corresponding grammar rule is reduced during parsing.

A typical grammar specification in parser generators includes:

Grammar Rules: Define the syntax structure
Semantic Actions: Code blocks executed during reduction
Value Passing: Each symbol can have an associated value ($1, $2, etc.)
Result Assignment: The result of a calculation ($$ variable)
Embedded Output: Print statements or other side effects

For example, a simple arithmetic expression grammar with semantic actions would define:

Addition and multiplication operations that compute values
Value passing from terminals like NUMBER to higher-level expressions
Rules for handling parenthesized expressions

Attribute Grammar Implementation:

For more complex attribute dependencies, attribute evaluation follows a systematic process:

Dependency-Based Traversal:
- Traverse the parse tree in a manner that respects attribute dependencies
- For synthesized attributes, evaluate children before parent nodes
- For inherited attributes, evaluate parent nodes before children
Recursive Attribute Evaluation Process:
- First evaluate attributes for all child nodes
- Then compute attributes for the current node based on:
  - Child node attributes (synthesized)
  - Parent node attributes (inherited)
  - The production rule being applied
Type Checking Example:
- For expression nodes, verify type compatibility of operands
- For binary operations like addition, check if both operands are compatible
- Propagate type errors upward in the tree
- Apply coercion rules when needed

This systematic approach ensures that all attribute values are computed in the correct order based on their dependencies.

2.5 Intermediate Code Generation

Intermediate Code Generation is a compiler phase where high-level source code is translated into a low-level intermediate representation (IR). This IR is machine-independent, making it easier to apply optimizations before final code generation. It serves as a bridge between source code and machine code.

Purpose of Intermediate Code:

Machine Independence: Allows for optimizations and transformations without being tied to a specific target architecture.
Optimization: Enables various optimization techniques to improve performance.
Simplification: Simplifies the translation process by breaking it into smaller, manageable parts.
Portability: Facilitates the generation of code for different target machines from the same intermediate representation.
Debugging: Intermediate code can be used for debugging and analysis before generating final machine code.

Common forms of intermediate code include three-address code, quadruples, and triples.

Three-Address Code:

Three-address code is a linear representation where each instruction contains at most one operator and three addresses (2 operands + 1 result). It is a low-level representation that simplifies the generation of machine code.

Each instruction typically has the form:

x = y op z

Where:

x, y, and z are names, constants, or compiler-generated temporary variables
= is an assignment operator
op is an operator (arithmetic, logical, etc.)

Example: For the expression a + b * c, the three-address code would be:

t1 = b * c
t2 = a + t1

This representation allows for easy optimization and code generation, as each instruction is simple and straightforward.

Types of Three-Address Instructions:

Assignment: x = y op z or x = op y or x = y
Control Flow: goto L, if x goto L, if x relop y goto L
Procedure Calls: param x, call p, n, return y
Indexed Assignments: x = y[i], x[i] = y
Address and Pointer Operations: x = &y, x = *y, *x = y

Translation Scheme for Generating Three-Address Code:

E → E₁ + T    { E.place = new_temp();
                 emit(E.place + " = " + E₁.place + " + " + T.place); }
                 
E → T         { E.place = T.place; }

T → T₁ * F    { T.place = new_temp();
                 emit(T.place + " = " + T₁.place + " * " + F.place); }
                 
T → F         { T.place = F.place; }

F → (E)       { F.place = E.place; }

F → id        { F.place = id.lexeme; }

For "a + b * c", this produces:

t1 = b * c
t2 = a + t1

2.6 Quadruples and Triples

Intermediate code is typically stored in data structures that facilitate optimization and code generation. Two common structures are quadruples and triples.

Quadruples:

Quadruples are a representation of three-address code where each instruction is represented as a tuple of four fields. This structure allows for easy manipulation and optimization.

Each quadruple (or quad) consists of: (op, arg1, arg2, result)

op: The operation to be performed (e.g., +, -, *, /)
arg1: The first operand (could be a variable, constant, or temporary)
arg2: The second operand (could be a variable, constant, or temporary)
result: The location where the result of the operation will be stored (could be a variable or temporary)

Example Representation: For the expression a + b * c, the quadruple representation would be:

t1 = b * c   → (*, b, c, t1)
t2 = a + t1 → (+, a, t1, t2)

Index | Op  | Arg1 | Arg2 | Result
------|-----|------|------|-------
1     | *   | b    | c    | t1
2     | +   | a    | t1   | t2

Where t1 and t2 are temporary variables generated by the compiler.

Triples:

Triples are a more compact representation of three-address code where each instruction is represented as a tuple of three fields. Unlike quadruples, triples do not explicitly store the result.

Each triple consists of: (op, arg1, arg2)

op: The operation to be performed (e.g., +, -, *, /)
arg1: The first operand (could be a variable, constant, or temporary)
arg2: The second operand (could be a variable, constant, or temporary)

Example Representation: For the expression a + b * c, the triple representation would be:

t1 = b * c   → (*, b, c) -> (1)
t2 = a + t1 → (+, a, 1) -> (2)

Index | Op  | Arg1 | Arg2
------+-----|------|------
1     | *   | b    | c
2     | +   | a    | 1

In the above example, the result of b * c is not explicitly stored but can be referenced by its index (1) in subsequent operations.

Comparison of Quadruples and Triples:

Feature	Quadruples	Triples
Structure	Four fields (op, arg1, arg2, result)	Three fields (op, arg1, arg2)
Result Storage	Explicitly stored in result field	Not explicitly stored
Indexing	Each quad has a unique index	Each triple has a unique index
Memory Usage	Slightly larger due to extra field	More compact representation
Access	Direct access to result	Indirect access via index

Quadruples are generally easier to work with for code generation and optimization due to their explicit result storage. Triples are more compact and can save memory but may require additional handling during code generation.