Overview of a Compiler

Published Sep 30, 2019Last updated Mar 27, 2020

What is a compiler?
The Parser
- Error Handling
- What is a scanner or lexer and why use it?
The IDS
Semantic Analysis
The Code Generator

What is a compiler?

A compiler is a translator. Thus, it (almost) necessarily deals with 2 different languages: an input language, and an output language. The internals of a compiler mimick this duality: there is a front-end that consumes the input language, and, there is a back-end that emits the output language.

The front-end has a parser, while the back-end has a code generator.

The two are connected via a data structure, very generically calls an IDS, standing for Intermediate or Internal Data Structure, and, this is usually a tree, often an Abstract Syntax Tree (AST), though sometimes the similar but much more verbose Parse Tree, aka Concrete Syntax Tree (CST). The parser reads the input and constructs an IDS, and the code generator reads the IDS and emits assembly language, machine code, or other (such as C).

We can also equate a parser to a deserializer, and a code generator to a serializer.

A serializer takes an object graph and deflates it into a file format that can be taken as an ordered series of (bits or) bytes. A de-serializer does the reverse: takes an ordered series of bytes and "inflates" that into an object graph.

A lint checker is half of a compiler, since all it outputs is warnings or error messages, but no translation. A pretty-printer is a tool that works like a compiler but outputs the same language as the input — depending on the language, this may require as much complexity as a compiler or much less.

The Parser

A parser, in its essence, is a recognizer. It is a recognizer of language (language syntax) that can take linguistic components and assemble them in to syntactic constructs of the language. In programming languages, the components being assembled are characters, and they are assembled into expessions, declarations, statements, etc.. Typically this language is specified by a grammar. The grammar tells us what are valid constructs in the language, and, allows the parser to recognize syntactic language constructs.

Language semantics are not really part of parsing: once the input is recognized as syntax of the input language, it becomes subject to semantics of the input language.

Error Handling

There are lots of approaches to error handling. At one extreme, any error prevents translation since the input (program) is invalid. However, this is not terribly graceful, so there are a number of approaches that allow the initial phases of translation to proceed in order to report additional errors to be reported rather than just one at a time. Also, the quality of error messages can vary dramatically, from "syntax error" to something more meaningful that directs the user toward a possible solution.

What is a scanner or lexer and why use it?

A lexer or scanner is used to assist the parser in recognizing smaller components of the language.

To be clear, a parser does not need a lexer: the parser can consume text directly.

However, a serious language will offer, arguably, somewhat overlapping constructs: there is the main language itself, and then there are other, secondary features like whitespace, comments, and conditional compliation (C's #ifdef, and C#'s #if, for example). These other features are almost like having a language within a language. They get in the way of, and complicate parsing of, the main language — here's where the lexer shines: it consumes (parses) these secondary constructs allowing the parser to focus on the main language alone. Keeping a lexer separated from the parser can simplify both.

A lexer typically outputs some notion of tokens, and intermediate representation of all possible inputs, condensed — however this is not strictly necessary, and (in my humble opinion) this is an unnecessary intermediate mapping. When I write a scanner/parser, I have the scanner handle whitespace, conmments, #ifdefs, identifiers, string literals, and numeric constants, and otherwise pass characters like +, -, ., etc.. directly to the parser for interpretation without mapping to tokens.

The IDS

As mentioned before an IDS is usually some kind of object graph: normally a tree data structure such as an AST, that captures the syntactic recognition of an input program in the input language.

For example, a = b + c is represented in an AST tree as:

Semantic Analysis

Once the program is parsed, it is legal by the syntax of the language (its gammar), but that doesn't necessarily mean the program is valid. Semantic Analysis determines other aspects of program legality, specifications that are outside of the grammar, such as requiring a variable to be defined or declared before using it. Semantic analysis will use the IDS to ensure such requirements are observed.

The Code Generator

A code generator walks the IDS and generates assembly language or machine code instructions to execute the semantics of the input program. This requires a mapping of semantic constructs in the input language to things in the output language that will accomplish the intent — with precision. Fundamentally, the code generator will analyse the AST to determine what operations in machine code will compute the desired statements & expressions captured in the AST.

Evalutation Context: LHS, RHS, Branch

As we walk the IDS to generate machine code, there are several different evaluation contexts, two of them are: left hand side and right hand side.

Left Hand Side vs. Right Hand Side

For example, in a = b + c, we evaluate the value of b and c, add them together, and then associate or store the result in a. Here b and c have the context that we call right hand side — b and c are on the right hand side (of an assignment), whereas a has left hand side context — a is on the left hand side of an assignment. We need the value of b and c yet the location of a to store the result.

A complex left hand side expression will itself also involve some right hand side evaluation. For example, a[i] = 5 requires evaluting a and i as values (as if right hand side) and only the array indexing itself is evalutated as left hand side (for storage location). 5, of course, is understood as right hand side.

Branch Evaluation

The other evaluation context is branching, for control flow. In an if-statement, for example, we want to know whether to execute the then-part or skip it. Thus, what we really want to accomplish is a possible change in the flow of control. The naive way to do this is to evaluate the conditional expression in an if-statement for its value (i.e. using right hand side evaluation), and then execute the then-part if the value is true (or non-zero depending on the language) and skip the then-part if the value is false (or zero). While this does work, it generates suboptimal code even very simple expressions. For example, in if ( a == b ) ... the naive approach would compare a with b, generate the boolean true/false value then do a test and branch on that boolean value. However, in most machine code languages, we can branch directly on the result of the compare, omitting the creation of the intermediate boolean value. When we evaluate a conditional expression for branching, we provide the evaluator with a branch target.

Optimization

Once the intermediate data structure is constructed, we can analyze it for potential optimization. Classic optimizations involve common sub-expression elimination, loop invariant code motion, and induction variable strength reductions. Some compilers choose to do such optimization after generating code, because in generating code more opportunities for these optimizations arise, for example in A[i] = B[i], the compiler may expand this to
t1 = i*4
t2 = B[t1]
t3 = i*4
A[t3] = t2
As you can see there is an common sub-expression opportunity in reusing t1 instead of creating a second i*4 in t3. Some compilers lower the AST to another intermediate (called lowering as this other intermediate is more verbose, and closer to the machine code). Some also chose to do optimization both on the AST and in the intermediate, as well as during and after machine code generation.

Control Flow

Many compilers translate from some kind of structured programming into assembly language or machine code. Machine code (and assembly language) uses an if-goto style of control flow instead of structured statements. A conversion is required that involves negation of conditionals, and management of branches and branch target labels. While structured programming uses constructs like "if this then do that", assembly language (and machine code) uses constructs like "if not this, then skip around that". See my other article Conditional Logic in Assembly Language for more information on this subject.

Machine Type

Generating code differs quite a bit depending on the machine type: whether Accumulator Machine, Stack Machine, or Register Machine.
See my other article Classifications of Instruction Set Architectures for a better description of these machine types.

Data parsing Assembly Machine code Code generation Compiler construction

Report

Enjoy this post? Give Erik Eidt a like if it's helpful.

Erik Eidt

Assembly Language, Machine Code, Processor Architecture

I am no longer taking new students for mentoring sessions. However, you are welcome to ask me questions of any depth in chat and I'll answer as I can. I am an assembly language expert, having mentored recently in MIPS, LC-3&4, M...

Discover and read more posts from Erik Eidt

get started