English | 中文版
[TOC]
This file condenses the key compile-and-link concepts (lexing, parsing, automata, object formats, and linking) into a practical reference. Diagrams in res/ illustrate the pipeline and object layouts.
The compilation pipeline transforms source code into an executable image. Typical stages include preprocessing, lexical analysis, parsing, semantic checks, IR generation and optimization, code generation, assembly, and linking.
The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program.
Interactions between the lexical analyzer and the parser
Since the lexical analyzer is the part of the compiler that reads the source text, it may perform certain other tasks besides the identification of lexemes:
- tripping out comments and
whitespace. - correlating error messages generated by the compiler with the source program.
Sometimes, lexical analyzers are divided into a cascade of two processes:
Scanningconsists of simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one.Lexical analysisproper is the more complex portion, which produces tokens from the output of the scanner.
When discussing lexical analysis, we use three related but distinct terms:
- A
tokenis a pair consisting of a token name and an optional attribute value. - A
patternis a description of the form that the lexemes of a token may take. - A
lexemeis a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.
Definitions of operations on languages:
| OPERATION | DEFINITION AND NOTATION |
|---|---|
| Union of |
$L \cup M = {s |
| Concatenation of |
$LM = {st |
| Kleene closure of |
|
| Positive closure of |
BASIS: There are two rules that form the basis:
-
$\epsilon$ is a regular expression, and$L(\epsilon)$ is${\epsilon}$ , that is, the language whose sole member is the empty string. - If
$a$ is a symbol in$\sum$ , then$a$ is a regular expression, and$L(a) = {a}$ , that is, the language with one string, of length one, with$a$ in its one position.
INDUCTION: There are four parts to the induction whereby larger regular expressions are built from smaller ones. Suppose
-
$(r)|(s)$ is a regular expression denoting the language$L(r) \cup L(s)$ . -
$(r)(s)$ is a regular expression denoting the language$L(r)L(s)$ . - $(r)^{}$ is a regular expression denoting $(L(r))^{}$.
-
$(r)$ is a regular expression denoting$L(r)$ .
As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop certain pairs of parentheses if we adopt the conventions that:
- The unary operator
$*$ has the highest precedence and is left associative. - Concatenation has the second-highest precedence and is left associative.
-
$|$ has the lowest precedence and is left associative.
Algebraic laws for regular expressions:
| LAW | DESCRIPTION |
|---|---|
| $r | s = s |
| $r | (s |
| Concatenation is associative | |
| $r(s | t) = rs |
|
|
|
| $r^{*} = (r | \varepsilon)^{*}$ |
|
|
If regular definition is a sequence of definitions of the form:
$$
d_1 \rightarrow r_1 \
d_2 \rightarrow r_2 \
\cdots \
d_n \rightarrow r_n
$$
, where:
- Each
$d_i$ is a new symbol, not in$\sum$ and not the same as any other of the$d$ 's, and - Each
$r_i$ is a regular expression over the alphabet$\sum \cup {d_1, d_2, \cdots, d_{i-1}}$ .
A nondeterministic finite automaton(NFA) consists of:
- A finite set of states
$S$ . - A set of input symbols
$\sum$ , theinput alphabet. We assume that$\epsilon$ , which stands for the empty string, is never a member of$\sum$ . - A
transition functionthat gives, for each state, and for each symbol in$\sum \cup {\epsilon}$ a set ofnext states. - A state
$s_0$ from$S$ that is distinguished as thestart state(orinitial state). - A set of states
$F$ , a subset of$S$ , that is distinguished as theaccepting states(orfinal states).
A deterministic finite automaton(DFA) is a special case of an NFA where:
- There are no moves on input
$\epsilon$ , and - For each state
$s$ and input symbol$a$ , there is exactly one edge out of$s$ labeled$a$ .
Algorithm 3.18: Simulating a DFA.
INPUT: An input string move.
OUTPUT: Answer "yes" if
METHOD: Apply the algorithm in Fig. 3.27 to the input string
The general idea behind the subset constructions is that each state of the constructed DFA corresponds to a set of NFA states.
Algorithm 3.20: The subset construction of a DFA from an NFA.
INPUT: An NFA
OUTPUT: A DFA
METHOD: Our algorithm constructs a transition table Dtran for
Operations on NFA states:
| OPERATION | DESCRIPTION |
|---|---|
| Set of NFA states reachable from NFA state |
|
| Set of NFA states reachable from some NFA state |
|
| Set of NFA states to which there is a transition on input symbol |
Algorithm 3.22: Simulating an NFA.
INPUT: An input string
OUTPUT: Answer "yes" if
METHOD: The algorithm keeps a set of current states
Algorithm 3.36: Construction of a DFA from a regular expression
INPUT: A regular expression
OUTPUT: A DFA
METHOD:
-
Construct a syntax tree
$T$ from the augmented regular expression(r)#. -
Compute
nullable, firstpos, lastpos, and followposfor$T$ , using the methods of Sections 3.9.3 and 3.9.4. -
Construct
Dstates, the set of states of DFA$D$ , andDtran, the transition function for$D$ , by the procedure of Fig.3.6.2
A Lex program is turned into a transition table and actions, which are used by a finite-automaton simulator
These components are:
- A transition table for the automaton.
- Those functions that are passed directly through
$Lex$ to the output. - The actions from the input program, which appear as fragments of code to be invoked at the appropriate time by the automaton simulator.
Linking is the process of collecting and combining various pieces of code and data into a single file that can be loaded (copied) into memory and executed.
Given this notion of strong and weak symbols, Linux linkers use the following rules for dealing with duplicate symbol names:
- Rule 1. Multiple strong symbols with the same name are not allowed.
- Rule 2. Given a strong symbol and multiple weak symbols with the same name, choose the strong symbol.
- Rule 3. Given multiple weak symbols with the same name, choose any of the weak symbols.
Compilers and assemblers generate relocatable object files (including shared object files). Linkers generate executable object files. Technically, an object module is a sequence of bytes, and an object file is an object module stored on disk in a file.
ELF headerThe ELF header begins with a 16-byte sequence that describes the word size and byte ordering of the system that generated the file..textThe machine code of the compiled program..rodataRead-only data such as the format strings in printf statements, and jump tables for switch statements..dataInitialized global and static C variables. Local C variables are maintained at run time on the stack and do not appear in either the.dataor.bsssections..bssUninitialized global and static C variables, along with any global or static variables that are initialized to zero..symtabA symbol table with information about functions and global variables that are defined and referenced in the program..rel.textA list of locations in the.textsection that will need to be modified when the linker combines this object file with others..rel.dataRelocation information for any global variables that are referenced or defined by the module..debugA debugging symbol table with entries for local variables and typedefs defined in the program, global variables defined and referenced in the program, and the original C source file. It is only present if the compiler driver is invoked with the-goption..lineA mapping between line numbers in the original C source program and machine code instructions in the.textsection. It is only present if the compiler driver is invoked with the-goption..strtabA string table for the symbol tables in the.symtaband.debugsections and for the section names in the section headers.
Static linkers are invoked by compiler drivers such as GCC. They combine multiple relocatable object files into a single executable object file. Multiple object files can define the same symbol, and the rules that linkers use for silently resolving these multiple definitions can introduce subtle bugs in user programs.
Shared libraries are modern innovations that address the disadvantages of static libraries. A shared library is an object module that, at either run time or load time, can be loaded at an arbitrary memory address and linked with a program in memory. This process is known as dynamic linking and is performed by a program called a dynamic linker.
The dynamic linker then finishes the linking task by performing the following relocations:
- Relocating the text and data of
libc.sointo some memory segment. - Relocating the text and data of
libvector.sointo another memory segment. - Relocating any references in
prog21to symbols defined bylibc.soandlibvector.so.
[1] Randal E. Bryant, David R. O'Hallaron . COMPUTER SYSTEMS: A PROGRAMMER'S PERSPECTIVE . 3ED






