CPU在执行一段指令(instruction written in a low-level language called machine language)的流程通常是这样的:从内存中取一条指令,对它进行解析,确定指令的操作符(type)和操作数(operands),然后执行,结束后再取一条,然后再解析再执行,如此循环往复。
由普通文本代码(source code)转成指令的基本过程如下图所示。
上图的Object File就是机器可识别的二进制文件,在此基础上进行一些必要的链接,CPU可执行的程序就生成了。创建一个指向该代码段的进程,程序就可以被加载到内存由操作系统进行执行了。
由代码(source code)生成object code的过程是通过编译(compile)实现的。维基百科上对编译程序的定义:
A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language, often having a binary form known as object code).More generally, compilers are a specific type of translators.
编译大体上可分为3步:代码分析、编译优化、转成汇编或机器指令。
其中代码分析是为了判断程序写得对不对,包括词法分析、语法分析、语义分析。
词法分析是为了识别单词和词性,一般通过有限状态机来实现。比如下图表示的优先状态机就可以用来识别一个数字(整数部分以及可选的小数部分):
如果输入的是“42.15”,那么输出的就是<NUMBER,42.15>,其中NUMBER为token,表示这是一个数字,42.15就是对应的具体数值。
语法分析就是看看写得句子的结构对不对,比如识别 int “hello word”以及括号对得上对不上这种错误。这个一般通过上下文无关文法来检查。上下文无关文法拥有强大的表达力可以表示大多数程序设计语言的语法;实际上,几乎所有程序设计语言都是通过上下文无关文法来定义的。另一方面,上下文无关文法又足够简单,使得我们可以构造有效的分析算法来检验一个给定字串是否是由某个上下文无关文法产生的。
语义分析结合上下文来判断一段代码是否有错误。例如一个C程序片断:
int arr[2],b;
b = arr * 10;
源程序的结构是正确的. 语义分析将审查类型并报告错误:不能在表达式中使用一个数组变量,赋值语句的右端和左端的类型不匹配。
编写一个编译器的基本过程如下:
- Parsing (computer science). Convert the text of the program into an AST (abstract syntax tree). For example, "z = x+y" is a ASSIGN-type STATEMENT whose left-hand side is the VARIABLE z, whose op is =, and whose right-hand side is a BINARY-OP-type EXPRESSION whose op is + and whose arguments are the VARIABLE x and the VARIABLE y. The capital words are the names for various types of syntax.
- Elaboration. Just simplifying the code a bit. Examples: expand "x++" into "x+=1", convert while loops into for loops, expand out typedefs.
- Static checking. Make sure every function returns, variables aren't used uninitialized, the types are OK, etc.
- More elaboration. Convert booleans to integers, "int x = 3" to "int x; x=3", expand out structs to offsets from pointers.
- Convert to an Intermediate Representation (IR). This is supposed to be simpler than the source language, but more expressive than Assembly Language. An IR is useful because you can reuse the same IR for multiple source languages and multiple assembly languages.
- Apply Compiler Optimization to the IR. Constant folding, constant propogation, dead code elimination, common subexpression elimination ...
- Do register allocation (programmers can use infinitely many variables, but your CPUonly has some fixed number of registers. You can put variables on the stack or in-memory, but that's slow).
- Expand out your IR into assembly language.
- Set up stack frames, etc.
参考:
What is object code? - Webopedia
Compiler - Wikipedia, the free encyclopedia
How does a compiler convert high level programming languages into assembly?