What is a compiler?
https://hackernoon.com/compilers-and-interpreters-3e354a2e41cf
The simplest definition of a compiler is a program that translates code written in a high-level programming language (like JavaScript or Java) into low-level code (like Assembly) directly executable by the computer or another program such as a virtual machine.
For example, the Java compiler converts Java code to Java Bytecodeexecutable by the JVM (Java Virtual Machine). Other examples are V8, the JavaScript engine from Google which converts JavaScript code to machine code or GCC which can convert code written in programming languages like C, C++, Objective-C, Go among others to native machine code.
What’s in the black box?
So far we’ve looked at a compiler as a magic black box which contains some spell to convert high-level code to low-level code. Let’s open that box and see what’s inside.
A compiler can be divided into 2 parts.
- The first one generally called the front end scans the submitted source code for syntax errors, checks (and infers if necessary) the type of each declared variable and ensures that each variable is declared before use. If there is any error, it provides informative error messages to the user. It also maintains a data structure called symbol table which contains information about all the symbols found in the source code. Finally, if no error is detected, another data structure, an intermediate representation of the code, is built from the source code and passed as input to the second part.
- The second part, the back end uses the intermediate representation and the symbol table built by the front end to generate low-level code.
Both the front end and the back end perform their operations in a sequence of phases. Each phase generates a particular data structure from another data structure emitted by the phase before it.
The phases of the front end generally include lexical analysis, syntax analysis, semantic analysis and intermediate code generation while theback end includes optimization and code generation.
Structure of a compiler
Lexical Analysis
The first phase of the compiler is the lexical analysis. In this phase, the compiler breaks the submitted source code into meaningful elements called lexemes and generates a sequence of tokens from the lexemes.
A lexeme can be thought of as a uniquely identifiable string of characters in the source programming language, for example, keywords such as if
, while
or func
, identifiers, strings, numbers, operators or single characters like (
, )
, .
or :
.
A token is an object describing a lexeme. Along with the value of the lexeme(the actual string of characters of the lexeme), it contains information such as its type (is it a keyword? an identifier? an operator? …) and the position (line and/or column number) in the source code where it appears.
Sequence of lexemes generated during lexical analysis
If the compiler encounters a string of characters for which it cannot create a token, it will stop its execution by throwing an error; for example, if it encounters a malformed string or number or an invalid character (such as a non-ASCII character in Java).
Syntax Analysis
During syntax analysis, the compiler uses the sequence of tokens generated during the lexical analysis to generate a tree-like data structure called Abstract Syntax Tree, AST for short. The AST reflects the syntactic and logical structure of the program.
Abstract Syntax Tree generated after syntax analysis
Syntax analysis is also the phase where eventual syntax errors are detected and reported to the user in the form of informative messages. For instance, in the example above, if we forget the closing brace }
after the definition of the sum
function, the compiler should return an error stating that there is a missing }
and the error should point to the line and column where the }
is missing.
If no error is found during this phase, the compiler moves to the semantic analysis phase.
Semantic Analysis
During semantic analysis, the compiler uses the AST generated during syntax analysis to check if the program is consistent with all the rules of the source programming language. Semantic analysis encompasses
- Type inference. If the programming language supports type inference, the compiler will try to infer the type of all untyped expressions in the program. If a type is successfully inferred, the compiler will annotate the corresponding node in the AST with the inferred type information.
- Type checking. Here, the compiler checks that all values being assigned to variables and all arguments involved in an operation have the correct type. For example, the compiler makes sure that no variable of type
String
is being assigned aDouble
value or that a value of typeBool
is not passed to a function accepting a parameter of typeDouble
or again that we’re not trying to divide aString
by anInt
,"Hello" / 2
(unless the language definition allows it). - Symbol management. Along with performing type inference and type checking, the compiler maintains a data structure called symbol tablewhich contains information about all the symbols (or names) encountered in the program. The compiler uses the symbol table to answer questions such as Is this variable declared before use?, Are there 2 variables with the same name in the same scope? What is the type of this variable? Is this variable available in the current scope? and many more.
The output of the semantic analysis phase is an annotated AST and the symbol table.
Intermediate Code Generation
After the semantic analysis phase, the compiler uses the annotated AST to generate an intermediate and machine-independent low-level code. One such intermediate representation is the three-address code.
The three-address code (3AC), in its simplest form, is a language in which an instruction is an assignment and has at most 3 operands.
Most instructions in 3AC are of the form a := b <operator> c
or a := b
.
The above drawing depicts a 3AC code generated from an annotated ASTcreated during the compilation of the function
func sum(n: Int): Int = {
n * (n + 1) / 2
}
The intermediate code generation concludes the front end phase of the compiler.
Optimization
In the optimization phase, the first phase of the back end, the compiler uses different optimization techniques to improve on the intermediate code generated by making the code faster or shorter for example.
For example, a very simple optimization on the 3AC code in the previous example would be to eliminate the temporary assignment t3 := t2 / 2
and directly assign to id1
the value t2 / 2
.
Code Generation
In this last phase, the compiler translates the optimized intermediate code into machine-dependent code, Assembly or any other target low-level language.
Compiler vs. Interpreter
Let’s conclude this article with a note about the difference between compilers and interpreters.
Interpreters and compilers are very similar in structure. The main difference is that an interpreter directly executes the instructions in the source programming language while a compiler translates those instructions into efficient machine code.
An interpreter will typically generate an efficient intermediate representation and immediately evaluate it. Depending on the interpreter, the intermediate representation can be an AST, an annotated AST or a machine-independent low-level representation such as the three-address code.
Difference between Compiler and Interpreter
http://www.c4learn.com/c-programming/compiler-vs-interpreter/
No | Compiler | Interpreter |
---|---|---|
1 | Compiler Takes Entire program as input | Interpreter Takes Single instruction as input . |
2 | Intermediate Object Code is Generated | No Intermediate Object Code is Generated |
3 | Conditional Control Statements are Executes faster | Conditional Control Statements are Executes slower |
4 | Memory Requirement : More(Since Object Code is Generated) | Memory Requirement is Less |
5 | Program need not be compiledevery time | Every time higher level program is converted into lower level program |
6 | Errors are displayed after entire program is checked | Errors are displayed for every instruction interpreted (if any) |
7 | Example : C Compiler | Example : BASIC |
https://www.programiz.com/article/difference-compiler-interpreter
Interpreter | Compiler |
---|---|
Translates program one statement at a time. | Scans the entire program and translates it as a whole into machine code. |
It takes less amount of time to analyze the source code but the overall execution time is slower. | It takes large amount of time to analyze the source code but the overall execution time is comparatively faster. |
No intermediate object code is generated, hence are memory efficient. | Generates intermediate object code which further requires linking, hence requires more memory. |
Continues translating the program until the first error is met, in which case it stops. Hence debugging is easy. | It generates the error message only after scanning the whole program. Hence debugging is comparatively hard. |
Programming language like Python, Ruby use interpreters. | Programming language like C, C++ use compilers. |
https://stackoverflow.com/questions/2377273/how-does-an-interpreter-compiler-work
Compiler characteristics:
- spends a lot of time analyzing and processing the program
- the resulting executable is some form of machine- specific binary code
- the computer hardware interprets (executes) the resulting code
- program execution is fast
Interpreter characteristics:
- relatively little time is spent analyzing and processing the program
- the resulting code is some sort of intermediate code
- the resulting code is interpreted by another program
- program execution is relatively slow
What is a translator?
An S -> T translator accepts code expressed in source language S, and translates it to equivalent code expressed in another (target) language T.
Examples of translators:
- Compilers - translates high level code to low level code, e.g. Java -> JVM
- Assemblers - translates assembly language code to machine code, e.g. x86as -> x86
- High-level translators - translates code from one PL to another, e.g. Java -> C
- Decompilers - translates low-level code to high-level code, e.g. Java JVM bytecode -> Java
What is an interpreter?
An S interpreter accepts code expressed in language S, and immediately executes that code. It works by fetching, analysing, and executing one instruction at a time.
Great when user is entering instructions interactively (think Python) and would like to get the output before putting in the next instruction. Also useful when the program is to be executed only once or requires to be portable.
- Interpreting a program is much slower than executing native machine code
- Interpreting a high-level language is ~100 times slower
- Interpreting an intermediate-level (such as JVM bytecode) language is ~10 slower
- If an instruction is called repeatedly, it will be analysed repeatedly - time-consuming!
- No need to compile code
Differences
Behaviour
-
A compiler translates source code to machine code, but does not execute the source or object code.
-
An interpreter executes source code one instruction at a time, but does not translate the source code.
Performance
- A compiler takes quite a long time to translate the source program to native machine code, but subsequent execution is fast
- An interpreter starts executing the source program immediately, but execution is slow
Interpretive compilers
An interpretive compiler is a good compromise between compilers and interpreters. It translates source program into virtual machine code, which is then interpreted.
An interpretive compiler combines fast translation with moderately fast execution, provided that:
- VM code is lower than the source language, but higher than native machine code
- VM instructions have simple formats (can be quickly analysed by an interpreter)
Example: JDK provides an interpretive compiler for Java.
The Compiler translates the entire program before it is run.
The Interpreters translates one statement into machine language, executes it, and proceeds to next statement.
Examples with Languages
Interpreted
- Python
- Ruby
- PHP
- JAVA(Almighty)
- Perl
- R
- Powershell
compiled
- C
- C++
- C#
- Objective-C
- SWIFT
- Fortran
...
Java is Both a Compiled and Interpreted Language
https://techwelkin.com/compiler-vs-interpreter
When you write a Java program, the javac compiler converts your program into something called bytecode. All the Java programs run inside a JVM (this is the secret behind Java being cross-platform language). Bytecode compiled by javac, enters into JVM memory and there it is interpreted by another program called java. This java program interprets bytecode line-by-line and converts it into machine code to be run by the JVM. Following flowchart shows how a Java program executes.
Execution of a Java program. Java is both a compiled and interpreted language.