AEG: Automatic Exploit Generation
From the title we can know that AEG is a technique that automatically generate exploits of a given source code. Fig. 5 shows the workflow of AEG. In this blog we learn these steps one by one.
Pre-Process
Compile binary code BgccB_{gcc}Bgcc and LLVM bytecode BllvmB_{llvm}Bllvm from source code. Notice that AEG is a two-input single-output system, which means it does not take source code as input, but it needs bytecode to do source analysis. Bytecode is a platform-independent, intermediate representation of code that is not executed directly by the hardware, but by a virtual machine (VM).
Source-Analysis
This step does not do complex analysis, instead it just finds out the largest buffer size in the program and output maxmaxmax as the maximum size of symbolic data / exploits, which is at least 10% larger than the largest buffer size.
Bug-Find
Preconditioned Symbolic Execution
Traditional symbolic execution for bug finding is representing each byte of exploits with symbolic variables, turning it into a symbolic data. For every branch in the program, new constraints will be created to model different branches. In the end, we will try to solve a valid symbolic data satisfying all of the constraints in the interpreter / path.
if(a > 1)
b = a - 1;
else
b = a + 1;
Taking the simple code above as an example, we will create interpreters to model different paths passing the if
branch, like a>1
and a<=1
. When we want the result b
to be 2
, we can set the function as a>1 & a-1=2
, gaining a=3
; and a<=1 & a+1=2
, gaining a=1
.
Symbolic execution can help attackers find potential exploits, but the search space can be extremely large, especially for loops because every iteration can produce new branches. To prune the branch, the authors introduced Preconditioned Symbolic Execution to add some constraints before trying to solve the symbolic execution, including
- Know Length: To overflow a buffer, it is obvious that the input should exceed the length of buffer, so we do not need to consider inputs that are shorter than the length of buffer;
- Known Prefix: Sometimes we know the prefix of input, e.g., a HTTP GET request always starts with “GET”;
- Concolic Execution: Reuse a known input specified by a single program path.
Path Prioritization
Each branch will lead to 2 different paths, whose number will grow exponentially in loops. It is also a question to decide which path we will explore first. So, the authors proposed Path Prioritization to decide the order of exploration. This includes 2 main techniques:
- Buggy-Path-First: One bug on a path means subsequent statements are also likely to be buggy (and hopefully exploitable), so they prioritize buggy paths higher and continues exploration.
- Loop Exhaustion: The loop-exhaustion strategy gives higher priority to an interpreter exploring the maximum number of loop iterations, hoping that computations involving more iterations are more promising to produce bugs like buffer overflows. In this way, a loop only creates one new interpreter.
Environment Modelling
AS a practical application, AEG can attack in different environment settings, including Files, Sockets, Variables, Library Function Calls and System Calls.
Exploit Generation
After finding a path leading to a bug, we need to check if it is exploitable. Then, attackers need to generate the exploit and verify it can get a shell via this bug for the attacker.
DBA: Dynamic Binary Analysis
In Bug-Find step, we will gain paths constraints leading to a bug and names of vulnerable functions and buffers. Within them, attackers can keep reproducing the bug and observing the behavior of the program. That is what DBA does. During DBA, AEG performs instrumentation on the given executable binary BgccB_{gcc}Bgcc. When it detects the vulnerable function call, it stops execution and examines the stack, recording stack memory contents.
To attack the program, attackers need to overwrite the content in the stack. However, inappropriate value may cause crashes. So, AEG will restore the contents in stack that aren’t needed, also making sure the program won’t crash during attack.
Exploit-Gen
With the runtime information gained from DBA, the AEG will try to generate exploits accordingly. There are multiple types of exploits, but the paper only presents 1 kind of algorithm to generate stack-overflow return-to-stack exploits, showing below:
The exp_str
stores the expected contents in the stack, overwriting EIP (register pointing to the next instruction) and shellcode afterwards. The EIP is set to the next stack frame (offset+8
), where contains the shellcode. Other stack contents between &retaddr
and bufaddr
(exp_str
before offset
) remain unchanged and will be restored during attack.