转载地址:http://www.xiesiyi.com/posts/deep-water-printf-float-in-int-type.html
这片文章是与一个朋友聊天,聊起了一个问题,然后他研究完写了一篇文章,写的非常好,转载过来,记录一下。
Abstract
For a programmer, as a user of the interface printf
in C language, he or she should assure the string specifier matches the types of the variables or the result isundefined. In fact, the result from unmatched types may be defined from the prospective of the implementor of this interface.
Table of Contents
Background
Last week, a friend of mine showed me an Obj-C code snippet and we want to figure out what is the output exactly.
|
The output from the Xcode IDE is as follows, running on a iPhone 6s simulator :
Note that after running three times, some numbers in the outputs keep the same while others change every time. Especially, the outputs by printing price
of float
type in int
type without type casting is indeterminate at the first sight.
This Obj-C code snippet seems too easy to give a quick answer. To verify my first thought, I just quickly translate them into C language to look them in a lower level, assumingNSLog
is a macro wrapped C printf
( Obj-C is a superset of C anyway ~ ). Here is the C code snippet:
The output is as follows. Note the line of printing %d
without casting, it seems unpredictable yet:
// compile sample.c $ gcc -g -o sample sample.c // run for once $ ./sample specifier casting input %f 1.500000 %d (int) 1 %d 2147483630 %d (int *) -493699412 %d *(int *) 1095237632
Solving
The output astonished and puzzled me instantly AND for the week. The question haunted around: is it true that the output of printf is undefined?
To study further and thoroughly, I decide to check a concise snippet focusing on the printf("%d\n", price)
. In my opinion, this is key point to demystify the hood.
Before we begin to check the code, it is necessary to make clear under which environment the programs will run. Actually the x86-64 Linux OS and gcc environment are on a Debian virtual guest.
// programming environment $ uname -a Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux $ gcc --version gcc (Debian 4.9.2-10) 4.9.2 Copyright (C) 2014 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The first C snippet code here is used to show the sizes of different data types.
// compile size.c $ gcc -g -o size size.c // and run $ ./size sizoef(int) is 4 sizoef(long) is 8 sizoef(float) is 4 sizoef(double) is 8
The concise C code demonstrating printf("%d\n", price)
and it's output is shown below:
// // printf_int.c // #include <stdlib.h> #include <stdio.h> int main(int argc, char **argv) { float price = 1.5; printf("%d\n", price); return 0; } |
// compile printf_int $ gcc -g -o printf_int printf_int.c // and run for three times $ ./printf_int -1044129512 $ ./printf_int 559638616 $ ./printf_int 1742869704
No surprise at all, the output is weird and it changes in every run.
I know a few gotchas that the specifier format string in the printf
function should be matched with the types of the var_arg
list or the behavior is undefined. Now the output makes me to think in deep and hard way why and how the undefined behavior comes.
The typical memory layout for a C program is composed of
- text segment : containing machine instructions
- data segment : initialized data
- bss segment : uninitialized data
- heap : dynamic allocated memory
- stack : memory area for storing variable during function calls
Usually, on Linux running on x86 intel CPU, the stack area starts from higher memory address and expands to lower address when it grows. The heap area starts from the top ofbss segment and grows to the bottom of the stack area.
When calling a function, the arguments passed from the caller is stored in the stack if needed (pushed onto the stack). The order for pushing arguments is reverse with that in source code. That is to say, the last argument in the source code will be the first pushed. The is true for stackless function calling. For x86-64 programs on Linux, some reigsters are involved storing the arguments.
As many as six arguments should be loaded into respective registers directly without being pushed onto stack. The registers and the arguments are specified by the CPU ABI which is well introduced in this article written by Eli Bendersky. I refer the illustrating image here:
The illustrating image is showing calling a function myfunc
like:
long myfunc(long a, long b, long c, long d, long e, long f, long g, long h)

Fig. x64 frame nonleaf (illustrating) (copied from Eli)
By the x86-64 assembly specification, the first argument is loaded into register %rdi and the second into %rsi. Another important exception for function printf
is that its arguments are examined by the compiler for type checking and type promotion. Thus the type float is promoted to type double. Regarding the statement printf("%d\n", price);
, the first argument is the format string and the second argument price
(float number 1.5) is promoted into double. ( See more about type promote).
If this is true, the output should be the content interpreted as type int
of the price
. Unfortunately, it is not. The binary representation of double price = 1.5
is
// 64-bit binary representation of double 1.5 // as hex 0x3FF8000000000000 // as binary 00111111 11111000 00000000 00000000 00000000 00000000 00000000 00000000
If you do verify yourself, neither the upper 32 bits or lower 32 bits matches the %d output if interpreted as an int
.
Things get more complicated. Thanks to Eli Bendersky again, the article referenced above also indicates that float arguments are stored into xmm registers while only arguments of integer type or pointer are handled by the common six registers. This gives a clue to examine the xmm0 register, the first xmm register. To examine registers, I have to use the powerful debugging tool gdb
.
Tip
- In order to debug with
gdb
, the executable should be compiled with the-g
option ofgcc
. - The gdb command
n
(line 22 and line 25) is short fornext
, which is to execute the next step indicated by latest output statement (line 19 and line 23 respectively).
Line 26 is the output integer -8328 of the printf
function. Pay close attention to line 20: the int value -8328 represented by the lower 32 bits(%esi) of the %rsi register. To learn more about the x86-64 registers, follow this link
What a coincident! or is it deterministic? Yes, it is and deterministic and defined. I will explain soon.
Recall that the %rsi register holds the second argument of integer or pointer(address), but of which function? Here it is the main
function! If you look closely at Line 18:
Temporary breakpoint 1, main (argc=1, argv=0x7fffffffdf78) at printf_int.c:5
The argument argv
is the second argument and its content is an 64-bit address 0x7fffffffdf78
. It is a pointer, so it's content is hold by register %rsi.
I use python to manipulate numbers. If we convert the lower 32 bits of this address into integer.
# address hex
ffffdf78
# binary representation of the address
0b10000010001000
# interpret the binary address as integer
-8328 // -0b10000010001000
Amazing!!!
Now we explain the these two statements.
float price = 1.5; printf("%d\n", price);
When we want to print it with a %d
format specifier, the compiler does in such steps:
- Parsing the arguments. There are two. The first is pointer to the format string, so it(as address) is loaded into %rdi register; The second is
price
of float type(but promoted to double), so it(1.5
) is loaded into %xmm0 register and the content of the %rsi register remains unchanged. Here is the black magic. - When
printf
is called, it parse the specifier format string to determine the value type will be printed. Here the %d specifier is first encountered during the parsing, so theprintf
considers it to be an integer(as the %d indicates). Theprintf
then fetches the value from the %rsi register and prints the content as integer. - The %rsi register is not alerted by
printf
here, so the output is not determined by the call ofprintf("%d\n", price);
. It is determined by the last call which changes the%rsi register.
In general, we can summary:
- Type promoting is checked first. Type
char
is promoted intoint
, typefloat
is promtped intodouble
and so on. - For the arguments of integer type or pointer, as many as six should be loaded into specified registers(
%rdi %rsi %rdx %rcx %r8 %r9
), floating(single or double) arguments are loaded intoxmm
registers, and others should be pushed onto the stack frame. - For the arguments of double type, they are loaded into %xmm registers which is designed for holding float numbers.
- Before
printf
function is called, the arguments is stored according their declared types (after type promoting if needed); When executing, the value is fetched according to the format specifier string.
Let's verify these conclusions. Here we add a function sum
which expects two int
arguments. By taking in two integer, calling the sum
function would make a side effect: the%rsi register will be loaded with the second argument and remains intact by a following call printf
. The following printf
function will take the integer value according to the %d
specifier from the %rsi register set by the previous sum
. We can expect that the output of the printf
should be determined by the second integer argument by the previous sum
function and it varies with this second int argument.
Here is the output
// sum(2, 5) // compile $ gcc -g -o printf_int_sum printf_int_sum.c // run for three times $ ./printf_int_sum %d price: 5 $ ./printf_int_sum %d price: 5 $ ./printf_int_sum %d price: 5 // after we change to sum(2, 7) // compile $ gcc -g -o printf_int_sum printf_int_sum.c // run $ ./printf_int_sum %d price: 7 $ ./printf_int_sum %d price: 7 $ ./printf_int_sum %d price: 7
Observing the above outputs, when we call sum(2, 5);
, price
is printed as 5; When we call sum(2, 7);
, price
is printed as 7. The behavior is what we expect: the first integer or pointer argument of sum
determines the output of price
. It is defined! Hooray!
Until now, we have figured out how to expect the output of the printf
. One mystery is still remained that the result sample.c
program changes in every run. According to the conclusions, the output should be determined by the first integer or pointer argument of main
function, I add a line in the file printf_int_argv.c to print the argv
which is a pointer. This program printf_int_argv
runs on a real machine and with gcb
respectively.
// // printf_int_argv.c // #include <stdlib.h> #include <stdio.h> int main(int argc, char **argv) { float price = 1.5; printf("%%d price: %d\n", price); printf("%%p argv: %p\n", argv); return 0; } |
The output from running on a real machine is shown as follows:
// run on real machine, with default system settings $ ./printf_int_argv %d price: 1520235144 %p argv: 0x7ffd5a9cf288 $ ./printf_int_argv %d price: -1531767544 %p argv: 0x7ffea4b31508 $ ./printf_int_argv %d price: 1047310888 %p argv: 0x7ffe3e6cb228
The output from running with gdb
is shown as follows:
The outputs differ in these two different environments. The output from gdb
can be explained well according to previous conclusions: they are determined by and varies with the value of argv
. See line 22 - 32: the content of %esi is exactly the same the print of price
. The difference is attributed to that argv
stay unchanged when debugging using gdb
while it varies on real machine.
Why and how argv
changes? I goolge for c why argv address changes every time and get a useful link Environment variable's address is changing?. I get some key concepts:
ASLR /proc/sys/kernel/randomize_va_space
Continuing to search with these concepts, I get these from ASLR @wikipedia:
Address space layout randomization (ASLR) is a computer security technique involved in protection from buffer overflow attacks. In order to prevent an attacker from reliably jumping to, for example, a particular exploited function in memory, ASLR randomly arranges the address space positions of key data areas of a process, including the base of the executable and the positions of the stack, heap and libraries.
As far as I know, the argv
list is just above the stack, so it should also change in every run. I finally decide to disable ASLR and run the printf_int_argv
again, with the fresh output here:
Warning
You need root permission to disable or enable the ASLR. Here is a guide step by step from the link:
The following values are supported:
0 – No randomization. Everything is static. 1 – Conservative randomization. Shared libraries, stack, mmap(), VDSO and heap are randomized. 2 – Full randomization. In addition to elements listed in the previous point, memory managed through brk() is also randomized. So, to disable it, run echo 0 | sudo tee /proc/sys/kernel/randomize_va_space and to enable it again, run echo 2 | sudo tee /proc/sys/kernel/randomize_va_space
// when the ALSR is disabled $ ./printf_int_argv %d price: -6504 %p argv: 0x7fffffffe698 $ ./printf_int_argv %d price: -6504 %p argv: 0x7fffffffe698 $ ./printf_int_argv %d price: -6504 %p argv: 0x7fffffffe698
Hooray! Hooray! The argv
stays unchanged when the ALSR is disabled, so the output of printf
as a result of interpreting the argv
as an integer keeps the same in every run now.
Warning
You should enable ASLR after this experiment. Do NOT forget it.
Conclusions
For a programmer, as a user of the interface printf
in C language, he or she should assure the string specifier matches the types of the variables or the result is undefined. In fact, the result from unmatched types may be defined as follows from the prospective of the implementor of this interface.
Function calls are modeled as stack frames and the arguments passed from the caller are stored according to their types(maybe after undergoing type checking and type promotion). As for x86-64 CPU, the first six arguments of integer type or pointer are loaded into respective registers ( %rdi %rsi %rdx %rcx %r8 %r9
), floating(single or double) arguments are loaded into xmm
registers, and others should be pushed onto the stack frame.
Specifically, when printf
function is called to print something, the first %d
in the format specifier string indicates that it is the content in the register %rsi should be fetched. When trying to print out a float variable price
in a %d format printf("%d\n", price)
, the %rsi remains unchanged by printf
since the second argument(price
) is NOT ofinteger type or pointer; the %rsi keeps the value set by the last function call with a second argument of that proper type.
If the printf
function is the first to be called in a int main(int argc, char **argv)
program, the content of register %rsi is the the second argument is value of argv
which is a pointer(holding an address). When ASLR is enabled(in a Linux system), the value of argv
changes randomly for enhancing security, so does content of %rsi. That is why the output of printf("%d\n", price)
changes in every run.
Take-away Tips
- Always use the right format specifier for
printf
function, or you will get unexpected results. - In fact, the output by an unmatched format specifier string is defined in some way if you examine the registers and function call stack. When you understand this magic, you should alway refer to the previous tip ~.