保证函数调用时stack的起始地址是十六字节对齐的方法

最新推荐文章于 2025-09-25 15:30:47 发布

原创最新推荐文章于 2025-09-25 15:30:47 发布 · 5.3k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#gcc #汇编 #nested #codec #crash #x86

探讨SSE指令集对内存地址的要求及不同编译器间的兼容性问题，提供解决方案。

在某些应用到SSE指令的代码中，都会要求所参与运算的内存地址必须是16字节对齐，否则程序将会crash.

在默认情况下如果一个(静态/动态)库是用gcc编译的，并且调用该(静态/动态)库的程序也是用gcc编译的，则该程序将运行正常，因为这样可以保证所有代码中都用相同的stack初始化逻辑，如果代码出错，则必定是真实的错误，而不是由于函数调用过程中stack初始化错误而造成的。

而如果一个库(静态/动态)库用gcc编译，而调用该(静态/动态)库的程序则是MSVC++编译的，则问题就来了，因为gcc可以保证并假定在任何函数调用中所用到的stack都是16字节对齐的，如果在该(静态/动态)库中恰恰用到了stack上的开始地址处变量进行的SSE运算，这理论上是没有问题的，因为stack地址已经自动做了16字节对齐，而MSVC++的函数调用则不满足16字节对齐的要求，一般仅仅做到了4字节对齐，这就使得在后续的SSE运算中程序crash.

例如在FFmpeg中有如下代码

void ff_fdct_sse2(int16_t *block)

{

DECLARE_ALIGNED(16, int64_t, align_tmp[16]);

int16_t * const block1= (int16_t*)align_tmp;

fdct_col_sse2(block, block1, 0);

fdct_row_sse2(block1, block);

}

用gcc编译后会产生如下代码：

00000e50 <_ff_fdct_sse2>:

e50: 81 ec 8c 00 00 00 sub $0x8c,%esp

e56: b9 00 00 00 00 mov $0x0,%ecx

e5b: 89 9c 24 84 00 00 00 mov %ebx,0x84(%esp)

e62: 8b 94 24 90 00 00 00 mov 0x90(%esp),%edx

e69: bb 30 00 00 00 mov $0x30,%ebx

e6e: 89 b4 24 88 00 00 00 mov %esi,0x88(%esp)

e75: be 40 00 00 00 mov $0x40,%esi

e7a: 66 0f 6f 42 10 movdqa 0x10(%edx),%xmm0

e7f: 66 0f 6f 4a 60 movdqa 0x60(%edx),%xmm1

e84: 66 0f 6f d0 movdqa %xmm0,%xmm2

e88: 66 0f 6f 5a 20 movdqa 0x20(%edx),%xmm3

e8d: 66 0f ed c1 paddsw %xmm1,%xmm0

e91: 66 0f 6f 62 50 movdqa 0x50(%edx),%xmm4

e96: 66 0f 71 f0 03 psllw $0x3,%xmm0

e9b: 66 0f 6f 2a movdqa (%edx),%xmm5

e9f: 66 0f ed e3 paddsw %xmm3,%xmm4

ea3: 66 0f ed 6a 70 paddsw 0x70(%edx),%xmm5

ea8: 66 0f 71 f4 03 psllw $0x3,%xmm4

ead: 66 0f 6f f0 movdqa %xmm0,%xmm6

eb1: 66 0f e9 d1 psubsw %xmm1,%xmm2

......

通过如下指令直接在栈上开辟出了所有的局部变量所需要的空间：

sub $0x8c, %esp

仔细推敲可以发现这样做能保证如下两个局部变量的内存分配可以正常工作的原因是gcc已经假定函数的第一个参数的对应的内存地址是可以被16整除的！

DECLARE_ALIGNED(16, int64_t, align_tmp[16]);

int16_t * const block1= (int16_t*)align_tmp;

而在MSVC++中这个条件是不满足的，如果在MSVC++中使用这个函数，则程序的crash的可能性差不多有75%

-----------------------------------------------------------------------

可以用如下方法之一来解决该问题：

1)最简单的方式是在导出函数的实现代码中用gcc的__attribute__((force_align_arg_pointer))修饰符，例如

int __attribute__((force_align_arg_pointer)) avcodec_open(AVCodecContext *avctx, AVCodec *codec)

{

//....

}

关于force_align_arg_pointer的使用的优点和缺点参见如下说明：

On the Intel x86, the force_align_arg_pointer attribute may be applied to individual function definitions, generating an alternate prologue and epilogue that realigns the runtime stack. This supports mixing legacy codes that run with a 4-byte aligned stack with modern codes that keep a 16-byte stack for SSE compatibility. The alternate prologue/epilogue is slower and bigger than the regular one, and it requires one dedicated register for the life of the function. This also lowers the number of registers available if used in conjunction with the regparm attribute. The force_align_arg_pointer attribute is incompatible with nested functions; this is considered a hard error.

注意要使用force_align_arg_pointer必须保证gcc的版本不小于4.2

2)自己手动维护函数调用的stack对齐实现，此种实现就有多种版本了，要看每个人的实现方法了，下面给出两种示例实现。

a)MSVC风格的调用demo,代码来自x264项目，可以用yasm来编译

;-----------------------------------------------------------------------------

; void x264_stack_align( void (*func)(void*), void *arg );

;-----------------------------------------------------------------------------

cglobal x264_stack_align

push ebp

mov ebp, esp

sub esp, 4

and esp, ~15

mov ec x, [ebp+8]

mov edx, [ebp+12]

mov [esp], edx

call ecx

mov esp, ebp

pop ebp

ret

简单说明一下上面代码的实现逻辑：

本质上来讲，上面的汇编代码就是模拟了x86下的一次最简单的函数调用过程。

step 1>

push ebp;首先将上一次的栈底指针(ebp)保存下来，

mov ebp, esp;将栈顶指针esp赋值给ebp，即ebp = esp,此时ebp和esp都指向刚才保存的上次的ebp的内存的地址

此时的栈结构为(从高地址到低地址)

______________

|arg

|_____________

|func

|_____________

|函数返回地址

|____________

|old ebp

ebp-->|_____________

step 2>

sub esp,4 ;在栈上开辟4字节的预留空间，确保在step 4 中不会将我们保存的上一次的ebp冲掉

and esp,~15 ;继续增加stack的size,强制栈顶指针 esp 16字节对齐，注意在x86上stack的增长方向是从高地址到的地址

______________

|arg

|_____________

|func

|_____________

|函数返回地址

|____________

|old ebp

ebp-->|_____________

|resevered

|_____________

...

esp-->|_____________

step 3>

mov ecx, [ebp + 8 ];将参数func的值放到ecx寄存器

mov edx, [ebp + 12];将参数arg的值放到edx寄存器

mov [esp], edx;用edx的值来填充esp,实现函数参数arg如栈的过程，注意此时esp可以确保是16字节对齐

______________

|arg

|_____________

|func

|_____________

|函数返回地址

|____________

|old ebp

ebp-->|_____________

|resevered

|_____________

...

|arg

esp-->|_____________

call ecx;调用func函数

step 4>清栈过程

mov esp, ebp;释放栈空间并恢复栈顶esp

pop ebp;恢复上次的ebp

ret;取得函数返回地址，并作跳转，继续执行其他代码

上述实现另一个变种为

;-----------------------------------------------------------------------------

; void x264_stack_align( void (*func)(void*), void *arg );

;-----------------------------------------------------------------------------

cglobal x264_stack_align

push ebp

mov ebp, esp

sub esp, 8

and esp, ~15

mov ecx, [ebp+8]

mov edx, [ebp+12]

mov [esp], edx

mov edx, [ebp+16]

mov [esp+4], edx

call ecx

leave

ret

感兴趣的朋友可以自己分析

b)gcc风格的宏调用

#define STACK_ALIGNCALL16(x) /

asm("movl %esp, %ebx");/

asm("andl $0xfffffff0, %esp");/

asm("subl $12, %esp");/

asm("pushl %ebx");/

x;/

asm("popl %ebx");/

asm("movl %ebx, %esp");/

}

STACK_ALIGNCALL16(bytesFilled = avcodec_encode_video(codecContext, target, maxTargetSize, avframe));

上面的实现逻辑很简单，先保存esp到ebx,后强制esp对齐，接着预留12字节的栈，将ebx压栈，此时esp又恢复了16字节对齐。

上面的代码很容易转换成MSVC++的汇编语法，感兴趣的朋友可以自己转换。

-----------------------------------------------

保证函数调用时stack的起始地址是十六字节对齐的方法

3 条评论