OpenSolaris系统调用在x86系统上的实现

最新推荐文章于 2025-08-20 13:26:15 发布

原创最新推荐文章于 2025-08-20 13:26:15 发布 · 742 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#x86 #system #solaris #preprocessor #wrapper #combinations

本文介绍了在Solaris x86和x64平台上不同的系统调用方法及其工作原理，包括传统调用方式lcall$0x27、新的int$0x91以及针对Intel和AMD处理器优化的sysenter和syscall指令。通过实际例子跟踪系统调用过程，展示了从用户空间到内核空间的转换。

x86 syscall primer

Getting started on a project as complex as an operating system can be quite a daunting task. To help OpenSolaris newcomers sort out their head from their tail, here's a look at the system call infrastructure on Solaris x86 and Solaris x64.

I'll go over the different system call methods used, their departure points in userland and entry points in the kernel, and then we'll actually follow one into the kernel with the debugger to see it all in action.

Background

Processors in the x86 world support a number of different system call methods, and some are faster than others. In Solaris, unoptimized system calls take one of three possible paths into the kernel:

lcall $0x27

Used for years as the standard Solaris syscall method.

int $0x91

Used by linux for years (vector 0x80), Solaris finally adopted int as the base syscall method in Solaris 11 (under development) - and earned a significant performance increase as a result. It will be available soon in a Solaris 10 update.

lcall $0x7

Used by some ( very old) statically linked binaries.

Fast Syscalls and Hardware Capability Libraries

When a well-behaved application makes a system call, it jumps through a wrapper function in libc. Changing the instruction used to enter the kernel becomes a matter of changing the wrappers in libc. Recently I integrated support for faster, chip-proprietary system calls into Solaris 10: sysenter (from Intel) and syscall (from AMD). Along with new kernel entry points, new hwcap (as in "hardware capability") versions of libc were provided to take advantage of the these new, faster instructions ( Tim Marsland has written about the hw capability architecture and Darren Moffat has written about how the system goes about selecting and using a hwcap libc).

I often get confused about which system call method is used on which type of system. For the record, the following table shows which methods are supported by the various flavor combinations of x86 kernels, CPUs, and user application types shipping today:

u64 = 64-bit user applications u32 = 32-bit user applications

		syscall	sysenter
64-bit kernel	Intel Xeon	u64 (`64-bit libc`)	u32 (`hwcap1`)
64-bit kernel	AMD Opteron	u64 (`64-bit libc`) u32 (`hwcap2`)	-
32-bit kernel	Intel Xeon	-	u32 (`hwcap1`)
32-bit kernel	AMD Opteron	-	u32 (`hwcap1`)

(The hwcap libraries referenced live in the /usr/lib/libc directory.)

Note that we only support AMD's syscall instruction in the 64-bit kernel. Using syscall/sysret in the 32-bit kernel is too complicated and not worth the trouble.

Digging In

To illustrate this, let's take a look at the libc source code. It lives in under the usr/src/lib/libc directory. The important entries here are:

i386/ - 32-bit source code and unoptimized binary
amd64/ - 64-bit source code and binary
i386_hwcap1/ - Intel CPU-specific source code and binary
i386_hwcap2/ - AMD CPU-specific source code and binary

A simple system call to use for this example is mkdir(2). We can use mdb to disassemble the text bits and see how libc jumps into the kernel:

rab> mdb /lib/libc.so.1
Loading modules: [ libc.so.1 ]
> mkdir::dis
mkdir:                          movl   $0x50,%eax
mkdir+5:                        syscall
mkdir+7:                        jb     -0x82847 <__cerror>
mkdir+0xd:                      ret

We can see that the system call number (See Eric Schrock's post for more information on system call numbers) is stashed away in register %eax so the kernel can find it later, and then the syscall instruction is executed to transfer control to the kernel.

This example is on an AMD Opteron system, because otherwise we'd expect to find either lcall $0x27 or sysenter as the control transfer instruction. We can get at the unoptimized libc by unmounting the hwcap library:

rab> su
Password: 
# umount /lib/libc.so.1
rab> mdb /lib/libc.so.1
Loading modules: [ libc.so.1 ]
> mkdir::dis
mkdir:                          movl   $0x50,%eax
mkdir+5:                        lcall  $0x27,$0x0
mkdir+0xc:                      jb     -0x82b2c <__cerror>
mkdir+0x12:                     ret

Tracing it back to the source

Ah-hah - now let's look at the source for the libc mkdir(2) wrapper to complete the userland picture:

rab> pwd
.../usr/src/lib/libc/common/sys
rab> cat mkdir.s
[ snip ]
#include "SYS.h"

        SYSCALL_RVAL1(mkdir)
        RET
        SET_SIZE(mkdir)

In order to organize the source in a portable way that avoids reproducing the same code in more than one place, many portions of libc are implemented as preprocessor macros. mkdir(2) is so simple that it needs nothing but the SYSCALL macro, found in SYS.h. For reasons too boring to repeat here, the SYSCALL macro eventually expands into a corresponding SYSTRAP macro. All 32-bit variants of libc share one SYS.h, and preprocessor macros defined via Makefiles in the binary directories determine which instructions go into the SYSTRAP macro:

rab> pwd
.../usr/src/lib/libc/i386/inc
rab> grep SYSTRAP_RVAL1 SYS.h
#define SYSTRAP_RVAL1(name)     __SYSCALL(name)
#define SYSTRAP_RVAL1(name)     __SYSENTER(name)
#define SYSTRAP_RVAL1(name)     __SYSLCALL(name)

One of the above macros are used depending on which libc is being built: __SYSCALL() for hwcap2, __SYSENTER() for hwcap1, and __SYSLCALL() for the unoptimized base libc at /lib/libc.so.1.

rab> cat SYS.h
[ snip ]
#define __SYSLCALL(name)                /
        /* CSTYLED */                   /
        movl    $SYS_/**/name, %eax;    /
        lcall   $SYSCALL_TRAPNUM, $0
[ snip ]
#define __SYSCALL(name)                 /
        /* CSTYLED */                   /
        movl    $SYS_/**/name, %eax;    /
        .byte   0xf, 0x5        /* syscall */

We added support for AMD's syscall instruction to Solaris, but we were using a slightly older version of our assembler which (embarassingly enough) didn't yet recognize the instruction, so its opcode had to be manually hard-coded into libc.

Jumping Over the Fence

That's all for userland; the easy part is over. Because the actual workings of the differing system call instructions vary widely, the kernel uses separate code paths to deal with each. The function entry points used are (shown are only those for 32-bit applications making system calls):

	Entry Instruction	Kernel Entry Point
64-bit kernel	`lcall`*	`trap()`
	`syscall`	`sys_syscall32()`
	`sysenter`	`sys_sysenter()`
32-bit kernel	`lcall`	`sys_call()`
	`sysenter`	`sys_sysenter()`

* In the 64-bit kernel, 32-bit system calls made via lcall come in to the system via a segment-not-present trap (#np), a matter which is beyond the scope of this document. Trust me, you don't want to get into segmentation now...

Seeing it in Action

Using the kernel debugger we can step out of the classroom and watch these creatures in their native wild habitats. Boot a machine and from the system console get the kernel debugger loaded and ready. Enter the debugger, and then set a breakpoint on the syscall entry point. I'm still using the same Opteron machine as above (running the 64-bit kernel), so I need to re-mount the hwcap library:

root> mount -O -F lofs /usr/lib/libc/libc_hwcap2.so.1 /lib/libc.so.1
root> mdb -K

Welcome to kmdb
Loaded modules: [ cpc ptm ufs unix krtld sppp nca lofs genunix ip logindmux usba
 specfs nfs random sctp ]
[0]> sys_syscall32:b
[0]> :c
kmdb: stop at sys_syscall32
kmdb: target stopped at:
sys_syscall32:  swapgs
[1]> ::cpuinfo
 ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
  0 fffffffffbc230a0  1b    0    0  60   no    no t-0    ffffffff82b38520 
fsflush
  1 ffffffff8bdd1800  1b    0    0  49   no    no t-0    ffffffff8cc991e0 ksh

We set a breakpoint, and tripped over it immediately after continuing (because system calls are a very common occurrence on even an idle machine). We can see that CPU1 tripped the breakpoint first (as evidenced by the [1] in the kmdb prompt), and that ksh is the process running. Which system call is the shell making? Remember that the libc wrapper function stashed the system call number in register %eax. When we are in the 64-bit kernel, %eax is the lower 32-bits of register %rax:

[1]> <rax=D
                98

syscall 98, which -- according to the sysent table (see sysent.c) -- is the shell doing a sigaction(2) (which makes sense, because shells are always messing around with signals).

Clear the breakpoint and try the same thing with the 64-bit entry point (it is sys_syscall()), but this time enter the debugger by sending a break over the console (how one does this varies depending on the terminal being used to access the console):

[1]> :z
[1]> sys_syscall:b
[1]> :c
root>
root>
root>

Because this is an otherwise idle machine, nothing trips the 64-bit syscall breakpoint just yet. There just aren't very many 64-bit processes running. We can run one manually to trigger the breakpoint:

root> /usr/bin/amd64/ls 
kmdb: stop at sys_syscall
kmdb: target stopped at:
sys_syscall:    swapgs
[1]> <rax=D
                115

We see that the first 64-bit system call made by the 64-bit ls is mmap(2), which makes sense because the 64-bit dynamic linker needs to begin setting up the new process's address space.