Getting started on a project as complex as an operating system can be quite a daunting task. To help OpenSolaris newcomers sort out their head from their tail, here's a look at the system call infrastructure on Solaris x86 and Solaris x64.
I'll go over the different system call methods used, their departure points in userland and entry points in the kernel, and then we'll actually follow one into the kernel with the debugger to see it all in action.
Background
Processors in the x86 world support a number of different system call methods, and some are faster than others. In Solaris, unoptimized system calls take one of three possible paths into the kernel:
-
lcall $0x27
- Used for years as the standard Solaris syscall method. int $0x91
- Used by linux for years (vector 0x80), Solaris finally adopted int as the base syscall method in Solaris 11 (under development) - and earned a significant performance increase as a result. It will be available soon in a Solaris 10 update. lcall $0x7
- Used by some ( very old) statically linked binaries.
Fast Syscalls and Hardware Capability Libraries
When a well-behaved application makes a system call, it jumps through a wrapper function in libc. Changing the instruction used to enter the kernel becomes a matter of changing the wrappers in libc. Recently I integrated support for faster, chip-proprietary system calls into Solaris 10: sysenter (from Intel) and syscall (from AMD). Along with new kernel entry points, new hwcap (as in "hardware capability") versions of libc were provided to take advantage of the these new, faster instructions ( Tim Marsland has written about the hw capability architecture and Darren Moffat has written about how the system goes about selecting and using a hwcap libc).
I often get confused about which system call method is used on which type of system. For the record, the following table shows which methods are supported by the various flavor combinations of x86 kernels, CPUs, and user application types shipping today:
u64 = 64-bit user applications u32 = 32-bit user applications
| syscall | sysenter | ||
| 64-bit kernel | Intel Xeon | u64 (64-bit libc) | u32 (hwcap1) |
| AMD Opteron | u64 (64-bit libc) u32 (hwcap2) | - | |
| 32-bit kernel | Intel Xeon | - | u32 (hwcap1) |
| AMD Opteron | - | u32 (hwcap1) |
Note that we only support AMD's syscall instruction in the 64-bit kernel. Using syscall/sysret in the 32-bit kernel is too complicated and not worth the trouble.
Digging In
To illustrate this, let's take a look at the libc source code. It lives in under the usr/src/lib/libc directory. The important entries here are:
- i386/ - 32-bit source code and unoptimized binary
- amd64/ - 64-bit source code and binary
- i386_hwcap1/ - Intel CPU-specific source code and binary
- i386_hwcap2/ - AMD CPU-specific source code and binary
A simple system call to use for this example is mkdir(2). We can use mdb to disassemble the text bits and see how libc jumps into the kernel:
rab> mdb /lib/libc.so.1
Loading modules: [ libc.so.1 ]
> mkdir::dis
mkdir: movl $0x50,%eax
mkdir+5: syscall
mkdir+7: jb -0x82847 <__cerror>
mkdir+0xd: ret
We can see that the system call number (See Eric Schrock's post for more information on system call numbers) is stashed away in register %eax so the kernel can find it later, and then the syscall instruction is executed to transfer control to the kernel.
This example is on an AMD Opteron system, because otherwise we'd expect to find either lcall $0x27 or sysenter as the control transfer instruction. We can get at the unoptimized libc by unmounting the hwcap library:
rab> su
Password:
# umount /lib/libc.so.1
rab> mdb /lib/libc.so.1
Loading modules: [ libc.so.1 ]
> mkdir::dis
mkdir: movl $0x50,%eax
mkdir+5: lcall $0x27,$0x0
mkdir+0xc: jb -0x82b2c <__cerror>
mkdir+0x12: ret
Tracing it back to the source
Ah-hah - now let's look at the source for the libc mkdir(2) wrapper to complete the userland picture:
rab> pwd
.../usr/src/lib/libc/common/sys
rab> cat mkdir.s
[ snip ]
#include "SYS.h"
SYSCALL_RVAL1(mkdir)
RET
SET_SIZE(mkdir)
In order to organize the source in a portable way that avoids reproducing the same code in more than one place, many portions of libc are implemented as preprocessor macros. mkdir(2) is so simple that it needs nothing but the SYSCALL macro, found in SYS.h. For reasons too boring to repeat here, the SYSCALL macro eventually expands into a corresponding SYSTRAP macro. All 32-bit variants of libc share one SYS.h, and preprocessor macros defined via Makefiles in the binary directories determine which instructions go into the SYSTRAP macro:
rab> pwd
.../usr/src/lib/libc/i386/inc
rab> grep SYSTRAP_RVAL1 SYS.h
#define SYSTRAP_RVAL1(name) __SYSCALL(name)
#define SYSTRAP_RVAL1(name) __SYSENTER(name)
#define SYSTRAP_RVAL1(name) __SYSLCALL(name)
One of the above macros are used depending on which libc is being built: __SYSCALL() for hwcap2, __SYSENTER() for hwcap1, and __SYSLCALL() for the unoptimized base libc at /lib/libc.so.1.
rab> cat SYS.h
[ snip ]
#define __SYSLCALL(name) /
/* CSTYLED */ /
movl $SYS_/**/name, %eax; /
lcall $SYSCALL_TRAPNUM, $0
[ snip ]
#define __SYSCALL(name) /
/* CSTYLED */ /
movl $SYS_/**/name, %eax; /
.byte 0xf, 0x5 /* syscall */
We added support for AMD's syscall instruction to Solaris, but we were using a slightly older version of our assembler which (embarassingly enough) didn't yet recognize the instruction, so its opcode had to be manually hard-coded into libc.
Jumping Over the Fence
That's all for userland; the easy part is over. Because the actual workings of the differing system call instructions vary widely, the kernel uses separate code paths to deal with each. The function entry points used are (shown are only those for 32-bit applications making system calls):
| Entry Instruction | Kernel Entry Point | |
| 64-bit kernel | lcall* | trap() |
| syscall | sys_syscall32() | |
| sysenter | sys_sysenter() | |
| 32-bit kernel | lcall | sys_call() |
| sysenter | sys_sysenter() |
* In the 64-bit kernel, 32-bit system calls made via lcall come in to the system via a segment-not-present trap (#np), a matter which is beyond the scope of this document. Trust me, you don't want to get into segmentation now...
Seeing it in Action
Using the kernel debugger we can step out of the classroom and watch these creatures in their native wild habitats. Boot a machine and from the system console get the kernel debugger loaded and ready. Enter the debugger, and then set a breakpoint on the syscall entry point. I'm still using the same Opteron machine as above (running the 64-bit kernel), so I need to re-mount the hwcap library:
root> mount -O -F lofs /usr/lib/libc/libc_hwcap2.so.1 /lib/libc.so.1
root> mdb -K
Welcome to kmdb
Loaded modules: [ cpc ptm ufs unix krtld sppp nca lofs genunix ip logindmux usba
specfs nfs random sctp ]
[0]> sys_syscall32:b
[0]> :c
kmdb: stop at sys_syscall32
kmdb: target stopped at:
sys_syscall32: swapgs
[1]> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc230a0 1b 0 0 60 no no t-0 ffffffff82b38520
fsflush
1 ffffffff8bdd1800 1b 0 0 49 no no t-0 ffffffff8cc991e0 ksh
We set a breakpoint, and tripped over it immediately after continuing (because system calls are a very common occurrence on even an idle machine). We can see that CPU1 tripped the breakpoint first (as evidenced by the [1] in the kmdb prompt), and that ksh is the process running. Which system call is the shell making? Remember that the libc wrapper function stashed the system call number in register %eax. When we are in the 64-bit kernel, %eax is the lower 32-bits of register %rax:
[1]> <rax=D
98
syscall 98, which -- according to the sysent table (see sysent.c) -- is the shell doing a sigaction(2) (which makes sense, because shells are always messing around with signals).
Clear the breakpoint and try the same thing with the 64-bit entry point (it is sys_syscall()), but this time enter the debugger by sending a break over the console (how one does this varies depending on the terminal being used to access the console):
[1]> :z
[1]> sys_syscall:b
[1]> :c
root>
root>
root>
Because this is an otherwise idle machine, nothing trips the 64-bit syscall breakpoint just yet. There just aren't very many 64-bit processes running. We can run one manually to trigger the breakpoint:
root> /usr/bin/amd64/ls
kmdb: stop at sys_syscall
kmdb: target stopped at:
sys_syscall: swapgs
[1]> <rax=D
115
We see that the first 64-bit system call made by the 64-bit ls is mmap(2), which makes sense because the 64-bit dynamic linker needs to begin setting up the new process's address space.
本文介绍了在Solaris x86和x64平台上不同的系统调用方法及其工作原理,包括传统调用方式lcall$0x27、新的int$0x91以及针对Intel和AMD处理器优化的sysenter和syscall指令。通过实际例子跟踪系统调用过程,展示了从用户空间到内核空间的转换。
528

被折叠的 条评论
为什么被折叠?



