内核是如何Boot的 The Kernel Boot Process

本文详细介绍了Linux内核从实模式到保护模式的启动过程,包括内核如何被加载到内存,实模式内核代码如何执行,如何跳转到保护模式,以及保护模式下内核的初始化过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

The Kernel Boot Process

内核的启动过程

The previous post explained how computers boot up right up to the point where the boot loader, after stuffing the kernel image into memory, is about to jump into the kernel entry point. This last post about booting takes a look at the guts of the kernel to see how an operating system starts life. Since I have an empirical bent I’ll link heavily to the sources for Linux kernel 2.6.25.6 at the Linux Cross Reference. The sources are very readable if you are familiar with C-like syntax; even if you miss some details you can get the gist of what’s happening. The main obstacle is the lack of context around some of the code, such as when or why it runs or the underlying features of the machine. I hope to provide a bit of that context. Due to brevity (hah!) a lot of fun stuff – like interrupts and memory – gets only a nod for now. The post ends with the highlights for the Windows boot.

之前关于计算机启动说到了kernel image已经被bootloader加载到内存了, 并且已经跳转到内核开始工作的入口点. 关于这些列的最后一片文章我们将关注与内核如果让操作系动开始它的生命之旅. 因为我是个经验注意者(empirical bent), 所以我会深入到集开源社区经验之大成的kernel代码当中, 主要参照lxr上的2.6.25.6。如果你熟悉C语言的语法,那么内核的代码是很容易阅读的;甚至你不去关注一些细节,依然能够理解到底发生了什么。 主要的障碍会是你缺少对一些代码上下文的理解,比如什么时候或者为什么它这样运行或者是计算机底层的一些特性。我希望能帮组你理解这些上下文。由于我简化了很多有趣的部分--比如中断和内存管理--现在只是先了解个大概。 这篇文章略微描述了一下windows的启动。


At this point in the Intel x86 boot story the processor is running in real-mode, is able to address 1 MB of memory, and RAM looks like this for a modern Linux system:

此时,intel x86的处理器工作在实模式下,只能够寻址1MB的内存空间,在现代Linux系统下,RAM看起来如下:

RAM contents after boot loader runs 
RAM contents after boot loader is done

当bootloader工作结束之后的RAM布局

The kernel image has been loaded to memory by the boot loader using the BIOS disk I/O services. This image is an exact copy of the file in your hard drive that contains the kernel, e.g./boot/vmlinuz-2.6.22-14-server. The image is split into two pieces: a small part containing the real-mode kernel code is loaded below the 640K barrier; the bulk of the kernel, which runs in protected mode, is loaded after the first megabyte of memory.

内核image已经被bootloader利用BIOS提供的磁盘I/O服务加载到了内存。这个image其实就是一个份在硬盘上的内核文件的拷贝, 如./boot/vmlinuz-2.6.22-14-server。这个image分为两个部分:一个很小的包含实模式下运行的内核代码,它受限于640k的限制;其余的部分就是运行在保护模式下的内核,它被加载到第一个1MB内存空间之后。


The action starts in the real-mode kernel header pictured above. This region of memory is used to implement the Linux boot protocol between the boot loader and the kernel. Some of the values there are read by the boot loader while doing its work. These include amenities such as a human-readable string containing the kernel version, but also crucial information like the size of the real-mode kernel piece. The boot loader also writes values to this region, such as the memory address for the command-line parameters given by the user in the boot menu. Once the boot loader is finished it has filled in all of the parameters required by the kernel header. It’s then time to jump into the kernel entry point. The diagram below shows the code sequence for the kernel initialization, along with source directories, files, and line numbers:

上图描述了由实模式下的内核头部开始执行。这部分内存用来实现bootloader和内核之间交互的Linux boot protocol(arm的boot就完全不同,参照http://lxr.linux.no/#linux+v2.6.25.6/Documentation/arm/Booting)。那里有意些值是被bootloader使用的。这里包括一些很有用的值,比如可供人理解的内核版本信息,再比如实模式的内核部分占多少空间。bootloader也会写一些值到这个区域,比如从启动命令选项中由用户指定的内存地址。 一旦bootloader完成了, 它会把所有kernel头部所需要的参数都赋值好。这时就该跳转到内核的入口了。下面的图描述了内核初始化的代码流程,包括代码目录,文件名和代码行数:


Architecture-specific Linux Kernel Initialization
Architecture-specific Linux Kernel Initialization

The early kernel start-up for the Intel architecture is in file arch/x86/boot/header.S. It’s in assembly language, which is rare for the kernel at large but common for boot code. The start of this file actually contains boot sector code, a left over from the days when Linux could work without a boot loader. Nowadays this boot sector, if executed, only prints a “bugger_off_msg” to the user and reboots. Modern boot loaders ignore this legacy code. After the boot sector code we have the first 15 bytes of the real-mode kernel header; these two pieces together add up to 512 bytes, the size of a typical disk sector on Intel hardware.

最开始的intel体系结构的内核启动代码在arch/x86/boot/header.S。它是汇编代码,虽然在内核代码中不常见,但是在启动代码中很普偏。这段代码包含有启动扇区的代码,是Linux历史遗留代码,当时,Linux可以不依赖于bootloader来启动。现在,如果这个启动扇区代码如果被执行了,只会打印“bugger_off_msg”的错误信息给用户并重新启动。现代的bootloader会忽略这些遗留代码。这些启动扇区代码之后,就是我们需要的15字节的实模式内核头部;这两者加起来总共512个自己,正好是intel体系结构上典型的硬盘扇区大小。


After these 512 bytes, at offset 0×200, we find the very first instruction that runs as part of the Linux kernel: the real-mode entry point. It’s in header.S:110 and it is a 2-byte jump written directly in machine code as 0x3aeb. You can verify this by running hexdump on your kernel image and seeing the bytes at that offset – just a sanity check to make sure it’s not all a dream. The boot loader jumps into this location when it is finished, which in turn jumps to header.S:229 where we have a regular assembly routine called start_of_setup. This short routine sets up a stack, zeroes the bss segment (the area that contains static variables, so they start with zero values) for the real-mode kernel and then jumps to good old C code at arch/x86/boot/main.c:122.

接着这512字节之后, 即从image开始的偏移0x200处,接着就是实模式内核的入口点。它在header.S:110处,并且是一个2字节的jump指令,直接编码为0x3aeb的机器指令。你可以通过使用hexdump工具,解析你的kenel image然后就会发现在在0x200的偏移处,就能发现这个机器指令。bootloader的工作结束之后就会跳转到该指令处Header.S:229,这里是一个普通的汇编语言函数叫作start_of_setup。这个很短的函数会为实模式的内核代码设定一个堆栈,初始化bss段为0(这个段会包含静态变量,所以要初始化为0),然后跳转到arch/x86/boot/main.c:122处的c代码处(跳转代码在 header.S:289)。


main() does some house keeping like detecting memory layout, setting a video mode, etc. It then calls go_to_protected_mode(). Before the CPU can be set to protected mode, however, a few tasks must be done. There are two main issues: interrupts and memory. In real-mode the interrupt vector table for the processor is always at memory address 0, whereas in protected mode the location of the interrupt vector table is stored in a CPU register called IDTR. Meanwhile, the translation of logical memory addresses (the ones programs manipulate) to linear memory addresses (a raw number from 0 to the top of the memory) is different between real-mode and protected mode. Protected mode requires a register called GDTR to be loaded with the address of a Global Descriptor Table for memory. So go_to_protected_mode() calls setup_idt() and setup_gdt() to install a temporary interrupt descriptor table and global descriptor table.

该main函数做了一些包括检测内存布局,设置显示模式等等的事情。接着就会调用go_to_protected_mode函数。但是,在处理器被设置为保护模式之前,有一些事情必须要做。有两件主要的事情:中断和内存。在实模式时,处理器的中断向量表总是坐落在内存地址0处, 而在保护模式下,向量表的地址是由CPU的IDTR寄存器所保存。同时,实模式和保护模式下的逻辑地址到线信地址的转换方法也完全不同。保护模式需要GDTR寄存器提供的Global Descriptor Table中的关于内存的信息。所以,在go_to_protected_mode()中会调用setup_idt()和setup_gdt()来初始化零时的中断描述表和全局描述表。


We’re now ready for the plunge into protected mode, which is done by protected_mode_jump, another assembly routine. This routine enables protected mode by setting the PE bit in the CR0 CPU register. At this point we’re running with paging disabled; paging is an optional feature of the processor, even in protected mode, and there’s no need for it yet. What’s important is that we’re no longer confined to the 640K barrier and can now address up to 4GB of RAM. The routine then calls the 32-bit kernel entry point, which is startup_32 for compressed kernels. This routine does some basic register initializations and calls decompress_kernel(), a C function to do the actual decompression.

现在让我们投入到保护模式当中,通过protected_mode_jump来到真正的保护模式之中,这是另外一个汇编函数。它通过把处理器的CR0的PE位置位。此时,是运行在分页机制停用的情况下;分页机制是处理器的可选功能,即使在保护模式下,也有肯能不使用它。重要的是,进入保护模式后我们就能不受制于640k的限制而能够寻址4GB的地址空间。接下来的函数就是32位内核的入口点,叫做startup_32,对应于压缩内核。这个函数做了一些基本的寄存器初始化并且调用decompress_kerel(),它是一个C语言的函数用以解压内核。


decompress_kernel() prints the familiar “Decompressing Linux…” message. Decompression happens in-place and once it’s finished the uncompressed kernel image has overwritten the compressed one pictured in the first diagram. Hence the uncompressed contents also start at 1MB. decompress_kernel() then prints “done.” and the comforting “Booting the kernel.” By “Booting” it means a jump to the final entry point in this whole story, given to Linus by God himself atop Mountain Halti, which is the protected-mode kernel entry point at the start of the second megabyte of RAM (0×100000). That sacred location contains a routine called, uh, startup_32. But this one is in a different directory, you see.

decompress_kernel()会打印出熟悉的“Decompressing Linux ...”的消息。解压是在原地(即没有开辟新的空间)发生的,一旦完成后就会用解压后的内核imgae覆盖之前的压缩的kenerl image。因此,解压后的内容依然是从1MB处开始。decompress_kernel()就会打出“done”和“Booting the kernel”。这里就会进入最后的入口点了,这里就是保护模式内核的入口点,它坐落在第二个MB的0x100000的内存地址处。这个神圣的地址包含了一个叫做startup_32的函数,它是另外一个同名函数,如上图所示在不同的目录中。


The second incarnation of startup_32 is also an assembly routine, but it contains 32-bit mode initializations. It clears the bss segment for the protected-mode kernel (which is the true kernel that will now run until the machine reboots or shuts down), sets up the final global descriptor table for memory, builds page tables so that paging can be turned on, enables paging, initializes a stack, creates the final interrupt descriptor table, and finally jumps to to the architecture-independent kernel start-up, start_kernel(). The diagram below shows the code flow for the last leg of the boot:

这第二个startup_32也是一个汇编函数,它包含一个32位模式的初始化。它会为保护模式的内核(它是真正一直运行的内核,直到你重启机器或关机)清空bss段,为内存设置最终的global descriptor table(从这里看出从实时模式到保护模式总共存在过两个global descriptor table只不过实时模式的只存在一下),建立页表(这使得分页机制可以被激活),初始化堆栈,建立最终的interrupt descriptior table(跟global descriptor table类似),并最终跳转到体系结构独立的内核启动代码start_kernel()。下图描述了接下来的一些过程:

Architecture-independent Linux Kernel Initialization 
Architecture-independent Linux Kernel Initialization

start_kernel() looks more like typical kernel code, which is nearly all C and machine independent. The function is a long list of calls to initializations of the various kernel subsystems and data structures. These include the scheduler, memory zones, time keeping, and so on. start_kernel() then calls rest_init(), at which point things are almost all working. rest_init() creates a kernel thread passing another function, kernel_init(), as the entry point. rest_init() then calls schedule() to kickstart task scheduling and goes to sleep by calling cpu_idle(), which is the idle thread for the Linux kernel. cpu_idle() runs forever and so does process zero, which hosts it. Whenever there is work to do – a runnable process – process zero gets booted out of the CPU, only to return when no runnable processes are available.

start_kernel()函数看起来很像一个典型的内核代码,它几乎完全是C语言并且是平台独立的。该函数包含了一个很长的初始化各种内核子系统的函数调用和各种数据结构。包括调度,内存域,计时器等等。start_kernel()调用rest_init(),此时所有的准备模块都开是工作了。rest_init()会建立一个内核线程并传递给其它函数,kernel_init()。接着再调用schedule()来开始任务调度并且通过调用cpu_idle()(它是一个依赖体系结构的函数)来进入睡眠状态,它是一个内核的idle线程。cpu_idle()会一直作为进程0来运行。一旦有工作要做了--有新的进程启动了--进程0就会启动CPU,只有没有运行的进程后才会返回。

But here’s the kicker for us. This idle loop is the end of the long thread we followed since boot, it’s the final descendent of the very first jump executed by the processor after power up. All of this mess, from reset vector to BIOS to MBR to boot loader to real-mode kernel to protected-mode kernel, all of it leads right here, jump by jump by jump it ends in the idle loop for the boot processor, cpu_idle(). Which is really kind of cool. However, this can’t be the whole story otherwise the computer would do no work.

这个idle循环是我们从上电到现在一路跟着不断的jump才到达的位置.所有这些过程,从reset vector到BIOS到MBR到bootloader到实模式内核最后到保护模式内核.这是相当酷的一个过程.然后,这还不是全部,否则计算机将不能工作.


At this point, the kernel thread started previously is ready to kick in, displacing process 0 and its idle thread. And so it does, at which point kernel_init() starts running since it was given as the thread entry point. kernel_init() is responsible for initializing the remaining CPUs in the system, which have been halted since boot. All of the code we’ve seen so far has been executed in a single CPU, called the boot processor. As the other CPUs, called application processors, are started they come up in real-mode and must run through several initializations as well. Many of the code paths are common, as you can see in the code for startup_32, but there are slight forks taken by the late-coming application processors. Finally, kernel_init() calls init_post(), which tries to execute a user-mode process in the following order: /sbin/init, /etc/init, /bin/init, and /bin/sh. If all fail, the kernel will panic. Luckily init is usually there, and starts running as PID 1. It checks its configuration file to figure out which processes to launch, which might include X11 Windows, programs for logging in on the console, network daemons, and so on. Thus ends the boot process as yet another Linux box starts running somewhere. May your uptime be long and untroubled.

此时,之前启动的内核线程已经开始工作, 替换了进程0它的idle线程. kernel_init()负责初始化系统中的其余CPU, 它们是在启动过程中处于停机状态。到目前为止我们看到代码和流程都是在一个被叫做boot处理器的CPU上运行的。而其它的CPU, 被叫做application processors, 被在实模式之后所启动并且也必须经过一些初始化的过程。 它们所经历的代码流程都差不多,与startup_32的代码类似,但是都是由application processors来处理的。最终,kernel_init()会调用init_post(), 它尝试执行以下目录中的用户模式程序:sbin/init, /etc/init, /bin/init, and /bin/sh。 如果失败了,内核就会panic。幸运的是,init程序一般都能被执行,而作为PID为1的进程。它会检查各种配置文件并且选择哪一个程序会被执行,哪一个可能包含x11 windows,哪一个负责logiing console,建立network守护进程等等。因此,启动过程的结束是另一个linux box的开始。


以下是window的启动过程的简单介绍,我本人不感兴趣,飘过~~,留做以后不时之需作为参考:)

The process for Windows is similar in many ways, given the common architecture. Many of the same problems are faced and similar initializations must be done. When it comes to boot one of the biggest differences is that Windows packs all of the real-mode kernel code, and some of the initial protected mode code, into the boot loader itself (C:\NTLDR). So instead of having two regions in the same kernel image, Windows uses different binary images. Plus Linux completely separates boot loader and kernel; in a way this automatically falls out of the open source process. The diagram below shows the main bits for the Windows kernel:

Windows Kernel Initialization
Windows Kernel Initialization

The Windows user-mode start-up is naturally very different. There’s no /sbin/init, but rather Csrss.exe and Winlogon.exe. Winlogon spawns Services.exe, which starts all of the Windows Services, and Lsass.exe, the local security authentication subsystem. The classic Windows login dialog runs in the context of Winlogon.

This is the end of this boot series. Thanks everyone for reading and for feedback. I’m sorry some things got superficial treatment; I’ve gotta start somewhere and only so much fits into blog-sized bites. But nothing like a day after the next; my plan is to do regular “Software Illustrated” posts like this series along with other topics. Meanwhile, here are some resources:

  • The best, most important resource, is source code for real kernels, either Linux or one of the BSDs.
  • Intel publishes excellent Software Developer’s Manuals, which you can download for free.
  • Understanding the Linux Kernel is a good book and walks through a lot of the Linux Kernel sources. It’s getting outdated and it’s dry, but I’d still recommend it to anyone who wants to grok the kernel. Linux Device Drivers is more fun, teaches well, but is limited in scope. Finally, Patrick Moroney suggested Linux Kernel Development by Robert Love in the comments for this post. I’ve heard other positive reviews for that book, so it sounds worth checking out.
  • For Windows, the best reference by far is Windows Internals by David Solomon and Mark Russinovich, the latter of Sysinternals fame. This is a great book, well-written and thorough. The main downside is the lack of source code.
  • 我也推荐一些我认为的好书, 《Professional Linux Kernel Architecture》

Updated: 一个比较好的图例


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值