The Kernel Boot Process

本文深入探讨了计算机操作系统从开机到完全启动的过程,包括实模式内核的加载、保护模式的切换、内核的解压缩及初始化等关键步骤,并对比了Linux与Windows系统的不同之处。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

The Kernel Boot Process

The previous post explained how computers boot up right up to the point where the boot loader, after stuffing (填充) the kernel image into memory, is about to jump into the kernel entry point. This last post about booting takes a look at the guts of the kernel to see how an operating system starts life. Since I have an empirical bent I’ll link heavily to the sources for Linux kernel 2.6.25.6 at the Linux Cross Reference. The sources are very readable if you are familiar with C-like syntax; even if you miss some details you can get the gist of what’s happening. The main obstacle is the lack of context around some of the code, such as when or why it runs or the underlying features of the machine. I hope to provide a bit of that context. Due to brevity (hah!) a lot of fun stuff - like interrupts and memory - gets only a nod for now. The post ends with the highlights for the Windows boot.

At this point in the Intel x86 boot story the processor is running in real-mode, is able to address 1 MB of memory, and RAM looks like this for a modern Linux system:(第一次到kernel entry的跳转仍然在实模式中,还没有进入到保护模式。而且从下面的图里面可以看出来,kernel镜像被加载到了两个地方,即kernel base之上(线性地址 0xfffff)(还没有被解压缩)另一个在0x8000之下(640K))

RAM contents after boot loader runs
RAM contents after boot loader is done

The kernel image has been loaded to memory by the boot loader using the BIOS disk I/O services. This image is an exact copy of the file in your hard drive that contains the kernel, e.g./boot/vmlinuz-2.6.22-14-server. The image is split into two pieces: a small part containing the real-mode kernel code is loaded below the 640K barrier; the bulk of the kernel, which runs in protected mode, is loaded after the first megabyte of memory.(kernel镜像被分裂成两部分,即处于640Kbarrier之下的(0x8000)Real-mode kernel code);而运行在kernelbase(1MB)之上的是在保护模式中运行的。

The action starts in the real-mode kernel header pictured above. This region of memory is used to implement the Linux boot protocol between the boot loader and the kernel. Some of the values there are read by the boot loader while doing its work. These include amenities such as a human-readable string containing the kernel version, but also crucial information like the size of the real-mode kernel piece. The boot loader also writes values to this region, such as the memory address for the command-line parameters given by the user in the boot menu. Once the boot loader is finished it has filled in all of the parameters required by the kernel header. It’s then time to jump into the kernel entry point. The diagram below shows the code sequence for the kernel initialization, along with source directories, files, and line numbers:

Architecture-specific Linux Kernel Initialization
Architecture-specific Linux Kernel Initialization

The early kernel start-up for the Intel architecture is in file arch/x86/boot/header.S. It’s in assembly language, which is rare for the kernel at large but common for boot code. The start of this file actually contains boot sector code, a left over from the days when Linux could work without a boot loader. Nowadays this boot sector, if executed, only prints a “bugger_off_msg” to the user and reboots. Modern boot loaders ignore this legacy code. After the boot sector code we have the first 15 bytes of the real-mode kernel header; these two pieces together add up to 512 bytes, the size of a typical disk sector on Intel hardware.

After these 512 bytes, at offset 0x200, we find the very first instruction that runs as part of the Linux kernel: the real-mode entry point. It’s in header.S:110 and it is a 2-byte jump written directly in machine code as 0x3aeb. You can verify this by running hexdump on your kernel image and seeing the bytes at that offset – just a sanity check to make sure it’s not all a dream. The boot loader jumps into this location when it is finished, which in turn jumps to header.S:229where we have a regular assembly routine called start_of_setup. This short routine sets up a stack, zeroes the bss segment (the area that contains static variables, so they start with zero values) for the real-mode kernel and then jumps to good old C code atarch/x86/boot/main.c:122.(从这里开始进入实模式,然后调用start_of_setup并创建堆栈,初始化bss segment,然后跳转到了boot/main.c)

main() does some house keeping like detecting memory layout, setting a video mode, etc. It then calls go_to_protected_mode(). Before the CPU can be set to protected mode, however, a few tasks must be done. There are two main issues: interrupts and memory. In real-mode the interrupt vector table for the processor is always at memory address 0, whereas in protected mode the location of the interrupt vector table is stored in a CPU register called IDTR. Meanwhile, the translation of logical memory addresses (the ones programs manipulate) to linear memory addresses (a raw number from 0 to the top of the memory) is different between real-mode and protected mode. Protected mode requires a register called GDTR to be loaded with the address of a Global Descriptor Table for memory. So go_to_protected_mode() calls setup_idt() and setup_gdt() to install a temporary interrupt descriptor table and global descriptor table.(在转换到保护模式之前,必须要完成两件事情,使IDTR指向中断向量表,使GDTR指向GDT,关于GDT可以参阅80386手册)(setup_gdt的过程就是下面说的init过程,是系统创建的第一个进程,这个进程是所有其他进程的父亲)

(实际的过程:From wikipedia: Linux startup process:

Kernel phase

The kernel in Linux handles all operating system processes, such as memory management, task schedulingI/Ointerprocess communication, and overall system control. This is loaded in two stages - in the first stage the kernel (as a compressed image file) is loaded into memory and decompressed, and a few fundamental functions such as basic memory management are set up. Control is then switched one final time to the main kernel start process. Once the kernel is fully operational – and as part of its startup, upon being loaded and executing – the kernel looks for an init process to run, which (separately) sets up a user space and the processes needed for a user environment and ultimate login. The kernel itself is then allowed to go idle, subject to calls from other processes.

Kernel loading stage

The kernel as loaded is typically an image file, compressed into either zImage or bzImage formats with zlib. A routine at the head of it does a minimal amount of hardware setup, decompresses the image fully into high memory, and takes note of any RAM disk if configured.[6] It then executes kernel startup via ./arch/i386/boot/head and thestartup_32 () (for x86 based processors) process.

Kernel startup stage

The startup function for the kernel (also called the swapper or process 0) establishes memory management (paging tables and memory paging), detects the type of CPU and any additional functionality such as floating point capabilities, and then switches to non-architecture specific Linux kernel functionality via a call to start_kernel().

start_kernel executes a wide range of initialization functions. It sets up interrupt handling (IRQs), further configures memory, starts the Init process (the first user-space process), and then starts the idle task via cpu_idle(). Notably, the kernel startup process also mounts the initial RAM disk ("initrd") that was loaded previously as the temporary root file system during the boot phase. The initrd allows driver modules to be loaded directly from memory, without reliance upon other devices (e.g. a hard disk) and the drivers that are needed to access them (e.g. a SATA driver). This split of some drivers statically compiled into the kernel and other drivers loaded from initrd allows for a smaller kernel. The root file system is later switched via a call to pivot_root() which unmounts the temporary root file system and replaces it with the use of the real one, once the latter is accessible. The memory used by the temporary root file system is then reclaimed.

Thus, the kernel initializes devices, mounts the root filesystem specified by the boot loader as read only, and runs Init (/sbin/init) which is designated as the first process run by the system (PID = 1).[3] A message is printed by the kernel upon mounting the file system, and by Init upon starting the Init process. It may also optionally runInitrd[clarification needed] to allow setup and device related matters (RAM disk or similar) to be handled before the root file system is mounted.[3]

According to Red Hat, the detailed kernel process at this stage is therefore summarized as follows:[4]

"When the kernel is loaded, it immediately initializes and configures the computer's memory and configures the various hardware attached to the system, including all processors, I/O subsystems, and storage devices. It then looks for the compressed initrd image in a predetermined location in memory, decompresses it, mounts it, and loads all necessary drivers. Next, it initializes virtual devices related to the file system, such as LVM or software RAID before unmounting the initrd disk image and freeing up all the memory the disk image once occupied. The kernel then creates a root device,[clarification needed] mounts the root partition read-only, and frees any unused memory. At this point, the kernel is loaded into memory and operational. However, since there are no user applications that allow meaningful input to the system, not much can be done with it." An initramfs-style boot is similar, but not identical to the described initrd boot.

At this point, with interrupts enabled, the scheduler can take control of the overall management of the system, to provide pre-emptive multi-tasking, and the init process is left to continue booting the user environment in user space.

Init process

SysV init

Main article: Init

Init is the father of all processes. Its primary role is to create processes from a script stored in the file /etc/inittab. This file usually has entries which cause init to spawn gettys on each line that users can log in. It also controls autonomous processes required by any particular system. A run level is a software configuration of the system which allows only a selected group of processes to exist. The processes spawned by init for each of these run levels are defined in the /etc/inittab file.

— manual page for Init[7]

Init's job is "to get everything running the way it should be"[8] once the kernel is fully running. Essentially it establishes and operates the entire user space. This includes checking and mounting file systems, starting up necessary user services, and ultimately switching to a user-environment when system startup is completed. It is similar to the Unix and BSD init processes, from which it derived, but in some cases has diverged or became customized. In a standard Linux system, Init is executed with a parameter, known as a runlevel, that takes a value from 0 to 6, and that determines which subsystems are to be made operational. Each runlevel has its own scripts which codify the various processes involved in setting up or leaving the given runlevel, and it is these scripts which are referenced as necessary in the boot process. Init scripts are typically held in directories with names such as "/etc/rc...". The top level configuration file for init is at /etc/inittab.[8])

We’re now ready for the plunge into protected mode, which is done by protected_mode_jump, another assembly routine. This routine enables protected mode by setting the PE bit in the CR0 CPU register. At this point we’re running with paging disabled; paging is an optional feature of the processor, even in protected mode, and there’s no need for it yet. What’s important is that we’re no longer confined to the 640K barrier and can now address up to 4GB of RAM. The routine then calls the 32-bit kernel entry point, which is startup_32 for compressed kernels. This routine does some basic register initializations and calls decompress_kernel(), a C function to do the actual decompression.

decompress_kernel() prints the familiar “Decompressing Linux…” message. Decompression happens in-place and once it’s finished the uncompressed kernel image has overwritten the compressed one pictured in the first diagram. Hence the uncompressed contents also start at 1MB. decompress_kernel() then prints “done.” and the comforting “Booting the kernel.” By “Booting” it means a jump to the final entry point in this whole story, given to Linus by God himself atop Mountain Halti, which is the protected-mode kernel entry point at the start of the second megabyte of RAM (0x100000). That sacred location contains a routine called, uh,startup_32. But this one is in a different directory, you see.

The second incarnation of startup_32 is also an assembly routine, but it contains 32-bit mode initializations. It clears the bss segment for the protected-mode kernel (which is the true kernel that will now run until the machine reboots or shuts down), sets up the final global descriptor table for memory, builds page tables so that paging can be turned on, enables paging, initializes a stack, creates the final interrupt descriptor table, and finally jumps to to the architecture-independent kernel start-up, start_kernel(). The diagram below shows the code flow for the last leg of the boot:

(解压缩kernelbase(0xfffff)上面的kernel镜像,进入到保护模式)

Architecture-independent Linux Kernel Initialization
Architecture-independent Linux Kernel Initialization

start_kernel() looks more like typical kernel code, which is nearly all C and machine independent. The function is a long list of calls to initializations of the various kernel subsystems and data structures. These include the scheduler, memory zones, time keeping, and so on. start_kernel() then calls rest_init(), at which point things are almost all working. rest_init() creates a kernel thread passing another function, kernel_init(), as the entry point. rest_init() then calls schedule() to kickstart task scheduling and goes to sleep by calling cpu_idle(), which is the idle thread for the Linux kernel. cpu_idle() runs forever and so does process zero, which hosts it. Whenever there is work to do – a runnable process – process zero gets booted out of the CPU, only to return when no runnable processes are available.

But here’s the kicker for us. This idle loop is the end of the long thread we followed since boot, it’s the final descendent of the very first jump executed by the processor after power up. All of this mess, from reset vector to BIOS to MBR to boot loader to real-mode kernel to protected-mode kernel, all of it leads right here, jump by jump by jump it ends in the idle loop for the boot processor, cpu_idle(). Which is really kind of cool. However, this can’t be the whole story otherwise the computer would do no work.

At this point, the kernel thread started previously is ready to kick in, displacing process 0 and its idle thread. And so it does, at which point kernel_init() starts running since it was given as the thread entry point. kernel_init() is responsible for initializing the remaining CPUs in the system, which have been halted since boot. All of the code we’ve seen so far has been executed in a single CPU, called the boot processor. As the other CPUs, called application processors, are started they come up in real-mode and must run through several initializations as well. Many of the code paths are common, as you can see in the code for startup_32, but there are slight forks taken by the late-coming application processors. Finally, kernel_init() calls init_post(), which tries to execute a user-mode process in the following order: /sbin/init, /etc/init, /bin/init, and /bin/sh. If all fail, the kernel will panic. Luckily init is usually there, and starts running as PID 1. It checks its configuration file to figure out which processes to launch, which might include X11 Windows, programs for logging in on the console, network daemons, and so on. Thus ends the boot process as yet another Linux box starts running somewhere. May your uptime be long and untroubled.

The process for Windows is similar in many ways, given the common architecture. Many of the same problems are faced and similar initializations must be done. When it comes to boot one of the biggest differences is that Windows packs all of the real-mode kernel code, and some of the initial protected mode code, into the boot loader itself (C:\NTLDR). So instead of having two regions in the same kernel image, Windows uses different binary images. Plus Linux completely separates boot loader and kernel; in a way this automatically falls out of the open source process. The diagram below shows the main bits for the Windows kernel:

Windows Kernel Initialization
Windows Kernel Initialization

The Windows user-mode start-up is naturally very different. There’s no /sbin/init, but rather Csrss.exe and Winlogon.exe. Winlogon spawns Services.exe, which starts all of the Windows Services, and Lsass.exe, the local security authentication subsystem. The classic Windows login dialog runs in the context of Winlogon.

This is the end of this boot series. Thanks everyone for reading and for feedback. I’m sorry some things got superficial treatment; I’ve gotta start somewhere and only so much fits into blog-sized bites. But nothing like a day after the next; my plan is to do regular “Software Illustrated” posts like this series along with other topics. Meanwhile, here are some resources:

  • The best, most important resource, is source code for real kernels, either Linux or one of the BSDs.
  • Intel publishes excellent Software Developer’s Manuals, which you can download for free.
  • Understanding the Linux Kernel is a good book and walks through a lot of the Linux Kernel sources. It’s getting outdated and it’s dry, but I’d still recommend it to anyone who wants to grok the kernel. Linux Device Drivers is more fun, teaches well, but is limited in scope. Finally, Patrick Moroney suggested Linux Kernel Development by Robert Love in the comments for this post. I’ve heard other positive reviews for that book, so it sounds worth checking out.
  • For Windows, the best reference by far is Windows Internals by David Solomon and Mark Russinovich, the latter of Sysinternals fame. This is a great book, well-written and thorough. The main downside is the lack of source code.

From:http://duartes.org/gustavo/blog/post/kernel-boot-process/

<think>我们正在处理一个系统警告:检测到内核版本低于推荐的最小版本(4.19.90 < 5.5.0),这可能导致进程挂起。用户希望解决这个问题,并可能寻求升级内核的方法。 根据引用[1][2][3][4]中的信息,这个问题在多个场景下出现,尤其是在使用多GPU训练时(如transformers的Trainer进行分布式训练)可能导致进程挂起。虽然可以通过限制使用单张GPU(设置CUDA_VISIBLE_DEVICES)临时规避,但根本解决方案是升级内核。 解决方案: 1. 临时解决方案:指定使用单张GPU(如果多GPU不是必须的) 2. 根本解决方案:升级Linux内核到5.5.0或更高版本 下面将详细说明这两种方案,并提供升级内核的步骤(以Ubuntu为例,因为引用中提到了Ubuntu系统)。 注意:升级内核是一项系统级操作,需要管理员权限,并且存在一定风险。建议在操作前备份重要数据。 详细步骤: 一、临时解决方案(不需要升级内核) 在程序运行前,通过环境变量指定仅使用一块GPU(例如使用第一块GPU): ```bash # 在运行训练命令前设置环境变量 export CUDA_VISIBLE_DEVICES=0 ``` 或者在Python代码中,在导入torch等库之前设置: ```python import os os.environ["CUDA_VISIBLE_DEVICES"] = "0" # 必须在导入torch之前设置 # 然后继续导入其他库 import torch from transformers import Trainer, ... ``` 二、升级Linux内核(以Ubuntu 18.04/20.04为例) 步骤1:检查当前内核版本 ```bash uname -r ``` 步骤2:更新系统并安装必要的软件包 ```bash sudo apt update sudo apt upgrade sudo apt install build-essential libncurses-dev bison flex libssl-dev libelf-dev ``` 步骤3:下载新版本内核(以5.5.0为例,但建议选择更稳定的新版本,如5.15.x) 可以在官网(https://www.kernel.org/)查找最新稳定版本,或者使用以下命令下载指定版本: ```bash # 以5.15.0版本为例 wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.15.0.tar.xz ``` 步骤4:解压内核源码 ```bash tar -xvf linux-5.15.0.tar.xz cd linux-5.15.0 ``` 步骤5:配置内核(可以使用当前内核的配置作为基础) ```bash sudo cp /boot/config-$(uname -r) .config sudo make olddefconfig ``` 步骤6:编译内核(可能需要较长时间) ```bash sudo make -j$(nproc) # nproc为CPU核心数 ``` 步骤7:安装内核模块和内核 ```bash sudo make modules_install sudo make install ``` 步骤8:更新GRUB引导 ```bash sudo update-initramfs -c -k 5.15.0 sudo update-grub ``` 步骤9:重启系统并验证 ```bash sudo reboot uname -r # 重启后检查内核版本 ``` 注意:如果升级内核后遇到问题(如无法启动),可以在GRUB菜单中选择旧内核启动。 三、其他注意事项 1. 如果使用的是云服务器,部分云平台提供的内核升级选项可能更简单(如阿里云提供的内核升级功能)。 2. 如果不想编译内核,可以尝试添加官方内核PPA(适用于Ubuntu): ```bash # 添加PPA(以安装5.15内核为例) sudo add-apt-repository ppa:canonical-kernel-team/ppa sudo apt update sudo apt install linux-generic-5.15 sudo reboot ``` 3. 引用[4]中提到,在Ubuntu 18.04上,内核版本为5.4.0,通过指定单卡可以临时解决。但长期看,升级内核是更稳妥的方案。 四、验证 升级内核后,重新运行之前出现警告的程序,检查是否还有内核版本警告以及进程挂起问题是否解决。 五、如果升级内核后仍遇到问题 可以考虑检查CUDA和驱动版本是否与新内核兼容,必要时升级NVIDIA驱动。 相关问题: §§相关问题§§ 1. 如何在不升级内核的情况下避免多GPU训练时的进程挂起问题? 2. 升级Linux内核后出现兼容性问题如何解决? 3. 如何检查CUDA驱动与Linux内核版本的兼容性? 4. 在云服务器上升级内核有哪些注意事项?</think>### 解决内核版本过低导致进程挂起的问题 检测到您的内核版本为 4.19.90,低于推荐的最小版本 5.5.0,这可能导致分布式训练进程挂起[^1][^2][^3][^4]。以下是完整的解决方案: --- ### 一、临时解决方案(无需升级内核) 1. **强制使用单GPU模式** 在 Python 导入任何深度学习库前设置环境变量: ```python import os os.environ["CUDA_VISIBLE_DEVICES"] = "0" # 必须放在所有import之前 # 然后导入其他库 import torch from transformers import Trainer ``` 此方法通过禁用多GPU并行避免内核问题[^4]。 2. **降低深度学习库版本** ```bash pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html pip install transformers==4.20.0 ``` 较旧版本对低内核兼容性更好(需测试兼容性)。 --- ### 二、永久解决方案(升级内核) #### Ubuntu 系统升级步骤: ```bash # 1. 添加官方内核PPA sudo add-apt-repository ppa:cappelikan/ppa sudo apt update # 2. 安装最新LTS内核(当前推荐6.5+) sudo apt install mainline # 3. 查看可用内核版本 mainline -l # 4. 安装特定版本(示例) sudo mainline --install-latest # 或指定版本 sudo mainline --install 6.5.0 # 5. 更新GRUB引导 sudo update-grub # 6. 重启系统 sudo reboot ``` #### 重启后验证: ```bash uname -r # 应显示 >=5.5.0 的版本 ``` #### CentOS/RHEL 升级步骤: ```bash # 1. 启用ELRepo sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org sudo rpm -Uvh https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm # 2. 安装最新内核 sudo yum --enablerepo=elrepo-kernel install kernel-ml # 3. 设置新内核为默认 sudo grub2-set-default 0 # 4. 更新GRUB sudo grub2-mkconfig -o /boot/grub2/grub.cfg # 5. 重启 sudo reboot ``` --- ### 三、升级后的关键检查 1. **NVIDIA驱动兼容性** ```bash nvidia-smi # 检查驱动是否正常工作 nvcc --version # 验证CUDA版本 ``` 若驱动失效,需重装: ```bash sudo apt install --reinstall nvidia-driver-535 # 版本根据GPU调整 ``` 2. **深度学习环境验证** ```python import torch print(torch.__version__) # 应 >=1.10.0 print(torch.cuda.is_available()) # 应为True ``` --- ### 四、技术原理 低版本内核(<5.5)存在以下问题: 1. **异步I/O缺陷**:导致DDP通信死锁 $$ \lim_{T \to \infty} \int_0^T f(t)dt \text{ 可能不收敛} $$ 2. **内存管理漏洞**:GPU-CPU数据传输异常 3. **NUMA调度缺陷**:多GPU负载不均衡 升级内核后修复了这些底层机制[^2][^4]。 > **注意**:云服务器用户需在控制台选择支持高内核的镜像(如AWS的Amazon Linux 2023,GCP的Container Optimized OS) --- ### 五、验证解决方案 1. 运行分布式训练命令 2. 检查日志中是否仍有内核警告 3. 使用`htop`监控进程状态 ```bash watch -n 1 "ps aux | grep python" # 实时监控进程 ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值