linux Oops和Panic关系 .

本文详细解析了Linux内核中的Oops现象,包括其产生的原因、处理流程及对系统稳定性的影响。通过具体实例展示了Oops信息的内容及其对系统行为的影响。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

常在河边走,哪能不湿鞋。用Linux,总有死机的时候,如果运气好,会看到一些所谓”Oops”信息(在屏幕上或系统日志中),比如:

Unable to handle kernel paging request at virtual address f899b670
printing eip:
c01de48c
*pde = 00737067
Oops: 0002 [#1]
Modules linked in: bluesmoke_e752x bluesmoke_mc md5 ipv6 parport_pc lp parport nls_cp936 vfat fat dm_mod button battery asus_acpi ac joydev yenta_socket pcmcia_core uhci_hcd ehci_hcd snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore ipw2200 ieee80211 ieee80211_crypt sk98lin ext3 jbd
CPU: 0
EIP: 0060:[] Not tainted VLI
EFLAGS: 00210286 (2.6.9-11.21AXKProbes)
EIP is at kobject_add+0×83/0xd7
eax: c038db78 ebx: c038db04 ecx: f899b670 edx: f8a4a630
esi: c038db4c edi: f8a4a614 ebp: c038db80 esp: d7568f2c
ds: 007b es: 007b ss: 0068
Process modprobe (pid: 8227, threadinfo=d7568000 task=f4ea99b0)
Stack: f8a4a614 ffffffea f8a4a5e4 00000000 c01de4f9 f8a4a614 c038db00 c024a1d4
f8a4a5c0 f8a4a5e4 f8a4a5f4 d7568000 c024a661 1d244b3c 00000000 0000000a
c032421b 00000000 00000000 00000015 00000014 00000016 f89ddb34 f8a4a5c0
Call Trace:
[] kobject_register+0×19/0×39
[] bus_add_driver+0×36/0×97
[] driver_register+0×82/0×89
[] pci_register_driver+0×85/0xa1
[] init_module+0xa/0×14 [bluesmoke_e752x]
[] sys_init_module+0x1ec/0×323
[] syscall_call+0×7/0xb
Code: 85 d2 0f 85 06 04 00 00 85 ed 75 0d 8b 47 28 83 c0 10 e8 82 01 00 00 89 c5 8b 47 28 8d 57 1c 83 c0 08 89 47 1c 8b 48 04 89 50 04 <89> 11 89 4a 04 8b 47 28 8b 18 8d 4b 48 89 c8 ba ff ff 00 00 0f

Oops可以看成是内核级的Segmentation Fault。应用程序如果进行了非法内存访问或执行了非法指令,会得到Segfault信号,一般的行为是coredump,应用程序也可以自己截获Segfault信号,自行处理。如果内核自己犯了这样的错误,则会打出Oops信息。

有不少文章说明如何理解这些Oops (http://pczou.blogchina.com/545558.html),这里只想解释一下它所产生的过程(以2.6系列内核为例):

首先是处理硬件发出的内存访问异常(fault),有些异常是无辜的(比如demand-paging),而有些则是内核的错误所致。

1. do_page_fault() arch/i386/mm/fault.c

如果是内核进行了非法访问,do_page_fault()会先打出EIP, PDE等信息,例如:

Unable to handle kernel paging request at virtual address f899b670
printing eip:
c01de48c
*pde = 00737067

然后调用 die(“Oops”, regs, error_code);

这之后,如果系统还活着(至少要满足两个条件:1. 在进程上下文 2. 没有设置panic_on_oops),会杀死当前进程。然后继续运行,好像什么事情都没有发生一样。不过,这样的好事不经常发生,发生了也不会太持久。

2. do_page_fault() -> die() arch/i386/kernel/traps.c

die() 首先打出一行:

Oops: 0002 [#1]

其中0002代表错误码 (读错误、发生在内核空间),#1代表Oops发生次数。

* error_code:
*       bit 0 == 0 means no page found, 1 means protection fault
*       bit 1 == 0 means read, 1 means write
*       bit 2 == 0 means kernel, 1 means user-mode


然后,调用 show_registers(regs) 输出寄存器、当前进程、堆栈、指令代码等信息:

Modules linked in: bluesmoke_e752x bluesmoke_mc md5 ipv6 parport_pc lp parport nls_cp936 vfat fat dm_mod button battery asus_acpi ac joydev yenta_socket pcmcia_core uhci_hcd ehci_hcd snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore ipw2200 ieee80211 ieee80211_crypt sk98lin ext3 jbd
CPU: 0
EIP: 0060:[] Not tainted VLI
EFLAGS: 00210286 (2.6.9-11.21AXKProbes)
EIP is at kobject_add+0×83/0xd7
eax: c038db78 ebx: c038db04 ecx: f899b670 edx: f8a4a630
esi: c038db4c edi: f8a4a614 ebp: c038db80 esp: d7568f2c
ds: 007b es: 007b ss: 0068
Process modprobe (pid: 8227, threadinfo=d7568000 task=f4ea99b0)
Stack: f8a4a614 ffffffea f8a4a5e4 00000000 c01de4f9 f8a4a614 c038db00 c024a1d4
f8a4a5c0 f8a4a5e4 f8a4a5f4 d7568000 c024a661 1d244b3c 00000000 0000000a
c032421b 00000000 00000000 00000015 00000014 00000016 f89ddb34 f8a4a5c0
Call Trace:
[] kobject_register+0×19/0×39
[] bus_add_driver+0×36/0×97
[] driver_register+0×82/0×89
[] pci_register_driver+0×85/0xa1
[] init_module+0xa/0×14 [bluesmoke_e752x]
[] sys_init_module+0x1ec/0×323
[] syscall_call+0×7/0xb
Code: 85 d2 0f 85 06 04 00 00 85 ed 75 0d 8b 47 28 83 c0 10 e8 82 01 00 00 89 c5 8b 47 28 8d 57 1c 83 c0 08 89 47 1c 8b 48 04 89 50 04 <89> 11 89 4a 04 8b 47 28 8b 18 8d 4b 48 89 c8 ba ff ff 00 00 0f


如果是在中断上下文,则直接调用panic()【panic 输出oops信息后将挂起系统】。如果是在进程上下文,则根据panic_on_oops的设置选择是否panic()。panic_on_oops的缺省设置是”0″,即在Oops发生时不会进行panic()操作。可以通过sysctl进行设置:
sysctl -w kernel.panic_on_oops=1

有panic_on_oops这样的设置,说明Oops不一定导致系统死亡,也不一定需要重新启动系统。正如用户程序segfault时可能还能坚持运行一样。不过Oops一旦发生,系统已经有些不正常了,即使表面上可能还正常,不过可能有些锁已经被占用而无法释放,很快会导致系统死锁

那么,panic()是什么呢?panic()和用户空间的abort()类似,简单清理一下,就可以放心去死(reboot)了。

3. do_page_fault() -> die() -> panic()

panic会根据 kernel.panic 的设置决定 reboot 前的延时,如果 kernel.panic=0,则打开中断,陷入死循环。反之,则在几秒之后,reboot系统。

可以看出虽然都是死,但死因不同,死亡时的表现更是五花八门。常见的死因有:

  • 非法内存访问 (比如访问地址0)
  • 非法指令

有时候核心成心发出非法指令,比如BUG() (include/asm/bug.h) 中所做的,以引起Oops。类似用户程序中调用assert()。

死亡发生的地点也很关键,直接导致了死亡的不同表现,比如:

  • 进程上下文
  • 中断上下文

在中断上下文中,由于中断是关闭的,而且往往会占用一些锁,这种情况下一般除了死,没有什么别的办法。

在进程上下文中要自由一些,如果运气好的话,可以苟延残喘一段时间。

### Linux Kernel Oops Explanation and Solution #### Understanding Kernel Oops A kernel oops is an error condition that occurs when something goes wrong inside the Linux kernel. An oops indicates that the kernel has detected a serious problem with its internal state or encountered invalid data while executing privileged instructions. When such conditions arise, the kernel prints out diagnostic information about what went wrong before potentially crashing or continuing execution depending on configuration settings. The provided example demonstrates how loading a poorly written kernel module can lead to instability within the operating system environment[^1]. Specifically, invoking `panic()` forces immediate termination of normal operations as shown below: ```c int init_module(void) { printk(KERN_INFO "Hello world. Now we crash.\n"); panic("Down we go, panic called!"); return 0; } ``` This results in abrupt cessation because calling `panic()` halts further processing immediately after printing specified message string into log files. In another scenario described elsewhere, attempting to insert faulty drivers may cause memory access violations leading up to similar outcomes where protected areas are improperly accessed causing faults during runtime activities involving inserted modules[^2]. #### Diagnosing Kernel Oopses When diagnosing issues related to kernel oopses, several key pieces of information should be examined closely including but not limited to stack traces, registers states, call paths taken prior failure points among others available through various logging mechanisms like dmesg output or syslog entries. For instance, examining logs generated from failed attempts at inserting problematic components provides insights regarding root causes behind unexpected behaviors observed post-installation actions performed against target systems under test scenarios. #### Preventing Future Occurrences To prevent future occurrences of kernel oops events due to custom driver development efforts, developers must ensure adherence to best practices throughout coding phases ensuring robustness checks exist around critical sections prone to errors especially those interacting directly hardware resources managed via MTD (Memory Technology Device) interfaces requiring proper registration procedures outlined hereunder for device-specific handlers responsible handling flash storage devices connected over SPI buses etcetera[^3]: ```c void register_mtd_chip_driver(struct mtd_chip_driver *); ``` Additionally, rigorous testing cycles incorporating static analysis tools alongside dynamic verification methods help identify potential pitfalls early stages reducing likelihood encountering catastrophic failures once deployed environments outside controlled laboratory setups.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值