转载地址:http://ilinuxkernel.com/?p=537
业务运行过程中,Linux系统僵死,屏幕无任何有效打印信息,网络中断、键盘鼠标没有任何响应。这种故障现象,可能是因为Linux内核死锁导致。由于无任何有效打印信息,内核日志中也没有记录,就无法定位故障根因。
如何让Linux内核在僵死前打印相关信息,对问题定位尤为关键。其中一个有效手段是打开“Kernel
Hacking”选项,然后重新编译内核。
对于Linux内核死锁有帮助的几个配置选项有:
[*]
Detect Soft Lockups
[
] Panic (Reboot) On Soft Lockups
[*]
Detect Hung Tasks
(120)
Default timeout for hung task detection (in seconds)
[*]
Panic (Reboot) On Hung Tasks
[
] Lock
usage statistics
[
] Spinlock debugging: sleep-inside-spinlock checking
下面一个实例是在SLES11.1
2.6.32.12-0.7内核中系统僵死前,检测到内核死锁并panic,打印出了死锁进程和信息。
ftp
D ffff88010d7c5978 4472 22481 4259 0×00000000
ffff880031d1dd08
0000000000000016 ffff8801025cf6f8 0000000000000005
ffff880031d1dc08
ffff880031d1dfd8 000000000000ca98 00000000001d1580
0000000000004000
00000000001d1580 ffff880031d1dca8 ffffffff8108970c
Call
Trace:
[<ffffffff8108970c>]
? __lock_acquire+0xf7f/0×1038
[<ffffffff81076ca2>]
? sched_clock_cpu+0x9a/0×183
[<ffffffff810859e9>]
? mark_held_locks+0x6e/0xa3
[<ffffffff8184df73>]
? mutex_lock_nested+0x2cf/0x4ff
[<ffffffff8184df89>]
mutex_lock_nested+0x2e5/0x4ff
[<ffffffff810b1d6c>]
? generic_file_aio_write+0×62/0xea
[<ffffffff810b1d6c>]
? generic_file_aio_write+0×62/0xea
[<ffffffff810b1d6c>]
generic_file_aio_write+0×62/0xea
[<ffffffff810f93b2>]
do_sync_write+0xee/0×157
[<ffffffff810704ca>]
? autoremove_wake_function+0×0/0x4d
[<ffffffff810fa23d>]
vfs_write+0xed/0x1d1
[<ffffffff810fa43e>]
sys_write+0x5c/0x9f
[<ffffffff8100376b>]
system_call_fastpath+0×16/0x1b
1
lock held by ftp/22481:
#0:
(&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<ffffffff810b1d6c>] generic_file_aio_write+0×62/0xea
Kernel
panic – not syncing: hung_task: blocked tasks
Pid:
536, comm: khungtaskd Not tainted 2.6.32.120609_new #14
Call
Trace:
[<ffffffff8184a9b4>]
panic+0x9f/0×228
[<ffffffff810a2b59>]
watchdog+0x33c/0x3a2
[<ffffffff810a293d>]
? watchdog+0×120/0x3a2
[<ffffffff810a281d>]
? watchdog+0×0/0x3a2
[<ffffffff81070177>]
kthread+0xa0/0xaf
[<ffffffff8100477a>]
child_rip+0xa/0×20
[<ffffffff81048bbd>]
? finish_task_switch+0×0/0xfe
[<ffffffff8100413c>]
? restore_args+0×0/0×30
[<ffffffff8107001c>]
? kthreadd+0xc0/0x17b
[<ffffffff810700d7>]
? kthread+0×0/0xaf
[<ffffffff81004770>] ? child_rip+0×0/0×20