一:问题现象
最近在值班时遇到一个问题,ECS一段时间失联了,应该是hang住了。
日志报错非常多serial8250: too much work for irq4,主要还会报rcu_stall

二:问题分析
serial8250: too much work for irq4打印的函数:
表明串口的中断数量太大了。
static irqreturn_t serial8250_interrupt(int irq, void *dev_id)
{
do {
l = l->next;
if (l == i->head && pass_counter++ > PASS_LIMIT) {#512
/* If we hit this, we're dead. */
printk_ratelimited(KERN_ERR
"serial8250: too much work for irq%d\n", irq);
break;
}
} while (l != end);
查看对应的中断数据:
中断号是4 中断数非常多,这是开机几个小时产生的。
查看了ttsS0的中断数为352857。
而且是非常段的时间内产生的。ttyS0是想console打印日志。
到目前为止怀疑是console打印量太大,导致中断数较大,rcu_stall发现异常。这个时候并没有hang住,后面又触发了jdb2的hang。导致磁盘无法正常读取数据,这时候机器异常了。
三:结论
咨询SRE在对应时间点的操作,他们反映有virsh console到虚拟机,有sz文件,看到有大量的二进制打印,这种打印会触发大量的中断,导致对应的cpu0一直响应中断。下载文件无法写入,并且磁盘卡主。系统hang.
如下的patch把"too much work for irq4"打印去掉了。
commit 9d7c249a1ef9bf0d5696df14e6bc067004f16979
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Thu Aug 2 13:14:32 2018 +0200
serial: 8250: drop the printk from serial8250_interrupt()
If the UART has (legitimate) work to do and we break out of the loop,
nothing changes: the interrupt is most likely already pending in the
interrupt controller and we end up in the handler anyway. This printk is
hardly helping.
Older kernels also had a comment saying that a bad configuration might
lead to this but I don't see how that should happen because a wrongly
configured interrupt number would let the handler leave "early" with
IRQ_NONE and the spurious detected will handle that (weill since 2.6.11,
before that we had no spurious detector). In that case, we would never
loop that often here.
This loop looks like an optimisation in order to pull the bytes from the
FIFO which were received while we were already here instead of waiting
for the interrupt. This might have been a good idea while the CPUs were
slow and FIFOs small.
There are other serial driver in tree, like the amba-pl*, which also
have this kind of a loop but without the printk (and were based on this
driver).
Remove the printk which might trigger in otherwise valid situtations.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_core.c
index 8fe3d0ed229e..94f3e1c64490 100644
--- a/drivers/tty/serial/8250/8250_core.c
+++ b/drivers/tty/serial/8250/8250_core.c
@@ -130,12 +130,8 @@ static irqreturn_t serial8250_interrupt(int irq, void *dev_id)
l = l->next;
- if (l == i->head && pass_counter++ > PASS_LIMIT) {
- /* If we hit this, we're dead. */
- printk_ratelimited(KERN_ERR
- "serial8250: too much work for irq%d\n", irq);
+ if (l == i->head && pass_counter++ > PASS_LIMIT)
break;
- }
} while (l != end);
spin_unlock(&i->lock);