Subject: [PATCH] net: solve a NAPI race

最新推荐文章于 2023-02-09 19:53:33 发布

原创最新推荐文章于 2023-02-09 19:53:33 发布 · 400 阅读

0 ·

CC 4.0 BY-SA版权

linux 专栏收录该内容

50 篇文章

订阅专栏

博客详细描述了一个在Linux内核中关于NAPI（New API）的竞态条件bug，该bug导致某些网络包接收延迟约200毫秒。此问题在2017年被Eric Dumazet发现并修复，主要原因是NAPI驱动程序在重新启用IRQ前未能正确处理NAPI_STATE_SCHED标志。修复方案引入了一个新的NAPI_STATE_MISSED标志，确保不会错过任何网络包。

包括我在内很多人认为Linux已经很成熟了，不应该有什么不可思议的bug。最近发现有个bug在2017年才fix。

From 39e6c8208d7b6fb9d2047850fb3327db567b564b Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 28 Feb 2017 10:34:50 -0800
Subject: [PATCH] net: solve a NAPI race

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1 thread 2 (could be on same cpu, or not)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
Additional 3rd packet is
available.
device hard irq

// does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
napi_schedule();

napi_complete_done(napi, 2);
rearm_irq();

Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

mlx5_core相关的函数调用关系如下：

alloc_comp_eqs
mlx5_create_map_eq
request_irq(mlx5_eq_int)

mlx5_eq_int # hard irq handler
mlx5_eq_cq_completion
cq->comp -> mlx5e_completion_event
napi_schedule
napi_schedule_prep
set NAPIF_STATE_SCHED
if NAPIF_STATE_SCHED is already set
set NAPIF_STATE_MISSED
__napi_schedule
____napi_schedule
__raise_softirq_irqoff(NET_RX_SOFTIRQ)

mlx5e_napi_poll # softirq handler
mlx5e_poll_tx_cq
mlx5e_poll_rx_cq
napi_complete_done
clear NAPIF_STATE_SCHED
if NAPIF_STATE_MISSED is set
__napi_schedule
mlx5e_cq_arm