网卡 `bond0` RX dropped 排查笔记（详尽版）

一、现象与结论速览

现象：ifconfig bond0 或 ip -s link show bond0 显示 RX errors = 0，但 RX dropped 很高。

结论（快速判断）：

RX errors = 0 → NIC 未报告硬件层面的错帧/CRC 等物理错误。
RX dropped 高 → 包被内核或驱动丢弃（software drop、queue overflow、NAPI/softirq 处理不过来、或上游交换机/流量分配引起）。
必须看底层 slave 网卡（如 eth0、eth1、ensX）和内核/中断指标，而不是查看 ethtool -S bond0（bond 本身通常无 stats）。

二、可能原因（按优先级，从常见到罕见）

物理 NIC RX ring / buffer 太小（最常见）
软中断/NET_RX 在某 CPU 上堆积（softirq 处理不过来）
中断（IRQ）/队列分配不均，单核过载（MSI-X、多队列未正确使用）
bond 模式或 hash 导致流量倾斜（LACP、balance-xor 等）
上游交换机端口 / buffer 溢出或 LACP 配置不一致
驱动/固件问题、SR-IOV/VF/DPDK 冲突或 offload 配置错误
容器/虚拟化网络栈（OVS、Cilium、Docker bridge）压力
非常罕见：硬件 bug / DMA 问题 / 路由器流控异常

三、必要信息与准备（先收集这些）

先收集环境与状态快照，便于后续分析或发给网工/Vendor：

# bond 信息
cat /proc/net/bonding/bond0

# ifconfig / ip 输出
ifconfig bond0
ip -s link show bond0

# 列出 slave
awk '/Slave Interface/ {print $3}' /proc/net/bonding/bond0

# 每个 slave 的简要状态
ip -s link show eth0
ethtool eth0

# NIC 统计（重点）
ethtool -S eth0 | egrep -i "rx|drop|err|miss|over"

# ring buffer 大小
ethtool -g eth0

# 中断分布
cat /proc/interrupts | egrep "eth0|ens|eno|bond0"

# softirq
watch -n1 cat /proc/softirq

# top/htop 观察软中断或 ksoftirqd
top -H -p $(pgrep -d, -f ksoftirqd)

# 系统日志查看时间窗口内是否有相关报错
journalctl -k --since "1 hour ago"
dmesg | tail -n 200

保存输出为文件便于比对与发送：

mkdir -p /tmp/net-triage && \
for i in bond0 eth0 eth1; do \
  ip -s link show $i > /tmp/net-triage/$i.ip 2>&1; \
  ethtool -S $i > /tmp/net-triage/$i.ethtool 2>&1 || true; \
  ethtool -g $i > /tmp/net-triage/$i.ring 2>&1 || true; \
done
cat /proc/interrupts > /tmp/net-triage/interrupts
cat /proc/softirq > /tmp/net-triage/softirq
cat /proc/net/bonding/bond0 > /tmp/net-triage/bond0.info

四、逐步排查流程（详尽步骤，按顺序执行）

步骤 0 — 初步确认（先看整体）

ip -s link show bond0
ifconfig bond0

看 RX dropped 的绝对值、速率和是否持续上升。

步骤 1 — 判断责任层（bond 还是 slave）

cat /proc/net/bonding/bond0

确认 slaves（例如 eth0、eth1）。然后针对每个 slave：

ethtool -S eth0 | egrep -i "rx|drop|miss|over|err"

判断：

若某个 slave 有大量 rx_dropped/rx_missed_errors → 问题在该物理 NIC/驱动/硬件队列或上游交换机。
若所有 slave 都没有明显统计，而 bond0 dropped 高 → 问题更可能在内核层（softirq、RPS、nf_conntrack、网络命名空间/容器）。

步骤 2 — 检查 RX ring buffer（非常关键）

ethtool -g eth0

看 RX 的 current/max，并查看 RX 的 current 是否较小（如 256、512）。

如果很小（<2048）且流量大，执行临时扩大：

ethtool -G eth0 rx 4096

确认能否成功（有些驱动或硬件有限制，会报错）。扩大后观察 dropped 是否下降。

步骤 3 — 检查中断与队列（队列数量 / IRQ affinity）

查看队列数量：

ethtool -l eth0

查看中断分布：

cat /proc/interrupts | egrep "eth0|ens|eno|igb|ixgbe|mlx5"

查看是否所有中断落在单一 CPU 核或某些核负载极高。

如需自动优化：

systemctl start irqbalance   # 若可用
irqbalance --oneshot

或手动调整 affinity（示例，慎用）：

echo f > /proc/irq/<irq_num>/smp_affinity   # 将 affinity 设为某 mask

步骤 4 — 观察软中断与 CPU 使用

watch -n1 cat /proc/softirq
top -H

关注 NET_RX 列是否在个别 CPU 上快速增长，及是否有 ksoftirqd/<n> 占高 CPU。

若 softirq 占用高，考虑：

开启 RPS/RFS（软件包分散）
增加队列数或中断分散
绑核调整（将中断分散到空闲核）

设置 RPS 示例（临时）：

# 将 eth0 的 rxq0 设置为 CPU mask
echo 000000ff > /sys/class/net/eth0/queues/rx-0/rps_cpus
# 设置 rps_flow_cnt
echo 32768 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

（注意：RPS 值和 mask 需按机器 CPU 数量调整）

步骤 5 — 检查 bond 模式 & 负载分布

cat /proc/net/bonding/bond0

关注 Mode（balance-rr, active-backup, balance-xor, 802.3ad, balance-tlb, balance-alb 等）与当前 active-slave。

常见问题：

802.3ad（LACP）需交换机端配置一致，否则可能流量重分配异常。
balance-xor / balance-rr 等在单五元组流量会压到单链路，导致单链路饱和、丢包。

步骤 6 — 检查上游交换机（需要网工）

与交换机管理员协作检查：

交换机端口 show interface <port> counters 的 input discard/CRC/errors
LACP 状态、聚合口 hash 策略（是否与服务器一致）
端口速率/duplex 是否匹配
交换机 buffer overflow / tail drop 指标

步骤 7 — 排查驱动/固件/Offload/虚拟化干扰

检查网卡驱动版本 ethtool -i eth0，与厂家 release notes 比对已知 bug。
检查 offload 设置（GRO, GSO, TSO, RX checksum）是否与容器/DPDK 有冲突。
若使用 SR-IOV / VF，确认 VF 配置没有流控或队列问题。
容器网络（Cilium/OVS）场景下，检查是否为 OVS conntrack 或 iptables/nftables 导致 bottleneck。

五、常见修复措施（即时 & 长期）

临时（快速验证、见效快）

增大 RX ring：ethtool -G eth0 rx 4096
启动 irqbalance 或手动分散 IRQ
增加 RPS：设置 /sys/class/net/.../queues/rx-*/rps_cpus 和 rps_flow_cnt
临时把单节点流量引到另一条链路（当作流量缓解）

中期（更稳定）

调整 bond 模式到更合适的模式（与网络侧协调）
固件/驱动升级（vendor 提供的修复）
调整交换机 hash 策略，使流量更均匀分布
提高内核参数或优化系统以降低 softirq 延迟（CPU 优先级调整）

永久（系统/架构层面）

提前容量规划：按 PPS 和并发设计 ring/队列数
使用多队列、多核分散（MSI-X + RPS/RFS）
对高 PPS 服务使用 DPDK 或 SR-IOV（如果适用并经过评估）
监控与告警（详见下节）

六、监控/告警建议（必要监控项）

ifconfig / ip -s link 的 RX dropped、TX dropped、errors（分钟粒度）
NIC 层 ethtool -S 相关字段（rx_missed_errors、rx_no_buffer_count 等）
中断分布（/proc/interrupts）及某些队列的中断速率（异常时采样）
/proc/softirq 的 NET_RX 增长趋势
ring buffer 配置（只需变更时记录）
交换机端口丢包/丢弃指标（与网工共享）
PPS（packets per second）与带宽（bps）同时监控，用以区分小包高 PPS 场景

七、自动化采集脚本（一次性采样，多台主机可用）

把下面脚本保存为 /usr/local/bin/net-triage.sh 并执行，生成 /tmp/net-triage-<ts>.tar.gz：

#!/bin/bash
TS=$(date +%s)
OUTDIR=/tmp/net-triage-$TS
mkdir -p $OUTDIR
echo $TS > $OUTDIR/ts

# basic info
uname -a > $OUTDIR/uname
ip -s link > $OUTDIR/ip_s_link
cat /proc/net/bonding/bond0 > $OUTDIR/bond0 || true

# slaves
for sl in $(awk '/Slave Interface/ {print $3}' /proc/net/bonding/bond0 2>/dev/null); do
  echo "=== $sl ===" > $OUTDIR/$sl.info
  ip -s link show $sl >> $OUTDIR/$sl.info 2>&1
  ethtool -i $sl >> $OUTDIR/$sl.info 2>&1
  ethtool -S $sl >> $OUTDIR/$sl.ethtool 2>&1 || true
  ethtool -g $sl >> $OUTDIR/$sl.ring 2>&1 || true
  ethtool -l $sl >> $OUTDIR/$sl.queues 2>&1 || true
done

cat /proc/interrupts > $OUTDIR/interrupts
cat /proc/softirq > $OUTDIR/softirq
dmesg | tail -n 500 > $OUTDIR/dmesg
journalctl -k --no-pager -n 500 > $OUTDIR/journal_dmesg || true

tar czf /tmp/net-triage-$TS.tar.gz -C /tmp net-triage-$TS
echo "/tmp/net-triage-$TS.tar.gz"

八、典型案例示例（如何读输出并判断）

案例 A：ethtool -S eth0 显示 rx_missed_errors 很高
→ 很可能是 NIC 的 RX ring 被耗尽（ring 太小或上游突发），先扩大 ring。
案例 B：ethtool -S eth0 没有明显计数，但 /proc/softirq 显示 NET_RX 在 CPU0 上暴涨且 ksoftirqd/0 占高 CPU
→ 问题在 softirq 处理能力，考虑 RPS 或中断分散、或增加 CPU 资源。
案例 C：/proc/net/bonding/bond0 显示 mode=802.3ad，但交换机端 LACP 有不一致或 port-channel 部分 flapping
→ 与网工核对交换机端 LACP，确认 hash/port-channel 配置，检查 switch drop。
案例 D：中断计数全部集中在单一 irq/queue
→ 手动或自动调整 irq affinity 或启用/重启 irqbalance。

九、与网工 / 厂商沟通时必备信息（要发给对方）

服务器端采集包（上文脚本产物 net-triage-<ts>.tar.gz）
cat /proc/net/bonding/bond0 的输出（说明 bond 模式）
ethtool -S、ethtool -g、ethtool -i 的输出（逐个 slave）
/proc/interrupts、/proc/softirq 的时序样本（至少两个时间点，间隔 30s-1min）
交换机端端口 counters（input discards, CRCs, errors）和 LACP 状态输出
发生时段的应用流量特征（PPS vs BPS，是否小包洪泛）
是否近期升级过驱动/内核/固件/交换机配置

十、常见误区与注意事项

误区：看到 bond0 dropped 就以为是 bond 的 bug → 实际统计通常在 slave 上。
误区：直接盲目把 ring 设到最大而不测试 → 有些硬件/驱动有上限或其它副作用（内存占用需要考虑）。
注意：在生产上调整 IRQ affinity、RPS、ring 等设置前，建议在低流量时逐步调整并监控指标变化。
注意：某些设置（如 ethtool -G）在网卡重启或驱动重载后会丢失，需做持久化（systemd unit / rc.local / modprobe.d 或 vendor 工具）。

十一、持久化参数示例（systemd service）

把 ring buffer 与 RPS 在启动时设置持久化（示例）：
/etc/systemd/system/net-tuning.service

[Unit]
Description=Network tuning for NICs
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/net-tuning.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

/usr/local/sbin/net-tuning.sh（示例内容需按实际 NIC 修改）：

#!/bin/bash
# example for eth0
ethtool -G eth0 rx 4096 || true
# set RPS mask for rx queue 0 (for 8 cores example)
echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus || true
echo 32768 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt || true

然后：

chmod +x /usr/local/sbin/net-tuning.sh
systemctl daemon-reload
systemctl enable --now net-tuning.service

十二、快速排查清单（打印版）

获取 cat /proc/net/bonding/bond0，列出 slave
逐个 ethtool -S <slave> 查 rx_dropped/rx_missed_errors/rx_no_buffer_count
ethtool -g <slave> 查看 RX ring 大小 → 若小则尝试 ethtool -G 扩大
cat /proc/interrupts 与 ethtool -l <slave> 检查中断/队列分配
cat /proc/softirq / top -H 检查 softirq/ksoftirqd CPU 占用
irqbalance --oneshot 或手动调整 affinity 并观察结果
查看 ip -s link / ifconfig 的 rx dropped 时间序列（是否随流量变化）
与网工核对交换机端 counters、LACP、hash 策略
若怀疑驱动/固件，记录 ethtool -i 并与厂商确认

网卡 `bond0` RX dropped 排查笔记（详尽版）

目录标题