前提背景:
生产环境上,服务器网络突然断链,ssh连接失败。
问题初步定位:
查找内核日志,得到网卡异常信息
Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 14 not cleared within the polling period
Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 15 not cleared within the polling period
Jan 24 11:52:43 localhost kernel: bonding: bond5: link status definitely down for interface eth0, disabling it
Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: detected SFP+: 5
Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jan 24 11:52:43 localhost kernel: bond5: link status definitely up for interface eth0, 10000 Mbps full duplex.
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: Detected Tx Unit Hang
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: tx_buffer_info[next_to_clean]
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: tx hang 448 detected on queue 6, resetting adapter
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: Reset adapter
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period
网卡PCI信息:
# lspci -vvv -s 84:00.0
84:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+