一、故障初发
客户通知一台Linux服务器messages日志有大量IO报错告警,有存储磁盘链路丢失,初步怀疑是服务器HBA卡故障导致的存储磁盘链路丢失。
二、初期排查
到达现场后第一时间去机房查看服务器,经排查服务器运行状态正常,主机2块HBA卡正常,无硬件告警,可以大概率排除是客户认为的HBA卡故障导致的存储链路丢失。
三、后期排查
1、远程SSH连接至系统,首先查看 /var/log/messages日志,故障日志截图如下:
Jun 7 03:23:31 pmsdb1 multipathd: mpathap: sdy - directio checker reports path is down
Jun 7 03:23:31 pmsdb1 multipathd: mpathar: sds - directio checker reports path is down
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:7: [sdy] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:7: [sdy] Sense Key : Illegal Request [current]
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:7: [sdy] Add. Sense: Logical unit not supported
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:7: [sdy] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
Jun 7 03:23:31 pmsdb1 kernel: end_request: I/O error, dev sdy, sector 0
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:1: [sds] Sense Key : Illegal Request [current]
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:1: [sds] Add. Sense: Logical unit not supported
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:1: [sds] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
Jun 7 03:23:31 pmsdb1 kernel: end_request: I/O error, dev sds, sector 0
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:2: [sdt] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:2: [sdt] Sense Key : Illegal Request [current]
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:2: [sdt] Add. Sense: Logical unit not supported
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:2: [sdt] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
Jun 7 03:23:31 pmsdb1 kernel: end_request: I/O error, dev sdt, sector 0
Jun 7 03:23:31 pmsdb1 multipathd: mpathau: sdt - directio checker reports path is down
Jun 7 03:23:31 pmsdb1 multipathd: mpathav: sdaa - directio checker reports path is down
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:9: [sdaa] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:9: [sdaa] Sense Key : Illegal Request [current]
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:9: [sdaa] Add. Sense: Logical unit not supported
Jun 7 03:23:31 pmsdb1 kernel: sd 6:0:3:9: [sdaa] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
Jun 7 03:23:31 pmsdb1 kernel: end_request: I/O error, dev sdaa, sector 0
Jun 7 03:23:32 pmsdb1 multipathd: mpathax: sdab - directio checker reports path is down
Jun 7 03:23:32 pmsdb1 kernel: sd 6:0:3:6: [sdx] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 7 03:23:32 pmsdb1 kernel: sd 6:0:3:6: [sdx] Sense Key : Illegal Request [current]
Jun 7 03:23:32 pmsdb1 kernel: sd 6:0:3:6: [sdx] Add. Sense: Logical unit not supported
Jun 7 03:23:32 pmsdb1 kernel: sd 6:0:3:6: [sdx] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
messages日志提示sdaa sdx sdy sdt等硬盘链路丢失,另外df -h查看文件系统也有报错。
2、使用命令fdisk -l 查看系统磁盘,并使用Linux multipath软件命令multipath -ll查看从存储映射到主机的磁盘
mpathaw (36e02861100592fcbf9f6602800000036) dm-25 HUAWEI,XSG1
size=512G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=enabled
`- 6:0:3:4 sdv 65:80 failed faulty running
mpathav (36e02861100592fcbf9f6623c0000003b) dm-24 HUAWEI,XSG1
size=512G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=enabled
`<