Environment
- Red Hat Enterprise Linux 5
- Red Hat Enterprise Linux 6
- Red Hat Enterprise Linux 7
Issue
-
Getting error: 'kernel: sd h:c:t:l: SCSI error: return code = 0x00010000':
May 22 23:50:10 localhost kernel: Device sdb not ready. May 22 23:50:10 localhost kernel: end_request: I/O error, dev sdb, sector 0 May 22 23:50:10 localhost kernel: SCSI error : <0 0 2 14> return code = 0x10000
-
SAN access issue reported on RHEL server, observed following messages in
/var/log/messages
May 5 04:15:00 localhost kernel: sd 3:0:1:42: Unhandled error code May 5 04:15:00 localhost kernel: sd 3:0:1:42: SCSI error: return code = 0x00010000 May 5 04:15:00 localhost kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
-
In device mapper multipath, some path fails and going down. Logs have many events logged for
tur checker reports path is down
andmultipath -ll
output show paths in failed faulty state:[root@example ~]# multipath -ll sdb: checker msg is "tur checker reports path is down" mpath0 (2001738006296000b) dm-8 IBM,2810XIV [size=200G][features=1 queue_if_no_path][hwhandler=0][rw] \_ round-robin 0 [prio=3][active] \_ 2:0:0:1 sda 8:0 [active][ready] \_ 2:0:1:1 sdb 8:16 [failed][faulty] \_ 4:0:8:1 sdc 8:32 [active][ready] \_ 4:0:9:1 sdd 8:48 [active][ready]
-
We had A SAN director failure and the system got paniced due to lpfc lost devices.
Resolution
SCSI error: return code = 0x00010000 DID_NO_CONNECT
-
There is likely a hardware issue that is related to the connectivity problems. Contact storage hardware support for assistance in determining cause and addressing the problem.
-
Parallel SCSI
DID_NO_CONNECT = SCSI SELECTION failed because there was no device at the address specified
-
iSCSI
- iscsi layer returns this if replacement/recovery timeout seconds has expired or if user asked to shutdown session.
-
FC/SAN
- Typically this will follow loss of a storage port, for example:
kernel: rport-0:0-1: blocked FC remote port time out: saving binding kernel: sd 0:0:1:22: Unhandled error code kernel: sd 0:0:1:22: SCSI error: return code = 0x00010000 kernel: Result: hostbyte=>DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
- In the above example the remote port timed out and the controller lost connection to the devices behind that storage port.
Root Cause
The SCSI error return code of 0x00010000
is broken down into constituent parts and decoded as shown below:
0x00 01 00 00
-------------
00 status byte : {likely} not valid, see other fields
00 msg byte : {likely} not valid, see other fields
01 host byte : DID_NO_CONNECT - no connection to device {possibly device doesn't exist or transport failed}
00 driver byte : {likely} not valid, see other fields
The return code 0x00010000 is a DID_NO_CONNECT -- hardware transport connection to device is no longer available. This scsi error indicates that IO command is being rejected because the command cannot be sent to the device until the hardware transport becomes available again. These are a symptom/result of some other root cause. That other root cause needs to be found and addressed.
Either access to the device is temporarily unavailable or the device is no longer available within the configuration. A temporary service interrupt can be cause by maintenance activity in the san such as a switch reboot. Such activity can cause either a link down condition and/or remote port timeouts. Its possible the hardware transport connectivity will return at some time later. A permanent loss of connectivity can occur if storage is reconfigured to remove that lun from being presented to the host.
Check for the following or similar event messages within the system logs:
Mar 28 19:52:53 hostname kernel: qla2xxx 0000:04:00.1: LOOP DOWN detected (4 4 0 0). Mar 28 19:53:23 hostname kernel: rport-3:0-0: blocked FC remote port time out: saving binding Mar 28 19:53:23 hostname kernel: sd 3:0:0:1: SCSI error: return code = 0x00010000 : .
Look at what was being logged just before these DID_NO_CONNECTs started being logged.
You can also have remote ports timeout (loss of connectivity) without a link down event. See the following for more information on remote port timeouts: Saw error "rport-1:0-0: blocked FC remote port time out: saving binding" in system log?
There is a delay between the link down event at 19:52:53 and remote port time out at 19:53:23 -- a 30 second delay. When a remote port is lost, a delay within the driver called dev_loss_tmo
, device loss timeout, is applied before taking further action. If the port returns to the configuration before that timeout expires, then io is immediately retried. If the port hasn't returned by that time then all io is immediately returned with DID_NO_CONNECT status. Any and all further io will immediately fail after the remote port time out event has occurred. See DM Multipath (RHEL 5)and DM Multipath (RHEL 6) configuration guides for more information on dev_loss_tmo behavior. Also "How to set custom dev_loss_tmo value permanently?" has information for setting dev_loss_tmo outside of multipath.
Other symptoms can result from the host loosing connectivity to storage. If no connectivity remains, issues such as file systems going read-only can result from being unable commit necessary metadata changes to the disks hosting the filesystem.
See How do I interpret scsi status messages in RHEL like "sd 2:0:0:243: SCSI error: return code = 0x08000002"? for more information.