Resource monitor time out occur when sub path of the multipath device fails.

博客围绕Red Hat Enterprise Linux Server release 7系统中,多路径设备子路径故障导致资源监控超时问题展开。分析根源是SCSI错误处理和IO故障转移耗时过长。给出决议,通过调整dm - multipath和SCSI错误处理选项,减少错误处理时间,避免监控超时,并介绍了具体配置方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

https://access.redhat.com/solutions/2599071

 SOLUTION 已验证 - 已更新 2017年一月9日17:12 - 

English 

环境

  • Red Hat Enterprise Linux Server release 7

问题

Resource monitor time out occur when sub path of the multipath device fails
Error seen in "/var/log/messages"

Raw

Aug 16 13:19:21 xxxxx kernel: qla2xxx [0000:08:00.0]-801c:1: Abort command issued nexus=1:3:0 --  1 2002.
Aug 16 13:19:44 xxxxx  lrmd[4223]: warning: KEP_vg_arch_monitor_60000 process (PID 22244) timed out

决议

Tuning options with dm-multipath and in SCSI error handling which should reduce the time required in completing error handling on affected paths and then failover the IO to remaining available sub paths quickly. This should help to avoid monitor timeouts when only few of the sub paths to SAN devices are affected:

[1] Edit the "defaults" section in "/etc/multipath.conf" file to reduce the "checker_timeout", "polling_interval":

Raw

        defaults { 
                     checker_timeout      10        
                              polling_interval        5     
                     }
  • Then update the devices section to reduce "fast_io_fail_tmo", "dev_loss_tmo":

    Raw

    devices {
        device {
                vendor                  "3PARdata"
                product                 "VV"
                failback                "immediate"
                rr_weight               "uniform"
                no_path_retry           2       ### Edited
                rr_min_io_rq            1
                fast_io_fail_tmo        5       ### Edited
                dev_loss_tmo            10      ### Edited
                        }
               }
    

    In the above snip, "no_path_retry" is reduced to 2, this is because in RHEL 7 the minimum value for "dev_loss_tmo" is calculated as a product of current "polling_interval" and "no_path_retry" for the devices. So to reduce the "dev_loss_tmo" for affected paths we would need to reduce the "no_path_retry" to 2.

    After updating the above changes, please reload "multipathd" service to make these changes effective:

[2] Then set the following two tunings:

Raw

        eh_timeout    5   Per device 
        eh_deadline   5   Overall cap on the error handler, only starts when the error handler kicks in.

To set above options:

Raw

        $ cd /sys/block
        for d in sd*
        do
         echo 5 > $d/device/eh_timeout
        done

        $ cd /sys/class/scsi_host
        for host in host1 host2                       
        do
         echo 5 > $host/eh_deadline
        done

Note: Above changes in tuning options "eh_timeout", "eh_deadline" would not be persistent across the reboots, you could append the above commands in step [2] to "/etc/rc.local" file to automatically tune these options upon the reboot.

根源

  • Cluster node is having SAN devices connected through 2 Qlogic FC HBAs. And there were command timeouts, command aborts happening only for the sub paths connected through Qlogic FC HBA host1. So the sub paths through another HBA host2 were still available. As the sub paths through another HBA were still available we expect the monitor IO operation on above resources should not fail.
  • SCSI error handling, and then IO failover from affected sub paths to the non-affected paths took slightly more time which was long enough to reach to the monitor timeout on resource. Due to this, a monitor operation on resources got timed out and pacemaker had initiated a recovery action on the resources.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值