libvirt和SanLock中的续约失败测试

使用默认配置来进行libvirt和sanlock的续约失败测试

创建使用本地磁盘的虚拟机(Test1或Test2节点)

  • 初始化虚拟机磁盘镜像:
$ dd if=cirros-0.3.4-x86_64-disk.img of=/dev/sdd3
  • 创建test3_sanlock虚拟机:
$ vi test3_sanlock.xml
<domain type='kvm'>
    <name>test3_sanlock</name>
    <memory>262144</memory>
    <vcpu>1</vcpu>

    <os>
        <type arch='x86_64' machine='pc'>hvm</type>
        <boot dev='hd'/>
    </os>

    <devices>
        <disk type='file' device='disk'>
            <driver name='qemu' type='qcow2'/>
            <source file='/dev/sdd3'/>
            <target dev='hda' bus='ide'/>
        </disk>
        
        <input type='tablet' bus='usb'/>
        <input type='mouse' bus='ps2'/>

        <graphics type='vnc' port='-1' listen = '0.0.0.0' autoport='yes' keymap='en-us'/>
    </devices>
</domain>
  • 定义虚拟机:
$ virt-xml-validate test3_sanlock.xml
test3_sanlock.xml validates

$ virsh define test3_sanlock.xml
定义域 test3_sanlock(从 test3_sanlock.xml)

$ virsh list --all
 Id    名称                         状态
----------------------------------------------------
 -     test1_sanlock                  关闭
 -     test2_sanlock                  关闭
 -     test3_sanlock                  关闭
 -     test_sanlock                   关闭
  • 启动虚拟机:
$ virsh start test3_sanlock
域 test3_sanlock 已开始
  • 所有虚拟机状态:
$ virsh list --all
 Id    名称                         状态
----------------------------------------------------
 3     test3_sanlock                  running
 -     test1_sanlock                  关闭
 -     test2_sanlock                  关闭
 -     test_sanlock                   关闭

  • 虚拟机进程状态:
$ ps -aux | grep /dev/sdd3
qemu      20345 22.8  5.0 675848 94884 ?        Sl   10:13   0:10 /usr/libexec/qemu-kvm -name test3_sanlock -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 9cc6d09e-46e0-4fee-9113-938221810bad -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-test3_sanlock/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd3,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
root      20430  0.0  0.0 112664   972 pts/0    S+   10:14   0:00 grep --color=auto /dev/sdd3
  • 租约文件列表:
$ ls /var/lib/libvirt/sanlock -l
-rw------- 1 sanlock sanlock 1048576 4月  20 19:55 1199371e4095b4aeb587631d5e61ea06
-rw------- 1 sanlock sanlock 1048576 4月  20 19:54 3f8518ff5358a6757f0a5918e3ec7be2
-rw------- 1 sanlock sanlock 1048576 4月  20 19:55 51d27a61a6a3dd58637b6e00bb719cae
-rw------- 1 sanlock sanlock 1048576 4月  20 19:54 6cfae8f6c4541a92e4aa52f14e9977a5
-rw------- 1 sanlock sanlock 1048576 4月  20 20:24 7edb5b6820e56426339607637d18e871
-rw------- 1 sanlock sanlock 1048576 4月  21 09:56 d726322246f67ea910a344b4426b4e0d
-rw-rw---- 1 sanlock sanlock 1048576 4月  21 09:56 __LIBVIRT__DISKS__

  • 租约状态:
$ sanlock direct dump /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ 
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                   __LIBVIRT__DISKS__       581e732d-e4b1-4216-8b2f-1a63f08bb28d.Test1 0000003949 0001 0002
00000512                   __LIBVIRT__DISKS__       10de2b81-24c8-4aa6-8b29-867833982ff5.Test2 0000003164 0002 0002

$ sanlock direct dump /var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                   __LIBVIRT__DISKS__                 d726322246f67ea910a344b4426b4e0d 0000004112 0002 0003 1

  • sanlock日志:
$ cat /var/log/sanlock.log
...
2017-04-21 10:10:09+0800 3896 [19810]: sanlock daemon started 3.4.0 host 25869182-7a3b-4c95-9f66-51bb606cdfc3.Test2
2017-04-21 10:10:09+0800 3896 [19810]: set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted
2017-04-21 10:10:25+0800 3911 [19817]: r1 cmd_acquire 2,10,19859 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__
2017-04-21 10:10:44+0800 3930 [19817]: r2 cmd_acquire 2,10,19914 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__
2017-04-21 10:11:04+0800 3950 [19816]: s1 lockspace __LIBVIRT__DISKS__:2:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0
2017-04-21 10:13:46+0800 4112 [19816]: s1:r3 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,10,20345
2017-04-21 10:13:46+0800 4112 [19816]: s1:r4 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,10,20345
2017-04-21 10:13:46+0800 4112 [19810]: s1 host 1 2 4872 581e732d-e4b1-4216-8b2f-1a63f08bb28d.Test1
2017-04-21 10:13:46+0800 4112 [19810]: s1 host 2 3 4090 25869182-7a3b-4c95-9f66-51bb606cdfc3.Test2

断开网络5秒之后的状态

  • 虚拟机正常运行:
$ virsh list --all
 Id    名称                         状态
----------------------------------------------------
 3     test3_sanlock                  running
 -     test1_sanlock                  关闭
 -     test2_sanlock                  关闭
 -     test_sanlock                   关闭
  • 虚拟机进程正常:
$ ps -aux | grep /dev/sdd3
qemu      20345  6.1  5.0 675848 94884 ?        Sl   10:13   0:15 /usr/libexec/qemu-kvm -name test3_sanlock -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 9cc6d09e-46e0-4fee-9113-938221810bad -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-test3_sanlock/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd3,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
root      20837  0.0  0.0 112664   972 pts/0    S+   10:17   0:00 grep --color=auto /dev/sdd3
  • sanlock日志已经出现异常:
$ cat /var/log/sanlock.log
...
2017-04-21 10:17:40+0800 4347 [20022]: __LIBVIR aio timeout RD 0x7fda5c0008c0:0x7fda5c0008d0:0x7fda6e094000 ioto 10 to_count 1
2017-04-21 10:17:40+0800 4347 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:17:40+0800 4347 [20022]: s1 renewal error -202 delta_length 10 last_success 4316
2017-04-21 10:18:01+0800 4367 [20022]: __LIBVIR aio timeout RD 0x7fda5c000910:0x7fda5c000920:0x7fda6df92000 ioto 10 to_count 2
2017-04-21 10:18:01+0800 4367 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:18:01+0800 4367 [20022]: s1 renewal error -202 delta_length 20 last_success 4316

断开网络30秒之后的状态

  • 虚拟机已经关闭:
$ virsh list --all
 Id    名称                         状态
----------------------------------------------------
 -     test1_sanlock                  关闭
 -     test2_sanlock                  关闭
 -     test3_sanlock                  关闭
 -     test_sanlock                   关闭

  • 虚拟机进程已经被终止:
$ ps -aux | grep /dev/sdd3
root      21018  0.0  0.0 112664   972 pts/0    S+   10:19   0:00 grep --color=auto /dev/sdd3

  • 从sanlock的日志可以看出,他把20345的虚拟机进程使用SIGTERM(15)信号kill掉了:
$ cat /var/log/sanlock.log
...
2017-04-21 10:13:46+0800 4112 [19810]: s1 host 1 2 4872 581e732d-e4b1-4216-8b2f-1a63f08bb28d.Test1
2017-04-21 10:13:46+0800 4112 [19810]: s1 host 2 3 4090 25869182-7a3b-4c95-9f66-51bb606cdfc3.Test2
2017-04-21 10:17:40+0800 4347 [20022]: __LIBVIR aio timeout RD 0x7fda5c0008c0:0x7fda5c0008d0:0x7fda6e094000 ioto 10 to_count 1
2017-04-21 10:17:40+0800 4347 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:17:40+0800 4347 [20022]: s1 renewal error -202 delta_length 10 last_success 4316
2017-04-21 10:18:01+0800 4367 [20022]: __LIBVIR aio timeout RD 0x7fda5c000910:0x7fda5c000920:0x7fda6df92000 ioto 10 to_count 2
2017-04-21 10:18:01+0800 4367 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:18:01+0800 4367 [20022]: s1 renewal error -202 delta_length 20 last_success 4316
2017-04-21 10:18:10+0800 4376 [19810]: s1 check_our_lease warning 60 last_success 4316
2017-04-21 10:18:11+0800 4377 [19810]: s1 check_our_lease warning 61 last_success 4316
2017-04-21 10:18:12+0800 4378 [19810]: s1 check_our_lease warning 62 last_success 4316
2017-04-21 10:18:13+0800 4379 [19810]: s1 check_our_lease warning 63 last_success 4316
2017-04-21 10:18:14+0800 4380 [19810]: s1 check_our_lease warning 64 last_success 4316
2017-04-21 10:18:15+0800 4381 [19810]: s1 check_our_lease warning 65 last_success 4316
2017-04-21 10:18:16+0800 4382 [19810]: s1 check_our_lease warning 66 last_success 4316
2017-04-21 10:18:17+0800 4383 [19810]: s1 check_our_lease warning 67 last_success 4316
2017-04-21 10:18:18+0800 4384 [19810]: s1 check_our_lease warning 68 last_success 4316
2017-04-21 10:18:19+0800 4385 [19810]: s1 check_our_lease warning 69 last_success 4316
2017-04-21 10:18:20+0800 4386 [19810]: s1 check_our_lease warning 70 last_success 4316
2017-04-21 10:18:21+0800 4387 [19810]: s1 check_our_lease warning 71 last_success 4316
2017-04-21 10:18:21+0800 4388 [20022]: __LIBVIR aio timeout RD 0x7fda5c000960:0x7fda5c000970:0x7fda6de90000 ioto 10 to_count 3
2017-04-21 10:18:21+0800 4388 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:18:21+0800 4388 [20022]: s1 renewal error -202 delta_length 20 last_success 4316
2017-04-21 10:18:22+0800 4388 [19810]: s1 check_our_lease warning 72 last_success 4316
2017-04-21 10:18:23+0800 4389 [19810]: s1 check_our_lease warning 73 last_success 4316
2017-04-21 10:18:24+0800 4390 [19810]: s1 check_our_lease warning 74 last_success 4316
2017-04-21 10:18:25+0800 4391 [19810]: s1 check_our_lease warning 75 last_success 4316
2017-04-21 10:18:26+0800 4392 [19810]: s1 check_our_lease warning 76 last_success 4316
2017-04-21 10:18:27+0800 4393 [19810]: s1 check_our_lease warning 77 last_success 4316
2017-04-21 10:18:28+0800 4394 [19810]: s1 check_our_lease warning 78 last_success 4316
2017-04-21 10:18:29+0800 4395 [19810]: s1 check_our_lease warning 79 last_success 4316
2017-04-21 10:18:30+0800 4396 [19810]: s1 check_our_lease failed 80
2017-04-21 10:18:30+0800 4396 [19810]: s1 kill 20345 sig 15 count 1
2017-04-21 10:18:30+0800 4396 [19810]: dead 20345 ci 2 count 1
2017-04-21 10:18:30+0800 4397 [19810]: s1 all pids clear
2017-04-21 10:18:42+0800 4408 [20022]: __LIBVIR aio timeout RD 0x7fda5c0009b0:0x7fda5c0009c0:0x7fda6dd8e000 ioto 10 to_count 4
2017-04-21 10:18:42+0800 4408 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:18:42+0800 4408 [20022]: s1 renewal error -202 delta_length 20 last_success 4316
2017-04-21 10:19:42+0800 4468 [20022]: __LIBVIR close_task_aio 0 0x7fda5c0008c0 busy
2017-04-21 10:19:42+0800 4468 [20022]: __LIBVIR close_task_aio 1 0x7fda5c000910 busy
2017-04-21 10:19:42+0800 4468 [20022]: __LIBVIR close_task_aio 2 0x7fda5c000960 busy
2017-04-21 10:19:42+0800 4468 [20022]: __LIBVIR close_task_aio 3 0x7fda5c0009b0 busy
2017-04-21 10:20:42+0800 4528 [20022]: __LIBVIR close_task_aio 0 0x7fda5c0008c0 busy
2017-04-21 10:20:42+0800 4528 [20022]: __LIBVIR close_task_aio 1 0x7fda5c000910 busy
2017-04-21 10:20:42+0800 4528 [20022]: __LIBVIR close_task_aio 2 0x7fda5c000960 busy
2017-04-21 10:20:42+0800 4528 [20022]: __LIBVIR close_task_aio 3 0x7fda5c0009b0 busy

  • 从内核消息中可以看出iSCSI、多路径和NFS均已出现异常:
$ dmesg -c
[ 4116.360505] qemu-kvm: sending ioctl 5326 to a partition!
[ 4116.360528] qemu-kvm: sending ioctl 80200204 to a partition!
[ 4116.961920] kvm: zapping shadow pages for mmio generation wraparound
[ 4123.393299] kvm [20345]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xabcd
[ 4335.502566]  connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298988592, last ping 4298993600, now 4298998608
[ 4335.502712]  connection1:0: detected conn error (1022)
[ 4335.502718]  connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298988592, last ping 4298993600, now 4298998608
[ 4335.502723]  connection5:0: detected conn error (1022)
[ 4339.501709]  connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298992593, last ping 4298997600, now 4299002608
[ 4339.501777]  connection3:0: detected conn error (1022)
[ 4339.501785]  connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298992593, last ping 4298997600, now 4299002608
[ 4339.501790]  connection4:0: detected conn error (1022)
[ 4340.509931]  connection6:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298993601, last ping 4298998608, now 4299003616
[ 4340.509940]  session1: session recovery timed out after 5 secs
[ 4340.509952]  connection6:0: detected conn error (1022)
[ 4340.509956]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298993601, last ping 4298998608, now 4299003616
[ 4340.509959]  connection2:0: detected conn error (1022)
[ 4340.509964]  session5: session recovery timed out after 5 secs
[ 4341.499795] device-mapper: multipath: Failing path 8:80.
[ 4341.499852] device-mapper: multipath: Failing path 8:128.
[ 4344.508453]  session3: session recovery timed out after 5 secs
[ 4344.508470] sd 5:0:0:1: rejecting I/O to offline device
[ 4344.508474] sd 5:0:0:1: [sdg] killing request
[ 4344.508483]  session4: session recovery timed out after 5 secs
[ 4344.508488] sd 5:0:0:1: [sdg] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 4344.508489] sd 5:0:0:1: [sdg] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[ 4344.508491] blk_update_request: I/O error, dev sdg, sector 0
[ 4344.508507] sd 6:0:0:1: rejecting I/O to offline device
[ 4344.508509] sd 6:0:0:1: [sdj] killing request
[ 4344.508517] sd 6:0:0:1: [sdj] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 4344.508518] sd 6:0:0:1: [sdj] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[ 4344.508519] blk_update_request: I/O error, dev sdj, sector 0
[ 4345.501500] device-mapper: multipath: Failing path 8:96.
[ 4345.501545] device-mapper: multipath: Failing path 8:144.
[ 4345.504705] blk_update_request: I/O error, dev dm-1, sector 104857472
[ 4345.504726] blk_update_request: I/O error, dev dm-1, sector 104857472
[ 4345.504728] Buffer I/O error on device dm-1, logical block 13107184
[ 4345.516361]  session6: session recovery timed out after 5 secs
[ 4345.516382] sd 8:0:0:1: rejecting I/O to offline device
[ 4345.516386] sd 8:0:0:1: [sdh] killing request
[ 4345.516393]  session2: session recovery timed out after 5 secs
[ 4345.516398] sd 8:0:0:1: [sdh] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 4345.516400] sd 8:0:0:1: [sdh] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[ 4345.516402] blk_update_request: I/O error, dev sdh, sector 0
[ 4345.516419] sd 4:0:0:1: rejecting I/O to offline device
[ 4345.516421] sd 4:0:0:1: [sde] killing request
[ 4345.516427] sd 4:0:0:1: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 4345.516429] sd 4:0:0:1: [sde] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[ 4345.516430] blk_update_request: I/O error, dev sde, sector 0
[ 4345.522653] sd 4:0:0:1: rejecting I/O to offline device
[ 4345.522680] device-mapper: multipath: Failing path 8:64.
[ 4345.522690] blk_update_request: I/O error, dev dm-0, sector 209715072
[ 4345.522974] blk_update_request: I/O error, dev dm-0, sector 209715072
[ 4345.522978] Buffer I/O error on device dm-0, logical block 26214384
[ 4346.501748] device-mapper: multipath: Failing path 8:112.
[ 4391.783505] watchdog watchdog0: watchdog did not stop!
[ 4521.500331] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500336] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500337] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500338] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500339] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500340] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500341] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500342] nfs: server 192.168.195.131 not responding, still trying
[ 4541.847179] nfs: server 192.168.195.131 not responding, still trying
[ 4541.847184] nfs: server 192.168.195.131 not responding, still trying
[ 4541.847185] nfs: server 192.168.195.131 not responding, still trying
[ 4541.847186] nfs: server 192.168.195.131 not responding, still trying
[ 4541.847187] nfs: server 192.168.195.131 not responding, still trying
  • 已无法正常查看租约状态:
$ sanlock direct dump /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__

重新连接网络

  • 重启服务:
$ systemctl stop libvirtd
$ systemctl stop sanlock

$ mount -o,remount /var/lib/libvirt/sanlock

$ systemctl start sanlock
$ systemctl start libvirtd
  • 查看日志:
$ sanlock client log_dump
2017-04-21 10:57:39+0800 6743 [24933]: sanlock daemon started 3.4.0 host b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2
  • 查看租约状态:
$ sanlock direct dump /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ 
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                   __LIBVIRT__DISKS__       581e732d-e4b1-4216-8b2f-1a63f08bb28d.Test1 0000007621 0001 0002
00000512                   __LIBVIRT__DISKS__       25869182-7a3b-4c95-9f66-51bb606cdfc3.Test2 0000004316 0002 0003

$ sanlock direct dump /var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                   __LIBVIRT__DISKS__                 d726322246f67ea910a344b4426b4e0d 0000004112 0002 0003 1
  • 重新启动虚拟机:
$ virsh start test3_sanlock
域 test3_sanlock 已开始
  • 查看日志:
$ sanlock client log_dump
2017-04-21 10:57:39+0800 6743 [24933]: sanlock daemon started 3.4.0 host b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2
2017-04-21 10:57:39+0800 6743 [24933]: set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted
2017-04-21 15:13:53+0800 22095 [24938]: cmd_add_lockspace 2,10 __LIBVIRT__DISKS__:2:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0 flags 0 timeout 0
2017-04-21 15:13:53+0800 22095 [24938]: s1 lockspace __LIBVIRT__DISKS__:2:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0
2017-04-21 15:13:53+0800 22095 [43517]: s1 delta_acquire begin __LIBVIRT__DISKS__:2
2017-04-21 15:13:53+0800 22095 [43517]: s1 delta_acquire delta_large_delay 80 delay 140
2017-04-21 15:16:13+0800 22235 [43517]: s1 delta_acquire write 2 4 22235 b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2
2017-04-21 15:16:13+0800 22235 [43517]: s1 delta_acquire delta_short_delay 20
2017-04-21 15:16:33+0800 22255 [43517]: s1 delta_acquire done 2 4 22235
2017-04-21 15:16:34+0800 22256 [24938]: s1 add_lockspace done
2017-04-21 15:16:34+0800 22256 [24938]: cmd_add_lockspace 2,10 done 0
2017-04-21 15:16:35+0800 22257 [24933]: s1 host 1 2 23019 581e732d-e4b1-4216-8b2f-1a63f08bb28d.Test1
2017-04-21 15:16:35+0800 22257 [24933]: s1 host 2 4 22235 b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2
2017-04-21 15:17:48+0800 22330 [24933]: cmd_register ci 2 fd 10 pid 43842
2017-04-21 15:17:48+0800 22330 [24933]: cmd_restrict ci 2 fd 10 pid 43842 flags 1
2017-04-21 15:17:49+0800 22330 [24937]: cmd_inquire 2,10,43842 ci_in 3 fd 13
2017-04-21 15:17:49+0800 22330 [24937]: cmd_inquire 2,10,43842 result 0 pid_dead 0 res_count 0 cat_count 0 strlen 0
2017-04-21 15:17:49+0800 22330 [24938]: cmd_acquire 2,10,43842 ci_in 3 fd 13 count 1 flags 0
2017-04-21 15:17:49+0800 22330 [24938]: s1:r1 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,10,43842
2017-04-21 15:17:49+0800 22330 [24938]: r1 paxos_acquire begin 2 0 0
2017-04-21 15:17:49+0800 22330 [24938]: r1 paxos_acquire leader 1 owner 2 3 4112 max mbal[1] 2 our_dblock 2 2 2 3 4112 1
2017-04-21 15:17:49+0800 22330 [24938]: r1 paxos_acquire owner 2 3 4112 was old local new is 4
2017-04-21 15:17:49+0800 22330 [24938]: r1 ballot 2 phase1 mbal 2002
2017-04-21 15:17:49+0800 22330 [24938]: r1 ballot 2 phase2 bal 2002 inp 2 4 22330 q_max -1
2017-04-21 15:17:49+0800 22330 [24938]: r1 ballot 2 commit self owner 2 4 22330
2017-04-21 15:17:49+0800 22330 [24938]: r1 acquire_disk rv 1 lver 2 at 22330
2017-04-21 15:17:49+0800 22330 [24938]: cmd_acquire 2,10,43842 result 0 pid_dead 0
2017-04-21 15:17:49+0800 22330 [24937]: cmd_inquire 2,10,43842 ci_in 3 fd 13
2017-04-21 15:17:49+0800 22330 [24937]: cmd_inquire 2,10,43842 result 0 pid_dead 0 res_count 1 cat_count 1 strlen 113
2017-04-21 15:17:49+0800 22330 [24938]: cmd_acquire 2,10,43842 ci_in 3 fd 13 count 1 flags 0
2017-04-21 15:17:49+0800 22330 [24938]: s1:r2 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,10,43842
2017-04-21 15:17:49+0800 22330 [24938]: r2 cmd_acquire 2,10,43842 acquire_token -17 
2017-04-21 15:17:49+0800 22330 [24938]: cmd_acquire 2,10,43842 result -17 pid_dead 0

一切恢复正常。

使用libvirt中的on_lockfailure标签进行续约失败测试

  • 制作虚拟机磁盘镜像:
$ wget http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
$ dd if=cirros-0.3.4-x86_64-disk.img of=/dev/sdd3
$ dd if=cirros-0.3.4-x86_64-disk.img of=/dev/sdd4
  • 配置并定义on_lockfailure时关机的虚拟机:
$ vi test_lockfailure1.xml
<domain type='kvm'>
    <name>test_lockfailure1</name>
    <memory>262144</memory>
    <vcpu>1</vcpu>
	
    <os>
        <type arch='x86_64' machine='pc'>hvm</type>
        <boot dev='hd'/>
    </os>
	
	<on_lockfailure>poweroff</on_lockfailure>
	
    <devices>
        <disk type='file' device='disk'>
            <driver name='qemu' type='qcow2'/>
            <source file='/dev/sdd3'/>
            <target dev='hda' bus='ide'/>
        </disk>
		
        <graphics type='vnc' port='-1' listen = '0.0.0.0' autoport='yes' keymap='en-us'/>
    </devices>
</domain>

$ virsh define test_lockfailure1.xml 
定义域 test_lockfailure1(从 test_lockfailure1.xml)
  • 配置并定义on_lockfailure时暂停的虚拟机:
$ vi test_lockfailure2.xml
<domain type='kvm'>
    <name>test_lockfailure2</name>
    <memory>262144</memory>
    <vcpu>1</vcpu>
	
    <os>
        <type arch='x86_64' machine='pc'>hvm</type>
        <boot dev='hd'/>
    </os>
	
	<on_lockfailure>pause</on_lockfailure>
	
    <devices>
        <disk type='file' device='disk'>
            <driver name='qemu' type='qcow2'/>
            <source file='/dev/sdd4'/>
            <target dev='hda' bus='ide'/>
        </disk>
		
        <graphics type='vnc' port='-1' listen = '0.0.0.0' autoport='yes' keymap='en-us'/>
    </devices>
</domain>
$ virsh define test_lockfailure2.xml 
定义域 test_lockfailure2(从 test_lockfailure2.xml)

$ virsh list --all
 Id    名称                         状态
----------------------------------------------------
 -     test_lockfailure1              关闭
 -     test_lockfailure2              关闭
  • 启动虚拟机:
$ virsh start test_lockfailure1
域 test_lockfailure1 已开始

$ virsh start test_lockfailure2
域 test_lockfailure2 已开始

$ virsh vncdisplay test_lockfailure1
:0

$ vncviewer :0

$ virsh vncdisplay test_lockfailure2
:1

$ vncviewer :1
  • 查看当前的状态信息:
$ virsh list --all
 Id    名称                         状态
----------------------------------------------------
 1     test_lockfailure1              running
 2     test_lockfailure2              running

$ systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
   Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
   Active: active (running) since 二 2017-04-25 11:13:26 CST; 5min ago
  Process: 1072 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
 Main PID: 1109 (sanlock)
   CGroup: /system.slice/sanlock.service
           ├─1109 /usr/sbin/sanlock daemon
           └─1110 /usr/sbin/sanlock daemon

4月 25 11:13:18 Test1 systemd[1]: Starting Shared Storage Lease Manager...
4月 25 11:13:26 Test1 systemd[1]: Started Shared Storage Lease Manager.

$ sanlock client log_dump
2017-04-25 11:13:19+0800 26 [1109]: sanlock daemon started 3.4.0 host dd75027f-b6f2-4f46-bd75-11d24f7ce2a9.Test1
2017-04-25 11:13:44+0800 50 [1112]: cmd_add_lockspace 2,9 __LIBVIRT__DISKS__:1:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0 flags 0 timeout 0
2017-04-25 11:13:44+0800 50 [1112]: s1 lockspace __LIBVIRT__DISKS__:1:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0
2017-04-25 11:13:44+0800 50 [3078]: s1 delta_acquire begin __LIBVIRT__DISKS__:1
2017-04-25 11:13:44+0800 50 [3078]: s1 delta_acquire delta_large_delay 80 delay 140
2017-04-25 11:16:04+0800 190 [3078]: s1 delta_acquire write 1 3 190 dd75027f-b6f2-4f46-bd75-11d24f7ce2a9.Test1
2017-04-25 11:16:04+0800 190 [3078]: s1 delta_acquire delta_short_delay 20
2017-04-25 11:16:24+0800 211 [3078]: s1 delta_acquire done 1 3 190
2017-04-25 11:16:25+0800 211 [1112]: s1 add_lockspace done
2017-04-25 11:16:25+0800 211 [1112]: cmd_add_lockspace 2,9 done 0
2017-04-25 11:16:26+0800 212 [1109]: s1 host 1 3 190 dd75027f-b6f2-4f46-bd75-11d24f7ce2a9.Test1
2017-04-25 11:16:26+0800 212 [1109]: s1 host 2 4 40508 b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2
2017-04-25 11:17:10+0800 256 [1109]: cmd_register ci 2 fd 9 pid 13022
2017-04-25 11:17:10+0800 256 [1113]: cmd_killpath 2,9,13022 flags 0
2017-04-25 11:17:10+0800 256 [1109]: cmd_restrict ci 2 fd 9 pid 13022 flags 1
2017-04-25 11:17:10+0800 256 [1112]: cmd_inquire 2,9,13022 ci_in 3 fd 12
2017-04-25 11:17:10+0800 256 [1112]: cmd_inquire 2,9,13022 result 0 pid_dead 0 res_count 0 cat_count 0 strlen 0
2017-04-25 11:17:10+0800 256 [1113]: cmd_acquire 2,9,13022 ci_in 4 fd 13 count 1 flags 0
2017-04-25 11:17:10+0800 256 [1113]: s1:r1 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,9,13022
2017-04-25 11:17:10+0800 256 [1113]: r1 paxos_acquire begin 2 0 0
2017-04-25 11:17:10+0800 257 [1113]: r1 paxos_acquire leader 2 owner 2 4 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2017-04-25 11:17:10+0800 257 [1113]: r1 paxos_acquire leader 2 free
2017-04-25 11:17:10+0800 257 [1113]: r1 ballot 3 phase1 mbal 1
2017-04-25 11:17:10+0800 257 [1113]: r1 ballot 3 phase2 bal 1 inp 1 3 257 q_max -1
2017-04-25 11:17:10+0800 257 [1113]: r1 ballot 3 commit self owner 1 3 257
2017-04-25 11:17:10+0800 257 [1113]: r1 acquire_disk rv 1 lver 3 at 257
2017-04-25 11:17:10+0800 257 [1113]: cmd_acquire 2,9,13022 result 0 pid_dead 0
2017-04-25 11:17:10+0800 257 [1112]: cmd_inquire 2,9,13022 ci_in 3 fd 12
2017-04-25 11:17:10+0800 257 [1112]: cmd_inquire 2,9,13022 result 0 pid_dead 0 res_count 1 cat_count 1 strlen 113
2017-04-25 11:17:10+0800 257 [1113]: cmd_acquire 2,9,13022 ci_in 3 fd 12 count 1 flags 0
2017-04-25 11:17:10+0800 257 [1113]: s1:r2 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,9,13022
2017-04-25 11:17:10+0800 257 [1113]: r2 cmd_acquire 2,9,13022 acquire_token -17 
2017-04-25 11:17:10+0800 257 [1113]: cmd_acquire 2,9,13022 result -17 pid_dead 0
2017-04-25 11:17:12+0800 258 [1112]: cmd_align 3,12
2017-04-25 11:17:12+0800 258 [1112]: cmd_align 3,12 done 1048576
2017-04-25 11:17:12+0800 258 [1113]: cmd_write_resource 3,12 __LIBVIRT__DISKS__:0ff52295a7bb219221ea57f31f343ba8:/var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8:0
2017-04-25 11:17:12+0800 258 [1109]: cmd_register ci 3 fd 12 pid 13044
2017-04-25 11:17:12+0800 258 [1112]: cmd_killpath 3,12,13044 flags 0
2017-04-25 11:17:12+0800 258 [1109]: cmd_restrict ci 3 fd 12 pid 13044 flags 1
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 ci_in 4 fd 13
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 result 0 pid_dead 0 res_count 0 cat_count 0 strlen 0
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 ci_in 4 fd 13 count 1 flags 0
2017-04-25 11:17:12+0800 259 [1113]: s1:r3 resource __LIBVIRT__DISKS__:0ff52295a7bb219221ea57f31f343ba8:/var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8:0 for 3,12,13044
2017-04-25 11:17:12+0800 259 [1113]: r3 paxos_acquire begin 2 0 0
2017-04-25 11:17:12+0800 259 [1113]: r3 paxos_acquire leader 0 owner 0 0 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2017-04-25 11:17:12+0800 259 [1113]: r3 paxos_acquire leader 0 free
2017-04-25 11:17:12+0800 259 [1113]: r3 ballot 1 phase1 mbal 1
2017-04-25 11:17:12+0800 259 [1113]: r3 ballot 1 phase2 bal 1 inp 1 3 259 q_max -1
2017-04-25 11:17:12+0800 259 [1113]: r3 ballot 1 commit self owner 1 3 259
2017-04-25 11:17:12+0800 259 [1113]: r3 acquire_disk rv 1 lver 1 at 259
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 result 0 pid_dead 0
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 ci_in 4 fd 13
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 result 0 pid_dead 0 res_count 1 cat_count 1 strlen 113
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 ci_in 4 fd 13 count 1 flags 0
2017-04-25 11:17:12+0800 259 [1113]: s1:r4 resource __LIBVIRT__DISKS__:0ff52295a7bb219221ea57f31f343ba8:/var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8:0 for 3,12,13044
2017-04-25 11:17:12+0800 259 [1113]: r4 cmd_acquire 3,12,13044 acquire_token -17 
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 result -17 pid_dead 0

$ sanlock direct dump /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ 
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                   __LIBVIRT__DISKS__       dd75027f-b6f2-4f46-bd75-11d24f7ce2a9.Test1 0000000437 0001 0003
00000512                   __LIBVIRT__DISKS__       b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2 0000040508 0002 0004

$ sanlock direct dump /var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8 
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                   __LIBVIRT__DISKS__                 0ff52295a7bb219221ea57f31f343ba8 0000000259 0001 0003 1

$ sanlock direct dump /var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d 
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                   __LIBVIRT__DISKS__                 d726322246f67ea910a344b4426b4e0d 0000000257 0001 0003 3

$ ps -aux | grep qemu-kvm
qemu      13022 31.6  5.1 678052 96468 ?        Sl   11:17   1:18 /usr/libexec/qemu-kvm -name test_lockfailure1 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 37266b5e-d0fc-4287-bf2d-fcaa75aca64c -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-test_lockfailure1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd3,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
qemu      13044 32.2  5.1 678056 96896 ?        Sl   11:17   1:19 /usr/libexec/qemu-kvm -name test_lockfailure2 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 2562a258-d4be-486f-959a-1fb993958811 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-2-test_lockfailure2/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd4,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc 0.0.0.0:1 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
  • 断开网络(大概3分钟),等待一段时间:
$ systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
   Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
   Active: active (running) since 二 2017-04-25 11:13:26 CST; 29min ago
  Process: 1072 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
 Main PID: 1109 (sanlock)
   CGroup: /system.slice/sanlock.service
           ├─ 1109 /usr/sbin/sanlock daemon
           ├─ 1110 /usr/sbin/sanlock daemon
           └─14234 /usr/libexec/libvirt_sanlock_helper qemu:///system 2562a258-d4be-486f-959a-1fb993958811 pause

4月 25 11:24:37 Test1 sanlock[1109]: 2017-04-25 11:24:37+0800 703 [3078]: s1 delta_renew read rv -2 offset 0 /var/lib/libvirt/sanlock/__L...__DISKS__
4月 25 11:24:37 Test1 sanlock[1109]: 2017-04-25 11:24:37+0800 703 [3078]: s1 renewal error -2 delta_length 10 last_success 580
4月 25 11:25:37 Test1 sanlock[1109]: 2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
4月 25 11:25:37 Test1 sanlock[1109]: 2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
4月 25 11:25:37 Test1 sanlock[1109]: 2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
4月 25 11:25:37 Test1 sanlock[1109]: 2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
4月 25 11:26:37 Test1 sanlock[1109]: 2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
4月 25 11:26:37 Test1 sanlock[1109]: 2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
4月 25 11:26:37 Test1 sanlock[1109]: 2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
4月 25 11:26:37 Test1 sanlock[1109]: 2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
Hint: Some lines were ellipsized, use -l to show in full.

$ systemctl status libvirtd
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
   Active: active (running) since 二 2017-04-25 11:13:42 CST; 29min ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 1878 (libvirtd)
   CGroup: /system.slice/libvirtd.service
           ├─ 1878 /usr/sbin/libvirtd
           ├─12842 /sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/libexec/libvirt_leaseshelper
           └─12843 /sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/libexec/libvirt_leaseshelper

4月 25 11:16:26 Test1 dnsmasq-dhcp[12842]: DHCP, IP range 192.168.122.2 -- 192.168.122.254, lease time 1h
4月 25 11:16:26 Test1 dnsmasq[12842]: reading /etc/resolv.conf
4月 25 11:16:26 Test1 dnsmasq[12842]: using nameserver 114.114.114.114#53
4月 25 11:16:26 Test1 dnsmasq[12842]: read /etc/hosts - 6 addresses
4月 25 11:16:26 Test1 dnsmasq[12842]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
4月 25 11:16:26 Test1 dnsmasq-dhcp[12842]: read /var/lib/libvirt/dnsmasq/default.hostsfile
4月 25 11:17:10 Test1 libvirtd[1878]: libvirt version: 2.0.0, package: 10.el7_3.5 (CentOS BuildSystem <http://bugs.centos.org>, 2017-03-0...ntos.org)
4月 25 11:17:10 Test1 libvirtd[1878]: hostname: test1
4月 25 11:17:10 Test1 libvirtd[1878]: 请求锁失败: 文件已存在
4月 25 11:17:12 Test1 libvirtd[1878]: 请求锁失败: 文件已存在
Hint: Some lines were ellipsized, use -l to show in full.

$ ps -aux | grep qemu-kvm

$ sanlock direct dump /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ 
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                   __LIBVIRT__DISKS__       dd75027f-b6f2-4f46-bd75-11d24f7ce2a9.Test1 0000000437 0001 0003
00000512                   __LIBVIRT__DISKS__       b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2 0000040508 0002 0004

$ sanlock direct dump /var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8 
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                   __LIBVIRT__DISKS__                 0ff52295a7bb219221ea57f31f343ba8 0000000459 0001 0001 1

$ sanlock direct dump /var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d 
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                   __LIBVIRT__DISKS__                 d726322246f67ea910a344b4426b4e0d 0000000456 0001 0001 1

$ sanlock client log_dump
...
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 result 0 pid_dead 0
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 ci_in 4 fd 13
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 result 0 pid_dead 0 res_count 1 cat_count 1 strlen 113
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 ci_in 4 fd 13 count 1 flags 0
2017-04-25 11:17:12+0800 259 [1113]: s1:r4 resource __LIBVIRT__DISKS__:0ff52295a7bb219221ea57f31f343ba8:/var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8:0 for 3,12,13044
2017-04-25 11:17:12+0800 259 [1113]: r4 cmd_acquire 3,12,13044 acquire_token -17 
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 result -17 pid_dead 0
2017-04-25 11:23:05+0800 611 [3078]: __LIBVIR aio timeout RD 0x7f49740008c0:0x7f49740008d0:0x7f4989c30000 ioto 10 to_count 1
2017-04-25 11:23:05+0800 611 [3078]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:23:05+0800 611 [3078]: s1 renewal error -202 delta_length 10 last_success 580
2017-04-25 11:23:05+0800 611 [3078]: s1 delta_renew begin reap
2017-04-25 11:23:15+0800 621 [3078]: s1 delta_renew reap -202
2017-04-25 11:23:25+0800 631 [3078]: __LIBVIR aio timeout RD 0x7f4974000910:0x7f4974000920:0x7f4989a2d000 ioto 10 to_count 2
2017-04-25 11:23:25+0800 631 [3078]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:23:25+0800 631 [3078]: s1 renewal error -202 delta_length 20 last_success 580
2017-04-25 11:23:26+0800 632 [3078]: s1 delta_renew begin reap
2017-04-25 11:23:33+0800 640 [1109]: s1 check_our_lease warning 60 last_success 580
2017-04-25 11:23:34+0800 641 [1109]: s1 check_our_lease warning 61 last_success 580
2017-04-25 11:23:35+0800 642 [1109]: s1 check_our_lease warning 62 last_success 580
2017-04-25 11:23:36+0800 642 [3078]: s1 delta_renew reap -202
2017-04-25 11:23:36+0800 643 [1109]: s1 check_our_lease warning 63 last_success 580
2017-04-25 11:23:37+0800 644 [1109]: s1 check_our_lease warning 64 last_success 580
2017-04-25 11:23:38+0800 645 [1109]: s1 check_our_lease warning 65 last_success 580
2017-04-25 11:23:39+0800 646 [1109]: s1 check_our_lease warning 66 last_success 580
2017-04-25 11:23:40+0800 647 [1109]: s1 check_our_lease warning 67 last_success 580
2017-04-25 11:23:41+0800 648 [1109]: s1 check_our_lease warning 68 last_success 580
2017-04-25 11:23:42+0800 649 [1109]: s1 check_our_lease warning 69 last_success 580
2017-04-25 11:23:43+0800 650 [1109]: s1 check_our_lease warning 70 last_success 580
2017-04-25 11:23:44+0800 651 [1109]: s1 check_our_lease warning 71 last_success 580
2017-04-25 11:23:45+0800 652 [1109]: s1 check_our_lease warning 72 last_success 580
2017-04-25 11:23:46+0800 652 [3078]: __LIBVIR aio timeout RD 0x7f4974000960:0x7f4974000970:0x7f498992b000 ioto 10 to_count 3
2017-04-25 11:23:46+0800 652 [3078]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:23:46+0800 652 [3078]: s1 renewal error -202 delta_length 20 last_success 580
2017-04-25 11:23:46+0800 652 [3078]: s1 delta_renew begin reap
2017-04-25 11:23:46+0800 653 [1109]: s1 check_our_lease warning 73 last_success 580
2017-04-25 11:23:47+0800 654 [1109]: s1 check_our_lease warning 74 last_success 580
2017-04-25 11:23:48+0800 655 [1109]: s1 check_our_lease warning 75 last_success 580
2017-04-25 11:23:49+0800 656 [1109]: s1 check_our_lease warning 76 last_success 580
2017-04-25 11:23:50+0800 657 [1109]: s1 check_our_lease warning 77 last_success 580
2017-04-25 11:23:51+0800 658 [1109]: s1 check_our_lease warning 78 last_success 580
2017-04-25 11:23:52+0800 659 [1109]: s1 check_our_lease warning 79 last_success 580
2017-04-25 11:23:53+0800 660 [1109]: s1 check_our_lease failed 80
2017-04-25 11:23:53+0800 660 [1109]: s1 set killing_pids check -1 remove 0
2017-04-25 11:23:53+0800 660 [1109]: s1:r1 client_using_space pid 13022
2017-04-25 11:23:53+0800 660 [1109]: s1 kill 13022 sig 100 count 1
2017-04-25 11:23:53+0800 660 [1109]: s1:r3 client_using_space pid 13044
2017-04-25 11:23:53+0800 660 [1109]: s1 kill 13044 sig 100 count 1
2017-04-25 11:23:53+0800 660 [1109]: client_pid_dead 2,9,13022 cmd_active 0 suspend 0
2017-04-25 11:23:53+0800 660 [1109]: dead 13022 ci 2 count 1
2017-04-25 11:23:54+0800 660 [1112]: cmd_inquire 3,12,13044 ci_in 2 fd 9
2017-04-25 11:23:54+0800 660 [1112]: cmd_inquire 3,12,13044 result 0 pid_dead 0 res_count 1 cat_count 1 strlen 113
2017-04-25 11:23:56+0800 662 [3078]: s1 delta_renew reap -202
2017-04-25 11:24:06+0800 672 [3078]: __LIBVIR aio timeout RD 0x7f49740009b0:0x7f49740009c0:0x7f4989829000 ioto 10 to_count 4
2017-04-25 11:24:06+0800 672 [3078]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:24:06+0800 672 [3078]: s1 renewal error -202 delta_length 20 last_success 580
2017-04-25 11:24:07+0800 673 [3078]: s1 delta_renew begin reap
2017-04-25 11:24:17+0800 683 [3078]: s1 delta_renew reap -202
2017-04-25 11:24:27+0800 693 [3078]: s1 delta_renew read rv -2 offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:24:27+0800 693 [3078]: s1 renewal error -2 delta_length 20 last_success 580
2017-04-25 11:24:33+0800 700 [1109]: s1 kill 13044 sig 9 count 41
2017-04-25 11:24:33+0800 700 [1109]: client_pid_dead 3,12,13044 cmd_active 0 suspend 0
2017-04-25 11:24:33+0800 700 [1109]: dead 13044 ci 3 count 41
2017-04-25 11:24:33+0800 700 [1109]: s1 all pids clear
2017-04-25 11:24:33+0800 700 [1109]: s1 set thread_stop
2017-04-25 11:24:33+0800 700 [1109]: s1 wdmd_test_live 0 0 to disable
2017-04-25 11:24:37+0800 703 [3078]: s1 delta_renew read rv -2 offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:24:37+0800 703 [3078]: s1 renewal error -2 delta_length 10 last_success 580
2017-04-25 11:24:37+0800 703 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:24:37+0800 703 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:24:37+0800 703 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:24:37+0800 703 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:24:47+0800 713 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:24:47+0800 713 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:24:47+0800 713 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:24:47+0800 713 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:24:57+0800 723 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:24:57+0800 723 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:24:57+0800 723 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:24:57+0800 723 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:07+0800 733 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:07+0800 733 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:07+0800 733 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:07+0800 733 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:17+0800 743 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:17+0800 743 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:17+0800 743 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:17+0800 743 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:27+0800 753 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:27+0800 753 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:27+0800 753 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:27+0800 753 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:47+0800 773 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:47+0800 773 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:47+0800 773 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:47+0800 773 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:57+0800 783 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:57+0800 783 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:57+0800 783 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:57+0800 783 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:26:07+0800 793 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:26:07+0800 793 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:26:07+0800 793 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:26:07+0800 793 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:26:17+0800 803 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:26:17+0800 803 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:26:17+0800 803 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:26:17+0800 803 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:26:27+0800 813 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:26:27+0800 813 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:26:27+0800 813 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:26:27+0800 813 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio destroy 4 incomplete ops


$ cat /var/log/libvirt/qemu/test_lockfailure1.log 
2017-04-25 04:36:33.021+0000: starting up libvirt version: 2.0.0, package: 10.el7_3.5 (CentOS BuildSystem <http://bugs.centos.org>, 2017-03-03-02:09:45, c1bm.rdu2.centos.org), qemu version: 1.5.3 (qemu-kvm-1.5.3-126.el7_3.6), hostname: test1
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name test_lockfailure1 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 37266b5e-d0fc-4287-bf2d-fcaa75aca64c -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-4-test_lockfailure1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd3,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
qemu: terminating on signal 15 from pid 1878
2017-04-25 04:38:17.745+0000: shutting down

$ cat /var/log/libvirt/qemu/test_lockfailure2.log 
2017-04-25 04:36:36.153+0000: starting up libvirt version: 2.0.0, package: 10.el7_3.5 (CentOS BuildSystem <http://bugs.centos.org>, 2017-03-03-02:09:45, c1bm.rdu2.centos.org), qemu version: 1.5.3 (qemu-kvm-1.5.3-126.el7_3.6), hostname: test1
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name test_lockfailure2 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 2562a258-d4be-486f-959a-1fb993958811 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-5-test_lockfailure2/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd4,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc 0.0.0.0:1 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on

$ sanlock client status -D
daemon ce8f2af7-36f4-47bb-9d41-6c001de66725.Test1
    our_host_name=ce8f2af7-36f4-47bb-9d41-6c001de66725.Test1
    use_watchdog=1
    high_priority=1
    mlock_level=1
    quiet_fail=1
    debug_renew=0
    renewal_history_size=180
    gid=179
    uid=179
    sh_retries=8
    use_aio=1
    kill_grace_seconds=40
    helper_pid=1108
    helper_kill_fd=4
    helper_full_count=0
    helper_last_status=15942
    monotime=15956
    version_str=3.4.0
    version_num=3.4.0
    version_hex=03040000
    smproto_hex=00000001
p -1 helper
    ci=0
    fd=5
    pid=-1
    flags=0
    restricted=0
    cmd_active=0
    cmd_last=0
    pid_dead=0
    kill_count=0
    kill_last=0
    suspend=0
    need_free=0
p -1 listener
    ci=1
    fd=7
    pid=-1
    flags=0
    restricted=0
    cmd_active=0
    cmd_last=0
    pid_dead=0
    kill_count=0
    kill_last=0
    suspend=0
    need_free=0
p -1 status
    ci=2
    fd=9
    pid=-1
    flags=0
    restricted=0
    cmd_active=0
    cmd_last=5
    pid_dead=0
    kill_count=0
    kill_last=0
    suspend=0
    need_free=0

$ ps -aux | grep qemu
root      13827  0.0  0.3 252016  5828 ?        S    12:38   0:00 /usr/libexec/libvirt_sanlock_helper qemu:///system 2562a258-d4be-486f-959a-1fb993958811 pause

# 再等一会,再次执行:
$ ps -aux | grep qemu

通过修改killpath来进行sanlock的续约失败测试

创建租约文件

  • 创建Lockspace租约:
$ mkdir -pv /mnt/nfs/sanlock
$ dd if=/dev/zero bs=1048576 count=1 of=/mnt/nfs/sanlock/idLease
$ sanlock direct init -s LS:0:/mnt/nfs/sanlock/idLease:0
init done 0
$ chown sanlock:sanlock /mnt/nfs/sanlock/idLease
  • 创建Resource租约:
$ dd if=/dev/zero bs=1048576 count=1 of=/mnt/nfs/sanlock/leaderLease
$ sanlock direct init -r LS:leader:/mnt/nfs/sanlock/leaderLease:0
init done 0
$ chown sanlock:sanlock /mnt/nfs/sanlock/leaderLease

编写测试脚本

$ vi killpath.sh
#!/bin/bash

/usr/bin/echo "[`date`] $1" >> /var/log/killpath.log

$ chmod a+x killpath.sh
$ vi killpath_test.py
#!/usr/bin/python
import sys
import time
from multiprocessing import Process

import sanlock


def serviceMain(hostId, lockspacePath, leasePath):
	sfd = sanlock.register()
	while True:
		try:
			sanlock.killpath('/opt/sanlock/killpath.sh', ['killpath test'], sfd)
		except sanlock.SanlockException:
			print "Failed to set killpath, try again in 3s."
			time.sleep(3)
			continue
		
		print "Set killpath ok.\n"
		
		try:
			if not sanlock.inq_lockspace("LS", hostId, lockspacePath):
				print "Try to acquire host id LS:%s:%s:0" % (hostId, lockspacePath)

				print time.strftime("%Y %m %Z %H:%M:%S"), "Enter Lockspace Begin"
				sanlock.add_lockspace("LS", hostId, lockspacePath)
				print time.strftime("%Y %m %Z %H:%M:%S"), "Enter Lockspace End"

			print "Try to acquire leader lease LS:leader:%s:0" % leasePath
			sanlock.acquire("LS", "leader", [(leasePath, 0)], sfd)
			
		except sanlock.SanlockException:
			print "Failed to acquire leases, try again in 10s."
			time.sleep(10)
		else:
			break

	while True:
		print time.strftime("%Y %m %Z %H:%M:%S"), "Service is running"
		time.sleep(5)


if __name__ == "__main__":
	try:
		hostId = int(sys.argv[1])
		lockspacePath = sys.argv[2]
		leasePath = sys.argv[3]
	except Exception:
		sys.stderr.write(
			"Usage: %s host_id lockspace_path lease_path\n" % sys.argv[0])
		exit(1)

	while True:
		p = Process(target=serviceMain,
					args=(hostId, lockspacePath, leasePath))
		p.start()
		p.join()
		time.sleep(15)

启动测试脚本

$ ./killpath_test.py 1 /mnt/nfs/sanlock/idLease /mnt/nfs/sanlock/leaderLease
Set killpath ok.

Try to acquire host id LS:1:/mnt/nfs/sanlock/idLease:0
2017 04 CST 10:41:43 Enter Lockspace Begin
2017 04 CST 10:42:04 Enter Lockspace End
Try to acquire leader lease LS:leader:/mnt/nfs/sanlock/leaderLease:0
2017 04 CST 10:42:04 Service is running
2017 04 CST 10:42:09 Service is running
2017 04 CST 10:42:14 Service is running
2017 04 CST 10:42:19 Service is running
2017 04 CST 10:42:24 Service is running
2017 04 CST 10:42:29 Service is running
2017 04 CST 10:42:34 Service is running
2017 04 CST 10:42:39 Service is running
2017 04 CST 10:42:44 Service is running
2017 04 CST 10:42:49 Service is running
2017 04 CST 10:42:54 Service is running

$ ps -aux | grep killpath_test
root      95519  0.0  0.3 165956  5960 pts/1    S+   10:41   0:00 /usr/bin/python ./killpath_test.py 1 /mnt/nfs/sanlock/idLease /mnt/nfs/sanlock/leaderLease

停止nfs服务,查看sanlock状态和日志信息

$ sanlock client status -D
daemon ce8f2af7-36f4-47bb-9d41-6c001de66725.Test1
    our_host_name=ce8f2af7-36f4-47bb-9d41-6c001de66725.Test1
    use_watchdog=1
    high_priority=1
    mlock_level=1
    quiet_fail=1
    debug_renew=0
    renewal_history_size=180
    gid=179
    uid=179
    sh_retries=8
    use_aio=1
    kill_grace_seconds=40
    helper_pid=1108
    helper_kill_fd=4
    helper_full_count=0
    helper_last_status=36506
    monotime=36548
    version_str=3.4.0
    version_num=3.4.0
    version_hex=03040000
    smproto_hex=00000001
p -1 helper
    ci=0
    fd=5
    pid=-1
    flags=0
    restricted=0
    cmd_active=0
    cmd_last=0
    pid_dead=0
    kill_count=0
    kill_last=0
    suspend=0
    need_free=0
p -1 listener
    ci=1
    fd=7
    pid=-1
    flags=0
    restricted=0
    cmd_active=0
    cmd_last=0
    pid_dead=0
    kill_count=0
    kill_last=0
    suspend=0
    need_free=0
p 95519 
    ci=2
    fd=9
    pid=95519
    flags=2
    restricted=0
    cmd_active=0
    cmd_last=7
    pid_dead=0
    kill_count=28
    kill_last=36548
    suspend=0
    need_free=0
p -1 status
    ci=3
    fd=10
    pid=-1
    flags=0
    restricted=0
    cmd_active=0
    cmd_last=5
    pid_dead=0
    kill_count=0
    kill_last=0
    suspend=0
    need_free=0
s LS:1:/mnt/nfs/sanlock/idLease:0
    list=spaces
    space_id=4
    io_timeout=10
    host_generation=3
    renew_fail=1
    space_dead=1
    killing_pids=1
    used_retries=0
    external_used=0
    used_by_orphans=0
    renewal_read_extend_sec=10
    corrupt_result=0
    acquire_last_result=1
    renewal_last_result=-202
    acquire_last_attempt=36379
    acquire_last_success=36379
    renewal_last_attempt=36513
    renewal_last_success=36441
r LS:leader:/mnt/nfs/sanlock/leaderLease:0:2 p 95519
    list=held
    flags=0
    lver=2
    token_id=63
$ sanlock client log_dump
...
2017-04-26 10:42:04+0800 36400 [1110]: r63 paxos_acquire owner 1 2 35560 was old local new is 3
2017-04-26 10:42:04+0800 36400 [1110]: r63 ballot 2 phase1 mbal 2001
2017-04-26 10:42:04+0800 36400 [1110]: r63 ballot 2 phase2 bal 2001 inp 1 3 36400 q_max -1
2017-04-26 10:42:04+0800 36400 [1110]: r63 ballot 2 commit self owner 1 3 36400
2017-04-26 10:42:04+0800 36400 [1110]: r63 acquire_disk rv 1 lver 2 at 36400
2017-04-26 10:42:04+0800 36400 [1110]: cmd_acquire 2,9,95519 result 0 pid_dead 0
2017-04-26 10:42:05+0800 36401 [1107]: s4 host 1 3 36379 ce8f2af7-36f4-47bb-9d41-6c001de66725.Test1
2017-04-26 10:43:15+0800 36471 [95520]: LS aio timeout RD 0x7efce80008c0:0x7efce80008d0:0x7efcef2f6000 ioto 10 to_count 1
2017-04-26 10:43:15+0800 36471 [95520]: s4 delta_renew read timeout 10 sec offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:43:15+0800 36471 [95520]: s4 renewal error -202 delta_length 10 last_success 36441
2017-04-26 10:43:16+0800 36472 [95520]: s4 delta_renew begin reap
2017-04-26 10:43:26+0800 36482 [95520]: s4 delta_renew reap -202
2017-04-26 10:43:36+0800 36492 [95520]: LS aio timeout RD 0x7efce8000910:0x7efce8000920:0x7efcef1f4000 ioto 10 to_count 2
2017-04-26 10:43:36+0800 36492 [95520]: s4 delta_renew read timeout 10 sec offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:43:36+0800 36492 [95520]: s4 renewal error -202 delta_length 20 last_success 36441
2017-04-26 10:43:36+0800 36492 [95520]: s4 delta_renew begin reap
2017-04-26 10:43:45+0800 36501 [1107]: s4 check_our_lease warning 60 last_success 36441
2017-04-26 10:43:46+0800 36502 [1107]: s4 check_our_lease warning 61 last_success 36441
2017-04-26 10:43:46+0800 36502 [95520]: s4 delta_renew reap -202
2017-04-26 10:43:47+0800 36503 [1107]: s4 check_our_lease warning 62 last_success 36441
2017-04-26 10:43:48+0800 36504 [1107]: s4 check_our_lease warning 63 last_success 36441
2017-04-26 10:43:49+0800 36505 [1107]: s4 check_our_lease warning 64 last_success 36441
2017-04-26 10:43:50+0800 36506 [1107]: s4 check_our_lease warning 65 last_success 36441
2017-04-26 10:43:51+0800 36507 [1107]: s4 check_our_lease warning 66 last_success 36441
2017-04-26 10:43:52+0800 36508 [1107]: s4 check_our_lease warning 67 last_success 36441
2017-04-26 10:43:53+0800 36509 [1107]: s4 check_our_lease warning 68 last_success 36441
2017-04-26 10:43:54+0800 36510 [1107]: s4 check_our_lease warning 69 last_success 36441
2017-04-26 10:43:55+0800 36511 [1107]: s4 check_our_lease warning 70 last_success 36441
2017-04-26 10:43:56+0800 36512 [1107]: s4 check_our_lease warning 71 last_success 36441
2017-04-26 10:43:56+0800 36512 [95520]: LS aio timeout RD 0x7efce8000960:0x7efce8000970:0x7efcef0f2000 ioto 10 to_count 3
2017-04-26 10:43:56+0800 36512 [95520]: s4 delta_renew read timeout 10 sec offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:43:56+0800 36512 [95520]: s4 renewal error -202 delta_length 20 last_success 36441
2017-04-26 10:43:57+0800 36513 [95520]: s4 delta_renew begin reap
2017-04-26 10:43:57+0800 36513 [1107]: s4 check_our_lease warning 72 last_success 36441
2017-04-26 10:43:58+0800 36514 [1107]: s4 check_our_lease warning 73 last_success 36441
2017-04-26 10:43:59+0800 36515 [1107]: s4 check_our_lease warning 74 last_success 36441
2017-04-26 10:44:00+0800 36516 [1107]: s4 check_our_lease warning 75 last_success 36441
2017-04-26 10:44:01+0800 36517 [1107]: s4 check_our_lease warning 76 last_success 36441
2017-04-26 10:44:02+0800 36518 [1107]: s4 check_our_lease warning 77 last_success 36441
2017-04-26 10:44:03+0800 36519 [1107]: s4 check_our_lease warning 78 last_success 36441
2017-04-26 10:44:04+0800 36520 [1107]: s4 check_our_lease warning 79 last_success 36441
2017-04-26 10:44:05+0800 36521 [1107]: s4 check_our_lease failed 80
2017-04-26 10:44:05+0800 36521 [1107]: s4 set killing_pids check -1 remove 0
2017-04-26 10:44:05+0800 36521 [1107]: s4:r63 client_using_space pid 95519
2017-04-26 10:44:05+0800 36521 [1107]: s4 kill 95519 sig 100 count 1
2017-04-26 10:44:07+0800 36523 [95520]: s4 delta_renew reap -202
2017-04-26 10:44:17+0800 36533 [95520]: LS aio timeout RD 0x7efce80009b0:0x7efce80009c0:0x7efceeff0000 ioto 10 to_count 4
2017-04-26 10:44:17+0800 36533 [95520]: s4 delta_renew read timeout 10 sec offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:44:17+0800 36533 [95520]: s4 renewal error -202 delta_length 20 last_success 36441
2017-04-26 10:44:17+0800 36533 [95520]: s4 delta_renew begin reap
2017-04-26 10:44:27+0800 36543 [95520]: s4 delta_renew reap -202
2017-04-26 10:44:37+0800 36553 [95520]: s4 delta_renew read rv -2 offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:44:37+0800 36553 [95520]: s4 renewal error -2 delta_length 20 last_success 36441
2017-04-26 10:44:45+0800 36561 [1107]: s4 kill 95519 sig 9 count 41
2017-04-26 10:44:45+0800 36561 [1107]: client_pid_dead 2,9,95519 cmd_active 0 suspend 0
2017-04-26 10:44:45+0800 36561 [1107]: dead 95519 ci 2 count 41
2017-04-26 10:44:45+0800 36561 [1107]: s4 all pids clear
2017-04-26 10:44:45+0800 36561 [1107]: s4 set thread_stop
2017-04-26 10:44:45+0800 36561 [1107]: s4 wdmd_test_live 0 0 to disable
2017-04-26 10:44:48+0800 36564 [95520]: s4 delta_renew read rv -2 offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:44:48+0800 36564 [95520]: s4 renewal error -2 delta_length 10 last_success 36441
2017-04-26 10:44:48+0800 36564 [95520]: LS close_task_aio 0 0x7efce80008c0 busy
2017-04-26 10:44:48+0800 36564 [95520]: LS close_task_aio 1 0x7efce8000910 busy
2017-04-26 10:44:48+0800 36564 [95520]: LS close_task_aio 2 0x7efce8000960 busy
2017-04-26 10:44:48+0800 36564 [95520]: LS close_task_aio 3 0x7efce80009b0 busy
2017-04-26 10:44:58+0800 36574 [95520]: LS close_task_aio 0 0x7efce80008c0 busy
2017-04-26 10:44:58+0800 36574 [95520]: LS close_task_aio 1 0x7efce8000910 busy
2017-04-26 10:44:58+0800 36574 [95520]: LS close_task_aio 2 0x7efce8000960 busy
2017-04-26 10:44:58+0800 36574 [95520]: LS close_task_aio 3 0x7efce80009b0 busy

$ cat /var/log/killpath.log 
[2017年 04月 26日 星期三 11:07:10 CST] killpath test

可见只要获取租约的进程不退出,则sanlock不会释放进程客户端相关数据结构。

使用watchdog进行续约失败测试

关闭watchdog时的sanlock服务异常测试

默认配置中已经关闭了对watchdog的相关功能,但sanlock服务启动时会默认加载softdog驱动和启动wdmd服务。

  1. 关闭wdmd服务以及卸载softdog驱动。
$ lsmod | grep softdog
softdog                13319  0

$ fuser /dev/watchdog0
/dev/watchdog0:      34034

$ ps -aux | grep wdmd
root      34034  0.0  0.1  13804  3440 ?        SLs  11:22   0:00 /usr/sbin/wdmd

$ systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
   Loaded: loaded (/usr/lib/systemd/system/wdmd.service; enabled; vendor preset: disabled)
   Active: active (running) since 一 2017-04-24 11:22:08 CST; 1min 10s ago
  Process: 34032 ExecStart=/usr/sbin/wdmd (code=exited, status=0/SUCCESS)
  Process: 34030 ExecStartPre=/lib/systemd/systemd-wdmd watchdog-check (code=exited, status=0/SUCCESS)
 Main PID: 34034 (wdmd)
   CGroup: /system.slice/wdmd.service
           └─34034 /usr/sbin/wdmd

4月 24 11:22:08 Test1 systemd[1]: Starting Watchdog Multiplexing Daemon...
4月 24 11:22:08 Test1 systemd[1]: Started Watchdog Multiplexing Daemon.
4月 24 11:22:08 Test1 wdmd[34034]: wdmd started S0 H1 G179
4月 24 11:22:08 Test1 wdmd[34034]: /dev/watchdog0 armed with fire_timeout 60

$ systemctl stop wdmd

$ systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
   Loaded: loaded (/usr/lib/systemd/system/wdmd.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since 一 2017-04-24 11:23:36 CST; 39s ago
  Process: 34032 ExecStart=/usr/sbin/wdmd (code=exited, status=0/SUCCESS)
  Process: 34030 ExecStartPre=/lib/systemd/systemd-wdmd watchdog-check (code=exited, status=0/SUCCESS)
 Main PID: 34034 (code=exited, status=0/SUCCESS)

4月 24 11:22:08 Test1 systemd[1]: Starting Watchdog Multiplexing Daemon...
4月 24 11:22:08 Test1 systemd[1]: Started Watchdog Multiplexing Daemon.
4月 24 11:22:08 Test1 wdmd[34034]: wdmd started S0 H1 G179
4月 24 11:22:08 Test1 wdmd[34034]: /dev/watchdog0 armed with fire_timeout 60
4月 24 11:23:36 Test1 systemd[1]: Stopping Watchdog Multiplexing Daemon...
4月 24 11:23:36 Test1 systemd[1]: Stopped Watchdog Multiplexing Daemon.

$ rmmod softdog
$ lsmod | grep softdog

$ systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
   Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
   Active: active (running) since 一 2017-04-24 11:05:05 CST; 20min ago
  Process: 31606 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
 Main PID: 31608 (sanlock)
   CGroup: /system.slice/sanlock.service
           ├─31608 /usr/sbin/sanlock daemon
           └─31609 /usr/sbin/sanlock daemon

4月 24 11:05:05 Test1 systemd[1]: Starting Shared Storage Lease Manager...
4月 24 11:05:05 Test1 systemd[1]: Started Shared Storage Lease Manager.
  1. 在Test1或Test2节点上启动simpleHA.py测试脚本。
$ ./simpleHA.py 1 /mnt/nfs/sanlock/idLease /mnt/nfs/sanlock/leaderLease
Try to acquire host id LS:1:/mnt/nfs/sanlock/idLease:0
2017 04 CST 11:10:32 Enter Lockspace Begin
2017 04 CST 11:10:53 Enter Lockspace End
Try to acquire leader lease LS:leader:/mnt/nfs/sanlock/leaderLease:0
2017 04 CST 11:10:53 Service is running
2017 04 CST 11:11:03 Service is running
2017 04 CST 11:11:13 Service is running
2017 04 CST 11:11:23 Service is running
2017 04 CST 11:11:33 Service is running

$ sanlock direct dump /mnt/nfs/sanlock/idLease
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                                   LS       583c5507-34a7-412f-ad9d-daae650a7a8f.Test1 0000009313 0001 0001

$ sanlock direct dump /mnt/nfs/sanlock/leaderLease 
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000                                   LS                                           leader 0000000000 0001 0001 14

$ sanlock client log_dump
2017-04-24 11:05:05+0800 7939 [31608]: sanlock daemon started 3.4.0 host 583c5507-34a7-412f-ad9d-daae650a7a8f.Test1
2017-04-24 11:10:32+0800 8266 [31608]: cmd_register ci 2 fd 9 pid 32438
2017-04-24 11:10:32+0800 8266 [31617]: cmd_add_lockspace 3,10 LS:1:/mnt/nfs/sanlock/idLease:0 flags 0 timeout 0
2017-04-24 11:10:32+0800 8266 [31617]: s1 lockspace LS:1:/mnt/nfs/sanlock/idLease:0
2017-04-24 11:10:32+0800 8266 [32439]: s1 delta_acquire begin LS:1
2017-04-24 11:10:32+0800 8266 [32439]: s1 delta_acquire write 1 1 8266 583c5507-34a7-412f-ad9d-daae650a7a8f.Test1
2017-04-24 11:10:32+0800 8266 [32439]: s1 delta_acquire delta_short_delay 20
2017-04-24 11:10:52+0800 8286 [32439]: s1 delta_acquire done 1 1 8266
2017-04-24 11:10:53+0800 8287 [31617]: s1 add_lockspace done
2017-04-24 11:10:53+0800 8287 [31617]: cmd_add_lockspace 3,10 done 0
2017-04-24 11:10:53+0800 8287 [31616]: cmd_acquire 2,9,32438 ci_in 2 fd 9 count 1 flags 0
2017-04-24 11:10:53+0800 8287 [31616]: s1:r1 resource LS:leader:/mnt/nfs/sanlock/leaderLease:0 for 2,9,32438
2017-04-24 11:10:53+0800 8287 [31616]: r1 paxos_acquire begin 2 0 0
2017-04-24 11:10:53+0800 8287 [31616]: r1 paxos_acquire leader 0 owner 0 0 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2017-04-24 11:10:53+0800 8287 [31616]: r1 paxos_acquire leader 0 free
2017-04-24 11:10:53+0800 8287 [31616]: r1 ballot 1 phase1 mbal 1
2017-04-24 11:10:53+0800 8287 [31616]: r1 ballot 1 phase2 bal 1 inp 1 1 8287 q_max -1
2017-04-24 11:10:53+0800 8287 [31616]: r1 ballot 1 commit self owner 1 1 8287
2017-04-24 11:10:53+0800 8287 [31616]: r1 acquire_disk rv 1 lver 1 at 8287
2017-04-24 11:10:53+0800 8287 [31616]: cmd_acquire 2,9,32438 result 0 pid_dead 0
2017-04-24 11:10:54+0800 8287 [31608]: s1 host 1 1 8266 583c5507-34a7-412f-ad9d-daae650a7a8f.Test1
2017-04-24 11:11:53+0800 8347 [31608]: client_pid_dead 2,9,32438 cmd_active 0 suspend 0
2017-04-24 11:11:53+0800 8347 [31618]: r1 release async r_flags 0
2017-04-24 11:11:53+0800 8347 [31618]: r1 write_host_block host_id 1 flags 0 gen 0
2017-04-24 11:11:53+0800 8347 [31618]: r1 paxos_release leader 1 owner 1 1 8287
2017-04-24 11:11:53+0800 8347 [31618]: r1 release async done r_flags 0
2017-04-24 11:12:08+0800 8362 [31608]: cmd_register ci 2 fd 9 pid 32710
2017-04-24 11:12:08+0800 8362 [31616]: cmd_acquire 2,9,32710 ci_in 2 fd 9 count 1 flags 0
2017-04-24 11:12:08+0800 8362 [31616]: s1:r2 resource LS:leader:/mnt/nfs/sanlock/leaderLease:0 for 2,9,32710
2017-04-24 11:12:08+0800 8362 [31616]: r2 paxos_acquire begin 2 0 0
2017-04-24 11:12:08+0800 8362 [31616]: r2 paxos_acquire leader 1 owner 1 1 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2017-04-24 11:12:08+0800 8362 [31616]: r2 paxos_acquire leader 1 free
2017-04-24 11:12:08+0800 8362 [31616]: r2 ballot 2 phase1 mbal 1
2017-04-24 11:12:08+0800 8362 [31616]: r2 ballot 2 phase2 bal 1 inp 1 1 8362 q_max -1
2017-04-24 11:12:08+0800 8362 [31616]: r2 ballot 2 commit self owner 1 1 8362
2017-04-24 11:12:08+0800 8362 [31616]: r2 acquire_disk rv 1 lver 2 at 8362

...

2017-04-24 11:26:55+0800 9248 [31608]: client_pid_dead 2,9,34711 cmd_active 0 suspend 0
2017-04-24 11:26:55+0800 9248 [31618]: r13 release async r_flags 0
2017-04-24 11:26:55+0800 9248 [31618]: r13 write_host_block host_id 1 flags 0 gen 0
2017-04-24 11:26:55+0800 9248 [31618]: r13 paxos_release leader 13 owner 1 1 9188
2017-04-24 11:26:55+0800 9248 [31618]: r13 release async done r_flags 0
2017-04-24 11:27:10+0800 9263 [31608]: cmd_register ci 2 fd 9 pid 34871
2017-04-24 11:27:10+0800 9263 [31616]: cmd_acquire 2,9,34871 ci_in 2 fd 9 count 1 flags 0
2017-04-24 11:27:10+0800 9263 [31616]: s1:r14 resource LS:leader:/mnt/nfs/sanlock/leaderLease:0 for 2,9,34871
2017-04-24 11:27:10+0800 9263 [31616]: r14 paxos_acquire begin 2 0 0
2017-04-24 11:27:10+0800 9263 [31616]: r14 paxos_acquire leader 13 owner 1 1 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2017-04-24 11:27:10+0800 9263 [31616]: r14 paxos_acquire leader 13 free
2017-04-24 11:27:10+0800 9263 [31616]: r14 ballot 14 phase1 mbal 1
2017-04-24 11:27:10+0800 9263 [31616]: r14 ballot 14 phase2 bal 1 inp 1 1 9263 q_max -1
2017-04-24 11:27:10+0800 9263 [31616]: r14 ballot 14 commit self owner 1 1 9263
2017-04-24 11:27:10+0800 9263 [31616]: r14 acquire_disk rv 1 lver 14 at 9263
2017-04-24 11:27:10+0800 9263 [31616]: cmd_acquire 2,9,34871 result 0 pid_dead 0
2017-04-24 11:28:10+0800 9323 [31608]: client_pid_dead 2,9,34871 cmd_active 0 suspend 0
2017-04-24 11:28:10+0800 9323 [31618]: r14 release async r_flags 0
2017-04-24 11:28:10+0800 9323 [31618]: r14 write_host_block host_id 1 flags 0 gen 0
2017-04-24 11:28:10+0800 9323 [31618]: r14 paxos_release leader 14 owner 1 1 9263
2017-04-24 11:28:10+0800 9323 [31618]: r14 release async done r_flags 0
2017-04-24 11:28:25+0800 9338 [31608]: cmd_register ci 2 fd 9 pid 35092
2017-04-24 11:28:25+0800 9338 [31616]: cmd_acquire 2,9,35092 ci_in 2 fd 9 count 1 flags 0
  1. 使用信号9终止sanlock服务
$ killall -9 sanlock

$ systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
   Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
   Active: failed (Result: signal) since 一 2017-04-24 11:29:13 CST; 10s ago
  Process: 31606 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
 Main PID: 31608 (code=killed, signal=KILL)

4月 24 11:05:05 Test1 systemd[1]: Starting Shared Storage Lease Manager...
4月 24 11:05:05 Test1 systemd[1]: Started Shared Storage Lease Manager.
4月 24 11:29:13 Test1 systemd[1]: sanlock.service: main process exited, code=killed, status=9/KILL
4月 24 11:29:13 Test1 systemd[1]: Unit sanlock.service entered failed state.
4月 24 11:29:13 Test1 systemd[1]: sanlock.service failed.

$ sanlock client log_dump

simpleHA.py脚本的输出:

$ Process Process-16:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "./simpleHA.py", line 10, in serviceMain
    sfd = sanlock.register()
SanlockException: (111, 'Sanlock registration failed', 'Connection refused')
Process Process-17:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap

等待一段时间(3分钟左右),系统没有重启,也没有其他异常。

启用watchdog时的sanlock续约失败测试

  1. 开启sanlock的watchdog功能。
$ modprobe softdog
$ dmesg -c
[ 9711.194159] softdog: Software Watchdog Timer: 0.08 initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)

$ vi /etc/sanlock/sanlock.conf
use_watchdog = 1

$ systemctl enable wdmd && systemctl start wdmd && systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
   Loaded: loaded (/usr/lib/systemd/system/wdmd.service; enabled; vendor preset: disabled)
   Active: active (running) since 一 2017-04-24 11:35:16 CST; 6ms ago
  Process: 36108 ExecStart=/usr/sbin/wdmd (code=exited, status=0/SUCCESS)
  Process: 36105 ExecStartPre=/lib/systemd/systemd-wdmd watchdog-check (code=exited, status=0/SUCCESS)
 Main PID: 36110 (wdmd)
   CGroup: /system.slice/wdmd.service
           └─36110 /usr/sbin/wdmd

4月 24 11:35:16 Test1 systemd[1]: Starting Watchdog Multiplexing Daemon...
4月 24 11:35:16 Test1 wdmd[36110]: wdmd started S0 H1 G179
4月 24 11:35:16 Test1 systemd[1]: Started Watchdog Multiplexing Daemon.
4月 24 11:35:16 Test1 wdmd[36110]: /dev/watchdog0 armed with fire_timeout 60

$ systemctl restart sanlock && systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
   Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
   Active: active (running) since 一 2017-04-24 11:35:22 CST; 10ms ago
  Process: 36133 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
 Main PID: 36134 (sanlock)
   CGroup: /system.slice/sanlock.service
           ├─36134 /usr/sbin/sanlock daemon
           └─36135 /usr/sbin/sanlock daemon

4月 24 11:35:22 Test1 systemd[1]: Starting Shared Storage Lease Manager...
4月 24 11:35:22 Test1 systemd[1]: Started Shared Storage Lease Manager.

$ fuser /dev/watchdog0
/dev/watchdog0:      36110

$ ps -aux | grep wdmd
root      36110  0.0  0.1  13804  3440 ?        SLs  11:35   0:00 /usr/sbin/wdmd

$ ll /proc/36110/fd
总用量 0
lrwx------ 1 root root 64 4月  24 11:35 0 -> /dev/null
lrwx------ 1 root root 64 4月  24 11:35 1 -> /dev/null
lrwx------ 1 root root 64 4月  24 11:35 2 -> /dev/null
lrwx------ 1 root root 64 4月  24 11:35 3 -> socket:[228511]
l-wx------ 1 root root 64 4月  24 11:35 4 -> /run/wdmd/wdmd.pid
lrwx------ 1 root root 64 4月  24 11:35 5 -> /dev/shm/wdmd
lrwx------ 1 root root 64 4月  24 11:35 6 -> anon_inode:[signalfd]
lrwx------ 1 root root 64 4月  24 11:35 7 -> socket:[228514]
l-wx------ 1 root root 64 4月  24 11:35 8 -> /dev/watchdog0

$ ps -aux | grep sanlock
sanlock   36134  0.0  0.9 259592 17908 ?        SLsl 11:35   0:00 /usr/sbin/sanlock daemon
root      36135  0.0  0.0  27680   208 ?        S    11:35   0:00 /usr/sbin/sanlock daemon

$ ll /proc/36134/fd
总用量 0
lrwx------ 1 sanlock sanlock 64 4月  24 11:35 0 -> /dev/null
lrwx------ 1 sanlock sanlock 64 4月  24 11:35 1 -> /dev/null
lrwx------ 1 sanlock sanlock 64 4月  24 11:35 2 -> /dev/null
lrwx------ 1 sanlock sanlock 64 4月  24 11:35 3 -> /var/log/sanlock.log
l-wx------ 1 sanlock sanlock 64 4月  24 11:35 4 -> pipe:[228572]
lr-x------ 1 sanlock sanlock 64 4月  24 11:35 5 -> pipe:[228573]
l-wx------ 1 sanlock sanlock 64 4月  24 11:35 6 -> /run/sanlock/sanlock.pid
lrwx------ 1 sanlock sanlock 64 4月  24 11:35 7 -> socket:[228581]
lrwx------ 1 sanlock sanlock 64 4月  24 11:35 8 -> anon_inode:[eventfd]

$ ll /proc/36135/fd
总用量 0
lrwx------ 1 root root 64 4月  24 11:35 0 -> /dev/null
lrwx------ 1 root root 64 4月  24 11:35 1 -> /dev/null
lrwx------ 1 root root 64 4月  24 11:35 2 -> /dev/null
lr-x------ 1 root root 64 4月  24 11:35 3 -> pipe:[228572]
l-wx------ 1 root root 64 4月  24 11:35 6 -> pipe:[228573]
  1. 在Test1或Test2节点上启动simpleHA.py测试脚本。
./simpleHA.py 1 /mnt/nfs/sanlock/idLease /mnt/nfs/sanlock/leaderLease
Try to acquire host id LS:1:/mnt/nfs/sanlock/idLease:0
2017 04 CST 11:46:59 Enter Lockspace Begin
2017 04 CST 11:49:40 Enter Lockspace End
Try to acquire leader lease LS:leader:/mnt/nfs/sanlock/leaderLease:0
2017 04 CST 11:49:40 Service is running
2017 04 CST 11:49:50 Service is running
2017 04 CST 11:50:00 Service is running
2017 04 CST 11:50:10 Service is running
2017 04 CST 11:50:20 Service is running

  1. 使用信号9终止sanlock服务。
$ killall -9 sanlock

$ systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
   Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
   Active: failed (Result: signal) since 一 2017-04-24 11:53:07 CST; 14s ago
  Process: 36133 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
 Main PID: 36134 (code=killed, signal=KILL)

4月 24 11:35:22 Test1 systemd[1]: Starting Shared Storage Lease Manager...
4月 24 11:35:22 Test1 systemd[1]: Started Shared Storage Lease Manager.
4月 24 11:53:07 Test1 systemd[1]: sanlock.service: main process exited, code=killed, status=9/KILL
4月 24 11:53:07 Test1 systemd[1]: Unit sanlock.service entered failed state.
4月 24 11:53:07 Test1 systemd[1]: sanlock.service failed.


$ systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
   Loaded: loaded (/usr/lib/systemd/system/wdmd.service; enabled; vendor preset: disabled)
   Active: active (running) since 一 2017-04-24 11:35:16 CST; 19min ago
  Process: 36108 ExecStart=/usr/sbin/wdmd (code=exited, status=0/SUCCESS)
  Process: 36105 ExecStartPre=/lib/systemd/systemd-wdmd watchdog-check (code=exited, status=0/SUCCESS)
 Main PID: 36110 (wdmd)
   CGroup: /system.slice/wdmd.service
           └─36110 /usr/sbin/wdmd

4月 24 11:54:26 Test1 wdmd[36110]: test failed rem 51 now 10899 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:27 Test1 wdmd[36110]: test failed rem 50 now 10900 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:28 Test1 wdmd[36110]: test failed rem 49 now 10901 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:29 Test1 wdmd[36110]: test failed rem 48 now 10902 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:30 Test1 wdmd[36110]: test failed rem 47 now 10903 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:31 Test1 wdmd[36110]: test failed rem 46 now 10904 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:32 Test1 wdmd[36110]: test failed rem 45 now 10905 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:33 Test1 wdmd[36110]: test failed rem 44 now 10906 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:34 Test1 wdmd[36110]: test failed rem 43 now 10907 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:35 Test1 wdmd[36110]: test failed rem 42 now 10908 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1

simpleHA.py脚本的输出:

Process Process-4:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "./simpleHA.py", line 10, in serviceMain
    sfd = sanlock.register()
SanlockException: (111, 'Sanlock registration failed', 'Connection refused')
Process Process-5:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)

等待一段时间后系统重启。

总结

  • sanlock如果开启了watchdog支持,再过一段时间,如果依旧无法终止进程,则会停止喂狗,导致watchdog重启系统。

转载于:https://my.oschina.net/LastRitter/blog/1540118

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值