使用默认配置来进行libvirt和sanlock的续约失败测试
创建使用本地磁盘的虚拟机(Test1或Test2节点)
- 初始化虚拟机磁盘镜像:
$ dd if=cirros-0.3.4-x86_64-disk.img of=/dev/sdd3
- 创建test3_sanlock虚拟机:
$ vi test3_sanlock.xml
<domain type='kvm'>
<name>test3_sanlock</name>
<memory>262144</memory>
<vcpu>1</vcpu>
<os>
<type arch='x86_64' machine='pc'>hvm</type>
<boot dev='hd'/>
</os>
<devices>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/dev/sdd3'/>
<target dev='hda' bus='ide'/>
</disk>
<input type='tablet' bus='usb'/>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' port='-1' listen = '0.0.0.0' autoport='yes' keymap='en-us'/>
</devices>
</domain>
- 定义虚拟机:
$ virt-xml-validate test3_sanlock.xml
test3_sanlock.xml validates
$ virsh define test3_sanlock.xml
定义域 test3_sanlock(从 test3_sanlock.xml)
$ virsh list --all
Id 名称 状态
----------------------------------------------------
- test1_sanlock 关闭
- test2_sanlock 关闭
- test3_sanlock 关闭
- test_sanlock 关闭
- 启动虚拟机:
$ virsh start test3_sanlock
域 test3_sanlock 已开始
- 所有虚拟机状态:
$ virsh list --all
Id 名称 状态
----------------------------------------------------
3 test3_sanlock running
- test1_sanlock 关闭
- test2_sanlock 关闭
- test_sanlock 关闭
- 虚拟机进程状态:
$ ps -aux | grep /dev/sdd3
qemu 20345 22.8 5.0 675848 94884 ? Sl 10:13 0:10 /usr/libexec/qemu-kvm -name test3_sanlock -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 9cc6d09e-46e0-4fee-9113-938221810bad -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-test3_sanlock/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd3,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
root 20430 0.0 0.0 112664 972 pts/0 S+ 10:14 0:00 grep --color=auto /dev/sdd3
- 租约文件列表:
$ ls /var/lib/libvirt/sanlock -l
-rw------- 1 sanlock sanlock 1048576 4月 20 19:55 1199371e4095b4aeb587631d5e61ea06
-rw------- 1 sanlock sanlock 1048576 4月 20 19:54 3f8518ff5358a6757f0a5918e3ec7be2
-rw------- 1 sanlock sanlock 1048576 4月 20 19:55 51d27a61a6a3dd58637b6e00bb719cae
-rw------- 1 sanlock sanlock 1048576 4月 20 19:54 6cfae8f6c4541a92e4aa52f14e9977a5
-rw------- 1 sanlock sanlock 1048576 4月 20 20:24 7edb5b6820e56426339607637d18e871
-rw------- 1 sanlock sanlock 1048576 4月 21 09:56 d726322246f67ea910a344b4426b4e0d
-rw-rw---- 1 sanlock sanlock 1048576 4月 21 09:56 __LIBVIRT__DISKS__
- 租约状态:
$ sanlock direct dump /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
offset lockspace resource timestamp own gen lver
00000000 __LIBVIRT__DISKS__ 581e732d-e4b1-4216-8b2f-1a63f08bb28d.Test1 0000003949 0001 0002
00000512 __LIBVIRT__DISKS__ 10de2b81-24c8-4aa6-8b29-867833982ff5.Test2 0000003164 0002 0002
$ sanlock direct dump /var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d
offset lockspace resource timestamp own gen lver
00000000 __LIBVIRT__DISKS__ d726322246f67ea910a344b4426b4e0d 0000004112 0002 0003 1
- sanlock日志:
$ cat /var/log/sanlock.log
...
2017-04-21 10:10:09+0800 3896 [19810]: sanlock daemon started 3.4.0 host 25869182-7a3b-4c95-9f66-51bb606cdfc3.Test2
2017-04-21 10:10:09+0800 3896 [19810]: set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted
2017-04-21 10:10:25+0800 3911 [19817]: r1 cmd_acquire 2,10,19859 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__
2017-04-21 10:10:44+0800 3930 [19817]: r2 cmd_acquire 2,10,19914 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__
2017-04-21 10:11:04+0800 3950 [19816]: s1 lockspace __LIBVIRT__DISKS__:2:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0
2017-04-21 10:13:46+0800 4112 [19816]: s1:r3 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,10,20345
2017-04-21 10:13:46+0800 4112 [19816]: s1:r4 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,10,20345
2017-04-21 10:13:46+0800 4112 [19810]: s1 host 1 2 4872 581e732d-e4b1-4216-8b2f-1a63f08bb28d.Test1
2017-04-21 10:13:46+0800 4112 [19810]: s1 host 2 3 4090 25869182-7a3b-4c95-9f66-51bb606cdfc3.Test2
断开网络5秒之后的状态
- 虚拟机正常运行:
$ virsh list --all
Id 名称 状态
----------------------------------------------------
3 test3_sanlock running
- test1_sanlock 关闭
- test2_sanlock 关闭
- test_sanlock 关闭
- 虚拟机进程正常:
$ ps -aux | grep /dev/sdd3
qemu 20345 6.1 5.0 675848 94884 ? Sl 10:13 0:15 /usr/libexec/qemu-kvm -name test3_sanlock -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 9cc6d09e-46e0-4fee-9113-938221810bad -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-test3_sanlock/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd3,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
root 20837 0.0 0.0 112664 972 pts/0 S+ 10:17 0:00 grep --color=auto /dev/sdd3
- sanlock日志已经出现异常:
$ cat /var/log/sanlock.log
...
2017-04-21 10:17:40+0800 4347 [20022]: __LIBVIR aio timeout RD 0x7fda5c0008c0:0x7fda5c0008d0:0x7fda6e094000 ioto 10 to_count 1
2017-04-21 10:17:40+0800 4347 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:17:40+0800 4347 [20022]: s1 renewal error -202 delta_length 10 last_success 4316
2017-04-21 10:18:01+0800 4367 [20022]: __LIBVIR aio timeout RD 0x7fda5c000910:0x7fda5c000920:0x7fda6df92000 ioto 10 to_count 2
2017-04-21 10:18:01+0800 4367 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:18:01+0800 4367 [20022]: s1 renewal error -202 delta_length 20 last_success 4316
断开网络30秒之后的状态
- 虚拟机已经关闭:
$ virsh list --all
Id 名称 状态
----------------------------------------------------
- test1_sanlock 关闭
- test2_sanlock 关闭
- test3_sanlock 关闭
- test_sanlock 关闭
- 虚拟机进程已经被终止:
$ ps -aux | grep /dev/sdd3
root 21018 0.0 0.0 112664 972 pts/0 S+ 10:19 0:00 grep --color=auto /dev/sdd3
- 从sanlock的日志可以看出,他把20345的虚拟机进程使用SIGTERM(15)信号kill掉了:
$ cat /var/log/sanlock.log
...
2017-04-21 10:13:46+0800 4112 [19810]: s1 host 1 2 4872 581e732d-e4b1-4216-8b2f-1a63f08bb28d.Test1
2017-04-21 10:13:46+0800 4112 [19810]: s1 host 2 3 4090 25869182-7a3b-4c95-9f66-51bb606cdfc3.Test2
2017-04-21 10:17:40+0800 4347 [20022]: __LIBVIR aio timeout RD 0x7fda5c0008c0:0x7fda5c0008d0:0x7fda6e094000 ioto 10 to_count 1
2017-04-21 10:17:40+0800 4347 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:17:40+0800 4347 [20022]: s1 renewal error -202 delta_length 10 last_success 4316
2017-04-21 10:18:01+0800 4367 [20022]: __LIBVIR aio timeout RD 0x7fda5c000910:0x7fda5c000920:0x7fda6df92000 ioto 10 to_count 2
2017-04-21 10:18:01+0800 4367 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:18:01+0800 4367 [20022]: s1 renewal error -202 delta_length 20 last_success 4316
2017-04-21 10:18:10+0800 4376 [19810]: s1 check_our_lease warning 60 last_success 4316
2017-04-21 10:18:11+0800 4377 [19810]: s1 check_our_lease warning 61 last_success 4316
2017-04-21 10:18:12+0800 4378 [19810]: s1 check_our_lease warning 62 last_success 4316
2017-04-21 10:18:13+0800 4379 [19810]: s1 check_our_lease warning 63 last_success 4316
2017-04-21 10:18:14+0800 4380 [19810]: s1 check_our_lease warning 64 last_success 4316
2017-04-21 10:18:15+0800 4381 [19810]: s1 check_our_lease warning 65 last_success 4316
2017-04-21 10:18:16+0800 4382 [19810]: s1 check_our_lease warning 66 last_success 4316
2017-04-21 10:18:17+0800 4383 [19810]: s1 check_our_lease warning 67 last_success 4316
2017-04-21 10:18:18+0800 4384 [19810]: s1 check_our_lease warning 68 last_success 4316
2017-04-21 10:18:19+0800 4385 [19810]: s1 check_our_lease warning 69 last_success 4316
2017-04-21 10:18:20+0800 4386 [19810]: s1 check_our_lease warning 70 last_success 4316
2017-04-21 10:18:21+0800 4387 [19810]: s1 check_our_lease warning 71 last_success 4316
2017-04-21 10:18:21+0800 4388 [20022]: __LIBVIR aio timeout RD 0x7fda5c000960:0x7fda5c000970:0x7fda6de90000 ioto 10 to_count 3
2017-04-21 10:18:21+0800 4388 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:18:21+0800 4388 [20022]: s1 renewal error -202 delta_length 20 last_success 4316
2017-04-21 10:18:22+0800 4388 [19810]: s1 check_our_lease warning 72 last_success 4316
2017-04-21 10:18:23+0800 4389 [19810]: s1 check_our_lease warning 73 last_success 4316
2017-04-21 10:18:24+0800 4390 [19810]: s1 check_our_lease warning 74 last_success 4316
2017-04-21 10:18:25+0800 4391 [19810]: s1 check_our_lease warning 75 last_success 4316
2017-04-21 10:18:26+0800 4392 [19810]: s1 check_our_lease warning 76 last_success 4316
2017-04-21 10:18:27+0800 4393 [19810]: s1 check_our_lease warning 77 last_success 4316
2017-04-21 10:18:28+0800 4394 [19810]: s1 check_our_lease warning 78 last_success 4316
2017-04-21 10:18:29+0800 4395 [19810]: s1 check_our_lease warning 79 last_success 4316
2017-04-21 10:18:30+0800 4396 [19810]: s1 check_our_lease failed 80
2017-04-21 10:18:30+0800 4396 [19810]: s1 kill 20345 sig 15 count 1
2017-04-21 10:18:30+0800 4396 [19810]: dead 20345 ci 2 count 1
2017-04-21 10:18:30+0800 4397 [19810]: s1 all pids clear
2017-04-21 10:18:42+0800 4408 [20022]: __LIBVIR aio timeout RD 0x7fda5c0009b0:0x7fda5c0009c0:0x7fda6dd8e000 ioto 10 to_count 4
2017-04-21 10:18:42+0800 4408 [20022]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-21 10:18:42+0800 4408 [20022]: s1 renewal error -202 delta_length 20 last_success 4316
2017-04-21 10:19:42+0800 4468 [20022]: __LIBVIR close_task_aio 0 0x7fda5c0008c0 busy
2017-04-21 10:19:42+0800 4468 [20022]: __LIBVIR close_task_aio 1 0x7fda5c000910 busy
2017-04-21 10:19:42+0800 4468 [20022]: __LIBVIR close_task_aio 2 0x7fda5c000960 busy
2017-04-21 10:19:42+0800 4468 [20022]: __LIBVIR close_task_aio 3 0x7fda5c0009b0 busy
2017-04-21 10:20:42+0800 4528 [20022]: __LIBVIR close_task_aio 0 0x7fda5c0008c0 busy
2017-04-21 10:20:42+0800 4528 [20022]: __LIBVIR close_task_aio 1 0x7fda5c000910 busy
2017-04-21 10:20:42+0800 4528 [20022]: __LIBVIR close_task_aio 2 0x7fda5c000960 busy
2017-04-21 10:20:42+0800 4528 [20022]: __LIBVIR close_task_aio 3 0x7fda5c0009b0 busy
- 从内核消息中可以看出iSCSI、多路径和NFS均已出现异常:
$ dmesg -c
[ 4116.360505] qemu-kvm: sending ioctl 5326 to a partition!
[ 4116.360528] qemu-kvm: sending ioctl 80200204 to a partition!
[ 4116.961920] kvm: zapping shadow pages for mmio generation wraparound
[ 4123.393299] kvm [20345]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xabcd
[ 4335.502566] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298988592, last ping 4298993600, now 4298998608
[ 4335.502712] connection1:0: detected conn error (1022)
[ 4335.502718] connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298988592, last ping 4298993600, now 4298998608
[ 4335.502723] connection5:0: detected conn error (1022)
[ 4339.501709] connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298992593, last ping 4298997600, now 4299002608
[ 4339.501777] connection3:0: detected conn error (1022)
[ 4339.501785] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298992593, last ping 4298997600, now 4299002608
[ 4339.501790] connection4:0: detected conn error (1022)
[ 4340.509931] connection6:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298993601, last ping 4298998608, now 4299003616
[ 4340.509940] session1: session recovery timed out after 5 secs
[ 4340.509952] connection6:0: detected conn error (1022)
[ 4340.509956] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298993601, last ping 4298998608, now 4299003616
[ 4340.509959] connection2:0: detected conn error (1022)
[ 4340.509964] session5: session recovery timed out after 5 secs
[ 4341.499795] device-mapper: multipath: Failing path 8:80.
[ 4341.499852] device-mapper: multipath: Failing path 8:128.
[ 4344.508453] session3: session recovery timed out after 5 secs
[ 4344.508470] sd 5:0:0:1: rejecting I/O to offline device
[ 4344.508474] sd 5:0:0:1: [sdg] killing request
[ 4344.508483] session4: session recovery timed out after 5 secs
[ 4344.508488] sd 5:0:0:1: [sdg] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 4344.508489] sd 5:0:0:1: [sdg] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[ 4344.508491] blk_update_request: I/O error, dev sdg, sector 0
[ 4344.508507] sd 6:0:0:1: rejecting I/O to offline device
[ 4344.508509] sd 6:0:0:1: [sdj] killing request
[ 4344.508517] sd 6:0:0:1: [sdj] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 4344.508518] sd 6:0:0:1: [sdj] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[ 4344.508519] blk_update_request: I/O error, dev sdj, sector 0
[ 4345.501500] device-mapper: multipath: Failing path 8:96.
[ 4345.501545] device-mapper: multipath: Failing path 8:144.
[ 4345.504705] blk_update_request: I/O error, dev dm-1, sector 104857472
[ 4345.504726] blk_update_request: I/O error, dev dm-1, sector 104857472
[ 4345.504728] Buffer I/O error on device dm-1, logical block 13107184
[ 4345.516361] session6: session recovery timed out after 5 secs
[ 4345.516382] sd 8:0:0:1: rejecting I/O to offline device
[ 4345.516386] sd 8:0:0:1: [sdh] killing request
[ 4345.516393] session2: session recovery timed out after 5 secs
[ 4345.516398] sd 8:0:0:1: [sdh] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 4345.516400] sd 8:0:0:1: [sdh] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[ 4345.516402] blk_update_request: I/O error, dev sdh, sector 0
[ 4345.516419] sd 4:0:0:1: rejecting I/O to offline device
[ 4345.516421] sd 4:0:0:1: [sde] killing request
[ 4345.516427] sd 4:0:0:1: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 4345.516429] sd 4:0:0:1: [sde] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[ 4345.516430] blk_update_request: I/O error, dev sde, sector 0
[ 4345.522653] sd 4:0:0:1: rejecting I/O to offline device
[ 4345.522680] device-mapper: multipath: Failing path 8:64.
[ 4345.522690] blk_update_request: I/O error, dev dm-0, sector 209715072
[ 4345.522974] blk_update_request: I/O error, dev dm-0, sector 209715072
[ 4345.522978] Buffer I/O error on device dm-0, logical block 26214384
[ 4346.501748] device-mapper: multipath: Failing path 8:112.
[ 4391.783505] watchdog watchdog0: watchdog did not stop!
[ 4521.500331] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500336] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500337] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500338] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500339] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500340] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500341] nfs: server 192.168.195.131 not responding, still trying
[ 4521.500342] nfs: server 192.168.195.131 not responding, still trying
[ 4541.847179] nfs: server 192.168.195.131 not responding, still trying
[ 4541.847184] nfs: server 192.168.195.131 not responding, still trying
[ 4541.847185] nfs: server 192.168.195.131 not responding, still trying
[ 4541.847186] nfs: server 192.168.195.131 not responding, still trying
[ 4541.847187] nfs: server 192.168.195.131 not responding, still trying
- 已无法正常查看租约状态:
$ sanlock direct dump /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
重新连接网络
- 重启服务:
$ systemctl stop libvirtd
$ systemctl stop sanlock
$ mount -o,remount /var/lib/libvirt/sanlock
$ systemctl start sanlock
$ systemctl start libvirtd
- 查看日志:
$ sanlock client log_dump
2017-04-21 10:57:39+0800 6743 [24933]: sanlock daemon started 3.4.0 host b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2
- 查看租约状态:
$ sanlock direct dump /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
offset lockspace resource timestamp own gen lver
00000000 __LIBVIRT__DISKS__ 581e732d-e4b1-4216-8b2f-1a63f08bb28d.Test1 0000007621 0001 0002
00000512 __LIBVIRT__DISKS__ 25869182-7a3b-4c95-9f66-51bb606cdfc3.Test2 0000004316 0002 0003
$ sanlock direct dump /var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d
offset lockspace resource timestamp own gen lver
00000000 __LIBVIRT__DISKS__ d726322246f67ea910a344b4426b4e0d 0000004112 0002 0003 1
- 重新启动虚拟机:
$ virsh start test3_sanlock
域 test3_sanlock 已开始
- 查看日志:
$ sanlock client log_dump
2017-04-21 10:57:39+0800 6743 [24933]: sanlock daemon started 3.4.0 host b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2
2017-04-21 10:57:39+0800 6743 [24933]: set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted
2017-04-21 15:13:53+0800 22095 [24938]: cmd_add_lockspace 2,10 __LIBVIRT__DISKS__:2:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0 flags 0 timeout 0
2017-04-21 15:13:53+0800 22095 [24938]: s1 lockspace __LIBVIRT__DISKS__:2:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0
2017-04-21 15:13:53+0800 22095 [43517]: s1 delta_acquire begin __LIBVIRT__DISKS__:2
2017-04-21 15:13:53+0800 22095 [43517]: s1 delta_acquire delta_large_delay 80 delay 140
2017-04-21 15:16:13+0800 22235 [43517]: s1 delta_acquire write 2 4 22235 b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2
2017-04-21 15:16:13+0800 22235 [43517]: s1 delta_acquire delta_short_delay 20
2017-04-21 15:16:33+0800 22255 [43517]: s1 delta_acquire done 2 4 22235
2017-04-21 15:16:34+0800 22256 [24938]: s1 add_lockspace done
2017-04-21 15:16:34+0800 22256 [24938]: cmd_add_lockspace 2,10 done 0
2017-04-21 15:16:35+0800 22257 [24933]: s1 host 1 2 23019 581e732d-e4b1-4216-8b2f-1a63f08bb28d.Test1
2017-04-21 15:16:35+0800 22257 [24933]: s1 host 2 4 22235 b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2
2017-04-21 15:17:48+0800 22330 [24933]: cmd_register ci 2 fd 10 pid 43842
2017-04-21 15:17:48+0800 22330 [24933]: cmd_restrict ci 2 fd 10 pid 43842 flags 1
2017-04-21 15:17:49+0800 22330 [24937]: cmd_inquire 2,10,43842 ci_in 3 fd 13
2017-04-21 15:17:49+0800 22330 [24937]: cmd_inquire 2,10,43842 result 0 pid_dead 0 res_count 0 cat_count 0 strlen 0
2017-04-21 15:17:49+0800 22330 [24938]: cmd_acquire 2,10,43842 ci_in 3 fd 13 count 1 flags 0
2017-04-21 15:17:49+0800 22330 [24938]: s1:r1 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,10,43842
2017-04-21 15:17:49+0800 22330 [24938]: r1 paxos_acquire begin 2 0 0
2017-04-21 15:17:49+0800 22330 [24938]: r1 paxos_acquire leader 1 owner 2 3 4112 max mbal[1] 2 our_dblock 2 2 2 3 4112 1
2017-04-21 15:17:49+0800 22330 [24938]: r1 paxos_acquire owner 2 3 4112 was old local new is 4
2017-04-21 15:17:49+0800 22330 [24938]: r1 ballot 2 phase1 mbal 2002
2017-04-21 15:17:49+0800 22330 [24938]: r1 ballot 2 phase2 bal 2002 inp 2 4 22330 q_max -1
2017-04-21 15:17:49+0800 22330 [24938]: r1 ballot 2 commit self owner 2 4 22330
2017-04-21 15:17:49+0800 22330 [24938]: r1 acquire_disk rv 1 lver 2 at 22330
2017-04-21 15:17:49+0800 22330 [24938]: cmd_acquire 2,10,43842 result 0 pid_dead 0
2017-04-21 15:17:49+0800 22330 [24937]: cmd_inquire 2,10,43842 ci_in 3 fd 13
2017-04-21 15:17:49+0800 22330 [24937]: cmd_inquire 2,10,43842 result 0 pid_dead 0 res_count 1 cat_count 1 strlen 113
2017-04-21 15:17:49+0800 22330 [24938]: cmd_acquire 2,10,43842 ci_in 3 fd 13 count 1 flags 0
2017-04-21 15:17:49+0800 22330 [24938]: s1:r2 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,10,43842
2017-04-21 15:17:49+0800 22330 [24938]: r2 cmd_acquire 2,10,43842 acquire_token -17
2017-04-21 15:17:49+0800 22330 [24938]: cmd_acquire 2,10,43842 result -17 pid_dead 0
一切恢复正常。
使用libvirt中的on_lockfailure标签进行续约失败测试
- 制作虚拟机磁盘镜像:
$ wget http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
$ dd if=cirros-0.3.4-x86_64-disk.img of=/dev/sdd3
$ dd if=cirros-0.3.4-x86_64-disk.img of=/dev/sdd4
- 配置并定义on_lockfailure时关机的虚拟机:
$ vi test_lockfailure1.xml
<domain type='kvm'>
<name>test_lockfailure1</name>
<memory>262144</memory>
<vcpu>1</vcpu>
<os>
<type arch='x86_64' machine='pc'>hvm</type>
<boot dev='hd'/>
</os>
<on_lockfailure>poweroff</on_lockfailure>
<devices>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/dev/sdd3'/>
<target dev='hda' bus='ide'/>
</disk>
<graphics type='vnc' port='-1' listen = '0.0.0.0' autoport='yes' keymap='en-us'/>
</devices>
</domain>
$ virsh define test_lockfailure1.xml
定义域 test_lockfailure1(从 test_lockfailure1.xml)
- 配置并定义on_lockfailure时暂停的虚拟机:
$ vi test_lockfailure2.xml
<domain type='kvm'>
<name>test_lockfailure2</name>
<memory>262144</memory>
<vcpu>1</vcpu>
<os>
<type arch='x86_64' machine='pc'>hvm</type>
<boot dev='hd'/>
</os>
<on_lockfailure>pause</on_lockfailure>
<devices>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/dev/sdd4'/>
<target dev='hda' bus='ide'/>
</disk>
<graphics type='vnc' port='-1' listen = '0.0.0.0' autoport='yes' keymap='en-us'/>
</devices>
</domain>
$ virsh define test_lockfailure2.xml
定义域 test_lockfailure2(从 test_lockfailure2.xml)
$ virsh list --all
Id 名称 状态
----------------------------------------------------
- test_lockfailure1 关闭
- test_lockfailure2 关闭
- 启动虚拟机:
$ virsh start test_lockfailure1
域 test_lockfailure1 已开始
$ virsh start test_lockfailure2
域 test_lockfailure2 已开始
$ virsh vncdisplay test_lockfailure1
:0
$ vncviewer :0
$ virsh vncdisplay test_lockfailure2
:1
$ vncviewer :1
- 查看当前的状态信息:
$ virsh list --all
Id 名称 状态
----------------------------------------------------
1 test_lockfailure1 running
2 test_lockfailure2 running
$ systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
Active: active (running) since 二 2017-04-25 11:13:26 CST; 5min ago
Process: 1072 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
Main PID: 1109 (sanlock)
CGroup: /system.slice/sanlock.service
├─1109 /usr/sbin/sanlock daemon
└─1110 /usr/sbin/sanlock daemon
4月 25 11:13:18 Test1 systemd[1]: Starting Shared Storage Lease Manager...
4月 25 11:13:26 Test1 systemd[1]: Started Shared Storage Lease Manager.
$ sanlock client log_dump
2017-04-25 11:13:19+0800 26 [1109]: sanlock daemon started 3.4.0 host dd75027f-b6f2-4f46-bd75-11d24f7ce2a9.Test1
2017-04-25 11:13:44+0800 50 [1112]: cmd_add_lockspace 2,9 __LIBVIRT__DISKS__:1:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0 flags 0 timeout 0
2017-04-25 11:13:44+0800 50 [1112]: s1 lockspace __LIBVIRT__DISKS__:1:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0
2017-04-25 11:13:44+0800 50 [3078]: s1 delta_acquire begin __LIBVIRT__DISKS__:1
2017-04-25 11:13:44+0800 50 [3078]: s1 delta_acquire delta_large_delay 80 delay 140
2017-04-25 11:16:04+0800 190 [3078]: s1 delta_acquire write 1 3 190 dd75027f-b6f2-4f46-bd75-11d24f7ce2a9.Test1
2017-04-25 11:16:04+0800 190 [3078]: s1 delta_acquire delta_short_delay 20
2017-04-25 11:16:24+0800 211 [3078]: s1 delta_acquire done 1 3 190
2017-04-25 11:16:25+0800 211 [1112]: s1 add_lockspace done
2017-04-25 11:16:25+0800 211 [1112]: cmd_add_lockspace 2,9 done 0
2017-04-25 11:16:26+0800 212 [1109]: s1 host 1 3 190 dd75027f-b6f2-4f46-bd75-11d24f7ce2a9.Test1
2017-04-25 11:16:26+0800 212 [1109]: s1 host 2 4 40508 b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2
2017-04-25 11:17:10+0800 256 [1109]: cmd_register ci 2 fd 9 pid 13022
2017-04-25 11:17:10+0800 256 [1113]: cmd_killpath 2,9,13022 flags 0
2017-04-25 11:17:10+0800 256 [1109]: cmd_restrict ci 2 fd 9 pid 13022 flags 1
2017-04-25 11:17:10+0800 256 [1112]: cmd_inquire 2,9,13022 ci_in 3 fd 12
2017-04-25 11:17:10+0800 256 [1112]: cmd_inquire 2,9,13022 result 0 pid_dead 0 res_count 0 cat_count 0 strlen 0
2017-04-25 11:17:10+0800 256 [1113]: cmd_acquire 2,9,13022 ci_in 4 fd 13 count 1 flags 0
2017-04-25 11:17:10+0800 256 [1113]: s1:r1 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,9,13022
2017-04-25 11:17:10+0800 256 [1113]: r1 paxos_acquire begin 2 0 0
2017-04-25 11:17:10+0800 257 [1113]: r1 paxos_acquire leader 2 owner 2 4 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2017-04-25 11:17:10+0800 257 [1113]: r1 paxos_acquire leader 2 free
2017-04-25 11:17:10+0800 257 [1113]: r1 ballot 3 phase1 mbal 1
2017-04-25 11:17:10+0800 257 [1113]: r1 ballot 3 phase2 bal 1 inp 1 3 257 q_max -1
2017-04-25 11:17:10+0800 257 [1113]: r1 ballot 3 commit self owner 1 3 257
2017-04-25 11:17:10+0800 257 [1113]: r1 acquire_disk rv 1 lver 3 at 257
2017-04-25 11:17:10+0800 257 [1113]: cmd_acquire 2,9,13022 result 0 pid_dead 0
2017-04-25 11:17:10+0800 257 [1112]: cmd_inquire 2,9,13022 ci_in 3 fd 12
2017-04-25 11:17:10+0800 257 [1112]: cmd_inquire 2,9,13022 result 0 pid_dead 0 res_count 1 cat_count 1 strlen 113
2017-04-25 11:17:10+0800 257 [1113]: cmd_acquire 2,9,13022 ci_in 3 fd 12 count 1 flags 0
2017-04-25 11:17:10+0800 257 [1113]: s1:r2 resource __LIBVIRT__DISKS__:d726322246f67ea910a344b4426b4e0d:/var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d:0 for 2,9,13022
2017-04-25 11:17:10+0800 257 [1113]: r2 cmd_acquire 2,9,13022 acquire_token -17
2017-04-25 11:17:10+0800 257 [1113]: cmd_acquire 2,9,13022 result -17 pid_dead 0
2017-04-25 11:17:12+0800 258 [1112]: cmd_align 3,12
2017-04-25 11:17:12+0800 258 [1112]: cmd_align 3,12 done 1048576
2017-04-25 11:17:12+0800 258 [1113]: cmd_write_resource 3,12 __LIBVIRT__DISKS__:0ff52295a7bb219221ea57f31f343ba8:/var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8:0
2017-04-25 11:17:12+0800 258 [1109]: cmd_register ci 3 fd 12 pid 13044
2017-04-25 11:17:12+0800 258 [1112]: cmd_killpath 3,12,13044 flags 0
2017-04-25 11:17:12+0800 258 [1109]: cmd_restrict ci 3 fd 12 pid 13044 flags 1
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 ci_in 4 fd 13
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 result 0 pid_dead 0 res_count 0 cat_count 0 strlen 0
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 ci_in 4 fd 13 count 1 flags 0
2017-04-25 11:17:12+0800 259 [1113]: s1:r3 resource __LIBVIRT__DISKS__:0ff52295a7bb219221ea57f31f343ba8:/var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8:0 for 3,12,13044
2017-04-25 11:17:12+0800 259 [1113]: r3 paxos_acquire begin 2 0 0
2017-04-25 11:17:12+0800 259 [1113]: r3 paxos_acquire leader 0 owner 0 0 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2017-04-25 11:17:12+0800 259 [1113]: r3 paxos_acquire leader 0 free
2017-04-25 11:17:12+0800 259 [1113]: r3 ballot 1 phase1 mbal 1
2017-04-25 11:17:12+0800 259 [1113]: r3 ballot 1 phase2 bal 1 inp 1 3 259 q_max -1
2017-04-25 11:17:12+0800 259 [1113]: r3 ballot 1 commit self owner 1 3 259
2017-04-25 11:17:12+0800 259 [1113]: r3 acquire_disk rv 1 lver 1 at 259
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 result 0 pid_dead 0
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 ci_in 4 fd 13
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 result 0 pid_dead 0 res_count 1 cat_count 1 strlen 113
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 ci_in 4 fd 13 count 1 flags 0
2017-04-25 11:17:12+0800 259 [1113]: s1:r4 resource __LIBVIRT__DISKS__:0ff52295a7bb219221ea57f31f343ba8:/var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8:0 for 3,12,13044
2017-04-25 11:17:12+0800 259 [1113]: r4 cmd_acquire 3,12,13044 acquire_token -17
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 result -17 pid_dead 0
$ sanlock direct dump /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
offset lockspace resource timestamp own gen lver
00000000 __LIBVIRT__DISKS__ dd75027f-b6f2-4f46-bd75-11d24f7ce2a9.Test1 0000000437 0001 0003
00000512 __LIBVIRT__DISKS__ b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2 0000040508 0002 0004
$ sanlock direct dump /var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8
offset lockspace resource timestamp own gen lver
00000000 __LIBVIRT__DISKS__ 0ff52295a7bb219221ea57f31f343ba8 0000000259 0001 0003 1
$ sanlock direct dump /var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d
offset lockspace resource timestamp own gen lver
00000000 __LIBVIRT__DISKS__ d726322246f67ea910a344b4426b4e0d 0000000257 0001 0003 3
$ ps -aux | grep qemu-kvm
qemu 13022 31.6 5.1 678052 96468 ? Sl 11:17 1:18 /usr/libexec/qemu-kvm -name test_lockfailure1 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 37266b5e-d0fc-4287-bf2d-fcaa75aca64c -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-test_lockfailure1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd3,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
qemu 13044 32.2 5.1 678056 96896 ? Sl 11:17 1:19 /usr/libexec/qemu-kvm -name test_lockfailure2 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 2562a258-d4be-486f-959a-1fb993958811 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-2-test_lockfailure2/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd4,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc 0.0.0.0:1 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
- 断开网络(大概3分钟),等待一段时间:
$ systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
Active: active (running) since 二 2017-04-25 11:13:26 CST; 29min ago
Process: 1072 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
Main PID: 1109 (sanlock)
CGroup: /system.slice/sanlock.service
├─ 1109 /usr/sbin/sanlock daemon
├─ 1110 /usr/sbin/sanlock daemon
└─14234 /usr/libexec/libvirt_sanlock_helper qemu:///system 2562a258-d4be-486f-959a-1fb993958811 pause
4月 25 11:24:37 Test1 sanlock[1109]: 2017-04-25 11:24:37+0800 703 [3078]: s1 delta_renew read rv -2 offset 0 /var/lib/libvirt/sanlock/__L...__DISKS__
4月 25 11:24:37 Test1 sanlock[1109]: 2017-04-25 11:24:37+0800 703 [3078]: s1 renewal error -2 delta_length 10 last_success 580
4月 25 11:25:37 Test1 sanlock[1109]: 2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
4月 25 11:25:37 Test1 sanlock[1109]: 2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
4月 25 11:25:37 Test1 sanlock[1109]: 2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
4月 25 11:25:37 Test1 sanlock[1109]: 2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
4月 25 11:26:37 Test1 sanlock[1109]: 2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
4月 25 11:26:37 Test1 sanlock[1109]: 2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
4月 25 11:26:37 Test1 sanlock[1109]: 2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
4月 25 11:26:37 Test1 sanlock[1109]: 2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
Hint: Some lines were ellipsized, use -l to show in full.
$ systemctl status libvirtd
● libvirtd.service - Virtualization daemon
Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
Active: active (running) since 二 2017-04-25 11:13:42 CST; 29min ago
Docs: man:libvirtd(8)
http://libvirt.org
Main PID: 1878 (libvirtd)
CGroup: /system.slice/libvirtd.service
├─ 1878 /usr/sbin/libvirtd
├─12842 /sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/libexec/libvirt_leaseshelper
└─12843 /sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/libexec/libvirt_leaseshelper
4月 25 11:16:26 Test1 dnsmasq-dhcp[12842]: DHCP, IP range 192.168.122.2 -- 192.168.122.254, lease time 1h
4月 25 11:16:26 Test1 dnsmasq[12842]: reading /etc/resolv.conf
4月 25 11:16:26 Test1 dnsmasq[12842]: using nameserver 114.114.114.114#53
4月 25 11:16:26 Test1 dnsmasq[12842]: read /etc/hosts - 6 addresses
4月 25 11:16:26 Test1 dnsmasq[12842]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
4月 25 11:16:26 Test1 dnsmasq-dhcp[12842]: read /var/lib/libvirt/dnsmasq/default.hostsfile
4月 25 11:17:10 Test1 libvirtd[1878]: libvirt version: 2.0.0, package: 10.el7_3.5 (CentOS BuildSystem <http://bugs.centos.org>, 2017-03-0...ntos.org)
4月 25 11:17:10 Test1 libvirtd[1878]: hostname: test1
4月 25 11:17:10 Test1 libvirtd[1878]: 请求锁失败: 文件已存在
4月 25 11:17:12 Test1 libvirtd[1878]: 请求锁失败: 文件已存在
Hint: Some lines were ellipsized, use -l to show in full.
$ ps -aux | grep qemu-kvm
$ sanlock direct dump /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
offset lockspace resource timestamp own gen lver
00000000 __LIBVIRT__DISKS__ dd75027f-b6f2-4f46-bd75-11d24f7ce2a9.Test1 0000000437 0001 0003
00000512 __LIBVIRT__DISKS__ b888aabb-0f5f-45ce-a310-ff8021f514fd.Test2 0000040508 0002 0004
$ sanlock direct dump /var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8
offset lockspace resource timestamp own gen lver
00000000 __LIBVIRT__DISKS__ 0ff52295a7bb219221ea57f31f343ba8 0000000459 0001 0001 1
$ sanlock direct dump /var/lib/libvirt/sanlock/d726322246f67ea910a344b4426b4e0d
offset lockspace resource timestamp own gen lver
00000000 __LIBVIRT__DISKS__ d726322246f67ea910a344b4426b4e0d 0000000456 0001 0001 1
$ sanlock client log_dump
...
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 result 0 pid_dead 0
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 ci_in 4 fd 13
2017-04-25 11:17:12+0800 259 [1112]: cmd_inquire 3,12,13044 result 0 pid_dead 0 res_count 1 cat_count 1 strlen 113
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 ci_in 4 fd 13 count 1 flags 0
2017-04-25 11:17:12+0800 259 [1113]: s1:r4 resource __LIBVIRT__DISKS__:0ff52295a7bb219221ea57f31f343ba8:/var/lib/libvirt/sanlock/0ff52295a7bb219221ea57f31f343ba8:0 for 3,12,13044
2017-04-25 11:17:12+0800 259 [1113]: r4 cmd_acquire 3,12,13044 acquire_token -17
2017-04-25 11:17:12+0800 259 [1113]: cmd_acquire 3,12,13044 result -17 pid_dead 0
2017-04-25 11:23:05+0800 611 [3078]: __LIBVIR aio timeout RD 0x7f49740008c0:0x7f49740008d0:0x7f4989c30000 ioto 10 to_count 1
2017-04-25 11:23:05+0800 611 [3078]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:23:05+0800 611 [3078]: s1 renewal error -202 delta_length 10 last_success 580
2017-04-25 11:23:05+0800 611 [3078]: s1 delta_renew begin reap
2017-04-25 11:23:15+0800 621 [3078]: s1 delta_renew reap -202
2017-04-25 11:23:25+0800 631 [3078]: __LIBVIR aio timeout RD 0x7f4974000910:0x7f4974000920:0x7f4989a2d000 ioto 10 to_count 2
2017-04-25 11:23:25+0800 631 [3078]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:23:25+0800 631 [3078]: s1 renewal error -202 delta_length 20 last_success 580
2017-04-25 11:23:26+0800 632 [3078]: s1 delta_renew begin reap
2017-04-25 11:23:33+0800 640 [1109]: s1 check_our_lease warning 60 last_success 580
2017-04-25 11:23:34+0800 641 [1109]: s1 check_our_lease warning 61 last_success 580
2017-04-25 11:23:35+0800 642 [1109]: s1 check_our_lease warning 62 last_success 580
2017-04-25 11:23:36+0800 642 [3078]: s1 delta_renew reap -202
2017-04-25 11:23:36+0800 643 [1109]: s1 check_our_lease warning 63 last_success 580
2017-04-25 11:23:37+0800 644 [1109]: s1 check_our_lease warning 64 last_success 580
2017-04-25 11:23:38+0800 645 [1109]: s1 check_our_lease warning 65 last_success 580
2017-04-25 11:23:39+0800 646 [1109]: s1 check_our_lease warning 66 last_success 580
2017-04-25 11:23:40+0800 647 [1109]: s1 check_our_lease warning 67 last_success 580
2017-04-25 11:23:41+0800 648 [1109]: s1 check_our_lease warning 68 last_success 580
2017-04-25 11:23:42+0800 649 [1109]: s1 check_our_lease warning 69 last_success 580
2017-04-25 11:23:43+0800 650 [1109]: s1 check_our_lease warning 70 last_success 580
2017-04-25 11:23:44+0800 651 [1109]: s1 check_our_lease warning 71 last_success 580
2017-04-25 11:23:45+0800 652 [1109]: s1 check_our_lease warning 72 last_success 580
2017-04-25 11:23:46+0800 652 [3078]: __LIBVIR aio timeout RD 0x7f4974000960:0x7f4974000970:0x7f498992b000 ioto 10 to_count 3
2017-04-25 11:23:46+0800 652 [3078]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:23:46+0800 652 [3078]: s1 renewal error -202 delta_length 20 last_success 580
2017-04-25 11:23:46+0800 652 [3078]: s1 delta_renew begin reap
2017-04-25 11:23:46+0800 653 [1109]: s1 check_our_lease warning 73 last_success 580
2017-04-25 11:23:47+0800 654 [1109]: s1 check_our_lease warning 74 last_success 580
2017-04-25 11:23:48+0800 655 [1109]: s1 check_our_lease warning 75 last_success 580
2017-04-25 11:23:49+0800 656 [1109]: s1 check_our_lease warning 76 last_success 580
2017-04-25 11:23:50+0800 657 [1109]: s1 check_our_lease warning 77 last_success 580
2017-04-25 11:23:51+0800 658 [1109]: s1 check_our_lease warning 78 last_success 580
2017-04-25 11:23:52+0800 659 [1109]: s1 check_our_lease warning 79 last_success 580
2017-04-25 11:23:53+0800 660 [1109]: s1 check_our_lease failed 80
2017-04-25 11:23:53+0800 660 [1109]: s1 set killing_pids check -1 remove 0
2017-04-25 11:23:53+0800 660 [1109]: s1:r1 client_using_space pid 13022
2017-04-25 11:23:53+0800 660 [1109]: s1 kill 13022 sig 100 count 1
2017-04-25 11:23:53+0800 660 [1109]: s1:r3 client_using_space pid 13044
2017-04-25 11:23:53+0800 660 [1109]: s1 kill 13044 sig 100 count 1
2017-04-25 11:23:53+0800 660 [1109]: client_pid_dead 2,9,13022 cmd_active 0 suspend 0
2017-04-25 11:23:53+0800 660 [1109]: dead 13022 ci 2 count 1
2017-04-25 11:23:54+0800 660 [1112]: cmd_inquire 3,12,13044 ci_in 2 fd 9
2017-04-25 11:23:54+0800 660 [1112]: cmd_inquire 3,12,13044 result 0 pid_dead 0 res_count 1 cat_count 1 strlen 113
2017-04-25 11:23:56+0800 662 [3078]: s1 delta_renew reap -202
2017-04-25 11:24:06+0800 672 [3078]: __LIBVIR aio timeout RD 0x7f49740009b0:0x7f49740009c0:0x7f4989829000 ioto 10 to_count 4
2017-04-25 11:24:06+0800 672 [3078]: s1 delta_renew read timeout 10 sec offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:24:06+0800 672 [3078]: s1 renewal error -202 delta_length 20 last_success 580
2017-04-25 11:24:07+0800 673 [3078]: s1 delta_renew begin reap
2017-04-25 11:24:17+0800 683 [3078]: s1 delta_renew reap -202
2017-04-25 11:24:27+0800 693 [3078]: s1 delta_renew read rv -2 offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:24:27+0800 693 [3078]: s1 renewal error -2 delta_length 20 last_success 580
2017-04-25 11:24:33+0800 700 [1109]: s1 kill 13044 sig 9 count 41
2017-04-25 11:24:33+0800 700 [1109]: client_pid_dead 3,12,13044 cmd_active 0 suspend 0
2017-04-25 11:24:33+0800 700 [1109]: dead 13044 ci 3 count 41
2017-04-25 11:24:33+0800 700 [1109]: s1 all pids clear
2017-04-25 11:24:33+0800 700 [1109]: s1 set thread_stop
2017-04-25 11:24:33+0800 700 [1109]: s1 wdmd_test_live 0 0 to disable
2017-04-25 11:24:37+0800 703 [3078]: s1 delta_renew read rv -2 offset 0 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
2017-04-25 11:24:37+0800 703 [3078]: s1 renewal error -2 delta_length 10 last_success 580
2017-04-25 11:24:37+0800 703 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:24:37+0800 703 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:24:37+0800 703 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:24:37+0800 703 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:24:47+0800 713 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:24:47+0800 713 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:24:47+0800 713 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:24:47+0800 713 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:24:57+0800 723 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:24:57+0800 723 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:24:57+0800 723 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:24:57+0800 723 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:07+0800 733 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:07+0800 733 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:07+0800 733 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:07+0800 733 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:17+0800 743 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:17+0800 743 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:17+0800 743 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:17+0800 743 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:27+0800 753 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:27+0800 753 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:27+0800 753 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:27+0800 753 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:37+0800 763 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:47+0800 773 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:47+0800 773 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:47+0800 773 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:47+0800 773 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:25:57+0800 783 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:25:57+0800 783 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:25:57+0800 783 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:25:57+0800 783 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:26:07+0800 793 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:26:07+0800 793 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:26:07+0800 793 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:26:07+0800 793 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:26:17+0800 803 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:26:17+0800 803 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:26:17+0800 803 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:26:17+0800 803 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:26:27+0800 813 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:26:27+0800 813 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:26:27+0800 813 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:26:27+0800 813 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 0 0x7f49740008c0 busy
2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 1 0x7f4974000910 busy
2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 2 0x7f4974000960 busy
2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio 3 0x7f49740009b0 busy
2017-04-25 11:26:37+0800 823 [3078]: __LIBVIR close_task_aio destroy 4 incomplete ops
$ cat /var/log/libvirt/qemu/test_lockfailure1.log
2017-04-25 04:36:33.021+0000: starting up libvirt version: 2.0.0, package: 10.el7_3.5 (CentOS BuildSystem <http://bugs.centos.org>, 2017-03-03-02:09:45, c1bm.rdu2.centos.org), qemu version: 1.5.3 (qemu-kvm-1.5.3-126.el7_3.6), hostname: test1
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name test_lockfailure1 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 37266b5e-d0fc-4287-bf2d-fcaa75aca64c -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-4-test_lockfailure1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd3,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
qemu: terminating on signal 15 from pid 1878
2017-04-25 04:38:17.745+0000: shutting down
$ cat /var/log/libvirt/qemu/test_lockfailure2.log
2017-04-25 04:36:36.153+0000: starting up libvirt version: 2.0.0, package: 10.el7_3.5 (CentOS BuildSystem <http://bugs.centos.org>, 2017-03-03-02:09:45, c1bm.rdu2.centos.org), qemu version: 1.5.3 (qemu-kvm-1.5.3-126.el7_3.6), hostname: test1
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name test_lockfailure2 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 2562a258-d4be-486f-959a-1fb993958811 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-5-test_lockfailure2/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/sdd4,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc 0.0.0.0:1 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
$ sanlock client status -D
daemon ce8f2af7-36f4-47bb-9d41-6c001de66725.Test1
our_host_name=ce8f2af7-36f4-47bb-9d41-6c001de66725.Test1
use_watchdog=1
high_priority=1
mlock_level=1
quiet_fail=1
debug_renew=0
renewal_history_size=180
gid=179
uid=179
sh_retries=8
use_aio=1
kill_grace_seconds=40
helper_pid=1108
helper_kill_fd=4
helper_full_count=0
helper_last_status=15942
monotime=15956
version_str=3.4.0
version_num=3.4.0
version_hex=03040000
smproto_hex=00000001
p -1 helper
ci=0
fd=5
pid=-1
flags=0
restricted=0
cmd_active=0
cmd_last=0
pid_dead=0
kill_count=0
kill_last=0
suspend=0
need_free=0
p -1 listener
ci=1
fd=7
pid=-1
flags=0
restricted=0
cmd_active=0
cmd_last=0
pid_dead=0
kill_count=0
kill_last=0
suspend=0
need_free=0
p -1 status
ci=2
fd=9
pid=-1
flags=0
restricted=0
cmd_active=0
cmd_last=5
pid_dead=0
kill_count=0
kill_last=0
suspend=0
need_free=0
$ ps -aux | grep qemu
root 13827 0.0 0.3 252016 5828 ? S 12:38 0:00 /usr/libexec/libvirt_sanlock_helper qemu:///system 2562a258-d4be-486f-959a-1fb993958811 pause
# 再等一会,再次执行:
$ ps -aux | grep qemu
通过修改killpath来进行sanlock的续约失败测试
创建租约文件
- 创建Lockspace租约:
$ mkdir -pv /mnt/nfs/sanlock
$ dd if=/dev/zero bs=1048576 count=1 of=/mnt/nfs/sanlock/idLease
$ sanlock direct init -s LS:0:/mnt/nfs/sanlock/idLease:0
init done 0
$ chown sanlock:sanlock /mnt/nfs/sanlock/idLease
- 创建Resource租约:
$ dd if=/dev/zero bs=1048576 count=1 of=/mnt/nfs/sanlock/leaderLease
$ sanlock direct init -r LS:leader:/mnt/nfs/sanlock/leaderLease:0
init done 0
$ chown sanlock:sanlock /mnt/nfs/sanlock/leaderLease
编写测试脚本
$ vi killpath.sh
#!/bin/bash
/usr/bin/echo "[`date`] $1" >> /var/log/killpath.log
$ chmod a+x killpath.sh
$ vi killpath_test.py
#!/usr/bin/python
import sys
import time
from multiprocessing import Process
import sanlock
def serviceMain(hostId, lockspacePath, leasePath):
sfd = sanlock.register()
while True:
try:
sanlock.killpath('/opt/sanlock/killpath.sh', ['killpath test'], sfd)
except sanlock.SanlockException:
print "Failed to set killpath, try again in 3s."
time.sleep(3)
continue
print "Set killpath ok.\n"
try:
if not sanlock.inq_lockspace("LS", hostId, lockspacePath):
print "Try to acquire host id LS:%s:%s:0" % (hostId, lockspacePath)
print time.strftime("%Y %m %Z %H:%M:%S"), "Enter Lockspace Begin"
sanlock.add_lockspace("LS", hostId, lockspacePath)
print time.strftime("%Y %m %Z %H:%M:%S"), "Enter Lockspace End"
print "Try to acquire leader lease LS:leader:%s:0" % leasePath
sanlock.acquire("LS", "leader", [(leasePath, 0)], sfd)
except sanlock.SanlockException:
print "Failed to acquire leases, try again in 10s."
time.sleep(10)
else:
break
while True:
print time.strftime("%Y %m %Z %H:%M:%S"), "Service is running"
time.sleep(5)
if __name__ == "__main__":
try:
hostId = int(sys.argv[1])
lockspacePath = sys.argv[2]
leasePath = sys.argv[3]
except Exception:
sys.stderr.write(
"Usage: %s host_id lockspace_path lease_path\n" % sys.argv[0])
exit(1)
while True:
p = Process(target=serviceMain,
args=(hostId, lockspacePath, leasePath))
p.start()
p.join()
time.sleep(15)
启动测试脚本
$ ./killpath_test.py 1 /mnt/nfs/sanlock/idLease /mnt/nfs/sanlock/leaderLease
Set killpath ok.
Try to acquire host id LS:1:/mnt/nfs/sanlock/idLease:0
2017 04 CST 10:41:43 Enter Lockspace Begin
2017 04 CST 10:42:04 Enter Lockspace End
Try to acquire leader lease LS:leader:/mnt/nfs/sanlock/leaderLease:0
2017 04 CST 10:42:04 Service is running
2017 04 CST 10:42:09 Service is running
2017 04 CST 10:42:14 Service is running
2017 04 CST 10:42:19 Service is running
2017 04 CST 10:42:24 Service is running
2017 04 CST 10:42:29 Service is running
2017 04 CST 10:42:34 Service is running
2017 04 CST 10:42:39 Service is running
2017 04 CST 10:42:44 Service is running
2017 04 CST 10:42:49 Service is running
2017 04 CST 10:42:54 Service is running
$ ps -aux | grep killpath_test
root 95519 0.0 0.3 165956 5960 pts/1 S+ 10:41 0:00 /usr/bin/python ./killpath_test.py 1 /mnt/nfs/sanlock/idLease /mnt/nfs/sanlock/leaderLease
停止nfs服务,查看sanlock状态和日志信息
$ sanlock client status -D
daemon ce8f2af7-36f4-47bb-9d41-6c001de66725.Test1
our_host_name=ce8f2af7-36f4-47bb-9d41-6c001de66725.Test1
use_watchdog=1
high_priority=1
mlock_level=1
quiet_fail=1
debug_renew=0
renewal_history_size=180
gid=179
uid=179
sh_retries=8
use_aio=1
kill_grace_seconds=40
helper_pid=1108
helper_kill_fd=4
helper_full_count=0
helper_last_status=36506
monotime=36548
version_str=3.4.0
version_num=3.4.0
version_hex=03040000
smproto_hex=00000001
p -1 helper
ci=0
fd=5
pid=-1
flags=0
restricted=0
cmd_active=0
cmd_last=0
pid_dead=0
kill_count=0
kill_last=0
suspend=0
need_free=0
p -1 listener
ci=1
fd=7
pid=-1
flags=0
restricted=0
cmd_active=0
cmd_last=0
pid_dead=0
kill_count=0
kill_last=0
suspend=0
need_free=0
p 95519
ci=2
fd=9
pid=95519
flags=2
restricted=0
cmd_active=0
cmd_last=7
pid_dead=0
kill_count=28
kill_last=36548
suspend=0
need_free=0
p -1 status
ci=3
fd=10
pid=-1
flags=0
restricted=0
cmd_active=0
cmd_last=5
pid_dead=0
kill_count=0
kill_last=0
suspend=0
need_free=0
s LS:1:/mnt/nfs/sanlock/idLease:0
list=spaces
space_id=4
io_timeout=10
host_generation=3
renew_fail=1
space_dead=1
killing_pids=1
used_retries=0
external_used=0
used_by_orphans=0
renewal_read_extend_sec=10
corrupt_result=0
acquire_last_result=1
renewal_last_result=-202
acquire_last_attempt=36379
acquire_last_success=36379
renewal_last_attempt=36513
renewal_last_success=36441
r LS:leader:/mnt/nfs/sanlock/leaderLease:0:2 p 95519
list=held
flags=0
lver=2
token_id=63
$ sanlock client log_dump
...
2017-04-26 10:42:04+0800 36400 [1110]: r63 paxos_acquire owner 1 2 35560 was old local new is 3
2017-04-26 10:42:04+0800 36400 [1110]: r63 ballot 2 phase1 mbal 2001
2017-04-26 10:42:04+0800 36400 [1110]: r63 ballot 2 phase2 bal 2001 inp 1 3 36400 q_max -1
2017-04-26 10:42:04+0800 36400 [1110]: r63 ballot 2 commit self owner 1 3 36400
2017-04-26 10:42:04+0800 36400 [1110]: r63 acquire_disk rv 1 lver 2 at 36400
2017-04-26 10:42:04+0800 36400 [1110]: cmd_acquire 2,9,95519 result 0 pid_dead 0
2017-04-26 10:42:05+0800 36401 [1107]: s4 host 1 3 36379 ce8f2af7-36f4-47bb-9d41-6c001de66725.Test1
2017-04-26 10:43:15+0800 36471 [95520]: LS aio timeout RD 0x7efce80008c0:0x7efce80008d0:0x7efcef2f6000 ioto 10 to_count 1
2017-04-26 10:43:15+0800 36471 [95520]: s4 delta_renew read timeout 10 sec offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:43:15+0800 36471 [95520]: s4 renewal error -202 delta_length 10 last_success 36441
2017-04-26 10:43:16+0800 36472 [95520]: s4 delta_renew begin reap
2017-04-26 10:43:26+0800 36482 [95520]: s4 delta_renew reap -202
2017-04-26 10:43:36+0800 36492 [95520]: LS aio timeout RD 0x7efce8000910:0x7efce8000920:0x7efcef1f4000 ioto 10 to_count 2
2017-04-26 10:43:36+0800 36492 [95520]: s4 delta_renew read timeout 10 sec offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:43:36+0800 36492 [95520]: s4 renewal error -202 delta_length 20 last_success 36441
2017-04-26 10:43:36+0800 36492 [95520]: s4 delta_renew begin reap
2017-04-26 10:43:45+0800 36501 [1107]: s4 check_our_lease warning 60 last_success 36441
2017-04-26 10:43:46+0800 36502 [1107]: s4 check_our_lease warning 61 last_success 36441
2017-04-26 10:43:46+0800 36502 [95520]: s4 delta_renew reap -202
2017-04-26 10:43:47+0800 36503 [1107]: s4 check_our_lease warning 62 last_success 36441
2017-04-26 10:43:48+0800 36504 [1107]: s4 check_our_lease warning 63 last_success 36441
2017-04-26 10:43:49+0800 36505 [1107]: s4 check_our_lease warning 64 last_success 36441
2017-04-26 10:43:50+0800 36506 [1107]: s4 check_our_lease warning 65 last_success 36441
2017-04-26 10:43:51+0800 36507 [1107]: s4 check_our_lease warning 66 last_success 36441
2017-04-26 10:43:52+0800 36508 [1107]: s4 check_our_lease warning 67 last_success 36441
2017-04-26 10:43:53+0800 36509 [1107]: s4 check_our_lease warning 68 last_success 36441
2017-04-26 10:43:54+0800 36510 [1107]: s4 check_our_lease warning 69 last_success 36441
2017-04-26 10:43:55+0800 36511 [1107]: s4 check_our_lease warning 70 last_success 36441
2017-04-26 10:43:56+0800 36512 [1107]: s4 check_our_lease warning 71 last_success 36441
2017-04-26 10:43:56+0800 36512 [95520]: LS aio timeout RD 0x7efce8000960:0x7efce8000970:0x7efcef0f2000 ioto 10 to_count 3
2017-04-26 10:43:56+0800 36512 [95520]: s4 delta_renew read timeout 10 sec offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:43:56+0800 36512 [95520]: s4 renewal error -202 delta_length 20 last_success 36441
2017-04-26 10:43:57+0800 36513 [95520]: s4 delta_renew begin reap
2017-04-26 10:43:57+0800 36513 [1107]: s4 check_our_lease warning 72 last_success 36441
2017-04-26 10:43:58+0800 36514 [1107]: s4 check_our_lease warning 73 last_success 36441
2017-04-26 10:43:59+0800 36515 [1107]: s4 check_our_lease warning 74 last_success 36441
2017-04-26 10:44:00+0800 36516 [1107]: s4 check_our_lease warning 75 last_success 36441
2017-04-26 10:44:01+0800 36517 [1107]: s4 check_our_lease warning 76 last_success 36441
2017-04-26 10:44:02+0800 36518 [1107]: s4 check_our_lease warning 77 last_success 36441
2017-04-26 10:44:03+0800 36519 [1107]: s4 check_our_lease warning 78 last_success 36441
2017-04-26 10:44:04+0800 36520 [1107]: s4 check_our_lease warning 79 last_success 36441
2017-04-26 10:44:05+0800 36521 [1107]: s4 check_our_lease failed 80
2017-04-26 10:44:05+0800 36521 [1107]: s4 set killing_pids check -1 remove 0
2017-04-26 10:44:05+0800 36521 [1107]: s4:r63 client_using_space pid 95519
2017-04-26 10:44:05+0800 36521 [1107]: s4 kill 95519 sig 100 count 1
2017-04-26 10:44:07+0800 36523 [95520]: s4 delta_renew reap -202
2017-04-26 10:44:17+0800 36533 [95520]: LS aio timeout RD 0x7efce80009b0:0x7efce80009c0:0x7efceeff0000 ioto 10 to_count 4
2017-04-26 10:44:17+0800 36533 [95520]: s4 delta_renew read timeout 10 sec offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:44:17+0800 36533 [95520]: s4 renewal error -202 delta_length 20 last_success 36441
2017-04-26 10:44:17+0800 36533 [95520]: s4 delta_renew begin reap
2017-04-26 10:44:27+0800 36543 [95520]: s4 delta_renew reap -202
2017-04-26 10:44:37+0800 36553 [95520]: s4 delta_renew read rv -2 offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:44:37+0800 36553 [95520]: s4 renewal error -2 delta_length 20 last_success 36441
2017-04-26 10:44:45+0800 36561 [1107]: s4 kill 95519 sig 9 count 41
2017-04-26 10:44:45+0800 36561 [1107]: client_pid_dead 2,9,95519 cmd_active 0 suspend 0
2017-04-26 10:44:45+0800 36561 [1107]: dead 95519 ci 2 count 41
2017-04-26 10:44:45+0800 36561 [1107]: s4 all pids clear
2017-04-26 10:44:45+0800 36561 [1107]: s4 set thread_stop
2017-04-26 10:44:45+0800 36561 [1107]: s4 wdmd_test_live 0 0 to disable
2017-04-26 10:44:48+0800 36564 [95520]: s4 delta_renew read rv -2 offset 0 /mnt/nfs/sanlock/idLease
2017-04-26 10:44:48+0800 36564 [95520]: s4 renewal error -2 delta_length 10 last_success 36441
2017-04-26 10:44:48+0800 36564 [95520]: LS close_task_aio 0 0x7efce80008c0 busy
2017-04-26 10:44:48+0800 36564 [95520]: LS close_task_aio 1 0x7efce8000910 busy
2017-04-26 10:44:48+0800 36564 [95520]: LS close_task_aio 2 0x7efce8000960 busy
2017-04-26 10:44:48+0800 36564 [95520]: LS close_task_aio 3 0x7efce80009b0 busy
2017-04-26 10:44:58+0800 36574 [95520]: LS close_task_aio 0 0x7efce80008c0 busy
2017-04-26 10:44:58+0800 36574 [95520]: LS close_task_aio 1 0x7efce8000910 busy
2017-04-26 10:44:58+0800 36574 [95520]: LS close_task_aio 2 0x7efce8000960 busy
2017-04-26 10:44:58+0800 36574 [95520]: LS close_task_aio 3 0x7efce80009b0 busy
$ cat /var/log/killpath.log
[2017年 04月 26日 星期三 11:07:10 CST] killpath test
可见只要获取租约的进程不退出,则sanlock不会释放进程客户端相关数据结构。
使用watchdog进行续约失败测试
关闭watchdog时的sanlock服务异常测试
默认配置中已经关闭了对watchdog的相关功能,但sanlock服务启动时会默认加载softdog驱动和启动wdmd服务。
- 关闭wdmd服务以及卸载softdog驱动。
$ lsmod | grep softdog
softdog 13319 0
$ fuser /dev/watchdog0
/dev/watchdog0: 34034
$ ps -aux | grep wdmd
root 34034 0.0 0.1 13804 3440 ? SLs 11:22 0:00 /usr/sbin/wdmd
$ systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
Loaded: loaded (/usr/lib/systemd/system/wdmd.service; enabled; vendor preset: disabled)
Active: active (running) since 一 2017-04-24 11:22:08 CST; 1min 10s ago
Process: 34032 ExecStart=/usr/sbin/wdmd (code=exited, status=0/SUCCESS)
Process: 34030 ExecStartPre=/lib/systemd/systemd-wdmd watchdog-check (code=exited, status=0/SUCCESS)
Main PID: 34034 (wdmd)
CGroup: /system.slice/wdmd.service
└─34034 /usr/sbin/wdmd
4月 24 11:22:08 Test1 systemd[1]: Starting Watchdog Multiplexing Daemon...
4月 24 11:22:08 Test1 systemd[1]: Started Watchdog Multiplexing Daemon.
4月 24 11:22:08 Test1 wdmd[34034]: wdmd started S0 H1 G179
4月 24 11:22:08 Test1 wdmd[34034]: /dev/watchdog0 armed with fire_timeout 60
$ systemctl stop wdmd
$ systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
Loaded: loaded (/usr/lib/systemd/system/wdmd.service; enabled; vendor preset: disabled)
Active: inactive (dead) since 一 2017-04-24 11:23:36 CST; 39s ago
Process: 34032 ExecStart=/usr/sbin/wdmd (code=exited, status=0/SUCCESS)
Process: 34030 ExecStartPre=/lib/systemd/systemd-wdmd watchdog-check (code=exited, status=0/SUCCESS)
Main PID: 34034 (code=exited, status=0/SUCCESS)
4月 24 11:22:08 Test1 systemd[1]: Starting Watchdog Multiplexing Daemon...
4月 24 11:22:08 Test1 systemd[1]: Started Watchdog Multiplexing Daemon.
4月 24 11:22:08 Test1 wdmd[34034]: wdmd started S0 H1 G179
4月 24 11:22:08 Test1 wdmd[34034]: /dev/watchdog0 armed with fire_timeout 60
4月 24 11:23:36 Test1 systemd[1]: Stopping Watchdog Multiplexing Daemon...
4月 24 11:23:36 Test1 systemd[1]: Stopped Watchdog Multiplexing Daemon.
$ rmmod softdog
$ lsmod | grep softdog
$ systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
Active: active (running) since 一 2017-04-24 11:05:05 CST; 20min ago
Process: 31606 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
Main PID: 31608 (sanlock)
CGroup: /system.slice/sanlock.service
├─31608 /usr/sbin/sanlock daemon
└─31609 /usr/sbin/sanlock daemon
4月 24 11:05:05 Test1 systemd[1]: Starting Shared Storage Lease Manager...
4月 24 11:05:05 Test1 systemd[1]: Started Shared Storage Lease Manager.
- 在Test1或Test2节点上启动simpleHA.py测试脚本。
$ ./simpleHA.py 1 /mnt/nfs/sanlock/idLease /mnt/nfs/sanlock/leaderLease
Try to acquire host id LS:1:/mnt/nfs/sanlock/idLease:0
2017 04 CST 11:10:32 Enter Lockspace Begin
2017 04 CST 11:10:53 Enter Lockspace End
Try to acquire leader lease LS:leader:/mnt/nfs/sanlock/leaderLease:0
2017 04 CST 11:10:53 Service is running
2017 04 CST 11:11:03 Service is running
2017 04 CST 11:11:13 Service is running
2017 04 CST 11:11:23 Service is running
2017 04 CST 11:11:33 Service is running
$ sanlock direct dump /mnt/nfs/sanlock/idLease
offset lockspace resource timestamp own gen lver
00000000 LS 583c5507-34a7-412f-ad9d-daae650a7a8f.Test1 0000009313 0001 0001
$ sanlock direct dump /mnt/nfs/sanlock/leaderLease
offset lockspace resource timestamp own gen lver
00000000 LS leader 0000000000 0001 0001 14
$ sanlock client log_dump
2017-04-24 11:05:05+0800 7939 [31608]: sanlock daemon started 3.4.0 host 583c5507-34a7-412f-ad9d-daae650a7a8f.Test1
2017-04-24 11:10:32+0800 8266 [31608]: cmd_register ci 2 fd 9 pid 32438
2017-04-24 11:10:32+0800 8266 [31617]: cmd_add_lockspace 3,10 LS:1:/mnt/nfs/sanlock/idLease:0 flags 0 timeout 0
2017-04-24 11:10:32+0800 8266 [31617]: s1 lockspace LS:1:/mnt/nfs/sanlock/idLease:0
2017-04-24 11:10:32+0800 8266 [32439]: s1 delta_acquire begin LS:1
2017-04-24 11:10:32+0800 8266 [32439]: s1 delta_acquire write 1 1 8266 583c5507-34a7-412f-ad9d-daae650a7a8f.Test1
2017-04-24 11:10:32+0800 8266 [32439]: s1 delta_acquire delta_short_delay 20
2017-04-24 11:10:52+0800 8286 [32439]: s1 delta_acquire done 1 1 8266
2017-04-24 11:10:53+0800 8287 [31617]: s1 add_lockspace done
2017-04-24 11:10:53+0800 8287 [31617]: cmd_add_lockspace 3,10 done 0
2017-04-24 11:10:53+0800 8287 [31616]: cmd_acquire 2,9,32438 ci_in 2 fd 9 count 1 flags 0
2017-04-24 11:10:53+0800 8287 [31616]: s1:r1 resource LS:leader:/mnt/nfs/sanlock/leaderLease:0 for 2,9,32438
2017-04-24 11:10:53+0800 8287 [31616]: r1 paxos_acquire begin 2 0 0
2017-04-24 11:10:53+0800 8287 [31616]: r1 paxos_acquire leader 0 owner 0 0 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2017-04-24 11:10:53+0800 8287 [31616]: r1 paxos_acquire leader 0 free
2017-04-24 11:10:53+0800 8287 [31616]: r1 ballot 1 phase1 mbal 1
2017-04-24 11:10:53+0800 8287 [31616]: r1 ballot 1 phase2 bal 1 inp 1 1 8287 q_max -1
2017-04-24 11:10:53+0800 8287 [31616]: r1 ballot 1 commit self owner 1 1 8287
2017-04-24 11:10:53+0800 8287 [31616]: r1 acquire_disk rv 1 lver 1 at 8287
2017-04-24 11:10:53+0800 8287 [31616]: cmd_acquire 2,9,32438 result 0 pid_dead 0
2017-04-24 11:10:54+0800 8287 [31608]: s1 host 1 1 8266 583c5507-34a7-412f-ad9d-daae650a7a8f.Test1
2017-04-24 11:11:53+0800 8347 [31608]: client_pid_dead 2,9,32438 cmd_active 0 suspend 0
2017-04-24 11:11:53+0800 8347 [31618]: r1 release async r_flags 0
2017-04-24 11:11:53+0800 8347 [31618]: r1 write_host_block host_id 1 flags 0 gen 0
2017-04-24 11:11:53+0800 8347 [31618]: r1 paxos_release leader 1 owner 1 1 8287
2017-04-24 11:11:53+0800 8347 [31618]: r1 release async done r_flags 0
2017-04-24 11:12:08+0800 8362 [31608]: cmd_register ci 2 fd 9 pid 32710
2017-04-24 11:12:08+0800 8362 [31616]: cmd_acquire 2,9,32710 ci_in 2 fd 9 count 1 flags 0
2017-04-24 11:12:08+0800 8362 [31616]: s1:r2 resource LS:leader:/mnt/nfs/sanlock/leaderLease:0 for 2,9,32710
2017-04-24 11:12:08+0800 8362 [31616]: r2 paxos_acquire begin 2 0 0
2017-04-24 11:12:08+0800 8362 [31616]: r2 paxos_acquire leader 1 owner 1 1 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2017-04-24 11:12:08+0800 8362 [31616]: r2 paxos_acquire leader 1 free
2017-04-24 11:12:08+0800 8362 [31616]: r2 ballot 2 phase1 mbal 1
2017-04-24 11:12:08+0800 8362 [31616]: r2 ballot 2 phase2 bal 1 inp 1 1 8362 q_max -1
2017-04-24 11:12:08+0800 8362 [31616]: r2 ballot 2 commit self owner 1 1 8362
2017-04-24 11:12:08+0800 8362 [31616]: r2 acquire_disk rv 1 lver 2 at 8362
...
2017-04-24 11:26:55+0800 9248 [31608]: client_pid_dead 2,9,34711 cmd_active 0 suspend 0
2017-04-24 11:26:55+0800 9248 [31618]: r13 release async r_flags 0
2017-04-24 11:26:55+0800 9248 [31618]: r13 write_host_block host_id 1 flags 0 gen 0
2017-04-24 11:26:55+0800 9248 [31618]: r13 paxos_release leader 13 owner 1 1 9188
2017-04-24 11:26:55+0800 9248 [31618]: r13 release async done r_flags 0
2017-04-24 11:27:10+0800 9263 [31608]: cmd_register ci 2 fd 9 pid 34871
2017-04-24 11:27:10+0800 9263 [31616]: cmd_acquire 2,9,34871 ci_in 2 fd 9 count 1 flags 0
2017-04-24 11:27:10+0800 9263 [31616]: s1:r14 resource LS:leader:/mnt/nfs/sanlock/leaderLease:0 for 2,9,34871
2017-04-24 11:27:10+0800 9263 [31616]: r14 paxos_acquire begin 2 0 0
2017-04-24 11:27:10+0800 9263 [31616]: r14 paxos_acquire leader 13 owner 1 1 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2017-04-24 11:27:10+0800 9263 [31616]: r14 paxos_acquire leader 13 free
2017-04-24 11:27:10+0800 9263 [31616]: r14 ballot 14 phase1 mbal 1
2017-04-24 11:27:10+0800 9263 [31616]: r14 ballot 14 phase2 bal 1 inp 1 1 9263 q_max -1
2017-04-24 11:27:10+0800 9263 [31616]: r14 ballot 14 commit self owner 1 1 9263
2017-04-24 11:27:10+0800 9263 [31616]: r14 acquire_disk rv 1 lver 14 at 9263
2017-04-24 11:27:10+0800 9263 [31616]: cmd_acquire 2,9,34871 result 0 pid_dead 0
2017-04-24 11:28:10+0800 9323 [31608]: client_pid_dead 2,9,34871 cmd_active 0 suspend 0
2017-04-24 11:28:10+0800 9323 [31618]: r14 release async r_flags 0
2017-04-24 11:28:10+0800 9323 [31618]: r14 write_host_block host_id 1 flags 0 gen 0
2017-04-24 11:28:10+0800 9323 [31618]: r14 paxos_release leader 14 owner 1 1 9263
2017-04-24 11:28:10+0800 9323 [31618]: r14 release async done r_flags 0
2017-04-24 11:28:25+0800 9338 [31608]: cmd_register ci 2 fd 9 pid 35092
2017-04-24 11:28:25+0800 9338 [31616]: cmd_acquire 2,9,35092 ci_in 2 fd 9 count 1 flags 0
- 使用信号9终止sanlock服务
$ killall -9 sanlock
$ systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
Active: failed (Result: signal) since 一 2017-04-24 11:29:13 CST; 10s ago
Process: 31606 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
Main PID: 31608 (code=killed, signal=KILL)
4月 24 11:05:05 Test1 systemd[1]: Starting Shared Storage Lease Manager...
4月 24 11:05:05 Test1 systemd[1]: Started Shared Storage Lease Manager.
4月 24 11:29:13 Test1 systemd[1]: sanlock.service: main process exited, code=killed, status=9/KILL
4月 24 11:29:13 Test1 systemd[1]: Unit sanlock.service entered failed state.
4月 24 11:29:13 Test1 systemd[1]: sanlock.service failed.
$ sanlock client log_dump
simpleHA.py脚本的输出:
$ Process Process-16:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "./simpleHA.py", line 10, in serviceMain
sfd = sanlock.register()
SanlockException: (111, 'Sanlock registration failed', 'Connection refused')
Process Process-17:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
等待一段时间(3分钟左右),系统没有重启,也没有其他异常。
启用watchdog时的sanlock续约失败测试
- 开启sanlock的watchdog功能。
$ modprobe softdog
$ dmesg -c
[ 9711.194159] softdog: Software Watchdog Timer: 0.08 initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
$ vi /etc/sanlock/sanlock.conf
use_watchdog = 1
$ systemctl enable wdmd && systemctl start wdmd && systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
Loaded: loaded (/usr/lib/systemd/system/wdmd.service; enabled; vendor preset: disabled)
Active: active (running) since 一 2017-04-24 11:35:16 CST; 6ms ago
Process: 36108 ExecStart=/usr/sbin/wdmd (code=exited, status=0/SUCCESS)
Process: 36105 ExecStartPre=/lib/systemd/systemd-wdmd watchdog-check (code=exited, status=0/SUCCESS)
Main PID: 36110 (wdmd)
CGroup: /system.slice/wdmd.service
└─36110 /usr/sbin/wdmd
4月 24 11:35:16 Test1 systemd[1]: Starting Watchdog Multiplexing Daemon...
4月 24 11:35:16 Test1 wdmd[36110]: wdmd started S0 H1 G179
4月 24 11:35:16 Test1 systemd[1]: Started Watchdog Multiplexing Daemon.
4月 24 11:35:16 Test1 wdmd[36110]: /dev/watchdog0 armed with fire_timeout 60
$ systemctl restart sanlock && systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
Active: active (running) since 一 2017-04-24 11:35:22 CST; 10ms ago
Process: 36133 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
Main PID: 36134 (sanlock)
CGroup: /system.slice/sanlock.service
├─36134 /usr/sbin/sanlock daemon
└─36135 /usr/sbin/sanlock daemon
4月 24 11:35:22 Test1 systemd[1]: Starting Shared Storage Lease Manager...
4月 24 11:35:22 Test1 systemd[1]: Started Shared Storage Lease Manager.
$ fuser /dev/watchdog0
/dev/watchdog0: 36110
$ ps -aux | grep wdmd
root 36110 0.0 0.1 13804 3440 ? SLs 11:35 0:00 /usr/sbin/wdmd
$ ll /proc/36110/fd
总用量 0
lrwx------ 1 root root 64 4月 24 11:35 0 -> /dev/null
lrwx------ 1 root root 64 4月 24 11:35 1 -> /dev/null
lrwx------ 1 root root 64 4月 24 11:35 2 -> /dev/null
lrwx------ 1 root root 64 4月 24 11:35 3 -> socket:[228511]
l-wx------ 1 root root 64 4月 24 11:35 4 -> /run/wdmd/wdmd.pid
lrwx------ 1 root root 64 4月 24 11:35 5 -> /dev/shm/wdmd
lrwx------ 1 root root 64 4月 24 11:35 6 -> anon_inode:[signalfd]
lrwx------ 1 root root 64 4月 24 11:35 7 -> socket:[228514]
l-wx------ 1 root root 64 4月 24 11:35 8 -> /dev/watchdog0
$ ps -aux | grep sanlock
sanlock 36134 0.0 0.9 259592 17908 ? SLsl 11:35 0:00 /usr/sbin/sanlock daemon
root 36135 0.0 0.0 27680 208 ? S 11:35 0:00 /usr/sbin/sanlock daemon
$ ll /proc/36134/fd
总用量 0
lrwx------ 1 sanlock sanlock 64 4月 24 11:35 0 -> /dev/null
lrwx------ 1 sanlock sanlock 64 4月 24 11:35 1 -> /dev/null
lrwx------ 1 sanlock sanlock 64 4月 24 11:35 2 -> /dev/null
lrwx------ 1 sanlock sanlock 64 4月 24 11:35 3 -> /var/log/sanlock.log
l-wx------ 1 sanlock sanlock 64 4月 24 11:35 4 -> pipe:[228572]
lr-x------ 1 sanlock sanlock 64 4月 24 11:35 5 -> pipe:[228573]
l-wx------ 1 sanlock sanlock 64 4月 24 11:35 6 -> /run/sanlock/sanlock.pid
lrwx------ 1 sanlock sanlock 64 4月 24 11:35 7 -> socket:[228581]
lrwx------ 1 sanlock sanlock 64 4月 24 11:35 8 -> anon_inode:[eventfd]
$ ll /proc/36135/fd
总用量 0
lrwx------ 1 root root 64 4月 24 11:35 0 -> /dev/null
lrwx------ 1 root root 64 4月 24 11:35 1 -> /dev/null
lrwx------ 1 root root 64 4月 24 11:35 2 -> /dev/null
lr-x------ 1 root root 64 4月 24 11:35 3 -> pipe:[228572]
l-wx------ 1 root root 64 4月 24 11:35 6 -> pipe:[228573]
- 在Test1或Test2节点上启动simpleHA.py测试脚本。
./simpleHA.py 1 /mnt/nfs/sanlock/idLease /mnt/nfs/sanlock/leaderLease
Try to acquire host id LS:1:/mnt/nfs/sanlock/idLease:0
2017 04 CST 11:46:59 Enter Lockspace Begin
2017 04 CST 11:49:40 Enter Lockspace End
Try to acquire leader lease LS:leader:/mnt/nfs/sanlock/leaderLease:0
2017 04 CST 11:49:40 Service is running
2017 04 CST 11:49:50 Service is running
2017 04 CST 11:50:00 Service is running
2017 04 CST 11:50:10 Service is running
2017 04 CST 11:50:20 Service is running
- 使用信号9终止sanlock服务。
$ killall -9 sanlock
$ systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
Loaded: loaded (/usr/lib/systemd/system/sanlock.service; enabled; vendor preset: disabled)
Active: failed (Result: signal) since 一 2017-04-24 11:53:07 CST; 14s ago
Process: 36133 ExecStart=/usr/sbin/sanlock daemon (code=exited, status=0/SUCCESS)
Main PID: 36134 (code=killed, signal=KILL)
4月 24 11:35:22 Test1 systemd[1]: Starting Shared Storage Lease Manager...
4月 24 11:35:22 Test1 systemd[1]: Started Shared Storage Lease Manager.
4月 24 11:53:07 Test1 systemd[1]: sanlock.service: main process exited, code=killed, status=9/KILL
4月 24 11:53:07 Test1 systemd[1]: Unit sanlock.service entered failed state.
4月 24 11:53:07 Test1 systemd[1]: sanlock.service failed.
$ systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
Loaded: loaded (/usr/lib/systemd/system/wdmd.service; enabled; vendor preset: disabled)
Active: active (running) since 一 2017-04-24 11:35:16 CST; 19min ago
Process: 36108 ExecStart=/usr/sbin/wdmd (code=exited, status=0/SUCCESS)
Process: 36105 ExecStartPre=/lib/systemd/systemd-wdmd watchdog-check (code=exited, status=0/SUCCESS)
Main PID: 36110 (wdmd)
CGroup: /system.slice/wdmd.service
└─36110 /usr/sbin/wdmd
4月 24 11:54:26 Test1 wdmd[36110]: test failed rem 51 now 10899 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:27 Test1 wdmd[36110]: test failed rem 50 now 10900 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:28 Test1 wdmd[36110]: test failed rem 49 now 10901 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:29 Test1 wdmd[36110]: test failed rem 48 now 10902 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:30 Test1 wdmd[36110]: test failed rem 47 now 10903 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:31 Test1 wdmd[36110]: test failed rem 46 now 10904 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:32 Test1 wdmd[36110]: test failed rem 45 now 10905 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:33 Test1 wdmd[36110]: test failed rem 44 now 10906 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:34 Test1 wdmd[36110]: test failed rem 43 now 10907 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
4月 24 11:54:35 Test1 wdmd[36110]: test failed rem 42 now 10908 ping 10880 close 10890 renewal 10818 expire 10898 client 36134 sanlock_LS:1
simpleHA.py脚本的输出:
Process Process-4:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "./simpleHA.py", line 10, in serviceMain
sfd = sanlock.register()
SanlockException: (111, 'Sanlock registration failed', 'Connection refused')
Process Process-5:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
等待一段时间后系统重启。
总结
- sanlock如果开启了watchdog支持,再过一段时间,如果依旧无法终止进程,则会停止喂狗,导致watchdog重启系统。