rook ceph osd 异常(down)问题排查
初始化问题显现,如下:
[root@rook-ceph-tools-78cdfd976c-dhrlx /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 15.00000 root default
-11 3.00000 host master1
4 hdd 1.00000 osd.4 up 1.00000 1.00000
9 hdd 1.00000 osd.9 down 0 1.00000
14 hdd 1.00000 osd.14 up 1.00000 1.00000
在检查ceph集群状态,发现: 37 daemons have recently crashed
[root@rook-ceph-tools-78cdfd976c-dhrlx osd]# ceph -s
cluster:
id: f65c0ebc-0ace-4181-8061-abc2d1d581e9
health: HEALTH_WARN
37 daemons have recently crashed
services:
mon: 3 daemons, quorum a,c,g (age 9m)
mgr: a(active, since 13d)
mds: 1/1 daemons up, 1 hot standby
osd: 15 osds: 14 up (since 10m), 14 in (since 2h)
data:
volumes: 1/1 healthy
pools: 4 pools, 97 pgs
objects: 20.64k objects, 72 GiB
usage: 216 GiB used, 14 TiB / 14 TiB avail
pgs: 97 active+clean
io:
client: 8.8 KiB/s rd, 1.2 MiB/s wr, 2 op/s rd, 49 op/s wr
判断这里显示的应该是历史故障信息,查看历史crash:
ceph crash ls-new
2022-05-13T01:46:58.600474Z_11da8241-7462-49b5-8ab6-83e96d0dd1d9
查看crash日志
ceph crash info 2022-05-13T01:46:58.600474Z_11da8241-7462-49b5-8ab6-83e96d0dd1d9
2393> 2020-05-13 10:24:55.180 7f5d5677aa80 -1 Falling back to public interface
-1754> 2020-05-13 10:25:07.419 7f5d5677aa80 -1 osd.2 875 log_to_monitors {default=true}
-1425> 2020-05-13 10:25:07.803 7f5d48d7c700 -1 osd.2 875 set_numa_affinity unable to identify public interface 'eth0' numa node: (2) No such file or directory
-2> 2020-05-13 10:25:23.731 7f5d4436d700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 717694145, got 2263389519 in db/001499.sst offset 43727772 size 3899 code = 2 Rocksdb transaction:
-1> 2020-05-13 10:25:23.735 7f5d4436d700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f5d4436d700 time 2020-05-13 10:25:23.733456
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/os/bluestore/BlueStore.cc: 11016: FAILED ceph_assert(r == 0)
ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x56297aa20f7d]
2: (()+0x4cb145) [0x56297aa21145]
3: (BlueStore::_kv_sync_thread()+0x11c3) [0x56297af95233]
4: (BlueStore::KVSyncThread::entry()+0xd) [0x56297afba3fd]
5: (()+0x7e65) [0x7f5d537bfe65]
6: (clone()+0x6d) [0x7f5d5268388d]
0> 2020-05-13 10:25:23.735 7f5d4436d700 -1 *** Caught signal (Aborted) **
in thread 7f5d4436d700 thread_name:bstore_kv_sync
ceph version

本文详细描述了一次Rook Ceph集群中OSD 9出现down状态的问题,涉及rocksdb数据损坏的排查过程,包括查看crash日志、确认故障原因、磁盘操作、集群状态恢复等步骤。
最低0.47元/天 解锁文章
2098

被折叠的 条评论
为什么被折叠?



