1、背景
今天学习ceph部署时,发现集群状态异常
ceph health
HEALTH_ERR 21 pgs are stuck inactive for more than 300 seconds; 21 pgs stale; 21 pgs stuck stale

猜测:测试添加osd和删除osd时,没有清理干净或者没有使用正确的方法清理
2、处理办法
解決方法就是用 ceph pg force_creat_pg <pgid> 去覆盖那个有问题的 pg
# 查看有问题的PG
ceph health detail
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 21 pgs stale; 1 pgs stuck inactive; 21 pgs stuck stale; 1 pgs stuck unclean
pg 0.18 is stuck inactive for 385.978354, current state creating, last acting [0]
pg 0.18 is stuck unclean for 385.978358, current state creating, last acting [0]
pg 0.38 is stuck stale for 5228.614958, current state stale+active+clean, last acting [1]
pg 0.2d is stuck stale for 5228.614920, current state stale+active+clean, last acting [1]
pg 0.2c is stuck stale for 5198.568749, current state stale+active+clean, last acting [2]
pg 0.2b is stuck stale for 5228.614940, current state stale+active+clean, last acting [1]
pg 0.2a is stuck stale for 5198.568753, current state stale+active+clean, last acting [2]
pg 0.1a is stuck stale for 5228.614950, current state stale+active+clean, last acting [1]
pg 0.1b is stuck stale for 5228.614951, current state stale+active+clean, last acting [1]
pg 0.d is stuck stale for 5198.568803, current state stale+active+clean, last acting [2]
pg 0.c is stuck stale for 5228.614961, current state stale+active+clean, last acting [1]
pg 0.22 is stuck stale for 5198.568796, current state stale+active+clean, last acting [2]
pg 0.1c is stuck stale for 5228.614956, current state stale+active+clean, last acting [1]
pg 0.5 is stuck stale for 5198.568804, current state stale+active+clean, last acting [2]
pg 0.3c is stuck stale for 5228.614978, current state stale+active+clean, last acting [1]
pg 0.3e is stuck stale for 5198.568821, current state stale+active+clean, last acting [2]
pg 0.34 is stuck stale for 5228.614975, current state stale+active+clean, last acting [1]
pg 0.1d is stuck stale for 5228.614962, current state stale+active+clean, last acting [1]
pg 0.20 is stuck stale for 5228.614962, current state stale+active+clean, last acting [1]
pg 0.36 is stuck stale for 5228.614981, current state stale+active+clean, last acting [1]
pg 0.1f is stuck stale for 5198.568809, current state stale+active+clean, last acting [2]
pg 0.35 is stuck stale for 5228.614983, current state stale+active+clean, last acting [1]
pg 0.1e is stuck stale for 5228.614968, current state stale+active+clean, last acting [1]
覆盖那个有问题的pg
cat pg_id.sh
#!/bin/bash
PG_ID=(
0.18
0.18
0.38
0.2d
0.2c
0.2b
0.2a
0.1a
0.1b
0.d
0.c
0.22
0.1c
0.5
0.3c
0.3e
0.34
0.1d
0.20
0.36
0.1f
0.35
0.1e
)
for id in ${PG_ID[@]};do
echo $id
ceph pg force_create_pg $id
done
# 执行
sh pg_id.sh
问题特别多可以使用一下命令去跑
for pg in `ceph health detail | grep "stale+active+undersized+degraded" | awk '{print $2}' | sort | uniq`;
do
ceph pg force_create_pg $pg
done

再次查看健康信息
ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs stuck inactive; 1 pgs stuck unclean
pg 0.18 is stuck inactive for 674.102608, current state creating, last acting [0]
pg 0.18 is stuck unclean for 674.102613, current state creating, last acting [0]

看来使用覆盖的方式处理的是stale 的问题,还存在一个inactive 和unclean 的问题
解决creating
for pg in `ceph health detail | grep "creating" | awk '{print $2}' | sort | uniq`;
do
ceph pg map $pg
done
# 执行完成后重启所有的osd服务
systemctl restart ceph-osd@0

重启服务


在部署Ceph时遇到集群状态异常,主要表现为PGs处于stale、inactive和unclean状态。通过运行`cephhealthdetail`命令识别问题PG,并使用`ceph pg force_create_pg <pgid>`命令强制创建PG来解决stale问题。对于inactive和unclean状态的PG,执行`ceph pg map <pg>`和重启OSD服务进行修复。经过处理,大部分PG问题已解决,但仍有部分PG处于inactive和unclean状态。
3916

被折叠的 条评论
为什么被折叠?



