ceph：HEALTH_ERR 41 pgs are stuck inactive for more than 300 seconds；

原创于 2022-12-19 20:23:13 发布 · 1k 阅读

CC 4.0 BY-SA版权

文章标签：

在部署Ceph时遇到集群状态异常，主要表现为PGs处于stale、inactive和unclean状态。通过运行`cephhealthdetail`命令识别问题PG，并使用`ceph pg force_create_pg <pgid>`命令强制创建PG来解决stale问题。对于inactive和unclean状态的PG，执行`ceph pg map <pg>`和重启OSD服务进行修复。经过处理，大部分PG问题已解决，但仍有部分PG处于inactive和unclean状态。

1、背景

今天学习ceph部署时，发现集群状态异常

ceph health
HEALTH_ERR 21 pgs are stuck inactive for more than 300 seconds; 21 pgs stale; 21 pgs stuck stale

在这里插入图片描述

猜测：测试添加osd和删除osd时，没有清理干净或者没有使用正确的方法清理

2、处理办法

解決方法就是用 ceph pg force_creat_pg <pgid> 去覆盖那个有问题的 pg

# 查看有问题的PG
ceph health detail
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 21 pgs stale; 1 pgs stuck inactive; 21 pgs stuck stale; 1 pgs stuck unclean
pg 0.18 is stuck inactive for 385.978354, current state creating, last acting [0]
pg 0.18 is stuck unclean for 385.978358, current state creating, last acting [0]
pg 0.38 is stuck stale for 5228.614958, current state stale+active+clean, last acting [1]
pg 0.2d is stuck stale for 5228.614920, current state stale+active+clean, last acting [1]
pg 0.2c is stuck stale for 5198.568749, current state stale+active+clean, last acting [2]
pg 0.2b is stuck stale for 5228.614940, current state stale+active+clean, last acting [1]
pg 0.2a is stuck stale for 5198.568753, current state stale+active+clean, last acting [2]
pg 0.1a is stuck stale for 5228.614950, current state stale+active+clean, last acting [1]
pg 0.1b is stuck stale for 5228.614951, current state stale+active+clean, last acting [1]
pg 0.d is stuck stale for 5198.568803, current state stale+active+clean, last acting [2]
pg 0.c is stuck stale for 5228.614961, current state stale+active+clean, last acting [1]
pg 0.22 is stuck stale for 5198.568796, current state stale+active+clean, last acting [2]
pg 0.1c is stuck stale for 5228.614956, current state stale+active+clean, last acting [1]
pg 0.5 is stuck stale for 5198.568804, current state stale+active+clean, last acting [2]
pg 0.3c is stuck stale for 5228.614978, current state stale+active+clean, last acting [1]
pg 0.3e is stuck stale for 5198.568821, current state stale+active+clean, last acting [2]
pg 0.34 is stuck stale for 5228.614975, current state stale+active+clean, last acting [1]
pg 0.1d is stuck stale for 5228.614962, current state stale+active+clean, last acting [1]
pg 0.20 is stuck stale for 5228.614962, current state stale+active+clean, last acting [1]
pg 0.36 is stuck stale for 5228.614981, current state stale+active+clean, last acting [1]
pg 0.1f is stuck stale for 5198.568809, current state stale+active+clean, last acting [2]
pg 0.35 is stuck stale for 5228.614983, current state stale+active+clean, last acting [1]
pg 0.1e is stuck stale for 5228.614968, current state stale+active+clean, last acting [1]

覆盖那个有问题的pg

cat pg_id.sh
#!/bin/bash

PG_ID=(
0.18
0.18
0.38
0.2d
0.2c
0.2b
0.2a
0.1a
0.1b
0.d 
0.c 
0.22
0.1c
0.5 
0.3c
0.3e
0.34
0.1d
0.20
0.36
0.1f
0.35
0.1e
)

for id in ${PG_ID[@]};do
  echo $id 
  ceph pg force_create_pg $id
done

# 执行
sh pg_id.sh

问题特别多可以使用一下命令去跑

for pg in `ceph health detail | grep "stale+active+undersized+degraded" | awk '{print $2}' | sort | uniq`;
do
  ceph pg force_create_pg $pg
done

在这里插入图片描述
再次查看健康信息

ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs stuck inactive; 1 pgs stuck unclean
pg 0.18 is stuck inactive for 674.102608, current state creating, last acting [0]
pg 0.18 is stuck unclean for 674.102613, current state creating, last acting [0]

在这里插入图片描述
看来使用覆盖的方式处理的是stale 的问题，还存在一个inactive 和unclean 的问题
解决creating

for pg in `ceph health detail | grep "creating" | awk '{print $2}' | sort | uniq`;
do
  ceph pg map $pg
done
# 执行完成后重启所有的osd服务
systemctl restart ceph-osd@0

在这里插入图片描述
重启服务