upmap的存储池osd坏盘处理问题

最新推荐文章于 2025-12-14 21:33:55 发布

原创最新推荐文章于 2025-12-14 21:33:55 发布 · 1.4k 阅读

17 ·

CC 4.0 BY-SA版权

文章标签：

#ceph

写在前面

喜欢ceph的话欢迎关注奋斗的cepher微信公众号阅读更多好文！

在《坏盘处理时osd为什么不要rm》文章中，松鼠哥对比了多组各种osd处理与数据的情况，有一个细节，那就是如果osd在重建前后要保持pg映射的一致性，那么存储池做均衡使用的是crush-compat模式，同时有读者老铁留言，说当存储池使用了upmap模式做存储池均衡的话，osd重建前后将不能保持相同的pg映射。

因为松鼠哥对存储池做均衡几乎都是推荐使用crush-compat来做，upmap实践不多，所以这种场景没有遇到过，于是有了本篇，对upmap场景下osd坏盘重建前后的pg情况做一个补充。

开始

在相同的环境中，首先使用upmap对存储池做均衡，考虑到一次均衡效果不是很好，连续做了3次：

[root@testnode1 twj]# ceph osd getmap -o osdmap
[root@testnode1 twj]# osdmaptool ./osdmap --upmap afterupmap --upmap-pool songshuge.rgw.buckets.data  --upmap-max 100 --upmap-deviation 1
[root@testnode1 twj]# source aftermap

待pg都分布均衡后，选一个osd，这里选取osd.50，记录它的pg信息，将它destroy并out出集群：

[root@testnode1 twj]# ceph pg ls-by-osd 50 |grep '47'|awk '{print $1}' > osd.50.pg 
[root@testnode1 twj]# systemctl stop ceph-osd@50.service
[root@testnode1 twj]# ceph osd destroy oosd.50 --force
[root@testnode1 twj]# ceph osd out 50

数据完成均衡后，重建该osd并加入集群：

[root@testnode1 twj]# ceph-volume --cluster ceph lvm prepare --bluestore --data xxx  --osd-id 50  --osd-fsid uuidgen --block.db yy --block.wal zz
[root@testnode1 twj]# ceph-volume lvm activate 50 xxx

待数据均衡完毕后，再次记录它的pg信息，并进行对比:

[root@testnode1 twj]# ceph pg ls-by-osd 50 |grep '47'|awk '{print $1}' > osd.50.pg.new
[root@testnode1 twj]# ls -lt
total 4096
-rw-r----- 1 root root    1423 Aug 30 09:42 osd.50.pg
-rw-r----- 1 root root    1507 Aug 30 15:41 osd.50.pg.new
[root@testnode1 twj]# diff osd.50.pg osd.50.pg.new 
73a74
> 47.509
82a84
> 47.5dc
110a113
> 47.841
112a116
> 47.885
115a120
> 47.8c4
119a125
> 47.90d
130a137
> 47.9e5
139a147
> 47.a88
152a161
> 47.b6a
171a181,182
> 47.cc2
> 47.cc8
179a191
> 47.d98

显然不同，相同的操作在此前的crush-compat模式下pg分布是相同的，那么原因是什么呢？

官方对不同模式的说明是这样的：

1.crush-compat. This mode uses the compat weight-set feature (introduced in Luminous) to manage an alternative set of weights for devices in the CRUSH hierarchy. When the balancer is operating in this mode, the normal weights should remain set to the size of the device in order to reflect the target amount of data intended to be stored on the device. The balancer will then optimize the weight-set values, adjusting them up or down in small increments, in order to achieve a distribution that matches the target distribution as closely as possible. (Because PG placement is a pseudorandom process, it is subject to a natural amount of variation; optimizing the weights serves to counteract that natural variation.)

Note that this mode is fully backward compatible with older clients: when an OSD Map and CRUSH map are shared with older clients, Ceph presents the optimized weights as the “real” weights.

The primary limitation of this mode is that the balancer cannot handle multiple CRUSH hierarchies with different placement rules if the subtrees of the hierarchy share any OSDs. (Such sharing of OSDs is not typical and, because of the difficulty of managing the space utilization on the shared OSDs, is generally not recommended.)

2.upmap. In Luminous and later releases, the OSDMap can store explicit mappings for individual OSDs as exceptions to the normal CRUSH placement calculation. These upmap entries provide fine-grained control over the PG mapping. This balancer mode optimizes the placement of individual PGs in order to achieve a balanced distribution. In most cases, the resulting distribution is nearly perfect: that is, there is an equal number of PGs on each OSD (±1 PG, since the total number might not divide evenly).

To use upmap, all clients must be Luminous or newer.

内容有点长，简单来说就是，crush-compat主要参考weight做均衡，对client比较友好，upmap依赖osdmap实现，对client版本有点要求，read和upmap-read是R版本以后的功能，暂且不说。

从实现来看，crush-compat会将均衡的信息写入到crushmap中，就在crushmap的最后面：

# choose_args
choose_args 18446744073709551615 {
  {
    bucket_id -1
    weight_set [
      [ 7.953 8.055 7.992 ]
    ]
  }
......

当我们处理osd不rm时，这个信息是不会被破坏的，也就能够让osd保持pg的映射一致性，而upmap方式则是将pg均衡信息写入到了osdmap中，经过测试，destroy osd就会直接导致这些信息的变化：

在destroy之前，定位到osdmap版本为263625，osdmap中的pg upmap信息：
[root@testnode1 ceph]# osdmaptool --print osdmap.263625 > qian.osd.upmap
osdmaptool: osdmap file 'osdmap.263625'
[root@testnode1 ceph]# more qian.osd.upmap 
osd.311 up   in  weight 1 up_from 257374 up_thru 261950 down_at 257366 last_clean_interval [257354,257373) [v2:192.168.1.1:10025/1265298,v1:192.168.1.1:10028/1265298] [v2:192.168.1.1:10048/2265298,v1:192.168.1.1:10049/2265298] exists,up f7f47c72-4a9d-4db9-9550-c555de928792

pg_upmap_items 47.3a [40,31]
pg_upmap_items 47.50 [58,51]
pg_upmap_items 47.57 [10,1]
......

可以看到，osdmap的最后就是upmap的信息，那么destroy之后，osdmap的版本必然是变化的，不过upmap信息也会跟着变化：

在destroy之后，定位到最新osdmap版本为264472，osdmap中的pg upmap信息：
[root@testnode1 ceph]# osdmaptool --print osdmap.264472 > hou.osd.upmap
osdmaptool: osdmap file 'osdmap.264472'
[root@testnode1 ceph]# more hou.osd.upmap 
osd.311 up   in  weight 1 up_from 257374 up_thru 261950 down_at 257366 last_clean_interval [257354,257373) [v2:192.168.1.1:10025/1265298,v1:192.168.1.1:10028/1265298] [v2:192.168.1.1:10048/2265298,v1:192.168.1.1:10049/2265298] exists,up f7f47c72-4a9d-4db9-9550-c555de928792

pg_upmap_items 47.3a [40,31]
pg_upmap_items 47.50 [58,51]
pg_upmap_items 47.57 [10,1]
......

进行一个对比
[root@testnode1 ceph]# grep 'pg_upmap_items' qian.osd.upmap > qian1
[root@testnode1 ceph]# grep 'pg_upmap_items' hou.osd.upmap > hou2
[root@testnode1 ceph]# wc -l hou2 
272 hou2
[root@testnode1 ceph]# wc -l qian1 
281 qian1