Ceph 日志迁移

最新推荐文章于 2024-06-12 12:08:03 发布

青草渐枯黄

最新推荐文章于 2024-06-12 12:08:03 发布

阅读量844

点赞数 1

CC 4.0 BY-SA版权

分类专栏： Ceph

本文链接：https://blog.youkuaiyun.com/chenwei8280/article/details/85458911

Ceph 专栏收录该内容

3 篇文章

订阅专栏

本文详细介绍了在Ceph存储集群中，如何将Journal从HDD迁移到SSD或NVME，以提升IOPS和降低延迟，尤其适用于小块数据读写场景。通过具体的步骤指导，包括分区创建、权限修改、Journal链接更新等，实现了存储性能的显著增强。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Ceph 日志迁移

过去一段时间一直在看一些benchmark，尤其是关注一些硬件选型对软件性能的影响，这其中就涉及到Ceph journal放在不同类型的盘
上对Ceph性能的影响，这里记录的是Ceph在使用filestore的场景下，如何更改journal所在的盘，Ceph版本是“Jewel”。一般的建议是
Ceph journal是放在SSD或者NVME上，OSD与journal的比例一般遵循：

SSD 4-5:1
NVME 12-18:1

先来看看为什么要用Journal？

Speed: The journal enables the Ceph OSD Daemon to commit small writes quickly. Ceph writes small, random i/o to the journal sequentially, which tends to speed up bursty workloads by allowing the backing filesystem more time to coalesce writes. The Ceph OSD Daemon’s journal, however, can lead to spiky performance with short spurts of high-speed writes followed by periods without any write progress as the filesystem catches up to the journal.
Consistency: Ceph OSD Daemons require a filesystem interface that guarantees atomic compound operations. Ceph OSD Daemons write a description of the operation to the journal and apply the operation to the filesystem. This enables atomic updates to an object (for example, placement group metadata). Every few seconds–between filestore max sync interval and filestore min sync interval–the Ceph OSD Daemon stops writes and synchronizes the journal with the filesystem, allowing Ceph OSD Daemons to trim operations from the journal and reuse the space. On failure, Ceph OSD Daemons replay the journal starting after the last synchronization operation.

简言之就是速度和一致性，正因为速度的考量，所以这里最好使用快存储设备，例如SSD；一致性我一直理解类似为数据库的日志，可以用来做数据恢复。

默认情况下，Jewel版本的Ceph journal是放在HDD上，且在我系统上看到的是在OSD所在的盘上划分了一个5G大小的分区来用作Journal，例如：

$ lsblk
...
sdb      8:16   0   1.8T  0 disk
├─sdb2   8:18   0     5G  0 part
└─sdb1   8:17   0   1.8T  0 part /var/lib/ceph/osd/ceph-0
...

所以我们需要找到这个分区，将其修改为SSD或者NVME的分区，看看Journal放在哪里：

# cd /var/lib/ceph/osd/ceph-0
# ls -l journal
lrwxrwxrwx 1 root root 58 12月 26 14:04 journal -> /dev/disk/by-partuuid/389057e5-a099-43b6-952e-ad0bff2e7893
# ls -l /dev/disk/by-partuuid/389057e5-a099-43b6-952e-ad0bff2e7893
lrwxrwxrwx 1 root root 10 11月 13 16:15 /dev/disk/by-partuuid/389057e5-a099-43b6-952e-ad0bff2e7893 -> ../../sdb2

接下来所要做的不外乎是将其链接到SSD/NVME的一块分区，这里假设sde是一块SSD盘:

# Ceph节点上有三块OSD，所以从sde（SSD盘符）上划分出三块5G大小的分区，为了方便比较，大小也设置为5G
# parted /dev/sde
(parted) mklabel gpt
(parted) mkpart journal-0 1 5G
(parted) mkpart journal-1 5G 10G
(parted) mkpart journal-2 10G 15G
# 修改owner和group，否则后面可能会有权限问题
# sudo chown ceph:ceph /dev/sde1
# sudo chown ceph:ceph /dev/sde2
# sudo chown ceph:ceph /dev/sde3
# cd /var/lib/ceph/osd/ceph-0
# ceph osd set noout （开启noout以避免rebalance。）
# systemctl stop ceph-osd@0
# ceph-osd -i 0 --flush-journal
# rm /var/lib/ceph/osd/ceph-0/journal
# 链接journal盘到SSD
# ln -s  /var/lib/ceph/osd/<osd-id>/journal /dev/<ssd-partition-for-journal>
# OSD目录下有一个“journal_uuid”，这个文件需要手动更新，应该是一个bug
# echo $partuuid > journal_uuid
# ceph-osd -i 0 --mkjournal
# systemctl start ceph-osd@0
# ceph osd unset noout

做完上面的步骤，用”ceph-disk list“确认一下是否修改成功，例如：

$ sudo ceph-disk list
...
/dev/sda :
/dev/sda2 ceph journal
/dev/sda1 ceph data, active, cluster ceph, osd.4, journal /dev/sde3
/dev/sdb :
/dev/sdb2 ceph journal
/dev/sdb1 ceph data, active, cluster ceph, osd.1, journal /dev/sde2
/dev/sdc :
/dev/sdc2 ceph journal
/dev/sdc1 ceph data, active, cluster ceph, osd.0, journal /dev/sde1...# ls -l /var/lib/ceph/osd/ceph-0/journal
lrwxrwxrwx 1 root root 58 11月 13 16:01 /var/lib/ceph/osd/ceph-0/journal -> /dev/disk/by-partuuid/8fa35d73-d973-4ff3-b103-a370a10bf4a1
# ls -l /dev/disk/by-partuuid/8fa35d73-d973-4ff3-b103-a370a10bf4a1
lrwxrwxrwx 1 root root 10 11月 13 16:09 /dev/disk/by-partuuid/8fa35d73-d973-4ff3-b103-a370a10bf4a1 -> ../../sde1

OSD-0所对应的journal已经修改为SSD盘上的一个分区了，接下来就可以跑一些benchmark做一些对比实验了。

跑benchmark过程中发现rados benchmark默认参数下跑出来的结果差别不大，究其原因可以从下面的回复中得到答案。

Hi Dave,
The SSD journal will help boost iops & latency which will be more apparent for small block sizes. The rados benchmark default block size is 4M, use the -b option to specify the size. Try at 4k, 32k, 64k …
As a side note, this is a rados level test, the rbd image size is not relevant here.
Maged.

个人博客：http://www.jungler.cn/