XFS - Data Block Sharing (Reflink)

XFS文件系统现在默认启用了Reflink功能,允许用户在文件间共享数据块,从而实现快速创建VM镜像快照和文件数据去重,提高存储效率。Reflink通过Copy-on-Write技术确保文件内容完整,同时通过iomap和内核算法优化提升I/O性能,并解决了大稀疏文件内存分配问题。用户只需更新到xfsprogs 5.1及以上版本,即可利用这一生产就绪的功能。

 

XFS - Data Block Sharing (Reflink)

Matt Keenan

Following on from his recent blog XFS - 2019 Development Retrospective, XFS Upstream maintainer Darrick Wong dives a little deeper into the Reflinks implementation for XFS in the mainline Linux Kernel.

Three years ago, I introduced to XFS a new experimental "reflink" feature that enables users to share data blocks between files. With this feature, users gain the ability to make fast snapshots of VM images and directory trees; and deduplicate file data for more efficient use of storage hardware. Copy on write is used when necessary to keep file contents intact, but XFS otherwise continues to use direct overwrites to keep metadata overhead low. The filesystem automatically creates speculative preallocations when copy on write is in use to combat fragmentation.

I'm pleased to announce with xfsprogs 5.1, the reflink feature is now production ready and enabled by default on new installations, having graduated from the experimental and stabilization phases. Based on feedback from early adopters of reflink, we also redesigned some of our in-kernel algorithms for better performance, as noted below:

iomap for Faster I/O

Beginning with Linux 4.18, Christoph Hellwig and I have migrated XFS' IO paths away from the old VFS/MM infrastructure, which dealt with IO on a per-block and per-block-per-page ("bufferhead") basis. These mechanisms were introduced to handle simple filesystems on Linux in the 1990s, but are very inefficient.

The new IO paths, known as "iomap", iterate IO requests on an extent basis as much as possible to reduce overhead. The subsystem was written years ago to handle file mapping leases and the like, but nowadays we can use it as a generic binding between the VFS, the memory manager, and XFS whenever possible. The conversion finished as of Linux 5.4.

In-Core Extent Tree

For many years, the in-core file extent cache in XFS used a contiguous chunk of memory to store the mappings. This introduces a serious and subtle pain point for users with large sparse files, because it can be very difficult for the kernel to fulfill such an allocation when memory is fragmented.

Christoph Hellwig rewrote the in-core mapping cache in Linux 4.15 to use a btree structure. Instead of using a single huge array, the btree structure reduces our contiguous memory requirements to 256 bytes per chunk, with no maximum on the number of chunks. This enables XFS to scale to hundreds of millions of extents while eliminating a source of OOM killer reports.

Users need only upgrade their kernel to take advantage of this improvement.

To begin experimenting with XFS's reflink support, one must format a new filesystem:

1

2

3

4

5

6

7

8

9

10

11

# mkfs.xfs /dev/sda1

meta-data=/dev/sda1              isize=512    agcount=4, agsize=6553600 blks

         =                       sectsz=512   attr=2, projid32bit=1

         =                       crc=1        finobt=1, sparse=1, rmapbt=0

         =                       reflink=1

data     =                       bsize=4096   blocks=26214400, imaxpct=25

         =                       sunit=0      swidth=0 blks

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1

log      =internal log           bsize=4096   blocks=12800, version=2

         =                       sectsz=512   sunit=0 blks, lazy-count=1

realtime =none                   extsz=4096   blocks=0, rtextents=0

If you do not see the exact phrase "reflink=1" in the mkfs output then your system is too old to support reflink on XFS. Now one must mount the filesystem:

1

# mount /dev/sda1 /storage

At this point, the filesystem is ready to absorb some new files. Let's pretend that we're running a virtual machine (VM) farm and therefore need to manage deployment images. This and the next example are admittedly contrived, as any serious VM and container farm manager takes care of all these details.

1

2

3

# mkdir /storage/images

# truncate -s 30g /storage/images/os8_base.img

# qemu-system-x86_64 -hda /storage/images/os8_base.img -cdrom /isoz/os8_install.iso

Now we install a base OS image that we will later use for fast deployment. Once that's done, we shut down the QEMU process. But first, we'll check that everything's in order:

1

2

3

4

5

6

7

8

9

# xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img

/storage/images/os8_base.img:

 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET               TOTAL FLAGS

   0: [0..15728639]:        52428960..68157599    1 (160..15728799)      15728640 000000

<listing shortened for brevity>

 

# df -h /storage

Filesystem       Size  Used Avail Use% Mounted on

/dev/sda1        100G   32G   68G  32% /storage

Now, let's say that we want to provision a new VM using the base image that we just created. In the old days we would have had to copy the entire image, which can be very time consuming. Now, we can do this very quickly:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

# /usr/bin/time cp -pRdu --reflink /storage/images/os8_base.img /storage/images/vm1.img

0.00user 0.00system 0:00.02elapsed 39%CPU (0avgtext+0avgdata 2568maxresident)k

0inputs+0outputs (0major+108minor)pagefaults 0swaps

 

# xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img

/storage/images/vm1.img:

 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET               TOTAL FLAGS

   0: [0..15728639]:        52428960..68157599    1 (160..15728799)      15728640 100000

<listing shortened for brevity>

 FLAG Values:

    0100000 Shared extent

    0010000 Unwritten preallocated extent

    0001000 Doesn't begin on stripe unit

    0000100 Doesn't end   on stripe unit

    0000010 Doesn't begin on stripe width

    0000001 Doesn't end   on stripe width

 

# df -h /storage

Filesystem       Size  Used Avail Use% Mounted on

/dev/sda1        100G   32G   68G  32% /storage

This was a very quick copy! Notice how the extent map on the new image file shows file data pointing to the same physical storage as the original base image, but is now marked as a shared extent, and there's about as much free space as there was before the copy. Now let's start that new VM and let it run for a little while before re-querying the block mapping:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

# xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img

/storage/images/vm1.img:

 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET               TOTAL FLAGS

   0: [0..15728639]:        52428960..68157599    1 (160..15728799)      15728640 100000

 

<listing shortened for brevity>

# xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img

/storage/images/vm1.img:

 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET               TOTAL FLAGS

   0: [0..255]:             102762656..102762911  1 (50333856..50334111)      256 000000

   1: [256..15728639]:      52429216..68157599    1 (416..15728799)      15728384 100000

<listing shortened for brevity>

 

# df -h /storage

Filesystem       Size  Used Avail Use% Mounted on

/dev/sda1        100G   36G   64G  32% /storage

Notice how the first 128K of the file now points elsewhere. This is evidence that the VM guest wrote to its storage, causing XFS to employ copy on write on the file so that the original base image remains unmodified. We've apparently used another 4GB of space, which is far better than the 64GB that would have been required in the old days.

Let's turn our attention to the second major feature for reflink: fast(er) snapshotting of directory trees. Suppose now that we want to manage containers with XFS. After a fresh formatting, create a directory tree for our container base:

1

# mkdir -p /storage/containers/os8_base

In the directory we just created, install a base container OS image that we will later use for fast deployment. Once that's done, we shut down the container and check that everything's in order:

1

2

3

4

5

6

7

# df /storage/

Filesystem       Size  Used Avail Use% Mounted on

/dev/sda1        100G  2.0G   98G   2% /storage

# xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash

/storage/containers/os8_base/bin/bash:

 EXT: FILE-OFFSET      BLOCK-RANGE        AG AG-OFFSET        TOTAL FLAGS

   0: [0..2175]:       52440384..52442559  1 (11584..13759)    2176 000000

Ok, that looks like a reasonable base system. Let's use reflink to make a fast copy of this system:

1

2

3

4

5

6

7

8

9

10

11

12

# /usr/bin/time cp -pRdu --reflink=always /storage/containers/os8_base /storage/containers/container1

0.01user 0.64system 0:00.68elapsed 96%CPU (0avgtext+0avgdata 2744maxresident)k

0inputs+0outputs (0major+129minor)pagefaults 0swaps

 

# xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash

/storage/containers/os8_base/bin/bash:

 EXT: FILE-OFFSET      BLOCK-RANGE        AG AG-OFFSET        TOTAL FLAGS

   0: [0..2175]:       52440384..52442559  1 (11584..13759)    2176 100000

 

# df /storage/

Filesystem       Size  Used Avail Use% Mounted on

/dev/sda1        100G  2.0G   98G   2% /storage

Now we let the container runtime do some work and update (for example) the bash binary:

1

2

3

4

5

6

7

8

9

# xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash

/storage/containers/os8_base/bin/bash:

 EXT: FILE-OFFSET      BLOCK-RANGE        AG AG-OFFSET        TOTAL FLAGS

   0: [0..2175]:       52440384..52442559  1 (11584..13759)    2176 000000

 

# xfs_bmap -e -l -p -v -v -v /storage/containers/container1/bin/bash

/storage/containers/container1/bin/bash:

 EXT: FILE-OFFSET      BLOCK-RANGE        AG AG-OFFSET        TOTAL FLAGS

   0: [0..2175]:       52442824..52444999  1 (14024..16199)    2176 000000

Notice that the two copies of bash no longer share blocks. This concludes our demonstration. We hope you enjoy this major new feature!

<think>嗯,用户问的是关于使用mkfs.xfs命令创建一个XFS文件系统,参数是-b size=4096和-m reflink=1,crc=1,设备是/dev/sdb1。我需要先确认每个参数的含义以及它们的作用。 首先,mkfs.xfs是用于创建XFS文件系统的命令。用户指定的选项包括块大小和元数据选项。块大小通过-b size=4096设置,这里应该是将文件系统的块大小设为4096字节,也就是4K。通常XFS默认的块大小可能也是4K,但不同版本可能不同,需要确认。设置这个值会影响文件系统的性能和对存储空间的使用效率,尤其是处理大文件或小文件时的表现。 然后是-m选项,后面跟着reflink=1和crc=1。这部分是元数据选项。reflink=1应该是指启用写时复制(Copy-on-Write)的功能,这个功能允许文件共享数据块,直到有修改操作发生,常用于快照和克隆,比如在Btrfs或ZFS中常见。不过XFS本身是否支持reflink呢?我记得从XFS 5.10版本之后开始支持reflink,用于实现类似快照的功能,但需要确认。启用这个功能可能需要内核和文件系统的支持。 接下来是crc=1,这应该是启用元数据的CRC校验,用于检测数据损坏,提高数据的完整性。默认情况下,可能已经启用,但用户显式设置可以确保开启。CRC校验会增加一些开销,但能提高数据的安全性。 设备是/dev/sdb1,需要确保这个分区存在且没有被挂载,否则格式化会失败。用户可能需要先卸载分区,或者确认没有重要数据,因为格式化会清除所有数据。 接下来需要逐步解释命令的作用,每个参数的意义,以及使用该命令后的结果。同时需要注意潜在的问题,比如内核版本是否支持reflink,块大小的选择是否合理,比如是否适合应用场景。例如,4K块对于小文件可能更高效,但大文件可能更适合更大的块,但需要权衡空间利用率。 另外,用户可能需要了解如何验证参数是否正确应用,比如使用xfs_info命令查看创建后的文件系统信息,检查块大小和元数据选项是否生效。 可能用户是系统管理员或有一定Linux使用经验的人,想要配置一个高性能或具备特定功能的文件系统,用于存储或虚拟化环境,其中reflink可能用于节省空间或快速克隆。需要提醒用户备份数据,确认硬件和内核支持这些特性,以及后续使用中的注意事项,比如是否支持快照等。</think>以下是对命令 $mkfs.xfs -b size=4096 -m reflink=1,crc=1 /dev/sdb1$ 的分步解析: --- ### 一、命令作用 该命令用于在设备 `/dev/sdb1` 上创建 **XFS 文件系统**,并通过参数调整文件系统的核心特性。 --- ### 二、参数详解 #### 1. `-b size=4096` - **功能**:指定文件系统的基础块大小为 $4096$ 字节(即 4KB)。 - **意义**: - 直接影响文件系统的最小存储单元分配 - 适合通用场景(默认值通常为 4KB) - 若存储大文件可考虑更大块(如 64KB),小文件则无需调整 #### 2. `-m reflink=1,crc=1` - **元数据选项**(`-m`)包含两个子参数: - **`reflink=1`**: - 启用 **写时复制(Copy-On-Write)** 功能 - 允许文件共享数据块(节省空间,支持快照克隆) - 要求内核 ≥ 5.10 且硬件支持 - **`crc=1`**: - 启用 **元数据 CRC 校验** - 检测数据损坏,提升可靠性 - 默认已启用(显式指定确保激活) #### 3. `/dev/sdb1` - 目标设备路径 - **操作前必须**: - 确认设备无重要数据(格式化会清空数据) - 确保设备未挂载(否则操作失败) --- ### 三、操作验证 格式化完成后,通过以下命令检查参数是否生效: ```bash xfs_info /dev/sdb1 ``` 输出中应包含: - `bsize=4096`(确认块大小) - `reflink=1`(确认写时复制启用) - `crc=1`(确认校验启用) --- ### 四、注意事项 1. **兼容性**: - `reflink` 需要较新内核支持(建议 ≥ Linux 5.10) - 部分旧存储设备可能不支持 CRC 校验 2. **性能权衡**: - `reflink` 会增加元数据操作开销 - `crc=1` 略微影响性能(换取数据完整性) 3. **应用场景**: - 虚拟机/容器存储(利用 `reflink` 快速克隆) - 关键数据存储(依赖 `crc` 校验保障) --- ### 五、扩展知识 若需优化性能,可结合其他参数: - `-l logdev=/dev/ssd`:将日志存放到独立 SSD - `-d agcount=4`:调整分配组数量(匹配 CPU 核心数) 建议通过 `man xfs_mkfs` 查阅完整参数文档。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值