XFS - Data Block Sharing (Reflink)

最新推荐文章于 2025-10-22 14:11:47 发布

转载最新推荐文章于 2025-10-22 14:11:47 发布 · 1.2k 阅读

1 ·

CC 4.0 BY-SA版权

原文链接：https://blogs.oracle.com/linux/xfs-data-block-sharing-reflink

文章标签：

#linux #xfs #reflink #RHEL8 #Oracle Linux 8

RHEL8 同时被 2 个专栏收录

217 篇文章

订阅专栏

Oracle Linux 8

10 篇文章

订阅专栏

XFS文件系统现在默认启用了Reflink功能，允许用户在文件间共享数据块，从而实现快速创建VM镜像快照和文件数据去重，提高存储效率。Reflink通过Copy-on-Write技术确保文件内容完整，同时通过iomap和内核算法优化提升I/O性能，并解决了大稀疏文件内存分配问题。用户只需更新到xfsprogs 5.1及以上版本，即可利用这一生产就绪的功能。

XFS - Data Block Sharing (Reflink)

Matt Keenan

Following on from his recent blog XFS - 2019 Development Retrospective, XFS Upstream maintainer Darrick Wong dives a little deeper into the Reflinks implementation for XFS in the mainline Linux Kernel.

Three years ago, I introduced to XFS a new experimental "reflink" feature that enables users to share data blocks between files. With this feature, users gain the ability to make fast snapshots of VM images and directory trees; and deduplicate file data for more efficient use of storage hardware. Copy on write is used when necessary to keep file contents intact, but XFS otherwise continues to use direct overwrites to keep metadata overhead low. The filesystem automatically creates speculative preallocations when copy on write is in use to combat fragmentation.

I'm pleased to announce with xfsprogs 5.1, the reflink feature is now production ready and enabled by default on new installations, having graduated from the experimental and stabilization phases. Based on feedback from early adopters of reflink, we also redesigned some of our in-kernel algorithms for better performance, as noted below:

iomap for Faster I/O

Beginning with Linux 4.18, Christoph Hellwig and I have migrated XFS' IO paths away from the old VFS/MM infrastructure, which dealt with IO on a per-block and per-block-per-page ("bufferhead") basis. These mechanisms were introduced to handle simple filesystems on Linux in the 1990s, but are very inefficient.

The new IO paths, known as "iomap", iterate IO requests on an extent basis as much as possible to reduce overhead. The subsystem was written years ago to handle file mapping leases and the like, but nowadays we can use it as a generic binding between the VFS, the memory manager, and XFS whenever possible. The conversion finished as of Linux 5.4.

In-Core Extent Tree

For many years, the in-core file extent cache in XFS used a contiguous chunk of memory to store the mappings. This introduces a serious and subtle pain point for users with large sparse files, because it can be very difficult for the kernel to fulfill such an allocation when memory is fragmented.

Christoph Hellwig rewrote the in-core mapping cache in Linux 4.15 to use a btree structure. Instead of using a single huge array, the btree structure reduces our contiguous memory requirements to 256 bytes per chunk, with no maximum on the number of chunks. This enables XFS to scale to hundreds of millions of extents while eliminating a source of OOM killer reports.

Users need only upgrade their kernel to take advantage of this improvement.

Demonstration: Reflink

To begin experimenting with XFS's reflink support, one must format a new filesystem:

# mkfs.xfs /dev/sda1

meta-data=/dev/sda1 isize=512 agcount=4, agsize=6553600 blks

= sectsz=512 attr=2, projid32bit=1

= crc=1 finobt=1, sparse=1, rmapbt=0

= reflink=1

data = bsize=4096 blocks=26214400, imaxpct=25

= sunit=0 swidth=0 blks

naming =version 2 bsize=4096 ascii-ci=0, ftype=1

log =internal log bsize=4096 blocks=12800, version=2

= sectsz=512 sunit=0 blks, lazy-count=1

realtime =none extsz=4096 blocks=0, rtextents=0

If you do not see the exact phrase "reflink=1" in the mkfs output then your system is too old to support reflink on XFS. Now one must mount the filesystem:

1	`# mount /dev/sda1 /storage`

At this point, the filesystem is ready to absorb some new files. Let's pretend that we're running a virtual machine (VM) farm and therefore need to manage deployment images. This and the next example are admittedly contrived, as any serious VM and container farm manager takes care of all these details.

# mkdir /storage/images

# truncate -s 30g /storage/images/os8_base.img

# qemu-system-x86_64 -hda /storage/images/os8_base.img -cdrom /isoz/os8_install.iso

Now we install a base OS image that we will later use for fast deployment. Once that's done, we shut down the QEMU process. But first, we'll check that everything's in order:

# xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img

/storage/images/os8_base.img:

EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS

0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 000000

<listing shortened for brevity>

# df -h /storage

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 100G 32G 68G 32% /storage

Now, let's say that we want to provision a new VM using the base image that we just created. In the old days we would have had to copy the entire image, which can be very time consuming. Now, we can do this very quickly:

# /usr/bin/time cp -pRdu --reflink /storage/images/os8_base.img /storage/images/vm1.img

0.00user 0.00system 0:00.02elapsed 39%CPU (0avgtext+0avgdata 2568maxresident)k

0inputs+0outputs (0major+108minor)pagefaults 0swaps

# xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img

/storage/images/vm1.img:

EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS

0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 100000

<listing shortened for brevity>

FLAG Values:

0100000 Shared extent

0010000 Unwritten preallocated extent

0001000 Doesn't begin on stripe unit

0000100 Doesn't end on stripe unit

0000010 Doesn't begin on stripe width

0000001 Doesn't end on stripe width

# df -h /storage

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 100G 32G 68G 32% /storage

This was a very quick copy! Notice how the extent map on the new image file shows file data pointing to the same physical storage as the original base image, but is now marked as a shared extent, and there's about as much free space as there was before the copy. Now let's start that new VM and let it run for a little while before re-querying the block mapping:

# xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img

/storage/images/vm1.img:

EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS

0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 100000

<listing shortened for brevity>

# xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img

/storage/images/vm1.img:

EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS

0: [0..255]: 102762656..102762911 1 (50333856..50334111) 256 000000

1: [256..15728639]: 52429216..68157599 1 (416..15728799) 15728384 100000

<listing shortened for brevity>

# df -h /storage

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 100G 36G 64G 32% /storage

Notice how the first 128K of the file now points elsewhere. This is evidence that the VM guest wrote to its storage, causing XFS to employ copy on write on the file so that the original base image remains unmodified. We've apparently used another 4GB of space, which is far better than the 64GB that would have been required in the old days.

Let's turn our attention to the second major feature for reflink: fast(er) snapshotting of directory trees. Suppose now that we want to manage containers with XFS. After a fresh formatting, create a directory tree for our container base:

1	`# mkdir -p /storage/containers/os8_base`

In the directory we just created, install a base container OS image that we will later use for fast deployment. Once that's done, we shut down the container and check that everything's in order:

# df /storage/

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 100G 2.0G 98G 2% /storage

# xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash

/storage/containers/os8_base/bin/bash:

EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS

0: [0..2175]: 52440384..52442559 1 (11584..13759) 2176 000000

Ok, that looks like a reasonable base system. Let's use reflink to make a fast copy of this system: