从实现原理来看为什么 Clone 插件比 Xtrabackup 更好用？

本文主要初步的介绍 Clone Plugin 的原理以及和 Xtrabackup 的异同，以及整体实现的框架。

作者：戴骏贤(小光) 前网易游戏计费组-资深数据库系统工程师，现天翼云数据库运维专家。

爱可生开源社区出品，原创内容未经授权不得随意使用，转载请联系小编并注明来源。

本文约 2600 字，预计阅读需要 9 分钟。

从 MySQL 8.0.17 版本开始，官方实现了 Clone 的功能，允许用户通过简单的 SQL 命令把远端或本地的数据库实例拷贝到其他实例后，快速拉起一个新的实例。

图片来源：https://lefred.be/

该功能由一些列的 WL 组成 :

Clone local replica(WL#9209) ：实现了数据本地 Clone。
Clone remote replica(WL#9210) ：在本地 Clone 的基础上，实现了远程 Clone。将数据保存到远程的一个目录中，解决跨节点部署 MySQL 的问题。
Clone Remote provisioning(WL#11636) ：将数据直接拷贝到需要重新初始化的 MySQL 实例中。此外这个 WL 还增加了预检查的功能。
Clone Replication Coordinates(WL#9211) ：完成了获取和保存 Clone 点位的功能，方便 Clone 实例正常的加入到集群中。
Support cloning encrypted database (WL#9682) ：最后一个 worklog 解决了数据加密情况下的数据拷贝问题。

本文主要初步的介绍 Clone Plugin 的原理以及和 Xtrabackup 的异同，以及整体实现的框架。

在 Xtrabackup 备份的过程中，可能遇到的最大的问题在于拷贝 Redo Log 的速度跟不上线上生产 Redo Log 的速度。

因为 Redo Log 是会循环利用的，当 CK 过后旧的 Redo Log 可能会被新的 Redo Log 覆盖，而此时如果 Xtrabackup 没有完成旧的 Redo Log 的拷贝，那么没法保证备份过程中的数据一致性。

图片来源：https://www.cnblogs.com/linuxk/p/9372990.html

Redo Log 工作原理

那么在 Clone Plugin 中如何去解决这个问题? 从 WL#9209 中可以看到官方整体的设计思路。在完成 Clone 的过程中将过程分为了 5 步:

INIT: The clone object is initialized identified by a locator.
FILE COPY: The state changes from INIT to "FILE COPY" when snapshot_copy interface is called. Before making the state change we start "Page Tracking" at lsn "CLONE START LSN". In this state we copy all database files and send to the caller.
PAGE COPY: The state changes from "FILE COPY" to "PAGE COPY" after all files are copied and sent. Before making the state change we start "Redo Archiving" at lsn "CLONE FILE END LSN" and stop "Page Tracking". In this state, all modified pages as identified by Page IDs between "CLONE START LSN" and "CLONE FILE END LSN" are read from "buffer pool" and sent. We would sort the pages by space ID, page ID to avoid random read(donor) and random write(recipient) as much as possible.
REDO COPY: The state changes from "PAGE COPY" to "REDO COPY" after all modified pages are sent. Before making the state change we stop "Redo Archiving" at lsn "CLONE LSN". This is the LSN of the cloned database. We would also need to capture the replication coordinates at this point in future. It should be the replication coordinate of the last committed transaction up to the "CLONE LSN". We send the redo logs from archived files in this state from "CLONE FILE END LSN" to "CLONE LSN" before moving to "Done" state.
Done: The clone object is kept in this state till destroyed by snapshot_end() call.