请先了解lookup模式,然后再看full-compaction模式
环境创建
CREATE CATALOG fs_catalog WITH (
'type'='paimon',
'warehouse'='file:/data/soft/paimon/catalog'
);
USE CATALOG fs_catalog;
drop table if exists t_changelog_fullcompaction;
CREATE TABLE t_changelog_fullcompaction(
age BIGINT,
money BIGINT,
hh STRING,
PRIMARY KEY (hh) NOT ENFORCED
)WITH (
'merge-engine' = 'deduplicate',
'changelog-producer' = 'full-compaction'
);
paimon的snapshot和checkpoint之间的关系
- 一次snapshot会产生一个data文件
- 一次checkpoint会产生1-2个snapshot文件,要看这次checkpoint是否触发compaction,触发了就是2个data文件(一个是合并后的数据,一个本次checkpoint写入数据),否则只有一个(本次checkpoint写入数据)
- paimon 流式写入根据checkpoint间隔,定期生成snapshot
- 批写(手动执行sql脚本)每一个sql会立即生成snapshot,不管checkpoint间隔配置
lookup模式,一次checkpoint会产生一个chagelog,而full-compaction会等待若干次checkpoint后才产生一个chagelog,这取决于参数

如果配置 full-compaction.delta-commits = 2 ,意思就是说,在流式处理场景中checkpoint2次,产生2个snapshot后才会生成一个changelog
在批处理场景下,一次sql就会产生一个check
流式写入验证(1)
CREATE TEMPORARY TABLE t_auto_generated_data (
age BIGINT,
money BIGINT,
hh STRING,
PRIMARY KEY (hh) NOT ENFORCED
) WITH (
'connector' = 'datagen', -- 使用datagen连接器自动生成数据
'fields.age.kind' = 'random', -- age字段生成随机数
'fields.age.min' = '1', -- age字段最小值
'fields.age.max' = '100', -- age字段最大值
'fields.money.kind' = 'random', -- money字段生成随机数
'fields.money.min' = '0', -- money字段最小值
'fields.money.max' = '10000', -- money字段最大值
'fields.hh.kind' = 'random', -- hh字段生成随机字符串
'fields.hh.length' = '10' -- hh字段长度
);
-- 设置checkpoint的间隔60s
SET 'execution.checkpointing.interval'='60 s';
-- 流式写入
insert into t_changelog_fullcompaction select * from t_auto_generated_data;
过一段时间后,我们查看catalog目录,观察changelog产生的时间,1分钟一个和checkpoint的间隔是一样的,现在的现象和lookup是一样的,一次checkpoint产生一个changelog
root@wsl01:/data/soft/paimon/catalog/default.db/t_changelog_fullcompaction/bucket-0# ll
total 66904
-rw-r–r-- 1 root root 335296Nov 28 10:04changelog-3b6403dd-f6a3-4492-b376-9d6ecadc97e1-0.parquet
-rw-r–r-- 1 root root 9580478Nov 28 10:05changelog-3b6403dd-f6a3-4492-b376-9d6ecadc97e1-2.parquet
-rw-r–r-- 1 root root 9581448Nov 28 10:06changelog-3b6403dd-f6a3-4492-b376-9d6ecadc97e1-4.parquet
-rw-r–r-- 1 root root 335296 Nov 28 10:04 data-3a8b0d35-34db-4aca-81de-68dea000a669-0.parquet
-rw-r–r-- 1 root root 9580478 Nov 28 10:05 data-3a8b0d35-34db-4aca-81de-68dea000a669-1.parquet
-rw-r–r-- 1 root root 9581448 Nov 28 10:06 data-3a8b0d35-34db-4aca-81de-68dea000a669-2.parquet
-rw-r–r-- 1 root root 10026806 Nov 28 10:05 data-3b6403dd-f6a3-4492-b376-9d6ecadc97e1-1.parquet
-rw-r–r-- 1 root root 19477692 Nov 28 10:06 data-3b6403dd-f6a3-4492-b376-9d6ecadc97e1-3.parquet
流式写入验证(2)
加上full-compaction.delta-commits = 2参数,看看changelog产生的变化情况
drop table if exists t_changelog_fullcompaction2;
CREATE TABLE t_changelog_fullcompaction2(
age BIGINT,
money BIGINT,
hh STRING,
PRIMARY KEY (hh) NOT ENFORCED
)WITH (
'merge-engine' = 'deduplicate',
'changelog-producer' = 'full-compaction',
'full-compaction.delta-commits' = '2'
);
-- 设置checkpoint的间隔60s
SET 'execution.checkpointing.interval'='60 s';
-- 流式写入
insert into t_changelog_fullcompaction2 select * from t_auto_generated_data;
过一段时间后,我们查看catalog目录,观察changelog产生的时间,2分钟一个,就是根据我们的full-compaction.delta-commits参数*checkpoint间隔,也就是n次checkpoint产生一个changelog
不是根据时间间隔,而是根据checkpoint触发次数,这点在批写的时候会有体现
root@wsl01:/data/soft/paimon/catalog/default.db/t_changelog_fullcompaction2/bucket-0# ll
total 131168
-rw-r–r-- 1 root root 10024339Nov 28 10:16changelog-99b10288-0cd7-471e-8afd-075620279701-1.parquet
-rw-r–r-- 1 root root 18879457Nov 28 10:18changelog-99b10288-0cd7-471e-8afd-075620279701-3.parquet
-rw-r–r-- 1 root root 2945606Nov 28 10:20changelog-99b10288-0cd7-471e-8afd-075620279701-5.parquet
-rw-r–r-- 1 root root 10024339 Nov 28 10:16 data-99b10288-0cd7-471e-8afd-075620279701-0.parquet
-rw-r–r-- 1 root root 28811744 Nov 28 10:18 data-99b10288-0cd7-471e-8afd-075620279701-2.parquet
-rw-r–r-- 1 root root 31599979 Nov 28 10:20 data-99b10288-0cd7-471e-8afd-075620279701-4.parquet
-rw-r–r-- 1 root root 335703 Nov 28 10:15 data-d3f11150-7e72-4cb6-addf-7079367bea42-0.parquet
-rw-r–r-- 1 root root 9575206 Nov 28 10:16 data-d3f11150-7e72-4cb6-addf-7079367bea42-1.parquet
-rw-r–r-- 1 root root 9568870 Nov 28 10:17 data-d3f11150-7e72-4cb6-addf-7079367bea42-2.parquet
-rw-r–r-- 1 root root 9575911 Nov 28 10:18 data-d3f11150-7e72-4cb6-addf-7079367bea42-3.parquet
-rw-r–r-- 1 root root 2941759 Nov 28 10:19 data-d3f11150-7e72-4cb6-addf-7079367bea42-4.parquet
-rw-r–r-- 1 root root 5890 Nov 28 10:20 data-d3f11150-7e72-4cb6-addf-7079367bea42-5.parquet
当不设置delta-commits参数时,样子好像和lookup没有什么差别,因此如果要用full-compaction时,必须要加上delta-commits参数,来控制compaction的触发间隔。
lookup模式每次checkpoint都会进行一次compaction,数据延迟较小,但是资源消耗大,如果你不在乎数据延迟,那么你可以使用full-compaction模式,将若干次的checkpoint攒到一起,然后统一执行一次compaction,这样能减少资源消耗,但是会增大数据延迟
批处理写入验证
批处理每次写入都会触发一次checkpoint,那么在full-compaction下,如果配置了 ‘full-compaction.delta-commits’ = ‘2’,会有什么效果?
三次写入后,产生了4个changelog,原因就是我们配置了2次checkpoint产生一次compaction,在第三次时,发现前边已经完成了两次checkpoint,因此要进行一次compaction,然后还有第三次本身的数据,一共4个changelog,这个简单理解就行,没有人用full-compaction模式的表进行批写
root@wsl01:/data/soft/paimon/catalog/default.db/t_changelog_fullcompaction2/bucket-0# ll
total 40
-rw-r–r-- 1 root root 1217 Nov 28 10:39 changelog-01040eb2-9c68-45b7-b04c-00bd3f4b8c45-0.parquet
-rw-r–r-- 1 root root 1362 Nov 28 10:40 changelog-2ed968c6-d7c3-47c7-9416-00ad010ef7d0-1.parquet
-rw-r–r-- 1 root root 1217 Nov 28 10:39 changelog-30c7d2fe-942b-42bd-884f-c2ccf4c6005d-0.parquet
-rw-r–r-- 1 root root 1362 Nov 28 10:39 changelog-4a1f24c6-8b9c-455b-b153-cd5688b3e9f4-1.parquet
-rw-r–r-- 1 root root 1217 Nov 28 10:39 data-243a924d-b38f-4a8c-9704-bccbbc63a3dd-0.parquet
-rw-r–r-- 1 root root 1217 Nov 28 10:40 data-2ed968c6-d7c3-47c7-9416-00ad010ef7d0-0.parquet
-rw-r–r-- 1 root root 1217 Nov 28 10:39 data-2facc877-fd84-4717-b10f-28187bcca435-0.parquet
-rw-r–r-- 1 root root 1217 Nov 28 10:39 data-4a1f24c6-8b9c-455b-b153-cd5688b3e9f4-0.parquet
-rw-r–r-- 1 root root 1217 Nov 28 10:40 data-82c7c2f9-03b2-4aea-83ae-b0dc28a7427e-0.parquet
-rw-r–r-- 1 root root 1217 Nov 28 10:39 data-fa4c9a01-13e0-46f3-be45-c15f34577282-0.parquet
结论
- changelog=full-compaction模式相比于lookup模式区别在于changelog的生成时间
- lookup模式:一次checkpoint产生一次changelog,数据延迟小,资源消耗大
- full-compaction模式:n次checkpoint产生一次changelog,数据延迟大,资源消耗小
应用场景
- 非cdc采集的表(不能完整的提供数据变化流)
- and
- 后期要进行流式处理的表
- and
- 不在乎数据延迟的场景
509

被折叠的 条评论
为什么被折叠?



