考虑到alter table 对clickhouse的性能和数据一致性有一定的影响,所有尽量使用创建新表来替换旧表以实现数据的变更于删除,下面我将使用状态表和删除表来实现表的更新与删除
1.创建测试表
CREATE TABLE uk_price_paid
(
`sid` UUID,
`price` UInt32,
`date` Date,
`addr1` String,
`addr2` String,
`state` String,
`key` String
)
ENGINE = MergeTree
ORDER BY key
这里的sid 不作为主键,key为唯一主键,每个key做对于的删除与修改时 state发生变化 如 0 新增 1 修改 2 删除 sid 使用UUID随机生成
2.添加测试数据 (随机生成10条key)
INSERT INTO uk_price_paid
SETTINGS max_http_get_redirects = 10
WITH splitByChar(' ', postcode) AS p
SELECT
generateUUIDv4() AS sid,
toUInt32(price_string) AS price,
now() - 1 AS date,
addr1,
addr2,
'0' AS state,
rand() AS key
FROM url('http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv', 'CSV', 'uuid_string String,\n price_string String,\n time String,\n postcode String,\n a String,\n b String,\n c String,\n addr1 String,\n addr2 String,\n street String,\n locality String,\n town String,\n district String,\n county String,\n d String,\n e String')
LIMIT 10
SETTINGS max_http_get_redirects = 10
3.添加修改数据和删除
//修改以9结尾key的数据,更改时间为29号做测试
INSERT INTO uk_price_paid SELECT
generateUUIDv4(),
price,
'20240729' AS date,
addr1,
addr2,
'1' AS state,
key
FROM uk_price_paid
WHERE substring(key, length(key), 1) = '9'
//删除以3结尾的key的数据,更改时间为30号做测试
INSERT INTO uk_price_paid SELECT
generateUUIDv4(),
price,
'20240730' AS date,
addr1,
addr2,
'2' AS state,
key
FROM uk_price_paid
WHERE substring(key, length(key), 1) = '3'
//现在有3条9结尾的key需要修改,3结尾的key都需要删除
4.创建STATE状态表
CREATE TABLE default.state_tb
(
`sid` UUID,
`date` Date,
`state` String,
`key` String
)
ENGINE = MergeTree
ORDER BY key
SETTINGS index_granularity = 8192
AS SELECT
sid,
date,
state,
key
FROM uk_price_paid
5.创建DEL删除表
CREATE TABLE del_sid
ENGINE = MergeTree
ORDER BY key
AS SELECT
key,
sid,
state
FROM uk_price_paid AS T1
INNER JOIN
(
SELECT
key,
argMax(state, date) AS state
FROM state_tb
GROUP BY key HAVING state = '2'
) AS T2 ON T1.key = T2.key
因为状态为2的key的相关数据都需要删除,所有这里有状态为0,以3结尾的key
6.创建数据更新表(这里我将不改变修改后的key的原始id,只修改内容,及对应的29号的date数据)
CREATE TABLE uk_price_paid_new
ENGINE = MergeTree
ORDER BY key
AS SELECT
T1.*,
T2.key AS key,
T2.osid AS osid
FROM
(
SELECT *
FROM uk_price_paid
WHERE (key, sid) IN (
SELECT
key,
argMax(sid, date)
FROM state_tb
GROUP BY key
)
) AS T1
INNER JOIN
(
SELECT
key,
argMin(sid, date) AS osid
FROM state_tb
GROUP BY key
) AS T2 ON T1.key = T2.key
7.现在这里包含了所有最新的数据,但是它包含了需要删除的数据,再次创建表,来实现id变更为对应的原始id及osid字段,插入对于的新表时,将要删除的数据排除在外
drop table if EXISTS uk_price_paid_tmp
create table uk_price_paid_tmp as uk_price_paid
INSERT INTO uk_price_paid_tmp SELECT
osid AS sid,
price,
date,
addr1,
addr2,
state,
key
FROM uk_price_paid_new
WHERE sid NOT IN (
SELECT sid
FROM del_sid
)
现有的将是更新后的表,如果有增量时间需要的时候,更新相关环节的STATE和DEL表即可
验证一下,根据DEL表中应该有2条3结尾的key状态为0的需要删除,并且9结尾的key时间都应该是29号,如下图