hive/iceberg唯一性约束

weixin_44591926

已于 2022-08-01 14:50:04 修改

阅读量1.4k

点赞数

文章标签： hive 数据库数据仓库

于 2022-07-24 18:59:10 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_44591926/article/details/125962594

版权

先了解下oracle的约束状态
https://docs.oracle.com/en/database/oracle/oracle-database/21/sqlrf/constraint.html#GUID-1055EA97-BA6F-4764-A15F-1024FD5B6DFE

ENABLE VALIDATE指定所有旧数据和新数据也符合约束。启用的验证约束保证所有数据都是有效的，并将继续有效。
ENABLE NOVALIDATE确保对受约束数据的所有新 DML 操作都符合约束。此子句不确保表中的现有数据符合约束。
DISABLE VALIDATE禁用约束并删除约束上的索引，但保持约束有效。此功能在数据仓库情况下最有用，因为它可以加载大量数据，同时还可以通过没有索引来节省空间。不允许其他SQL 语句对表的所有其他修改（插入、更新和删除）。
DISABLE NOVALIDATE表示 Oracle 不努力维护约束（因为它被禁用）并且不能保证约束是真的（因为它没有被验证）。

回来看hive的约束定义
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Constraints

drop table if exists testonly;

create table testonly(id string,name string,age int);

alter table testonly add constraint pk_id primary key(id) disable novalidate;
alter table testonly add constraint unique_id unique(id) disable novalidate;

当我尝试把disable novalidate改为其他3种约束状态时，hive会进行友情提示
以enable validate为例
可见hive(version3.1.2)针对数据的唯一性约束暂不支持，个人认为存在以下两点原因：

hive作为数仓工具而非数据库，olap（roll-up/drill-down/slice/dice/pivot）是它的主战场
假设hive支持约束性检查，面对海量数据存储时每一条数据都做约束性检查，虽然提高了数据质量，但一定会带来存储效率的消耗

iceberg(version0.13.2)唯一性约束
flinkSQL操作为例：

set table.sql-dialect=default;
use catalog hive_catalog;
CREATE TABLE mytable (
    id   string,
    name string,
    age int,
    primary key(id) NOT ENFORCED
) PARTITIONED BY (id)
WITH (
    'connector'='iceberg',
    'format-version'='2',
    'write.upsert.enabled' = 'true',
    'write.distribution-mode' = 'hash',
    'write.metadata.delete-after-commit.enabled' = 'true',
    'catalog-type'='hive',
    'catalog-name'='hive_catalog',
    'catalog-database'='mydbname',
    'catalog-table'='testonly',
    'uri'='thrift://host1:9083,thrift://host2:9083,thrift://host3:9083',
    'warehouse'='hdfs://ns1/user/hive/warehouse'
);
INSERT INTO mytable VALUES ('qq1','tommy',33333),('qq2','abc',33333);
INSERT INTO mytable VALUES ('qq1','tommy',456),('qq2','abc',789);
insert into mytable select 'qq3' as id,'testname' as name,100 as age from existtest_table limit 10000;

问题点：
在这里插入图片描述

每行记录都对应hdfs一个目录，海量数据insert时，会存在海量的小目录文件是可预料到的麻烦事
id='qq3’的数据经过flink反复写入iceberg表发现：id虽然保证了数据表检索时的唯一性，但是数据文件在累积增加，说明upsert更新的不是数据文件而是iceberg的元数据。面对无界流数据，update的操作有多少都不得而知，极端一点iceberg表中1条记录经过多次的update就可能占用上百兆的存储需求

结论：hive/iceberg保证数据唯一性，尽量不要沿用oltp的思维。保证源头数据质量，在真正使用到数据计算时采用olap方式去除数据重复性。