第2.2章 StarRocks表设计——排序键和数据模型

Column可以分为两大类：Key和Value，从业务角度看，Key和Value分别对应维度列和指标列。StarRocks的key列是建表语句中指定的列，建表语句中的关键字 'duplicate key'、'aggregate key'、'unique key'、' primary key' 后面的列就是Key列，除了 Key列剩下的就是Value列。

1.1 四种模型

Duplicate Key Model：明细模型
Aggregate Key Model：聚合模型
Unique Key Model：更新模型
Primary Key Model：主键模型

1.2 排序键

1.2.1 概述

StarRocks在创建表的时候，可以指定一个列或者多个列（一般来说前三列）作为这个表的排序键（Sort Key），当数据导入时，数据会按照排序键的定义，按照顺序存储在磁盘空间上，当查询根据这些排序字段进行查询时，就能够根据已经排好序的数据，快速定位到要查询的对应数据集所对应的磁盘地址，在scan阶段就能够大面积减少无关数据，加速查询。

直观来看，各个模型的排序键就是建表语句中duplicate key、aggregate key、unique key或primary key后面指定的列。但是四种模型的排序键还是有一些区别：

1.2.2 分类

明细模型：明细模型排序键写法比较灵活，可以指定部分的维度列为排序键。可以使用duplicate key()显式定义排序键。如果省略duplicate key(列1,列2……)时，默认选择表的前三列作为排序键。在建表语句中，排序键必须定义在其他列之前。指定排序键的时候，列的顺序要和建表语句中的相同，否则建表语句会报错。

#建表语句：
create table if not exists test1 (
    event_time datetime not null comment "datetime of event",
    event_type int not null comment "type of event",
    user_id int comment "id of user",
    channel int comment ""
)
duplicate key(event_time, event_type,user_id)
distributed by hash(user_id) buckets 10;

#===如果使用duplicate key()显式定义排序键，单从建表不报错的角度，可以有四种组合：
event_time
event_time, event_type
event_time, event_type, user_id
event_time, event_type, user_id, channel


#===如果省略duplicate key(列1,列2……)，默认选择表的前三列作为排序键。
create table if not exists test1 (
    event_time datetime not null comment "datetime of event",
    event_type int not null comment "type of event",
    user_id int comment "id of user",
    channel int comment ""
)
distributed by hash(user_id) buckets 10;
#等价于：
create table if not exists test1 (
    event_time datetime not null comment "datetime of event",
    event_type int not null comment "type of event",
    user_id int comment "id of user",
    channel int comment ""
)
duplicate key(event_time, event_type,user_id)
distributed by hash(user_id) buckets 10;

聚合表：据按照排序键aggregate key聚合后排序，排序键需要满足唯一性约束，并且需要按建表顺序指定所有的维度列。

#建表语句：
create table if not exists test2(
    site_id largeint not null comment "id of site",
    date date not null comment "time of event",
    city_code varchar(20) comment "city_code of user",
    pv bigint sum default "0" comment "total page views"
)
aggregate key(site_id, date, city_code)
distributed by hash(site_id)
properties (
"replication_num" = "3"
);


#排序键必须满足唯一性约束，并且需要按建表顺序指定所有的维度列
#上述的排序键是site_id, date, city_code,指标键是pv 


#  上述的建表语句可以简写为：
create table if not exists test2(
    site_id largeint not null comment "id of site",
    date date not null comment "time of event",
    city_code varchar(20) comment "city_code of user",
    pv bigint sum default "0" comment "total page views"
)
distributed by hash(site_id)
properties (
"replication_num" = "3"
);

更新模型：更新模型的排序键（也称主键）只有一种写法，就是在unique key()的括号中指定，并且排序键需要满足唯一性约束。

#建表语句：
create table if not exists test3(
    create_time date no

最低0.47元/天解锁文章

200万优质内容无限畅学