ClickHouse使用（十）

最新推荐文章于 2024-06-05 17:56:48 发布

原创最新推荐文章于 2024-06-05 17:56:48 发布 · 725 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#数据库

大数据专栏收录该内容

19 篇文章

订阅专栏

本文详细介绍了ClickHouse的各种表引擎，包括File、Memory、Set、Join、Buffer、TinyLog、StripeLog和Distributed等，以及其内存管理和日志类型的特点。此外，还探讨了数据查询中的with子句、sample子句、arrayjoin子句、join子句、预处理where（prewhere）子句、groupby子句和limitby等操作，深入解析了查询性能和优化策略。

1 引擎

能够直接读取本地文件数据，通常用作扩充手段来使用
Create table table_name(
name String,
value UInt32
)engine=File(“CSV”)

1.1File引擎

能够直接读取本地文件数据，通常用作扩充手段来使用
Create table table_name(
name String,
value UInt32
)engine=File(“CSV”)

1.2内存类型的引擎

特点：将数据全部加载到内存，提供更好的查询性能，占用大量的内存空间

1.2.1 Memory

不支持持久化到磁盘，服务器重启数据会全部丢失

1.2.2 Set

特点：去重，可以持久化磁盘，不能select，只能作为 in查询的右部条件被使用

1.2.3 join 表引擎

可以持久化磁盘，能够被直接查询使用

1.2.4 Buffer

不能持久化，充当缓冲区

1.3 日志类型的表引擎

一次插入多次查询，数据量<100万

1.3.1 TinyLog

1.3.2 StripeLog

1.3.2 Log

1.4接口类表引擎

整合其他表数据，本身不存储表数据
engine=merge(database,table_name)
database 库名，table_name 表名（支持正则表达式，可以用在分表存储，联合查询上）

1.5 Dictionary

Dictionary表引擎是数据字典的一层代理封装，可以取代字段函数engin=dictionary（test_flat_dict）

1.6 Distributed表引擎

相当于Sharing方案中数据库中间件，本身不存储数据，能够作为分布式的一层透明代理，在集群内部自动开展数据的写入分发和查询路由工作

1.7 其他类型

12.5.1
Live View
可以针对sql查询结果作为监控目标，当目标数据增加时，Live View 可以及时发出响应
12.5.2
Null 表引擎
向Null表写入数据会返回成功，查询会返回一张空表
12.5.3
Url 表引擎
当执行select 时，会将请求转化成远程的get调用
当执行insert时，会将请求转化成远程post 调用

2 数据查询

大小写敏感，标准sql

2.1查询子句 with

（1）定义变量，增加可读性
如：select pow(pow(2,2),3) 等价于
with pow(2,2) as a select pow(a,3)
（2）调用函数
with sum() as …
（3）定义子查询
with （select 。。）as total 。。。
（4）在子查询中重复使用 with
select a,b from with (select …) as tt

2.2 from 子句

（1）可以使用final修饰，可以配合 CollapsingMergeTree 和 VersionedCollapsingMergeTree 等使用，强制在查询中合并，但是会降低性能

2.3 sample子句使用

能够实现采样功能只能用于MergeTree系列表引擎，并且要在创建表时生命采样表达式
create table hits_v1
(
…
UserID UInt64
)engine=MergeTree()
…
sample by intHash32(UserID)
（1）sample factor 按照百分比采样
select … from … sample 0.1
或者
select … from … sample 1/10
（2）sample rows 按照样本数量采样
select count() from … sample 10000
返回结果 9576 由于采样数据的最小粒度为index_granularity 索引粒度。结论：设置一个较小的采样值没有意义

（3）sample factor offset n
按照系数和偏移量采样
select countId from … sample 0.4 offset 0.5
在后百分之50数据中采样整体的百分之40
select countId from … sample 0.8 offset 0.5
在后百分之50数据中采样整体的百分之80，会溢出，自动截断，只采样后面百分之50

2.4 array join子句使用

在表内部将Array 类型的值展开成多行
如：
create table …
(
title String ,
value Array(UInt32)
)engin=log

select title,value from …
title value
food [1,2,3]
fruit [3,4]
meat []

inner array join (默认)
select title,value form … array join value
title value
food 1
food 2
food 3
fruit 3
fruit 4
left array join
select title,v,value form … array join value
title v value
food 1 [1,2,3]
food 2 [1,2,3]
food 3 [1,2,3]
fruit 3 [3,4]
fruit 4 [3,4]
meat 0 []

嵌套类型同理
create table …
(
title String ,
nest Nested(
v1 UInt32,
v2 UInt64
)
)engine = log

2.5 join子句使用

连接精度+连接类型
(1)连接精度
all/any/asof
默认为all，（可以配置）通过join key 进行只支持=式，建立连接；
cross join 不需要 join key，会产生笛卡尔积（两个表中的数据乘积，如A n条，B m条；共 n*m条）
（1）.1 all
左表中数据和右表中数据为乘积的关系
（1）.2 any
左表中数据和右表中数据多条匹配，则只返回第一条
（1）.3 asof
可以增加模糊匹配条件
如 select a.id ,a.name,b.rate,a.time,b.time form j_1 as a
asof inner join j_2 as b
on a.id = b.id and a.time >= b.time
连接键 a.id=b.id ,模糊匹配的条件 a.time>=b.time,返回结果为符合条件的右表中的第一条数据
可以使用 using（id，time）代替上面写法，time不能和id为同一个字段，必须是整形，浮点型符合日期等有序数据类型

(2)连接类型
(2).1内连接 inner
返回交集，按照左表连接键，遍历右表进行匹配

(2).2 outer
外连接 left/right/full(outer)
left 左连接左表全部数据，在右表中按照连接键匹配数据，右表匹配到多条数据显示（多条），右表无匹配数据，显示默认值
right 右连接右表全部数据，在左表中按照连接键匹配数据，左表匹配到多条数据显示（多条），左表无匹配数据，显示默认值
full 全连接，左连接和右连接的并集

(2).3 交叉连接 cross，笛卡尔积结合A*结合B
交叉连接 cross
表A （1,2,3）表B（a，b，c）
交叉结果 1,a;1,b;1,c;2,a;2,b;2,c;3,a;3,b;3,c;

(3)多表连接
多表连接=两两连接

(4)查询性能
(4).1 性能
左大右小原则（无论哪种连接方式，右表都会被整个加载）
无缓存支持
在大量维度属性需要补全的场景中，建议使用字典代替Join
(4).2 空值
可以通过配置 join_use_nulls 来指定默认值或者Null，
on a.id=b.id 等价于 using id

2.6 where prewhere

条件中有主键字段可通过索引优化查询
prewhere 可实现通过条件字段筛选数据后，加载查询字段信息的效果，比where 更优化（clickhouse 会自动启动这种优化）

2.7 group by 子句

select 后只有聚合函数可省略 group by

2.8 with rollup子句使用

按照聚合键从右向左上卷数据
如：
select table,name,sum() from system.parts group by table,name with rollup group by table;
除了按照table name 计算sum() ,还会按照 table 汇总小计
(8)With cube
像立方体一样聚合键之间的所有组合生成小计信息
select database,table,name,sum(bytes_on_disk) from
(select database,table,name,bytes_on_disk form system.parts where table =‘h1’)
group by database,table,name
with cube
order by database,table,name
如：上面例子组合
[空],[database,table,name],[datable],[table],[name],[datable,table],[datable,name],[table,name] 8种情况
database table name sum()
1460381504
201403_1_29_2 90381504
201403_1_6_1 80381504
h1 1460381504
h1 201403_1_29_2 90381504
h1 201403_1_6_1 80381504
dataset 90381504
dataset 201403_1_29_2 90381504
dataset h1 90381504
dataset h1 201403_1_29_2 90381504
default 201403_1_6_1 80381504
default 80381504
default h1 80381504
default h1 201403_1_6_1 80381504