【HQL】HiveQL 的查询语句

最新推荐文章于 2024-05-09 00:30:00 发布

和风与影

最新推荐文章于 2024-05-09 00:30:00 发布

阅读量678

点赞数

分类专栏： Hive HQL 文章标签： hive

本文链接：https://blog.youkuaiyun.com/weixin_45545090/article/details/124516168

版权

Hive 同时被 2 个专栏收录

17 篇文章

订阅专栏

HQL

11 篇文章

订阅专栏

今天学习 HQL 的查询语句。欢迎关注公众号回复 803 获取 Hive 编程指南。

1.select from 语句

以下面的表为例：

# 创建分区表
create table employees (	# mydb 指定了要创建表的数据库是 mydb
	name			string,
    salary			float,
    subordinates	array<string>,
    deductions		map<string, float>,
    address			struct<street:string, city:string, state:string, zip:int>)
partitioned by (country string, state string);

select 的结果，string 类型的列不加引号，集合的字符串元素会加引号，MAP 和 STRUCT 的结果用 JSON 格式表示。

# 集合内的元素用下标
select name,subordinates[0] from employees;

# MAP 元素用 key 作为下标
select name,deductions["State Taxes"] from employees;

# STRUCT 用 点 访问
select name,address.city from employees;

1.1 使用正则表达式指定列

# 选择 price 作为前缀的列
select symbol,'price.*' from stocks;

1.2 使用列值进行计算

选择的同时调用函数来处理

select	upper(name),salary,deductions["Federal Taxes"],
		round(salary*(1-deductions["Federal Taxes"]))
from	employees;

1.3 算术运算符

+ - * / % & | ^ ~

1.4 使用函数

见之前的文章。
Hive 中的常用函数合集。

1.5 limit 语句

限制返回行数。

1.6 列别名

用 as 取别名

1.7 嵌套 select 语句

可以将 from 写在前面。

from (
	select	upper(name),
    		salary,
    		deductions["Federal Taxes"] as def_taxes,
    		round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
    from
    		employees
	) e
select	e.name,
		e.salary_minus_fed_taxes
where	e.salary_minus_fed_taxes > 70000;

1.8 case…when…then 语句

类似 MySQL。

1.9 本地模式

本地模式查询可以不执行 MapReduce 程序，直接读取文件，设置如下：

set hive.exec.mode.local.auto=true;

可以不执行 MapReduce 程序进行查询的例子如下：

# 直接读取整个文件
select * from employees;

# where 后是分区也是直接读取整个分区文件
select * from employees
where country='US' and state='CA'
limit 100;

2.where 语句

where 语句可在查询时用条件过滤。

2.1 常用条件

A=B, A<B, A>B, A<=B, A>=B
A<=>B：A 和 B 都为 NULL 返回 true，有一个为 NULL 返回 NULL，其他和=一致；
A<>B,A!=B：有一个为 NULL 返回 NULL，否则相等返回 TRUE，不等返回 FALSE；
A [not] between B and C：A、B、C 有一个为 NULL 返回 NULL，否则根据 A 是否在 B、C 直接返回结果；
A is [not] NULL
A rlike B, A regexp B：B 是正则表达式，判断 A 是否匹配。

2.2 浮点数比较

float 的 0.2 比 double 的 0.2 大，因为浮点数不能精确表示，0.2 的 float 类型是 0.2000001，double 类型是 0.200000000001，所有用where 过滤时，如果 float 类型的 0.2 判断 > 0.2 时，会返回 true，别筛选出来。所以如果 A 是 float 类型的 0.2，要筛选 > 0.2，可以这样做：

# 错误做法
where A > 0.2;

# 正确做法
where A > cast(0.2 as float);

2.3 like 和 rlike

rlike 可以用 Java 的正则表达式，功能更强大。例如：

# rlike 的写法
where street rlike '.*(Chicago|Ontario).*'

# like 的写法
where street like '%Chicago%' or street like '%Ontario%';

3.gourp by … having

类似 MySQL

4.join

Hive 只支持等值连接。例如下面是错误的：

# 错误
a join b on a.ymd <= b.ymd

因为 MapReduce 很难实现这种类型。

Hive 也不支持 on 子句中使用 or，只能 union all 代替。

多张表 join 时，每个 join 会启动一个 MapReduce job。

join 时小表放在前面可以启动 map-side join。

如果分区过滤条件放在 on 语句中，对于外连接会忽略掉分区过滤条件，对内连接是可用的。所以在 Hive 中进行 join 时，一般先用子查询将每个表中需要的字段先查找出来，再对结果进行 join 操作。

inner join、left outer join、right outer join、full outer join 类似其他数据库的操作，下面将一些使用较少的。

4.1 left semi join

返回左表的记录，前提是该记录在右表存在。例如：

# 例如下面的语法，Hive 中是不支持的
select s.ymd,s.symbol,s.price_close from stocks s
where s.ymd,s.symbol in
(select d.ymd, d.symbol from dividends d);

# Hive 中实现
select s.ymd,s.symbol,s.price_close
from stocks s left semi join dividends
on s.ymd = d.ymd and s.symbol=d.symbol;

4.2 笛卡尔积 join

错误的连接可能导致笛卡尔积，它会产生大量的数据，无法优化。设置hive.mapred.mod为 strict 会阻止笛卡尔积。

4.3 map-side join

设置开启 map-side join，并设置使用这个优化时小表的大小：

# 开启 map-side join
set hive.auto.convert.join=true;

# 设置小表大小
set hive.mapjoin.smalltable.filesize=25000000

# 分桶存储的大表也是可以使用的
set hive.optimize.bucketmapJOIN=true;

如果所有的表有相同的分桶数，且数据是按照连接键或桶的键排序的，那么可以使用更快的分类-合并连接（sort-merge JOIN）：

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucektmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;

5.order by 和 sort by

order by 和其他 SQL 一样会全局排序。所有数据用一个 reducer 处理，数据量大耗时会很长。

sort by 只在每个 reducer 内部排序，保证每个 reducer 有序，是局部排序。

6.sort by + distribute by

比如我们希望一个字段中相同值发送到同一个 reducer 中处理，并对其排序，那么可以使用 sort by + distribute by，distribute by 要写在 sort by 前面：

select s.ymd,s.symbol,s.price_close
from stocks s
distribute by s.symbol
sort by s.symbol asc,s.ymd asc;

7.cluseter by

distribute by 和 sort by 是同一个字段时，相当于 cluster by：

select s.ymd,s.symbol,s.price_close
from stocks s
cluster by s.symbol;

8.类型转换

语法如下，如果 value 非法会返回 NULL：

cast(value as type)

binary 只能转 string，可以嵌套 cast 转为数字：

cast(cast(b as string) as double)

9.抽样查询

语法：

TABLESAMPLE (BUCKET x OUT OF y [ON colname])

colname 表示抽样的列，y 表示分多少个桶，x 表示桶的编号（从 1 开始）。例如把表随机分成 10 组取第 3 个桶：

select * from numbers tablesample(bucket 3 out of 10 on rand()) s;

上面的结果每次执行都是随机的。如果是指定列而非 rand() 函数，那么多次执行结果不变：

select * from numbers tablesample(bucket 3 out of 10 on number) s;

9.1 数据块抽样

按百分比抽样：

select * from numbers tablesample(0.1 percent) s;

抽样的最小单一是一个 HDFS 数据块。数据量小于一个数据块时会返回所有结果。

9.2 分桶表的输入裁剪

如果是分桶表的情况下，并且 tablesample 和 clustered by 指定相同的列，那么 tablesample 查询只会扫描表的该分桶下的数据：

# 创建分桶表
create table numbers_bucketed (number int) clustered by (number) into 3 buckets;

# 假如第 2 个分桶中的全部数据是 1 7 10 4
# 那么下面语句的结果是 1 7 10 4
select * from numbers_bucketed tablesample(bucket 2 out of 10 on number) s;