大数据工具——Hive（高级）

最新推荐文章于 2024-10-28 18:45:19 发布

NoMissU

最新推荐文章于 2024-10-28 18:45:19 发布

阅读量324

点赞数 1

文章标签： hive hadoop 数据库

本文链接：https://blog.youkuaiyun.com/m0_45993982/article/details/116277833

版权

本文详细介绍了Hive的高级查询语法，包括Join的特点、查询字句如WHERE、GROUP BY和LIMIT的使用，以及数据类型的深入探讨，特别是复杂数据类型如列表和Map的操作。同时，讲解了系统内置函数、窗口函数的应用，并提到了自定义函数的分类和加载方式。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、查询基本语法

1.查询基本结构

写顺序

select 
from
join on 
group by 
having
order by
sort by
limit
union / union all

执行顺序

from
on
join
where
group by 
having
select
distinct
order by
limit

查询注意事项

尽量不要使用子查询、尽量不要使用 in not in 
select * from aa1 
where id in (select id from bb); 
查询尽量避免join连接查询，但是这种操作咱们是永远避免不了的。 
查询永远是小表驱动大表（永远是小结果集驱动大结果集）

二、Join的语法与特点

1. 表之间的关系

在关系型数据库里面，每个实体有自己的一张表（table），所有属性都是这张表的字段（field），表与表之间根据关联字段"连接"（join）在一起。所以，表的连接是关系型数据库的核心问题。

所谓"连接"，就是两张表根据关联字段，组合成一个数据集。
问题是，两张表的关联字段的值往往是不一致的，如果关联字段不匹配，怎么处理？
比如，表 A 包含张三和李四，表 B 包含李四和王五，匹配的只有李四这一条记录

只返回两张表匹配的记录，这叫内连接（inner join）。
返回匹配的记录，以及表 A 多余的记录，这叫左连接（left join）。
返回匹配的记录，以及表 B 多余的记录，这叫右连接（right join）。
返回匹配的记录，以及表 A 和表 B 各自的多余记录，这叫全连接（full join）。

在这里插入图片描述

2. Hive专有Jion的特点

2.1 left semi join
在hive中，有一种专有的join操作,left semi join,我们称之为半开连接。它是left join的一种优化形式，只能查询左表的信息，主要用于解决hive中左表的数据是否存在的问题。相当于exists关键字的用法。

2.2 子查询

hive对子查询支持不是很友好，特别是 "="问题较多;
分区字段对outer join 中的on条件无效，对inner join 中的on条件有效

2.3 map-side join
如果所用的表中有小表，将会把小表缓存在内存中，然后在map端进行连接查找。hive在map端 查找时会减小整体查询量，从内存中读取缓存的小表数据，效率较快，还省去大量数据传输和shuffle耗时

三、查询字句

1.where

后不能跟聚合函数

2.group by:

分组，通常和聚合函数搭配使用,查询的字段要么出现在group by 后面，要么出现在聚合函数里面

3.Hiving

是对分组以后的结果集进行过滤。

4.Limit

限制从从结果集中取数据的条数,一般用于分页

5.排序

order by
全局排序:引发全表扫描,reduce数量一个,不建议使用
sort by
默认分区内排序,当reduceTask数量是1时候,那么效果和order by 一样
一般和distribute by 搭配使用
distribute by
用来确定用哪个列(字段)来分区,一般要写在sort by的前面
cluster by
分区的列和分区内排序的列相同时,那么可以用cluster by deptno来代替 distribute by deptno sort
cluster by ：兼有distribute by以及sort by的升序功能。
排序只能是升序排序（默认排序规则），不能指定排序规则为asc 或者desc。

四、数据类型

1.简单数据类型

create table if not exists bs1( 
id1 tinyint, 
id2 smallint, 
id3 int, id4 bigint, 
sla float, 
sla1 double, 
isok boolean, 
content binary, 
dt timestamp 
)
row format delimited fields terminated by ','; 
-- 准备要插入的数据: 
23, 12,342523,455345345,30000,600005,nihao, helloworld ,2017-06-02 11:41:30

2.复杂数据类型

在这里插入图片描述

1. 列表

1.1.列转行

创建array类型的表

create table if not exists arr1 ( 
name string, 
score array<string>
) 
row format delimited fields terminated by '\t' 
collection items terminated by ',';

内嵌查询

explode:就是把一列数据转化成多行

select explode(score) score from arr1;

虚拟表

lateral view 会将explode生成的结果放到一个虚拟表中，然后这个虚拟表会和当前表join,来达到数据聚合的目的。

select name,cj from arr1 lateral view explode(score) score as cj;

2. 行转列

collect_set函数：它们都是将分组中的某列转为一个数组返回

select name,collect_set(sorce) from student group by name;

3. map

创建map类型的表

create table if not exists map1( 
name string, 
score map<string,int> 
)
row format delimited fields terminated by ' ' 
collection items terminated by ',' 
map keys terminated by ':';

Map格式数据查询

#查询数学大于35分的学生的英语和自然成绩： 
select 
m.name, 
m.score['english'] , 
m.score['nature'] 
from map1 m 
where m.score['math'] > 35;

3.1Map列转行

explode

explode 展开数据
select explode(score) as (m_class,m_score) from map1;

Lateral view

Lateral View和split，explode等一起使用，它能够将一行数据拆成多行数据，并在此基础上对拆分后的数据进行聚合。
select name,m_class,m_score from map1 lateral view explode(score) score as m_class,m_score;

3.2Map行转列

创建临时表

create table map_temp( 
name string, 
score1 int, 
score2 int, 
score3 int 
)
row format delimited fields terminated by ',';

导入新表map2（建表语句同map1）

insert into map2 
select name,map('chinese',score1,'math',score2,'english',score3) 
from map_temp;

struct

创建表

create table if not exists tax(
id int,
name string, 
belong array<string>, 
tax map<string,double>, 
addr struct<province:string,city:string,road:string> )
row format delimited fields terminated by ' ' 
collection items terminated by ',' 
map keys terminated by ':' 
stored as textfile;

表中数据案例

uid uname belong tax addr 
1 xdd ll,lw,lg,lm wuxian:300,gongjijin:1200,shebao:300 北京,西城区,中南海 
2 lkq lg,lm,lw,ll,mm wuxian:200,gongjijin:1000,shebao:200 河北,石家庄,中山路

查询例子

#查询：下属个数大于4个，公积金小于1200，省份在河北的数据 
select id, name, belong[0], belong[1], tax['wuxian'], tax['shebao'], addr.road 
from tax 
where size(belong) > 4 and 
	  tax['gongjijin'] < 1200 and 
      addr.province = '河北';

五、系统内置函数

1.函数查看

-- 显示Hive中所有函数 
show functions; 

-- 查看某个函数的用法 
desc function array;

2. 日期函数

-- 时间戳转日期 
select from_unixtime(1505456567); 
select from_unixtime(1505456567,'yyyyMMdd'); 
select from_unixtime(1505456567,'yyyy-MM-dd HH:mm:ss'); 
-- 获取当前时间戳 
select unix_timestamp();
-- 日期转时间戳 
select unix_timestamp('2017-09-15 14:23:00'); 
-- 计算时间差 
select datediff('2018-06-18','2018-11-21'); 
-- 查询当月第几天 
select dayofmonth(current_date); 
-- 月末: 
select last_day(current_date); 
--当月第1天: 
select date_sub(current_date,dayofmonth(current_date)-1); 
--下个月第1天: 
select add_months(date_sub(current_date,dayofmonth(current_date)-1),1); 
-- 当前日期 
select current_date 
-- 字符串转时间（字符串必须为：yyyy-MM-dd格式） 
select to_date('2017-01-01 12:12:12'); 
-- 日期、时间戳、字符串类型格式化输出标准时间格式： 
select date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'); 
select date_format(current_date(),'yyyyMMdd'); 
select date_format('2017-01-01','yyyy-MM-dd HH:mm:ss');

3. 字符串函数

lower--（转小写） 
upper--（转大写）
length--（字符串长度，字符数)
concat--（字符串拼接） 
concat_ws --（指定分隔符）
substr--（求子串）

4. 类型转换函数

cast(value as type) -- 类型转换 
select cast('123' as int)+1;

5. 数学函数

round --四舍五入((42.3 =>42)) 
ceil --向上取整(42.3 =>43) 
floor --向下取整(42.3 =>42)

六、窗口函数

1.简介

窗口函数又名开窗函数，属于分析函数的一种。用于解决复杂报表统计需求的功能强大的函数。窗口函数用于计算基于组的某种聚合值，它和聚合函数的不同之处是：对于每个组返回多行，而聚合函数对于每个组只返回一行。开窗函数指定了分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变化而变化。

2.语法

create table if not exists t_order ( 
name string, 
orderdate string, 
cost int 
) 
row format delimited fields terminated by ',';

1. over 开窗

-- 1.不使用窗口函数 
-- 查询所有明细 
select * from t_order;
# 查询总量 select count(*) from t_order; 
 
-- 2.使用窗口函数 
select *, count(*) over() from t_order;

2. partition by子句

在over窗口中进行分区,对某一列进行分区统计,窗口的大小就是分区的大小

select name, orderdate, cost, sum(cost) over (partition by month(orderdate)) 
from t_order;

3. order by

order by子句会让输入的数据强制排序

select name, orderdate, cost, 
       sum(cost) over (partition by month(orderdate) order by orderdate) 
from t_order;

4.Window子句

在这里插入图片描述

select name,orderdate,cost,
       sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7  
from t_order;

七、自定义函数

1. 自定义函数分类

UDF：用户自定义函数，user defined function。一对一的输入输出最常用的）。
UDTF：用户自定义表生成函数。user defined table-generate function.一对多的输入输出。lateral view explode
UDAF：用户自定义聚合函数。user defined aggregate function。多对一的输入输出count sum max。

2.函数加载方式

2.1命令加载

# 进入到hive客户端,执行下面命令 
hive> add jar /opt/jar/udf.jar

-- 创建一个临时函数名,要跟上面hive在同一个session里面： 
hive> create temporary function toUP as 'com.hive.FirstUDF'; 

-- 检查函数是否创建成功 
hive> show functions; 

-- 测试功能
hive> select toUp('abcdef');

-- 删除函数
hive> drop temporary function if exists toUP;

2.2配置文件加载

# 1、将编写的udf的jar包上传到服务器上 
# 2、在hive的安装目录的bin目录下创建一个配置文件，文件名：.hiverc [root@hadoop01 hive]# vi ./bin/.hiverc 
add jar /hivedata/udf.jar; 
create temporary function toup as 'com.hive.FirstUDF'; 
3、启动hive 
[root@hadoop01 hive]# hive