【Hive_03】

最新推荐文章于 2024-05-07 18:46:59 发布

原创最新推荐文章于 2024-05-07 18:46:59 发布 · 462 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#hive #hadoop #大数据

本文详细介绍了Hive的复杂数据类型，包括数组、映射和结构体的使用，以及如何在Hive中创建、操作内部表和外部表。通过示例展示了如何处理数据，如创建带有复杂数据类型的表、查询和转换数据。还探讨了Hive的分区表，动态和静态分区的插入，以及开窗函数的应用，如累计点外卖次数的统计。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Hive_03

1.hive 表注释中文显示问题？
2.beeline 日志级别设置？

1.开启服务常用操作
1.shell命令去开启某个服务
1.在当前会话直接使用脚本 =》仅仅是测试时候使用
弊端：关闭session 那么服务就停止了

	2.建议：
		启动服务在后台运行 
		[hadoop@bigdata32 ~]$ nohup hiveserver2 & 
		[hadoop@bigdata32 ~]$ nohup: ignoring input and appending output to ‘nohup.out’
			此时 hiveserver2 就在当前session 后台运行 
			运行日志 在这存储nohup.out

		进一步优化： 
		 nohup hiveserver2 > ~/logs/hiveserver2.log 2>&1 &	
		 2>&1 => 服务日志标准输出

1.table
1.内部表 vs 外部表
内部表(MANAGED_TABLE) [受hive管控的表]:
外部表：EXTERNAL

区别：
内部表： data + metadata =》 drop 表中数据和元数据都被删除
外部表：metadata =》drop 只有元数据被删除，表中hdfs上的数据还在（hdfs上的文件仍然保留）

内：emp_manager
create table emp_manager as select * from emp;

CREATE TABLE emp (
empno decimal(4,0) ,
ename string ,
job string ,
mgr decimal(4,0) ,
hiredate string ,
sal decimal(7,2) ,
comm decimal(7,2) ,
deptno decimal(2,0)
)
row format delimited fields terminated by ‘,’
stored as textfile;

drop table emp_manager;

CREATE EXTERNAL TABLE emp_external (
empno decimal(4,0) ,
ename string ,
job string ,
mgr decimal(4,0) ,
hiredate string ,
sal decimal(7,2) ,
comm decimal(7,2) ,
deptno decimal(2,0)
)
row format delimited fields terminated by ‘,’
stored as textfile;

load

linux file:emp.txt
hadoop fs -put ./emp.txt hdfs://bigdata32:9000/user/hive/warehouse/bigdata_hive.db/emp_external

思考：
table ： hdfs+metastore

2.相互转换
alter table :
ALTER TABLE table_name SET TBLPROPERTIES table_properties;

	alter table emp_external set tblproperties ("EXTERNAL"="false");
	alter table emp_external set tblproperties ("EXTERNAL"="true");

alter table emp_external set tblproperties (“external”=“false”); =》小写不行

3.复杂数据类型
复杂数据类型：
1.用的不多【中小企业，用的很多】
2.要求：
1.create table 带复杂数据类型的表的创建
2.select column ~表的查询

1.数组arrays:
ARRAY<data_type>

问：在表里面有一个字段定义成数据类型能不能放不同的数据类型的数据呢？
eg： 1 , 2 ,“hello”

[hadoop@bigdata32 data]$ cat hive_array.txt
zhangsan beijing,shanghai,dalian,shenyang
lisi chengdu,hangzhou,shanghai,wuxi

create table hive_array(
name string,
locations array
)
row format delimited fields terminated by ‘\t’
collection items terminated by ‘,’;

load data local inpath ‘/home/hadoop/tmp/data/hive_array.txt’ into table hive_array;

案例分析：
1.查询每个用户第一个工作地点？
select name ,locations[0] as first_loc_work from hive_array;
2.查询每个人工作地点的数量
select name , size(locations) from hive_array ;
3.查询在shanghai 工作的有哪些人
select * from hive_array where array_contains(locations,‘shanghai’);

name locations
zhangsan [“beijing”,“shanghai”,“dalian”,“shenyang”]
lisi [“chengdu”,“hangzhou”,“shanghai”,“wuxi”]

行转列
name locations
zhangsan beijing
zhangsan shanghai
zhangsan dalian
zhangsan shenyang
4.解析locations 所有值
Lateral View 侧写视图：
LATERAL VIEW udtf(expression) tableAlias AS columnAlias
udtf =》一进多出
FROM baseTable (lateralView)*

select name,location
from hive_array lateral view explode(locations) loc_table as location;

2.集合maps:
MAP<primitive_type, data_type>

[hadoop@bigdata32 data]$ cat hive_map.txt
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

id name relations age

create table hive_map(
id int comment ‘用户id’,
name string comment ‘用户名字’,
relation map<string,string> comment ‘家庭成员’,
age int comment ‘年龄’
)
row format delimited fields terminated by ‘,’
collection items terminated by ‘#’
map keys terminated by ‘:’;

load data local inpath ‘/home/hadoop/tmp/data/hive_map.txt’ into table hive_map;

需求：
1.查询表中每个人的father的名字

select id,name,age,relation[‘father’] as father from hive_map;

2.查询表中 每个人的家庭成员   keys

select id,name,age,map_keys(relation) as members from hive_map;
3.查询表中每个人的家庭成员的名字 values
select id,name,age,map_values(relation) as members from hive_map;

4.查询表中 有brother的人以及brother的名字

select
id,name,age,relation[‘brother’] as brother
from hive_map
where
relation[‘brother’] is not null;

或者
select
id,name,age,relation[‘brother’] as brother
from hive_map
where
array_contains(map_keys(relation), ‘brother’);

3.结构体structs: java bean
STRUCT<col_name : data_type [COMMENT col_comment], …>

[hadoop@bigdata32 data]$ cat hive_struct.txt
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70

create table hive_struct(
ip string,
userinfo STRUCTname:string,age:int
)
row format delimited fields terminated by ‘#’
collection items terminated by ‘:’;
load data local inpath ‘/home/hadoop/tmp/data/hive_struct.txt’ into table hive_struct;

需求：只需取出里面的内容即可

类似一个类，里面包含属性，用 ·调用即可

userinfo:
name
age
select ip,userinfo.name as name ,userinfo.age as age from hive_struct;

4.table:
1.内部表：
1.普通表
2.分区表
2.外部表：
1.普通表
2.分区表

1.普通表： 
	内部表
	外部表
2.分区表：
	内部表
	外部表

分区表：
1.hive中多一个或者多个分区
2.背景：提升查询效率

使用场景：
普通表：一般维护的数据量比较少
分区表：dt

分区表：
create table order_info (
orderid string,
ordertime string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;

导入数据
load data local inpath ‘/home/hadoop/tmp/data/order.txt’ into table order_info; 不行=》data 导入到默认分区

load data local inpath ‘/home/hadoop/tmp/data/order.txt’
into table order_info partition(dt=‘20211014’);

查看表中分区：
show partitions order_info;
删除分区：
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, …]

alter table order_info drop PARTITION(dt=‘HIVE_DEFAULT_PARTITION’);

emp:创建分区表

drop table emp_p;
CREATE TABLE emp_p (
empno decimal(4,0) ,
ename string ,
job string ,
mgr decimal(4,0) ,
hiredate string ,
sal decimal(7,2) ,
comm decimal(7,2)
)
PARTITIONED BY (deptno decimal(2,0))
row format delimited fields terminated by ‘\t’
stored as textfile;

导入数据：
load
insert
load data local inpath “/home/hadoop/tmp/data/emp_p.txt” overwrite into table emp_p partition(deptno=30);

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 …)
[IF NOT EXISTS]] select_statement1 FROM from_statement;

insert into table emp_p partition(deptno=20)
select
empno,
ename,
job ,
mgr ,
hiredate,
sal ,
comm
from emp where deptno=20;

insert overwrite table emp_p partition(deptno=20)
select
empno,
ename,
job ,
mgr ,
hiredate,
sal ,
comm
from emp where deptno=20;

思考：
使用一个sql 把所有数据落到对应的分区里面呢？

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table emp_p partition(deptno)
select
empno,
ename,
job ,
mgr ,
hiredate,
sal ,
comm ,
deptno
from emp;

insert data：
1.静态分区
2.动态分区：开启一个开关

insert overwrite table emp_p partition(deptno)
select
empno,
ename,
job ,
mgr ,
hiredate,
sal ,
comm ,
deptno
from emp where deptno=20;

insert overwrite table emp_p partition(deptno)
select
empno,
ename,
job ,
mgr ,
hiredate,
sal ,
comm ,
20 as deptno
from emp where deptno=20;

hive 分区表插入数据： detno

emp
emp_p

动态分区
静态分区： 100w个
insert overwrite table emp_p paritition(deptno=xxx)
select
xx
from emp where deptno=xxxx

100w遍

动态分区：正确的数据落到正确的分区
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table emp_p paritition(deptno)
select
xx
from emp;

离线业务：业务周期性 T+1
dt
1201任务 =》昨天的数据
动态：
insert overwrite table emp_p paritition(deptno)
select
xx
from emp where dt= yestoday;
静态：

insert overwrite table emp_p paritition(deptno=‘yestoday’)
select
xx
from emp where dt= yestoday;

5.开窗函数
聚合函数：多行数据按照一定规则进行聚合为一行
理论上聚合后的行数 <=聚合前的行数
需求：
既要显示聚合前的数据，又要显示聚合后的数据？
eg：
id name sal
1 zs 3w
2 ls 2.5w
3 ww 2w

需求：按照工资降序排列还显示对应的排名
id name sal rank
1 zs 3w 1
2 ls 2.5w 2
3 ww 2w 3

窗口函数/开窗函数：窗口 + 函数
窗口：函数运行时计算的数据集的范围
函数：运行时函数
语法结构：
函数 over([partition by xxx,…] [order by xxx,…])
over: 以谁进行开窗 table、
parition by : 以谁进行分组 table columns
order by : 以谁进行排序 table columns
函数：开窗函数、聚合函数
数据：
haige,2022-11-10,1
haige,2022-11-11,5
haige,2022-11-12,7
haige,2022-11-13,3
haige,2022-11-14,2
haige,2022-11-15,4
haige,2022-11-16,4

create table user_mt (
name string,
dt string,
cnt int
)
row format delimited fields terminated by ‘,’ ;

load data local inpath ‘/home/hadoop/tmp/mt.txt’ overwrite into table user_mt;
需求：
统计累计问题，每个用户每天累计点外卖次数

[partition by xxx,…] [order by xxx,…]

select
name ,
dt ,
cnt ,
sum(cnt) over(partition by name order by dt ) as sum_cnt
from user_mt;
补充：
单单一个基本查询开窗函数和 group by 不能一起使用

指定窗口大小：
select
name ,
dt ,
cnt ,
sum(cnt) over(partition by name order by dt ) as sum_cnt,
sum(cnt) over(partition by name order by dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) as sum_cnt2,
sum(cnt) over(partition by name order by dt ROWS BETWEEN 3 PRECEDING AND CURRENT ROW ) as sum_cnt3,
sum(cnt) over(partition by name order by dt ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING ) as sum_cnt4
from user_mt;

select
name ,
dt ,
cnt ,
sum(cnt) over(partition by name order by dt ) as sum_cnt,
sum(cnt) over( order by dt ) as sum_cnt1
from user_mt;

注意：
只要不加partition by 就是作用整张表

需求：
有如下用户访问数据
userid dt cnt
u01,2017/01/21,5
u02,2017/01/23,6
u03,2017/01/22,8
u04,2017/01/20,3
u01,2017/01/23,6
u01,2017/02/21,8
u02,2017/01/23,6
u01,2017/02/22,4

需求：
使用sql 统计出每个用户每个月的累计访问次数
用户id 月份小计累计
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12

1.求出每个用户每个月的访问次数

dt:2017/01/21 => 2017-01-21
moth:2017-01

select

id,

date_format(replace(dt,‘/’,‘-’),‘YYYY-MM’) as moth,

sum(cnt) cnt_sum

from user_test

group by

id,date_format(replace(dt,‘/’,‘-’),‘YYYY-MM’);

2.基于result 进一步求累计访问次数
select
userid,
moth,
cnt_sum,
sum(cnt_sum) over(parition by userid order by moth ) as cnt_all
from
(
select
userid,
date_format(replace(dt,‘/’,‘-’),‘YYYY-MM’) as moth,
sum(cnt) as cnt_sum
from user_log
group by
userid,date_format(replace(dt,‘/’,‘-’),‘YYYY-MM’)
) as a ;

create table user_log(
userid string,
dt string,
cnt int
)
row format delimited fields terminated by ‘,’ ;

load data local inpath “/home/hadoop/tmp/data/exemple/user_visit.txt” into table user_log;

开窗函数：
函数：
1.开窗函数自带的
1.排序相关的
2.串行
2.聚合函数

rank() over(partition by xx order by xxx) as rk
row_number() over(partition by xx order by xxx) as rn
dense_rank() over(partition by xx order by xxx) as dk

select
name,
dt,
cnt,
rank() over(partition by name order by dt) as rk,
row_number() over(partition by name order by dt) as rn,
dense_rank() over(partition by name order by dt) as dk
from user_mt

总结：
rank() 从1开始按照顺序会生成分组内记录的编号，排序相同会重复在名次中留下空位
row_number()：从1开始按照顺序会生成分组内记录的编号，排序相同不会重复
dense_rank()：从1开始按照顺序会生成分组内记录的编号，排序相同会重复在名次中不会留下空位

使用场景：
topN问题：
top3 ： result table where rk <=3
需求：
京东店铺

[hadoop@bigdata32 exemple]$ cat user_shop.txt
user_id shop
u1,a
u2,b
u1,b
u1,a
u3,c
u4,b
u1,a
u2,c
u5,b
u4,b
u6,c
u2,c
u1,b
u2,a
u2,a
u3,a
u5,a
u5,a
u5,a

pv =》页面浏览量 3个用户每个人访问了 10次页面 30
uv =》访客次数 3个用户每个人访问了 10次页面 3

需求：
1.每个店铺的uv

select

shop,

count(*)

from user_shop

group by shop;

2.每个店铺访问次数top3 的用户记录
输出：店铺名次访客id 访问次数

1.每个店铺访问次数

2.访问次数排名

3.top3 店铺名字访问次数排名

select
shop,
cnt ,
rk
from
(
select
shop,
cnt ,
rank() over( order by cnt desc ) as rk
from
(
– 每个店铺访问次数
select
shop,
count(1) as cnt
from user_shop
group by shop
) as a
) as a
where rk <4;

select
shop,
count(1) as cnt
from user_shop
group by shop;

店铺名称访客id 访问次数 top3

1.每个店铺每个访客id访问次数

2.访问次数排名

3.top3 店铺名称访客id 访问次数

select
shop,
user_id,
cnt,
rk
from
(
select
shop,
user_id,
cnt,
rank() over(partition by shop order by cnt desc ) as rk
from
(
– 1.每个店铺每个访客id访问次数
select
shop,
user_id,
count(1) as cnt
from user_shop
group by
shop,
user_id
) as a
) as a
where rk <4;