【Hive_02】

最新推荐文章于 2024-05-14 09:07:11 发布

走多远才算远

最新推荐文章于 2024-05-14 09:07:11 发布

阅读量567

点赞数

文章标签： hive 数据库大数据

本文链接：https://blog.youkuaiyun.com/weixin_47922102/article/details/128190794

版权

本文详细介绍了Hive的数据定义及操作语言，包括DDL（如创建、查看和删除数据库），数据表的创建、加载数据，以及DML操作如插入、更新和删除。此外，还讨论了Hive的日期处理函数、分组聚合、Join操作和窗口函数的应用。通过实例展示了如何统计各部门每年入职人数、公司每年每月入职人数等。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Hive_02_数据定义及操作语言

DDL：Data Definition Language

需求：
1.各个部门每年入职的人数

select

deptno,year(hiredate),

count(*)

from emp

group by deptno year(hiredate);

2.整个公司每年每月的入职人数

year,month函数：

select year(hiredate),month(hiredate),count(*)

from emp

group by year(hiredate),month(hiredate);

date_format函数：

select

date_format(hiredate,“YYYY-MM”) as YM,

count(*) as cnt

from emp

group by date_format(hiredate,“YYYY-MM”);

3.销售部和经理部入职的人薪资范围在1500-2500 每年每月的入职人数

4.公司内有绩效的员工每年每月的入职人数
5.销售部和经理部入职的人薪资范围在1500-2500 每年每月的入职人数以及员工信息

Database：
1.hive 默认有一个数据库 default 路径：/user/hive/warehouse

1.创建数据库
CREATE [REMOTE] (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, …)];

[] 可有可无
(|) 选择其中一个即可

create database bigdata_hive;
思考：这个数据库在hdfs什么地方？

database 在hdfs上的路径：
	默认数据库路径：/user/hive/warehouse
	非默认数据库路径：/user/hive/warehouse/dbname.db

create database if not exists bigdata_hive;
create database bigdata_hive2 LOCATION ‘/data/bigdata_hive2’;
create database bigdata_hive3 WITH DBPROPERTIES (‘creator’=‘doublehappy’, ‘create_dt’=“2099-11-29”);
create database if not exists bigdata_hive4 COMMENT “这是一个数据库4”;
2.查看数据库
show databases;
show databases like “bigdata_hive*”
desc database bigdata_hive3;
desc database EXTENDED bigdata_hive3;

3.更改数据库
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, …); – (Note: SCHEMA added in Hive 0.14.0)

alter database bigdata_hive3 set dbproperties (‘create_dt’=“2022-11-29”);

4.删除数据库
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

DROP DATABASE bigdata_hive4;

DROP DATABASE bigdata_hive2 CASCADE; =>删库跑路的操作

查看table的信息：
DESCRIBE [EXTENDED|FORMATTED]
table_name
desc EXTENDED test;
desc FORMATTED test;

2.table
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], … [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]
[CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)] – (Note: Available in Hive 0.6.0 and later)
]

create table test(
id string comment ‘用户id’,
name string,
age bigint
) comment ‘test table’
row format delimited fields terminated by ‘,’
stored as textfile;

data_type:
常用的数据类型：
数值：
整型：
int
bigint
小数：float double decimal
字符串：string
时间:
日期 DATE YYYY-MM-DD
时间戳 TIMESTAMP YYYY-MM-DD HH:MM:SS

思考：建表为什么要指定分隔符？

MR:
wc案例
map阶段：按照分割符进行拆分
reduce阶段：算的过程

在hive里面创建一张表：
table中的数据=》 hdfs上普通文件
stu.txt => hdfs上普通文件
1,zhangsan,18
2,lisi,18
3,doublehappy,15
4,zihang,28

metastore =》 mysql 表中的元数据
id int
name string
age int

hive数据：hdfs+matastore =》 table

hdfs上文件每一行的列的分隔符一定得指定
分割符：
1.hdfs 每行有分割符 \n
2.hdfs 一行数据内容每个列之间也要指定分割符 ,

stu:分割符 \t hdfs存的数据,

id 1,zhangsan,18
name 空的
age 空的

导入数据：
LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]
LOCAL: 从本地磁盘往hivetable 中导入数据
不加local ：从hdfs 往hive table 中导入数据
load data local inpath ‘/home/hadoop/tmp/test.txt’ into TABLE test;

练习：
1.创建 emp 表并导入数据

CREATE TABLE emp (
empno decimal(4,0) ,
ename string ,
job string ,
mgr decimal(4,0) ,
hiredate string ,
sal decimal(7,2) ,
comm decimal(7,2) ,
deptno decimal(2,0)
)
row format delimited fields terminated by ‘,’
stored as textfile;

load data local inpath ‘/home/hadoop/tmp/emp.txt’ into table emp;

创建hive table ：
1.普通的建表语句
2.CREATE TABLE emp2 like emp; => 仅仅获取的表结构【建表语句】
3.create table emp3 as select * from emp; 表结构+数据【会触发mr作业】
CTAS ：
使用场景临时查询 =》放入结果表中

DML：

1.load 加载数据
1.加载本地数据
2.加载hdfs数据
【覆盖原有表中的数据】

LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]
1.1.加载本地数据
load data local inpath ‘/home/hadoop/tmp/emp.txt’ INTO TABLE emp;
1.2 加载hdfs数据

[hadoop@bigdata32 tmp]$ hadoop fs -ls /data
[hadoop@bigdata32 tmp]$ hadoop fs -put ./emp.txt /data
[hadoop@bigdata32 tmp]$ hadoop fs -ls /data
Found 1 items
-rw-r–r-- 3 hadoop supergroup 799 2022-11-30 21:53 /data/emp.txt

load data inpath ‘/data/emp.txt’ INTO TABLE emp;
1.table :
hdfs
metastore
hadoop fs -mv xxx /table/ [建议先不要这么做]

1.3 覆盖表中数据
load data local inpath ‘/home/hadoop/tmp/emp.txt’ OVERWRITE INTO TABLE emp;

2.insert 【也会触发mr作业】
1.Inserting data into Hive Tables from queries
insert into|OVERWRITE table tablename selectQury
追加数据
覆盖数据
2.Inserting values into tables from SQL 【不推荐使用】
INSERT INTO TABLE tablename
VALUES values_row [, values_row …]
1.每导入一条数据就会触发一次 mapreduce job 效率太低

emp2：
insert into table emp2
select * from emp;
insert overwrite table emp2
select * from emp where deptno=10;

3.update delete => hive 不要用效率低不要用
=》 insert OVERWRITE

4.清空表中数据
TRUNCATE [TABLE] table_name
truncate table emp2;

5.select

SELECT [ALL | DISTINCT] select_expr, select_expr, …
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]

1.where 过滤条件
where_condition
<
>
=
<> !=
and
or
in
not in
between and
is
is not

需求：
查询表中 deptno 20 10
select
*
from emp
where deptno=20 or deptno =10;

select
*
from emp
where deptno in (10,20);

select
*
from emp
where deptno <> 20;
select
*
from emp
where deptno != 20;

2.order by 排序语法
1.默认asc 升序
2.降序 desc

select
sal
from emp
order by sal desc;

3.like 语法模糊匹配
1._ 占位符
2.% 模糊

rlike regexp

4.合并表
1.union 去重
2.union all 不去重

create table a(id int ,name string) row format delimited fields terminated by ‘,’ ;
create table b(id int ,name string) row format delimited fields terminated by ‘,’ ;

load data local inpath “/home/hadoop/tmp/a.txt” into table a;
load data local inpath “/home/hadoop/tmp/b.txt” into table b;

select name from a
union all
select name from b;

select name from a
union all
select name from b
union all
select “lisi” as name ;

select name,“1” as pk from a
union all
select name,“2” as pk from b
union all
select “lisi” as name,“3” as id ;

思考： hive建表默认column 分割符是什么？

5.null 处理
1. 过滤
where xxx is not null
is null 作用一样 <=>
2. etl 转换
ifnull => hive里没有
coalesce =》
nvl =》

补充： 
	查看hive支持的function ： 
			y=f(x)
	SHOW FUNCTIONS [LIKE "<pattern>"];
	show functions like nvl;  => 判断 function hive 是否存在
	desc function nvl; =》  查看某个函数具体使用

select
empno,
ename,
job,
mgr,
hiredate,
sal,
nvl(comm,0) as comm_alias,
deptno
from emp ;

6.分组聚合函数 join

1.group by
1.和聚合函数一起使用
2. 一个或者多个 column 进行分组
3.分组字段 select 出现和 group by 出现要一致
having ：条件过滤
只能在group by 后面使用
2.聚合函数
sum max min avg count

select
sum(sal) as sal_sum,
max(sal) as sal_max,
min(sal) as sal_min,
avg(sal) as sal_avg,
count(1) as cnt
from emp ;

select
job,
sum(sal) as sal_sum,
max(sal) as sal_max,
min(sal) as sal_min,
avg(sal) as sal_avg,
count(1) as cnt
from emp
group by job
having sal_sum > 6000;

子查询：
select
job,
sal_sum,
sal_max,
sal_min,
sal_avg,
cnt
from
(
select
job,
sum(sal) as sal_sum,
max(sal) as sal_max,
min(sal) as sal_min,
avg(sal) as sal_avg,
count(1) as cnt
from emp
group by job
) as a
where sal_sum > 6000;

7.join
1.inner join [join] 内连接
select
a.,
b.
from
(
select
*
from a
) as a join
(
select
*
from b
) as b
on a.id = b.id;

2.left join  左连接

select
a.,
b.
from
(
select
*
from a
) as a left join
(
select
*
from b
) as b
on a.id = b.id;
3.right join 右连接

select
a.,
b.
from
(
select
*
from a
) as a right join
(
select
*
from b
) as b
on a.id = b.id;
4.full join 全连接

select
a.,
b.
from
(
select
*
from a
) as a full join
(
select
*
from b
) as b
on a.id = b.id;

table：
维度：group by
指标：聚合函数、普普通通列
where:

需求： emp
1.各个部门每年入职的人数

table：emp
维度：部门每年
指标：人数
where: no

select
部门每年
人数
from emp
group by 部门每年;

每年=》hiredate 处理etl

select
deptno,date_format(hiredate,‘YYYY’) as year,
count(1) as cnt
from emp
group by deptno,date_format(hiredate,‘YYYY’);

1980-12-17 00:00:00 => etl 使用处理日期相关的function date_format

2.整个公司每年每月的入职人数
维度：年月
指标：人数

hiredate ：1980-12-17 00:00:00 =》 1980-12

select
date_format(hiredate,‘YYYY-MM’) as ymoth,
count(1) as cnt
from emp
group by date_format(hiredate,‘YYYY-MM’);

作业：
3.销售部和经理部入职的人薪资范围在1500-2500 每年每月的入职人数
4.公司内有绩效的员工每年每月的入职人数
5.销售部和经理部入职的人薪资范围在1500-2500 每年每月的入职人数以及员工信息

1.开窗函数
聚合函数：多行数据按照一定规则进行聚合为一行

理论上 聚合后的行数 <=聚合前的行数

需求：
既要显示聚合前的数据，又要显示聚合后的数据？

eg：
id name sal
1 zs 3w
2 ls 2.5w
3 ww 2w

需求：按照工资降序排列还显示对应的排名
id name sal rank
1 zs 3w 1
2 ls 2.5w 2
3 ww 2w 3

窗口函数/开窗函数：窗口 + 函数
窗口：函数运行时计算的数据集的范围
函数：运行时函数

语法结构： 
	函数  over([partition by xxx,...] [order by xxx,....])

	over: 以谁进行开窗 table、
	parition by : 以谁进行分组   table columns 
	order by : 以谁进行排序  table columns

	函数：开窗函数、聚合函数

数据：
haige,2022-11-10,1
haige,2022-11-11,5
haige,2022-11-12,7
haige,2022-11-13,3
haige,2022-11-14,2
haige,2022-11-15,4
haige,2022-11-16,4

create table user_mt (
name string,
dt string,
cnt int
)
row format delimited fields terminated by ‘,’ ;

load data local inpath ‘/home/hadoop/tmp/mt.txt’ overwrite into table user_mt;
需求：
统计累计问题，每个用户每天累计点外卖次数

[partition by xxx,…] [order by xxx,…]

select
name ,
dt ,
cnt ,
sum(cnt) over(partition by name order by dt ) as sum_cnt
from user_mt;

补充：

单单一个基本查询开窗函数和 group by 不能一起使用。

3.command line
1.hive shell
2.jdbc => hiveServer2

hive clinet: 
	1. hive shell
	2. beeline shell jdbc   开启 hiveServer2 服务 thift

!connect jdbc:hive2://localhost:10000 hadoop

补充：
beeline => 连接 hive =》 hdfs
对hdfs 做一个设置代理设置：

core-site.xml:

hadoop.proxyuser.hadoop.hosts

hadoop.proxyuser.hadoop.groups

作业：
1.beeline 的连接hive的使用
2.dbvear 连接hive 搞定
3.开窗函数：【面试比考】
https://blog.youkuaiyun.com/weixin_45775873/article/details/109556502
看二个案例