Hive_02_数据定义及操作语言
DDL:Data Definition Language
需求:
1.各个部门每年入职的人数
select
deptno,year(hiredate),
count(*)
from emp
group by deptno year(hiredate);
2.整个公司每年每月的入职人数
year,month函数:
select year(hiredate),month(hiredate),count(*)
from emp
group by year(hiredate),month(hiredate);
date_format函数:
select
date_format(hiredate,“YYYY-MM”) as YM,
count(*) as cnt
from emp
group by date_format(hiredate,“YYYY-MM”);
3.销售部和经理部入职的人薪资范围在1500-2500 每年每月的入职人数
4.公司内有绩效的员工每年每月的入职人数
5.销售部和经理部入职的人薪资范围在1500-2500 每年每月的入职人数 以及员工信息
Database:
1.hive 默认有一个数据库 default 路径:/user/hive/warehouse
1.创建数据库
CREATE [REMOTE] (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, …)];
[] 可有可无
(|) 选择其中一个即可
create database bigdata_hive;
思考: 这个数据库在hdfs什么地方?
database 在hdfs上的路径:
默认数据库路径:/user/hive/warehouse
非默认数据库路径:/user/hive/warehouse/dbname.db
create database if not exists bigdata_hive;
create database bigdata_hive2 LOCATION ‘/data/bigdata_hive2’;
create database bigdata_hive3 WITH DBPROPERTIES (‘creator’=‘doublehappy’, ‘create_dt’=“2099-11-29”);
create database if not exists bigdata_hive4 COMMENT “这是一个数据库4”;
2.查看数据库
show databases;
show databases like “bigdata_hive*”
desc database bigdata_hive3;
desc database EXTENDED bigdata_hive3;
3.更改数据库
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, …); – (Note: SCHEMA added in Hive 0.14.0)
alter database bigdata_hive3 set dbproperties (‘create_dt’=“2022-11-29”);
4.删除数据库
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
DROP DATABASE bigdata_hive4;
DROP DATABASE bigdata_hive2 CASCADE; =>删库跑路的操作
查看table的信息:
DESCRIBE [EXTENDED|FORMATTED]
table_name
desc EXTENDED test;
desc FORMATTED test;
2.table
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], … [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]
[CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)] – (Note: Available in Hive 0.6.0 and later)
]
create table test(
id string comment ‘用户id’,
name string,
age bigint
) comment ‘test table’
row format delimited fields terminated by ‘,’
stored as textfile;
data_type:
常用的数据类型:
数值:
整型 :
int
bigint
小数 :float double decimal
字符串 :string
时间:
日期 DATE YYYY-MM-DD
时间戳 TIMESTAMP YYYY-MM-DD HH:MM:SS
思考: 建表为什么要指定分隔符?
MR:
wc案例
map阶段:按照分割符进行拆分
reduce阶段:算的过程
在hive里面创建一张表:
table中的数据=》 hdfs上 普通文件
stu.txt => hdfs上 普通文件
1,zhangsan,18
2,lisi,18
3,doublehappy,15
4,zihang,28
metastore =》 mysql 表中的元数据
id int
name string
age int
hive数据:hdfs+matastore =》 table
hdfs上文件 每一行 的列的分隔符一定得指定
分割符:
1.hdfs 每行 有分割符 \n
2.hdfs 一行数据内容 每个列之间也要指定分割符 ,
stu:分割符 \t hdfs存的数据,
id 1,zhangsan,18
name 空的
age 空的
导入数据:
LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]
LOCAL: 从本地磁盘 往hivetable 中导入数据
不加local : 从hdfs 往hive table 中导入数据
load data local inpath ‘/home/hadoop/tmp/test.txt’ into TABLE test;
练习:
1.创建 emp 表 并导入数据
CREATE TABLE emp (
empno decimal(4,0) ,
ename string ,
job string ,
mgr decimal(4,0) ,
hiredate string ,
sal decimal(7,2) ,
comm decimal(7,2) ,
deptno decimal(2,0)
)
row format delimited fields terminated by ‘,’
stored as textfile;
load data local inpath ‘/home/hadoop/tmp/emp.txt’ into table emp;
创建hive table :
1.普通的建表语句
2.CREATE TABLE emp2 like emp; => 仅仅获取的 表结构 【建表语句 】
3.create table emp3 as select * from emp; 表结构+数据 【会触发mr作业】
CTAS :
使用场景 临时查询 =》 放入结果表中
DML:
1.load 加载数据
1.加载本地数据
2.加载hdfs数据
【覆盖原有表中的数据】
LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]
1.1.加载本地数据
load data local inpath ‘/home/hadoop/tmp/emp.txt’ INTO TABLE emp;
1.2 加载hdfs数据
[hadoop@bigdata32 tmp]$ hadoop fs -ls /data
[hadoop@bigdata32 tmp]$ hadoop fs -put ./emp.txt /data
[hadoop@bigdata32 tmp]$ hadoop fs -ls /data
Found 1 items
-rw-r–r-- 3 hadoop supergroup 799 2022-11-30 21:53 /data/emp.txt
load data inpath ‘/data/emp.txt’ INTO TABLE emp;
1.table :
hdfs
metastore
hadoop fs -mv xxx /table/ [建议先不要这么做]
1.3 覆盖表中数据
load data local inpath ‘/home/hadoop/tmp/emp.txt’ OVERWRITE INTO TABLE emp;
2.insert 【也会触发mr作业】
1.Inserting data into Hive Tables from queries
insert into|OVERWRITE table tablename selectQury
追加数据
覆盖数据
2.Inserting values into tables from SQL 【不推荐使用】
INSERT INTO TABLE tablename
VALUES values_row [, values_row …]
1.每导入一条数据 就会触发一次 mapreduce job 效率太低
emp2:
insert into table emp2
select * from emp;
insert overwrite table emp2
select * from emp where deptno=10;
3.update delete => hive 不要用 效率低 不要用
=》 insert OVERWRITE
4.清空表中数据
TRUNCATE [TABLE] table_name
truncate table emp2;
5.select
SELECT [ALL | DISTINCT] select_expr, select_expr, …
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
1.where 过滤条件
where_condition
<
>
=
<> !=
and
or
in
not in
between and
is
is not
需求:
查询表中 deptno 20 10
select
*
from emp
where deptno=20 or deptno =10;
select
*
from emp
where deptno in (10,20);
select
*
from emp
where deptno <> 20;
select
*
from emp
where deptno != 20;
2.order by 排序语法
1.默认asc 升序
2.降序 desc
select
sal
from emp
order by sal desc;
3.like 语法 模糊匹配
1._ 占位符
2.% 模糊
rlike regexp
4.合并表
1.union 去重
2.union all 不去重
create table a(id int ,name string) row format delimited fields terminated by ‘,’ ;
create table b(id int ,name string) row format delimited fields terminated by ‘,’ ;
load data local inpath “/home/hadoop/tmp/a.txt” into table a;
load data local inpath “/home/hadoop/tmp/b.txt” into table b;
select name from a
union all
select name from b;
select name from a
union all
select name from b
union all
select “lisi” as name ;
select name,“1” as pk from a
union all
select name,“2” as pk from b
union all
select “lisi” as name,“3” as id ;
思考: hive建表 默认column 分割符是什么?
5.null 处理
1. 过滤
where xxx is not null
is null 作用一样 <=>
2. etl 转换
ifnull => hive里没有
coalesce =》
nvl =》
补充:
查看hive支持的function :
y=f(x)
SHOW FUNCTIONS [LIKE "<pattern>"];
show functions like nvl; => 判断 function hive 是否存在
desc function nvl; =》 查看某个函数具体使用
select
empno,
ename,
job,
mgr,
hiredate,
sal,
nvl(comm,0) as comm_alias,
deptno
from emp ;
6.分组 聚合函数 join
1.group by
1.和聚合函数一起使用
2. 一个 或者多个 column 进行分组
3.分组字段 select 出现 和 group by 出现 要一致
having : 条件过滤
只能在group by 后面使用
2.聚合函数
sum max min avg count
select
sum(sal) as sal_sum,
max(sal) as sal_max,
min(sal) as sal_min,
avg(sal) as sal_avg,
count(1) as cnt
from emp ;
select
job,
sum(sal) as sal_sum,
max(sal) as sal_max,
min(sal) as sal_min,
avg(sal) as sal_avg,
count(1) as cnt
from emp
group by job
having sal_sum > 6000;
子查询:
select
job,
sal_sum,
sal_max,
sal_min,
sal_avg,
cnt
from
(
select
job,
sum(sal) as sal_sum,
max(sal) as sal_max,
min(sal) as sal_min,
avg(sal) as sal_avg,
count(1) as cnt
from emp
group by job
) as a
where sal_sum > 6000;
7.join
1.inner join [join] 内连接
select
a.,
b.
from
(
select
*
from a
) as a join
(
select
*
from b
) as b
on a.id = b.id;
2.left join 左连接
select
a.,
b.
from
(
select
*
from a
) as a left join
(
select
*
from b
) as b
on a.id = b.id;
3.right join 右连接
select
a.,
b.
from
(
select
*
from a
) as a right join
(
select
*
from b
) as b
on a.id = b.id;
4.full join 全连接
select
a.,
b.
from
(
select
*
from a
) as a full join
(
select
*
from b
) as b
on a.id = b.id;
table:
维度:group by
指标:聚合函数、 普普通通列
where:
需求: emp
1.各个部门每年入职的人数
table:emp
维度:部门 每年
指标:人数
where: no
select
部门 每年
人数
from emp
group by 部门 每年;
每年=》hiredate 处理etl
select
deptno,date_format(hiredate,‘YYYY’) as year,
count(1) as cnt
from emp
group by deptno,date_format(hiredate,‘YYYY’);
1980-12-17 00:00:00 => etl 使用处理日期相关的function date_format
2.整个公司每年每月的入职人数
维度:年 月
指标:人数
hiredate :1980-12-17 00:00:00 =》 1980-12
select
date_format(hiredate,‘YYYY-MM’) as ymoth,
count(1) as cnt
from emp
group by date_format(hiredate,‘YYYY-MM’);
作业:
3.销售部和经理部入职的人薪资范围在1500-2500 每年每月的入职人数
4.公司内有绩效的员工每年每月的入职人数
5.销售部和经理部入职的人薪资范围在1500-2500 每年每月的入职人数 以及员工信息
1.开窗函数
聚合函数:多行数据按照一定规则 进行聚合为 一行
理论上 聚合后的行数 <=聚合前的行数
需求:
既要显示聚合前的数据,又要显示聚合后的数据?
eg:
id name sal
1 zs 3w
2 ls 2.5w
3 ww 2w
需求: 按照工资降序排列 还显示对应的 排名
id name sal rank
1 zs 3w 1
2 ls 2.5w 2
3 ww 2w 3
窗口函数/开窗函数:窗口 + 函数
窗口: 函数 运行时 计算的数据集的范围
函数: 运行时函数
语法结构:
函数 over([partition by xxx,...] [order by xxx,....])
over: 以谁进行开窗 table、
parition by : 以谁进行分组 table columns
order by : 以谁进行排序 table columns
函数:开窗函数、聚合函数
数据:
haige,2022-11-10,1
haige,2022-11-11,5
haige,2022-11-12,7
haige,2022-11-13,3
haige,2022-11-14,2
haige,2022-11-15,4
haige,2022-11-16,4
create table user_mt (
name string,
dt string,
cnt int
)
row format delimited fields terminated by ‘,’ ;
load data local inpath ‘/home/hadoop/tmp/mt.txt’ overwrite into table user_mt;
需求:
统计累计问题 ,每个用户每天累计点外卖次数
[partition by xxx,…] [order by xxx,…]
select
name ,
dt ,
cnt ,
sum(cnt) over(partition by name order by dt ) as sum_cnt
from user_mt;
补充:
单单一个基本查询 开窗函数 和 group by 不能一起使用。
3.command line
1.hive shell
2.jdbc => hiveServer2
hive clinet:
1. hive shell
2. beeline shell jdbc 开启 hiveServer2 服务 thift
!connect jdbc:hive2://localhost:10000 hadoop
补充:
beeline => 连接 hive =》 hdfs
对hdfs 做一个设置 代理设置:
core-site.xml:
hadoop.proxyuser.hadoop.hosts
hadoop.proxyuser.hadoop.groups
作业:
1.beeline 的连接hive的使用
2.dbvear 连接hive 搞定
3.开窗函数 : 【面试比考】
https://blog.youkuaiyun.com/weixin_45775873/article/details/109556502
看二个案例