Hive入门（3）

最新推荐文章于 2025-06-01 20:57:16 发布

原创最新推荐文章于 2025-06-01 20:57:16 发布 · 162 阅读

1 ·

CC 4.0 BY-SA版权

Hive 专栏收录该内容

6 篇文章

订阅专栏

1.DML补充

(1)查询出不包含xx的数据

select * from emp where ename not in ('SCOTT');


//查询不包含ename等于SCOTT的所有行

（2）查询出指定的某行

 select * from emp where ename in ('SCOTT','ss');

查询返回ename =  SCOTT 和 ss的行

（3）查询某一区间的行 between and 【这是一个闭区间】

select * from emp where salary between 1000 and 3000;
OK
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30

2.聚合函数

（1）定义：具有多进一出操作的函数。例如max/min/count/avg/sum

select min(salary),max(salary),avg(salary),sum(salary) from emp;


 select count(1) from emp where depto=10;



count（1）  count（*）  count（字段） 区别


count(*),count(1)是一样的

count("字段名")：如果字段为非空字段那么它和count(1),count(*)是一样的

但如果是可以为空的字段，查询时间将会长很多

3.分组函数 group by

（1）需求：求每个部门的平均工资

select deptno,avg(salary) from emp group by deptno;


这里注意一点，select 出现的字段，如果没有出现组函数中（就是那些count，avg。。。），
必须出现在group by 函数中。

（2）需求：求每个部门每个工作的最高工资

select deptno,job,avg(salary) from emp group by deptno,job;

//好像聚合函数和其他字段一起使用时，必须和分区函数一起使用。

（3）需求：求每个部门，平均工资大于2000的

 select deptno,avg(salary) from emp group by deptno having avg(salary)>2000;


//这里用where是会报错的

where是从数据表中的字段单条直接进行筛选，having是对分组后的结果，整体进行筛选，where需在group 之前，数据先筛选。

（4）给数据加个标签

select ename,salary,
case 
when salary>1 and salary<=1000 then 'lower'
when salary>1000 and salary<=2000 then 'middle'
when salary>2000 and salary<=4000 then 'high'
else 'highest'
end
from emp;


//select xx case when xxx条件 and xxx条件 then '标签名' end  from tablename;

4.join（关联查询）

（1）join=inner join

select * from joina join joinb;



这个产生的笛卡儿积，俩者的数据都做了join，这是什么消耗性能的！！！

（2）left join

select * from joina left join joinb on joina.id=joinb.id;


OK
1	ruoze	1	30
2	jepson	2	29
3	yunfei	NULL	NULL

（3）right join

 select joina.id,joina.name from joina right join joinb on joina.id = joinb.id;

1	ruoze
2	jepson
NULL	NULL

（3）full join

select * from joina full join joinb on joina.id=joinb.id;

1	ruoze	1	30
2	jepson	2	29
3	yunfei	NULL	NULL
NULL	NULL	4	20

5.单级分区表（partition）

（1）分区表的作用和意义：在查询处理的时候，加上分区位置，能大大降低reduce io 。

（2）创建静态分区表：

create table order_partition(ordernumber string,orderevent string) partitioned by (eventdate string) row
    > format delimited fields terminated by '\t';


注意这个partitioned by

（3）根据分区表查询

select * from order_partition where eventdate='2018-12';

（4）如果将数据直接放到HDFS上的Hive分区下面，能不能查询到呢?

测试：

[root@hadoop01 data]#hdfs dfs -mkdir /user/hive/warehouse/order_partition/eventdate=2018-01/
18/12/27 21:41:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[root@hadoop01 data]#hdfs dfs -put /home/hadoop/data/order.txt /user/hive/warehouse/order_partition/eventdate=2018-01/
18/12/27 21:41:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[root@hadoop01 data]#hdfs dfs -ls /user/hive/warehouse/order_partition/eventdate=2018-01/
18/12/27 21:41:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 root root        192 2018-12-27 21:41 /user/hive/warehouse/order_partition/eventdate=2018-01/order.txt
[root@hadoop01 data]#

结果：查询不到的，因为根本在元数据库，根本就没有关于这个新分区的信息，该怎么办呢？

（5）刷新分区解决上述问题

这里用的alter命令，而没用msck命令，因为这个命令是全局更新，十分废时间，慎用。

hive> select * from order_partition where eventdate=2018-01;
OK
Time taken: 0.799 seconds
hive> select * from order_partition where eventdate=2018-02;
OK
Time taken: 0.124 seconds
hive> select * from order_partition where eventdate='2018-02';
OK
Time taken: 0.109 seconds
hive> select * from order_partition where eventdate='2018-01';
OK
Time taken: 0.085 seconds
hive> alter table order_partition add if not exists partition(eventdate='2018-01');
OK
Time taken: 0.138 seconds
hive> select * from order_partition where eventdate='2018-01';
OK
160202412041    2018-12-24      09:21:21.335+01	NULL	2018-01
160703140214    2018-10-01      13:42:34.224+02	NULL	2018-01
161102412241    2018-11-14      19:21:21.25+03	NULL	2018-01
160470314022    2018-05-02      14:32:14.21+01	NULL	2018-01


这里做了个测试发现，hdfs上那新创建的文件夹就是你的分区名字，也就是换成02就刷新不了，因为根本就没有这文件。

（6）除了上述这种手动的，添加数据和分区，还有一种方法，添加分区和数据


create table order_4_partition(
ordernumber string,
eventtime string
)
row format delimited fields terminated by '\t';
load data local inpath '/home/hadoop/data/order.txt' overwrite into table order_4_partition;

//这俩个事创建表然后加载数据
insert overwrite table order_partition 
partition(eventdate='2014-08')
select * from order_4_partition;

//增加了08分区，然后将数据复制覆盖过去。

以上讲的都是单级分区

6.多级分区表

（1）创建多分区表

 create table order_mulit_partition(ordernumber string,eventtime string) partitioned by (event_month string,event_day string)
row format delimited fields terminated by '\t';


可以看出其实和单级分区表就是变成了多个列

（2）多分区加载数据

LOAD DATA LOCAL INPATH '/home/hadoop/data/order.txt' 
OVERWRITE INTO TABLE order_mulit_partition 
PARTITION(event_month='2018-05', event_day='01'); 

区别也只是要指定俩个分区
在HDFS上就是有在月文件夹下面还有个天文件夹

（3）查询也是一样的，加个分区。

7.动态分区

（1）需求：将emp表中的数据按照部门编号10和20.。。。。分区到其它表

create table ruozedata_static_emp 
(empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)
PARTITIONED by(deptno string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' ;

//先创建一个分区表
insert into table ruozedata_static_emp partition(deptno='10')
select empno,ename,job,mgr,hiredate,salary,comm from emp
where deptno=10;
按部门编号十来分

insert into table ruozedata_static_emp partition(deptno='20')
select empno,ename,job,mgr,hiredate,salary,comm from emp
where deptno=20;
按部门编号20来分

如果是1000个呢？那不得敲死，这就引出来了动态分区

（2）创建动态分区表


create table ruozedata_dynamic_emp 
(empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)
PARTITIONED by(deptno string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' ;

和静态一样的

（3）动态分区

先设置 set hive.exec.dynamic.partition.mode=nonstrict（注意这只是当前窗口模式有效，永久去hive-site设置，这也是hive设置语法的方式，key=value，set key表示查看这个key的值），将动态分区设置成非严格的方式，然后再执行

8.Function

（1）查看自带的内置函数

show functions;


desc function extended upper;

查看这方法的具体使用方法；

（2）使用举例,不会用desc function extended name ;查看example

select unix_timestamp() from emp;
查看当前的时间戳
select current_date from emp;
查看当前的日期
还有day month hour  minute

select date_add("2018-12-01",10) from emp;
日期天数加三
select date_sub("2018-12-01",10) from emp;
日期天数减三

 select substr('HIVE',1,3) from emp;
截取字符串的用法
还有连接字符concat
还有concat_ws//多个转一个


SELECT concat_ws('.', 'www', array('facebook', 'com')) FROM src LIMIT 1;
  'www.facebook.com'