目录
1、select语法:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY
col_list]]
[LIMIT [offset,] rows]
创建表,加载数据
-- 测试数据 /home/hadoop/data/emp.dat
7369,SMITH,CLERK,7902,2010-12-17,800,,20
7499,ALLEN,SALESMAN,7698,2011-02-20,1600,300,30
7521,WARD,SALESMAN,7698,2011-02-22,1250,500,30
7566,JONES,MANAGER,7839,2011-04-02,2975,,20
7654,MARTIN,SALESMAN,7698,2011-09-28,1250,1400,30
7698,BLAKE,MANAGER,7839,2011-05-01,2850,,30
7782,CLARK,MANAGER,7839,2011-06-09,2450,,10
7788,SCOTT,ANALYST,7566,2017-07-13,3000,,20
7839,KING,PRESIDENT,,2011-11-07,5000,,10
7844,TURNER,SALESMAN,7698,2011-09-08,1500,0,30
7876,ADAMS,CLERK,7788,2017-07-13,1100,,20
7900,JAMES,CLERK,7698,2011-12-03,950,,30
7902,FORD,ANALYST,7566,2011-12-03,3000,,20
7934,MILLER,CLERK,7782,2012-01-23,1300,,10
-- 建表并加载数据
CREATE TABLE emp (
empno int,
ename string,
job string,
mgr int,
hiredate DATE,
sal int,
comm int,
deptno int
)row format delimited fields terminated by ",";
-- 加载数据
LOAD DATA LOCAL INPATH '/home/hadoop/data/emp.dat'
INTO TABLE emp;
2、运算符
比较运算符 | 描述 |
---|---|
=、==、<=> | 等于 |
<>、!= | 不等于 |
<、<=、>、>= | 大于等于、小于等于 |
is [not] null | 如果A等于NULL,则返回TRUE,反之返回FALSE。使用NOT关键字结果相反。 |
in (value1,value2,…) | 匹配列表中的值 |
LIKE | % 代表匹配零个或多个字符(任意个字符);_ 代表匹配一个字符。使用NOT关键字结果相反 |
BETWEEN… AND… | 范围的判断,使用NOT关键字结果相反。 |
RLIKE、REGEXP | 基于java的正则表达式,匹配返回TRUE,反之返回FALSE。匹配使用的是JDK中的正则表达式接口实现的,因为正则也依据其中的规则。例如,正则表达式必须和整个字符串A相匹配,而不是只需与其字符串匹配 |
备注:通常情况下NULL参与运算,返回值为NULL;NULL<=>NULL的结果为true
3、group by子句
GROUP BY语句通常与聚组函数一起使用,按照一个或多个列对数据进行分组,对每个组进行聚合操作。
-- 计算emp表每个部门的平均工资
select deptno, avg(sal)
from emp
group by deptno;
select deptno, avg(sal) aa from
emp
group by deptno
having aa >1000;
-
where子句针对表中的数据发挥作用;having针对查询结果(聚组以后的结果)发挥作用
-
where子句不能有分组函数;having子句可以有分组函数
-
having只用于group by分组统计之后
4、join
Hive支持通常的SQL JOIN语句。默认情况下,仅支持等值连接,不支持非等值连接。
1. 内连接: [inner] join
2. 外连接 (outer join)
- 左外连接。 left [outer] join,左表的数据全部显示
- 右外连接。 right [outer] join,右表的数据全部显示
- 全外连接。 full [outer] join,两张表的数据都显示
u1 和u2
0: jdbc:hive2://master:10000> select * from u1;
+--------+----------+
| u1.id | u1.name |
+--------+----------+
| 1 | a |
| 2 | b |
| 3 | c |
| 3 | d |
| 4 | e |
+--------+----------+
5 rows selected (0.647 seconds)
0: jdbc:hive2://master:10000>
0: jdbc:hive2://master:10000> select * from u2;
+--------+----------+
| u2.id | u2.name |
+--------+----------+
| 1 | d |
| 1 | e |
| 2 | f |
| 3 | g |
| 5 | h |
+--------+----------+
内连接
0: jdbc:hive2://master:10000> select u1.id ,u1.name ,u2.id,u2.name from u1 join u2 on u1.id = u2.id;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/bigdata/hive-2.3.7/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/bigdata/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
+--------+----------+--------+----------+
| u1.id | u1.name | u2.id | u2.name |
+--------+----------+--------+----------+
| 1 | a | 1 | d |
| 1 | a | 1 | e |
| 2 | b | 2 | f |
| 3 | c | 3 | g |
| 3 | d | 3 | g |
+--------+----------+--------+----------+
左外连接
0: jdbc:hive2://master:10000> select u1.id ,u1.name,u2.id ,u2.name from u1 left join u2 on u1.id = u2.id;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/bigdata/hive-2.3.7/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/bigdata/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
+--------+----------+--------+----------+
| u1.id | u1.name | u2.id | u2.name |
+--------+----------+--------+----------+
| 1 | a | 1 | d |
| 1 | a | 1 | e |
| 2 | b | 2 | f |
| 3 | c | 3 | g |
| 3 | d | 3 | g |
| 4 | e | NULL | NULL |
+--------+----------+--------+----------+
右外连接
0: jdbc:hive2://master:10000> select u1.id, u1.name,u2.id, u2.name from u1 right join u2 on u1.id = u2.id;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/bigdata/hive-2.3.7/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/bigdata/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
+--------+----------+--------+----------+
| u1.id | u1.name | u2.id | u2.name |
+--------+----------+--------+----------+
| 1 | a | 1 | d |
| 1 | a | 1 | e |
| 2 | b | 2 | f |
| 3 | c | 3 | g |
| 3 | d | 3 | g |
| NULL | NULL | 5 | h |
+--------+----------+--------+----------+
全链接
0: jdbc:hive2://master:10000> select u1.id, u1.name,u2.id, u2.name from u1 full join u2 on u1.id = u2.id;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+--------+----------+--------+----------+
| u1.id | u1.name | u2.id | u2.name |
+--------+----------+--------+----------+
| 1 | a | 1 | e |
| 1 | a | 1 | d |
| 2 | b | 2 | f |
| 3 | d | 3 | g |
| 3 | c | 3 | g |
| 4 | e | NULL | NULL |
| NULL | NULL | 5 | h |
+--------+----------+--------+----------+
多表连接
0: jdbc:hive2://master:10000> select u1.id, u1.name,u2.id, u2.name from u1 right join u2 on u1.id = u2.id;
+--------+----------+--------+----------+
| u1.id | u1.name | u2.id | u2.name |
+--------+----------+--------+----------+
| 1 | a | 1 | d |
| 1 | a | 1 | e |
| 2 | b | 2 | f |
| 3 | c | 3 | g |
| 3 | d | 3 | g |
| NULL | NULL | 5 | h |
+--------+----------+--------+----------+
0: jdbc:hive2://master:10000> select * from b_course;
+--------------+----------------+-----------------+
| b_course.id | b_course.name | b_course.score |
+--------------+----------------+-----------------+
| 3 | hadoop | 91 |
| 3 | python | 89 |
| 3 | c | 74 |
| 3 | java | 98 |
| 6 | c | 76 |
| 7 | python | 87 |
| 1 | hadoop | 80 |
| 1 | python | 91 |
| 1 | c | 78 |
| 1 | java | 90 |
| 8 | hadoop | 88 |
| 5 | java | 93 |
| 2 | python | 80 |
| 2 | c | 76 |
| 2 | java | 75 |
| 2 | hadoop | 93 |
+--------------+----------------+-----------------+
0: jdbc:hive2://master:10000> select u1.id, u1.name,u2.id, u2.name ,b.name,b.score
. . . . . . . . . . . . . . > from u1 right join u2 on u1.id = u2.id
. . . . . . . . . . . . . . > left join b_course b on u1.id = b.id
. . . . . . . . . . . . . . > ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/bigdata/hive-2.3.7/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/bigdata/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
+--------+----------+--------+----------+---------+----------+
| u1.id | u1.name | u2.id | u2.name | b.name | b.score |
+--------+----------+--------+----------+---------+----------+
| 1 | a | 1 | d | hadoop | 80 |
| 1 | a | 1 | d | python | 91 |
| 1 | a | 1 | d | c | 78 |
| 1 | a | 1 | d | java | 90 |
| 1 | a | 1 | e | hadoop | 80 |
| 1 | a | 1 | e | python | 91 |
| 1 | a | 1 | e | c | 78 |
| 1 | a | 1 | e | java | 90 |
| 2 | b | 2 | f | python | 80 |
| 2 | b | 2 | f | c | 76 |
| 2 | b | 2 | f | java | 75 |
| 2 | b | 2 | f | hadoop | 93 |
| 3 | c | 3 | g | hadoop | 91 |
| 3 | c | 3 | g | python | 89 |
| 3 | c | 3 | g | c | 74 |
| 3 | c | 3 | g | java | 98 |
| 3 | d | 3 | g | hadoop | 91 |
| 3 | d | 3 | g | python | 89 |
| 3 | d | 3 | g | c | 74 |
| 3 | d | 3 | g | java | 98 |
| NULL | NULL | 5 | h | NULL | NULL |
+--------+----------+--------+----------+---------+----------+
连接 n张表,至少需要 n-1 个连接条件。例如:连接四张表,至少需要三个连接条件。
Hive总是按照从左到右的顺序执行,Hive会对每对 JOIN 连接对象启动一个MapReduce 任务。
5、排序子句
5.1 order by
ORDER BY执行全局排序,只有一个reduce;
-- 普通排序
select * from emp order by deptno;
-- 按别名排序
select empno, ename, job, mgr, sal + nvl(comm, 0) salcomm,deptno
from emp
order by salcomm desc;
-- 多列排序
select empno, ename, job, mgr, sal + nvl(comm, 0) salcomm,deptno
from emp
order by deptno, salcomm desc;
-- 排序字段要出现在select子句中。以下语句无法执行(因为select子句中缺少deptno):
select empno, ename, job, mgr, sal + nvl(comm, 0) salcommfrom emp
order by deptno, salcomm desc;
5.2 sort by
在很多业务场景,我们并不需要全局有序的数据,此时可以使用sort by;sort by为每个reduce产生一个排序文件,在reduce内部进行排序,得到局部有序的结果;
-- 设置reduce个数
set mapreduce.job.reduces=2;
-- 按照工资降序查看员工信息
select * from emp sort by sal desc;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredate | emp.sal | emp.comm | emp.deptno |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
| 7902 | FORD | ANALYST | 7566 | 2011-12-03 | 3000 | NULL | 20 |
| 7788 | SCOTT | ANALYST | 7566 | 2017-07-13 | 3000 | NULL | 20 |
| 7566 | JONES | MANAGER | 7839 | 2011-04-02 | 2975 | NULL | 20 |
| 7844 | TURNER | SALESMAN | 7698 | 2011-09-08 | 1500 | 0 | 30 |
| 7521 | WARD | SALESMAN | 7698 | 2011-02-22 | 1250 | 500 | 30 |
| 7654 | MARTIN | SALESMAN | 7698 | 2011-09-28 | 1250 | 1400 | 30 |
| 7876 | ADAMS | CLERK | 7788 | 2017-07-13 | 1100 | NULL | 20 |
| 7900 | JAMES | CLERK | 7698 | 2011-12-03 | 950 | NULL | 30 |
| 7369 | SMITH | CLERK | 7902 | 2010-12-17 | 800 | NULL | 20 |
| 7839 | KING | PRESIDENT | NULL | 2011-11-07 | 5000 | NULL | 10 |
| 7698 | BLAKE | MANAGER | 7839 | 2011-05-01 | 2850 | NULL | 30 |
| 7782 | CLARK | MANAGER | 7839 | 2011-06-09 | 2450 | NULL | 10 |
| 7499 | ALLEN | SALESMAN | 7698 | 2011-02-20 | 1600 | 300 | 30 |
| 7934 | MILLER | CLERK | 7782 | 2012-01-23 | 1300 | NULL | 10 |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
-- 将查询结果导入到文件中(按照工资降序)。生成两个输出文件,每个文件内部数据按工资降序排列
insert overwrite local directory '/root/bigdata/test_data/hive/sort_sal'
row format delimited fields terminated by ' '
select * from emp sort by sal desc;
[root@master hive]# cd sort_sal/
[root@master sort_sal]# ll
total 8
-rw-r--r--. 1 root root 411 Oct 5 13:04 000000_0
-rw-r--r--. 1 root root 230 Oct 5 13:04 000001_0
[root@master sort_sal]# cat 000000_0
7902 FORD ANALYST 7566 2011-12-03 3000 \N 20
7788 SCOTT ANALYST 7566 2017-07-13 3000 \N 20
7566 JONES MANAGER 7839 2011-04-02 2975 \N 20
7844 TURNER SALESMAN 7698 2011-09-08 1500 0 30
7521 WARD SALESMAN 7698 2011-02-22 1250 500 30
7654 MARTIN SALESMAN 7698 2011-09-28 1250 1400 30
7876 ADAMS CLERK 7788 2017-07-13 1100 \N 20
7900 JAMES CLERK 7698 2011-12-03 950 \N 30
7369 SMITH CLERK 7902 2010-12-17 800 \N 20
[root@master sort_sal]#
[root@master sort_sal]# cat 000001_0
7839 KING PRESIDENT \N 2011-11-07 5000 \N 10
7698 BLAKE MANAGER 7839 2011-05-01 2850 \N 30
7782 CLARK MANAGER 7839 2011-06-09 2450 \N 10
7499 ALLEN SALESMAN 7698 2011-02-20 1600 300 30
7934 MILLER CLERK 7782 2012-01-23 1300 \N 10
5.3 分区排序(distribute by)
distribute by 将特定的行发送到特定的reducer中,便于后继的聚合 与 排序操作;
distribute by 类似于MR中的分区操作,可以结合sort by操作,使分区数据有序;
distribute by 要写在sort by之前;
set mapreduce.job.reduces=3;
-- 将结果输出到文件,观察输出结果
insert overwrite local directory '/root/bigdata/test_data/hive/dist_by'
row format delimited fields terminated by ' '
select empno, ename, job, deptno, sal + nvl(comm, 0) salcomm
from emp
distribute by deptno
sort by salcomm desc;
[root@master hive]# cd dist_by/
[root@master dist_by]# ll
total 12
-rw-r--r--. 1 root root 164 Oct 5 13:13 000000_0
-rw-r--r--. 1 root root 81 Oct 5 13:13 000001_0
-rw-r--r--. 1 root root 129 Oct 5 13:13 000002_0
[root@master dist_by]#
[root@master dist_by]#
[root@master dist_by]# cat 000000_0
7698 BLAKE MANAGER 30 2850
7654 MARTIN SALESMAN 30 2650
7499 ALLEN SALESMAN 30 1900
7521 WARD SALESMAN 30 1750
7844 TURNER SALESMAN 30 1500
7900 JAMES CLERK 30 950
[root@master dist_by]#
[root@master dist_by]# cat 000001_0
7839 KING PRESIDENT 10 5000
7782 CLARK MANAGER 10 2450
7934 MILLER CLERK 10 1300
[root@master dist_by]# cat 000002_0
7788 SCOTT ANALYST 20 3000
7902 FORD ANALYST 20 3000
7566 JONES MANAGER 20 2975
7876 ADAMS CLERK 20 1100
7369 SMITH CLERK 20 800
5.4 Cluster By
当distribute by 与 sort by是同一个字段时,可使用cluster by简化语法;
cluster by 只能是升序,不能指定排序规则;
-- 语法上是等价的
select * from emp distribute by deptno sort by deptno;
select * from emp cluster by deptno;