select查询
hive (default)>
> select * from emp;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975.0 NULL 20
7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20
7369 SMITH CLERK 7902 1980-12-17 00:00:00 800.0 NULL 20
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850.0 NULL 30
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.0 300.0 30
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.0 0.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250.0 1400.0 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.0 500.0 30
7900 JAMES CLERK 7698 981-12-03 00:00:00 950.0 NULL 30
Time taken: 0.229 seconds, Fetched: 14 row(s)
hive (default)> select empno,ename from emp ;
OK
empno ename
7839 KING
7782 CLARK
7934 MILLER
7902 FORD
7788 SCOTT
7566 JONES
7876 ADAMS
7369 SMITH
7698 BLAKE
7499 ALLEN
7844 TURNER
7654 MARTIN
7521 WARD
7900 JAMES
Time taken: 0.047 seconds, Fetched: 14 row(s)
hive (default)>
hive (default)> select * from emp where sal between 1000 and 2000;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.0 300.0 30
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.0 0.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250.0 1400.0 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.0 500.0 30
Time taken: 0.032 seconds, Fetched: 6 row(s)
hive (default)> select * from emp where sal>=3000;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20
Time taken: 0.074 seconds, Fetched: 3 row(s)
hive (default)>
hive (default)>
> select * from emp limit 2;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
Time taken: 0.035 seconds, Fetched: 2 row(s)
hive (default)> select * from emp where comm is null;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975.0 NULL 20
7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20
7369 SMITH CLERK 7902 1980-12-17 00:00:00 800.0 NULL 20
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850.0 NULL 30
7900 JAMES CLERK 7698 981-12-03 00:00:00 950.0 NULL 30
Time taken: 0.046 seconds, Fetched: 10 row(s)
hive (default)>
聚合函数
max/min/count/sum/ave 特点:多进一出,进来很多条记录出去只有一条记录
-
查询部门编号为10的有多少人数
hive (default)> select deptno,count(1) sum from emp group by deptno having deptno=10;
-
求最大工资,最小工资,平均工资,工资的和
hive (default)> > select max(sal),min(sal),avg(sal),sum(sal) from emp;
分组函数 group by
使用group by函数需注意:1.group by中出现的字段,select中也一定要出现,不然会报错;
2.group by 一般和having比较函数一起使用同时也和聚合函数一起使用
3.group by后面可以跟多个字段,只不过需要用","隔开
-
求部门的平均工资
hive (default)> > select deptno,avg(sal) avgsal from emp group by deptno;
-
求每个部门(deptno)、工作岗位(job)的最高工资(salary),很简单还是注意group by后面的字段内容
hive (default)> select deptno,job,max(sal) maxsal from emp group by deptno,job;
-
每个部门的那些工作平均工资大于2000
hive (default)> select deptno,job,avg(sal) avgsal from emp group by deptno,job having avg(sal)>2000 ;
case when then end 根据已有的字段取一个常用的标识
hive (default)>
>
> select ename,sal,
> case
> when sal>1 and sal<=1000 then 'lower'
> when sal>1000 and sal<=2000 then 'just so so'
> when sal>2000 and sal<=3000 then 'ok'
> else 'high'
> end
> from emp ;
OK
ename sal _c2
KING 5000.0 high
CLARK 2450.0 ok
MILLER 1300.0 just so so
FORD 3000.0 ok
SCOTT 3000.0 ok
JONES 2975.0 ok
ADAMS 1100.0 just so so
SMITH 800.0 lower
BLAKE 2850.0 ok
ALLEN 1600.0 just so so
TURNER 1500.0 just so so
MARTIN 1250.0 just so so
WARD 1250.0 just so so
JAMES 950.0 lower
Time taken: 0.048 seconds, Fetched: 14 row(s)
hive (default)>
order by 、 sort by 、distribute by 和 cluster by的区别
- order by 全局排序 ,在1个reduce中排序 ,一般不建议生产上使用
- sort by 局部排序 ,每个reduce内是有序
- distribute by 按照一定的规则把数据分散到某个reducer
- cluster by 等同于 distribute by xxx sort by xxxx(前提是xxx必须和xxxx是相同的),这个几乎很少用到
默认执行reduce任务的就1个,如果需要多个reduce执行任务,可以设置mapred.reduce.tasks
参数
set mapred.reduce.tasks=3 这种设置只是临时会话种有效,如果需要永久有效,则在应在hive_site.xml中设置。
用order by排序的时候如果hive.mapred.mode=strict
后面不加limit会报错,但系统默认的是hive.mapred.mode=nonstrict
不严格模式,所以加不加limit都不会报错。
hive (default)> set mapred.reduce.tasks;
mapred.reduce.tasks=3
hive (default)> insert overwrite local directory '/home/hadoop/data/sort/' row format delimited fields terminated by ',' select * from emp sort by empno;
Query ID = hadoop_20190719141717_d0b35106-40a0-4d0a-a1f7-e17f4e544cdf
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1562553101223_0020, Tracking URL = http://hadoop001:8078/proxy/application_1562553101223_0020/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1562553101223_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2019-07-19 15:36:47,245 Stage-1 map = 0%, reduce = 0%
2019-07-19 15:36:51,327 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.77 sec
2019-07-19 15:36:56,557 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 2.09 sec
2019-07-19 15:36:58,644 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.0 sec
MapReduce Total cumulative CPU time: 5 seconds 0 msec
Ended Job = job_1562553101223_0020
Copying data to local directory /home/hadoop/data/sort
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 5.0 sec HDFS Read: 15426 HDFS Write: 802 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 0 msec
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
Time taken: 18.827 seconds
查看/home/hadoop/data/sort,看是否是3个文件
[hadoop@hadoop001 ~]$ cd /home/hadoop/data/sort
[hadoop@hadoop001 sort]$ ll
total 12
-rw-r--r-- 1 hadoop hadoop 341 Jul 19 15:37 000000_0
-rw-r--r-- 1 hadoop hadoop 345 Jul 19 15:37 000001_0
-rw-r--r-- 1 hadoop hadoop 116 Jul 19 15:37 000002_0
[hadoop@hadoop001 sort]$
hive的内置函数
hive的内置函数较多,这里没有必要记,只需要知道怎么查找内置函数就行
show functions;查看说有的内置函数
desc function length(函数名);查看length函数的说明,没有例子
desc function extended length(函数名);查看length函数的说明并且还有案例
hive (default)>
> create table dual(x string);
> insert into table values('');//创建一张空表dual,用来做下面测试
常用有关时间的内置函数
hive (default)> select ename,lower(ename) from emp;
OK
ename _c1
KING king
CLARK clark
MILLER miller
FORD ford
SCOTT scott
JONES jones
ADAMS adams
SMITH smith
BLAKE blake
ALLEN allen
TURNER turner
MARTIN martin
WARD ward
JAMES james
Time taken: 0.034 seconds, Fetched: 14 row(s)
hive (default)>
date内置函数
hive.mapred.mode=nonstrict
hive (default)> select current_date from dual;
OK
_c0
2019-07-19
Time taken: 0.034 seconds, Fetched: 1 row(s)
hive (default)>
> select current_timestamp from dual;
OK
_c0
2019-07-19 16:37:01.49
Time taken: 0.036 seconds, Fetched: 1 row(s)
hive (default)>
hive (default)> select unix_timestamp() from dual;//当前时间的时间戳
OK
_c0
1563525550
Time taken: 0.035 seconds, Fetched: 1 row(s)
hive (default)> select unix_timestamp("2019-07-19 16:42:00") from dual;
OK
_c0
1563525720
Time taken: 0.031 seconds, Fetched: 1 row(s)
hive (default)> select unix_timestamp("20190719 164200","yyyyMMdd HHmmss") from dual;//如果时间格式不对,需要先修改格式
OK
_c0
1563525720
Time taken: 0.033 seconds, Fetched: 1 row(s)
hive (default)>
hive (default)>
>
> select year("2019-07-19 16:42:00") from dual;
OK
_c0
2019
Time taken: 0.032 seconds, Fetched: 1 row(s)
hive (default)> select month("2019-07-19 16:42:00") from dual;
OK
_c0
7
Time taken: 0.031 seconds, Fetched: 1 row(s)
hive (default)> select day("2019-07-19 16:42:00") from dual;
OK
_c0
19
Time taken: 0.03 seconds, Fetched: 1 row(s)
hive (default)>
hive (default)> select date_add("2019-07-19 16:52:57.04",10) from dual; //向上加10天
OK
_c0
2019-07-29
Time taken: 0.036 seconds, Fetched: 1 row(s)
hive (default)> select date_sub("2019-07-19 16:52:57.04",10) from dual;//减10天
OK
_c0
2019-07-09
Time taken: 0.032 seconds, Fetched: 1 row(s)
hive (default)>
转换函数 cast
hive (default)> select cast("5"as int) from dual;
OK
_c0
5
Time taken: 0.029 seconds, Fetched: 1 row(s)
hive (default)> select cast("5a"as int) from dual; //若果前面的5a不满足后面的类型,将会为null
OK
_c0
NULL
Time taken: 0.027 seconds, Fetched: 1 row(s)
hive (default)> select cast("2018-8-8"as date) from dual;
OK
_c0
2018-08-08
Time taken: 0.028 seconds, Fetched: 1 row(s)
hive (default)> select cast(532 as string) from dual;
OK
_c0
532
Time taken: 0.029 seconds, Fetched: 1 row(s)
hive (default)>
内置数字函数
Time taken: 0.029 seconds, Fetched: 1 row(s)
hive (default)>
>
>
> select round(5.4) from dual; //round函数,四舍五入
OK
_c0
5.0
Time taken: 0.035 seconds, Fetched: 1 row(s)
hive (default)> select round(3.145654,2) from dual;
OK
_c0
3.15
Time taken: 0.027 seconds, Fetched: 1 row(s)
hive (default)>
hive (default)> select least(3,4,5) from dual; 取最小的数
OK
_c0
3
Time taken: 0.034 seconds, Fetched: 1 row(s)
hive (default)> select greatest(3,4,5) from dual;取最大的数
OK
_c0
5
Time taken: 0.025 seconds, Fetched: 1 row(s)
hive (default)>
字符串函数
hive (default)> select concat("www",".","baicu.com") from dual;
OK
_c0
www.baicu.com
Time taken: 0.037 seconds, Fetched: 1 row(s)
hive (default)> select concat_ws(".","192","168","1","1") from dual;
OK
_c0
192.168.1.1
Time taken: 0.027 seconds, Fetched: 1 row(s)
hive (default)>
Hive来完成wc统计
[hadoop@hadoop001 data]$ cat /home/hadoop/data/wc.txt
hello hello hello
world world
welcome
[hadoop@hadoop001 data]$
hive (default)> create table if not exists wc(sentence string);
OK
Time taken: 0.268 seconds
hive (default)> load data local inpath '/home/hadoop/data/wc.txt' overwrite into table wc;
Loading data to table default.wc
Table default.wc stats: [numFiles=1, numRows=0, totalSize=44, rawDataSize=0]
OK
Time taken: 0.198 seconds
hive (default)> select word,count(1) as cnt from
> (select explode(split(sentence,'\t')) as word from wc) t
> group by word order by cnt desc;
Query ID = hadoop_20190719160606_75417f04-c6d4-45c9-9309-3d0cadf21313
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1562553101223_0022, Tracking URL = http://hadoop001:8078/proxy/application_1562553101223_0022/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1562553101223_0022
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-07-19 17:43:54,749 Stage-1 map = 0%, reduce = 0%
2019-07-19 17:43:58,883 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.51 sec
2019-07-19 17:44:03,008 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.74 sec
MapReduce Total cumulative CPU time: 2 seconds 740 msec
Ended Job = job_1562553101223_0022
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1562553101223_0023, Tracking URL = http://hadoop001:8078/proxy/application_1562553101223_0023/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1562553101223_0023
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2019-07-19 17:44:09,024 Stage-2 map = 0%, reduce = 0%
2019-07-19 17:44:13,185 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.65 sec
2019-07-19 17:44:18,307 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 1.95 sec
MapReduce Total cumulative CPU time: 1 seconds 950 msec
Ended Job = job_1562553101223_0023
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.74 sec HDFS Read: 6753 HDFS Write: 169 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 1.95 sec HDFS Read: 4655 HDFS Write: 25 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 690 msec
OK
word cnt
hello 3
world 2
welcom 1
Time taken: 31.204 seconds, Fetched: 3 row(s)
hive (default)>
注意:explode:一进多出,就是将行变成列。
split是hive内置的函数。