一、Table DDL
SQL处理,第一步必须要有表的存在
CREATE EXTERNAL TABLE emp_external2(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/rouze/hive_external_table/emp/';
LOAD DATA LOCAL INPATH '/opt/emp.txt' into table emp_external2;
hive (default)> CREATE EXTERNAL TABLE emp_external2(
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> sal double,
> comm double,
> deptno int
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/rouze/hive_external_table/emp/';
OK
Time taken: 18.643 seconds
hive (default)> LOAD DATA LOCAL INPATH '/opt/emp.txt' into table emp_external2;
Loading data to table default.emp_external2
Table default.emp_external2 stats: [numFiles=0, totalSize=0]
OK
Time taken: 4.493 seconds
hive (default)> select *from emp_external2;
OK
emp_external2.empno emp_external2.ename emp_external2.job emp_external2.mgr emp_external2.hiredate emp_external2.sal emp_external2.comm emp_external2.deptno
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 1.948 seconds, Fetched: 14 row(s)
二、Hive中有两种常见类型的表
内部表 MANAGED TABLE
外部表 EXTERNAL TABLE
内部表与外部表的区别
区别:内部表删除数据和元数据都删除了 外部表删除仅删除元数据,但数据依旧保留
内部表删除时: HDFS + META 都被删除
外部表删除时: HDFS不删除 仅META被删除
hive (default)> create table emp1 as select * from emp_external2;
hive (default)> drop table emp_external2;
OK
Time taken: 0.138 seconds
经过测试:
什么场景该使用EXTERNAL
Hive+Sentry
数据备份/移动
外部表变内部表
三、分区表场景
og_90121229
log_90121230 time 创建index
如果不分区
log
1229
1230
select … from log where day=’…’
IO:磁盘IO 网络IO
分区表 : 分区其实对应的就是HDFS上的一个文件夹/目录
log
day=90121229
… <==
day=90121230
…
select … from log where day=‘90121229’
IO的开销不是一个级别的
Ticket 服务请求 Order
四、分区表
create table order_partition(
order_no string,
event_time string
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive (default)> LOAD DATA LOCAL INPATH '/opt/order.txt' INTO TABLE order_partition PARTITION (event_month='2019-10');
order_partition.order_no order_partition.event_time order_partition.event_month
1000 2019-10-12 2019-10
1001 2019-11-12 2019-10
1002 2019-10-12 2019-10
1003 2019-11-12 2019-10
1004 2019-12-12 2019-10
Time taken: 0.05 seconds, Fetched: 5 row(s)
分区表加载数据一定要指定分区字段
SQL
hive (default)> LOAD DATA INPATH '/order.txt' INTO TABLE order_partition PARTITION (event_month='2019-11');
Loading data to table default.order_partition partition (event_month=2019-11)
Partition default.order_partition{event_month=2019-11} stats: [numFiles=1, numRows=0, totalSize=80, rawDataSize=0]
OK
Time taken: 0.52 seconds
对于分区表操作,如果你的数据是写入HDFS
默认sql是查询不到的,
为什么?因为元数据里没有
CREATE TABLE rouze_order_mulit_partition (
order_number string,
event_time string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH "/opt/order_created.txt"
OVERWRITE INTO TABLE rouze_order_mulit_partition
Time taken: 11.221 seconds
hive (default)> LOAD DATA LOCAL INPATH "/opt/order_created.txt"
> OVERWRITE INTO TABLE rouze_order_mulit_partition ;
Loading data to table default.rouze_order_mulit_partition
Table default.rouze_order_mulit_partition stats: [numFiles=1, totalSize=213]
OK
Time taken: 3.403 seconds
insert overwrite table order_partition PARTITION(event_month='2014-07')
select * from rouze_order_mulit_partition;
刷新分区元信息字段
hive (default)> ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION(event_month='2014-07') ;
OK
Time taken: 0.108 seconds
ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION(event_month='2014-07') ;
msck repair table order_partition
实验证明
[rouze@hadoop001 /opt]$hadoop fs -put order_created.txt /user/hive/warehouse/order_partition/event_month=2019-08
Time taken: 0.046 seconds, Fetched: 5 row(s)
hive (default)> select * from order_partition where event_month='2019-08';
OK
order_partition.order_no order_partition.event_time order_partition.event_month
Time taken: 1.182 seconds
mysql> select * from partitions;
+---------+-------------+------------------+----------------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_ID | TBL_ID |
+---------+-------------+------------------+----------------------------+-------+--------+
| 1 | 1576430124 | 0 | event_month=2019-10-12 | 9 | 8 |
| 2 | 1576430284 | 0 | event_month=2019-10 | 10 | 8 |
| 3 | 1576431119 | 0 | event_month=2019-11 | 11 | 8 |
| 6 | 1576514931 | 0 | event_month=2014-07 | 22 | 8 |
| 7 | 1576517053 | 0 | event_month=2014-05/step=1 | 24 | 17 |
+---------+-------------+------------------+----------------------------+-------+--------+
5 rows in set (0.00 sec)
刷新元数据
ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION(event_month='2019-08') ;
hive (default)> ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION(event_month='2019-08')
> ;
OK
Time taken: 0.513 seconds
mysql> select * from partitions;
+---------+-------------+------------------+----------------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_ID | TBL_ID |
+---------+-------------+------------------+----------------------------+-------+--------+
| 1 | 1576430124 | 0 | event_month=2019-10-12 | 9 | 8 |
| 2 | 1576430284 | 0 | event_month=2019-10 | 10 | 8 |
| 3 | 1576431119 | 0 | event_month=2019-11 | 11 | 8 |
| 6 | 1576514931 | 0 | event_month=2014-07 | 22 | 8 |
| 7 | 1576517053 | 0 | event_month=2014-05/step=1 | 24 | 17 |
| 11 | 1576587354 | 0 | event_month=2019-08 | 26 | 8 |
+---------+-------------+------------------+----------------------------+-------+--------+
6 rows in set (0.01 sec)
order_partition.order_no order_partition.event_time order_partition.event_month
10703007267488 2014-05-01 06:01:12.334+01 2019-08
10101043505096 2014-05-01 07:28:12.342+01 2019-08
10103043509747 2014-05-01 07:50:12.33+01 2019-08
10103043501575 2014-05-01 09:27:12.33+01 2019-08
10104043514061 2014-05-01 09:03:12.324+01 2019-08
Time taken: 0.198 seconds, Fetched: 5 row(s)
多个分区表
单级分区 一层目录
三级分区 三层目录
create table order_mulit_partition(
order_no string,
event_time string
)
PARTITIONED BY (event_month string, step string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/home/hadoop/data/order_created.txt' INTO TABLE order_mulit_partition PARTITION (event_month='2014-05',step='1');
LOAD DATA LOCAL INPATH "/opt/order_created.txt" INTO TABLE order_mulit_partition PARTITION (event_month='2014-05',step='1');
hive (default)> LOAD DATA LOCAL INPATH "/opt/order_created.txt" INTO TABLE order_mulit_partition PARTITION (event_month='2014-05',step='1');
Loading data to table default.order_mulit_partition partition (event_month=2014-05, step=1)
Partition default.order_mulit_partition{event_month=2014-05, step=1} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
OK
Time taken: 0.29 seconds
使用分区表时,加载数据一定要指定我们的所有分区字段
CREATE TABLE emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
)
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
INSERT OVERWRITE TABLE emp_partition PARTITION (deptno=10) select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=10;
hive (default)> select * from emp_partition;
OK
emp_partition.empno emp_partition.ename emp_partition.job emp_partition.mgr emp_partition.hiredate emp_partition.sal emp_partition.comm emp_partition.deptno
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 0.107 seconds, Fetched: 3 row(s)
动态分区
CREATE TABLE emp_dynamic_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
)
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive (default)> set hive.exec.dynamic.partition=true;
hive (default)> set hive.exec.dynamic.partition.mode=nonstrict;
hive (default)> INSERT OVERWRITE TABLE emp_dynamic_partition PARTITION (deptno) select empno,ename,job,mgr,hiredate,sal,comm,deptno from emp;
hive (default)> select * from emp_dynamic_partition;
OK
emp_dynamic_partition.empno emp_dynamic_partition.ename emp_dynamic_partition.job emp_dynamic_partition.mgr emp_dynamic_partition.hiredate emp_dynamic_partition.sal emp_dynamic_partition.comm emp_dynamic_partition.deptno
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
Time taken: 0.412 seconds, Fetched: 14 row(s)
五、数据导出的三种方法
INSERT OVERWRITE DIRECTORY ‘/hivetmp’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ SELECT empno,ename FROM emp;
Hadoop fs -get
hive -e “”
> 文件
1.导出到文件
hive (default)> INSERT OVERWRITE DIRECTORY '/opt/emp_insert' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT empno,ename FROM emp;
Query ID = rouze_20191217230101_b4c7c1a3-6006-4fcf-a6d2-95ab0476155a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1576557773324_0004, Tracking URL = http://hadoop001:38088/proxy/application_1576557773324_0004/
Kill Command = /home/rouze/app/hadoop/bin/hadoop job -kill job_1576557773324_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-12-17 23:02:08,595 Stage-1 map = 0%, reduce = 0%
2019-12-17 23:02:23,309 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.75 sec
MapReduce Total cumulative CPU time: 3 seconds 750 msec
Ended Job = job_1576557773324_0004
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/opt/emp_insert/.hive-staging_hive_2019-12-17_23-01-49_763_8189611662586281089-1/-ext-10000
Moving data to: /opt/emp_insert
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.75 sec HDFS Read: 4190 HDFS Write: 154 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 750 msec
OK
empno ename
Time taken: 37.01 seconds
[rouze@hadoop001 /home/rouze]$hadoop fs -text /opt/emp_insert/000000_0
19/12/17 23:04:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7369 SMITH
7499 ALLEN
7521 WARD
7566 JONES
7654 MARTIN
7698 BLAKE
7782 CLARK
7788 SCOTT
7839 KING
7844 TURNER
7876 ADAMS
7900 JAMES
7902 FORD
7934 MILLER
2.Hadoop fs -get
[rouze@hadoop001 /home/rouze]$hadoop fs -get /user/hive/warehouse/emp/emp.txt ~/test_emp.txt
19/12/17 23:08:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[rouze@hadoop001 /home/rouze]$ll
总用量 24
-rw-rw-r--. 1 rouze rouze 26 12月 9 00:00 1.log
-rw-rw-r--. 1 rouze rouze 26 12月 9 00:05 2.log
-rw-r--r--. 1 rouze rouze 6 12月 7 22:13 aa
drwxrwxr-x. 4 rouze rouze 78 12月 15 00:21 app
drwxrwxr-x. 2 rouze rouze 19 12月 1 13:01 data
drwxrwxr-x. 2 rouze rouze 6 12月 1 00:05 lib
drwxrwxr-x. 2 rouze rouze 6 12月 1 00:05 log
-rw-r--r--. 1 rouze rouze 6 12月 1 01:49 slaves1
drwxrwxr-x. 2 rouze rouze 43 12月 1 00:37 software
drwxrwxr-x. 2 rouze rouze 6 12月 1 00:05 sourcecode
drwxrwxr-x. 2 rouze rouze 6 12月 8 13:18 test
-rw-r--r--. 1 rouze rouze 6 12月 7 22:18 test_copy
-rw-r--r--. 1 rouze rouze 656 12月 17 23:08 test_emp.txt
drwxrwxrwx. 4 rouze rouze 207 12月 17 12:42 tmp
3.hive -e
[rouze@hadoop001 /home/rouze]$hive -e "select * from emp limit 1" > emp.log
which: no hbase in (/home/rouze/app/hive-1.1.0-cdh5.16.2/bin:/home/rouze/app/hadoop/bin:/home/rouze/app/hadoop/sbin:/usr/java/jdk1.8.0_121/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rouze/.local/bin:/home/rouze/bin)
19/12/17 23:10:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Logging initialized using configuration in jar:file:/home/rouze/app/hive-1.1.0-cdh5.16.2/lib/hive-common-1.1.0-cdh5.16.2.jar!/hive-log4j.properties
OK
Time taken: 12.611 seconds, Fetched: 1 row(s)
[rouze@hadoop001 /home/rouze]$cat emp.log
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
MR是否允许
hive.fetch.task.conversion
Default Value: minimal in Hive 0.10.0 through 0.13.1, more in Hive 0.14.0 and later
Added In: Hive 0.10.0 with HIVE-2925; default changed in Hive 0.14.0 with HIVE-7397
Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.
Supported values are none, minimal and more.
none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389)
minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)
“more” can take any kind of expressions in the SELECT clause, including UDFs.
(UDTFs and lateral views are not yet supported – see HIVE-5718.)