Hive知识：DDL、内外部表、分区表、数据导出的三种方法

最新推荐文章于 2024-05-14 09:38:08 发布

wjw0806

最新推荐文章于 2024-05-14 09:38:08 发布

阅读量392

点赞数

CC 4.0 BY-SA版权

文章标签： hive

本文链接：https://blog.youkuaiyun.com/wjw0806/article/details/103654693

本文介绍了Hive的Table DDL，详细讲解了内部表和外部表的区别，强调了外部表在Hive+Sentry、数据备份/移动等场景下的适用性。接着探讨了分区表的重要性，特别是对磁盘IO和网络IO的优化，并展示了单级和多级分区的使用。最后，总结了数据导出的三种方法，包括导出到文件、使用Hadoop fs命令和通过hive -e脚本。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、Table DDL

SQL处理，第一步必须要有表的存在

CREATE EXTERNAL TABLE emp_external2(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/rouze/hive_external_table/emp/';
LOAD DATA LOCAL INPATH '/opt/emp.txt' into table emp_external2; 
    hive (default)> CREATE EXTERNAL TABLE emp_external2(
              > empno int,
              > ename string,
              > job string,
              > mgr int,
              > hiredate string,
              > sal double,
              > comm double,
              > deptno int
              > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/rouze/hive_external_table/emp/';
OK
Time taken: 18.643 seconds
hive (default)> LOAD DATA LOCAL INPATH '/opt/emp.txt' into table emp_external2;
Loading data to table default.emp_external2
Table default.emp_external2 stats: [numFiles=0, totalSize=0]
OK
Time taken: 4.493 seconds
hive (default)> select *from emp_external2;
OK
emp_external2.empno emp_external2.ename emp_external2.job   emp_external2.mgr   emp_external2.hiredate  emp_external2.sal   emp_external2.comm  emp_external2.deptno
7369    SMITH   CLERK   7902    1980-12-17  800.0   NULL    20
7499    ALLEN   SALESMAN    7698    1981-2-20   1600.0  300.0   30
7521    WARD    SALESMAN    7698    1981-2-22   1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2    2975.0  NULL    20
7654    MARTIN  SALESMAN    7698    1981-9-28   1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1    2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9    2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-4-19   3000.0  NULL    20
7839    KING    PRESIDENT   NULL    1981-11-17  5000.0  NULL    10
7844    TURNER  SALESMAN    7698    1981-9-8    1500.0  0.0 30
7876    ADAMS   CLERK   7788    1987-5-23   1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-3   950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-3   3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-1-23   1300.0  NULL    10
Time taken: 1.948 seconds, Fetched: 14 row(s)

二、Hive中有两种常见类型的表

内部表 MANAGED TABLE
外部表 EXTERNAL TABLE
内部表与外部表的区别
区别：内部表删除数据和元数据都删除了外部表删除仅删除元数据，但数据依旧保留
内部表删除时: HDFS + META 都被删除
外部表删除时: HDFS不删除仅META被删除

hive (default)> create table emp1 as select * from emp_external2;
hive (default)> drop table emp_external2;
OK
Time taken: 0.138 seconds

经过测试：
什么场景该使用EXTERNAL
Hive+Sentry
数据备份/移动
外部表变内部表

三、分区表场景

og_90121229
log_90121230 time 创建index

如果不分区
log
1229
1230

select … from log where day=’…’

IO：磁盘IO 网络IO

分区表：分区其实对应的就是HDFS上的一个文件夹/目录
log
day=90121229
… <==
day=90121230
…

select … from log where day=‘90121229’

IO的开销不是一个级别的
Ticket 服务请求 Order

四、分区表

create table order_partition(
order_no string,
event_time string
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive (default)> LOAD DATA LOCAL INPATH '/opt/order.txt' INTO TABLE order_partition PARTITION (event_month='2019-10');
order_partition.order_no    order_partition.event_time  order_partition.event_month
1000    2019-10-12  2019-10
1001    2019-11-12  2019-10
1002    2019-10-12  2019-10
1003    2019-11-12  2019-10
1004    2019-12-12  2019-10
Time taken: 0.05 seconds, Fetched: 5 row(s)

分区表加载数据一定要指定分区字段
SQL

hive (default)> LOAD DATA  INPATH '/order.txt' INTO TABLE order_partition PARTITION (event_month='2019-11');
Loading data to table default.order_partition partition (event_month=2019-11)
Partition default.order_partition{event_month=2019-11} stats: [numFiles=1, numRows=0, totalSize=80, rawDataSize=0]
OK
Time taken: 0.52 seconds

对于分区表操作，如果你的数据是写入HDFS
默认sql是查询不到的，
为什么？因为元数据里没有

CREATE TABLE rouze_order_mulit_partition (
order_number string,
event_time string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH "/opt/order_created.txt" 
OVERWRITE INTO TABLE rouze_order_mulit_partition
Time taken: 11.221 seconds
hive (default)> LOAD DATA LOCAL INPATH "/opt/order_created.txt" 
              > OVERWRITE INTO TABLE rouze_order_mulit_partition ;
Loading data to table default.rouze_order_mulit_partition
Table default.rouze_order_mulit_partition stats: [numFiles=1, totalSize=213]
OK
Time taken: 3.403 seconds
insert overwrite table order_partition PARTITION(event_month='2014-07')
select * from rouze_order_mulit_partition;

刷新分区元信息字段

hive (default)> ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION(event_month='2014-07') ;
OK
Time taken: 0.108 seconds
ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION(event_month='2014-07') ;
msck repair table order_partition

实验证明

[rouze@hadoop001 /opt]$hadoop fs -put order_created.txt  /user/hive/warehouse/order_partition/event_month=2019-08
Time taken: 0.046 seconds, Fetched: 5 row(s)
hive (default)> select * from order_partition where event_month='2019-08';
OK
order_partition.order_no    order_partition.event_time  order_partition.event_month
Time taken: 1.182 seconds
mysql> select * from partitions;
+---------+-------------+------------------+----------------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME                  | SD_ID | TBL_ID |
+---------+-------------+------------------+----------------------------+-------+--------+
|       1 |  1576430124 |                0 | event_month=2019-10-12     |     9 |      8 |
|       2 |  1576430284 |                0 | event_month=2019-10        |    10 |      8 |
|       3 |  1576431119 |                0 | event_month=2019-11        |    11 |      8 |
|       6 |  1576514931 |                0 | event_month=2014-07        |    22 |      8 |
|       7 |  1576517053 |                0 | event_month=2014-05/step=1 |    24 |     17 |
+---------+-------------+------------------+----------------------------+-------+--------+
5 rows in set (0.00 sec)

刷新元数据

ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION(event_month='2019-08') ;
hive (default)> ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION(event_month='2019-08') 
              > ;
OK
Time taken: 0.513 seconds
mysql> select * from partitions;
+---------+-------------+------------------+----------------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME                  | SD_ID | TBL_ID |
+---------+-------------+------------------+----------------------------+-------+--------+
|       1 |  1576430124 |                0 | event_month=2019-10-12     |     9 |      8 |
|       2 |  1576430284 |                0 | event_month=2019-10        |    10 |      8 |
|       3 |  1576431119 |                0 | event_month=2019-11        |    11 |      8 |
|       6 |  1576514931 |                0 | event_month=2014-07        |    22 |      8 |
|       7 |  1576517053 |                0 | event_month=2014-05/step=1 |    24 |     17 |
|      11 |  1576587354 |                0 | event_month=2019-08        |    26 |      8 |
+---------+-------------+------------------+----------------------------+-------+--------+
6 rows in set (0.01 sec)
order_partition.order_no    order_partition.event_time  order_partition.event_month
10703007267488  2014-05-01 06:01:12.334+01  2019-08
10101043505096  2014-05-01 07:28:12.342+01  2019-08
10103043509747  2014-05-01 07:50:12.33+01   2019-08
10103043501575  2014-05-01 09:27:12.33+01   2019-08
10104043514061  2014-05-01 09:03:12.324+01  2019-08
Time taken: 0.198 seconds, Fetched: 5 row(s)

多个分区表
单级分区一层目录
三级分区三层目录

create table order_mulit_partition(
order_no string,
event_time string
)
PARTITIONED BY (event_month string, step string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/home/hadoop/data/order_created.txt' INTO TABLE order_mulit_partition PARTITION (event_month='2014-05',step='1');
LOAD DATA LOCAL INPATH "/opt/order_created.txt"  INTO TABLE order_mulit_partition PARTITION (event_month='2014-05',step='1');
hive (default)> LOAD DATA LOCAL INPATH "/opt/order_created.txt"  INTO TABLE order_mulit_partition PARTITION (event_month='2014-05',step='1');
Loading data to table default.order_mulit_partition partition (event_month=2014-05, step=1)
Partition default.order_mulit_partition{event_month=2014-05, step=1} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
OK
Time taken: 0.29 seconds

使用分区表时，加载数据一定要指定我们的所有分区字段

CREATE TABLE emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
) 
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
INSERT OVERWRITE TABLE emp_partition PARTITION (deptno=10) select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=10;
hive (default)> select * from emp_partition;
OK
emp_partition.empno emp_partition.ename emp_partition.job   emp_partition.mgr   emp_partition.hiredate  emp_partition.sal   emp_partition.comm  emp_partition.deptno
7782    CLARK   MANAGER 7839    1981-6-9    2450.0  NULL    10
7839    KING    PRESIDENT   NULL    1981-11-17  5000.0  NULL    10
7934    MILLER  CLERK   7782    1982-1-23   1300.0  NULL    10
Time taken: 0.107 seconds, Fetched: 3 row(s)

动态分区

CREATE TABLE emp_dynamic_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
) 
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive (default)> set hive.exec.dynamic.partition=true;
hive (default)> set hive.exec.dynamic.partition.mode=nonstrict;
hive (default)> INSERT OVERWRITE TABLE emp_dynamic_partition PARTITION (deptno) select empno,ename,job,mgr,hiredate,sal,comm,deptno from emp;
hive (default)> select * from emp_dynamic_partition;
OK
emp_dynamic_partition.empno emp_dynamic_partition.ename emp_dynamic_partition.job   emp_dynamic_partition.mgr   emp_dynamic_partition.hiredate  emp_dynamic_partition.sal   emp_dynamic_partition.comm  emp_dynamic_partition.deptno
7782    CLARK   MANAGER 7839    1981-6-9    2450.0  NULL    10
7839    KING    PRESIDENT   NULL    1981-11-17  5000.0  NULL    10
7934    MILLER  CLERK   7782    1982-1-23   1300.0  NULL    10
7369    SMITH   CLERK   7902    1980-12-17  800.0   NULL    20
7566    JONES   MANAGER 7839    1981-4-2    2975.0  NULL    20
7788    SCOTT   ANALYST 7566    1987-4-19   3000.0  NULL    20
7876    ADAMS   CLERK   7788    1987-5-23   1100.0  NULL    20
7902    FORD    ANALYST 7566    1981-12-3   3000.0  NULL    20
7499    ALLEN   SALESMAN    7698    1981-2-20   1600.0  300.0   30
7521    WARD    SALESMAN    7698    1981-2-22   1250.0  500.0   30
7654    MARTIN  SALESMAN    7698    1981-9-28   1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1    2850.0  NULL    30
7844    TURNER  SALESMAN    7698    1981-9-8    1500.0  0.0 30
7900    JAMES   CLERK   7698    1981-12-3   950.0   NULL    30
Time taken: 0.412 seconds, Fetched: 14 row(s)

五、数据导出的三种方法

INSERT OVERWRITE DIRECTORY ‘/hivetmp’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ SELECT empno,ename FROM emp;
Hadoop fs -get

hive -e “” > 文件

1.导出到文件

hive (default)> INSERT OVERWRITE DIRECTORY '/opt/emp_insert' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT empno,ename FROM emp;  
Query ID = rouze_20191217230101_b4c7c1a3-6006-4fcf-a6d2-95ab0476155a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1576557773324_0004, Tracking URL = http://hadoop001:38088/proxy/application_1576557773324_0004/
Kill Command = /home/rouze/app/hadoop/bin/hadoop job  -kill job_1576557773324_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-12-17 23:02:08,595 Stage-1 map = 0%,  reduce = 0%
2019-12-17 23:02:23,309 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.75 sec
MapReduce Total cumulative CPU time: 3 seconds 750 msec
Ended Job = job_1576557773324_0004
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/opt/emp_insert/.hive-staging_hive_2019-12-17_23-01-49_763_8189611662586281089-1/-ext-10000
Moving data to: /opt/emp_insert
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.75 sec   HDFS Read: 4190 HDFS Write: 154 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 750 msec
OK
empno   ename
Time taken: 37.01 seconds
[rouze@hadoop001 /home/rouze]$hadoop fs -text /opt/emp_insert/000000_0
19/12/17 23:04:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7369    SMITH
7499    ALLEN
7521    WARD
7566    JONES
7654    MARTIN
7698    BLAKE
7782    CLARK
7788    SCOTT
7839    KING
7844    TURNER
7876    ADAMS
7900    JAMES
7902    FORD
7934    MILLER

2.Hadoop fs -get

[rouze@hadoop001 /home/rouze]$hadoop fs  -get /user/hive/warehouse/emp/emp.txt   ~/test_emp.txt
19/12/17 23:08:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[rouze@hadoop001 /home/rouze]$ll
总用量 24
-rw-rw-r--. 1 rouze rouze  26 12月  9 00:00 1.log
-rw-rw-r--. 1 rouze rouze  26 12月  9 00:05 2.log
-rw-r--r--. 1 rouze rouze   6 12月  7 22:13 aa
drwxrwxr-x. 4 rouze rouze  78 12月 15 00:21 app
drwxrwxr-x. 2 rouze rouze  19 12月  1 13:01 data
drwxrwxr-x. 2 rouze rouze   6 12月  1 00:05 lib
drwxrwxr-x. 2 rouze rouze   6 12月  1 00:05 log
-rw-r--r--. 1 rouze rouze   6 12月  1 01:49 slaves1
drwxrwxr-x. 2 rouze rouze  43 12月  1 00:37 software
drwxrwxr-x. 2 rouze rouze   6 12月  1 00:05 sourcecode
drwxrwxr-x. 2 rouze rouze   6 12月  8 13:18 test
-rw-r--r--. 1 rouze rouze   6 12月  7 22:18 test_copy
-rw-r--r--. 1 rouze rouze 656 12月 17 23:08 test_emp.txt
drwxrwxrwx. 4 rouze rouze 207 12月 17 12:42 tmp

3.hive -e

[rouze@hadoop001 /home/rouze]$hive -e "select * from emp limit 1" > emp.log
which: no hbase in (/home/rouze/app/hive-1.1.0-cdh5.16.2/bin:/home/rouze/app/hadoop/bin:/home/rouze/app/hadoop/sbin:/usr/java/jdk1.8.0_121/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rouze/.local/bin:/home/rouze/bin)
19/12/17 23:10:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Logging initialized using configuration in jar:file:/home/rouze/app/hive-1.1.0-cdh5.16.2/lib/hive-common-1.1.0-cdh5.16.2.jar!/hive-log4j.properties
OK
Time taken: 12.611 seconds, Fetched: 1 row(s)
[rouze@hadoop001 /home/rouze]$cat emp.log 
emp.empno   emp.ename   emp.job emp.mgr emp.hiredate    emp.sal emp.comm    emp.deptno
7369    SMITH   CLERK   7902    1980-12-17  800.0   NULL    20

MR是否允许

hive.fetch.task.conversion

Default Value: minimal in Hive 0.10.0 through 0.13.1, more in Hive 0.14.0 and later
Added In: Hive 0.10.0 with HIVE-2925; default changed in Hive 0.14.0 with HIVE-7397
Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.

Supported values are none, minimal and more.

none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389)
minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)
“more” can take any kind of expressions in the SELECT clause, including UDFs.
(UDTFs and lateral views are not yet supported – see HIVE-5718.)