一,Hive数据导入的几种方式
首先列出讲述下面几种导入方式的数据和hive表。
Hive表:
创建testA:
- CREATE TABLE testA (
- id INT,
- name string,
- area string
- ) PARTITIONED BY (create_time string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
创建testB:
- CREATE TABLE testB (
- id INT,
- name string,
- area string,
- code string
- ) PARTITIONED BY (create_time string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
数据文件(sourceA.txt):
- 1,fish1,SZ
- 2,fish2,SH
- 3,fish3,HZ
- 4,fish4,QD
- 5,fish5,SR
- 1,zy1,SZ,1001
- 2,zy2,SH,1002
- 3,zy3,HZ,1003
- 4,zy4,QD,1004
- 5,zy5,SR,1005
(1)本地文件导入到Hive表
- hive> LOAD DATA LOCAL INPATH '/home/hadoop/sourceA.txt' INTO TABLE testA PARTITION(create_time='2015-07-08');
- Copying data from file:/home/hadoop/sourceA.txt
- Copying file: file:/home/hadoop/sourceA.txt
- Loading data to table default.testa partition (create_time=2015-07-08)
- Partition default.testa{create_time=2015-07-08} stats: [numFiles=1, numRows=0, totalSize=58, rawDataSize=0]
- OK
- Time taken: 0.237 seconds
- hive> LOAD DATA LOCAL INPATH '/home/hadoop/sourceB.txt' INTO TABLE testB PARTITION(create_time='2015-07-09');
- Copying data from file:/home/hadoop/sourceB.txt
- Copying file: file:/home/hadoop/sourceB.txt
- Loading data to table default.testb partition (create_time=2015-07-09)
- Partition default.testb{create_time=2015-07-09} stats: [numFiles=1, numRows=0, totalSize=73, rawDataSize=0]
- OK
- Time taken: 0.212 seconds
- hive> select * from testA;
- OK
- 1 fish1 SZ 2015-07-08
- 2 fish2 SH 2015-07-08
- 3 fish3 HZ 2015-07-08
- 4 fish4 QD 2015-07-08
- 5 fish5 SR 2015-07-08
- Time taken: 0.029 seconds, Fetched: 5 row(s)
- hive> select * from testB;
- OK
- 1 zy1 SZ 1001 2015-07-09
- 2 zy2 SH 1002 2015-07-09
- 3 zy3 HZ 1003 2015-07-09
- 4 zy4 QD 1004 2015-07-09
- 5 zy5 SR 1005 2015-07-09
- Time taken: 0.047 seconds, Fetched: 5 row(s)
(2)Hive表导入到Hive表
将testB的数据导入到testA表
- hive> INSERT INTO TABLE testA PARTITION(create_time='2015-07-11') select id, name, area from testB where id = 1;
- ...(省略)
- OK
- Time taken: 14.744 seconds
- hive> INSERT INTO TABLE testA PARTITION(create_time) select id, name, area, code from testB where id = 2;
- <pre name="code" class="java">...(省略)
说明:
1,将testB中id=1的行,导入到testA,分区为2015-07-11
2,将testB中id=2的行,导入到testA,分区create_time为id=2行的code值。
(3)HDFS文件导入到Hive表
将sourceA.txt和sourceB.txt传到HDFS中,路径分别是/home/hadoop/sourceA.txt和/home/hadoop/sourceB.txt中
- hive> LOAD DATA INPATH '/home/hadoop/sourceA.txt' INTO TABLE testA PARTITION(create_time='2015-07-08');
- ...(省略)
- OK
- Time taken: 0.237 seconds
- hive> LOAD DATA INPATH '/home/hadoop/sourceB.txt' INTO TABLE testB PARTITION(create_time='2015-07-09');
- <pre name="code" class="java">...(省略)
- OK
- Time taken: 0.212 seconds
- hive> select * from testA;
- OK
- 1 fish1 SZ 2015-07-08
- 2 fish2 SH 2015-07-08
- 3 fish3 HZ 2015-07-08
- 4 fish4 QD 2015-07-08
- 5 fish5 SR 2015-07-08
- Time taken: 0.029 seconds, Fetched: 5 row(s)
- hive> select * from testB;
- OK
- 1 zy1 SZ 1001 2015-07-09
- 2 zy2 SH 1002 2015-07-09
- 3 zy3 HZ 1003 2015-07-09
- 4 zy4 QD 1004 2015-07-09
- 5 zy5 SR 1005 2015-07-09
- Time taken: 0.047 seconds, Fetched: 5 row(s)
/home/hadoop/sourceA.txt'导入到testA表
/home/hadoop/sourceB.txt'导入到testB表
(4)创建表的过程中从其他表导入
- hive> create table testC as select name, code from testB;
- Total jobs = 3
- Launching Job 1 out of 3
- Number of reduce tasks is set to 0 since there's no reduce operator
- Starting Job = job_1449746265797_0106, Tracking URL = http://hadoopcluster79:8088/proxy/application_1449746265797_0106/
- Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job -kill job_1449746265797_0106
- Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
- 2015-12-24 16:40:17,981 Stage-1 map = 0%, reduce = 0%
- 2015-12-24 16:40:23,115 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.11 sec
- MapReduce Total cumulative CPU time: 1 seconds 110 msec
- Ended Job = job_1449746265797_0106
- Stage-4 is selected by condition resolver.
- Stage-3 is filtered out by condition resolver.
- Stage-5 is filtered out by condition resolver.
- Moving data to: hdfs://hadoop2cluster/tmp/hive-root/hive_2015-12-24_16-40-09_983_6048680148773453194-1/-ext-10001
- Moving data to: hdfs://hadoop2cluster/home/hadoop/hivedata/warehouse/testc
- Table default.testc stats: [numFiles=1, numRows=0, totalSize=45, rawDataSize=0]
- MapReduce Jobs Launched:
- Job 0: Map: 1 Cumulative CPU: 1.11 sec HDFS Read: 297 HDFS Write: 45 SUCCESS
- Total MapReduce CPU Time Spent: 1 seconds 110 msec
- OK
- Time taken: 14.292 seconds
- hive> desc testC;
- OK
- name string
- code string
- Time taken: 0.032 seconds, Fetched: 2 row(s)
二,Hive数据导出的几种方式
(1)导出到本地文件系统
- hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/output' ROW FORMAT DELIMITED FIELDS TERMINATED by ',' select * from testA;
- Total jobs = 1
- Launching Job 1 out of 1
- Number of reduce tasks is set to 0 since there's no reduce operator
- Starting Job = job_1451024007879_0001, Tracking URL = http://hadoopcluster79:8088/proxy/application_1451024007879_0001/
- Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job -kill job_1451024007879_0001
- Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
- 2015-12-25 17:04:30,447 Stage-1 map = 0%, reduce = 0%
- 2015-12-25 17:04:35,616 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
- MapReduce Total cumulative CPU time: 1 seconds 160 msec
- Ended Job = job_1451024007879_0001
- Copying data to local directory /home/hadoop/output
- Copying data to local directory /home/hadoop/output
- MapReduce Jobs Launched:
- Job 0: Map: 1 Cumulative CPU: 1.16 sec HDFS Read: 305 HDFS Write: 110 SUCCESS
- Total MapReduce CPU Time Spent: 1 seconds 160 msec
- OK
- Time taken: 16.701 seconds
查看数据结果:
- [hadoop@hadoopcluster78 output]$ cat /home/hadoop/output/000000_0
- 1,fish1,SZ,2015-07-08
- 2,fish2,SH,2015-07-08
- 3,fish3,HZ,2015-07-08
- 4,fish4,QD,2015-07-08
- 5,fish5,SR,2015-07-08
(2)导出到HDFS
导入到HDFS和导入本地文件类似,去掉HQL语句的LOCAL就可以了
- hive> INSERT OVERWRITE DIRECTORY '/home/hadoop/output' select * from testA;
- Total jobs = 3
- Launching Job 1 out of 3
- Number of reduce tasks is set to 0 since there's no reduce operator
- Starting Job = job_1451024007879_0002, Tracking URL = http://hadoopcluster79:8088/proxy/application_1451024007879_0002/
- Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job -kill job_1451024007879_0002
- Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
- 2015-12-25 17:08:51,034 Stage-1 map = 0%, reduce = 0%
- 2015-12-25 17:08:59,313 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.4 sec
- MapReduce Total cumulative CPU time: 1 seconds 400 msec
- Ended Job = job_1451024007879_0002
- Stage-3 is selected by condition resolver.
- Stage-2 is filtered out by condition resolver.
- Stage-4 is filtered out by condition resolver.
- Moving data to: hdfs://hadoop2cluster/home/hadoop/hivedata/hive-hadoop/hive_2015-12-25_17-08-43_733_1768532778392261937-1/-ext-10000
- Moving data to: /home/hadoop/output
- MapReduce Jobs Launched:
- Job 0: Map: 1 Cumulative CPU: 1.4 sec HDFS Read: 305 HDFS Write: 110 SUCCESS
- Total MapReduce CPU Time Spent: 1 seconds 400 msec
- OK
- Time taken: 16.667 seconds
查看hfds输出文件:
- [hadoop@hadoopcluster78 bin]$ ./hadoop fs -cat /home/hadoop/output/000000_0
- 1fish1SZ2015-07-08
- 2fish2SH2015-07-08
- 3fish3HZ2015-07-08
- 4fish4QD2015-07-08
- 5fish5SR2015-07-08
其他
采用hive的-e和-f参数来导出数据。
参数为: -e 的使用方式,后面接SQL语句。>>后面为输出文件路径
- [hadoop@hadoopcluster78 bin]$ ./hive -e "select * from testA" >> /home/hadoop/output/testA.txt
- 15/12/25 17:15:07 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
- Logging initialized using configuration in file:/home/hadoop/apache/hive-0.13.1/conf/hive-log4j.properties
- OK
- Time taken: 1.128 seconds, Fetched: 5 row(s)
- [hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/testA.txt
- 1 fish1 SZ 2015-07-08
- 2 fish2 SH 2015-07-08
- 3 fish3 HZ 2015-07-08
- 4 fish4 QD 2015-07-08
- 5 fish5 SR 2015-07-08
参数为: -f 的使用方式,后面接存放sql语句的文件。 >> 后面为输出文件路径
SQL语句文件:
- [hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/sql.sql
- select * from testA
使用-f参数执行:
- [hadoop@hadoopcluster78 bin]$ ./hive -f /home/hadoop/output/sql.sql >> /home/hadoop/output/testB.txt
- 15/12/25 17:20:52 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
- Logging initialized using configuration in file:/home/hadoop/apache/hive-0.13.1/conf/hive-log4j.properties
- OK
- Time taken: 1.1 seconds, Fetched: 5 row(s)
参看结果:
- [hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/testB.txt
- 1 fish1 SZ 2015-07-08
- 2 fish2 SH 2015-07-08
- 3 fish3 HZ 2015-07-08
- 4 fish4 QD 2015-07-08
- 5 fish5 SR 2015-07-08