文章目录
动态分区表
1 包含动态分区的字段的表
- 准备数据
5 男 16 北京 13754554587 2015-03-24
6 女 17 北京 13872374170 2017-02-20
2 女 16 北京 17938844884 2015-05-26
23 男 15 北京 13700000033 2016-07-25
4 男 17 北京 15257575773 2017-08-11
3 女 15 北京 15885888889 2018-05-03
10 男 16 北京 14098593989 2018-04-06
15 女 14 北京 14938983000 2019-06-24
[yao@master ~]$ cd data/
[yao@master data]$ vim stu_message.txt
- 创建表
hive> create table stu_message(
> stu_id int,
> sex string,
> age int,
> address string,
> tel_num string,
> ts date)
> row format delimited fields terminated by '\t'
> stored as textfile
> location '/yr/stu_message';
OK
Time taken: 0.12 seconds
- 加载数据
hive> load data local inpath '/home/yao/data/stu_message.txt' into table stu_message;
Loading data to table yr.stu_message
Table yr.stu_message stats: [numFiles=0, totalSize=0]
OK
Time taken: 0.932 seconds
hive> select * from stu_message;
OK
5 男 16 北京 13754554587 2015-03-24
6 女 17 北京 13872374170 2017-02-20
2 女 16 北京 17938844884 2015-05-26
23 男 15 北京 13700000033 2016-07-25
4 男 17 北京 15257575773 2017-08-11
3 女 15 北京 15885888889 2018-05-03
10 男 16 北京 14098593989 2018-04-06
15 女 14 北京 14938983000 2019-06-24
Time taken: 0.078 seconds, Fetched: 8 row(s)
- 创建全部字段的表
hive> create table stu_mess(
> stu_id int,
> sex string,
> age int,
> address string,
> tel_num string,
> ts date,
> y int,
> m int)
> row format delimited fields terminated by '\t'
> stored as textfile;
OK
Time taken: 0.111 seconds
- 加载数据
hive> insert into table stu_mess select stu_id,sex,age,address,tel_num,ts,year(ts) as y,month(ts) as m from stu_message; //年份和月份是通过year()和month()函数截取的
Query ID = yao_20190321150133_0ab47c53-1f4e-4edb-bc5e-6073cf526ed4
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1553134178907_0006, Tracking URL = http://master:18088/proxy/application_1553134178907_0006/
Kill Command = /home/yao/hadoop-2.7.7/bin/hadoop job -kill job_1553134178907_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-03-21 15:01:50,482 Stage-1 map = 0%, reduce = 0%
2019-03-21 15:02:10,344 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.53 sec
MapReduce Total cumulative CPU time: 3 seconds 530 msec
Ended Job = job_1553134178907_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://master:9000/user/hive/warehouse/yr.db/stu_mess/.hive-staging_hive_2019-03-21_15-01-33_891_108911504095429350-1/-ext-10000
Loading data to table yr.stu_mess
Table yr.stu_mess stats: [numFiles=1, numRows=8, totalSize=371, rawDataSize=363]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 4.3 sec HDFS Read: 4519 HDFS Write: 439 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 300 msec
OK
Time taken: 40.172 seconds
hive> select * from stu_mess;
OK
5 男 16 北京 13754554587 2015-03-24 2015 3
6 女 17 北京 13872374170 2017-02-20 2017 2
2 女 16 北京 17938844884 2015-05-26 2015 5
23 男 15 北京 13700000033 2016-07-25 2016 7
4 男 17 北京 15257575773 2017-08-11 2017 8
3 女 15 北京 15885888889 2018-05-03 2018 5
10 男 16 北京 14098593989 2018-04-06 2018 4
15 女 14 北京 14938983000 2019-06-24 2019 6
Time taken: 0.084 seconds, Fetched: 8 row(s)
2 关闭严格分区模式
- 分区模式,默认nonstrict
hive> set hive.exec.dynamic.partition.mode=nonstrict;
- 开启动态分区,默认true
hive> set hive.exec.dynamic.partition=true;
- 最大动态分区数,默认1000
set hive.exec.dynamic.partitions=1000;
数据量很大时需要设置,这次的实验数据很小,所以这里没有进行设置。
3 创建带有分区的表
- 创建表
hive> create table stu_mess_part(
> stu_id int,
> sex string,
> age int,
> address string,
> tel_num string,
> ts date)
> partitioned by(y int,m int)
> row format delimited fields terminated by '\t'
> stored as textfile;
OK
Time taken: 0.131 seconds
- 加载数据
hive> insert into table stu_mess_part partition(y,m) select * from stu_mess;
Query ID = yao_20190321151105_548604f0-746b-425b-8c07-6d78426de226
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1553134178907_0007, Tracking URL = http://master:18088/proxy/application_1553134178907_0007/
Kill Command = /home/yao/hadoop-2.7.7/bin/hadoop job -kill job_1553134178907_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-03-21 15:11:21,325 Stage-1 map = 0%, reduce = 0%
2019-03-21 15:11:33,308 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.71 sec
MapReduce Total cumulative CPU time: 3 seconds 710 msec
Ended Job = job_1553134178907_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://master:9000/user/hive/warehouse/yr.db/stu_mess_part/.hive-staging_hive_2019-03-21_15-11-05_342_1048170690574644764-1/-ext-10000
Loading data to table yr.stu_mess_part partition (y=null, m=null)
Time taken for load dynamic partitions : 1957
Loading partition {y=2015, m=5}
Loading partition {y=2019, m=6}
Loading partition {y=2015, m=3}
Loading partition {y=2016, m=7}
Loading partition {y=2017, m=8}
Loading partition {y=2018, m=5}
Loading partition {y=2017, m=2}
Loading partition {y=2018, m=4}
Time taken for adding to write entity : 3
Partition yr.stu_mess_part{y=2015, m=3} stats: [numFiles=1, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part{y=2015, m=5} stats: [numFiles=1, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part{y=2016, m=7} stats: [numFiles=1, numRows=1, totalSize=40, rawDataSize=39]
Partition yr.stu_mess_part{y=2017, m=2} stats: [numFiles=1, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part{y=2017, m=8} stats: [numFiles=1, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part{y=2018, m=4} stats: [numFiles=1, numRows=1, totalSize=40, rawDataSize=39]
Partition yr.stu_mess_part{y=2018, m=5} stats: [numFiles=1, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part{y=2019, m=6} stats: [numFiles=1, numRows=1, totalSize=40, rawDataSize=39]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.71 sec HDFS Read: 4815 HDFS Write: 720 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 710 msec
OK
Time taken: 33.911 seconds
hive> dfs -ls /user/hive/warehouse/yr.db/stu_mess_part;
Found 5 items
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2015
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2016
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2017
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2018
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2019
hive> dfs -ls /user/hive/warehouse/yr.db/stu_mess_part;
Found 5 items
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2015
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2016
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2017
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2018
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2019
hive> dfs -ls /user/hive/warehouse/yr.db/stu_mess_part/y=2016;
Found 1 items
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2016/m=7
hive> dfs -ls /user/hive/warehouse/yr.db/stu_mess_part/y=2017;
Found 2 items
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2017/m=2
drwxr-xr-x - yao supergroup 0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2017/m=8
hive> dfs -ls /user/hive/warehouse/yr.db/stu_mess_part/y=2017/m=2;
Found 1 items
-rwxr-xr-x 1 yao supergroup 39 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2017/m=2/000000_0
hive> dfs -cat /user/hive/warehouse/yr.db/stu_mess_part/y=2017/m=2/000000_0;
6 女 17 北京 13872374170 2017-02-20
hive> show partitions stu_mess_part;
OK
y=2015/m=3
y=2015/m=5
y=2016/m=7
y=2017/m=2
y=2017/m=8
y=2018/m=4
y=2018/m=5
y=2019/m=6
Time taken: 0.159 seconds, Fetched: 8 row(s)
一般基于时间的分区比较多(电商类)
分桶
-
为什么存在桶表?
分区提供了一个隔离数据和优化查询的便利方式,不过并非所有的数据都可形成
合理的分区,尤其是需要确定合适大小的分区划分方式,(不合理的数据分区划分方式可能导致有的分区数据过多,而某些分区没有什么数据的尴尬情况)。分桶是将数据集分解为更容易管理的若干部分的另一种技术。 -
什么是桶表?
对于每一个表(table)或者分区,Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。
Hive也是针对某一列进行桶的组织。 -
怎么分桶?
Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。
-
有什么优点?
1.获得更高的查询处理效率。桶为表加上了额外的结构,Hive在处理有些查询
时能利用这个结构。具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用连接。比如 JOIN 操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行 JOIN 操作就可以,可以大大较少 JOIN 的数据量。
2.使取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。
1.实例
- data
[yao@master data]$ vim student.txt
[yao@master data]$ cat student.txt
95001,李勇,男,20,CS
95002,刘晨,女,19,IS
95003,王敏,女,22,MA
95004,张立,男,19,IS
95005,刘刚,男,18,MA
95006,孙庆,男,23,CS
95007,易思玲,女,19,MA
95008,李娜,女,18,CS
95009,梦圆圆,女,18,MA
95010,孔小涛,男,19,CS
95011,包小柏,男,18,MA
95012,孙花,女,20,CS
95013,冯伟,男,21,CS
95014,王小丽,女,19,CS
95015,王君,男,18,MA
95016,钱国,男,21,MA
95017,王风娟,女,18,IS
95018,王一,女,19,IS
95019,邢小丽,女,19,IS
95020,赵钱,男,21,IS
95021,周二,男,17,MA
95022,郑明,男,20,MA
2 第一个失败的分桶
- 创建表
hive> create table students(
> sno int,
> sname string,
> sex string,
> sage int,
> sdept string)
> clustered by(sno) into 3 buckets
> row format delimited fields terminated by ','
> stored as textfile;
OK
Time taken: 0.162 seconds
- 设置分桶(强制分桶)
hive> set hive.enforce.bucketing=true;
- 加载数据
hive> load data local inpath '/home/yao/data/student.txt' overwrite into table students;
Loading data to table yr.students
Table yr.students stats: [numFiles=1, numRows=0, totalSize=527, rawDataSize=0]
OK
Time taken: 0.752 seconds
查看
hive> dfs -ls /user/hive/warehouse/yr.db/students;
Found 1 items
-rwxr-xr-x 1 yao supergroup 527 2019-03-21 15:55 /user/hive/warehouse/yr.db/students/student.txt
虽然强制分桶,但并没成功
我们看到虽然设置了强制分桶,但实际STUDENT表下面只有一个STUDENTS一个文件。
分桶也就是分区,分区数量等于文件数,所以上面方法并没有分桶。
3 第二个分桶
- 创建表
hive> create table stud_buck(
> sno int,
> sname string,
> sex string,
> sage int,
> sdept string)
> clustered by(sno)
> sorted by(sno desc)
> into 3 buckets
> row format delimited fields terminated by ',';
OK
Time taken: 0.12 seconds
- 设置分桶和分区
hive> set hive.enforce.bucketing=true;
hive> set mapreduce.job.reduce=3;
设置变量,设置分桶为true, 设置reduce数量是分桶的数量个数
- 加载数据
开始往创建的分桶表插入数据(插入数据需要是已分桶, 且排序的)
hive> insert into table stud_buck select * from students distribute by(sno) sort by (sno);
Query ID = yao_20190321160721_d26f0b8f-319f-4378-aec3-f6146b59ba08
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1553134178907_0008, Tracking URL = http://master:18088/proxy/application_1553134178907_008/
Kill Command = /home/yao/hadoop-2.7.7/bin/hadoop job -kill job_1553134178907_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-03-21 16:07:40,026 Stage-1 map = 0%, reduce = 0%
2019-03-21 16:07:53,630 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.22 sec
2019-03-21 16:08:09,486 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.31 sec
MapReduce Total cumulative CPU time: 6 seconds 310 msec
Ended Job = job_1553134178907_0008
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1553134178907_0009, Tracking URL = http://master:18088/proxy/application_1553134178907_009/
Kill Command = /home/yao/hadoop-2.7.7/bin/hadoop job -kill job_1553134178907_0009
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 3
2019-03-21 16:08:29,636 Stage-2 map = 0%, reduce = 0%
2019-03-21 16:08:36,029 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.67 sec
2019-03-21 16:09:11,172 Stage-2 map = 100%, reduce = 44%, Cumulative CPU 4.37 sec
2019-03-21 16:09:26,828 Stage-2 map = 100%, reduce = 67%, Cumulative CPU 13.03 sec
2019-03-21 16:09:27,866 Stage-2 map = 100%, reduce = 89%, Cumulative CPU 13.17 sec
2019-03-21 16:09:29,991 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 15.41 sec
MapReduce Total cumulative CPU time: 15 seconds 410 msec
Ended Job = job_1553134178907_0009
Loading data to table yr.stud_buck
Table yr.stud_buck stats: [numFiles=3, numRows=22, totalSize=527, rawDataSize=505]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.31 sec HDFS Read: 6819 HDFS Write: 909 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 3 Cumulative CPU: 15.41 sec HDFS Read: 12694 HDFS Write: 734 SUCCESS
Total MapReduce CPU Time Spent: 21 seconds 720 msec
OK
Time taken: 130.884 seconds
查看
hive> dfs -ls /user/hive/warehouse/yr.db/stud_buck;
Found 3 items
-rwxr-xr-x 1 yao supergroup 193 2019-03-21 16:09 /user/hive/warehouse/yr.db/stud_buck/000000_0
-rwxr-xr-x 1 yao supergroup 170 2019-03-21 16:09 /user/hive/warehouse/yr.db/stud_buck/000001_0
-rwxr-xr-x 1 yao supergroup 164 2019-03-21 16:09 /user/hive/warehouse/yr.db/stud_buck/000002_0
hive> dfs -cat /user/hive/warehouse/yr.db/stud_buck/000000_0;
95001,李勇,男,20,CS
95004,张立,男,19,IS
95007,易思玲,女,19,MA
95010,孔小涛,男,19,CS
95013,冯伟,男,21,CS
95016,钱国,男,21,MA
95019,邢小丽,女,19,IS
95022,郑明,男,20,MA
hive> dfs -cat /user/hive/warehouse/yr.db/stud_buck/000000_1;
cat: `/user/hive/warehouse/yr.db/stud_buck/000000_1': No such file or directory
Command failed with exit code = 1
Query returned non-zero code: 1, cause: null
hive> dfs -cat /user/hive/warehouse/yr.db/stud_buck/000001_0;
95002,刘晨,女,19,IS
95005,刘刚,男,18,MA
95008,李娜,女,18,CS
95011,包小柏,男,18,MA
95014,王小丽,女,19,CS
95017,王风娟,女,18,IS
95020,赵钱,男,21,IS
hive> dfs -cat /user/hive/warehouse/yr.db/stud_buck/000002_0;
95003,王敏,女,22,MA
95006,孙庆,男,23,CS
95009,梦圆圆,女,18,MA
95012,孙花,女,20,CS
95015,王君,男,18,MA
95018,王一,女,19,IS
95021,周二,男,17,MA
#可以使用distribute by(sno) sort by(sno asc) 或是排序和分桶的字段相同的时候使用Cluster by(字段)
#注意使用cluster by 就等同于分桶+排序(sort)
4 重点
SORTED BY对桶中的一个或多个列另外排序
distribute by + “表中字段”关键字控制map输出结果的分发,相同字段的map输出会发到一个reduce节点去处理。
CLUSTERED BY相当于 distribute by 和sort by 的结合,默认只能是升序。用来指定划分桶所用列和划分桶的个数。Hive对key的hash值除bucket个数取余数,保证数据均匀随机分布在所有bucket里。
5 总结
其实桶的概念就是MapReduce的分区的概念,两者完全相同。物理上每个桶就是目录里的一个文件,一个作业产生的桶(输出文件)数量和reduce任务个数相同。
而分区表的概念,则是新的概念。分区代表了数据的仓库,也就是文件夹目录。每个文件夹下面可以放不同的数据文件。
通过文件夹可以查询里面存放的文件。但文件夹本身和数据的内容毫无关系。
桶则是按照数据内容的某个值进行分桶,把一个大文件散列称为一个个小文件。这些小文件可以单独排序。如果另外一个表也按照同样的规则分成了一个个小文件。两个表join的时候,就不必要扫描整个表,
只需要匹配相同分桶的数据即可。效率当然大大提升。
同样,对数据抽样的时候,也不需要扫描整个文件。只需要对每个分区按照相同规则抽取一部分数据即可。
6 分桶的查询
- 基于桶抽样
hive> select * from stu_mess_part_buck tablesample(bucket 1 out of 2 on sex); //取2个bucket,分别是1 和3
OK
5 男 16 北京 13754554587 2015-03-24 2015 3
23 男 15 北京 13700000033 2016-07-25 2016 7
4 男 17 北京 15257575773 2017-08-11 2017 8
10 男 16 北京 14098593989 2018-04-06 2018 4
Time taken: 1.621 seconds, Fetched: 4 row(s)
tablesample是抽样语句,语法:TABLESAMPLE(BUCKET x OUTOF y)
y必须是table总bucket数的倍数或者因子。hive根据y的大小,决定抽样的比例。
例如,table总共分了64份,当y=32时,抽取(64/32=)2个bucket的数据,
当y=128时,抽取(64/128=)1/2个bucket的数据。
x表示从哪个bucket开始抽取。
例如,table总bucket数为32,tablesample(bucket 3 out of 16),
表示总共抽取(32/16=)2个bucket的数据,
分别为第3个bucket和第(3+16=)19个bucket的数据。
- 基于百分比抽样
hive另外一种按照抽样百分比进行抽样的方式,该种方式基于行数,按照输入路径下的数据块的百分比进行抽样。这种抽样的最小单元是一个hdfs数据块,如果表的数据大小小于普通块大小128M,将返回所有行。
基于百分比的抽样方式提供了一个变量,用于控制基于数据块的调优种子信息:
<property>
<name>hive.sample.seednumber</name>
<value>0</value>
</property>
7 数据分桶存在的一些缺陷
如果通过数据文件LOAD 到分桶表中,会存在额外的MR负担。
实际生产中分桶策略使用频率较低,更常见的还是使用数据分区
既分区又分桶
- 创建表
hive> create table stu_mess_part_buck(
> stu_id int,
> sex string,
> age int,
> address string,
> tel_num string,
> ts date)
> partitioned by (y int,m int)
> clustered by(stu_id) into 4 buckets
> row format delimited fields terminated by "\t"
> stored as textfile
> location '/yr/stu_mess_part_buck';
OK
Time taken: 0.095 seconds
加载数据
hive> insert into table stu_mess_part_buck partition(y,m) select * from stu_mess;
Query ID = yao_20190321162521_c78329dd-e687-4839-81fc-26d706f8631e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1553134178907_0010, Tracking URL = http://master:18088/proxy/application_1553134178907_010/
Kill Command = /home/yao/hadoop-2.7.7/bin/hadoop job -kill job_1553134178907_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2019-03-21 16:25:38,087 Stage-1 map = 0%, reduce = 0%
2019-03-21 16:25:45,844 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.53 sec
2019-03-21 16:26:03,068 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 4.31 sec
2019-03-21 16:26:10,634 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 8.12 sec
2019-03-21 16:26:13,881 Stage-1 map = 100%, reduce = 42%, Cumulative CPU 9.05 sec
2019-03-21 16:26:16,029 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 12.12 sec
2019-03-21 16:26:17,080 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 18.26 sec
MapReduce Total cumulative CPU time: 18 seconds 260 msec
Ended Job = job_1553134178907_0010
Loading data to table yr.stu_mess_part_buck partition (y=null, m=null)
Time taken for load dynamic partitions : 1278
Loading partition {y=2018, m=5}
Loading partition {y=2016, m=7}
Loading partition {y=2017, m=2}
Loading partition {y=2019, m=6}
Loading partition {y=2015, m=3}
Loading partition {y=2018, m=4}
Loading partition {y=2015, m=5}
Loading partition {y=2017, m=8}
Time taken for adding to write entity : 4
Partition yr.stu_mess_part_buck{y=2015, m=3} stats: [numFiles=4, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part_buck{y=2015, m=5} stats: [numFiles=4, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part_buck{y=2016, m=7} stats: [numFiles=4, numRows=1, totalSize=40, rawDataSize=39]
Partition yr.stu_mess_part_buck{y=2017, m=2} stats: [numFiles=4, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part_buck{y=2017, m=8} stats: [numFiles=4, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part_buck{y=2018, m=4} stats: [numFiles=4, numRows=1, totalSize=40, rawDataSize=39]
Partition yr.stu_mess_part_buck{y=2018, m=5} stats: [numFiles=4, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part_buck{y=2019, m=6} stats: [numFiles=4, numRows=1, totalSize=40, rawDataSize=39]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 18.26 sec HDFS Read: 21761 HDFS Write: 871 SUCCESS
Total MapReduce CPU Time Spent: 18 seconds 260 msec
OK
Time taken: 60.378 seconds
结果查询
hive> show partitions stu_mess_part_buck;
OK
y=2015/m=3
y=2015/m=5
y=2016/m=7
y=2017/m=2
y=2017/m=8
y=2018/m=4
y=2018/m=5
y=2019/m=6
Time taken: 0.109 seconds, Fetched: 8 row(s)
hive> dfs -ls /yr/stu_mess_part_buck;
Found 5 items
drwxr-xr-x - yao supergroup 0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015
drwxr-xr-x - yao supergroup 0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2016
drwxr-xr-x - yao supergroup 0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2017
drwxr-xr-x - yao supergroup 0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2018
drwxr-xr-x - yao supergroup 0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2019
hive> dfs -ls /yr/stu_mess_part_buck/y=2015;
Found 2 items
drwxr-xr-x - yao supergroup 0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=3
drwxr-xr-x - yao supergroup 0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=5
hive> dfs -ls /yr/stu_mess_part_buck/y=2015/m=3;
Found 4 items
-rwxr-xr-x 1 yao supergroup 0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=3/000000_0
-rwxr-xr-x 1 yao supergroup 39 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=3/000001_0
-rwxr-xr-x 1 yao supergroup 0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=3/000002_0
-rwxr-xr-x 1 yao supergroup 0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=3/000003_0
hive> dfs -cat /yr/stu_mess_part_buck/y=2015/m=3/000000_0;
hive> dfs -cat /yr/stu_mess_part_buck/y=2015/m=3/000001_0;
5 男 16 北京 13754554587 2015-03-24
hive> dfs -cat /yr/stu_mess_part_buck/y=2015/m=3/000002_0;
hive> dfs -cat /yr/stu_mess_part_buck/y=2015/m=3/000003_0;

1413

被折叠的 条评论
为什么被折叠?



