压缩-储存

最新推荐文章于 2024-07-25 19:44:34 发布

Helpless_pain

最新推荐文章于 2024-07-25 19:44:34 发布

阅读量526

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/Helpless_pain/article/details/70228386

====压缩=======================================================

Hive常用压缩格式: bzip2, gzip, lzo, snappy ...

注意：hive里要使用某种压缩格式,hadoop必须支持才行，因为hive依赖于Hadoop
查看hadoop是否支持某种压缩的命令：
$ bin/hadoop checknative
hadoop: false
zlib:    false
snappy: false
lz4:     false
bzip2:   false
openssl: false
可见CDH Hadoop不支持任何压缩格式。

怎么让hadoop支持某种压缩方式：
   ** 除了重新编译hadoop源码包，别无他法。

----使用Apache Hadoop----------------------

** 解压2.5.0-native-snappy.tar.gz，替换掉lib/native里面原本的文件

不同的压缩格式,有什么区别？
   ** 压缩文件后的压缩比(率)
   ** 压缩速度和解压速度，一般来说，压缩率越高，压缩速度越慢
   ** 是否支持split分片(即文件切割)

哪些地方用到了压缩或者需要压缩？(见图)
map输入(由输入文件类型决定)、map输出（即shuffle输入）、reduce输出

一、map输出（即shuffle的输入）

1)MapReduce设置：
--在mapred-site.xml中设置属性（永久生效）
<property>
   <name>mapreduce.map.output.compress</name>
   <value>true</value>
</property>
<property>
   <name>mapreduce.map.output.compress.codec</name>
   <value>org.apache.hadoop.io.compress.DefaultCodec</value>
</property>

2)代码（临时生效）
conf.set("mapreduce.map.output.compress","true");

3)Hive设置：     --推荐
--永久生效    hive-site.xml
--临时生效   在CLI或者脚本文件中使用set
set mapreduce.map.output.compress = true;
set mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
set hive.exec.compress.intermediate = true; --效果不明显

测试：
select date,hour,count(url),count(distinct guid) from track_log where date='20150828' group by date,hour;

第一次job结束后，进入History--Counters，查看"Map-Reduce Framework"内容：
Map output materialized bytes：1617585(1.6M)
Reduce shuffle bytes：1617585
CPU time spent (ms)：3270   2920   6190

然后设置以下两个属性，再次执行select，
set mapreduce.map.output.compress = true;
set mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
比较结果：
Map output materialized bytes：876177(0.87M)
Reduce shuffle bytes：876177
CPU time spent (ms)：3350   2880   6230 --通常CPU消耗会变大，极个别情况例外

***扩展：可以测试一下set mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.DefaultCodec;的效果

小结：中间阶段输出压缩（shuffle阶段）
      开启压缩能降低网络IO和磁盘IO的吞吐量，但增加了CPU和内存的负载
      即将网络IO吞吐压力转移给 CPU和内存
问题：什么时候使用压缩？
分析job瓶颈，IO吞吐、CPU

二、reduce输出
   ** 节约HDFS空间

1)mapreduce设置：
mapreduce.output.fileoutputformat.compress
mapreduce.output.fileoutputformat.compress.codec

2)Hive设置：
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

测试：
** 修改empX中X的值
create table emp1 row format delimited fields terminated by '\t' as select * from emp;
然后对比输出文件的类型和大小

问题：
1、hive/mapreduce中有几个地方可以设置压缩？
2、压缩有什么好处,有什么弊端？
3、各个阶段压缩参数怎么设置？
4、如何决策是否进行压缩？

====hive文件格式============================================================

file_format:
: SEQUENCEFILE
| TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
| RCFILE      -- (Note: Available in Hive 0.6.0 and later)
| ORC         -- (Note: Available in Hive 0.11.0 and later)
| PARQUET     -- (Note: Available in Hive 0.13.0 and later)
| AVRO        -- (Note: Available in Hive 0.14.0 and later)

TEXTFILE格式：     文本格式,按行存储,简单,常用,默认格式
SEQUENCEFILE格式：二进制文件,按行存储,可以将一个文件划分成多个块
ORC格式：          数据按行分块,每块按照列存储, 支持hive/shark/spark
PARQUET格式：      列式存储格式,Hive、Spark、drill，Impala、Pig等支持

这几种文件格式的区别？
   ** 在HDFS上面展现的文件格式不一样
   ** 选择不同的格式,最终生成的文件类型、大小、查询效率不一样

用法：
create table track_log(
字段1 类型,
字段2 类型
....
)
row format ... "\t"
stored as textfile;   --声明存储在HDFS上面文件的格式

-------------------------------------

--创建表track_log1,格式为textfile（默认）
create table if not exists track_log1(
id              string,
url             string,
referer         string,
keyword         string,
type            string,
guid            string,
pageId          string,
moduleId        string,
linkId          string,
attachedInfo    string,
sessionId       string,
trackerU        string,
trakerType     string,
ip              string,
trackerSrc      string,
cookie          string,
orderCode       string,
trackTime       string,
endUserId       string,
firstLink       string,
sessionViewNo   string,
productId       string,
curMerchantId   string,
provinceId      string,
cityId          string,
fee             string,
edmActivity     string,
edmEmail        string,
edmJobId        string,
ieVersion       string,
platform        string,
internalKeyword string,
resultSum       string,
currentPage     string,
linkPosition    string,
buttonPosition string
)
row format delimited fields terminated by '\t'
stored as textfile;

--导入数据到track_log
load data local inpath "/home/tom/2015082818" into table track_log1;

----------------------------------

--创建表指定orc+snappy
create table if not exists track_orc_snappy(
id              string,
url             string,
referer         string,
keyword         string,
type            string,
guid            string,
pageId          string,
moduleId        string,
linkId          string,
attachedInfo    string,
sessionId       string,
trackerU        string,
trakerType     string,
ip              string,
trackerSrc      string,
cookie          string,
orderCode       string,
trackTime       string,
endUserId       string,
firstLink       string,
sessionViewNo   string,
productId       string,
curMerchantId   string,
provinceId      string,
cityId          string,
fee             string,
edmActivity     string,
edmEmail        string,
edmJobId        string,
ieVersion       string,
platform        string,
internalKeyword string,
resultSum       string,
currentPage     string,
linkPosition    string,
buttonPosition string
)
row format delimited fields terminated by '\t'
stored as orc TBLPROPERTIES("orc.compress"="SNAPPY");

--向track_orc_snappy 导入数据只能用insert
insert overwrite table track_orc_snappy select * from track_log1;

-------------------------------

--创建表指定PARQUET+snappy
hive (mydb)> set parquet.compression=SNAPPY;
create table if not exists track_parquet_snappy(
id              string,
url             string,
referer         string,
keyword         string,
type            string,
guid            string,
pageId          string,
moduleId        string,
linkId          string,
attachedInfo    string,
sessionId       string,
trackerU        string,
trakerType     string,
ip              string,
trackerSrc      string,
cookie          string,
orderCode       string,
trackTime       string,
endUserId       string,
firstLink       string,
sessionViewNo   string,
productId       string,
curMerchantId   string,
provinceId      string,
cityId          string,
fee             string,
edmActivity     string,
edmEmail        string,
edmJobId        string,
ieVersion       string,
platform        string,
internalKeyword string,
resultSum       string,
currentPage     string,
linkPosition    string,
buttonPosition string
)
row format delimited fields terminated by '\t'
stored as parquet;

insert overwrite table track_parquet_snappy select * from track_log1;

-----------------------------------

比较：
a)文件大小：(差别很大)
hive (mydb)> dfs -du -s -h /user/hive/warehouse/mydb.db/*;   --即命令du -sh
37.6 M /user/hive/warehouse/mydb.db/track_log
6.5 M /user/hive/warehouse/mydb.db/track_orc_snappy
9.2 M /user/hive/warehouse/mydb.db/track_parquet_snappy

b)查询效率：（差不多）
> select provinceid,count(url),count(distinct guid) from track_log1 group by provinceid;
Time taken: 32.925 seconds, Fetched: 35 row(s)

> select provinceid,count(url),count(distinct guid) from track_orc_snappy group by provinceid;
Time taken: 34.396 seconds, Fetched: 35 row(s)

> select provinceid,count(url),count(distinct guid) from track_parquet_snappy group by provinceid;
Time taken: 35.651 seconds, Fetched: 35 row(s)

常用的选择：
   orc+snappy
   parquet+snappy

什么时候我们会用到其他文件格式？
   ** 节省HDFS空间
   ** 临时表：当我们把某个select命令查询结果保存到另一个临时表中,可以使用压缩文件格式(orc+snappy)
   ** 当我们从大表中拆出部分数据时,比如把一个36字段的原表拆分出一张5个字段的小表，
       应用于某种业务需求时,小表可以选择压缩文件格式orc+snappy

=====排序======================================================================

order by
   全局排序,所有reduce，常用
sort by
   每个reduce内部进行排序，不是全局排序
distribute by
   类似于partition，进行分区，有几个reduce就会生成几个最终文件，常和sort by联合使用
cluster by
   当distribute和sort字段相同时使用，用的比较少

** order by
** 往本地目录中导出数据
insert overwrite local directory "/home/tom/order" row format delimited fields terminated by "\t"
select * from emp order by empno;

** distribute by和sort by
hive (mydb)> select count(distinct deptno) from emp; --先求出部门数
hive (mydb)> set mapreduce.job.reduces=3;   --该值应该和部门数相同，否则会有数据集中到一个文件的现象
** 有几个reduce就会生成几个最终文件
-- 通过deptno分区，empno排序，sort by需要放在后面
insert overwrite local directory "/home/tom/sort" row format delimited fields terminated by "\t"
select * from emp distribute by deptno sort by empno;

** cluster by
** cluster by即通过指定列分区，又进行指定列的排序
** 下例会把empno相同的记录放在一个文件内，同时进行排序
hive (mydb)> set mapreduce.job.reduces=3;   --依旧设为3做测试
-- 通过雇员号，由系统分配，结果分为3个文件，每个文件内按empno排序
insert overwrite local directory "/home/tom/cluster" row format delimited fields terminated by "\t"
select * from emp cluster by empno;
-- 通过部门号，按照deptno分为3个文件
insert overwrite local directory "/home/tom/cluster" row format delimited fields terminated by "\t"
select * from emp cluster by deptno;

PS:
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>    --默认值为10G，超出则会生成新的reducer
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>                  --最多有几个reducer
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>                   --指定数量