压缩
减少磁盘存储压力,负载
减少网络IO负载
1).首先,要保证hadoop是支持压缩
检查是否支持压缩算法
$ bin/hadoop checknative
Native library checking:
hadoop: false
zlib: false
snappy: false
lz4: false
bzip2: false
openssl: false
snappy 压缩比和速度相当于适中的
2)编译hadoop源码:mvn package -Pdist,native,docs -DskipTests -Dtar -Drequire.snappy
3)##替换$HADOOP_HOME/lib/native 直接上传到$HADOOP_HOME
$ tar -zxf cdh5.3.6-snappy-lib-natirve.tar.gz
再次检查 $ bin/hadoop checknative
Native library checking:
hadoop: true /opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
snappy: true /opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/lib/native/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
openssl: true /usr/lib64/libcrypto.so
启动dfs,yarn,historyserver,然后提交一个job
$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar pi 1 2
完成之后,web界面查看完成这个任务的配置
mapreduce.map.output.compress false
mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.DefaultCodec
4)配置mapred-site.xml,在原有的下方增加如下内容
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec </name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
结束所有进程,重新启动所有进程dfs,yarn,historyserver,再提交一个job
$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar pi 3 5
完成之后,web界面重新查看完成这个任务的配置
mapreduce.map.output.compress true job.xml ⬅ mapred-site.xml
mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.SnappyCodec
/启用压缩snappy+存储为ORC格式
方式一 在MapReduce的shuffle阶段启用压缩
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org apache.hadoop.io.compress.SnappyCodec;
create table if not exists file_orc_snappy(
t_time string,
t_url string,
t_uuid string,
t_refered_url string,
t_ip string,
t_user string,
t_city string
)
row format delimited fields terminated by '\t'
stored as ORC
tblproperties("orc.compression"="Snappy");
insert into table file_orc_snappy select * from file_text;
方式二:对reduce输出的结果文件进行压缩
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org apache.hadoop.io.compress.SnappyCodec;
create table if not exists file_parquet_snappy(
t_time string,
t_url string,
t_uuid string,
t_refered_url string,
t_ip string,
t_user string,
t_city string
)
row format delimited fields terminated by '\t'
stored as parquet
tblproperties("parquet.compression"="Snappy");
insert into table file_parquet_snappy select * from file_text;
insert overwrite table file_parquet_snappy select * from file_text;
减少磁盘存储压力,负载
减少网络IO负载
1).首先,要保证hadoop是支持压缩
检查是否支持压缩算法
$ bin/hadoop checknative
Native library checking:
hadoop: false
zlib: false
snappy: false
lz4: false
bzip2: false
openssl: false
snappy 压缩比和速度相当于适中的
2)编译hadoop源码:mvn package -Pdist,native,docs -DskipTests -Dtar -Drequire.snappy
3)##替换$HADOOP_HOME/lib/native 直接上传到$HADOOP_HOME
$ tar -zxf cdh5.3.6-snappy-lib-natirve.tar.gz
再次检查 $ bin/hadoop checknative
Native library checking:
hadoop: true /opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
snappy: true /opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/lib/native/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
openssl: true /usr/lib64/libcrypto.so
启动dfs,yarn,historyserver,然后提交一个job
$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar pi 1 2
完成之后,web界面查看完成这个任务的配置
mapreduce.map.output.compress false
mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.DefaultCodec
4)配置mapred-site.xml,在原有的下方增加如下内容
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec </name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
结束所有进程,重新启动所有进程dfs,yarn,historyserver,再提交一个job
$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar pi 3 5
完成之后,web界面重新查看完成这个任务的配置
mapreduce.map.output.compress true job.xml ⬅ mapred-site.xml
mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.SnappyCodec
/启用压缩snappy+存储为ORC格式
方式一 在MapReduce的shuffle阶段启用压缩
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org apache.hadoop.io.compress.SnappyCodec;
create table if not exists file_orc_snappy(
t_time string,
t_url string,
t_uuid string,
t_refered_url string,
t_ip string,
t_user string,
t_city string
)
row format delimited fields terminated by '\t'
stored as ORC
tblproperties("orc.compression"="Snappy");
insert into table file_orc_snappy select * from file_text;
方式二:对reduce输出的结果文件进行压缩
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org apache.hadoop.io.compress.SnappyCodec;
create table if not exists file_parquet_snappy(
t_time string,
t_url string,
t_uuid string,
t_refered_url string,
t_ip string,
t_user string,
t_city string
)
row format delimited fields terminated by '\t'
stored as parquet
tblproperties("parquet.compression"="Snappy");
insert into table file_parquet_snappy select * from file_text;
insert overwrite table file_parquet_snappy select * from file_text;