本文转载至https://blog.youkuaiyun.com/weixin_40420525/article/details/84869883,并进行实践,总结了其中遇到的问题。
1前置环境:
1.java环境与maven
2.安装前置库(如果已经编译过Hadoop,这些东西都应该下载过)
yum -y install lzo-devel zlib-devel gcc autoconf automake libtool
2.安装lzo
[hadoop@hadoop software]$ wget www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
[hadoop@hadoop app]$ tar -zxvf lzo-2.06.tar.gz -C ../app
[hadoop@hadoop app]$ cd lzo-2.06/
[hadoop@hadoop lzo-2.06]$ export CFLAGS=-m64
# 创建文件夹,用来存放编译之后的lzo
[hadoop@hadoop lzo-2.06]$ mkdir lzo
#指定编译之后的位置
[hadoop@hadoop lzo-2.06]$ ./configure -enable-shared -prefix=/home/hadoop/app/lzo-2.06/lzo/
#开始编译安装
[hadoop@hadoop lzo-2.06]$ make && make install
# 查看编译是否成功 只要有如下内容 就可以了
[hadoop@hadoop lzo-2.06]$ cd complie/
[hadoop@hadoop complie]$ ll
total 12
drwxrwxr-x 3 hadoop hadoop 4096 Dec 6 17:08 include
drwxrwxr-x 2 hadoop hadoop 4096 Dec 6 17:08 lib
drwxrwxr-x 3 hadoop hadoop 4096 Dec 6 17:08 share
3.安装hadoop-lzo
3.1下载并解压
[hadoop@hadoop software]$ wget https://github.com/twitter/hadoop-lzo/archive/master.zip
#解压 -d可以指定解压目录
[hadoop@hadoop software]$ unzip master.zip -d ../app
# 如果提示没有 unzip 记得用yum 安装,需root权限
[root@hadoop ~]# *yum -y install unzip*
3.2 修改解压后目录中饭的pom.xml文件
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.6.0</hadoop.current.version> #这里修改成对应的hadoop版本号,我这里是hadoop2.6.0-cdh5.7.0
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
3.3 增加配置
[hadoop@hadoop app]$ cd hadoop-lzo-master/
[hadoop@hadoop hadoop-lzo-master]$ export CFLAGS=-m64
[hadoop@hadoop hadoop-lzo-master]$ export CXXFLAGS=-m64
[hadoop@hadoop hadoop-lzo-master]$ export C_INCLUDE_PATH=/home/hadoop/app/lzo-2.06/lzo/include/ # 这里需要提供编译好的lzo的include文件
[hadoop@hadoop hadoop-lzo-master]$ export LIBRARY_PATH=/home/hadoop/app/lzo-2.06/lzo/lib/ # 这里需要提供编译好的lzo的lib文件
3.4编译
[root@hadoop hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true
这步如果编译报错,切换root权限试试,本人遇到了这个问题
出现buildsuccess即成功
3.5进行如下操作
[hadoop@hadoop hadoop-lzo-master]$
# 查看编译成功之后的包
[hadoop@hadoop hadoop-lzo-master]$ ll
total 80
-rw-rw-r-- 1 hadoop hadoop 35147 Oct 13 2017 COPYING
-rw-rw-r-- 1 hadoop hadoop 19753 Dec 6 17:18 pom.xml
-rw-rw-r-- 1 hadoop hadoop 10170 Oct 13 2017 README.md
drwxrwxr-x 2 hadoop hadoop 4096 Oct 13 2017 scripts
drwxrwxr-x 4 hadoop hadoop 4096 Oct 13 2017 src
drwxrwxr-x 10 hadoop hadoop 4096 Dec 6 17:21 target
# 进入target/native/Linux-amd64-64 目录下执行如下命令
[hadoop@hadoop hadoop-lzo-master]$ cd target/native/Linux-amd64-64
[hadoop@hadoop Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~
./
./libgplcompression.so
./libgplcompression.so.0
./libgplcompression.la
./libgplcompression.a
./libgplcompression.so.0.0.
[hadoop@hadoop Linux-amd64-64]$ cp ~/libgplcompression* $HADOOP_HOME/lib/native/
# 这里很重要 需要把hadoop-lzo-0.4.21-SNAPSHOT.jar 复制到hadoop中
[hadoop@hadoop hadoop-lzo-master]$ cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/
[hadoop@hadoop hadoop-lzo-master]$ cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/mapreduce/lib
4.修改Hadoop相关配置文件
4.1 修改 $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib
4.2 修改 $HADOOP_HOME/etc/hadoop/core-site.xml
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
4.3 修改 $HADOOP_HOME/etc/hadoop/mapred-site.xml
<property>
<name>mapred.child.env </name>
<value>LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
5.配置成功后实验
#准备一份大小为600M的数据
[hadoop@hadoop001 data]$ ll -h
-rw-r--r-- 1 hadoop hadoop 601M 4月 15 09:54 gen_logs
#使用lzo压缩
[hadoop@hadoop001 data]$ lzop gen_logs
[hadoop@hadoop001 data]$ ll -h
-rw-r--r-- 1 hadoop hadoop 601M 4月 15 09:54 gen_logs
-rw-r--r-- 1 hadoop hadoop 231M 4月 15 09:54 gen_logs.lzo
#将该文件上传至hdfs
[hadoop@hadoop001 data]$ hadoop fs -put gen_logs.lzo /log
[hadoop@hadoop ~]$ hadoop fs -ls /log
Found 1 items
-rw-r--r-- 1 hadoop supergroup 241258919 2019-04-16 13:40 /log/gen_logs.lzo
#执行一次wordcount
[hadoop@hadoop mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /log/gen_logs.lzo /output
可以看到,number of splits:1 只有1个:
19/04/17 14:30:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/17 14:30:07 INFO input.FileInputFormat: Total input paths to process : 1
19/04/17 14:30:07 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
19/04/17 14:30:07 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
19/04/17 14:30:08 INFO mapreduce.JobSubmitter: number of splits:1
我们知道,gzip默认可以分片,而lzo默认不可以分片,但可以通过创建索引的方式来支持分片,所以,我们创建该文件的索引
[hadoop@hadoop000 hadoop$ hadoop jar \
share/hadoop/mapreduce/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/log/gen_logs.lzo
#查看是否生成了索引
[hadoop@hadoop hadoop]$ hadoop fs -ls /log
Found 2 items
-rw-r--r-- 1 hadoop supergroup 241258919 2019-04-16 13:40 /log/gen_logs.lzo
-rw-r--r-- 1 hadoop supergroup 19208 2019-04-16 13:50 /log/gen_logs.lzo.index
单单生成索引文件是不够的,在运行程序的时候还要对要运行的程序进行相应的更改,
把inputformat设置成LzoTextInputFormat,不然还是会把索引文件也当做是输入文件,还是只运行一个map来处理。
[hadoop@hadoop mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /log/gen_logs.lzo /output
19/04/17 14:47:50 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/17 14:47:51 INFO input.FileInputFormat: Total input paths to process : 1
19/04/17 14:47:52 INFO mapreduce.JobSubmitter: number of splits:2
......
可以看出number of splits:2,成功。