【Hadoop】Compression Options

本文对比了几种MapReduce中常用的压缩格式,包括bzip2、zlib、lzo和snappy等。根据压缩率和解压速度进行了综合考量,推荐在不同阶段采用不同的压缩算法以平衡I/O效率和处理时间。

http://comphadoop.weebly.com/experiment-and-results.html

http://comphadoop.weebly.com/

http://www.slideshare.net/ydn/hug-compression-talk

以上几篇文章MR中常见的几种compression格式,并通过一些实验数据来比较。通常,compression可以应用于三个阶段:map input, map output, reduce output。

根据第三篇slide中的实验数据,几种compression算法需要从space/time上进行权衡:从压缩率来看,bzip2 > zlib(deflate, gzip) > lzo/snappy;从解压缩速度上看是相反的。尤其是压缩速度,lzo/snappy比zlib要快6~8倍,比bzip2要快8~10倍。解压速度差距没那么大,lzo/snappy比zlib快1~2倍(不过比bzip2要快将近20倍)。

因此,对于MR Job中的compression选择:

  • map input常用sequecefile+zlib(default codec)。首先,bzip2虽然压缩率很高,但是解压速度太慢,IO上节省的时间可能都要被解压时间所抵消。而zlib的解压速度不算太差,而且压缩率比lzo/snappy要高,考虑到MR Job通常是IO-bound,选择zlib相对更能提高性能。
  • 对于map output,考虑到map output需要在map端做压缩,在reduce端做解压,通常选择解压缩速度更快的lzo/snappy。


root@master:/home/mm/usr/hadoop# hadoop -version /home/mm/usr/hadoop/hadoop/libexec/hadoop-functions.sh: 行 2425: HADOOP_-VERSION_USER: 无效的变量名 /home/mm/usr/hadoop/hadoop/libexec/hadoop-functions.sh: 行 2386: HADOOP_-VERSION_USER: 无效的变量名 ERROR: -version is not COMMAND nor fully qualified CLASSNAME. Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS] or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS] where CLASSNAME is a user-provided Java class OPTIONS is none or any of: buildpaths attempt to add class files from build tree --config dir Hadoop config directory --debug turn on shell script debug mode --help usage information hostnames list[,of,host,names] hosts to use in slave mode hosts filename list of hosts to use in slave mode loglevel level set the log4j level for this command workers turn on worker mode SUBCOMMAND is one of: Admin Commands: daemonlog get/set the log level for each daemon Client Commands: archive create a Hadoop archive checknative check native Hadoop and compression libraries availability classpath prints the class path needed to get the Hadoop jar and the required libraries conftest validate configuration XML files credential interact with credential providers distch distributed metadata changer distcp copy file or directories recursively dtutil operations related to delegation tokens envvars display computed Hadoop environment variables fs run a generic filesystem user client gridmix submit a mix of synthetic job, modeling a profiled from production load jar <jar> run a jar file. NOTE: please use "yarn jar" to launch YARN applications, not this command. jnipath prints the java.library.path kdiag Diagnose Kerberos Problems kerbname show auth_to_local principal conversion key manage keys via the KeyProvider rumenfolder scale a rumen input trace rumentrace convert logs into a rumen trace s3guard manage metadata on S3 trace view and modify Hadoop tracing settings version print the version Daemon Commands: kms run KMS, the Key Management Server registrydns run the registry DNS server SUBCOMMAND may print help when invoked w/o parameters or with -h.
最新发布
10-10
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值