hadoop compress file

本文介绍了使用Hadoop进行文件压缩的不同方法,包括使用cut命令截取字段和cat命令直接复制内容,并对比了Gzip和BZip2两种压缩方式的效果。通过具体示例展示了如何设置Hadoop任务参数来实现压缩功能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

compress files in directory to another directory

use ‘cut -f 2’

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
  -Dmapred.reduce.tasks=0 \
  -input /home/houzhizhen/defaultfs/test/input \
  -output /home/houzhizhen/defaultfs/test/outputcut \
  -mapper "cut -f 2"

This produces one file in output directory for one file in input directory. After unzip the file using command ‘gunzip’, the file length is not equals to the source file lenght, the file length reduce by 1 for every line in the file, probably because is replace ‘\n\r’ with ‘\n’.

[houzhizhen@localhost outputcut]$ ll
总用量 12
-rw-r--r--. 1 houzhizhen root 2938 5月  16 10:07 part-00000.gz
-rw-r--r--. 1 houzhizhen root  325 5月  16 10:07 part-00001.gz
-rw-r--r--. 1 houzhizhen root  128 5月  16 10:07 part-00002.gz
-rw-r--r--. 1 houzhizhen root    0 5月  16 10:07 _SUCCESS

use ‘/bin/cat’

The output result is identical to the previous test.

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
            -Dmapred.reduce.tasks=0 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
            -input /home/houzhizhen/defaultfs/test/input \
            -output /home/houzhizhen/defaultfs/test/output-gz \
            -mapper /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat

reduce into one compressed file directly

Notice: this will cause all the data to single reduce task, and runs very slow if the input size is large.

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
        -Dmapred.reduce.tasks=1 \
        -Dmapred.output.compress=true \
        -Dmapred.compress.map.output=true \
        -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
        -input /home/houzhizhen/defaultfs/test/input \
        -output /home/houzhizhen/defaultfs/test/archive \
        -mapper /bin/cat \
        -reducer /bin/cat \
        -inputformat org.apache.hadoop.mapred.TextInputFormat \
        -outputformat org.apache.hadoop.mapred.TextOutputFormat
  • decompress
/home/houzhizhen/defaultfs/test/archive
bunzip2 part-00000.bz2
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值