mapreduce程序中避免reduce输出空文件

探讨了MapReduce中reduce任务生成空文件的问题及其带来的负面影响,例如增加NameNode负担和浪费计算资源等,并介绍了如何通过使用LazyOutputFormat来避免生成空文件。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

在mapreduce里,如果某个reduce输出为空,默认也会生成一个大小为0的文件。原因是reduce写的时候,不知道会不会有输出数据,所以默认初始化了一个文件。如果没有输出,close文件最终会生成一个空文件。如下。有几个缺点:
1)生成的很多小文件,对namenode形成一定压力
2)生成的数据下个阶段处理的时候,这些空的文件会浪费掉一些计算资源。
3)看着不爽

-rw-r--r--   3 hadoop supergroup          0 2018-05-09 10:38 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/_SUCCESS
drwxr-xr-x   - hadoop supergroup          0 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/_temporary
-rw-r--r--   3 hadoop supergroup     290779 2018-05-09 10:37 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00000
-rw-r--r--   3 hadoop supergroup     102365 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00001
-rw-r--r--   3 hadoop supergroup     210493 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00002
-rw-r--r--   3 hadoop supergroup     194585 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00003
-rw-r--r--   3 hadoop supergroup      97649 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00004
-rw-r--r--   3 hadoop supergroup          0 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00005
-rw-r--r--   3 hadoop supergroup      74188 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00006
-rw-r--r--   3 hadoop supergroup      61837 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00007
-rw-r--r--   3 hadoop supergroup     254879 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00008
-rw-r--r--   3 hadoop supergroup       6061 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00009
-rw-r--r--   3 hadoop supergroup          0 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00010
-rw-r--r--   3 hadoop supergroup     126900 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00011
-rw-r--r--   3 hadoop supergroup       3623 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00012
-rw-r--r--   3 hadoop supergroup      75816 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00013
-rw-r--r--   3 hadoop supergroup     358310 2018-05-09 10:38 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00014
-rw-r--r--   3 hadoop supergroup     115713 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00015
-rw-r--r--   3 hadoop supergroup      90556 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00016
-rw-r--r--   3 hadoop supergroup     150605 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00017
-rw-r--r--   3 hadoop supergroup          0 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00018
-rw-r--r--   3 hadoop supergroup      54610 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00019
-rw-r--r--   3 hadoop supergroup     163868 2018-05-09 10:36 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00020

解决办法很简单,使用LazyOutputFormat,从名字能看出,这个是lazy的outputformat。只有真正有数据要输出的时候,才会初始化一个hdfs上的文件,避免了形成空文件。
使用方式如下

//导入jar包
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
//使用LazyOutputFormat
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

重跑下数据,现在ok了,空文件都没有了,如下

-rw-r--r--   3 hadoop supergroup          0 2018-05-09 11:04 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/_SUCCESS
drwxr-xr-x   - hadoop supergroup          0 2018-05-09 11:03 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/_temporary
-rw-r--r--   3 hadoop supergroup     307566 2018-05-09 11:03 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00000
-rw-r--r--   3 hadoop supergroup     101794 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00001
-rw-r--r--   3 hadoop supergroup     219026 2018-05-09 11:03 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00002
-rw-r--r--   3 hadoop supergroup     195994 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00003
-rw-r--r--   3 hadoop supergroup     101289 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00004
-rw-r--r--   3 hadoop supergroup      73074 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00006
-rw-r--r--   3 hadoop supergroup      60595 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00007
-rw-r--r--   3 hadoop supergroup     258069 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00008
-rw-r--r--   3 hadoop supergroup       6061 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00009
-rw-r--r--   3 hadoop supergroup     127131 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00011
-rw-r--r--   3 hadoop supergroup       3623 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00012
-rw-r--r--   3 hadoop supergroup      75725 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00013
-rw-r--r--   3 hadoop supergroup     366958 2018-05-09 11:04 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00014
-rw-r--r--   3 hadoop supergroup     113457 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00015
-rw-r--r--   3 hadoop supergroup      87421 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00016
-rw-r--r--   3 hadoop supergroup     148302 2018-05-09 11:03 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00017
-rw-r--r--   3 hadoop supergroup      54856 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00019
-rw-r--r--   3 hadoop supergroup     162752 2018-05-09 11:02 /home/hadoop/offline/js_ad/daily_ad_url/2018-05-08/part-r-00020
<think>嗯,用户想了解如何编写和运行MapReduce程序。这是一个非常实用的技术需求,特别是对于处理大数据的开发者。我可以根据之前的系统级引用和标准MapReduce知识来组织回答。用户提到的引用[1]关于MRJob工具特别值得展开,因为它提供了本地调试的轻量级解决方案。引用[2][3]提到的IDEA配置细节也很关键,这些是实际开发中最常遇到的痛点。准备从三个核心部分来回应:首先是基本程序结构,要强调Mapper和Reducer的抽象关系;然后是运行方式的比较说明;最后是开发环境配置的具体技巧。对于每个要点,都应该给出具体示例和配置说明,比如Python的MRJob示例和Java的IDEA配置要点。用户看起来像是有编程基础的开发者,可能会在意实际配置细节,所以需要详细说明mapred-site.xml等配置文件的修改位置。同时要注意Hadoop环境差异的提示,这能帮助避免兼容性问题。考虑在最后添加常见问题部分,因为运行环境搭建往往充满陷阱,Hadoop版本冲突等问题可以帮助用户少走弯路。如果能提供可运行的Python代码示例就更好了,这样用户可以直接复制测试。</think> 编写和运行MapReduce程序主要分为以下步骤(以**Java + Hadoop 3.x**为例): --- ### 一、编写MapReduce程序 #### 1. **程序结构** ```java public class WordCount { // Mapper类 public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { context.write(new Text(itr.nextToken()), new IntWritable(1)); } } } // Reducer类 public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) sum += val.get(); context.write(key, new IntWritable(sum)); } } // 主函数 public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } ``` --- ### 二、运行方式 #### 1. **本地模式调试** ```bash # 设置Hadoop为本地模式 export HADOOP_CONF_DIR=/path/to/empty_conf_dir # 清配置指向 # 直接运行Java程序 (需包含hadoop依赖) java -classpath hadoop-classpath:your.jar WordCount input.txt output ``` #### 2. **使用MRJob (Python版快速开发)** ```python # word_count.py from mrjob.job import MRJob class MRWordCount(MRJob): def mapper(self, _, line): for word in line.split(): yield word, 1 def reducer(self, word, counts): yield word, sum(counts) if __name__ == '__main__': MRWordCount.run() ``` 运行命令: ```bash python word_count.py -r local input.txt > output.txt # 本地测试[^1] ``` #### 3. **集群模式运行** ```bash # 打包程序为JAR mvn package # Maven项目 # 提交到Hadoop集群 hadoop jar target/your-app.jar WordCount /hdfs/input /hdfs/output ``` --- ### 三、开发环境配置(IDEA) 1. **关键配置项**: ```xml <!-- mapred-site.xml 示例配置 --> <property> <name>mapreduce.job.jar</name> <value>D:\project\target\your-app.jar</value> <!-- JAR包路径 --> </property> <property> <name>mapreduce.app-submission.cross-platform</name> <value>true</value> <!-- 跨平台支持[^2] --> </property> ``` 2. **Log4j日志输出配置**(调试用): ```properties log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5p | %-16t | %-32c{1} | %4L | %m%n[^3] ``` --- ### 四、连接HDFS ```java // Java连接HDFS示例 Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://namenode:9000"); FileSystem fs = FileSystem.get(conf); // 读写操作 fs.copyFromLocalFile(new Path("local.txt"), new Path("/hdfs/dir")); ``` --- **常见问题解决**: - **ClassNotFoundException**:确保`mapreduce.job.jar`指向正确JAR - **跨平台错误**:启用`mapreduce.app-submission.cross-platform` - **权限问题**:配置`core-site.xml`中的HDFS用户 - **Python依赖**:使用`virtualenv`安装`mrjob`包 > **注意**:Hadoop 2.x/3.x配置差异较大,需核对版本文档。实际提交时建议通过`hadoop classpath`获取依赖路径[^2][^3]。 --- **
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值