MapReduce入门程序WordCount增强版_mapreduce新版wordcount2-优快云博客

文章来源:http://www.madongfly.cn/articles/enhanced-wordcount.html

WordCount程序应该是学习MapReduce编程最经典的样例程序了，小小一段程序就基本概括了MapReduce编程模型的核心思想。

现在考虑实现一个增强版的WordCount程序，要求：

提供大小写忽略的选项。
在原始串中，过滤掉一些内容，例如要过滤hexie，那么单词hexieshehui就作为shehui统计。第一个很好实现，只需要在map函数里判断一下要不要toLowerCase()即可。第二个也很好实现，将需要过滤的内容组合成一个长字符串，通过JobConf设置即可，但是如果需要过滤的参数很多，多到需要从DFS上的文件里读取呢。显然，我们可以在map函数里直接读取DFS上的文件，但是这并不是最优的办法，Hadoop的官方文档提供的WordCount2.0给了一个很好的办法。该代码还包括了其他一些很有用的技巧，让我们来好好分析一下吧。:)

package org.myorg;
import java.io.*;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount extends Configured implements Tool {
   public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
     static enum Counters { INPUT_WORDS }
     private final static IntWritable one = new IntWritable(1);
     private Text word = new Text();
     private boolean caseSensitive = true;
     private Set<String> patternsToSkip = new HashSet<String>();
     private long numRecords = 0;
     private String inputFile;
     public void configure(JobConf job) {
       caseSensitive = job.getBoolean("wordcount.case.sensitive", true);
       inputFile = job.get("map.input.file");
       if (job.getBoolean("wordcount.skip.patterns", false)) {
         Path[] patternsFiles = new Path[0];
         try {
           patternsFiles = DistributedCache.getLocalCacheFiles(job);
         } catch (IOException ioe) {
           System.err.println("Caught exception while getting cached files: " + StringUtils.stringifyException(ioe));
         }
         for (Path patternsFile : patternsFiles) {
           parseSkipFile(patternsFile);
         }
       }
     }
     private void parseSkipFile(Path patternsFile) {
       try {
         BufferedReader fis = new BufferedReader(new FileReader(patternsFile.toString()));
         String pattern = null;
         while ((pattern = fis.readLine()) != null) {
           patternsToSkip.add(pattern);
         }
       } catch (IOException ioe) {
         System.err.println("Caught exception while parsing the cached file '" + patternsFile + "' : " + StringUtils.stringifyException(ioe));
       }
     }
     public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
       String line = (caseSensitive) ? value.toString() : value.toString().toLowerCase();
       for (String pattern : patternsToSkip) {
         line = line.replaceAll(pattern, "");
       }
       StringTokenizer tokenizer = new StringTokenizer(line);
       while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         output.collect(word, one);
         reporter.incrCounter(Counters.INPUT_WORDS, 1);
       }
       if ((++numRecords % 100) == 0) {
         reporter.setStatus("Finished processing " + numRecords + " records " + "from the input file: " + inputFile);
       }
     }
   }
   public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
     public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
       int sum = 0;
       while (values.hasNext()) {
         sum += values.next().get();
       }
       output.collect(key, new IntWritable(sum));
     }
   }
   public int run(String[] args) throws Exception {
     JobConf conf = new JobConf(getConf(), WordCount.class);
     conf.setJobName("wordcount");
     conf.setOutputKeyClass(Text.class);
     conf.setOutputValueClass(IntWritable.class);
     conf.setMapperClass(Map.class);
     conf.setCombinerClass(Reduce.class);
     conf.setReducerClass(Reduce.class);
     conf.setInputFormat(TextInputFormat.class);
     conf.setOutputFormat(TextOutputFormat.class);
     List<String> other_args = new ArrayList<String>();
     for (int i=0; i < args.length; ++i) {
       if ("-skip".equals(args[i])) {
         DistributedCache.addCacheFile(new Path(args[++i]).toUri(), conf);
         conf.setBoolean("wordcount.skip.patterns", true);
       } else {
         other_args.add(args[i]);
       }
     }
     FileInputFormat.setInputPaths(conf, new Path(other_args.get(0)));
     FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1)));
     JobClient.runJob(conf);
     return 0;
   }
   public static void main(String[] args) throws Exception {
     int res = ToolRunner.run(new Configuration(), new WordCount(), args);
     System.exit(res);
   }
}

下面我们来逐一分析一下该程序与原始版本的不同之处。

在最初版的wordCount里，程序是在main函数里直接runJob的，而增强版的main函数里通过调用ToolRunner.run()函数启动程序。
该函数的原型是public static int run(Configuration conf, Tool tool, String[] args)，其功能是将args作为参数，conf作为配置运行tool。

Tool 是Map/Reduce工具或应用的标准。ToolRunner用来运行实现了Tool接口的类，它与GenericOptionsParser合作解析Hadoop的命令行参数。
Hadoop命令行的常用选项有：

-conf
-D

-fs
-jt

应用程序应该只处理其定制参数，把标准命令行选项通过 ToolRunner.run(Tool, String[]) 委托给 GenericOptionsParser处理。

增强版的WordCount类继承了Configured类并实现了Tool接口，因此第95行中的第二个参数就是WordCount类。这也是典型的实现Tool接口的写法。Configured类提供了88行的函数getConf()，该函数功能是获得对象自身的配置。Tool接口主要实现一个 run函数，然后通过ToolRunner.run调用执行。

在run函数中，第83行，通过DistributedCache将参数文件分发到HDFS缓存文件。

DistributedCache 是Map/Reduce框架提供的功能，能够缓存应用程序所需的文件（包括文本，档案文件，jar文件等）。应用程序在JobConf中通过url(hdfs://)指定需要被缓存的文件。 DistributedCache假定由hdfs://格式url指定的文件已经在 FileSystem上了。Map-Redcue框架在作业所有任务执行之前会把必要的文件拷贝到slave节点上。DistributedCache运行高效是因为每个作业的文件只拷贝一次并且为那些没有文档的slave节点缓存文档。

DistributedCache 根据缓存文档修改的时间戳进行追踪。在作业执行期间，当前应用程序或者外部程序不能修改缓存文件。

用户可以通过设置mapred.cache.{files|archives}来分发文件。如果要分发多个文件，可以使用逗号分隔文件所在路径。也可以利用API来设置该属性： DistributedCache.addCacheFile(URI,conf)/ DistributedCache.addCacheArchive(URI,conf) and DistributedCache.setCacheFiles(URIs,conf)/ DistributedCache.setCacheArchives(URIs,conf) 其中URI的形式是 hdfs://host:port/absolute-path#link-name 在Streaming程序中，可以通过命令行选项 -cacheFile/-cacheArchive 分发文件。

在第25行获得缓存的参数文件。

在第12行用到了Counters， Counters 是多个由Map/Reduce框架或者应用程序定义的全局计数器。每一个Counter可以是任何一种 Enum类型。同一特定Enum类型的Counter可以汇集到一个组，其类型为Counters.Group。应用程序可以定义任意(Enum类型)的 Counters并且可以通过 map 或者 reduce方法中的 Reporter.incrCounter(Enum, long)或者 Reporter.incrCounter(String, String, long) 更新。之后框架会汇总这些全局counters。

在第54行用到了Reporter，Reporter是用于Map/Reduce应用程序报告进度，设定应用级别的状态消息，更新Counters（计数器）的机制。

Mapper和Reducer的实现可以利用Reporter 来报告进度，或者仅是表明自己运行正常。在那种应用程序需要花很长时间处理个别键值对的场景中，这种机制是很关键的，因为框架可能会以为这个任务超时了，从而将它强行杀死。另一个避免这种情况发生的方式是，将配置参数mapred.task.timeout设置为一个足够高的值（或者干脆设置为零，则没有超时限制了）。第57行就用reporter来设置了程序运行的状态。

第20行标记是否忽略大小写，该参数并没有在程序中设置，而是留给运行程序的用户了。

另外，在第50行，采用了StringTokenizer进行单词的分割，记得当时做项目的时候就查看过API，StringTokenizer是不推荐使用的，所以我们都是采用split来实现。

下面是增强版WordCount的运行样例及结果

输入样例：

$ bin/hadoop dfs -ls /usr/joe/wordcount/input/
/usr/joe/wordcount/input/file01
/usr/joe/wordcount/input/file02

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01
Hello World, Bye World!

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02
Hello Hadoop, Goodbye to hadoop.

运行程序：

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output

输出：

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop, 1
Hello 2
World! 1
World, 1
hadoop. 1
to 1

现在通过DistributedCache插入一个模式文件，文件中保存了要被忽略的单词模式。

$ hadoop dfs -cat /user/joe/wordcount/patterns.txt
.
,
!
to

再运行一次，这次使用更多的选项：

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount -Dwordcount.case.sensitive=true /usr/joe/wordcount/input /usr/joe/wordcount/output -skip /user/joe/wordcount/patterns.txt

应该得到这样的输出：

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 1
Hello 2
World 2
hadoop 1

再运行一次，这一次关闭大小写敏感性（case-sensitivity）：

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount -Dwordcount.case.sensitive=false /usr/joe/wordcount/input /usr/joe/wordcount/output -skip /user/joe/wordcount/patterns.txt

输出：

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
bye 1
goodbye 1
hadoop 2
hello 2
world 2

最后，比较囧的是，我竟然是第一次看到第60行的这种用法，虽然一眼就能判断出这是foreach操作，但是之前一直不知道Java还支持这种使用，查了一下，是1.5加入的特性。