Word Count

最新推荐文章于 2025-09-06 22:11:59 发布

转载最新推荐文章于 2025-09-06 22:11:59 发布 · 449 阅读

文章标签：

#hadoop #string #import #iterator #exception #output

hadoop集群问题专栏收录该内容

14 篇文章

订阅专栏

本文详细介绍了MapReduce框架的基本原理，通过实现WordCount示例程序，展示了数据的分片、映射、折叠、切片与归约过程。重点阐述了如何利用MapReduce进行大规模数据处理，并提供了源代码实现。

http://tlyxy228.blog.163.com/blog/static/181090120105208322823/

map(String key1, String value1):

// key1: doc name
// value1: doc contents(words)
for each word w in value:
    EmitIntermediate(w, "1")

reduce(String key2, Iterator values2):
// key2: a word
// values2: a list of counts for intermediate values
int result = 0
for each v in values:
    result += ParseInt(v)
Emit(AsString(result))

示例：

输入
hello world bye world
hello hadoop bye hadoop
bye hadoop hello hadoop
输出
hello: 3
world: 2
bye: 3
hadoop: 4

过程演示：

hello world bye world
hello hadoop bye hadoop
bye hadoop hello hadoop
1. Split

hello world bye world --> (key, value)

hello hadoop bye hadoop --> (key, value)

bye hadoop hello hadoop --> (key, value)

2. Map

hello world bye world
-->
<hello, 1>
<world, 1>
<bye, 1>
<world, 1>

hello hadoop bye hadoop
-->
<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>

bye hadoop hello hadoop
-->
<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>

{ 可选：Combine
将中间结果合并成<key, list(value)>，减少元组数目和网络流量
（用不用Combine，一方面取决于数据的特征（重复Key的多寡）；另一方面就是网络带宽（网络速度很快时，Combine提高的性能有限，甚至不会提高性能）。）

<hello, 1>
<world, 1>
<bye, 1>
<world, 1>

<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>

<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>

-->

<hello, 1>
<world, 1>
<world, 1>
<bye, 1>

<hello, 1>
<hadoop , 1>
<hadoop , 1>
<bye, 1>

<bye, 1>
<hadoop , 1>
<hadoop , 1>
<hello, 1>
}

3. Fold/Shuffle

<hello, 1>
<world, 1>
<bye, 1>
<world, 1>

<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>

<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>

-->

<hello, 1>
<hello, 1>
<hello, 1>

<world, 1>
<world, 1>

<bye, 1>
<bye, 1>
<bye, 1>

<hadoop , 1>
<hadoop , 1>
<hadoop , 1>
<hadoop , 1>

按key2 partition成R个分区（与决定Reduce task节点的hash相同）并排序产生{key,{value}}，各自存在本地磁盘

4. Reduce
读取{key,{value}}，为每个(key,{value})，调用应用程序自定义的reduce函数。根据中间结果的value1（即key2）决定Reduce task的node（如 hash(key2)mod R），一个reduce可能读取多个map节点的中间数据

<hello, 1>
<hello, 1>
<hello, 1>

<world, 1>
<world, 1>

<bye, 1>
<bye, 1>
<bye, 1>

<hadoop , 1>
<hadoop , 1>
<hadoop , 1>
<hadoop , 1>

-->
<hello, 3>
<world, 2>
<bye, 3>
<hadoop , 4>

源代码 WordCount.java

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount {

    public static class Map extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                 output.collect(word, one);
            }
        }
    }

    public static class Reduce extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            int sum = 0;
             while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(WordCount.class);
        conf.setJobName("wordcount");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);
    }
}