Word Count

本文详细介绍了MapReduce框架的基本原理,通过实现WordCount示例程序,展示了数据的分片、映射、折叠、切片与归约过程。重点阐述了如何利用MapReduce进行大规模数据处理,并提供了源代码实现。

http://tlyxy228.blog.163.com/blog/static/181090120105208322823/

map(String key1, String value1):

// key1: doc name
// value1: doc contents(words)
  for each word w in value:
    EmitIntermediate(w, "1")

reduce(String key2, Iterator values2):
// key2: a word
// values2: a list of counts for intermediate values
  int result = 0
  for each v in values:
    result += ParseInt(v)
  Emit(AsString(result))

示例:

输入
hello world bye world
hello hadoop bye hadoop
bye hadoop hello hadoop
输出
hello: 3
world: 2
bye: 3
hadoop: 4

过程演示:

hello world bye world
hello hadoop bye hadoop
bye hadoop hello hadoop
1. Split

hello world bye world --> (key, value)

hello hadoop bye hadoop --> (key, value)

bye hadoop hello hadoop --> (key, value)

2. Map

hello world bye world 
--> 
<hello, 1>
<world, 1>
<bye, 1>
<world, 1>

hello hadoop bye hadoop
 --> 
<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>

bye hadoop hello hadoop 
-->
<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>

{ 可选:Combine
将中间结果合并成<key, list(value)>,减少元组数目和网络流量
用不用Combine,一方面取决于数据的特征(重 复Key的多寡);另一方面就是网络带宽( 网络速度很快时,Combine提高的性能有限,甚至不会提高 性能 )。

<hello, 1>
<world, 1>
<bye, 1>
<world, 1>

<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>

<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>

-->

<hello, 1>
<world, 1>
<world, 1>
<bye, 1>


<hello, 1>
<hadoop , 1>
<hadoop , 1>
<bye, 1>


<bye, 1>
<hadoop , 1>
<hadoop , 1>
<hello, 1>
}

3. Fold/Shuffle

<hello, 1>
<world, 1>
<bye, 1>
<world, 1>

<hello, 1>
<hadoop , 1>
<bye, 1>
<hadoop , 1>

<bye, 1>
<hadoop , 1>
<hello, 1>
<hadoop , 1>

-->

<hello, 1>
<hello, 1>
<hello, 1>

<world, 1>
<world, 1>

<bye, 1>
<bye, 1>
<bye, 1>

<hadoop , 1>
<hadoop , 1>
<hadoop , 1>
<hadoop , 1>

按key2 partition成R个分区(与决定Reduce task节点的hash相同)并排序产生{key,{value}},各自存在本地磁盘

4. Reduce
读取{key,{value}},为每个(key,{value}),调用应用程序自定义的reduce函数。根据中间结果的value1(即key2)决定Reduce task的node(如 hash(key2)mod R),一个reduce可能读取多个map节点的中间数据

<hello, 1>
<hello, 1>
<hello, 1>

<world, 1>
<world, 1>

<bye, 1>
<bye, 1>
<bye, 1>

<hadoop , 1>
<hadoop , 1>
<hadoop , 1>
<hadoop , 1>

-->
<hello, 3>
<world, 2>
<bye, 3>
<hadoop , 4>

源代码 WordCount.java

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount {

    public static class  Map  extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                 output.collect(word, one);
            }
        }
    }

    public static class  Reduce  extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            int sum = 0;
             while (values.hasNext()) {
                sum += values.next().get();
            }

            output.collect(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(WordCount.class);
        conf.setJobName("wordcount");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);
    }
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值