Hadoop生态之MapReduce工作机制(六)

最新推荐文章于 2024-11-28 11:49:32 发布

ansap

最新推荐文章于 2024-11-28 11:49:32 发布

阅读量198

点赞数 1

分类专栏：思普大数据技术文章标签： MapReduce Hadoop MR MR

本文链接：https://blog.youkuaiyun.com/welun521/article/details/90700371

版权

思普大数据技术专栏收录该内容

34 篇文章

订阅专栏

MapReduce是Hadoop的分布式并行计算框架，简化了分布式程序编写。开发人员只需实现map和reduce业务代码，其余由Hadoop完成。文中给出单词词频计算案例，还介绍了MapReduce工作流程，包括输入、map、shuffle、reduce和输出等阶段。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MapReduce是Hadoop提供的分布式并行计算框架，用户不用关心如何编写实现分布式并行计算代码，只需在Mapper 和Reducer 里实现自己业务逻辑就可以了。简化了编写分布式程序的复杂度。

Hadoop中的mapreduce计算模型也是基于分布式计算原型的，是分布式计算的一种实现。Hadoop提供了mapreduce框架的底层实现，负责完成mapreduce程序分发到各个nodeManager节点上，map计算后的结果如何分发到对应的reduce汇总节点，数据的分发，任务的启动，监控和资源的调度等等。开发人员只需要按照一定规则和约定开发map 和reduce的具体业务实现代码就可以，剩下的交给hadoop完成计算。

案例：计算单词出现的词频（本地运行模式）

文件E:\\word.txt,内容：

郑州,开封,洛阳,南阳,信阳,驻马店,安阳,
郑州,开封,洛阳,南阳,信阳,驻马店,安阳,
郑州,开封,洛阳,南阳,信阳,驻马店,安阳,
郑州,开封,洛阳,南阳,信阳,驻马店,安阳,
郑州,开封,洛阳,南阳,信阳,驻马店,安阳,
郑州,开封,洛阳,南阳,信阳,驻马店,安阳,
周口,周口,郑州,开封,洛阳,周口,开封

1.mapper类

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

   @Override
   protected void map(LongWritable key, Text line, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
           throws IOException, InterruptedException {
       // 需求：文本中单词词频统计
       // 通过mapper task 处理每一行的数据，并且写入到上下文中
       // 行格式：郑州,开封,洛阳,新乡.....
       String[] words = line.toString().split(",");
       System.out.println(words.length);
       for (String w : words) {
           System.out.println(w);
           context.write(new Text(w), new LongWritable(1));
       }
   }

}

2.reducer类

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

   @Override
   protected void reduce(Text text, Iterable<LongWritable> counts,
           Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
       long sum = 0;
       for (LongWritable i : counts) {
           System.out.println(i);
           sum += 1;
       }
       context.write(text, new LongWritable(sum));
   }
}

3.主函数入口类

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriverMain {

   public static void main(String[] args) throws Exception {

       Configuration conf = new Configuration();

       Job job = Job.getInstance(conf, "WordCountDriverMain");

       // 指定job要用到的mapper/reducer业务类
       job.setJarByClass(WordCountDriverMain.class);
       job.setMapperClass(WordCountMapper.class);
       job.setReducerClass(WordCountReducer.class);

       // 指定mapper输出数据用的kv类型
       job.setMapOutputKeyClass(Text.class);
       job.setMapOutputValueClass(LongWritable.class);

       // 指定最终输出结果的kv类型
       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(LongWritable.class);

       // 指定job输入的源文件所在目录
       FileInputFormat.setInputPaths(job, new Path("E:\\word.txt"));
       // 指定job输出的文件目录
       FileOutputFormat.setOutputPath(job, new Path("E:\\output"));

       boolean result = job.waitForCompletion(true);

       if (result) {
           System.out.println("success");
       } else {
           System.err.println("failed");
       }
   }
}

输出结果：E:\\output\\part-r-00000

信阳   6
南阳   6
周口   3
安阳   6
开封   8
洛阳   7
郑州   7
驻马店   6

MapReduce工作原理流程简介

在MapReduce整个过程可以概括为以下过程：

输入 --> map --> shuffle --> reduce -->输出

输入文件会被切分成多个块，每一块都有一个map task

map阶段的输出结果会先写到内存缓冲区，然后由缓冲区写到磁盘上。默认的缓冲区大小是100M，溢出的百分比是0.8，也就是说当缓冲区中达到80M的时候就会往磁盘上写。如果map计算完成后的中间结果没有达到80M，最终也是要写到磁盘上的，因为它最终还是要形成文件。那么，在往磁盘上写的时候会进行分区和排序。一个map的输出可能有多个这个的文件，这些文件最终会合并成一个，这就是这个map的输出文件。