MapReduce WordCount 源码详细解析

最新推荐文章于 2025-06-19 15:19:16 发布

原创

最新推荐文章于 2025-06-19 15:19:16 发布 · 2.1w 阅读

165 ·

CC 4.0 BY-SA版权

文章标签：

#MapReduce #wordcount #源码

文章详细介绍了MapReduce的基本执行流程，重点解析了WordCount程序的Map和Reduce阶段。Map阶段中，输入数据被切分为<key, value>对，key为文本偏移量，value为文本内容，输出为单词与计数1；Reduce阶段对相同单词的计数进行累加，输出最终词频。" 53464716,966784,使用ADB命令模拟Android手机操作,"['Android开发', 'ADB工具', '自动化测试']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MapReduce 基本的执行流程

与学习编程语言时采用“hello world”程序作为入门示例程序不同，在大数据处理领域常常使用“wordcount”程序作为入门程序。WordCount 程序是用来统计一段输入的数据中相同单词出现的频率。其基本的执行流程如下图所示：

这里写图片描述

一个基于MapReduce的WordCount程序主要由一下几个部分组成：

1、Split

将程序的输入数据进行切分，每一个 split 交给一个 Map Task 执行。split的数量可以自己定义。

2、Map

输入为一个split中的数据，对split中的数据进行拆分，并以 < key, value> 对的格式保存数据，其中 key 的值为一个单词，value的值固定为 1。如 < I , 1>、< wish, 1> …

3、Shuffle/Combine/sort

这几个过程在一些简单的MapReduce程序中并不需要我们关注，因为源代码中已经给出了一些默认的Shuffle/Combine/sort处理器，这几个过程的作用分别是：

Combine：对Map Task产生的结果在本地节点上进行合并、统计等，以减少后续整个集群间的Shuffle过程所需要传输的数据量。
Shuffle / Sort：将集群中各个Map Task的处理结果在集群间进行传输，排序，数据经过这个阶段之后就作为 Reduce 端的输入。

4、Reduce

Reduce Task的输入数据其实已经不仅仅是简单的< key, value>对，而是经过排序之后的一系列key值相同的< key, value>对。Reduce Task对其进行统计等处理，产生最终的输出。

WordCount 源码解析

本文基于Hadoop 2.7.6 源码，对其中的 WordCount 程序源码进行解读。

Hadoop 2.7.6版本的 WordCount 源码如下：

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package com;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache