Map和Reducer学习笔记

最新推荐文章于 2025-06-02 10:47:39 发布

原创最新推荐文章于 2025-06-02 10:47:39 发布 · 322 阅读

0 ·

CC 4.0 BY-SA版权

大数据处理技术专栏收录该内容

5 篇文章

订阅专栏

本文深入解析了Hadoop MapReduce的工作原理，包括Map和Reduce函数的实现细节，以及如何通过编写驱动程序逻辑来封装MapReduce作业。此外，还提供了WordCount实例的代码，展示了从编写到执行的全过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

第一部分：

在编写Map函数是，需要实现下边父类，重写父类的一些方法

#Context类，提供与Hadoop框架通信的各种机制，

class Mapper<K1, V1, K2, V2>

{

    void map(K1 key, V1 value Mapper.Context context)

            throws IOException, InterruptedException

    {..}

}

#该方法，在将key/value pairs提交给Map前，被调用。。默认实现：do noting

protected void setup( Mapper.Context context)
    throws IOException, Interrupted Exception

#该方法，在将所有的key/value pairs提交给Map后，被调用。。默认实现：do noting

protected void run( Mapper.Context context)
    throws IOException, Interrupted

#此方法控制JVM中任务处理的整体流程。默认实现调用setup方法一次，然后重复调用split中每个键/值对的map方法，最后调用cleanup方法。

protected void run( Mapper.Context context)
    throws IOException, Interrupted Exception

第二部分：

Reducer函数和Map函数类似

首先需要继承Reducer父类，重写Reducer方法

#reduce方法接受K2 / V2作为输入并提供K3 / V3作为输出
public class Reducer<K2, V2, K3, V3>
{
    void reduce(K1 key, Iterable<V2> values, Reducer.Context context)
        throws IOException, InterruptedException
    {..}
}
    This class also has the setup, run, and cleanup methods with  similar default implementations as with the Mapper class that can optionally be overridden:


#下边方法，在将key/value pairs提交给reduce method前，被调用。。默认实现：do noting
protected void setup( Reduce.Context context)
    throws IOException, InterruptedException

#下边方法，在将所有的key/value pairs提交给reduce method后，被调用。。默认实现：do noting
protected void cleanup( Reducer.Context context)
    throws IOException, InterruptedException

#下边方法控制在JVM中处理任务的整体流程。 默认实现在为Reducer类提供的任意数量的键/值重复调用reduce方法之前调用setup方法，然后最终调用cleanup方法。
protected void run( Reducer.Context context)
    throws IOException, InterruptedException

第三部分：

驱动程序（或者说配置）逻辑通常存在于编写用于封装MapReduce作业的类的main方法中

public class ExampleDriver

{

...

    public static void main(String[] args) throws Exception
    {
        // Create a Configuration object that is used to set other options
        Configuration conf = new Configuration() ;

        // Create the object representing the job
        Job job = new Job(conf, "ExampleJob") ;

        // Set the name of the main class in the job jarfile
        job.setJarByClass(ExampleDriver.class) ;
        
        // Set the mapper class
        job.setMapperClass(ExampleMapper.class) ;

        // Set the reducer class
        job.setReducerClass(ExampleReducer.class) ;

        // Set the types for the final output key and value
        job.setOutputKeyClass(Text.class) ;
        job.setOutputValueClass(IntWritable.class) ;

        // Set input and output file paths
        FileInputFormat.addInputPath(job, new Path(args[0])) ;
        FileOutputFormat.setOutputPath(job, new Path(args[1]))        

        // Execute the job and wait for it to complete
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

第四部分：

在编译任何与Hadoop相关的代码之前，需要指出：Hadoop-1.0.4.core.jar的路径

$ export CLASSPATH=.:${HADOOP_HOME}/Hadoop-1.0.4.core.jar:${CLASSPATH}

WordCount1.java file

import java.io.* ;

import org.apache.hadoop.conf.Configuration ;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount1

{

    public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>

    {

        private final static IntWritable one = new IntWritable(1);

        private Text word = new Text();



        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            String[] words = value.toString().split(" ") ;

            for (String str: words)

            {

                word.set(str);

                context.write(word, one);

            }

        }

    }



    public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>

    {

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

            int total = 0;

            for (IntWritable val : values) {

                total++ ;

            }

            context.write(key, new IntWritable(total));

        }

    }



    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        Job job = new Job(conf, "word count");

        job.setJarByClass(WordCount1.class);

        job.setMapperClass(WordCountMapper.class);

        job.setReducerClass(WordCountReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

#说明：

#该作业将TextInputFormat指定为输入数据的格式，默认情况下，它将传递给mapper数据，其中key是文件中的行号，value是该行的文本。

#mapper对输入源中的每行文本执行一次，每次执行该行并将其分解为单词。然后，它使用Context对象输出（通常称为发送）表单<word，1>的每个新键/值。这些是我们的K2 / V2值。

#我们之前说过，reducer的输入是一个键和一个相应的值列表，并且map和reduce方法之间会发生一些magic，它们会聚合每个键的值来促进这一点(这一过程是shuffle)。 Hadoop为每个键执行一次reducer，前面的reducer实现只计算Iterable对象中的数字，并以<word，count>的形式给出每个单词的输出。这是我们的K3 / V3值。

#我们使用传递给类的参数来指定输入和输出位置

#编译
$ javac WordCount1.java


#Create a JAR file from the generated class files.
$ jar cvf wc1.jar WordCount1*class


#Submit the new JAR file to Hadoop for execution.
$ hadoop jar wc1.jar WordCount1 test.txt output


#Check the output file; it should be as follows:
$ Hadoop fs –cat output/part-r-00000