第一个mapreduce程序的测试与分析

最新推荐文章于 2024-05-20 15:39:49 发布

原创最新推荐文章于 2024-05-20 15:39:49 发布 · 1.8k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#mapreduce #Hadoop #分布式

本文详细介绍了如何在Eclipse和命令行模式下运行MapReduce程序，以WordCount为例，阐述了MapReduce的工作流程。通过分析mapper、reducer类及main函数，揭示了Combiner的作用。在实践中遇到的问题和解决方法也被提及。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

现在开始学习mapreduce的内容。首先我们来装载mapreduce的程序。这里运行mapreduce程序的方法有两种，一种是在eclipse下运行，一种是在命令行下运行。先来介绍在eclipse下运行的方法。

一、File->New->others->Map/Reduce Project出现一下屏幕

project name填入项目名称，点击finish即可，然后在项目名称/src下添加包，并写入mapreduce的java源程序。这里为了方便，用的是源码当中提供的示例代码wordcount.java（路径为/${hadoop_home}/hadoop-exaples-1.2.1.jar//org/apache/hadoop/examples/WordCount.java）,添加进来后就直接项目右键->run configuration->Java Application->WordCount->Arguments 输入输入文件和输出文件的路径名，输出文件一定不可以存在于HDFS上，否则会报错，而输入文件则一定要出现在HDFS上（所以你要将你的输入文件添加到HDFS上）。我这里的输入文件加为input，文件夹下有两个文件，file01,、file02，内容为了方便就各自写了几个单词。

然后点击run就可以运行了，不过由于是伪分布式，所以执行起来和普通的java程序在时间上没多大区别，然后从网页打开你的HDFS，就可以找到你的输出内容了

打开output01,再打开文件part-r-00000,内容如下：

每个单词的数量已经统计出来了。

下面介绍在命令行模式下运行（比较麻烦，不推荐）：

二、主要步骤就三步，但要把所需的包一起打包，所以比较麻烦。

（1）编译

这一步之前要把所用的包复制到原java文件的目录下边，利用

javac -classpath /${hadoop_home}/hadoop-core-1.2.1.jar -d 存放的路径源文件名

(2) 打包

jar -cvf 生成的jar包的名称 -C 文件存放路径

（3）运行

${hadoop_home}/bin/hadoop jar 所要运行的包输入文件输出文件

如果要仔细看的话，参考http://hi.baidu.com/royripple/item/8294721f0f4e33fb64eabfc0

接下来就来分析一下源代码。

这个java文件包含WordCount类中包含两个内部类和一个main函数，当然内部类也可以单独写成一个类。

先分析mapper类

public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

写一个新的mapper类继承自Mapper，map是个虚函数，泛型的四个参数<输入键值类型,输入value类型,输出键值类型,输出value类型>，输入的键值默认情况下其实是文本的行号，此处对我们没有用，Text相当于java中的String，IntWritable相当于java中的Integer,分析map程序，其作用就是把文本中的单词提取出来（

 word.set(itr.nextToken());注：StringTokenizer.nextToken()在默认情况下遇到间隔符就停，包括!,+,-,空格，回车...

）然后再用context传给Reducer,Mapper的输出(即context)为

(Hello , 1)

(World , 1)

(Bye , 1)

(World , 1)

(Hello , 1)

(Hadoop , 1)

(GoodBye , 1)

(Hadoop , 1)

经过MapReduce框架处理后，Reducer的输入为

(Hello,[1,1])

(World,[1,1])

(Bye,[1])

(Hadoop,[1,1])

(GoodBye,[1])

下面分析Reducer类

public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

输入的类型一定要注意和mapper的输出类型相对应，由于程序简单，也没什么好讲的，就是相当于将每一个键值下的
value相加就可以得到单词的数目了。

最后来分析一下main函数

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }

conf制定了作业的规范，我们用它来控制整个作业的运行

new GenericOptionsParser(conf, args).getRemainingArgs();

//从命令行获取参数。这是hadoop提供的辅助类，GenericOptionsParser是一个类，用来解释常用的Hadoop命令行选项，并根据需要，为Configuration对象设置相应的取值。

设定好作业和作业名

Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);

设置Mapper和Reducer的类

这里出现了一个setCombinerClass其实是在设置Combiner函数，Combiner相当于先对各个map的输出做处理，然后再交给Reducer，这样的话，Reducer的处理量就小了很多，由于此处正好Reducer类的作用与Combiner一致(找最大值)，所以Combiner类就设置的和Reducer类一样了（注意并不是每次都可以这样做，只有这两个作用一样的时候才可以）。

 job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);

然后再设置Reducer的输出类型（Mapper的输入类型也可以设定，此处没有设置，默认是TextInputFormat）

job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

设置输入文件和输出文件

 FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

最后执行，这里true的设定只是说明是否即时打印执行信息。

System.exit(job.waitForCompletion(true) ? 0 : 1);

这样程序就分析完了，下面我参照《Hadoop权威指南》上的例子自己写一个作为练习，相关分析材料到NCDC温度信息下载，在这里附上代码

package example;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class MaxTemperature {
	static class MaxTemperatureMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
		private static  final int MISSING = 9999;

		@Override
		protected void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
			String lines = value.toString();
			String year = lines.substring(15, 19);
			int temperature;
			if(lines.charAt(87)=='+'){
				temperature = Integer.parseInt(lines.substring(88, 92));
			}else{
				temperature = Integer.parseInt(lines.substring(87, 92));
			}
			String quality = lines.substring(92,93);
			if (temperature != MISSING && quality.matches("[01459]")){
				context.write(new Text(year),new IntWritable(temperature) );
			}
		}
	}
	
	static class MaxTemperatureReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

		protected void reduce(Text key,Iterable<IntWritable> value,
				Context context)
				throws IOException, InterruptedException {
			int MaxValue = Integer.MIN_VALUE;
			for (IntWritable temp : value){
				MaxValue = Math.max(MaxValue, temp.get());
			}
			context.write(key, new IntWritable(MaxValue));
		}		
	}
	
	public static void main(String arg[]) throws Exception{
		Configuration conf = new Configuration();
		String[] otherArg = new GenericOptionsParser(conf,arg).getRemainingArgs();
		if (otherArg.length != 2){
			System.out.println("need Inputfile and Outputfile");
			System.exit(2);
		}
		Job job= new Job(conf,"max temperature");
		job.setJarByClass(MaxTemperature.class);
		job.setMapperClass(MaxTemperatureMapper.class);
		job.setCombinerClass(MaxTemperatureReducer.class);
		job.setReducerClass(MaxTemperatureReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.addInputPath(job, new Path(otherArg[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArg[1]));
		
		System.exit(job.waitForCompletion(true)?0:1);
	}

}

这里注意一下，我已开始使用eclipse自动添加reduce函数的时候,那个Context是

org.apache.hadoop.mapreduce.Reducer.Context

最后出来的结果完全不同，那些从map传递过来的键值对似乎根本没有进行合并，具体原因我也不大清楚，可能到后边会知道。直接改为Context,即org.apache.hadoop.mapreduce.Mapper.Context，运行结果就正常了