MapReduce 的格式输入----SequenceFileInputFormat ---源码分析

最新推荐文章于 2021-04-12 08:50:51 发布

原创最新推荐文章于 2021-04-12 08:50:51 发布 · 968 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#SequenceFileInputFor #mapreduce #Hadoop

Hadoop实战专栏收录该内容

46 篇文章

订阅专栏

本文详细介绍了如何使用SequenceFileInputFormat处理MapReduce的二进制输入，通过分析MaxTempMapper、MaxTempReducer及App的代码，展示了如何配合SequenceFileInputFormat进行数据处理。同时，源码跟踪揭示了从获取切片信息到创建SequenceFileRecordReader，再到Mapper执行过程的关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MapReduce不仅可以处理文本数据，还可以处理二进制数据。

Hadoop顺序文件存储格式是二进制的键-值对序列，使用顺序文件作为MapReduce的输入，可以使用SequenceFileInputFormat。键-值对是由顺序文件格式决定，只需要保证map的输入格式正确，例如顺序文件的格式是InWritable-Intwritable,那么Mapper的输入格式也一样。

1、MaxTempMapper

package hadoop.mr.sequenfileinputformat;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * MaxTempMapper
 */
public class MaxTempMapper extends Mapper<IntWritable, IntWritable, IntWritable,IntWritable> {

   protected void map(IntWritable key, IntWritable value, Context context) throws IOException, InterruptedException {


         context.write(key, value);

   }
}

2、MaxTempReducer

package .hadoop.mr.sequenfileinputformat;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * MaxTempReducer
 */
public class MaxTempReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{

   protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int Max = Integer.MIN_VALUE;
      for(IntWritable iw:values){
         Max = Max>iw.get()?Max:iw.get();

      }
      context.write(key,new IntWritable(Max));
   }
}

3、App

package hadoop.mr.sequenfileinputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 */
public class App {
   public static void main(String[] args) throws Exception {
      args  = new String[]{
            "F:/java1/seq/1.seq","f:/java1/seq/out"
      };

      Configuration conf = new Configuration();

      FileSystem fs = FileSystem.get(conf);
      if(fs.exists(new Path(args[1]))){
         fs.delete(new Path(args[1]),true);
      }

      Job job = Job.getInstance(conf);

      job.setJobName("MAXtemp");
      job.setJarByClass(App.class);

      job.setMapperClass(MaxTempMapper.class);
      job.setReducerClass(MaxTempReducer.class);

      //添加输入路径

      FileInputFormat.addInputPath(job,new Path(args[0]));
      //设置输出路径
      FileOutputFormat.setOutputPath(job,new Path(args[1]));
      //文件的输入格式
      job.setInputFormatClass(SequenceFileInputFormat.class);

      //设置mapreduce输出
      job.setMapOutputKeyClass(IntWritable.class);
      job.setMapOutputValueClass(IntWritable.class);

      job.setOutputKeyClass(IntWritable.class);
      job.setOutputValueClass(IntWritable.class);

      job.setNumReduceTasks(2);

      //手动设置map的个数
//    job.getConfiguration().set(MRJobConfig.NUM_MAPS,"3");

      job.waitForCompletion(true);
   }
}

源码跟踪：

1、SequenceFileInputFormat<K, V>extendsFileInputFormat<K,
 V>

2、调用FileInputFormat的getSplits方法

3、返回切片信息

4、调用SequenceFileInputFormat的createRecordReader方法创建SequenceFileRecordReader

5、调用Mapper中run方法

while(contxet.nextkeyvalue)是个循环过程

6、调用SequenceFileRecordReader的nextKeyValue方法