Mapper类详解

最新推荐文章于 2024-05-11 17:04:35 发布

happylzs2008

最新推荐文章于 2024-05-11 17:04:35 发布

阅读量7.2k

点赞数 3

分类专栏： hadoop

hadoop 专栏收录该内容

45 篇文章

订阅专栏

[java]view plaincopy 
   
 <span style="font-family:SimSun;font-size:14px;">public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {  
   
   public class Context   
     extends MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {  
     public Context(Configuration conf, TaskAttemptID taskid,  
                    RecordReader<KEYIN,VALUEIN> reader,  
                    RecordWriter<KEYOUT,VALUEOUT> writer,  
                    OutputCommitter committer,  
                    StatusReporter reporter,  
                    InputSplit split) throws IOException, InterruptedException {  
       super(conf, taskid, reader, writer, committer, reporter, split);  
     }  
   }  
     
   /** 
    * Called once at the beginning of the task. 
    */  
   protected void setup(Context context  
                        ) throws IOException, InterruptedException {  
     // NOTHING  
   }  
   
   /** 
    * Called once for each key/value pair in the input split. Most applications 
    * should override this, but the default is the identity function. 
    */  
   @SuppressWarnings("unchecked")  
   protected void map(KEYIN key, VALUEIN value,   
                      Context context) throws IOException, InterruptedException {  
     context.write((KEYOUT) key, (VALUEOUT) value);  
   }  
   
   /** 
    * Called once at the end of the task. 
    */  
   protected void cleanup(Context context  
                          ) throws IOException, InterruptedException {  
     // NOTHING  
   }  
     
   /** 
    * Expert users can override this method for more complete control over the 
    * execution of the Mapper. 
    * @param context 
    * @throws IOException 
    */  
   public void run(Context context) throws IOException, InterruptedException {  
     setup(context);  
     while (context.nextKeyValue()) {  
       map(context.getCurrentKey(), context.getCurrentValue(), context);  
     }  
     cleanup(context);  
   }  
 }</span>  

Mapper共有setup()，map()，cleanup()和run()四个方法。其中setup()一般是用来进行一些map()前的准备工作，map()则一般承担主要的处理工作，cleanup()则是收尾工作如关闭文件或者执行map()后的K-V分发等。run()方法提供了setup->map->cleanup()的执行模板。从上面run方法可以看出，K/V对是从传入的Context获取的。我们也可以从下面的map方法看出，输出结果K/V对也是通过Context来完成的。至于Context暂且放着。

我们先来看看三个Mapper子类，它们位src\mapred\org\apache\Hadoop\mapreduce\lib\map:

1、TokenCounterMapper

[java]view plaincopy 
   
 <span style="font-family:SimSun;font-size:14px;">public class TokenCounterMapper extends Mapper<Object, Text, Text, IntWritable>{    
             
       private final static IntWritable one = new IntWritable(1);    
       private Text word = new Text();    
           
       @Override    
       public void map(Object key, Text value, Context context    
                       ) throws IOException, InterruptedException {    
         StringTokenizer itr = new StringTokenizer(value.toString());    
         while (itr.hasMoreTokens()) {    
           word.set(itr.nextToken());    
           context.write(word, one);    
         }    
       }    
 　　} </span>  

我们看到，对于一个输入的K-V对，它使用StringTokenizer来获取value中的tokens，然后对每一个token，分发出一个<token,one>对，这将在reduce端被收集，同一个token对应的K-V对都会被收集到同一个reducer上，这样我们就可以计算出所有mapper分发出来的以某个token为key的<token,one>的数量，然后只要在reduce函数中加起来，就得到了token的计数。这就是为什么这个类叫做TokenCounterMapper的原因。

在MapReduce的“Hello world”：WordCount例子中，我们完全可以直接使用这个TokenCounterMapper作为MapperClass，仅需用job.setMapperClass(TokenCounterMapper.class)进行设置即可。

2.InverseMapper

[java]view plaincopy 
   
 <span style="font-family:SimSun;font-size:14px;"> public class InverseMapper<K, V> extends Mapper<K,V,V,K> {    
         
         
     /** The inverse function.  Input keys and values are swapped.*/    
     @Override    
     public void map(K key, V value, Context context    
                     ) throws IOException, InterruptedException {    
       context.write(value, key);    
 　　}  </span>