第一部分:
在编写Map函数是,需要实现下边父类,重写父类的一些方法
#Context类,提供与Hadoop框架通信的各种机制,
class Mapper<K1, V1, K2, V2>
{
void map(K1 key, V1 value Mapper.Context context)
throws IOException, InterruptedException
{..}
}
#该方法,在将key/value pairs提交给Map前,被调用。。默认实现:do noting
protected void setup( Mapper.Context context)
throws IOException, Interrupted Exception
#该方法,在将所有的key/value pairs提交给Map后,被调用。。默认实现:do noting
protected void run( Mapper.Context context)
throws IOException, Interrupted
#此方法控制JVM中任务处理的整体流程。 默认实现调用setup方法一次,然后重复调用split中每个键/值对的map方法,最后调用cleanup方法。
protected void run( Mapper.Context context)
throws IOException, Interrupted Exception
第二部分:
Reducer函数和Map函数类似
首先需要继承Reducer父类,重写Reducer方法
#reduce方法接受K2 / V2作为输入并提供K3 / V3作为输出
public class Reducer<K2, V2, K3, V3>
{
void reduce(K1 key, Iterable<V2> values, Reducer.Context context)
throws IOException, InterruptedException
{..}
}
This class also has the setup, run, and cleanup methods with similar default implementations as with the Mapper class that can optionally be overridden:
#下边方法,在将key/value pairs提交给reduce method前,被调用。。默认实现:do noting
protected void setup( Reduce.Context context)
throws IOException, InterruptedException
#下边方法,在将所有的key/value pairs提交给reduce method后,被调用。。默认实现:do noting
protected void cleanup( Reducer.Context context)
throws IOException, InterruptedException
#下边方法控制在JVM中处理任务的整体流程。 默认实现在为Reducer类提供的任意数量的键/值重复调用reduce方法之前调用setup方法,然后最终调用cleanup方法。
protected void run( Reducer.Context context)
throws IOException, InterruptedException
第三部分:
驱动程序(或者说配置)逻辑通常存在于编写用于封装MapReduce作业的类的main方法中
public class ExampleDriver
{
...
public static void main(String[] args) throws Exception
{
// Create a Configuration object that is used to set other options
Configuration conf = new Configuration() ;
// Create the object representing the job
Job job = new Job(conf, "ExampleJob") ;
// Set the name of the main class in the job jarfile
job.setJarByClass(ExampleDriver.class) ;
// Set the mapper class
job.setMapperClass(ExampleMapper.class) ;
// Set the reducer class
job.setReducerClass(ExampleReducer.class) ;
// Set the types for the final output key and value
job.setOutputKeyClass(Text.class) ;
job.setOutputValueClass(IntWritable.class) ;
// Set input and output file paths
FileInputFormat.addInputPath(job, new Path(args[0])) ;
FileOutputFormat.setOutputPath(job, new Path(args[1]))
// Execute the job and wait for it to complete
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
第四部分:
在编译任何与Hadoop相关的代码之前,需要指出:Hadoop-1.0.4.core.jar的路径
$ export CLASSPATH=.:${HADOOP_HOME}/Hadoop-1.0.4.core.jar:${CLASSPATH}
WordCount1.java file
import java.io.* ;
import org.apache.hadoop.conf.Configuration ;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount1
{
public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(" ") ;
for (String str: words)
{
word.set(str);
context.write(word, one);
}
}
}
public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int total = 0;
for (IntWritable val : values) {
total++ ;
}
context.write(key, new IntWritable(total));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount1.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
#说明:
#该作业将TextInputFormat指定为输入数据的格式,默认情况下,它将传递给mapper数据,其中key是文件中的行号,value是该行的文本。
#mapper对输入源中的每行文本执行一次,每次执行该行并将其分解为单词。 然后,它使用Context对象输出(通常称为发送)表单<word,1>的每个新键/值。 这些是我们的K2 / V2值。
#我们之前说过,reducer的输入是一个键和一个相应的值列表,并且map和reduce方法之间会发生一些magic,它们会聚合每个键的值来促进这一点(这一过程是shuffle)。 Hadoop为每个键执行一次reducer,前面的reducer实现只计算Iterable对象中的数字,并以<word,count>的形式给出每个单词的输出。 这是我们的K3 / V3值。
#我们使用传递给类的参数来指定输入和输出位置
#编译
$ javac WordCount1.java
#Create a JAR file from the generated class files.
$ jar cvf wc1.jar WordCount1*class
#Submit the new JAR file to Hadoop for execution.
$ hadoop jar wc1.jar WordCount1 test.txt output
#Check the output file; it should be as follows:
$ Hadoop fs –cat output/part-r-00000