Hadoop的WordCount相当于java学习中的Hallo World。
Hadoop的计算处理过程中的函数,可以高度抽象成map和reduce两步。map把任务分解成多个,reduce负责把结果汇总起来。
Hadoop提供了如下内容的数据类型,这些数据类型都实现了WritableComparable接口,以便用这些类型定义的数据可以被序列化进行网络传输和文件存储,以及进行大小比较。
BooleanWritable:标准布尔型数值
ByteWritable:单字节数值
DoubleWritable:双字节数
FloatWritable:浮点数
IntWritable:整型数
LongWritable:长整型数
Text:使用UTF8格式存储的文本
NullWritable:当<key,value>中的key或value为空时使用
main函数分析:
首先对job进行初始化,public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));JobClient.runJob(conf);
}
JobConf conf = new JobConf(WordCount. class );
conf.setJobName("wordcount" );
JobConf类对job进行初始化,然后conf.setJobName对job命名。
然后设置job输出结果中key哈value的数据类型
conf.setOutputKeyClass(Text.class );
conf.setOutputValueClass(IntWritable.class );
因为输出结果为 单词/个数,所以类型分别为text
设置Job处理的Map(拆分)、Combiner(中间结果合并)以及Reduce(合并)的相关处理类
conf.setMapperClass(Map.class );
conf.setCombinerClass(Reduce.class );
conf.setReducerClass(Reduce.class );
设置输入输出路径
Map函数分析conf.setInputFormat(TextInputFormat.class );
conf.setOutputFormat(TextOutputFormat.class );
<span style="font-size:14px;"> </span><span style="font-family:Microsoft YaHei;font-size:14px;"> public static class Map extends MapReduceBase implements /*继承</span><span style="font-family:Microsoft YaHei;font-size:14px;">MapReduceBase类</span><span style="font-family:Microsoft YaHei;font-size:14px;">* Mapper<LongWritable, Text, Text, IntWritable> { /*实现mapper接口* private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } </span>
map方法对输入的行以空格为单位进行切分,然后使用OutputCollect收集输出的<word,one>
Reduce函数分析
Reduce函数将输入的key值作为输出的key值,然后将获得多个value值加起来,作为输出的值.public static class Reduce extends MapReduceBase implements/*继承MapReduceBase类* Reducer<Text, IntWritable, Text, IntWritable> { /*实现Reducer接口* public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }