数据样本:
2017/07/28 sina.com/lady/
2017/07/28 sina.com/play
2017/07/28 sina.com/movie
2017/07/28 sina.com/music
2017/07/28 sina.com/sport
2017/07/28 sina.com/sport
2017/07/28 163.com/sport......等
#中间是空格
-
访问次数的top-n实现
mapreduce实现,首先map对取数据进行分割,写入到context中。
package mapreduce.page_topn;
//省略各种包
public class PageTopMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
@Override
protected void map(LongWritable key, Text values, Context context)
throws IOException, InterruptedException {
String line = values.toString();
String[] firlds = line.split(" ");
//写入<网页,次数>
context.write(new Text(firlds[1]), new IntWritable(1));
}
}
reduce拿到map的<key,value>的成组数据进行统计,将PageCount类封装网页名称和次数数据,reduce task将统计的<网页,总次数>的pagecount类put到TreeMap中,在PageCount类中实现排序方法,完成之后会调用cleanup方法,对cleanup方法进行重写,遍历前n此拿到的即是网页浏览次数前n的数据。
package mapreduce.page_topn;
//省略各种包
public class PageTopReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
TreeMap<PageCount, Object> treeMap = new TreeMap<>();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
int count=0;
for (IntWritable value : values) {
count += value.get();
}
//将网页名称和次数封装到PageCount类中,在PageCount类中重写排序方法
PageCount pageC