hadoop案例之topK问题

最新推荐文章于 2021-10-09 11:50:57 发布

原创最新推荐文章于 2021-10-09 11:50:57 发布 · 2k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#hadoop #yarn #海量数据

大数据同时被 2 个专栏收录

25 篇文章

订阅专栏

hadoop

25 篇文章

订阅专栏

本文介绍如何使用Hadoop解决海量数据中的TopK问题。通过自定义TreeMap实现大顶堆或小顶堆来筛选数据，并在MapReduce过程中只保留每个输入分片的前K条记录，最终在一个Reducer中汇总得出全局TopK数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

海量数据中，查找topK问题的hadoop解法：

一个map task就是一个进程。有几个map task就有几个中间文件，有几个reduce task就有几个最终输出文件。

要找的top K 是指的全局的前K条数据，那么不管中间有几个map, reduce最终只能有一个reduce来汇总数据，输出top K。

思路和代码：
1. Mappers
使用默认的mapper数据，一个input split（输入分片）由一个mapper来处理。
在每一个map task中，我们找到这个input split的前k个记录。这里我们用TreeMap这个数据结构来保存top K的数据，这样便于更新。

下一步，我们来加入新记录到TreeMap中去（这里的TreeMap就是个大顶堆或小顶堆）。在map中，对每一条记录都尝试去更新TreeMap，

最后得到的就是这个分片中的local top k的k个值。在这里要提醒一下，以往的mapper中，我们都是处理一条数据之后就context.write

或者output.collector一次。而在这里不是，这里是把所有这个input split的数据处理完之后再进行写入。所以，可以把这个context.write放在cleanup里执行。

cleanup就是整个mapper task执行完之后会执行的一个函数。

2.reducers
由于前面介绍只有一个reducer，就是对mapper输出的数据进行再一次汇总，选出其中的top k，即可达到我们的目的。

//求最大的前K个数，用小顶堆
//求最小的前K个数，用大顶堆
public class TopK {

	static class mapJob extends Mapper<LongWritable,Text ,IntWritable ,IntWritable>
	{
		//默认的TreeMap升序排列 ,求最大的前K个数，用小顶堆，堆的第一个值是最小的
		//private TreeMap<Integer, Integer> tree = new TreeMap<>();
		
		//自定义比较器的TreeMap降序排列 ,求最小的前K个数，用大顶堆,堆的第一个值是最大的
		private TreeMap<Integer, Integer> tree = new TreeMap<>(new Comparator<Integer>() {
			public int compare(Integer a , Integer b)
			{
				return b - a;
			}
		});
		
		protected void map(LongWritable key, Text value ,Context context)
				throws IOException, InterruptedException 
		{
			Configuration conf = context.getConfiguration();
			String ktext = conf.get("k");
			int k = Integer.valueOf(ktext);
			
			String line = value.toString();
			if(line.length() > 0)
			{
				Integer i = new Integer(line);
				
				tree.put(i, i);
				
				if (tree.size() > k) {
					tree.remove(tree.firstKey());
				}
			}
		}
		
		protected void cleanup(Context context)
			throws IOException, InterruptedException 
		{
			for(Integer i: tree.values())
			{
				context.write(new IntWritable(i), new IntWritable(i));
			}
		}
		
	}
	
	static class reduceJob extends Reducer<IntWritable, IntWritable, IntWritable, NullWritable>
	{
		//默认的TreeMap升序排列 
		//private TreeMap<Integer, Integer> tree = new TreeMap<>();
		
		//自定义比较器的TreeMap降序排列 
		private TreeMap<Integer, Integer> tree = new TreeMap<>(new Comparator<Integer>() {
			public int compare(Integer a ,Integer b)
			{
				return b - a ;
			}
		});
		
		@Override
		protected void reduce(IntWritable key, Iterable<IntWritable> values,
				Context context)
						throws IOException, InterruptedException {
			
			Configuration conf = context.getConfiguration();
			String ktext = conf.get("k");
			int k = Integer.valueOf(ktext);
			
			for(IntWritable i : values)
			{
				tree.put(new Integer(i.get()), new Integer(i.get()));
				if (tree.size() > k) {
					tree.remove(tree.firstKey());
				}
			}
		}
		
		@Override
		protected void cleanup(Context context)
				throws IOException, InterruptedException {
			for(Integer i : tree.values())
			{
				context.write(new IntWritable(i), NullWritable.get());
			}
		}
	}
	
	public static void main(String[] args) {
		Configuration conf = new Configuration();
		
		try {
			
			String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
			
			if (otherArgs.length != 3) {
				System.err.println("length != 3");
			}
			conf.set("k", otherArgs[2]);
			
			
			Job job = new Job(conf);
			job.setJobName("top K");
			job.setJarByClass(TopK.class);
			
			job.setMapperClass(mapJob.class);
			job.setReducerClass(reduceJob.class);
			
			job.setMapOutputKeyClass(IntWritable.class);
			job.setMapOutputValueClass(IntWritable.class);
			
			job.setNumReduceTasks(1);
			
			FileInputFormat.addInputPath(job, new Path("/usr/local/hadooptempdata/input/topk/"));
			FileOutputFormat.setOutputPath(job, new Path("/usr/local/hadooptempdata/output/topk/"));
		
			System.exit(job.waitForCompletion(true)? 0 : 1);
			
		} catch (Exception e) {
			// TODO: handle exception
		}
	}
}