Top N之MapReduce程序加强版Enhanced MapReduce for Top N items

最新推荐文章于 2022-12-17 20:14:48 发布

转载最新推荐文章于 2022-12-17 20:14:48 发布 · 974 阅读

文章标签：

#Top N #MapReduce程序 #加强版 #Enhanced MapReduce

Hadoop 专栏收录该内容

123 篇文章

订阅专栏

本文介绍了如何优化MapReduce程序，通过改进映射器部分，减少网络传输，提高计算大数据集Top-N项的性能。通过使用HashMap存储单词及其出现次数，避免重复发送相同单词到减速器，从而显著提升程序效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

In the last post we saw how to write a MapReduce program for finding the top-n items of a dataset.

The code in the mapper emits a pair key-value for every word found, passing the word as the key and 1 as the value. Since the book has roughly 38,000 words, this means that the information transmitted from mappers to reducers is proportional to that number. A way to improve network performance of this program is to rewrite the mapper as follows:

public static class TopNMapper extends Mapper<object, text,="" intwritable=""> { private Map<String, Integer> countMap = new HashMap<>(); @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String cleanLine = value.toString().toLowerCase().replaceAll("[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']", " "); StringTokenizer itr = new StringTokenizer(cleanLine); while (itr.hasMoreTokens()) { String word = itr.nextToken().trim(); if (countMap.containsKey(word)) { countMap.put(word, countMap.get(word)+1); } else { countMap.put(word, 1); } } } @Override protected void cleanup(Context context) throws IOException, InterruptedException { for (String key: countMap.keySet()) { context.write(new Text(key), new IntWritable(countMap.get(key))); } } }

As we can see, we define an HashMap that uses words as the keys and the number of occurrences as the values; inside the loop, instead of emitting every word to the reducer, we put it into the map: if the word was already put, we increase its value, otherwise we set it to one. We also overrode the cleanup method, which is a method that Hadoop calls when the mapper has finished computing its input; in this method we now can emit the words to the reducers: doing this way, we can save a lot of network transmissions because we send to the reducers every word only once.

The complete code of this class is available on my github.
In the next post we'll see how to use combiners to leverage this approach.

from: http://andreaiacono.blogspot.com/2014/03/enhanced-mapreduce-for-top-n-items.html