Optimizing Map/Reduce with MongoDB

最新推荐文章于 2025-12-20 09:19:22 发布

转载最新推荐文章于 2025-12-20 09:19:22 发布 · 447 阅读

文章标签：

#mongodb #mapreduce

mongodb 专栏收录该内容

33 篇文章

订阅专栏

本文探讨了使用MongoDB 1.8及更早版本进行Map/Reduce操作时常见的性能问题，并提出通过设置适当的排序参数来提高效率的方法。特别是介绍了如何选择与发射键相同的输入排序键，以及确保该键被索引的重要性。

Optimizing Map/Reduce with MongoDB

I’ve come across several users who experience poor performance when using Map/Reduce with MongoDB version 1.8 and older, and it turns out that in many cases it is easily fixable. Today I will focus on the “sort” parameter of the MapReduce command, which is often overlooked but critical.

Here is how the M/R works in the general case, assuming there is no query filter:

mongod does full table scan in natural order, going through all documents of collection
for each document, map() is called, which emits a document like {_id: key, value: val} which gets stored in an in memory map (tree).
mongod checks every 100 records that the size of the map is not over 50KB, if so it runs reduce on ALL current keys. If size of map is still over 100KB, it dumps all current documents to disk in an “incremental” collection.
when all mapping is done, it reads back from the inc collection sorted by _id, and does the final reduce.

Now if you have many documents, and the key distribution is fairly random, it can result in following: all docs get inserted to map but it is not useful for reduction, and most documents will end up in the “inc” collection on disk that needs to be read back in order. The particular issue to understand is that since mongod has no idea what key you will use to emit, it cannot presort the data to make it efficient.

To fix this issue:

add an input sort key for the M/R job that is the same as the emit key.
make sure that key is indexed and works well with your query filter. You should run a find() with same query and sort with explain(), and make sure it uses an index.

This can result in 100x performance in some cases. Note that in mongo 1.9 and above, some works has been done to improve performance: