Optimizing Map/Reduce with MongoDB
I’ve come across several users who experience poor performance when using Map/Reduce with MongoDB version 1.8 and older, and it turns out that in many cases it is easily fixable. Today I will focus on the “sort” parameter of the MapReduce command, which is often overlooked but critical.
Here is how the M/R works in the general case, assuming there is no query filter:
- mongod does full table scan in natural order, going through all documents of collection
- for each document, map() is called, which emits a document like {_id: key, value: val} which gets stored in an in memory map (tree).
- mongod checks every 100 records that the size of the map is not over 50KB, if so it runs reduce on ALL current keys. If size of map is still over 100KB, it dumps all current documents to disk in an “incremental” collection.
- when all mapping is done, it reads back from the inc collection sorted by _id, and does the final reduce.
Now if you have many documents, and the key distribution is fairly random, it can result in following: all docs get inserted to map but it is not useful for reduction, and most documents will end up in the “inc” collection on disk that needs to be read back in order. The particular issue to understand is that since mongod has no idea what key you will use to emit, it cannot presort the data to make it efficient.
To fix this issue:
- add an input sort key for the M/R job that is the same as the emit key.
- make sure that key is indexed and works well with your query filter. You should run a find() with same query and sort with explain(), and make sure it uses an index.
This can result in 100x performance in some cases. Note that in mongo 1.9 and above, some works has been done to improve performance:
- threshold to run reduces or dump to disk have been increased.
- there is a new “pure JS” mode that can be very fast for light jobs.
- optimized the js engine interface
But in any case mongod is still not aware of your emit key, so use sort!
cheers
AG
本文探讨了使用MongoDB 1.8及更早版本进行Map/Reduce操作时常见的性能问题,并提出通过设置适当的排序参数来提高效率的方法。特别是介绍了如何选择与发射键相同的输入排序键,以及确保该键被索引的重要性。
724

被折叠的 条评论
为什么被折叠?



