MongoDB 聚合

最新推荐文章于 2024-09-06 09:46:37 发布

转载最新推荐文章于 2024-09-06 09:46:37 发布 · 766 阅读

24 篇文章

订阅专栏

db.pa_yunju_orders.aggregate(
[
{$group:{_id:"$erp_order_id",total_num:{$sum:1},max_price:{$max:"$order_price"},order_sn:{$push:"$order_sn"}}},
{$match:{"total_num":{"$gt":1}}},
{$sort:{"max_price":-1}},
{$unwind:"$order_sn"}
]
)

###############################

MongoDB中聚合(aggregate)主要用于处理数据(诸如统计平均值,求和等)，并返回计算后的数据结果。有点类似sql语句中的 count(*)。

aggregate() 方法

MongoDB中聚合的方法使用aggregate()。

语法

aggregate() 方法的基本语法格式如下所示：

>db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)

实例

集合中的数据如下：

{
   _id: ObjectId(7df78ad8902c)
   title: 'MongoDB Overview', 
   description: 'MongoDB is no sql database',
   by_user: 'w3cschool.cc',
   url: 'http://www.w3cschool.cc',
   tags: ['mongodb', 'database', 'NoSQL'],
   likes: 100
},
{
   _id: ObjectId(7df78ad8902d)
   title: 'NoSQL Overview', 
   description: 'No sql database is very fast',
   by_user: 'w3cschool.cc',
   url: 'http://www.w3cschool.cc',
   tags: ['mongodb', 'database', 'NoSQL'],
   likes: 10
},
{
   _id: ObjectId(7df78ad8902e)
   title: 'Neo4j Overview', 
   description: 'Neo4j is no sql database',
   by_user: 'Neo4j',
   url: 'http://www.neo4j.com',
   tags: ['neo4j', 'database', 'NoSQL'],
   likes: 750
},

现在我们通过以上集合计算每个作者所写的文章数，使用aggregate()计算结果如下：

> db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : 1}}}])
{
   "result" : [
      {
         "_id" : "w3cschool.cc",
         "num_tutorial" : 2
      },
      {
         "_id" : "Neo4j",
         "num_tutorial" : 1
      }
   ],
   "ok" : 1
}
>

以上实例类似sql语句： select by_user, count(*) from mycol group by by_user

在上面的例子中，我们通过字段by_user字段对数据进行分组，并计算by_user字段相同值的总和。

下表展示了一些聚合的表达式:

表达式	描述	实例
$sum	计算总和。	db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : "$likes"}}}])
$avg	计算平均值	db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$avg : "$likes"}}}])
$min	获取集合中所有文档对应值得最小值。	db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$min : "$likes"}}}])
$max	获取集合中所有文档对应值得最大值。	db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$max : "$likes"}}}])
$push	在结果文档中插入值到一个数组中。	db.mycol.aggregate([{$group : {_id : "$by_user", url : {$push: "$url"}}}])
$addToSet	在结果文档中插入值到一个数组中，但不创建副本。	db.mycol.aggregate([{$group : {_id : "$by_user", url : {$addToSet : "$url"}}}])
$first	根据资源文档的排序获取第一个文档数据。	db.mycol.aggregate([{$group : {_id : "$by_user", first_url : {$first : "$url"}}}])
$last	根据资源文档的排序获取最后一个文档数据	db.mycol.aggregate([{$group : {_id : "$by_user", last_url : {$last : "$url"}}}])

管道的概念

管道在Unix和Linux中一般用于将当前命令的输出结果作为下一个命令的参数。

MongoDB的聚合管道将MongoDB文档在一个管道处理完毕后将结果传递给下一个管道处理。管道操作是可以重复的。

表达式：处理输入文档并输出。表达式是无状态的，只能用于计算当前聚合管道的文档，不能处理其它的文档。

这里我们介绍一下聚合框架中常用的几个操作：

$project：修改输入文档的结构。可以用来重命名、增加或删除域，也可以用于创建计算结果以及嵌套文档。
$match：用于过滤数据，只输出符合条件的文档。$match使用MongoDB的标准查询操作。
$limit：用来限制MongoDB聚合管道返回的文档数。
$skip：在聚合管道中跳过指定数量的文档，并返回余下的文档。
$unwind：将文档中的某一个数组类型字段拆分成多条，每条包含数组中的一个值。
$group：将集合中的文档分组，可用于统计结果。
$sort：将输入文档排序后输出。
$geoNear：输出接近某一地理位置的有序文档。

管道操作符实例

1、$project实例

db.article.aggregate(
    { $project : {
        title : 1 ,
        author : 1 ,
    }}
 );

这样的话结果中就只还有_id,tilte和author三个字段了，默认情况下_id字段是被包含的，如果要想不包含_id话可以这样:

db.article.aggregate(
    { $project : {
        _id : 0 ,
        title : 1 ,
        author : 1
    }});

2.$match实例

db.articles.aggregate( [
                        { $match : { score : { $gt : 70, $lte : 90 } } },
                        { $group: { _id: null, count: { $sum: 1 } } }
                       ] );

$match用于获取分数大于70小于或等于90记录，然后将符合条件的记录送到下一阶段$group管道操作符进行处理。

3.$skip实例

db.article.aggregate(
    { $skip : 5 });

经过$skip管道操作符处理后，前五个文档被"过滤"掉。

##############################################################

通过上一篇文章中，认识了MongoDB中四个聚合操作，提供基本功能的count、distinct和group，还有可以提供强大功能的mapReduce。

在MongoDB的2.2版本以后，聚合框架中多了一个新的成员，聚合管道，数据进入管道后就会经过一级级的处理，直到输出。

对于数据量不是特别大，逻辑也不是特别复杂的聚合操作，聚合管道还是比mapReduce有很多优势的：

相比mapReduce，聚合管道比较容易理解和使用
可以直接使用管道表达式操作符，省掉了很多自定义js function，一定程度上提高执行效率
和mapReduce一样，它也可以作用于分片集合

但是，对于数据量大，逻辑复杂的聚合操作，还是要使用mapReduce实现。

聚合管道

在聚合管道中，每一步操作（管道操作符）都是一个工作阶段（stage），所有的stage存放在一个array中。MongoDB文档中的描述如下：

db.collection.aggregate( [ { <stage> }, ... ] )

在聚合管道中，每一个stage都对应一个管道操作符，根据MongoDB文档，聚合管道可以支持以下管道操作符：

Name	Description
$geoNear	Returns an ordered stream of documents based on the proximity to a geospatial point. Incorporates the functionality of $match, $sort, and $limit for geospatial data. The output documents include an additional distance field and can include a location identifier field.
$group	Groups input documents by a specified identifier expression and applies the accumulator expression(s), if specified, to each group. Consumes all input documents and outputs one document per each distinct group. The output documents only contain the identifier field and, if specified, accumulated fields.
$limit	Passes the first n documents unmodified to the pipeline where n is the specified limit. For each input document, outputs either one document (for the first n documents) or zero documents (after the first n documents).
$match	Filters the document stream to allow only matching documents to pass unmodified into the next pipeline stage. $match uses standard MongoDB queries. For each input document, outputs either one document (a match) or zero documents (no match).
$out	Writes the resulting documents of the aggregation pipeline to a collection. To use the $out stage, it must be the last stage in the pipeline.
$project	Reshapes each document in the stream, such as by adding new fields or removing existing fields. For each input document, outputs one document.
$redact	Reshapes each document in the stream by restricting the content for each document based on information stored in the documents themselves. Incorporates the functionality of $project and $match. Can be used to implement field level redaction. For each input document, outputs either one or zero document.
$skip	Skips the first n documents where n is the specified skip number and passes the remaining documents unmodified to the pipeline. For each input document, outputs either zero documents (for the first n documents) or one document (if after the first n documents).
$sort	Reorders the document stream by a specified sort key. Only the order changes; the documents remain unmodified. For each input document, outputs one document.
$unwind	Deconstructs an array field from the input documents to output a document for each element. Each output document replaces the array with an element value. For each input document, outputs n documents where n is the number of array elements and can be zero for an empty array.

下面通过具体的例子看看聚合管道的基本用法：

首先，通过以下代码准备测试数据：

 1 var dataList = [
 2     { "name" : "Will0", "gender" : "Female", "age" : 22 , "classes": ["MongoDB", "C#", "C++"]},
 3     { "name" : "Will1", "gender" : "Female", "age" : 20 , "classes": ["Node", "JavaScript"]},
 4     { "name" : "Will2", "gender" : "Male", "age" : 24 , "classes": ["Java", "WPF", "C#"]},
 5     { "name" : "Will3", "gender" : "Male", "age" : 23 , "classes": ["WPF", "C",]},
 6     { "name" : "Will4", "gender" : "Male", "age" : 21 , "classes": ["SQL", "HTML"]},
 7     { "name" : "Will5", "gender" : "Male", "age" : 20 , "classes": ["DOM", "CSS", "HTML5"]},
 8     { "name" : "Will6", "gender" : "Female", "age" : 20 , "classes": ["PPT", "Word", "Excel"]},
 9     { "name" : "Will7", "gender" : "Female", "age" : 24 , "classes": ["HTML5", "C#"]},
10     { "name" : "Will8", "gender" : "Male", "age" : 21 , "classes": ["Java", "VB", "BASH"]},
11     { "name" : "Will9", "gender" : "Female", "age" : 24 , "classes": ["CSS"]}
12 ]              
13 
14 for(var i = 0; i < dataList.length; i++){
15     db.school.students.insert(dataList[i]);
16 }

$project

$project主要用于数据投影，实现字段的重命名、增加和删除。下面例子中，重命名了"name"字段，增加了"birthYear"字段，删除了"_id"字段。

 1 > db.runCommand({
 2 ...     "aggregate": "school.students",
 3 ...     "pipeline": [
 4 ...                         {"$match": {"age": {"$lte": 22}}},
 5 ...                         {"$project": {"_id": 0, "studentName": "$name", "gender": 1, "birthYear": {"$subtract": [2014, "$age"]}}},
 6 ...                         {"$sort": {"birthYear":1}}
 7 ...                     ]
 8 ... })
 9 {
10         "result" : [
11                 {
12                         "gender" : "Female",
13                         "studentName" : "Will0",
14                         "birthYear" : 1992
15                 },
16                 {
17                         "gender" : "Male",
18                         "studentName" : "Will4",
19                         "birthYear" : 1993
20                 },
21                 {
22                         "gender" : "Male",
23                         "studentName" : "Will8",
24                         "birthYear" : 1993
25                 },
26                 {
27                         "gender" : "Female",
28                         "studentName" : "Will1",
29                         "birthYear" : 1994
30                 },
31                 {
32                         "gender" : "Male",
33                         "studentName" : "Will5",
34                         "birthYear" : 1994
35                 },
36                 {
37                         "gender" : "Female",
38                         "studentName" : "Will6",
39                         "birthYear" : 1994
40                 }
41         ],
42         "ok" : 1
43 }
44 >

$unwind

$unwind用来将数组拆分为独立字段。

 1 > db.runCommand({
 2 ...     "aggregate": "school.students",
 3 ...     "pipeline": [
 4 ...                         {"$match": {"age": 23}},
 5 ...                         {"$project": {"name": 1, "gender": 1, "age": 1, "classes": 1}},
 6 ...                         {"$unwind": "$classes"}
 7 ...                     ]
 8 ... })
 9 {
10         "result" : [
11                 {
12                         "_id" : ObjectId("54805220e31c9e1578ed0ccc"),
13                         "name" : "Will3",
14                         "gender" : "Male",
15                         "age" : 23,
16                         "classes" : "WPF"
17                 },
18                 {
19                         "_id" : ObjectId("54805220e31c9e1578ed0ccc"),
20                         "name" : "Will3",
21                         "gender" : "Male",
22                         "age" : 23,
23                         "classes" : "C"
24                 }
25         ],
26         "ok" : 1
27 }
28 >

$group

跟单个的group用法类似，用作分组操作。

使用$group时，必须要指定一个_id域，同时也可以包含一些算术类型的表达式操作符。

下面的例子就是用聚合管道实现得到男生和女生的平均年龄。

 1 > db.runCommand({
 2 ...     "aggregate": "school.students",
 3 ...     "pipeline": [
 4 ...                     {"$group":{"_id":"$gender", "avg": { "$avg": "$age" } }}
 5 ...                  ]
 6 ... })
 7 {
 8         "result" : [
 9                 {
10                         "_id" : "Male",
11                         "avg" : 21.8
12                 },
13                 {
14                         "_id" : "Female",
15                         "avg" : 22
16                 }
17         ],
18         "ok" : 1
19 }
20 >

查询男生和女生的最大年龄

 1 > db.runCommand({
 2 ...     "aggregate": "school.students",
 3 ...     "pipeline": [
 4 ...                         {"$group": {"_id": "$gender", "max": {"$max": "$age"}}}
 5 ...                     ]
 6 ... })
 7 {
 8         "result" : [
 9                 {
10                         "_id" : "Male",
11                         "max" : 24
12                 },
13                 {
14                         "_id" : "Female",
15                         "max" : 24
16                 }
17         ],
18         "ok" : 1
19 }
20 >