使用mongoDB pipeline进行聚合操作

最新推荐文章于 2025-11-01 18:52:22 发布

原创最新推荐文章于 2025-11-01 18:52:22 发布 · 2w 阅读

46 ·

CC 4.0 BY-SA版权

文章标签：

#mongodb #聚合 #aggregate #group

技术博客同时被 3 个专栏收录

15 篇文章

订阅专栏

数据库

15 篇文章

订阅专栏

翻译

2 篇文章

订阅专栏

本文详细介绍了MongoDB中的聚合操作，包括aggregation pipeline的各种阶段，如$project、$match、$redact等，并通过实例展示了如何使用这些操作来处理数据。

mongoDB中的聚合操作将多个文档中的值组合在一起，并可对分组数据执行各种操作，以返回单个结果。在SQL中的 count(*)与group by组合相当于mongodb 中的聚合功能。
mongoDB为我们提供了三种方法来实现聚合操作。分别是aggregation pipeline，Map-Reduce和Single Purpose Aggregation Operations。今天我们主要来讨论一下关于aggregation pipeline的内容。关于Map-Reduce和Single Purpose Aggregation Operations后面有时间了再讨论。

aggregation pipeline主要是用aggregate()方法来实现聚合操作。
pipeline通俗的理解就是一个管道，其中每一个操作就是管道的一个阶段，每次当前操作接受上一个阶段的输出作为输入，并把输出结果作为输入结果给下一个阶段。如果你还不是特别明白，不用担心，慢慢往下看，我们会用更多的例子来说明。语法如下：
db.collection.aggregate( [ { }, … ] )。其中stage可以是如下的内容。

1$project

$project用来重构返回值。下面举个例子来说明。

db.books.insert(
{
  "_id" : 1,
  title: "abc123",
  isbn: "0001122223334",
  author: { last: "zzz", first: "aaa" },
  copies: 5
}
)

此时我们使用如下命令来查询数据：

 db.books.aggregate([{$project:{title:1,author:1}}])

输出如下结果：

{ "_id" : 1, "title" : "abc123", "author" : { "last" : "zzz", "first" : "aaa" } }

可以看出，输出的结果中包含了_id,title,author三个字段。默认情况下_id字段是会输出的，然后在$project中可以通过指定字段的值为0或者1，来决定输出结果中包含还是不包含这个字段。为1时代表返回值包含这个字段，为0时代表返回值不包含这个字段。不想返回_id字段可以这样：

db.books.aggregate( [ { $project : { _id: 0, title : 1 , author : 1 } } ] )

$project除了可以控制是否返回某个字段，还可以给返回结果增加字段。

db.books.aggregate(
   [
      {
         $project: {
            title: 1,
            isbn: {
               prefix: { $substr: [ "$isbn", 0, 3 ] },
               group: { $substr: [ "$isbn", 3, 2 ] },
               publisher: { $substr: [ "$isbn", 5, 4 ] },
               title: { $substr: [ "$isbn", 9, 3 ] },
               checkDigit: { $substr: [ "$isbn", 12, 1] }
            },
            lastName: "$author.last",
            copiesSold: "$copies"
         }
      }
   ]
)

返回结果如下：

{ "_id" : 1, "title" : "abc123", "isbn" : { "prefix" : "000", "group" : "11", "publisher" : "2222", "title" : "333", "checkDigit" : "4" }, "lastName" : "zzz", "copiesSold" : 5 }

可以看到，我们给返回结果的isbsn字段中添加了prefix,group,publisher,title和checkDigit字段，这些字段是通过对原来isbn的字符串进行拆分得到的,另一个增加的字段copiesSold是引用了原来的copies字段。

2$match

$match通过跟查询语句相比对，来过滤集合，只返回跟查询语句相匹配的文档。

db.articles.insert([
{ "_id" : ObjectId("512bc95fe835e68f199c8686"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("512bc962e835e68f199c8687"), "author" : "dave", "score" : 85, "views" : 521 }
{ "_id" : ObjectId("55f5a192d4bede9ac365b257"), "author" : "ahn", "score" : 60, "views" : 1000 }
{ "_id" : ObjectId("55f5a192d4bede9ac365b258"), "author" : "li", "score" : 55, "views" : 5000 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b259"), "author" : "annT", "score" : 60, "views" : 50 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25a"), "author" : "li", "score" : 94, "views" : 999 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25b"), "author" : "ty", "score" : 95, "views" : 1000 }
])

执行命令：

db.articles.aggregate([{$match:{author:"dave"}}])

返回结果如下：

{ "_id" : ObjectId("512bc95fe835e68f199c8686"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("512bc962e835e68f199c8687"), "author" : "dave", "score" : 85, "views" : 521 }

3$redact

感觉不好说清楚，直接上例子：

db.forecasts.insert(
{
  _id: 1,
  title: "123 Department Report",
  tags: [ "G", "STLW" ],
  year: 2014,
  subsections: [
    {
      subtitle: "Section 1: Overview",
      tags: [ "SI", "G" ],
      content:  "Section 1: This is the content of section 1."
    },
    {
      subtitle: "Section 2: Analysis",
      tags: [ "STLW" ],
      content: "Section 2: This is the content of section 2."
    },
    {
      subtitle: "Section 3: Budgeting",
      tags: [ "TK" ],
      content: {
        text: "Section 3: This is the content of section3.",
        tags: [ "HCS" ]
      }
    }
  ]
}
)

执行如下操作：

var userAccess = [ "STLW", "G" ];
db.forecasts.aggregate(
   [
     { $match: { year: 2014 } },
     { $redact: {
        $cond: {
           if: { $gt: [ { $size: { $setIntersection: [ "$tags", userAccess ] } }, 0 ] },
           then: "$$DESCEND",
           else: "$$PRUNE"
         }
       }
     }
   ]
);

返回结果如下：

{
  "_id" : 1,
  "title" : "123 Department Report",
  "tags" : [ "G", "STLW" ],
  "year" : 2014,
  "subsections" : [
    {
      "subtitle" : "Section 1: Overview",
      "tags" : [ "SI", "G" ],
      "content" : "Section 1: This is the content of section 1."
    },
    {
      "subtitle" : "Section 2: Analysis",
      "tags" : [ "STLW" ],
      "content" : "Section 2: This is the content of section 2."
    }
  ]
}

解释一下上面的例子，$redact主要有三个关键字：
1）$$DESCEND：包含当前document级别的所有fields。当前级别字段的内嵌文档将会被继续检测。
2）$$PRUNE：不包含当前文档或者内嵌文档级别的所有字段，不会继续检测此级别的其他字段，即使这些字段的内嵌文档持有相同的访问级别。
3）$$KEEP：包含当前文档或内嵌文档级别的所有字段，不再继续检测此级别的其他字段，即使这些字段的内嵌文档中持有不同的访问级别。
在上面的aggregate操作中，首先使用$match过滤掉了一部分数据，只拿到year是2014的文档。此时$matche就是第一个阶段，在这一个阶段中，我们只拿出跟查询条件相匹配的结果，并将这个结果作为输入传递个下一个阶段。在这里下一个阶段就是$redact。在$redact阶段中，我们拿到了上一个阶段的输出结果。$redact常常和$cond配合使用。$cond是英文condition的所写，代表条件。在$cond中通常会使用if–else–then的表达式。在if中传入一个判断语句，如果符合判断语句，执行then的部分，否则执行else的部分。在我们上面的例子中，$setIntersection表示对两个集合取交集。$size表示对集合取大小。所以if判断语句中的意思就是：对于当前阶段输入进来的每一个文档，都判断tags和userAccess的交集的长度是否大于0，如果大于0，就执行then，否则执行else。
在then这一部分使用了$$DESCEND关键字。这个关键字的意思我们在上面已经做了说明。在这个例子中的意思就是说，如果我满足了if条件，那么会继续递归的向下判断我的子文档。在我们的例子中，将继续去判断subsections字段里面的内容，因为它里面也含有tags字段。
在else这一部分使用了$$PRUNE关键字。这个关键字的意思在上面也已经说清楚了。在这个例子中的意思就是说，如果我不满足if条件，那么我直接就舍弃当前文档。

4$limit

用来控制当前阶段输出数据的数据。如下：

db.article.aggregate(
    { $limit : 5 }
);

返回结果只有5条，如果还有下一个阶段的操作，那么也只会向下一阶段传递5条结果。
mongoDB对limit做了优化，如果$sort(排序)刚好在limit前面，那么$sort只会在内存中维护n条记录，这个n就是limit中的数字。

5$skip

跳过一些数据。
如下：

db.article.aggregate(
    { $skip : 5 }
);

在这个例子中将会跳过5条数据，然后将结果返回。如果还有下一步操作，将结果传递给下一阶段。

6$unwind

还是直接上例子。
inventory集合中有这样一个文档。

{ "_id" : 1, "item" : "ABC1", sizes: [ "S", "M", "L"] }

使用：

db.inventory.aggregate( [ { $unwind : "$sizes" } ] )

返回结果：

{ "_id" : 1, "item" : "ABC1", "sizes" : "S" }
{ "_id" : 1, "item" : "ABC1", "sizes" : "M" }
{ "_id" : 1, "item" : "ABC1", "sizes" : "L" }

可以看出，在这个例子中，$unwind命令将sizes字段的数组拆分成了一个个的元素。这就是$unwind的主要功能。
$unwind的完整写法如下：

{
  $unwind:
    {
      path: <field path>,
      includeArrayIndex: <string>,
      preserveNullAndEmptyArrays: <boolean>
    }
}

path:必须有,指明需要查分的字段，一般为一个数组
includeArrayIndex: 可选,如果需要拆分开的数据包含数据下标，可以指定数组下标的字段名
preserveNullAndEmptyArrays: 可选，在数组为null,缺失 ,或者空的情况下，如果为true, $unwind 同样输出当前文档. 如果为false, \$unwind不输出文档。默认是false.
下面的写法效果是一样的：

db.inventory.aggregate( [ { $unwind: "$sizes" } ] )
db.inventory.aggregate( [ { $unwind: { path: "$sizes" } } ] )

假设现在inventory中有如下文档数据：

{ "_id" : 1, "item" : "ABC", "sizes": [ "S", "M", "L"] }
{ "_id" : 2, "item" : "EFG", "sizes" : [ ] }
{ "_id" : 3, "item" : "IJK", "sizes": "M" }
{ "_id" : 4, "item" : "LMN" }
{ "_id" : 5, "item" : "XYZ", "sizes" : null }

使用如下命令：

db.inventory.aggregate( [ { $unwind: "$sizes" } ] )

返回结果：

{ "_id" : 1, "item" : "ABC", "sizes" : "S" }
{ "_id" : 1, "item" : "ABC", "sizes" : "M" }
{ "_id" : 1, "item" : "ABC", "sizes" : "L" }
{ "_id" : 3, "item" : "IJK", "sizes" : "M" }

可以看到，在sizes是空，null或者缺失的情况下，文档结果不会输出来。(preserveNullAndEmptyArrays默认是false)
我们来将preserveNullAndEmptyArrays指定为true,看一下结果。
执行：

db.inventory.aggregate( [
   { $unwind: { path: "$sizes", preserveNullAndEmptyArrays: true } }
] )

返回结果：

{ "_id" : 1, "item" : "ABC", "sizes" : "S" }
{ "_id" : 1, "item" : "ABC", "sizes" : "M" }
{ "_id" : 1, "item" : "ABC", "sizes" : "L" }
{ "_id" : 2, "item" : "EFG" }
{ "_id" : 3, "item" : "IJK", "sizes" : "M" }
{ "_id" : 4, "item" : "LMN" }
{ "_id" : 5, "item" : "XYZ", "sizes" : null }

区别已经一目了然。
我们还可以通过来显示被查分的数据的数组下标。

db.inventory.aggregate( [ { $unwind: { path: "$sizes", includeArrayIndex: "arrayIndex" } } ] )

在这里，返回结果中会多出一个字段叫做：arrayIndex，这个字段包含了当前文档原来的数据在数组中的下标。上面的返回结果：

{ "_id" : 1, "item" : "ABC", "sizes" : "S", "arrayIndex" : NumberLong(0) }
{ "_id" : 1, "item" : "ABC", "sizes" : "M", "arrayIndex" : NumberLong(1) }
{ "_id" : 1, "item" : "ABC", "sizes" : "L", "arrayIndex" : NumberLong(2) }
{ "_id" : 3, "item" : "IJK", "sizes" : "M", "arrayIndex" : null }

7$sample

$sample随机返回指定大小的数据集合。标准的语法模板如下：

{ $sample: { size: <positive integer> } }

通过下面的命令从users集合中随机选出三条记录。

db.users.aggregate(
   [ { $sample: { size: 3 } } ]
)

结果将返回users中随机的三条记录。
值得注意的是，$sample的返回结果中，某一条记录可能会出现不止一次。

8$group

$group几乎是用的最为频繁的聚合函数。先来举个例子看一下效果：

 db.products.insert([
   {proId:"001",saleNum:10},
   {proId:"002",saleNum:5},
   {proId:"001",saleNum:15}
   ])

现在products集合中有三条数据。现在我们想要统计出proId为001的产品总的saleNum。

 db.products.aggregate([
    {
       $group:{_id:"$proId",
              totalNum:{$sum:"$saleNum"}
              }
    }
    ])

$group就类似与mysql中的group by，指定数据根据某个字段分组。在$group中我们是通过_id来指定分组依据。上面的例子中，我们指定数据根据proId来尽心分组，然后在每个分组中通过$sum函数来计算saleNum的总数，并将计算结果给了totalNum这个字段。
返回结果如下：

{ "_id" : "002", "totalNum" : 5 }
{ "_id" : "001", "totalNum" : 25 }

通过上面这个例子，相信大家对$group的大致使用已经有所了解。现在我们来看看$group的完整的使用模板：

{ $group: { _id: <expression>, <field1>: { <accumulator1> : <expression1> }, ... } }

其中_id指定分组依据，后面的accumulator可以是下面的函数：

$sum:返回分组中某个字段的和。如果某个字符串无法转换成数字，mongoDB将忽略它。从3.2版本之后，在$group和$project阶段都有效。
$avg:返回分组内某个字段的平均值。如果某个字符串无法转换成数字，mongoDB将忽略它。从3.2版本之后，在$group和$project阶段都有效。
$first:返回分组内的第一条文档数据。仅仅在$group阶段有效。
$last:返回分组内的最后一条文档数据。仅仅在$group阶段有效。
$max：返回分组内某个字段最大的数据。从3.2版本之后，在$group和$project阶段都有效。
$min:返回分组内某个字段最小的数据。从3.2版本之后，在$group和$project阶段都有效。
$push:将分组内某个字段的值加入到一个数组中。仅仅在$group阶段有效。
$addToSet:将分组内的某个字段加入一个集合中，无重复值。仅仅在$group阶段有效。
$stdDevPop:计算分组内某个字段的标准差。从3.2版本之后，在$group和$project阶段都有效。
$stdDevSamp:计算样本中某个字段的标准差。从3.2版本之后，在$group和$project阶段都有效。
下面我们一一举例来说明。
$sum
表达式模板：

{ $sum: <expression> }

集合sales中有如下数据：

{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z")}
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z")}
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z")}
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z")}
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z")}

执行命令：

db.sales.aggregate(
   [
     {
       $group:
         {
           _id: { day: { $dayOfYear: "$date"}, year: { $year: "$date" } },
           totalAmount: { $sum: { $multiply: [ "$price", "$quantity" ] } },
           count: { $sum: 1 }
         }
     }
   ]
)

解释一下这条命令，$dayOfYear是个函数取到某天是一年中的第几天，$yearz这个函数取到年份，所以在上面的例子中，我们是用一年中的第几天和年份来作为分组依据。假设某个分组内有n条数据，对于每一条数据，$multiply函数用来做乘法计算，将每一个分组内每一条数据中的price和quantity相乘。一共计算n次，得到n个乘积，最后把得到的n个乘积求和给totalAmount这个字段。
count: { $sum: 1 }跟totalAmount类似，只是在这里把price和quantity相乘直接换成了1。这样写的意思就是统计每个分组内有多少条数据。
输出的结果如下：

{ "_id" : { "day" : 46, "year" : 2014 }, "totalAmount" : 150, "count" : 2 }
{ "_id" : { "day" : 34, "year" : 2014 }, "totalAmount" : 45, "count" : 2 }
{ "_id" : { "day" : 1, "year" : 2014 }, "totalAmount" : 20, "count" : 1 }

补充：

在 $group阶段，如果表达式中的数据是数组，那么$sum把该数组看做非数字类型。

example	Field Values	Results
{ $sum : <field> }	数字	计算values的和
{ $sum : <field> }	数字和非数字混合	忽略非数字，计算数字的和
{ $sum : <field> }	非数字，或者field不存在	0

$avg
计算每一个分组内某个指定字段的平均值。
表达式：

{ $avg: <expression> }

这个跟上面的sum使用方法差不多，就不赘述了，举个例子相信大家都看得懂。
在sales集合中有如下数据：

{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z")}
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z")}
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z")}
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z")}
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:12:00Z")}

执行命令：

db.sales.aggregate(
   [
     {
       $group:
         {
           _id: "$item",
           avgAmount: { $avg: { $multiply: [ "$price", "$quantity" ] } },
           avgQuantity: { $avg: "$quantity" }
         }
     }
   ]
)

返回结果如下：

{ "_id" : "xyz", "avgAmount" : 37.5, "avgQuantity" : 7.5 }
{ "_id" : "jkl", "avgAmount" : 20, "avgQuantity" : 1 }
{ "_id" : "abc", "avgAmount" : 60, "avgQuantity" : 6 }

补充：
在 $group阶段，如果表达式中的数据是数组，那么$avg把该数组看做非数字类型。
$first
返回每个分组中第一条数据，注意只有在组内数据有一定的顺序的时候，这个操作才具有意义。
语法如下：

{ $first: <expression> }

假定sales集合中有如下数据：

{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z")}
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z")}
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z")}
{ "_id" : 4,"item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z")}
{ "_id" : 5,"item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z")}
{ "_id" : 6, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-15T12:05:10Z") }
{ "_id" : 7, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T14:12:12Z")}

执行如下操作：

db.sales.aggregate(
   [
     { $sort: { item: 1, date: 1 } },
     {
       $group:
         {
           _id: "$item",
           firstSalesDate: { $first: "$date" }
         }
     }
   ]
)

首先根据item和date做正向排序，然后根据item字段分组，并将每一组内第一条数据给firstSalesDate字段，这样就拿出了每种item最早的销售日期。
返回结果：

{ "_id" : "xyz", "firstSalesDate" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : "jkl", "firstSalesDate" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : "abc", "firstSalesDate" : ISODate("2014-01-01T08:00:00Z") }

$last
跟$first用法一样，不再赘述。
$max
返回最大值。例子一看大家都明白。

db.sales.aggregate(
   [
     {
       $group:
         {
           _id: "$item",
           maxTotalAmount: { $max: { $multiply: [ "$price", "$quantity" ] } },
           maxQuantity: { $max: "$quantity" }
         }
     }
   ]
)

返回结果：

{ "_id" : "xyz", "maxTotalAmount" : 50, "maxQuantity" : 10 }
{ "_id" : "jkl", "maxTotalAmount" : 20, "maxQuantity" : 1 }
{ "_id" : "abc", "maxTotalAmount" : 100, "maxQuantity" : 10 }

$min
跟$max类似，不在赘述。
$push
将指定的表达式的值添加到一个数组中。
还是以上面的sales集合为例。

db.sales.aggregate(
   [
     {
       $group:
         {
           _id: { day: { $dayOfYear: "$date"}, year: { $year: "$date" } },
           itemsSold: { $push:  { item: "$item", quantity: "$quantity" } }
         }
     }
   ]
)

在这个例子中我们将item: “$item”, quantity: “$quantity”这样形式的数据每次都放入一个数组中，数组名字我们取名为itemSold.
返回结果：

{
   "_id" : { "day" : 46, "year" : 2014 },
   "itemsSold" : [
      { "item" : "abc", "quantity" : 10 },
      { "item" : "xyz", "quantity" : 10 },
      { "item" : "xyz", "quantity" : 5 },
      { "item" : "xyz", "quantity" : 10 }
   ]
}
{
   "_id" : { "day" : 34, "year" : 2014 },
   "itemsSold" : [
      { "item" : "jkl", "quantity" : 1 },
      { "item" : "xyz", "quantity" : 5 }
   ]
}
{
   "_id" : { "day" : 1, "year" : 2014 },
   "itemsSold" : [ { "item" : "abc", "quantity" : 2 } ]
}

$addToSet
用法同$push，只是不能有重复数据。
依然用sales集合做例子：

db.sales.aggregate(
   [
     {
       $group:
         {
           _id: { day: { $dayOfYear: "$date"}, year: { $year: "$date" } },
           itemsSold: { $addToSet: "$item" }
         }
     }
   ]
)

返回结果：

{ "_id" : { "day" : 46, "year" : 2014 }, "itemsSold" : [ "xyz", "abc" ] }
{ "_id" : { "day" : 34, "year" : 2014 }, "itemsSold" : [ "xyz", "jkl" ] }
{ "_id" : { "day" : 1, "year" : 2014 }, "itemsSold" : [ "abc" ] }

$stdDevPop
返回标准差。
user集合中有如下数据：

{ "_id" : 1, "name" : "dave123", "quiz" : 1, "score" : 85 }
{ "_id" : 2, "name" : "dave2", "quiz" : 1, "score" : 90 }
{ "_id" : 3, "name" : "ahn", "quiz" : 1, "score" : 71 }
{ "_id" : 4, "name" : "li", "quiz" : 2, "score" : 96 }
{ "_id" : 5, "name" : "annT", "quiz" : 2, "score" : 77 }
{ "_id" : 6, "name" : "ty", "quiz" : 2, "score" : 82 }

执行命令：

db.users.aggregate([
   { $group: { _id: "$quiz", stdDev: { $stdDevPop: "$score" } } }
])

返回结果：

{ "_id" : 2, "stdDev" : 8.04155872120988 }
{ "_id" : 1, "stdDev" : 8.04155872120988 }

可以看出，$stdDevPop求出了每个分组内score字段的标准差。
$stdDevSamp
用法同$stdDevPop，只是$stdDevPop是在有全部数据的情况下做运算计算标准差，$stdDevSamp是在只有样本的情况下计算标准差。
users中有如下数据。

{_id: 0, username: "user0", age: 20}
{_id: 1, username: "user1", age: 42}
{_id: 2, username: "user2", age: 28}
...

执行计算：

db.users.aggregate(
   [
      { $sample: { size: 100 } },
      { $group: { _id: null, ageStdDev: { $stdDevSamp: "$age" } } }
   ]
)

返回结果：

{ "_id" : null, "ageStdDev" : 7.811258386185771 }

先用$sample选取了100个样本，然后计算出了这一百个样本的标准偏差。
值得注意的是，在这里$group分组的时候，_id指定了null,意思就是不分组，所有的数据都作为一组参与当前运算。

9$sort

对所有的输入数据进行排序，并输出排序好的数据。
模板如下：

{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... } }

其中sort order可以取值为1，-1，或者textScore.关于textScore这里先不做说明，后面有时间再讨论这个。
当sort order取值为1时，代表升序。
当sort order取值为-1时，代表降序。
将user表先按照age进行降序再按照posts字段进行升序排序。

db.users.aggregate(
   [
     { $sort : { age : -1, posts: 1 } }
   ]
)

如果$sort后面紧跟着$limit,$sort能够只维持前n条记录。n就是limit限定的条数。这样能够提高效率并且节省内存。
一般来说 $sort阶段有100M的内存限制。如果在运算的过程中超过这个大小，就会报错。为了能够处理更大规模的数据，可以再做聚合操作的时候将allowDiskUse设置为true，默认是false。就想下面这样：

db.stocks.aggregate([
                    { $project : { cusip: 1, date: 1, price: 1, _id: 0 } },
                    { $sort : { cusip : 1, date: 1 } }
                    ],
                    {
                      allowDiskUse: true
                    }
                   )

$sort操作可以利用索引以提高性能。但是必须在aggregate管道开始的地方使用$sort或者在$project,$unwind和$group的前面使用。如果在$project,$unwind和$group的后面使用$sort，那么系统将不能使用任何索引，这一点要引起注意。

10$lookup

先看一下语法模板：

{
   $lookup:
     {
       from: <collection to join>,
       localField: <field from the input documents>,
       foreignField: <field from the documents of the "from" collection>,
       as: <output array field>
     }
}

$lookup将用localField字段中的内容和from集合中的foreignField字段进行比较，如果比较相同，就将from集合中的这条记录加入当前文档中，字段名为as的值。
举个例子。
orders集合内容如下：

{ "_id" : 1, "item" : "abc", "price" : 12, "quantity" : 2 }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1 }
{ "_id" : 3  }

inventory集合内容如下：

{ "_id" : 1, "sku" : "abc", description: "product 1", "instock" : 120 }
{ "_id" : 2, "sku" : "def", description: "product 2", "instock" : 80 }
{ "_id" : 3, "sku" : "ijk", description: "product 3", "instock" : 60 }
{ "_id" : 4, "sku" : "jkl", description: "product 4", "instock" : 70 }
{ "_id" : 5, "sku": null, description: "Incomplete" }
{ "_id" : 6 }

执行如下命令：

db.orders.aggregate([
    {
      $lookup:
        {
          from: "inventory",
          localField: "item",
          foreignField: "sku",
          as: "inventory_docs"
        }
   }
])

输出结果如下：

{
  "_id" : 1,
   "item" : "abc",
  "price" : 12,
  "quantity" : 2,
  "inventory_docs" : [
    { "_id" : 1, "sku" : "abc", description: "product 1", "instock" : 120 }
  ]
}
{
  "_id" : 2,
  "item" : "jkl",
  "price" : 20,
  "quantity" : 1,
  "inventory_docs" : [
    { "_id" : 4, "sku" : "jkl", "description" : "product 4", "instock" : 70 }
  ]
}
{
  "_id" : 3,
  "inventory_docs" : [
    { "_id" : 5, "sku" : null, "description" : "Incomplete" },
    { "_id" : 6 }
  ]
}

如果localField是一个数组而foreignField单独的元素，那么你需要使用$unwind来拆分数组。
orders有如下数据：

{ "_id" : 1, "item" : "MON1003", "price" : 350, "quantity" : 2, "specs" :
[ "27 inch", "Retina display", "1920x1080" ], "type" : "Monitor" }

inventory有如下数据：

{ "_id" : 1, "sku" : "MON1003", "type" : "Monitor", "instock" : 120,
"size" : "27 inch", "resolution" : "1920x1080" }
{ "_id" : 2, "sku" : "MON1012", "type" : "Monitor", "instock" : 85,
"size" : "23 inch", "resolution" : "1280x800" }
{ "_id" : 3, "sku" : "MON1031", "type" : "Monitor", "instock" : 60,
"size" : "23 inch", "display_type" : "LED" }

执行命令：

db.orders.aggregate([
   {
      $unwind: "$specs"
   },
   {
      $lookup:
         {
            from: "inventory",
            localField: "specs",
            foreignField: "size",
            as: "inventory_docs"
        }
   },
   {
      $match: { "inventory_docs": { $ne: [] } }
   }
])

首先将orders中的specs数组拆分成一个个的元素，然后执行$lookup操作，执行完毕后使用$match来匹配到”inventory_docs”非空的字段。
返回结果如下：

{
   "_id" : 1,
   "item" : "MON1003",
   "price" : 350,
   "quantity" : 2,
   "specs" : "27 inch",
   "type" : "Monitor",
   "inventory_docs" : [
      {
         "_id" : 1,
         "sku" : "MON1003",
         "type" : "Monitor",
         "instock" : 120,
         "size" : "27 inch",
         "resolution" : "1920x1080"
      }
   ]
}

11$out

指定将结果输出到哪里。
book包含如下数据：

{ "_id" : 8751, "title" : "The Banquet", "author" : "Dante", "copies" : 2 }
{ "_id" : 8752, "title" : "Divine Comedy", "author" : "Dante", "copies" : 1 }
{ "_id" : 8645, "title" : "Eclogues", "author" : "Dante", "copies" : 2 }
{ "_id" : 7000, "title" : "The Odyssey", "author" : "Homer", "copies" : 10 }
{ "_id" : 7020, "title" : "Iliad", "author" : "Homer", "copies" : 10 }

在books集合上执行如下操作：

db.books.aggregate( [
                      { $group : { _id : "$author", books: { $push: "$title" } } },
                      { $out : "authors" }
                  ] )

执行完毕后authors集合中就会有如下数据，可以通过db.authors.find()来进行查询。

{ "_id" : "Homer", "books" : [ "The Odyssey", "Iliad" ] }
{ "_id" : "Dante", "books" : [ "The Banquet", "Divine Comedy", "Eclogues" ] }

12$facet

在一个stage里处理多条pipeline。
语法模板如下：

{ $facet:
   {
      <outputField1>: [ <stage1>, <stage2>, ... ],
      <outputField2>: [ <stage1>, <stage2>, ... ],
      ...
   }
}

每一个outputField对应一个pipeline。
artwork中含有如下数据：

{ "_id" : 1, "title" : "The Pillars of Society", "artist" : "Grosz", "year" : 1926,
  "price" : NumberDecimal("199.99"),
  "tags" : [ "painting", "satire", "Expressionism", "caricature" ] }
{ "_id" : 2, "title" : "Melancholy III", "artist" : "Munch", "year" : 1902,
  "price" : NumberDecimal("280.00"),
  "tags" : [ "woodcut", "Expressionism" ] }
{ "_id" : 3, "title" : "Dancer", "artist" : "Miro", "year" : 1925,
  "price" : NumberDecimal("76.04"),
  "tags" : [ "oil", "Surrealism", "painting" ] }
{ "_id" : 4, "title" : "The Great Wave off Kanagawa", "artist" : "Hokusai",
  "price" : NumberDecimal("167.30"),
  "tags" : [ "woodblock", "ukiyo-e" ] }
{ "_id" : 5, "title" : "The Persistence of Memory", "artist" : "Dali", "year" : 1931,
  "price" : NumberDecimal("483.00"),
  "tags" : [ "Surrealism", "painting", "oil" ] }
{ "_id" : 6, "title" : "Composition VII", "artist" : "Kandinsky", "year" : 1913,
  "price" : NumberDecimal("385.00"),
  "tags" : [ "oil", "painting", "abstract" ] }
{ "_id" : 7, "title" : "The Scream", "artist" : "Munch", "year" : 1893,
  "tags" : [ "Expressionism", "painting", "oil" ] }
{ "_id" : 8, "title" : "Blue Flower", "artist" : "O'Keefe", "year" : 1918,
  "price" : NumberDecimal("118.42"),
  "tags" : [ "abstract", "painting" ] }

执行命令：

db.artwork.aggregate( [
  {
    $facet: {
      "categorizedByTags": [
        { $unwind: "$tags" },
        { $sortByCount: "$tags" }
      ],
      "categorizedByPrice": [
        // Filter out documents without a price e.g., _id: 7
        { $match: { price: { $exists: 1 } } },
        {
          $bucket: {
            groupBy: "$price",
            boundaries: [  0, 150, 200, 300, 400 ],
            default: "Other",
            output: {
              "count": { $sum: 1 },
              "titles": { $push: "$title" }
            }
          }
        }
      ],
      "categorizedByYears(Auto)": [
        {
          $bucketAuto: {
            groupBy: "$year",
            buckets: 4
          }
        }
      ]
    }
  }
])

在这个例子中，我们有三条pipeline，分别为：categorizedByTags,categorizedByPrice,categorizedByYears。
最终返回的结果如下：
关于$bucket， $bucketAuto, $sortByCount的具体用法请参阅链接。
$bucket官方文档， $bucketAuto官方文档, $sortByCount官方文档

{
  "categorizedByYears(Auto)" : [
    // First bucket includes the document without a year, e.g., _id: 4
    { "_id" : { "min" : null, "max" : 1902 }, "count" : 2 },
    { "_id" : { "min" : 1902, "max" : 1918 }, "count" : 2 },
    { "_id" : { "min" : 1918, "max" : 1926 }, "count" : 2 },
    { "_id" : { "min" : 1926, "max" : 1931 }, "count" : 2 }
  ],
  "categorizedByPrice" : [
    {
      "_id" : 0,
      "count" : 2,
      "titles" : [
        "Dancer",
        "Blue Flower"
      ]
    },
    {
      "_id" : 150,
      "count" : 2,
      "titles" : [
        "The Pillars of Society",
        "The Great Wave off Kanagawa"
      ]
    },
    {
      "_id" : 200,
      "count" : 1,
      "titles" : [
        "Melancholy III"
      ]
    },
    {
      "_id" : 300,
      "count" : 1,
      "titles" : [
        "Composition VII"
      ]
    },
    {
      // Includes document price outside of bucket boundaries, e.g., _id: 5
      "_id" : "Other",
      "count" : 1,
      "titles" : [
        "The Persistence of Memory"
      ]
    }
  ],
  "categorizedByTags" : [
    { "_id" : "painting", "count" : 6 },
    { "_id" : "oil", "count" : 4 },
    { "_id" : "Expressionism", "count" : 3 },
    { "_id" : "Surrealism", "count" : 2 },
    { "_id" : "abstract", "count" : 2 },
    { "_id" : "woodblock", "count" : 1 },
    { "_id" : "woodcut", "count" : 1 },
    { "_id" : "ukiyo-e", "count" : 1 },
    { "_id" : "satire", "count" : 1 },
    { "_id" : "caricature", "count" : 1 }
  ]
}

13$count

scores集合中有如下数据：

{ "_id" : 1, "subject" : "History", "score" : 88 }
{ "_id" : 2, "subject" : "History", "score" : 92 }
{ "_id" : 3, "subject" : "History", "score" : 97 }
{ "_id" : 4, "subject" : "History", "score" : 71 }
{ "_id" : 5, "subject" : "History", "score" : 79 }
{ "_id" : 6, "subject" : "History", "score" : 83 }

执行命令：

db.scores.aggregate(
  [
    {
      $match: {
        score: {
          $gt: 80
        }
      }
    },
    {
      $count: "passing_scores"
    }
  ]
)

在这个例子中，首先使用match得到score大于80的文档集合，然后计算当前文档集合中文档的数量。并且将这个数量给”passing_scores”这个字段。
返回结果：