Name | Description |
---|---|
$project | Reshapes a document stream. $project can rename, add, or remove fields as well as create computed values and sub-documents. |
$match | Filters the document stream, and only allows matching documents to pass into the next pipeline stage.$match uses standard MongoDB queries. |
$limit | Restricts the number of documents in an aggregation pipeline. |
$skip | Skips over a specified number of documents from the pipeline and returns the rest. |
$unwind | Takes an array of documents and returns them as a stream of documents. |
$group | Groups documents together for the purpose of calculating aggregate values based on a collection of documents. |
$sort | Takes all input documents and returns them in a stream of sorted documents. |
$geoNear | Returns an ordered stream of documents based on proximity to a geospatial point. |
db.article.aggregate( { $project : { title : 1 , author : 1 , }} );
db.article.aggregate( { $project : { _id : 0 , title : 1 , author : 1 }});
db.article.aggregate( { $project : { title : 1, doctoredPageViews : { $add:["$pageViews", 10] } }});
db.article.aggregate( { $project : { title : 1 , page_views : "$pageViews" , bar : "$other.foo" }});
db.article.aggregate( { $project : { title : 1 , stats : { pv : "$pageViews", foo : "$other.foo", dpv : { $add:["$pageViews", 10] } } }});
db.articles.aggregate( [ { $match : { score : { $gt : 70, $lte : 90 } } }, { $group: { _id: null, count: { $sum: 1 } } } ] );
db.article.aggregate( { $limit : 5 });
db.article.aggregate( { $skip : 5 });
{ "_id" : ObjectId("528751b0e7f3eea3d1412ce2"),
{
"result" : [
{
"_id" : ObjectId("528751b0e7f3eea3d1412ce2"),
"author" : "Jone",
"title" : "A book",
"tags" : "good"
},
{
"_id" : ObjectId("528751b0e7f3eea3d1412ce2"),
"author" : "Jone",
"title" : "A book",
"tags" : "fun"
},
{
"_id" : ObjectId("528751b0e7f3eea3d1412ce2"),
"author" : "Jone",
"title" : "A book",
"tags" : "good"
}
],
"ok" : 1
}
{ "result" : [ ], "ok" : 1 }
将$tags改为$tag因不存在该字段,该文档被忽略,输出的结果为空
at printStackTrace (src/mongo/shell/utils.js:37:15)
at DBCollection.aggregate (src/mongo/shell/collection.js:897:9)
at (shell):1:12
Sat Nov 16 19:16:54.488 JavaScript execution failed: aggregate failed: {
"errmsg" : "exception: $unwind: value at end of field path must be an array",
"code" : 15978,
"ok" : 0
} at src/mongo/shell/collection.js:L898
db.article.aggregate( { $group : { _id : "$author", docsPerAuthor : { $sum : 1 }, viewsPerAuthor : { $sum : "$pageViews" } }});
db.users.aggregate( { $sort : { age : -1, posts: 1 } });
Field | Type | Description |
---|---|---|
near | GeoJSON point orlegacy coordinate pairs | The point for which to find the closest documents. |
distanceField | string | The output field that contains the calculated distance. To specify a field within a subdocument, use dot notation. |
limit | number | Optional. The maximum number of documents to return. The default value is 100. See also the num option. |
num | number | Optional. The num option provides the same function as the limitoption. Both define the maximum number of documents to return. If both options are included, the num value overrides the limit value. |
maxDistance | number | Optional. A distance from the center point. Specify the distance in radians. MongoDB limits the results to those documents that fall within the specified distance from the center point. |
query | document | Optional. Limits the results to the documents that match the query. The query syntax is the usual MongoDB read operation query syntax. |
spherical | Boolean | Optional. If true, MongoDB references points using a spherical surface. The default value is false. |
distanceMultiplier | number | Optional. The factor to multiply all distances returned by the query. For example, use the distanceMultiplier to convert radians, as returned by a spherical query, to kilometers by multiplying by the radius of the Earth. |
includeLocs | string | Optional. This specifies the output field that identifies the location used to calculate the distance. This option is useful when a location field contains multiple locations. To specify a field within a subdocument, usedot notation. |
uniqueDocs | Boolean | Optional. If this value is true, the query returns a matching document once, even if more than one of the document’s location fields match the query. If this value is false, the query returns a document multiple times if the document has multiple matching location fields. See $uniqueDocsfor more information. |
db.places.aggregate([ { $geoNear: { near: [40.724, -73.997], distanceField: "dist.calculated", maxDistance: 0.008, query: { type: "public" }, includeLocs: "dist.location", uniqueDocs: true, num: 5 } } ])
{ "result" : [ { "_id" : 7, "name" : "Washington Square", "type" : "public", "location" : [ [ 40.731, -73.999 ], [ 40.732, -73.998 ], [ 40.730, -73.995 ], [ 40.729, -73.996 ] ], "dist" : { "calculated" : 0.0050990195135962296, "location" : [ 40.729, -73.996 ] } }, { "_id" : 8, "name" : "Sara D. Roosevelt Park", "type" : "public", "location" : [ [ 40.723, -73.991 ], [ 40.723, -73.990 ], [ 40.715, -73.994 ], [ 40.715, -73.994 ] ], "dist" : { "calculated" : 0.006082762530298062, "location" : [ 40.723, -73.991 ] } } ], "ok" : 1}
管道操作符作为“键”,所对应的“值”叫做管道表达式。例如上面例子中{$match:{status:"A"}},$match称为管道操作符,而{status:"A"}称为管道表达式,它可以看做是管道操作符的操作数(Operand),每个管道表达式是一个文档结构,它是由字段名、字段值、和一些表达式操作符组成的,例如上面例子中管道表达式就包含了一个表达式操作符$sum进行累加求和。
每个管道表达式只能作用于处理当前正在处理的文档,而不能进行跨文档的操作。管道表达式对文档的处理都是在内存中进行的。除了能够进行累加计算的管道表达式外,其他的表达式都是无状态的,也就是不会保留上下文的信息。累加性质的表达式操作符通常和$group操作符一起使用,来统计该组内最大值、最小值等,例如上面的例子中我们在$group管道操作符中使用了具有累加的$sum来计算总和。
Name | Description |
---|---|
$addToSet | Returns an array of all the unique values for the selected field among for each document in that group. |
$first | Returns the first value in a group. |
$last | Returns the last value in a group. |
$max | Returns the highest value in a group. |
$min | Returns the lowest value in a group. |
$avg | Returns an average of all the values in a group. |
$push | Returns an array of all values for the selected field among for each document in that group. |
$sum | Returns the sum of all the values in a group. |
Name | Description |
---|---|
$and | Returns true only when all values in its input array are true. |
$or | Returns true when any value in its input array are true. |
$not | Returns the boolean value that is the opposite of the input value. |
Name | Description |
---|---|
$cmp | Compares two values and returns the result of the comparison as an integer. |
$eq | Takes two values and returns true if the values are equivalent. |
$gt | Takes two values and returns true if the first is larger than the second. |
$gte | Takes two values and returns true if the first is larger than or equal to the second. |
$lt | Takes two values and returns true if the second value is larger than the first. |
$lte | Takes two values and returns true if the second value is larger than or equal to the first. |
$ne | Takes two values and returns true if the values are not equivalent. |
Name | Description |
---|---|
$add | Computes the sum of an array of numbers. |
$divide | Takes two numbers and divides the first number by the second. |
$mod | Takes two numbers and calcualtes the modulo of the first number divided by the second. |
$multiply | Computes the product of an array of numbers. |
$subtract | Takes two numbers and subtracts the second number from the first. |
Name | Description |
---|---|
$concat | Concatenates two strings. |
$strcasecmp | Compares two strings and returns an integer that reflects the comparison. |
$substr | Takes a string and returns portion of that string. |
$toLower | Converts a string to lowercase. |
$toUpper | Converts a string to uppercase. |
Name | Description |
---|---|
$dayOfYear | Converts a date to a number between 1 and 366. |
$dayOfMonth | Converts a date to a number between 1 and 31. |
$dayOfWeek | Converts a date to a number between 1 and 7. |
$year | Converts a date to the full year. |
$month | Converts a date into a number between 1 and 12. |
$week | Converts a date into a number between 0 and 53 |
$hour | Converts a date into a number between 0 and 23. |
$minute | Converts a date into a number between 0 and 59. |
$second | Converts a date into a number between 0 and 59. May be 60 to account for leap seconds. |
$millisecond | Returns the millisecond portion of a date as an integer between 0 and 999. |
Name | Description |
---|---|
$cond | A ternary operator that evaluates one expression, and depending on the result returns the value of one following expressions. |
$ifNull | Evaluates an expression and returns a value. |
以面向对象的思想去理解,整个流水线,可以理解为一个数据传输的管道;该管道中的每一个工作线程,可以理解为一个整个流水线的一个工作阶段stage,这些工作线程之间的合作是一环扣一环的。靠输入口越近的工作线程,是时序较早的工作阶段stage,它的工作成果会影响下一个工作线程阶段(stage)的工作结果,即下个阶段依赖于上一个阶段的输出,上一个阶段的输出成为本阶段的输入。这也是pipeline的一个共有特点!
对于大多数的聚合操作,聚合管道可以提供很好的性能和一致的接口,使用起来比较简单, 和MapReduce一样,它也可以作用于分片集合,但是输出的结果只能保留在一个文档中,要遵守BSON Document大小限制(当前是16M)。
MongoDB 2.1 多了新Feature - Aggregation Framework。最近工作需要就稍微看了下,Mark之。
Overview
Aggregation 提供的功能map-reduce也能做(诸如统计平均值,求和等)。官方那个大胖子说这东西比map-reduce简单, map-reduce 我没用过, 不过从使用Aggregation的情况来看, 进行统计等操作还是蛮方便的。
总体而言,Aggregation就是类似 Unix-like中的 管道 的概念,可以将很多数据流串起来,不同的数据处理阶段可以再上一个阶段的基础上再次加工。
Pipeline-Operator
比较常用的有:
•$sort - 排序
Usage - Java
我在db中造了些数据(数据时随机生成的, 能用即可),没有建索引,文档结构如下:
01 | Document结构: |
02 | { |
03 | "_id" : ObjectId( "509944545" ), |
04 | "province" : "海南" , |
05 | "age" : 21 , |
06 | "subjects" : [ |
07 | { |
08 | "name" : "语文" , |
09 | "score" : 53 |
10 | }, |
11 | { |
12 | "name" : "数学" , |
13 | "score" : 27 |
14 | }, |
15 | { |
16 | "name" : "英语" , |
17 | "score" : 35 |
18 | } |
19 | ], |
20 | "name" : "刘雨" |
21 | } |
- 统计上海学生平均年龄
-
统计每个省各科平均成绩
接下来一一道来
统计上海学生平均年龄
从这个需求来讲,要实现功能要有几个步骤: 1. 找出上海的学生. 2. 统计平均年龄 (当然也可以先算出所有省份的平均值再找出上海的)。如此思路也就清晰了
首先上 $match, 取出上海学生
1
{$match:{
'province'
:
'上海'
}}
1
{$group:{_id:’$province’,$avg:’$age’}}
上面两个命令等价于
1
select
province,
avg
(age)
2
from
student
3
where
province =
'上海'
4
group
by
province
下面是Java代码
01
Mongo m =
new
Mongo(
"localhost"
,
27017
);
02
DB db = m.getDB(
"test"
);
03
DBCollection coll = db.getCollection(
"student"
);
04
05
/*创建 $match, 作用相当于query*/
06
DBObject match = new BasicDBObject("$match", new BasicDBObject("province", "上海"));
07
08
/* Group操作*/
09
DBObject groupFields = new BasicDBObject("_id", "$province");
10
groupFields.put("AvgAge", new BasicDBObject("$avg", "$age"));
11
DBObject group = new BasicDBObject("$group", groupFields);
12
13
/* 查看Group结果 */
14
AggregationOutput output = coll.aggregate(match, group);
// 执行 aggregation命令
15
System.out.println(output.getCommandResult());
1
{
"serverUsed"
:
"localhost/127.0.0.1:27017"
,
2
"result"
: [
3
{
"_id"
:
"上海"
,
"AvgAge"
:
32.09375
}
4
] ,
5
"ok"
:
1.0
6
}
统计每个省各科平均成绩
首先更具数据库文档结构,subjects是数组形式,需要先‘劈’开,然后再进行统计
主要处理步骤如下:
1. 先用$unwind 拆数组 2. 按照 province, subject 分租并求各科目平均分
$unwind 拆数组
1
{$unwind:’$subjects’}
1
{$group:{
2
_id:{
3
subjname:”$subjects.name”,
// 指定group字段之一 subjects.name, 并重命名为 subjname
4
province:’$province’
// 指定group字段之一 province, 并重命名为 province(没变)
5
},
6
AvgScore:{
7
$avg:”$subjects.score”
// 对 subjects.score 求平均
8
}
9
}
01
Mongo m =
new
Mongo(
"localhost"
,
27017
);
02
DB db = m.getDB(
"test"
);
03
DBCollection coll = db.getCollection(
"student"
);
04
05
/* 创建 $unwind 操作, 用于切分数组*/
06
DBObject unwind = new BasicDBObject("$unwind", "$subjects");
07
08
/* Group操作*/
09
DBObject groupFields = new BasicDBObject("_id", new BasicDBObject("subjname", "$subjects.name").append("province", "$province"));
10
groupFields.put("AvgScore", new BasicDBObject("$avg", "$subjects.scores"));
11
DBObject group = new BasicDBObject("$group", groupFields);
12
13
/* 查看Group结果 */
14
AggregationOutput output = coll.aggregate(unwind, group);
// 执行 aggregation命令
15
System.out.println(output.getCommandResult());
01
{
"serverUsed"
:
"localhost/127.0.0.1:27017"
,
02
"result"
: [
03
{
"_id"
: {
"subjname"
:
"英语"
,
"province"
:
"海南"
} ,
"AvgScore"
:
58.1
} ,
04
{
"_id"
: {
"subjname"
:
"数学"
,
"province"
:
"海南"
} ,
"AvgScore"
:
60.485
} ,
05
{
"_id"
: {
"subjname"
:
"语文"
,
"province"
:
"江西"
} ,
"AvgScore"
:
55.538
} ,
06
{
"_id"
: {
"subjname"
:
"英语"
,
"province"
:
"上海"
} ,
"AvgScore"
:
57.65625
} ,
07
{
"_id"
: {
"subjname"
:
"数学"
,
"province"
:
"广东"
} ,
"AvgScore"
:
56.690
} ,
08
{
"_id"
: {
"subjname"
:
"数学"
,
"province"
:
"上海"
} ,
"AvgScore"
:
55.671875
,
09
{
"_id"
: {
"subjname"
:
"语文"
,
"province"
:
"上海"
} ,
"AvgScore"
:
56.734375
,
10
{
"_id"
: {
"subjname"
:
"英语"
,
"province"
:
"云南"
} ,
"AvgScore"
:
55.7301
} ,
11
.
12
.
13
.
14
.
15
"ok"
:
1.0
16
}
接下来进行下加强,
支线任务: 将同一省份的科目成绩统计到一起( 即,期望 'province':'xxxxx', avgscores:[ {'xxx':xxx}, ....] 这样的形式)
要做的有一件事,在前面的统计结果的基础上,先用 $project 将平均分和成绩揉到一起,即形如下面的样子
1
{
"subjinfo"
: {
"subjname"
:
"英语"
,
"AvgScores"
:
58.1
} ,
"province"
:
"海南"
}
再按省份group,将各科目的平均分push到一块,命令如下:
$project 重构group结果
1
{$project:{province:
"$_id.province"
, subjinfo:{
"subjname"
:
"$_id.subjname"
,
"avgscore"
:
"$AvgScore"
}}
1
{$group:{_id:
"$province"
, avginfo:{$push:
"$subjinfo"
}}}
01
Mongo m =
new
Mongo(
"localhost"
,
27017
);
02
DB db = m.getDB(
"test"
);
03
DBCollection coll = db.getCollection(
"student"
);
04
05
/* 创建 $unwind 操作, 用于切分数组*/
06
DBObject unwind = new BasicDBObject("$unwind", "$subjects");
07
08
/* Group操作*/
09
DBObject groupFields = new BasicDBObject("_id", new BasicDBObject("subjname", "$subjects.name").append("province", "$province"));
10
groupFields.put("AvgScore", new BasicDBObject("$avg", "$subjects.scores"));
11
DBObject group = new BasicDBObject("$group", groupFields);
12
13
/* Reshape Group Result*/
14
DBObject projectFields = new BasicDBObject();
15
projectFields.put("province", "$_id.province");
16
projectFields.put("subjinfo", new BasicDBObject("subjname","$_id.subjname").append("avgscore", "$AvgScore"));
17
DBObject project = new BasicDBObject("$project", projectFields);
18
19
/* 将结果push到一起*/
20
DBObject groupAgainFields = new BasicDBObject("_id", "$province");
21
groupAgainFields.put("avginfo", new BasicDBObject("$push", "$subjinfo"));
22
DBObject reshapeGroup = new BasicDBObject("$group", groupAgainFields);
23
24
/* 查看Group结果 */
25
AggregationOutput output = coll.aggregate(unwind, group, project, reshapeGroup);
26
System.out.println(output.getCommandResult());
结果如下:
01
{
"serverUsed"
:
"localhost/127.0.0.1:27017"
,
02
"result"
: [
03
{
"_id"
:
"辽宁"
,
"avginfo"
: [ {
"subjname"
:
"数学"
,
"avgscore"
:
56.46666666666667
} , {
"subjname"
:
"英语"
,
"avgscore"
:
52.093333333333334
} , {
"subjname"
:
"语文"
,
"avgscore"
:
50.53333333333333
}]} ,
04
{
"_id"
:
"四川"
,
"avginfo"
: [ {
"subjname"
:
"数学"
,
"avgscore"
:
52.72727272727273
} , {
"subjname"
:
"英语"
,
"avgscore"
:
55.90909090909091
} , {
"subjname"
:
"语文"
,
"avgscore"
:
57.59090909090909
}]} ,
05
{
"_id"
:
"重庆"
,
"avginfo"
: [ {
"subjname"
:
"语文"
,
"avgscore"
:
56.077922077922075
} , {
"subjname"
:
"英语"
,
"avgscore"
:
54.84415584415584
} , {
"subjname"
:
"数学"
,
"avgscore"
:
55.33766233766234
}]} ,
06
{
"_id"
:
"安徽"
,
"avginfo"
: [ {
"subjname"
:
"英语"
,
"avgscore"
:
55.458333333333336
} , {
"subjname"
:
"数学"
,
"avgscore"
:
54.47222222222222
} , {
"subjname"
:
"语文"
,
"avgscore"
:
52.80555555555556
}]}
07
.
08
.
09
.
10
] ,
"ok"
:
1.0
}