Aggregation Pipeline
The aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into aggregated results.
The aggregation pipeline provides an alternative to map-reduce and may be the preferred solution for aggregation tasks where the complexity of map-reduce may be unwarranted.
Example with the Zip Code Data Set
The examples in this document use the zipcodes collection. This collection is available at: media.mongodb.org/zips.json. Use mongoimport to load this data set into your mongod instance.
wget https://www.dropbox.com/s/sq0hiodvqmu1t0s/zips.json
mongoimport --db test --collection zipcodes --file zips.json
2018-07-24T07:32:33.917+0000 connected to: localhost
2018-07-24T07:32:34.270+0000 imported 29353 documents
Data Model
Each document in the zipcodes collection has the following form:
{
"_id": "10280",
"city": "NEW YORK",
"state": "NY",
"pop": 5574,
"loc": [-74.016323, 40.710537]
}
aggregate() Method
All of the following examples use the aggregate() helper in the mongo shell.
The aggregate() method uses the aggregation pipeline to processes documents into aggregated results. An aggregation pipeline consists of stages with each stage processing the documents as they pass along the pipeline. Documents pass through the stages in sequence.
The aggregate() method in the mongo shell provides a wrapper around the aggregate database command.
Return States with Populations above 10 Million
mongo
db.zipcodes.aggregate( [
{ $group: { _id: "$state", totalPop: { $sum: "$pop" } } },
{ $match: { totalPop: { $gte: 10*1000*1000 } } }
] )
{ "_id" : "CA", "totalPop" : 29754890 }
{ "_id" : "FL", "totalPop" : 12686644 }
{ "_id" : "PA", "totalPop" : 11881643 }
{ "_id" : "NY", "totalPop" : 17990402 }
{ "_id" : "OH", "totalPop" : 10846517 }
{ "_id" : "IL", "totalPop" : 11427576 }
{ "_id" : "TX", "totalPop" : 16984601 }
- The
$groupstage groups the documents of the zipcode collection by the state field, calculates the totalPop field for each state, and outputs a document for each unique state. - The
$matchstage filters these grouped documents to output only those documents whose totalPop value is greater than or equal to 10 million.
The equivalent SQL for this aggregation operation is:
SELECT state, SUM(pop) AS totalPop
FROM zipcodes
GROUP BY state
HAVING totalPop >= (10*1000*1000)
Return Average City Population by State
db.zipcodes.aggregate( [
{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },
{ $group: { _id: "$_id.state", avgCityPop: { $avg: "$pop" } } }
] )
{ "_id" : "DC", "avgCityPop" : 303450 }
{ "_id" : "DE", "avgCityPop" : 14481.91304347826 }
{ "_id" : "RI", "avgCityPop" : 19292.653846153848 }
{ "_id" : "NJ", "avgCityPop" : 15775.89387755102 }
{ "_id" : "MT", "avgCityPop" : 2593.987012987013 }
{ "_id" : "CA", "avgCityPop" : 27756.42723880597 }
{ "_id" : "KS", "avgCityPop" : 3819.884259259259 }
{ "_id" : "MO", "avgCityPop" : 5672.195338512764 }
{ "_id" : "NH", "avgCityPop" : 5232.320754716981 }
{ "_id" : "OK", "avgCityPop" : 6155.743639921722 }
{ "_id" : "NE", "avgCityPop" : 3034.882692307692 }
{ "_id" : "CO", "avgCityPop" : 9981.075757575758 }
{ "_id" : "LA", "avgCityPop" : 10465.496277915632 }
{ "_id" : "ID", "avgCityPop" : 4320.811158798283 }
{ "_id" : "IL", "avgCityPop" : 9954.334494773519 }
{ "_id" : "AL", "avgCityPop" : 7907.2152641878665 }
{ "_id" : "OR", "avgCityPop" : 8262.561046511628 }
{ "_id" : "MD", "avgCityPop" : 12615.775725593667 }
{ "_id" : "AR", "avgCityPop" : 4175.355239786856 }
{ "_id" : "FL", "avgCityPop" : 27400.958963282937 }
Type "it" for more
- The first
$groupstage groups the documents by the combination of city and state, uses the $sum expression to calculate the population for each combination, and outputs a document for each city and state combination.
After this stage in the pipeline, the documents resemble the following:
{
"_id": {
"state": "CO",
"city": "EDGEWATER"
},
"pop": 13154
}
- A second
$groupstage groups the documents in the pipeline by the _id.state field (i.e. the state field inside the _id document), uses the $avg expression to calculate the average city population (avgCityPop) for each state, and outputs a document for each state.
Return Largest and Smallest Cities by State
db.zipcodes.aggregate( [
{ $group:
{
_id: { state: "$state", city: "$city" },
pop: { $sum: "$pop" }
}
},
{ $sort: { pop: 1 } },
{ $group:
{
_id : "$_id.state",
biggestCity: { $last: "$_id.city" },
biggestPop: { $last: "$pop" },
smallestCity: { $first: "$_id.city" },
smallestPop: { $first: "$pop" }
}
},
{ $project:
{ _id: 0,
state: "$_id",
biggestCity: { name: "$biggestCity", pop: "$biggestPop" },
smallestCity: { name: "$smallestCity", pop: "$smallestPop" }
}
}
] )
{ "biggestCity" : { "name" : "NEWARK", "pop" : 111674 }, "smallestCity" : { "name" : "BETHEL", "pop" : 108 }, "state" : "DE" }
{ "biggestCity" : { "name" : "JACKSON", "pop" : 204788 }, "smallestCity" : { "name" : "CHUNKY", "pop" : 79 }, "state" : "MS" }
{ "biggestCity" : { "name" : "CRANSTON", "pop" : 176404 }, "smallestCity" : { "name" : "CLAYVILLE", "pop" : 45 }, "state" : "RI" }
{ "biggestCity" : { "name" : "SAINT LOUIS", "pop" : 397802 }, "smallestCity" : { "name" : "BENDAVIS", "pop" : 44 }, "state" : "MO" }
{ "biggestCity" : { "name" : "MIAMI", "pop" : 825232 }, "smallestCity" : { "name" : "CECIL FIELD NAS", "pop" : 0 }, "state" : "FL" }
{ "biggestCity" : { "name" : "LITTLE ROCK", "pop" : 192895 }, "smallestCity" : { "name" : "TOMATO", "pop" : 0 }, "state" : "AR" }
{ "biggestCity" : { "name" : "ATLANTA", "pop" : 609591 }, "smallestCity" : { "name" : "FORT STEWART", "pop" : 0 }, "state" : "GA" }
{ "biggestCity" : { "name" : "BURLINGTON", "pop" : 39127 }, "smallestCity" : { "name" : "UNIV OF VERMONT", "pop" : 0 }, "state" : "VT" }
{ "biggestCity" : { "name" : "ALBUQUERQUE", "pop" : 449584 }, "smallestCity" : { "name" : "MONUMENT", "pop" : 0 }, "state" : "NM" }
{ "biggestCity" : { "name" : "PHILADELPHIA", "pop" : 1610956 }, "smallestCity" : { "name" : "HAMILTON", "pop" : 0 }, "state" : "PA" }
{ "biggestCity" : { "name" : "WICHITA", "pop" : 295115 }, "smallestCity" : { "name" : "ARNOLD", "pop" : 0 }, "state" : "KS" }
{ "biggestCity" : { "name" : "MINNEAPOLIS", "pop" : 344719 }, "smallestCity" : { "name" : "JOHNSON", "pop" : 12 }, "state" : "MN" }
{ "biggestCity" : { "name" : "LOS ANGELES", "pop" : 2102295 }, "smallestCity" : { "name" : "OREGON HOUSE", "pop" : 0 }, "state" : "CA" }
{ "biggestCity" : { "name" : "BILLINGS", "pop" : 78805 }, "smallestCity" : { "name" : "HOMESTEAD", "pop" : 7 }, "state" : "MT" }
{ "biggestCity" : { "name" : "BIRMINGHAM", "pop" : 242606 }, "smallestCity" : { "name" : "ALLEN", "pop" : 0 }, "state" : "AL" }
{ "biggestCity" : { "name" : "PORTLAND", "pop" : 518543 }, "smallestCity" : { "name" : "KENT", "pop" : 0 }, "state" : "OR" }
{ "biggestCity" : { "name" : "BALTIMORE", "pop" : 733081 }, "smallestCity" : { "name" : "ANNAPOLIS JUNCTI", "pop" : 32 }, "state" : "MD" }
{ "biggestCity" : { "name" : "DENVER", "pop" : 451182 }, "smallestCity" : { "name" : "CHEYENNE MTN AFB", "pop" : 0 }, "state" : "CO" }
{ "biggestCity" : { "name" : "PORTLAND", "pop" : 63268 }, "smallestCity" : { "name" : "BUSTINS ISLAND", "pop" : 0 }, "state" : "ME" }
{ "biggestCity" : { "name" : "WORCESTER", "pop" : 169856 }, "smallestCity" : { "name" : "BUCKLAND", "pop" : 16 }, "state" : "MA" }
Type "it" for more
- The first
$groupstage groups the documents by the combination of the city and state, calculates the sum of the pop values for each combination, and outputs a document for each city and state combination. - The
$sortstage orders the documents in the pipeline by the pop field value, from smallest to largest; i.e. by increasing order. This operation does not alter the documents. - The next
$groupstage groups the now-sorted documents by the _id.state field (i.e. the state field inside the _id document) and outputs a document for each state.- The stage also calculates the following four fields for each state. Using the
$lastexpression, the$groupoperator creates the biggestCity and biggestPop fields that store the city with the largest population and that population. Using the$firstexpression, the$groupoperator creates the smallestCity and smallestPop fields that store the city with the smallest population and that population.
- The stage also calculates the following four fields for each state. Using the
- The final
$projectstage renames the _id field to state and moves the biggestCity, biggestPop, smallestCity, and smallestPop into biggestCity and smallestCity embedded documents.