Aggregation with the Zip Code Data Set

Aggregation with the Zip Code Data Set

The examples in this document use the zipcodes collection. Thiscollection is available at: media.mongodb.org/zips.json. Use mongoimport toload this data set into your mongod instance.

Data Model

Each document in the zipcodes collection has the following form:

{
  "_id": "10280",
  "city": "NEW YORK",
  "state": "NY",
  "pop": 5574,
  "loc": [
    -74.016323,
    40.710537
  ]
}

The _id field holds the zip code as a string.
The city field holds the city name. A city can have more than onezip code associated with it as different sections of the city caneach have a different zip code.
The state field holds the two letter state abbreviation.
The pop field holds the population.
The loc field holds the location as a latitude longitude pair.

aggregate() Method

All of the following examples use the aggregate() helper in the mongo shell.

The aggregate() method uses theaggregation pipeline to processesdocuments into aggregated results. An aggregation pipeline consists of stages with each stage processingthe documents as they pass along the pipeline. Documents pass throughthe stages in sequence.

The aggregate() method in themongo shell provides a wrapper around theaggregate database command. See the documentation for yourdriver for a more idiomatic interfacefor data aggregation operations.

Return States with Populations above 10 Million

The following aggregation operation returns all states with totalpopulation greater than 10 million:

db.zipcodes.aggregate( [
   { $group: { _id: "$state", totalPop: { $sum: "$pop" } } },
   { $match: { totalPop: { $gte: 10*1000*1000 } } }
] )

In this example, the aggregation pipelineconsists of the $group stage followed by the$match stage:

The $group stage groups the documents of the zipcodecollection by the state field, calculates the totalPop fieldfor each state, and outputs a document for each unique state.

The new per-state documents have two fields: the _id field andthe totalPop field. The _id field contains the value of thestate; i.e. the group by field. The totalPop field is acalculated field that contains the total population of each state. Tocalculate the value, $group uses the $sumoperator to add the population field (pop) for each state.

After the $group stage, the documents in thepipeline resemble the following:

{
  "_id" : "AK",
  "totalPop" : 550043
}

The $match stage filters these grouped documents tooutput only those documents whose totalPop value is greater thanor equal to 10 million. The $match stage does not alterthe matching documents but outputs the matching documents unmodified.

The equivalent SQL for this aggregation operation is:

SELECT state, SUM(pop) AS totalPop
FROM zipcodes
GROUP BY state
HAVING totalPop >= (10*1000*1000)

Return Average City Population by State

The following aggregation operation returns the average populations forcities in each state:

db.zipcodes.aggregate( [
   { $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },
   { $group: { _id: "$_id.state", avgCityPop: { $avg: "$pop" } } }
] )

In this example, the aggregation pipelineconsists of the $group stage followed by another$group stage:

The first $group stage groups the documents by thecombination of city and state, uses the $sumexpression to calculate the population for each combination, andoutputs a document for each city and state combination.[1]

After this stage in the pipeline, the documents resemble thefollowing:

{
  "_id" : {
    "state" : "CO",
    "city" : "EDGEWATER"
  },
  "pop" : 13154
}

A second $group stage groups the documents in thepipeline by the _id.state field (i.e. the state field insidethe _id document), uses the $avg expression to calculatethe average city population (avgCityPop) for each state, andoutputs a document for each state.

The documents that result from this aggregation operation resembles thefollowing:

{
  "_id" : "MN",
  "avgCityPop" : 5335
}

Return Largest and Smallest Cities by State

The following aggregation operation returns the smallest and largestcities by population for each state:

db.zipcodes.aggregate( [
   { $group:
      {
        _id: { state: "$state", city: "$city" },
        pop: { $sum: "$pop" }
      }
   },
   { $sort: { pop: 1 } },
   { $group:
      {
        _id : "$_id.state",
        biggestCity:  { $last: "$_id.city" },
        biggestPop:   { $last: "$pop" },
        smallestCity: { $first: "$_id.city" },
        smallestPop:  { $first: "$pop" }
      }
   },
 
  // the following $project is optional, and
  // modifies the output format.
 
  { $project:
    { _id: 0,
      state: "$_id",
      biggestCity:  { name: "$biggestCity",  pop: "$biggestPop" },
      smallestCity: { name: "$smallestCity", pop: "$smallestPop" }
    }
  }
] )

In this example, the aggregation pipelineconsists of a $group stage, a $sort stage,another $group stage, and a $project stage:

The first $group stage groups the documents by thecombination of the city and state, calculates the sum of the pop values for each combination, and outputs adocument for each city and state combination.

At this stage in the pipeline, the documents resemble the following:

{
  "_id" : {
    "state" : "CO",
    "city" : "EDGEWATER"
  },
  "pop" : 13154
}

The $sort stage orders the documents in the pipeline bythe pop field value, from smallest to largest; i.e. byincreasing order. This operation does not alter the documents.
The next $group stage groups the now-sorted documentsby the _id.state field (i.e. the state field inside the_id document) and outputs a document for each state.

The stage also calculates the following four fields for each state.Using the $last expression, the $group operatorcreates the biggestCity and biggestPop fields that store thecity with the largest population and that population. Using the$first expression, the $group operator createsthe smallestCity and smallestPop fields that store the citywith the smallest population and that population.

The documents, at this stage in the pipeline, resemble the following:

{
  "_id" : "WA",
  "biggestCity" : "SEATTLE",
  "biggestPop" : 520096,
  "smallestCity" : "BENGE",
  "smallestPop" : 2
}

The final $project stage renames the _id field tostate and moves the biggestCity, biggestPop,smallestCity, and smallestPop into biggestCity andsmallestCity embedded documents.

The output documents of this aggregation operation resemble the following:

{
  "state" : "RI",
  "biggestCity" : {
    "name" : "CRANSTON",
    "pop" : 176404
  },
  "smallestCity" : {
    "name" : "CLAYVILLE",
    "pop" : 45
  }
}

[1]	A city can have more than one zipcode associated with it as different sections of the city can eachhave a different zip code.