$bucketAuto (aggregation)
Definition
New in version 3.4.
Categorizes incoming documents into a specific number of groups,called buckets, based on a specified expression. Bucket boundariesare automatically determined in an attempt to evenly distribute thedocuments into the specified number of buckets.
Each bucket is represented as a document in the output. The documentfor each bucket contains an _id
field, whose value specifies theinclusive lower bound and the exclusive upper bound for the bucket,and a count
field that contains the number of documents in thebucket. The count
field is included by default when theoutput
is not specified.
The $bucketAuto
stage has the following form:
- {
- $bucketAuto: {
- groupBy: <expression>,
- buckets: <number>,
- output: {
- <output1>: { <$accumulator expression> },
- ...
- }
- granularity: <string>
- }
- }
FieldTypeDescriptiongroupBy
expressionAn expression to group documentsby. To specify a field path,prefix the field name with a dollar sign $
and enclose it inquotes.buckets
integerA positive 32-bit integer that specifies the number of buckets intowhich input documents are grouped.output
documentOptional. A document that specifies the fields to include inthe output documents in addition to the _id
field. Tospecify the field to include, you must use accumulatorexpressions:
- <outputfield1>: { <accumulator>: <expression1> },
- ...
The default count
field is not included in the output documentwhen output
is specified. Explicitly specify the count
expression as part of the output
document to include it:
- output: {
- <outputfield1>: { <accumulator>: <expression1> },
- ...
- count: { $sum: 1 }
- }
granularity
stringOptional. A string that specifies the preferred number series to use toensure that the calculated boundary edges end on preferredround numbers or their powers of 10.
Available only if the all groupBy
values are numeric andnone of them are NaN
.
The suppported values of granularity
are:
"R5"
"R10"
"R20"
"R40"
"R80"
"1-2-5"
"E6"
"E12"
"E24"
"E48"
"E96"
"E192"
"POWERSOF2"
Behavior
There may be less than the specified number of buckets if:
- The number of input documents is less than the specified number ofbuckets.
- The number of unique values of the
groupBy
expression is lessthan the specified number ofbuckets
. - The
granularity
has fewer intervals then the number ofbuckets
. - The
granularity
is not fine enough to evenly distribute documentsinto the specified number ofbuckets
.
If the groupBy
expression refers to an array or document, thevalues are arranged using the same ordering as in $sort
before determining the bucket boundaries.
The even distribution of documents across buckets depends on thecardinality, or the number of unique values, of the groupBy
field. Ifthe cardinality is not high enough, the $bucketAuto stage may not evenlydistribute the results across buckets.
Granularity
The $bucketAuto
accepts an optional granularity
parameter whichensures that the boundaries of all buckets adhere to a specifiedpreferred number series.Using a preferred number series provides more control on where thebucket boundaries are set among the range of values in the groupBy
expression. They may also be used to help logarithmically and evenlyset bucket boundaries when the range of the groupBy
expressionscales exponentially.
Renard Series
The Renard number series are sets of numbers derived by taking eitherthe 5 th, 10 th, 20 th,40 th, or 80 th root of 10, then includingvarious powers of the root that equate to values between 1.0 to 10.0(10.3 in the case of R80
).
Set granularity
to R5
, R10
, R20
, R40
, or R80
torestrict bucket boundaries to values in the series. The values of theseries are multiplied by a power of 10 when the groupBy
values areoutside of the 1.0 to 10.0 (10.3 for R80
) range.
Example
The R5
series is based off of the fifth root of 10, which is1.58, and includes various powers of this root (rounded) until 10 isreached. The R5
series is derived as follows:
- 10 0/5 = 1
- 10 1/5 = 1.584 ~ 1.6
- 10 2/5 = 2.511 ~ 2.5
- 10 3/5 = 3.981 ~ 4.0
- 10 4/5 = 6.309 ~ 6.3
- 10 5/5 = 10
The same approach is applied to the other Renard series to offer finergranularity, i.e., more intervals between 1.0 and 10.0 (10.3 forR80
).
E Series
The E number series are similar to theRenard series in that they subdivide theinterval from 1.0 to 10.0 by the 6 th,12 th, 24 th, 48 th,96 th, or 192 nd root of ten with aparticular relative error.
Set granularity
to E6
, E12
, E24
, E48
, E96
, orE192
to restrict bucket boundaries to values in the series. Thevalues of the series are multiplied by a power of 10 when thegroupBy
values are outside of the 1.0 to 10.0 range. To learn moreabout the E-series and their respective relative errors, seepreferred number series.
1-2-5 Series
The 1-2-5
series behaves like a three-valueRenard series, if such a series existed.
Set granularity
to 1-2-5
to restrict bucket boundaries tovarious powers of the third root of 10, rounded to one significantdigit.
Example
The following values are part of the 1-2-5
series:0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, and soon…
Powers of Two Series
Set granularity
to POWERSOF2
to restrict bucket boundaries tonumbers that are a power of two.
Example
The following numbers adhere to the power of two Series:
- 2 0 = 1
- 2 1 = 2
- 2 2 = 4
- 2 3 = 8
- 2 4 = 16
- 2 5 = 32
- and so on…
A common implementation is how various computer components, likememory, often adhere to the POWERSOF2
set of preferred numbers:
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, and so on….
Comparing Different Granularities
The following operation demonstrates how specifying different valuesfor granularity
affects how $bucketAuto
determines bucketboundaries. A collection of things
have an _id
numbered from1 to 100:
- { _id: 1 }
- { _id: 2 }
- ...
- { _id: 100 }
Different values for granularity
are substituted into the followingoperation:
- db.things.aggregate( [
- {
- $bucketAuto: {
- groupBy: "$_id",
- buckets: 5,
- granularity: <granularity>
- }
- }
- ] )
The results in the following table demonstrate how different values forgranularity
yield different bucket boundaries:
Granularity | Results | Notes |
---|---|---|
No granularity | { “_id” : { “min” : 0, “max” : 20 }, “count” : 20 }{ “_id” : { “min” : 20, “max” : 40 }, “count” : 20 }{ “_id” : { “min” : 40, “max” : 60 }, “count” : 20 }{ “_id” : { “min” : 60, “max” : 80 }, “count” : 20 }{ “_id” : { “min” : 80, “max” : 99 }, “count” : 20 } | |
R20 | { “_id” : { “min” : 0, “max” : 20 }, “count” : 20 }{ “_id” : { “min” : 20, “max” : 40 }, “count” : 20 }{ “_id” : { “min” : 40, “max” : 63 }, “count” : 23 }{ “_id” : { “min” : 63, “max” : 90 }, “count” : 27 }{ “_id” : { “min” : 90, “max” : 100 }, “count” : 10 } | |
E24 | { “_id” : { “min” : 0, “max” : 20 }, “count” : 20 }{ “_id” : { “min” : 20, “max” : 43 }, “count” : 23 }{ “_id” : { “min” : 43, “max” : 68 }, “count” : 25 }{ “_id” : { “min” : 68, “max” : 91 }, “count” : 23 }{ “_id” : { “min” : 91, “max” : 100 }, “count” : 9 } | |
1-2-5 | { “_id” : { “min” : 0, “max” : 20 }, “count” : 20 }{ “_id” : { “min” : 20, “max” : 50 }, “count” : 30 }{ “_id” : { “min” : 50, “max” : 100 }, “count” : 50 } | The specified number of buckets exceeds the number of intervalsin the series. |
POWERSOF2 | { “_id” : { “min” : 0, “max” : 32 }, “count” : 32 }{ “_id” : { “min” : 32, “max” : 64 }, “count” : 32 }{ “_id” : { “min” : 64, “max” : 128 }, “count” : 36 } | The specified number of buckets exceeds the number of intervalsin the series. |
Example
Consider a collection artwork
with the following documents:
- { "_id" : 1, "title" : "The Pillars of Society", "artist" : "Grosz", "year" : 1926,
- "price" : NumberDecimal("199.99"),
- "dimensions" : { "height" : 39, "width" : 21, "units" : "in" } }
- { "_id" : 2, "title" : "Melancholy III", "artist" : "Munch", "year" : 1902,
- "price" : NumberDecimal("280.00"),
- "dimensions" : { "height" : 49, "width" : 32, "units" : "in" } }
- { "_id" : 3, "title" : "Dancer", "artist" : "Miro", "year" : 1925,
- "price" : NumberDecimal("76.04"),
- "dimensions" : { "height" : 25, "width" : 20, "units" : "in" } }
- { "_id" : 4, "title" : "The Great Wave off Kanagawa", "artist" : "Hokusai",
- "price" : NumberDecimal("167.30"),
- "dimensions" : { "height" : 24, "width" : 36, "units" : "in" } }
- { "_id" : 5, "title" : "The Persistence of Memory", "artist" : "Dali", "year" : 1931,
- "price" : NumberDecimal("483.00"),
- "dimensions" : { "height" : 20, "width" : 24, "units" : "in" } }
- { "_id" : 6, "title" : "Composition VII", "artist" : "Kandinsky", "year" : 1913,
- "price" : NumberDecimal("385.00"),
- "dimensions" : { "height" : 30, "width" : 46, "units" : "in" } }
- { "_id" : 7, "title" : "The Scream", "artist" : "Munch",
- "price" : NumberDecimal("159.00"),
- "dimensions" : { "height" : 24, "width" : 18, "units" : "in" } }
- { "_id" : 8, "title" : "Blue Flower", "artist" : "O'Keefe", "year" : 1918,
- "price" : NumberDecimal("118.42"),
- "dimensions" : { "height" : 24, "width" : 20, "units" : "in" } }
Single Facet Aggregation
In the following operation, input documents are grouped into fourbuckets according to the values in the price
field:
- db.artwork.aggregate( [
- {
- $bucketAuto: {
- groupBy: "$price",
- buckets: 4
- }
- }
- ] )
The operation returns the following documents:
- {
- "_id" : {
- "min" : NumberDecimal("76.04"),
- "max" : NumberDecimal("159.00")
- },
- "count" : 2
- }
- {
- "_id" : {
- "min" : NumberDecimal("159.00"),
- "max" : NumberDecimal("199.99")
- },
- "count" : 2
- }
- {
- "_id" : {
- "min" : NumberDecimal("199.99"),
- "max" : NumberDecimal("385.00")
- },
- "count" : 2
- }
- {
- "_id" : {
- "min" : NumberDecimal("385.00"),
- "max" : NumberDecimal("483.00")
- },
- "count" : 2
- }
Multi-Faceted Aggregation
The $bucketAuto
stage can be used within the$facet
stage to process multiple aggregation pipelines onthe same set of input documents from artwork
.
The following aggregation pipeline groups the documents from theartwork
collection into buckets based on price
, year
, andthe calculated area
:
- db.artwork.aggregate( [
- {
- $facet: {
- "price": [
- {
- $bucketAuto: {
- groupBy: "$price",
- buckets: 4
- }
- }
- ],
- "year": [
- {
- $bucketAuto: {
- groupBy: "$year",
- buckets: 3,
- output: {
- "count": { $sum: 1 },
- "years": { $push: "$year" }
- }
- }
- }
- ],
- "area": [
- {
- $bucketAuto: {
- groupBy: {
- $multiply: [ "$dimensions.height", "$dimensions.width" ]
- },
- buckets: 4,
- output: {
- "count": { $sum: 1 },
- "titles": { $push: "$title" }
- }
- }
- }
- ]
- }
- }
- ] )
The operation returns the following document:
- {
- "area" : [
- {
- "_id" : { "min" : 432, "max" : 500 },
- "count" : 3,
- "titles" : [
- "The Scream",
- "The Persistence of Memory",
- "Blue Flower"
- ]
- },
- {
- "_id" : { "min" : 500, "max" : 864 },
- "count" : 2,
- "titles" : [
- "Dancer",
- "The Pillars of Society"
- ]
- },
- {
- "_id" : { "min" : 864, "max" : 1568 },
- "count" : 2,
- "titles" : [
- "The Great Wave off Kanagawa",
- "Composition VII"
- ]
- },
- {
- "_id" : { "min" : 1568, "max" : 1568 },
- "count" : 1,
- "titles" : [
- "Melancholy III"
- ]
- }
- ],
- "price" : [
- {
- "_id" : { "min" : NumberDecimal("76.04"), "max" : NumberDecimal("159.00") },
- "count" : 2
- },
- {
- "_id" : { "min" : NumberDecimal("159.00"), "max" : NumberDecimal("199.99") },
- "count" : 2
- },
- {
- "_id" : { "min" : NumberDecimal("199.99"), "max" : NumberDecimal("385.00") },
- "count" : 2 },
- {
- "_id" : { "min" : NumberDecimal("385.00"), "max" : NumberDecimal("483.00") },
- "count" : 2
- }
- ],
- "year" : [
- { "_id" : { "min" : null, "max" : 1913 }, "count" : 3, "years" : [ 1902 ] },
- { "_id" : { "min" : 1913, "max" : 1926 }, "count" : 3, "years" : [ 1913, 1918, 1925 ] },
- { "_id" : { "min" : 1926, "max" : 1931 }, "count" : 2, "years" : [ 1926, 1931 ] }
- ]
- }