$bucketAuto (aggregation)

Definition

  • $bucketAuto

New in version 3.4.

Categorizes incoming documents into a specific number of groups,called buckets, based on a specified expression. Bucket boundariesare automatically determined in an attempt to evenly distribute thedocuments into the specified number of buckets.

Each bucket is represented as a document in the output. The documentfor each bucket contains an _id field, whose value specifies theinclusive lower bound and the exclusive upper bound for the bucket,and a count field that contains the number of documents in thebucket. The count field is included by default when theoutput is not specified.

The $bucketAuto stage has the following form:

  1. {
  2. $bucketAuto: {
  3. groupBy: <expression>,
  4. buckets: <number>,
  5. output: {
  6. <output1>: { <$accumulator expression> },
  7. ...
  8. }
  9. granularity: <string>
  10. }
  11. }

FieldTypeDescriptiongroupByexpressionAn expression to group documentsby. To specify a field path,prefix the field name with a dollar sign $ and enclose it inquotes.bucketsintegerA positive 32-bit integer that specifies the number of buckets intowhich input documents are grouped.outputdocumentOptional. A document that specifies the fields to include inthe output documents in addition to the _id field. Tospecify the field to include, you must use accumulatorexpressions:

  1. <outputfield1>: { <accumulator>: <expression1> },
  2. ...

The default count field is not included in the output documentwhen output is specified. Explicitly specify the countexpression as part of the output document to include it:

  1. output: {
  2. <outputfield1>: { <accumulator>: <expression1> },
  3. ...
  4. count: { $sum: 1 }
  5. }

granularitystringOptional. A string that specifies the preferred number series to use toensure that the calculated boundary edges end on preferredround numbers or their powers of 10.

Available only if the all groupBy values are numeric andnone of them are NaN.

The suppported values of granularity are:

  • "R5"
  • "R10"
  • "R20"
  • "R40"
  • "R80"
  • "1-2-5"
  • "E6"
  • "E12"
  • "E24"
  • "E48"
  • "E96"
  • "E192"
  • "POWERSOF2"

Behavior

There may be less than the specified number of buckets if:

  • The number of input documents is less than the specified number ofbuckets.
  • The number of unique values of the groupBy expression is lessthan the specified number of buckets.
  • The granularity has fewer intervals then the number ofbuckets.
  • The granularity is not fine enough to evenly distribute documentsinto the specified number of buckets.

If the groupBy expression refers to an array or document, thevalues are arranged using the same ordering as in $sortbefore determining the bucket boundaries.

The even distribution of documents across buckets depends on thecardinality, or the number of unique values, of the groupBy field. Ifthe cardinality is not high enough, the $bucketAuto stage may not evenlydistribute the results across buckets.

Granularity

The $bucketAuto accepts an optional granularity parameter whichensures that the boundaries of all buckets adhere to a specifiedpreferred number series.Using a preferred number series provides more control on where thebucket boundaries are set among the range of values in the groupByexpression. They may also be used to help logarithmically and evenlyset bucket boundaries when the range of the groupBy expressionscales exponentially.

Renard Series

The Renard number series are sets of numbers derived by taking eitherthe 5 th, 10 th, 20 th,40 th, or 80 th root of 10, then includingvarious powers of the root that equate to values between 1.0 to 10.0(10.3 in the case of R80).

Set granularity to R5, R10, R20, R40, or R80 torestrict bucket boundaries to values in the series. The values of theseries are multiplied by a power of 10 when the groupBy values areoutside of the 1.0 to 10.0 (10.3 for R80) range.

Example

The R5 series is based off of the fifth root of 10, which is1.58, and includes various powers of this root (rounded) until 10 isreached. The R5 series is derived as follows:

  • 10 0/5 = 1
  • 10 1/5 = 1.584 ~ 1.6
  • 10 2/5 = 2.511 ~ 2.5
  • 10 3/5 = 3.981 ~ 4.0
  • 10 4/5 = 6.309 ~ 6.3
  • 10 5/5 = 10

The same approach is applied to the other Renard series to offer finergranularity, i.e., more intervals between 1.0 and 10.0 (10.3 forR80).

E Series

The E number series are similar to theRenard series in that they subdivide theinterval from 1.0 to 10.0 by the 6 th,12 th, 24 th, 48 th,96 th, or 192 nd root of ten with aparticular relative error.

Set granularity to E6, E12, E24, E48, E96, orE192 to restrict bucket boundaries to values in the series. Thevalues of the series are multiplied by a power of 10 when thegroupBy values are outside of the 1.0 to 10.0 range. To learn moreabout the E-series and their respective relative errors, seepreferred number series.

1-2-5 Series

The 1-2-5 series behaves like a three-valueRenard series, if such a series existed.

Set granularity to 1-2-5 to restrict bucket boundaries tovarious powers of the third root of 10, rounded to one significantdigit.

Example

The following values are part of the 1-2-5 series:0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, and soon…

Powers of Two Series

Set granularity to POWERSOF2 to restrict bucket boundaries tonumbers that are a power of two.

Example

The following numbers adhere to the power of two Series:

  • 2 0 = 1
  • 2 1 = 2
  • 2 2 = 4
  • 2 3 = 8
  • 2 4 = 16
  • 2 5 = 32
  • and so on…

A common implementation is how various computer components, likememory, often adhere to the POWERSOF2 set of preferred numbers:

1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, and so on….

Comparing Different Granularities

The following operation demonstrates how specifying different valuesfor granularity affects how $bucketAuto determines bucketboundaries. A collection of things have an _id numbered from1 to 100:

  1. { _id: 1 }
  2. { _id: 2 }
  3. ...
  4. { _id: 100 }

Different values for granularity are substituted into the followingoperation:

  1. db.things.aggregate( [
  2. {
  3. $bucketAuto: {
  4. groupBy: "$_id",
  5. buckets: 5,
  6. granularity: <granularity>
  7. }
  8. }
  9. ] )

The results in the following table demonstrate how different values forgranularity yield different bucket boundaries:

GranularityResultsNotes
No granularity{ “_id” : { “min” : 0, “max” : 20 }, “count” : 20 }{ “_id” : { “min” : 20, “max” : 40 }, “count” : 20 }{ “_id” : { “min” : 40, “max” : 60 }, “count” : 20 }{ “_id” : { “min” : 60, “max” : 80 }, “count” : 20 }{ “_id” : { “min” : 80, “max” : 99 }, “count” : 20 }
R20{ “_id” : { “min” : 0, “max” : 20 }, “count” : 20 }{ “_id” : { “min” : 20, “max” : 40 }, “count” : 20 }{ “_id” : { “min” : 40, “max” : 63 }, “count” : 23 }{ “_id” : { “min” : 63, “max” : 90 }, “count” : 27 }{ “_id” : { “min” : 90, “max” : 100 }, “count” : 10 }
E24{ “_id” : { “min” : 0, “max” : 20 }, “count” : 20 }{ “_id” : { “min” : 20, “max” : 43 }, “count” : 23 }{ “_id” : { “min” : 43, “max” : 68 }, “count” : 25 }{ “_id” : { “min” : 68, “max” : 91 }, “count” : 23 }{ “_id” : { “min” : 91, “max” : 100 }, “count” : 9 }
1-2-5{ “_id” : { “min” : 0, “max” : 20 }, “count” : 20 }{ “_id” : { “min” : 20, “max” : 50 }, “count” : 30 }{ “_id” : { “min” : 50, “max” : 100 }, “count” : 50 }The specified number of buckets exceeds the number of intervalsin the series.
POWERSOF2{ “_id” : { “min” : 0, “max” : 32 }, “count” : 32 }{ “_id” : { “min” : 32, “max” : 64 }, “count” : 32 }{ “_id” : { “min” : 64, “max” : 128 }, “count” : 36 }The specified number of buckets exceeds the number of intervalsin the series.

Example

Consider a collection artwork with the following documents:

  1. { "_id" : 1, "title" : "The Pillars of Society", "artist" : "Grosz", "year" : 1926,
  2. "price" : NumberDecimal("199.99"),
  3. "dimensions" : { "height" : 39, "width" : 21, "units" : "in" } }
  4. { "_id" : 2, "title" : "Melancholy III", "artist" : "Munch", "year" : 1902,
  5. "price" : NumberDecimal("280.00"),
  6. "dimensions" : { "height" : 49, "width" : 32, "units" : "in" } }
  7. { "_id" : 3, "title" : "Dancer", "artist" : "Miro", "year" : 1925,
  8. "price" : NumberDecimal("76.04"),
  9. "dimensions" : { "height" : 25, "width" : 20, "units" : "in" } }
  10. { "_id" : 4, "title" : "The Great Wave off Kanagawa", "artist" : "Hokusai",
  11. "price" : NumberDecimal("167.30"),
  12. "dimensions" : { "height" : 24, "width" : 36, "units" : "in" } }
  13. { "_id" : 5, "title" : "The Persistence of Memory", "artist" : "Dali", "year" : 1931,
  14. "price" : NumberDecimal("483.00"),
  15. "dimensions" : { "height" : 20, "width" : 24, "units" : "in" } }
  16. { "_id" : 6, "title" : "Composition VII", "artist" : "Kandinsky", "year" : 1913,
  17. "price" : NumberDecimal("385.00"),
  18. "dimensions" : { "height" : 30, "width" : 46, "units" : "in" } }
  19. { "_id" : 7, "title" : "The Scream", "artist" : "Munch",
  20. "price" : NumberDecimal("159.00"),
  21. "dimensions" : { "height" : 24, "width" : 18, "units" : "in" } }
  22. { "_id" : 8, "title" : "Blue Flower", "artist" : "O'Keefe", "year" : 1918,
  23. "price" : NumberDecimal("118.42"),
  24. "dimensions" : { "height" : 24, "width" : 20, "units" : "in" } }

Single Facet Aggregation

In the following operation, input documents are grouped into fourbuckets according to the values in the price field:

  1. db.artwork.aggregate( [
  2. {
  3. $bucketAuto: {
  4. groupBy: "$price",
  5. buckets: 4
  6. }
  7. }
  8. ] )

The operation returns the following documents:

  1. {
  2. "_id" : {
  3. "min" : NumberDecimal("76.04"),
  4. "max" : NumberDecimal("159.00")
  5. },
  6. "count" : 2
  7. }
  8. {
  9. "_id" : {
  10. "min" : NumberDecimal("159.00"),
  11. "max" : NumberDecimal("199.99")
  12. },
  13. "count" : 2
  14. }
  15. {
  16. "_id" : {
  17. "min" : NumberDecimal("199.99"),
  18. "max" : NumberDecimal("385.00")
  19. },
  20. "count" : 2
  21. }
  22. {
  23. "_id" : {
  24. "min" : NumberDecimal("385.00"),
  25. "max" : NumberDecimal("483.00")
  26. },
  27. "count" : 2
  28. }

Multi-Faceted Aggregation

The $bucketAuto stage can be used within the$facet stage to process multiple aggregation pipelines onthe same set of input documents from artwork.

The following aggregation pipeline groups the documents from theartwork collection into buckets based on price, year, andthe calculated area:

  1. db.artwork.aggregate( [
  2. {
  3. $facet: {
  4. "price": [
  5. {
  6. $bucketAuto: {
  7. groupBy: "$price",
  8. buckets: 4
  9. }
  10. }
  11. ],
  12. "year": [
  13. {
  14. $bucketAuto: {
  15. groupBy: "$year",
  16. buckets: 3,
  17. output: {
  18. "count": { $sum: 1 },
  19. "years": { $push: "$year" }
  20. }
  21. }
  22. }
  23. ],
  24. "area": [
  25. {
  26. $bucketAuto: {
  27. groupBy: {
  28. $multiply: [ "$dimensions.height", "$dimensions.width" ]
  29. },
  30. buckets: 4,
  31. output: {
  32. "count": { $sum: 1 },
  33. "titles": { $push: "$title" }
  34. }
  35. }
  36. }
  37. ]
  38. }
  39. }
  40. ] )

The operation returns the following document:

  1. {
  2. "area" : [
  3. {
  4. "_id" : { "min" : 432, "max" : 500 },
  5. "count" : 3,
  6. "titles" : [
  7. "The Scream",
  8. "The Persistence of Memory",
  9. "Blue Flower"
  10. ]
  11. },
  12. {
  13. "_id" : { "min" : 500, "max" : 864 },
  14. "count" : 2,
  15. "titles" : [
  16. "Dancer",
  17. "The Pillars of Society"
  18. ]
  19. },
  20. {
  21. "_id" : { "min" : 864, "max" : 1568 },
  22. "count" : 2,
  23. "titles" : [
  24. "The Great Wave off Kanagawa",
  25. "Composition VII"
  26. ]
  27. },
  28. {
  29. "_id" : { "min" : 1568, "max" : 1568 },
  30. "count" : 1,
  31. "titles" : [
  32. "Melancholy III"
  33. ]
  34. }
  35. ],
  36. "price" : [
  37. {
  38. "_id" : { "min" : NumberDecimal("76.04"), "max" : NumberDecimal("159.00") },
  39. "count" : 2
  40. },
  41. {
  42. "_id" : { "min" : NumberDecimal("159.00"), "max" : NumberDecimal("199.99") },
  43. "count" : 2
  44. },
  45. {
  46. "_id" : { "min" : NumberDecimal("199.99"), "max" : NumberDecimal("385.00") },
  47. "count" : 2 },
  48. {
  49. "_id" : { "min" : NumberDecimal("385.00"), "max" : NumberDecimal("483.00") },
  50. "count" : 2
  51. }
  52. ],
  53. "year" : [
  54. { "_id" : { "min" : null, "max" : 1913 }, "count" : 3, "years" : [ 1902 ] },
  55. { "_id" : { "min" : 1913, "max" : 1926 }, "count" : 3, "years" : [ 1913, 1918, 1925 ] },
  56. { "_id" : { "min" : 1926, "max" : 1931 }, "count" : 2, "years" : [ 1926, 1931 ] }
  57. ]
  58. }