$substrBytes (aggregation)

$substrBytes (aggregation)

Definition

$substrBytes

New in version 3.4.

Returns the substring of a string. The substring starts with thecharacter at the specified UTF-8 byte index (zero-based) in thestring and continues for the number of bytes specified.

$substrBytes has the following operatorexpression syntax:

{ $substrBytes: [ <string expression>, <byte index>, <byte count> ] }

FieldTypeDescriptionstring expressionstringThe string from which the substring will be extracted. string expressioncan be any valid expression aslong as it resolves to a string. For more information onexpressions, see Expressions.

If the argument resolves to a value of null or refers to a fieldthat is missing, $substrBytes returns an empty string.

If the argument does not resolve to a string or null norrefers to a missing field, $substrBytes returns an error.byte indexnumberIndicates the starting point of the substring. byte index can beany valid expression as long asit resolves to a non-negative integer or number that can berepresented as an integer (such as 2.0).

byte index cannot referto a starting index located in the middle of a multi-byte UTF-8character.byte countnumberCan be any valid expressionas long as it resolves to a non-negative integer or number that can berepresented as an integer (such as 2.0).

byte count can notresult in an ending index that is in the middle of a UTF-8 character.

Behavior

The $substrBytes operator uses the indexes of UTF-8encoded bytes where each code point, or character, may use between oneand four bytes to encode.

For example, US-ASCII characters are encoded using one byte. Characterswith diacritic markings and additional Latin alphabetical characters(i.e. Latin characters outside of the English alphabet) are encodedusing two bytes. Chinese, Japanese and Korean characters typicallyrequire three bytes, and other planes of unicode (emoji, mathematicalsymbols, etc.) require four bytes.

It is important to be mindful of the content in thestring expression because providing a byte index orbyte count located in the middle of a UTF-8 character will resultin an error.

$substrBytes differs from $substrCP in that$substrBytes counts the bytes of each character, whereas$substrCP counts the code points, or characters,regardless of how many bytes a character uses.

Example	Results
{ $substrBytes: [ "abcde", 1, 2 ] }	"bc"
{ $substrBytes: [ "Hello World!", 6, 5 ] }	"World"
{ $substrBytes: [ "cafétéria", 0, 5 ] }	"café"
{ $substrBytes: [ "cafétéria", 5, 4 ] }	"tér"
{ $substrBytes: [ "cafétéria", 7, 3 ] }	Errors with message:`"Error: Invalid range, starting index is a UTF-8 continuation byte."`
{ $substrBytes: [ "cafétéria", 3, 1 ] }	Errors with message:`"Error: Invalid range, ending index is in the middle of a UTF-8 character."`

Example

Single-Byte Character Set

Consider an inventory collection with the following documents:

{ "_id" : 1, "item" : "ABC1", quarter: "13Q1", "description" : "product 1" }
{ "_id" : 2, "item" : "ABC2", quarter: "13Q4", "description" : "product 2" }
{ "_id" : 3, "item" : "XYZ1", quarter: "14Q2", "description" : null }

The following operation uses the $substrBytes operatorseparate the quarter value (containing only single byte US-ASCIIcharacters) into a yearSubstring and a quarterSubstring. ThequarterSubstring field represents the rest of the string from thespecified byte index following the yearSubstring. It iscalculated by subtracting the byte index from the length of thestring using $strLenBytes.

db.inventory.aggregate(
  [
    {
      $project: {
        item: 1,
        yearSubstring: { $substrBytes: [ "$quarter", 0, 2 ] },
        quarterSubtring: {
          $substrBytes: [
            "$quarter", 2, { $subtract: [ { $strLenBytes: "$quarter" }, 2 ] }
          ]
        }
      }
    }
  ]
)

The operation returns the following results:

{ "_id" : 1, "item" : "ABC1", "yearSubstring" : "13", "quarterSubtring" : "Q1" }
{ "_id" : 2, "item" : "ABC2", "yearSubstring" : "13", "quarterSubtring" : "Q4" }
{ "_id" : 3, "item" : "XYZ1", "yearSubstring" : "14", "quarterSubtring" : "Q2" }

Single-Byte and Multibyte Character Set

A collection named food contains the following documents:

{ "_id" : 1, "name" : "apple" }
{ "_id" : 2, "name" : "banana" }
{ "_id" : 3, "name" : "éclair" }
{ "_id" : 4, "name" : "hamburger" }
{ "_id" : 5, "name" : "jalapeño" }
{ "_id" : 6, "name" : "pizza" }
{ "_id" : 7, "name" : "tacos" }
{ "_id" : 8, "name" : "寿司sushi" }

The following operation uses the $substrBytes operator to create a threebyte menuCode from the name value:

db.food.aggregate(
  [
    {
      $project: {
        "name": 1,
        "menuCode": { $substrBytes: [ "$name", 0, 3 ] }
      }
    }
  ]
)

The operation returns the following results:

{ "_id" : 1, "name" : "apple", "menuCode" : "app" }
{ "_id" : 2, "name" : "banana", "menuCode" : "ban" }
{ "_id" : 3, "name" : "éclair", "menuCode" : "éc" }
{ "_id" : 4, "name" : "hamburger", "menuCode" : "ham" }
{ "_id" : 5, "name" : "jalapeño", "menuCode" : "jal" }
{ "_id" : 6, "name" : "pizza", "menuCode" : "piz" }
{ "_id" : 7, "name" : "tacos", "menuCode" : "tac" }
{ "_id" : 8, "name" : "寿司sushi", "menuCode" : "寿" }