Unicode

Let me just say that this section is not an exhaustive everything-you-ever-wanted-to-know-about-Unicode resource. I want to cover what you need to know that’s changing for Unicode in ES6, but we won’t go much deeper than that. Mathias Bynens (http://twitter.com/mathias) has written/spoken extensively and brilliantly about JS and Unicode (see https://mathiasbynens.be/notes/javascript-unicode and http://fluentconf.com/javascript-html-2015/public/content/2015/02/18-javascript-loves-unicode).

The Unicode characters that range from 0x0000 to 0xFFFF contain all the standard printed characters (in various languages) that you’re likely to have seen or interacted with. This group of characters is called the Basic Multilingual Plane (BMP). The BMP even contains fun symbols like this cool snowman: ☃ (U+2603).

There are lots of other extended Unicode characters beyond this BMP set, which range up to 0x10FFFF. These symbols are often referred to as astral symbols, as that’s the name given to the set of 16 planes (e.g., layers/groupings) of characters beyond the BMP. Examples of astral symbols include ? (U+1D11E) and ? (U+1F4A9).

Prior to ES6, JavaScript strings could specify Unicode characters using Unicode escaping, such as:

  1. var snowman = "\u2603";
  2. console.log( snowman ); // "☃"

However, the \uXXXX Unicode escaping only supports four hexadecimal characters, so you can only represent the BMP set of characters in this way. To represent an astral character using Unicode escaping prior to ES6, you need to use a surrogate pair — basically two specially calculated Unicode-escaped characters side by side, which JS interprets together as a single astral character:

  1. var gclef = "\uD834\uDD1E";
  2. console.log( gclef ); // "?"

As of ES6, we now have a new form for Unicode escaping (in strings and regular expressions), called Unicode code point escaping:

  1. var gclef = "\u{1D11E}";
  2. console.log( gclef ); // "?"

As you can see, the difference is the presence of the { } in the escape sequence, which allows it to contain any number of hexadecimal characters. Because you only need six to represent the highest possible code point value in Unicode (i.e., 0x10FFFF), this is sufficient.

Unicode-Aware String Operations

By default, JavaScript string operations and methods are not sensitive to astral symbols in string values. So, they treat each BMP character individually, even the two surrogate halves that make up an otherwise single astral character. Consider:

  1. var snowman = "☃";
  2. snowman.length; // 1
  3. var gclef = "?";
  4. gclef.length; // 2

So, how do we accurately calculate the length of such a string? In this scenario, the following trick will work:

  1. var gclef = "?";
  2. [...gclef].length; // 1
  3. Array.from( gclef ).length; // 1

Recall from the “for..of Loops” section earlier in this chapter that ES6 strings have built-in iterators. This iterator happens to be Unicode-aware, meaning it will automatically output an astral symbol as a single value. We take advantage of that using the ... spread operator in an array literal, which creates an array of the string’s symbols. Then we just inspect the length of that resultant array. ES6’s Array.from(..) does basically the same thing as [...XYZ], but we’ll cover that utility in detail in Chapter 6.

Warning: It should be noted that constructing and exhausting an iterator just to get the length of a string is quite expensive on performance, relatively speaking, compared to what a theoretically optimized native utility/property would do.

Unfortunately, the full answer is not as simple or straightforward. In addition to the surrogate pairs (which the string iterator takes care of), there are special Unicode code points that behave in other special ways, which is much harder to account for. For example, there’s a set of code points that modify the previous adjacent character, known as Combining Diacritical Marks.

Consider these two string outputs:

  1. console.log( s1 ); // "é"
  2. console.log( s2 ); // "é"

They look the same, but they’re not! Here’s how we created s1 and s2:

  1. var s1 = "\xE9",
  2. s2 = "e\u0301";

As you can probably guess, our previous length trick doesn’t work with s2:

  1. [...s1].length; // 1
  2. [...s2].length; // 2

So what can we do? In this case, we can perform a Unicode normalization on the value before inquiring about its length, using the ES6 String#normalize(..) utility (which we’ll cover more in Chapter 6):

  1. var s1 = "\xE9",
  2. s2 = "e\u0301";
  3. s1.normalize().length; // 1
  4. s2.normalize().length; // 1
  5. s1 === s2; // false
  6. s1 === s2.normalize(); // true

Essentially, normalize(..) takes a sequence like "e\u0301" and normalizes it to "\xE9". Normalization can even combine multiple adjacent combining marks if there’s a suitable Unicode character they combine to:

  1. var s1 = "o\u0302\u0300",
  2. s2 = s1.normalize(),
  3. s3 = "ồ";
  4. s1.length; // 3
  5. s2.length; // 1
  6. s3.length; // 1
  7. s2 === s3; // true

Unfortunately, normalization isn’t fully perfect here, either. If you have multiple combining marks modifying a single character, you may not get the length count you’d expect, because there may not be a single defined normalized character that represents the combination of all the marks. For example:

  1. var s1 = "e\u0301\u0330";
  2. console.log( s1 ); // "ḛ́"
  3. s1.normalize().length; // 2

The further you go down this rabbit hole, the more you realize that it’s difficult to get one precise definition for “length.” What we see visually rendered as a single character — more precisely called a grapheme — doesn’t always strictly relate to a single “character” in the program processing sense.

Tip: If you want to see just how deep this rabbit hole goes, check out the “Grapheme Cluster Boundaries” algorithm (http://www.Unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).

Character Positioning

Similar to length complications, what does it actually mean to ask, “what is the character at position 2?” The naive pre-ES6 answer comes from charAt(..), which will not respect the atomicity of an astral character, nor will it take into account combining marks.

Consider:

  1. var s1 = "abc\u0301d",
  2. s2 = "ab\u0107d",
  3. s3 = "ab\u{1d49e}d";
  4. console.log( s1 ); // "abćd"
  5. console.log( s2 ); // "abćd"
  6. console.log( s3 ); // "ab?d"
  7. s1.charAt( 2 ); // "c"
  8. s2.charAt( 2 ); // "ć"
  9. s3.charAt( 2 ); // "" <-- unprintable surrogate
  10. s3.charAt( 3 ); // "" <-- unprintable surrogate

So, is ES6 giving us a Unicode-aware version of charAt(..)? Unfortunately, no. At the time of this writing, there’s a proposal for such a utility that’s under consideration for post-ES6.

But with what we explored in the previous section (and of course with the limitations noted thereof!), we can hack an ES6 answer:

  1. var s1 = "abc\u0301d",
  2. s2 = "ab\u0107d",
  3. s3 = "ab\u{1d49e}d";
  4. [...s1.normalize()][2]; // "ć"
  5. [...s2.normalize()][2]; // "ć"
  6. [...s3.normalize()][2]; // "?"

Warning: Reminder of an earlier warning: constructing and exhausting an iterator each time you want to get at a single character is… not very ideal, performance wise. Let’s hope we get a built-in and optimized utility for this soon, post-ES6.

What about a Unicode-aware version of the charCodeAt(..) utility? ES6 gives us codePointAt(..):

  1. var s1 = "abc\u0301d",
  2. s2 = "ab\u0107d",
  3. s3 = "ab\u{1d49e}d";
  4. s1.normalize().codePointAt( 2 ).toString( 16 );
  5. // "107"
  6. s2.normalize().codePointAt( 2 ).toString( 16 );
  7. // "107"
  8. s3.normalize().codePointAt( 2 ).toString( 16 );
  9. // "1d49e"

What about the other direction? A Unicode-aware version of String.fromCharCode(..) is ES6’s String.fromCodePoint(..):

  1. String.fromCodePoint( 0x107 ); // "ć"
  2. String.fromCodePoint( 0x1d49e ); // "?"

So wait, can we just combine String.fromCodePoint(..) and codePointAt(..) to get a better version of a Unicode-aware charAt(..) from earlier? Yep!

  1. var s1 = "abc\u0301d",
  2. s2 = "ab\u0107d",
  3. s3 = "ab\u{1d49e}d";
  4. String.fromCodePoint( s1.normalize().codePointAt( 2 ) );
  5. // "ć"
  6. String.fromCodePoint( s2.normalize().codePointAt( 2 ) );
  7. // "ć"
  8. String.fromCodePoint( s3.normalize().codePointAt( 2 ) );
  9. // "?"

There’s quite a few other string methods we haven’t addressed here, including toUpperCase(), toLowerCase(), substring(..), indexOf(..), slice(..), and a dozen others. None of these have been changed or augmented for full Unicode awareness, so you should be very careful — probably just avoid them! — when working with strings containing astral symbols.

There are also several string methods that use regular expressions for their behavior, like replace(..) and match(..). Thankfully, ES6 brings Unicode awareness to regular expressions, as we covered in “Unicode Flag” earlier in this chapter.

OK, there we have it! JavaScript’s Unicode string support is significantly better over pre-ES6 (though still not perfect) with the various additions we’ve just covered.

Unicode Identifier Names

Unicode can also be used in identifier names (variables, properties, etc.). Prior to ES6, you could do this with Unicode-escapes, like:

  1. var \u03A9 = 42;
  2. // same as: var Ω = 42;

As of ES6, you can also use the earlier explained code point escape syntax:

  1. var \u{2B400} = 42;
  2. // same as: var ? = 42;

There’s a complex set of rules around exactly which Unicode characters are allowed. Furthermore, some are allowed only if they’re not the first character of the identifier name.

Note: Mathias Bynens has a great post (https://mathiasbynens.be/notes/javascript-identifiers-es6) on all the nitty-gritty details.

The reasons for using such unusual characters in identifier names are rather rare and academic. You typically won’t be best served by writing code that relies on these esoteric capabilities.