Regular Expressions

Let’s face it: regular expressions haven’t changed much in JS in a long time. So it’s a great thing that they’ve finally learned a couple of new tricks in ES6. We’ll briefly cover the additions here, but the overall topic of regular expressions is so dense that you’ll need to turn to chapters/books dedicated to it (of which there are many!) if you need a refresher.

Unicode Flag

We’ll cover the topic of Unicode in more detail in “Unicode” later in this chapter. Here, we’ll just look briefly at the new u flag for ES6+ regular expressions, which turns on Unicode matching for that expression.

JavaScript strings are typically interpreted as sequences of 16-bit characters, which correspond to the characters in the Basic Multilingual Plane (BMP) (http://en.wikipedia.org/wiki/Plane_%28Unicode%29). But there are many UTF-16 characters that fall outside this range, and so strings may have these multibyte characters in them.

Prior to ES6, regular expressions could only match based on BMP characters, which means that those extended characters were treated as two separate characters for matching purposes. This is often not ideal.

So, as of ES6, the u flag tells a regular expression to process a string with the interpretation of Unicode (UTF-16) characters, such that such an extended character will be matched as a single entity.

Warning: Despite the name implication, “UTF-16” doesn’t strictly mean 16 bits. Modern Unicode uses 21 bits, and standards like UTF-8 and UTF-16 refer roughly to how many bits are used in the representation of a character.

An example (straight from the ES6 specification): ? (the musical symbol G-clef) is Unicode point U+1D11E (0x1D11E).

If this character appears in a regular expression pattern (like /?/), the standard BMP interpretation would be that it’s two separate characters (0xD834 and 0xDD1E) to match with. But the new ES6 Unicode-aware mode means that /?/u (or the escaped Unicode form /\u{1D11E}/u) will match "?" in a string as a single matched character.

You might be wondering why this matters? In non-Unicode BMP mode, the pattern is treated as two separate characters, but would still find the match in a string with the "?" character in it, as you can see if you try:

  1. /?/.test( "?-clef" ); // true

The length of the match is what matters. For example:

  1. /^.-clef/ .test( "?-clef" ); // false
  2. /^.-clef/u.test( "?-clef" ); // true

The ^.-clef in the pattern says to match only a single character at the beginning before the normal "-clef" text. In standard BMP mode, the match fails (two characters), but with u Unicode mode flagged on, the match succeeds (one character).

It’s also important to note that u makes quantifiers like + and * apply to the entire Unicode code point as a single character, not just the lower surrogate (aka rightmost half of the symbol) of the character. The same goes for Unicode characters appearing in character classes, like /[?-?]/u.

Note: There’s plenty more nitty-gritty details about u behavior in regular expressions, which Mathias Bynens (https://twitter.com/mathias) has written extensively about (https://mathiasbynens.be/notes/es6-unicode-regex).

Sticky Flag

Another flag mode added to ES6 regular expressions is y, which is often called “sticky mode.” Sticky essentially means the regular expression has a virtual anchor at its beginning that keeps it rooted to matching at only the position indicated by the regular expression’s lastIndex property.

To illustrate, let’s consider two regular expressions, the first without sticky mode and the second with:

  1. var re1 = /foo/,
  2. str = "++foo++";
  3. re1.lastIndex; // 0
  4. re1.test( str ); // true
  5. re1.lastIndex; // 0 -- not updated
  6. re1.lastIndex = 4;
  7. re1.test( str ); // true -- ignored `lastIndex`
  8. re1.lastIndex; // 4 -- not updated

Three things to observe about this snippet:

  • test(..) doesn’t pay any attention to lastIndex‘s value, and always just performs its match from the beginning of the input string.
  • Because our pattern does not have a ^ start-of-input anchor, the search for "foo" is free to move ahead through the whole string looking for a match.
  • lastIndex is not updated by test(..).

Now, let’s try a sticky mode regular expression:

  1. var re2 = /foo/y, // <-- notice the `y` sticky flag
  2. str = "++foo++";
  3. re2.lastIndex; // 0
  4. re2.test( str ); // false -- "foo" not found at `0`
  5. re2.lastIndex; // 0
  6. re2.lastIndex = 2;
  7. re2.test( str ); // true
  8. re2.lastIndex; // 5 -- updated to after previous match
  9. re2.test( str ); // false
  10. re2.lastIndex; // 0 -- reset after previous match failure

And so our new observations about sticky mode:

  • test(..) uses lastIndex as the exact and only position in str to look to make a match. There is no moving ahead to look for the match — it’s either there at the lastIndex position or not.
  • If a match is made, test(..) updates lastIndex to point to the character immediately following the match. If a match fails, test(..) resets lastIndex back to 0.

Normal non-sticky patterns that aren’t otherwise ^-rooted to the start-of-input are free to move ahead in the input string looking for a match. But sticky mode restricts the pattern to matching just at the position of lastIndex.

As I suggested at the beginning of this section, another way of looking at this is that y implies a virtual anchor at the beginning of the pattern that is relative (aka constrains the start of the match) to exactly the lastIndex position.

Warning: In previous literature on the topic, it has alternatively been asserted that this behavior is like y implying a ^ (start-of-input) anchor in the pattern. This is inaccurate. We’ll explain in further detail in “Anchored Sticky” later.

Sticky Positioning

It may seem strangely limiting that to use y for repeated matches, you have to manually ensure lastIndex is in the exact right position, as it has no move-ahead capability for matching.

Here’s one possible scenario: if you know that the match you care about is always going to be at a position that’s a multiple of a number (e.g., 0, 10, 20, etc.), you can just construct a limited pattern matching what you care about, but then manually set lastIndex each time before match to those fixed positions.

Consider:

  1. var re = /f../y,
  2. str = "foo far fad";
  3. str.match( re ); // ["foo"]
  4. re.lastIndex = 10;
  5. str.match( re ); // ["far"]
  6. re.lastIndex = 20;
  7. str.match( re ); // ["fad"]

However, if you’re parsing a string that isn’t formatted in fixed positions like that, figuring out what to set lastIndex to before each match is likely going to be untenable.

There’s a saving nuance to consider here. y requires that lastIndex be in the exact position for a match to occur. But it doesn’t strictly require that you manually set lastIndex.

Instead, you can construct your expressions in such a way that they capture in each main match everything before and after the thing you care about, up to right before the next thing you’ll care to match.

Because lastIndex will set to the next character beyond the end of a match, if you’ve matched everything up to that point, lastIndex will always be in the correct position for the y pattern to start from the next time.

Warning: If you can’t predict the structure of the input string in a sufficiently patterned way like that, this technique may not be suitable and you may not be able to use y.

Having structured string input is likely the most practical scenario where y will be capable of performing repeated matching throughout a string. Consider:

  1. var re = /\d+\.\s(.*?)(?:\s|$)/y
  2. str = "1. foo 2. bar 3. baz";
  3. str.match( re ); // [ "1. foo ", "foo" ]
  4. re.lastIndex; // 7 -- correct position!
  5. str.match( re ); // [ "2. bar ", "bar" ]
  6. re.lastIndex; // 14 -- correct position!
  7. str.match( re ); // ["3. baz", "baz"]

This works because I knew something ahead of time about the structure of the input string: there is always a numeral prefix like "1. " before the desired match ("foo", etc.), and either a space after it, or the end of the string ($ anchor). So the regular expression I constructed captures all of that in each main match, and then I use a matching group ( ) so that the stuff I really care about is separated out for convenience.

After the first match ("1. foo "), the lastIndex is 7, which is already the position needed to start the next match, for "2. bar ", and so on.

If you’re going to use y sticky mode for repeated matches, you’ll probably want to look for opportunities to have lastIndex automatically positioned as we’ve just demonstrated.

Sticky Versus Global

Some readers may be aware that you can emulate something like this lastIndex-relative matching with the g global match flag and the exec(..) method, as so:

  1. var re = /o+./g, // <-- look, `g`!
  2. str = "foot book more";
  3. re.exec( str ); // ["oot"]
  4. re.lastIndex; // 4
  5. re.exec( str ); // ["ook"]
  6. re.lastIndex; // 9
  7. re.exec( str ); // ["or"]
  8. re.lastIndex; // 13
  9. re.exec( str ); // null -- no more matches!
  10. re.lastIndex; // 0 -- starts over now!

While it’s true that g pattern matches with exec(..) start their matching from lastIndex‘s current value, and also update lastIndex after each match (or failure), this is not the same thing as y‘s behavior.

Notice in the previous snippet that "ook", located at position 6, was matched and found by the second exec(..) call, even though at the time, lastIndex was 4 (from the end of the previous match). Why? Because as we said earlier, non-sticky matches are free to move ahead in their matching. A sticky mode expression would have failed here, because it would not be allowed to move ahead.

In addition to perhaps undesired move-ahead matching behavior, another downside to just using g instead of y is that g changes the behavior of some matching methods, like str.match(re).

Consider:

  1. var re = /o+./g, // <-- look, `g`!
  2. str = "foot book more";
  3. str.match( re ); // ["oot","ook","or"]

See how all the matches were returned at once? Sometimes that’s OK, but sometimes that’s not what you want.

The y sticky flag will give you one-at-a-time progressive matching with utilities like test(..) and match(..). Just make sure the lastIndex is always in the right position for each match!

Anchored Sticky

As we warned earlier, it’s inaccurate to think of sticky mode as implying a pattern starts with ^. The ^ anchor has a distinct meaning in regular expressions, which is not altered by sticky mode. ^ is an anchor that always refers to the beginning of the input, and is not in any way relative to lastIndex.

Besides poor/inaccurate documentation on this topic, the confusion is unfortunately strengthened further because an older pre-ES6 experiment with sticky mode in Firefox did make ^ relative to lastIndex, so that behavior has been around for years.

ES6 elected not to do it that way. ^ in a pattern means start-of-input absolutely and only.

As a consequence, a pattern like /^foo/y will always and only find a "foo" match at the beginning of a string, if it’s allowed to match there. If lastIndex is not 0, the match will fail. Consider:

  1. var re = /^foo/y,
  2. str = "foo";
  3. re.test( str ); // true
  4. re.test( str ); // false
  5. re.lastIndex; // 0 -- reset after failure
  6. re.lastIndex = 1;
  7. re.test( str ); // false -- failed for positioning
  8. re.lastIndex; // 0 -- reset after failure

Bottom line: y plus ^ plus lastIndex > 0 is an incompatible combination that will always cause a failed match.

Note: While y does not alter the meaning of ^ in any way, the m multiline mode does, such that ^ means start-of-input or start of text after a newline. So, if you combine y and m flags together for a pattern, you can find multiple ^-rooted matches in a string. But remember: because it’s y sticky, you’ll have to make sure lastIndex is pointing at the correct new line position (likely by matching to the end of the line) each subsequent time, or no subsequent matches will be made.

Regular Expression flags

Prior to ES6, if you wanted to examine a regular expression object to see what flags it had applied, you needed to parse them out — ironically, probably with another regular expression — from the content of the source property, such as:

  1. var re = /foo/ig;
  2. re.toString(); // "/foo/ig"
  3. var flags = re.toString().match( /\/([gim]*)$/ )[1];
  4. flags; // "ig"

As of ES6, you can now get these values directly, with the new flags property:

  1. var re = /foo/ig;
  2. re.flags; // "gi"

It’s a small nuance, but the ES6 specification calls for the expression’s flags to be listed in this order: "gimuy", regardless of what order the original pattern was specified with. That’s the reason for the difference between /ig and "gi".

No, the order of flags specified or listed doesn’t matter.

Another tweak from ES6 is that the RegExp(..) constructor is now flags-aware if you pass it an existing regular expression:

  1. var re1 = /foo*/y;
  2. re1.source; // "foo*"
  3. re1.flags; // "y"
  4. var re2 = new RegExp( re1 );
  5. re2.source; // "foo*"
  6. re2.flags; // "y"
  7. var re3 = new RegExp( re1, "ig" );
  8. re3.source; // "foo*"
  9. re3.flags; // "gi"

Prior to ES6, the re3 construction would throw an error, but as of ES6 you can override the flags when duplicating.