Pire

List of functions

  • Pire::Grep(String) -> (String?) -> Bool
  • Pire::Match(String) -> (String?) -> Bool
  • Pire::MultiGrep(String) -> (String?) -> Tuple<Bool, Bool, ...>
  • Pire::MultiMatch(String) -> (String?) -> Tuple<Bool, Bool, ...>
  • Pire::Capture(String) -> (String?) -> String?
  • Pire::Replace(String) -> (String?, String) -> String?

One of the options to match regular expressions in YQL is to use Pire (Perl Incompatible Regular Expressions). This is a very fast library of regular expressions developed at Yandex: at the lower level, it looks up the input string once, without any lookaheads or rollbacks, spending 5 machine instructions per character (on x86 and x86_64).

The speed is achieved by using the reasonable restrictions:

  • Pire is primarily focused at checking whether a string matches a regular expression.
  • The matching substring can also be returned (by Capture), but with restrictions (a match with only one group is returned).

By default, all functions work in the single-byte mode. However, if the regular expression is a valid UTF-8 string but is not a valid ASCII string, the UTF-8 mode is enabled automatically.

To enable the Unicode mode, you can put one character that’s beyond ASCII with the ? operator, for example: \\w+я?.

Call syntax

To avoid compiling a regular expression at each table row, wrap the function call by a named expression:

  1. $re = Pire::Grep("\\d+"); -- create a callable value to match a specific regular expression
  2. SELECT * FROM table WHERE $re(key); -- use it to filter the table

Pire - 图1

Alert

When escaping special characters in a regular expression, be sure to use the second slash, since all the standard string literals in SQL can accept C-escaped strings, and the \d sequence is not a valid sequence (even if it were, it wouldn’t search for numbers as intended).

You can enable the case-insensitive mode by specifying, at the beginning of the regular expression, the flag (?i).

Examples

  1. $value = "xaaxaaxaa";
  2. $match = Pire::Match("a.*");
  3. $grep = Pire::Grep("axa");
  4. $insensitive_grep = Pire::Grep("(?i)axa");
  5. $multi_match = Pire::MultiMatch(@@a.*
  6. .*a.*
  7. .*a
  8. .*axa.*@@);
  9. $capture = Pire::Capture(".*x(a).*");
  10. $capture_many = Pire::Capture(".*x(a+).*");
  11. $replace = Pire::Replace(".*x(a).*");
  12. SELECT
  13. $match($value) AS match,
  14. $grep($value) AS grep,
  15. $insensitive_grep($value) AS insensitive_grep,
  16. $multi_match($value) AS multi_match,
  17. $multi_match($value).0 AS some_multi_match,
  18. $capture($value) AS capture,
  19. $capture_many($value) AS capture_many,
  20. $replace($value, "b") AS replace;
  21. /*
  22. - match: `false`
  23. - grep: `true`
  24. - insensitive_grep: `true`
  25. - multi_match: `(false, true, true, true)`
  26. - some_multi_match: `false`
  27. - capture: `"a"`
  28. - capture_many: `"aa"`
  29. - replace: `"xaaxaaxba"`
  30. */

Pire - 图2

Grep

Matches the regular expression with a part of the string (arbitrary substring).

Match

Matches the whole string against the regular expression.
To get a result similar to Grep (where substring matching is included), enclose the regular expression in .*. For example, use .*foo.* instead of foo.

MultiGrep/MultiMatch

Pire lets you match against multiple regular expressions in a single pass through the text and get a separate response for each match.
Use the MultiGrep/MultiMatch functions to optimize the query execution speed. Be sure to do it carefully, since the size of the state machine used for matching grows exponentially with the number of regular expressions:

  • If you want to match a string against any of the listed expressions (the results are joined with “or”), it would be much more efficient to combine the query parts in a single regular expression with | and match it using regular Grep or Match.
  • Pire has a limit on the size of the state machine (YQL uses the default value set in the library). If you exceed the limit, the error is raised at the start of the query: Failed to glue up regexes, probably the finite state machine appeared to be too large.
    When you call MultiGrep/MultiMatch, regular expressions are passed one per line using multiline string literals:

Examples

  1. $multi_match = Pire::MultiMatch(@@a.*
  2. .*x.*
  3. .*axa.*@@);
  4. SELECT
  5. $multi_match("a") AS a,
  6. $multi_match("axa") AS axa;
  7. /*
  8. - a: `(true, false, false)`
  9. - axa: `(true, true, true)`
  10. */

Pire - 图3

Capture

If a string matches the specified regular expression, it returns a substring that matches the group enclosed in parentheses in the regular expression.
Capture is non-greedy: the shortest possible substring is returned.

Alert

The expression must contain only one group in parentheses. NULL (empty Optional) is returned in case of no match.

If the above limitations and features are unacceptable for some reason, we recommend that you consider Re2::Capture.

REPLACE

Pire doesn’t support replace based on a regular expression. Pire::Replace implemented in YQL is a simplified emulation using Capture. It may run correctly, if the substring occurs more than once in the source string.

As a rule, it’s better to use Re2::Replace instead.