Regular Expressions

Regular expressions are too huge of a topic to introduce here,but make sure that you understand these concepts.For tutorials,see perlrequick or perlretut.For the definitive documentation,see perlre.

Matches and replacements return a quantity.

The m// and s/// operators return the number of matches or replacements they made,respectively.You can either use the number directly,or check it for truth.

  1. if ( $str =~ /Diggle|Shelley/ ) {
  2. print "We found Pete or Steve!\n";
  3. }
  4.  
  5. if ( my $n = ($str =~ s/this/that/g) ) {
  6. print qq{Replaced $n occurrence(s) of "this"\n};
  7. }

Don't use capture variables without checking that the match succeeded.

The capture variables, $1, etc, are not valid unless the match succeeded, and they're not cleared, either.

  1. # BAD: Not checked, but at least it "works".
  2. my $str = 'Perl 101 rocks.';
  3. $str =~ /(\d+)/;
  4. print "Number: $1"; # Prints "Number: 101";
  5.  
  6. # WORSE: Not checked, and the result is not what you'd expect
  7. $str =~ /(Python|Ruby)/;
  8. print "Language: $1"; # Prints "Language: 101";

Instead, you must check the return value from the match:

  1. # GOOD: Check the results
  2. my $str = 'Perl 101 rocks.';
  3. if ( $str =~ /(\d+)/ ) {
  4. print "Number: $1"; # Prints "Number: 101";
  5. }
  6.  
  7. if ( $str =~ /(Python|Ruby)/ ) {
  8. print "Language: $1"; # Never gets here
  9. }

XXX m// in list context gives a list of matches

Common match flags

  • /i - case insensitive match
  • /g - match multiple times
  1. $var = "match match match";
  2.  
  3. while ($var =~ /match/g) { $a++; }
  4. print "$a\n"; # prints 3
  5.  
  6. $a = 0;
  7. $a++ foreach ($var =~ /match/g);
  8. print "$a\n"; # prints 3
  • /m - ^ and $ change meaning
    • Ordinarily, ^ means "start of string" and $, "end of string"
    • /m makes them mean start and end of line, respectively
  1. $str = "one\ntwo\nthree";
  2. @a = $str =~ /^\w+/g; # @a = ("one");
  3. @b = $str =~ /^\w+/gm; # @b = ("one","two","three")
  • Use \A and \z for start and end of string regardless of /m
  • \Z is the same as \z except it will ignore a final newline
    • /s - . also matches newline
  1. $str = "one\ntwo\nthree\n";
  2. $str =~ /^(.{8})/s;
  3. print $1; # prints "one\ntwo\n"

Capture variables $1 and friends

  • Sets of capturing parentheses are stored in numeric variables
  • Parenthesis are assigned left to right:
  1. my $str = "abc";
  2. $str =~ /(((a)(b))(c))/;
  3. print "1: $1 2: $2 3: $3 4: $4 5: $5\n";
  4. # prints: 1: abc 2: ab 3: a 4: b 5: c
  • No upper limit on number of capturing parenthesis and variables

Avoid capture with ?:

  • If a parenthesis is followed by ?:, the group will not be captured
  • Useful if you don't want the matches to be saved
  1. my $str = "abc";
  2. $str =~ /(?:a(b)c)/;
  3. print "$1\n"; # prints "b"

Allow easier reading with the /x switch

  • If you're doing something tricky with a regex, comment it.
  • You can do this with the /x flag.
    This ugly behemoth
  1. my ($num) = $ARGV[0] =~ m/^\+?((?:(?<!\+)-)?(?:\d*.)?\d+)$/x;

is more readable with whitespace and comments, as allowed by the /x flag.

  1. my ($num) =
  2. $ARGV[0] =~ m/^ \+? # An optional plus sign, to be discarded
  3. ( # Capture...
  4. (?:(?<!\+)-)? # a negative sign, if there's no plus behind it,
  5. (?:\d*.)? # an optional number, followed by a point if a decimal,
  6. \d+ # then any number of numbers.
  7. )$/x;
  • Whitespace and comments are stripped unless escaped.

Automatically quote your regexes with \Q and \E

  • Automatically escapes regex metacharacters
  • Won't escape dollar signs
  1. my $num = '3.1415';
  2. print "ok 1\n" if $num =~ /\Q3.14\E/;
  3. $num = '3X1415';
  4. print "ok 2\n" if $num =~ /\Q3.14\E/;
  5. print "ok 3\n" if $num =~ /3.14/;

prints

  1. ok 1
  2. ok 3

Execute code with /e flag to s///

  • Allows arbitrary code to replace a string in a regular expression
  1. my $str = "AbCdE\n";
  2. $str =~ s/(\w)/lc $1/eg;
  3. print $str; # prints "abcde"
  • Use $1 and friends if necessary

Know when to use study

study is not helpful in the vast majority of cases. All it does is make a table of where the first occurrence of each of 256 bytes is in the string. This means that if you have a 1,000-character string, and you search for lots of strings that begin with a constant character, then the matcher can jump right to it. For example:

"This is a very long [… 900 characters skipped…] string that I have here, ending at position 1000"

Now, if you are matching this against the regex /Icky/, the matcher will try to find the first letter "I" that matches. That may take scanning through the first 900+ characters until you get to it. But what study does is build a table of the 256 possible bytes and where they first appear, so that in this case, the scanner can jump right to that position and start matching.

Handle multi-line regexes

Use re => debug

  1. -Mre=debug

Want to contribute?

Submit a PR to github.com/petdance/perl101