- publishing free software manuals
Perl Language Reference Manual
by Larry Wall and others
Paperback (6"x9"), 724 pages
ISBN 9781906966027
RRP £29.95 ($39.95)

Sales of this book support The Perl Foundation! Get a printed copy>>>

11.2.6 Capture buffers

The bracketing construct ( ... ) creates capture buffers. To refer to the current contents of a buffer later on, within the same pattern, use \1 for the first, \2 for the second, and so on. Outside the match use "$" instead of "\". (The \<digit> notation works in certain circumstances outside the match. See the warning below about \1 vs $1 for details.) Referring back to another part of the match is called a backreference.

There is no limit to the number of captured substrings that you may use. However Perl also uses \10, \11, etc. as aliases for \010, \011, etc. (Recall that 0 means octal, so \011 is the character at number 9 in your coded character set; which would be the 10th character, a horizontal tab under ASCII.) Perl resolves this ambiguity by interpreting \10 as a backreference only if at least 10 left parentheses have opened before it. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences. If the bracketing group did not match, the associated backreference won't match either. (This can happen if the bracketing group is optional, or in a different branch of an alternation.)

In order to provide a safer and easier way to construct patterns using backreferences, Perl provides the \g{N} notation (starting with perl 5.10.0). The curly brackets are optional, however omitting them is less safe as the meaning of the pattern can be changed by text (such as digits) following it. When N is a positive integer the \g{N} notation is exactly equivalent to using normal backreferences. When N is a negative integer then it is a relative backreference referring to the previous N'th capturing group. When the bracket form is used and N is not an integer, it is treated as a reference to a named buffer.

Thus \g{-1} refers to the last buffer, \g{-2} refers to the buffer before that. For example:

/
 (Y)            # buffer 1
 (              # buffer 2
    (X)         # buffer 3
    \g{-1}      # backref to buffer 3
    \g{-3}      # backref to buffer 1
 )
/x

and would match the same as /(Y) ( (X) \3 \1 )/x.

Additionally, as of Perl 5.10.0 you may use named capture buffers and named backreferences. The notation is (?<name>...) to declare and \k<name> to reference. You may also use apostrophes instead of angle brackets to delimit the name; and you may use the bracketed \g{name} backreference syntax. It's possible to refer to a named capture buffer by absolute and relative number as well. Outside the pattern, a named capture buffer is available via the %+ hash. When different buffers within the same pattern have the same name, $+{name} and \k<name> refer to the leftmost defined group. (Thus it's possible to do things with named capture buffers that would otherwise require (??{}) code to accomplish.)

Examples:

s/^([^ ]*) *([^ ]*)/$2 $1/;     # swap first two words
/(.)\1/                         # find first doubled char
     and print "'$1' is the first doubled character\n";
/(?<char>.)\k<char>/            # ... a different way
     and print "'$+{char}' is the first doubled character\n";
/(?'char'.)\1/                  # ... mix and match
     and print "'$1' is the first doubled character\n";
if (/Time: (..):(..):(..)/) {   # parse out values
    $hours = $1;
    $minutes = $2;
    $seconds = $3;
}

Several special variables also refer back to portions of the previous match. $+ returns whatever the last bracket match matched. $& returns the entire matched string. (At one point $0 did also, but now it returns the name of the program.) $` returns everything before the matched string. $' returns everything after the matched string. And $^N contains whatever was matched by the most-recently closed group (submatch). $^N can be used in extended patterns (see below), for example to assign a submatch to a variable.

The numbered match variables ($1, $2, $3, etc.) and the related punctuation set ($+, $&, $`, $', and $^N) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See 4.6.)

NOTE: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.

WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression (?: ... ) instead.) But if you never use $&, $` or $', then patterns without capturing parentheses will not be penalized. So avoid $&, $', and $` if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, $& is not so costly as the other two.

As a workaround for this problem, Perl 5.10.0 introduces ${^PREMATCH}, ${^MATCH} and ${^POSTMATCH}, which are equivalent to $`, $& and $', except that they are only guaranteed to be defined after a successful match that was executed with the /p (preserve) modifier. The use of these variables incurs no global performance penalty, unlike their punctuation char equivalents, however at the trade-off that you have to tell perl when you want to use them.

Backslashed metacharacters in Perl are alphanumeric, such as \b, \w, \n. Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric. So anything that looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a literal character, not a metacharacter. This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:

$pattern =~ s/(\W)/\\$1/g;

(If use locale is set, then this depends on the current locale.) Today it is more common to use the quotemeta() function or the \Q metaquoting escape sequence to disable all metacharacters' special meanings like this:

/$unquoted\Q$quoted\E$unquoted/

Beware that if you put literal backslashes (those not inside interpolated variables) between \Q and \E, double-quotish backslash interpolation may lead to confusing results. If you need to use literal backslashes within \Q...\E, consult 7.32.

ISBN 9781906966027Perl Language Reference ManualSee the print edition