- publishing free software manuals
Perl Language Reference Manual
by Larry Wall and others
Paperback (6"x9"), 724 pages
ISBN 9781906966027
RRP £29.95 ($39.95)

Sales of this book support The Perl Foundation! Get a printed copy>>>

11.8 Repeated Patterns Matching a Zero-length Substring

WARNING: Difficult material (and prose) ahead. This section needs a rewrite.

Regular expressions provide a terse and powerful programming language. As with most other power tools, power comes together with the ability to wreak havoc.

A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocuous as:

'foo' =~ m{ ( o? )* }x;

The o? matches at the beginning of 'foo', and since the position in the string is not moved by the match, o? would match again and again because of the * quantifier. Another common way to create a similar cycle is with the looping modifier //g:

@matches = ( 'foo' =~ m{ o? }xg );

or

print "match: <$&>\n" while 'foo' =~ m{ o? }xg;

or the loop implied by split().

However, long experience has shown that many programming tasks may be significantly simplified by using repeated subexpressions that may match zero-length substrings. Here's a simple example being:

@chars = split //, $string;           # // is not magic in split
($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /

Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower-level loops given by the greedy quantifiers *+{}, and for higher-level ones like the /g modifier or split() operator.

The lower-level loops are interrupted (that is, the loop is broken) when Perl detects that a repeated expression matched a zero-length substring. Thus

m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;

is made equivalent to

m{   (?: NON_ZERO_LENGTH )*
   |
     (?: ZERO_LENGTH )?
 }x;

The higher level-loops preserve an additional state between iterations: whether the last match was zero-length. To break the loop, the following match after a zero-length match is prohibited to have a length of zero. This prohibition interacts with backtracking (see 11.5), and so the second best match is chosen if the best match is of zero length.

For example:

$_ = 'bar';
s/\w??/<$&>/g;

results in <><b><><a><><r><>. At each position of the string the best match given by non-greedy ?? is the zero-length match, and the second best match is what is matched by \w. Thus zero-length matches alternate with one-character-long matches.

Similarly, for repeated m/()/g the second-best match is the match at the position one notch further in the string.

The additional state of being matched with zero-length is associated with the matched string, and is reset by each assignment to pos(). Zero-length matches at the end of the previous match are ignored during split.

ISBN 9781906966027Perl Language Reference ManualSee the print edition