- publishing free software manuals
Perl Language Reference Manual
by Larry Wall and others
Paperback (6"x9"), 724 pages
ISBN 9781906966027
RRP £29.95 ($39.95)

Sales of this book support The Perl Foundation! Get a printed copy>>>

13.2.3 Whitespace

\s matches any single character that is considered whitespace. In the ASCII range, \s matches the horizontal tab (\t), the new line (\n), the form feed (\f), the carriage return (\r), and the space. (The vertical tab, \cK is not matched by \s.) The exact set of characters matched by \s depends on whether the source string is in UTF-8 format and the locale or EBCDIC code page that is in effect. If it's in UTF-8 format, \s matches what is considered whitespace in the Unicode database; the complete list is in the table below. Otherwise, if there is a locale or EBCDIC code page in effect, \s matches whatever is considered whitespace by the current locale or EBCDIC code page. Without a locale or EBCDIC code page, \s matches the five characters mentioned in the beginning of this paragraph. Perhaps the most notable possible surprise is that \s matches a non-breaking space only if the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC code page that is in effect has that character. See 13.4.

Any character that isn't matched by \s will be matched by \S.

\h will match any character that is considered horizontal whitespace; this includes the space and the tab characters and 17 other characters that are listed in the table below. \H will match any character that is not considered horizontal whitespace.

\N is new in 5.12, and is experimental. It, like the dot, will match any character that is not a newline. The difference is that \N will not be influenced by the single line /s regular expression modifier. Note that there is a second meaning of \N when of the form \N{...}. This form is for named characters. See "Define character names for \N{named} string literal escapes" (charnames) in the Perl Library Reference Manual (Volume 1) for those. If \N is followed by an opening brace and something that is not a quantifier, perl will assume that a character name is coming, and not this meaning of \N. For example, \N{3} means to match 3 non-newlines; \N{5,} means to match 5 or more non-newlines, but \N{4F} and \N{F4} are not legal quantifiers, and will cause perl to look for characters named 4F or F4, respectively (and won't find them, thus raising an error, unless they have been defined using custom names).

\v will match any character that is considered vertical whitespace; this includes the carriage return and line feed characters (newline) plus 5 other characters listed in the table below. \V will match any character that is not considered vertical whitespace.

\R matches anything that can be considered a newline under Unicode rules. It's not a character class, as it can match a multi-character sequence. Therefore, it cannot be used inside a bracketed character class; use \v instead (vertical whitespace). Details are discussed in 12.

Note that unlike \s, \d and \w, \h and \v always match the same characters, regardless whether the source string is in UTF-8 format or not. The set of characters they match is also not influenced by locale nor EBCDIC code page.

One might think that \s is equivalent to [\h\v]. This is not true. The vertical tab ("\x0b") is not matched by \s, it is however considered vertical whitespace. Furthermore, if the source string is not in UTF-8 format, and any locale or EBCDIC code page that is in effect doesn't include them, the next line ("\x85") and the no-break space ("\xA0") characters are not matched by \s, but are by \v and \h respectively. If the source string is in UTF-8 format, both the next line and the no-break space are matched by \s.

The following table is a complete listing of characters matched by \s, \h and \v as of Unicode 5.2.

The first column gives the code point of the character (in hex format), the second column gives the (Unicode) name. The third column indicates by which class(es) the character is matched (assuming no locale or EBCDIC code page is in effect that changes the \s matching).

0x00009        CHARACTER TABULATION   h s
0x0000a              LINE FEED (LF)    vs
0x0000b             LINE TABULATION    v
0x0000c              FORM FEED (FF)    vs
0x0000d        CARRIAGE RETURN (CR)    vs
0x00020                       SPACE   h s
0x00085             NEXT LINE (NEL)    vs  [1]
0x000a0              NO-BREAK SPACE   h s  [1]
0x01680            OGHAM SPACE MARK   h s
0x0180e   MONGOLIAN VOWEL SEPARATOR   h s
0x02000                     EN QUAD   h s
0x02001                     EM QUAD   h s
0x02002                    EN SPACE   h s
0x02003                    EM SPACE   h s
0x02004          THREE-PER-EM SPACE   h s
0x02005           FOUR-PER-EM SPACE   h s
0x02006            SIX-PER-EM SPACE   h s
0x02007                FIGURE SPACE   h s
0x02008           PUNCTUATION SPACE   h s
0x02009                  THIN SPACE   h s
0x0200a                  HAIR SPACE   h s
0x02028              LINE SEPARATOR    vs
0x02029         PARAGRAPH SEPARATOR    vs
0x0202f       NARROW NO-BREAK SPACE   h s
0x0205f   MEDIUM MATHEMATICAL SPACE   h s
0x03000           IDEOGRAPHIC SPACE   h s
  1. NEXT LINE and NO-BREAK SPACE only match \s if the source string is in UTF-8 format, or the locale or EBCDIC code page that is in effect includes them.

It is worth noting that \d, \w, etc, match single characters, not complete numbers or words. To match a number (that consists of integers), use \d+; to match a word, use \w+.

ISBN 9781906966027Perl Language Reference ManualSee the print edition