- publishing free software manuals
Perl Language Reference Manual
by Larry Wall and others
Paperback (6"x9"), 724 pages
ISBN 9781906966027
RRP £29.95 ($39.95)

Sales of this book support The Perl Foundation! Get a printed copy>>>

13.3.5 Posix Character Classes

Posix character classes have the form [:class:], where class is name, and the [: and :] delimiters. Posix character classes only appear inside bracketed character classes, and are a convenient and descriptive way of listing a group of characters, though they currently suffer from portability issues (see below and 13.4). Be careful about the syntax,

# Correct:
$string =~ /[[:alpha:]]/
# Incorrect (will warn):
$string =~ /[:alpha:]/

The latter pattern would be a character class consisting of a colon, and the letters a, l, p and h. These character classes can be part of a larger bracketed character class. For example,

[01[:alpha:]%]

is valid and matches '0', '1', any alphabetic character, and the percent sign.

Perl recognizes the following POSIX character classes:

alpha  Any alphabetical character ("[A-Za-z]").
alnum  Any alphanumerical character. ("[A-Za-z0-9]")
ascii  Any character in the ASCII character set.
blank  A GNU extension, equal to a space or a horizontal tab ("\t").
cntrl  Any control character.  See Note [2] below.
digit  Any decimal digit ("[0-9]"), equivalent to "\d".
graph  Any printable character, excluding a space.  See Note [3] below.
lower  Any lowercase character ("[a-z]").
print  Any printable character, including a space.  See Note [4] below.
punct  Any graphical character excluding "word" characters.  Note [5].
space  Any whitespace character. "\s" plus the vertical tab ("\cK").
upper  Any uppercase character ("[A-Z]").
word   A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
xdigit Any hexadecimal digit ("[0-9a-fA-F]").

Most POSIX character classes have two Unicode-style \p property counterparts. (They are not official Unicode properties, but Perl extensions derived from official Unicode properties.) The table below shows the relation between POSIX character classes and these counterparts.

One counterpart, in the column labelled "ASCII-range Unicode" in the table will only match characters in the ASCII range. (On EBCDIC platforms, they match those characters which have ASCII equivalents.)

The other counterpart, in the column labelled "Full-range Unicode", matches any appropriate characters in the full Unicode character set. For example, \p{Alpha} will match not just the ASCII alphabetic characters, but any character in the entire Unicode character set that is considered to be alphabetic.

(Each of the counterparts has various synonyms as well. "Properties accessible through \p{} and \P{}" (perluniprops) in the Perl Unicode and Locales Manual lists all the synonyms, plus all the characters matched by each of the ASCII-range properties. For example \p{AHex} is a synonym for \p{ASCII_Hex_Digit}, and any \p property name can be prefixed with "Is" such as \p{IsAlpha}.)

Both the \p forms are unaffected by any locale that is in effect, or whether the string is in UTF-8 format or not, or whether the platform is EBCDIC or not. In contrast, the POSIX character classes are affected. If the source string is in UTF-8 format, the POSIX classes (with the exception of [[:punct:]], see Note [5]) behave like their "Full-range" Unicode counterparts. If the source string is not in UTF-8 format, and no locale is in effect, and the platform is not EBCDIC, all the POSIX classes behave like their ASCII-range counterparts. Otherwise, they behave based on the rules of the locale or EBCDIC code page. It is proposed to change this behavior in a future release of Perl so that the the UTF8ness of the source string will be irrelevant to the behavior of the POSIX character classes. This means they will always behave in strict accordance with the official POSIX standard. That is, if either locale or EBCDIC code page is present, they will behave in accordance with those; if absent, the classes will match only their ASCII-range counterparts.

[[:...:]]      ASCII-range        Full-range  backslash  Note
                Unicode            Unicode    sequence
-------------------------------------------------------------
  alpha      \p{PosixAlpha}       \p{Alpha}
  alnum      \p{PosixAlnum}       \p{Alnum}
  ascii      \p{ASCII}          
  blank      \p{PosixBlank}       \p{Blank} =             [1]
                                  \p{HorizSpace}  \h      [1]
  cntrl      \p{PosixCntrl}       \p{Cntrl}               [2]
  digit      \p{PosixDigit}       \p{Digit}       \d
  graph      \p{PosixGraph}       \p{Graph}               [3]
  lower      \p{PosixLower}       \p{Lower}
  print      \p{PosixPrint}       \p{Print}               [4]
  punct      \p{PosixPunct}       \p{Punct}               [5]
             \p{PerlSpace}        \p{SpacePerl}   \s      [6]
  space      \p{PosixSpace}       \p{Space}               [6]
  upper      \p{PosixUpper}       \p{Upper}
  word       \p{PerlWord}         \p{Word}        \w
  xdigit     \p{ASCII_Hex_Digit}  \p{XDigit}
  1. \p{Blank} and \p{HorizSpace} are synonyms.
  2. Control characters don't produce output as such, but instead usually control the terminal somehow: for example newline and backspace are control characters. In the ASCII range, characters whose ordinals are between 0 and 31 inclusive, plus 127 (DEL) are control characters. On EBCDIC platforms, it is likely that the code page will define [[:cntrl:]] to be the EBCDIC equivalents of the ASCII controls, plus the controls that in Unicode have ordinals from 128 through 139.
  3. Any character that is graphical, that is, visible. This class consists of all the alphanumerical characters and all punctuation characters.
  4. All printable characters, which is the set of all the graphical characters plus whitespace characters that are not also controls.
  5. \p{PosixPunct} and [[:punct:]] in the ASCII range match all the non-controls, non-alphanumeric, non-space characters: [-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~] (although if a locale is in effect, it could alter the behavior of [[:punct:]]). When the matching string is in UTF-8 format, [[:punct:]] matches the above set, plus what \p{Punct} matches. This is different than strictly matching according to \p{Punct}, because the above set includes characters that aren't considered punctuation by Unicode, but rather "symbols". Another way to say it is that for a UTF-8 string, [[:punct:]] matches all the characters that Unicode considers to be punctuation, plus all the ASCII-range characters that Unicode considers to be symbols.
  6. \p{SpacePerl} and \p{Space} differ only in that \p{Space} additionally matches the vertical tab, \cK. Same for the two ASCII-only range forms.
ISBN 9781906966027Perl Language Reference ManualSee the print edition