- publishing free software manuals
Perl Language Reference Manual
by Larry Wall and others
Paperback (6"x9"), 724 pages
ISBN 9781906966027
RRP £29.95 ($39.95)

Sales of this book support The Perl Foundation! Get a printed copy>>>

13.4 Locale, EBCDIC, Unicode and UTF-8

Some of the character classes have a somewhat different behaviour depending on the internal encoding of the source string, and the locale that is in effect, and if the program is running on an EBCDIC platform.

\w, \d, \s and the POSIX character classes (and their negations, including \W, \D, \S) suffer from this behaviour. (Since the backslash sequences \b and \B are defined in terms of \w and \W, they also are affected.)

The rule is that if the source string is in UTF-8 format, the character classes match according to the Unicode properties. If the source string isn't, then the character classes match according to whatever locale or EBCDIC code page is in effect. If there is no locale nor EBCDIC, they match the ASCII defaults (52 letters, 10 digits and underscore for \w; 0 to 9 for \d; etc.).

This usually means that if you are matching against characters whose ord() values are between 128 and 255 inclusive, your character class may match or not depending on the current locale or EBCDIC code page, and whether the source string is in UTF-8 format. The string will be in UTF-8 format if it contains characters whose ord() value exceeds 255. But a string may be in UTF-8 format without it having such characters. See "The "Unicode Bug"" (perlunicode) in the Perl Unicode and Locales Manual.

For portability reasons, it may be better to not use \w, \d, \s or the POSIX character classes, and use the Unicode properties instead.

ISBN 9781906966027Perl Language Reference ManualSee the print edition