"Fossies" - the Fresh Open Source Software Archive  

Source code changes of the file "doc/pcre2pattern.3" between
pcre2-10.35.tar.bz2 and pcre2-10.36.tar.bz2

About: The PCRE2 library implements Perl compatible regular expression pattern matching. New future PCRE version with revised API.

pcre2pattern.3  (pcre2-10.35.tar.bz2):pcre2pattern.3  (pcre2-10.36.tar.bz2)
skipping to change at line 178 skipping to change at line 178
CHARACTERS AND METACHARACTERS CHARACTERS AND METACHARACTERS
A regular expression is a pattern that is matched against a subject strin g from left to right. Most char- A regular expression is a pattern that is matched against a subject strin g from left to right. Most char-
acters stand for themselves in a pattern, and match the corresponding cha racters in the subject. As a acters stand for themselves in a pattern, and match the corresponding cha racters in the subject. As a
trivial example, the pattern trivial example, the pattern
The quick brown fox The quick brown fox
matches a portion of a subject string that is identical to itself. Whe n caseless matching is specified matches a portion of a subject string that is identical to itself. Whe n caseless matching is specified
(the PCRE2_CASELESS option), letters are matched independently of case. (the PCRE2_CASELESS option or (?i) within the pattern), letters are match
ed independently of case. Note
that there are two ASCII characters, K and S, that, in addition to the
ir lower case ASCII equivalents,
are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long
S) respectively when either
PCRE2_UTF or PCRE2_UCP is set.
The power of regular expressions comes from the ability to include wild c The power of regular expressions comes from the ability to include wild
ards, character classes, alter- cards, character classes, alter-
natives, and repetitions in the pattern. These are encoded in the patte natives, and repetitions in the pattern. These are encoded in the pattern
rn by the use of metacharacters, by the use of metacharacters,
which do not stand for themselves but instead are interpreted in some spe cial way. which do not stand for themselves but instead are interpreted in some spe cial way.
There are two different sets of metacharacters: those that are recognized There are two different sets of metacharacters: those that are recognize
anywhere in the pattern except d anywhere in the pattern except
within square brackets, and those that are recognized within square br within square brackets, and those that are recognized within square brack
ackets. Outside square brackets, ets. Outside square brackets,
the metacharacters are as follows: the metacharacters are as follows:
\ general escape character with several uses \ general escape character with several uses
^ assert start of string (or line, in multiline mode) ^ assert start of string (or line, in multiline mode)
$ assert end of string (or line, in multiline mode) $ assert end of string (or line, in multiline mode)
. match any character except newline (by default) . match any character except newline (by default)
[ start character class definition [ start character class definition
| start of alternative branch | start of alternative branch
( start group or control verb ( start group or control verb
) end group or control verb ) end group or control verb
skipping to change at line 210 skipping to change at line 213
Part of a pattern that is in square brackets is called a "character class ". In a character class the only Part of a pattern that is in square brackets is called a "character class ". In a character class the only
metacharacters are: metacharacters are:
\ general escape character \ general escape character
^ negate the class, but only if the first character ^ negate the class, but only if the first character
- indicates character range - indicates character range
[ POSIX character class (if followed by POSIX syntax) [ POSIX character class (if followed by POSIX syntax)
] terminates the character class ] terminates the character class
If a pattern is compiled with the PCRE2_EXTENDED option, most white space
in the pattern, other than in a
character class, and characters between a # outside a character class
and the next newline, inclusive,
are ignored. An escaping backslash can be used to include a white space o
r a # character as part of the
pattern. If the PCRE2_EXTENDED_MORE option is set, the same applies, but
in addition unescaped space and
horizontal tab characters are ignored inside a character class. Note: o
nly these two characters are
ignored, not the full set of pattern white space characters that are ig
nored outside a character class.
Option settings can be changed within a pattern; see the section entit
led "Internal Option Setting"
below.
The following sections describe the use of each of the metacharacters. The following sections describe the use of each of the metacharacters.
BACKSLASH BACKSLASH
The backslash character has several uses. Firstly, if it is followed by a character that is not a digit The backslash character has several uses. Firstly, if it is followed by a character that is not a digit
or a letter, it takes away any special meaning that character may have. This use of backslash as an or a letter, it takes away any special meaning that character may have. This use of backslash as an
escape character applies both inside and outside character classes. escape character applies both inside and outside character classes.
For example, if you want to match a * character, you must write \* in t he pattern. This escaping action For example, if you want to match a * character, you must write \* in t he pattern. This escaping action
applies whether or not the following character would otherwise be interpr eted as a metacharacter, so it applies whether or not the following character would otherwise be interpr eted as a metacharacter, so it
is always safe to precede a non-alphanumeric with backslash to specif y that it stands for itself. In is always safe to precede a non-alphanumeric with backslash to specif y that it stands for itself. In
particular, if you want to match a backslash, you write \\. particular, if you want to match a backslash, you write \\.
In a UTF mode, only ASCII digits and letters have any special meaning aft Only ASCII digits and letters have any special meaning after a backslash.
er a backslash. All other char- All other characters (in par-
acters (in particular, those whose code points are greater than 127) are ticular, those whose code points are greater than 127) are treated as lit
treated as literals. erals.
If a pattern is compiled with the PCRE2_EXTENDED option, most white space
in the pattern (other than in a
character class), and characters between a # outside a character class an
d the next newline, inclusive,
are ignored. An escaping backslash can be used to include a white spa
ce or # character as part of the
pattern.
If you want to treat all characters in a sequence as literals, you can do If you want to treat all characters in a sequence as literals, you can
so by putting them between \Q do so by putting them between \Q
and \E. This is different from Perl in that $ and @ are handled as and \E. This is different from Perl in that $ and @ are handled as lite
literals in \Q...\E sequences in rals in \Q...\E sequences in
PCRE2, whereas in Perl, $ and @ cause variable interpolation. Also, Perl PCRE2, whereas in Perl, $ and @ cause variable interpolation. Also, Per
does "double-quotish backslash l does "double-quotish backslash
interpolation" on any backslashes between \Q and \E which, its documentat ion says, "may lead to confusing interpolation" on any backslashes between \Q and \E which, its documentat ion says, "may lead to confusing
results". PCRE2 treats a backslash between \Q and \E just like any other character. Note the following results". PCRE2 treats a backslash between \Q and \E just like any oth er character. Note the following
examples: examples:
Pattern PCRE2 matches Perl matches Pattern PCRE2 matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the \Qabc$xyz\E abc$xyz abc followed by the
contents of $xyz contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz \Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
\QA\B\E A\B A\B \QA\B\E A\B A\B
\Q\\E \ \\E \Q\\E \ \\E
The \Q...\E sequence is recognized both inside and outside character clas ses. An isolated \E that is not The \Q...\E sequence is recognized both inside and outside character clas ses. An isolated \E that is not
preceded by \Q is ignored. If \Q is not followed by \E later in the patte preceded by \Q is ignored. If \Q is not followed by \E later in the pat
rn, the literal interpretation tern, the literal interpretation
continues to the end of the pattern (that is, \E is assumed at the end) continues to the end of the pattern (that is, \E is assumed at the end).
. If the isolated \Q is inside a If the isolated \Q is inside a
character class, this causes an error, because the character class is not character class, this causes an error, because the character class is no
terminated by a closing square t terminated by a closing square
bracket. bracket.
Non-printing characters Non-printing characters
A second use of backslash provides a way of encoding non-printing cha A second use of backslash provides a way of encoding non-printing charact
racters in patterns in a visible ers in patterns in a visible
manner. There is no restriction on the appearance of non-printing charact manner. There is no restriction on the appearance of non-printing cha
ers in a pattern, but when a racters in a pattern, but when a
pattern is being prepared by text editing, it is often easier to pattern is being prepared by text editing, it is often easier to use
use one of the following escape one of the following escape
sequences instead of the binary character it represents. In an ASCII sequences instead of the binary character it represents. In an ASC
or Unicode environment, these II or Unicode environment, these
escapes are as follows: escapes are as follows:
\a alarm, that is, the BEL character (hex 07) \a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any printable ASCII character \cx "control-x", where x is any printable ASCII character
\e escape (hex 1B) \e escape (hex 1B)
\f form feed (hex 0C) \f form feed (hex 0C)
\n linefeed (hex 0A) \n linefeed (hex 0A)
\r carriage return (hex 0D) (but see below) \r carriage return (hex 0D) (but see below)
\t tab (hex 09) \t tab (hex 09)
\0dd character with octal code 0dd \0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference \ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd.. \o{ddd..} character with octal code ddd..
\xhh character with hex code hh \xhh character with hex code hh
\x{hhh..} character with hex code hhh.. \x{hhh..} character with hex code hhh..
\N{U+hhh..} character with Unicode hex code point hhh.. \N{U+hhh..} character with Unicode hex code point hhh..
By default, after \x that is not followed by {, from zero to two hexadeci mal digits are read (letters can By default, after \x that is not followed by {, from zero to two hexadeci mal digits are read (letters can
be in upper or lower case). Any number of hexadecimal digits may appear b etween \x{ and }. If a character be in upper or lower case). Any number of hexadecimal digits may appear b etween \x{ and }. If a character
other than a hexadecimal digit appears between \x{ and }, or if the re is no terminating }, an error other than a hexadecimal digit appears between \x{ and }, or if there is no terminating }, an error
occurs. occurs.
Characters whose code points are less than 256 can be defined by either o Characters whose code points are less than 256 can be defined by either
f the two syntaxes for \x or by of the two syntaxes for \x or by
an octal sequence. There is no difference in the way they are handled. an octal sequence. There is no difference in the way they are handled. Fo
For example, \xdc is exactly the r example, \xdc is exactly the
same as \x{dc} or \334. However, using the braced versions does make suc h sequences easier to read. same as \x{dc} or \334. However, using the braced versions does make suc h sequences easier to read.
Support is available for some ECMAScript (aka JavaScript) escape sequence Support is available for some ECMAScript (aka JavaScript) escape sequenc
s via two compile-time options. es via two compile-time options.
If PCRE2_ALT_BSUX is set, the sequence \x followed by { is not recognize If PCRE2_ALT_BSUX is set, the sequence \x followed by { is not recognized
d. Only if \x is followed by two . Only if \x is followed by two
hexadecimal digits is it recognized as a character escape. Otherwise it i hexadecimal digits is it recognized as a character escape. Otherwise it
s interpreted as a literal "x" is interpreted as a literal "x"
character. In this mode, support for code points greater than 256 is pr character. In this mode, support for code points greater than 256 is prov
ovided by \u, which must be fol- ided by \u, which must be fol-
lowed by four hexadecimal digits; otherwise it is interpreted as a litera l "u" character. lowed by four hexadecimal digits; otherwise it is interpreted as a litera l "u" character.
PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in additi PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addi
on, \u{hhh..} is recognized as tion, \u{hhh..} is recognized as
the character specified by hexadecimal code point. There may be any nu the character specified by hexadecimal code point. There may be any numb
mber of hexadecimal digits. This er of hexadecimal digits. This
syntax is from ECMAScript 6. syntax is from ECMAScript 6.
The \N{U+hhh..} escape sequence is recognized only when PCRE2 is operatin The \N{U+hhh..} escape sequence is recognized only when PCRE2 is opera
g in UTF mode. Perl also uses ting in UTF mode. Perl also uses
\N{name} to specify characters by Unicode name; PCRE2 does not suppor \N{name} to specify characters by Unicode name; PCRE2 does not support th
t this. Note that when \N is not is. Note that when \N is not
followed by an opening brace (curly bracket) it has an entirely different meaning, matching any character followed by an opening brace (curly bracket) it has an entirely different meaning, matching any character
that is not a newline. that is not a newline.
There are some legacy applications where the escape sequence \r is exp There are some legacy applications where the escape sequence \r is expect
ected to match a newline. If the ed to match a newline. If the
PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \r in a pattern is converted PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \r in a pattern is convert
to \n so that it matches a LF ed to \n so that it matches a LF
(linefeed) instead of a CR (carriage return) character. (linefeed) instead of a CR (carriage return) character.
The precise effect of \cx on ASCII characters is as follows: if x is a lo wer case letter, it is converted The precise effect of \cx on ASCII characters is as follows: if x is a lo wer case letter, it is converted
to upper case. Then bit 6 of the character (hex 40) is inverted. Thus \cA to upper case. Then bit 6 of the character (hex 40) is inverted. Thus \c
to \cZ become hex 01 to hex 1A A to \cZ become hex 01 to hex 1A
(A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes he (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex
x 7B (; is 3B). If the code unit 7B (; is 3B). If the code unit
following \c has a value less than 32 or greater than 126, a compile-time error occurs. following \c has a value less than 32 or greater than 126, a compile-time error occurs.
When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a,
\e, \f, \n, \r, and \t generate \e, \f, \n, \r, and \t generate
the appropriate EBCDIC code values. The \c escape is processed as spec the appropriate EBCDIC code values. The \c escape is processed as specifi
ified for Perl in the perlebcdic ed for Perl in the perlebcdic
document. The only characters that are allowed after \c are A-Z, a-z, or document. The only characters that are allowed after \c are A-Z, a-z, o
one of @, [, \, ], ^, _, or ?. r one of @, [, \, ], ^, _, or ?.
Any other character provokes a compile-time error. The sequence \c@ en Any other character provokes a compile-time error. The sequence \c@ encod
codes character code 0; after \c es character code 0; after \c
the letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [ the letters (in either case) encode characters 1-26 (hex 01 to hex 1A);
, \, ], ^, and _ encode charac- [, \, ], ^, and _ encode charac-
ters 27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F). ters 27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
Thus, apart from \c?, these escapes generate the same character code val ues as they do in an ASCII envi- Thus, apart from \c?, these escapes generate the same character code valu es as they do in an ASCII envi-
ronment, though the meanings of the values mostly differ. For example, \c G always generates code value 7, ronment, though the meanings of the values mostly differ. For example, \c G always generates code value 7,
which is BEL in ASCII but DEL in EBCDIC. which is BEL in ASCII but DEL in EBCDIC.
The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but
but because 127 is not a control because 127 is not a control
character in EBCDIC, Perl makes it generate the APC character. Unfortunat character in EBCDIC, Perl makes it generate the APC character. Unfortuna
ely, there are several variants tely, there are several variants
of EBCDIC. In most of them the APC character has the value 255 (hex FF), but in the one Perl calls POSIX- of EBCDIC. In most of them the APC character has the value 255 (hex FF), but in the one Perl calls POSIX-
BC its value is 95 (hex 5F). If certain other characters have POSIX-BC va lues, PCRE2 makes \c? generate BC its value is 95 (hex 5F). If certain other characters have POSIX-BC values, PCRE2 makes \c? generate
95; otherwise it generates 255. 95; otherwise it generates 255.
After \0 up to two further octal digits are read. If there are fewer than two digits, just those that are After \0 up to two further octal digits are read. If there are fewer than two digits, just those that are
present are used. Thus the sequence \0\x\015 specifies two binary zeros f present are used. Thus the sequence \0\x\015 specifies two binary zeros
ollowed by a CR character (code followed by a CR character (code
value 13). Make sure you supply two digits after the initial zero if th value 13). Make sure you supply two digits after the initial zero if the
e pattern character that follows pattern character that follows
is itself an octal digit. is itself an octal digit.
The escape \o must be followed by a sequence of octal digits, enclosed in braces. An error occurs if this The escape \o must be followed by a sequence of octal digits, enclosed in braces. An error occurs if this
is not the case. This escape is a recent addition to Perl; it provides is not the case. This escape is a recent addition to Perl; it provides wa
way of specifying character code y of specifying character code
points as octal numbers greater than 0777, and it also allows octal numb points as octal numbers greater than 0777, and it also allows octal
ers and backreferences to be numbers and backreferences to be
unambiguously specified. unambiguously specified.
For greater clarity and unambiguity, it is best to avoid following \ by a digit greater than zero. For greater clarity and unambiguity, it is best to avoid following \ by a digit greater than zero.
Instead, use \o{} or \x{} to specify numerical character code points, and \g{} to specify backreferences. Instead, use \o{} or \x{} to specify numerical character code points, and \g{} to specify backreferences.
The following paragraphs describe the old, ambiguous syntax. The following paragraphs describe the old, ambiguous syntax.
The handling of a backslash followed by a digit other than 0 is compli cated, and Perl has changed over The handling of a backslash followed by a digit other than 0 is complicat ed, and Perl has changed over
time, causing PCRE2 also to change. time, causing PCRE2 also to change.
Outside a character class, PCRE2 reads the digit and any following digits as a decimal number. If the Outside a character class, PCRE2 reads the digit and any following di gits as a decimal number. If the
number is less than 10, begins with the digit 8 or 9, or if there are at least that many previous capture number is less than 10, begins with the digit 8 or 9, or if there are at least that many previous capture
groups in the expression, the entire sequence is taken as a backreference groups in the expression, the entire sequence is taken as a backrefe
. A description of how this rence. A description of how this
works is given later, following the discussion of parenthesized group works is given later, following the discussion of parenthesized groups.
s. Otherwise, up to three octal Otherwise, up to three octal
digits are read to form a character code. digits are read to form a character code.
Inside a character class, PCRE2 handles \8 and \9 as the literal characte rs "8" and "9", and otherwise Inside a character class, PCRE2 handles \8 and \9 as the literal chara cters "8" and "9", and otherwise
reads up to three octal digits following the backslash, using them to gen erate a data character. Any sub- reads up to three octal digits following the backslash, using them to gen erate a data character. Any sub-
sequent digits stand for themselves. For example, outside a character cla ss: sequent digits stand for themselves. For example, outside a character cla ss:
\040 is another way of writing an ASCII space \040 is another way of writing an ASCII space
\40 is the same, provided there are fewer than 40 \40 is the same, provided there are fewer than 40
previous capture groups previous capture groups
\7 is always a backreference \7 is always a backreference
\11 might be a backreference, or another way of \11 might be a backreference, or another way of
writing a tab writing a tab
\011 is always a tab \011 is always a tab
skipping to change at line 371 skipping to change at line 378
character with octal code 113 character with octal code 113
\377 might be a backreference, otherwise \377 might be a backreference, otherwise
the value 255 (decimal) the value 255 (decimal)
\81 is always a backreference \81 is always a backreference
Note that octal values of 100 or greater that are specified using this sy ntax must not be introduced by a Note that octal values of 100 or greater that are specified using this sy ntax must not be introduced by a
leading zero, because no more than three octal digits are ever read. leading zero, because no more than three octal digits are ever read.
Constraints on character values Constraints on character values
Characters that are specified using octal or hexadecimal numbers are li mited to certain values, as fol- Characters that are specified using octal or hexadecimal numbers are limi ted to certain values, as fol-
lows: lows:
8-bit non-UTF mode no greater than 0xff 8-bit non-UTF mode no greater than 0xff
16-bit non-UTF mode no greater than 0xffff 16-bit non-UTF mode no greater than 0xffff
32-bit non-UTF mode no greater than 0xffffffff 32-bit non-UTF mode no greater than 0xffffffff
All UTF modes no greater than 0x10ffff and a valid code point All UTF modes no greater than 0x10ffff and a valid code point
Invalid Unicode code points are all those in the range 0xd800 to 0xdfff ( Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
the so-called "surrogate" code (the so-called "surrogate" code
points). The check for these can be disabled by the caller of pcre2 points). The check for these can be disabled by the caller of pcre2_com
_compile() by setting the option pile() by setting the option
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UT PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in
F-8 and UTF-32 modes, because UTF-8 and UTF-32 modes, because
these values are not representable in UTF-16. these values are not representable in UTF-16.
Escape sequences in character classes Escape sequences in character classes
All the sequences that define a single character value can be used b oth inside and outside character All the sequences that define a single character value can be used both inside and outside character
classes. In addition, inside a character class, \b is interpreted as the backspace character (hex 08). classes. In addition, inside a character class, \b is interpreted as the backspace character (hex 08).
When not followed by an opening brace, \N is not allowed in a character c When not followed by an opening brace, \N is not allowed in a characte
lass. \B, \R, and \X are not r class. \B, \R, and \X are not
special inside a character class. Like other unrecognized alphabetic special inside a character class. Like other unrecognized alphabetic esc
escape sequences, they cause an ape sequences, they cause an
error. Outside a character class, these sequences have different meanings . error. Outside a character class, these sequences have different meanings .
Unsupported escape sequences Unsupported escape sequences
In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its strin g handler and used to modify the In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its strin g handler and used to modify the
case of following characters. By default, PCRE2 does not support these es cape sequences in patterns. How- case of following characters. By default, PCRE2 does not support these es cape sequences in patterns. How-
ever, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U matches a "U" character, ever, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U matches a "U" character,
and \u can be used to define a character by code point, as described abov e. and \u can be used to define a character by code point, as described abov e.
Absolute and relative backreferences Absolute and relative backreferences
The sequence \g followed by a signed or unsigned number, optionally enclo sed in braces, is an absolute or The sequence \g followed by a signed or unsigned number, optionally enclo sed in braces, is an absolute or
relative backreference. A named backreference can be coded as \g{name}. Backreferences are discussed relative backreference. A named backreference can be coded as \g{nam e}. Backreferences are discussed
later, following the discussion of parenthesized groups. later, following the discussion of parenthesized groups.
Absolute and relative subroutine calls Absolute and relative subroutine calls
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either For compatibility with Oniguruma, the non-Perl syntax \g followed by a na me or a number enclosed either
in angle brackets or single quotes, is an alternative syntax for referenc ing a capture group as a subrou- in angle brackets or single quotes, is an alternative syntax for referenc ing a capture group as a subrou-
tine. Details are discussed later. Note that \g{...} (Perl syntax) an d \g<...> (Oniguruma syntax) are tine. Details are discussed later. Note that \g{...} (Perl syntax) and \ g<...> (Oniguruma syntax) are
not synonymous. The former is a backreference; the latter is a subroutine call. not synonymous. The former is a backreference; the latter is a subroutine call.
Generic character types Generic character types
Another use of backslash is for specifying generic character types: Another use of backslash is for specifying generic character types:
\d any decimal digit \d any decimal digit
\D any character that is not a decimal digit \D any character that is not a decimal digit
\h any horizontal white space character \h any horizontal white space character
\H any character that is not a horizontal white space character \H any character that is not a horizontal white space character
\N any character that is not a newline \N any character that is not a newline
\s any white space character \s any white space character
\S any character that is not a white space character \S any character that is not a white space character
\v any vertical white space character \v any vertical white space character
\V any character that is not a vertical white space character \V any character that is not a vertical white space character
\w any "word" character \w any "word" character
\W any "non-word" character \W any "non-word" character
The \N escape sequence has the same meaning as the "." metacharacter when PCRE2_DOTALL is not set, but The \N escape sequence has the same meaning as the "." metacharacter w hen PCRE2_DOTALL is not set, but
setting PCRE2_DOTALL does not change the meaning of \N. Note that when \N is followed by an opening brace setting PCRE2_DOTALL does not change the meaning of \N. Note that when \N is followed by an opening brace
it has a different meaning. See the section entitled "Non-printing charac ters" above for details. Perl it has a different meaning. See the section entitled "Non-printing cha racters" above for details. Perl
also uses \N{name} to specify characters by Unicode name; PCRE2 does not support this. also uses \N{name} to specify characters by Unicode name; PCRE2 does not support this.
Each pair of lower and upper case escape sequences partitions the com plete set of characters into two Each pair of lower and upper case escape sequences partitions the complet e set of characters into two
disjoint sets. Any given character matches one, and only one, of each pai r. The sequences can appear both disjoint sets. Any given character matches one, and only one, of each pai r. The sequences can appear both
inside and outside character classes. They each match one character of t inside and outside character classes. They each match one character of th
he appropriate type. If the cur- e appropriate type. If the cur-
rent matching point is at the end of the subject string, all of them fail rent matching point is at the end of the subject string, all of them fai
, because there is no character l, because there is no character
to match. to match.
The default \s characters are HT (9), LF (10), VT (11), FF (12), CR (13), and space (32), which are The default \s characters are HT (9), LF (10), VT (11), FF (12), CR (13) , and space (32), which are
defined as white space in the "C" locale. This list may vary if locale-sp ecific matching is taking place. defined as white space in the "C" locale. This list may vary if locale-sp ecific matching is taking place.
For example, in some locales the "non-breaking space" character (\xA0) i s recognized as white space, and For example, in some locales the "non-breaking space" character (\xA0) is recognized as white space, and
in others the VT character is not. in others the VT character is not.
A "word" character is an underscore or any character that is a letter or A "word" character is an underscore or any character that is a letter or
digit. By default, the defini- digit. By default, the defini-
tion of letters and digits is controlled by PCRE2's low-valued character tion of letters and digits is controlled by PCRE2's low-valued character
tables, and may vary if locale- tables, and may vary if locale-
specific matching is taking place (see "Locale support" in the pcre2api p specific matching is taking place (see "Locale support" in the pcre2api
age). For example, in a French page). For example, in a French
locale such as "fr_FR" in Unix-like systems, or "french" in Windows, s locale such as "fr_FR" in Unix-like systems, or "french" in Windows, some
ome character codes greater than character codes greater than
127 are used for accented letters, and these are then matched by \w. The 127 are used for accented letters, and these are then matched by \w. Th
use of locales with Unicode is e use of locales with Unicode is
discouraged. discouraged.
By default, characters whose code points are greater than 127 never match \d, \s, or \w, and always match By default, characters whose code points are greater than 127 never match \d, \s, or \w, and always match
\D, \S, and \W, although this may be different for characters in the rang e 128-255 when locale-specific \D, \S, and \W, although this may be different for characters in the ra nge 128-255 when locale-specific
matching is happening. These escape sequences retain their original mean ings from before Unicode support matching is happening. These escape sequences retain their original mean ings from before Unicode support
was available, mainly for efficiency reasons. If the PCRE2_UCP option is set, the behaviour is changed so was available, mainly for efficiency reasons. If the PCRE2_UCP option is set, the behaviour is changed so
that Unicode properties are used to determine character types, as follows : that Unicode properties are used to determine character types, as follows :
\d any character that matches \p{Nd} (decimal digit) \d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v \s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore \w any character that matches \p{L} or \p{N}, plus underscore
The upper case escapes match the inverse sets of characters. Note that The upper case escapes match the inverse sets of characters. Note that \d
\d matches only decimal digits, matches only decimal digits,
whereas \w matches any Unicode digit, as well as any Unicode letter, an whereas \w matches any Unicode digit, as well as any Unicode letter
d underscore. Note also that , and underscore. Note also that
PCRE2_UCP affects \b, and \B because they are defined in terms of \w and PCRE2_UCP affects \b, and \B because they are defined in terms of \w and
\W. Matching these sequences is \W. Matching these sequences is
noticeably slower when PCRE2_UCP is set. noticeably slower when PCRE2_UCP is set.
The sequences \h, \H, \v, and \V, in contrast to the other sequences, whi The sequences \h, \H, \v, and \V, in contrast to the other sequences, w
ch match only ASCII characters hich match only ASCII characters
by default, always match a specific list of code points, whether or not by default, always match a specific list of code points, whether or not P
PCRE2_UCP is set. The horizontal CRE2_UCP is set. The horizontal
space characters are: space characters are:
U+0009 Horizontal tab (HT) U+0009 Horizontal tab (HT)
U+0020 Space U+0020 Space
U+00A0 Non-break space U+00A0 Non-break space
U+1680 Ogham space mark U+1680 Ogham space mark
U+180E Mongolian vowel separator U+180E Mongolian vowel separator
U+2000 En quad U+2000 En quad
U+2001 Em quad U+2001 Em quad
U+2002 En space U+2002 En space
skipping to change at line 505 skipping to change at line 512
U+000C Form feed (FF) U+000C Form feed (FF)
U+000D Carriage return (CR) U+000D Carriage return (CR)
U+0085 Next line (NEL) U+0085 Next line (NEL)
U+2028 Line separator U+2028 Line separator
U+2029 Paragraph separator U+2029 Paragraph separator
In 8-bit, non-UTF-8 mode, only the characters with code points less than 256 are relevant. In 8-bit, non-UTF-8 mode, only the characters with code points less than 256 are relevant.
Newline sequences Newline sequences
Outside a character class, by default, the escape sequence \R matches any Unicode newline sequence. In Outside a character class, by default, the escape sequence \R matches any Unicode newline sequence. In
8-bit non-UTF-8 mode \R is equivalent to the following: 8-bit non-UTF-8 mode \R is equivalent to the following:
(?>\r\n|\n|\x0b|\f|\r|\x85) (?>\r\n|\n|\x0b|\f|\r|\x85)
This is an example of an "atomic group", details of which are given below . This particular group matches This is an example of an "atomic group", details of which are given below . This particular group matches
either the two-character sequence CR followed by LF, or one of the si either the two-character sequence CR followed by LF, or one of the
ngle characters LF (linefeed, single characters LF (linefeed,
U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (carria U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (carriage
ge return, U+000D), or NEL (next return, U+000D), or NEL (next
line, U+0085). Because this is an atomic group, the two-character sequenc line, U+0085). Because this is an atomic group, the two-character sequ
e is treated as a single unit ence is treated as a single unit
that cannot be split. that cannot be split.
In other modes, two additional characters whose code points are greater t han 255 are added: LS (line sep- In other modes, two additional characters whose code points are greater t han 255 are added: LS (line sep-
arator, U+2028) and PS (paragraph separator, U+2029). Unicode support is not needed for these characters arator, U+2028) and PS (paragraph separator, U+2029). Unicode support is not needed for these characters
to be recognized. to be recognized.
It is possible to restrict \R to match only CR, LF, or CRLF (instead of It is possible to restrict \R to match only CR, LF, or CRLF (instead of t
the complete set of Unicode line he complete set of Unicode line
endings) by setting the option PCRE2_BSR_ANYCRLF at compile time. (BSR is endings) by setting the option PCRE2_BSR_ANYCRLF at compile time. (BSR
an abbrevation for "backslash is an abbrevation for "backslash
R".) This can be made the default when PCRE2 is built; if this is the R".) This can be made the default when PCRE2 is built; if this is the cas
case, the other behaviour can be e, the other behaviour can be
requested via the PCRE2_BSR_UNICODE option. It is also possible to specif requested via the PCRE2_BSR_UNICODE option. It is also possible to spec
y these settings by starting a ify these settings by starting a
pattern string with one of the following sequences: pattern string with one of the following sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only (*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence (*BSR_UNICODE) any Unicode newline sequence
These override the default and the options given to the compiling functio n. Note that these special set- These override the default and the options given to the compiling functio n. Note that these special set-
tings, which are not Perl-compatible, are recognized only at the very sta tings, which are not Perl-compatible, are recognized only at the very s
rt of a pattern, and that they tart of a pattern, and that they
must be in upper case. If more than one of them is present, the last o must be in upper case. If more than one of them is present, the last one
ne is used. They can be combined is used. They can be combined
with a change of newline convention; for example, a pattern can start wit h: with a change of newline convention; for example, a pattern can start wit h:
(*ANY)(*BSR_ANYCRLF) (*ANY)(*BSR_ANYCRLF)
They can also be combined with the (*UTF) or (*UCP) special sequences. In side a character class, \R is They can also be combined with the (*UTF) or (*UCP) special sequences. Inside a character class, \R is
treated as an unrecognized escape sequence, and causes an error. treated as an unrecognized escape sequence, and causes an error.
Unicode character properties Unicode character properties
When PCRE2 is built with Unicode support (the default), three additi When PCRE2 is built with Unicode support (the default), three additional
onal escape sequences that match escape sequences that match
characters with specific properties are available. They can be used in an characters with specific properties are available. They can be used
y mode, though in 8-bit and in any mode, though in 8-bit and
16-bit non-UTF modes these sequences are of course limited to testing 16-bit non-UTF modes these sequences are of course limited to testing cha
characters whose code points are racters whose code points are
less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, c
points greater than 0x10ffff ode points greater than 0x10ffff
(the Unicode limit) may be encountered. These are all treated as being i (the Unicode limit) may be encountered. These are all treated as being in
n the Unknown script and with an the Unknown script and with an
unassigned type. The extra escape sequences are: unassigned type. The extra escape sequences are:
\p{xx} a character with the xx property \p{xx} a character with the xx property
\P{xx} a character without the xx property \P{xx} a character without the xx property
\X a Unicode extended grapheme cluster \X a Unicode extended grapheme cluster
The property names represented by xx above are case-sensitive. There is s upport for Unicode script names, The property names represented by xx above are case-sensitive. There is s upport for Unicode script names,
Unicode general category properties, "Any", which matches any charact Unicode general category properties, "Any", which matches any character
er (including newline), and some (including newline), and some
special PCRE2 properties (described in the next section). Other Perl pro special PCRE2 properties (described in the next section). Other Perl p
perties such as "InMusicalSym- roperties such as "InMusicalSym-
bols" are not supported by PCRE2. Note that \P{Any} does not match an bols" are not supported by PCRE2. Note that \P{Any} does not match any c
y characters, so always causes a haracters, so always causes a
match failure. match failure.
Sets of Unicode characters are defined as belonging to certain scripts. A character from one of these Sets of Unicode characters are defined as belonging to certain script s. A character from one of these
sets can be matched using a script name. For example: sets can be matched using a script name. For example:
\p{Greek} \p{Greek}
\P{Han} \P{Han}
Unassigned characters (and in non-UTF 32-bit mode, characters with code p oints greater than 0x10FFFF) are Unassigned characters (and in non-UTF 32-bit mode, characters with code p oints greater than 0x10FFFF) are
assigned the "Unknown" script. Others that are not part of an identified script are lumped together as assigned the "Unknown" script. Others that are not part of an identifi ed script are lumped together as
"Common". The current list of scripts is: "Common". The current list of scripts is:
Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balines e, Bamum, Bassa_Vah, Batak, Ben- Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Ben-
gali, Bhaiksuki, Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Abo riginal, Carian, Caucasian_Alba- gali, Bhaiksuki, Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Abo riginal, Carian, Caucasian_Alba-
nian, Chakma, Cham, Cherokee, Chorasmian, Common, Coptic, Cuneiform, Cyp nian, Chakma, Cham, Cherokee, Chorasmian, Common, Coptic, Cuneiform, Cypr
riot, Cyrillic, Deseret, Devana- iot, Cyrillic, Deseret, Devana-
gari, Dives_Akuru, Dogra, Duployan, Egyptian_Hieroglyphs, Elbasan, gari, Dives_Akuru, Dogra, Duployan, Egyptian_Hieroglyphs, Elbasan,
Elymaic, Ethiopic, Georgian, Elymaic, Ethiopic, Georgian,
Glagolitic, Gothic, Grantha, Greek, Gujarati, Gunjala_Gondi, Gurmukh Glagolitic, Gothic, Grantha, Greek, Gujarati, Gunjala_Gondi, Gurmukhi,
i, Han, Hangul, Hanifi_Rohingya, Han, Hangul, Hanifi_Rohingya,
Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited, I Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
nscriptional_Pahlavi, Inscrip- Inscriptional_Pahlavi, Inscrip-
tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharosh tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshth
thi, Khitan_Small_Script, Khmer, i, Khitan_Small_Script, Khmer,
Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Linear_B, Li Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Linear_B,
su, Lycian, Lydian, Mahajani, Lisu, Lycian, Lydian, Mahajani,
Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi , Medefaidrin, Meetei_Mayek, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi , Medefaidrin, Meetei_Mayek,
Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mong Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, M
olian, Mro, Multani, Myanmar, ongolian, Mro, Multani, Myanmar,
Nabataean, Nandinagari, New_Tai_Lue, Newa, Nko, Nushu, Nyakeng_Puachue_ Nabataean, Nandinagari, New_Tai_Lue, Newa, Nko, Nushu, Nyakeng_Puachue_Hm
Hmong, Ogham, Ol_Chiki, Old_Hun- ong, Ogham, Ol_Chiki, Old_Hun-
garian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sogdi garian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sogd
an, Old_South_Arabian, Old_Tur- ian, Old_South_Arabian, Old_Tur-
kic, Oriya, Osage, Osmanya, Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_ kic, Oriya, Osage, Osmanya, Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_P
Pa, Phoenician, Psalter_Pahlavi, a, Phoenician, Psalter_Pahlavi,
Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWrit ing, Sinhala, Sogdian, Sora_Som- Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWrit ing, Sinhala, Sogdian, Sora_Som-
peng, Soyombo, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Ta peng, Soyombo, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_L
i_Le, Tai_Tham, Tai_Viet, Takri, e, Tai_Tham, Tai_Viet, Takri,
Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ug Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifinagh, Tirhuta,
aritic, Unknown, Vai, Wancho, Ugaritic, Unknown, Vai, Wancho,
Warang_Citi, Yezidi, Yi, Zanabazar_Square. Warang_Citi, Yezidi, Yi, Zanabazar_Square.
Each character has exactly one Unicode general category property, specifi ed by a two-letter abbreviation. Each character has exactly one Unicode general category property, specifi ed by a two-letter abbreviation.
For compatibility with Perl, negation can be specified by including a c ircumflex between the opening For compatibility with Perl, negation can be specified by including a circumflex between the opening
brace and the property name. For example, \p{^Lu} is the same as \P{Lu}. brace and the property name. For example, \p{^Lu} is the same as \P{Lu}.
If only one letter is specified with \p or \P, it includes all the genera l category properties that start If only one letter is specified with \p or \P, it includes all the genera l category properties that start
with that letter. In this case, in the absence of negation, the curly bra ckets in the escape sequence are with that letter. In this case, in the absence of negation, the curly bra ckets in the escape sequence are
optional; these two examples have the same effect: optional; these two examples have the same effect:
\p{L} \p{L}
\pL \pL
The following general category property codes are supported: The following general category property codes are supported:
skipping to change at line 646 skipping to change at line 653
So Other symbol So Other symbol
Z Separator Z Separator
Zl Line separator Zl Line separator
Zp Paragraph separator Zp Paragraph separator
Zs Space separator Zs Space separator
The special property L& is also supported: it matches a character that ha s the Lu, Ll, or Lt property, in The special property L& is also supported: it matches a character that ha s the Lu, Ll, or Lt property, in
other words, a letter that is not classified as a modifier or "other". other words, a letter that is not classified as a modifier or "other".
The Cs (Surrogate) property applies only to characters whose code points are in the range U+D800 to The Cs (Surrogate) property applies only to characters whose code p oints are in the range U+D800 to
U+DFFF. These characters are no different to any other character when PCR E2 is not in UTF mode (using the U+DFFF. These characters are no different to any other character when PCR E2 is not in UTF mode (using the
16-bit or 32-bit library). However, they are not valid in Unicode string s and so cannot be tested by 16-bit or 32-bit library). However, they are not valid in Unicode st rings and so cannot be tested by
PCRE2 in UTF mode, unless UTF validity checking has been turned off (see the discussion of PCRE2 in UTF mode, unless UTF validity checking has been turned off (see the discussion of
PCRE2_NO_UTF_CHECK in the pcre2api page). PCRE2_NO_UTF_CHECK in the pcre2api page).
The long synonyms for property names that Perl supports (such as \p{Lette r}) are not supported by PCRE2, The long synonyms for property names that Perl supports (such as \p{Lett er}) are not supported by PCRE2,
nor is it permitted to prefix any of these properties with "Is". nor is it permitted to prefix any of these properties with "Is".
No character that is in the Unicode table has the Cn (unassigned) prop erty. Instead, this property is No character that is in the Unicode table has the Cn (unassigned) propert y. Instead, this property is
assumed for any code point that is not in the Unicode table. assumed for any code point that is not in the Unicode table.
Specifying caseless matching does not affect these escape sequences. For example, \p{Lu} always matches Specifying caseless matching does not affect these escape sequences. Fo r example, \p{Lu} always matches
only upper case letters. This is different from the behaviour of current versions of Perl. only upper case letters. This is different from the behaviour of current versions of Perl.
Matching characters by Unicode property is not fast, because PCRE2 has to do a multistage table lookup in Matching characters by Unicode property is not fast, because PCRE2 has to do a multistage table lookup in
order to find a character's property. That is why the traditional escape order to find a character's property. That is why the traditional escap
sequences such as \d and \w do e sequences such as \d and \w do
not use Unicode properties in PCRE2 by default, though you can make the not use Unicode properties in PCRE2 by default, though you can make them
m do so by setting the PCRE2_UCP do so by setting the PCRE2_UCP
option or by starting the pattern with (*UCP). option or by starting the pattern with (*UCP).
Extended grapheme clusters Extended grapheme clusters
The \X escape matches any number of Unicode characters that form an "ex The \X escape matches any number of Unicode characters that form an
tended grapheme cluster", and "extended grapheme cluster", and
treats the sequence as an atomic group (see below). Unicode supports va treats the sequence as an atomic group (see below). Unicode supports var
rious kinds of composite charac- ious kinds of composite charac-
ter by giving each character a grapheme breaking property, and having rul ter by giving each character a grapheme breaking property, and having ru
es that use these properties to les that use these properties to
define the boundaries of extended grapheme clusters. The rules are defin define the boundaries of extended grapheme clusters. The rules are define
ed in Unicode Standard Annex 29, d in Unicode Standard Annex 29,
"Unicode Text Segmentation". Unicode 11.0.0 abandoned the use of some pre "Unicode Text Segmentation". Unicode 11.0.0 abandoned the use of some p
vious properties that had been revious properties that had been
used for emojis. Instead it introduced various emoji-specific properti used for emojis. Instead it introduced various emoji-specific properties
es. PCRE2 uses only the Extended . PCRE2 uses only the Extended
Pictographic property. Pictographic property.
\X always matches at least one character. Then it decides whether to add additional characters according \X always matches at least one character. Then it decides whether to add additional characters according
to the following rules for ending a cluster: to the following rules for ending a cluster:
1. End at the end of the subject string. 1. End at the end of the subject string.
2. Do not end between CR and LF; otherwise end after any control characte r. 2. Do not end between CR and LF; otherwise end after any control characte r.
3. Do not break Hangul (a Korean script) syllable sequences. Hangul cha racters are of five types: L, V, 3. Do not break Hangul (a Korean script) syllable sequences. Hangul chara cters are of five types: L, V,
T, LV, and LVT. An L character may be followed by an L, V, LV, or LVT cha racter; an LV or V character may T, LV, and LVT. An L character may be followed by an L, V, LV, or LVT cha racter; an LV or V character may
be followed by a V or T character; an LVT or T character may be follwed o nly by a T character. be followed by a V or T character; an LVT or T character may be follwed o nly by a T character.
4. Do not end before extending characters or spacing marks or the "zero- width joiner" character. Charac- 4. Do not end before extending characters or spacing marks or the "zero-w idth joiner" character. Charac-
ters with the "mark" property always have the "extend" grapheme breaking property. ters with the "mark" property always have the "extend" grapheme breaking property.
5. Do not end after prepend characters. 5. Do not end after prepend characters.
6. Do not break within emoji modifier sequences or emoji zwj sequences. T 6. Do not break within emoji modifier sequences or emoji zwj sequence
hat is, do not break between s. That is, do not break between
characters with the Extended_Pictographic property. Extend and ZWJ ch characters with the Extended_Pictographic property. Extend and ZWJ chara
aracters are allowed between the cters are allowed between the
characters. characters.
7. Do not break within emoji flag sequences. That is, do not break betwee n regional indicator (RI) char- 7. Do not break within emoji flag sequences. That is, do not break betwe en regional indicator (RI) char-
acters if there are an odd number of RI characters before the break point . acters if there are an odd number of RI characters before the break point .
8. Otherwise, end the cluster. 8. Otherwise, end the cluster.
PCRE2's additional properties PCRE2's additional properties
As well as the standard Unicode properties described above, PCRE2 suppor ts four more that make it possi- As well as the standard Unicode properties described above, PCRE2 support s four more that make it possi-
ble to convert traditional escape sequences such as \w and \s to use Unic ode properties. PCRE2 uses these ble to convert traditional escape sequences such as \w and \s to use Unic ode properties. PCRE2 uses these
non-standard, non-Perl properties internally when PCRE2_UCP is set . However, they may also be used non-standard, non-Perl properties internally when PCRE2_UCP is set. How ever, they may also be used
explicitly. These properties are: explicitly. These properties are:
Xan Any alphanumeric character Xan Any alphanumeric character
Xps Any POSIX space character Xps Any POSIX space character
Xsp Any Perl space character Xsp Any Perl space character
Xwd Any Perl "word" character Xwd Any Perl "word" character
Xan matches characters that have either the L (letter) or the N (number) property. Xps matches the char- Xan matches characters that have either the L (letter) or the N (number) property. Xps matches the char-
acters tab, linefeed, vertical tab, form feed, or carriage return, and an y other character that has the Z acters tab, linefeed, vertical tab, form feed, or carriage return, and an y other character that has the Z
(separator) property. Xsp is the same as Xps; in PCRE1 it used to exclud e vertical tab, for Perl compat- (separator) property. Xsp is the same as Xps; in PCRE1 it used to exclud e vertical tab, for Perl compat-
ibility, but Perl changed. Xwd matches the same characters as Xan, plus u nderscore. ibility, but Perl changed. Xwd matches the same characters as Xan, plus u nderscore.
There is another non-standard property, Xuc, which matches any charac There is another non-standard property, Xuc, which matches any character
ter that can be represented by a that can be represented by a
Universal Character Name in C++ and other programming languages. These ar Universal Character Name in C++ and other programming languages. These a
e the characters $, @, ` (grave re the characters $, @, ` (grave
accent), and all characters with Unicode code points greater than or equa l to U+00A0, except for the sur- accent), and all characters with Unicode code points greater than or equa l to U+00A0, except for the sur-
rogates U+D800 to U+DFFF. Note that most base (ASCII) characters are excl uded. (Universal Character Names rogates U+D800 to U+DFFF. Note that most base (ASCII) characters are excl uded. (Universal Character Names
are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. Not e that the Xuc property does not are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. Note that the Xuc property does not
match these sequences but the characters that they represent.) match these sequences but the characters that they represent.)
Resetting the match start Resetting the match start
In normal use, the escape sequence \K causes any previously matched chara cters not to be included in the In normal use, the escape sequence \K causes any previously matched char acters not to be included in the
final matched sequence that is returned. For example, the pattern: final matched sequence that is returned. For example, the pattern:
foo\Kbar foo\Kbar
matches "foobar", but reports that it has matched "bar". \K does not int eract with anchoring in any way. matches "foobar", but reports that it has matched "bar". \K does not inte ract with anchoring in any way.
The pattern: The pattern:
^foo\Kbar ^foo\Kbar
matches only when the subject begins with "foobar" (in single line mode), matches only when the subject begins with "foobar" (in single line mo
though it again reports the de), though it again reports the
matched string as "bar". This feature is similar to a lookbehind assert matched string as "bar". This feature is similar to a lookbehind assertio
ion (described below). However, n (described below). However,
in this case, the part of the subject before the real match does not have to be of fixed length, as look- in this case, the part of the subject before the real match does not have to be of fixed length, as look-
behind assertions do. The use of \K does not interfere with the sett ing of captured substrings. For behind assertions do. The use of \K does not interfere with the setting of captured substrings. For
example, when the pattern example, when the pattern
(foo)\Kbar (foo)\Kbar
matches "foobar", the first substring is still set to "foo". matches "foobar", the first substring is still set to "foo".
Perl documents that the use of \K within assertions is "not well defined" Perl used to document that the use of \K within lookaround assertions
. In PCRE2, \K is acted upon is "not well defined", but from
when it occurs inside positive assertions, but is ignored in negative a version 5.32.0 Perl does not support this usage at all. In PCRE2, \K is
ssertions. Note that when a pat- acted upon when it occurs inside
tern such as (?=ab\K) matches, the reported start of the match can be gre positive assertions, but is ignored in negative assertions. Note that
ater than the end of the match. when a pattern such as (?=ab\K)
Using \K in a lookbehind assertion at the start of a pattern can also l matches, the reported start of the match can be greater than the end of t
ead to odd effects. For example, he match. Using \K in a lookbe-
consider this pattern: hind assertion at the start of a pattern can also lead to odd effects.
For example, consider this pat-
tern:
(?<=\Kfoo)bar (?<=\Kfoo)bar
If the subject is "foobar", a call to pcre2_match() with a starting offse t of 3 succeeds and reports the If the subject is "foobar", a call to pcre2_match() with a starting offse t of 3 succeeds and reports the
matching string as "foobar", that is, the start of the reported match is earlier than where the match matching string as "foobar", that is, the start of the reported match is earlier than where the match
started. started.
Simple assertions Simple assertions
The final use of backslash is for certain simple assertions. An assertion specifies a condition that has The final use of backslash is for certain simple assertions. An assertion specifies a condition that has
skipping to change at line 945 skipping to change at line 953
For example, the character class [aeiou] matches any lower case vowel, wh ile [^aeiou] matches any charac- For example, the character class [aeiou] matches any lower case vowel, wh ile [^aeiou] matches any charac-
ter that is not a lower case vowel. Note that a circumflex is just a co nvenient notation for specifying ter that is not a lower case vowel. Note that a circumflex is just a co nvenient notation for specifying
the characters that are in the class by enumerating those that are not. A class that starts with a cir- the characters that are in the class by enumerating those that are not. A class that starts with a cir-
cumflex is not an assertion; it still consumes a character from the subject string, and therefore it cumflex is not an assertion; it still consumes a character from the subject string, and therefore it
fails if the current pointer is at the end of the string. fails if the current pointer is at the end of the string.
Characters in a class may be specified by their code points using \o, \x, or \N{U+hh..} in the usual way. Characters in a class may be specified by their code points using \o, \x, or \N{U+hh..} in the usual way.
When caseless matching is set, any letters in a class represent both thei r upper case and lower case ver- When caseless matching is set, any letters in a class represent both thei r upper case and lower case ver-
sions, so for example, a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not sions, so for example, a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
match "A", whereas a caseful version would. match "A", whereas a caseful version would. Note that there are two ASC
II characters, K and S, that, in
addition to their lower case ASCII equivalents, are case-equivalent with
Unicode U+212A (Kelvin sign) and
U+017F (long S) respectively when either PCRE2_UTF or PCRE2_UCP is set.
Characters that might indicate line breaks are never treated in any spe cial way when matching character Characters that might indicate line breaks are never treated in any spe cial way when matching character
classes, whatever line-ending sequence is in use, and whatever setting of the PCRE2_DOTALL and PCRE2_MUL- classes, whatever line-ending sequence is in use, and whatever setting of the PCRE2_DOTALL and PCRE2_MUL-
TILINE options is used. A class such as [^a] always matches one of these characters. TILINE options is used. A class such as [^a] always matches one of these characters.
The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s, \ S, \v, \V, \w, and \W may appear The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s, \ S, \v, \V, \w, and \W may appear
in a character class, and add the characters that they match to the c lass. For example, [\dABCDEF] in a character class, and add the characters that they match to the c lass. For example, [\dABCDEF]
matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option affect s the meanings of \d, \s, \w and matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option affect s the meanings of \d, \s, \w and
their upper case partners, just as it does when they appear outside a cha racter class, as described in their upper case partners, just as it does when they appear outside a cha racter class, as described in
the section entitled "Generic character types" above. The escape seq uence \b has a different meaning the section entitled "Generic character types" above. The escape seq uence \b has a different meaning
skipping to change at line 2748 skipping to change at line 2758
pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), pcre2(3). pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), pcre2(3).
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge, England. Cambridge, England.
REVISION REVISION
Last updated: 24 February 2020 Last updated: 06 October 2020
Copyright (c) 1997-2020 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
PCRE2 10.35 24 February 2020 PCRE2PATTERN(3) PCRE2 10.35 06 October 2020 PCRE2PATTERN(3)
 End of changes. 83 change blocks. 
267 lines changed or deleted 287 lines changed or added

Home  |  About  |  Features  |  All  |  Newest  |  Dox  |  Diffs  |  RSS Feeds  |  Screenshots  |  Comments  |  Imprint  |  Privacy  |  HTTP(S)