"Fossies" - the Fresh Open Source Software Archive  

Source code changes of the file "doc/pcre.txt" between
pcre-8.43.tar.bz2 and pcre-8.44.tar.bz2

About: The PCRE library implements Perl compatible regular expression pattern matching.

pcre.txt  (pcre-8.43.tar.bz2):pcre.txt  (pcre-8.44.tar.bz2)
skipping to change at line 36 skipping to change at line 36
The PCRE library is a set of functions that implement regular expres- The PCRE library is a set of functions that implement regular expres-
sion pattern matching using the same syntax and semantics as Perl, with sion pattern matching using the same syntax and semantics as Perl, with
just a few differences. Some features that appeared in Python and PCRE just a few differences. Some features that appeared in Python and PCRE
before they appeared in Perl are also available using the Python syn- before they appeared in Perl are also available using the Python syn-
tax, there is some support for one or two .NET and Oniguruma syntax tax, there is some support for one or two .NET and Oniguruma syntax
items, and there is an option for requesting some minor changes that items, and there is an option for requesting some minor changes that
give better JavaScript compatibility. give better JavaScript compatibility.
Starting with release 8.30, it is possible to compile two separate PCRE Starting with release 8.30, it is possible to compile two separate PCRE
libraries: the original, which supports 8-bit character strings libraries: the original, which supports 8-bit character strings (in-
(including UTF-8 strings), and a second library that supports 16-bit cluding UTF-8 strings), and a second library that supports 16-bit char-
character strings (including UTF-16 strings). The build process allows acter strings (including UTF-16 strings). The build process allows ei-
either one or both to be built. The majority of the work to make this ther one or both to be built. The majority of the work to make this
possible was done by Zoltan Herczeg. possible was done by Zoltan Herczeg.
Starting with release 8.32 it is possible to compile a third separate Starting with release 8.32 it is possible to compile a third separate
PCRE library that supports 32-bit character strings (including UTF-32 PCRE library that supports 32-bit character strings (including UTF-32
strings). The build process allows any combination of the 8-, 16- and strings). The build process allows any combination of the 8-, 16- and
32-bit libraries. The work to make this possible was done by Christian 32-bit libraries. The work to make this possible was done by Christian
Persch. Persch.
The three libraries contain identical sets of functions, except that The three libraries contain identical sets of functions, except that
the names in the 16-bit library start with pcre16_ instead of pcre_, the names in the 16-bit library start with pcre16_ instead of pcre_,
skipping to change at line 119 skipping to change at line 119
which interprets patterns and subjects as strings of UTF-8 characters which interprets patterns and subjects as strings of UTF-8 characters
instead of individual 8-bit characters. This causes both the pattern instead of individual 8-bit characters. This causes both the pattern
and any data against which it is matched to be checked for UTF-8 valid- and any data against which it is matched to be checked for UTF-8 valid-
ity. If the data string is very long, such a check might use suffi- ity. If the data string is very long, such a check might use suffi-
ciently many resources as to cause your application to lose perfor- ciently many resources as to cause your application to lose perfor-
mance. mance.
One way of guarding against this possibility is to use the One way of guarding against this possibility is to use the
pcre_fullinfo() function to check the compiled pattern's options for pcre_fullinfo() function to check the compiled pattern's options for
UTF. Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF UTF. Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF
option at compile time. This causes an compile time error if a pattern option at compile time. This causes a compile time error if a pattern
contains a UTF-setting sequence. contains a UTF-setting sequence.
If your application is one that supports UTF, be aware that validity If your application is one that supports UTF, be aware that validity
checking can take time. If the same data string is to be matched many checking can take time. If the same data string is to be matched many
times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second
and subsequent matches to save redundant checks. and subsequent matches to save redundant checks.
Another way that performance can be hit is by running a pattern that Another way that performance can be hit is by running a pattern that
has a very large search tree against a string that will never match. has a very large search tree against a string that will never match.
Nested unlimited repeats in a pattern are a common example. PCRE pro- Nested unlimited repeats in a pattern are a common example. PCRE pro-
skipping to change at line 300 skipping to change at line 300
int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output, int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
PCRE_SPTR16 input, int length, int *byte_order, PCRE_SPTR16 input, int length, int *byte_order,
int keep_boms); int keep_boms);
THE PCRE 16-BIT LIBRARY THE PCRE 16-BIT LIBRARY
Starting with release 8.30, it is possible to compile a PCRE library Starting with release 8.30, it is possible to compile a PCRE library
that supports 16-bit character strings, including UTF-16 strings, as that supports 16-bit character strings, including UTF-16 strings, as
well as or instead of the original 8-bit library. The majority of the well as or instead of the original 8-bit library. The majority of the
work to make this possible was done by Zoltan Herczeg. The two work to make this possible was done by Zoltan Herczeg. The two li-
libraries contain identical sets of functions, used in exactly the same braries contain identical sets of functions, used in exactly the same
way. Only the names of the functions and the data types of their argu- way. Only the names of the functions and the data types of their argu-
ments and results are different. To avoid over-complication and reduce ments and results are different. To avoid over-complication and reduce
the documentation maintenance load, most of the PCRE documentation the documentation maintenance load, most of the PCRE documentation de-
describes the 8-bit library, with only occasional references to the scribes the 8-bit library, with only occasional references to the
16-bit library. This page describes what is different when you use the 16-bit library. This page describes what is different when you use the
16-bit library. 16-bit library.
WARNING: A single application can be linked with both libraries, but WARNING: A single application can be linked with both libraries, but
you must take care when processing any particular pattern to use func- you must take care when processing any particular pattern to use func-
tions from just one library. For example, if you want to study a pat- tions from just one library. For example, if you want to study a pat-
tern that was compiled with pcre16_compile(), you must do so with tern that was compiled with pcre16_compile(), you must do so with
pcre16_study(), not pcre_study(), and you must free the study data with pcre16_study(), not pcre_study(), and you must free the study data with
pcre16_free_study(). pcre16_free_study().
skipping to change at line 333 skipping to change at line 333
In Unix-like systems, the 16-bit library is called libpcre16, and can In Unix-like systems, the 16-bit library is called libpcre16, and can
normally be accesss by adding -lpcre16 to the command for linking an normally be accesss by adding -lpcre16 to the command for linking an
application that uses PCRE. application that uses PCRE.
STRING TYPES STRING TYPES
In the 8-bit library, strings are passed to PCRE library functions as In the 8-bit library, strings are passed to PCRE library functions as
vectors of bytes with the C type "char *". In the 16-bit library, vectors of bytes with the C type "char *". In the 16-bit library,
strings are passed as vectors of unsigned 16-bit quantities. The macro strings are passed as vectors of unsigned 16-bit quantities. The macro
PCRE_UCHAR16 specifies an appropriate data type, and PCRE_SPTR16 is PCRE_UCHAR16 specifies an appropriate data type, and PCRE_SPTR16 is de-
defined as "const PCRE_UCHAR16 *". In very many environments, "short fined as "const PCRE_UCHAR16 *". In very many environments, "short int"
int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16 is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16 as
as "unsigned short int", but checks that it really is a 16-bit data "unsigned short int", but checks that it really is a 16-bit data type.
type. If it is not, the build fails with an error message telling the If it is not, the build fails with an error message telling the main-
maintainer to modify the definition appropriately. tainer to modify the definition appropriately.
STRUCTURE TYPES STRUCTURE TYPES
The types of the opaque structures that are used for compiled 16-bit The types of the opaque structures that are used for compiled 16-bit
patterns and JIT stacks are pcre16 and pcre16_jit_stack respectively. patterns and JIT stacks are pcre16 and pcre16_jit_stack respectively.
The type of the user-accessible structure that is returned by The type of the user-accessible structure that is returned by
pcre16_study() is pcre16_extra, and the type of the structure that is pcre16_study() is pcre16_extra, and the type of the structure that is
used for passing data to a callout function is pcre16_callout_block. used for passing data to a callout function is pcre16_callout_block.
These structures contain the same fields, with the same names, as their These structures contain the same fields, with the same names, as their
8-bit counterparts. The only difference is that pointers to character 8-bit counterparts. The only difference is that pointers to character
skipping to change at line 402 skipping to change at line 402
The name-to-number translation table that is maintained for named sub- The name-to-number translation table that is maintained for named sub-
patterns uses 16-bit characters. The pcre16_get_stringtable_entries() patterns uses 16-bit characters. The pcre16_get_stringtable_entries()
function returns the length of each entry in the table as the number of function returns the length of each entry in the table as the number of
16-bit data units. 16-bit data units.
OPTION NAMES OPTION NAMES
There are two new general option names, PCRE_UTF16 and There are two new general option names, PCRE_UTF16 and
PCRE_NO_UTF16_CHECK, which correspond to PCRE_UTF8 and PCRE_NO_UTF16_CHECK, which correspond to PCRE_UTF8 and
PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options de-
define the same bits in the options word. There is a discussion about fine the same bits in the options word. There is a discussion about the
the validity of UTF-16 strings in the pcreunicode page. validity of UTF-16 strings in the pcreunicode page.
For the pcre16_config() function there is an option PCRE_CONFIG_UTF16 For the pcre16_config() function there is an option PCRE_CONFIG_UTF16
that returns 1 if UTF-16 support is configured, otherwise 0. If this that returns 1 if UTF-16 support is configured, otherwise 0. If this
option is given to pcre_config() or pcre32_config(), or if the option is given to pcre_config() or pcre32_config(), or if the
PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF32 option is given to pcre16_con- PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF32 option is given to pcre16_con-
fig(), the result is the PCRE_ERROR_BADOPTION error. fig(), the result is the PCRE_ERROR_BADOPTION error.
CHARACTER CODES CHARACTER CODES
In 16-bit mode, when PCRE_UTF16 is not set, character values are In 16-bit mode, when PCRE_UTF16 is not set, character values are
skipping to change at line 440 skipping to change at line 440
above). above).
ERROR NAMES ERROR NAMES
The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre- The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-
spond to their 8-bit counterparts. The error PCRE_ERROR_BADMODE is spond to their 8-bit counterparts. The error PCRE_ERROR_BADMODE is
given when a compiled pattern is passed to a function that processes given when a compiled pattern is passed to a function that processes
patterns in the other mode, for example, if a pattern compiled with patterns in the other mode, for example, if a pattern compiled with
pcre_compile() is passed to pcre16_exec(). pcre_compile() is passed to pcre16_exec().
There are new error codes whose names begin with PCRE_UTF16_ERR for There are new error codes whose names begin with PCRE_UTF16_ERR for in-
invalid UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for valid UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for
UTF-8 strings that are described in the section entitled "Reason codes UTF-8 strings that are described in the section entitled "Reason codes
for invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors for invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
are: are:
PCRE_UTF16_ERR1 Missing low surrogate at end of string PCRE_UTF16_ERR1 Missing low surrogate at end of string
PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate
PCRE_UTF16_ERR3 Isolated low surrogate PCRE_UTF16_ERR3 Isolated low surrogate
PCRE_UTF16_ERR4 Non-character PCRE_UTF16_ERR4 Non-character
ERROR TEXTS ERROR TEXTS
skipping to change at line 481 skipping to change at line 481
-16 option is ignored. -16 option is ignored.
When PCRE is being built, the RunTest script that is called by "make When PCRE is being built, the RunTest script that is called by "make
check" uses the pcretest -C option to discover which of the 8-bit, check" uses the pcretest -C option to discover which of the 8-bit,
16-bit and 32-bit libraries has been built, and runs the tests appro- 16-bit and 32-bit libraries has been built, and runs the tests appro-
priately. priately.
NOT SUPPORTED IN 16-BIT MODE NOT SUPPORTED IN 16-BIT MODE
Not all the features of the 8-bit library are available with the 16-bit Not all the features of the 8-bit library are available with the 16-bit
library. The C++ and POSIX wrapper functions support only the 8-bit library. The C++ and POSIX wrapper functions support only the 8-bit li-
library, and the pcregrep program is at present 8-bit only. brary, and the pcregrep program is at present 8-bit only.
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge CB2 3QH, England.
REVISION REVISION
Last updated: 12 May 2013 Last updated: 12 May 2013
skipping to change at line 612 skipping to change at line 612
Starting with release 8.32, it is possible to compile a PCRE library Starting with release 8.32, it is possible to compile a PCRE library
that supports 32-bit character strings, including UTF-32 strings, as that supports 32-bit character strings, including UTF-32 strings, as
well as or instead of the original 8-bit library. This work was done by well as or instead of the original 8-bit library. This work was done by
Christian Persch, based on the work done by Zoltan Herczeg for the Christian Persch, based on the work done by Zoltan Herczeg for the
16-bit library. All three libraries contain identical sets of func- 16-bit library. All three libraries contain identical sets of func-
tions, used in exactly the same way. Only the names of the functions tions, used in exactly the same way. Only the names of the functions
and the data types of their arguments and results are different. To and the data types of their arguments and results are different. To
avoid over-complication and reduce the documentation maintenance load, avoid over-complication and reduce the documentation maintenance load,
most of the PCRE documentation describes the 8-bit library, with only most of the PCRE documentation describes the 8-bit library, with only
occasional references to the 16-bit and 32-bit libraries. This page occasional references to the 16-bit and 32-bit libraries. This page de-
describes what is different when you use the 32-bit library. scribes what is different when you use the 32-bit library.
WARNING: A single application can be linked with all or any of the WARNING: A single application can be linked with all or any of the
three libraries, but you must take care when processing any particular three libraries, but you must take care when processing any particular
pattern to use functions from just one library. For example, if you pattern to use functions from just one library. For example, if you
want to study a pattern that was compiled with pcre32_compile(), you want to study a pattern that was compiled with pcre32_compile(), you
must do so with pcre32_study(), not pcre_study(), and you must free the must do so with pcre32_study(), not pcre_study(), and you must free the
study data with pcre32_free_study(). study data with pcre32_free_study().
THE HEADER FILE THE HEADER FILE
skipping to change at line 639 skipping to change at line 639
In Unix-like systems, the 32-bit library is called libpcre32, and can In Unix-like systems, the 32-bit library is called libpcre32, and can
normally be accesss by adding -lpcre32 to the command for linking an normally be accesss by adding -lpcre32 to the command for linking an
application that uses PCRE. application that uses PCRE.
STRING TYPES STRING TYPES
In the 8-bit library, strings are passed to PCRE library functions as In the 8-bit library, strings are passed to PCRE library functions as
vectors of bytes with the C type "char *". In the 32-bit library, vectors of bytes with the C type "char *". In the 32-bit library,
strings are passed as vectors of unsigned 32-bit quantities. The macro strings are passed as vectors of unsigned 32-bit quantities. The macro
PCRE_UCHAR32 specifies an appropriate data type, and PCRE_SPTR32 is PCRE_UCHAR32 specifies an appropriate data type, and PCRE_SPTR32 is de-
defined as "const PCRE_UCHAR32 *". In very many environments, "unsigned fined as "const PCRE_UCHAR32 *". In very many environments, "unsigned
int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32 int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32
as "unsigned int", but checks that it really is a 32-bit data type. If as "unsigned int", but checks that it really is a 32-bit data type. If
it is not, the build fails with an error message telling the maintainer it is not, the build fails with an error message telling the maintainer
to modify the definition appropriately. to modify the definition appropriately.
STRUCTURE TYPES STRUCTURE TYPES
The types of the opaque structures that are used for compiled 32-bit The types of the opaque structures that are used for compiled 32-bit
patterns and JIT stacks are pcre32 and pcre32_jit_stack respectively. patterns and JIT stacks are pcre32 and pcre32_jit_stack respectively.
The type of the user-accessible structure that is returned by The type of the user-accessible structure that is returned by
skipping to change at line 708 skipping to change at line 708
The name-to-number translation table that is maintained for named sub- The name-to-number translation table that is maintained for named sub-
patterns uses 32-bit characters. The pcre32_get_stringtable_entries() patterns uses 32-bit characters. The pcre32_get_stringtable_entries()
function returns the length of each entry in the table as the number of function returns the length of each entry in the table as the number of
32-bit data units. 32-bit data units.
OPTION NAMES OPTION NAMES
There are two new general option names, PCRE_UTF32 and There are two new general option names, PCRE_UTF32 and
PCRE_NO_UTF32_CHECK, which correspond to PCRE_UTF8 and PCRE_NO_UTF32_CHECK, which correspond to PCRE_UTF8 and
PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options de-
define the same bits in the options word. There is a discussion about fine the same bits in the options word. There is a discussion about the
the validity of UTF-32 strings in the pcreunicode page. validity of UTF-32 strings in the pcreunicode page.
For the pcre32_config() function there is an option PCRE_CONFIG_UTF32 For the pcre32_config() function there is an option PCRE_CONFIG_UTF32
that returns 1 if UTF-32 support is configured, otherwise 0. If this that returns 1 if UTF-32 support is configured, otherwise 0. If this
option is given to pcre_config() or pcre16_config(), or if the option is given to pcre_config() or pcre16_config(), or if the
PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF16 option is given to pcre32_con- PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF16 option is given to pcre32_con-
fig(), the result is the PCRE_ERROR_BADOPTION error. fig(), the result is the PCRE_ERROR_BADOPTION error.
CHARACTER CODES CHARACTER CODES
In 32-bit mode, when PCRE_UTF32 is not set, character values are In 32-bit mode, when PCRE_UTF32 is not set, character values are
skipping to change at line 744 skipping to change at line 744
pcre32_utf32_to_host_byte_order() is provided to help with this (see pcre32_utf32_to_host_byte_order() is provided to help with this (see
above). above).
ERROR NAMES ERROR NAMES
The error PCRE_ERROR_BADUTF32 corresponds to its 8-bit counterpart. The error PCRE_ERROR_BADUTF32 corresponds to its 8-bit counterpart.
The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed
to a function that processes patterns in the other mode, for example, to a function that processes patterns in the other mode, for example,
if a pattern compiled with pcre_compile() is passed to pcre32_exec(). if a pattern compiled with pcre_compile() is passed to pcre32_exec().
There are new error codes whose names begin with PCRE_UTF32_ERR for There are new error codes whose names begin with PCRE_UTF32_ERR for in-
invalid UTF-32 strings, corresponding to the PCRE_UTF8_ERR codes for valid UTF-32 strings, corresponding to the PCRE_UTF8_ERR codes for
UTF-8 strings that are described in the section entitled "Reason codes UTF-8 strings that are described in the section entitled "Reason codes
for invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors for invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors
are: are:
PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff) PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff)
PCRE_UTF32_ERR2 Non-character PCRE_UTF32_ERR2 Non-character
PCRE_UTF32_ERR3 Character > 0x10ffff PCRE_UTF32_ERR3 Character > 0x10ffff
ERROR TEXTS ERROR TEXTS
skipping to change at line 784 skipping to change at line 784
-32 option is ignored. -32 option is ignored.
When PCRE is being built, the RunTest script that is called by "make When PCRE is being built, the RunTest script that is called by "make
check" uses the pcretest -C option to discover which of the 8-bit, check" uses the pcretest -C option to discover which of the 8-bit,
16-bit and 32-bit libraries has been built, and runs the tests appro- 16-bit and 32-bit libraries has been built, and runs the tests appro-
priately. priately.
NOT SUPPORTED IN 32-BIT MODE NOT SUPPORTED IN 32-BIT MODE
Not all the features of the 8-bit library are available with the 32-bit Not all the features of the 8-bit library are available with the 32-bit
library. The C++ and POSIX wrapper functions support only the 8-bit library. The C++ and POSIX wrapper functions support only the 8-bit li-
library, and the pcregrep program is at present 8-bit only. brary, and the pcregrep program is at present 8-bit only.
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge CB2 3QH, England.
REVISION REVISION
Last updated: 12 May 2013 Last updated: 12 May 2013
skipping to change at line 808 skipping to change at line 808
PCREBUILD(3) Library Functions Manual PCREBUILD(3) PCREBUILD(3) Library Functions Manual PCREBUILD(3)
NAME NAME
PCRE - Perl-compatible regular expressions PCRE - Perl-compatible regular expressions
BUILDING PCRE BUILDING PCRE
PCRE is distributed with a configure script that can be used to build PCRE is distributed with a configure script that can be used to build
the library in Unix-like environments using the applications known as the library in Unix-like environments using the applications known as
Autotools. Also in the distribution are files to support building Autotools. Also in the distribution are files to support building us-
using CMake instead of configure. The text file README contains general ing CMake instead of configure. The text file README contains general
information about building with Autotools (some of which is repeated information about building with Autotools (some of which is repeated
below), and also has some comments about building on various operating below), and also has some comments about building on various operating
systems. There is a lot more information about building PCRE without systems. There is a lot more information about building PCRE without
using Autotools (including information about using CMake and building using Autotools (including information about using CMake and building
"by hand") in the text file called NON-AUTOTOOLS-BUILD. You should "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should
consult this file as well as the README file if you are building in a consult this file as well as the README file if you are building in a
non-Unix-like environment. non-Unix-like environment.
PCRE BUILD-TIME OPTIONS PCRE BUILD-TIME OPTIONS
skipping to change at line 833 skipping to change at line 833
lected by providing options to configure before running the make com- lected by providing options to configure before running the make com-
mand. However, the same options can be selected in both Unix-like and mand. However, the same options can be selected in both Unix-like and
non-Unix-like environments using the GUI facility of cmake-gui if you non-Unix-like environments using the GUI facility of cmake-gui if you
are using CMake instead of configure to build PCRE. are using CMake instead of configure to build PCRE.
If you are not using Autotools or CMake, option selection can be done If you are not using Autotools or CMake, option selection can be done
by editing the config.h file, or by passing parameter settings to the by editing the config.h file, or by passing parameter settings to the
compiler, as described in NON-AUTOTOOLS-BUILD. compiler, as described in NON-AUTOTOOLS-BUILD.
The complete list of options for configure (which includes the standard The complete list of options for configure (which includes the standard
ones such as the selection of the installation directory) can be ones such as the selection of the installation directory) can be ob-
obtained by running tained by running
./configure --help ./configure --help
The following sections include descriptions of options whose names The following sections include descriptions of options whose names be-
begin with --enable or --disable. These settings specify changes to the gin with --enable or --disable. These settings specify changes to the
defaults for the configure command. Because of the way that configure defaults for the configure command. Because of the way that configure
works, --enable and --disable always come in pairs, so the complemen- works, --enable and --disable always come in pairs, so the complemen-
tary option always exists as well, but as it specifies the default, it tary option always exists as well, but as it specifies the default, it
is not described. is not described.
BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
By default, a library called libpcre is built, containing functions By default, a library called libpcre is built, containing functions
that take string arguments contained in vectors of bytes, either as that take string arguments contained in vectors of bytes, either as
single-byte characters, or interpreted as UTF-8 strings. You can also single-byte characters, or interpreted as UTF-8 strings. You can also
build a separate library, called libpcre16, in which strings are con- build a separate library, called libpcre16, in which strings are con-
tained in vectors of 16-bit data units and interpreted either as sin- tained in vectors of 16-bit data units and interpreted either as sin-
gle-unit characters or UTF-16 strings, by adding gle-unit characters or UTF-16 strings, by adding
--enable-pcre16 --enable-pcre16
to the configure command. You can also build yet another separate to the configure command. You can also build yet another separate li-
library, called libpcre32, in which strings are contained in vectors of brary, called libpcre32, in which strings are contained in vectors of
32-bit data units and interpreted either as single-unit characters or 32-bit data units and interpreted either as single-unit characters or
UTF-32 strings, by adding UTF-32 strings, by adding
--enable-pcre32 --enable-pcre32
to the configure command. If you do not want the 8-bit library, add to the configure command. If you do not want the 8-bit library, add
--disable-pcre8 --disable-pcre8
as well. At least one of the three libraries must be built. Note that as well. At least one of the three libraries must be built. Note that
skipping to change at line 902 skipping to change at line 902
to the configure command. to the configure command.
UTF-8, UTF-16 AND UTF-32 SUPPORT UTF-8, UTF-16 AND UTF-32 SUPPORT
To build PCRE with support for UTF Unicode character strings, add To build PCRE with support for UTF Unicode character strings, add
--enable-utf --enable-utf
to the configure command. This setting applies to all three libraries, to the configure command. This setting applies to all three libraries,
adding support for UTF-8 to the 8-bit library, support for UTF-16 to adding support for UTF-8 to the 8-bit library, support for UTF-16 to
the 16-bit library, and support for UTF-32 to the to the 32-bit the 16-bit library, and support for UTF-32 to the to the 32-bit li-
library. There are no separate options for enabling UTF-8, UTF-16 and brary. There are no separate options for enabling UTF-8, UTF-16 and
UTF-32 independently because that would allow ridiculous settings such UTF-32 independently because that would allow ridiculous settings such
as requesting UTF-16 support while building only the 8-bit library. It as requesting UTF-16 support while building only the 8-bit library. It
is not possible to build one library with UTF support and another with- is not possible to build one library with UTF support and another with-
out in the same configuration. (For backwards compatibility, --enable- out in the same configuration. (For backwards compatibility, --enable-
utf8 is a synonym of --enable-utf.) utf8 is a synonym of --enable-utf.)
Of itself, this setting does not make PCRE treat strings as UTF-8, Of itself, this setting does not make PCRE treat strings as UTF-8,
UTF-16 or UTF-32. As well as compiling PCRE with this option, you also UTF-16 or UTF-32. As well as compiling PCRE with this option, you also
have have to set the PCRE_UTF8, PCRE_UTF16 or PCRE_UTF32 option (as have have to set the PCRE_UTF8, PCRE_UTF16 or PCRE_UTF32 option (as ap-
appropriate) when you call one of the pattern compiling functions. propriate) when you call one of the pattern compiling functions.
If you set --enable-utf when compiling in an EBCDIC environment, PCRE If you set --enable-utf when compiling in an EBCDIC environment, PCRE
expects its input to be either ASCII or UTF-8 (depending on the run- expects its input to be either ASCII or UTF-8 (depending on the run-
time option). It is not possible to support both EBCDIC and UTF-8 codes time option). It is not possible to support both EBCDIC and UTF-8 codes
in the same version of the library. Consequently, --enable-utf and in the same version of the library. Consequently, --enable-utf and
--enable-ebcdic are mutually exclusive. --enable-ebcdic are mutually exclusive.
UNICODE CHARACTER PROPERTY SUPPORT UNICODE CHARACTER PROPERTY SUPPORT
UTF support allows the libraries to process character codepoints up to UTF support allows the libraries to process character codepoints up to
skipping to change at line 945 skipping to change at line 945
PCRE library. Only the general category properties such as Lu and Nd PCRE library. Only the general category properties such as Lu and Nd
are supported. Details are given in the pcrepattern documentation. are supported. Details are given in the pcrepattern documentation.
JUST-IN-TIME COMPILER SUPPORT JUST-IN-TIME COMPILER SUPPORT
Just-in-time compiler support is included in the build by specifying Just-in-time compiler support is included in the build by specifying
--enable-jit --enable-jit
This support is available only for certain hardware architectures. If This support is available only for certain hardware architectures. If
this option is set for an unsupported architecture, a compile time this option is set for an unsupported architecture, a compile time er-
error occurs. See the pcrejit documentation for a discussion of JIT ror occurs. See the pcrejit documentation for a discussion of JIT us-
usage. When JIT support is enabled, pcregrep automatically makes use of age. When JIT support is enabled, pcregrep automatically makes use of
it, unless you add it, unless you add
--disable-pcregrep-jit --disable-pcregrep-jit
to the "configure" command. to the "configure" command.
CODE VALUE OF NEWLINE CODE VALUE OF NEWLINE
By default, PCRE interprets the linefeed (LF) character as indicating By default, PCRE interprets the linefeed (LF) character as indicating
the end of a line. This is the normal newline character on Unix-like the end of a line. This is the normal newline character on Unix-like
systems. You can compile PCRE to use carriage return (CR) instead, by systems. You can compile PCRE to use carriage return (CR) instead, by
adding adding
--enable-newline-is-cr --enable-newline-is-cr
to the configure command. There is also a --enable-newline-is-lf to the configure command. There is also a --enable-newline-is-lf op-
option, which explicitly specifies linefeed as the newline character. tion, which explicitly specifies linefeed as the newline character.
Alternatively, you can specify that line endings are to be indicated by Alternatively, you can specify that line endings are to be indicated by
the two character sequence CRLF. If you want this, add the two character sequence CRLF. If you want this, add
--enable-newline-is-crlf --enable-newline-is-crlf
to the configure command. There is a fourth option, specified by to the configure command. There is a fourth option, specified by
--enable-newline-is-anycrlf --enable-newline-is-anycrlf
skipping to change at line 1054 skipping to change at line 1054
If you want to build a version of PCRE that works this way, add If you want to build a version of PCRE that works this way, add
--disable-stack-for-recursion --disable-stack-for-recursion
to the configure command. With this configuration, PCRE will use the to the configure command. With this configuration, PCRE will use the
pcre_stack_malloc and pcre_stack_free variables to call memory manage- pcre_stack_malloc and pcre_stack_free variables to call memory manage-
ment functions. By default these point to malloc() and free(), but you ment functions. By default these point to malloc() and free(), but you
can replace the pointers so that your own functions are used instead. can replace the pointers so that your own functions are used instead.
Separate functions are provided rather than using pcre_malloc and Separate functions are provided rather than using pcre_malloc and
pcre_free because the usage is very predictable: the block sizes pcre_free because the usage is very predictable: the block sizes re-
requested are always the same, and the blocks are always freed in quested are always the same, and the blocks are always freed in reverse
reverse order. A calling program might be able to implement optimized order. A calling program might be able to implement optimized functions
functions that perform better than malloc() and free(). PCRE runs that perform better than malloc() and free(). PCRE runs noticeably more
noticeably more slowly when built in this way. This option affects only slowly when built in this way. This option affects only the pcre_exec()
the pcre_exec() function; it is not relevant for pcre_dfa_exec(). function; it is not relevant for pcre_dfa_exec().
LIMITING PCRE RESOURCE USAGE LIMITING PCRE RESOURCE USAGE
Internally, PCRE has a function called match(), which it calls repeat- Internally, PCRE has a function called match(), which it calls repeat-
edly (sometimes recursively) when matching a pattern with the edly (sometimes recursively) when matching a pattern with the
pcre_exec() function. By controlling the maximum number of times this pcre_exec() function. By controlling the maximum number of times this
function may be called during a single matching operation, a limit can function may be called during a single matching operation, a limit can
be placed on the resources used by a single call to pcre_exec(). The be placed on the resources used by a single call to pcre_exec(). The
limit can be changed at run time, as described in the pcreapi documen- limit can be changed at run time, as described in the pcreapi documen-
tation. The default is 10 million, but this can be changed by adding a tation. The default is 10 million, but this can be changed by adding a
skipping to change at line 1081 skipping to change at line 1081
--with-match-limit=500000 --with-match-limit=500000
to the configure command. This setting has no effect on the to the configure command. This setting has no effect on the
pcre_dfa_exec() matching function. pcre_dfa_exec() matching function.
In some environments it is desirable to limit the depth of recursive In some environments it is desirable to limit the depth of recursive
calls of match() more strictly than the total number of calls, in order calls of match() more strictly than the total number of calls, in order
to restrict the maximum amount of stack (or heap, if --disable-stack- to restrict the maximum amount of stack (or heap, if --disable-stack-
for-recursion is specified) that is used. A second limit controls this; for-recursion is specified) that is used. A second limit controls this;
it defaults to the value that is set for --with-match-limit, which it defaults to the value that is set for --with-match-limit, which im-
imposes no additional constraints. However, you can set a lower limit poses no additional constraints. However, you can set a lower limit by
by adding, for example, adding, for example,
--with-match-limit-recursion=10000 --with-match-limit-recursion=10000
to the configure command. This value can also be overridden at run to the configure command. This value can also be overridden at run
time. time.
CREATING CHARACTER TABLES AT BUILD TIME CREATING CHARACTER TABLES AT BUILD TIME
PCRE uses fixed tables for processing characters whose code values are PCRE uses fixed tables for processing characters whose code values are
less than 256. By default, PCRE is built with a set of tables that are less than 256. By default, PCRE is built with a set of tables that are
skipping to change at line 1117 skipping to change at line 1117
USING EBCDIC CODE USING EBCDIC CODE
PCRE assumes by default that it will run in an environment where the PCRE assumes by default that it will run in an environment where the
character code is ASCII (or Unicode, which is a superset of ASCII). character code is ASCII (or Unicode, which is a superset of ASCII).
This is the case for most computer operating systems. PCRE can, how- This is the case for most computer operating systems. PCRE can, how-
ever, be compiled to run in an EBCDIC environment by adding ever, be compiled to run in an EBCDIC environment by adding
--enable-ebcdic --enable-ebcdic
to the configure command. This setting implies --enable-rebuild-charta- to the configure command. This setting implies --enable-rebuild-charta-
bles. You should only use it if you know that you are in an EBCDIC bles. You should only use it if you know that you are in an EBCDIC en-
environment (for example, an IBM mainframe operating system). The vironment (for example, an IBM mainframe operating system). The --en-
--enable-ebcdic option is incompatible with --enable-utf. able-ebcdic option is incompatible with --enable-utf.
The EBCDIC character that corresponds to an ASCII LF is assumed to have The EBCDIC character that corresponds to an ASCII LF is assumed to have
the value 0x15 by default. However, in some EBCDIC environments, 0x25 the value 0x15 by default. However, in some EBCDIC environments, 0x25
is used. In such an environment you should use is used. In such an environment you should use
--enable-ebcdic-nl25 --enable-ebcdic-nl25
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
0x25 is not chosen as LF is made to correspond to the Unicode NEL char- 0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
skipping to change at line 1170 skipping to change at line 1170
to the configure command. The caller of pcregrep can, however, override to the configure command. The caller of pcregrep can, however, override
this value by specifying a run-time option. this value by specifying a run-time option.
PCRETEST OPTION FOR LIBREADLINE SUPPORT PCRETEST OPTION FOR LIBREADLINE SUPPORT
If you add If you add
--enable-pcretest-libreadline --enable-pcretest-libreadline
to the configure command, pcretest is linked with the libreadline to the configure command, pcretest is linked with the libreadline li-
library, and when its input is from a terminal, it reads it using the brary, and when its input is from a terminal, it reads it using the
readline() function. This provides line-editing and history facilities. readline() function. This provides line-editing and history facilities.
Note that libreadline is GPL-licensed, so if you distribute a binary of Note that libreadline is GPL-licensed, so if you distribute a binary of
pcretest linked in this way, there may be licensing issues. pcretest linked in this way, there may be licensing issues.
Setting this option causes the -lreadline option to be added to the Setting this option causes the -lreadline option to be added to the
pcretest build. In many operating environments with a sytem-installed pcretest build. In many operating environments with a sytem-installed
libreadline this is sufficient. However, in some environments (e.g. if libreadline this is sufficient. However, in some environments (e.g. if
an unmodified distribution version of readline is in use), some extra an unmodified distribution version of readline is in use), some extra
configuration may be necessary. The INSTALL file for libreadline says configuration may be necessary. The INSTALL file for libreadline says
this: this:
skipping to change at line 1201 skipping to change at line 1201
immediately before the configure command. immediately before the configure command.
DEBUGGING WITH VALGRIND SUPPORT DEBUGGING WITH VALGRIND SUPPORT
By adding the By adding the
--enable-valgrind --enable-valgrind
option to to the configure command, PCRE will use valgrind annotations option to to the configure command, PCRE will use valgrind annotations
to mark certain memory regions as unaddressable. This allows it to to mark certain memory regions as unaddressable. This allows it to de-
detect invalid memory accesses, and is mostly useful for debugging PCRE tect invalid memory accesses, and is mostly useful for debugging PCRE
itself. itself.
CODE COVERAGE REPORTING CODE COVERAGE REPORTING
If your C compiler is gcc, you can build a version of PCRE that can If your C compiler is gcc, you can build a version of PCRE that can
generate a code coverage report for its test suite. To enable this, you generate a code coverage report for its test suite. To enable this, you
must install lcov version 1.6 or above. Then specify must install lcov version 1.6 or above. Then specify
--enable-coverage --enable-coverage
skipping to change at line 1288 skipping to change at line 1288
NAME NAME
PCRE - Perl-compatible regular expressions PCRE - Perl-compatible regular expressions
PCRE MATCHING ALGORITHMS PCRE MATCHING ALGORITHMS
This document describes the two different algorithms that are available This document describes the two different algorithms that are available
in PCRE for matching a compiled regular expression against a given sub- in PCRE for matching a compiled regular expression against a given sub-
ject string. The "standard" algorithm is the one provided by the ject string. The "standard" algorithm is the one provided by the
pcre_exec(), pcre16_exec() and pcre32_exec() functions. These work in pcre_exec(), pcre16_exec() and pcre32_exec() functions. These work in
the same as as Perl's matching function, and provide a Perl-compatible the same as as Perl's matching function, and provide a Perl-compatible
matching operation. The just-in-time (JIT) optimization that is matching operation. The just-in-time (JIT) optimization that is de-
described in the pcrejit documentation is compatible with these func- scribed in the pcrejit documentation is compatible with these func-
tions. tions.
An alternative algorithm is provided by the pcre_dfa_exec(), An alternative algorithm is provided by the pcre_dfa_exec(),
pcre16_dfa_exec() and pcre32_dfa_exec() functions; they operate in a pcre16_dfa_exec() and pcre32_dfa_exec() functions; they operate in a
different way, and are not Perl-compatible. This alternative has advan- different way, and are not Perl-compatible. This alternative has advan-
tages and disadvantages compared with the standard algorithm, and these tages and disadvantages compared with the standard algorithm, and these
are described below. are described below.
When there is only one possible way in which a given subject string can When there is only one possible way in which a given subject string can
match a pattern, the two algorithms give the same answer. A difference match a pattern, the two algorithms give the same answer. A difference
skipping to change at line 1361 skipping to change at line 1361
from the first matching point in the subject, it scans the subject from the first matching point in the subject, it scans the subject
string from left to right, once, character by character, and as it does string from left to right, once, character by character, and as it does
this, it remembers all the paths through the tree that represent valid this, it remembers all the paths through the tree that represent valid
matches. In Friedl's terminology, this is a kind of "DFA algorithm", matches. In Friedl's terminology, this is a kind of "DFA algorithm",
though it is not implemented as a traditional finite state machine (it though it is not implemented as a traditional finite state machine (it
keeps multiple states active simultaneously). keeps multiple states active simultaneously).
Although the general principle of this matching algorithm is that it Although the general principle of this matching algorithm is that it
scans the subject string only once, without backtracking, there is one scans the subject string only once, without backtracking, there is one
exception: when a lookaround assertion is encountered, the characters exception: when a lookaround assertion is encountered, the characters
following or preceding the current point have to be independently following or preceding the current point have to be independently in-
inspected. spected.
The scan continues until either the end of the subject is reached, or The scan continues until either the end of the subject is reached, or
there are no more unterminated paths. At this point, terminated paths there are no more unterminated paths. At this point, terminated paths
represent the different matching possibilities (if there are none, the represent the different matching possibilities (if there are none, the
match has failed). Thus, if there is more than one possible match, match has failed). Thus, if there is more than one possible match,
this algorithm finds all of them, and in particular, it finds the long- this algorithm finds all of them, and in particular, it finds the long-
est. The matches are returned in decreasing order of length. There is est. The matches are returned in decreasing order of length. There is
an option to stop the algorithm after the first match (which is neces- an option to stop the algorithm after the first match (which is neces-
sarily the shortest) is found. sarily the shortest) is found.
skipping to change at line 1395 skipping to change at line 1395
ple, the pattern "a\d+" is compiled as if it were "a\d++" because there ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
is no point even considering the possibility of backtracking into the is no point even considering the possibility of backtracking into the
repeated digits. For DFA matching, this means that only one possible repeated digits. For DFA matching, this means that only one possible
match is found. If you really do want multiple matches in such cases, match is found. If you really do want multiple matches in such cases,
either use an ungreedy repeat ("a\d+?") or set the PCRE_NO_AUTO_POSSESS either use an ungreedy repeat ("a\d+?") or set the PCRE_NO_AUTO_POSSESS
option when compiling. option when compiling.
There are a number of features of PCRE regular expressions that are not There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows: supported by the alternative matching algorithm. They are as follows:
1. Because the algorithm finds all possible matches, the greedy or 1. Because the algorithm finds all possible matches, the greedy or un-
ungreedy nature of repetition quantifiers is not relevant. Greedy and greedy nature of repetition quantifiers is not relevant. Greedy and un-
ungreedy quantifiers are treated in exactly the same way. However, pos- greedy quantifiers are treated in exactly the same way. However, pos-
sessive quantifiers can make a difference when what follows could also sessive quantifiers can make a difference when what follows could also
match what is quantified, for example in a pattern like this: match what is quantified, for example in a pattern like this:
^a++\w! ^a++\w!
This pattern matches "aaab!" but not "aaa!", which would be matched by This pattern matches "aaab!" but not "aaa!", which would be matched by
a non-possessive quantifier. Similarly, if an atomic group is present, a non-possessive quantifier. Similarly, if an atomic group is present,
it is matched as if it were a standalone pattern at the current point, it is matched as if it were a standalone pattern at the current point,
and the longest match is then "locked in" for the rest of the overall and the longest match is then "locked in" for the rest of the overall
pattern. pattern.
2. When dealing with multiple paths through the tree simultaneously, it 2. When dealing with multiple paths through the tree simultaneously, it
is not straightforward to keep track of captured substrings for the is not straightforward to keep track of captured substrings for the
different matching possibilities, and PCRE's implementation of this different matching possibilities, and PCRE's implementation of this al-
algorithm does not attempt to do this. This means that no captured sub- gorithm does not attempt to do this. This means that no captured sub-
strings are available. strings are available.
3. Because no substrings are captured, back references within the pat- 3. Because no substrings are captured, back references within the pat-
tern are not supported, and cause errors if encountered. tern are not supported, and cause errors if encountered.
4. For the same reason, conditional expressions that use a backrefer- 4. For the same reason, conditional expressions that use a backrefer-
ence as the condition or test for a specific group recursion are not ence as the condition or test for a specific group recursion are not
supported. supported.
5. Because many paths through the tree may be active, the \K escape 5. Because many paths through the tree may be active, the \K escape se-
sequence, which resets the start of the match when encountered (but may quence, which resets the start of the match when encountered (but may
be on some paths and not on others), is not supported. It causes an be on some paths and not on others), is not supported. It causes an er-
error if encountered. ror if encountered.
6. Callouts are supported, but the value of the capture_top field is 6. Callouts are supported, but the value of the capture_top field is
always 1, and the value of the capture_last field is always -1. always 1, and the value of the capture_last field is always -1.
7. The \C escape sequence, which (in the standard algorithm) always 7. The \C escape sequence, which (in the standard algorithm) always
matches a single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is matches a single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is
not supported in these modes, because the alternative algorithm moves not supported in these modes, because the alternative algorithm moves
through the subject string one character (not data unit) at a time, for through the subject string one character (not data unit) at a time, for
all active paths through the tree. all active paths through the tree.
skipping to change at line 1619 skipping to change at line 1619
References to bytes and UTF-8 in this document should be read as refer- References to bytes and UTF-8 in this document should be read as refer-
ences to 16-bit data units and UTF-16 when using the 16-bit library, or ences to 16-bit data units and UTF-16 when using the 16-bit library, or
32-bit data units and UTF-32 when using the 32-bit library, unless 32-bit data units and UTF-32 when using the 32-bit library, unless
specified otherwise. More details of the specific differences for the specified otherwise. More details of the specific differences for the
16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages. 16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.
PCRE API OVERVIEW PCRE API OVERVIEW
PCRE has its own native API, which is described in this document. There PCRE has its own native API, which is described in this document. There
are also some wrapper functions (for the 8-bit library only) that cor- are also some wrapper functions (for the 8-bit library only) that cor-
respond to the POSIX regular expression API, but they do not give respond to the POSIX regular expression API, but they do not give ac-
access to all the functionality. They are described in the pcreposix cess to all the functionality. They are described in the pcreposix doc-
documentation. Both of these APIs define a set of C function calls. A umentation. Both of these APIs define a set of C function calls. A C++
C++ wrapper (again for the 8-bit library only) is also distributed with wrapper (again for the 8-bit library only) is also distributed with
PCRE. It is documented in the pcrecpp page. PCRE. It is documented in the pcrecpp page.
The native API C function prototypes are defined in the header file The native API C function prototypes are defined in the header file
pcre.h, and on Unix-like systems the (8-bit) library itself is called pcre.h, and on Unix-like systems the (8-bit) library itself is called
libpcre. It can normally be accessed by adding -lpcre to the command libpcre. It can normally be accessed by adding -lpcre to the command
for linking an application that uses PCRE. The header file defines the for linking an application that uses PCRE. The header file defines the
macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
numbers for the library. Applications can use these to include support numbers for the library. Applications can use these to include support
for different releases of PCRE. for different releases of PCRE.
In a Windows environment, if you want to statically link an application In a Windows environment, if you want to statically link an application
program against a non-dll pcre.a file, you must define PCRE_STATIC program against a non-dll pcre.a file, you must define PCRE_STATIC be-
before including pcre.h or pcrecpp.h, because otherwise the pcre_mal- fore including pcre.h or pcrecpp.h, because otherwise the pcre_malloc()
loc() and pcre_free() exported functions will be declared and pcre_free() exported functions will be declared __declspec(dl-
__declspec(dllimport), with unwanted results. limport), with unwanted results.
The functions pcre_compile(), pcre_compile2(), pcre_study(), and The functions pcre_compile(), pcre_compile2(), pcre_study(), and
pcre_exec() are used for compiling and matching regular expressions in pcre_exec() are used for compiling and matching regular expressions in
a Perl-compatible manner. A sample program that demonstrates the sim- a Perl-compatible manner. A sample program that demonstrates the sim-
plest way of using them is provided in the file called pcredemo.c in plest way of using them is provided in the file called pcredemo.c in
the PCRE source distribution. A listing of this program is given in the the PCRE source distribution. A listing of this program is given in the
pcredemo documentation, and the pcresample documentation describes how pcredemo documentation, and the pcresample documentation describes how
to compile and run it. to compile and run it.
Just-in-time compiler support is an optional feature of PCRE that can Just-in-time compiler support is an optional feature of PCRE that can
be built in appropriate hardware environments. It greatly speeds up the be built in appropriate hardware environments. It greatly speeds up the
matching performance of many patterns. Simple programs can easily matching performance of many patterns. Simple programs can easily re-
request that it be used if available, by setting an option that is quest that it be used if available, by setting an option that is ig-
ignored when it is not relevant. More complicated programs might need nored when it is not relevant. More complicated programs might need to
to make use of the functions pcre_jit_stack_alloc(), make use of the functions pcre_jit_stack_alloc(),
pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control
the JIT code's memory usage. the JIT code's memory usage.
From release 8.32 there is also a direct interface for JIT execution, From release 8.32 there is also a direct interface for JIT execution,
which gives improved performance. The JIT-specific functions are dis- which gives improved performance. The JIT-specific functions are dis-
cussed in the pcrejit documentation. cussed in the pcrejit documentation.
A second matching function, pcre_dfa_exec(), which is not Perl-compati- A second matching function, pcre_dfa_exec(), which is not Perl-compati-
ble, is also provided. This uses a different algorithm for the match- ble, is also provided. This uses a different algorithm for the match-
ing. The alternative algorithm finds all possible matches (at a given ing. The alternative algorithm finds all possible matches (at a given
skipping to change at line 1684 skipping to change at line 1684
pcre_copy_named_substring() pcre_copy_named_substring()
pcre_get_substring() pcre_get_substring()
pcre_get_named_substring() pcre_get_named_substring()
pcre_get_substring_list() pcre_get_substring_list()
pcre_get_stringnumber() pcre_get_stringnumber()
pcre_get_stringtable_entries() pcre_get_stringtable_entries()
pcre_free_substring() and pcre_free_substring_list() are also provided, pcre_free_substring() and pcre_free_substring_list() are also provided,
to free the memory used for extracted strings. to free the memory used for extracted strings.
The function pcre_maketables() is used to build a set of character The function pcre_maketables() is used to build a set of character ta-
tables in the current locale for passing to pcre_compile(), bles in the current locale for passing to pcre_compile(), pcre_exec(),
pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is or pcre_dfa_exec(). This is an optional facility that is provided for
provided for specialist use. Most commonly, no special tables are specialist use. Most commonly, no special tables are passed, in which
passed, in which case internal tables that are generated when PCRE is case internal tables that are generated when PCRE is built are used.
built are used.
The function pcre_fullinfo() is used to find out information about a The function pcre_fullinfo() is used to find out information about a
compiled pattern. The function pcre_version() returns a pointer to a compiled pattern. The function pcre_version() returns a pointer to a
string containing the version of PCRE and its date of release. string containing the version of PCRE and its date of release.
The function pcre_refcount() maintains a reference count in a data The function pcre_refcount() maintains a reference count in a data
block containing a compiled pattern. This is provided for the benefit block containing a compiled pattern. This is provided for the benefit
of object-oriented applications. of object-oriented applications.
The global variables pcre_malloc and pcre_free initially contain the The global variables pcre_malloc and pcre_free initially contain the
entry points of the standard malloc() and free() functions, respec- entry points of the standard malloc() and free() functions, respec-
tively. PCRE calls the memory management functions via these variables, tively. PCRE calls the memory management functions via these variables,
so a calling program can replace them if it wishes to intercept the so a calling program can replace them if it wishes to intercept the
calls. This should be done before calling any PCRE functions. calls. This should be done before calling any PCRE functions.
The global variables pcre_stack_malloc and pcre_stack_free are also The global variables pcre_stack_malloc and pcre_stack_free are also in-
indirections to memory management functions. These special functions directions to memory management functions. These special functions are
are used only when PCRE is compiled to use the heap for remembering used only when PCRE is compiled to use the heap for remembering data,
data, instead of recursive function calls, when running the pcre_exec() instead of recursive function calls, when running the pcre_exec() func-
function. See the pcrebuild documentation for details of how to do tion. See the pcrebuild documentation for details of how to do this. It
this. It is a non-standard way of building PCRE, for use in environ- is a non-standard way of building PCRE, for use in environments that
ments that have limited stacks. Because of the greater use of memory have limited stacks. Because of the greater use of memory management,
management, it runs more slowly. Separate functions are provided so it runs more slowly. Separate functions are provided so that special-
that special-purpose external code can be used for this case. When purpose external code can be used for this case. When used, these func-
used, these functions always allocate memory blocks of the same size. tions always allocate memory blocks of the same size. There is a dis-
There is a discussion about PCRE's stack usage in the pcrestack docu- cussion about PCRE's stack usage in the pcrestack documentation.
mentation.
The global variable pcre_callout initially contains NULL. It can be set The global variable pcre_callout initially contains NULL. It can be set
by the caller to a "callout" function, which PCRE will then call at by the caller to a "callout" function, which PCRE will then call at
specified points during a matching operation. Details are given in the specified points during a matching operation. Details are given in the
pcrecallout documentation. pcrecallout documentation.
The global variable pcre_stack_guard initially contains NULL. It can be The global variable pcre_stack_guard initially contains NULL. It can be
set by the caller to a function that is called by PCRE whenever it set by the caller to a function that is called by PCRE whenever it
starts to compile a parenthesized part of a pattern. When parentheses starts to compile a parenthesized part of a pattern. When parentheses
are nested, PCRE uses recursive function calls, which use up the system are nested, PCRE uses recursive function calls, which use up the system
skipping to change at line 1811 skipping to change at line 1809
into which the information is placed. The returned value is zero on into which the information is placed. The returned value is zero on
success, or the negative error code PCRE_ERROR_BADOPTION if the value success, or the negative error code PCRE_ERROR_BADOPTION if the value
in the first argument is not recognized. The following information is in the first argument is not recognized. The following information is
available: available:
PCRE_CONFIG_UTF8 PCRE_CONFIG_UTF8
The output is an integer that is set to one if UTF-8 support is avail- The output is an integer that is set to one if UTF-8 support is avail-
able; otherwise it is set to zero. This value should normally be given able; otherwise it is set to zero. This value should normally be given
to the 8-bit version of this function, pcre_config(). If it is given to to the 8-bit version of this function, pcre_config(). If it is given to
the 16-bit or 32-bit version of this function, the result is the 16-bit or 32-bit version of this function, the result is PCRE_ER-
PCRE_ERROR_BADOPTION. ROR_BADOPTION.
PCRE_CONFIG_UTF16 PCRE_CONFIG_UTF16
The output is an integer that is set to one if UTF-16 support is avail- The output is an integer that is set to one if UTF-16 support is avail-
able; otherwise it is set to zero. This value should normally be given able; otherwise it is set to zero. This value should normally be given
to the 16-bit version of this function, pcre16_config(). If it is given to the 16-bit version of this function, pcre16_config(). If it is given
to the 8-bit or 32-bit version of this function, the result is to the 8-bit or 32-bit version of this function, the result is PCRE_ER-
PCRE_ERROR_BADOPTION. ROR_BADOPTION.
PCRE_CONFIG_UTF32 PCRE_CONFIG_UTF32
The output is an integer that is set to one if UTF-32 support is avail- The output is an integer that is set to one if UTF-32 support is avail-
able; otherwise it is set to zero. This value should normally be given able; otherwise it is set to zero. This value should normally be given
to the 32-bit version of this function, pcre32_config(). If it is given to the 32-bit version of this function, pcre32_config(). If it is given
to the 8-bit or 16-bit version of this function, the result is to the 8-bit or 16-bit version of this function, the result is PCRE_ER-
PCRE_ERROR_BADOPTION. ROR_BADOPTION.
PCRE_CONFIG_UNICODE_PROPERTIES PCRE_CONFIG_UNICODE_PROPERTIES
The output is an integer that is set to one if support for Unicode The output is an integer that is set to one if support for Unicode
character properties is available; otherwise it is set to zero. character properties is available; otherwise it is set to zero.
PCRE_CONFIG_JIT PCRE_CONFIG_JIT
The output is an integer that is set to one if support for just-in-time The output is an integer that is set to one if support for just-in-time
compiling is available; otherwise it is set to zero. compiling is available; otherwise it is set to zero.
PCRE_CONFIG_JITTARGET PCRE_CONFIG_JITTARGET
The output is a pointer to a zero-terminated "const char *" string. If The output is a pointer to a zero-terminated "const char *" string. If
JIT support is available, the string contains the name of the architec- JIT support is available, the string contains the name of the architec-
ture for which the JIT compiler is configured, for example "x86 32bit ture for which the JIT compiler is configured, for example "x86 32bit
(little endian + unaligned)". If JIT support is not available, the (little endian + unaligned)". If JIT support is not available, the re-
result is NULL. sult is NULL.
PCRE_CONFIG_NEWLINE PCRE_CONFIG_NEWLINE
The output is an integer whose value specifies the default character The output is an integer whose value specifies the default character
sequence that is recognized as meaning "newline". The values that are sequence that is recognized as meaning "newline". The values that are
supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338 supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR, for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR,
ANYCRLF, and ANY yield the same values. However, the value for LF is ANYCRLF, and ANY yield the same values. However, the value for LF is
normally 21, though some EBCDIC environments use 37. The corresponding normally 21, though some EBCDIC environments use 37. The corresponding
values for CRLF are 3349 and 3365. The default should normally corre- values for CRLF are 3349 and 3365. The default should normally corre-
skipping to change at line 1869 skipping to change at line 1867
PCRE_CONFIG_BSR PCRE_CONFIG_BSR
The output is an integer whose value indicates what character sequences The output is an integer whose value indicates what character sequences
the \R escape sequence matches by default. A value of 0 means that \R the \R escape sequence matches by default. A value of 0 means that \R
matches any Unicode line ending sequence; a value of 1 means that \R matches any Unicode line ending sequence; a value of 1 means that \R
matches only CR, LF, or CRLF. The default can be overridden when a pat- matches only CR, LF, or CRLF. The default can be overridden when a pat-
tern is compiled or matched. tern is compiled or matched.
PCRE_CONFIG_LINK_SIZE PCRE_CONFIG_LINK_SIZE
The output is an integer that contains the number of bytes used for The output is an integer that contains the number of bytes used for in-
internal linkage in compiled regular expressions. For the 8-bit ternal linkage in compiled regular expressions. For the 8-bit library,
library, the value can be 2, 3, or 4. For the 16-bit library, the value the value can be 2, 3, or 4. For the 16-bit library, the value is ei-
is either 2 or 4 and is still a number of bytes. For the 32-bit ther 2 or 4 and is still a number of bytes. For the 32-bit library, the
library, the value is either 2 or 4 and is still a number of bytes. The value is either 2 or 4 and is still a number of bytes. The default
default value of 2 is sufficient for all but the most massive patterns, value of 2 is sufficient for all but the most massive patterns, since
since it allows the compiled pattern to be up to 64K in size. Larger it allows the compiled pattern to be up to 64K in size. Larger values
values allow larger regular expressions to be compiled, at the expense allow larger regular expressions to be compiled, at the expense of
of slower matching. slower matching.
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
The output is an integer that contains the threshold above which the The output is an integer that contains the threshold above which the
POSIX interface uses malloc() for output vectors. Further details are POSIX interface uses malloc() for output vectors. Further details are
given in the pcreposix documentation. given in the pcreposix documentation.
PCRE_CONFIG_PARENS_LIMIT PCRE_CONFIG_PARENS_LIMIT
The output is a long integer that gives the maximum depth of nesting of The output is a long integer that gives the maximum depth of nesting of
skipping to change at line 1905 skipping to change at line 1903
PCRE_CONFIG_MATCH_LIMIT PCRE_CONFIG_MATCH_LIMIT
The output is a long integer that gives the default limit for the num- The output is a long integer that gives the default limit for the num-
ber of internal matching function calls in a pcre_exec() execution. ber of internal matching function calls in a pcre_exec() execution.
Further details are given with pcre_exec() below. Further details are given with pcre_exec() below.
PCRE_CONFIG_MATCH_LIMIT_RECURSION PCRE_CONFIG_MATCH_LIMIT_RECURSION
The output is a long integer that gives the default limit for the depth The output is a long integer that gives the default limit for the depth
of recursion when calling the internal matching function in a of recursion when calling the internal matching function in a
pcre_exec() execution. Further details are given with pcre_exec() pcre_exec() execution. Further details are given with pcre_exec() be-
below. low.
PCRE_CONFIG_STACKRECURSE PCRE_CONFIG_STACKRECURSE
The output is an integer that is set to one if internal recursion when The output is an integer that is set to one if internal recursion when
running pcre_exec() is implemented by recursive function calls that use running pcre_exec() is implemented by recursive function calls that use
the stack to remember their state. This is the usual way that PCRE is the stack to remember their state. This is the usual way that PCRE is
compiled. The output is zero if PCRE was compiled to use blocks of data compiled. The output is zero if PCRE was compiled to use blocks of data
on the heap instead of recursive function calls. In this case, on the heap instead of recursive function calls. In this case,
pcre_stack_malloc and pcre_stack_free are called to manage memory pcre_stack_malloc and pcre_stack_free are called to manage memory
blocks on the heap, thus avoiding the use of the stack. blocks on the heap, thus avoiding the use of the stack.
skipping to change at line 1937 skipping to change at line 1935
const unsigned char *tableptr); const unsigned char *tableptr);
Either of the functions pcre_compile() or pcre_compile2() can be called Either of the functions pcre_compile() or pcre_compile2() can be called
to compile a pattern into an internal form. The only difference between to compile a pattern into an internal form. The only difference between
the two interfaces is that pcre_compile2() has an additional argument, the two interfaces is that pcre_compile2() has an additional argument,
errorcodeptr, via which a numerical error code can be returned. To errorcodeptr, via which a numerical error code can be returned. To
avoid too much repetition, we refer just to pcre_compile() below, but avoid too much repetition, we refer just to pcre_compile() below, but
the information applies equally to pcre_compile2(). the information applies equally to pcre_compile2().
The pattern is a C string terminated by a binary zero, and is passed in The pattern is a C string terminated by a binary zero, and is passed in
the pattern argument. A pointer to a single block of memory that is the pattern argument. A pointer to a single block of memory that is ob-
obtained via pcre_malloc is returned. This contains the compiled code tained via pcre_malloc is returned. This contains the compiled code and
and related data. The pcre type is defined for the returned block; this related data. The pcre type is defined for the returned block; this is
is a typedef for a structure whose contents are not externally defined. a typedef for a structure whose contents are not externally defined. It
It is up to the caller to free the memory (via pcre_free) when it is no is up to the caller to free the memory (via pcre_free) when it is no
longer required. longer required.
Although the compiled code of a PCRE regex is relocatable, that is, it Although the compiled code of a PCRE regex is relocatable, that is, it
does not depend on memory location, the complete pcre data block is not does not depend on memory location, the complete pcre data block is not
fully relocatable, because it may contain a copy of the tableptr argu- fully relocatable, because it may contain a copy of the tableptr argu-
ment, which is an address (see below). ment, which is an address (see below).
The options argument contains various bit settings that affect the com- The options argument contains various bit settings that affect the com-
pilation. It should be zero if no options are required. The available pilation. It should be zero if no options are required. The available
options are described below. Some of them (in particular, those that options are described below. Some of them (in particular, those that
skipping to change at line 1984 skipping to change at line 1982
Note that the offset is in data units, not characters, even in a UTF Note that the offset is in data units, not characters, even in a UTF
mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
acter. acter.
If pcre_compile2() is used instead of pcre_compile(), and the error- If pcre_compile2() is used instead of pcre_compile(), and the error-
codeptr argument is not NULL, a non-zero error code number is returned codeptr argument is not NULL, a non-zero error code number is returned
via this argument in the event of an error. This is in addition to the via this argument in the event of an error. This is in addition to the
textual error message. Error codes and messages are listed below. textual error message. Error codes and messages are listed below.
If the final argument, tableptr, is NULL, PCRE uses a default set of If the final argument, tableptr, is NULL, PCRE uses a default set of
character tables that are built when PCRE is compiled, using the character tables that are built when PCRE is compiled, using the de-
default C locale. Otherwise, tableptr must be an address that is the fault C locale. Otherwise, tableptr must be an address that is the re-
result of a call to pcre_maketables(). This value is stored with the sult of a call to pcre_maketables(). This value is stored with the com-
compiled pattern, and used again by pcre_exec() and pcre_dfa_exec() piled pattern, and used again by pcre_exec() and pcre_dfa_exec() when
when the pattern is matched. For more discussion, see the section on the pattern is matched. For more discussion, see the section on locale
locale support below. support below.
This code fragment shows a typical straightforward call to pcre_com- This code fragment shows a typical straightforward call to pcre_com-
pile(): pile():
pcre *re; pcre *re;
const char *error; const char *error;
int erroffset; int erroffset;
re = pcre_compile( re = pcre_compile(
"^A.*Z", /* the pattern */ "^A.*Z", /* the pattern */
0, /* default options */ 0, /* default options */
skipping to change at line 2073 skipping to change at line 2071
PCRE_DUPNAMES PCRE_DUPNAMES
If this bit is set, names used to identify capturing subpatterns need If this bit is set, names used to identify capturing subpatterns need
not be unique. This can be helpful for certain types of pattern when it not be unique. This can be helpful for certain types of pattern when it
is known that only one instance of the named subpattern can ever be is known that only one instance of the named subpattern can ever be
matched. There are more details of named subpatterns below; see also matched. There are more details of named subpatterns below; see also
the pcrepattern documentation. the pcrepattern documentation.
PCRE_EXTENDED PCRE_EXTENDED
If this bit is set, most white space characters in the pattern are If this bit is set, most white space characters in the pattern are to-
totally ignored except when escaped or inside a character class. How- tally ignored except when escaped or inside a character class. However,
ever, white space is not allowed within sequences such as (?> that white space is not allowed within sequences such as (?> that introduce
introduce various parenthesized subpatterns, nor within a numerical various parenthesized subpatterns, nor within a numerical quantifier
quantifier such as {1,3}. However, ignorable white space is permitted such as {1,3}. However, ignorable white space is permitted between an
between an item and a following quantifier and between a quantifier and item and a following quantifier and between a quantifier and a follow-
a following + that indicates possessiveness. ing + that indicates possessiveness.
White space did not used to include the VT character (code 11), because White space did not used to include the VT character (code 11), because
Perl did not treat this character as white space. However, Perl changed Perl did not treat this character as white space. However, Perl changed
at release 5.18, so PCRE followed at release 8.34, and VT is now at release 5.18, so PCRE followed at release 8.34, and VT is now
treated as white space. treated as white space.
PCRE_EXTENDED also causes characters between an unescaped # outside a PCRE_EXTENDED also causes characters between an unescaped # outside a
character class and the next newline, inclusive, to be ignored. character class and the next newline, inclusive, to be ignored.
PCRE_EXTENDED is equivalent to Perl's /x option, and it can be changed PCRE_EXTENDED is equivalent to Perl's /x option, and it can be changed
within a pattern by a (?x) option setting. within a pattern by a (?x) option setting.
Which characters are interpreted as newlines is controlled by the Which characters are interpreted as newlines is controlled by the op-
options passed to pcre_compile() or by a special sequence at the start tions passed to pcre_compile() or by a special sequence at the start of
of the pattern, as described in the section entitled "Newline conven- the pattern, as described in the section entitled "Newline conventions"
tions" in the pcrepattern documentation. Note that the end of this type in the pcrepattern documentation. Note that the end of this type of
of comment is a literal newline sequence in the pattern; escape comment is a literal newline sequence in the pattern; escape sequences
sequences that happen to represent a newline do not count. that happen to represent a newline do not count.
This option makes it possible to include comments inside complicated This option makes it possible to include comments inside complicated
patterns. Note, however, that this applies only to data characters. patterns. Note, however, that this applies only to data characters.
White space characters may never appear within special character White space characters may never appear within special character se-
sequences in a pattern, for example within the sequence (?( that intro- quences in a pattern, for example within the sequence (?( that intro-
duces a conditional subpattern. duces a conditional subpattern.
PCRE_EXTRA PCRE_EXTRA
This option was invented in order to turn on additional functionality This option was invented in order to turn on additional functionality
of PCRE that is incompatible with Perl, but it is currently of very of PCRE that is incompatible with Perl, but it is currently of very
little use. When set, any backslash in a pattern that is followed by a little use. When set, any backslash in a pattern that is followed by a
letter that has no special meaning causes an error, thus reserving letter that has no special meaning causes an error, thus reserving
these combinations for future expansion. By default, as in Perl, a these combinations for future expansion. By default, as in Perl, a
backslash followed by a letter with no special meaning is treated as a backslash followed by a letter with no special meaning is treated as a
literal. (Perl can, however, be persuaded to give an error for this, by literal. (Perl can, however, be persuaded to give an error for this, by
running it with the -w option.) There are at present no other features running it with the -w option.) There are at present no other features
controlled by this option. It can also be set by a (?X) option setting controlled by this option. It can also be set by a (?X) option setting
within a pattern. within a pattern.
PCRE_FIRSTLINE PCRE_FIRSTLINE
If this option is set, an unanchored pattern is required to match If this option is set, an unanchored pattern is required to match be-
before or at the first newline in the subject string, though the fore or at the first newline in the subject string, though the matched
matched text may continue over the newline. text may continue over the newline.
PCRE_JAVASCRIPT_COMPAT PCRE_JAVASCRIPT_COMPAT
If this option is set, PCRE's behaviour is changed in some ways so that If this option is set, PCRE's behaviour is changed in some ways so that
it is compatible with JavaScript rather than Perl. The changes are as it is compatible with JavaScript rather than Perl. The changes are as
follows: follows:
(1) A lone closing square bracket in a pattern causes a compile-time (1) A lone closing square bracket in a pattern causes a compile-time
error, because this is illegal in JavaScript (by default it is treated error, because this is illegal in JavaScript (by default it is treated
as a data character). Thus, the pattern AB]CD becomes illegal when this as a data character). Thus, the pattern AB]CD becomes illegal when this
skipping to change at line 2422 skipping to change at line 2420
The returned value from pcre_study() can be passed directly to The returned value from pcre_study() can be passed directly to
pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con- pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con-
tains other fields that can be set by the caller before the block is tains other fields that can be set by the caller before the block is
passed; these are described below in the section on matching a pattern. passed; these are described below in the section on matching a pattern.
If studying the pattern does not produce any useful information, If studying the pattern does not produce any useful information,
pcre_study() returns NULL by default. In that circumstance, if the pcre_study() returns NULL by default. In that circumstance, if the
calling program wants to pass any of the other fields to pcre_exec() or calling program wants to pass any of the other fields to pcre_exec() or
pcre_dfa_exec(), it must set up its own pcre_extra block. However, if pcre_dfa_exec(), it must set up its own pcre_extra block. However, if
pcre_study() is called with the PCRE_STUDY_EXTRA_NEEDED option, it pcre_study() is called with the PCRE_STUDY_EXTRA_NEEDED option, it re-
returns a pcre_extra block even if studying did not find any additional turns a pcre_extra block even if studying did not find any additional
information. It may still return NULL, however, if an error occurs in information. It may still return NULL, however, if an error occurs in
pcre_study(). pcre_study().
The second argument of pcre_study() contains option bits. There are The second argument of pcre_study() contains option bits. There are
three further options in addition to PCRE_STUDY_EXTRA_NEEDED: three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
PCRE_STUDY_JIT_COMPILE PCRE_STUDY_JIT_COMPILE
PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
skipping to change at line 2450 skipping to change at line 2448
JIT compilation is a heavyweight optimization. It can take some time JIT compilation is a heavyweight optimization. It can take some time
for patterns to be analyzed, and for one-off matches and simple pat- for patterns to be analyzed, and for one-off matches and simple pat-
terns the benefit of faster execution might be offset by a much slower terns the benefit of faster execution might be offset by a much slower
study time. Not all patterns can be optimized by the JIT compiler. For study time. Not all patterns can be optimized by the JIT compiler. For
those that cannot be handled, matching automatically falls back to the those that cannot be handled, matching automatically falls back to the
pcre_exec() interpreter. For more details, see the pcrejit documenta- pcre_exec() interpreter. For more details, see the pcrejit documenta-
tion. tion.
The third argument for pcre_study() is a pointer for an error message. The third argument for pcre_study() is a pointer for an error message.
If studying succeeds (even if no data is returned), the variable it If studying succeeds (even if no data is returned), the variable it
points to is set to NULL. Otherwise it is set to point to a textual points to is set to NULL. Otherwise it is set to point to a textual er-
error message. This is a static string that is part of the library. You ror message. This is a static string that is part of the library. You
must not try to free it. You should test the error pointer for NULL must not try to free it. You should test the error pointer for NULL af-
after calling pcre_study(), to be sure that it has run successfully. ter calling pcre_study(), to be sure that it has run successfully.
When you are finished with a pattern, you can free the memory used for When you are finished with a pattern, you can free the memory used for
the study data by calling pcre_free_study(). This function was added to the study data by calling pcre_free_study(). This function was added to
the API for release 8.20. For earlier versions, the memory could be the API for release 8.20. For earlier versions, the memory could be
freed with pcre_free(), just like the pattern itself. This will still freed with pcre_free(), just like the pattern itself. This will still
work in cases where JIT optimization is not used, but it is advisable work in cases where JIT optimization is not used, but it is advisable
to change to the new function when convenient. to change to the new function when convenient.
This is a typical way in which pcre_study() is used (except that in a This is a typical way in which pcre_study() is used (except that in a
real application there should be tests for errors): real application there should be tests for errors):
skipping to change at line 2527 skipping to change at line 2525
port, all characters can be tested with \p and \P, or, alternatively, port, all characters can be tested with \p and \P, or, alternatively,
the PCRE_UCP option can be set when a pattern is compiled; this causes the PCRE_UCP option can be set when a pattern is compiled; this causes
\w and friends to use Unicode property support instead of the built-in \w and friends to use Unicode property support instead of the built-in
tables. tables.
The use of locales with Unicode is discouraged. If you are handling The use of locales with Unicode is discouraged. If you are handling
characters with code points greater than 128, you should either use characters with code points greater than 128, you should either use
Unicode support, or use locales, but not try to mix the two. Unicode support, or use locales, but not try to mix the two.
PCRE contains an internal set of tables that are used when the final PCRE contains an internal set of tables that are used when the final
argument of pcre_compile() is NULL. These are sufficient for many argument of pcre_compile() is NULL. These are sufficient for many ap-
applications. Normally, the internal tables recognize only ASCII char- plications. Normally, the internal tables recognize only ASCII charac-
acters. However, when PCRE is built, it is possible to cause the inter- ters. However, when PCRE is built, it is possible to cause the internal
nal tables to be rebuilt in the default "C" locale of the local system, tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different. which may cause them to be different.
The internal tables can always be overridden by tables supplied by the The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale application that calls PCRE. These may be created in a different locale
from the default. As more and more applications change to using Uni- from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away. code, the need for this locale support is expected to die away.
External tables are built by calling the pcre_maketables() function, External tables are built by calling the pcre_maketables() function,
which has no arguments, in the relevant locale. The result can then be which has no arguments, in the relevant locale. The result can then be
passed to pcre_compile() as often as necessary. For example, to build passed to pcre_compile() as often as necessary. For example, to build
and use tables that are appropriate for the French locale (where and use tables that are appropriate for the French locale (where ac-
accented characters with values greater than 128 are treated as let- cented characters with values greater than 128 are treated as letters),
ters), the following code could be used: the following code could be used:
setlocale(LC_CTYPE, "fr_FR"); setlocale(LC_CTYPE, "fr_FR");
tables = pcre_maketables(); tables = pcre_maketables();
re = pcre_compile(..., tables); re = pcre_compile(..., tables);
The locale name "fr_FR" is used on Linux and other Unix-like systems; The locale name "fr_FR" is used on Linux and other Unix-like systems;
if you are using Windows, the name for the French locale is "french". if you are using Windows, the name for the French locale is "french".
When pcre_maketables() runs, the tables are built in memory that is When pcre_maketables() runs, the tables are built in memory that is ob-
obtained via pcre_malloc. It is the caller's responsibility to ensure tained via pcre_malloc. It is the caller's responsibility to ensure
that the memory containing the tables remains available for as long as that the memory containing the tables remains available for as long as
it is needed. it is needed.
The pointer that is passed to pcre_compile() is saved with the compiled The pointer that is passed to pcre_compile() is saved with the compiled
pattern, and the same tables are used via this pointer by pcre_study() pattern, and the same tables are used via this pointer by pcre_study()
and also by pcre_exec() and pcre_dfa_exec(). Thus, for any single pat- and also by pcre_exec() and pcre_dfa_exec(). Thus, for any single pat-
tern, compilation, studying and matching all happen in the same locale, tern, compilation, studying and matching all happen in the same locale,
but different patterns can be processed in different locales. but different patterns can be processed in different locales.
It is possible to pass a table pointer or NULL (indicating the use of It is possible to pass a table pointer or NULL (indicating the use of
skipping to change at line 2597 skipping to change at line 2595
success, or one of the following negative numbers: success, or one of the following negative numbers:
PCRE_ERROR_NULL the argument code was NULL PCRE_ERROR_NULL the argument code was NULL
the argument where was NULL the argument where was NULL
PCRE_ERROR_BADMAGIC the "magic number" was not found PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADENDIANNESS the pattern was compiled with different PCRE_ERROR_BADENDIANNESS the pattern was compiled with different
endianness endianness
PCRE_ERROR_BADOPTION the value of what was invalid PCRE_ERROR_BADOPTION the value of what was invalid
PCRE_ERROR_UNSET the requested field is not set PCRE_ERROR_UNSET the requested field is not set
The "magic number" is placed at the start of each compiled pattern as The "magic number" is placed at the start of each compiled pattern as a
an simple check against passing an arbitrary memory pointer. The endi- simple check against passing an arbitrary memory pointer. The endian-
anness error can occur if a compiled pattern is saved and reloaded on a ness error can occur if a compiled pattern is saved and reloaded on a
different host. Here is a typical call of pcre_fullinfo(), to obtain different host. Here is a typical call of pcre_fullinfo(), to obtain
the length of the compiled pattern: the length of the compiled pattern:
int rc; int rc;
size_t length; size_t length;
rc = pcre_fullinfo( rc = pcre_fullinfo(
re, /* result of pcre_compile() */ re, /* result of pcre_compile() */
sd, /* result of pcre_study(), or NULL */ sd, /* result of pcre_study(), or NULL */
PCRE_INFO_SIZE, /* what is required */ PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */ &length); /* where to put the data */
skipping to change at line 2636 skipping to change at line 2634
Return a pointer to the internal default character tables within PCRE. Return a pointer to the internal default character tables within PCRE.
The fourth argument should point to an unsigned char * variable. This The fourth argument should point to an unsigned char * variable. This
information call is provided for internal use by the pcre_study() func- information call is provided for internal use by the pcre_study() func-
tion. External callers can cause PCRE to use its internal tables by tion. External callers can cause PCRE to use its internal tables by
passing a NULL table pointer. passing a NULL table pointer.
PCRE_INFO_FIRSTBYTE (deprecated) PCRE_INFO_FIRSTBYTE (deprecated)
Return information about the first data unit of any matched string, for Return information about the first data unit of any matched string, for
a non-anchored pattern. The name of this option refers to the 8-bit a non-anchored pattern. The name of this option refers to the 8-bit li-
library, where data units are bytes. The fourth argument should point brary, where data units are bytes. The fourth argument should point to
to an int variable. Negative values are used for special cases. How- an int variable. Negative values are used for special cases. However,
ever, this means that when the 32-bit library is in non-UTF-32 mode, this means that when the 32-bit library is in non-UTF-32 mode, the full
the full 32-bit range of characters cannot be returned. For this rea- 32-bit range of characters cannot be returned. For this reason, this
son, this value is deprecated; use PCRE_INFO_FIRSTCHARACTERFLAGS and value is deprecated; use PCRE_INFO_FIRSTCHARACTERFLAGS and
PCRE_INFO_FIRSTCHARACTER instead. PCRE_INFO_FIRSTCHARACTER instead.
If there is a fixed first value, for example, the letter "c" from a If there is a fixed first value, for example, the letter "c" from a
pattern such as (cat|cow|coyote), its value is returned. In the 8-bit pattern such as (cat|cow|coyote), its value is returned. In the 8-bit
library, the value is always less than 256. In the 16-bit library the library, the value is always less than 256. In the 16-bit library the
value can be up to 0xffff. In the 32-bit library the value can be up to value can be up to 0xffff. In the 32-bit library the value can be up to
0x10ffff. 0x10ffff.
If there is no fixed first value, and if either If there is no fixed first value, and if either
skipping to change at line 2665 skipping to change at line 2663
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
set (if it were set, the pattern would be anchored), set (if it were set, the pattern would be anchored),
-1 is returned, indicating that the pattern matches only at the start -1 is returned, indicating that the pattern matches only at the start
of a subject string or after any newline within the string. Otherwise of a subject string or after any newline within the string. Otherwise
-2 is returned. For anchored patterns, -2 is returned. -2 is returned. For anchored patterns, -2 is returned.
PCRE_INFO_FIRSTCHARACTER PCRE_INFO_FIRSTCHARACTER
Return the value of the first data unit (non-UTF character) of any Return the value of the first data unit (non-UTF character) of any
matched string in the situation where PCRE_INFO_FIRSTCHARACTERFLAGS matched string in the situation where PCRE_INFO_FIRSTCHARACTERFLAGS re-
returns 1; otherwise return 0. The fourth argument should point to an turns 1; otherwise return 0. The fourth argument should point to a
uint_t variable. uint_t variable.
In the 8-bit library, the value is always less than 256. In the 16-bit In the 8-bit library, the value is always less than 256. In the 16-bit
library the value can be up to 0xffff. In the 32-bit library in UTF-32 library the value can be up to 0xffff. In the 32-bit library in UTF-32
mode the value can be up to 0x10ffff, and up to 0xffffffff when not mode the value can be up to 0x10ffff, and up to 0xffffffff when not us-
using UTF-32 mode. ing UTF-32 mode.
PCRE_INFO_FIRSTCHARACTERFLAGS PCRE_INFO_FIRSTCHARACTERFLAGS
Return information about the first data unit of any matched string, for Return information about the first data unit of any matched string, for
a non-anchored pattern. The fourth argument should point to an int a non-anchored pattern. The fourth argument should point to an int
variable. variable.
If there is a fixed first value, for example, the letter "c" from a If there is a fixed first value, for example, the letter "c" from a
pattern such as (cat|cow|coyote), 1 is returned, and the character pattern such as (cat|cow|coyote), 1 is returned, and the character
value can be retrieved using PCRE_INFO_FIRSTCHARACTER. If there is no value can be retrieved using PCRE_INFO_FIRSTCHARACTER. If there is no
skipping to change at line 2744 skipping to change at line 2742
Return the value of the rightmost literal data unit that must exist in Return the value of the rightmost literal data unit that must exist in
any matched string, other than at its start, if such a value has been any matched string, other than at its start, if such a value has been
recorded. The fourth argument should point to an int variable. If there recorded. The fourth argument should point to an int variable. If there
is no such value, -1 is returned. For anchored patterns, a last literal is no such value, -1 is returned. For anchored patterns, a last literal
value is recorded only if it follows something of variable length. For value is recorded only if it follows something of variable length. For
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
/^a\dz\d/ the returned value is -1. /^a\dz\d/ the returned value is -1.
Since for the 32-bit library using the non-UTF-32 mode, this function Since for the 32-bit library using the non-UTF-32 mode, this function
is unable to return the full 32-bit range of characters, this value is is unable to return the full 32-bit range of characters, this value is
deprecated; instead the PCRE_INFO_REQUIREDCHARFLAGS and deprecated; instead the PCRE_INFO_REQUIREDCHARFLAGS and PCRE_INFO_RE-
PCRE_INFO_REQUIREDCHAR values should be used. QUIREDCHAR values should be used.
PCRE_INFO_MATCH_EMPTY PCRE_INFO_MATCH_EMPTY
Return 1 if the pattern can match an empty string, otherwise 0. The Return 1 if the pattern can match an empty string, otherwise 0. The
fourth argument should point to an int variable. fourth argument should point to an int variable.
PCRE_INFO_MATCHLIMIT PCRE_INFO_MATCHLIMIT
If the pattern set a match limit by including an item of the form If the pattern set a match limit by including an item of the form
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth (*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth ar-
argument should point to an unsigned 32-bit integer. If no such value gument should point to an unsigned 32-bit integer. If no such value has
has been set, the call to pcre_fullinfo() returns the error been set, the call to pcre_fullinfo() returns the error PCRE_ERROR_UN-
PCRE_ERROR_UNSET. SET.
PCRE_INFO_MAXLOOKBEHIND PCRE_INFO_MAXLOOKBEHIND
Return the number of characters (NB not data units) in the longest Return the number of characters (NB not data units) in the longest
lookbehind assertion in the pattern. This information is useful when lookbehind assertion in the pattern. This information is useful when
doing multi-segment matching using the partial matching facilities. doing multi-segment matching using the partial matching facilities.
Note that the simple assertions \b and \B require a one-character look- Note that the simple assertions \b and \B require a one-character look-
behind. \A also registers a one-character lookbehind, though it does behind. \A also registers a one-character lookbehind, though it does
not actually inspect the previous character. This is to ensure that at not actually inspect the previous character. This is to ensure that at
least one character from the old segment is retained when a new segment least one character from the old segment is retained when a new segment
skipping to change at line 2794 skipping to change at line 2792
PCRE_INFO_NAMEENTRYSIZE PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE PCRE_INFO_NAMETABLE
PCRE supports the use of named as well as numbered capturing parenthe- PCRE supports the use of named as well as numbered capturing parenthe-
ses. The names are just an additional way of identifying the parenthe- ses. The names are just an additional way of identifying the parenthe-
ses, which still acquire numbers. Several convenience functions such as ses, which still acquire numbers. Several convenience functions such as
pcre_get_named_substring() are provided for extracting captured sub- pcre_get_named_substring() are provided for extracting captured sub-
strings by name. It is also possible to extract the data directly, by strings by name. It is also possible to extract the data directly, by
first converting the name to a number in order to access the correct first converting the name to a number in order to access the correct
pointers in the output vector (described with pcre_exec() below). To do pointers in the output vector (described with pcre_exec() below). To do
the conversion, you need to use the name-to-number map, which is the conversion, you need to use the name-to-number map, which is de-
described by these three values. scribed by these three values.
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
of each entry; both of these return an int value. The entry size of each entry; both of these return an int value. The entry size de-
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns pends on the length of the longest name. PCRE_INFO_NAMETABLE returns a
a pointer to the first entry of the table. This is a pointer to char in pointer to the first entry of the table. This is a pointer to char in
the 8-bit library, where the first two bytes of each entry are the num- the 8-bit library, where the first two bytes of each entry are the num-
ber of the capturing parenthesis, most significant byte first. In the ber of the capturing parenthesis, most significant byte first. In the
16-bit library, the pointer points to 16-bit data units, the first of 16-bit library, the pointer points to 16-bit data units, the first of
which contains the parenthesis number. In the 32-bit library, the which contains the parenthesis number. In the 32-bit library, the
pointer points to 32-bit data units, the first of which contains the pointer points to 32-bit data units, the first of which contains the
parenthesis number. The rest of the entry is the corresponding name, parenthesis number. The rest of the entry is the corresponding name,
zero terminated. zero terminated.
The names are in alphabetical order. If (?| is used to create multiple The names are in alphabetical order. If (?| is used to create multiple
groups with the same number, as described in the section on duplicate groups with the same number, as described in the section on duplicate
subpattern numbers in the pcrepattern page, the groups may be given the subpattern numbers in the pcrepattern page, the groups may be given the
same name, but there is only one entry in the table. Different names same name, but there is only one entry in the table. Different names
for groups of the same number are not permitted. Duplicate names for for groups of the same number are not permitted. Duplicate names for
subpatterns with different numbers are permitted, but only if PCRE_DUP- subpatterns with different numbers are permitted, but only if PCRE_DUP-
NAMES is set. They appear in the table in the order in which they were NAMES is set. They appear in the table in the order in which they were
found in the pattern. In the absence of (?| this is the order of found in the pattern. In the absence of (?| this is the order of in-
increasing number; when (?| is used this is not necessarily the case creasing number; when (?| is used this is not necessarily the case be-
because later subpatterns may have lower numbers. cause later subpatterns may have lower numbers.
As a simple example of the name/number table, consider the following As a simple example of the name/number table, consider the following
pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
set, so white space - including newlines - is ignored): set, so white space - including newlines - is ignored):
(?<date> (?<year>(\d\d)?\d\d) - (?<date> (?<year>(\d\d)?\d\d) -
(?<month>\d\d) - (?<day>\d\d) ) (?<month>\d\d) - (?<day>\d\d) )
There are four named subpatterns, so the table has four entries, and There are four named subpatterns, so the table has four entries, and
each entry in the table is eight bytes long. The table is as follows, each entry in the table is eight bytes long. The table is as follows,
skipping to change at line 2846 skipping to change at line 2844
00 02 y e a r 00 ?? 00 02 y e a r 00 ??
When writing code to extract data from named subpatterns using the When writing code to extract data from named subpatterns using the
name-to-number map, remember that the length of the entries is likely name-to-number map, remember that the length of the entries is likely
to be different for each compiled pattern. to be different for each compiled pattern.
PCRE_INFO_OKPARTIAL PCRE_INFO_OKPARTIAL
Return 1 if the pattern can be used for partial matching with Return 1 if the pattern can be used for partial matching with
pcre_exec(), otherwise 0. The fourth argument should point to an int pcre_exec(), otherwise 0. The fourth argument should point to an int
variable. From release 8.00, this always returns 1, because the variable. From release 8.00, this always returns 1, because the re-
restrictions that previously applied to partial matching have been strictions that previously applied to partial matching have been
lifted. The pcrepartial documentation gives details of partial match- lifted. The pcrepartial documentation gives details of partial match-
ing. ing.
PCRE_INFO_OPTIONS PCRE_INFO_OPTIONS
Return a copy of the options with which the pattern was compiled. The Return a copy of the options with which the pattern was compiled. The
fourth argument should point to an unsigned long int variable. These fourth argument should point to an unsigned long int variable. These
option bits are those specified in the call to pcre_compile(), modified option bits are those specified in the call to pcre_compile(), modified
by any top-level option settings at the start of the pattern itself. In by any top-level option settings at the start of the pattern itself. In
other words, they are the options that will be in force when matching other words, they are the options that will be in force when matching
starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with
the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
and PCRE_EXTENDED. and PCRE_EXTENDED.
A pattern is automatically anchored by PCRE if all of its top-level A pattern is automatically anchored by PCRE if all of its top-level al-
alternatives begin with one of the following: ternatives begin with one of the following:
^ unless PCRE_MULTILINE is set ^ unless PCRE_MULTILINE is set
\A always \A always
\G always \G always
.* if PCRE_DOTALL is set and there are no back .* if PCRE_DOTALL is set and there are no back
references to the subpattern in which .* appears references to the subpattern in which .* appears
For such patterns, the PCRE_ANCHORED bit is set in the options returned For such patterns, the PCRE_ANCHORED bit is set in the options returned
by pcre_fullinfo(). by pcre_fullinfo().
PCRE_INFO_RECURSIONLIMIT PCRE_INFO_RECURSIONLIMIT
If the pattern set a recursion limit by including an item of the form If the pattern set a recursion limit by including an item of the form
(*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
argument should point to an unsigned 32-bit integer. If no such value argument should point to an unsigned 32-bit integer. If no such value
has been set, the call to pcre_fullinfo() returns the error has been set, the call to pcre_fullinfo() returns the error PCRE_ER-
PCRE_ERROR_UNSET. ROR_UNSET.
PCRE_INFO_SIZE PCRE_INFO_SIZE
Return the size of the compiled pattern in bytes (for all three Return the size of the compiled pattern in bytes (for all three li-
libraries). The fourth argument should point to a size_t variable. This braries). The fourth argument should point to a size_t variable. This
value does not include the size of the pcre structure that is returned value does not include the size of the pcre structure that is returned
by pcre_compile(). The value that is passed as the argument to by pcre_compile(). The value that is passed as the argument to
pcre_malloc() when pcre_compile() is getting memory in which to place pcre_malloc() when pcre_compile() is getting memory in which to place
the compiled data is the value returned by this option plus the size of the compiled data is the value returned by this option plus the size of
the pcre structure. Studying a compiled pattern, with or without JIT, the pcre structure. Studying a compiled pattern, with or without JIT,
does not alter the value returned by this option. does not alter the value returned by this option.
PCRE_INFO_STUDYSIZE PCRE_INFO_STUDYSIZE
Return the size in bytes (for all three libraries) of the data block Return the size in bytes (for all three libraries) of the data block
skipping to change at line 2915 skipping to change at line 2913
PCRE_INFO_REQUIREDCHARFLAGS PCRE_INFO_REQUIREDCHARFLAGS
Returns 1 if there is a rightmost literal data unit that must exist in Returns 1 if there is a rightmost literal data unit that must exist in
any matched string, other than at its start. The fourth argument should any matched string, other than at its start. The fourth argument should
point to an int variable. If there is no such value, 0 is returned. If point to an int variable. If there is no such value, 0 is returned. If
returning 1, the character value itself can be retrieved using returning 1, the character value itself can be retrieved using
PCRE_INFO_REQUIREDCHAR. PCRE_INFO_REQUIREDCHAR.
For anchored patterns, a last literal value is recorded only if it fol- For anchored patterns, a last literal value is recorded only if it fol-
lows something of variable length. For example, for the pattern lows something of variable length. For example, for the pattern
/^a\d+z\d+/ the returned value 1 (with "z" returned from /^a\d+z\d+/ the returned value 1 (with "z" returned from PCRE_INFO_RE-
PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0. QUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
PCRE_INFO_REQUIREDCHAR PCRE_INFO_REQUIREDCHAR
Return the value of the rightmost literal data unit that must exist in Return the value of the rightmost literal data unit that must exist in
any matched string, other than at its start, if such a value has been any matched string, other than at its start, if such a value has been
recorded. The fourth argument should point to an uint32_t variable. If recorded. The fourth argument should point to a uint32_t variable. If
there is no such value, 0 is returned. there is no such value, 0 is returned.
REFERENCE COUNTS REFERENCE COUNTS
int pcre_refcount(pcre *code, int adjust); int pcre_refcount(pcre *code, int adjust);
The pcre_refcount() function is used to maintain a reference count in The pcre_refcount() function is used to maintain a reference count in
the data block that contains a compiled pattern. It is provided for the the data block that contains a compiled pattern. It is provided for the
benefit of applications that operate in an object-oriented manner, benefit of applications that operate in an object-oriented manner,
where different parts of the application may be using the same compiled where different parts of the application may be using the same compiled
skipping to change at line 2954 skipping to change at line 2952
whose byte-order is different. (This seems a highly unlikely scenario.) whose byte-order is different. (This seems a highly unlikely scenario.)
MATCHING A PATTERN: THE TRADITIONAL FUNCTION MATCHING A PATTERN: THE TRADITIONAL FUNCTION
int pcre_exec(const pcre *code, const pcre_extra *extra, int pcre_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset, const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize); int options, int *ovector, int ovecsize);
The function pcre_exec() is called to match a subject string against a The function pcre_exec() is called to match a subject string against a
compiled pattern, which is passed in the code argument. If the pattern compiled pattern, which is passed in the code argument. If the pattern
was studied, the result of the study should be passed in the extra was studied, the result of the study should be passed in the extra ar-
argument. You can call pcre_exec() with the same code and extra argu- gument. You can call pcre_exec() with the same code and extra arguments
ments as many times as you like, in order to match different subject as many times as you like, in order to match different subject strings
strings with the same pattern. with the same pattern.
This function is the main matching facility of the library, and it This function is the main matching facility of the library, and it op-
operates in a Perl-like manner. For specialist use there is also an erates in a Perl-like manner. For specialist use there is also an al-
alternative matching function, which is described below in the section ternative matching function, which is described below in the section
about the pcre_dfa_exec() function. about the pcre_dfa_exec() function.
In most applications, the pattern will have been compiled (and option- In most applications, the pattern will have been compiled (and option-
ally studied) in the same process that calls pcre_exec(). However, it ally studied) in the same process that calls pcre_exec(). However, it
is possible to save compiled patterns and study data, and then use them is possible to save compiled patterns and study data, and then use them
later in different processes, possibly even on different hosts. For a later in different processes, possibly even on different hosts. For a
discussion about this, see the pcreprecompile documentation. discussion about this, see the pcreprecompile documentation.
Here is an example of a simple call to pcre_exec(): Here is an example of a simple call to pcre_exec():
skipping to change at line 3031 skipping to change at line 3029
should not set these yourself, but you may add to the block by setting should not set these yourself, but you may add to the block by setting
other fields and their corresponding flag bits. other fields and their corresponding flag bits.
The match_limit field provides a means of preventing PCRE from using up The match_limit field provides a means of preventing PCRE from using up
a vast amount of resources when running patterns that are not going to a vast amount of resources when running patterns that are not going to
match, but which have a very large number of possibilities in their match, but which have a very large number of possibilities in their
search trees. The classic example is a pattern that uses nested unlim- search trees. The classic example is a pattern that uses nested unlim-
ited repeats. ited repeats.
Internally, pcre_exec() uses a function called match(), which it calls Internally, pcre_exec() uses a function called match(), which it calls
repeatedly (sometimes recursively). The limit set by match_limit is repeatedly (sometimes recursively). The limit set by match_limit is im-
imposed on the number of times this function is called during a match, posed on the number of times this function is called during a match,
which has the effect of limiting the amount of backtracking that can which has the effect of limiting the amount of backtracking that can
take place. For patterns that are not anchored, the count restarts from take place. For patterns that are not anchored, the count restarts from
zero for each position in the subject string. zero for each position in the subject string.
When pcre_exec() is called with a pattern that was successfully studied When pcre_exec() is called with a pattern that was successfully studied
with a JIT option, the way that the matching is executed is entirely with a JIT option, the way that the matching is executed is entirely
different. However, there is still the possibility of runaway matching different. However, there is still the possibility of runaway matching
that goes on for a very long time, and so the match_limit value is also that goes on for a very long time, and so the match_limit value is also
used in this case (but in a different way) to limit how long the match- used in this case (but in a different way) to limit how long the match-
ing can continue. ing can continue.
The default value for the limit can be set when PCRE is built; the The default value for the limit can be set when PCRE is built; the de-
default default is 10 million, which handles all but the most extreme fault default is 10 million, which handles all but the most extreme
cases. You can override the default by suppling pcre_exec() with a cases. You can override the default by suppling pcre_exec() with a
pcre_extra block in which match_limit is set, and pcre_extra block in which match_limit is set, and PCRE_EX-
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is TRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded,
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
A value for the match limit may also be supplied by an item at the A value for the match limit may also be supplied by an item at the
start of a pattern of the form start of a pattern of the form
(*LIMIT_MATCH=d) (*LIMIT_MATCH=d)
where d is a decimal number. However, such a setting is ignored unless where d is a decimal number. However, such a setting is ignored unless
d is less than the limit set by the caller of pcre_exec() or, if no d is less than the limit set by the caller of pcre_exec() or, if no
such limit is set, less than the default. such limit is set, less than the default.
skipping to change at line 3075 skipping to change at line 3073
Limiting the recursion depth limits the amount of machine stack that Limiting the recursion depth limits the amount of machine stack that
can be used, or, when PCRE has been compiled to use memory on the heap can be used, or, when PCRE has been compiled to use memory on the heap
instead of the stack, the amount of heap memory that can be used. This instead of the stack, the amount of heap memory that can be used. This
limit is not relevant, and is ignored, when matching is done using JIT limit is not relevant, and is ignored, when matching is done using JIT
compiled code. compiled code.
The default value for match_limit_recursion can be set when PCRE is The default value for match_limit_recursion can be set when PCRE is
built; the default default is the same value as the default for built; the default default is the same value as the default for
match_limit. You can override the default by suppling pcre_exec() with match_limit. You can override the default by suppling pcre_exec() with
a pcre_extra block in which match_limit_recursion is set, and a pcre_extra block in which match_limit_recursion is set, and PCRE_EX-
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the TRA_MATCH_LIMIT_RECURSION is set in the flags field. If the limit is
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
A value for the recursion limit may also be supplied by an item at the A value for the recursion limit may also be supplied by an item at the
start of a pattern of the form start of a pattern of the form
(*LIMIT_RECURSION=d) (*LIMIT_RECURSION=d)
where d is a decimal number. However, such a setting is ignored unless where d is a decimal number. However, such a setting is ignored unless
d is less than the limit set by the caller of pcre_exec() or, if no d is less than the limit set by the caller of pcre_exec() or, if no
such limit is set, less than the default. such limit is set, less than the default.
skipping to change at line 3154 skipping to change at line 3152
sequence matches. The choice is either to match only CR, LF, or CRLF, sequence matches. The choice is either to match only CR, LF, or CRLF,
or to match any Unicode newline sequence. These options override the or to match any Unicode newline sequence. These options override the
choice that was made or defaulted when the pattern was compiled. choice that was made or defaulted when the pattern was compiled.
PCRE_NEWLINE_CR PCRE_NEWLINE_CR
PCRE_NEWLINE_LF PCRE_NEWLINE_LF
PCRE_NEWLINE_CRLF PCRE_NEWLINE_CRLF
PCRE_NEWLINE_ANYCRLF PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY PCRE_NEWLINE_ANY
These options override the newline definition that was chosen or These options override the newline definition that was chosen or de-
defaulted when the pattern was compiled. For details, see the descrip- faulted when the pattern was compiled. For details, see the description
tion of pcre_compile() above. During matching, the newline choice of pcre_compile() above. During matching, the newline choice affects
affects the behaviour of the dot, circumflex, and dollar metacharac- the behaviour of the dot, circumflex, and dollar metacharacters. It may
ters. It may also alter the way the match position is advanced after a also alter the way the match position is advanced after a match failure
match failure for an unanchored pattern. for an unanchored pattern.
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
set, and a match attempt for an unanchored pattern fails when the cur- set, and a match attempt for an unanchored pattern fails when the cur-
rent position is at a CRLF sequence, and the pattern contains no rent position is at a CRLF sequence, and the pattern contains no ex-
explicit matches for CR or LF characters, the match position is plicit matches for CR or LF characters, the match position is advanced
advanced by two characters instead of one, in other words, to after the by two characters instead of one, in other words, to after the CRLF.
CRLF.
The above rule is a compromise that makes the most common cases work as The above rule is a compromise that makes the most common cases work as
expected. For example, if the pattern is .+A (and the PCRE_DOTALL expected. For example, if the pattern is .+A (and the PCRE_DOTALL op-
option is not set), it does not match the string "\r\nA" because, after tion is not set), it does not match the string "\r\nA" because, after
failing at the start, it skips both the CR and the LF before retrying. failing at the start, it skips both the CR and the LF before retrying.
However, the pattern [\r\n]A does match that string, because it con- However, the pattern [\r\n]A does match that string, because it con-
tains an explicit CR or LF reference, and so advances only by one char- tains an explicit CR or LF reference, and so advances only by one char-
acter after the first failure. acter after the first failure.
An explicit match for CR of LF is either a literal appearance of one of An explicit match for CR of LF is either a literal appearance of one of
those characters, or one of the \r or \n escape sequences. Implicit those characters, or one of the \r or \n escape sequences. Implicit
matches such as [^X] do not count, nor does \s (which includes CR and matches such as [^X] do not count, nor does \s (which includes CR and
LF in the characters that it matches). LF in the characters that it matches).
Notwithstanding the above, anomalous effects may still occur when CRLF Notwithstanding the above, anomalous effects may still occur when CRLF
is a valid newline sequence and explicit \r or \n escapes appear in the is a valid newline sequence and explicit \r or \n escapes appear in the
pattern. pattern.
PCRE_NOTBOL PCRE_NOTBOL
This option specifies that first character of the subject string is not This option specifies that first character of the subject string is not
the beginning of a line, so the circumflex metacharacter should not the beginning of a line, so the circumflex metacharacter should not
match before it. Setting this without PCRE_MULTILINE (at compile time) match before it. Setting this without PCRE_MULTILINE (at compile time)
causes circumflex never to match. This option affects only the behav- causes circumflex never to match. This option affects only the behav-
iour of the circumflex metacharacter. It does not affect \A. iour of the circumflex metacharacter. It does not affect \A.
PCRE_NOTEOL PCRE_NOTEOL
This option specifies that the end of the subject string is not the end This option specifies that the end of the subject string is not the end
of a line, so the dollar metacharacter should not match it nor (except of a line, so the dollar metacharacter should not match it nor (except
in multiline mode) a newline immediately before it. Setting this with- in multiline mode) a newline immediately before it. Setting this with-
out PCRE_MULTILINE (at compile time) causes dollar never to match. This out PCRE_MULTILINE (at compile time) causes dollar never to match. This
option affects only the behaviour of the dollar metacharacter. It does option affects only the behaviour of the dollar metacharacter. It does
not affect \Z or \z. not affect \Z or \z.
PCRE_NOTEMPTY PCRE_NOTEMPTY
An empty string is not considered to be a valid match if this option is An empty string is not considered to be a valid match if this option is
set. If there are alternatives in the pattern, they are tried. If all set. If there are alternatives in the pattern, they are tried. If all
the alternatives match the empty string, the entire match fails. For the alternatives match the empty string, the entire match fails. For
example, if the pattern example, if the pattern
a?b? a?b?
is applied to a string not beginning with "a" or "b", it matches an is applied to a string not beginning with "a" or "b", it matches an
empty string at the start of the subject. With PCRE_NOTEMPTY set, this empty string at the start of the subject. With PCRE_NOTEMPTY set, this
match is not valid, so PCRE searches further into the string for occur- match is not valid, so PCRE searches further into the string for occur-
rences of "a" or "b". rences of "a" or "b".
PCRE_NOTEMPTY_ATSTART PCRE_NOTEMPTY_ATSTART
This is like PCRE_NOTEMPTY, except that an empty string match that is This is like PCRE_NOTEMPTY, except that an empty string match that is
not at the start of the subject is permitted. If the pattern is not at the start of the subject is permitted. If the pattern is an-
anchored, such a match can occur only if the pattern contains \K. chored, such a match can occur only if the pattern contains \K.
Perl has no direct equivalent of PCRE_NOTEMPTY or Perl has no direct equivalent of PCRE_NOTEMPTY or PCRE_NOTEMPTY_AT-
PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern START, but it does make a special case of a pattern match of the empty
match of the empty string within its split() function, and when using string within its split() function, and when using the /g modifier. It
the /g modifier. It is possible to emulate Perl's behaviour after is possible to emulate Perl's behaviour after matching a null string by
matching a null string by first trying the match again at the same off- first trying the match again at the same offset with PCRE_NOTEMPTY_AT-
set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that START and PCRE_ANCHORED, and then if that fails, by advancing the
fails, by advancing the starting offset (see below) and trying an ordi- starting offset (see below) and trying an ordinary match again. There
nary match again. There is some code that demonstrates how to do this is some code that demonstrates how to do this in the pcredemo sample
in the pcredemo sample program. In the most general case, you have to program. In the most general case, you have to check to see if the new-
check to see if the newline convention recognizes CRLF as a newline, line convention recognizes CRLF as a newline, and if so, and the cur-
and if so, and the current character is CR followed by LF, advance the rent character is CR followed by LF, advance the starting offset by two
starting offset by two characters instead of one. characters instead of one.
PCRE_NO_START_OPTIMIZE PCRE_NO_START_OPTIMIZE
There are a number of optimizations that pcre_exec() uses at the start There are a number of optimizations that pcre_exec() uses at the start
of a match, in order to speed up the process. For example, if it is of a match, in order to speed up the process. For example, if it is
known that an unanchored match must start with a specific character, it known that an unanchored match must start with a specific character, it
searches the subject for that character, and fails immediately if it searches the subject for that character, and fails immediately if it
cannot find it, without actually running the main matching function. cannot find it, without actually running the main matching function.
This means that a special item such as (*COMMIT) at the start of a pat- This means that a special item such as (*COMMIT) at the start of a pat-
tern is not considered until after a suitable starting point for the tern is not considered until after a suitable starting point for the
match has been found. Also, when callouts or (*MARK) items are in use, match has been found. Also, when callouts or (*MARK) items are in use,
these "start-up" optimizations can cause them to be skipped if the pat- these "start-up" optimizations can cause them to be skipped if the pat-
tern is never actually used. The start-up optimizations are in effect a tern is never actually used. The start-up optimizations are in effect a
pre-scan of the subject that takes place before the pattern is run. pre-scan of the subject that takes place before the pattern is run.
The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
possibly causing performance to suffer, but ensuring that in cases possibly causing performance to suffer, but ensuring that in cases
where the result is "no match", the callouts do occur, and that items where the result is "no match", the callouts do occur, and that items
such as (*COMMIT) and (*MARK) are considered at every possible starting such as (*COMMIT) and (*MARK) are considered at every possible starting
position in the subject string. If PCRE_NO_START_OPTIMIZE is set at position in the subject string. If PCRE_NO_START_OPTIMIZE is set at
compile time, it cannot be unset at matching time. The use of compile time, it cannot be unset at matching time. The use of
PCRE_NO_START_OPTIMIZE at matching time (that is, passing it to PCRE_NO_START_OPTIMIZE at matching time (that is, passing it to
pcre_exec()) disables JIT execution; in this situation, matching is pcre_exec()) disables JIT execution; in this situation, matching is al-
always done using interpretively. ways done using interpretively.
Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching op-
operation. Consider the pattern eration. Consider the pattern
(*COMMIT)ABC (*COMMIT)ABC
When this is compiled, PCRE records the fact that a match must start When this is compiled, PCRE records the fact that a match must start
with the character "A". Suppose the subject string is "DEFABC". The with the character "A". Suppose the subject string is "DEFABC". The
start-up optimization scans along the subject, finds "A" and runs the start-up optimization scans along the subject, finds "A" and runs the
first match attempt from there. The (*COMMIT) item means that the pat- first match attempt from there. The (*COMMIT) item means that the pat-
tern must match the current starting position, which in this case, it tern must match the current starting position, which in this case, it
does. However, if the same match is run with PCRE_NO_START_OPTIMIZE does. However, if the same match is run with PCRE_NO_START_OPTIMIZE
set, the initial scan along the subject string does not happen. The set, the initial scan along the subject string does not happen. The
first match attempt is run starting from "D" and when this fails, first match attempt is run starting from "D" and when this fails,
(*COMMIT) prevents any further matches being tried, so the overall (*COMMIT) prevents any further matches being tried, so the overall re-
result is "no match". If the pattern is studied, more start-up opti- sult is "no match". If the pattern is studied, more start-up optimiza-
mizations may be used. For example, a minimum length for the subject tions may be used. For example, a minimum length for the subject may be
may be recorded. Consider the pattern recorded. Consider the pattern
(*MARK:A)(X|Y) (*MARK:A)(X|Y)
The minimum length for a match is one character. If the subject is The minimum length for a match is one character. If the subject is
"ABC", there will be attempts to match "ABC", "BC", "C", and then "ABC", there will be attempts to match "ABC", "BC", "C", and then fi-
finally an empty string. If the pattern is studied, the final attempt nally an empty string. If the pattern is studied, the final attempt
does not take place, because PCRE knows that the subject is too short, does not take place, because PCRE knows that the subject is too short,
and so the (*MARK) is never encountered. In this case, studying the and so the (*MARK) is never encountered. In this case, studying the
pattern does not affect the overall match result, which is still "no pattern does not affect the overall match result, which is still "no
match", but it does affect the auxiliary information that is returned. match", but it does affect the auxiliary information that is returned.
PCRE_NO_UTF8_CHECK PCRE_NO_UTF8_CHECK
When PCRE_UTF8 is set at compile time, the validity of the subject as a When PCRE_UTF8 is set at compile time, the validity of the subject as a
UTF-8 string is automatically checked when pcre_exec() is subsequently UTF-8 string is automatically checked when pcre_exec() is subsequently
called. The entire string is checked before any other processing takes called. The entire string is checked before any other processing takes
place. The value of startoffset is also checked to ensure that it place. The value of startoffset is also checked to ensure that it
points to the start of a UTF-8 character. There is a discussion about points to the start of a UTF-8 character. There is a discussion about
the validity of UTF-8 strings in the pcreunicode page. If an invalid the validity of UTF-8 strings in the pcreunicode page. If an invalid
sequence of bytes is found, pcre_exec() returns the error sequence of bytes is found, pcre_exec() returns the error PCRE_ER-
PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a ROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a trun-
truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In cated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
both cases, information about the precise nature of the error may also both cases, information about the precise nature of the error may also
be returned (see the descriptions of these errors in the section enti- be returned (see the descriptions of these errors in the section enti-
tled Error return values from pcre_exec() below). If startoffset con- tled Error return values from pcre_exec() below). If startoffset con-
tains a value that does not point to the start of a UTF-8 character (or tains a value that does not point to the start of a UTF-8 character (or
to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned. to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
If you already know that your subject is valid, and you want to skip If you already know that your subject is valid, and you want to skip
these checks for performance reasons, you can set the these checks for performance reasons, you can set the
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
do this for the second and subsequent calls to pcre_exec() if you are do this for the second and subsequent calls to pcre_exec() if you are
making repeated calls to find all the matches in a single subject making repeated calls to find all the matches in a single subject
string. However, you should be sure that the value of startoffset string. However, you should be sure that the value of startoffset
points to the start of a character (or the end of the subject). When points to the start of a character (or the end of the subject). When
PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
subject or an invalid value of startoffset is undefined. Your program subject or an invalid value of startoffset is undefined. Your program
may crash or loop. may crash or loop.
PCRE_PARTIAL_HARD PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT PCRE_PARTIAL_SOFT
These options turn on the partial matching feature. For backwards com- These options turn on the partial matching feature. For backwards com-
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
match occurs if the end of the subject string is reached successfully, match occurs if the end of the subject string is reached successfully,
but there are not enough subject characters to complete the match. If but there are not enough subject characters to complete the match. If
this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
matching continues by testing any remaining alternatives. Only if no matching continues by testing any remaining alternatives. Only if no
complete match can be found is PCRE_ERROR_PARTIAL returned instead of complete match can be found is PCRE_ERROR_PARTIAL returned instead of
PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the
caller is prepared to handle a partial match, but only if no complete caller is prepared to handle a partial match, but only if no complete
match can be found. match can be found.
If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this
case, if a partial match is found, pcre_exec() immediately returns case, if a partial match is found, pcre_exec() immediately returns
PCRE_ERROR_PARTIAL, without considering any other alternatives. In PCRE_ERROR_PARTIAL, without considering any other alternatives. In
other words, when PCRE_PARTIAL_HARD is set, a partial match is consid- other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
ered to be more important that an alternative complete match. ered to be more important that an alternative complete match.
In both cases, the portion of the string that was inspected when the In both cases, the portion of the string that was inspected when the
partial match was found is set as the first matching string. There is a partial match was found is set as the first matching string. There is a
more detailed discussion of partial and multi-segment matching, with more detailed discussion of partial and multi-segment matching, with
examples, in the pcrepartial documentation. examples, in the pcrepartial documentation.
The string to be matched by pcre_exec() The string to be matched by pcre_exec()
The subject string is passed to pcre_exec() as a pointer in subject, a The subject string is passed to pcre_exec() as a pointer in subject, a
length in length, and a starting offset in startoffset. The units for length in length, and a starting offset in startoffset. The units for
length and startoffset are bytes for the 8-bit library, 16-bit data length and startoffset are bytes for the 8-bit library, 16-bit data
items for the 16-bit library, and 32-bit data items for the 32-bit items for the 16-bit library, and 32-bit data items for the 32-bit li-
library. brary.
If startoffset is negative or greater than the length of the subject, If startoffset is negative or greater than the length of the subject,
pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is
zero, the search for a match starts at the beginning of the subject, zero, the search for a match starts at the beginning of the subject,
and this is by far the most common case. In UTF-8 or UTF-16 mode, the and this is by far the most common case. In UTF-8 or UTF-16 mode, the
offset must point to the start of a character, or the end of the sub- offset must point to the start of a character, or the end of the sub-
ject (in UTF-32 mode, one data unit equals one character, so all off- ject (in UTF-32 mode, one data unit equals one character, so all off-
sets are valid). Unlike the pattern string, the subject may contain sets are valid). Unlike the pattern string, the subject may contain bi-
binary zeroes. nary zeroes.
A non-zero starting offset is useful when searching for another match A non-zero starting offset is useful when searching for another match
in the same subject by calling pcre_exec() again after a previous suc- in the same subject by calling pcre_exec() again after a previous suc-
cess. Setting startoffset differs from just passing over a shortened cess. Setting startoffset differs from just passing over a shortened
string and setting PCRE_NOTBOL in the case of a pattern that begins string and setting PCRE_NOTBOL in the case of a pattern that begins
with any kind of lookbehind. For example, consider the pattern with any kind of lookbehind. For example, consider the pattern
\Biss\B \Biss\B
which finds occurrences of "iss" in the middle of words. (\B matches which finds occurrences of "iss" in the middle of words. (\B matches
only if the current position in the subject is not a word boundary.) only if the current position in the subject is not a word boundary.)
When applied to the string "Mississipi" the first call to pcre_exec() When applied to the string "Mississipi" the first call to pcre_exec()
finds the first occurrence. If pcre_exec() is called again with just finds the first occurrence. If pcre_exec() is called again with just
the remainder of the subject, namely "issipi", it does not match, the remainder of the subject, namely "issipi", it does not match, be-
because \B is always false at the start of the subject, which is deemed cause \B is always false at the start of the subject, which is deemed
to be a word boundary. However, if pcre_exec() is passed the entire to be a word boundary. However, if pcre_exec() is passed the entire
string again, but with startoffset set to 4, it finds the second occur- string again, but with startoffset set to 4, it finds the second occur-
rence of "iss" because it is able to look behind the starting point to rence of "iss" because it is able to look behind the starting point to
discover that it is preceded by a letter. discover that it is preceded by a letter.
Finding all the matches in a subject is tricky when the pattern can Finding all the matches in a subject is tricky when the pattern can
match an empty string. It is possible to emulate Perl's /g behaviour by match an empty string. It is possible to emulate Perl's /g behaviour by
first trying the match again at the same offset, with the first trying the match again at the same offset, with the
PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
fails, advancing the starting offset and trying an ordinary match fails, advancing the starting offset and trying an ordinary match
again. There is some code that demonstrates how to do this in the pcre- again. There is some code that demonstrates how to do this in the pcre-
demo sample program. In the most general case, you have to check to see demo sample program. In the most general case, you have to check to see
if the newline convention recognizes CRLF as a newline, and if so, and if the newline convention recognizes CRLF as a newline, and if so, and
the current character is CR followed by LF, advance the starting offset the current character is CR followed by LF, advance the starting offset
by two characters instead of one. by two characters instead of one.
If a non-zero starting offset is passed when the pattern is anchored, If a non-zero starting offset is passed when the pattern is anchored,
one attempt to match at the given offset is made. This can only succeed one attempt to match at the given offset is made. This can only succeed
if the pattern does not require the match to be at the start of the if the pattern does not require the match to be at the start of the
subject. subject.
How pcre_exec() returns captured substrings How pcre_exec() returns captured substrings
In general, a pattern matches a certain portion of the subject, and in In general, a pattern matches a certain portion of the subject, and in
addition, further substrings from the subject may be picked out by addition, further substrings from the subject may be picked out by
parts of the pattern. Following the usage in Jeffrey Friedl's book, parts of the pattern. Following the usage in Jeffrey Friedl's book,
this is called "capturing" in what follows, and the phrase "capturing this is called "capturing" in what follows, and the phrase "capturing
subpattern" is used for a fragment of a pattern that picks out a sub- subpattern" is used for a fragment of a pattern that picks out a sub-
string. PCRE supports several other kinds of parenthesized subpattern string. PCRE supports several other kinds of parenthesized subpattern
that do not cause substrings to be captured. that do not cause substrings to be captured.
Captured substrings are returned to the caller via a vector of integers Captured substrings are returned to the caller via a vector of integers
whose address is passed in ovector. The number of elements in the vec- whose address is passed in ovector. The number of elements in the vec-
tor is passed in ovecsize, which must be a non-negative number. Note: tor is passed in ovecsize, which must be a non-negative number. Note:
this argument is NOT the size of ovector in bytes. this argument is NOT the size of ovector in bytes.
The first two-thirds of the vector is used to pass back captured sub- The first two-thirds of the vector is used to pass back captured sub-
strings, each substring using a pair of integers. The remaining third strings, each substring using a pair of integers. The remaining third
of the vector is used as workspace by pcre_exec() while matching cap- of the vector is used as workspace by pcre_exec() while matching cap-
turing subpatterns, and is not available for passing back information. turing subpatterns, and is not available for passing back information.
The number passed in ovecsize should always be a multiple of three. If The number passed in ovecsize should always be a multiple of three. If
it is not, it is rounded down. it is not, it is rounded down.
When a match is successful, information about captured substrings is When a match is successful, information about captured substrings is
returned in pairs of integers, starting at the beginning of ovector, returned in pairs of integers, starting at the beginning of ovector,
and continuing up to two-thirds of its length at the most. The first and continuing up to two-thirds of its length at the most. The first
element of each pair is set to the offset of the first character in a element of each pair is set to the offset of the first character in a
substring, and the second is set to the offset of the first character substring, and the second is set to the offset of the first character
after the end of a substring. These values are always data unit off- after the end of a substring. These values are always data unit off-
sets, even in UTF mode. They are byte offsets in the 8-bit library, sets, even in UTF mode. They are byte offsets in the 8-bit library,
16-bit data item offsets in the 16-bit library, and 32-bit data item 16-bit data item offsets in the 16-bit library, and 32-bit data item
offsets in the 32-bit library. Note: they are not character counts. offsets in the 32-bit library. Note: they are not character counts.
The first pair of integers, ovector[0] and ovector[1], identify the The first pair of integers, ovector[0] and ovector[1], identify the
portion of the subject string matched by the entire pattern. The next portion of the subject string matched by the entire pattern. The next
pair is used for the first capturing subpattern, and so on. The value pair is used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair that returned by pcre_exec() is one more than the highest numbered pair that
has been set. For example, if two substrings have been captured, the has been set. For example, if two substrings have been captured, the
returned value is 3. If there are no capturing subpatterns, the return returned value is 3. If there are no capturing subpatterns, the return
value from a successful match is 1, indicating that just the first pair value from a successful match is 1, indicating that just the first pair
of offsets has been set. of offsets has been set.
If a capturing subpattern is matched repeatedly, it is the last portion If a capturing subpattern is matched repeatedly, it is the last portion
of the string that it matched that is returned. of the string that it matched that is returned.
If the vector is too small to hold all the captured substring offsets, If the vector is too small to hold all the captured substring offsets,
it is used as far as possible (up to two-thirds of its length), and the it is used as far as possible (up to two-thirds of its length), and the
function returns a value of zero. If neither the actual string matched function returns a value of zero. If neither the actual string matched
nor any captured substrings are of interest, pcre_exec() may be called nor any captured substrings are of interest, pcre_exec() may be called
with ovector passed as NULL and ovecsize as zero. However, if the pat- with ovector passed as NULL and ovecsize as zero. However, if the pat-
tern contains back references and the ovector is not big enough to tern contains back references and the ovector is not big enough to re-
remember the related substrings, PCRE has to get additional memory for member the related substrings, PCRE has to get additional memory for
use during matching. Thus it is usually advisable to supply an ovector use during matching. Thus it is usually advisable to supply an ovector
of reasonable size. of reasonable size.
There are some cases where zero is returned (indicating vector over- There are some cases where zero is returned (indicating vector over-
flow) when in fact the vector is exactly the right size for the final flow) when in fact the vector is exactly the right size for the final
match. For example, consider the pattern match. For example, consider the pattern
(a)(?:(b)c|bd) (a)(?:(b)c|bd)
If a vector of 6 elements (allowing for only 1 captured substring) is If a vector of 6 elements (allowing for only 1 captured substring) is
given with subject string "abd", pcre_exec() will try to set the second given with subject string "abd", pcre_exec() will try to set the second
captured string, thereby recording a vector overflow, before failing to captured string, thereby recording a vector overflow, before failing to
match "c" and backing up to try the second alternative. The zero match "c" and backing up to try the second alternative. The zero re-
return, however, does correctly indicate that the maximum number of turn, however, does correctly indicate that the maximum number of slots
slots (namely 2) have been filled. In similar cases where there is tem- (namely 2) have been filled. In similar cases where there is temporary
porary overflow, but the final number of used slots is actually less overflow, but the final number of used slots is actually less than the
than the maximum, a non-zero value is returned. maximum, a non-zero value is returned.
The pcre_fullinfo() function can be used to find out how many capturing The pcre_fullinfo() function can be used to find out how many capturing
subpatterns there are in a compiled pattern. The smallest size for subpatterns there are in a compiled pattern. The smallest size for
ovector that will allow for n captured substrings, in addition to the ovector that will allow for n captured substrings, in addition to the
offsets of the substring matched by the whole pattern, is (n+1)*3. offsets of the substring matched by the whole pattern, is (n+1)*3.
It is possible for capturing subpattern number n+1 to match some part It is possible for capturing subpattern number n+1 to match some part
of the subject when subpattern n has not been used at all. For example, of the subject when subpattern n has not been used at all. For example,
if the string "abc" is matched against the pattern (a|(z))(bc) the if the string "abc" is matched against the pattern (a|(z))(bc) the re-
return from the function is 4, and subpatterns 1 and 3 are matched, but turn from the function is 4, and subpatterns 1 and 3 are matched, but 2
2 is not. When this happens, both values in the offset pairs corre- is not. When this happens, both values in the offset pairs correspond-
sponding to unused subpatterns are set to -1. ing to unused subpatterns are set to -1.
Offset values that correspond to unused subpatterns at the end of the Offset values that correspond to unused subpatterns at the end of the
expression are also set to -1. For example, if the string "abc" is expression are also set to -1. For example, if the string "abc" is
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
matched. The return from the function is 2, because the highest used matched. The return from the function is 2, because the highest used
capturing subpattern number is 1, and the offsets for for the second capturing subpattern number is 1, and the offsets for for the second
and third capturing subpatterns (assuming the vector is large enough, and third capturing subpatterns (assuming the vector is large enough,
of course) are set to -1. of course) are set to -1.
Note: Elements in the first two-thirds of ovector that do not corre- Note: Elements in the first two-thirds of ovector that do not corre-
spond to capturing parentheses in the pattern are never changed. That spond to capturing parentheses in the pattern are never changed. That
is, if a pattern contains n capturing parentheses, no more than ovec- is, if a pattern contains n capturing parentheses, no more than ovec-
tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in
the first two-thirds) retain whatever values they previously had. the first two-thirds) retain whatever values they previously had.
Some convenience functions are provided for extracting the captured Some convenience functions are provided for extracting the captured
substrings as separate strings. These are described below. substrings as separate strings. These are described below.
Error return values from pcre_exec() Error return values from pcre_exec()
If pcre_exec() fails, it returns a negative number. The following are If pcre_exec() fails, it returns a negative number. The following are
defined in the header file: defined in the header file:
PCRE_ERROR_NOMATCH (-1) PCRE_ERROR_NOMATCH (-1)
The subject string did not match the pattern. The subject string did not match the pattern.
PCRE_ERROR_NULL (-2) PCRE_ERROR_NULL (-2)
Either code or subject was passed as NULL, or ovector was NULL and Either code or subject was passed as NULL, or ovector was NULL and
ovecsize was not zero. ovecsize was not zero.
PCRE_ERROR_BADOPTION (-3) PCRE_ERROR_BADOPTION (-3)
An unrecognized bit was set in the options argument. An unrecognized bit was set in the options argument.
PCRE_ERROR_BADMAGIC (-4) PCRE_ERROR_BADMAGIC (-4)
PCRE stores a 4-byte "magic number" at the start of the compiled code, PCRE stores a 4-byte "magic number" at the start of the compiled code,
to catch the case when it is passed a junk pointer and to detect when a to catch the case when it is passed a junk pointer and to detect when a
pattern that was compiled in an environment of one endianness is run in pattern that was compiled in an environment of one endianness is run in
an environment with the other endianness. This is the error that PCRE an environment with the other endianness. This is the error that PCRE
gives when the magic number is not present. gives when the magic number is not present.
PCRE_ERROR_UNKNOWN_OPCODE (-5) PCRE_ERROR_UNKNOWN_OPCODE (-5)
While running the pattern match, an unknown item was encountered in the While running the pattern match, an unknown item was encountered in the
compiled pattern. This error could be caused by a bug in PCRE or by compiled pattern. This error could be caused by a bug in PCRE or by
overwriting of the compiled pattern. overwriting of the compiled pattern.
PCRE_ERROR_NOMEMORY (-6) PCRE_ERROR_NOMEMORY (-6)
If a pattern contains back references, but the ovector that is passed If a pattern contains back references, but the ovector that is passed
to pcre_exec() is not big enough to remember the referenced substrings, to pcre_exec() is not big enough to remember the referenced substrings,
PCRE gets a block of memory at the start of matching to use for this PCRE gets a block of memory at the start of matching to use for this
purpose. If the call via pcre_malloc() fails, this error is given. The purpose. If the call via pcre_malloc() fails, this error is given. The
memory is automatically freed at the end of matching. memory is automatically freed at the end of matching.
This error is also given if pcre_stack_malloc() fails in pcre_exec(). This error is also given if pcre_stack_malloc() fails in pcre_exec().
This can happen only when PCRE has been compiled with --disable-stack- This can happen only when PCRE has been compiled with --disable-stack-
for-recursion. for-recursion.
PCRE_ERROR_NOSUBSTRING (-7) PCRE_ERROR_NOSUBSTRING (-7)
This error is used by the pcre_copy_substring(), pcre_get_substring(), This error is used by the pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() functions (see below). It is never and pcre_get_substring_list() functions (see below). It is never re-
returned by pcre_exec(). turned by pcre_exec().
PCRE_ERROR_MATCHLIMIT (-8) PCRE_ERROR_MATCHLIMIT (-8)
The backtracking limit, as specified by the match_limit field in a The backtracking limit, as specified by the match_limit field in a
pcre_extra structure (or defaulted) was reached. See the description pcre_extra structure (or defaulted) was reached. See the description
above. above.
PCRE_ERROR_CALLOUT (-9) PCRE_ERROR_CALLOUT (-9)
This error is never generated by pcre_exec() itself. It is provided for This error is never generated by pcre_exec() itself. It is provided for
use by callout functions that want to yield a distinctive error code. use by callout functions that want to yield a distinctive error code.
See the pcrecallout documentation for details. See the pcrecallout documentation for details.
PCRE_ERROR_BADUTF8 (-10) PCRE_ERROR_BADUTF8 (-10)
A string that contains an invalid UTF-8 byte sequence was passed as a A string that contains an invalid UTF-8 byte sequence was passed as a
subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
the output vector (ovecsize) is at least 2, the byte offset to the the output vector (ovecsize) is at least 2, the byte offset to the
start of the the invalid UTF-8 character is placed in the first ele- start of the the invalid UTF-8 character is placed in the first ele-
ment, and a reason code is placed in the second element. The reason ment, and a reason code is placed in the second element. The reason
codes are listed in the following section. For backward compatibility, codes are listed in the following section. For backward compatibility,
if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char- if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
acter at the end of the subject (reason codes 1 to 5), acter at the end of the subject (reason codes 1 to 5), PCRE_ER-
PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8. ROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
PCRE_ERROR_BADUTF8_OFFSET (-11) PCRE_ERROR_BADUTF8_OFFSET (-11)
The UTF-8 byte sequence that was passed as a subject was checked and The UTF-8 byte sequence that was passed as a subject was checked and
found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
value of startoffset did not point to the beginning of a UTF-8 charac- value of startoffset did not point to the beginning of a UTF-8 charac-
ter or the end of the subject. ter or the end of the subject.
PCRE_ERROR_PARTIAL (-12) PCRE_ERROR_PARTIAL (-12)
The subject string did not match, but it did match partially. See the The subject string did not match, but it did match partially. See the
pcrepartial documentation for details of partial matching. pcrepartial documentation for details of partial matching.
PCRE_ERROR_BADPARTIAL (-13) PCRE_ERROR_BADPARTIAL (-13)
This code is no longer in use. It was formerly returned when the This code is no longer in use. It was formerly returned when the
PCRE_PARTIAL option was used with a compiled pattern containing items PCRE_PARTIAL option was used with a compiled pattern containing items
that were not supported for partial matching. From release 8.00 that were not supported for partial matching. From release 8.00 on-
onwards, there are no restrictions on partial matching. wards, there are no restrictions on partial matching.
PCRE_ERROR_INTERNAL (-14) PCRE_ERROR_INTERNAL (-14)
An unexpected internal error has occurred. This error could be caused An unexpected internal error has occurred. This error could be caused
by a bug in PCRE or by overwriting of the compiled pattern. by a bug in PCRE or by overwriting of the compiled pattern.
PCRE_ERROR_BADCOUNT (-15) PCRE_ERROR_BADCOUNT (-15)
This error is given if the value of the ovecsize argument is negative. This error is given if the value of the ovecsize argument is negative.
PCRE_ERROR_RECURSIONLIMIT (-21) PCRE_ERROR_RECURSIONLIMIT (-21)
The internal recursion limit, as specified by the match_limit_recursion The internal recursion limit, as specified by the match_limit_recursion
field in a pcre_extra structure (or defaulted) was reached. See the field in a pcre_extra structure (or defaulted) was reached. See the de-
description above. scription above.
PCRE_ERROR_BADNEWLINE (-23) PCRE_ERROR_BADNEWLINE (-23)
An invalid combination of PCRE_NEWLINE_xxx options was given. An invalid combination of PCRE_NEWLINE_xxx options was given.
PCRE_ERROR_BADOFFSET (-24) PCRE_ERROR_BADOFFSET (-24)
The value of startoffset was negative or greater than the length of the The value of startoffset was negative or greater than the length of the
subject, that is, the value in length. subject, that is, the value in length.
PCRE_ERROR_SHORTUTF8 (-25) PCRE_ERROR_SHORTUTF8 (-25)
This error is returned instead of PCRE_ERROR_BADUTF8 when the subject This error is returned instead of PCRE_ERROR_BADUTF8 when the subject
string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
option is set. Information about the failure is returned as for option is set. Information about the failure is returned as for
PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but
this special error code for PCRE_PARTIAL_HARD precedes the implementa- this special error code for PCRE_PARTIAL_HARD precedes the implementa-
tion of returned information; it is retained for backwards compatibil- tion of returned information; it is retained for backwards compatibil-
ity. ity.
PCRE_ERROR_RECURSELOOP (-26) PCRE_ERROR_RECURSELOOP (-26)
This error is returned when pcre_exec() detects a recursion loop within This error is returned when pcre_exec() detects a recursion loop within
the pattern. Specifically, it means that either the whole pattern or a the pattern. Specifically, it means that either the whole pattern or a
subpattern has been called recursively for the second time at the same subpattern has been called recursively for the second time at the same
position in the subject string. Some simple patterns that might do this position in the subject string. Some simple patterns that might do this
are detected and faulted at compile time, but more complicated cases, are detected and faulted at compile time, but more complicated cases,
in particular mutual recursions between two different subpatterns, can- in particular mutual recursions between two different subpatterns, can-
not be detected until run time. not be detected until run time.
PCRE_ERROR_JIT_STACKLIMIT (-27) PCRE_ERROR_JIT_STACKLIMIT (-27)
This error is returned when a pattern that was successfully studied This error is returned when a pattern that was successfully studied us-
using a JIT compile option is being matched, but the memory available ing a JIT compile option is being matched, but the memory available for
for the just-in-time processing stack is not large enough. See the the just-in-time processing stack is not large enough. See the pcrejit
pcrejit documentation for more details. documentation for more details.
PCRE_ERROR_BADMODE (-28) PCRE_ERROR_BADMODE (-28)
This error is given if a pattern that was compiled by the 8-bit library This error is given if a pattern that was compiled by the 8-bit library
is passed to a 16-bit or 32-bit library function, or vice versa. is passed to a 16-bit or 32-bit library function, or vice versa.
PCRE_ERROR_BADENDIANNESS (-29) PCRE_ERROR_BADENDIANNESS (-29)
This error is given if a pattern that was compiled and saved is This error is given if a pattern that was compiled and saved is
reloaded on a host with different endianness. The utility function reloaded on a host with different endianness. The utility function
pcre_pattern_to_host_byte_order() can be used to convert such a pattern pcre_pattern_to_host_byte_order() can be used to convert such a pattern
so that it runs on the new host. so that it runs on the new host.
PCRE_ERROR_JIT_BADOPTION PCRE_ERROR_JIT_BADOPTION
This error is returned when a pattern that was successfully studied This error is returned when a pattern that was successfully studied us-
using a JIT compile option is being matched, but the matching mode ing a JIT compile option is being matched, but the matching mode (par-
(partial or complete match) does not correspond to any JIT compilation tial or complete match) does not correspond to any JIT compilation
mode. When the JIT fast path function is used, this error may be also mode. When the JIT fast path function is used, this error may be also
given for invalid options. See the pcrejit documentation for more given for invalid options. See the pcrejit documentation for more de-
details. tails.
PCRE_ERROR_BADLENGTH (-32) PCRE_ERROR_BADLENGTH (-32)
This error is given if pcre_exec() is called with a negative value for This error is given if pcre_exec() is called with a negative value for
the length argument. the length argument.
Error numbers -16 to -20, -22, and 30 are not used by pcre_exec(). Error numbers -16 to -20, -22, and 30 are not used by pcre_exec().
Reason codes for invalid UTF-8 strings Reason codes for invalid UTF-8 strings
This section applies only to the 8-bit library. The corresponding This section applies only to the 8-bit library. The corresponding in-
information for the 16-bit and 32-bit libraries is given in the pcre16 formation for the 16-bit and 32-bit libraries is given in the pcre16
and pcre32 pages. and pcre32 pages.
When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT- When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
UTF8, and the size of the output vector (ovecsize) is at least 2, the UTF8, and the size of the output vector (ovecsize) is at least 2, the
offset of the start of the invalid UTF-8 character is placed in the offset of the start of the invalid UTF-8 character is placed in the
first output vector element (ovector[0]) and a reason code is placed in first output vector element (ovector[0]) and a reason code is placed in
the second element (ovector[1]). The reason codes are given names in the second element (ovector[1]). The reason codes are given names in
the pcre.h header file: the pcre.h header file:
PCRE_UTF8_ERR1 PCRE_UTF8_ERR1
PCRE_UTF8_ERR2 PCRE_UTF8_ERR2
PCRE_UTF8_ERR3 PCRE_UTF8_ERR3
PCRE_UTF8_ERR4 PCRE_UTF8_ERR4
PCRE_UTF8_ERR5 PCRE_UTF8_ERR5
The string ends with a truncated UTF-8 character; the code specifies The string ends with a truncated UTF-8 character; the code specifies
how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
characters to be no longer than 4 bytes, the encoding scheme (origi- characters to be no longer than 4 bytes, the encoding scheme (origi-
nally defined by RFC 2279) allows for up to 6 bytes, and this is nally defined by RFC 2279) allows for up to 6 bytes, and this is
checked first; hence the possibility of 4 or 5 missing bytes. checked first; hence the possibility of 4 or 5 missing bytes.
PCRE_UTF8_ERR6 PCRE_UTF8_ERR6
PCRE_UTF8_ERR7 PCRE_UTF8_ERR7
PCRE_UTF8_ERR8 PCRE_UTF8_ERR8
PCRE_UTF8_ERR9 PCRE_UTF8_ERR9
PCRE_UTF8_ERR10 PCRE_UTF8_ERR10
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
the character do not have the binary value 0b10 (that is, either the the character do not have the binary value 0b10 (that is, either the
most significant bit is 0, or the next bit is 1). most significant bit is 0, or the next bit is 1).
PCRE_UTF8_ERR11 PCRE_UTF8_ERR11
PCRE_UTF8_ERR12 PCRE_UTF8_ERR12
A character that is valid by the RFC 2279 rules is either 5 or 6 bytes A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
long; these code points are excluded by RFC 3629. long; these code points are excluded by RFC 3629.
PCRE_UTF8_ERR13 PCRE_UTF8_ERR13
A 4-byte character has a value greater than 0x10fff; these code points A 4-byte character has a value greater than 0x10fff; these code points
are excluded by RFC 3629. are excluded by RFC 3629.
PCRE_UTF8_ERR14 PCRE_UTF8_ERR14
A 3-byte character has a value in the range 0xd800 to 0xdfff; this A 3-byte character has a value in the range 0xd800 to 0xdfff; this
range of code points are reserved by RFC 3629 for use with UTF-16, and range of code points are reserved by RFC 3629 for use with UTF-16, and
so are excluded from UTF-8. so are excluded from UTF-8.
PCRE_UTF8_ERR15 PCRE_UTF8_ERR15
PCRE_UTF8_ERR16 PCRE_UTF8_ERR16
PCRE_UTF8_ERR17 PCRE_UTF8_ERR17
PCRE_UTF8_ERR18 PCRE_UTF8_ERR18
PCRE_UTF8_ERR19 PCRE_UTF8_ERR19
A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
for a value that can be represented by fewer bytes, which is invalid. for a value that can be represented by fewer bytes, which is invalid.
For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor- For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
rect coding uses just one byte. rect coding uses just one byte.
PCRE_UTF8_ERR20 PCRE_UTF8_ERR20
The two most significant bits of the first byte of a character have the The two most significant bits of the first byte of a character have the
binary value 0b10 (that is, the most significant bit is 1 and the sec- binary value 0b10 (that is, the most significant bit is 1 and the sec-
ond is 0). Such a byte can only validly occur as the second or subse- ond is 0). Such a byte can only validly occur as the second or subse-
quent byte of a multi-byte character. quent byte of a multi-byte character.
PCRE_UTF8_ERR21 PCRE_UTF8_ERR21
The first byte of a character has the value 0xfe or 0xff. These values The first byte of a character has the value 0xfe or 0xff. These values
can never occur in a valid UTF-8 string. can never occur in a valid UTF-8 string.
PCRE_UTF8_ERR22 PCRE_UTF8_ERR22
This error code was formerly used when the presence of a so-called This error code was formerly used when the presence of a so-called
"non-character" caused an error. Unicode corrigendum #9 makes it clear "non-character" caused an error. Unicode corrigendum #9 makes it clear
that such characters should not cause a string to be rejected, and so that such characters should not cause a string to be rejected, and so
this code is no longer in use and is never returned. this code is no longer in use and is never returned.
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
int pcre_copy_substring(const char *subject, int *ovector, int pcre_copy_substring(const char *subject, int *ovector,
int stringcount, int stringnumber, char *buffer, int stringcount, int stringnumber, char *buffer,
int buffersize); int buffersize);
int pcre_get_substring(const char *subject, int *ovector, int pcre_get_substring(const char *subject, int *ovector,
int stringcount, int stringnumber, int stringcount, int stringnumber,
const char **stringptr); const char **stringptr);
int pcre_get_substring_list(const char *subject, int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr); int *ovector, int stringcount, const char ***listptr);
Captured substrings can be accessed directly by using the offsets Captured substrings can be accessed directly by using the offsets re-
returned by pcre_exec() in ovector. For convenience, the functions turned by pcre_exec() in ovector. For convenience, the functions
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
string_list() are provided for extracting captured substrings as new, string_list() are provided for extracting captured substrings as new,
separate, zero-terminated strings. These functions identify substrings separate, zero-terminated strings. These functions identify substrings
by number. The next section describes functions for extracting named by number. The next section describes functions for extracting named
substrings. substrings.
A substring that contains a binary zero is correctly extracted and has A substring that contains a binary zero is correctly extracted and has
a further zero added on the end, but the result is not, of course, a C a further zero added on the end, but the result is not, of course, a C
string. However, you can process such a string by referring to the string. However, you can process such a string by referring to the
length that is returned by pcre_copy_substring() and pcre_get_sub- length that is returned by pcre_copy_substring() and pcre_get_sub-
string(). Unfortunately, the interface to pcre_get_substring_list() is string(). Unfortunately, the interface to pcre_get_substring_list() is
not adequate for handling strings containing binary zeros, because the not adequate for handling strings containing binary zeros, because the
end of the final string is not independently indicated. end of the final string is not independently indicated.
The first three arguments are the same for all three of these func- The first three arguments are the same for all three of these func-
tions: subject is the subject string that has just been successfully tions: subject is the subject string that has just been successfully
matched, ovector is a pointer to the vector of integer offsets that was matched, ovector is a pointer to the vector of integer offsets that was
passed to pcre_exec(), and stringcount is the number of substrings that passed to pcre_exec(), and stringcount is the number of substrings that
were captured by the match, including the substring that matched the were captured by the match, including the substring that matched the
entire regular expression. This is the value returned by pcre_exec() if entire regular expression. This is the value returned by pcre_exec() if
it is greater than zero. If pcre_exec() returned zero, indicating that it is greater than zero. If pcre_exec() returned zero, indicating that
it ran out of space in ovector, the value passed as stringcount should it ran out of space in ovector, the value passed as stringcount should
be the number of elements in the vector divided by three. be the number of elements in the vector divided by three.
The functions pcre_copy_substring() and pcre_get_substring() extract a The functions pcre_copy_substring() and pcre_get_substring() extract a
single substring, whose number is given as stringnumber. A value of single substring, whose number is given as stringnumber. A value of
zero extracts the substring that matched the entire pattern, whereas zero extracts the substring that matched the entire pattern, whereas
higher values extract the captured substrings. For pcre_copy_sub- higher values extract the captured substrings. For pcre_copy_sub-
string(), the string is placed in buffer, whose length is given by string(), the string is placed in buffer, whose length is given by
buffersize, while for pcre_get_substring() a new block of memory is buffersize, while for pcre_get_substring() a new block of memory is ob-
obtained via pcre_malloc, and its address is returned via stringptr. tained via pcre_malloc, and its address is returned via stringptr. The
The yield of the function is the length of the string, not including yield of the function is the length of the string, not including the
the terminating zero, or one of these error codes: terminating zero, or one of these error codes:
PCRE_ERROR_NOMEMORY (-6) PCRE_ERROR_NOMEMORY (-6)
The buffer was too small for pcre_copy_substring(), or the attempt to The buffer was too small for pcre_copy_substring(), or the attempt to
get memory failed for pcre_get_substring(). get memory failed for pcre_get_substring().
PCRE_ERROR_NOSUBSTRING (-7) PCRE_ERROR_NOSUBSTRING (-7)
There is no substring whose number is stringnumber. There is no substring whose number is stringnumber.
The pcre_get_substring_list() function extracts all available sub- The pcre_get_substring_list() function extracts all available sub-
strings and builds a list of pointers to them. All this is done in a strings and builds a list of pointers to them. All this is done in a
single block of memory that is obtained via pcre_malloc. The address of single block of memory that is obtained via pcre_malloc. The address of
the memory block is returned via listptr, which is also the start of the memory block is returned via listptr, which is also the start of
the list of string pointers. The end of the list is marked by a NULL the list of string pointers. The end of the list is marked by a NULL
pointer. The yield of the function is zero if all went well, or the pointer. The yield of the function is zero if all went well, or the er-
error code ror code
PCRE_ERROR_NOMEMORY (-6) PCRE_ERROR_NOMEMORY (-6)
if the attempt to get the memory block failed. if the attempt to get the memory block failed.
When any of these functions encounter a substring that is unset, which When any of these functions encounter a substring that is unset, which
can happen when capturing subpattern number n+1 matches some part of can happen when capturing subpattern number n+1 matches some part of
the subject, but subpattern n has not been used at all, they return an the subject, but subpattern n has not been used at all, they return an
empty string. This can be distinguished from a genuine zero-length sub- empty string. This can be distinguished from a genuine zero-length sub-
string by inspecting the appropriate offset in ovector, which is nega- string by inspecting the appropriate offset in ovector, which is nega-
tive for unset substrings. tive for unset substrings.
The two convenience functions pcre_free_substring() and pcre_free_sub- The two convenience functions pcre_free_substring() and pcre_free_sub-
string_list() can be used to free the memory returned by a previous string_list() can be used to free the memory returned by a previous
call of pcre_get_substring() or pcre_get_substring_list(), respec- call of pcre_get_substring() or pcre_get_substring_list(), respec-
tively. They do nothing more than call the function pointed to by tively. They do nothing more than call the function pointed to by
pcre_free, which of course could be called directly from a C program. pcre_free, which of course could be called directly from a C program.
However, PCRE is used in some situations where it is linked via a spe- However, PCRE is used in some situations where it is linked via a spe-
cial interface to another programming language that cannot use cial interface to another programming language that cannot use
pcre_free directly; it is for these cases that the functions are pro- pcre_free directly; it is for these cases that the functions are pro-
vided. vided.
EXTRACTING CAPTURED SUBSTRINGS BY NAME EXTRACTING CAPTURED SUBSTRINGS BY NAME
int pcre_get_stringnumber(const pcre *code, int pcre_get_stringnumber(const pcre *code,
const char *name); const char *name);
int pcre_copy_named_substring(const pcre *code, int pcre_copy_named_substring(const pcre *code,
const char *subject, int *ovector, const char *subject, int *ovector,
int stringcount, const char *stringname, int stringcount, const char *stringname,
char *buffer, int buffersize); char *buffer, int buffersize);
int pcre_get_named_substring(const pcre *code, int pcre_get_named_substring(const pcre *code,
const char *subject, int *ovector, const char *subject, int *ovector,
int stringcount, const char *stringname, int stringcount, const char *stringname,
const char **stringptr); const char **stringptr);
To extract a substring by name, you first have to find associated num- To extract a substring by name, you first have to find associated num-
ber. For example, for this pattern ber. For example, for this pattern
(a+)b(?<xxx>\d+)... (a+)b(?<xxx>\d+)...
the number of the subpattern called "xxx" is 2. If the name is known to the number of the subpattern called "xxx" is 2. If the name is known to
be unique (PCRE_DUPNAMES was not set), you can find the number from the be unique (PCRE_DUPNAMES was not set), you can find the number from the
name by calling pcre_get_stringnumber(). The first argument is the com- name by calling pcre_get_stringnumber(). The first argument is the com-
piled pattern, and the second is the name. The yield of the function is piled pattern, and the second is the name. The yield of the function is
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
subpattern of that name. subpattern of that name.
Given the number, you can extract the substring directly, or use one of Given the number, you can extract the substring directly, or use one of
the functions described in the previous section. For convenience, there the functions described in the previous section. For convenience, there
are also two functions that do the whole job. are also two functions that do the whole job.
Most of the arguments of pcre_copy_named_substring() and Most of the arguments of pcre_copy_named_substring() and
pcre_get_named_substring() are the same as those for the similarly pcre_get_named_substring() are the same as those for the similarly
named functions that extract by number. As these are described in the named functions that extract by number. As these are described in the
previous section, they are not re-described here. There are just two previous section, they are not re-described here. There are just two
differences: differences:
First, instead of a substring number, a substring name is given. Sec- First, instead of a substring number, a substring name is given. Sec-
ond, there is an extra argument, given at the start, which is a pointer ond, there is an extra argument, given at the start, which is a pointer
to the compiled pattern. This is needed in order to gain access to the to the compiled pattern. This is needed in order to gain access to the
name-to-number translation table. name-to-number translation table.
These functions call pcre_get_stringnumber(), and if it succeeds, they These functions call pcre_get_stringnumber(), and if it succeeds, they
then call pcre_copy_substring() or pcre_get_substring(), as appropri- then call pcre_copy_substring() or pcre_get_substring(), as appropri-
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
behaviour may not be what you want (see the next section). behaviour may not be what you want (see the next section).
Warning: If the pattern uses the (?| feature to set up multiple subpat- Warning: If the pattern uses the (?| feature to set up multiple subpat-
terns with the same number, as described in the section on duplicate terns with the same number, as described in the section on duplicate
subpattern numbers in the pcrepattern page, you cannot use names to subpattern numbers in the pcrepattern page, you cannot use names to
distinguish the different subpatterns, because names are not included distinguish the different subpatterns, because names are not included
in the compiled code. The matching process uses only numbers. For this in the compiled code. The matching process uses only numbers. For this
reason, the use of different names for subpatterns of the same number reason, the use of different names for subpatterns of the same number
causes an error at compile time. causes an error at compile time.
DUPLICATE SUBPATTERN NAMES DUPLICATE SUBPATTERN NAMES
int pcre_get_stringtable_entries(const pcre *code, int pcre_get_stringtable_entries(const pcre *code,
const char *name, char **first, char **last); const char *name, char **first, char **last);
When a pattern is compiled with the PCRE_DUPNAMES option, names for When a pattern is compiled with the PCRE_DUPNAMES option, names for
subpatterns are not required to be unique. (Duplicate names are always subpatterns are not required to be unique. (Duplicate names are always
allowed for subpatterns with the same number, created by using the (?| allowed for subpatterns with the same number, created by using the (?|
feature. Indeed, if such subpatterns are named, they are required to feature. Indeed, if such subpatterns are named, they are required to
use the same names.) use the same names.)
Normally, patterns with duplicate names are such that in any one match, Normally, patterns with duplicate names are such that in any one match,
only one of the named subpatterns participates. An example is shown in only one of the named subpatterns participates. An example is shown in
the pcrepattern documentation. the pcrepattern documentation.
When duplicates are present, pcre_copy_named_substring() and When duplicates are present, pcre_copy_named_substring() and
pcre_get_named_substring() return the first substring corresponding to pcre_get_named_substring() return the first substring corresponding to
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
(-7) is returned; no data is returned. The pcre_get_stringnumber() (-7) is returned; no data is returned. The pcre_get_stringnumber()
function returns one of the numbers that are associated with the name, function returns one of the numbers that are associated with the name,
but it is not defined which it is. but it is not defined which it is.
If you want to get full details of all captured substrings for a given If you want to get full details of all captured substrings for a given
name, you must use the pcre_get_stringtable_entries() function. The name, you must use the pcre_get_stringtable_entries() function. The
first argument is the compiled pattern, and the second is the name. The first argument is the compiled pattern, and the second is the name. The
third and fourth are pointers to variables which are updated by the third and fourth are pointers to variables which are updated by the
function. After it has run, they point to the first and last entries in function. After it has run, they point to the first and last entries in
the name-to-number table for the given name. The function itself the name-to-number table for the given name. The function itself re-
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if turns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if there
there are none. The format of the table is described above in the sec- are none. The format of the table is described above in the section en-
tion entitled Information about a pattern above. Given all the rele- titled Information about a pattern above. Given all the relevant en-
vant entries for the name, you can extract each of their numbers, and tries for the name, you can extract each of their numbers, and hence
hence the captured data, if any. the captured data, if any.
FINDING ALL POSSIBLE MATCHES FINDING ALL POSSIBLE MATCHES
The traditional matching function uses a similar algorithm to Perl, The traditional matching function uses a similar algorithm to Perl,
which stops when it finds the first match, starting at a given point in which stops when it finds the first match, starting at a given point in
the subject. If you want to find all possible matches, or the longest the subject. If you want to find all possible matches, or the longest
possible match, consider using the alternative matching function (see possible match, consider using the alternative matching function (see
below) instead. If you cannot use the alternative function, but still below) instead. If you cannot use the alternative function, but still
need to find all possible matches, you can kludge it up by making use need to find all possible matches, you can kludge it up by making use
of the callout facility, which is described in the pcrecallout documen- of the callout facility, which is described in the pcrecallout documen-
tation. tation.
What you have to do is to insert a callout right at the end of the pat- What you have to do is to insert a callout right at the end of the pat-
tern. When your callout function is called, extract and save the cur- tern. When your callout function is called, extract and save the cur-
rent matched substring. Then return 1, which forces pcre_exec() to rent matched substring. Then return 1, which forces pcre_exec() to
backtrack and try other alternatives. Ultimately, when it runs out of backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
OBTAINING AN ESTIMATE OF STACK USAGE OBTAINING AN ESTIMATE OF STACK USAGE
Matching certain patterns using pcre_exec() can use a lot of process Matching certain patterns using pcre_exec() can use a lot of process
stack, which in certain environments can be rather limited in size. stack, which in certain environments can be rather limited in size.
Some users find it helpful to have an estimate of the amount of stack Some users find it helpful to have an estimate of the amount of stack
that is used by pcre_exec(), to help them set recursion limits, as that is used by pcre_exec(), to help them set recursion limits, as de-
described in the pcrestack documentation. The estimate that is output scribed in the pcrestack documentation. The estimate that is output by
by pcretest when called with the -m and -C options is obtained by call- pcretest when called with the -m and -C options is obtained by calling
ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
first five arguments. first five arguments.
Normally, if its first argument is NULL, pcre_exec() immediately Normally, if its first argument is NULL, pcre_exec() immediately re-
returns the negative error code PCRE_ERROR_NULL, but with this special turns the negative error code PCRE_ERROR_NULL, but with this special
combination of arguments, it returns instead a negative number whose combination of arguments, it returns instead a negative number whose
absolute value is the approximate stack frame size in bytes. (A nega- absolute value is the approximate stack frame size in bytes. (A nega-
tive number is used so that it is clear that no match has happened.) tive number is used so that it is clear that no match has happened.)
The value is approximate because in some cases, recursive calls to The value is approximate because in some cases, recursive calls to
pcre_exec() occur when there are one or two additional variables on the pcre_exec() occur when there are one or two additional variables on the
stack. stack.
If PCRE has been compiled to use the heap instead of the stack for If PCRE has been compiled to use the heap instead of the stack for re-
recursion, the value returned is the size of each block that is cursion, the value returned is the size of each block that is obtained
obtained from the heap. from the heap.
MATCHING A PATTERN: THE ALTERNATIVE FUNCTION MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
int pcre_dfa_exec(const pcre *code, const pcre_extra *extra, int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset, const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize, int options, int *ovector, int ovecsize,
int *workspace, int wscount); int *workspace, int wscount);
The function pcre_dfa_exec() is called to match a subject string The function pcre_dfa_exec() is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the against a compiled pattern, using a matching algorithm that scans the
subject string just once, and does not backtrack. This has different subject string just once, and does not backtrack. This has different
characteristics to the normal algorithm, and is not compatible with characteristics to the normal algorithm, and is not compatible with
Perl. Some of the features of PCRE patterns are not supported. Never- Perl. Some of the features of PCRE patterns are not supported. Never-
theless, there are times when this kind of matching can be useful. For theless, there are times when this kind of matching can be useful. For
a discussion of the two matching algorithms, and a list of features a discussion of the two matching algorithms, and a list of features
that pcre_dfa_exec() does not support, see the pcrematching documenta- that pcre_dfa_exec() does not support, see the pcrematching documenta-
tion. tion.
The arguments for the pcre_dfa_exec() function are the same as for The arguments for the pcre_dfa_exec() function are the same as for
pcre_exec(), plus two extras. The ovector argument is used in a differ- pcre_exec(), plus two extras. The ovector argument is used in a differ-
ent way, and this is described below. The other common arguments are ent way, and this is described below. The other common arguments are
used in the same way as for pcre_exec(), so their description is not used in the same way as for pcre_exec(), so their description is not
repeated here. repeated here.
The two additional arguments provide workspace for the function. The The two additional arguments provide workspace for the function. The
workspace vector should contain at least 20 elements. It is used for workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More keeping track of multiple paths through the pattern tree. More
workspace will be needed for patterns and subjects where there are a workspace will be needed for patterns and subjects where there are a
lot of potential matches. lot of potential matches.
Here is an example of a simple call to pcre_dfa_exec(): Here is an example of a simple call to pcre_dfa_exec():
int rc; int rc;
int ovector[10]; int ovector[10];
int wspace[20]; int wspace[20];
rc = pcre_dfa_exec( rc = pcre_dfa_exec(
re, /* result of pcre_compile() */ re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */ NULL, /* we didn't study the pattern */
skipping to change at line 4009 skipping to change at line 4006
11, /* the length of the subject string */ 11, /* the length of the subject string */
0, /* start at offset 0 in the subject */ 0, /* start at offset 0 in the subject */
0, /* default options */ 0, /* default options */
ovector, /* vector of integers for substring information */ ovector, /* vector of integers for substring information */
10, /* number of elements (NOT size in bytes) */ 10, /* number of elements (NOT size in bytes) */
wspace, /* working space vector */ wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */ 20); /* number of elements (NOT size in bytes) */
Option bits for pcre_dfa_exec() Option bits for pcre_dfa_exec()
The unused bits of the options argument for pcre_dfa_exec() must be The unused bits of the options argument for pcre_dfa_exec() must be
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_AT-
PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, START, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, PCRE_BSR_UNICODE,
PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR- PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT,
TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last four of
four of these are exactly the same as for pcre_exec(), so their these are exactly the same as for pcre_exec(), so their description is
description is not repeated here. not repeated here.
PCRE_PARTIAL_HARD PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT PCRE_PARTIAL_SOFT
These have the same general effect as they do for pcre_exec(), but the These have the same general effect as they do for pcre_exec(), but the
details are slightly different. When PCRE_PARTIAL_HARD is set for details are slightly different. When PCRE_PARTIAL_HARD is set for
pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub- pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
ject is reached and there is still at least one matching possibility ject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete that requires additional characters. This happens even if some complete
matches have also been found. When PCRE_PARTIAL_SOFT is set, the return matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
of the subject is reached, there have been no complete matches, but of the subject is reached, there have been no complete matches, but
there is still at least one matching possibility. The portion of the there is still at least one matching possibility. The portion of the
string that was inspected when the longest partial match was found is string that was inspected when the longest partial match was found is
set as the first matching string in both cases. There is a more set as the first matching string in both cases. There is a more de-
detailed discussion of partial and multi-segment matching, with exam- tailed discussion of partial and multi-segment matching, with examples,
ples, in the pcrepartial documentation. in the pcrepartial documentation.
PCRE_DFA_SHORTEST PCRE_DFA_SHORTEST
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna- stop as soon as it has found one match. Because of the way the alterna-
tive algorithm works, this is necessarily the shortest possible match tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string. at the first possible matching point in the subject string.
PCRE_DFA_RESTART PCRE_DFA_RESTART
When pcre_dfa_exec() returns a partial match, it is possible to call it When pcre_dfa_exec() returns a partial match, it is possible to call it
again, with additional subject characters, and have it continue with again, with additional subject characters, and have it continue with
the same match. The PCRE_DFA_RESTART option requests this action; when the same match. The PCRE_DFA_RESTART option requests this action; when
it is set, the workspace and wscount options must reference the same it is set, the workspace and wscount options must reference the same
vector as before because data about the match so far is left in them vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the after a partial match. There is more discussion of this facility in the
pcrepartial documentation. pcrepartial documentation.
Successful returns from pcre_dfa_exec() Successful returns from pcre_dfa_exec()
When pcre_dfa_exec() succeeds, it may have matched more than one sub- When pcre_dfa_exec() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run string in the subject. Note, however, that all the matches from one run
of the function start at the same point in the subject. The shorter of the function start at the same point in the subject. The shorter
matches are all initial substrings of the longer matches. For example, matches are all initial substrings of the longer matches. For example,
if the pattern if the pattern
<.*> <.*>
is matched against the string is matched against the string
This is <something> <something else> <something further> no more This is <something> <something else> <something further> no more
the three matched strings are the three matched strings are
<something> <something>
<something> <something else> <something> <something else>
<something> <something else> <something further> <something> <something else> <something further>
On success, the yield of the function is a number greater than zero, On success, the yield of the function is a number greater than zero,
which is the number of matched substrings. The substrings themselves which is the number of matched substrings. The substrings themselves
are returned in ovector. Each string uses two elements; the first is are returned in ovector. Each string uses two elements; the first is
the offset to the start, and the second is the offset to the end. In the offset to the start, and the second is the offset to the end. In
fact, all the strings have the same start offset. (Space could have fact, all the strings have the same start offset. (Space could have
been saved by giving this only once, but it was decided to retain some been saved by giving this only once, but it was decided to retain some
compatibility with the way pcre_exec() returns data, even though the compatibility with the way pcre_exec() returns data, even though the
meaning of the strings is different.) meaning of the strings is different.)
The strings are returned in reverse order of length; that is, the long- The strings are returned in reverse order of length; that is, the long-
est matching string is given first. If there were too many matches to est matching string is given first. If there were too many matches to
fit into ovector, the yield of the function is zero, and the vector is fit into ovector, the yield of the function is zero, and the vector is
filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec() filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()
can use the entire ovector for returning matched strings. can use the entire ovector for returning matched strings.
NOTE: PCRE's "auto-possessification" optimization usually applies to NOTE: PCRE's "auto-possessification" optimization usually applies to
character repeats at the end of a pattern (as well as internally). For character repeats at the end of a pattern (as well as internally). For
example, the pattern "a\d+" is compiled as if it were "a\d++" because example, the pattern "a\d+" is compiled as if it were "a\d++" because
there is no point even considering the possibility of backtracking into there is no point even considering the possibility of backtracking into
the repeated digits. For DFA matching, this means that only one possi- the repeated digits. For DFA matching, this means that only one possi-
ble match is found. If you really do want multiple matches in such ble match is found. If you really do want multiple matches in such
cases, either use an ungreedy repeat ("a\d+?") or set the cases, either use an ungreedy repeat ("a\d+?") or set the
PCRE_NO_AUTO_POSSESS option when compiling. PCRE_NO_AUTO_POSSESS option when compiling.
Error returns from pcre_dfa_exec() Error returns from pcre_dfa_exec()
The pcre_dfa_exec() function returns a negative number when it fails. The pcre_dfa_exec() function returns a negative number when it fails.
Many of the errors are the same as for pcre_exec(), and these are Many of the errors are the same as for pcre_exec(), and these are de-
described above. There are in addition the following errors that are scribed above. There are in addition the following errors that are
specific to pcre_dfa_exec(): specific to pcre_dfa_exec():
PCRE_ERROR_DFA_UITEM (-16) PCRE_ERROR_DFA_UITEM (-16)
This return is given if pcre_dfa_exec() encounters an item in the pat- This return is given if pcre_dfa_exec() encounters an item in the pat-
tern that it does not support, for instance, the use of \C or a back tern that it does not support, for instance, the use of \C or a back
reference. reference.
PCRE_ERROR_DFA_UCOND (-17) PCRE_ERROR_DFA_UCOND (-17)
This return is given if pcre_dfa_exec() encounters a condition item This return is given if pcre_dfa_exec() encounters a condition item
that uses a back reference for the condition, or a test for recursion that uses a back reference for the condition, or a test for recursion
in a specific group. These are not supported. in a specific group. These are not supported.
PCRE_ERROR_DFA_UMLIMIT (-18) PCRE_ERROR_DFA_UMLIMIT (-18)
This return is given if pcre_dfa_exec() is called with an extra block This return is given if pcre_dfa_exec() is called with an extra block
that contains a setting of the match_limit or match_limit_recursion that contains a setting of the match_limit or match_limit_recursion
fields. This is not supported (these fields are meaningless for DFA fields. This is not supported (these fields are meaningless for DFA
matching). matching).
PCRE_ERROR_DFA_WSSIZE (-19) PCRE_ERROR_DFA_WSSIZE (-19)
This return is given if pcre_dfa_exec() runs out of space in the This return is given if pcre_dfa_exec() runs out of space in the
workspace vector. workspace vector.
PCRE_ERROR_DFA_RECURSE (-20) PCRE_ERROR_DFA_RECURSE (-20)
When a recursive subpattern is processed, the matching function calls When a recursive subpattern is processed, the matching function calls
itself recursively, using private vectors for ovector and workspace. itself recursively, using private vectors for ovector and workspace.
This error is given if the output vector is not large enough. This This error is given if the output vector is not large enough. This
should be extremely rare, as a vector of size 1000 is used. should be extremely rare, as a vector of size 1000 is used.
PCRE_ERROR_DFA_BADRESTART (-30) PCRE_ERROR_DFA_BADRESTART (-30)
When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some
plausibility checks are made on the contents of the workspace, which plausibility checks are made on the contents of the workspace, which
should contain data about the previous partial match. If any of these should contain data about the previous partial match. If any of these
checks fail, this error is given. checks fail, this error is given.
SEE ALSO SEE ALSO
pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3),
pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre- pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-
sample(3), pcrestack(3). sample(3), pcrestack(3).
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge CB2 3QH, England.
REVISION REVISION
skipping to change at line 4183 skipping to change at line 4180
DESCRIPTION DESCRIPTION
PCRE provides a feature called "callout", which is a means of temporar- PCRE provides a feature called "callout", which is a means of temporar-
ily passing control to the caller of PCRE in the middle of pattern ily passing control to the caller of PCRE in the middle of pattern
matching. The caller of PCRE provides an external function by putting matching. The caller of PCRE provides an external function by putting
its entry point in the global variable pcre_callout (pcre16_callout for its entry point in the global variable pcre_callout (pcre16_callout for
the 16-bit library, pcre32_callout for the 32-bit library). By default, the 16-bit library, pcre32_callout for the 32-bit library). By default,
this variable contains NULL, which disables all calling out. this variable contains NULL, which disables all calling out.
Within a regular expression, (?C) indicates the points at which the Within a regular expression, (?C) indicates the points at which the ex-
external function is to be called. Different callout points can be ternal function is to be called. Different callout points can be iden-
identified by putting a number less than 256 after the letter C. The tified by putting a number less than 256 after the letter C. The de-
default value is zero. For example, this pattern has two callout fault value is zero. For example, this pattern has two callout points:
points:
(?C1)abc(?C2)def (?C1)abc(?C2)def
If the PCRE_AUTO_CALLOUT option bit is set when a pattern is compiled, If the PCRE_AUTO_CALLOUT option bit is set when a pattern is compiled,
PCRE automatically inserts callouts, all with number 255, before each PCRE automatically inserts callouts, all with number 255, before each
item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the
pattern pattern
A(\d{2}|--) A(\d{2}|--)
it is processed as if it were it is processed as if it were
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
Notice that there is a callout before and after each parenthesis and Notice that there is a callout before and after each parenthesis and
alternation bar. If the pattern contains a conditional group whose con- alternation bar. If the pattern contains a conditional group whose con-
dition is an assertion, an automatic callout is inserted immediately dition is an assertion, an automatic callout is inserted immediately
before the condition. Such a callout may also be inserted explicitly, before the condition. Such a callout may also be inserted explicitly,
for example: for example:
(?(?C9)(?=a)ab|de) (?(?C9)(?=a)ab|de)
This applies only to assertion conditions (because they are themselves This applies only to assertion conditions (because they are themselves
independent groups). independent groups).
Automatic callouts can be used for tracking the progress of pattern Automatic callouts can be used for tracking the progress of pattern
matching. The pcretest program has a pattern qualifier (/C) that sets matching. The pcretest program has a pattern qualifier (/C) that sets
automatic callouts; when it is used, the output indicates how the pat- automatic callouts; when it is used, the output indicates how the pat-
tern is being matched. This is useful information when you are trying tern is being matched. This is useful information when you are trying
to optimize the performance of a particular pattern. to optimize the performance of a particular pattern.
MISSING CALLOUTS MISSING CALLOUTS
You should be aware that, because of optimizations in the way PCRE com- You should be aware that, because of optimizations in the way PCRE com-
piles and matches patterns, callouts sometimes do not happen exactly as piles and matches patterns, callouts sometimes do not happen exactly as
you might expect. you might expect.
At compile time, PCRE "auto-possessifies" repeated items when it knows At compile time, PCRE "auto-possessifies" repeated items when it knows
that what follows cannot be part of the repeat. For example, a+[bc] is that what follows cannot be part of the repeat. For example, a+[bc] is
compiled as if it were a++[bc]. The pcretest output when this pattern compiled as if it were a++[bc]. The pcretest output when this pattern
is anchored and then applied with automatic callouts to the string is anchored and then applied with automatic callouts to the string
"aaaa" is: "aaaa" is:
--->aaaa --->aaaa
+0 ^ ^ +0 ^ ^
+1 ^ a+ +1 ^ a+
+3 ^ ^ [bc] +3 ^ ^ [bc]
No match No match
This indicates that when matching [bc] fails, there is no backtracking This indicates that when matching [bc] fails, there is no backtracking
into a+ and therefore the callouts that would be taken for the back- into a+ and therefore the callouts that would be taken for the back-
tracks do not occur. You can disable the auto-possessify feature by tracks do not occur. You can disable the auto-possessify feature by
passing PCRE_NO_AUTO_POSSESS to pcre_compile(), or starting the pattern passing PCRE_NO_AUTO_POSSESS to pcre_compile(), or starting the pattern
with (*NO_AUTO_POSSESS). If this is done in pcretest (using the /O with (*NO_AUTO_POSSESS). If this is done in pcretest (using the /O
qualifier), the output changes to this: qualifier), the output changes to this:
--->aaaa --->aaaa
+0 ^ ^ +0 ^ ^
+1 ^ a+ +1 ^ a+
+3 ^ ^ [bc] +3 ^ ^ [bc]
+3 ^ ^ [bc] +3 ^ ^ [bc]
+3 ^ ^ [bc] +3 ^ ^ [bc]
+3 ^^ [bc] +3 ^^ [bc]
No match No match
This time, when matching [bc] fails, the matcher backtracks into a+ and This time, when matching [bc] fails, the matcher backtracks into a+ and
tries again, repeatedly, until a+ itself fails. tries again, repeatedly, until a+ itself fails.
Other optimizations that provide fast "no match" results also affect Other optimizations that provide fast "no match" results also affect
callouts. For example, if the pattern is callouts. For example, if the pattern is
ab(?C4)cd ab(?C4)cd
PCRE knows that any matching string must contain the letter "d". If the PCRE knows that any matching string must contain the letter "d". If the
subject string is "abyz", the lack of "d" means that matching doesn't subject string is "abyz", the lack of "d" means that matching doesn't
ever start, and the callout is never reached. However, with "abyd", ever start, and the callout is never reached. However, with "abyd",
though the result is still no match, the callout is obeyed. though the result is still no match, the callout is obeyed.
If the pattern is studied, PCRE knows the minimum length of a matching If the pattern is studied, PCRE knows the minimum length of a matching
string, and will immediately give a "no match" return without actually string, and will immediately give a "no match" return without actually
running a match if the subject is not long enough, or, for unanchored running a match if the subject is not long enough, or, for unanchored
patterns, if it has been scanned far enough. patterns, if it has been scanned far enough.
You can disable these optimizations by passing the PCRE_NO_START_OPTI- You can disable these optimizations by passing the PCRE_NO_START_OPTI-
MIZE option to the matching function, or by starting the pattern with MIZE option to the matching function, or by starting the pattern with
(*NO_START_OPT). This slows down the matching process, but does ensure (*NO_START_OPT). This slows down the matching process, but does ensure
that callouts such as the example above are obeyed. that callouts such as the example above are obeyed.
THE CALLOUT INTERFACE THE CALLOUT INTERFACE
During matching, when PCRE reaches a callout point, the external func- During matching, when PCRE reaches a callout point, the external func-
tion defined by pcre_callout or pcre[16|32]_callout is called (if it is tion defined by pcre_callout or pcre[16|32]_callout is called (if it is
set). This applies to both normal and DFA matching. The only argument set). This applies to both normal and DFA matching. The only argument
to the callout function is a pointer to a pcre_callout or to the callout function is a pointer to a pcre_callout or
pcre[16|32]_callout block. These structures contains the following pcre[16|32]_callout block. These structures contains the following
fields: fields:
int version; int version;
int callout_number; int callout_number;
int *offset_vector; int *offset_vector;
const char *subject; (8-bit version) const char *subject; (8-bit version)
PCRE_SPTR16 subject; (16-bit version) PCRE_SPTR16 subject; (16-bit version)
PCRE_SPTR32 subject; (32-bit version) PCRE_SPTR32 subject; (32-bit version)
int subject_length; int subject_length;
int start_match; int start_match;
int current_position; int current_position;
int capture_top; int capture_top;
int capture_last; int capture_last;
void *callout_data; void *callout_data;
int pattern_position; int pattern_position;
int next_item_length; int next_item_length;
const unsigned char *mark; (8-bit version) const unsigned char *mark; (8-bit version)
const PCRE_UCHAR16 *mark; (16-bit version) const PCRE_UCHAR16 *mark; (16-bit version)
const PCRE_UCHAR32 *mark; (32-bit version) const PCRE_UCHAR32 *mark; (32-bit version)
The version field is an integer containing the version number of the The version field is an integer containing the version number of the
block format. The initial version was 0; the current version is 2. The block format. The initial version was 0; the current version is 2. The
version number will change again in future if additional fields are version number will change again in future if additional fields are
added, but the intention is never to remove any of the existing fields. added, but the intention is never to remove any of the existing fields.
The callout_number field contains the number of the callout, as com- The callout_number field contains the number of the callout, as com-
piled into the pattern (that is, the number after ?C for manual call- piled into the pattern (that is, the number after ?C for manual call-
outs, and 255 for automatically generated callouts). outs, and 255 for automatically generated callouts).
The offset_vector field is a pointer to the vector of offsets that was The offset_vector field is a pointer to the vector of offsets that was
passed by the caller to the matching function. When pcre_exec() or passed by the caller to the matching function. When pcre_exec() or
pcre[16|32]_exec() is used, the contents can be inspected, in order to pcre[16|32]_exec() is used, the contents can be inspected, in order to
extract substrings that have been matched so far, in the same way as extract substrings that have been matched so far, in the same way as
for extracting substrings after a match has completed. For the DFA for extracting substrings after a match has completed. For the DFA
matching functions, this field is not useful. matching functions, this field is not useful.
The subject and subject_length fields contain copies of the values that The subject and subject_length fields contain copies of the values that
were passed to the matching function. were passed to the matching function.
The start_match field normally contains the offset within the subject The start_match field normally contains the offset within the subject
at which the current match attempt started. However, if the escape at which the current match attempt started. However, if the escape se-
sequence \K has been encountered, this value is changed to reflect the quence \K has been encountered, this value is changed to reflect the
modified starting point. If the pattern is not anchored, the callout modified starting point. If the pattern is not anchored, the callout
function may be called several times from the same point in the pattern function may be called several times from the same point in the pattern
for different starting points in the subject. for different starting points in the subject.
The current_position field contains the offset within the subject of The current_position field contains the offset within the subject of
the current match pointer. the current match pointer.
When the pcre_exec() or pcre[16|32]_exec() is used, the capture_top When the pcre_exec() or pcre[16|32]_exec() is used, the capture_top
field contains one more than the number of the highest numbered cap- field contains one more than the number of the highest numbered cap-
tured substring so far. If no substrings have been captured, the value tured substring so far. If no substrings have been captured, the value
of capture_top is one. This is always the case when the DFA functions of capture_top is one. This is always the case when the DFA functions
are used, because they do not support captured substrings. are used, because they do not support captured substrings.
The capture_last field contains the number of the most recently cap- The capture_last field contains the number of the most recently cap-
tured substring. However, when a recursion exits, the value reverts to tured substring. However, when a recursion exits, the value reverts to
what it was outside the recursion, as do the values of all captured what it was outside the recursion, as do the values of all captured
substrings. If no substrings have been captured, the value of cap- substrings. If no substrings have been captured, the value of cap-
ture_last is -1. This is always the case for the DFA matching func- ture_last is -1. This is always the case for the DFA matching func-
tions. tions.
The callout_data field contains a value that is passed to a matching The callout_data field contains a value that is passed to a matching
function specifically so that it can be passed back in callouts. It is function specifically so that it can be passed back in callouts. It is
passed in the callout_data field of a pcre_extra or pcre[16|32]_extra passed in the callout_data field of a pcre_extra or pcre[16|32]_extra
data structure. If no such data was passed, the value of callout_data data structure. If no such data was passed, the value of callout_data
in a callout block is NULL. There is a description of the pcre_extra in a callout block is NULL. There is a description of the pcre_extra
structure in the pcreapi documentation. structure in the pcreapi documentation.
The pattern_position field is present from version 1 of the callout The pattern_position field is present from version 1 of the callout
structure. It contains the offset to the next item to be matched in the structure. It contains the offset to the next item to be matched in the
pattern string. pattern string.
The next_item_length field is present from version 1 of the callout The next_item_length field is present from version 1 of the callout
structure. It contains the length of the next item to be matched in the structure. It contains the length of the next item to be matched in the
pattern string. When the callout immediately precedes an alternation pattern string. When the callout immediately precedes an alternation
bar, a closing parenthesis, or the end of the pattern, the length is bar, a closing parenthesis, or the end of the pattern, the length is
zero. When the callout precedes an opening parenthesis, the length is zero. When the callout precedes an opening parenthesis, the length is
that of the entire subpattern. that of the entire subpattern.
The pattern_position and next_item_length fields are intended to help The pattern_position and next_item_length fields are intended to help
in distinguishing between different automatic callouts, which all have in distinguishing between different automatic callouts, which all have
the same callout number. However, they are set for all callouts. the same callout number. However, they are set for all callouts.
The mark field is present from version 2 of the callout structure. In The mark field is present from version 2 of the callout structure. In
callouts from pcre_exec() or pcre[16|32]_exec() it contains a pointer callouts from pcre_exec() or pcre[16|32]_exec() it contains a pointer
to the zero-terminated name of the most recently passed (*MARK), to the zero-terminated name of the most recently passed (*MARK),
(*PRUNE), or (*THEN) item in the match, or NULL if no such items have (*PRUNE), or (*THEN) item in the match, or NULL if no such items have
been passed. Instances of (*PRUNE) or (*THEN) without a name do not been passed. Instances of (*PRUNE) or (*THEN) without a name do not
obliterate a previous (*MARK). In callouts from the DFA matching func- obliterate a previous (*MARK). In callouts from the DFA matching func-
tions this field always contains NULL. tions this field always contains NULL.
RETURN VALUES RETURN VALUES
The external callout function returns an integer to PCRE. If the value The external callout function returns an integer to PCRE. If the value
is zero, matching proceeds as normal. If the value is greater than is zero, matching proceeds as normal. If the value is greater than
zero, matching fails at the current point, but the testing of other zero, matching fails at the current point, but the testing of other
matching possibilities goes ahead, just as if a lookahead assertion had matching possibilities goes ahead, just as if a lookahead assertion had
failed. If the value is less than zero, the match is abandoned, the failed. If the value is less than zero, the match is abandoned, the
matching function returns the negative value. matching function returns the negative value.
Negative values should normally be chosen from the set of Negative values should normally be chosen from the set of PCRE_ER-
PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan- ROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a standard "no
dard "no match" failure. The error number PCRE_ERROR_CALLOUT is match" failure. The error number PCRE_ERROR_CALLOUT is reserved for
reserved for use by callout functions; it will never be used by PCRE use by callout functions; it will never be used by PCRE itself.
itself.
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge CB2 3QH, England.
REVISION REVISION
Last updated: 12 November 2013 Last updated: 12 November 2013
skipping to change at line 4410 skipping to change at line 4405
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
PCRECOMPAT(3) Library Functions Manual PCRECOMPAT(3) PCRECOMPAT(3) Library Functions Manual PCRECOMPAT(3)
NAME NAME
PCRE - Perl-compatible regular expressions PCRE - Perl-compatible regular expressions
DIFFERENCES BETWEEN PCRE AND PERL DIFFERENCES BETWEEN PCRE AND PERL
This document describes the differences in the ways that PCRE and Perl This document describes the differences in the ways that PCRE and Perl
handle regular expressions. The differences described here are with handle regular expressions. The differences described here are with re-
respect to Perl versions 5.10 and above. spect to Perl versions 5.10 and above.
1. PCRE has only a subset of Perl's Unicode support. Details of what it 1. PCRE has only a subset of Perl's Unicode support. Details of what it
does have are given in the pcreunicode page. does have are given in the pcreunicode page.
2. PCRE allows repeat quantifiers only on parenthesized assertions, but 2. PCRE allows repeat quantifiers only on parenthesized assertions, but
they do not mean what you might think. For example, (?!a){3} does not they do not mean what you might think. For example, (?!a){3} does not
assert that the next three characters are not "a". It just asserts that assert that the next three characters are not "a". It just asserts that
the next character is not "a" three times (in principle: PCRE optimizes the next character is not "a" three times (in principle: PCRE optimizes
this to run the assertion just once). Perl allows repeat quantifiers on this to run the assertion just once). Perl allows repeat quantifiers on
other assertions such as \b, but these do not seem to have any use. other assertions such as \b, but these do not seem to have any use.
skipping to change at line 4482 skipping to change at line 4477
tern matching. See the pcrecallout documentation for details. tern matching. See the pcrecallout documentation for details.
9. Subpatterns that are called as subroutines (whether or not recur- 9. Subpatterns that are called as subroutines (whether or not recur-
sively) are always treated as atomic groups in PCRE. This is like sively) are always treated as atomic groups in PCRE. This is like
Python, but unlike Perl. Captured values that are set outside a sub- Python, but unlike Perl. Captured values that are set outside a sub-
routine call can be reference from inside in PCRE, but not in Perl. routine call can be reference from inside in PCRE, but not in Perl.
There is a discussion that explains these differences in more detail in There is a discussion that explains these differences in more detail in
the section on recursion differences from Perl in the pcrepattern page. the section on recursion differences from Perl in the pcrepattern page.
10. If any of the backtracking control verbs are used in a subpattern 10. If any of the backtracking control verbs are used in a subpattern
that is called as a subroutine (whether or not recursively), their that is called as a subroutine (whether or not recursively), their ef-
effect is confined to that subpattern; it does not extend to the sur- fect is confined to that subpattern; it does not extend to the sur-
rounding pattern. This is not always the case in Perl. In particular, rounding pattern. This is not always the case in Perl. In particular,
if (*THEN) is present in a group that is called as a subroutine, its if (*THEN) is present in a group that is called as a subroutine, its
action is limited to that group, even if the group does not contain any action is limited to that group, even if the group does not contain any
| characters. Note that such subpatterns are processed as anchored at | characters. Note that such subpatterns are processed as anchored at
the point where they are tested. the point where they are tested.
11. If a pattern contains more than one backtracking control verb, the 11. If a pattern contains more than one backtracking control verb, the
first one that is backtracked onto acts. For example, in the pattern first one that is backtracked onto acts. For example, in the pattern
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure
in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
it is the same as PCRE, but there are examples where it differs. it is the same as PCRE, but there are examples where it differs.
12. Most backtracking verbs in assertions have their normal actions. 12. Most backtracking verbs in assertions have their normal actions.
They are not confined to the assertion. They are not confined to the assertion.
13. There are some differences that are concerned with the settings of 13. There are some differences that are concerned with the settings of
captured strings when part of a pattern is repeated. For example, captured strings when part of a pattern is repeated. For example,
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
unset, but in PCRE it is set to "b". set, but in PCRE it is set to "b".
14. PCRE's handling of duplicate subpattern numbers and duplicate sub- 14. PCRE's handling of duplicate subpattern numbers and duplicate sub-
pattern names is not as general as Perl's. This is a consequence of the pattern names is not as general as Perl's. This is a consequence of the
fact the PCRE works internally just with numbers, using an external ta- fact the PCRE works internally just with numbers, using an external ta-
ble to translate between numbers and names. In particular, a pattern ble to translate between numbers and names. In particular, a pattern
such as (?|(?<a>A)|(?<b>B), where the two capturing parentheses have such as (?|(?<a>A)|(?<b>B), where the two capturing parentheses have
the same number but different names, is not supported, and causes an the same number but different names, is not supported, and causes an
error at compile time. If it were allowed, it would not be possible to error at compile time. If it were allowed, it would not be possible to
distinguish which parentheses matched, because both names map to cap- distinguish which parentheses matched, because both names map to cap-
turing subpattern number 1. To avoid this confusing situation, an error turing subpattern number 1. To avoid this confusing situation, an error
is given at compile time. is given at compile time.
15. Perl recognizes comments in some places that PCRE does not, for 15. Perl recognizes comments in some places that PCRE does not, for ex-
example, between the ( and ? at the start of a subpattern. If the /x ample, between the ( and ? at the start of a subpattern. If the /x mod-
modifier is set, Perl allows white space between ( and ? (though cur- ifier is set, Perl allows white space between ( and ? (though current
rent Perls warn that this is deprecated) but PCRE never does, even if Perls warn that this is deprecated) but PCRE never does, even if the
the PCRE_EXTENDED option is set. PCRE_EXTENDED option is set.
16. Perl, when in warning mode, gives warnings for character classes 16. Perl, when in warning mode, gives warnings for character classes
such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter- such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
als. PCRE has no warning features, so it gives an error in these cases als. PCRE has no warning features, so it gives an error in these cases
because they are almost certainly user mistakes. because they are almost certainly user mistakes.
17. In PCRE, the upper/lower case character properties Lu and Ll are 17. In PCRE, the upper/lower case character properties Lu and Ll are
not affected when case-independent matching is specified. For example, not affected when case-independent matching is specified. For example,
\p{Lu} always matches an upper case letter. I think Perl has changed in \p{Lu} always matches an upper case letter. I think Perl has changed in
this respect; in the release at the time of writing (5.16), \p{Lu} and this respect; in the release at the time of writing (5.16), \p{Lu} and
skipping to change at line 4609 skipping to change at line 4604
The syntax and semantics of the regular expressions that are supported The syntax and semantics of the regular expressions that are supported
by PCRE are described in detail below. There is a quick-reference syn- by PCRE are described in detail below. There is a quick-reference syn-
tax summary in the pcresyntax page. PCRE tries to match Perl syntax and tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
semantics as closely as it can. PCRE also supports some alternative semantics as closely as it can. PCRE also supports some alternative
regular expression syntax (which does not conflict with the Perl syn- regular expression syntax (which does not conflict with the Perl syn-
tax) in order to provide some compatibility with regular expressions in tax) in order to provide some compatibility with regular expressions in
Python, .NET, and Oniguruma. Python, .NET, and Oniguruma.
Perl's regular expressions are described in its own documentation, and Perl's regular expressions are described in its own documentation, and
regular expressions in general are covered in a number of books, some regular expressions in general are covered in a number of books, some
of which have copious examples. Jeffrey Friedl's "Mastering Regular of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
Expressions", published by O'Reilly, covers regular expressions in pressions", published by O'Reilly, covers regular expressions in great
great detail. This description of PCRE's regular expressions is detail. This description of PCRE's regular expressions is intended as
intended as reference material. reference material.
This document discusses the patterns that are supported by PCRE when This document discusses the patterns that are supported by PCRE when
one its main matching functions, pcre_exec() (8-bit) or one its main matching functions, pcre_exec() (8-bit) or
pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has alternative pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has alternative
matching functions, pcre_dfa_exec() and pcre[16|32_dfa_exec(), which matching functions, pcre_dfa_exec() and pcre[16|32_dfa_exec(), which
match using a different algorithm that is not Perl-compatible. Some of match using a different algorithm that is not Perl-compatible. Some of
the features discussed below are not available when DFA matching is the features discussed below are not available when DFA matching is
used. The advantages and disadvantages of the alternative functions, used. The advantages and disadvantages of the alternative functions,
and how they differ from the normal functions, are discussed in the and how they differ from the normal functions, are discussed in the
pcrematching page. pcrematching page.
SPECIAL START-OF-PATTERN ITEMS SPECIAL START-OF-PATTERN ITEMS
A number of options that can be passed to pcre_compile() can also be A number of options that can be passed to pcre_compile() can also be
set by special items at the start of a pattern. These are not Perl-com- set by special items at the start of a pattern. These are not Perl-com-
patible, but are provided to make these options accessible to pattern patible, but are provided to make these options accessible to pattern
writers who are not able to change the program that processes the pat- writers who are not able to change the program that processes the pat-
tern. Any number of these items may appear, but they must all be tern. Any number of these items may appear, but they must all be to-
together right at the start of the pattern string, and the letters must gether right at the start of the pattern string, and the letters must
be in upper case. be in upper case.
UTF support UTF support
The original operation of PCRE was on strings of one-byte characters. The original operation of PCRE was on strings of one-byte characters.
However, there is now also support for UTF-8 strings in the original However, there is now also support for UTF-8 strings in the original
library, an extra library that supports 16-bit and UTF-16 character library, an extra library that supports 16-bit and UTF-16 character
strings, and a third library that supports 32-bit and UTF-32 character strings, and a third library that supports 32-bit and UTF-32 character
strings. To use these features, PCRE must be built to include appropri- strings. To use these features, PCRE must be built to include appropri-
ate support. When using UTF strings you must either call the compiling ate support. When using UTF strings you must either call the compiling
function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option, or the function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option, or the
pattern must start with one of these special sequences: pattern must start with one of these special sequences:
(*UTF8) (*UTF8)
(*UTF16) (*UTF16)
(*UTF32) (*UTF32)
(*UTF) (*UTF)
(*UTF) is a generic sequence that can be used with any of the (*UTF) is a generic sequence that can be used with any of the li-
libraries. Starting a pattern with such a sequence is equivalent to braries. Starting a pattern with such a sequence is equivalent to set-
setting the relevant option. How setting a UTF mode affects pattern ting the relevant option. How setting a UTF mode affects pattern match-
matching is mentioned in several places below. There is also a summary ing is mentioned in several places below. There is also a summary of
of features in the pcreunicode page. features in the pcreunicode page.
Some applications that allow their users to supply patterns may wish to Some applications that allow their users to supply patterns may wish to
restrict them to non-UTF data for security reasons. If the restrict them to non-UTF data for security reasons. If the
PCRE_NEVER_UTF option is set at compile time, (*UTF) etc. are not PCRE_NEVER_UTF option is set at compile time, (*UTF) etc. are not al-
allowed, and their appearance causes an error. lowed, and their appearance causes an error.
Unicode property support Unicode property support
Another special sequence that may appear at the start of a pattern is Another special sequence that may appear at the start of a pattern is
(*UCP). This has the same effect as setting the PCRE_UCP option: it (*UCP). This has the same effect as setting the PCRE_UCP option: it
causes sequences such as \d and \w to use Unicode properties to deter- causes sequences such as \d and \w to use Unicode properties to deter-
mine character types, instead of recognizing only characters with codes mine character types, instead of recognizing only characters with codes
less than 128 via a lookup table. less than 128 via a lookup table.
Disabling auto-possessification Disabling auto-possessification
skipping to change at line 4703 skipping to change at line 4698
It is also possible to specify a newline convention by starting a pat- It is also possible to specify a newline convention by starting a pat-
tern string with one of the following five sequences: tern string with one of the following five sequences:
(*CR) carriage return (*CR) carriage return
(*LF) linefeed (*LF) linefeed
(*CRLF) carriage return, followed by linefeed (*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above (*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences (*ANY) all Unicode newline sequences
These override the default and the options given to the compiling func- These override the default and the options given to the compiling func-
tion. For example, on a Unix system where LF is the default newline tion. For example, on a Unix system where LF is the default newline se-
sequence, the pattern quence, the pattern
(*CR)a.b (*CR)a.b
changes the convention to CR. That pattern matches "a\nb" because LF is changes the convention to CR. That pattern matches "a\nb" because LF is
no longer a newline. If more than one of these settings is present, the no longer a newline. If more than one of these settings is present, the
last one is used. last one is used.
The newline convention affects where the circumflex and dollar asser- The newline convention affects where the circumflex and dollar asser-
tions are true. It also affects the interpretation of the dot metachar- tions are true. It also affects the interpretation of the dot metachar-
acter when PCRE_DOTALL is not set, and the behaviour of \N. However, it acter when PCRE_DOTALL is not set, and the behaviour of \N. However, it
skipping to change at line 4769 skipping to change at line 4764
matches a portion of a subject string that is identical to itself. When matches a portion of a subject string that is identical to itself. When
caseless matching is specified (the PCRE_CASELESS option), letters are caseless matching is specified (the PCRE_CASELESS option), letters are
matched independently of case. In a UTF mode, PCRE always understands matched independently of case. In a UTF mode, PCRE always understands
the concept of case for characters whose values are less than 128, so the concept of case for characters whose values are less than 128, so
caseless matching is always possible. For characters with higher val- caseless matching is always possible. For characters with higher val-
ues, the concept of case is supported if PCRE is compiled with Unicode ues, the concept of case is supported if PCRE is compiled with Unicode
property support, but not otherwise. If you want to use caseless property support, but not otherwise. If you want to use caseless
matching for characters 128 and above, you must ensure that PCRE is matching for characters 128 and above, you must ensure that PCRE is
compiled with Unicode property support as well as with UTF support. compiled with Unicode property support as well as with UTF support.
The power of regular expressions comes from the ability to include The power of regular expressions comes from the ability to include al-
alternatives and repetitions in the pattern. These are encoded in the ternatives and repetitions in the pattern. These are encoded in the
pattern by the use of metacharacters, which do not stand for themselves pattern by the use of metacharacters, which do not stand for themselves
but instead are interpreted in some special way. but instead are interpreted in some special way.
There are two different sets of metacharacters: those that are recog- There are two different sets of metacharacters: those that are recog-
nized anywhere in the pattern except within square brackets, and those nized anywhere in the pattern except within square brackets, and those
that are recognized within square brackets. Outside square brackets, that are recognized within square brackets. Outside square brackets,
the metacharacters are as follows: the metacharacters are as follows:
\ general escape character with several uses \ general escape character with several uses
^ assert start of string (or line, in multiline mode) ^ assert start of string (or line, in multiline mode)
skipping to change at line 4833 skipping to change at line 4828
codepoints are greater than 127) are treated as literals. codepoints are greater than 127) are treated as literals.
If a pattern is compiled with the PCRE_EXTENDED option, most white If a pattern is compiled with the PCRE_EXTENDED option, most white
space in the pattern (other than in a character class), and characters space in the pattern (other than in a character class), and characters
between a # outside a character class and the next newline, inclusive, between a # outside a character class and the next newline, inclusive,
are ignored. An escaping backslash can be used to include a white space are ignored. An escaping backslash can be used to include a white space
or # character as part of the pattern. or # character as part of the pattern.
If you want to remove the special meaning from a sequence of charac- If you want to remove the special meaning from a sequence of charac-
ters, you can do so by putting them between \Q and \E. This is differ- ters, you can do so by putting them between \Q and \E. This is differ-
ent from Perl in that $ and @ are handled as literals in \Q...\E ent from Perl in that $ and @ are handled as literals in \Q...\E se-
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- quences in PCRE, whereas in Perl, $ and @ cause variable interpolation.
tion. Note the following examples: Note the following examples:
Pattern PCRE matches Perl matches Pattern PCRE matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the \Qabc$xyz\E abc$xyz abc followed by the
contents of $xyz contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz \Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
The \Q...\E sequence is recognized both inside and outside character The \Q...\E sequence is recognized both inside and outside character
classes. An isolated \E that is not preceded by \Q is ignored. If \Q classes. An isolated \E that is not preceded by \Q is ignored. If \Q
skipping to change at line 4857 skipping to change at line 4852
continues to the end of the pattern (that is, \E is assumed at the continues to the end of the pattern (that is, \E is assumed at the
end). If the isolated \Q is inside a character class, this causes an end). If the isolated \Q is inside a character class, this causes an
error, because the character class is not terminated. error, because the character class is not terminated.
Non-printing characters Non-printing characters
A second use of backslash provides a way of encoding non-printing char- A second use of backslash provides a way of encoding non-printing char-
acters in patterns in a visible manner. There is no restriction on the acters in patterns in a visible manner. There is no restriction on the
appearance of non-printing characters, apart from the binary zero that appearance of non-printing characters, apart from the binary zero that
terminates a pattern, but when a pattern is being prepared by text terminates a pattern, but when a pattern is being prepared by text
editing, it is often easier to use one of the following escape editing, it is often easier to use one of the following escape se-
sequences than the binary character it represents. In an ASCII or Uni- quences than the binary character it represents. In an ASCII or Uni-
code environment, these escapes are as follows: code environment, these escapes are as follows:
\a alarm, that is, the BEL character (hex 07) \a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII character \cx "control-x", where x is any ASCII character
\e escape (hex 1B) \e escape (hex 1B)
\f form feed (hex 0C) \f form feed (hex 0C)
\n linefeed (hex 0A) \n linefeed (hex 0A)
\r carriage return (hex 0D) \r carriage return (hex 0D)
\t tab (hex 09) \t tab (hex 09)
\0dd character with octal code 0dd \0dd character with octal code 0dd
skipping to change at line 4887 skipping to change at line 4882
character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
(A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c
has a value greater than 127, a compile-time error occurs. This locks has a value greater than 127, a compile-time error occurs. This locks
out non-ASCII characters in all modes. out non-ASCII characters in all modes.
When PCRE is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gener- When PCRE is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gener-
ate the appropriate EBCDIC code values. The \c escape is processed as ate the appropriate EBCDIC code values. The \c escape is processed as
specified for Perl in the perlebcdic document. The only characters that specified for Perl in the perlebcdic document. The only characters that
are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?.
Any other character provokes a compile-time error. The sequence \c@ Any other character provokes a compile-time error. The sequence \c@ en-
encodes character code 0; after \c the letters (in either case) encode codes character code 0; after \c the letters (in either case) encode
characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters
27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95
(hex 5F). (hex 5F).
Thus, apart from \c?, these escapes generate the same character code Thus, apart from \c?, these escapes generate the same character code
values as they do in an ASCII environment, though the meanings of the values as they do in an ASCII environment, though the meanings of the
values mostly differ. For example, \cG always generates code value 7, values mostly differ. For example, \cG always generates code value 7,
which is BEL in ASCII but DEL in EBCDIC. which is BEL in ASCII but DEL in EBCDIC.
The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
but because 127 is not a control character in EBCDIC, Perl makes it but because 127 is not a control character in EBCDIC, Perl makes it
generate the APC character. Unfortunately, there are several variants generate the APC character. Unfortunately, there are several variants
of EBCDIC. In most of them the APC character has the value 255 (hex of EBCDIC. In most of them the APC character has the value 255 (hex
FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
certain other characters have POSIX-BC values, PCRE makes \c? generate certain other characters have POSIX-BC values, PCRE makes \c? generate
95; otherwise it generates 255. 95; otherwise it generates 255.
After \0 up to two further octal digits are read. If there are fewer After \0 up to two further octal digits are read. If there are fewer
than two digits, just those that are present are used. Thus the than two digits, just those that are present are used. Thus the se-
sequence \0\x\015 specifies two binary zeros followed by a CR character quence \0\x\015 specifies two binary zeros followed by a CR character
(code value 13). Make sure you supply two digits after the initial zero (code value 13). Make sure you supply two digits after the initial zero
if the pattern character that follows is itself an octal digit. if the pattern character that follows is itself an octal digit.
The escape \o must be followed by a sequence of octal digits, enclosed The escape \o must be followed by a sequence of octal digits, enclosed
in braces. An error occurs if this is not the case. This escape is a in braces. An error occurs if this is not the case. This escape is a
recent addition to Perl; it provides way of specifying character code recent addition to Perl; it provides way of specifying character code
points as octal numbers greater than 0777, and it also allows octal points as octal numbers greater than 0777, and it also allows octal
numbers and back references to be unambiguously specified. numbers and back references to be unambiguously specified.
For greater clarity and unambiguity, it is best to avoid following \ by For greater clarity and unambiguity, it is best to avoid following \ by
skipping to change at line 4999 skipping to change at line 4994
called "surrogate" codepoints), and 0xffef. called "surrogate" codepoints), and 0xffef.
Escape sequences in character classes Escape sequences in character classes
All the sequences that define a single character value can be used both All the sequences that define a single character value can be used both
inside and outside character classes. In addition, inside a character inside and outside character classes. In addition, inside a character
class, \b is interpreted as the backspace character (hex 08). class, \b is interpreted as the backspace character (hex 08).
\N is not allowed in a character class. \B, \R, and \X are not special \N is not allowed in a character class. \B, \R, and \X are not special
inside a character class. Like other unrecognized escape sequences, inside a character class. Like other unrecognized escape sequences,
they are treated as the literal characters "B", "R", and "X" by they are treated as the literal characters "B", "R", and "X" by de-
default, but cause an error if the PCRE_EXTRA option is set. Outside a fault, but cause an error if the PCRE_EXTRA option is set. Outside a
character class, these sequences have different meanings. character class, these sequences have different meanings.
Unsupported escape sequences Unsupported escape sequences
In Perl, the sequences \l, \L, \u, and \U are recognized by its string In Perl, the sequences \l, \L, \u, and \U are recognized by its string
handler and used to modify the case of following characters. By handler and used to modify the case of following characters. By de-
default, PCRE does not support these escape sequences. However, if the fault, PCRE does not support these escape sequences. However, if the
PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and
\u can be used to define a character by code point, as described in the \u can be used to define a character by code point, as described in the
previous section. previous section.
Absolute and relative back references Absolute and relative back references
The sequence \g followed by an unsigned or a negative number, option- The sequence \g followed by an unsigned or a negative number, option-
ally enclosed in braces, is an absolute or relative back reference. A ally enclosed in braces, is an absolute or relative back reference. A
named back reference can be coded as \g{name}. Back references are dis- named back reference can be coded as \g{name}. Back references are dis-
cussed later, following the discussion of parenthesized subpatterns. cussed later, following the discussion of parenthesized subpatterns.
skipping to change at line 5058 skipping to change at line 5053
Each pair of lower and upper case escape sequences partitions the com- Each pair of lower and upper case escape sequences partitions the com-
plete set of characters into two disjoint sets. Any given character plete set of characters into two disjoint sets. Any given character
matches one, and only one, of each pair. The sequences can appear both matches one, and only one, of each pair. The sequences can appear both
inside and outside character classes. They each match one character of inside and outside character classes. They each match one character of
the appropriate type. If the current matching point is at the end of the appropriate type. If the current matching point is at the end of
the subject string, all of them fail, because there is no character to the subject string, all of them fail, because there is no character to
match. match.
For compatibility with Perl, \s did not used to match the VT character For compatibility with Perl, \s did not used to match the VT character
(code 11), which made it different from the the POSIX "space" class. (code 11), which made it different from the the POSIX "space" class.
However, Perl added VT at release 5.18, and PCRE followed suit at However, Perl added VT at release 5.18, and PCRE followed suit at re-
release 8.34. The default \s characters are now HT (9), LF (10), VT lease 8.34. The default \s characters are now HT (9), LF (10), VT (11),
(11), FF (12), CR (13), and space (32), which are defined as white FF (12), CR (13), and space (32), which are defined as white space in
space in the "C" locale. This list may vary if locale-specific matching the "C" locale. This list may vary if locale-specific matching is tak-
is taking place. For example, in some locales the "non-breaking space" ing place. For example, in some locales the "non-breaking space" char-
character (\xA0) is recognized as white space, and in others the VT acter (\xA0) is recognized as white space, and in others the VT charac-
character is not. ter is not.
A "word" character is an underscore or any character that is a letter A "word" character is an underscore or any character that is a letter
or digit. By default, the definition of letters and digits is con- or digit. By default, the definition of letters and digits is con-
trolled by PCRE's low-valued character tables, and may vary if locale- trolled by PCRE's low-valued character tables, and may vary if locale-
specific matching is taking place (see "Locale support" in the pcreapi specific matching is taking place (see "Locale support" in the pcreapi
page). For example, in a French locale such as "fr_FR" in Unix-like page). For example, in a French locale such as "fr_FR" in Unix-like
systems, or "french" in Windows, some character codes greater than 127 systems, or "french" in Windows, some character codes greater than 127
are used for accented letters, and these are then matched by \w. The are used for accented letters, and these are then matched by \w. The
use of locales with Unicode is discouraged. use of locales with Unicode is discouraged.
skipping to change at line 5141 skipping to change at line 5136
256 are relevant. 256 are relevant.
Newline sequences Newline sequences
Outside a character class, by default, the escape sequence \R matches Outside a character class, by default, the escape sequence \R matches
any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
to the following: to the following:
(?>\r\n|\n|\x0b|\f|\r|\x85) (?>\r\n|\n|\x0b|\f|\r|\x85)
This is an example of an "atomic group", details of which are given This is an example of an "atomic group", details of which are given be-
below. This particular group matches either the two-character sequence low. This particular group matches either the two-character sequence
CR followed by LF, or one of the single characters LF (linefeed, CR followed by LF, or one of the single characters LF (linefeed,
U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
riage return, U+000D), or NEL (next line, U+0085). The two-character riage return, U+000D), or NEL (next line, U+0085). The two-character
sequence is treated as a single unit that cannot be split. sequence is treated as a single unit that cannot be split.
In other modes, two additional characters whose codepoints are greater In other modes, two additional characters whose codepoints are greater
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
rator, U+2029). Unicode character property support is not needed for rator, U+2029). Unicode character property support is not needed for
these characters to be recognized. these characters to be recognized.
It is possible to restrict \R to match only CR, LF, or CRLF (instead of It is possible to restrict \R to match only CR, LF, or CRLF (instead of
the complete set of Unicode line endings) by setting the option the complete set of Unicode line endings) by setting the option
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
(BSR is an abbrevation for "backslash R".) This can be made the default (BSR is an abbrevation for "backslash R".) This can be made the default
when PCRE is built; if this is the case, the other behaviour can be when PCRE is built; if this is the case, the other behaviour can be re-
requested via the PCRE_BSR_UNICODE option. It is also possible to quested via the PCRE_BSR_UNICODE option. It is also possible to spec-
specify these settings by starting a pattern string with one of the ify these settings by starting a pattern string with one of the follow-
following sequences: ing sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only (*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence (*BSR_UNICODE) any Unicode newline sequence
These override the default and the options given to the compiling func- These override the default and the options given to the compiling func-
tion, but they can themselves be overridden by options given to a tion, but they can themselves be overridden by options given to a
matching function. Note that these special settings, which are not matching function. Note that these special settings, which are not
Perl-compatible, are recognized only at the very start of a pattern, Perl-compatible, are recognized only at the very start of a pattern,
and that they must be in upper case. If more than one of them is and that they must be in upper case. If more than one of them is
present, the last one is used. They can be combined with a change of present, the last one is used. They can be combined with a change of
newline convention; for example, a pattern can start with: newline convention; for example, a pattern can start with:
(*ANY)(*BSR_ANYCRLF) (*ANY)(*BSR_ANYCRLF)
They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF)
or (*UCP) special sequences. Inside a character class, \R is treated as or (*UCP) special sequences. Inside a character class, \R is treated as
an unrecognized escape sequence, and so matches the letter "R" by an unrecognized escape sequence, and so matches the letter "R" by de-
default, but causes an error if PCRE_EXTRA is set. fault, but causes an error if PCRE_EXTRA is set.
Unicode character properties Unicode character properties
When PCRE is built with Unicode character property support, three addi- When PCRE is built with Unicode character property support, three addi-
tional escape sequences that match characters with specific properties tional escape sequences that match characters with specific properties
are available. When in 8-bit non-UTF-8 mode, these sequences are of are available. When in 8-bit non-UTF-8 mode, these sequences are of
course limited to testing characters whose codepoints are less than course limited to testing characters whose codepoints are less than
256, but they do work in this mode. The extra escape sequences are: 256, but they do work in this mode. The extra escape sequences are:
\p{xx} a character with the xx property \p{xx} a character with the xx property
\P{xx} a character without the xx property \P{xx} a character without the xx property
\X a Unicode extended grapheme cluster \X a Unicode extended grapheme cluster
The property names represented by xx above are limited to the Unicode The property names represented by xx above are limited to the Unicode
script names, the general category properties, "Any", which matches any script names, the general category properties, "Any", which matches any
character (including newline), and some special PCRE properties character (including newline), and some special PCRE properties (de-
(described in the next section). Other Perl properties such as "InMu- scribed in the next section). Other Perl properties such as "InMusi-
sicalSymbols" are not currently supported by PCRE. Note that \P{Any} calSymbols" are not currently supported by PCRE. Note that \P{Any} does
does not match any characters, so always causes a match failure. not match any characters, so always causes a match failure.
Sets of Unicode characters are defined as belonging to certain scripts. Sets of Unicode characters are defined as belonging to certain scripts.
A character from one of these sets can be matched using a script name. A character from one of these sets can be matched using a script name.
For example: For example:
\p{Greek} \p{Greek}
\P{Han} \P{Han}
Those that are not part of an identified script are lumped together as Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is: "Common". The current list of scripts is:
Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali, Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali,
Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Car- Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Car-
ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei- ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei-
form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero- form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero-
glyphs, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, glyphs, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha,
Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Im-
Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip- perial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip-
tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,
Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Lin- Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Lin-
ear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic, ear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic,
Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, Meroitic_Hi-
Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean, eroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean, New_Tai_Lue,
New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian, Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian, Old_Permic,
Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pa-
Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, hawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha- Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac,
Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu,
Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi,
Yi. Yi.
Each character has exactly one Unicode general category property, spec- Each character has exactly one Unicode general category property, spec-
ified by a two-letter abbreviation. For compatibility with Perl, nega- ified by a two-letter abbreviation. For compatibility with Perl, nega-
tion can be specified by including a circumflex between the opening tion can be specified by including a circumflex between the opening
brace and the property name. For example, \p{^Lu} is the same as brace and the property name. For example, \p{^Lu} is the same as
skipping to change at line 5336 skipping to change at line 5331
(?>\PM\pM*) (?>\PM\pM*)
That is, it matched a character without the "mark" property, followed That is, it matched a character without the "mark" property, followed
by zero or more characters with the "mark" property. Characters with by zero or more characters with the "mark" property. Characters with
the "mark" property are typically non-spacing accents that affect the the "mark" property are typically non-spacing accents that affect the
preceding character. preceding character.
This simple definition was extended in Unicode to include more compli- This simple definition was extended in Unicode to include more compli-
cated kinds of composite character by giving each character a grapheme cated kinds of composite character by giving each character a grapheme
breaking property, and creating rules that use these properties to breaking property, and creating rules that use these properties to de-
define the boundaries of extended grapheme clusters. In releases of fine the boundaries of extended grapheme clusters. In releases of PCRE
PCRE later than 8.31, \X matches one of these clusters. later than 8.31, \X matches one of these clusters.
\X always matches at least one character. Then it decides whether to \X always matches at least one character. Then it decides whether to
add additional characters according to the following rules for ending a add additional characters according to the following rules for ending a
cluster: cluster:
1. End at the end of the subject string. 1. End at the end of the subject string.
2. Do not end between CR and LF; otherwise end after any control char- 2. Do not end between CR and LF; otherwise end after any control char-
acter. acter.
skipping to change at line 5366 skipping to change at line 5361
with the "mark" property always have the "extend" grapheme breaking with the "mark" property always have the "extend" grapheme breaking
property. property.
5. Do not end after prepend characters. 5. Do not end after prepend characters.
6. Otherwise, end the cluster. 6. Otherwise, end the cluster.
PCRE's additional properties PCRE's additional properties
As well as the standard Unicode properties described above, PCRE sup- As well as the standard Unicode properties described above, PCRE sup-
ports four more that make it possible to convert traditional escape ports four more that make it possible to convert traditional escape se-
sequences such as \w and \s to use Unicode properties. PCRE uses these quences such as \w and \s to use Unicode properties. PCRE uses these
non-standard, non-Perl properties internally when PCRE_UCP is set. How- non-standard, non-Perl properties internally when PCRE_UCP is set. How-
ever, they may also be used explicitly. These properties are: ever, they may also be used explicitly. These properties are:
Xan Any alphanumeric character Xan Any alphanumeric character
Xps Any POSIX space character Xps Any POSIX space character
Xsp Any Perl space character Xsp Any Perl space character
Xwd Any Perl "word" character Xwd Any Perl "word" character
Xan matches characters that have either the L (letter) or the N (num- Xan matches characters that have either the L (letter) or the N (num-
ber) property. Xps matches the characters tab, linefeed, vertical tab, ber) property. Xps matches the characters tab, linefeed, vertical tab,
form feed, or carriage return, and any other character that has the Z form feed, or carriage return, and any other character that has the Z
(separator) property. Xsp is the same as Xps; it used to exclude ver- (separator) property. Xsp is the same as Xps; it used to exclude ver-
tical tab, for Perl compatibility, but Perl changed, and so PCRE fol- tical tab, for Perl compatibility, but Perl changed, and so PCRE fol-
lowed at release 8.34. Xwd matches the same characters as Xan, plus lowed at release 8.34. Xwd matches the same characters as Xan, plus un-
underscore. derscore.
There is another non-standard property, Xuc, which matches any charac- There is another non-standard property, Xuc, which matches any charac-
ter that can be represented by a Universal Character Name in C++ and ter that can be represented by a Universal Character Name in C++ and
other programming languages. These are the characters $, @, ` (grave other programming languages. These are the characters $, @, ` (grave
accent), and all characters with Unicode code points greater than or accent), and all characters with Unicode code points greater than or
equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
most base (ASCII) characters are excluded. (Universal Character Names most base (ASCII) characters are excluded. (Universal Character Names
are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
Note that the Xuc property does not match these sequences but the char- Note that the Xuc property does not match these sequences but the char-
acters that they represent.) acters that they represent.)
skipping to change at line 5412 skipping to change at line 5407
is similar to a lookbehind assertion (described below). However, in is similar to a lookbehind assertion (described below). However, in
this case, the part of the subject before the real match does not have this case, the part of the subject before the real match does not have
to be of fixed length, as lookbehind assertions do. The use of \K does to be of fixed length, as lookbehind assertions do. The use of \K does
not interfere with the setting of captured substrings. For example, not interfere with the setting of captured substrings. For example,
when the pattern when the pattern
(foo)\Kbar (foo)\Kbar
matches "foobar", the first substring is still set to "foo". matches "foobar", the first substring is still set to "foo".
Perl documents that the use of \K within assertions is "not well Perl documents that the use of \K within assertions is "not well de-
defined". In PCRE, \K is acted upon when it occurs inside positive fined". In PCRE, \K is acted upon when it occurs inside positive asser-
assertions, but is ignored in negative assertions. Note that when a tions, but is ignored in negative assertions. Note that when a pattern
pattern such as (?=ab\K) matches, the reported start of the match can such as (?=ab\K) matches, the reported start of the match can be
be greater than the end of the match. greater than the end of the match.
Simple assertions Simple assertions
The final use of backslash is for certain simple assertions. An asser- The final use of backslash is for certain simple assertions. An asser-
tion specifies a condition that has to be met at a particular point in tion specifies a condition that has to be met at a particular point in
a match, without consuming any characters from the subject string. The a match, without consuming any characters from the subject string. The
use of subpatterns for more complicated assertions is described below. use of subpatterns for more complicated assertions is described below.
The backslashed assertions are: The backslashed assertions are:
\b matches at a word boundary \b matches at a word boundary
\B matches when not at a word boundary \B matches when not at a word boundary
\A matches at the start of the subject \A matches at the start of the subject
\Z matches at the end of the subject \Z matches at the end of the subject
also matches before a newline at the end of the subject also matches before a newline at the end of the subject
\z matches only at the end of the subject \z matches only at the end of the subject
\G matches at the first matching position in the subject \G matches at the first matching position in the subject
Inside a character class, \b has a different meaning; it matches the Inside a character class, \b has a different meaning; it matches the
backspace character. If any other of these assertions appears in a backspace character. If any other of these assertions appears in a
character class, by default it matches the corresponding literal char- character class, by default it matches the corresponding literal char-
acter (for example, \B matches the letter B). However, if the acter (for example, \B matches the letter B). However, if the PCRE_EX-
PCRE_EXTRA option is set, an "invalid escape sequence" error is gener- TRA option is set, an "invalid escape sequence" error is generated in-
ated instead. stead.
A word boundary is a position in the subject string where the current A word boundary is a position in the subject string where the current
character and the previous character do not both match \w or \W (i.e. character and the previous character do not both match \w or \W (i.e.
one matches \w and the other matches \W), or the start or end of the one matches \w and the other matches \W), or the start or end of the
string if the first or last character matches \w, respectively. In a string if the first or last character matches \w, respectively. In a
UTF mode, the meanings of \w and \W can be changed by setting the UTF mode, the meanings of \w and \W can be changed by setting the
PCRE_UCP option. When this is done, it also affects \b and \B. Neither PCRE_UCP option. When this is done, it also affects \b and \B. Neither
PCRE nor Perl has a separate "start of word" or "end of word" metase- PCRE nor Perl has a separate "start of word" or "end of word" metase-
quence. However, whatever follows \b normally determines which it is. quence. However, whatever follows \b normally determines which it is.
For example, the fragment \ba matches "a" at the start of a word. For example, the fragment \ba matches "a" at the start of a word.
skipping to change at line 5502 skipping to change at line 5497
Circumflex need not be the first character of the pattern if a number Circumflex need not be the first character of the pattern if a number
of alternatives are involved, but it should be the first thing in each of alternatives are involved, but it should be the first thing in each
alternative in which it appears if the pattern is ever to match that alternative in which it appears if the pattern is ever to match that
branch. If all possible alternatives start with a circumflex, that is, branch. If all possible alternatives start with a circumflex, that is,
if the pattern is constrained to match only at the start of the sub- if the pattern is constrained to match only at the start of the sub-
ject, it is said to be an "anchored" pattern. (There are also other ject, it is said to be an "anchored" pattern. (There are also other
constructs that can cause a pattern to be anchored.) constructs that can cause a pattern to be anchored.)
The dollar character is an assertion that is true only if the current The dollar character is an assertion that is true only if the current
matching point is at the end of the subject string, or immediately matching point is at the end of the subject string, or immediately be-
before a newline at the end of the string (by default). Note, however, fore a newline at the end of the string (by default). Note, however,
that it does not actually match the newline. Dollar need not be the that it does not actually match the newline. Dollar need not be the
last character of the pattern if a number of alternatives are involved, last character of the pattern if a number of alternatives are involved,
but it should be the last item in any branch in which it appears. Dol- but it should be the last item in any branch in which it appears. Dol-
lar has no special meaning in a character class. lar has no special meaning in a character class.
The meaning of dollar can be changed so that it matches only at the The meaning of dollar can be changed so that it matches only at the
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
compile time. This does not affect the \Z assertion. compile time. This does not affect the \Z assertion.
The meanings of the circumflex and dollar characters are changed if the The meanings of the circumflex and dollar characters are changed if the
skipping to change at line 5549 skipping to change at line 5544
fies the end of a line. fies the end of a line.
When a line ending is defined as a single character, dot never matches When a line ending is defined as a single character, dot never matches
that character; when the two-character sequence CRLF is used, dot does that character; when the two-character sequence CRLF is used, dot does
not match CR if it is immediately followed by LF, but otherwise it not match CR if it is immediately followed by LF, but otherwise it
matches all characters (including isolated CRs and LFs). When any Uni- matches all characters (including isolated CRs and LFs). When any Uni-
code line endings are being recognized, dot does not match CR or LF or code line endings are being recognized, dot does not match CR or LF or
any of the other line ending characters. any of the other line ending characters.
The behaviour of dot with regard to newlines can be changed. If the The behaviour of dot with regard to newlines can be changed. If the
PCRE_DOTALL option is set, a dot matches any one character, without PCRE_DOTALL option is set, a dot matches any one character, without ex-
exception. If the two-character sequence CRLF is present in the subject ception. If the two-character sequence CRLF is present in the subject
string, it takes two dots to match it. string, it takes two dots to match it.
The handling of dot is entirely independent of the handling of circum- The handling of dot is entirely independent of the handling of circum-
flex and dollar, the only relationship being that they both involve flex and dollar, the only relationship being that they both involve
newlines. Dot has no special meaning in a character class. newlines. Dot has no special meaning in a character class.
The escape sequence \N behaves like a dot, except that it is not The escape sequence \N behaves like a dot, except that it is not af-
affected by the PCRE_DOTALL option. In other words, it matches any fected by the PCRE_DOTALL option. In other words, it matches any char-
character except one that signifies the end of a line. Perl also uses acter except one that signifies the end of a line. Perl also uses \N to
\N to match characters by name; PCRE does not support this. match characters by name; PCRE does not support this.
MATCHING A SINGLE DATA UNIT MATCHING A SINGLE DATA UNIT
Outside a character class, the escape sequence \C matches any one data Outside a character class, the escape sequence \C matches any one data
unit, whether or not a UTF mode is set. In the 8-bit library, one data unit, whether or not a UTF mode is set. In the 8-bit library, one data
unit is one byte; in the 16-bit library it is a 16-bit unit; in the unit is one byte; in the 16-bit library it is a 16-bit unit; in the
32-bit library it is a 32-bit unit. Unlike a dot, \C always matches 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
line-ending characters. The feature is provided in Perl in order to line-ending characters. The feature is provided in Perl in order to
match individual bytes in UTF-8 mode, but it is unclear how it can use- match individual bytes in UTF-8 mode, but it is unclear how it can use-
fully be used. Because \C breaks up characters into individual data fully be used. Because \C breaks up characters into individual data
skipping to change at line 5594 skipping to change at line 5589
a lookahead to check the length of the next character, as in this pat- a lookahead to check the length of the next character, as in this pat-
tern, which could be used with a UTF-8 string (ignore white space and tern, which could be used with a UTF-8 string (ignore white space and
line breaks): line breaks):
(?| (?=[\x00-\x7f])(\C) | (?| (?=[\x00-\x7f])(\C) |
(?=[\x80-\x{7ff}])(\C)(\C) | (?=[\x80-\x{7ff}])(\C)(\C) |
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
A group that starts with (?| resets the capturing parentheses numbers A group that starts with (?| resets the capturing parentheses numbers
in each alternative (see "Duplicate Subpattern Numbers" below). The in each alternative (see "Duplicate Subpattern Numbers" below). The as-
assertions at the start of each branch check the next UTF-8 character sertions at the start of each branch check the next UTF-8 character for
for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The char-
character's individual bytes are then captured by the appropriate num- acter's individual bytes are then captured by the appropriate number of
ber of groups. groups.
SQUARE BRACKETS AND CHARACTER CLASSES SQUARE BRACKETS AND CHARACTER CLASSES
An opening square bracket introduces a character class, terminated by a An opening square bracket introduces a character class, terminated by a
closing square bracket. A closing square bracket on its own is not spe- closing square bracket. A closing square bracket on its own is not spe-
cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set, cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
a lone closing square bracket causes a compile-time error. If a closing a lone closing square bracket causes a compile-time error. If a closing
square bracket is required as a member of the class, it should be the square bracket is required as a member of the class, it should be the
first data character in the class (after an initial circumflex, if first data character in the class (after an initial circumflex, if
present) or escaped with a backslash. present) or escaped with a backslash.
skipping to change at line 5643 skipping to change at line 5638
match "A", whereas a caseful version would. In a UTF mode, PCRE always match "A", whereas a caseful version would. In a UTF mode, PCRE always
understands the concept of case for characters whose values are less understands the concept of case for characters whose values are less
than 128, so caseless matching is always possible. For characters with than 128, so caseless matching is always possible. For characters with
higher values, the concept of case is supported if PCRE is compiled higher values, the concept of case is supported if PCRE is compiled
with Unicode property support, but not otherwise. If you want to use with Unicode property support, but not otherwise. If you want to use
caseless matching in a UTF mode for characters 128 and above, you must caseless matching in a UTF mode for characters 128 and above, you must
ensure that PCRE is compiled with Unicode property support as well as ensure that PCRE is compiled with Unicode property support as well as
with UTF support. with UTF support.
Characters that might indicate line breaks are never treated in any Characters that might indicate line breaks are never treated in any
special way when matching character classes, whatever line-ending special way when matching character classes, whatever line-ending se-
sequence is in use, and whatever setting of the PCRE_DOTALL and quence is in use, and whatever setting of the PCRE_DOTALL and PCRE_MUL-
PCRE_MULTILINE options is used. A class such as [^a] always matches one TILINE options is used. A class such as [^a] always matches one of
of these characters. these characters.
The minus (hyphen) character can be used to specify a range of charac- The minus (hyphen) character can be used to specify a range of charac-
ters in a character class. For example, [d-m] matches any letter ters in a character class. For example, [d-m] matches any letter be-
between d and m, inclusive. If a minus character is required in a tween d and m, inclusive. If a minus character is required in a class,
class, it must be escaped with a backslash or appear in a position it must be escaped with a backslash or appear in a position where it
where it cannot be interpreted as indicating a range, typically as the cannot be interpreted as indicating a range, typically as the first or
first or last character in the class, or immediately after a range. For last character in the class, or immediately after a range. For example,
example, [b-d-z] matches letters in the range b to d, a hyphen charac- [b-d-z] matches letters in the range b to d, a hyphen character, or z.
ter, or z.
It is not possible to have the literal character "]" as the end charac- It is not possible to have the literal character "]" as the end charac-
ter of a range. A pattern such as [W-]46] is interpreted as a class of ter of a range. A pattern such as [W-]46] is interpreted as a class of
two characters ("W" and "-") followed by a literal string "46]", so it two characters ("W" and "-") followed by a literal string "46]", so it
would match "W46]" or "-46]". However, if the "]" is escaped with a would match "W46]" or "-46]". However, if the "]" is escaped with a
backslash it is interpreted as the end of range, so [W-\]46] is inter- backslash it is interpreted as the end of range, so [W-\]46] is inter-
preted as a class containing a range followed by two other characters. preted as a class containing a range followed by two other characters.
The octal or hexadecimal representation of "]" can also be used to end The octal or hexadecimal representation of "]" can also be used to end
a range. a range.
An error is generated if a POSIX character class (see below) or an An error is generated if a POSIX character class (see below) or an es-
escape sequence other than one that defines a single character appears cape sequence other than one that defines a single character appears at
at a point where a range ending character is expected. For example, a point where a range ending character is expected. For example,
[z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not. [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
Ranges operate in the collating sequence of character values. They can Ranges operate in the collating sequence of character values. They can
also be used for characters specified numerically, for example also be used for characters specified numerically, for example
[\000-\037]. Ranges can include any characters that are valid for the [\000-\037]. Ranges can include any characters that are valid for the
current mode. current mode.
If a range that includes letters is used when caseless matching is set, If a range that includes letters is used when caseless matching is set,
it matches the letters in either case. For example, [W-c] is equivalent it matches the letters in either case. For example, [W-c] is equivalent
to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
character tables for a French locale are in use, [\xc8-\xcb] matches character tables for a French locale are in use, [\xc8-\xcb] matches
accented E characters in both cases. In UTF modes, PCRE supports the accented E characters in both cases. In UTF modes, PCRE supports the
concept of case for characters with values greater than 128 only when concept of case for characters with values greater than 128 only when
it is compiled with Unicode property support. it is compiled with Unicode property support.
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
\w, and \W may appear in a character class, and add the characters that \w, and \W may appear in a character class, and add the characters that
they match to the class. For example, [\dABCDEF] matches any hexadeci- they match to the class. For example, [\dABCDEF] matches any hexadeci-
mal digit. In UTF modes, the PCRE_UCP option affects the meanings of mal digit. In UTF modes, the PCRE_UCP option affects the meanings of
\d, \s, \w and their upper case partners, just as it does when they \d, \s, \w and their upper case partners, just as it does when they ap-
appear outside a character class, as described in the section entitled pear outside a character class, as described in the section entitled
"Generic character types" above. The escape sequence \b has a different "Generic character types" above. The escape sequence \b has a different
meaning inside a character class; it matches the backspace character. meaning inside a character class; it matches the backspace character.
The sequences \B, \N, \R, and \X are not special inside a character The sequences \B, \N, \R, and \X are not special inside a character
class. Like any other unrecognized escape sequences, they are treated class. Like any other unrecognized escape sequences, they are treated
as the literal characters "B", "N", "R", and "X" by default, but cause as the literal characters "B", "N", "R", and "X" by default, but cause
an error if the PCRE_EXTRA option is set. an error if the PCRE_EXTRA option is set.
A circumflex can conveniently be used with the upper case character A circumflex can conveniently be used with the upper case character
types to specify a more restricted set of characters than the matching types to specify a more restricted set of characters than the matching
lower case type. For example, the class [^\W_] matches any letter or lower case type. For example, the class [^\W_] matches any letter or
digit, but not underscore, whereas [\w] includes underscore. A positive digit, but not underscore, whereas [\w] includes underscore. A positive
character class should be read as "something OR something OR ..." and a character class should be read as "something OR something OR ..." and a
negative class as "NOT something AND NOT something AND NOT ...". negative class as "NOT something AND NOT something AND NOT ...".
The only metacharacters that are recognized in character classes are The only metacharacters that are recognized in character classes are
backslash, hyphen (only where it can be interpreted as specifying a backslash, hyphen (only where it can be interpreted as specifying a
range), circumflex (only at the start), opening square bracket (only range), circumflex (only at the start), opening square bracket (only
when it can be interpreted as introducing a POSIX class name, or for a when it can be interpreted as introducing a POSIX class name, or for a
special compatibility feature - see the next two sections), and the special compatibility feature - see the next two sections), and the
terminating closing square bracket. However, escaping other non- terminating closing square bracket. However, escaping other non-al-
alphanumeric characters does no harm. phanumeric characters does no harm.
POSIX CHARACTER CLASSES POSIX CHARACTER CLASSES
Perl supports the POSIX notation for character classes. This uses names Perl supports the POSIX notation for character classes. This uses names
enclosed by [: and :] within the enclosing square brackets. PCRE also enclosed by [: and :] within the enclosing square brackets. PCRE also
supports this notation. For example, supports this notation. For example,
[01[:alpha:]%] [01[:alpha:]%]
matches "0", "1", any alphabetic character, or "%". The supported class matches "0", "1", any alphabetic character, or "%". The supported class
names are: names are:
alnum letters and digits alnum letters and digits
alpha letters alpha letters
ascii character codes 0 - 127 ascii character codes 0 - 127
skipping to change at line 5738 skipping to change at line 5732
digit decimal digits (same as \d) digit decimal digits (same as \d)
graph printing characters, excluding space graph printing characters, excluding space
lower lower case letters lower lower case letters
print printing characters, including space print printing characters, including space
punct printing characters, excluding letters and digits and space punct printing characters, excluding letters and digits and space
space white space (the same as \s from PCRE 8.34) space white space (the same as \s from PCRE 8.34)
upper upper case letters upper upper case letters
word "word" characters (same as \w) word "word" characters (same as \w)
xdigit hexadecimal digits xdigit hexadecimal digits
The default "space" characters are HT (9), LF (10), VT (11), FF (12), The default "space" characters are HT (9), LF (10), VT (11), FF (12),
CR (13), and space (32). If locale-specific matching is taking place, CR (13), and space (32). If locale-specific matching is taking place,
the list of space characters may be different; there may be fewer or the list of space characters may be different; there may be fewer or
more of them. "Space" used to be different to \s, which did not include more of them. "Space" used to be different to \s, which did not include
VT, for Perl compatibility. However, Perl changed at release 5.18, and VT, for Perl compatibility. However, Perl changed at release 5.18, and
PCRE followed at release 8.34. "Space" and \s now match the same set PCRE followed at release 8.34. "Space" and \s now match the same set
of characters. of characters.
The name "word" is a Perl extension, and "blank" is a GNU extension The name "word" is a Perl extension, and "blank" is a GNU extension
from Perl 5.8. Another Perl extension is negation, which is indicated from Perl 5.8. Another Perl extension is negation, which is indicated
by a ^ character after the colon. For example, by a ^ character after the colon. For example,
[12[:^digit:]] [12[:^digit:]]
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
these are not supported, and an error is given if they are encountered. these are not supported, and an error is given if they are encountered.
By default, characters with values greater than 128 do not match any of By default, characters with values greater than 128 do not match any of
the POSIX character classes. However, if the PCRE_UCP option is passed the POSIX character classes. However, if the PCRE_UCP option is passed
to pcre_compile(), some of the classes are changed so that Unicode to pcre_compile(), some of the classes are changed so that Unicode
character properties are used. This is achieved by replacing certain character properties are used. This is achieved by replacing certain
POSIX classes by other sequences, as follows: POSIX classes by other sequences, as follows:
[:alnum:] becomes \p{Xan} [:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L} [:alpha:] becomes \p{L}
[:blank:] becomes \h [:blank:] becomes \h
[:digit:] becomes \p{Nd} [:digit:] becomes \p{Nd}
[:lower:] becomes \p{Ll} [:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps} [:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu} [:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd} [:word:] becomes \p{Xwd}
Negated versions, such as [:^alpha:] use \P instead of \p. Three other Negated versions, such as [:^alpha:] use \P instead of \p. Three other
POSIX classes are handled specially in UCP mode: POSIX classes are handled specially in UCP mode:
[:graph:] This matches characters that have glyphs that mark the page [:graph:] This matches characters that have glyphs that mark the page
when printed. In Unicode property terms, it matches all char- when printed. In Unicode property terms, it matches all char-
acters with the L, M, N, P, S, or Cf properties, except for: acters with the L, M, N, P, S, or Cf properties, except for:
U+061C Arabic Letter Mark U+061C Arabic Letter Mark
U+180E Mongolian Vowel Separator U+180E Mongolian Vowel Separator
U+2066 - U+2069 Various "isolate"s U+2066 - U+2069 Various "isolate"s
[:print:] This matches the same characters as [:graph:] plus space [:print:] This matches the same characters as [:graph:] plus space
characters that are not controls, that is, characters with characters that are not controls, that is, characters with
the Zs property. the Zs property.
[:punct:] This matches all characters that have the Unicode P (punctua- [:punct:] This matches all characters that have the Unicode P (punctua-
tion) property, plus those characters whose code points are tion) property, plus those characters whose code points are
less than 128 that have the S (Symbol) property. less than 128 that have the S (Symbol) property.
The other POSIX classes are unchanged, and match only characters with The other POSIX classes are unchanged, and match only characters with
code points less than 128. code points less than 128.
COMPATIBILITY FEATURE FOR WORD BOUNDARIES COMPATIBILITY FEATURE FOR WORD BOUNDARIES
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
and "end of word". PCRE treats these items as follows: and "end of word". PCRE treats these items as follows:
[[:<:]] is converted to \b(?=\w) [[:<:]] is converted to \b(?=\w)
[[:>:]] is converted to \b(?<=\w) [[:>:]] is converted to \b(?<=\w)
Only these exact character sequences are recognized. A sequence such as Only these exact character sequences are recognized. A sequence such as
[a[:<:]b] provokes error for an unrecognized POSIX class name. This [a[:<:]b] provokes error for an unrecognized POSIX class name. This
support is not compatible with Perl. It is provided to help migrations support is not compatible with Perl. It is provided to help migrations
from other environments, and is best not used in any new patterns. Note from other environments, and is best not used in any new patterns. Note
that \b matches at the start and the end of a word (see "Simple asser- that \b matches at the start and the end of a word (see "Simple asser-
tions" above), and in a Perl-style pattern the preceding or following tions" above), and in a Perl-style pattern the preceding or following
character normally shows which is wanted, without the need for the character normally shows which is wanted, without the need for the as-
assertions that are used above in order to give exactly the POSIX be- sertions that are used above in order to give exactly the POSIX behav-
haviour. iour.
VERTICAL BAR VERTICAL BAR
Vertical bar characters are used to separate alternative patterns. For Vertical bar characters are used to separate alternative patterns. For
example, the pattern example, the pattern
gilbert|sullivan gilbert|sullivan
matches either "gilbert" or "sullivan". Any number of alternatives may matches either "gilbert" or "sullivan". Any number of alternatives may
appear, and an empty alternative is permitted (matching the empty appear, and an empty alternative is permitted (matching the empty
string). The matching process tries each alternative in turn, from left string). The matching process tries each alternative in turn, from left
to right, and the first one that succeeds is used. If the alternatives to right, and the first one that succeeds is used. If the alternatives
are within a subpattern (defined below), "succeeds" means matching the are within a subpattern (defined below), "succeeds" means matching the
rest of the main pattern as well as the alternative in the subpattern. rest of the main pattern as well as the alternative in the subpattern.
INTERNAL OPTION SETTING INTERNAL OPTION SETTING
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
PCRE_EXTENDED options (which are Perl-compatible) can be changed from PCRE_EXTENDED options (which are Perl-compatible) can be changed from
within the pattern by a sequence of Perl option letters enclosed within the pattern by a sequence of Perl option letters enclosed be-
between "(?" and ")". The option letters are tween "(?" and ")". The option letters are
i for PCRE_CASELESS i for PCRE_CASELESS
m for PCRE_MULTILINE m for PCRE_MULTILINE
s for PCRE_DOTALL s for PCRE_DOTALL
x for PCRE_EXTENDED x for PCRE_EXTENDED
For example, (?im) sets caseless, multiline matching. It is also possi- For example, (?im) sets caseless, multiline matching. It is also possi-
ble to unset these options by preceding the letter with a hyphen, and a ble to unset these options by preceding the letter with a hyphen, and a
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
is also permitted. If a letter appears both before and after the is also permitted. If a letter appears both before and after the hy-
hyphen, the option is unset. phen, the option is unset.
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
can be changed in the same way as the Perl-compatible options by using can be changed in the same way as the Perl-compatible options by using
the characters J, U and X respectively. the characters J, U and X respectively.
When one of these option changes occurs at top level (that is, not When one of these option changes occurs at top level (that is, not in-
inside subpattern parentheses), the change applies to the remainder of side subpattern parentheses), the change applies to the remainder of
the pattern that follows. An option change within a subpattern (see the pattern that follows. An option change within a subpattern (see be-
below for a description of subpatterns) affects only that part of the low for a description of subpatterns) affects only that part of the
subpattern that follows it, so subpattern that follows it, so
(a(?i)b)c (a(?i)b)c
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
used). By this means, options can be made to have different settings used). By this means, options can be made to have different settings
in different parts of the pattern. Any changes made in one alternative in different parts of the pattern. Any changes made in one alternative
do carry on into subsequent branches within the same subpattern. For do carry on into subsequent branches within the same subpattern. For
example, example,
(a(?i)b|c) (a(?i)b|c)
matches "ab", "aB", "c", and "C", even though when matching "C" the matches "ab", "aB", "c", and "C", even though when matching "C" the
first branch is abandoned before the option setting. This is because first branch is abandoned before the option setting. This is because
the effects of option settings happen at compile time. There would be the effects of option settings happen at compile time. There would be
some very weird behaviour otherwise. some very weird behaviour otherwise.
Note: There are other PCRE-specific options that can be set by the Note: There are other PCRE-specific options that can be set by the ap-
application when the compiling or matching functions are called. In plication when the compiling or matching functions are called. In some
some cases the pattern can contain special leading sequences such as cases the pattern can contain special leading sequences such as (*CRLF)
(*CRLF) to override what the application has set or what has been to override what the application has set or what has been defaulted.
defaulted. Details are given in the section entitled "Newline Details are given in the section entitled "Newline sequences" above.
sequences" above. There are also the (*UTF8), (*UTF16),(*UTF32), and There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading se-
(*UCP) leading sequences that can be used to set UTF and Unicode prop- quences that can be used to set UTF and Unicode property modes; they
erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, are equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the
PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF) sequence PCRE_UCP options, respectively. The (*UTF) sequence is a generic ver-
is a generic version that can be used with any of the libraries. How- sion that can be used with any of the libraries. However, the applica-
ever, the application can set the PCRE_NEVER_UTF option, which locks tion can set the PCRE_NEVER_UTF option, which locks out the use of the
out the use of the (*UTF) sequences. (*UTF) sequences.
SUBPATTERNS SUBPATTERNS
Subpatterns are delimited by parentheses (round brackets), which can be Subpatterns are delimited by parentheses (round brackets), which can be
nested. Turning part of a pattern into a subpattern does two things: nested. Turning part of a pattern into a subpattern does two things:
1. It localizes a set of alternatives. For example, the pattern 1. It localizes a set of alternatives. For example, the pattern
cat(aract|erpillar|) cat(aract|erpillar|)
matches "cataract", "caterpillar", or "cat". Without the parentheses, matches "cataract", "caterpillar", or "cat". Without the parentheses,
it would match "cataract", "erpillar" or an empty string. it would match "cataract", "erpillar" or an empty string.
2. It sets up the subpattern as a capturing subpattern. This means 2. It sets up the subpattern as a capturing subpattern. This means
that, when the whole pattern matches, that portion of the subject that, when the whole pattern matches, that portion of the subject
string that matched the subpattern is passed back to the caller via the string that matched the subpattern is passed back to the caller via the
ovector argument of the matching function. (This applies only to the ovector argument of the matching function. (This applies only to the
traditional matching functions; the DFA matching functions do not sup- traditional matching functions; the DFA matching functions do not sup-
port capturing.) port capturing.)
Opening parentheses are counted from left to right (starting from 1) to Opening parentheses are counted from left to right (starting from 1) to
obtain numbers for the capturing subpatterns. For example, if the obtain numbers for the capturing subpatterns. For example, if the
string "the red king" is matched against the pattern string "the red king" is matched against the pattern
the ((red|white) (king|queen)) the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king", and are num- the captured substrings are "red king", "red", and "king", and are num-
bered 1, 2, and 3, respectively. bered 1, 2, and 3, respectively.
The fact that plain parentheses fulfil two functions is not always The fact that plain parentheses fulfil two functions is not always
helpful. There are often times when a grouping subpattern is required helpful. There are often times when a grouping subpattern is required
without a capturing requirement. If an opening parenthesis is followed without a capturing requirement. If an opening parenthesis is followed
by a question mark and a colon, the subpattern does not do any captur- by a question mark and a colon, the subpattern does not do any captur-
ing, and is not counted when computing the number of any subsequent ing, and is not counted when computing the number of any subsequent
capturing subpatterns. For example, if the string "the white queen" is capturing subpatterns. For example, if the string "the white queen" is
matched against the pattern matched against the pattern
the ((?:red|white) (king|queen)) the ((?:red|white) (king|queen))
the captured substrings are "white queen" and "queen", and are numbered the captured substrings are "white queen" and "queen", and are numbered
1 and 2. The maximum number of capturing subpatterns is 65535. 1 and 2. The maximum number of capturing subpatterns is 65535.
As a convenient shorthand, if any option settings are required at the As a convenient shorthand, if any option settings are required at the
start of a non-capturing subpattern, the option letters may appear start of a non-capturing subpattern, the option letters may appear be-
between the "?" and the ":". Thus the two patterns tween the "?" and the ":". Thus the two patterns
(?i:saturday|sunday) (?i:saturday|sunday)
(?:(?i)saturday|sunday) (?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative branches are match exactly the same set of strings. Because alternative branches are
tried from left to right, and options are not reset until the end of tried from left to right, and options are not reset until the end of
the subpattern is reached, an option setting in one branch does affect the subpattern is reached, an option setting in one branch does affect
subsequent branches, so the above patterns match "SUNDAY" as well as subsequent branches, so the above patterns match "SUNDAY" as well as
"Saturday". "Saturday".
DUPLICATE SUBPATTERN NUMBERS DUPLICATE SUBPATTERN NUMBERS
Perl 5.10 introduced a feature whereby each alternative in a subpattern Perl 5.10 introduced a feature whereby each alternative in a subpattern
uses the same numbers for its capturing parentheses. Such a subpattern uses the same numbers for its capturing parentheses. Such a subpattern
starts with (?| and is itself a non-capturing subpattern. For example, starts with (?| and is itself a non-capturing subpattern. For example,
consider this pattern: consider this pattern:
(?|(Sat)ur|(Sun))day (?|(Sat)ur|(Sun))day
Because the two alternatives are inside a (?| group, both sets of cap- Because the two alternatives are inside a (?| group, both sets of cap-
turing parentheses are numbered one. Thus, when the pattern matches, turing parentheses are numbered one. Thus, when the pattern matches,
you can look at captured substring number one, whichever alternative you can look at captured substring number one, whichever alternative
matched. This construct is useful when you want to capture part, but matched. This construct is useful when you want to capture part, but
not all, of one of a number of alternatives. Inside a (?| group, paren- not all, of one of a number of alternatives. Inside a (?| group, paren-
theses are numbered as usual, but the number is reset at the start of theses are numbered as usual, but the number is reset at the start of
each branch. The numbers of any capturing parentheses that follow the each branch. The numbers of any capturing parentheses that follow the
subpattern start after the highest number used in any branch. The fol- subpattern start after the highest number used in any branch. The fol-
lowing example is taken from the Perl documentation. The numbers under- lowing example is taken from the Perl documentation. The numbers under-
neath show in which buffer the captured content will be stored. neath show in which buffer the captured content will be stored.
# before ---------------branch-reset----------- after # before ---------------branch-reset----------- after
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4 # 1 2 2 3 2 3 4
A back reference to a numbered subpattern uses the most recent value A back reference to a numbered subpattern uses the most recent value
that is set for that number by any subpattern. The following pattern that is set for that number by any subpattern. The following pattern
matches "abcabc" or "defdef": matches "abcabc" or "defdef":
/(?|(abc)|(def))\1/ /(?|(abc)|(def))\1/
In contrast, a subroutine call to a numbered subpattern always refers In contrast, a subroutine call to a numbered subpattern always refers
to the first one in the pattern with the given number. The following to the first one in the pattern with the given number. The following
pattern matches "abcabc" or "defabc": pattern matches "abcabc" or "defabc":
/(?|(abc)|(def))(?1)/ /(?|(abc)|(def))(?1)/
If a condition test for a subpattern's having matched refers to a non- If a condition test for a subpattern's having matched refers to a non-
unique number, the test is true if any of the subpatterns of that num- unique number, the test is true if any of the subpatterns of that num-
ber have matched. ber have matched.
An alternative approach to using this "branch reset" feature is to use An alternative approach to using this "branch reset" feature is to use
duplicate named subpatterns, as described in the next section. duplicate named subpatterns, as described in the next section.
NAMED SUBPATTERNS NAMED SUBPATTERNS
Identifying capturing parentheses by number is simple, but it can be Identifying capturing parentheses by number is simple, but it can be
very hard to keep track of the numbers in complicated regular expres- very hard to keep track of the numbers in complicated regular expres-
sions. Furthermore, if an expression is modified, the numbers may sions. Furthermore, if an expression is modified, the numbers may
change. To help with this difficulty, PCRE supports the naming of sub- change. To help with this difficulty, PCRE supports the naming of sub-
patterns. This feature was not added to Perl until release 5.10. Python patterns. This feature was not added to Perl until release 5.10. Python
had the feature earlier, and PCRE introduced it at release 4.0, using had the feature earlier, and PCRE introduced it at release 4.0, using
the Python syntax. PCRE now supports both the Perl and the Python syn- the Python syntax. PCRE now supports both the Perl and the Python syn-
tax. Perl allows identically numbered subpatterns to have different tax. Perl allows identically numbered subpatterns to have different
names, but PCRE does not. names, but PCRE does not.
In PCRE, a subpattern can be named in one of three ways: (?<name>...) In PCRE, a subpattern can be named in one of three ways: (?<name>...)
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
to capturing parentheses from other parts of the pattern, such as back to capturing parentheses from other parts of the pattern, such as back
references, recursion, and conditions, can be made by name as well as references, recursion, and conditions, can be made by name as well as
by number. by number.
Names consist of up to 32 alphanumeric characters and underscores, but Names consist of up to 32 alphanumeric characters and underscores, but
must start with a non-digit. Named capturing parentheses are still must start with a non-digit. Named capturing parentheses are still al-
allocated numbers as well as names, exactly as if the names were not located numbers as well as names, exactly as if the names were not
present. The PCRE API provides function calls for extracting the name- present. The PCRE API provides function calls for extracting the name-
to-number translation table from a compiled pattern. There is also a to-number translation table from a compiled pattern. There is also a
convenience function for extracting a captured substring by name. convenience function for extracting a captured substring by name.
By default, a name must be unique within a pattern, but it is possible By default, a name must be unique within a pattern, but it is possible
to relax this constraint by setting the PCRE_DUPNAMES option at compile to relax this constraint by setting the PCRE_DUPNAMES option at compile
time. (Duplicate names are also always permitted for subpatterns with time. (Duplicate names are also always permitted for subpatterns with
the same number, set up as described in the previous section.) Dupli- the same number, set up as described in the previous section.) Dupli-
cate names can be useful for patterns where only one instance of the cate names can be useful for patterns where only one instance of the
named parentheses can match. Suppose you want to match the name of a named parentheses can match. Suppose you want to match the name of a
weekday, either as a 3-letter abbreviation or as the full name, and in weekday, either as a 3-letter abbreviation or as the full name, and in
both cases you want to extract the abbreviation. This pattern (ignoring both cases you want to extract the abbreviation. This pattern (ignoring
the line breaks) does the job: the line breaks) does the job:
(?<DN>Mon|Fri|Sun)(?:day)?| (?<DN>Mon|Fri|Sun)(?:day)?|
(?<DN>Tue)(?:sday)?| (?<DN>Tue)(?:sday)?|
(?<DN>Wed)(?:nesday)?| (?<DN>Wed)(?:nesday)?|
(?<DN>Thu)(?:rsday)?| (?<DN>Thu)(?:rsday)?|
(?<DN>Sat)(?:urday)? (?<DN>Sat)(?:urday)?
There are five capturing substrings, but only one is ever set after a There are five capturing substrings, but only one is ever set after a
match. (An alternative way of solving this problem is to use a "branch match. (An alternative way of solving this problem is to use a "branch
reset" subpattern, as described in the previous section.) reset" subpattern, as described in the previous section.)
The convenience function for extracting the data by name returns the The convenience function for extracting the data by name returns the
substring for the first (and in this example, the only) subpattern of substring for the first (and in this example, the only) subpattern of
that name that matched. This saves searching to find which numbered that name that matched. This saves searching to find which numbered
subpattern it was. subpattern it was.
If you make a back reference to a non-unique named subpattern from If you make a back reference to a non-unique named subpattern from
elsewhere in the pattern, the subpatterns to which the name refers are elsewhere in the pattern, the subpatterns to which the name refers are
checked in the order in which they appear in the overall pattern. The checked in the order in which they appear in the overall pattern. The
first one that is set is used for the reference. For example, this pat- first one that is set is used for the reference. For example, this pat-
tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo": tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
(?:(?<n>foo)|(?<n>bar))\k<n> (?:(?<n>foo)|(?<n>bar))\k<n>
If you make a subroutine call to a non-unique named subpattern, the one If you make a subroutine call to a non-unique named subpattern, the one
that corresponds to the first occurrence of the name is used. In the that corresponds to the first occurrence of the name is used. In the
absence of duplicate numbers (see the previous section) this is the one absence of duplicate numbers (see the previous section) this is the one
with the lowest number. with the lowest number.
If you use a named reference in a condition test (see the section about If you use a named reference in a condition test (see the section about
conditions below), either to check whether a subpattern has matched, or conditions below), either to check whether a subpattern has matched, or
to check for recursion, all subpatterns with the same name are tested. to check for recursion, all subpatterns with the same name are tested.
If the condition is true for any one of them, the overall condition is If the condition is true for any one of them, the overall condition is
true. This is the same behaviour as testing by number. For further true. This is the same behaviour as testing by number. For further de-
details of the interfaces for handling named subpatterns, see the tails of the interfaces for handling named subpatterns, see the pcreapi
pcreapi documentation. documentation.
Warning: You cannot use different names to distinguish between two sub- Warning: You cannot use different names to distinguish between two sub-
patterns with the same number because PCRE uses only the numbers when patterns with the same number because PCRE uses only the numbers when
matching. For this reason, an error is given at compile time if differ- matching. For this reason, an error is given at compile time if differ-
ent names are given to subpatterns with the same number. However, you ent names are given to subpatterns with the same number. However, you
can always give the same name to subpatterns with the same number, even can always give the same name to subpatterns with the same number, even
when PCRE_DUPNAMES is not set. when PCRE_DUPNAMES is not set.
REPETITION REPETITION
Repetition is specified by quantifiers, which can follow any of the Repetition is specified by quantifiers, which can follow any of the
following items: following items:
a literal data character a literal data character
the dot metacharacter the dot metacharacter
the \C escape sequence the \C escape sequence
the \X escape sequence the \X escape sequence
the \R escape sequence the \R escape sequence
an escape such as \d or \pL that matches a single character an escape such as \d or \pL that matches a single character
a character class a character class
a back reference (see next section) a back reference (see next section)
a parenthesized subpattern (including assertions) a parenthesized subpattern (including assertions)
a subroutine call to a subpattern (recursive or otherwise) a subroutine call to a subpattern (recursive or otherwise)
The general repetition quantifier specifies a minimum and maximum num- The general repetition quantifier specifies a minimum and maximum num-
ber of permitted matches, by giving the two numbers in curly brackets ber of permitted matches, by giving the two numbers in curly brackets
(braces), separated by a comma. The numbers must be less than 65536, (braces), separated by a comma. The numbers must be less than 65536,
and the first must be less than or equal to the second. For example: and the first must be less than or equal to the second. For example:
z{2,4} z{2,4}
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
special character. If the second number is omitted, but the comma is special character. If the second number is omitted, but the comma is
present, there is no upper limit; if the second number and the comma present, there is no upper limit; if the second number and the comma
are both omitted, the quantifier specifies an exact number of required are both omitted, the quantifier specifies an exact number of required
matches. Thus matches. Thus
[aeiou]{3,} [aeiou]{3,}
matches at least 3 successive vowels, but may match many more, while matches at least 3 successive vowels, but may match many more, while
\d{8} \d{8}
matches exactly 8 digits. An opening curly bracket that appears in a matches exactly 8 digits. An opening curly bracket that appears in a
position where a quantifier is not allowed, or one that does not match position where a quantifier is not allowed, or one that does not match
the syntax of a quantifier, is taken as a literal character. For exam- the syntax of a quantifier, is taken as a literal character. For exam-
ple, {,6} is not a quantifier, but a literal string of four characters. ple, {,6} is not a quantifier, but a literal string of four characters.
In UTF modes, quantifiers apply to characters rather than to individual In UTF modes, quantifiers apply to characters rather than to individual
data units. Thus, for example, \x{100}{2} matches two characters, each data units. Thus, for example, \x{100}{2} matches two characters, each
of which is represented by a two-byte sequence in a UTF-8 string. Simi- of which is represented by a two-byte sequence in a UTF-8 string. Simi-
larly, \X{3} matches three Unicode extended grapheme clusters, each of larly, \X{3} matches three Unicode extended grapheme clusters, each of
which may be several data units long (and they may be of different which may be several data units long (and they may be of different
lengths). lengths).
The quantifier {0} is permitted, causing the expression to behave as if The quantifier {0} is permitted, causing the expression to behave as if
the previous item and the quantifier were not present. This may be use- the previous item and the quantifier were not present. This may be use-
ful for subpatterns that are referenced as subroutines from elsewhere ful for subpatterns that are referenced as subroutines from elsewhere
in the pattern (but see also the section entitled "Defining subpatterns in the pattern (but see also the section entitled "Defining subpatterns
for use by reference only" below). Items other than subpatterns that for use by reference only" below). Items other than subpatterns that
have a {0} quantifier are omitted from the compiled pattern. have a {0} quantifier are omitted from the compiled pattern.
For convenience, the three most common quantifiers have single-charac- For convenience, the three most common quantifiers have single-charac-
ter abbreviations: ter abbreviations:
* is equivalent to {0,} * is equivalent to {0,}
+ is equivalent to {1,} + is equivalent to {1,}
? is equivalent to {0,1} ? is equivalent to {0,1}
It is possible to construct infinite loops by following a subpattern It is possible to construct infinite loops by following a subpattern
that can match no characters with a quantifier that has no upper limit, that can match no characters with a quantifier that has no upper limit,
for example: for example:
(a?)* (a?)*
Earlier versions of Perl and PCRE used to give an error at compile time Earlier versions of Perl and PCRE used to give an error at compile time
for such patterns. However, because there are cases where this can be for such patterns. However, because there are cases where this can be
useful, such patterns are now accepted, but if any repetition of the useful, such patterns are now accepted, but if any repetition of the
subpattern does in fact match no characters, the loop is forcibly bro- subpattern does in fact match no characters, the loop is forcibly bro-
ken. ken.
By default, the quantifiers are "greedy", that is, they match as much By default, the quantifiers are "greedy", that is, they match as much
as possible (up to the maximum number of permitted times), without as possible (up to the maximum number of permitted times), without
causing the rest of the pattern to fail. The classic example of where causing the rest of the pattern to fail. The classic example of where
this gives problems is in trying to match comments in C programs. These this gives problems is in trying to match comments in C programs. These
appear between /* and */ and within the comment, individual * and / appear between /* and */ and within the comment, individual * and /
characters may appear. An attempt to match C comments by applying the characters may appear. An attempt to match C comments by applying the
pattern pattern
/\*.*\*/ /\*.*\*/
to the string to the string
/* first comment */ not comment /* second comment */ /* first comment */ not comment /* second comment */
fails, because it matches the entire string owing to the greediness of fails, because it matches the entire string owing to the greediness of
the .* item. the .* item.
However, if a quantifier is followed by a question mark, it ceases to However, if a quantifier is followed by a question mark, it ceases to
be greedy, and instead matches the minimum number of times possible, so be greedy, and instead matches the minimum number of times possible, so
the pattern the pattern
/\*.*?\*/ /\*.*?\*/
does the right thing with the C comments. The meaning of the various does the right thing with the C comments. The meaning of the various
quantifiers is not otherwise changed, just the preferred number of quantifiers is not otherwise changed, just the preferred number of
matches. Do not confuse this use of question mark with its use as a matches. Do not confuse this use of question mark with its use as a
quantifier in its own right. Because it has two uses, it can sometimes quantifier in its own right. Because it has two uses, it can sometimes
appear doubled, as in appear doubled, as in
\d??\d \d??\d
which matches one digit by preference, but can match two if that is the which matches one digit by preference, but can match two if that is the
only way the rest of the pattern matches. only way the rest of the pattern matches.
If the PCRE_UNGREEDY option is set (an option that is not available in If the PCRE_UNGREEDY option is set (an option that is not available in
Perl), the quantifiers are not greedy by default, but individual ones Perl), the quantifiers are not greedy by default, but individual ones
can be made greedy by following them with a question mark. In other can be made greedy by following them with a question mark. In other
words, it inverts the default behaviour. words, it inverts the default behaviour.
When a parenthesized subpattern is quantified with a minimum repeat When a parenthesized subpattern is quantified with a minimum repeat
count that is greater than 1 or with a limited maximum, more memory is count that is greater than 1 or with a limited maximum, more memory is
required for the compiled pattern, in proportion to the size of the required for the compiled pattern, in proportion to the size of the
minimum or maximum. minimum or maximum.
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
alent to Perl's /s) is set, thus allowing the dot to match newlines, alent to Perl's /s) is set, thus allowing the dot to match newlines,
the pattern is implicitly anchored, because whatever follows will be the pattern is implicitly anchored, because whatever follows will be
tried against every character position in the subject string, so there tried against every character position in the subject string, so there
is no point in retrying the overall match at any position after the is no point in retrying the overall match at any position after the
first. PCRE normally treats such a pattern as though it were preceded first. PCRE normally treats such a pattern as though it were preceded
by \A. by \A.
In cases where it is known that the subject string contains no new- In cases where it is known that the subject string contains no new-
lines, it is worth setting PCRE_DOTALL in order to obtain this opti- lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
mization, or alternatively using ^ to indicate anchoring explicitly. mization, or alternatively using ^ to indicate anchoring explicitly.
However, there are some cases where the optimization cannot be used. However, there are some cases where the optimization cannot be used.
When .* is inside capturing parentheses that are the subject of a back When .* is inside capturing parentheses that are the subject of a back
reference elsewhere in the pattern, a match at the start may fail where reference elsewhere in the pattern, a match at the start may fail where
a later one succeeds. Consider, for example: a later one succeeds. Consider, for example:
(.*)abc\1 (.*)abc\1
If the subject is "xyz123abc123" the match point is the fourth charac- If the subject is "xyz123abc123" the match point is the fourth charac-
ter. For this reason, such a pattern is not implicitly anchored. ter. For this reason, such a pattern is not implicitly anchored.
Another case where implicit anchoring is not applied is when the lead- Another case where implicit anchoring is not applied is when the lead-
ing .* is inside an atomic group. Once again, a match at the start may ing .* is inside an atomic group. Once again, a match at the start may
fail where a later one succeeds. Consider this pattern: fail where a later one succeeds. Consider this pattern:
(?>.*?a)b (?>.*?a)b
It matches "ab" in the subject "aab". The use of the backtracking con- It matches "ab" in the subject "aab". The use of the backtracking con-
trol verbs (*PRUNE) and (*SKIP) also disable this optimization. trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
When a capturing subpattern is repeated, the value captured is the sub- When a capturing subpattern is repeated, the value captured is the sub-
string that matched the final iteration. For example, after string that matched the final iteration. For example, after
(tweedle[dume]{3}\s*)+ (tweedle[dume]{3}\s*)+
has matched "tweedledum tweedledee" the value of the captured substring has matched "tweedledum tweedledee" the value of the captured substring
is "tweedledee". However, if there are nested capturing subpatterns, is "tweedledee". However, if there are nested capturing subpatterns,
the corresponding captured values may have been set in previous itera- the corresponding captured values may have been set in previous itera-
tions. For example, after tions. For example, after
/(a|(b))+/ /(a|(b))+/
matches "aba" the value of the second captured substring is "b". matches "aba" the value of the second captured substring is "b".
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
repetition, failure of what follows normally causes the repeated item repetition, failure of what follows normally causes the repeated item
to be re-evaluated to see if a different number of repeats allows the to be re-evaluated to see if a different number of repeats allows the
rest of the pattern to match. Sometimes it is useful to prevent this, rest of the pattern to match. Sometimes it is useful to prevent this,
either to change the nature of the match, or to cause it fail earlier either to change the nature of the match, or to cause it fail earlier
than it otherwise might, when the author of the pattern knows there is than it otherwise might, when the author of the pattern knows there is
no point in carrying on. no point in carrying on.
Consider, for example, the pattern \d+foo when applied to the subject Consider, for example, the pattern \d+foo when applied to the subject
line line
123456bar 123456bar
After matching all 6 digits and then failing to match "foo", the normal After matching all 6 digits and then failing to match "foo", the normal
action of the matcher is to try again with only 5 digits matching the action of the matcher is to try again with only 5 digits matching the
\d+ item, and then with 4, and so on, before ultimately failing. \d+ item, and then with 4, and so on, before ultimately failing.
"Atomic grouping" (a term taken from Jeffrey Friedl's book) provides "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
the means for specifying that once a subpattern has matched, it is not the means for specifying that once a subpattern has matched, it is not
to be re-evaluated in this way. to be re-evaluated in this way.
If we use atomic grouping for the previous example, the matcher gives If we use atomic grouping for the previous example, the matcher gives
up immediately on failing to match "foo" the first time. The notation up immediately on failing to match "foo" the first time. The notation
is a kind of special parenthesis, starting with (?> as in this example: is a kind of special parenthesis, starting with (?> as in this example:
(?>\d+)foo (?>\d+)foo
This kind of parenthesis "locks up" the part of the pattern it con- This kind of parenthesis "locks up" the part of the pattern it con-
tains once it has matched, and a failure further into the pattern is tains once it has matched, and a failure further into the pattern is
prevented from backtracking into it. Backtracking past it to previous prevented from backtracking into it. Backtracking past it to previous
items, however, works as normal. items, however, works as normal.
An alternative description is that a subpattern of this type matches An alternative description is that a subpattern of this type matches
the string of characters that an identical standalone pattern would the string of characters that an identical standalone pattern would
match, if anchored at the current point in the subject string. match, if anchored at the current point in the subject string.
Atomic grouping subpatterns are not capturing subpatterns. Simple cases Atomic grouping subpatterns are not capturing subpatterns. Simple cases
such as the above example can be thought of as a maximizing repeat that such as the above example can be thought of as a maximizing repeat that
must swallow everything it can. So, while both \d+ and \d+? are pre- must swallow everything it can. So, while both \d+ and \d+? are pre-
pared to adjust the number of digits they match in order to make the pared to adjust the number of digits they match in order to make the
rest of the pattern match, (?>\d+) can only match an entire sequence of rest of the pattern match, (?>\d+) can only match an entire sequence of
digits. digits.
Atomic groups in general can of course contain arbitrarily complicated Atomic groups in general can of course contain arbitrarily complicated
subpatterns, and can be nested. However, when the subpattern for an subpatterns, and can be nested. However, when the subpattern for an
atomic group is just a single repeated item, as in the example above, a atomic group is just a single repeated item, as in the example above, a
simpler notation, called a "possessive quantifier" can be used. This simpler notation, called a "possessive quantifier" can be used. This
consists of an additional + character following a quantifier. Using consists of an additional + character following a quantifier. Using
this notation, the previous example can be rewritten as this notation, the previous example can be rewritten as
\d++foo \d++foo
Note that a possessive quantifier can be used with an entire group, for Note that a possessive quantifier can be used with an entire group, for
example: example:
(abc|xyz){2,3}+ (abc|xyz){2,3}+
Possessive quantifiers are always greedy; the setting of the Possessive quantifiers are always greedy; the setting of the PCRE_UN-
PCRE_UNGREEDY option is ignored. They are a convenient notation for the GREEDY option is ignored. They are a convenient notation for the sim-
simpler forms of atomic group. However, there is no difference in the pler forms of atomic group. However, there is no difference in the
meaning of a possessive quantifier and the equivalent atomic group, meaning of a possessive quantifier and the equivalent atomic group,
though there may be a performance difference; possessive quantifiers though there may be a performance difference; possessive quantifiers
should be slightly faster. should be slightly faster.
The possessive quantifier syntax is an extension to the Perl 5.8 syn- The possessive quantifier syntax is an extension to the Perl 5.8 syn-
tax. Jeffrey Friedl originated the idea (and the name) in the first tax. Jeffrey Friedl originated the idea (and the name) in the first
edition of his book. Mike McCloskey liked it, so implemented it when he edition of his book. Mike McCloskey liked it, so implemented it when he
built Sun's Java package, and PCRE copied it from there. It ultimately built Sun's Java package, and PCRE copied it from there. It ultimately
found its way into Perl at release 5.10. found its way into Perl at release 5.10.
PCRE has an optimization that automatically "possessifies" certain sim- PCRE has an optimization that automatically "possessifies" certain sim-
ple pattern constructs. For example, the sequence A+B is treated as ple pattern constructs. For example, the sequence A+B is treated as
A++B because there is no point in backtracking into a sequence of A's A++B because there is no point in backtracking into a sequence of A's
when B must follow. when B must follow.
When a pattern contains an unlimited repeat inside a subpattern that When a pattern contains an unlimited repeat inside a subpattern that
can itself be repeated an unlimited number of times, the use of an can itself be repeated an unlimited number of times, the use of an
atomic group is the only way to avoid some failing matches taking a atomic group is the only way to avoid some failing matches taking a
very long time indeed. The pattern very long time indeed. The pattern
(\D+|<\d+>)*[!?] (\D+|<\d+>)*[!?]
matches an unlimited number of substrings that either consist of non- matches an unlimited number of substrings that either consist of non-
digits, or digits enclosed in <>, followed by either ! or ?. When it digits, or digits enclosed in <>, followed by either ! or ?. When it
matches, it runs quickly. However, if it is applied to matches, it runs quickly. However, if it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
it takes a long time before reporting failure. This is because the it takes a long time before reporting failure. This is because the
string can be divided between the internal \D+ repeat and the external string can be divided between the internal \D+ repeat and the external
* repeat in a large number of ways, and all have to be tried. (The * repeat in a large number of ways, and all have to be tried. (The ex-
example uses [!?] rather than a single character at the end, because ample uses [!?] rather than a single character at the end, because both
both PCRE and Perl have an optimization that allows for fast failure PCRE and Perl have an optimization that allows for fast failure when a
when a single character is used. They remember the last single charac- single character is used. They remember the last single character that
ter that is required for a match, and fail early if it is not present is required for a match, and fail early if it is not present in the
in the string.) If the pattern is changed so that it uses an atomic string.) If the pattern is changed so that it uses an atomic group,
group, like this: like this:
((?>\D+)|<\d+>)*[!?] ((?>\D+)|<\d+>)*[!?]
sequences of non-digits cannot be broken, and failure happens quickly. sequences of non-digits cannot be broken, and failure happens quickly.
BACK REFERENCES BACK REFERENCES
Outside a character class, a backslash followed by a digit greater than Outside a character class, a backslash followed by a digit greater than
0 (and possibly further digits) is a back reference to a capturing sub- 0 (and possibly further digits) is a back reference to a capturing sub-
pattern earlier (that is, to its left) in the pattern, provided there pattern earlier (that is, to its left) in the pattern, provided there
have been that many previous capturing left parentheses. have been that many previous capturing left parentheses.
However, if the decimal number following the backslash is less than 10, However, if the decimal number following the backslash is less than 10,
it is always taken as a back reference, and causes an error only if it is always taken as a back reference, and causes an error only if
there are not that many capturing left parentheses in the entire pat- there are not that many capturing left parentheses in the entire pat-
tern. In other words, the parentheses that are referenced need not be tern. In other words, the parentheses that are referenced need not be
to the left of the reference for numbers less than 10. A "forward back to the left of the reference for numbers less than 10. A "forward back
reference" of this type can make sense when a repetition is involved reference" of this type can make sense when a repetition is involved
and the subpattern to the right has participated in an earlier itera- and the subpattern to the right has participated in an earlier itera-
tion. tion.
It is not possible to have a numerical "forward back reference" to a It is not possible to have a numerical "forward back reference" to a
subpattern whose number is 10 or more using this syntax because a subpattern whose number is 10 or more using this syntax because a se-
sequence such as \50 is interpreted as a character defined in octal. quence such as \50 is interpreted as a character defined in octal. See
See the subsection entitled "Non-printing characters" above for further the subsection entitled "Non-printing characters" above for further de-
details of the handling of digits following a backslash. There is no tails of the handling of digits following a backslash. There is no such
such problem when named parentheses are used. A back reference to any problem when named parentheses are used. A back reference to any sub-
subpattern is possible using named parentheses (see below). pattern is possible using named parentheses (see below).
Another way of avoiding the ambiguity inherent in the use of digits Another way of avoiding the ambiguity inherent in the use of digits
following a backslash is to use the \g escape sequence. This escape following a backslash is to use the \g escape sequence. This escape
must be followed by an unsigned number or a negative number, optionally must be followed by an unsigned number or a negative number, optionally
enclosed in braces. These examples are all identical: enclosed in braces. These examples are all identical:
(ring), \1 (ring), \1
(ring), \g1 (ring), \g1
(ring), \g{1} (ring), \g{1}
An unsigned number specifies an absolute reference without the ambigu- An unsigned number specifies an absolute reference without the ambigu-
ity that is present in the older syntax. It is also useful when literal ity that is present in the older syntax. It is also useful when literal
digits follow the reference. A negative number is a relative reference. digits follow the reference. A negative number is a relative reference.
Consider this example: Consider this example:
(abc(def)ghi)\g{-1} (abc(def)ghi)\g{-1}
The sequence \g{-1} is a reference to the most recently started captur- The sequence \g{-1} is a reference to the most recently started captur-
ing subpattern before \g, that is, is it equivalent to \2 in this exam- ing subpattern before \g, that is, is it equivalent to \2 in this exam-
ple. Similarly, \g{-2} would be equivalent to \1. The use of relative ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
references can be helpful in long patterns, and also in patterns that references can be helpful in long patterns, and also in patterns that
are created by joining together fragments that contain references are created by joining together fragments that contain references
within themselves. within themselves.
A back reference matches whatever actually matched the capturing sub- A back reference matches whatever actually matched the capturing sub-
pattern in the current subject string, rather than anything matching pattern in the current subject string, rather than anything matching
the subpattern itself (see "Subpatterns as subroutines" below for a way the subpattern itself (see "Subpatterns as subroutines" below for a way
of doing that). So the pattern of doing that). So the pattern
(sens|respons)e and \1ibility (sens|respons)e and \1ibility
matches "sense and sensibility" and "response and responsibility", but matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If caseful matching is in force at the not "sense and responsibility". If caseful matching is in force at the
time of the back reference, the case of letters is relevant. For exam- time of the back reference, the case of letters is relevant. For exam-
ple, ple,
((?i)rah)\s+\1 ((?i)rah)\s+\1
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
original capturing subpattern is matched caselessly. original capturing subpattern is matched caselessly.
There are several different ways of writing back references to named There are several different ways of writing back references to named
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
unified back reference syntax, in which \g can be used for both numeric unified back reference syntax, in which \g can be used for both numeric
and named references, is also supported. We could rewrite the above and named references, is also supported. We could rewrite the above ex-
example in any of the following ways: ample in any of the following ways:
(?<p1>(?i)rah)\s+\k<p1> (?<p1>(?i)rah)\s+\k<p1>
(?'p1'(?i)rah)\s+\k{p1} (?'p1'(?i)rah)\s+\k{p1}
(?P<p1>(?i)rah)\s+(?P=p1) (?P<p1>(?i)rah)\s+(?P=p1)
(?<p1>(?i)rah)\s+\g{p1} (?<p1>(?i)rah)\s+\g{p1}
A subpattern that is referenced by name may appear in the pattern A subpattern that is referenced by name may appear in the pattern be-
before or after the reference. fore or after the reference.
There may be more than one back reference to the same subpattern. If a There may be more than one back reference to the same subpattern. If a
subpattern has not actually been used in a particular match, any back subpattern has not actually been used in a particular match, any back
references to it always fail by default. For example, the pattern references to it always fail by default. For example, the pattern
(a|(bc))\2 (a|(bc))\2
always fails if it starts to match "a" rather than "bc". However, if always fails if it starts to match "a" rather than "bc". However, if
the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer- the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
ence to an unset value matches an empty string. ence to an unset value matches an empty string.
Because there may be many capturing parentheses in a pattern, all dig- Because there may be many capturing parentheses in a pattern, all dig-
its following a backslash are taken as part of a potential back refer- its following a backslash are taken as part of a potential back refer-
ence number. If the pattern continues with a digit character, some ence number. If the pattern continues with a digit character, some de-
delimiter must be used to terminate the back reference. If the limiter must be used to terminate the back reference. If the PCRE_EX-
PCRE_EXTENDED option is set, this can be white space. Otherwise, the TENDED option is set, this can be white space. Otherwise, the \g{ syn-
\g{ syntax or an empty comment (see "Comments" below) can be used. tax or an empty comment (see "Comments" below) can be used.
Recursive back references Recursive back references
A back reference that occurs inside the parentheses to which it refers A back reference that occurs inside the parentheses to which it refers
fails when the subpattern is first used, so, for example, (a\1) never fails when the subpattern is first used, so, for example, (a\1) never
matches. However, such references can be useful inside repeated sub- matches. However, such references can be useful inside repeated sub-
patterns. For example, the pattern patterns. For example, the pattern
(a|b\1)+ (a|b\1)+
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
ation of the subpattern, the back reference matches the character ation of the subpattern, the back reference matches the character
string corresponding to the previous iteration. In order for this to string corresponding to the previous iteration. In order for this to
work, the pattern must be such that the first iteration does not need work, the pattern must be such that the first iteration does not need
to match the back reference. This can be done using alternation, as in to match the back reference. This can be done using alternation, as in
the example above, or by a quantifier with a minimum of zero. the example above, or by a quantifier with a minimum of zero.
Back references of this type cause the group that they reference to be Back references of this type cause the group that they reference to be
treated as an atomic group. Once the whole group has been matched, a treated as an atomic group. Once the whole group has been matched, a
subsequent matching failure cannot cause backtracking into the middle subsequent matching failure cannot cause backtracking into the middle
of the group. of the group.
ASSERTIONS ASSERTIONS
An assertion is a test on the characters following or preceding the An assertion is a test on the characters following or preceding the
current matching point that does not actually consume any characters. current matching point that does not actually consume any characters.
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are de-
described above. scribed above.
More complicated assertions are coded as subpatterns. There are two More complicated assertions are coded as subpatterns. There are two
kinds: those that look ahead of the current position in the subject kinds: those that look ahead of the current position in the subject
string, and those that look behind it. An assertion subpattern is string, and those that look behind it. An assertion subpattern is
matched in the normal way, except that it does not cause the current matched in the normal way, except that it does not cause the current
matching position to be changed. matching position to be changed.
Assertion subpatterns are not capturing subpatterns. If such an asser- Assertion subpatterns are not capturing subpatterns. If such an asser-
tion contains capturing subpatterns within it, these are counted for tion contains capturing subpatterns within it, these are counted for
the purposes of numbering the capturing subpatterns in the whole pat- the purposes of numbering the capturing subpatterns in the whole pat-
tern. However, substring capturing is carried out only for positive tern. However, substring capturing is carried out only for positive as-
assertions. (Perl sometimes, but not always, does do capturing in nega- sertions. (Perl sometimes, but not always, does do capturing in nega-
tive assertions.) tive assertions.)
WARNING: If a positive assertion containing one or more capturing sub- WARNING: If a positive assertion containing one or more capturing sub-
patterns succeeds, but failure to match later in the pattern causes patterns succeeds, but failure to match later in the pattern causes
backtracking over this assertion, the captures within the assertion are backtracking over this assertion, the captures within the assertion are
reset only if no higher numbered captures are already set. This is, reset only if no higher numbered captures are already set. This is, un-
unfortunately, a fundamental limitation of the current implementation, fortunately, a fundamental limitation of the current implementation,
and as PCRE1 is now in maintenance-only status, it is unlikely ever to and as PCRE1 is now in maintenance-only status, it is unlikely ever to
change. change.
For compatibility with Perl, assertion subpatterns may be repeated; For compatibility with Perl, assertion subpatterns may be repeated;
though it makes no sense to assert the same thing several times, the though it makes no sense to assert the same thing several times, the
side effect of capturing parentheses may occasionally be useful. In side effect of capturing parentheses may occasionally be useful. In
practice, there only three cases: practice, there only three cases:
(1) If the quantifier is {0}, the assertion is never obeyed during (1) If the quantifier is {0}, the assertion is never obeyed during
matching. However, it may contain internal capturing parenthesized matching. However, it may contain internal capturing parenthesized
groups that are called from elsewhere via the subroutine mechanism. groups that are called from elsewhere via the subroutine mechanism.
(2) If quantifier is {0,n} where n is greater than zero, it is treated (2) If quantifier is {0,n} where n is greater than zero, it is treated
as if it were {0,1}. At run time, the rest of the pattern match is as if it were {0,1}. At run time, the rest of the pattern match is
tried with and without the assertion, the order depending on the greed- tried with and without the assertion, the order depending on the greed-
iness of the quantifier. iness of the quantifier.
(3) If the minimum repetition is greater than zero, the quantifier is (3) If the minimum repetition is greater than zero, the quantifier is
ignored. The assertion is obeyed just once when encountered during ignored. The assertion is obeyed just once when encountered during
matching. matching.
Lookahead assertions Lookahead assertions
Lookahead assertions start with (?= for positive assertions and (?! for Lookahead assertions start with (?= for positive assertions and (?! for
negative assertions. For example, negative assertions. For example,
\w+(?=;) \w+(?=;)
matches a word followed by a semicolon, but does not include the semi- matches a word followed by a semicolon, but does not include the semi-
colon in the match, and colon in the match, and
foo(?!bar) foo(?!bar)
matches any occurrence of "foo" that is not followed by "bar". Note matches any occurrence of "foo" that is not followed by "bar". Note
that the apparently similar pattern that the apparently similar pattern
(?!foo)bar (?!foo)bar
does not find an occurrence of "bar" that is preceded by something does not find an occurrence of "bar" that is preceded by something
other than "foo"; it finds any occurrence of "bar" whatsoever, because other than "foo"; it finds any occurrence of "bar" whatsoever, because
the assertion (?!foo) is always true when the next three characters are the assertion (?!foo) is always true when the next three characters are
"bar". A lookbehind assertion is needed to achieve the other effect. "bar". A lookbehind assertion is needed to achieve the other effect.
If you want to force a matching failure at some point in a pattern, the If you want to force a matching failure at some point in a pattern, the
most convenient way to do it is with (?!) because an empty string most convenient way to do it is with (?!) because an empty string al-
always matches, so an assertion that requires there not to be an empty ways matches, so an assertion that requires there not to be an empty
string must always fail. The backtracking control verb (*FAIL) or (*F) string must always fail. The backtracking control verb (*FAIL) or (*F)
is a synonym for (?!). is a synonym for (?!).
Lookbehind assertions Lookbehind assertions
Lookbehind assertions start with (?<= for positive assertions and (?<! Lookbehind assertions start with (?<= for positive assertions and (?<!
for negative assertions. For example, for negative assertions. For example,
(?<!foo)bar (?<!foo)bar
does find an occurrence of "bar" that is not preceded by "foo". The does find an occurrence of "bar" that is not preceded by "foo". The
contents of a lookbehind assertion are restricted such that all the contents of a lookbehind assertion are restricted such that all the
strings it matches must have a fixed length. However, if there are sev- strings it matches must have a fixed length. However, if there are sev-
eral top-level alternatives, they do not all have to have the same eral top-level alternatives, they do not all have to have the same
fixed length. Thus fixed length. Thus
(?<=bullock|donkey) (?<=bullock|donkey)
is permitted, but is permitted, but
(?<!dogs?|cats?) (?<!dogs?|cats?)
causes an error at compile time. Branches that match different length causes an error at compile time. Branches that match different length
strings are permitted only at the top level of a lookbehind assertion. strings are permitted only at the top level of a lookbehind assertion.
This is an extension compared with Perl, which requires all branches to This is an extension compared with Perl, which requires all branches to
match the same length of string. An assertion such as match the same length of string. An assertion such as
(?<=ab(c|de)) (?<=ab(c|de))
is not permitted, because its single top-level branch can match two is not permitted, because its single top-level branch can match two
different lengths, but it is acceptable to PCRE if rewritten to use two different lengths, but it is acceptable to PCRE if rewritten to use two
top-level branches: top-level branches:
(?<=abc|abde) (?<=abc|abde)
In some cases, the escape sequence \K (see above) can be used instead In some cases, the escape sequence \K (see above) can be used instead
of a lookbehind assertion to get round the fixed-length restriction. of a lookbehind assertion to get round the fixed-length restriction.
The implementation of lookbehind assertions is, for each alternative, The implementation of lookbehind assertions is, for each alternative,
to temporarily move the current position back by the fixed length and to temporarily move the current position back by the fixed length and
then try to match. If there are insufficient characters before the cur- then try to match. If there are insufficient characters before the cur-
rent position, the assertion fails. rent position, the assertion fails.
In a UTF mode, PCRE does not allow the \C escape (which matches a sin- In a UTF mode, PCRE does not allow the \C escape (which matches a sin-
gle data unit even in a UTF mode) to appear in lookbehind assertions, gle data unit even in a UTF mode) to appear in lookbehind assertions,
because it makes it impossible to calculate the length of the lookbe- because it makes it impossible to calculate the length of the lookbe-
hind. The \X and \R escapes, which can match different numbers of data hind. The \X and \R escapes, which can match different numbers of data
units, are also not permitted. units, are also not permitted.
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
lookbehinds, as long as the subpattern matches a fixed-length string. lookbehinds, as long as the subpattern matches a fixed-length string.
Recursion, however, is not supported. Recursion, however, is not supported.
Possessive quantifiers can be used in conjunction with lookbehind Possessive quantifiers can be used in conjunction with lookbehind as-
assertions to specify efficient matching of fixed-length strings at the sertions to specify efficient matching of fixed-length strings at the
end of subject strings. Consider a simple pattern such as end of subject strings. Consider a simple pattern such as
abcd$ abcd$
when applied to a long string that does not match. Because matching when applied to a long string that does not match. Because matching
proceeds from left to right, PCRE will look for each "a" in the subject proceeds from left to right, PCRE will look for each "a" in the subject
and then see if what follows matches the rest of the pattern. If the and then see if what follows matches the rest of the pattern. If the
pattern is specified as pattern is specified as
^.*abcd$ ^.*abcd$
the initial .* matches the entire string at first, but when this fails the initial .* matches the entire string at first, but when this fails
(because there is no following "a"), it backtracks to match all but the (because there is no following "a"), it backtracks to match all but the
last character, then all but the last two characters, and so on. Once last character, then all but the last two characters, and so on. Once
again the search for "a" covers the entire string, from right to left, again the search for "a" covers the entire string, from right to left,
so we are no better off. However, if the pattern is written as so we are no better off. However, if the pattern is written as
^.*+(?<=abcd) ^.*+(?<=abcd)
there can be no backtracking for the .*+ item; it can match only the there can be no backtracking for the .*+ item; it can match only the
entire string. The subsequent lookbehind assertion does a single test entire string. The subsequent lookbehind assertion does a single test
on the last four characters. If it fails, the match fails immediately. on the last four characters. If it fails, the match fails immediately.
For long strings, this approach makes a significant difference to the For long strings, this approach makes a significant difference to the
processing time. processing time.
Using multiple assertions Using multiple assertions
Several assertions (of any sort) may occur in succession. For example, Several assertions (of any sort) may occur in succession. For example,
(?<=\d{3})(?<!999)foo (?<=\d{3})(?<!999)foo
matches "foo" preceded by three digits that are not "999". Notice that matches "foo" preceded by three digits that are not "999". Notice that
each of the assertions is applied independently at the same point in each of the assertions is applied independently at the same point in
the subject string. First there is a check that the previous three the subject string. First there is a check that the previous three
characters are all digits, and then there is a check that the same characters are all digits, and then there is a check that the same
three characters are not "999". This pattern does not match "foo" pre- three characters are not "999". This pattern does not match "foo" pre-
ceded by six characters, the first of which are digits and the last ceded by six characters, the first of which are digits and the last
three of which are not "999". For example, it doesn't match "123abc- three of which are not "999". For example, it doesn't match "123abc-
foo". A pattern to do that is foo". A pattern to do that is
(?<=\d{3}...)(?<!999)foo (?<=\d{3}...)(?<!999)foo
This time the first assertion looks at the preceding six characters, This time the first assertion looks at the preceding six characters,
checking that the first three are digits, and then the second assertion checking that the first three are digits, and then the second assertion
checks that the preceding three characters are not "999". checks that the preceding three characters are not "999".
Assertions can be nested in any combination. For example, Assertions can be nested in any combination. For example,
(?<=(?<!foo)bar)baz (?<=(?<!foo)bar)baz
matches an occurrence of "baz" that is preceded by "bar" which in turn matches an occurrence of "baz" that is preceded by "bar" which in turn
is not preceded by "foo", while is not preceded by "foo", while
(?<=\d{3}(?!999)...)foo (?<=\d{3}(?!999)...)foo
is another pattern that matches "foo" preceded by three digits and any is another pattern that matches "foo" preceded by three digits and any
three characters that are not "999". three characters that are not "999".
CONDITIONAL SUBPATTERNS CONDITIONAL SUBPATTERNS
It is possible to cause the matching process to obey a subpattern con- It is possible to cause the matching process to obey a subpattern con-
ditionally or to choose between two alternative subpatterns, depending ditionally or to choose between two alternative subpatterns, depending
on the result of an assertion, or whether a specific capturing subpat- on the result of an assertion, or whether a specific capturing subpat-
tern has already been matched. The two possible forms of conditional tern has already been matched. The two possible forms of conditional
subpattern are: subpattern are:
(?(condition)yes-pattern) (?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern) (?(condition)yes-pattern|no-pattern)
If the condition is satisfied, the yes-pattern is used; otherwise the If the condition is satisfied, the yes-pattern is used; otherwise the
no-pattern (if present) is used. If there are more than two alterna- no-pattern (if present) is used. If there are more than two alterna-
tives in the subpattern, a compile-time error occurs. Each of the two tives in the subpattern, a compile-time error occurs. Each of the two
alternatives may itself contain nested subpatterns of any form, includ- alternatives may itself contain nested subpatterns of any form, includ-
ing conditional subpatterns; the restriction to two alternatives ing conditional subpatterns; the restriction to two alternatives ap-
applies only at the level of the condition. This pattern fragment is an plies only at the level of the condition. This pattern fragment is an
example where the alternatives are complex: example where the alternatives are complex:
(?(1) (A|B|C) | (D | (?(2)E|F) | E) ) (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
There are four kinds of condition: references to subpatterns, refer- There are four kinds of condition: references to subpatterns, refer-
ences to recursion, a pseudo-condition called DEFINE, and assertions. ences to recursion, a pseudo-condition called DEFINE, and assertions.
Checking for a used subpattern by number Checking for a used subpattern by number
If the text between the parentheses consists of a sequence of digits, If the text between the parentheses consists of a sequence of digits,
the condition is true if a capturing subpattern of that number has pre- the condition is true if a capturing subpattern of that number has pre-
viously matched. If there is more than one capturing subpattern with viously matched. If there is more than one capturing subpattern with
the same number (see the earlier section about duplicate subpattern the same number (see the earlier section about duplicate subpattern
numbers), the condition is true if any of them have matched. An alter- numbers), the condition is true if any of them have matched. An alter-
native notation is to precede the digits with a plus or minus sign. In native notation is to precede the digits with a plus or minus sign. In
this case, the subpattern number is relative rather than absolute. The this case, the subpattern number is relative rather than absolute. The
most recently opened parentheses can be referenced by (?(-1), the next most recently opened parentheses can be referenced by (?(-1), the next
most recent by (?(-2), and so on. Inside loops it can also make sense most recent by (?(-2), and so on. Inside loops it can also make sense
to refer to subsequent groups. The next parentheses to be opened can be to refer to subsequent groups. The next parentheses to be opened can be
referenced as (?(+1), and so on. (The value zero in any of these forms referenced as (?(+1), and so on. (The value zero in any of these forms
is not used; it provokes a compile-time error.) is not used; it provokes a compile-time error.)
Consider the following pattern, which contains non-significant white Consider the following pattern, which contains non-significant white
space to make it more readable (assume the PCRE_EXTENDED option) and to space to make it more readable (assume the PCRE_EXTENDED option) and to
divide it into three parts for ease of discussion: divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) ) ( \( )? [^()]+ (?(1) \) )
The first part matches an optional opening parenthesis, and if that The first part matches an optional opening parenthesis, and if that
character is present, sets it as the first captured substring. The sec- character is present, sets it as the first captured substring. The sec-
ond part matches one or more characters that are not parentheses. The ond part matches one or more characters that are not parentheses. The
third part is a conditional subpattern that tests whether or not the third part is a conditional subpattern that tests whether or not the
first set of parentheses matched. If they did, that is, if subject first set of parentheses matched. If they did, that is, if subject
started with an opening parenthesis, the condition is true, and so the started with an opening parenthesis, the condition is true, and so the
yes-pattern is executed and a closing parenthesis is required. Other- yes-pattern is executed and a closing parenthesis is required. Other-
wise, since no-pattern is not present, the subpattern matches nothing. wise, since no-pattern is not present, the subpattern matches nothing.
In other words, this pattern matches a sequence of non-parentheses, In other words, this pattern matches a sequence of non-parentheses, op-
optionally enclosed in parentheses. tionally enclosed in parentheses.
If you were embedding this pattern in a larger one, you could use a If you were embedding this pattern in a larger one, you could use a
relative reference: relative reference:
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
This makes the fragment independent of the parentheses in the larger This makes the fragment independent of the parentheses in the larger
pattern. pattern.
Checking for a used subpattern by name Checking for a used subpattern by name
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
used subpattern by name. For compatibility with earlier versions of used subpattern by name. For compatibility with earlier versions of
PCRE, which had this facility before Perl, the syntax (?(name)...) is PCRE, which had this facility before Perl, the syntax (?(name)...) is
also recognized. also recognized.
Rewriting the above example to use a named subpattern gives this: Rewriting the above example to use a named subpattern gives this:
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
If the name used in a condition of this kind is a duplicate, the test If the name used in a condition of this kind is a duplicate, the test
is applied to all subpatterns of the same name, and is true if any one is applied to all subpatterns of the same name, and is true if any one
of them has matched. of them has matched.
Checking for pattern recursion Checking for pattern recursion
If the condition is the string (R), and there is no subpattern with the If the condition is the string (R), and there is no subpattern with the
name R, the condition is true if a recursive call to the whole pattern name R, the condition is true if a recursive call to the whole pattern
or any subpattern has been made. If digits or a name preceded by amper- or any subpattern has been made. If digits or a name preceded by amper-
sand follow the letter R, for example: sand follow the letter R, for example:
(?(R3)...) or (?(R&name)...) (?(R3)...) or (?(R&name)...)
the condition is true if the most recent recursion is into a subpattern the condition is true if the most recent recursion is into a subpattern
whose number or name is given. This condition does not check the entire whose number or name is given. This condition does not check the entire
recursion stack. If the name used in a condition of this kind is a recursion stack. If the name used in a condition of this kind is a du-
duplicate, the test is applied to all subpatterns of the same name, and plicate, the test is applied to all subpatterns of the same name, and
is true if any one of them is the most recent recursion. is true if any one of them is the most recent recursion.
At "top level", all these recursion test conditions are false. The At "top level", all these recursion test conditions are false. The
syntax for recursive patterns is described below. syntax for recursive patterns is described below.
Defining subpatterns for use by reference only Defining subpatterns for use by reference only
If the condition is the string (DEFINE), and there is no subpattern If the condition is the string (DEFINE), and there is no subpattern
with the name DEFINE, the condition is always false. In this case, with the name DEFINE, the condition is always false. In this case,
there may be only one alternative in the subpattern. It is always there may be only one alternative in the subpattern. It is always
skipped if control reaches this point in the pattern; the idea of skipped if control reaches this point in the pattern; the idea of DE-
DEFINE is that it can be used to define subroutines that can be refer- FINE is that it can be used to define subroutines that can be refer-
enced from elsewhere. (The use of subroutines is described below.) For enced from elsewhere. (The use of subroutines is described below.) For
example, a pattern to match an IPv4 address such as "192.168.23.245" example, a pattern to match an IPv4 address such as "192.168.23.245"
could be written like this (ignore white space and line breaks): could be written like this (ignore white space and line breaks):
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b \b (?&byte) (\.(?&byte)){3} \b
The first part of the pattern is a DEFINE group inside which a another The first part of the pattern is a DEFINE group inside which a another
group named "byte" is defined. This matches an individual component of group named "byte" is defined. This matches an individual component of
an IPv4 address (a number less than 256). When matching takes place, an IPv4 address (a number less than 256). When matching takes place,
this part of the pattern is skipped because DEFINE acts like a false this part of the pattern is skipped because DEFINE acts like a false
condition. The rest of the pattern uses references to the named group condition. The rest of the pattern uses references to the named group
to match the four dot-separated components of an IPv4 address, insist- to match the four dot-separated components of an IPv4 address, insist-
ing on a word boundary at each end. ing on a word boundary at each end.
Assertion conditions Assertion conditions
If the condition is not in any of the above formats, it must be an If the condition is not in any of the above formats, it must be an as-
assertion. This may be a positive or negative lookahead or lookbehind sertion. This may be a positive or negative lookahead or lookbehind
assertion. Consider this pattern, again containing non-significant assertion. Consider this pattern, again containing non-significant
white space, and with the two alternatives on the second line: white space, and with the two alternatives on the second line:
(?(?=[^a-z]*[a-z]) (?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
The condition is a positive lookahead assertion that matches an The condition is a positive lookahead assertion that matches an op-
optional sequence of non-letters followed by a letter. In other words, tional sequence of non-letters followed by a letter. In other words, it
it tests for the presence of at least one letter in the subject. If a tests for the presence of at least one letter in the subject. If a let-
letter is found, the subject is matched against the first alternative; ter is found, the subject is matched against the first alternative;
otherwise it is matched against the second. This pattern matches otherwise it is matched against the second. This pattern matches
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
letters and dd are digits. letters and dd are digits.
COMMENTS COMMENTS
There are two ways of including comments in patterns that are processed There are two ways of including comments in patterns that are processed
by PCRE. In both cases, the start of the comment must not be in a char- by PCRE. In both cases, the start of the comment must not be in a char-
acter class, nor in the middle of any other sequence of related charac- acter class, nor in the middle of any other sequence of related charac-
ters such as (?: or a subpattern name or number. The characters that ters such as (?: or a subpattern name or number. The characters that
make up a comment play no part in the pattern matching. make up a comment play no part in the pattern matching.
The sequence (?# marks the start of a comment that continues up to the The sequence (?# marks the start of a comment that continues up to the
next closing parenthesis. Nested parentheses are not permitted. If the next closing parenthesis. Nested parentheses are not permitted. If the
PCRE_EXTENDED option is set, an unescaped # character also introduces a PCRE_EXTENDED option is set, an unescaped # character also introduces a
comment, which in this case continues to immediately after the next comment, which in this case continues to immediately after the next
newline character or character sequence in the pattern. Which charac- newline character or character sequence in the pattern. Which charac-
ters are interpreted as newlines is controlled by the options passed to ters are interpreted as newlines is controlled by the options passed to
a compiling function or by a special sequence at the start of the pat- a compiling function or by a special sequence at the start of the pat-
tern, as described in the section entitled "Newline conventions" above. tern, as described in the section entitled "Newline conventions" above.
Note that the end of this type of comment is a literal newline sequence Note that the end of this type of comment is a literal newline sequence
in the pattern; escape sequences that happen to represent a newline do in the pattern; escape sequences that happen to represent a newline do
not count. For example, consider this pattern when PCRE_EXTENDED is not count. For example, consider this pattern when PCRE_EXTENDED is
set, and the default newline convention is in force: set, and the default newline convention is in force:
abc #comment \n still comment abc #comment \n still comment
On encountering the # character, pcre_compile() skips along, looking On encountering the # character, pcre_compile() skips along, looking
for a newline in the pattern. The sequence \n is still literal at this for a newline in the pattern. The sequence \n is still literal at this
stage, so it does not terminate the comment. Only an actual character stage, so it does not terminate the comment. Only an actual character
with the code value 0x0a (the default newline) does so. with the code value 0x0a (the default newline) does so.
RECURSIVE PATTERNS RECURSIVE PATTERNS
Consider the problem of matching a string in parentheses, allowing for Consider the problem of matching a string in parentheses, allowing for
unlimited nested parentheses. Without the use of recursion, the best unlimited nested parentheses. Without the use of recursion, the best
that can be done is to use a pattern that matches up to some fixed that can be done is to use a pattern that matches up to some fixed
depth of nesting. It is not possible to handle an arbitrary nesting depth of nesting. It is not possible to handle an arbitrary nesting
depth. depth.
For some time, Perl has provided a facility that allows regular expres- For some time, Perl has provided a facility that allows regular expres-
sions to recurse (amongst other things). It does this by interpolating sions to recurse (amongst other things). It does this by interpolating
Perl code in the expression at run time, and the code can refer to the Perl code in the expression at run time, and the code can refer to the
expression itself. A Perl pattern using code interpolation to solve the expression itself. A Perl pattern using code interpolation to solve the
parentheses problem can be created like this: parentheses problem can be created like this:
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
The (?p{...}) item interpolates Perl code at run time, and in this case The (?p{...}) item interpolates Perl code at run time, and in this case
refers recursively to the pattern in which it appears. refers recursively to the pattern in which it appears.
Obviously, PCRE cannot support the interpolation of Perl code. Instead, Obviously, PCRE cannot support the interpolation of Perl code. Instead,
it supports special syntax for recursion of the entire pattern, and it supports special syntax for recursion of the entire pattern, and
also for individual subpattern recursion. After its introduction in also for individual subpattern recursion. After its introduction in
PCRE and Python, this kind of recursion was subsequently introduced PCRE and Python, this kind of recursion was subsequently introduced
into Perl at release 5.10. into Perl at release 5.10.
A special item that consists of (? followed by a number greater than A special item that consists of (? followed by a number greater than
zero and a closing parenthesis is a recursive subroutine call of the zero and a closing parenthesis is a recursive subroutine call of the
subpattern of the given number, provided that it occurs inside that subpattern of the given number, provided that it occurs inside that
subpattern. (If not, it is a non-recursive subroutine call, which is subpattern. (If not, it is a non-recursive subroutine call, which is
described in the next section.) The special item (?R) or (?0) is a described in the next section.) The special item (?R) or (?0) is a re-
recursive call of the entire regular expression. cursive call of the entire regular expression.
This PCRE pattern solves the nested parentheses problem (assume the This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored): PCRE_EXTENDED option is set so that white space is ignored):
\( ( [^()]++ | (?R) )* \) \( ( [^()]++ | (?R) )* \)
First it matches an opening parenthesis. Then it matches any number of First it matches an opening parenthesis. Then it matches any number of
substrings which can either be a sequence of non-parentheses, or a substrings which can either be a sequence of non-parentheses, or a re-
recursive match of the pattern itself (that is, a correctly parenthe- cursive match of the pattern itself (that is, a correctly parenthesized
sized substring). Finally there is a closing parenthesis. Note the use substring). Finally there is a closing parenthesis. Note the use of a
of a possessive quantifier to avoid backtracking into sequences of non- possessive quantifier to avoid backtracking into sequences of non-
parentheses. parentheses.
If this were part of a larger pattern, you would not want to recurse If this were part of a larger pattern, you would not want to recurse
the entire pattern, so instead you could use this: the entire pattern, so instead you could use this:
( \( ( [^()]++ | (?1) )* \) ) ( \( ( [^()]++ | (?1) )* \) )
We have put the pattern into parentheses, and caused the recursion to We have put the pattern into parentheses, and caused the recursion to
refer to them instead of the whole pattern. refer to them instead of the whole pattern.
In a larger pattern, keeping track of parenthesis numbers can be In a larger pattern, keeping track of parenthesis numbers can be
tricky. This is made easier by the use of relative references. Instead tricky. This is made easier by the use of relative references. Instead
of (?1) in the pattern above you can write (?-2) to refer to the second of (?1) in the pattern above you can write (?-2) to refer to the second
most recently opened parentheses preceding the recursion. In other most recently opened parentheses preceding the recursion. In other
words, a negative number counts capturing parentheses leftwards from words, a negative number counts capturing parentheses leftwards from
the point at which it is encountered. the point at which it is encountered.
It is also possible to refer to subsequently opened parentheses, by It is also possible to refer to subsequently opened parentheses, by
writing references such as (?+2). However, these cannot be recursive writing references such as (?+2). However, these cannot be recursive
because the reference is not inside the parentheses that are refer- because the reference is not inside the parentheses that are refer-
enced. They are always non-recursive subroutine calls, as described in enced. They are always non-recursive subroutine calls, as described in
the next section. the next section.
An alternative approach is to use named parentheses instead. The Perl An alternative approach is to use named parentheses instead. The Perl
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
supported. We could rewrite the above example as follows: supported. We could rewrite the above example as follows:
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) (?<pn> \( ( [^()]++ | (?&pn) )* \) )
If there is more than one subpattern with the same name, the earliest If there is more than one subpattern with the same name, the earliest
one is used. one is used.
This particular example pattern that we have been looking at contains This particular example pattern that we have been looking at contains
nested unlimited repeats, and so the use of a possessive quantifier for nested unlimited repeats, and so the use of a possessive quantifier for
matching strings of non-parentheses is important when applying the pat- matching strings of non-parentheses is important when applying the pat-
tern to strings that do not match. For example, when this pattern is tern to strings that do not match. For example, when this pattern is
applied to applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
it yields "no match" quickly. However, if a possessive quantifier is it yields "no match" quickly. However, if a possessive quantifier is
not used, the match runs for a very long time indeed because there are not used, the match runs for a very long time indeed because there are
so many different ways the + and * repeats can carve up the subject, so many different ways the + and * repeats can carve up the subject,
and all have to be tested before failure can be reported. and all have to be tested before failure can be reported.
At the end of a match, the values of capturing parentheses are those At the end of a match, the values of capturing parentheses are those
from the outermost level. If you want to obtain intermediate values, a from the outermost level. If you want to obtain intermediate values, a
callout function can be used (see below and the pcrecallout documenta- callout function can be used (see below and the pcrecallout documenta-
tion). If the pattern above is matched against tion). If the pattern above is matched against
(ab(cd)ef) (ab(cd)ef)
the value for the inner capturing parentheses (numbered 2) is "ef", the value for the inner capturing parentheses (numbered 2) is "ef",
which is the last value taken on at the top level. If a capturing sub- which is the last value taken on at the top level. If a capturing sub-
pattern is not matched at the top level, its final captured value is pattern is not matched at the top level, its final captured value is
unset, even if it was (temporarily) set at a deeper level during the unset, even if it was (temporarily) set at a deeper level during the
matching process. matching process.
If there are more than 15 capturing parentheses in a pattern, PCRE has If there are more than 15 capturing parentheses in a pattern, PCRE has
to obtain extra memory to store data during a recursion, which it does to obtain extra memory to store data during a recursion, which it does
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
Do not confuse the (?R) item with the condition (R), which tests for Do not confuse the (?R) item with the condition (R), which tests for
recursion. Consider this pattern, which matches text in angle brack- recursion. Consider this pattern, which matches text in angle brack-
ets, allowing for arbitrary nesting. Only digits are allowed in nested ets, allowing for arbitrary nesting. Only digits are allowed in nested
brackets (that is, when recursing), whereas any characters are permit- brackets (that is, when recursing), whereas any characters are permit-
ted at the outer level. ted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
In this pattern, (?(R) is the start of a conditional subpattern, with In this pattern, (?(R) is the start of a conditional subpattern, with
two different alternatives for the recursive and non-recursive cases. two different alternatives for the recursive and non-recursive cases.
The (?R) item is the actual recursive call. The (?R) item is the actual recursive call.
Differences in recursion processing between PCRE and Perl Differences in recursion processing between PCRE and Perl
Recursion processing in PCRE differs from Perl in two important ways. Recursion processing in PCRE differs from Perl in two important ways.
In PCRE (like Python, but unlike Perl), a recursive subpattern call is In PCRE (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried the subject string, it is never re-entered, even if it contains untried
alternatives and there is a subsequent matching failure. This can be alternatives and there is a subsequent matching failure. This can be
illustrated by the following pattern, which purports to match a palin- illustrated by the following pattern, which purports to match a palin-
dromic string that contains an odd number of characters (for example, dromic string that contains an odd number of characters (for example,
"a", "aba", "abcba", "abcdcba"): "a", "aba", "abcba", "abcdcba"):
^(.|(.)(?1)\2)$ ^(.|(.)(?1)\2)$
The idea is that it either matches a single character, or two identical The idea is that it either matches a single character, or two identical
characters surrounding a sub-palindrome. In Perl, this pattern works; characters surrounding a sub-palindrome. In Perl, this pattern works;
in PCRE it does not if the pattern is longer than three characters. in PCRE it does not if the pattern is longer than three characters.
Consider the subject string "abcba": Consider the subject string "abcba":
At the top level, the first character is matched, but as it is not at At the top level, the first character is matched, but as it is not at
the end of the string, the first alternative fails; the second alterna- the end of the string, the first alternative fails; the second alterna-
tive is taken and the recursion kicks in. The recursive call to subpat- tive is taken and the recursion kicks in. The recursive call to subpat-
tern 1 successfully matches the next character ("b"). (Note that the tern 1 successfully matches the next character ("b"). (Note that the
beginning and end of line tests are not part of the recursion). beginning and end of line tests are not part of the recursion).
Back at the top level, the next character ("c") is compared with what Back at the top level, the next character ("c") is compared with what
subpattern 2 matched, which was "a". This fails. Because the recursion subpattern 2 matched, which was "a". This fails. Because the recursion
is treated as an atomic group, there are now no backtracking points, is treated as an atomic group, there are now no backtracking points,
and so the entire match fails. (Perl is able, at this point, to re- and so the entire match fails. (Perl is able, at this point, to re-en-
enter the recursion and try the second alternative.) However, if the ter the recursion and try the second alternative.) However, if the pat-
pattern is written with the alternatives in the other order, things are tern is written with the alternatives in the other order, things are
different: different:
^((.)(?1)\2|.)$ ^((.)(?1)\2|.)$
This time, the recursing alternative is tried first, and continues to This time, the recursing alternative is tried first, and continues to
recurse until it runs out of characters, at which point the recursion recurse until it runs out of characters, at which point the recursion
fails. But this time we do have another alternative to try at the fails. But this time we do have another alternative to try at the
higher level. That is the big difference: in the previous case the higher level. That is the big difference: in the previous case the re-
remaining alternative is at a deeper recursion level, which PCRE cannot maining alternative is at a deeper recursion level, which PCRE cannot
use. use.
To change the pattern so that it matches all palindromic strings, not To change the pattern so that it matches all palindromic strings, not
just those with an odd number of characters, it is tempting to change just those with an odd number of characters, it is tempting to change
the pattern to this: the pattern to this:
^((.)(?1)\2|.?)$ ^((.)(?1)\2|.?)$
Again, this works in Perl, but not in PCRE, and for the same reason. Again, this works in Perl, but not in PCRE, and for the same reason.
When a deeper recursion has matched a single character, it cannot be When a deeper recursion has matched a single character, it cannot be
entered again in order to match an empty string. The solution is to entered again in order to match an empty string. The solution is to
separate the two cases, and write out the odd and even cases as alter- separate the two cases, and write out the odd and even cases as alter-
natives at the higher level: natives at the higher level:
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
If you want to match typical palindromic phrases, the pattern has to If you want to match typical palindromic phrases, the pattern has to
ignore all non-word characters, which can be done like this: ignore all non-word characters, which can be done like this:
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
If run with the PCRE_CASELESS option, this pattern matches phrases such If run with the PCRE_CASELESS option, this pattern matches phrases such
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
ing into sequences of non-word characters. Without this, PCRE takes a ing into sequences of non-word characters. Without this, PCRE takes a
great deal longer (ten times or more) to match typical phrases, and great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop. Perl takes so long that you think it has gone into a loop.
WARNING: The palindrome-matching patterns above work only if the sub- WARNING: The palindrome-matching patterns above work only if the sub-
ject string does not start with a palindrome that is shorter than the ject string does not start with a palindrome that is shorter than the
entire string. For example, although "abcba" is correctly matched, if entire string. For example, although "abcba" is correctly matched, if
the subject is "ababa", PCRE finds the palindrome "aba" at the start, the subject is "ababa", PCRE finds the palindrome "aba" at the start,
then fails at top level because the end of the string does not follow. then fails at top level because the end of the string does not follow.
Once again, it cannot jump back into the recursion to try other alter- Once again, it cannot jump back into the recursion to try other alter-
natives, so the entire match fails. natives, so the entire match fails.
The second way in which PCRE and Perl differ in their recursion pro- The second way in which PCRE and Perl differ in their recursion pro-
cessing is in the handling of captured values. In Perl, when a subpat- cessing is in the handling of captured values. In Perl, when a subpat-
tern is called recursively or as a subpattern (see the next section), tern is called recursively or as a subpattern (see the next section),
it has no access to any values that were captured outside the recur- it has no access to any values that were captured outside the recur-
sion, whereas in PCRE these values can be referenced. Consider this sion, whereas in PCRE these values can be referenced. Consider this
pattern: pattern:
^(.)(\1|a(?2)) ^(.)(\1|a(?2))
In PCRE, this pattern matches "bab". The first capturing parentheses In PCRE, this pattern matches "bab". The first capturing parentheses
match "b", then in the second group, when the back reference \1 fails match "b", then in the second group, when the back reference \1 fails
to match "b", the second alternative matches "a" and then recurses. In to match "b", the second alternative matches "a" and then recurses. In
the recursion, \1 does now match "b" and so the whole match succeeds. the recursion, \1 does now match "b" and so the whole match succeeds.
In Perl, the pattern fails to match because inside the recursive call In Perl, the pattern fails to match because inside the recursive call
\1 cannot access the externally set value. \1 cannot access the externally set value.
SUBPATTERNS AS SUBROUTINES SUBPATTERNS AS SUBROUTINES
If the syntax for a recursive subpattern call (either by number or by If the syntax for a recursive subpattern call (either by number or by
name) is used outside the parentheses to which it refers, it operates name) is used outside the parentheses to which it refers, it operates
like a subroutine in a programming language. The called subpattern may like a subroutine in a programming language. The called subpattern may
be defined before or after the reference. A numbered reference can be be defined before or after the reference. A numbered reference can be
absolute or relative, as in these examples: absolute or relative, as in these examples:
(...(absolute)...)...(?2)... (...(absolute)...)...(?2)...
(...(relative)...)...(?-1)... (...(relative)...)...(?-1)...
(...(?+1)...(relative)... (...(?+1)...(relative)...
An earlier example pointed out that the pattern An earlier example pointed out that the pattern
(sens|respons)e and \1ibility (sens|respons)e and \1ibility
matches "sense and sensibility" and "response and responsibility", but matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If instead the pattern not "sense and responsibility". If instead the pattern
(sens|respons)e and (?1)ibility (sens|respons)e and (?1)ibility
is used, it does match "sense and responsibility" as well as the other is used, it does match "sense and responsibility" as well as the other
two strings. Another example is given in the discussion of DEFINE two strings. Another example is given in the discussion of DEFINE
above. above.
All subroutine calls, whether recursive or not, are always treated as All subroutine calls, whether recursive or not, are always treated as
atomic groups. That is, once a subroutine has matched some of the sub- atomic groups. That is, once a subroutine has matched some of the sub-
ject string, it is never re-entered, even if it contains untried alter- ject string, it is never re-entered, even if it contains untried alter-
natives and there is a subsequent matching failure. Any capturing natives and there is a subsequent matching failure. Any capturing
parentheses that are set during the subroutine call revert to their parentheses that are set during the subroutine call revert to their
previous values afterwards. previous values afterwards.
Processing options such as case-independence are fixed when a subpat- Processing options such as case-independence are fixed when a subpat-
tern is defined, so if it is used as a subroutine, such options cannot tern is defined, so if it is used as a subroutine, such options cannot
be changed for different calls. For example, consider this pattern: be changed for different calls. For example, consider this pattern:
(abc)(?i:(?-1)) (abc)(?i:(?-1))
It matches "abcabc". It does not match "abcABC" because the change of It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern. processing option does not affect the called subpattern.
ONIGURUMA SUBROUTINE SYNTAX ONIGURUMA SUBROUTINE SYNTAX
For compatibility with Oniguruma, the non-Perl syntax \g followed by a For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is name or a number enclosed either in angle brackets or single quotes, is
an alternative syntax for referencing a subpattern as a subroutine, an alternative syntax for referencing a subpattern as a subroutine,
possibly recursively. Here are two of the examples used above, rewrit- possibly recursively. Here are two of the examples used above, rewrit-
ten using this syntax: ten using this syntax:
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
(sens|respons)e and \g'1'ibility (sens|respons)e and \g'1'ibility
PCRE supports an extension to Oniguruma: if a number is preceded by a PCRE supports an extension to Oniguruma: if a number is preceded by a
plus or a minus sign it is taken as a relative reference. For example: plus or a minus sign it is taken as a relative reference. For example:
(abc)(?i:\g<-1>) (abc)(?i:\g<-1>)
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
synonymous. The former is a back reference; the latter is a subroutine synonymous. The former is a back reference; the latter is a subroutine
call. call.
CALLOUTS CALLOUTS
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl has a feature whereby using the sequence (?{...}) causes arbitrary
Perl code to be obeyed in the middle of matching a regular expression. Perl code to be obeyed in the middle of matching a regular expression.
This makes it possible, amongst other things, to extract different sub- This makes it possible, amongst other things, to extract different sub-
strings that match the same pair of parentheses when there is a repeti- strings that match the same pair of parentheses when there is a repeti-
tion. tion.
PCRE provides a similar feature, but of course it cannot obey arbitrary PCRE provides a similar feature, but of course it cannot obey arbitrary
Perl code. The feature is called "callout". The caller of PCRE provides Perl code. The feature is called "callout". The caller of PCRE provides
an external function by putting its entry point in the global variable an external function by putting its entry point in the global variable
pcre_callout (8-bit library) or pcre[16|32]_callout (16-bit or 32-bit pcre_callout (8-bit library) or pcre[16|32]_callout (16-bit or 32-bit
library). By default, this variable contains NULL, which disables all library). By default, this variable contains NULL, which disables all
calling out. calling out.
Within a regular expression, (?C) indicates the points at which the Within a regular expression, (?C) indicates the points at which the ex-
external function is to be called. If you want to identify different ternal function is to be called. If you want to identify different
callout points, you can put a number less than 256 after the letter C. callout points, you can put a number less than 256 after the letter C.
The default value is zero. For example, this pattern has two callout The default value is zero. For example, this pattern has two callout
points: points:
(?C1)abc(?C2)def (?C1)abc(?C2)def
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call- If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-
outs are automatically installed before each item in the pattern. They outs are automatically installed before each item in the pattern. They
are all numbered 255. If there is a conditional group in the pattern are all numbered 255. If there is a conditional group in the pattern
whose condition is an assertion, an additional callout is inserted just whose condition is an assertion, an additional callout is inserted just
before the condition. An explicit callout may also be set at this posi- before the condition. An explicit callout may also be set at this posi-
tion, as in this example: tion, as in this example:
(?(?C9)(?=a)abc|def) (?(?C9)(?=a)abc|def)
Note that this applies only to assertion conditions, not to other types Note that this applies only to assertion conditions, not to other types
of condition. of condition.
During matching, when PCRE reaches a callout point, the external func- During matching, when PCRE reaches a callout point, the external func-
tion is called. It is provided with the number of the callout, the tion is called. It is provided with the number of the callout, the po-
position in the pattern, and, optionally, one item of data originally sition in the pattern, and, optionally, one item of data originally
supplied by the caller of the matching function. The callout function supplied by the caller of the matching function. The callout function
may cause matching to proceed, to backtrack, or to fail altogether. may cause matching to proceed, to backtrack, or to fail altogether.
By default, PCRE implements a number of optimizations at compile time By default, PCRE implements a number of optimizations at compile time
and matching time, and one side-effect is that sometimes callouts are and matching time, and one side-effect is that sometimes callouts are
skipped. If you need all possible callouts to happen, you need to set skipped. If you need all possible callouts to happen, you need to set
options that disable the relevant optimizations. More details, and a options that disable the relevant optimizations. More details, and a
complete description of the interface to the callout function, are complete description of the interface to the callout function, are
given in the pcrecallout documentation. given in the pcrecallout documentation.
BACKTRACKING CONTROL BACKTRACKING CONTROL
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
which are still described in the Perl documentation as "experimental which are still described in the Perl documentation as "experimental
and subject to change or removal in a future version of Perl". It goes and subject to change or removal in a future version of Perl". It goes
on to say: "Their usage in production code should be noted to avoid on to say: "Their usage in production code should be noted to avoid
problems during upgrades." The same remarks apply to the PCRE features problems during upgrades." The same remarks apply to the PCRE features
described in this section. described in this section.
The new verbs make use of what was previously invalid syntax: an open- The new verbs make use of what was previously invalid syntax: an open-
ing parenthesis followed by an asterisk. They are generally of the form ing parenthesis followed by an asterisk. They are generally of the form
(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving (*VERB) or (*VERB:NAME). Some may take either form, possibly behaving
differently depending on whether or not a name is present. A name is differently depending on whether or not a name is present. A name is
any sequence of characters that does not include a closing parenthesis. any sequence of characters that does not include a closing parenthesis.
The maximum length of name is 255 in the 8-bit library and 65535 in the The maximum length of name is 255 in the 8-bit library and 65535 in the
16-bit and 32-bit libraries. If the name is empty, that is, if the 16-bit and 32-bit libraries. If the name is empty, that is, if the
closing parenthesis immediately follows the colon, the effect is as if closing parenthesis immediately follows the colon, the effect is as if
the colon were not there. Any number of these verbs may occur in a the colon were not there. Any number of these verbs may occur in a
pattern. pattern.
Since these verbs are specifically related to backtracking, most of Since these verbs are specifically related to backtracking, most of
them can be used only when the pattern is to be matched using one of them can be used only when the pattern is to be matched using one of
the traditional matching functions, because these use a backtracking the traditional matching functions, because these use a backtracking
algorithm. With the exception of (*FAIL), which behaves like a failing algorithm. With the exception of (*FAIL), which behaves like a failing
negative assertion, the backtracking control verbs cause an error if negative assertion, the backtracking control verbs cause an error if
encountered by a DFA matching function. encountered by a DFA matching function.
The behaviour of these verbs in repeated groups, assertions, and in The behaviour of these verbs in repeated groups, assertions, and in
subpatterns called as subroutines (whether or not recursively) is docu- subpatterns called as subroutines (whether or not recursively) is docu-
mented below. mented below.
Optimizations that affect backtracking verbs Optimizations that affect backtracking verbs
PCRE contains some optimizations that are used to speed up matching by PCRE contains some optimizations that are used to speed up matching by
running some checks at the start of each match attempt. For example, it running some checks at the start of each match attempt. For example, it
may know the minimum length of matching subject, or that a particular may know the minimum length of matching subject, or that a particular
character must be present. When one of these optimizations bypasses the character must be present. When one of these optimizations bypasses the
running of a match, any included backtracking verbs will not, of running of a match, any included backtracking verbs will not, of
course, be processed. You can suppress the start-of-match optimizations course, be processed. You can suppress the start-of-match optimizations
by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com- by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com-
pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT). pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
There is more discussion of this option in the section entitled "Option There is more discussion of this option in the section entitled "Option
bits for pcre_exec()" in the pcreapi documentation. bits for pcre_exec()" in the pcreapi documentation.
Experiments with Perl suggest that it too has similar optimizations, Experiments with Perl suggest that it too has similar optimizations,
sometimes leading to anomalous results. sometimes leading to anomalous results.
Verbs that act immediately Verbs that act immediately
The following verbs act as soon as they are encountered. They may not The following verbs act as soon as they are encountered. They may not
be followed by a name. be followed by a name.
(*ACCEPT) (*ACCEPT)
This verb causes the match to end successfully, skipping the remainder This verb causes the match to end successfully, skipping the remainder
of the pattern. However, when it is inside a subpattern that is called of the pattern. However, when it is inside a subpattern that is called
as a subroutine, only that subpattern is ended successfully. Matching as a subroutine, only that subpattern is ended successfully. Matching
then continues at the outer level. If (*ACCEPT) in triggered in a posi- then continues at the outer level. If (*ACCEPT) in triggered in a posi-
tive assertion, the assertion succeeds; in a negative assertion, the tive assertion, the assertion succeeds; in a negative assertion, the
assertion fails. assertion fails.
If (*ACCEPT) is inside capturing parentheses, the data so far is cap- If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
tured. For example: tured. For example:
A((?:A|B(*ACCEPT)|C)D) A((?:A|B(*ACCEPT)|C)D)
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
tured by the outer parentheses. tured by the outer parentheses.
(*FAIL) or (*F) (*FAIL) or (*F)
This verb causes a matching failure, forcing backtracking to occur. It This verb causes a matching failure, forcing backtracking to occur. It
is equivalent to (?!) but easier to read. The Perl documentation notes is equivalent to (?!) but easier to read. The Perl documentation notes
that it is probably useful only when combined with (?{}) or (??{}). that it is probably useful only when combined with (?{}) or (??{}).
Those are, of course, Perl features that are not present in PCRE. The Those are, of course, Perl features that are not present in PCRE. The
nearest equivalent is the callout feature, as for example in this pat- nearest equivalent is the callout feature, as for example in this pat-
tern: tern:
a+(?C)(*FAIL) a+(?C)(*FAIL)
A match with the string "aaaa" always fails, but the callout is taken A match with the string "aaaa" always fails, but the callout is taken
before each backtrack happens (in this example, 10 times). before each backtrack happens (in this example, 10 times).
Recording which path was taken Recording which path was taken
There is one verb whose main purpose is to track how a match was There is one verb whose main purpose is to track how a match was ar-
arrived at, though it also has a secondary use in conjunction with rived at, though it also has a secondary use in conjunction with ad-
advancing the match starting point (see (*SKIP) below). vancing the match starting point (see (*SKIP) below).
(*MARK:NAME) or (*:NAME) (*MARK:NAME) or (*:NAME)
A name is always required with this verb. There may be as many A name is always required with this verb. There may be as many in-
instances of (*MARK) as you like in a pattern, and their names do not stances of (*MARK) as you like in a pattern, and their names do not
have to be unique. have to be unique.
When a match succeeds, the name of the last-encountered (*MARK:NAME), When a match succeeds, the name of the last-encountered (*MARK:NAME),
(*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to
the caller as described in the section entitled "Extra data for the caller as described in the section entitled "Extra data for
pcre_exec()" in the pcreapi documentation. Here is an example of pcre_exec()" in the pcreapi documentation. Here is an example of
pcretest output, where the /K modifier requests the retrieval and out- pcretest output, where the /K modifier requests the retrieval and out-
putting of (*MARK) data: putting of (*MARK) data:
re> /X(*MARK:A)Y|X(*MARK:B)Z/K re> /X(*MARK:A)Y|X(*MARK:B)Z/K
data> XY data> XY
0: XY 0: XY
MK: A MK: A
XZ XZ
0: XZ 0: XZ
MK: B MK: B
The (*MARK) name is tagged with "MK:" in this output, and in this exam- The (*MARK) name is tagged with "MK:" in this output, and in this exam-
ple it indicates which of the two alternatives matched. This is a more ple it indicates which of the two alternatives matched. This is a more
efficient way of obtaining this information than putting each alterna- efficient way of obtaining this information than putting each alterna-
tive in its own capturing parentheses. tive in its own capturing parentheses.
If a verb with a name is encountered in a positive assertion that is If a verb with a name is encountered in a positive assertion that is
true, the name is recorded and passed back if it is the last-encoun- true, the name is recorded and passed back if it is the last-encoun-
tered. This does not happen for negative assertions or failing positive tered. This does not happen for negative assertions or failing positive
assertions. assertions.
After a partial match or a failed match, the last encountered name in After a partial match or a failed match, the last encountered name in
the entire match process is returned. For example: the entire match process is returned. For example:
re> /X(*MARK:A)Y|X(*MARK:B)Z/K re> /X(*MARK:A)Y|X(*MARK:B)Z/K
data> XP data> XP
No match, mark = B No match, mark = B
Note that in this unanchored example the mark is retained from the Note that in this unanchored example the mark is retained from the
match attempt that started at the letter "X" in the subject. Subsequent match attempt that started at the letter "X" in the subject. Subsequent
match attempts starting at "P" and then with an empty string do not get match attempts starting at "P" and then with an empty string do not get
as far as the (*MARK) item, but nevertheless do not reset it. as far as the (*MARK) item, but nevertheless do not reset it.
If you are interested in (*MARK) values after failed matches, you If you are interested in (*MARK) values after failed matches, you
should probably set the PCRE_NO_START_OPTIMIZE option (see above) to should probably set the PCRE_NO_START_OPTIMIZE option (see above) to
ensure that the match is always attempted. ensure that the match is always attempted.
Verbs that act after backtracking Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching con- The following verbs do nothing when they are encountered. Matching con-
tinues with what follows, but if there is no subsequent match, causing tinues with what follows, but if there is no subsequent match, causing
a backtrack to the verb, a failure is forced. That is, backtracking a backtrack to the verb, a failure is forced. That is, backtracking
cannot pass to the left of the verb. However, when one of these verbs cannot pass to the left of the verb. However, when one of these verbs
appears inside an atomic group or an assertion that is true, its effect appears inside an atomic group or an assertion that is true, its effect
is confined to that group, because once the group has been matched, is confined to that group, because once the group has been matched,
there is never any backtracking into it. In this situation, backtrack- there is never any backtracking into it. In this situation, backtrack-
ing can "jump back" to the left of the entire atomic group or asser- ing can "jump back" to the left of the entire atomic group or asser-
tion. (Remember also, as stated above, that this localization also tion. (Remember also, as stated above, that this localization also ap-
applies in subroutine calls.) plies in subroutine calls.)
These verbs differ in exactly what kind of failure occurs when back- These verbs differ in exactly what kind of failure occurs when back-
tracking reaches them. The behaviour described below is what happens tracking reaches them. The behaviour described below is what happens
when the verb is not in a subroutine or an assertion. Subsequent sec- when the verb is not in a subroutine or an assertion. Subsequent sec-
tions cover these special cases. tions cover these special cases.
(*COMMIT) (*COMMIT)
This verb, which may not be followed by a name, causes the whole match This verb, which may not be followed by a name, causes the whole match
to fail outright if there is a later matching failure that causes back- to fail outright if there is a later matching failure that causes back-
tracking to reach it. Even if the pattern is unanchored, no further tracking to reach it. Even if the pattern is unanchored, no further at-
attempts to find a match by advancing the starting point take place. If tempts to find a match by advancing the starting point take place. If
(*COMMIT) is the only backtracking verb that is encountered, once it (*COMMIT) is the only backtracking verb that is encountered, once it
has been passed pcre_exec() is committed to finding a match at the cur- has been passed pcre_exec() is committed to finding a match at the cur-
rent starting point, or not at all. For example: rent starting point, or not at all. For example:
a+(*COMMIT)b a+(*COMMIT)b
This matches "xxaab" but not "aacaab". It can be thought of as a kind This matches "xxaab" but not "aacaab". It can be thought of as a kind
of dynamic anchor, or "I've started, so I must finish." The name of the of dynamic anchor, or "I've started, so I must finish." The name of the
most recently passed (*MARK) in the path is passed back when (*COMMIT) most recently passed (*MARK) in the path is passed back when (*COMMIT)
forces a match failure. forces a match failure.
If there is more than one backtracking verb in a pattern, a different If there is more than one backtracking verb in a pattern, a different
one that follows (*COMMIT) may be triggered first, so merely passing one that follows (*COMMIT) may be triggered first, so merely passing
(*COMMIT) during a match does not always guarantee that a match must be (*COMMIT) during a match does not always guarantee that a match must be
at this starting point. at this starting point.
Note that (*COMMIT) at the start of a pattern is not the same as an Note that (*COMMIT) at the start of a pattern is not the same as an an-
anchor, unless PCRE's start-of-match optimizations are turned off, as chor, unless PCRE's start-of-match optimizations are turned off, as
shown in this output from pcretest: shown in this output from pcretest:
re> /(*COMMIT)abc/ re> /(*COMMIT)abc/
data> xyzabc data> xyzabc
0: abc 0: abc
data> xyzabc\Y data> xyzabc\Y
No match No match
For this pattern, PCRE knows that any match must start with "a", so the For this pattern, PCRE knows that any match must start with "a", so the
optimization skips along the subject to "a" before applying the pattern optimization skips along the subject to "a" before applying the pattern
to the first set of data. The match attempt then succeeds. In the sec- to the first set of data. The match attempt then succeeds. In the sec-
ond set of data, the escape sequence \Y is interpreted by the pcretest ond set of data, the escape sequence \Y is interpreted by the pcretest
program. It causes the PCRE_NO_START_OPTIMIZE option to be set when program. It causes the PCRE_NO_START_OPTIMIZE option to be set when
pcre_exec() is called. This disables the optimization that skips along pcre_exec() is called. This disables the optimization that skips along
to the first character. The pattern is now applied starting at "x", and to the first character. The pattern is now applied starting at "x", and
so the (*COMMIT) causes the match to fail without trying any other so the (*COMMIT) causes the match to fail without trying any other
starting points. starting points.
(*PRUNE) or (*PRUNE:NAME) (*PRUNE) or (*PRUNE:NAME)
This verb causes the match to fail at the current starting position in This verb causes the match to fail at the current starting position in
the subject if there is a later matching failure that causes backtrack- the subject if there is a later matching failure that causes backtrack-
ing to reach it. If the pattern is unanchored, the normal "bumpalong" ing to reach it. If the pattern is unanchored, the normal "bumpalong"
advance to the next starting character then happens. Backtracking can advance to the next starting character then happens. Backtracking can
occur as usual to the left of (*PRUNE), before it is reached, or when occur as usual to the left of (*PRUNE), before it is reached, or when
matching to the right of (*PRUNE), but if there is no match to the matching to the right of (*PRUNE), but if there is no match to the
right, backtracking cannot cross (*PRUNE). In simple cases, the use of right, backtracking cannot cross (*PRUNE). In simple cases, the use of
(*PRUNE) is just an alternative to an atomic group or possessive quan- (*PRUNE) is just an alternative to an atomic group or possessive quan-
tifier, but there are some uses of (*PRUNE) that cannot be expressed in tifier, but there are some uses of (*PRUNE) that cannot be expressed in
any other way. In an anchored pattern (*PRUNE) has the same effect as any other way. In an anchored pattern (*PRUNE) has the same effect as
(*COMMIT). (*COMMIT).
The behaviour of (*PRUNE:NAME) is the not the same as The behaviour of (*PRUNE:NAME) is the not the same as
(*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is re-
remembered for passing back to the caller. However, (*SKIP:NAME) membered for passing back to the caller. However, (*SKIP:NAME) searches
searches only for names set with (*MARK). only for names set with (*MARK).
(*SKIP) (*SKIP)
This verb, when given without a name, is like (*PRUNE), except that if This verb, when given without a name, is like (*PRUNE), except that if
the pattern is unanchored, the "bumpalong" advance is not to the next the pattern is unanchored, the "bumpalong" advance is not to the next
character, but to the position in the subject where (*SKIP) was encoun- character, but to the position in the subject where (*SKIP) was encoun-
tered. (*SKIP) signifies that whatever text was matched leading up to tered. (*SKIP) signifies that whatever text was matched leading up to
it cannot be part of a successful match. Consider: it cannot be part of a successful match. Consider:
a+(*SKIP)b a+(*SKIP)b
If the subject is "aaaac...", after the first match attempt fails If the subject is "aaaac...", after the first match attempt fails
(starting at the first character in the string), the starting point (starting at the first character in the string), the starting point
skips on to start the next attempt at "c". Note that a possessive quan- skips on to start the next attempt at "c". Note that a possessive quan-
tifer does not have the same effect as this example; although it would tifer does not have the same effect as this example; although it would
suppress backtracking during the first match attempt, the second suppress backtracking during the first match attempt, the second at-
attempt would start at the second character instead of skipping on to tempt would start at the second character instead of skipping on to
"c". "c".
(*SKIP:NAME) (*SKIP:NAME)
When (*SKIP) has an associated name, its behaviour is modified. When it When (*SKIP) has an associated name, its behaviour is modified. When it
is triggered, the previous path through the pattern is searched for the is triggered, the previous path through the pattern is searched for the
most recent (*MARK) that has the same name. If one is found, the most recent (*MARK) that has the same name. If one is found, the
"bumpalong" advance is to the subject position that corresponds to that "bumpalong" advance is to the subject position that corresponds to that
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
a matching name is found, the (*SKIP) is ignored. a matching name is found, the (*SKIP) is ignored.
Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME). ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
(*THEN) or (*THEN:NAME) (*THEN) or (*THEN:NAME)
This verb causes a skip to the next innermost alternative when back- This verb causes a skip to the next innermost alternative when back-
tracking reaches it. That is, it cancels any further backtracking tracking reaches it. That is, it cancels any further backtracking
within the current alternative. Its name comes from the observation within the current alternative. Its name comes from the observation
that it can be used for a pattern-based if-then-else block: that it can be used for a pattern-based if-then-else block:
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
If the COND1 pattern matches, FOO is tried (and possibly further items If the COND1 pattern matches, FOO is tried (and possibly further items
after the end of the group if FOO succeeds); on failure, the matcher after the end of the group if FOO succeeds); on failure, the matcher
skips to the second alternative and tries COND2, without backtracking skips to the second alternative and tries COND2, without backtracking
into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
quently BAZ fails, there are no more alternatives, so there is a back- quently BAZ fails, there are no more alternatives, so there is a back-
track to whatever came before the entire group. If (*THEN) is not track to whatever came before the entire group. If (*THEN) is not in-
inside an alternation, it acts like (*PRUNE). side an alternation, it acts like (*PRUNE).
The behaviour of (*THEN:NAME) is the not the same as The behaviour of (*THEN:NAME) is the not the same as
(*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is re-
remembered for passing back to the caller. However, (*SKIP:NAME) membered for passing back to the caller. However, (*SKIP:NAME) searches
searches only for names set with (*MARK). only for names set with (*MARK).
A subpattern that does not contain a | character is just a part of the A subpattern that does not contain a | character is just a part of the
enclosing alternative; it is not a nested alternation with only one enclosing alternative; it is not a nested alternation with only one al-
alternative. The effect of (*THEN) extends beyond such a subpattern to ternative. The effect of (*THEN) extends beyond such a subpattern to
the enclosing alternative. Consider this pattern, where A, B, etc. are the enclosing alternative. Consider this pattern, where A, B, etc. are
complex pattern fragments that do not contain any | characters at this complex pattern fragments that do not contain any | characters at this
level: level:
A (B(*THEN)C) | D A (B(*THEN)C) | D
If A and B are matched, but there is a failure in C, matching does not If A and B are matched, but there is a failure in C, matching does not
backtrack into A; instead it moves to the next alternative, that is, D. backtrack into A; instead it moves to the next alternative, that is, D.
However, if the subpattern containing (*THEN) is given an alternative, However, if the subpattern containing (*THEN) is given an alternative,
it behaves differently: it behaves differently:
A (B(*THEN)C | (*FAIL)) | D A (B(*THEN)C | (*FAIL)) | D
The effect of (*THEN) is now confined to the inner subpattern. After a The effect of (*THEN) is now confined to the inner subpattern. After a
failure in C, matching moves to (*FAIL), which causes the whole subpat- failure in C, matching moves to (*FAIL), which causes the whole subpat-
tern to fail because there are no more alternatives to try. In this tern to fail because there are no more alternatives to try. In this
case, matching does now backtrack into A. case, matching does now backtrack into A.
Note that a conditional subpattern is not considered as having two Note that a conditional subpattern is not considered as having two al-
alternatives, because only one is ever used. In other words, the | ternatives, because only one is ever used. In other words, the | char-
character in a conditional subpattern has a different meaning. Ignoring acter in a conditional subpattern has a different meaning. Ignoring
white space, consider: white space, consider:
^.*? (?(?=a) a | b(*THEN)c ) ^.*? (?(?=a) a | b(*THEN)c )
If the subject is "ba", this pattern does not match. Because .*? is If the subject is "ba", this pattern does not match. Because .*? is un-
ungreedy, it initially matches zero characters. The condition (?=a) greedy, it initially matches zero characters. The condition (?=a) then
then fails, the character "b" is matched, but "c" is not. At this fails, the character "b" is matched, but "c" is not. At this point,
point, matching does not backtrack to .*? as might perhaps be expected matching does not backtrack to .*? as might perhaps be expected from
from the presence of the | character. The conditional subpattern is the presence of the | character. The conditional subpattern is part of
part of the single alternative that comprises the whole pattern, and so the single alternative that comprises the whole pattern, and so the
the match fails. (If there was a backtrack into .*?, allowing it to match fails. (If there was a backtrack into .*?, allowing it to match
match "b", the match would succeed.) "b", the match would succeed.)
The verbs just described provide four different "strengths" of control The verbs just described provide four different "strengths" of control
when subsequent matching fails. (*THEN) is the weakest, carrying on the when subsequent matching fails. (*THEN) is the weakest, carrying on the
match at the next alternative. (*PRUNE) comes next, failing the match match at the next alternative. (*PRUNE) comes next, failing the match
at the current starting position, but allowing an advance to the next at the current starting position, but allowing an advance to the next
character (for an unanchored pattern). (*SKIP) is similar, except that character (for an unanchored pattern). (*SKIP) is similar, except that
the advance may be more than one character. (*COMMIT) is the strongest, the advance may be more than one character. (*COMMIT) is the strongest,
causing the entire match to fail. causing the entire match to fail.
More than one backtracking verb More than one backtracking verb
If more than one backtracking verb is present in a pattern, the one If more than one backtracking verb is present in a pattern, the one
that is backtracked onto first acts. For example, consider this pat- that is backtracked onto first acts. For example, consider this pat-
tern, where A, B, etc. are complex pattern fragments: tern, where A, B, etc. are complex pattern fragments:
(A(*COMMIT)B(*THEN)C|ABD) (A(*COMMIT)B(*THEN)C|ABD)
If A matches but B fails, the backtrack to (*COMMIT) causes the entire If A matches but B fails, the backtrack to (*COMMIT) causes the entire
match to fail. However, if A and B match, but C fails, the backtrack to match to fail. However, if A and B match, but C fails, the backtrack to
(*THEN) causes the next alternative (ABD) to be tried. This behaviour (*THEN) causes the next alternative (ABD) to be tried. This behaviour
is consistent, but is not always the same as Perl's. It means that if is consistent, but is not always the same as Perl's. It means that if
two or more backtracking verbs appear in succession, all the the last two or more backtracking verbs appear in succession, all the the last
of them has no effect. Consider this example: of them has no effect. Consider this example:
...(*COMMIT)(*PRUNE)... ...(*COMMIT)(*PRUNE)...
If there is a matching failure to the right, backtracking onto (*PRUNE) If there is a matching failure to the right, backtracking onto (*PRUNE)
causes it to be triggered, and its action is taken. There can never be causes it to be triggered, and its action is taken. There can never be
a backtrack onto (*COMMIT). a backtrack onto (*COMMIT).
Backtracking verbs in repeated groups Backtracking verbs in repeated groups
PCRE differs from Perl in its handling of backtracking verbs in PCRE differs from Perl in its handling of backtracking verbs in re-
repeated groups. For example, consider: peated groups. For example, consider:
/(a(*COMMIT)b)+ac/ /(a(*COMMIT)b)+ac/
If the subject is "abac", Perl matches, but PCRE fails because the If the subject is "abac", Perl matches, but PCRE fails because the
(*COMMIT) in the second repeat of the group acts. (*COMMIT) in the second repeat of the group acts.
Backtracking verbs in assertions Backtracking verbs in assertions
(*FAIL) in an assertion has its normal effect: it forces an immediate (*FAIL) in an assertion has its normal effect: it forces an immediate
backtrack. backtrack.
(*ACCEPT) in a positive assertion causes the assertion to succeed with- (*ACCEPT) in a positive assertion causes the assertion to succeed with-
out any further processing. In a negative assertion, (*ACCEPT) causes out any further processing. In a negative assertion, (*ACCEPT) causes
the assertion to fail without any further processing. the assertion to fail without any further processing.
The other backtracking verbs are not treated specially if they appear The other backtracking verbs are not treated specially if they appear
in a positive assertion. In particular, (*THEN) skips to the next in a positive assertion. In particular, (*THEN) skips to the next al-
alternative in the innermost enclosing group that has alternations, ternative in the innermost enclosing group that has alternations,
whether or not this is within the assertion. whether or not this is within the assertion.
Negative assertions are, however, different, in order to ensure that Negative assertions are, however, different, in order to ensure that
changing a positive assertion into a negative assertion changes its changing a positive assertion into a negative assertion changes its re-
result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg- sult. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a nega-
ative assertion to be true, without considering any further alternative tive assertion to be true, without considering any further alternative
branches in the assertion. Backtracking into (*THEN) causes it to skip branches in the assertion. Backtracking into (*THEN) causes it to skip
to the next enclosing alternative within the assertion (the normal be- to the next enclosing alternative within the assertion (the normal be-
haviour), but if the assertion does not have such an alternative, haviour), but if the assertion does not have such an alternative,
(*THEN) behaves like (*PRUNE). (*THEN) behaves like (*PRUNE).
Backtracking verbs in subroutines Backtracking verbs in subroutines
These behaviours occur whether or not the subpattern is called recur- These behaviours occur whether or not the subpattern is called recur-
sively. Perl's treatment of subroutines is different in some cases. sively. Perl's treatment of subroutines is different in some cases.
(*FAIL) in a subpattern called as a subroutine has its normal effect: (*FAIL) in a subpattern called as a subroutine has its normal effect:
it forces an immediate backtrack. it forces an immediate backtrack.
(*ACCEPT) in a subpattern called as a subroutine causes the subroutine (*ACCEPT) in a subpattern called as a subroutine causes the subroutine
match to succeed without any further processing. Matching then contin- match to succeed without any further processing. Matching then contin-
ues after the subroutine call. ues after the subroutine call.
(*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
cause the subroutine match to fail. cause the subroutine match to fail.
(*THEN) skips to the next alternative in the innermost enclosing group (*THEN) skips to the next alternative in the innermost enclosing group
within the subpattern that has alternatives. If there is no such group within the subpattern that has alternatives. If there is no such group
within the subpattern, (*THEN) causes the subroutine match to fail. within the subpattern, (*THEN) causes the subroutine match to fail.
SEE ALSO SEE ALSO
pcreapi(3), pcrecallout(3), pcrematching(3), pcresyntax(3), pcre(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcresyntax(3), pcre(3),
pcre16(3), pcre32(3). pcre16(3), pcre32(3).
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge CB2 3QH, England. Cambridge CB2 3QH, England.
REVISION REVISION
skipping to change at line 7636 skipping to change at line 7630
Perl and POSIX space are now the same. Perl added VT to its space char- Perl and POSIX space are now the same. Perl added VT to its space char-
acter set at release 5.18 and PCRE changed at release 8.34. acter set at release 5.18 and PCRE changed at release 8.34.
SCRIPT NAMES FOR \p AND \P SCRIPT NAMES FOR \p AND \P
Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali, Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali,
Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Car- Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Car-
ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei- ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei-
form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero- form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero-
glyphs, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, glyphs, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha,
Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Im-
Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip- perial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip-
tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,
Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Lin- Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Lin-
ear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic, ear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic,
Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, Meroitic_Hi-
Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean, eroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean, New_Tai_Lue,
New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian, Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian, Old_Permic,
Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pa-
Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, hawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha- Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac,
Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu,
Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi,
Yi. Yi.
CHARACTER CLASSES CHARACTER CLASSES
[...] positive character class [...] positive character class
[^...] negative character class [^...] negative character class
skipping to change at line 7751 skipping to change at line 7745
OPTION SETTING OPTION SETTING
(?i) caseless (?i) caseless
(?J) allow duplicate names (?J) allow duplicate names
(?m) multiline (?m) multiline
(?s) single line (dotall) (?s) single line (dotall)
(?U) default ungreedy (lazy) (?U) default ungreedy (lazy)
(?x) extended (ignore white space) (?x) extended (ignore white space)
(?-...) unset option(s) (?-...) unset option(s)
The following are recognized only at the very start of a pattern or The following are recognized only at the very start of a pattern or af-
after one of the newline or \R options with similar syntax. More than ter one of the newline or \R options with similar syntax. More than one
one of them may appear. of them may appear.
(*LIMIT_MATCH=d) set the match limit to d (decimal number) (*LIMIT_MATCH=d) set the match limit to d (decimal number)
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
(*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS) (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
(*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
(*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
(*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
(*UTF) set appropriate UTF mode for the library in use (*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE_UCP (use Unicode properties for \d etc) (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
the limits set by the caller of pcre_exec(), not increase them. the limits set by the caller of pcre_exec(), not increase them.
NEWLINE CONVENTION NEWLINE CONVENTION
These are recognized only at the very start of the pattern or after These are recognized only at the very start of the pattern or after op-
option settings with a similar syntax. tion settings with a similar syntax.
(*CR) carriage return only (*CR) carriage return only
(*LF) linefeed only (*LF) linefeed only
(*CRLF) carriage return followed by linefeed (*CRLF) carriage return followed by linefeed
(*ANYCRLF) all three of the above (*ANYCRLF) all three of the above
(*ANY) any Unicode newline sequence (*ANY) any Unicode newline sequence
WHAT \R MATCHES WHAT \R MATCHES
These are recognized only at the very start of the pattern or after These are recognized only at the very start of the pattern or after op-
option setting with a similar syntax. tion setting with a similar syntax.
(*BSR_ANYCRLF) CR, LF, or CRLF (*BSR_ANYCRLF) CR, LF, or CRLF
(*BSR_UNICODE) any Unicode newline sequence (*BSR_UNICODE) any Unicode newline sequence
LOOKAHEAD AND LOOKBEHIND ASSERTIONS LOOKAHEAD AND LOOKBEHIND ASSERTIONS
(?=...) positive look ahead (?=...) positive look ahead
(?!...) negative look ahead (?!...) negative look ahead
(?<=...) positive look behind (?<=...) positive look behind
(?<!...) negative look behind (?<!...) negative look behind
skipping to change at line 7956 skipping to change at line 7950
which allows the full range of 31-bit values (0 to 0x7FFFFFFF). The which allows the full range of 31-bit values (0 to 0x7FFFFFFF). The
current check allows only values in the range U+0 to U+10FFFF, exclud- current check allows only values in the range U+0 to U+10FFFF, exclud-
ing the surrogate area. (From release 8.33 the so-called "non-charac- ing the surrogate area. (From release 8.33 the so-called "non-charac-
ter" code points are no longer excluded because Unicode corrigendum #9 ter" code points are no longer excluded because Unicode corrigendum #9
makes it clear that they should not be.) makes it clear that they should not be.)
Characters in the "Surrogate Area" of Unicode are reserved for use by Characters in the "Surrogate Area" of Unicode are reserved for use by
UTF-16, where they are used in pairs to encode codepoints with values UTF-16, where they are used in pairs to encode codepoints with values
greater than 0xFFFF. The code points that are encoded by UTF-16 pairs greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
are available independently in the UTF-8 and UTF-32 encodings. (In are available independently in the UTF-8 and UTF-32 encodings. (In
other words, the whole surrogate thing is a fudge for UTF-16 which other words, the whole surrogate thing is a fudge for UTF-16 which un-
unfortunately messes up UTF-8 and UTF-32.) fortunately messes up UTF-8 and UTF-32.)
If an invalid UTF-8 string is passed to PCRE, an error return is given. If an invalid UTF-8 string is passed to PCRE, an error return is given.
At compile time, the only additional information is the offset to the At compile time, the only additional information is the offset to the
first byte of the failing character. The run-time functions pcre_exec() first byte of the failing character. The run-time functions pcre_exec()
and pcre_dfa_exec() also pass back this information, as well as a more and pcre_dfa_exec() also pass back this information, as well as a more
detailed reason code if the caller has provided memory in which to do detailed reason code if the caller has provided memory in which to do
this. this.
In some situations, you may already know that your strings are valid, In some situations, you may already know that your strings are valid,
and therefore want to skip these checks in order to improve perfor- and therefore want to skip these checks in order to improve perfor-
skipping to change at line 8058 skipping to change at line 8052
the alternative matching function pcre[16|32]_dfa_exec(), nor is it the alternative matching function pcre[16|32]_dfa_exec(), nor is it
supported in UTF mode by the JIT optimization of pcre[16|32]_exec(). If supported in UTF mode by the JIT optimization of pcre[16|32]_exec(). If
JIT optimization is requested for a UTF pattern that contains \C, it JIT optimization is requested for a UTF pattern that contains \C, it
will not succeed, and so the matching will be carried out by the normal will not succeed, and so the matching will be carried out by the normal
interpretive function. interpretive function.
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
test characters of any code value, but, by default, the characters that test characters of any code value, but, by default, the characters that
PCRE recognizes as digits, spaces, or word characters remain the same PCRE recognizes as digits, spaces, or word characters remain the same
set as in non-UTF mode, all with values less than 256. This remains set as in non-UTF mode, all with values less than 256. This remains
true even when PCRE is built to include Unicode property support, true even when PCRE is built to include Unicode property support, be-
because to do otherwise would slow down PCRE in many common cases. Note cause to do otherwise would slow down PCRE in many common cases. Note
in particular that this applies to \b and \B, because they are defined in particular that this applies to \b and \B, because they are defined
in terms of \w and \W. If you really want to test for a wider sense of, in terms of \w and \W. If you really want to test for a wider sense of,
say, "digit", you can use explicit Unicode property tests such as say, "digit", you can use explicit Unicode property tests such as
\p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
character escapes work is changed so that Unicode properties are used character escapes work is changed so that Unicode properties are used
to determine which characters match. There are more details in the sec- to determine which characters match. There are more details in the sec-
tion on generic character types in the pcrepattern documentation. tion on generic character types in the pcrepattern documentation.
7. Similarly, characters that match the POSIX named character classes 7. Similarly, characters that match the POSIX named character classes
are all low-valued characters, unless the PCRE_UCP option is set. are all low-valued characters, unless the PCRE_UCP option is set.
skipping to change at line 8142 skipping to change at line 8136
ARM v5, v7, and Thumb2 ARM v5, v7, and Thumb2
Intel x86 32-bit and 64-bit Intel x86 32-bit and 64-bit
MIPS 32-bit MIPS 32-bit
Power PC 32-bit and 64-bit Power PC 32-bit and 64-bit
SPARC 32-bit (experimental) SPARC 32-bit (experimental)
If --enable-jit is set on an unsupported platform, compilation fails. If --enable-jit is set on an unsupported platform, compilation fails.
A program that is linked with PCRE 8.20 or later can tell if JIT sup- A program that is linked with PCRE 8.20 or later can tell if JIT sup-
port is available by calling pcre_config() with the PCRE_CONFIG_JIT port is available by calling pcre_config() with the PCRE_CONFIG_JIT op-
option. The result is 1 when JIT is available, and 0 otherwise. How- tion. The result is 1 when JIT is available, and 0 otherwise. However,
ever, a simple program does not need to check this in order to use JIT. a simple program does not need to check this in order to use JIT. The
The normal API is implemented in a way that falls back to the interpre- normal API is implemented in a way that falls back to the interpretive
tive code if JIT is not available. For programs that need the best pos- code if JIT is not available. For programs that need the best possible
sible performance, there is also a "fast path" API that is JIT-spe- performance, there is also a "fast path" API that is JIT-specific.
cific.
If your program may sometimes be linked with versions of PCRE that are If your program may sometimes be linked with versions of PCRE that are
older than 8.20, but you want to use JIT when it is available, you can older than 8.20, but you want to use JIT when it is available, you can
test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
macro such as PCRE_CONFIG_JIT, for compile-time control of your code. macro such as PCRE_CONFIG_JIT, for compile-time control of your code.
Also beware that the pcre_jit_exec() function was not available at all Also beware that the pcre_jit_exec() function was not available at all
before 8.32, and may not be available at all if PCRE isn't compiled before 8.32, and may not be available at all if PCRE isn't compiled
with --enable-jit. See the "JIT FAST PATH API" section below for with --enable-jit. See the "JIT FAST PATH API" section below for de-
details. tails.
SIMPLE USE OF JIT SIMPLE USE OF JIT
You have to do two things to make use of the JIT support in the sim- You have to do two things to make use of the JIT support in the sim-
plest way: plest way:
(1) Call pcre_study() with the PCRE_STUDY_JIT_COMPILE option for (1) Call pcre_study() with the PCRE_STUDY_JIT_COMPILE option for
each compiled pattern, and pass the resulting pcre_extra block to each compiled pattern, and pass the resulting pcre_extra block to
pcre_exec(). pcre_exec().
(2) Use pcre_free_study() to free the pcre_extra block when it is (2) Use pcre_free_study() to free the pcre_extra block when it is
no longer needed, instead of just freeing it yourself. This no longer needed, instead of just freeing it yourself. This en-
ensures that sures that
any JIT data is also freed. any JIT data is also freed.
For a program that may be linked with pre-8.20 versions of PCRE, you For a program that may be linked with pre-8.20 versions of PCRE, you
can insert can insert
#ifndef PCRE_STUDY_JIT_COMPILE #ifndef PCRE_STUDY_JIT_COMPILE
#define PCRE_STUDY_JIT_COMPILE 0 #define PCRE_STUDY_JIT_COMPILE 0
#endif #endif
so that no option is passed to pcre_study(), and then use something so that no option is passed to pcre_study(), and then use something
like this to free the study data: like this to free the study data:
#ifdef PCRE_CONFIG_JIT #ifdef PCRE_CONFIG_JIT
pcre_free_study(study_ptr); pcre_free_study(study_ptr);
#else #else
pcre_free(study_ptr); pcre_free(study_ptr);
#endif #endif
PCRE_STUDY_JIT_COMPILE requests the JIT compiler to generate code for PCRE_STUDY_JIT_COMPILE requests the JIT compiler to generate code for
complete matches. If you want to run partial matches using the complete matches. If you want to run partial matches using the
PCRE_PARTIAL_HARD or PCRE_PARTIAL_SOFT options of pcre_exec(), you PCRE_PARTIAL_HARD or PCRE_PARTIAL_SOFT options of pcre_exec(), you
should set one or both of the following options in addition to, or should set one or both of the following options in addition to, or in-
instead of, PCRE_STUDY_JIT_COMPILE when you call pcre_study(): stead of, PCRE_STUDY_JIT_COMPILE when you call pcre_study():
PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
If using pcre_jit_exec() and supporting a pre-8.32 version of PCRE, you If using pcre_jit_exec() and supporting a pre-8.32 version of PCRE, you
can insert: can insert:
#if PCRE_MAJOR >= 8 && PCRE_MINOR >= 32 #if PCRE_MAJOR >= 8 && PCRE_MINOR >= 32
pcre_jit_exec(...); pcre_jit_exec(...);
#else #else
pcre_exec(...) pcre_exec(...)
#endif #endif
but as described in the "JIT FAST PATH API" section below this assumes but as described in the "JIT FAST PATH API" section below this assumes
version 8.32 and later are compiled with --enable-jit, which may break. version 8.32 and later are compiled with --enable-jit, which may break.
The JIT compiler generates different optimized code for each of the The JIT compiler generates different optimized code for each of the
three modes (normal, soft partial, hard partial). When pcre_exec() is three modes (normal, soft partial, hard partial). When pcre_exec() is
called, the appropriate code is run if it is available. Otherwise, the called, the appropriate code is run if it is available. Otherwise, the
pattern is matched using interpretive code. pattern is matched using interpretive code.
In some circumstances you may need to call additional functions. These In some circumstances you may need to call additional functions. These
are described in the section entitled "Controlling the JIT stack" are described in the section entitled "Controlling the JIT stack" be-
below. low.
If JIT support is not available, PCRE_STUDY_JIT_COMPILE etc. are If JIT support is not available, PCRE_STUDY_JIT_COMPILE etc. are ig-
ignored, and no JIT data is created. Otherwise, the compiled pattern is nored, and no JIT data is created. Otherwise, the compiled pattern is
passed to the JIT compiler, which turns it into machine code that exe- passed to the JIT compiler, which turns it into machine code that exe-
cutes much faster than the normal interpretive code. When pcre_exec() cutes much faster than the normal interpretive code. When pcre_exec()
is passed a pcre_extra block containing a pointer to JIT code of the is passed a pcre_extra block containing a pointer to JIT code of the
appropriate mode (normal or hard/soft partial), it obeys that code appropriate mode (normal or hard/soft partial), it obeys that code in-
instead of running the interpreter. The result is identical, but the stead of running the interpreter. The result is identical, but the com-
compiled JIT code runs much faster. piled JIT code runs much faster.
There are some pcre_exec() options that are not supported for JIT exe- There are some pcre_exec() options that are not supported for JIT exe-
cution. There are also some pattern items that JIT cannot handle. cution. There are also some pattern items that JIT cannot handle. De-
Details are given below. In both cases, execution automatically falls tails are given below. In both cases, execution automatically falls
back to the interpretive code. If you want to know whether JIT was back to the interpretive code. If you want to know whether JIT was ac-
actually used for a particular match, you should arrange for a JIT tually used for a particular match, you should arrange for a JIT call-
callback function to be set up as described in the section entitled back function to be set up as described in the section entitled "Con-
"Controlling the JIT stack" below, even if you do not need to supply a trolling the JIT stack" below, even if you do not need to supply a non-
non-default JIT stack. Such a callback function is called whenever JIT default JIT stack. Such a callback function is called whenever JIT code
code is about to be obeyed. If the execution options are not right for is about to be obeyed. If the execution options are not right for JIT
JIT execution, the callback function is not obeyed. execution, the callback function is not obeyed.
If the JIT compiler finds an unsupported item, no JIT data is gener- If the JIT compiler finds an unsupported item, no JIT data is gener-
ated. You can find out if JIT execution is available after studying a ated. You can find out if JIT execution is available after studying a
pattern by calling pcre_fullinfo() with the PCRE_INFO_JIT option. A pattern by calling pcre_fullinfo() with the PCRE_INFO_JIT option. A re-
result of 1 means that JIT compilation was successful. A result of 0 sult of 1 means that JIT compilation was successful. A result of 0
means that JIT support is not available, or the pattern was not studied means that JIT support is not available, or the pattern was not studied
with PCRE_STUDY_JIT_COMPILE etc., or the JIT compiler was not able to with PCRE_STUDY_JIT_COMPILE etc., or the JIT compiler was not able to
handle the pattern. handle the pattern.
Once a pattern has been studied, with or without JIT, it can be used as Once a pattern has been studied, with or without JIT, it can be used as
many times as you like for matching different subject strings. many times as you like for matching different subject strings.
UNSUPPORTED OPTIONS AND PATTERN ITEMS UNSUPPORTED OPTIONS AND PATTERN ITEMS
The only pcre_exec() options that are supported for JIT execution are The only pcre_exec() options that are supported for JIT execution are
PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK, PCRE_NO_UTF32_CHECK, PCRE_NOT- PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK, PCRE_NO_UTF32_CHECK, PCRE_NOT-
BOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_PAR- BOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_PAR-
TIAL_HARD, and PCRE_PARTIAL_SOFT. TIAL_HARD, and PCRE_PARTIAL_SOFT.
The only unsupported pattern items are \C (match a single data unit) The only unsupported pattern items are \C (match a single data unit)
when running in a UTF mode, and a callout immediately before an asser- when running in a UTF mode, and a callout immediately before an asser-
tion condition in a conditional group. tion condition in a conditional group.
RETURN VALUES FROM JIT EXECUTION RETURN VALUES FROM JIT EXECUTION
When a pattern is matched using JIT execution, the return values are When a pattern is matched using JIT execution, the return values are
the same as those given by the interpretive pcre_exec() code, with the the same as those given by the interpretive pcre_exec() code, with the
addition of one new error code: PCRE_ERROR_JIT_STACKLIMIT. This means addition of one new error code: PCRE_ERROR_JIT_STACKLIMIT. This means
that the memory used for the JIT stack was insufficient. See "Control- that the memory used for the JIT stack was insufficient. See "Control-
ling the JIT stack" below for a discussion of JIT stack usage. For com- ling the JIT stack" below for a discussion of JIT stack usage. For com-
patibility with the interpretive pcre_exec() code, no more than two- patibility with the interpretive pcre_exec() code, no more than two-
thirds of the ovector argument is used for passing back captured sub- thirds of the ovector argument is used for passing back captured sub-
strings. strings.
The error code PCRE_ERROR_MATCHLIMIT is returned by the JIT code if The error code PCRE_ERROR_MATCHLIMIT is returned by the JIT code if
searching a very large pattern tree goes on for too long, as it is in searching a very large pattern tree goes on for too long, as it is in
the same circumstance when JIT is not used, but the details of exactly the same circumstance when JIT is not used, but the details of exactly
what is counted are not the same. The PCRE_ERROR_RECURSIONLIMIT error what is counted are not the same. The PCRE_ERROR_RECURSIONLIMIT error
code is never returned by JIT execution. code is never returned by JIT execution.
SAVING AND RESTORING COMPILED PATTERNS SAVING AND RESTORING COMPILED PATTERNS
The code that is generated by the JIT compiler is architecture-spe- The code that is generated by the JIT compiler is architecture-spe-
cific, and is also position dependent. For those reasons it cannot be cific, and is also position dependent. For those reasons it cannot be
saved (in a file or database) and restored later like the bytecode and saved (in a file or database) and restored later like the bytecode and
other data of a compiled pattern. Saving and restoring compiled pat- other data of a compiled pattern. Saving and restoring compiled pat-
terns is not something many people do. More detail about this facility terns is not something many people do. More detail about this facility
is given in the pcreprecompile documentation. It should be possible to is given in the pcreprecompile documentation. It should be possible to
run pcre_study() on a saved and restored pattern, and thereby recreate run pcre_study() on a saved and restored pattern, and thereby recreate
the JIT data, but because JIT compilation uses significant resources, the JIT data, but because JIT compilation uses significant resources,
it is probably not worth doing this; you might as well recompile the it is probably not worth doing this; you might as well recompile the
original pattern. original pattern.
CONTROLLING THE JIT STACK CONTROLLING THE JIT STACK
When the compiled JIT code runs, it needs a block of memory to use as a When the compiled JIT code runs, it needs a block of memory to use as a
stack. By default, it uses 32K on the machine stack. However, some stack. By default, it uses 32K on the machine stack. However, some
large or complicated patterns need more than this. The error large or complicated patterns need more than this. The error PCRE_ER-
PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack. ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
Three functions are provided for managing blocks of memory for use as tions are provided for managing blocks of memory for use as JIT stacks.
JIT stacks. There is further discussion about the use of JIT stacks in There is further discussion about the use of JIT stacks in the section
the section entitled "JIT stack FAQ" below. entitled "JIT stack FAQ" below.
The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
are a starting size and a maximum size, and it returns a pointer to an are a starting size and a maximum size, and it returns a pointer to an
opaque structure of type pcre_jit_stack, or NULL if there is an error. opaque structure of type pcre_jit_stack, or NULL if there is an error.
The pcre_jit_stack_free() function can be used to free a stack that is The pcre_jit_stack_free() function can be used to free a stack that is
no longer needed. (For the technically minded: the address space is no longer needed. (For the technically minded: the address space is al-
allocated by mmap or VirtualAlloc.) located by mmap or VirtualAlloc.)
JIT uses far less memory for recursion than the interpretive code, and JIT uses far less memory for recursion than the interpretive code, and
a maximum stack size of 512K to 1M should be more than enough for any a maximum stack size of 512K to 1M should be more than enough for any
pattern. pattern.
The pcre_assign_jit_stack() function specifies which stack JIT code The pcre_assign_jit_stack() function specifies which stack JIT code
should use. Its arguments are as follows: should use. Its arguments are as follows:
pcre_extra *extra pcre_extra *extra
pcre_jit_callback callback pcre_jit_callback callback
void *data void *data
The extra argument must be the result of studying a pattern with The extra argument must be the result of studying a pattern with
PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the
other two options: other two options:
(1) If callback is NULL and data is NULL, an internal 32K block (1) If callback is NULL and data is NULL, an internal 32K block
on the machine stack is used. on the machine stack is used.
(2) If callback is NULL and data is not NULL, data must be (2) If callback is NULL and data is not NULL, data must be
a valid JIT stack, the result of calling pcre_jit_stack_alloc(). a valid JIT stack, the result of calling pcre_jit_stack_alloc().
(3) If callback is not NULL, it must point to a function that is (3) If callback is not NULL, it must point to a function that is
called with data as an argument at the start of matching, in called with data as an argument at the start of matching, in
order to set up a JIT stack. If the return from the callback order to set up a JIT stack. If the return from the callback
function is NULL, the internal 32K stack is used; otherwise the function is NULL, the internal 32K stack is used; otherwise the
return value must be a valid JIT stack, the result of calling return value must be a valid JIT stack, the result of calling
pcre_jit_stack_alloc(). pcre_jit_stack_alloc().
A callback function is obeyed whenever JIT code is about to be run; it A callback function is obeyed whenever JIT code is about to be run; it
is not obeyed when pcre_exec() is called with options that are incom- is not obeyed when pcre_exec() is called with options that are incom-
patible for JIT execution. A callback function can therefore be used to patible for JIT execution. A callback function can therefore be used to
determine whether a match operation was executed by JIT or by the determine whether a match operation was executed by JIT or by the in-
interpreter. terpreter.
You may safely use the same JIT stack for more than one pattern (either You may safely use the same JIT stack for more than one pattern (either
by assigning directly or by callback), as long as the patterns are all by assigning directly or by callback), as long as the patterns are all
matched sequentially in the same thread. In a multithread application, matched sequentially in the same thread. In a multithread application,
if you do not specify a JIT stack, or if you assign or pass back NULL if you do not specify a JIT stack, or if you assign or pass back NULL
from a callback, that is thread-safe, because each thread has its own from a callback, that is thread-safe, because each thread has its own
machine stack. However, if you assign or pass back a non-NULL JIT machine stack. However, if you assign or pass back a non-NULL JIT
stack, this must be a different stack for each thread so that the stack, this must be a different stack for each thread so that the ap-
application is thread-safe. plication is thread-safe.
Strictly speaking, even more is allowed. You can assign the same non- Strictly speaking, even more is allowed. You can assign the same non-
NULL stack to any number of patterns as long as they are not used for NULL stack to any number of patterns as long as they are not used for
matching by multiple threads at the same time. For example, you can matching by multiple threads at the same time. For example, you can as-
assign the same stack to all compiled patterns, and use a global mutex sign the same stack to all compiled patterns, and use a global mutex in
in the callback to wait until the stack is available for use. However, the callback to wait until the stack is available for use. However,
this is an inefficient solution, and not recommended. this is an inefficient solution, and not recommended.
This is a suggestion for how a multithreaded program that needs to set This is a suggestion for how a multithreaded program that needs to set
up non-default JIT stacks might operate: up non-default JIT stacks might operate:
During thread initalization During thread initalization
thread_local_var = pcre_jit_stack_alloc(...) thread_local_var = pcre_jit_stack_alloc(...)
During thread exit During thread exit
pcre_jit_stack_free(thread_local_var) pcre_jit_stack_free(thread_local_var)
Use a one-line callback function Use a one-line callback function
return thread_local_var return thread_local_var
All the functions described in this section do nothing if JIT is not All the functions described in this section do nothing if JIT is not
available, and pcre_assign_jit_stack() does nothing unless the extra available, and pcre_assign_jit_stack() does nothing unless the extra
argument is non-NULL and points to a pcre_extra block that is the argument is non-NULL and points to a pcre_extra block that is the re-
result of a successful study with PCRE_STUDY_JIT_COMPILE etc. sult of a successful study with PCRE_STUDY_JIT_COMPILE etc.
JIT STACK FAQ JIT STACK FAQ
(1) Why do we need JIT stacks? (1) Why do we need JIT stacks?
PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack
where the local data of the current node is pushed before checking its where the local data of the current node is pushed before checking its
child nodes. Allocating real machine stack on some platforms is diffi- child nodes. Allocating real machine stack on some platforms is diffi-
cult. For example, the stack chain needs to be updated every time if we cult. For example, the stack chain needs to be updated every time if we
extend the stack on PowerPC. Although it is possible, its updating extend the stack on PowerPC. Although it is possible, its updating