"Fossies" - the Fresh Open Source Software Archive  

Source code changes of the file "doc/pcre2api.3" between
pcre2-10.35.tar.bz2 and pcre2-10.36.tar.bz2

About: The PCRE2 library implements Perl compatible regular expression pattern matching. New future PCRE version with revised API.

pcre2api.3  (pcre2-10.35.tar.bz2):pcre2api.3  (pcre2-10.36.tar.bz2)
skipping to change at line 433 skipping to change at line 433
A pointer to the compiled form of a pattern is returned to the user whe n pcre2_compile() is successful. A pointer to the compiled form of a pattern is returned to the user whe n pcre2_compile() is successful.
The data in the compiled pattern is fixed, and does not change when the p attern is matched. Therefore, it The data in the compiled pattern is fixed, and does not change when the p attern is matched. Therefore, it
is thread-safe, that is, the same compiled pattern can be used by more than one thread simultaneously. is thread-safe, that is, the same compiled pattern can be used by more than one thread simultaneously.
For example, an application can compile all its patterns at the start, before forking off multiple For example, an application can compile all its patterns at the start, before forking off multiple
threads that use them. However, if the just-in-time (JIT) optimization feature is being used, it needs threads that use them. However, if the just-in-time (JIT) optimization feature is being used, it needs
separate memory stack areas for each thread. See the pcre2jit documentati on for more details. separate memory stack areas for each thread. See the pcre2jit documentati on for more details.
In a more complicated situation, where patterns are compiled only when th ey are first needed, but are In a more complicated situation, where patterns are compiled only when th ey are first needed, but are
still shared between threads, pointers to compiled patterns must be pro tected from simultaneous writing still shared between threads, pointers to compiled patterns must be pro tected from simultaneous writing
by multiple threads, at least until a pattern has been compiled. The logi by multiple threads. This is somewhat tricky to do correctly. If you know
c can be something like this: that writing to a pointer is
atomic in your environment, you can use logic like this:
Get a read-only (shared) lock (mutex) for pointer Get a read-only (shared) lock (mutex) for pointer
if (pointer == NULL) if (pointer == NULL)
{ {
Get a write (unique) lock for pointer Get a write (unique) lock for pointer
pointer = pcre2_compile(... if (pointer == NULL) pointer = pcre2_compile(...
} }
Release the lock Release the lock
Use pointer in pcre2_match() Use pointer in pcre2_match()
Of course, testing for compilation errors should also be included in the code. Of course, testing for compilation errors should also be included in the code.
If JIT is being used, but the JIT compilation is not being done immediate The reason for checking the pointer a second time is as follows: Sever
ly, (perhaps waiting to see if al threads may have acquired the
the pattern is used often enough) similar logic is required. JIT compilat shared lock and tested the pointer for being NULL, but only one of them w
ion updates a pointer within the ill be given the write lock,
compiled code block, so a thread must gain unique write access t with the rest kept waiting. The winning thread will compile the pattern a
o the pointer before calling nd store the result. After this
pcre2_jit_compile(). Alternatively, pcre2_code_copy() or pcre2_code_c thread releases the write lock, another thread will get it, and if it doe
opy_with_tables() can be used to s not retest pointer for being
NULL, will recompile the pattern and overwrite the pointer, creating a
memory leak and possibly causing
other issues.
In an environment where writing to a pointer may not be atomic, the above
logic is not sufficient. The
thread that is doing the compiling may be descheduled after writing only
part of the pointer, which could
cause other threads to use an invalid value. Instead of checking the poin
ter itself, a separate "pointer
is valid" flag (that can be updated atomically) must be used:
Get a read-only (shared) lock (mutex) for pointer
if (!pointer_is_valid)
{
Get a write (unique) lock for pointer
if (!pointer_is_valid)
{
pointer = pcre2_compile(...
pointer_is_valid = TRUE
}
}
Release the lock
Use pointer in pcre2_match()
If JIT is being used, but the JIT compilation is not being done immedi
ately (perhaps waiting to see if
the pattern is used often enough), similar logic is required. JIT compila
tion updates a value within the
compiled code block, so a thread must gain unique write access
to the pointer before calling
pcre2_jit_compile(). Alternatively, pcre2_code_copy() or pcre2_code_copy_
with_tables() can be used to
obtain a private copy of the compiled code before calling the JIT compile r. obtain a private copy of the compiled code before calling the JIT compile r.
Context blocks Context blocks
The next main section below introduces the idea of "contexts" in which PC RE2 functions are called. A con- The next main section below introduces the idea of "contexts" in which PC RE2 functions are called. A con-
text is nothing more than a collection of parameters that control the way PCRE2 operates. Grouping a num- text is nothing more than a collection of parameters that control the way PCRE2 operates. Grouping a num-
ber of parameters together in a context is a convenient way of passing th em to a PCRE2 function without ber of parameters together in a context is a convenient way of passing them to a PCRE2 function without
using lots of arguments. The parameters that are stored in contexts are i n some sense "advanced features" using lots of arguments. The parameters that are stored in contexts are i n some sense "advanced features"
of the API. Many straightforward applications will not need to use contex ts. of the API. Many straightforward applications will not need to use contex ts.
In a multithreaded application, if the parameters in a context are values In a multithreaded application, if the parameters in a context are va
that are never changed, the lues that are never changed, the
same context can be used by all the threads. However, if any thread nee same context can be used by all the threads. However, if any thread needs
ds to change any value in a con- to change any value in a con-
text, it must make its own thread-specific copy. text, it must make its own thread-specific copy.
Match blocks Match blocks
The matching functions need a block of memory for storing the results of a match. This includes details The matching functions need a block of memory for storing the results o f a match. This includes details
of what was matched, as well as additional information such as the name o f a (*MARK) setting. Each thread of what was matched, as well as additional information such as the name o f a (*MARK) setting. Each thread
must provide its own copy of this memory. must provide its own copy of this memory.
PCRE2 CONTEXTS PCRE2 CONTEXTS
Some PCRE2 functions have a lot of parameters, many of which are used onl Some PCRE2 functions have a lot of parameters, many of which are used
y by specialist applications, only by specialist applications,
for example, those that use custom memory management or non-standard ch for example, those that use custom memory management or non-standard char
aracter tables. To keep function acter tables. To keep function
argument lists at a reasonable size, and at the same time to keep the API argument lists at a reasonable size, and at the same time to keep the AP
extensible, "uncommon" parame- I extensible, "uncommon" parame-
ters are passed to certain functions in a context instead of directly. A ters are passed to certain functions in a context instead of directly. A
context is just a block of mem- context is just a block of mem-
ory that holds the parameter values. Applications that do not need to ad ory that holds the parameter values. Applications that do not need to a
just any of the context parame- djust any of the context parame-
ters can pass NULL when a context pointer is required. ters can pass NULL when a context pointer is required.
There are three different types of context: a general context that is r elevant for several PCRE2 opera- There are three different types of context: a general context that is rel evant for several PCRE2 opera-
tions, a compile-time context, and a match-time context. tions, a compile-time context, and a match-time context.
The general context The general context
At present, this context just contains pointers to (and data for) externa At present, this context just contains pointers to (and data for) exte
l memory management functions rnal memory management functions
that are called from several places in the PCRE2 library. The contex that are called from several places in the PCRE2 library. The context is
t is named `general' rather than named `general' rather than
specifically `memory' because in future other fields may be added. If you specifically `memory' because in future other fields may be added. If yo
do not want to supply your own u do not want to supply your own
custom memory management functions, you do not need to bother with a ge custom memory management functions, you do not need to bother with a gene
neral context. A general context ral context. A general context
is created by: is created by:
pcre2_general_context *pcre2_general_context_create( pcre2_general_context *pcre2_general_context_create(
void *(*private_malloc)(PCRE2_SIZE, void *), void *(*private_malloc)(PCRE2_SIZE, void *),
void (*private_free)(void *, void *), void *memory_data); void (*private_free)(void *, void *), void *memory_data);
The two function pointers specify custom memory management functions, who se prototypes are: The two function pointers specify custom memory management functions, who se prototypes are:
void *private_malloc(PCRE2_SIZE, void *); void *private_malloc(PCRE2_SIZE, void *);
void private_free(void *, void *); void private_free(void *, void *);
Whenever code in PCRE2 calls these functions, the final argument is the v Whenever code in PCRE2 calls these functions, the final argument is the
alue of memory_data. Either of value of memory_data. Either of
the first two arguments of the creation function may be NULL, in which c the first two arguments of the creation function may be NULL, in which ca
ase the system memory management se the system memory management
functions malloc() and free() are used. (This is not currently useful, as functions malloc() and free() are used. (This is not currently useful, a
there are no other fields in a s there are no other fields in a
general context, but in future there might be.) The private_malloc() f general context, but in future there might be.) The private_malloc() fun
unction is used (if supplied) to ction is used (if supplied) to
obtain memory for storing the context, and all three values are saved as part of the context. obtain memory for storing the context, and all three values are saved as part of the context.
Whenever PCRE2 creates a data block of any kind, the block contains a poi nter to the free() function that Whenever PCRE2 creates a data block of any kind, the block contains a poi nter to the free() function that
matches the malloc() function that was used. When the time comes to free the block, this function is matches the malloc() function that was used. When the time comes to free the block, this function is
called. called.
A general context can be copied by calling: A general context can be copied by calling:
pcre2_general_context *pcre2_general_context_copy( pcre2_general_context *pcre2_general_context_copy(
pcre2_general_context *gcontext); pcre2_general_context *gcontext);
The memory used for a general context should be freed by calling: The memory used for a general context should be freed by calling:
void pcre2_general_context_free(pcre2_general_context *gcontext); void pcre2_general_context_free(pcre2_general_context *gcontext);
If this function is passed a NULL argument, it returns immediately withou t doing anything. If this function is passed a NULL argument, it returns immediately withou t doing anything.
The compile context The compile context
A compile context is required if you want to provide an external function for stack checking during com- A compile context is required if you want to provide an external functio n for stack checking during com-
pilation or to change the default values of any of the following compile- time parameters: pilation or to change the default values of any of the following compile- time parameters:
What \R matches (Unicode newlines or CR, LF, CRLF only) What \R matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables PCRE2's character tables
The newline character sequence The newline character sequence
The compile time nested parentheses limit The compile time nested parentheses limit
The maximum length of the pattern string The maximum length of the pattern string
The extra options bits (none set by default) The extra options bits (none set by default)
A compile context is also required if you are using custom memory mana gement. If none of these apply, A compile context is also required if you are using custom memory managem ent. If none of these apply,
just pass NULL as the context argument of pcre2_compile(). just pass NULL as the context argument of pcre2_compile().
A compile context is created, copied, and freed by the following function s: A compile context is created, copied, and freed by the following function s:
pcre2_compile_context *pcre2_compile_context_create( pcre2_compile_context *pcre2_compile_context_create(
pcre2_general_context *gcontext); pcre2_general_context *gcontext);
pcre2_compile_context *pcre2_compile_context_copy( pcre2_compile_context *pcre2_compile_context_copy(
pcre2_compile_context *ccontext); pcre2_compile_context *ccontext);
void pcre2_compile_context_free(pcre2_compile_context *ccontext); void pcre2_compile_context_free(pcre2_compile_context *ccontext);
A compile context is created with default values for its parameters. Thes e can be changed by calling the A compile context is created with default values for its parameters. The se can be changed by calling the
following functions, which return 0 on success, or PCRE2_ERROR_BADDATA if invalid data is detected. following functions, which return 0 on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
int pcre2_set_bsr(pcre2_compile_context *ccontext, int pcre2_set_bsr(pcre2_compile_context *ccontext,
uint32_t value); uint32_t value);
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR,
, LF, or CRLF, or PCRE2_BSR_UNI- LF, or CRLF, or PCRE2_BSR_UNI-
CODE, to specify that \R matches any Unicode line ending sequence. The va CODE, to specify that \R matches any Unicode line ending sequence. The v
lue is used by the JIT compiler alue is used by the JIT compiler
and by the two interpreted matching functions, pcre2_match() and pcre2_df a_match(). and by the two interpreted matching functions, pcre2_match() and pcre2_df a_match().
int pcre2_set_character_tables(pcre2_compile_context *ccontext, int pcre2_set_character_tables(pcre2_compile_context *ccontext,
const uint8_t *tables); const uint8_t *tables);
The value must be the result of a call to pcre2_maketables(), whose onl y argument is a general context. The value must be the result of a call to pcre2_maketables(), whose only argument is a general context.
This function builds a set of character tables in the current locale. This function builds a set of character tables in the current locale.
int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
uint32_t extra_options); uint32_t extra_options);
As PCRE2 has developed, almost all the 32 option bits that are available As PCRE2 has developed, almost all the 32 option bits that are avai
in the options argument of lable in the options argument of
pcre2_compile() have been used up. To avoid running out, the compile pcre2_compile() have been used up. To avoid running out, the compile cont
context contains a set of extra ext contains a set of extra
option bits which are used for some newer, assumed rarer, options. This option bits which are used for some newer, assumed rarer, options. T
function sets those bits. It his function sets those bits. It
always sets all the bits (either on or off). It does not modify any always sets all the bits (either on or off). It does not modify any ex
existing setting. The available isting setting. The available
options are defined in the section entitled "Extra compile options" below . options are defined in the section entitled "Extra compile options" below .
int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
PCRE2_SIZE value); PCRE2_SIZE value);
This sets a maximum length, in code units, for any pattern string that is compiled with this context. If This sets a maximum length, in code units, for any pattern string that i s compiled with this context. If
the pattern is longer, an error is generated. This facility is provided so that applications that accept the pattern is longer, an error is generated. This facility is provided so that applications that accept
patterns from external sources can limit their size. The default is the l argest number that a PCRE2_SIZE patterns from external sources can limit their size. The default is the largest number that a PCRE2_SIZE
variable can hold, which is effectively unlimited. variable can hold, which is effectively unlimited.
int pcre2_set_newline(pcre2_compile_context *ccontext, int pcre2_set_newline(pcre2_compile_context *ccontext,
uint32_t value); uint32_t value);
This specifies which characters or character sequences are to be recogn This specifies which characters or character sequences are to be recogniz
ized as newlines. The value must ed as newlines. The value must
be one of PCRE2_NEWLINE_CR (carriage return only), PCRE2_NEWLINE_LF (line be one of PCRE2_NEWLINE_CR (carriage return only), PCRE2_NEWLINE_LF (li
feed only), PCRE2_NEWLINE_CRLF nefeed only), PCRE2_NEWLINE_CRLF
(the two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRL (the two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (a
F (any of the above), PCRE2_NEW- ny of the above), PCRE2_NEW-
LINE_ANY (any Unicode newline sequence), or PCRE2_NEWLINE_NUL (the NUL ch aracter, that is a binary zero). LINE_ANY (any Unicode newline sequence), or PCRE2_NEWLINE_NUL (the NUL ch aracter, that is a binary zero).
A pattern can override the value set in the compile context by starting w ith a sequence such as (*CRLF). A pattern can override the value set in the compile context by starting with a sequence such as (*CRLF).
See the pcre2pattern page for details. See the pcre2pattern page for details.
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MOR E option, the newline convention When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option, the newline convention
affects the recognition of the end of internal comments starting with #. The value is saved with the com- affects the recognition of the end of internal comments starting with #. The value is saved with the com-
piled pattern for subsequent use by the JIT compiler and by the two interpreted matching functions, piled pattern for subsequent use by the JIT compiler and by the two i nterpreted matching functions,
pcre2_match() and pcre2_dfa_match(). pcre2_match() and pcre2_dfa_match().
int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
uint32_t value); uint32_t value);
This parameter adjusts the limit, set when PCRE2 is built (default 250), on the depth of parenthesis This parameter adjusts the limit, set when PCRE2 is built (default 2 50), on the depth of parenthesis
nesting in a pattern. This limit stops rogue patterns using up too much s ystem stack when being compiled. nesting in a pattern. This limit stops rogue patterns using up too much s ystem stack when being compiled.
The limit applies to parentheses of all kinds, not just capturing parenth eses. The limit applies to parentheses of all kinds, not just capturing parenth eses.
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
int (*guard_function)(uint32_t, void *), void *user_data); int (*guard_function)(uint32_t, void *), void *user_data);
There is at least one application that runs PCRE2 in threads with very li There is at least one application that runs PCRE2 in threads with very
mited system stack, where run- limited system stack, where run-
ning out of stack is to be avoided at all costs. The parenthesis limit ning out of stack is to be avoided at all costs. The parenthesis limit ab
above cannot take account of how ove cannot take account of how
much stack is actually available during compilation. For a finer control, much stack is actually available during compilation. For a finer control
you can supply a function that , you can supply a function that
is called whenever pcre2_compile() starts to compile a parenthesized part of a pattern. This function can is called whenever pcre2_compile() starts to compile a parenthesized part of a pattern. This function can
check the actual stack size (or anything else that it wants to, of course ). check the actual stack size (or anything else that it wants to, of course ).
The first argument to the callout function gives the current depth of nes The first argument to the callout function gives the current depth of
ting, and the second is user nesting, and the second is user
data that is set up by the last argument of pcre2_set_compile_recursi data that is set up by the last argument of pcre2_set_compile_recursion_
on_guard(). The callout function guard(). The callout function
should return zero if all is well, or non-zero to force an error. should return zero if all is well, or non-zero to force an error.
The match context The match context
A match context is required if you want to: A match context is required if you want to:
Set up a callout function Set up a callout function
Set an offset limit for matching an unanchored pattern Set an offset limit for matching an unanchored pattern
Change the limit on the amount of heap used when matching Change the limit on the amount of heap used when matching
Change the backtracking match limit Change the backtracking match limit
Change the backtracking depth limit Change the backtracking depth limit
Set custom memory management specifically for the match Set custom memory management specifically for the match
If none of these apply, just pass NULL as the context argument of pcre2_ match(), pcre2_dfa_match(), or If none of these apply, just pass NULL as the context argument of pcre 2_match(), pcre2_dfa_match(), or
pcre2_jit_match(). pcre2_jit_match().
A match context is created, copied, and freed by the following functions: A match context is created, copied, and freed by the following functions:
pcre2_match_context *pcre2_match_context_create( pcre2_match_context *pcre2_match_context_create(
pcre2_general_context *gcontext); pcre2_general_context *gcontext);
pcre2_match_context *pcre2_match_context_copy( pcre2_match_context *pcre2_match_context_copy(
pcre2_match_context *mcontext); pcre2_match_context *mcontext);
void pcre2_match_context_free(pcre2_match_context *mcontext); void pcre2_match_context_free(pcre2_match_context *mcontext);
A match context is created with default values for its parameters. The se can be changed by calling the A match context is created with default values for its parameters. These can be changed by calling the
following functions, which return 0 on success, or PCRE2_ERROR_BADDATA if invalid data is detected. following functions, which return 0 on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
int pcre2_set_callout(pcre2_match_context *mcontext, int pcre2_set_callout(pcre2_match_context *mcontext,
int (*callout_function)(pcre2_callout_block *, void *), int (*callout_function)(pcre2_callout_block *, void *),
void *callout_data); void *callout_data);
This sets up a callout function for PCRE2 to call at specified points during a matching operation. This sets up a callout function for PCRE2 to call at specified poi nts during a matching operation.
Details are given in the pcre2callout documentation. Details are given in the pcre2callout documentation.
int pcre2_set_substitute_callout(pcre2_match_context *mcontext, int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
int (*callout_function)(pcre2_substitute_callout_block *, void *), int (*callout_function)(pcre2_substitute_callout_block *, void *),
void *callout_data); void *callout_data);
This sets up a callout function for PCRE2 to call after each substitu tion made by pcre2_substitute(). This sets up a callout function for PCRE2 to call after each substitutio n made by pcre2_substitute().
Details are given in the section entitled "Creating a new string with sub stitutions" below. Details are given in the section entitled "Creating a new string with sub stitutions" below.
int pcre2_set_offset_limit(pcre2_match_context *mcontext, int pcre2_set_offset_limit(pcre2_match_context *mcontext,
PCRE2_SIZE value); PCRE2_SIZE value);
The offset_limit parameter limits how far an unanchored search can advanc The offset_limit parameter limits how far an unanchored search can ad
e in the subject string. The vance in the subject string. The
default value is PCRE2_UNSET. The pcre2_match() and pcre2_ default value is PCRE2_UNSET. The pcre2_match() and pcre2_d
dfa_match() functions return fa_match() functions return
PCRE2_ERROR_NOMATCH if a match with a starting point before or at the giv PCRE2_ERROR_NOMATCH if a match with a starting point before or at th
en offset is not found. The e given offset is not found. The
pcre2_substitute() function makes no more substitutions. pcre2_substitute() function makes no more substitutions.
For example, if the pattern /abc/ is matched against "123abc" with For example, if the pattern /abc/ is matched against "123abc" with an of
an offset limit less than 3, the fset limit less than 3, the
result is PCRE2_ERROR_NOMATCH. A match can never be found if the startoff result is PCRE2_ERROR_NOMATCH. A match can never be found if the starto
set argument of pcre2_match(), ffset argument of pcre2_match(),
pcre2_dfa_match(), or pcre2_substitute() is greater than the offset limit set in the match context. pcre2_dfa_match(), or pcre2_substitute() is greater than the offset limit set in the match context.
When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT option When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT option
when calling pcre2_compile() so when calling pcre2_compile() so
that when JIT is in use, different code can be compiled. If a match is st that when JIT is in use, different code can be compiled. If a match is
arted with a non-default match started with a non-default match
limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated. limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
The offset limit facility can be used to track progress when searching l The offset limit facility can be used to track progress when searching la
arge subject strings or to limit rge subject strings or to limit
the extent of global substitutions. See also the PCRE2_FIRSTLINE option, the extent of global substitutions. See also the PCRE2_FIRSTLINE option,
which requires a match to start which requires a match to start
before or at the first newline that follows the start of matching in the before or at the first newline that follows the start of matching in the
subject. If this is set with an subject. If this is set with an
offset limit, a match must occur in the first line and also within the o offset limit, a match must occur in the first line and also within t
ffset limit. In other words, he offset limit. In other words,
whichever limit comes first is used. whichever limit comes first is used.
int pcre2_set_heap_limit(pcre2_match_context *mcontext, int pcre2_set_heap_limit(pcre2_match_context *mcontext,
uint32_t value); uint32_t value);
The heap_limit parameter specifies, in units of kibibytes (1024 bytes), t he maximum amount of heap memory The heap_limit parameter specifies, in units of kibibytes (1024 bytes), t he maximum amount of heap memory
that pcre2_match() may use to hold backtracking information when running that pcre2_match() may use to hold backtracking information when run
an interpretive match. This ning an interpretive match. This
limit also applies to pcre2_dfa_match(), which may use the heap when p limit also applies to pcre2_dfa_match(), which may use the heap when proc
rocessing patterns with a lot of essing patterns with a lot of
nested pattern recursion or lookarounds or atomic groups. This limit does nested pattern recursion or lookarounds or atomic groups. This limit doe
not apply to matching with the s not apply to matching with the
JIT optimization, which has its own memory control arrangements (see the JIT optimization, which has its own memory control arrangements (see the
pcre2jit documentation for more pcre2jit documentation for more
details). If the limit is reached, the negative error code PCRE2_ERROR_HE APLIMIT is returned. The default details). If the limit is reached, the negative error code PCRE2_ERROR_HE APLIMIT is returned. The default
limit can be set when PCRE2 is built; if it is not, the default is s et very large and is essentially limit can be set when PCRE2 is built; if it is not, the default is set v ery large and is essentially
"unlimited". "unlimited".
A value for the heap limit may also be supplied by an item at the start o f a pattern of the form A value for the heap limit may also be supplied by an item at the start o f a pattern of the form
(*LIMIT_HEAP=ddd) (*LIMIT_HEAP=ddd)
where ddd is a decimal number. However, such a setting is ignored unless ddd is less than the limit set where ddd is a decimal number. However, such a setting is ignored unles s ddd is less than the limit set
by the caller of pcre2_match() or, if no such limit is set, less than the default. by the caller of pcre2_match() or, if no such limit is set, less than the default.
The pcre2_match() function starts out using a 20KiB vector on the system stack for recording backtracking The pcre2_match() function starts out using a 20KiB vector on the system stack for recording backtracking
points. The more nested backtracking points there are (that is, the deepe points. The more nested backtracking points there are (that is, the d
r the search tree), the more eeper the search tree), the more
memory is needed. Heap memory is used only if the initial vector is too memory is needed. Heap memory is used only if the initial vector is too
small. If the heap limit is set small. If the heap limit is set
to a value less than 21 (in particular, zero) no heap memory will be used to a value less than 21 (in particular, zero) no heap memory will be u
. In this case, only patterns sed. In this case, only patterns
that do not have a lot of nested backtracking can be successfully process ed. that do not have a lot of nested backtracking can be successfully process ed.
Similarly, for pcre2_dfa_match(), a vector on the system stack is use Similarly, for pcre2_dfa_match(), a vector on the system stack is used w
d when processing pattern recur- hen processing pattern recur-
sions, lookarounds, or atomic groups, and only if this is not big enough sions, lookarounds, or atomic groups, and only if this is not big eno
is heap memory used. In this ugh is heap memory used. In this
case, too, setting a value of zero disables the use of the heap. case, too, setting a value of zero disables the use of the heap.
int pcre2_set_match_limit(pcre2_match_context *mcontext, int pcre2_set_match_limit(pcre2_match_context *mcontext,
uint32_t value); uint32_t value);
The match_limit parameter provides a means of preventing PCRE2 from using up too many computing resources The match_limit parameter provides a means of preventing PCRE2 from using up too many computing resources
when processing patterns that are not going to match, but which have a ve ry large number of possibilities when processing patterns that are not going to match, but which have a ve ry large number of possibilities
in their search trees. The classic example is a pattern that uses nested unlimited repeats. in their search trees. The classic example is a pattern that uses nested unlimited repeats.
There is an internal counter in pcre2_match() that is incremented each ti me round its main matching loop. There is an internal counter in pcre2_match() that is incremented each ti me round its main matching loop.
If this value reaches the match limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT. If this value reaches the match limit, pcre2_match() returns the negati ve value PCRE2_ERROR_MATCHLIMIT.
This has the effect of limiting the amount of backtracking that can take place. For patterns that are not This has the effect of limiting the amount of backtracking that can take place. For patterns that are not
anchored, the count restarts from zero for each position in the subject s tring. This limit also applies anchored, the count restarts from zero for each position in the subject string. This limit also applies
to pcre2_dfa_match(), though the counting is done in a different way. to pcre2_dfa_match(), though the counting is done in a different way.
When pcre2_match() is called with a pattern that was successfully proce When pcre2_match() is called with a pattern that was successfully process
ssed by pcre2_jit_compile(), the ed by pcre2_jit_compile(), the
way in which matching is executed is entirely different. However, there i way in which matching is executed is entirely different. However, there
s still the possibility of run- is still the possibility of run-
away matching that goes on for a very long time, and so the match_limit away matching that goes on for a very long time, and so the match_limit v
value is also used in this case alue is also used in this case
(but in a different way) to limit how long the matching can continue. (but in a different way) to limit how long the matching can continue.
The default value for the limit can be set when PCRE2 is built; the defau The default value for the limit can be set when PCRE2 is built; the defa
lt default is 10 million, which ult default is 10 million, which
handles all but the most extreme cases. A value for the match limit ma handles all but the most extreme cases. A value for the match limit may a
y also be supplied by an item at lso be supplied by an item at
the start of a pattern of the form the start of a pattern of the form
(*LIMIT_MATCH=ddd) (*LIMIT_MATCH=ddd)
where ddd is a decimal number. However, such a setting is ignored unless ddd is less than the limit set where ddd is a decimal number. However, such a setting is ignored unles s ddd is less than the limit set
by the caller of pcre2_match() or pcre2_dfa_match() or, if no such limit is set, less than the default. by the caller of pcre2_match() or pcre2_dfa_match() or, if no such limit is set, less than the default.
int pcre2_set_depth_limit(pcre2_match_context *mcontext, int pcre2_set_depth_limit(pcre2_match_context *mcontext,
uint32_t value); uint32_t value);
This parameter limits the depth of nested backtracking in pcre2_match(). Each time a nested backtracking This parameter limits the depth of nested backtracking in pcre2_match(). Each time a nested backtracking
point is passed, a new memory "frame" is used to remember the state of ma tching at that point. Thus, this point is passed, a new memory "frame" is used to remember the state of ma tching at that point. Thus, this
parameter indirectly limits the amount of memory that is used in a mat parameter indirectly limits the amount of memory that is used in a match.
ch. However, because the size of However, because the size of
each memory "frame" depends on the number of capturing parentheses, the a each memory "frame" depends on the number of capturing parentheses, the
ctual memory limit varies from actual memory limit varies from
pattern to pattern. This limit was more useful in versions before 10. pattern to pattern. This limit was more useful in versions before 10.30,
30, where function recursion was where function recursion was
used for backtracking. used for backtracking.
The depth limit is not relevant, and is ignored, when matching is done us ing JIT compiled code. However, The depth limit is not relevant, and is ignored, when matching is done u sing JIT compiled code. However,
it is supported by pcre2_dfa_match(), which uses it to limit the depth of nested internal recursive func- it is supported by pcre2_dfa_match(), which uses it to limit the depth of nested internal recursive func-
tion calls that implement atomic groups, lookaround assertions, and pat tion calls that implement atomic groups, lookaround assertions, and
tern recursions. This limits, pattern recursions. This limits,
indirectly, the amount of system stack that is used. It was more usefu indirectly, the amount of system stack that is used. It was more useful i
l in versions before 10.32, when n versions before 10.32, when
stack memory was used for local workspace vectors for recursive function stack memory was used for local workspace vectors for recursive function
calls. From version 10.32, only calls. From version 10.32, only
local variables are allocated on the stack and as each call uses only a local variables are allocated on the stack and as each call uses only a f
few hundred bytes, even a small ew hundred bytes, even a small
stack can support quite a lot of recursion. stack can support quite a lot of recursion.
If the depth of internal recursive function calls is great enough, local workspace vectors are allocated If the depth of internal recursive function calls is great enough, local workspace vectors are allocated
on the heap from version 10.32 onwards, so the depth limit also indirectl y limits the amount of heap mem- on the heap from version 10.32 onwards, so the depth limit also indirectl y limits the amount of heap mem-
ory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when matc ory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when mat
hed to a very long string using ched to a very long string using
pcre2_dfa_match(), can use a great deal of memory. However, it is pro pcre2_dfa_match(), can use a great deal of memory. However, it is probabl
bably better to limit heap usage y better to limit heap usage
directly by calling pcre2_set_heap_limit(). directly by calling pcre2_set_heap_limit().
The default value for the depth limit can be set when PCRE2 is built; if it is not, the default is set to The default value for the depth limit can be set when PCRE2 is built; if it is not, the default is set to
the same value as the default for the match limit. If the lim the same value as the default for the match limit. If the limit
it is exceeded, pcre2_match() or is exceeded, pcre2_match() or
pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth l pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth
imit may also be supplied by an limit may also be supplied by an
item at the start of a pattern of the form item at the start of a pattern of the form
(*LIMIT_DEPTH=ddd) (*LIMIT_DEPTH=ddd)
where ddd is a decimal number. However, such a setting is ignored unles s ddd is less than the limit set where ddd is a decimal number. However, such a setting is ignored unless ddd is less than the limit set
by the caller of pcre2_match() or pcre2_dfa_match() or, if no such limit is set, less than the default. by the caller of pcre2_match() or pcre2_dfa_match() or, if no such limit is set, less than the default.
CHECKING BUILD-TIME OPTIONS CHECKING BUILD-TIME OPTIONS
int pcre2_config(uint32_t what, void *where); int pcre2_config(uint32_t what, void *where);
The function pcre2_config() makes it possible for a PCRE2 client to find The function pcre2_config() makes it possible for a PCRE2 client to find
the value of certain configura- the value of certain configura-
tion parameters and to discover which optional features have been comp tion parameters and to discover which optional features have been compile
iled into the PCRE2 library. The d into the PCRE2 library. The
pcre2build documentation has more details about these features. pcre2build documentation has more details about these features.
The first argument for pcre2_config() specifies which information is requ The first argument for pcre2_config() specifies which information is re
ired. The second argument is a quired. The second argument is a
pointer to memory into which the information is placed. If NULL is pointer to memory into which the information is placed. If NULL is pass
passed, the function returns the ed, the function returns the
amount of memory that is needed for the requested information. For calls amount of memory that is needed for the requested information. For cal
that return numerical values, ls that return numerical values,
the value is in bytes; when requesting these values, where should point the value is in bytes; when requesting these values, where should point t
to appropriately aligned memory. o appropriately aligned memory.
For calls that return strings, the required length is given in code units For calls that return strings, the required length is given in code uni
, not counting the terminating ts, not counting the terminating
zero. zero.
When requesting information, the returned value from pcre2_config() is When requesting information, the returned value from pcre2_config() is no
non-negative on success, or the n-negative on success, or the
negative error code PCRE2_ERROR_BADOPTION if the value in the first argum negative error code PCRE2_ERROR_BADOPTION if the value in the first argu
ent is not recognized. The fol- ment is not recognized. The fol-
lowing information is available: lowing information is available:
PCRE2_CONFIG_BSR PCRE2_CONFIG_BSR
The output is a uint32_t integer whose value indicates what character The output is a uint32_t integer whose value indicates what character seq
sequences the \R escape sequence uences the \R escape sequence
matches by default. A value of PCRE2_BSR_UNICODE means that \R matches an matches by default. A value of PCRE2_BSR_UNICODE means that \R matches a
y Unicode line ending sequence; ny Unicode line ending sequence;
a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRL a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
F. The default can be overridden The default can be overridden
when a pattern is compiled. when a pattern is compiled.
PCRE2_CONFIG_COMPILED_WIDTHS PCRE2_CONFIG_COMPILED_WIDTHS
The output is a uint32_t integer whose lower bits indicate which code un The output is a uint32_t integer whose lower bits indicate which cod
it widths were selected when e unit widths were selected when
PCRE2 was built. The 1-bit indicates 8-bit support, and the 2-bit and PCRE2 was built. The 1-bit indicates 8-bit support, and the 2-bit and 4-b
4-bit indicate 16-bit and 32-bit it indicate 16-bit and 32-bit
support, respectively. support, respectively.
PCRE2_CONFIG_DEPTHLIMIT PCRE2_CONFIG_DEPTHLIMIT
The output is a uint32_t integer that gives the default limit for the dep The output is a uint32_t integer that gives the default limit for the
th of nested backtracking in depth of nested backtracking in
pcre2_match() or the depth of nested recursions, lookarounds, and ato pcre2_match() or the depth of nested recursions, lookarounds, and atomic
mic groups in pcre2_dfa_match(). groups in pcre2_dfa_match().
Further details are given with pcre2_set_depth_limit() above. Further details are given with pcre2_set_depth_limit() above.
PCRE2_CONFIG_HEAPLIMIT PCRE2_CONFIG_HEAPLIMIT
The output is a uint32_t integer that gives, in kibibytes, the default li The output is a uint32_t integer that gives, in kibibytes, the default l
mit for the amount of heap mem- imit for the amount of heap mem-
ory used by pcre2_match() or pcre2_dfa_match(). Further details are g ory used by pcre2_match() or pcre2_dfa_match(). Further details are giv
iven with pcre2_set_heap_limit() en with pcre2_set_heap_limit()
above. above.
PCRE2_CONFIG_JIT PCRE2_CONFIG_JIT
The output is a uint32_t integer that is set to one if support for just-i n-time compiling is available; The output is a uint32_t integer that is set to one if support for just -in-time compiling is available;
otherwise it is set to zero. otherwise it is set to zero.
PCRE2_CONFIG_JITTARGET PCRE2_CONFIG_JITTARGET
The where argument should point to a buffer that is at least 48 co The where argument should point to a buffer that is at least 48 code u
de units long. (The exact length nits long. (The exact length
required can be found by calling pcre2_config() with where set to NULL.) required can be found by calling pcre2_config() with where set to NU
The buffer is filled with a LL.) The buffer is filled with a
string that contains the name of the architecture for which the JIT com string that contains the name of the architecture for which the JIT compi
piler is configured, for example ler is configured, for example
"x86 32bit (little endian + unaligned)". If JIT support is not avail "x86 32bit (little endian + unaligned)". If JIT support is not ava
able, PCRE2_ERROR_BADOPTION is ilable, PCRE2_ERROR_BADOPTION is
returned, otherwise the number of code units used is returned. This is th e length of the string, plus one returned, otherwise the number of code units used is returned. This is th e length of the string, plus one
unit for the terminating zero. unit for the terminating zero.
PCRE2_CONFIG_LINKSIZE PCRE2_CONFIG_LINKSIZE
The output is a uint32_t integer that contains the number of bytes used f or internal linkage in compiled The output is a uint32_t integer that contains the number of bytes used for internal linkage in compiled
regular expressions. When PCRE2 is configured, the value can be set to 2, 3, or 4, with the default being regular expressions. When PCRE2 is configured, the value can be set to 2, 3, or 4, with the default being
2. This is the value that is returned by pcre2_config(). However, when th 2. This is the value that is returned by pcre2_config(). However, when t
e 16-bit library is compiled, a he 16-bit library is compiled, a
value of 3 is rounded up to 4, and when the 32-bit library is compiled value of 3 is rounded up to 4, and when the 32-bit library is compiled, i
, internal linkages always use 4 nternal linkages always use 4
bytes, so the configured value is not relevant. bytes, so the configured value is not relevant.
The default value of 2 for the 8-bit and 16-bit libraries is sufficient f or all but the most massive pat- The default value of 2 for the 8-bit and 16-bit libraries is sufficient f or all but the most massive pat-
terns, since it allows the size of the compiled pattern to be up to 65535 code units. Larger values allow terns, since it allows the size of the compiled pattern to be up to 65535 code units. Larger values allow
larger regular expressions to be compiled by those two libraries, but at the expense of slower matching. larger regular expressions to be compiled by those two libraries, but at the expense of slower matching.
PCRE2_CONFIG_MATCHLIMIT PCRE2_CONFIG_MATCHLIMIT
The output is a uint32_t integer that gives the default match limit for pcre2_match(). Further details The output is a uint32_t integer that gives the default match limit fo r pcre2_match(). Further details
are given with pcre2_set_match_limit() above. are given with pcre2_set_match_limit() above.
PCRE2_CONFIG_NEWLINE PCRE2_CONFIG_NEWLINE
The output is a uint32_t integer whose value specifies the default chara cter sequence that is recognized The output is a uint32_t integer whose value specifies the default charac ter sequence that is recognized
as meaning "newline". The values are: as meaning "newline". The values are:
PCRE2_NEWLINE_CR Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
PCRE2_NEWLINE_LF Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
PCRE2_NEWLINE_NUL The NUL character (binary zero) PCRE2_NEWLINE_NUL The NUL character (binary zero)
The default should normally correspond to the standard sequence for your operating system. The default should normally correspond to the standard sequence for your operating system.
PCRE2_CONFIG_NEVER_BACKSLASH_C PCRE2_CONFIG_NEVER_BACKSLASH_C
The output is a uint32_t integer that is set to one if the use of \C was permanently disabled when PCRE2 The output is a uint32_t integer that is set to one if the use of \C was permanently disabled when PCRE2
was built; otherwise it is set to zero. was built; otherwise it is set to zero.
PCRE2_CONFIG_PARENSLIMIT PCRE2_CONFIG_PARENSLIMIT
The output is a uint32_t integer that gives the maximum depth of nesting of parentheses (of any kind) in The output is a uint32_t integer that gives the maximum depth of nesting of parentheses (of any kind) in
a pattern. This limit is imposed to cap the amount of system stack used w hen a pattern is compiled. It is a pattern. This limit is imposed to cap the amount of system stack used w hen a pattern is compiled. It is
specified when PCRE2 is built; the default is 250. This limit does not specified when PCRE2 is built; the default is 250. This limit does not ta
take into account the stack that ke into account the stack that
may already be used by the calling application. For finer control over may already be used by the calling application. For finer control o
compilation stack usage, see ver compilation stack usage, see
pcre2_set_compile_recursion_guard(). pcre2_set_compile_recursion_guard().
PCRE2_CONFIG_STACKRECURSE PCRE2_CONFIG_STACKRECURSE
This parameter is obsolete and should not be used in new code. The outp ut is a uint32_t integer that is This parameter is obsolete and should not be used in new code. The output is a uint32_t integer that is
always set to zero. always set to zero.
PCRE2_CONFIG_TABLES_LENGTH PCRE2_CONFIG_TABLES_LENGTH
The output is a uint32_t integer that gives the length of PCRE2's charact er processing tables in bytes. The output is a uint32_t integer that gives the length of PCRE2's chara cter processing tables in bytes.
For details of these tables see the section on locale support below. For details of these tables see the section on locale support below.
PCRE2_CONFIG_UNICODE_VERSION PCRE2_CONFIG_UNICODE_VERSION
The where argument should point to a buffer that is at least 24 co de units long. (The exact length The where argument should point to a buffer that is at least 24 code u nits long. (The exact length
required can be found by calling pcre2_config() with where set to NULL.) If PCRE2 has been compiled with- required can be found by calling pcre2_config() with where set to NULL.) If PCRE2 has been compiled with-
out Unicode support, the buffer is filled with the text "Unicode not su pported". Otherwise, the Unicode out Unicode support, the buffer is filled with the text "Unicode not supp orted". Otherwise, the Unicode
version string (for example, "8.0.0") is inserted. The number of code uni ts used is returned. This is the version string (for example, "8.0.0") is inserted. The number of code uni ts used is returned. This is the
length of the string plus one unit for the terminating zero. length of the string plus one unit for the terminating zero.
PCRE2_CONFIG_UNICODE PCRE2_CONFIG_UNICODE
The output is a uint32_t integer that is set to one if Unicode support i s available; otherwise it is set The output is a uint32_t integer that is set to one if Unicode support is available; otherwise it is set
to zero. Unicode support implies UTF support. to zero. Unicode support implies UTF support.
PCRE2_CONFIG_VERSION PCRE2_CONFIG_VERSION
The where argument should point to a buffer that is at least 24 code u The where argument should point to a buffer that is at least 24 co
nits long. (The exact length de units long. (The exact length
required can be found by calling pcre2_config() with where set to NULL required can be found by calling pcre2_config() with where set to NULL.)
.) The buffer is filled with the The buffer is filled with the
PCRE2 version string, zero-terminated. The number of code units used is r PCRE2 version string, zero-terminated. The number of code units used is
eturned. This is the length of returned. This is the length of
the string plus one unit for the terminating zero. the string plus one unit for the terminating zero.
COMPILING A PATTERN COMPILING A PATTERN
pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
pcre2_compile_context *ccontext); pcre2_compile_context *ccontext);
void pcre2_code_free(pcre2_code *code); void pcre2_code_free(pcre2_code *code);
pcre2_code *pcre2_code_copy(const pcre2_code *code); pcre2_code *pcre2_code_copy(const pcre2_code *code);
pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
The pcre2_compile() function compiles a pattern into an internal fo The pcre2_compile() function compiles a pattern into an internal form.
rm. The pattern is defined by a The pattern is defined by a
pointer to a string of code units and a length (in code units). If the pa pointer to a string of code units and a length (in code units). If the
ttern is zero-terminated, the pattern is zero-terminated, the
length can be specified as PCRE2_ZERO_TERMINATED. The function return length can be specified as PCRE2_ZERO_TERMINATED. The function returns a
s a pointer to a block of memory pointer to a block of memory
that contains the compiled pattern and related data, or NULL if an error occurred. that contains the compiled pattern and related data, or NULL if an error occurred.
If the compile context argument ccontext is NULL, memory for the compiled If the compile context argument ccontext is NULL, memory for the compile
pattern is obtained by calling d pattern is obtained by calling
malloc(). Otherwise, it is obtained from the same memory function that w malloc(). Otherwise, it is obtained from the same memory function that wa
as used for the compile context. s used for the compile context.
The caller must free the memory by calling pcre2_code_free() when The caller must free the memory by calling pcre2_code_free() whe
it is no longer needed. If n it is no longer needed. If
pcre2_code_free() is called with a NULL argument, it returns immediately, without doing anything. pcre2_code_free() is called with a NULL argument, it returns immediately, without doing anything.
The function pcre2_code_copy() makes a copy of the compiled code in n The function pcre2_code_copy() makes a copy of the compiled code in new m
ew memory, using the same memory emory, using the same memory
allocator as was used for the original. However, if the code has been pro allocator as was used for the original. However, if the code has been pr
cessed by the JIT compiler (see ocessed by the JIT compiler (see
below), the JIT information cannot be copied (because it is position-de below), the JIT information cannot be copied (because it is position-depe
pendent). The new copy can ini- ndent). The new copy can ini-
tially be used only for non-JIT matching, though it can be passed to pcre tially be used only for non-JIT matching, though it can be passed to pcr
2_jit_compile() if required. If e2_jit_compile() if required. If
pcre2_code_copy() is called with a NULL argument, it returns NULL. pcre2_code_copy() is called with a NULL argument, it returns NULL.
The pcre2_code_copy() function provides a way for individual threads i n a multithreaded application to The pcre2_code_copy() function provides a way for individual threads in a multithreaded application to
acquire a private copy of shared compiled code. However, it does not mak e a copy of the character tables acquire a private copy of shared compiled code. However, it does not mak e a copy of the character tables
used by the compiled pattern; the new pattern code points to the same ta used by the compiled pattern; the new pattern code points to the same tab
bles as the original code. (See les as the original code. (See
"Locale Support" below for details of these character tables.) In many ap "Locale Support" below for details of these character tables.) In many
plications the same tables are applications the same tables are
used throughout, so this behaviour is appropriate. Nevertheless, ther used throughout, so this behaviour is appropriate. Nevertheless, there ar
e are occasions when a copy of a e occasions when a copy of a
compiled pattern and the relevant tables are needed. The pcre2_code_co compiled pattern and the relevant tables are needed. The pcre2_code_
py_with_tables() provides this copy_with_tables() provides this
facility. Copies of both the code and the tables are made, with the new code pointing to the new tables. facility. Copies of both the code and the tables are made, with the new code pointing to the new tables.
The memory for the new tables is automatically freed when pcre2_code_free () is called for the new copy of The memory for the new tables is automatically freed when pcre2_code_free () is called for the new copy of
the compiled code. If pcre2_code_copy_with_tables() is called with a NULL argument, it returns NULL. the compiled code. If pcre2_code_copy_with_tables() is called with a NULL argument, it returns NULL.
NOTE: When one of the matching functions is called, pointers to the NOTE: When one of the matching functions is called, pointers to the comp
compiled pattern and the subject iled pattern and the subject
string are set in the match data block so that they can be referenced by string are set in the match data block so that they can be referenced b
the substring extraction func- y the substring extraction func-
tions after a successful match. After running a match, you must not free a compiled pattern or a subject tions after a successful match. After running a match, you must not free a compiled pattern or a subject
string until after all operations on the match data block have taken plac string until after all operations on the match data block have taken p
e, unless, in the case of the lace, unless, in the case of the
subject string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, w subject string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, whic
hich is described in the section h is described in the section
entitled "Option bits for pcre2_match()" below. entitled "Option bits for pcre2_match()" below.
The options argument for pcre2_compile() contains various bit settings th The options argument for pcre2_compile() contains various bit settings
at affect the compilation. It that affect the compilation. It
should be zero if none of them are required. The available options are d should be zero if none of them are required. The available options are de
escribed below. Some of them (in scribed below. Some of them (in
particular, those that are compatible with Perl, but some others as well) particular, those that are compatible with Perl, but some others as well
can also be set and unset from ) can also be set and unset from
within the pattern (see the detailed description in the pcre2pattern docu mentation). within the pattern (see the detailed description in the pcre2pattern docu mentation).
For those options that can be different in different parts of the patt ern, the contents of the options For those options that can be different in different parts of the pattern , the contents of the options
argument specifies their settings at the start of compilation. The PCRE2_ ANCHORED, PCRE2_ENDANCHORED, and argument specifies their settings at the start of compilation. The PCRE2_ ANCHORED, PCRE2_ENDANCHORED, and
PCRE2_NO_UTF_CHECK options can be set at the time of matching as well as at compile time. PCRE2_NO_UTF_CHECK options can be set at the time of matching as well as at compile time.
Some additional options and less frequently required compile-time para meters (for example, the newline Some additional options and less frequently required compile-time paramet ers (for example, the newline
setting) can be provided in a compile context (as described above). setting) can be provided in a compile context (as described above).
If errorcode or erroroffset is NULL, pcre2_compile() returns NULL immedia If errorcode or erroroffset is NULL, pcre2_compile() returns NULL immed
tely. Otherwise, the variables iately. Otherwise, the variables
to which these point are set to an error code and an offset (number of to which these point are set to an error code and an offset (number of co
code units) within the pattern, de units) within the pattern,
respectively, when pcre2_compile() returns NULL because a compilation err respectively, when pcre2_compile() returns NULL because a compilation er
or has occurred. The values are ror has occurred. The values are
not defined when compilation is successful and pcre2_compile() returns a non-NULL value. not defined when compilation is successful and pcre2_compile() returns a non-NULL value.
There are nearly 100 positive error codes that pcre2_compile() may re There are nearly 100 positive error codes that pcre2_compile() may return
turn if it finds an error in the if it finds an error in the
pattern. There are also some negative error codes that are used for inval pattern. There are also some negative error codes that are used for i
id UTF strings when validity nvalid UTF strings when validity
checking is in force. These are the same as given by pcre2_match( checking is in force. These are the same as given by pcre2_match() a
) and pcre2_dfa_match(), and are nd pcre2_dfa_match(), and are
described in the pcre2unicode documentation. There is no separate documen described in the pcre2unicode documentation. There is no separate docu
tation for the positive error mentation for the positive error
codes, because the textual error messages that are obtained by calli codes, because the textual error messages that are obtained by calling
ng the pcre2_get_error_message() the pcre2_get_error_message()
function (see "Obtaining a textual error message" below) should be self-e xplanatory. Macro names starting function (see "Obtaining a textual error message" below) should be self-e xplanatory. Macro names starting
with PCRE2_ERROR_ are defined for both positive and negative error codes in pcre2.h. with PCRE2_ERROR_ are defined for both positive and negative error codes in pcre2.h.
The value returned in erroroffset is an indication of where in the patte The value returned in erroroffset is an indication of where in the patter
rn the error occurred. It is not n the error occurred. It is not
necessarily the furthest point in the pattern that was read. For example, necessarily the furthest point in the pattern that was read. For exam
after the error "lookbehind ple, after the error "lookbehind
assertion is not fixed length", the error offset points to the start assertion is not fixed length", the error offset points to the start of t
of the failing assertion. For an he failing assertion. For an
invalid UTF-8 or UTF-16 string, the offset is that of the first code unit of the failing character. invalid UTF-8 or UTF-16 string, the offset is that of the first code unit of the failing character.
Some errors are not detected until the whole pattern has been scanned; in Some errors are not detected until the whole pattern has been scanned; i
these cases, the offset passed n these cases, the offset passed
back is the length of the pattern. Note that the offset is in code unit back is the length of the pattern. Note that the offset is in code units,
s, not characters, even in a UTF not characters, even in a UTF
mode. It may sometimes point into the middle of a UTF-8 or UTF-16 charact er. mode. It may sometimes point into the middle of a UTF-8 or UTF-16 charact er.
This code fragment shows a typical straightforward call to pcre2_compile( ): This code fragment shows a typical straightforward call to pcre2_compile( ):
pcre2_code *re; pcre2_code *re;
PCRE2_SIZE erroffset; PCRE2_SIZE erroffset;
int errorcode; int errorcode;
re = pcre2_compile( re = pcre2_compile(
"^A.*Z", /* the pattern */ "^A.*Z", /* the pattern */
PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */ PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
skipping to change at line 997 skipping to change at line 1023
&errorcode, /* for error code */ &errorcode, /* for error code */
&erroffset, /* for error offset */ &erroffset, /* for error offset */
NULL); /* no compile context */ NULL); /* no compile context */
Main compile options Main compile options
The following names for option bits are defined in the pcre2.h header fil e: The following names for option bits are defined in the pcre2.h header fil e:
PCRE2_ANCHORED PCRE2_ANCHORED
If this bit is set, the pattern is forced to be "anchored", that is, it i If this bit is set, the pattern is forced to be "anchored", that is, it
s constrained to match only at is constrained to match only at
the first matching point in the string that is being searched (the "s the first matching point in the string that is being searched (the "subje
ubject string"). This effect can ct string"). This effect can
also be achieved by appropriate constructs in the pattern itself, which i s the only way to do it in Perl. also be achieved by appropriate constructs in the pattern itself, which i s the only way to do it in Perl.
PCRE2_ALLOW_EMPTY_CLASS PCRE2_ALLOW_EMPTY_CLASS
By default, for compatibility with Perl, a closing square bracket that im mediately follows an opening one By default, for compatibility with Perl, a closing square bracket that im mediately follows an opening one
is treated as a data character for the class. When PCRE2_ALLOW_EMPTY _CLASS is set, it terminates the is treated as a data character for the class. When PCRE2_ALLOW_EMPTY_CLAS S is set, it terminates the
class, which therefore contains no characters and so can never match. class, which therefore contains no characters and so can never match.
PCRE2_ALT_BSUX PCRE2_ALT_BSUX
This option request alternative handling of three escape sequences, which makes PCRE2's behaviour more This option request alternative handling of three escape sequences, wh ich makes PCRE2's behaviour more
like ECMAscript (aka JavaScript). When it is set: like ECMAscript (aka JavaScript). When it is set:
(1) \U matches an upper case "U" character; by default \U causes a com pile time error (Perl uses \U to (1) \U matches an upper case "U" character; by default \U causes a compil e time error (Perl uses \U to
upper case subsequent characters). upper case subsequent characters).
(2) \u matches a lower case "u" character unless it is followed by four h exadecimal digits, in which case (2) \u matches a lower case "u" character unless it is followed by four h exadecimal digits, in which case
the hexadecimal number defines the code point to match. By default, \u c auses a compile time error (Perl the hexadecimal number defines the code point to match. By default, \u ca uses a compile time error (Perl
uses it to upper case the following character). uses it to upper case the following character).
(3) \x matches a lower case "x" character unless it is followed by two he (3) \x matches a lower case "x" character unless it is followed by two h
xadecimal digits, in which case exadecimal digits, in which case
the hexadecimal number defines the code point to match. By default, as the hexadecimal number defines the code point to match. By default, as in
in Perl, a hexadecimal number is Perl, a hexadecimal number is
always expected after \x, but it may have zero, one, or two digits (so, f or example, \xz matches a binary always expected after \x, but it may have zero, one, or two digits (so, f or example, \xz matches a binary
zero character followed by z). zero character followed by z).
ECMAscript 6 added additional functionality to \u. This can be accesse ECMAscript 6 added additional functionality to \u. This can be accessed
d using the PCRE2_EXTRA_ALT_BSUX using the PCRE2_EXTRA_ALT_BSUX
extra option (see "Extra compile options" below). Note that this altern extra option (see "Extra compile options" below). Note that this alt
ative escape handling applies ernative escape handling applies
only to patterns. Neither of these options affects the processing only to patterns. Neither of these options affects the processing of
of replacement strings passed to replacement strings passed to
pcre2_substitute(). pcre2_substitute().
PCRE2_ALT_CIRCUMFLEX PCRE2_ALT_CIRCUMFLEX
In multiline mode (when PCRE2_MULTILINE is set), the circumflex metachara cter matches at the start of the In multiline mode (when PCRE2_MULTILINE is set), the circumflex metachara cter matches at the start of the
subject (unless PCRE2_NOTBOL is set), and also after any internal new subject (unless PCRE2_NOTBOL is set), and also after any internal newline
line. However, it does not match . However, it does not match
after a newline at the end of the subject, for compatibility with Perl. I after a newline at the end of the subject, for compatibility with Perl.
f you want a multiline circum- If you want a multiline circum-
flex also to match after a terminating newline, you must set PCRE2_ALT_CI RCUMFLEX. flex also to match after a terminating newline, you must set PCRE2_ALT_CI RCUMFLEX.
PCRE2_ALT_VERBNAMES PCRE2_ALT_VERBNAMES
By default, for compatibility with Perl, the name in any verb sequ ence such as (*MARK:NAME) is any By default, for compatibility with Perl, the name in any verb sequence such as (*MARK:NAME) is any
sequence of characters that does not include a closing parenthesis. The n ame is not processed in any way, sequence of characters that does not include a closing parenthesis. The n ame is not processed in any way,
and it is not possible to include a closing parenthesis in the name. How and it is not possible to include a closing parenthesis in the name. Howe
ever, if the PCRE2_ALT_VERBNAMES ver, if the PCRE2_ALT_VERBNAMES
option is set, normal backslash processing is applied to verb names and o option is set, normal backslash processing is applied to verb names and
nly an unescaped closing paren- only an unescaped closing paren-
thesis terminates the name. A closing parenthesis can be included in a thesis terminates the name. A closing parenthesis can be included in a na
name either as \) or between \Q me either as \) or between \Q
and \E. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set with and \E. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set wit
PCRE2_ALT_VERBNAMES, unescaped h PCRE2_ALT_VERBNAMES, unescaped
whitespace in verb names is skipped and #-comments are recognized, exactl y as in the rest of the pattern. whitespace in verb names is skipped and #-comments are recognized, exactl y as in the rest of the pattern.
PCRE2_AUTO_CALLOUT PCRE2_AUTO_CALLOUT
If this bit is set, pcre2_compile() automatically inserts callout items, all with number 255, before each If this bit is set, pcre2_compile() automatically inserts callout items, all with number 255, before each
pattern item, except immediately before or after an explicit callout in t he pattern. For discussion of pattern item, except immediately before or after an explicit callout i n the pattern. For discussion of
the callout facility, see the pcre2callout documentation. the callout facility, see the pcre2callout documentation.
PCRE2_CASELESS PCRE2_CASELESS
If this bit is set, letters in the pattern match both upper and lower ca If this bit is set, letters in the pattern match both upper and lower cas
se letters in the subject. It is e letters in the subject. It is
equivalent to Perl's /i option, and it can be changed within a pattern b equivalent to Perl's /i option, and it can be changed within a patt
y a (?i) option setting. If ern by a (?i) option setting. If
either PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for a either PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all
ll characters with more than one characters with more than one
other case, and for all characters whose code points are greater than U+0 other case, and for all characters whose code points are greater than
07F. For lower valued characters U+007F. Note that there are two
with only one other case, a lookup table is used for speed. When neither ASCII characters, K and S, that, in addition to their lower case ASCII eq
PCRE2_UTF nor PCRE2_UCP is set, uivalents, are case-equivalent
a lookup table is used for all code points less than 256, and higher co with U+212A (Kelvin sign) and U+017F (long S) respectively. For lower
de points (available only in valued characters with only one
16-bit or 32-bit mode) are treated as not having another case. other case, a lookup table is used for speed. When neither PCRE2_UTF nor
PCRE2_UCP is set, a lookup table
is used for all code points less than 256, and higher code points (av
ailable only in 16-bit or 32-bit
mode) are treated as not having another case.
PCRE2_DOLLAR_ENDONLY PCRE2_DOLLAR_ENDONLY
If this bit is set, a dollar metacharacter in the pattern matches only a t the end of the subject string. If this bit is set, a dollar metacharacter in the pattern matches only at the end of the subject string.
Without this option, a dollar also matches immediately before a newline a t the end of the string (but not Without this option, a dollar also matches immediately before a newline a t the end of the string (but not
before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set. There before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set. There
is no equivalent to this option in Perl, and no way to set it within a pa ttern. is no equivalent to this option in Perl, and no way to set it within a pa ttern.
PCRE2_DOTALL PCRE2_DOTALL
If this bit is set, a dot metacharacter in the pattern matches any charac ter, including one that indi- If this bit is set, a dot metacharacter in the pattern matches any cha racter, including one that indi-
cates a newline. However, it only ever matches one character, even if new lines are coded as CRLF. Without cates a newline. However, it only ever matches one character, even if new lines are coded as CRLF. Without
this option, a dot does not match when the current position in the subjec this option, a dot does not match when the current position in the subj
t is at a newline. This option ect is at a newline. This option
is equivalent to Perl's /s option, and it can be changed within a pat is equivalent to Perl's /s option, and it can be changed within a pattern
tern by a (?s) option setting. A by a (?s) option setting. A
negative class such as [^a] always matches newline characters, and the \N negative class such as [^a] always matches newline characters, and the \
escape sequence always matches N escape sequence always matches
a non-newline character, independent of the setting of PCRE2_DOTALL. a non-newline character, independent of the setting of PCRE2_DOTALL.
PCRE2_DUPNAMES PCRE2_DUPNAMES
If this bit is set, names used to identify capture groups need not be If this bit is set, names used to identify capture groups need not be uni
unique. This can be helpful for que. This can be helpful for
certain types of pattern when it is known that only one instance of the n certain types of pattern when it is known that only one instance of the
amed group can ever be matched. named group can ever be matched.
There are more details of named capture groups below; see also the pcre2p attern documentation. There are more details of named capture groups below; see also the pcre2p attern documentation.
PCRE2_ENDANCHORED PCRE2_ENDANCHORED
If this bit is set, the end of any pattern match must be right at the If this bit is set, the end of any pattern match must be right at the end
end of the string being searched of the string being searched
(the "subject string"). If the pattern match succeeds by reaching (*ACCEP (the "subject string"). If the pattern match succeeds by reaching (*ACC
T), but does not reach the end EPT), but does not reach the end
of the subject, the match fails at the current starting point. For una of the subject, the match fails at the current starting point. For unanch
nchored patterns, a new match is ored patterns, a new match is
then tried at the next starting point. However, if the match succeeds by reaching the end of the pattern, then tried at the next starting point. However, if the match succeeds by reaching the end of the pattern,
but not the end of the subject, backtracking occurs and an alternative ma tch may be found. Consider these but not the end of the subject, backtracking occurs and an alternative ma tch may be found. Consider these
two patterns: two patterns:
.(*ACCEPT)|.. .(*ACCEPT)|..
.|.. .|..
If matched against "abc" with PCRE2_ENDANCHORED set, the first matches "c If matched against "abc" with PCRE2_ENDANCHORED set, the first matche
" whereas the second matches s "c" whereas the second matches
"bc". The effect of PCRE2_ENDANCHORED can also be achieved by appro "bc". The effect of PCRE2_ENDANCHORED can also be achieved by appropria
priate constructs in the pattern te constructs in the pattern
itself, which is the only way to do it in Perl. itself, which is the only way to do it in Perl.
For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only t For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
o the first (that is, the long- to the first (that is, the long-
est) matched string. Other parallel matches, which are necessarily s est) matched string. Other parallel matches, which are necessarily substr
ubstrings of the first one, must ings of the first one, must
obviously end before the end of the subject. obviously end before the end of the subject.
PCRE2_EXTENDED PCRE2_EXTENDED
If this bit is set, most white space characters in the pattern are totall y ignored except when escaped or If this bit is set, most white space characters in the pattern are totall y ignored except when escaped or
inside a character class. However, white space is not allowed within sequ ences such as (?> that introduce inside a character class. However, white space is not allowed within sequ ences such as (?> that introduce
various parenthesized groups, nor within numerical quantifiers such as {1 various parenthesized groups, nor within numerical quantifiers such as
,3}. Ignorable white space is {1,3}. Ignorable white space is
permitted between an item and a following quantifier and between a q permitted between an item and a following quantifier and between a quanti
uantifier and a following + that fier and a following + that
indicates possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x optio indicates possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x opti
n, and it can be changed within on, and it can be changed within
a pattern by a (?x) option setting. a pattern by a (?x) option setting.
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as white space only those char- When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as white space only those char-
acters with code points less than 256 that are flagged as white space in its low-character table. The ta- acters with code points less than 256 that are flagged as white space in its low-character table. The ta-
ble is normally created by pcre2_maketables(), which uses the isspace() ble is normally created by pcre2_maketables(), which uses the isspace() f
function to identify space char- unction to identify space char-
acters. In most ASCII environments, the relevant characters are those wi acters. In most ASCII environments, the relevant characters are thos
th code points 0x0009 (tab), e with code points 0x0009 (tab),
0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
D (carriage return), and 0x0020 (carriage return), and 0x0020
(space). (space).
When PCRE2 is compiled with Unicode support, in addition to these charact When PCRE2 is compiled with Unicode support, in addition to these charac
ers, five more Unicode "Pattern ters, five more Unicode "Pattern
White Space" characters are recognized by PCRE2_EXTENDED. These are U+00 White Space" characters are recognized by PCRE2_EXTENDED. These are U+008
85 (next line), U+200E (left-to- 5 (next line), U+200E (left-to-
right mark), U+200F (right-to-left mark), U+2028 (line separator), and U+ 2029 (paragraph separator). This right mark), U+200F (right-to-left mark), U+2028 (line separator), and U+ 2029 (paragraph separator). This
set of characters is the same as recognized by Perl's /x option. Note that the horizontal and vertical set of characters is the same as recognized by Perl's /x option. Note tha t the horizontal and vertical
space characters that are matched by the \h and \v escapes in patterns ar e a much bigger set. space characters that are matched by the \h and \v escapes in patterns ar e a much bigger set.
As well as ignoring most white space, PCRE2_EXTENDED also causes characte rs between an unescaped # out- As well as ignoring most white space, PCRE2_EXTENDED also causes charac ters between an unescaped # out-
side a character class and the next newline, inclusive, to be ignored, wh ich makes it possible to include side a character class and the next newline, inclusive, to be ignored, wh ich makes it possible to include
comments inside complicated patterns. Note that the end of this type of comment is a literal newline comments inside complicated patterns. Note that the end of this type of comment is a literal newline
sequence in the pattern; escape sequences that happen to represent a newl ine do not count. sequence in the pattern; escape sequences that happen to represent a newl ine do not count.
Which characters are interpreted as newlines can be specified by a settin g in the compile context that is Which characters are interpreted as newlines can be specified by a settin g in the compile context that is
passed to pcre2_compile() or by a special sequence at the start of the pa ttern, as described in the sec- passed to pcre2_compile() or by a special sequence at the start of the p attern, as described in the sec-
tion entitled "Newline conventions" in the pcre2pattern documentation. A default is defined when PCRE2 is tion entitled "Newline conventions" in the pcre2pattern documentation. A default is defined when PCRE2 is
built. built.
PCRE2_EXTENDED_MORE PCRE2_EXTENDED_MORE
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped This option has the effect of PCRE2_EXTENDED, but, in addition, unescape
space and horizontal tab char- d space and horizontal tab char-
acters are ignored inside a character class. Note: only these two char acters are ignored inside a character class. Note: only these two charact
acters are ignored, not the full ers are ignored, not the full
set of pattern white space characters that are ignored outside a characte set of pattern white space characters that are ignored outside a charact
r class. PCRE2_EXTENDED_MORE is er class. PCRE2_EXTENDED_MORE is
equivalent to Perl's /xx option, and it can be changed within a pattern b y a (?xx) option setting. equivalent to Perl's /xx option, and it can be changed within a pattern b y a (?xx) option setting.
PCRE2_FIRSTLINE PCRE2_FIRSTLINE
If this option is set, the start of an unanchored pattern match must be b efore or at the first newline in If this option is set, the start of an unanchored pattern match must be b efore or at the first newline in
the subject string following the start of matching, though the matched te the subject string following the start of matching, though the matched
xt may continue over the new- text may continue over the new-
line. If startoffset is non-zero, the limiting newline is not necessari line. If startoffset is non-zero, the limiting newline is not necessarily
ly the first newline in the sub- the first newline in the sub-
ject. For example, if the subject string is "abc\nxyz" (where \n represen ts a single-character newline) a ject. For example, if the subject string is "abc\nxyz" (where \n represen ts a single-character newline) a
pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffs pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset
et is greater than 3. See also is greater than 3. See also
PCRE2_USE_OFFSET_LIMIT, which provides a more general limiting facility. PCRE2_USE_OFFSET_LIMIT, which provides a more general limiting facility
If PCRE2_FIRSTLINE is set with . If PCRE2_FIRSTLINE is set with
an offset limit, a match must occur in the first line and also within t an offset limit, a match must occur in the first line and also within the
he offset limit. In other words, offset limit. In other words,
whichever limit comes first is used. whichever limit comes first is used.
PCRE2_LITERAL PCRE2_LITERAL
If this option is set, all meta-characters in the pattern are disabled, a If this option is set, all meta-characters in the pattern are disabled
nd it is treated as a literal , and it is treated as a literal
string. Matching literal strings with a regular expression engine is not string. Matching literal strings with a regular expression engine is not
the most efficient way of doing the most efficient way of doing
it. If you are doing a lot of literal matching and are worried about ef it. If you are doing a lot of literal matching and are worried about
ficiency, you should consider efficiency, you should consider
using other approaches. The only other main options that are al lowed with PCRE2_LITERAL are: using other approaches. The only other main options that are al lowed with PCRE2_LITERAL are:
PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2 PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE
_CASELESS, PCRE2_FIRSTLINE, 2_CASELESS, PCRE2_FIRSTLINE,
PCRE2_MATCH_INVALID_UTF, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_MATCH_INVALID_UTF, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
PCRE2_UTF, and PCRE2_USE_OFF- PCRE2_UTF, and PCRE2_USE_OFF-
SET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and PCRE2_EXTRA_MATCH SET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and PCRE2_EXTRA_MA
_WORD are also supported. Any TCH_WORD are also supported. Any
other options cause an error. other options cause an error.
PCRE2_MATCH_INVALID_UTF PCRE2_MATCH_INVALID_UTF
This option forces PCRE2_UTF (see below) and also enables support for m This option forces PCRE2_UTF (see below) and also enables support for mat
atching by pcre2_match() in sub- ching by pcre2_match() in sub-
ject strings that contain invalid UTF sequences. This facility is not su ject strings that contain invalid UTF sequences. This facility is not
pported for DFA matching. For supported for DFA matching. For
details, see the pcre2unicode documentation. details, see the pcre2unicode documentation.
PCRE2_MATCH_UNSET_BACKREF PCRE2_MATCH_UNSET_BACKREF
If this option is set, a backreference to an unset capture group matches an empty string (by default this If this option is set, a backreference to an unset capture group matches an empty string (by default this
causes the current matching alternative to fail). A pattern such as (\1) (a) succeeds when this option is causes the current matching alternative to fail). A pattern such as (\1) (a) succeeds when this option is
set (assuming it can find an "a" in the subject), whereas it fails by default, for Perl compatibility. set (assuming it can find an "a" in the subject), whereas it fails by def ault, for Perl compatibility.
Setting this option makes PCRE2 behave more like ECMAscript (aka JavaScri pt). Setting this option makes PCRE2 behave more like ECMAscript (aka JavaScri pt).
PCRE2_MULTILINE PCRE2_MULTILINE
By default, for the purposes of matching "start of line" and "end of lin e", PCRE2 treats the subject By default, for the purposes of matching "start of line" and "end of line", PCRE2 treats the subject
string as consisting of a single line of characters, even if it actually contains newlines. The "start of string as consisting of a single line of characters, even if it actually contains newlines. The "start of
line" metacharacter (^) matches only at the start of the string, and the line" metacharacter (^) matches only at the start of the string, and the
"end of line" metacharacter ($) "end of line" metacharacter ($)
matches only at the end of the string, or before a terminating newline ( matches only at the end of the string, or before a terminating newline (e
except when PCRE2_DOLLAR_ENDONLY xcept when PCRE2_DOLLAR_ENDONLY
is set). Note, however, that unless PCRE2_DOTALL is set, the "any charact is set). Note, however, that unless PCRE2_DOTALL is set, the "any chara
er" metacharacter (.) does not cter" metacharacter (.) does not
match at a newline. This behaviour (for ^, $, and dot) is the same as Per l. match at a newline. This behaviour (for ^, $, and dot) is the same as Per l.
When PCRE2_MULTILINE it is set, the "start of line" and "end of line" c When PCRE2_MULTILINE it is set, the "start of line" and "end of line" con
onstructs match immediately fol- structs match immediately fol-
lowing or immediately before internal newlines in the subject string, res lowing or immediately before internal newlines in the subject string,
pectively, as well as at the respectively, as well as at the
very start and end. This is equivalent to Perl's /m option, and it can very start and end. This is equivalent to Perl's /m option, and it can be
be changed within a pattern by a changed within a pattern by a
(?m) option setting. Note that the "start of line" metacharacter does not (?m) option setting. Note that the "start of line" metacharacter does
match after a newline at the not match after a newline at the
end of the subject, for compatibility with Perl. However, you end of the subject, for compatibility with Perl. However, you can
can change this by setting the change this by setting the
PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a subject string , or no occurrences of ^ or $ in PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a subject string , or no occurrences of ^ or $ in
a pattern, setting PCRE2_MULTILINE has no effect. a pattern, setting PCRE2_MULTILINE has no effect.
PCRE2_NEVER_BACKSLASH_C PCRE2_NEVER_BACKSLASH_C
This option locks out the use of \C in the pattern that is being compile This option locks out the use of \C in the pattern that is being compiled
d. This escape can cause unpre- . This escape can cause unpre-
dictable behaviour in UTF-8 or UTF-16 modes, because it may leave the cur dictable behaviour in UTF-8 or UTF-16 modes, because it may leave the cu
rent matching point in the mid- rrent matching point in the mid-
dle of a multi-code-unit character. This option may be useful in applica dle of a multi-code-unit character. This option may be useful in applicat
tions that process patterns from ions that process patterns from
external sources. Note that there is also a build-time option that perman ently locks out the use of \C. external sources. Note that there is also a build-time option that perman ently locks out the use of \C.
PCRE2_NEVER_UCP PCRE2_NEVER_UCP
This option locks out the use of Unicode properties for handling \B, \b, \D, \d, \S, \s, \W, \w, and some This option locks out the use of Unicode properties for handling \B, \b, \D, \d, \S, \s, \W, \w, and some
of the POSIX character classes, as described for the PCRE2_UCP option b of the POSIX character classes, as described for the PCRE2_UCP option bel
elow. In particular, it prevents ow. In particular, it prevents
the creator of the pattern from enabling this facility by starting the pa the creator of the pattern from enabling this facility by starting the
ttern with (*UCP). This option pattern with (*UCP). This option
may be useful in applications that process patterns from external may be useful in applications that process patterns from external so
sources. The option combination urces. The option combination
PCRE_UCP and PCRE_NEVER_UCP causes an error. PCRE_UCP and PCRE_NEVER_UCP causes an error.
PCRE2_NEVER_UTF PCRE2_NEVER_UTF
This option locks out interpretation of the pattern as UTF-8, UTF-16, o This option locks out interpretation of the pattern as UTF-8, UTF-1
r UTF-32, depending on which 6, or UTF-32, depending on which
library is in use. In particular, it prevents the creator of the pattern library is in use. In particular, it prevents the creator of the pattern
from switching to UTF interpre- from switching to UTF interpre-
tation by starting the pattern with (*UTF). This option may be useful in tation by starting the pattern with (*UTF). This option may be useful i
applications that process pat- n applications that process pat-
terns from external sources. The combination of PCRE2_UTF and PCRE2_NEVER _UTF causes an error. terns from external sources. The combination of PCRE2_UTF and PCRE2_NEVER _UTF causes an error.
PCRE2_NO_AUTO_CAPTURE PCRE2_NO_AUTO_CAPTURE
If this option is set, it disables the use of numbered capturing parenth If this option is set, it disables the use of numbered capturing parenthe
eses in the pattern. Any opening ses in the pattern. Any opening
parenthesis that is not followed by ? behaves as if it were followed by parenthesis that is not followed by ? behaves as if it were followed
?: but named parentheses can by ?: but named parentheses can
still be used for capturing (and they acquire numbers in the usual way still be used for capturing (and they acquire numbers in the usual way).
). This is the same as Perl's /n This is the same as Perl's /n
option. Note that, when this option is set, references to capture gro option. Note that, when this option is set, references to capture
ups (backreferences or recur- groups (backreferences or recur-
sion/subroutine calls) may only refer to named groups, though the referen ce can be by name or by number. sion/subroutine calls) may only refer to named groups, though the referen ce can be by name or by number.
PCRE2_NO_AUTO_POSSESS PCRE2_NO_AUTO_POSSESS
If this option is set, it disables "auto-possessification", which is an If this option is set, it disables "auto-possessification", which is an o
optimization that, for example, ptimization that, for example,
turns a+b into a++b in order to avoid backtracks into a+ that can never b turns a+b into a++b in order to avoid backtracks into a+ that can never
e successful. However, if call- be successful. However, if call-
outs are in use, auto-possessification means that some callouts are neve outs are in use, auto-possessification means that some callouts are never
r taken. You can set this option taken. You can set this option
if you want the matching functions to do a full unoptimized search and ru if you want the matching functions to do a full unoptimized search and
n all the callouts, but it is run all the callouts, but it is
mainly provided for testing purposes. mainly provided for testing purposes.
PCRE2_NO_DOTSTAR_ANCHOR PCRE2_NO_DOTSTAR_ANCHOR
If this option is set, it disables an optimization that is applied when If this option is set, it disables an optimization that is applied when .
.* is the first significant item * is the first significant item
in a top-level branch of a pattern, and all the other branches also start in a top-level branch of a pattern, and all the other branches also star
with .* or with \A or \G or ^. t with .* or with \A or \G or ^.
The optimization is automatically disabled for .* if it is inside an atom ic group or a capture group that The optimization is automatically disabled for .* if it is inside an atom ic group or a capture group that
is the subject of a backreference, or if the pattern contains (*PRUNE) or (*SKIP). When the optimization is the subject of a backreference, or if the pattern contains (*PRUNE) o r (*SKIP). When the optimization
is not disabled, such a pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items and is not disabled, such a pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items and
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match must start either at the PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that an y match must start either at the
start of the subject or following a newline is remembered. Like other opt imizations, this can cause call- start of the subject or following a newline is remembered. Like other opt imizations, this can cause call-
outs to be skipped. outs to be skipped.
PCRE2_NO_START_OPTIMIZE PCRE2_NO_START_OPTIMIZE
This is an option whose main effect is at matching time. It does not chan ge what pcre2_compile() gener- This is an option whose main effect is at matching time. It does not ch ange what pcre2_compile() gener-
ates, but it does affect the output of the JIT compiler. ates, but it does affect the output of the JIT compiler.
There are a number of optimizations that may occur at the start of a match, in order to speed up the There are a number of optimizations that may occur at the start of a matc h, in order to speed up the
process. For example, if it is known that an unanchored match must start with a specific code unit value, process. For example, if it is known that an unanchored match must start with a specific code unit value,
the matching code searches the subject for that value, and fails immedia the matching code searches the subject for that value, and fails immediat
tely if it cannot find it, with- ely if it cannot find it, with-
out actually running the main matching function. This means that a specia out actually running the main matching function. This means that a speci
l item such as (*COMMIT) at the al item such as (*COMMIT) at the
start of a pattern is not considered until after a suitable starting poi start of a pattern is not considered until after a suitable starting poin
nt for the match has been found. t for the match has been found.
Also, when callouts or (*MARK) items are in use, these "start-up" optimiz Also, when callouts or (*MARK) items are in use, these "start-up" op
ations can cause them to be timizations can cause them to be
skipped if the pattern is never actually used. The start-up optimizations are in effect a pre-scan of the skipped if the pattern is never actually used. The start-up optimizations are in effect a pre-scan of the
subject that takes place before the pattern is run. subject that takes place before the pattern is run.
The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, p The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
ossibly causing performance to possibly causing performance to
suffer, but ensuring that in cases where the result is "no match", the c suffer, but ensuring that in cases where the result is "no match", the ca
allouts do occur, and that items llouts do occur, and that items
such as (*COMMIT) and (*MARK) are considered at every possible starting p osition in the subject string. such as (*COMMIT) and (*MARK) are considered at every possible starting p osition in the subject string.
Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching oper ation. Consider the pattern Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching oper ation. Consider the pattern
(*COMMIT)ABC (*COMMIT)ABC
When this is compiled, PCRE2 records the fact that a match must start wit h the character "A". Suppose the When this is compiled, PCRE2 records the fact that a match must start wit h the character "A". Suppose the
subject string is "DEFABC". The start-up optimization scans along the subject, finds "A" and runs the subject string is "DEFABC". The start-up optimization scans along the sub ject, finds "A" and runs the
first match attempt from there. The (*COMMIT) item means that the pattern must match the current starting first match attempt from there. The (*COMMIT) item means that the pattern must match the current starting
position, which in this case, it does. However, if the same match is position, which in this case, it does. However, if the same match is ru
run with PCRE2_NO_START_OPTIMIZE n with PCRE2_NO_START_OPTIMIZE
set, the initial scan along the subject string does not happen. The first set, the initial scan along the subject string does not happen. The fir
match attempt is run starting st match attempt is run starting
from "D" and when this fails, (*COMMIT) prevents any further matches be from "D" and when this fails, (*COMMIT) prevents any further matches bein
ing tried, so the overall result g tried, so the overall result
is "no match". is "no match".
As another start-up optimization makes use of a minimum length for a matc hing subject, which is recorded As another start-up optimization makes use of a minimum length for a mat ching subject, which is recorded
when possible. Consider the pattern when possible. Consider the pattern
(*MARK:1)B(*MARK:2)(X|Y) (*MARK:1)B(*MARK:2)(X|Y)
The minimum length for a match is two characters. If the subject is The minimum length for a match is two characters. If the subject is "XX
"XXBB", the "starting character" BB", the "starting character"
optimization skips "XX", then tries to match "BB", which is long enough. optimization skips "XX", then tries to match "BB", which is long enou
In the process, (*MARK:2) is gh. In the process, (*MARK:2) is
encountered and remembered. When the match attempt fails, the next "B" encountered and remembered. When the match attempt fails, the next "B" is
is found, but there is only one found, but there is only one
character left, so there are no more attempts, and "no match" is returned character left, so there are no more attempts, and "no match" is return
with the "last mark seen" set ed with the "last mark seen" set
to "2". If NO_START_OPTIMIZE is set, however, matches are tried at e to "2". If NO_START_OPTIMIZE is set, however, matches are tried at ever
very possible starting position, y possible starting position,
including at the end of the subject, where (*MARK:1) is encountered, but including at the end of the subject, where (*MARK:1) is encountered, b
there is no "B", so the "last ut there is no "B", so the "last
mark seen" that is returned is "1". In this case, the optimizations mark seen" that is returned is "1". In this case, the optimizations do
do not affect the overall match not affect the overall match
result, which is still "no match", but they do affect the auxiliary infor mation that is returned. result, which is still "no match", but they do affect the auxiliary infor mation that is returned.
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
When PCRE2_UTF is set, the validity of the pattern as a UTF string is aut When PCRE2_UTF is set, the validity of the pattern as a UTF string is
omatically checked. There are automatically checked. There are
discussions about the validity of UTF-8 strings, UTF-16 strings, and UT discussions about the validity of UTF-8 strings, UTF-16 strings, and UTF-
F-32 strings in the pcre2unicode 32 strings in the pcre2unicode
document. If an invalid UTF sequence is found, pcre2_compile() returns a negative error code. document. If an invalid UTF sequence is found, pcre2_compile() returns a negative error code.
If you know that your pattern is a valid UTF string, and you want to skip this check for performance rea- If you know that your pattern is a valid UTF string, and you want to skip this check for performance rea-
sons, you can set the PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an invalid UTF sons, you can set the PCRE2_NO_UTF_CHECK option. When it is set, the effe ct of passing an invalid UTF
string as a pattern is undefined. It may cause your program to crash or l oop. string as a pattern is undefined. It may cause your program to crash or l oop.
Note that this option can also be passed to pcre2_match() and pcre_dfa_ma tch(), to suppress UTF validity Note that this option can also be passed to pcre2_match() and pcre_dfa_m atch(), to suppress UTF validity
checking of the subject string. checking of the subject string.
Note also that setting PCRE2_NO_UTF_CHECK at compile time does not disab Note also that setting PCRE2_NO_UTF_CHECK at compile time does not disabl
le the error that is given if an e the error that is given if an
escape sequence for an invalid Unicode code point is encountered in the p escape sequence for an invalid Unicode code point is encountered in the
attern. In particular, the so- pattern. In particular, the so-
called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you wan t to allow escape sequences such called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you wan t to allow escape sequences such
as \x{d800} you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra opt ion, as described in the section as \x{d800} you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra opt ion, as described in the section
entitled "Extra compile options" below. However, this is possible only in UTF-8 and UTF-32 modes, entitled "Extra compile options" below. However, this is possible only in UTF-8 and UTF-32 modes,
because these values are not representable in UTF-16. because these values are not representable in UTF-16.
PCRE2_UCP PCRE2_UCP
This option has two effects. Firstly, it change the way PCRE2 processes \ This option has two effects. Firstly, it change the way PCRE2 processes
B, \b, \D, \d, \S, \s, \W, \w, \B, \b, \D, \d, \S, \s, \W, \w,
and some of the POSIX character classes. By default, only ASCII c and some of the POSIX character classes. By default, only ASCII chara
haracters are recognized, but if cters are recognized, but if
PCRE2_UCP is set, Unicode properties are used instead to classify charact PCRE2_UCP is set, Unicode properties are used instead to classify chara
ers. More details are given in cters. More details are given in
the section on generic character types in the pcre2pattern page. If yo the section on generic character types in the pcre2pattern page. If you s
u set PCRE2_UCP, matching one of et PCRE2_UCP, matching one of
the items it affects takes much longer. the items it affects takes much longer.
The second effect of PCRE2_UCP is to force the use of Unicode properties for upper/lower casing opera- The second effect of PCRE2_UCP is to force the use of Unicode properti es for upper/lower casing opera-
tions on characters with code points greater than 127, even when PCRE2_UT F is not set. This makes it pos- tions on characters with code points greater than 127, even when PCRE2_UT F is not set. This makes it pos-
sible, for example, to process strings in the 16-bit UCS-2 code. This opt ion is available only if PCRE2 sible, for example, to process strings in the 16-bit UCS-2 code. This o ption is available only if PCRE2
has been compiled with Unicode support (which is the default). has been compiled with Unicode support (which is the default).
PCRE2_UNGREEDY PCRE2_UNGREEDY
This option inverts the "greediness" of the quantifiers so that the This option inverts the "greediness" of the quantifiers so that they are
y are not greedy by default, but not greedy by default, but
become greedy if followed by "?". It is not compatible with Perl. It can become greedy if followed by "?". It is not compatible with Perl. It
also be set by a (?U) option can also be set by a (?U) option
setting within the pattern. setting within the pattern.
PCRE2_USE_OFFSET_LIMIT PCRE2_USE_OFFSET_LIMIT
This option must be set for pcre2_compile() if pcre2_set_offset_limit() i s going to be used to set a non- This option must be set for pcre2_compile() if pcre2_set_offset_limit() i s going to be used to set a non-
default offset limit in a match context for matches that use this pattern default offset limit in a match context for matches that use this patt
. An error is generated if an ern. An error is generated if an
offset limit is set without this option. For more details, see t offset limit is set without this option. For more details, see the
he description of pcre2_set_off- description of pcre2_set_off-
set_limit() in the section that describes match contexts. See also the PC RE2_FIRSTLINE option above. set_limit() in the section that describes match contexts. See also the PC RE2_FIRSTLINE option above.
PCRE2_UTF PCRE2_UTF
This option causes PCRE2 to regard both the pattern and the subject strin This option causes PCRE2 to regard both the pattern and the subject st
gs that are subsequently pro- rings that are subsequently pro-
cessed as strings of UTF characters instead of single-code-unit string cessed as strings of UTF characters instead of single-code-unit strings.
s. It is available when PCRE2 is It is available when PCRE2 is
built to include Unicode support (which is the default). If Unicode suppo built to include Unicode support (which is the default). If Unicode supp
rt is not available, the use of ort is not available, the use of
this option provokes an error. Details of how PCRE2_UTF changes the beh this option provokes an error. Details of how PCRE2_UTF changes the behav
aviour of PCRE2 are given in the iour of PCRE2 are given in the
pcre2unicode page. In particular, note that it changes the way PCRE2_CAS pcre2unicode page. In particular, note that it changes the way PCRE2_
ELESS handles characters with CASELESS handles characters with
code points greater than 127. code points greater than 127.
Extra compile options Extra compile options
The option bits that can be set in a compile context by calling the p cre2_set_compile_extra_options() The option bits that can be set in a compile context by calling the p cre2_set_compile_extra_options()
function are as follows: function are as follows:
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
This option applies when compiling a pattern in UTF-8 or UTF-32 mode. It is forbidden in UTF-16 mode, and This option applies when compiling a pattern in UTF-8 or UTF-32 mode. It is forbidden in UTF-16 mode, and
ignored in non-UTF modes. Unicode "surrogate" code points in the range 0x d800 to 0xdfff are used in pairs ignored in non-UTF modes. Unicode "surrogate" code points in the range 0x d800 to 0xdfff are used in pairs
in UTF-16 to encode code points with values in the range 0x10000 to 0x in UTF-16 to encode code points with values in the range 0x10000 to
10ffff. The surrogates cannot 0x10ffff. The surrogates cannot
therefore be represented in UTF-16. They can be represented in UTF- therefore be represented in UTF-16. They can be represented in UTF-8 and
8 and UTF-32, but are defined as UTF-32, but are defined as
invalid code points, and cause errors if encountered in a UTF-8 or UTF-32 invalid code points, and cause errors if encountered in a UTF-8 or UTF
string that is being checked -32 string that is being checked
for validity by PCRE2. for validity by PCRE2.
These values also cause errors if encountered in escape sequences such as \x{d912} within a pattern. How- These values also cause errors if encountered in escape sequences such as \x{d912} within a pattern. How-
ever, it seems that some applications, when using PCRE2 to check for ever, it seems that some applications, when using PCRE2 to check
unwanted characters in UTF-8 for unwanted characters in UTF-8
strings, explicitly test for the surrogates using escape sequences. Th strings, explicitly test for the surrogates using escape sequences. The
e PCRE2_NO_UTF_CHECK option does PCRE2_NO_UTF_CHECK option does
not disable the error that occurs, because it applies only to the testing of input strings for UTF valid- not disable the error that occurs, because it applies only to the testing of input strings for UTF valid-
ity. ity.
If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogat If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate
e code point values in UTF-8 and code point values in UTF-8 and
UTF-32 patterns no longer provoke errors and are incorporated in the comp UTF-32 patterns no longer provoke errors and are incorporated in the com
iled pattern. However, they can piled pattern. However, they can
only match subject characters if the matching function is called with PCR E2_NO_UTF_CHECK set. only match subject characters if the matching function is called with PCR E2_NO_UTF_CHECK set.
PCRE2_EXTRA_ALT_BSUX PCRE2_EXTRA_ALT_BSUX
The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and \x in the way that ECMAscript (aka The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and \x in the way that ECMAscript (aka
JavaScript) does. Additional functionality was defined by ECMAscript 6; s JavaScript) does. Additional functionality was defined by ECMAscript 6;
etting PCRE2_EXTRA_ALT_BSUX has setting PCRE2_EXTRA_ALT_BSUX has
the effect of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} the effect of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} as
as a hexadecimal character code, a hexadecimal character code,
where hhh.. is any number of hexadecimal digits. where hhh.. is any number of hexadecimal digits.
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
This is a dangerous option. Use with care. By default, an unrecognized es This is a dangerous option. Use with care. By default, an unrecognized
cape such as \j or a malformed escape such as \j or a malformed
one such as \x{2z} causes a compile-time error when detected by pcre2_co one such as \x{2z} causes a compile-time error when detected by pcre2_com
mpile(). Perl is somewhat incon- pile(). Perl is somewhat incon-
sistent in handling such items: for example, \j is treated as a literal " sistent in handling such items: for example, \j is treated as a literal
j", and non-hexadecimal digits "j", and non-hexadecimal digits
in \x{} are just ignored, though warnings are given in both cases if P in \x{} are just ignored, though warnings are given in both cases if Perl
erl's warning switch is enabled. 's warning switch is enabled.
However, a malformed octal number after \o{ always causes an error in Per l. However, a malformed octal number after \o{ always causes an error in Per l.
If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to pcre2_ If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to pcre
compile(), all unrecognized or 2_compile(), all unrecognized or
malformed escape sequences are treated as single-character escapes. For malformed escape sequences are treated as single-character escapes. For e
example, \j is a literal "j" and xample, \j is a literal "j" and
\x{2z} is treated as the literal string "x{2z}". Setting this option mean \x{2z} is treated as the literal string "x{2z}". Setting this option mea
s that typos in patterns may go ns that typos in patterns may go
undetected and have unexpected results. Also note that a sequence such a undetected and have unexpected results. Also note that a sequence such as
s [\N{] is interpreted as a mal- [\N{] is interpreted as a mal-
formed attempt at [\N{...}] and so is treated as [N{] whereas [\N] gives formed attempt at [\N{...}] and so is treated as [N{] whereas [\N] gives
an error because an unqualified an error because an unqualified
\N is a valid escape sequence but is not supported in a character class. \N is a valid escape sequence but is not supported in a character class.
To reiterate: this is a danger- To reiterate: this is a danger-
ous option. Use with great care. ous option. Use with great care.
PCRE2_EXTRA_ESCAPED_CR_IS_LF PCRE2_EXTRA_ESCAPED_CR_IS_LF
There are some legacy applications where the escape sequence \r in a patt There are some legacy applications where the escape sequence \r in a pat
ern is expected to match a new- tern is expected to match a new-
line. If this option is set, \r in a pattern is converted to \n so line. If this option is set, \r in a pattern is converted to \n so that
that it matches a LF (linefeed) it matches a LF (linefeed)
instead of a CR (carriage return) character. The option does not affect a instead of a CR (carriage return) character. The option does not affect
literal CR in the pattern, nor a literal CR in the pattern, nor
does it affect CR specified as an explicit code point such as \x{0D}. does it affect CR specified as an explicit code point such as \x{0D}.
PCRE2_EXTRA_MATCH_LINE PCRE2_EXTRA_MATCH_LINE
This option is provided for use by the -x option of pcre2grep. It cause s the pattern only to match com- This option is provided for use by the -x option of pcre2grep. It causes the pattern only to match com-
plete lines. This is achieved by automatically inserting the code for "^( ?:" at the start of the compiled plete lines. This is achieved by automatically inserting the code for "^( ?:" at the start of the compiled
pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matc hed line may be in the middle of pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the match ed line may be in the middle of
the subject string. This option can be used with PCRE2_LITERAL. the subject string. This option can be used with PCRE2_LITERAL.
PCRE2_EXTRA_MATCH_WORD PCRE2_EXTRA_MATCH_WORD
This option is provided for use by the -w option of pcre2grep. It cause This option is provided for use by the -w option of pcre2grep. It
s the pattern only to match causes the pattern only to match
strings that have a word boundary at the start and the end. This is ach strings that have a word boundary at the start and the end. This is achie
ieved by automatically inserting ved by automatically inserting
the code for "\b(?:" at the start of the compiled pattern and ")\b" at th the code for "\b(?:" at the start of the compiled pattern and ")\b" at
e end. The option may be used the end. The option may be used
with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is a lso set. with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is a lso set.
JUST-IN-TIME (JIT) COMPILATION JUST-IN-TIME (JIT) COMPILATION
int pcre2_jit_compile(pcre2_code *code, uint32_t options); int pcre2_jit_compile(pcre2_code *code, uint32_t options);
int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
PCRE2_SIZE length, PCRE2_SIZE startoffset, PCRE2_SIZE length, PCRE2_SIZE startoffset,
uint32_t options, pcre2_match_data *match_data, uint32_t options, pcre2_match_data *match_data,
pcre2_match_context *mcontext); pcre2_match_context *mcontext);
skipping to change at line 1430 skipping to change at line 1458
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
PCRE2_SIZE maxsize, pcre2_general_context *gcontext); PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
void pcre2_jit_stack_assign(pcre2_match_context *mcontext, void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
pcre2_jit_callback callback_function, void *callback_data); pcre2_jit_callback callback_function, void *callback_data);
void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
These functions provide support for JIT compilation, which, if the jus These functions provide support for JIT compilation, which, if the just-i
t-in-time compiler is available, n-time compiler is available,
further processes a compiled pattern into machine code that executes much further processes a compiled pattern into machine code that executes mu
faster than the pcre2_match() ch faster than the pcre2_match()
interpretive matching function. Full details are given in the pcre2jit do cumentation. interpretive matching function. Full details are given in the pcre2jit do cumentation.
JIT compilation is a heavyweight optimization. It can take some time for patterns to be analyzed, and for JIT compilation is a heavyweight optimization. It can take some time for patterns to be analyzed, and for
one-off matches and simple patterns the benefit of faster execution might be offset by a much slower com- one-off matches and simple patterns the benefit of faster execution might be offset by a much slower com-
pilation time. Most (but not all) patterns can be optimized by the JIT c ompiler. pilation time. Most (but not all) patterns can be optimized by the JIT c ompiler.
LOCALE SUPPORT LOCALE SUPPORT
const uint8_t *pcre2_maketables(pcre2_general_context *gcontext); const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
void pcre2_maketables_free(pcre2_general_context *gcontext, void pcre2_maketables_free(pcre2_general_context *gcontext,
const uint8_t *tables); const uint8_t *tables);
PCRE2 handles caseless matching, and determines whether characters are PCRE2 handles caseless matching, and determines whether characters are le
letters, digits, or whatever, by tters, digits, or whatever, by
reference to a set of tables, indexed by character code point. However, t reference to a set of tables, indexed by character code point. However,
his applies only to characters this applies only to characters
whose code points are less than 256. By default, higher-valued code point s never match escapes such as \w whose code points are less than 256. By default, higher-valued code point s never match escapes such as \w
or \d. or \d.
When PCRE2 is built with Unicode support (the default), the Unicode prope When PCRE2 is built with Unicode support (the default), the Unicode pro
rties of all characters can be perties of all characters can be
tested with \p and \P, or, alternatively, the PCRE2_UCP option can be tested with \p and \P, or, alternatively, the PCRE2_UCP option can be set
set when a pattern is compiled; when a pattern is compiled;
this causes \w and friends to use Unicode property support instead of th this causes \w and friends to use Unicode property support instead of
e built-in tables. PCRE2_UCP the built-in tables. PCRE2_UCP
also causes upper/lower casing operations on characters with code points also causes upper/lower casing operations on characters with code points
greater than 127 to use Unicode greater than 127 to use Unicode
properties. These effects apply even when PCRE2_UTF is not set. properties. These effects apply even when PCRE2_UTF is not set.
The use of locales with Unicode is discouraged. If you are handling chara cters with code points greater The use of locales with Unicode is discouraged. If you are handling cha racters with code points greater
than 127, you should either use Unicode support, or use locales, but not try to mix the two. than 127, you should either use Unicode support, or use locales, but not try to mix the two.
PCRE2 contains a built-in set of character tables that are used by de PCRE2 contains a built-in set of character tables that are used by defaul
fault. These are sufficient for t. These are sufficient for
many applications. Normally, the internal tables recognize only ASCII cha many applications. Normally, the internal tables recognize only ASCII ch
racters. However, when PCRE2 is aracters. However, when PCRE2 is
built, it is possible to cause the internal tables to be rebuilt in the built, it is possible to cause the internal tables to be rebuilt in the d
default "C" locale of the local efault "C" locale of the local
system, which may cause them to be different. system, which may cause them to be different.
The built-in tables can be overridden by tables supplied by the applicati The built-in tables can be overridden by tables supplied by the applica
on that calls PCRE2. These may tion that calls PCRE2. These may
be created in a different locale from the default. As more and more a be created in a different locale from the default. As more and more appl
pplications change to using Uni- ications change to using Uni-
code, the need for this locale support is expected to die away. code, the need for this locale support is expected to die away.
External tables are built by calling the pcre2_maketables() function, in External tables are built by calling the pcre2_maketables() function,
the relevant locale. The only in the relevant locale. The only
argument to this function is a general context, which can be used to pa argument to this function is a general context, which can be used to pass
ss a custom memory allocator. If a custom memory allocator. If
the argument is NULL, the system malloc() is used. The result can be pass the argument is NULL, the system malloc() is used. The result can be pa
ed to pcre2_compile() as often ssed to pcre2_compile() as often
as necessary, by creating a compile context and calling pcre2_set_char as necessary, by creating a compile context and calling pcre2_set_charact
acter_tables() to set the tables er_tables() to set the tables
pointer therein. pointer therein.
For example, to build and use tables that are appropriate for the French locale (where accented charac- For example, to build and use tables that are appropriate for the Frenc h locale (where accented charac-
ters with values greater than 127 are treated as letters), the following code could be used: ters with values greater than 127 are treated as letters), the following code could be used:
setlocale(LC_CTYPE, "fr_FR"); setlocale(LC_CTYPE, "fr_FR");
tables = pcre2_maketables(NULL); tables = pcre2_maketables(NULL);
ccontext = pcre2_compile_context_create(NULL); ccontext = pcre2_compile_context_create(NULL);
pcre2_set_character_tables(ccontext, tables); pcre2_set_character_tables(ccontext, tables);
re = pcre2_compile(..., ccontext); re = pcre2_compile(..., ccontext);
The locale name "fr_FR" is used on Linux and other Unix-like systems; if you are using Windows, the name The locale name "fr_FR" is used on Linux and other Unix-like systems; if you are using Windows, the name
for the French locale is "french". for the French locale is "french".
The pointer that is passed (via the compile context) to pcre2_compile() i The pointer that is passed (via the compile context) to pcre2_compile()
s saved with the compiled pat- is saved with the compiled pat-
tern, and the same tables are used by the matching functions. Thus, for tern, and the same tables are used by the matching functions. Thus, for a
any single pattern, compilation ny single pattern, compilation
and matching both happen in the same locale, but different patterns c and matching both happen in the same locale, but different patter
an be processed in different ns can be processed in different
locales. locales.
It is the caller's responsibility to ensure that the memory containing th e tables remains available while It is the caller's responsibility to ensure that the memory containing th e tables remains available while
they are still in use. When they are no longer needed, you can dis card them using pcre2_maketa- they are still in use. When they are no longer needed, you can discard them using pcre2_maketa-
bles_free(), which should pass as its first parameter the same global con text that was used to create the bles_free(), which should pass as its first parameter the same global con text that was used to create the
tables. tables.
Saving locale tables Saving locale tables
The tables described above are just a sequence of binary bytes, which mak The tables described above are just a sequence of binary bytes, which ma
es them independent of hardware kes them independent of hardware
characteristics such as endianness or whether the processor is 32-bit or characteristics such as endianness or whether the processor is 32-bit or
64-bit. A copy of the result of 64-bit. A copy of the result of
pcre2_maketables() can therefore be saved in a file or elsewhere and re-u pcre2_maketables() can therefore be saved in a file or elsewhere and re
sed later, even in a different -used later, even in a different
program or on another computer. The size of the tables (number of by program or on another computer. The size of the tables (number of bytes)
tes) must be obtained by calling must be obtained by calling
pcre2_config() with the PCRE2_CONFIG_TABLES_LENGTH option because pcre2_m aketables() does not return this pcre2_config() with the PCRE2_CONFIG_TABLES_LENGTH option because pcre2_m aketables() does not return this
value. Note that the pcre2_dftables program, which is part of the PCRE2 value. Note that the pcre2_dftables program, which is part of the PCRE2 b
build system, can be used stand- uild system, can be used stand-
alone to create a file that contains a set of binary tables. See the alone to create a file that contains a set of binary tables. See
pcre2build documentation for the pcre2build documentation for
details. details.
INFORMATION ABOUT A COMPILED PATTERN INFORMATION ABOUT A COMPILED PATTERN
int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
The pcre2_pattern_info() function returns general information about a c The pcre2_pattern_info() function returns general information about a com
ompiled pattern. For information piled pattern. For information
about callouts, see the next section. The first argument for pcre2_patte about callouts, see the next section. The first argument for pcre2_pa
rn_info() is a pointer to the ttern_info() is a pointer to the
compiled pattern. The second argument specifies which piece of inform compiled pattern. The second argument specifies which piece of informatio
ation is required, and the third n is required, and the third
argument is a pointer to a variable to receive the data. If the third arg argument is a pointer to a variable to receive the data. If the third a
ument is NULL, the first argu- rgument is NULL, the first argu-
ment is ignored, and the function returns the size in bytes of the v ment is ignored, and the function returns the size in bytes of the variab
ariable that is required for the le that is required for the
information requested. Otherwise, the yield of the function is zero for s information requested. Otherwise, the yield of the function is zero for
uccess, or one of the following success, or one of the following
negative numbers: negative numbers:
PCRE2_ERROR_NULL the argument code was NULL PCRE2_ERROR_NULL the argument code was NULL
PCRE2_ERROR_BADMAGIC the "magic number" was not found PCRE2_ERROR_BADMAGIC the "magic number" was not found
PCRE2_ERROR_BADOPTION the value of what was invalid PCRE2_ERROR_BADOPTION the value of what was invalid
PCRE2_ERROR_UNSET the requested field is not set PCRE2_ERROR_UNSET the requested field is not set
The "magic number" is placed at the start of each compiled pattern as a The "magic number" is placed at the start of each compiled pattern as a s
simple check against passing an imple check against passing an
arbitrary memory pointer. Here is a typical call of pcre2_pattern_info(), arbitrary memory pointer. Here is a typical call of pcre2_pattern_inf
to obtain the length of the o(), to obtain the length of the
compiled pattern: compiled pattern:
int rc; int rc;
size_t length; size_t length;
rc = pcre2_pattern_info( rc = pcre2_pattern_info(
re, /* result of pcre2_compile() */ re, /* result of pcre2_compile() */
PCRE2_INFO_SIZE, /* what is required */ PCRE2_INFO_SIZE, /* what is required */
&length); /* where to put the data */ &length); /* where to put the data */
The possible values for the second argument are defined in pcre2.h, and a re as follows: The possible values for the second argument are defined in pcre2.h, and a re as follows:
PCRE2_INFO_ALLOPTIONS PCRE2_INFO_ALLOPTIONS
PCRE2_INFO_ARGOPTIONS PCRE2_INFO_ARGOPTIONS
PCRE2_INFO_EXTRAOPTIONS PCRE2_INFO_EXTRAOPTIONS
Return copies of the pattern's options. The third argument shoul Return copies of the pattern's options. The third argument should p
d point to a uint32_t variable. oint to a uint32_t variable.
PCRE2_INFO_ARGOPTIONS returns exactly the options that were passed PCRE2_INFO_ARGOPTIONS returns exactly the options that were passe
to pcre2_compile(), whereas d to pcre2_compile(), whereas
PCRE2_INFO_ALLOPTIONS returns the compile options as modified by any PCRE2_INFO_ALLOPTIONS returns the compile options as modified by any top
top-level (*XXX) option settings -level (*XXX) option settings
such as (*UTF) at the start of the pattern itself. PCRE2_INFO_EXTRAOPTION S returns the extra options that such as (*UTF) at the start of the pattern itself. PCRE2_INFO_EXTRAOPTION S returns the extra options that
were set in the compile context by calling the pcre2_set_compile_extra_op tions() function. were set in the compile context by calling the pcre2_set_compile_extra_op tions() function.
For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2 For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXT
_EXTENDED option, the result for ENDED option, the result for
PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF. Option settings s PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF. Option setti
uch as (?i) that can change ngs such as (?i) that can change
within a pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, within a pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even
even if they appear right at the if they appear right at the
start of the pattern. (This was different in some earlier releases.) start of the pattern. (This was different in some earlier releases.)
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PC RE2 if the first significant A pattern compiled without PCRE2_ANCHORED is automatically anchored b y PCRE2 if the first significant
item in every top-level branch is one of the following: item in every top-level branch is one of the following:
^ unless PCRE2_MULTILINE is set ^ unless PCRE2_MULTILINE is set
\A always \A always
\G always \G always
.* sometimes - see below .* sometimes - see below
When .* is the first significant item, anchoring is possible only when al l the following are true: When .* is the first significant item, anchoring is possible only when al l the following are true:
.* is not in an atomic group .* is not in an atomic group
.* is not in a capture group that is the subject .* is not in a capture group that is the subject
of a backreference of a backreference
PCRE2_DOTALL is in force for .* PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern Neither (*PRUNE) nor (*SKIP) appears in the pattern
PCRE2_NO_DOTSTAR_ANCHOR is not set PCRE2_NO_DOTSTAR_ANCHOR is not set
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is s et in the options returned for For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the options returned for
PCRE2_INFO_ALLOPTIONS. PCRE2_INFO_ALLOPTIONS.
PCRE2_INFO_BACKREFMAX PCRE2_INFO_BACKREFMAX
Return the number of the highest backreference in the pattern. The third Return the number of the highest backreference in the pattern. The
argument should point to a third argument should point to a
uint32_t variable. Named capture groups acquire numbers as well as na uint32_t variable. Named capture groups acquire numbers as well as names,
mes, and these count towards the and these count towards the
highest backreference. Backreferences such as \4 or \g{12} match the capt highest backreference. Backreferences such as \4 or \g{12} match the
ured characters of the given captured characters of the given
group, but in addition, the check that a capture group is set in a condit ional group such as (?(3)a|b) is group, but in addition, the check that a capture group is set in a condit ional group such as (?(3)a|b) is
also a backreference. Zero is returned if there are no backreferences. also a backreference. Zero is returned if there are no backreferences.
PCRE2_INFO_BSR PCRE2_INFO_BSR
The output is a uint32_t integer whose value indicates what character seq The output is a uint32_t integer whose value indicates what character
uences the \R escape sequence sequences the \R escape sequence
matches. A value of PCRE2_BSR_UNICODE means that \R matches any Unicode matches. A value of PCRE2_BSR_UNICODE means that \R matches any Unicode l
line ending sequence; a value of ine ending sequence; a value of
PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
PCRE2_INFO_CAPTURECOUNT PCRE2_INFO_CAPTURECOUNT
Return the highest capture group number in the pattern. In patterns where (?| is not used, this is also Return the highest capture group number in the pattern. In patterns whe re (?| is not used, this is also
the total number of capture groups. The third argument should point to a uint32_t variable. the total number of capture groups. The third argument should point to a uint32_t variable.
PCRE2_INFO_DEPTHLIMIT PCRE2_INFO_DEPTHLIMIT
If the pattern set a backtracking depth limit by including an item of the form (*LIMIT_DEPTH=nnnn) at the If the pattern set a backtracking depth limit by including an item of the form (*LIMIT_DEPTH=nnnn) at the
start, the value is returned. The third argument should point to a uint32 _t integer. If no such value has start, the value is returned. The third argument should point to a uint32 _t integer. If no such value has
been set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_ UNSET. Note that this limit will been set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_ UNSET. Note that this limit will
only be used during matching if it is less than the limit set or defaulte d by the caller of the match only be used during matching if it is less than the limit set or defa ulted by the caller of the match
function. function.
PCRE2_INFO_FIRSTBITMAP PCRE2_INFO_FIRSTBITMAP
In the absence of a single first code unit for a non-anchored pattern, In the absence of a single first code unit for a non-anchored pattern, pc
pcre2_compile() may construct a re2_compile() may construct a
256-bit table that defines a fixed set of values for the first code unit 256-bit table that defines a fixed set of values for the first code u
in any match. For example, a nit in any match. For example, a
pattern that starts with [abc] results in a table with three bits set. Wh en code unit values greater than pattern that starts with [abc] results in a table with three bits set. Wh en code unit values greater than
255 are supported, the flag bit for 255 means "any code unit of value 255 255 are supported, the flag bit for 255 means "any code unit of value 25
or above". If such a table was 5 or above". If such a table was
constructed, a pointer to it is returned. Otherwise NULL is returned. Th constructed, a pointer to it is returned. Otherwise NULL is returned. The
e third argument should point to third argument should point to
a const uint8_t * variable. a const uint8_t * variable.
PCRE2_INFO_FIRSTCODETYPE PCRE2_INFO_FIRSTCODETYPE
Return information about the first code unit of any matched string, for a non-anchored pattern. The third Return information about the first code unit of any matched string, for a non-anchored pattern. The third
argument should point to a uint32_t variable. If there is a fixed firs argument should point to a uint32_t variable. If there is a fixed first v
t value, for example, the letter alue, for example, the letter
"c" from a pattern such as (cat|cow|coyote), 1 is returned, and the "c" from a pattern such as (cat|cow|coyote), 1 is returned, and
value can be retrieved using the value can be retrieved using
PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is k PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is kno
nown that a match can occur only wn that a match can occur only
at the start of the subject or following a newline in the subject, 2 is at the start of the subject or following a newline in the subject, 2
returned. Otherwise, and for is returned. Otherwise, and for
anchored patterns, 0 is returned. anchored patterns, 0 is returned.
PCRE2_INFO_FIRSTCODEUNIT PCRE2_INFO_FIRSTCODEUNIT
Return the value of the first code unit of any matched string for a pat Return the value of the first code unit of any matched string for a patte
tern where PCRE2_INFO_FIRSTCODE- rn where PCRE2_INFO_FIRSTCODE-
TYPE returns 1; otherwise return 0. The third argument should point to a TYPE returns 1; otherwise return 0. The third argument should point to a
uint32_t variable. In the 8-bit uint32_t variable. In the 8-bit
library, the value is always less than 256. In the 16-bit library the v library, the value is always less than 256. In the 16-bit library the val
alue can be up to 0xffff. In the ue can be up to 0xffff. In the
32-bit library in UTF-32 mode the value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 32-bit library in UTF-32 mode the value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
mode. mode.
PCRE2_INFO_FRAMESIZE PCRE2_INFO_FRAMESIZE
Return the size (in bytes) of the data frames that are used to remember backtracking positions when the Return the size (in bytes) of the data frames that are used to remember b acktracking positions when the
pattern is processed by pcre2_match() without the use of JIT. The third a rgument should point to a size_t pattern is processed by pcre2_match() without the use of JIT. The third a rgument should point to a size_t
variable. The frame size depends on the number of capturing parentheses in the pattern. Each additional variable. The frame size depends on the number of capturing parentheses i n the pattern. Each additional
capture group adds two PCRE2_SIZE variables. capture group adds two PCRE2_SIZE variables.
PCRE2_INFO_HASBACKSLASHC PCRE2_INFO_HASBACKSLASHC
Return 1 if the pattern contains any instances of \C, otherwise 0. The th ird argument should point to a Return 1 if the pattern contains any instances of \C, otherwise 0. The third argument should point to a
uint32_t variable. uint32_t variable.
PCRE2_INFO_HASCRORLF PCRE2_INFO_HASCRORLF
Return 1 if the pattern contains any explicit matches for CR or LF ch Return 1 if the pattern contains any explicit matches for CR or LF charac
aracters, otherwise 0. The third ters, otherwise 0. The third
argument should point to a uint32_t variable. An explicit match is either argument should point to a uint32_t variable. An explicit match is eith
a literal CR or LF character, er a literal CR or LF character,
or \r or \n or one of the equivalent hexadecimal or octal escape sequence s. or \r or \n or one of the equivalent hexadecimal or octal escape sequence s.
PCRE2_INFO_HEAPLIMIT PCRE2_INFO_HEAPLIMIT
If the pattern set a heap memory limit by including an item of the form If the pattern set a heap memory limit by including an item of the form (
(*LIMIT_HEAP=nnnn) at the start, *LIMIT_HEAP=nnnn) at the start,
the value is returned. The third argument should point to a uint32_t inte the value is returned. The third argument should point to a uint32_t in
ger. If no such value has been teger. If no such value has been
set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET . Note that this limit will only set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET . Note that this limit will only
be used during matching if it is less than the limit set or defaulted by the caller of the match func- be used during matching if it is less than the limit set or defaulted by the caller of the match func-
tion. tion.
PCRE2_INFO_JCHANGED PCRE2_INFO_JCHANGED
Return 1 if the (?J) or (?-J) option setting is used in the pattern, Return 1 if the (?J) or (?-J) option setting is used in the pattern, oth
otherwise 0. The third argument erwise 0. The third argument
should point to a uint32_t variable. (?J) and (?-J) set and unset the should point to a uint32_t variable. (?J) and (?-J) set and unset
local PCRE2_DUPNAMES option, the local PCRE2_DUPNAMES option,
respectively. respectively.
PCRE2_INFO_JITSIZE PCRE2_INFO_JITSIZE
If the compiled pattern was successfully processed by pcre2_jit_compi le(), return the size of the JIT If the compiled pattern was successfully processed by pcre2_jit_compile() , return the size of the JIT
compiled code, otherwise return zero. The third argument should point to a size_t variable. compiled code, otherwise return zero. The third argument should point to a size_t variable.
PCRE2_INFO_LASTCODETYPE PCRE2_INFO_LASTCODETYPE
Returns 1 if there is a rightmost literal code unit that must exist in an Returns 1 if there is a rightmost literal code unit that must exist in a
y matched string, other than at ny matched string, other than at
its start. The third argument should point to a uint32_t variable. its start. The third argument should point to a uint32_t variable. If th
If there is no such value, 0 is ere is no such value, 0 is
returned. When 1 is returned, the code unit value itself can be retrieved returned. When 1 is returned, the code unit value itself can be retrieve
using PCRE2_INFO_LASTCODEUNIT. d using PCRE2_INFO_LASTCODEUNIT.
For anchored patterns, a last literal value is recorded only if it follo For anchored patterns, a last literal value is recorded only if it follow
ws something of variable length. s something of variable length.
For example, for the pattern /^a\d+z\d+/ the returned value is 1 (with "z " returned from PCRE2_INFO_LAST- For example, for the pattern /^a\d+z\d+/ the returned value is 1 (with "z " returned from PCRE2_INFO_LAST-
CODEUNIT), but for /^a\dz\d/ the returned value is 0. CODEUNIT), but for /^a\dz\d/ the returned value is 0.
PCRE2_INFO_LASTCODEUNIT PCRE2_INFO_LASTCODEUNIT
Return the value of the rightmost literal code unit that must exist in a ny matched string, other than at Return the value of the rightmost literal code unit that must exist in an y matched string, other than at
its start, for a pattern where PCRE2_INFO_LASTCODETYPE returns 1. Otherwi se, return 0. The third argument its start, for a pattern where PCRE2_INFO_LASTCODETYPE returns 1. Otherwi se, return 0. The third argument
should point to a uint32_t variable. should point to a uint32_t variable.
PCRE2_INFO_MATCHEMPTY PCRE2_INFO_MATCHEMPTY
Return 1 if the pattern might match an empty string, otherwise 0. The third argument should point to a Return 1 if the pattern might match an empty string, otherwise 0. The thi rd argument should point to a
uint32_t variable. When a pattern contains recursive subroutine calls it is not always possible to deter- uint32_t variable. When a pattern contains recursive subroutine calls it is not always possible to deter-
mine whether or not it can match an empty string. PCRE2 takes a cautiou s approach and returns 1 in such mine whether or not it can match an empty string. PCRE2 takes a cautious approach and returns 1 in such
cases. cases.
PCRE2_INFO_MATCHLIMIT PCRE2_INFO_MATCHLIMIT
If the pattern set a match limit by including an item of the form (*LIMIT If the pattern set a match limit by including an item of the form (*LIM
_MATCH=nnnn) at the start, the IT_MATCH=nnnn) at the start, the
value is returned. The third argument should point to a uint32_t integer value is returned. The third argument should point to a uint32_t integer.
. If no such value has been set, If no such value has been set,
the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET. Not the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET. N
e that this limit will only be ote that this limit will only be
used during matching if it is less than the limit set or defaulted by the caller of the match function. used during matching if it is less than the limit set or defaulted by the caller of the match function.
PCRE2_INFO_MAXLOOKBEHIND PCRE2_INFO_MAXLOOKBEHIND
A lookbehind assertion moves back a certain number of characters (n A lookbehind assertion moves back a certain number of characters (not co
ot code units) when it starts to de units) when it starts to
process each of its branches. This request returns the largest of these b process each of its branches. This request returns the largest of these
ackward moves. The third argu- backward moves. The third argu-
ment should point to a uint32_t integer. The simple assertions \b and \B ment should point to a uint32_t integer. The simple assertions \b and \B
require a one-character lookbe- require a one-character lookbe-
hind and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of any hind and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of an
thing longer. \A also registers ything longer. \A also registers
a one-character lookbehind, though it does not actually inspect the previ ous character. a one-character lookbehind, though it does not actually inspect the previ ous character.
Note that this information is useful for multi-segment matching only i Note that this information is useful for multi-segment matching only if t
f the pattern contains no nested he pattern contains no nested
lookbehinds. For example, the pattern (?<=a(?<=ba)c) returns a maximum lo lookbehinds. For example, the pattern (?<=a(?<=ba)c) returns a maximum
okbehind of 2, but when it is lookbehind of 2, but when it is
processed, the first lookbehind moves back by two characters, matche processed, the first lookbehind moves back by two characters, matches on
s one character, then the nested e character, then the nested
lookbehind also moves back by two characters. This puts the matching poin lookbehind also moves back by two characters. This puts the matching poi
t three characters earlier than nt three characters earlier than
it was at the start. PCRE2_INFO_MAXLOOKBEHIND is really only use it was at the start. PCRE2_INFO_MAXLOOKBEHIND is really only useful a
ful as a debugging tool. See the s a debugging tool. See the
pcre2partial documentation for a discussion of multi-segment matching. pcre2partial documentation for a discussion of multi-segment matching.
PCRE2_INFO_MINLENGTH PCRE2_INFO_MINLENGTH
If a minimum length for matching subject strings was computed, its valu If a minimum length for matching subject strings was computed, its
e is returned. Otherwise the value is returned. Otherwise the
returned value is 0. This value is not computed when PCRE2_NO_START_OPTI returned value is 0. This value is not computed when PCRE2_NO_START_OPTIM
MIZE is set. The value is a num- IZE is set. The value is a num-
ber of characters, which in UTF mode may be different from the number of ber of characters, which in UTF mode may be different from the number o
code units. The third argument f code units. The third argument
should point to a uint32_t variable. The value is a lower bound to th should point to a uint32_t variable. The value is a lower bound to the le
e length of any matching string. ngth of any matching string.
There may not be any strings of that length that do actually match, but e There may not be any strings of that length that do actually match, but
very string that does match is every string that does match is
at least that long. at least that long.
PCRE2_INFO_NAMECOUNT PCRE2_INFO_NAMECOUNT
PCRE2_INFO_NAMEENTRYSIZE PCRE2_INFO_NAMEENTRYSIZE
PCRE2_INFO_NAMETABLE PCRE2_INFO_NAMETABLE
PCRE2 supports the use of named as well as numbered capturing parenthe PCRE2 supports the use of named as well as numbered capturing parentheses
ses. The names are just an addi- . The names are just an addi-
tional way of identifying the parentheses, which still acquire numbers. tional way of identifying the parentheses, which still acquire number
Several convenience functions s. Several convenience functions
such as pcre2_substring_get_byname() are provided for extracting capture such as pcre2_substring_get_byname() are provided for extracting captured
d substrings by name. It is also substrings by name. It is also
possible to extract the data directly, by first converting the name to a possible to extract the data directly, by first converting the name to
number in order to access the a number in order to access the
correct pointers in the output vector (described with pcre2_match() b correct pointers in the output vector (described with pcre2_match() below
elow). To do the conversion, you ). To do the conversion, you
need to use the name-to-number map, which is described by these three val ues. need to use the name-to-number map, which is described by these three val ues.
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives the number of entries, and The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives the number of entries, and
PCRE2_INFO_NAMEENTRYSIZE gives the size of each entry in code units; both of these return a uint32_t PCRE2_INFO_NAMEENTRYSIZE gives the size of each entry in code units; both of these return a uint32_t
value. The entry size depends on the length of the longest name. value. The entry size depends on the length of the longest name.
PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. T his is a PCRE2_SPTR pointer to a PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. T his is a PCRE2_SPTR pointer to a
block of code units. In the 8-bit library, the first two bytes of each entry are the number of the cap- block of code units. In the 8-bit library, the first two bytes of each en try are the number of the cap-
turing parenthesis, most significant byte first. In the 16-bit library, t he pointer points to 16-bit code turing parenthesis, most significant byte first. In the 16-bit library, t he pointer points to 16-bit code
units, the first of which contains the parenthesis number. In the 32-bi units, the first of which contains the parenthesis number. In the 32-bit
t library, the pointer points to library, the pointer points to
32-bit code units, the first of which contains the parenthesis number. Th 32-bit code units, the first of which contains the parenthesis number. T
e rest of the entry is the cor- he rest of the entry is the cor-
responding name, zero terminated. responding name, zero terminated.
The names are in alphabetical order. If (?| is used to create multiple c The names are in alphabetical order. If (?| is used to create multiple ca
apture groups with the same num- pture groups with the same num-
ber, as described in the section on duplicate group numbers in the pcre2p ber, as described in the section on duplicate group numbers in the pcre
attern page, the groups may be 2pattern page, the groups may be
given the same name, but there is only one entry in the table. Differ given the same name, but there is only one entry in the table. Different
ent names for groups of the same names for groups of the same
number are not permitted. number are not permitted.
Duplicate names for capture groups with different numbers are permitted, but only if PCRE2_DUPNAMES is Duplicate names for capture groups with different numbers are permitte d, but only if PCRE2_DUPNAMES is
set. They appear in the table in the order in which they were found in th e pattern. In the absence of (?| set. They appear in the table in the order in which they were found in th e pattern. In the absence of (?|
this is the order of increasing number; when (?| is used this is not nece ssarily the case because later this is the order of increasing number; when (?| is used this is not ne cessarily the case because later
capture groups may have lower numbers. capture groups may have lower numbers.
As a simple example of the name/number table, consider the following pattern after compilation by the As a simple example of the name/number table, consider the following patt ern after compilation by the
8-bit library (assume PCRE2_EXTENDED is set, so white space - including n ewlines - is ignored): 8-bit library (assume PCRE2_EXTENDED is set, so white space - including n ewlines - is ignored):
(?<date> (?<year>(\d\d)?\d\d) - (?<date> (?<year>(\d\d)?\d\d) -
(?<month>\d\d) - (?<day>\d\d) ) (?<month>\d\d) - (?<day>\d\d) )
There are four named capture groups, so the table has four entries, and e There are four named capture groups, so the table has four entries, and
ach entry in the table is eight each entry in the table is eight
bytes long. The table is as follows, with non-printing bytes shows in bytes long. The table is as follows, with non-printing bytes shows in hex
hexadecimal, and undefined bytes adecimal, and undefined bytes
shown as ??: shown as ??:
00 01 d a t e 00 ?? 00 01 d a t e 00 ??
00 05 d a y 00 ?? ?? 00 05 d a y 00 ?? ??
00 04 m o n t h 00 00 04 m o n t h 00
00 02 y e a r 00 ?? 00 02 y e a r 00 ??
When writing code to extract data from named capture groups using the nam e-to-number map, remember that When writing code to extract data from named capture groups using the n ame-to-number map, remember that
the length of the entries is likely to be different for each compiled pat tern. the length of the entries is likely to be different for each compiled pat tern.
PCRE2_INFO_NEWLINE PCRE2_INFO_NEWLINE
The output is one of the following uint32_t values: The output is one of the following uint32_t values:
PCRE2_NEWLINE_CR Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
PCRE2_NEWLINE_LF Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
PCRE2_NEWLINE_NUL The NUL character (binary zero) PCRE2_NEWLINE_NUL The NUL character (binary zero)
This identifies the character sequence that will be recognized as meaning "newline" while matching. This identifies the character sequence that will be recognized as meaning "newline" while matching.
PCRE2_INFO_SIZE PCRE2_INFO_SIZE
Return the size of the compiled pattern in bytes (for all three libra ries). The third argument should Return the size of the compiled pattern in bytes (for all three libraries ). The third argument should
point to a size_t variable. This value includes the size of the general d ata block that precedes the code point to a size_t variable. This value includes the size of the general d ata block that precedes the code
units of the compiled pattern itself. The value that is used when pcre units of the compiled pattern itself. The value that is used when pcre2_c
2_compile() is getting memory in ompile() is getting memory in
which to place the compiled pattern may be slightly larger than the va which to place the compiled pattern may be slightly larger than th
lue returned by this option, e value returned by this option,
because there are cases where the code that calculates the size has to because there are cases where the code that calculates the size has to ov
over-estimate. Processing a pat- er-estimate. Processing a pat-
tern with the JIT compiler does not alter the value returned by this opti on. tern with the JIT compiler does not alter the value returned by this opti on.
INFORMATION ABOUT A PATTERN'S CALLOUTS INFORMATION ABOUT A PATTERN'S CALLOUTS
int pcre2_callout_enumerate(const pcre2_code *code, int pcre2_callout_enumerate(const pcre2_code *code,
int (*callback)(pcre2_callout_enumerate_block *, void *), int (*callback)(pcre2_callout_enumerate_block *, void *),
void *user_data); void *user_data);
A script language that supports the use of string arguments in callouts m A script language that supports the use of string arguments in callouts
ight like to scan all the call- might like to scan all the call-
outs in a pattern before running the match. This can be done by callin outs in a pattern before running the match. This can be done by calling
g pcre2_callout_enumerate(). The pcre2_callout_enumerate(). The
first argument is a pointer to a compiled pattern, the second points to first argument is a pointer to a compiled pattern, the second points
a callback function, and the to a callback function, and the
third is arbitrary user data. The callback function is called for eve third is arbitrary user data. The callback function is called for every c
ry callout in the pattern in the allout in the pattern in the
order in which they appear. Its first argument is a pointer to a callout order in which they appear. Its first argument is a pointer to a callout
enumeration block, and its sec- enumeration block, and its sec-
ond argument is the user_data value that was passed to pcre2_callout_ ond argument is the user_data value that was passed to pcre2_callout_enum
enumerate(). The contents of the erate(). The contents of the
callout enumeration block are described in the pcre2callout documentati callout enumeration block are described in the pcre2callout documen
on, which also gives further tation, which also gives further
details about callouts. details about callouts.
SERIALIZATION AND PRECOMPILING SERIALIZATION AND PRECOMPILING
It is possible to save compiled patterns on disc or elsewhere, and reload them later, subject to a number It is possible to save compiled patterns on disc or elsewhere, and reload them later, subject to a number
of restrictions. The host on which the patterns are reloaded must be runn of restrictions. The host on which the patterns are reloaded must be ru
ing the same version of PCRE2, nning the same version of PCRE2,
with the same code unit width, and must also have the same endiannes with the same code unit width, and must also have the same endianness,
s, pointer width, and PCRE2_SIZE pointer width, and PCRE2_SIZE
type. Before compiled patterns can be saved, they must be converted to a type. Before compiled patterns can be saved, they must be converted to a
"serialized" form, which in the "serialized" form, which in the
case of PCRE2 is really just a bytecode dump. The functions whose names case of PCRE2 is really just a bytecode dump. The functions whose names
begin with pcre2_serialize_ are begin with pcre2_serialize_ are
used for converting to and from the serialized form. They are described i used for converting to and from the serialized form. They are described
n the pcre2serialize documenta- in the pcre2serialize documenta-
tion. Note that PCRE2 serialization does not convert compiled patterns to an abstract format like Java or tion. Note that PCRE2 serialization does not convert compiled patterns to an abstract format like Java or
.NET serialization. .NET serialization.
THE MATCH DATA BLOCK THE MATCH DATA BLOCK
pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
pcre2_general_context *gcontext); pcre2_general_context *gcontext);
pcre2_match_data *pcre2_match_data_create_from_pattern( pcre2_match_data *pcre2_match_data_create_from_pattern(
const pcre2_code *code, pcre2_general_context *gcontext); const pcre2_code *code, pcre2_general_context *gcontext);
void pcre2_match_data_free(pcre2_match_data *match_data); void pcre2_match_data_free(pcre2_match_data *match_data);
Information about a successful or unsuccessful match is placed in a match Information about a successful or unsuccessful match is placed in a matc
data block, which is an opaque h data block, which is an opaque
structure that is accessed by function calls. In particular, the match structure that is accessed by function calls. In particular, the match da
data block contains a vector of ta block contains a vector of
offsets into the subject string that define the matched part of the subje offsets into the subject string that define the matched part of the subj
ct and any substrings that were ect and any substrings that were
captured. This is known as the ovector. captured. This is known as the ovector.
Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match() yo Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match() you
u must create a match data block must create a match data block
by calling one of the creation functions above. For pcre2_match_data_crea by calling one of the creation functions above. For pcre2_match_data_cre
te(), the first argument is the ate(), the first argument is the
number of pairs of offsets in the ovector. One pair of offsets is requ number of pairs of offsets in the ovector. One pair of offsets is require
ired to identify the string that d to identify the string that
matched the whole pattern, with an additional pair for each captured subs tring. For example, a value of 4 matched the whole pattern, with an additional pair for each captured subs tring. For example, a value of 4
creates enough space to record the matched portion of the subject plus th ree captured substrings. A mini- creates enough space to record the matched portion of the subject plus th ree captured substrings. A mini-
mum of at least 1 pair is imposed by pcre2_match_data_create(), so it is always possible to return the mum of at least 1 pair is imposed by pcre2_match_data_create(), so it is always possible to return the
overall matched string. overall matched string.
The second argument of pcre2_match_data_create() is a pointer to a ge The second argument of pcre2_match_data_create() is a pointer to a genera
neral context, which can specify l context, which can specify
custom memory management for obtaining the memory for the match data bloc custom memory management for obtaining the memory for the match data bl
k. If you are not using custom ock. If you are not using custom
memory management, pass NULL, which causes malloc() to be used. memory management, pass NULL, which causes malloc() to be used.
For pcre2_match_data_create_from_pattern(), the first argument is a po For pcre2_match_data_create_from_pattern(), the first argument is a point
inter to a compiled pattern. The er to a compiled pattern. The
ovector is created to be exactly the right size to hold all the substring ovector is created to be exactly the right size to hold all the substri
s a pattern might capture. The ngs a pattern might capture. The
second argument is again a pointer to a general context, but in this ca second argument is again a pointer to a general context, but in this case
se if NULL is passed, the memory if NULL is passed, the memory
is obtained using the same allocator that was used for the compiled patte rn (custom or default). is obtained using the same allocator that was used for the compiled patte rn (custom or default).
A match data block can be used many times, with the same or different com A match data block can be used many times, with the same or different co
piled patterns. You can extract mpiled patterns. You can extract
information from a match data block after a match operation has fi information from a match data block after a match operation has finish
nished, using functions that are ed, using functions that are
described in the sections on matched strings and other match data below. described in the sections on matched strings and other match data below.
When a call of pcre2_match() fails, valid data is available in the match When a call of pcre2_match() fails, valid data is available in the ma
block only when the error is tch block only when the error is
PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one of the error codes fo PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one of the error codes for a
r an invalid UTF string. Exactly n invalid UTF string. Exactly
what is available depends on the error, and is detailed below. what is available depends on the error, and is detailed below.
When one of the matching functions is called, pointers to the compiled pa ttern and the subject string are When one of the matching functions is called, pointers to the compiled pa ttern and the subject string are
set in the match data block so that they can be referenced by the extract ion functions after a successful set in the match data block so that they can be referenced by the extract ion functions after a successful
match. After running a match, you must not free a compiled pattern or a s match. After running a match, you must not free a compiled pattern or
ubject string until after all a subject string until after all
operations on the match data block (for that match) have taken place, un operations on the match data block (for that match) have taken place, unl
less, in the case of the subject ess, in the case of the subject
string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is des string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is d
cribed in the section entitled escribed in the section entitled
"Option bits for pcre2_match()" below. "Option bits for pcre2_match()" below.
When a match data block itself is no longer needed, it should be freed by calling When a match data block itself is no longer needed, it sh ould be freed by calling
pcre2_match_data_free(). If this function is called with a NULL argument, it returns immediately, without pcre2_match_data_free(). If this function is called with a NULL argument, it returns immediately, without
doing anything. doing anything.
MATCHING A PATTERN: THE TRADITIONAL FUNCTION MATCHING A PATTERN: THE TRADITIONAL FUNCTION
int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
PCRE2_SIZE length, PCRE2_SIZE startoffset, PCRE2_SIZE length, PCRE2_SIZE startoffset,
uint32_t options, pcre2_match_data *match_data, uint32_t options, pcre2_match_data *match_data,
pcre2_match_context *mcontext); pcre2_match_context *mcontext);
The function pcre2_match() is called to match a subject string agai The function pcre2_match() is called to match a subject string against
nst a compiled pattern, which is a compiled pattern, which is
passed in the code argument. You can call pcre2_match() with the same cod passed in the code argument. You can call pcre2_match() with the same co
e argument as many times as you de argument as many times as you
like, in order to find multiple matches in the subject string or to matc like, in order to find multiple matches in the subject string or to match
h different subject strings with different subject strings with
the same pattern. the same pattern.
This function is the main matching facility of the library, and it operat This function is the main matching facility of the library, and it ope
es in a Perl-like manner. For rates in a Perl-like manner. For
specialist use there is also an alternative matching function, which i specialist use there is also an alternative matching function, which is d
s described below in the section escribed below in the section
about the pcre2_dfa_match() function. about the pcre2_dfa_match() function.
Here is an example of a simple call to pcre2_match(): Here is an example of a simple call to pcre2_match():
pcre2_match_data *md = pcre2_match_data_create(4, NULL); pcre2_match_data *md = pcre2_match_data_create(4, NULL);
int rc = pcre2_match( int rc = pcre2_match(
re, /* result of pcre2_compile() */ re, /* result of pcre2_compile() */
"some string", /* the subject string */ "some string", /* the subject string */
11, /* the length of the subject string */ 11, /* the length of the subject string */
0, /* start at offset 0 in the subject */ 0, /* start at offset 0 in the subject */
0, /* default options */ 0, /* default options */
md, /* the match data block */ md, /* the match data block */
NULL); /* a match context; NULL means use defaults */ NULL); /* a match context; NULL means use defaults */
If the subject string is zero-terminated, the length can be given as PCRE If the subject string is zero-terminated, the length can be given as PCR
2_ZERO_TERMINATED. A match con- E2_ZERO_TERMINATED. A match con-
text must be provided if certain less common matching parameters are to text must be provided if certain less common matching parameters are to b
be changed. For details, see the e changed. For details, see the
section on the match context above. section on the match context above.
The string to be matched by pcre2_match() The string to be matched by pcre2_match()
The subject string is passed to pcre2_match() as a pointer in subject, a length in length, and a starting The subject string is passed to pcre2_match() as a pointer in subject, a length in length, and a starting
offset in startoffset. The length and offset are in code units, not c offset in startoffset. The length and offset are in code units, not chara
haracters. That is, they are in cters. That is, they are in
bytes for the 8-bit library, 16-bit code units for the 16-bit library, an bytes for the 8-bit library, 16-bit code units for the 16-bit librar
d 32-bit code units for the y, and 32-bit code units for the
32-bit library, whether or not UTF processing is enabled. 32-bit library, whether or not UTF processing is enabled.
If startoffset is greater than the length of the subject, pcre2_match( ) returns PCRE2_ERROR_BADOFFSET. If startoffset is greater than the length of the subject, pcre2_match() returns PCRE2_ERROR_BADOFFSET.
When the starting offset is zero, the search for a match starts at the be ginning of the subject, and this When the starting offset is zero, the search for a match starts at the be ginning of the subject, and this
is by far the most common case. In UTF-8 or UTF-16 mode, the starting off set must point to the start of a is by far the most common case. In UTF-8 or UTF-16 mode, the starting off set must point to the start of a
character, or to the end of the subject (in UTF-32 mode, one code unit eq uals one character, so all off- character, or to the end of the subject (in UTF-32 mode, one code unit e quals one character, so all off-
sets are valid). Like the pattern string, the subject may contain binary zeros. sets are valid). Like the pattern string, the subject may contain binary zeros.
A non-zero starting offset is useful when searching for another matc A non-zero starting offset is useful when searching for another match in
h in the same subject by calling the same subject by calling
pcre2_match() again after a previous success. Setting startoffset differ pcre2_match() again after a previous success. Setting startoffset diffe
s from passing over a shortened rs from passing over a shortened
string and setting PCRE2_NOTBOL in the case of a pattern that begins string and setting PCRE2_NOTBOL in the case of a pattern that begins with
with any kind of lookbehind. For any kind of lookbehind. For
example, consider the pattern example, consider the pattern
\Biss\B \Biss\B
which finds occurrences of "iss" in the middle of words. (\B matches only which finds occurrences of "iss" in the middle of words. (\B matches onl
if the current position in the y if the current position in the
subject is not a word boundary.) When applied to the string "Mississipi" subject is not a word boundary.) When applied to the string "Mississipi"
the first call to pcre2_match() the first call to pcre2_match()
finds the first occurrence. If pcre2_match() is called again with just t finds the first occurrence. If pcre2_match() is called again with ju
he remainder of the subject, st the remainder of the subject,
namely "issipi", it does not match, because \B is always false at th namely "issipi", it does not match, because \B is always false at the sta
e start of the subject, which is rt of the subject, which is
deemed to be a word boundary. However, if pcre2_match() is passed the e deemed to be a word boundary. However, if pcre2_match() is passed t
ntire string again, but with he entire string again, but with
startoffset set to 4, it finds the second occurrence of "iss" becau startoffset set to 4, it finds the second occurrence of "iss" because it
se it is able to look behind the is able to look behind the
starting point to discover that it is preceded by a letter. starting point to discover that it is preceded by a letter.
Finding all the matches in a subject is tricky when the pattern can match an empty string. It is possible Finding all the matches in a subject is tricky when the pattern can match an empty string. It is possible
to emulate Perl's /g behaviour by first trying the match again to emulate Perl's /g behaviour by first trying the match again a
at the same offset, with the t the same offset, with the
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that fails PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that fail
, advancing the starting offset s, advancing the starting offset
and trying an ordinary match again. There is some code that demonstrates and trying an ordinary match again. There is some code that demonstrates
how to do this in the pcre2demo how to do this in the pcre2demo
sample program. In the most general case, you have to check to see if the sample program. In the most general case, you have to check to see if t
newline convention recognizes he newline convention recognizes
CRLF as a newline, and if so, and the current character is CR followed by LF, advance the starting offset CRLF as a newline, and if so, and the current character is CR followed by LF, advance the starting offset
by two characters instead of one. by two characters instead of one.
If a non-zero starting offset is passed when the pattern is anchored, a s If a non-zero starting offset is passed when the pattern is anchored,
ingle attempt to match at the a single attempt to match at the
given offset is made. This can only succeed if the pattern does not requ given offset is made. This can only succeed if the pattern does not requi
ire the match to be at the start re the match to be at the start
of the subject. In other words, the anchoring must be the result of setti of the subject. In other words, the anchoring must be the result of sett
ng the PCRE2_ANCHORED option or ing the PCRE2_ANCHORED option or
the use of .* with PCRE2_DOTALL, not by starting the pattern with ^ or \A . the use of .* with PCRE2_DOTALL, not by starting the pattern with ^ or \A .
Option bits for pcre2_match() Option bits for pcre2_match()
The unused bits of the options argument for pcre2_match() must be zero. T he only bits that may be set are The unused bits of the options argument for pcre2_match() must be zero. T he only bits that may be set are
PCRE2_ANCHORED, PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_ANCHORED, PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED,
PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTBOL, PCRE2_NOTEOL,
PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, PCRE2_NO_UTF_ PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, PCRE2_NO_UTF_CH
CHECK, PCRE2_PARTIAL_HARD, and ECK, PCRE2_PARTIAL_HARD, and
PCRE2_PARTIAL_SOFT. Their action is described below. PCRE2_PARTIAL_SOFT. Their action is described below.
Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supporte d by the just-in-time (JIT) com- Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supporte d by the just-in-time (JIT) com-
piler. If it is set, JIT matching is disabled and the interpretive cod e in pcre2_match() is run. Apart piler. If it is set, JIT matching is disabled and the interpretive code i n pcre2_match() is run. Apart
from PCRE2_NO_JIT (obviously), the remaining options are supported for JI T matching. from PCRE2_NO_JIT (obviously), the remaining options are supported for JI T matching.
PCRE2_ANCHORED PCRE2_ANCHORED
The PCRE2_ANCHORED option limits pcre2_match() to matching at the first m The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
atching position. If a pattern matching position. If a pattern
was compiled with PCRE2_ANCHORED, or turned out to be anchored by virt was compiled with PCRE2_ANCHORED, or turned out to be anchored by virtue
ue of its contents, it cannot be of its contents, it cannot be
made unachored at matching time. Note that setting the option at match ti me disables JIT matching. made unachored at matching time. Note that setting the option at match ti me disables JIT matching.
PCRE2_COPY_MATCHED_SUBJECT PCRE2_COPY_MATCHED_SUBJECT
By default, a pointer to the subject is remembered in the match data bloc By default, a pointer to the subject is remembered in the match data b
k so that, after a successful lock so that, after a successful
match, it can be referenced by the substring extraction functions. This match, it can be referenced by the substring extraction functions. This m
means that the subject's memory eans that the subject's memory
must not be freed until all such operations are complete. For some applic ations where the lifetime of the must not be freed until all such operations are complete. For some applic ations where the lifetime of the
subject string is not guaranteed, it may be necessary to make a copy subject string is not guaranteed, it may be necessary to make a copy of t
of the subject string, but it is he subject string, but it is
wasteful to do this unless the match is successful. After a successful ma wasteful to do this unless the match is successful. After a successful m
tch, if PCRE2_COPY_MATCHED_SUB- atch, if PCRE2_COPY_MATCHED_SUB-
JECT is set, the subject is copied and the new pointer is remembered in JECT is set, the subject is copied and the new pointer is remembered in t
the match data block instead of he match data block instead of
the original subject pointer. The memory allocator that was used for the the original subject pointer. The memory allocator that was used for the
match block itself is used. The match block itself is used. The
copy is automatically freed when pcre2_match_data_free() is called to copy is automatically freed when pcre2_match_data_free() is called to fre
free the match data block. It is e the match data block. It is
also automatically freed if the match data block is re-used for another m atch operation. also automatically freed if the match data block is re-used for another m atch operation.
PCRE2_ENDANCHORED PCRE2_ENDANCHORED
If the PCRE2_ENDANCHORED option is set, any string that pcre2_match() mat ches must be right at the end of If the PCRE2_ENDANCHORED option is set, any string that pcre2_match() mat ches must be right at the end of
the subject string. Note that setting the option at match time disables J IT matching. the subject string. Note that setting the option at match time disables J IT matching.
PCRE2_NOTBOL PCRE2_NOTBOL
This option specifies that first character of the subject string is not This option specifies that first character of the subject string is not t
the beginning of a line, so the he beginning of a line, so the
circumflex metacharacter should not match before it. Setting this without circumflex metacharacter should not match before it. Setting this witho
having set PCRE2_MULTILINE at ut having set PCRE2_MULTILINE at
compile time causes circumflex never to match. This option affects only compile time causes circumflex never to match. This option affects only t
the behaviour of the circumflex he behaviour of the circumflex
metacharacter. It does not affect \A. metacharacter. It does not affect \A.
PCRE2_NOTEOL PCRE2_NOTEOL
This option specifies that the end of the subject string is not the en d of a line, so the dollar This option specifies that the end of the subject string is not the end of a line, so the dollar
metacharacter should not match it nor (except in multiline mode) a newlin e immediately before it. Setting metacharacter should not match it nor (except in multiline mode) a newlin e immediately before it. Setting
this without having set PCRE2_MULTILINE at compile time causes dollar nev er to match. This option affects this without having set PCRE2_MULTILINE at compile time causes dollar nev er to match. This option affects
only the behaviour of the dollar metacharacter. It does not affect \Z or \z. only the behaviour of the dollar metacharacter. It does not affect \Z or \z.
PCRE2_NOTEMPTY PCRE2_NOTEMPTY
An empty string is not considered to be a valid match if this option is s et. If there are alternatives in An empty string is not considered to be a valid match if this option is s et. If there are alternatives in
the pattern, they are tried. If all the alternatives match the empty stri ng, the entire match fails. For the pattern, they are tried. If all the alternatives match the empty str ing, the entire match fails. For
example, if the pattern example, if the pattern
a?b? a?b?
is applied to a string not beginning with "a" or "b", it matches an empty string at the start of the sub- is applied to a string not beginning with "a" or "b", it matches an empty string at the start of the sub-
ject. With PCRE2_NOTEMPTY set, this match is not valid, so pcre2_match() searches further into the string ject. With PCRE2_NOTEMPTY set, this match is not valid, so pcre2_match() searches further into the string
for occurrences of "a" or "b". for occurrences of "a" or "b".
PCRE2_NOTEMPTY_ATSTART PCRE2_NOTEMPTY_ATSTART
This is like PCRE2_NOTEMPTY, except that it locks out an empty string This is like PCRE2_NOTEMPTY, except that it locks out an empty string mat
match only at the first matching ch only at the first matching
position, that is, at the start of the subject plus the starting offset. position, that is, at the start of the subject plus the starting offset
An empty string match later in . An empty string match later in
the subject is permitted. If the pattern is anchored, such a match ca the subject is permitted. If the pattern is anchored, such a match can o
n occur only if the pattern con- ccur only if the pattern con-
tains \K. tains \K.
PCRE2_NO_JIT PCRE2_NO_JIT
By default, if a pattern has been successfully processed by pcre2_jit_co By default, if a pattern has been successfully processed by pcre2_jit
mpile(), JIT is automatically _compile(), JIT is automatically
used when pcre2_match() is called with options that JIT supports. Setti used when pcre2_match() is called with options that JIT supports. Setting
ng PCRE2_NO_JIT disables the use PCRE2_NO_JIT disables the use
of JIT; it forces matching to be done by the interpreter. of JIT; it forces matching to be done by the interpreter.
PCRE2_NO_UTF_CHECK PCRE2_NO_UTF_CHECK
When PCRE2_UTF is set at compile time, the validity of the subject as a When PCRE2_UTF is set at compile time, the validity of the subject a
UTF string is checked unless s a UTF string is checked unless
PCRE2_NO_UTF_CHECK is passed to pcre2_match() or PCRE2_MATCH_INVALID_UT PCRE2_NO_UTF_CHECK is passed to pcre2_match() or PCRE2_MATCH_INVALID_UTF
F was passed to pcre2_compile(). was passed to pcre2_compile().
The latter special case is discussed in detail in the pcre2unicode docume ntation. The latter special case is discussed in detail in the pcre2unicode docume ntation.
In the default case, if a non-zero starting offset is given, the check is applied only to that part of In the default case, if a non-zero starting offset is given, the check is applied only to that part of
the subject that could be inspected during matching, and there is a check that the starting offset points the subject that could be inspected during matching, and there is a check that the starting offset points
to the first code unit of a character or to the end of the subject. If th ere are no lookbehind assertions to the first code unit of a character or to the end of the subject. If th ere are no lookbehind assertions
in the pattern, the check starts at the starting offset. Otherwise, it s tarts at the length of the long- in the pattern, the check starts at the starting offset. Otherwise, it s tarts at the length of the long-
est lookbehind before the starting offset, or at the start of the subject if there are not that many est lookbehind before the starting offset, or at the start of the su bject if there are not that many
characters before the starting offset. Note that the sequences \b and \B are one-character lookbehinds. characters before the starting offset. Note that the sequences \b and \B are one-character lookbehinds.
The check is carried out before any other processing takes place, and a The check is carried out before any other processing takes place, and a n
negative error code is returned egative error code is returned
if the check fails. There are several UTF error codes for each code unit if the check fails. There are several UTF error codes for each code unit
width, corresponding to differ- width, corresponding to differ-
ent problems with the code unit sequence. There are discussions abou ent problems with the code unit sequence. There are discussions about th
t the validity of UTF-8 strings, e validity of UTF-8 strings,
UTF-16 strings, and UTF-32 strings in the pcre2unicode documentation. UTF-16 strings, and UTF-32 strings in the pcre2unicode documentation.
If you know that your subject is valid, and you want to skip this check f If you know that your subject is valid, and you want to skip this check
or performance reasons, you can for performance reasons, you can
set the PCRE2_NO_UTF_CHECK option when calling pcre2_match(). You migh set the PCRE2_NO_UTF_CHECK option when calling pcre2_match(). You might w
t want to do this for the second ant to do this for the second
and subsequent calls to pcre2_match() if you are making repeated calls to and subsequent calls to pcre2_match() if you are making repeated calls
find multiple matches in the to find multiple matches in the
same subject string. same subject string.
Warning: Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when PC Warning: Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when PCR
RE2_NO_UTF_CHECK is set at match E2_NO_UTF_CHECK is set at match
time the effect of passing an invalid string as a subject, or an invalid time the effect of passing an invalid string as a subject, or an invali
value of startoffset, is unde- d value of startoffset, is unde-
fined. Your program may crash or loop indefinitely or give wrong results . fined. Your program may crash or loop indefinitely or give wrong results .
PCRE2_PARTIAL_HARD PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT PCRE2_PARTIAL_SOFT
These options turn on the partial matching feature. A partial match These options turn on the partial matching feature. A partial match occur
occurs if the end of the subject s if the end of the subject
string is reached successfully, but there are not enough subject characte string is reached successfully, but there are not enough subject char
rs to complete the match. In acters to complete the match. In
addition, either at least one character must have been inspected or th addition, either at least one character must have been inspected or the p
e pattern must contain a lookbe- attern must contain a lookbe-
hind, or the pattern must be one that could match an empty string. hind, or the pattern must be one that could match an empty string.
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_H If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_
ARD) is set, matching continues HARD) is set, matching continues
by testing any remaining alternatives. Only if no complete match can by testing any remaining alternatives. Only if no complete match can be
be found is PCRE2_ERROR_PARTIAL found is PCRE2_ERROR_PARTIAL
returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SO returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_S
FT specifies that the caller is OFT specifies that the caller is
prepared to handle a partial match, but only if no complete match can be found. prepared to handle a partial match, but only if no complete match can be found.
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this ca se, if a partial match is found, If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this ca se, if a partial match is found,
pcre2_match() immediately returns PCRE2_ERROR_PARTIAL, without considerin pcre2_match() immediately returns PCRE2_ERROR_PARTIAL, without consid
g any other alternatives. In ering any other alternatives. In
other words, when PCRE2_PARTIAL_HARD is set, a partial match is conside other words, when PCRE2_PARTIAL_HARD is set, a partial match is considere
red to be more important that an d to be more important that an
alternative complete match. alternative complete match.
There is a more detailed discussion of partial and multi-segment mat ching, with examples, in the There is a more detailed discussion of partial and multi-segment matching, with examples, in the
pcre2partial documentation. pcre2partial documentation.
NEWLINE HANDLING WHEN MATCHING NEWLINE HANDLING WHEN MATCHING
When PCRE2 is built, a default newline convention is set; this is usually the standard convention for the When PCRE2 is built, a default newline convention is set; this is usually the standard convention for the
operating system. The default can be overridden in a compile context by c operating system. The default can be overridden in a compile context by
alling pcre2_set_newline(). It calling pcre2_set_newline(). It
can also be overridden by starting a pattern string with, for example, ( can also be overridden by starting a pattern string with, for example, (*
*CRLF), as described in the sec- CRLF), as described in the sec-
tion on newline conventions in the pcre2pattern page. During matching, th e newline choice affects the be- tion on newline conventions in the pcre2pattern page. During matching, th e newline choice affects the be-
haviour of the dot, circumflex, and dollar metacharacters. It may also alter the way the match starting haviour of the dot, circumflex, and dollar metacharacters. It may also al ter the way the match starting
position is advanced after a match failure for an unanchored pattern. position is advanced after a match failure for an unanchored pattern.
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is s When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY i
et as the newline convention, s set as the newline convention,
and a match attempt for an unanchored pattern fails when the curren and a match attempt for an unanchored pattern fails when the current sta
t starting position is at a CRLF rting position is at a CRLF
sequence, and the pattern contains no explicit matches for CR or LF chara sequence, and the pattern contains no explicit matches for CR or LF c
cters, the match position is haracters, the match position is
advanced by two characters instead of one, in other words, to after the C RLF. advanced by two characters instead of one, in other words, to after the C RLF.
The above rule is a compromise that makes the most common cases work The above rule is a compromise that makes the most common cases work as e
as expected. For example, if the xpected. For example, if the
pattern is .+A (and the PCRE2_DOTALL option is not set), it does not matc pattern is .+A (and the PCRE2_DOTALL option is not set), it does not m
h the string "\r\nA" because, atch the string "\r\nA" because,
after failing at the start, it skips both the CR and the LF before retryi ng. However, the pattern [\r\n]A after failing at the start, it skips both the CR and the LF before retryi ng. However, the pattern [\r\n]A
does match that string, because it contains an explicit CR or LF referenc e, and so advances only by one does match that string, because it contains an explicit CR or LF refere nce, and so advances only by one
character after the first failure. character after the first failure.
An explicit match for CR of LF is either a literal appearance of one of those characters in the pattern, An explicit match for CR of LF is either a literal appearance of one of t hose characters in the pattern,
or one of the \r or \n or equivalent octal or hexadecimal escape sequence s. Implicit matches such as [^X] or one of the \r or \n or equivalent octal or hexadecimal escape sequence s. Implicit matches such as [^X]
do not count, nor does \s, even though it includes CR and LF in the chara cters that it matches. do not count, nor does \s, even though it includes CR and LF in the chara cters that it matches.
Notwithstanding the above, anomalous effects may still occur when CRLF is a valid newline sequence and Notwithstanding the above, anomalous effects may still occur when CRLF is a valid newline sequence and
explicit \r or \n escapes appear in the pattern. explicit \r or \n escapes appear in the pattern.
HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
In general, a pattern matches a certain portion of the subject, and in ad In general, a pattern matches a certain portion of the subject, and in a
dition, further substrings from ddition, further substrings from
the subject may be picked out by parenthesized parts of the pattern the subject may be picked out by parenthesized parts of the pattern. Fo
. Following the usage in Jeffrey llowing the usage in Jeffrey
Friedl's book, this is called "capturing" in what follows, and the phrase "capture group" (Perl terminol- Friedl's book, this is called "capturing" in what follows, and the phrase "capture group" (Perl terminol-
ogy) is used for a fragment of a pattern that picks out a substring. PC RE2 supports several other kinds ogy) is used for a fragment of a pattern that picks out a substring. PCRE 2 supports several other kinds
of parenthesized group that do not cause substrings to be captured. The p cre2_pattern_info() function can of parenthesized group that do not cause substrings to be captured. The p cre2_pattern_info() function can
be used to find out how many capture groups there are in a compiled patte rn. be used to find out how many capture groups there are in a compiled patte rn.
You can use auxiliary functions for accessing captured substrings by nu mber or by name, as described in You can use auxiliary functions for accessing captured substrings by numb er or by name, as described in
sections below. sections below.
Alternatively, you can make direct use of the vector of PCRE2_SIZE values , called the ovector, which con- Alternatively, you can make direct use of the vector of PCRE2_SIZE values , called the ovector, which con-
tains the offsets of captured strings. It is part of the match data bloc tains the offsets of captured strings. It is part of the match data block
k. The function pcre2_get_ovec- . The function pcre2_get_ovec-
tor_pointer() returns the address of the ovector, and pcre2_get_ovector_c tor_pointer() returns the address of the ovector, and pcre2_get_ovect
ount() returns the number of or_count() returns the number of
pairs of values it contains. pairs of values it contains.
Within the ovector, the first in each pair of values is set to the of fset of the first code unit of a Within the ovector, the first in each pair of values is set to the offset of the first code unit of a
substring, and the second is set to the offset of the first code unit aft er the end of a substring. These substring, and the second is set to the offset of the first code unit aft er the end of a substring. These
values are always code unit offsets, not character offsets. That is, th ey are byte offsets in the 8-bit values are always code unit offsets, not character offsets. That is, they are byte offsets in the 8-bit
library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit library. library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit library.
After a partial match (error return PCRE2_ERROR_PARTIAL), only the first pair of offsets (that is, ovec- After a partial match (error return PCRE2_ERROR_PARTIAL), only the first pair of offsets (that is, ovec-
tor[0] and ovector[1]) are set. They identify the part of the subject tha t was partially matched. See the tor[0] and ovector[1]) are set. They identify the part of the subject tha t was partially matched. See the
pcre2partial documentation for details of partial matching. pcre2partial documentation for details of partial matching.
After a fully successful match, the first pair of offsets identifies the After a fully successful match, the first pair of offsets identifies t
portion of the subject string he portion of the subject string
that was matched by the entire pattern. The next pair is used for the that was matched by the entire pattern. The next pair is used for the fir
first captured substring, and so st captured substring, and so
on. The value returned by pcre2_match() is one more than the highest numb ered pair that has been set. For on. The value returned by pcre2_match() is one more than the highest numb ered pair that has been set. For
example, if two substrings have been captured, the returned value is example, if two substrings have been captured, the returned value is 3. I
3. If there are no captured sub- f there are no captured sub-
strings, the return value from a successful match is 1, indicating that j strings, the return value from a successful match is 1, indicating tha
ust the first pair of offsets t just the first pair of offsets
has been set. has been set.
If a pattern uses the \K escape sequence within a positive assertion, th If a pattern uses the \K escape sequence within a positive assertion, the
e reported start of a successful reported start of a successful
match can be greater than the end of the match. For example, if the patt match can be greater than the end of the match. For example, if the pat
ern (?=ab\K) is matched against tern (?=ab\K) is matched against
"ab", the start and end offset values for the match are 2 and 0. "ab", the start and end offset values for the match are 2 and 0.
If a capture group is matched repeatedly within a single match operatio n, it is the last portion of the If a capture group is matched repeatedly within a single match operation, it is the last portion of the
subject that it matched that is returned. subject that it matched that is returned.
If the ovector is too small to hold all the captured substring offsets, a s much as possible is filled in, If the ovector is too small to hold all the captured substring offsets, a s much as possible is filled in,
and the function returns a value of zero. If captured substrings are no t of interest, pcre2_match() may and the function returns a value of zero. If captured substrings are not of interest, pcre2_match() may
be called with a match data block whose ovector is of minimum length (tha t is, one pair). be called with a match data block whose ovector is of minimum length (tha t is, one pair).
It is possible for capture group number n+1 to match some part of the sub It is possible for capture group number n+1 to match some part of the s
ject when group n has not been ubject when group n has not been
used at all. For example, if the string "abc" is matched against the pat used at all. For example, if the string "abc" is matched against the patt
tern (a|(z))(bc) the return from ern (a|(z))(bc) the return from
the function is 4, and groups 1 and 3 are matched, but 2 is not. When thi the function is 4, and groups 1 and 3 are matched, but 2 is not. When
s happens, both values in the this happens, both values in the
offset pairs corresponding to unused groups are set to PCRE2_UNSET. offset pairs corresponding to unused groups are set to PCRE2_UNSET.
Offset values that correspond to unused groups at the end of the express Offset values that correspond to unused groups at the end of the expressi
ion are also set to PCRE2_UNSET. on are also set to PCRE2_UNSET.
For example, if the string "abc" is matched against the pattern (abc)(x(y For example, if the string "abc" is matched against the pattern (abc)
z)?)? groups 2 and 3 are not (x(yz)?)? groups 2 and 3 are not
matched. The return from the function is 2, because the highest used cap matched. The return from the function is 2, because the highest used capt
ture group number is 1. The off- ure group number is 1. The off-
sets for for the second and third capture groupss (assuming the vector is sets for for the second and third capture groupss (assuming the vector
large enough, of course) are is large enough, of course) are
set to PCRE2_UNSET. set to PCRE2_UNSET.
Elements in the ovector that do not correspond to capturing parentheses i n the pattern are never changed. Elements in the ovector that do not correspond to capturing parentheses i n the pattern are never changed.
That is, if a pattern contains n capturing parentheses, no more than ovec That is, if a pattern contains n capturing parentheses, no more than ove
tor[0] to ovector[2n+1] are set ctor[0] to ovector[2n+1] are set
by pcre2_match(). The other elements retain whatever values they prev by pcre2_match(). The other elements retain whatever values they previous
iously had. After a failed match ly had. After a failed match
attempt, the contents of the ovector are unchanged. attempt, the contents of the ovector are unchanged.
OTHER INFORMATION ABOUT A MATCH OTHER INFORMATION ABOUT A MATCH
PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
As well as the offsets in the ovector, other information about a match is As well as the offsets in the ovector, other information about a mat
retained in the match data ch is retained in the match data
block and can be retrieved by the above functions in appropriate circ block and can be retrieved by the above functions in appropriate circumst
umstances. If they are called at ances. If they are called at
other times, the result is undefined. other times, the result is undefined.
After a successful match, a partial match (PCRE2_ERROR_PARTIAL), After a successful match, a partial match (PCRE2_ERROR_PART
or a failure to match IAL), or a failure to match
(PCRE2_ERROR_NOMATCH), a mark name may be available. The function p (PCRE2_ERROR_NOMATCH), a mark name may be available. The function pcre2
cre2_get_mark() can be called to _get_mark() can be called to
access this name, which can be specified in the pattern by any of the ba access this name, which can be specified in the pattern by any of the
cktracking control verbs, not backtracking control verbs, not
just (*MARK). The same function applies to all the verbs. It returns just (*MARK). The same function applies to all the verbs. It returns a p
a pointer to the zero-terminated ointer to the zero-terminated
name, which is within the compiled pattern. If no name is available, NULL name, which is within the compiled pattern. If no name is available, NUL
is returned. The length of the L is returned. The length of the
name (excluding the terminating zero) is stored in the code unit that p name (excluding the terminating zero) is stored in the code unit that pre
recedes the name. You should use cedes the name. You should use
this length instead of relying on the terminating zero if the name might contain a binary zero. this length instead of relying on the terminating zero if the name might contain a binary zero.
After a successful match, the name that is returned is the last mark name After a successful match, the name that is returned is the last mark
encountered on the matching name encountered on the matching
path through the pattern. Instances of backtracking verbs without names path through the pattern. Instances of backtracking verbs without names d
do not count. Thus, for example, o not count. Thus, for example,
if the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned if the matching path contains (*MARK:A)(*PRUNE), the name "A" is return
. After a "no match" or a par- ed. After a "no match" or a par-
tial match, the last encountered name is returned. For example, consider this pattern: tial match, the last encountered name is returned. For example, consider this pattern:
^(*MARK:A)((*MARK:B)a|b)c ^(*MARK:A)((*MARK:B)a|b)c
When it matches "bc", the returned name is A. The B mark is "seen" in th When it matches "bc", the returned name is A. The B mark is "seen" in the
e first branch of the group, but first branch of the group, but
it is not on the matching path. On the other hand, when this pattern fail it is not on the matching path. On the other hand, when this pattern f
s to match "bx", the returned ails to match "bx", the returned
name is B. name is B.
Warning: By default, certain start-of-match optimizations are used to Warning: By default, certain start-of-match optimizations are used to giv
give a fast "no match" result in e a fast "no match" result in
some situations. For example, if the anchoring is removed from the patter some situations. For example, if the anchoring is removed from the pa
n above, there is an initial ttern above, there is an initial
check for the presence of "c" in the subject before running the match check for the presence of "c" in the subject before running the matching
ing engine. This check fails for engine. This check fails for
"bx", causing a match failure without seeing any marks. You can disable t "bx", causing a match failure without seeing any marks. You can disable
he start-of-match optimizations the start-of-match optimizations
by setting the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() by setting the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or
or by starting the pattern with by starting the pattern with
(*NO_START_OPT). (*NO_START_OPT).
After a successful match, a partial match, or one of the inval After a successful match, a partial match, or one of the inva
id UTF errors (for example, lid UTF errors (for example,
PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can be called. After PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can be called. After a su
a successful or partial match it ccessful or partial match it
returns the code unit offset of the character at which the match started. returns the code unit offset of the character at which the match starte
For a non-partial match, this d. For a non-partial match, this
can be different to the value of ovector[0] if the pattern contains the can be different to the value of ovector[0] if the pattern contains the \
\K escape sequence. After a par- K escape sequence. After a par-
tial match, however, this value is always the same as ovector[0] because \K does not affect the result of tial match, however, this value is always the same as ovector[0] because \K does not affect the result of
a partial match. a partial match.
After a UTF check failure, pcre2_get_startchar() can be used to ob tain the code unit offset of the After a UTF check failure, pcre2_get_startchar() can be used to obtain the code unit offset of the
invalid UTF character. Details are given in the pcre2unicode page. invalid UTF character. Details are given in the pcre2unicode page.
ERROR RETURNS FROM pcre2_match() ERROR RETURNS FROM pcre2_match()
If pcre2_match() fails, it returns a negative number. This can be convert If pcre2_match() fails, it returns a negative number. This can be conve
ed to a text string by calling rted to a text string by calling
the pcre2_get_error_message() function (see "Obtaining a textual error the pcre2_get_error_message() function (see "Obtaining a textual error me
message" below). Negative error ssage" below). Negative error
codes are also returned by other functions, and are documented with them. codes are also returned by other functions, and are documented with th
The codes are given names in em. The codes are given names in
the header file. If UTF checking is in force and an invalid UTF subject s tring is detected, one of a num- the header file. If UTF checking is in force and an invalid UTF subject s tring is detected, one of a num-
ber of UTF-specific negative error codes is returned. Details are given i n the pcre2unicode page. The ber of UTF-specific negative error codes is returned. Details are giv en in the pcre2unicode page. The
following are the other errors that may be returned by pcre2_match(): following are the other errors that may be returned by pcre2_match():
PCRE2_ERROR_NOMATCH PCRE2_ERROR_NOMATCH
The subject string did not match the pattern. The subject string did not match the pattern.
PCRE2_ERROR_PARTIAL PCRE2_ERROR_PARTIAL
The subject string did not match, but it did match partially. See th e pcre2partial documentation for The subject string did not match, but it did match partially. See the pcre2partial documentation for
details of partial matching. details of partial matching.
PCRE2_ERROR_BADMAGIC PCRE2_ERROR_BADMAGIC
PCRE2 stores a 4-byte "magic number" at the start of the compiled code, t o catch the case when it is PCRE2 stores a 4-byte "magic number" at the start of the compiled co de, to catch the case when it is
passed a junk pointer. This is the error that is returned when the magic number is not present. passed a junk pointer. This is the error that is returned when the magic number is not present.
PCRE2_ERROR_BADMODE PCRE2_ERROR_BADMODE
This error is given when a compiled pattern is passed to a function in a library of a different code unit This error is given when a compiled pattern is passed to a function in a library of a different code unit
width, for example, a pattern compiled by the 8-bit library is passed to a 16-bit or 32-bit library func- width, for example, a pattern compiled by the 8-bit library is passed to a 16-bit or 32-bit library func-
tion. tion.
PCRE2_ERROR_BADOFFSET PCRE2_ERROR_BADOFFSET
The value of startoffset was greater than the length of the subject. The value of startoffset was greater than the length of the subject.
PCRE2_ERROR_BADOPTION PCRE2_ERROR_BADOPTION
An unrecognized bit was set in the options argument. An unrecognized bit was set in the options argument.
PCRE2_ERROR_BADUTFOFFSET PCRE2_ERROR_BADUTFOFFSET
The UTF code unit sequence that was passed as a subject was ch The UTF code unit sequence that was passed as a subject was checked
ecked and found to be valid (the and found to be valid (the
PCRE2_NO_UTF_CHECK option was not set), but the value of startoffset did PCRE2_NO_UTF_CHECK option was not set), but the value of startoffset did
not point to the beginning of a not point to the beginning of a
UTF character or the end of the subject. UTF character or the end of the subject.
PCRE2_ERROR_CALLOUT PCRE2_ERROR_CALLOUT
This error is never generated by pcre2_match() itself. It is provided f This error is never generated by pcre2_match() itself. It is provided for
or use by callout functions that use by callout functions that
want to cause pcre2_match() or pcre2_callout_enumerate() to return a dis want to cause pcre2_match() or pcre2_callout_enumerate() to return a
tinctive error code. See the distinctive error code. See the
pcre2callout documentation for details. pcre2callout documentation for details.
PCRE2_ERROR_DEPTHLIMIT PCRE2_ERROR_DEPTHLIMIT
The nested backtracking depth limit was reached. The nested backtracking depth limit was reached.
PCRE2_ERROR_HEAPLIMIT PCRE2_ERROR_HEAPLIMIT
The heap limit was reached. The heap limit was reached.
PCRE2_ERROR_INTERNAL PCRE2_ERROR_INTERNAL
An unexpected internal error has occurred. This error could be caused by a bug in PCRE2 or by overwriting An unexpected internal error has occurred. This error could be caused by a bug in PCRE2 or by overwriting
of the compiled pattern. of the compiled pattern.
PCRE2_ERROR_JIT_STACKLIMIT PCRE2_ERROR_JIT_STACKLIMIT
This error is returned when a pattern that was successfully studied using This error is returned when a pattern that was successfully studied usi
JIT is being matched, but the ng JIT is being matched, but the
memory available for the just-in-time processing stack is not large eno memory available for the just-in-time processing stack is not large enoug
ugh. See the pcre2jit documenta- h. See the pcre2jit documenta-
tion for more details. tion for more details.
PCRE2_ERROR_MATCHLIMIT PCRE2_ERROR_MATCHLIMIT
The backtracking match limit was reached. The backtracking match limit was reached.
PCRE2_ERROR_NOMEMORY PCRE2_ERROR_NOMEMORY
If a pattern contains many nested backtracking points, heap memory is use If a pattern contains many nested backtracking points, heap memory is u
d to remember them. This error sed to remember them. This error
is given when the memory allocation function (default or custom) fai is given when the memory allocation function (default or custom) fails.
ls. Note that a different error, Note that a different error,
PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds th e heap limit. PCRE2_ERROR_NOMEM- PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds th e heap limit. PCRE2_ERROR_NOMEM-
ORY is also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory allo cation fails. ORY is also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory allo cation fails.
PCRE2_ERROR_NULL PCRE2_ERROR_NULL
Either the code, subject, or match_data argument was passed as NULL. Either the code, subject, or match_data argument was passed as NULL.
PCRE2_ERROR_RECURSELOOP PCRE2_ERROR_RECURSELOOP
This error is returned when pcre2_match() detects a recursion loop with in the pattern. Specifically, it This error is returned when pcre2_match() detects a recursion loop within the pattern. Specifically, it
means that either the whole pattern or a capture group has been called re cursively for the second time at means that either the whole pattern or a capture group has been called re cursively for the second time at
the same position in the subject string. Some simple patterns that might do this are detected and faulted the same position in the subject string. Some simple patterns that might do this are detected and faulted
at compile time, but more complicated cases, in particular mutual rec ursions between two different at compile time, but more complicated cases, in particular mutual recursions between two different
groups, cannot be detected until matching is attempted. groups, cannot be detected until matching is attempted.
OBTAINING A TEXTUAL ERROR MESSAGE OBTAINING A TEXTUAL ERROR MESSAGE
int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
PCRE2_SIZE bufflen); PCRE2_SIZE bufflen);
A text message for an error code from any PCRE2 function (compile, matc A text message for an error code from any PCRE2 function (compile, match,
h, or auxiliary) can be obtained or auxiliary) can be obtained
by calling pcre2_get_error_message(). The code is passed as the first arg by calling pcre2_get_error_message(). The code is passed as the first
ument, with the remaining two argument, with the remaining two
arguments specifying a code unit buffer and its length in code units arguments specifying a code unit buffer and its length in code units, int
, into which the text message is o which the text message is
placed. The message is returned in code units of the appropriate width fo placed. The message is returned in code units of the appropriate wid
r the library that is being th for the library that is being
used. used.
The returned message is terminated with a trailing zero, and the fun The returned message is terminated with a trailing zero, and the function
ction returns the number of code returns the number of code
units used, excluding the trailing zero. If the error number is unkn units used, excluding the trailing zero. If the error number is
own, the negative error code unknown, the negative error code
PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the messag PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the message
e is truncated (but still with a is truncated (but still with a
trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is retur trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is ret
ned. None of the messages are urned. None of the messages are
very long; a buffer size of 120 code units is ample. very long; a buffer size of 120 code units is ample.
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
int pcre2_substring_length_bynumber(pcre2_match_data *match_data, int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
uint32_t number, PCRE2_SIZE *length); uint32_t number, PCRE2_SIZE *length);
int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
uint32_t number, PCRE2_UCHAR *buffer, uint32_t number, PCRE2_UCHAR *buffer,
PCRE2_SIZE *bufflen); PCRE2_SIZE *bufflen);
int pcre2_substring_get_bynumber(pcre2_match_data *match_data, int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
uint32_t number, PCRE2_UCHAR **bufferptr, uint32_t number, PCRE2_UCHAR **bufferptr,
PCRE2_SIZE *bufflen); PCRE2_SIZE *bufflen);
void pcre2_substring_free(PCRE2_UCHAR *buffer); void pcre2_substring_free(PCRE2_UCHAR *buffer);
Captured substrings can be accessed directly by using the ovector as de Captured substrings can be accessed directly by using the ovector as desc
scribed above. For convenience, ribed above. For convenience,
auxiliary functions are provided for extracting captured substrings as auxiliary functions are provided for extracting captured substrings a
new, separate, zero-terminated s new, separate, zero-terminated
strings. A substring that contains a binary zero is correctly extracted strings. A substring that contains a binary zero is correctly extracted a
and has a further zero added on nd has a further zero added on
the end, but the result is not, of course, a C string. the end, but the result is not, of course, a C string.
The functions in this section identify substrings by number. The number z ero refers to the entire matched The functions in this section identify substrings by number. The number z ero refers to the entire matched
substring, with higher numbers referring to substrings captured by paren substring, with higher numbers referring to substrings captured by parent
thesized groups. After a partial hesized groups. After a partial
match, only substring zero is available. An attempt to extract any oth match, only substring zero is available. An attempt to extract any
er substring gives the error other substring gives the error
PCRE2_ERROR_PARTIAL. The next section describes similar functions for e PCRE2_ERROR_PARTIAL. The next section describes similar functions for ext
xtracting captured substrings by racting captured substrings by
name. name.
If a pattern uses the \K escape sequence within a positive assertion, the If a pattern uses the \K escape sequence within a positive assertion, th
reported start of a successful e reported start of a successful
match can be greater than the end of the match. For example, if the pat match can be greater than the end of the match. For example, if the patt
tern (?=ab\K) is matched against ern (?=ab\K) is matched against
"ab", the start and end offset values for the match are 2 and 0. In this "ab", the start and end offset values for the match are 2 and 0. In thi
situation, calling these func- s situation, calling these func-
tions with a zero substring number extracts a zero-length empty string. tions with a zero substring number extracts a zero-length empty string.
You can find the length in code units of a captured substring without ext racting it by calling pcre2_sub- You can find the length in code units of a captured substring without ext racting it by calling pcre2_sub-
string_length_bynumber(). The first argument is a pointer to the match da string_length_bynumber(). The first argument is a pointer to the mat
ta block, the second is the ch data block, the second is the
group number, and the third is a pointer to a variable into which the le group number, and the third is a pointer to a variable into which the len
ngth is placed. If you just want gth is placed. If you just want
to know whether or not the substring has been captured, you can pass the third argument as NULL. to know whether or not the substring has been captured, you can pass the third argument as NULL.
The pcre2_substring_copy_bynumber() function copies a captured substring into a supplied buffer, whereas The pcre2_substring_copy_bynumber() function copies a captured substring into a supplied buffer, whereas
pcre2_substring_get_bynumber() copies it into new memory, obtained using the same memory allocation func- pcre2_substring_get_bynumber() copies it into new memory, obtained using the same memory allocation func-
tion that was used for the match data block. The first two arguments of t hese functions are a pointer to tion that was used for the match data block. The first two arguments of these functions are a pointer to
the match data block and a capture group number. the match data block and a capture group number.
The final arguments of pcre2_substring_copy_bynumber() are a pointer The final arguments of pcre2_substring_copy_bynumber() are a pointer to t
to the buffer and a pointer to a he buffer and a pointer to a
variable that contains its length in code units. This is updated to cont variable that contains its length in code units. This is updated to c
ain the actual number of code ontain the actual number of code
units used for the extracted substring, excluding the terminating zero. units used for the extracted substring, excluding the terminating zero.
For pcre2_substring_get_bynumber() the third and fourth arguments poi For pcre2_substring_get_bynumber() the third and fourth arguments point t
nt to variables that are updated o variables that are updated
with a pointer to the new memory and the number of code units that compri with a pointer to the new memory and the number of code units that compr
se the substring, again exclud- ise the substring, again exclud-
ing the terminating zero. When the substring is no longer needed, the m ing the terminating zero. When the substring is no longer needed, the mem
emory should be freed by calling ory should be freed by calling
pcre2_substring_free(). pcre2_substring_free().
The return value from all these functions is zero for success, or a negat The return value from all these functions is zero for success, or a neg
ive error code. If the pattern ative error code. If the pattern
match failed, the match failure code is returned. If a substring number match failed, the match failure code is returned. If a substring number
greater than zero is used after greater than zero is used after
a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible error co des are: a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible error co des are:
PCRE2_ERROR_NOMEMORY PCRE2_ERROR_NOMEMORY
The buffer was too small for pcre2_substring_copy_bynumber(), or the atte mpt to get memory failed for The buffer was too small for pcre2_substring_copy_bynumber(), or the attempt to get memory failed for
pcre2_substring_get_bynumber(). pcre2_substring_get_bynumber().
PCRE2_ERROR_NOSUBSTRING PCRE2_ERROR_NOSUBSTRING
There is no substring with that number in the pattern, that is, the numb er is greater than the number of There is no substring with that number in the pattern, that is, the numbe r is greater than the number of
capturing parentheses. capturing parentheses.
PCRE2_ERROR_UNAVAILABLE PCRE2_ERROR_UNAVAILABLE
The substring number, though not greater than the number of captures in t he pattern, is greater than the The substring number, though not greater than the number of captures in the pattern, is greater than the
number of slots in the ovector, so the substring could not be captured. number of slots in the ovector, so the substring could not be captured.
PCRE2_ERROR_UNSET PCRE2_ERROR_UNSET
The substring did not participate in the match. For example, if the pat tern is (abc)|(def) and the sub- The substring did not participate in the match. For example, if the patte rn is (abc)|(def) and the sub-
ject is "def", and the ovector contains at least two capturing slots, sub string number 1 is unset. ject is "def", and the ovector contains at least two capturing slots, sub string number 1 is unset.
EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
int pcre2_substring_list_get(pcre2_match_data *match_data, int pcre2_substring_list_get(pcre2_match_data *match_data,
PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
void pcre2_substring_list_free(PCRE2_SPTR *list); void pcre2_substring_list_free(PCRE2_SPTR *list);
The pcre2_substring_list_get() function extracts all available substrings and builds a list of pointers The pcre2_substring_list_get() function extracts all available substrin gs and builds a list of pointers
to them. It also (optionally) builds a second list that contains their le ngths (in code units), excluding to them. It also (optionally) builds a second list that contains their le ngths (in code units), excluding
a terminating zero that is added to each of them. All this is done in a s ingle block of memory that is a terminating zero that is added to each of them. All this is done in a single block of memory that is
obtained using the same memory allocation function that was used to get t he match data block. obtained using the same memory allocation function that was used to get t he match data block.
This function must be called only after a successful match. If called after a partial match, the error This function must be called only after a successful match. If called aft er a partial match, the error
code PCRE2_ERROR_PARTIAL is returned. code PCRE2_ERROR_PARTIAL is returned.
The address of the memory block is returned via listptr, which is also th e start of the list of string The address of the memory block is returned via listptr, which is also the start of the list of string
pointers. The end of the list is marked by a NULL pointer. The address of the list of lengths is returned pointers. The end of the list is marked by a NULL pointer. The address of the list of lengths is returned
via lengthsptr. If your strings do not contain binary zeros and you do no t therefore need the lengths, via lengthsptr. If your strings do not contain binary zeros and you do not therefore need the lengths,
you may supply NULL as the lengthsptr argument to disable the creation of a list of lengths. The yield of you may supply NULL as the lengthsptr argument to disable the creation of a list of lengths. The yield of
the function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem ory block could not be obtained. the function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem ory block could not be obtained.
When the list is no longer needed, it should be freed by calling pcre2_su bstring_list_free(). When the list is no longer needed, it should be freed by calling pcre2_su bstring_list_free().
If this function encounters a substring that is unset, which can happ If this function encounters a substring that is unset, which can happen w
en when capture group number n+1 hen capture group number n+1
matches some part of the subject, but group n has not been used at all, i matches some part of the subject, but group n has not been used at all,
t returns an empty string. This it returns an empty string. This
can be distinguished from a genuine zero-length substring by inspecti can be distinguished from a genuine zero-length substring by inspecting t
ng the appropriate offset in the he appropriate offset in the
ovector, which contain PCRE2_UNSET for unset substrings, or by calling pc re2_substring_length_bynumber(). ovector, which contain PCRE2_UNSET for unset substrings, or by calling pc re2_substring_length_bynumber().
EXTRACTING CAPTURED SUBSTRINGS BY NAME EXTRACTING CAPTURED SUBSTRINGS BY NAME
int pcre2_substring_number_from_name(const pcre2_code *code, int pcre2_substring_number_from_name(const pcre2_code *code,
PCRE2_SPTR name); PCRE2_SPTR name);
int pcre2_substring_length_byname(pcre2_match_data *match_data, int pcre2_substring_length_byname(pcre2_match_data *match_data,
PCRE2_SPTR name, PCRE2_SIZE *length); PCRE2_SPTR name, PCRE2_SIZE *length);
skipping to change at line 2424 skipping to change at line 2452
int pcre2_substring_get_byname(pcre2_match_data *match_data, int pcre2_substring_get_byname(pcre2_match_data *match_data,
PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
void pcre2_substring_free(PCRE2_UCHAR *buffer); void pcre2_substring_free(PCRE2_UCHAR *buffer);
To extract a substring by name, you first have to find associated number. For example, for this pattern: To extract a substring by name, you first have to find associated number. For example, for this pattern:
(a+)b(?<xxx>\d+)... (a+)b(?<xxx>\d+)...
the number of the capture group called "xxx" is 2. If the name is known t the number of the capture group called "xxx" is 2. If the name is known
o be unique (PCRE2_DUPNAMES was to be unique (PCRE2_DUPNAMES was
not set), you can find the number from the name by calling pcre2_substri not set), you can find the number from the name by calling pcre2_substrin
ng_number_from_name(). The first g_number_from_name(). The first
argument is the compiled pattern, and the second is the name. The yield o f the function is the group num- argument is the compiled pattern, and the second is the name. The yield o f the function is the group num-
ber, PCRE2_ERROR_NOSUBSTRING if there is no group with that name, or ber, PCRE2_ERROR_NOSUBSTRING if there is no group with that name, or P
PCRE2_ERROR_NOUNIQUESUBSTRING if CRE2_ERROR_NOUNIQUESUBSTRING if
there is more than one group with that name. Given the number, you can e there is more than one group with that name. Given the number, you ca
xtract the substring directly n extract the substring directly
from the ovector, or use one of the "bynumber" functions described above. from the ovector, or use one of the "bynumber" functions described above.
For convenience, there are also "byname" functions that correspond to th For convenience, there are also "byname" functions that correspond to the
e "bynumber" functions, the only "bynumber" functions, the only
difference being that the second argument is a name instead of a number. difference being that the second argument is a name instead of a numb
If PCRE2_DUPNAMES is set and er. If PCRE2_DUPNAMES is set and
there are duplicate names, these functions scan all the groups with the there are duplicate names, these functions scan all the groups with the g
given name, and return the cap- iven name, and return the cap-
tured substring from the first named group that is set. tured substring from the first named group that is set.
If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is re If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
turned. If all groups with the returned. If all groups with the
name have numbers that are greater than the number of slots in the ove name have numbers that are greater than the number of slots in the ovect
ctor, PCRE2_ERROR_UNAVAILABLE is or, PCRE2_ERROR_UNAVAILABLE is
returned. If there is at least one group with a slot in the ovector, but returned. If there is at least one group with a slot in the ovector,
no group is found to be set, but no group is found to be set,
PCRE2_ERROR_UNSET is returned. PCRE2_ERROR_UNSET is returned.
Warning: If the pattern uses the (?| feature to set up multiple capture Warning: If the pattern uses the (?| feature to set up multiple capture g
groups with the same number, as roups with the same number, as
described in the section on duplicate group numbers in the pcre2pattern p described in the section on duplicate group numbers in the pcre2patte
age, you cannot use names to rn page, you cannot use names to
distinguish the different capture groups, because names are not included in the compiled code. The match- distinguish the different capture groups, because names are not included in the compiled code. The match-
ing process uses only numbers. For this reason, the use of different name s for groups with the same num- ing process uses only numbers. For this reason, the use of different nam es for groups with the same num-
ber causes an error at compile time. ber causes an error at compile time.
CREATING A NEW STRING WITH SUBSTITUTIONS CREATING A NEW STRING WITH SUBSTITUTIONS
int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
PCRE2_SIZE length, PCRE2_SIZE startoffset, PCRE2_SIZE length, PCRE2_SIZE startoffset,
uint32_t options, pcre2_match_data *match_data, uint32_t options, pcre2_match_data *match_data,
pcre2_match_context *mcontext, PCRE2_SPTR replacement, pcre2_match_context *mcontext, PCRE2_SPTR replacement,
PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
PCRE2_SIZE *outlengthptr); PCRE2_SIZE *outlengthptr);
This function optionally calls pcre2_match() and then makes a copy of the subject string in outputbuffer, This function optionally calls pcre2_match() and then makes a copy of the subject string in outputbuffer,
replacing parts that were matched with the replacement string, whose leng th is supplied in rlength. This replacing parts that were matched with the replacement string, whose len gth is supplied in rlength. This
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. There is an option (see PCRE2_SUBSTI- can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. There is an option (see PCRE2_SUBSTI-
TUTE_REPLACEMENT_ONLY below) to return just the replacement string(s). Th TUTE_REPLACEMENT_ONLY below) to return just the replacement string(s).
e default action is to perform The default action is to perform
just one replacement if the pattern matches, but there is an option tha just one replacement if the pattern matches, but there is an option that
t requests multiple replacements requests multiple replacements
(see PCRE2_SUBSTITUTE_GLOBAL below). (see PCRE2_SUBSTITUTE_GLOBAL below).
If successful, pcre2_substitute() returns the number of substitutions tha If successful, pcre2_substitute() returns the number of substitutions th
t were carried out. This may be at were carried out. This may be
zero if no match was found, and is never greater than one unless PCRE2_S zero if no match was found, and is never greater than one unless PCRE2_SU
UBSTITUTE_GLOBAL is set. A nega- BSTITUTE_GLOBAL is set. A nega-
tive value is returned if an error is detected. tive value is returned if an error is detected.
Matches in which a \K item in a lookahead in the pattern causes the match to end before it starts are not Matches in which a \K item in a lookahead in the pattern causes the match to end before it starts are not
supported, and give rise to an error return. For global replacements, mat ches in which \K in a lookbehind supported, and give rise to an error return. For global replacements, mat ches in which \K in a lookbehind
causes the match to start earlier than the point that was reached in the previous iteration are also not causes the match to start earlier than the point that was reached in the previous iteration are also not
supported. supported.
The first seven arguments of pcre2_substitute() are the same as for pcr e2_match(), except that the par- The first seven arguments of pcre2_substitute() are the same as for pcre2 _match(), except that the par-
tial matching options are not permitted, and match_data may be passed as NULL, in which case a match data tial matching options are not permitted, and match_data may be passed as NULL, in which case a match data
block is obtained and freed within this function, using memory manageme nt functions from the match con- block is obtained and freed within this function, using memory management functions from the match con-
text, if provided, or else those that were used to allocate memory for th e compiled code. text, if provided, or else those that were used to allocate memory for th e compiled code.
If match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the pr ovided block is used for all If match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, th e provided block is used for all
calls to pcre2_match(), and its contents afterwards are the result of the final call. For global changes, calls to pcre2_match(), and its contents afterwards are the result of the final call. For global changes,
this will always be a no-match error. The contents of the ovector within the match data block may or may this will always be a no-match error. The contents of the ovector within the match data block may or may
not have been changed. not have been changed.
As well as the usual options for pcre2_match(), a number of additional o ptions can be set in the options As well as the usual options for pcre2_match(), a number of additional op tions can be set in the options
argument of pcre2_substitute(). One such option is PCRE2_SUBSTITUTE_MATC HED. When this is set, an exter- argument of pcre2_substitute(). One such option is PCRE2_SUBSTITUTE_MATC HED. When this is set, an exter-
nal match_data block must be provided, and it must have been used for an nal match_data block must be provided, and it must have been used for an
external call to pcre2_match(). external call to pcre2_match().
The data in the match_data block (return code, offset vector) is used for The data in the match_data block (return code, offset vector) is used fo
the first substitution instead r the first substitution instead
of calling pcre2_match() from within pcre2_substitute(). This allows an of calling pcre2_match() from within pcre2_substitute(). This allows an a
application to check for a match pplication to check for a match
before choosing to substitute, without having to repeat the match. before choosing to substitute, without having to repeat the match.
The contents of the externally supplied match data block are not changed when PCRE2_SUBSTITUTE_MATCHED is The contents of the externally supplied match data block are not changed when PCRE2_SUBSTITUTE_MATCHED is
set. If PCRE2_SUBSTITUTE_GLOBAL is also set, pcre2_match() is called set. If PCRE2_SUBSTITUTE_GLOBAL is also set, pcre2_match() is called aft
after the first substitution to er the first substitution to
check for further matches, but this is done using an internally obtained check for further matches, but this is done using an internally obtain
match data block, thus always ed match data block, thus always
leaving the external block unchanged. leaving the external block unchanged.
The code argument is not used for matching before the first substitution when PCRE2_SUBSTITUTE_MATCHED is The code argument is not used for matching before the first substitution when PCRE2_SUBSTITUTE_MATCHED is
set, but it must be provided, even when PCRE2_SUBSTITUTE_GLOBAL is not se t, because it contains informa- set, but it must be provided, even when PCRE2_SUBSTITUTE_GLOBAL is not s et, because it contains informa-
tion such as the UTF setting and the number of capturing parentheses in t he pattern. tion such as the UTF setting and the number of capturing parentheses in t he pattern.
The default action of pcre2_substitute() is to return a copy of the The default action of pcre2_substitute() is to return a copy of the subj
subject string with matched sub- ect string with matched sub-
strings replaced. However, if PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set, o strings replaced. However, if PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set,
nly the replacement substrings only the replacement substrings
are returned. In the global case, multiple replacements are concatenated are returned. In the global case, multiple replacements are concatenated
in the output buffer. Substitu- in the output buffer. Substitu-
tion callouts (see below) can be used to separate them if necessary. tion callouts (see below) can be used to separate them if necessary.
The outlengthptr argument of pcre2_substitute() must point to a variable The outlengthptr argument of pcre2_substitute() must point to a varia
that contains the length, in ble that contains the length, in
code units, of the output buffer. If the function is successful, the code units, of the output buffer. If the function is successful, the valu
value is updated to contain the e is updated to contain the
length in code units of the new string, excluding the trailing zero that is automatically added. length in code units of the new string, excluding the trailing zero that is automatically added.
If the function is not successful, the value set via outlengthptr depends If the function is not successful, the value set via outlengthptr depend
on the type of error. For syn- s on the type of error. For syn-
tax errors in the replacement string, the value is the offset in the re tax errors in the replacement string, the value is the offset in the repl
placement string where the error acement string where the error
was detected. For other errors, the value is PCRE2_UNSET by default. This includes the case of the output was detected. For other errors, the value is PCRE2_UNSET by default. This includes the case of the output
buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set. buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buf
buffer is too small. The default fer is too small. The default
action is to return PCRE2_ERROR_NOMEMORY immediately. If this option is s action is to return PCRE2_ERROR_NOMEMORY immediately. If this option is
et, however, pcre2_substitute() set, however, pcre2_substitute()
continues to go through the motions of matching and substituting (witho continues to go through the motions of matching and substituting (without
ut, of course, writing anything) , of course, writing anything)
in order to compute the size of buffer that is needed. This value is pass in order to compute the size of buffer that is needed. This value is
ed back via the outlengthptr passed back via the outlengthptr
variable, with the result of the function still being PCRE2_ERROR_NOMEMOR Y. variable, with the result of the function still being PCRE2_ERROR_NOMEMOR Y.
Passing a buffer size of zero is a permitted way of finding out how much Passing a buffer size of zero is a permitted way of finding out how much
memory is needed for given sub- memory is needed for given sub-
stitution. However, this does mean that the entire operation is carried stitution. However, this does mean that the entire operation is car
out twice. Depending on the ried out twice. Depending on the
application, it may be more efficient to allocate a large buffer and fre application, it may be more efficient to allocate a large buffer and free
e the excess afterwards, instead the excess afterwards, instead
of using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH. of using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
The replacement string, which is interpreted as a UTF string in UTF mode, The replacement string, which is interpreted as a UTF string in UTF m
is checked for UTF validity ode, is checked for UTF validity
unless PCRE2_NO_UTF_CHECK is set. An invalid UTF replacement string cau unless PCRE2_NO_UTF_CHECK is set. An invalid UTF replacement string cause
ses an immediate return with the s an immediate return with the
relevant UTF error code. relevant UTF error code.
If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not interpr eted in any way. By default, If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not int erpreted in any way. By default,
however, a dollar character is an escape character that can specify the i nsertion of characters from cap- however, a dollar character is an escape character that can specify the i nsertion of characters from cap-
ture groups and names from (*MARK) or other control verbs in the pattern. The following forms are always ture groups and names from (*MARK) or other control verbs in the pattern . The following forms are always
recognized: recognized:
$$ insert a dollar character $$ insert a dollar character
$<n> or ${<n>} insert the contents of group <n> $<n> or ${<n>} insert the contents of group <n>
$*MARK or ${*MARK} insert a control verb name $*MARK or ${*MARK} insert a control verb name
Either a group number or a group name can be given for <n>. Curly bracke Either a group number or a group name can be given for <n>. Curly bracket
ts are required only if the fol- s are required only if the fol-
lowing character would be interpreted as part of the number or name. The lowing character would be interpreted as part of the number or name. T
number may be zero to include he number may be zero to include
the entire matched string. For example, if the pattern a(b)c is matched with "=abc=" and the replacement the entire matched string. For example, if the pattern a(b)c is matched with "=abc=" and the replacement
string "+$1$0$1+", the result is "=+babcb+=". string "+$1$0$1+", the result is "=+babcb+=".
$*MARK inserts the name from the last encountered backtracking control ve rb on the matching path that has $*MARK inserts the name from the last encountered backtracking control ve rb on the matching path that has
a name. (*MARK) must always include a name, but the other verbs need a name. (*MARK) must always include a name, but the other verbs need not.
not. For example, in the case of For example, in the case of
(*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B) t (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B
he relevant name is "B". This ) the relevant name is "B". This
facility can be used to perform simple simultaneous substitutions, as thi s pcre2test example shows: facility can be used to perform simple simultaneous substitutions, as thi s pcre2test example shows:
/(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK} /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
apple lemon apple lemon
2: pear orange 2: pear orange
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject s
string, replacing every matching tring, replacing every matching
substring. If this option is not set, only the first matching substring substring. If this option is not set, only the first matching subst
is replaced. The search for ring is replaced. The search for
matches takes place in the original subject string (that is, previous matches takes place in the original subject string (that is, previous rep
replacements do not affect it). lacements do not affect it).
Iteration is implemented by advancing the startoffset value for each sear Iteration is implemented by advancing the startoffset value for each se
ch, which is always passed the arch, which is always passed the
entire subject string. If an offset limit is set in the match context, se arching stops when that limit is entire subject string. If an offset limit is set in the match context, se arching stops when that limit is
reached. reached.
You can restrict the effect of a global substitution to a portion of the subject string by setting either You can restrict the effect of a global substitution to a portion of the subject string by setting either
or both of startoffset and an offset limit. Here is a pcre2test example: or both of startoffset and an offset limit. Here is a pcre2test example:
/B/g,replace=!,use_offset_limit /B/g,replace=!,use_offset_limit
ABC ABC ABC ABC\=offset=3,offset_limit=12 ABC ABC ABC ABC\=offset=3,offset_limit=12
2: ABC A!C A!C ABC 2: ABC A!C A!C ABC
When continuing with global substitutions after matching a substring with zero length, an attempt to find When continuing with global substitutions after matching a substring with zero length, an attempt to find
a non-empty match at the same offset is performed. If this is not succes a non-empty match at the same offset is performed. If this is not succe
sful, the offset is advanced by ssful, the offset is advanced by
one character except when CRLF is a valid newline sequence and the ne one character except when CRLF is a valid newline sequence and the next t
xt two characters are CR, LF. In wo characters are CR, LF. In
this case, the offset is advanced by two characters. this case, the offset is advanced by two characters.
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that d o not appear in the pattern to PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that do not appear in the pattern to
be treated as unset groups. This option should be used with care, because it means that a typo in a group be treated as unset groups. This option should be used with care, because it means that a typo in a group
name or number no longer causes the PCRE2_ERROR_NOSUBSTRING error. name or number no longer causes the PCRE2_ERROR_NOSUBSTRING error.
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including unkn PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including u
own groups when PCRE2_SUBSTI- nknown groups when PCRE2_SUBSTI-
TUTE_UNKNOWN_UNSET is set) to be treated as empty strings when inse TUTE_UNKNOWN_UNSET is set) to be treated as empty strings when inserted
rted as described above. If this as described above. If this
option is not set, an attempt to insert an unset group causes the PCRE2_E option is not set, an attempt to insert an unset group causes the PCRE
RROR_UNSET error. This option 2_ERROR_UNSET error. This option
does not influence the extended substitution syntax described below. does not influence the extended substitution syntax described below.
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the re
replacement string. Without this placement string. Without this
option, only the dollar character is special, and only the group insertio option, only the dollar character is special, and only the group inserti
n forms listed above are valid. on forms listed above are valid.
When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
Firstly, backslash in a replacement string is interpreted as an escape ch aracter. The usual forms such as Firstly, backslash in a replacement string is interpreted as an escape ch aracter. The usual forms such as
\n or \x{ddd} can be used to specify particular character codes, and ba \n or \x{ddd} can be used to specify particular character codes, an
ckslash followed by any non- d backslash followed by any non-
alphanumeric character quotes that character. Extended quoting can be co alphanumeric character quotes that character. Extended quoting can be cod
ded using \Q...\E, exactly as in ed using \Q...\E, exactly as in
pattern strings. pattern strings.
There are also four escape sequences for forcing the case of inserted let There are also four escape sequences for forcing the case of inserted l
ters. The insertion mechanism etters. The insertion mechanism
has three states: no case forcing, force upper case, and force lower c has three states: no case forcing, force upper case, and force lower case
ase. The escape sequences change . The escape sequences change
the current state: \U and \L change to upper or lower case forcing, respe ctively, and \E (when not termi- the current state: \U and \L change to upper or lower case forcing, respe ctively, and \E (when not termi-
nating a \Q quoted sequence) reverts to no case forcing. The sequences \u and \l force the next character nating a \Q quoted sequence) reverts to no case forcing. The sequences \u and \l force the next character
(if it is a letter) to upper or lower case, respectively, and then the st (if it is a letter) to upper or lower case, respectively, and then the s
ate automatically reverts to no tate automatically reverts to no
case forcing. Case forcing applies to all inserted characters, includi case forcing. Case forcing applies to all inserted characters, including
ng those from capture groups and those from capture groups and
letters within \Q...\E quoted sequences. If either PCRE2_UTF or PCRE2_UCP letters within \Q...\E quoted sequences. If either PCRE2_UTF or PCRE2_
was set when the pattern was UCP was set when the pattern was
compiled, Unicode properties are used for case forcing characters whose c ode points are greater than 127. compiled, Unicode properties are used for case forcing characters whose c ode points are greater than 127.
Note that case forcing sequences such as \U...\E do not nest. For e Note that case forcing sequences such as \U...\E do not nest. For examp
xample, the result of processing le, the result of processing
"\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no effect. Note also "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no effect. Note
that the PCRE2_ALT_BSUX and also that the PCRE2_ALT_BSUX and
PCRE2_EXTRA_ALT_BSUX options do not apply to replacement strings. PCRE2_EXTRA_ALT_BSUX options do not apply to replacement strings.
The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more f lexibility to capture group sub- The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more fle xibility to capture group sub-
stitution. The syntax is similar to that used by Bash: stitution. The syntax is similar to that used by Bash:
${<n>:-<string>} ${<n>:-<string>}
${<n>:+<string1>:<string2>} ${<n>:+<string1>:<string2>}
As before, <n> may be a group number or a name. The first form specifies a default value. If group <n> is As before, <n> may be a group number or a name. The first form specifies a default value. If group <n> is
set, its value is inserted; if not, <string> is expanded and the result set, its value is inserted; if not, <string> is expanded and the result i
inserted. The second form speci- nserted. The second form speci-
fies strings that are expanded and inserted when group <n> is set or unse fies strings that are expanded and inserted when group <n> is set or uns
t, respectively. The first form et, respectively. The first form
is just a convenient shorthand for is just a convenient shorthand for
${<n>:+${<n>}:<string>} ${<n>:+${<n>}:<string>}
Backslash can be used to escape colons and closing curly brackets in the replacement strings. A change of Backslash can be used to escape colons and closing curly brackets in the replacement strings. A change of
the case forcing state within a replacement string remains in force afterwards, as shown in this the case forcing state within a replacement string remains in fo rce afterwards, as shown in this
pcre2test example: pcre2test example:
/(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
body body
1: hello 1: hello
somebody somebody
1: HELLO 1: HELLO
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended su bstitutions. However, PCRE2_SUB- The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended su bstitutions. However, PCRE2_SUB-
STITUTE_UNKNOWN_UNSET does cause unknown groups in the extended syntax fo rms to be treated as unset. STITUTE_UNKNOWN_UNSET does cause unknown groups in the extended syntax fo rms to be treated as unset.
If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET, PC RE2_SUBSTITUTE_UNSET_EMPTY, and If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET, P CRE2_SUBSTITUTE_UNSET_EMPTY, and
PCRE2_SUBSTITUTE_EXTENDED are irrelevant and are ignored. PCRE2_SUBSTITUTE_EXTENDED are irrelevant and are ignored.
Substitution errors Substitution errors
In the event of an error, pcre2_substitute() returns a neg ative error code. Except for In the event of an error, pcre2_substitute() returns a negativ e error code. Except for
PCRE2_ERROR_NOMATCH (which is never returned), errors from pcre2_match() are passed straight back. PCRE2_ERROR_NOMATCH (which is never returned), errors from pcre2_match() are passed straight back.
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring i nsertion, unless PCRE2_SUBSTI- PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring i nsertion, unless PCRE2_SUBSTI-
TUTE_UNKNOWN_UNSET is set. TUTE_UNKNOWN_UNSET is set.
PCRE2_ERROR_UNSET is returned for an unset substring insertion (inc PCRE2_ERROR_UNSET is returned for an unset substring insertion (includi
luding an unknown substring when ng an unknown substring when
PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple (non-extended) syn PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple (non-extended)
tax is used and PCRE2_SUBSTI- syntax is used and PCRE2_SUBSTI-
TUTE_UNSET_EMPTY is not set. TUTE_UNSET_EMPTY is not set.
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enoug h. If the PCRE2_SUBSTITUTE_OVER- PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the PCRE2_SUBSTITUTE_OVER-
FLOW_LENGTH option is set, the size of buffer that is needed is returned via outlengthptr. Note that this FLOW_LENGTH option is set, the size of buffer that is needed is returned via outlengthptr. Note that this
does not happen by default. does not happen by default.
PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the m atch_data argument is NULL. PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the m atch_data argument is NULL.
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in t PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
he replacement string, with more replacement string, with more
particular errors being PCRE2_ERROR_BADREPESCAPE (invalid escape sequenc particular errors being PCRE2_ERROR_BADREPESCAPE (invalid escape sequen
e), PCRE2_ERROR_REPMISSINGBRACE ce), PCRE2_ERROR_REPMISSINGBRACE
(closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax e (closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax er
rror in extended group substitu- ror in extended group substitu-
tion), and PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it tion), and PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before
started or the match started it started or the match started
earlier than the current position in the subject, which can happen if \K is used in an assertion). earlier than the current position in the subject, which can happen if \K is used in an assertion).
As for all PCRE2 errors, a text message that describes the erro r can be obtained by calling the As for all PCRE2 errors, a text message that describes the error can be obtained by calling the
pcre2_get_error_message() function (see "Obtaining a textual error messag e" above). pcre2_get_error_message() function (see "Obtaining a textual error messag e" above).
Substitution callouts Substitution callouts
int pcre2_set_substitute_callout(pcre2_match_context *mcontext, int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
int (*callout_function)(pcre2_substitute_callout_block *, void *), int (*callout_function)(pcre2_substitute_callout_block *, void *),
void *callout_data); void *callout_data);
The pcre2_set_substitution_callout() function can be used to specify a ca llout function for pcre2_substi- The pcre2_set_substitution_callout() function can be used to specify a ca llout function for pcre2_substi-
tute(). This information is passed in a match context. The callout funct tute(). This information is passed in a match context. The callout functi
ion is called after each substi- on is called after each substi-
tution has been processed, but it can cause the replacement not to happen tution has been processed, but it can cause the replacement not to hap
. The callout function is not pen. The callout function is not
called for simulated substitutions that happen as a result of the called for simulated substitutions that happen as a result of the
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
option. option.
The first argument of the callout function is a pointer to a substitute c allout block structure, which The first argument of the callout function is a pointer to a substitut e callout block structure, which
contains the following fields, not necessarily in this order: contains the following fields, not necessarily in this order:
uint32_t version; uint32_t version;
uint32_t subscount; uint32_t subscount;
PCRE2_SPTR input; PCRE2_SPTR input;
PCRE2_SPTR output; PCRE2_SPTR output;
PCRE2_SIZE *ovector; PCRE2_SIZE *ovector;
uint32_t oveccount; uint32_t oveccount;
PCRE2_SIZE output_offsets[2]; PCRE2_SIZE output_offsets[2];
The version field contains the version number of the block format. The c The version field contains the version number of the block format. The cu
urrent version is 0. The version rrent version is 0. The version
number will increase in future if more fields are added, but the intentio number will increase in future if more fields are added, but the intenti
n is never to remove any of the on is never to remove any of the
existing fields. existing fields.
The subscount field is the number of the current match. It is 1 for the first callout, 2 for the second, The subscount field is the number of the current match. It is 1 for the f irst callout, 2 for the second,
and so on. The input and output pointers are copies of the values passed to pcre2_substitute(). and so on. The input and output pointers are copies of the values passed to pcre2_substitute().
The ovector field points to the ovector, which contains the result of the most recent match. The ovec- The ovector field points to the ovector, which contains the result of the most recent match. The ovec-
count field contains the number of pairs that are set in the ovector, and is always greater than zero. count field contains the number of pairs that are set in the ovector, and is always greater than zero.
The output_offsets vector contains the offsets of the replacement in the output string. This has already The output_offsets vector contains the offsets of the replacement in the output string. This has already
been processed for dollar and (if requested) backslash substitutions as d escribed above. been processed for dollar and (if requested) backslash substitutions as d escribed above.
The second argument of the callout function is the value passed as callou t_data when the function was The second argument of the callout function is the value passed as ca llout_data when the function was
registered. The value returned by the callout function is interpreted as follows: registered. The value returned by the callout function is interpreted as follows:
If the value is zero, the replacement is accepted, and, if PCRE2_SUBS If the value is zero, the replacement is accepted, and, if PCRE2_SUBSTIT
TITUTE_GLOBAL is set, processing UTE_GLOBAL is set, processing
continues with a search for the next match. If the value is not zero, th continues with a search for the next match. If the value is not zero
e current replacement is not , the current replacement is not
accepted. If the value is greater than zero, processing continues when accepted. If the value is greater than zero, processing continues when P
PCRE2_SUBSTITUTE_GLOBAL is set. CRE2_SUBSTITUTE_GLOBAL is set.
Otherwise (the value is less than zero or PCRE2_SUBSTITUTE_GLOBAL is not Otherwise (the value is less than zero or PCRE2_SUBSTITUTE_GLOBAL is not
set), the the rest of the input set), the the rest of the input
is copied to the output and the call to pcre2_substitute() exits, returni ng the number of matches so far. is copied to the output and the call to pcre2_substitute() exits, returni ng the number of matches so far.
DUPLICATE CAPTURE GROUP NAMES DUPLICATE CAPTURE GROUP NAMES
int pcre2_substring_nametable_scan(const pcre2_code *code, int pcre2_substring_nametable_scan(const pcre2_code *code,
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
When a pattern is compiled with the PCRE2_DUPNAMES option, names for ca When a pattern is compiled with the PCRE2_DUPNAMES option, names for capt
pture groups are not required to ure groups are not required to
be unique. Duplicate names are always allowed for groups with the same nu be unique. Duplicate names are always allowed for groups with the same
mber, created by using the (?| number, created by using the (?|
feature. Indeed, if such groups are named, they are required to use the s ame names. feature. Indeed, if such groups are named, they are required to use the s ame names.
Normally, patterns that use duplicate names are such that in any one matc h, only one of each set of iden- Normally, patterns that use duplicate names are such that in any one matc h, only one of each set of iden-
tically-named groups participates. An example is shown in the pcre2patter n documentation. tically-named groups participates. An example is shown in the pcre2patter n documentation.
When duplicates are present, pcre2_substring_copy_byname() and pcre2_sub string_get_byname() return the When duplicates are present, pcre2_substring_copy_byname() and pcre2_s ubstring_get_byname() return the
first substring corresponding to the given name that is set. Only if none are set is PCRE2_ERROR_UNSET is first substring corresponding to the given name that is set. Only if none are set is PCRE2_ERROR_UNSET is
returned. The pcre2_substring_number_from_name() function returns the err or PCRE2_ERROR_NOUNIQUESUBSTRING returned. The pcre2_substring_number_from_name() function returns the err or PCRE2_ERROR_NOUNIQUESUBSTRING
when there are duplicate names. when there are duplicate names.
If you want to get full details of all captured substrings for a given n ame, you must use the pcre2_sub- If you want to get full details of all captured substrings for a given na me, you must use the pcre2_sub-
string_nametable_scan() function. The first argument is the compiled patt ern, and the second is the name. string_nametable_scan() function. The first argument is the compiled patt ern, and the second is the name.
If the third and fourth arguments are NULL, the function returns a gr oup number for a unique name, or If the third and fourth arguments are NULL, the function returns a group number for a unique name, or
PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
When the third and fourth arguments are not NULL, they must be pointers t When the third and fourth arguments are not NULL, they must be pointers
o variables that are updated by to variables that are updated by
the function. After it has run, they point to the first and last entries the function. After it has run, they point to the first and last entries
in the name-to-number table for in the name-to-number table for
the given name, and the function returns the length of each entry in the given name, and the function returns the length of each ent
code units. In both cases, ry in code units. In both cases,
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name. PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
The format of the name table is described above in the section entit The format of the name table is described above in the section entitled
led Information about a pattern. Information about a pattern.
Given all the relevant entries for the name, you can extract each of thei Given all the relevant entries for the name, you can extract each of t
r numbers, and hence the cap- heir numbers, and hence the cap-
tured data. tured data.
FINDING ALL POSSIBLE MATCHES AT ONE POSITION FINDING ALL POSSIBLE MATCHES AT ONE POSITION
The traditional matching function uses a similar algorithm to Perl, whi The traditional matching function uses a similar algorithm to Perl, which
ch stops when it finds the first stops when it finds the first
match at a given point in the subject. If you want to find all possible m match at a given point in the subject. If you want to find all possible
atches, or the longest possible matches, or the longest possible
match at a given position, consider using the alternative matching func match at a given position, consider using the alternative matching functi
tion (see below) instead. If you on (see below) instead. If you
cannot use the alternative function, you can kludge it up by making use o f the callout facility, which is cannot use the alternative function, you can kludge it up by making use o f the callout facility, which is
described in the pcre2callout documentation. described in the pcre2callout documentation.
What you have to do is to insert a callout right at the end of the patt What you have to do is to insert a callout right at the end of the patter
ern. When your callout function n. When your callout function
is called, extract and save the current matched substring. Then return 1, is called, extract and save the current matched substring. Then return
which forces pcre2_match() to 1, which forces pcre2_match() to
backtrack and try other alternatives. Ultimately, when it runs out of m backtrack and try other alternatives. Ultimately, when it runs out of mat
atches, pcre2_match() will yield ches, pcre2_match() will yield
PCRE2_ERROR_NOMATCH. PCRE2_ERROR_NOMATCH.
MATCHING A PATTERN: THE ALTERNATIVE FUNCTION MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
PCRE2_SIZE length, PCRE2_SIZE startoffset, PCRE2_SIZE length, PCRE2_SIZE startoffset,
uint32_t options, pcre2_match_data *match_data, uint32_t options, pcre2_match_data *match_data,
pcre2_match_context *mcontext, pcre2_match_context *mcontext,
int *workspace, PCRE2_SIZE wscount); int *workspace, PCRE2_SIZE wscount);
The function pcre2_dfa_match() is called to match a subject string agains t a compiled pattern, using a The function pcre2_dfa_match() is called to match a subject string aga inst a compiled pattern, using a
matching algorithm that scans the subject string just once (not counting lookaround assertions), and does matching algorithm that scans the subject string just once (not counting lookaround assertions), and does
not backtrack. This has different characteristics to the normal algorith not backtrack. This has different characteristics to the normal algor
m, and is not compatible with ithm, and is not compatible with
Perl. Some of the features of PCRE2 patterns are not supported. Neverth Perl. Some of the features of PCRE2 patterns are not supported. Neverthe
eless, there are times when this less, there are times when this
kind of matching can be useful. For a discussion of the two matching algo kind of matching can be useful. For a discussion of the two matching al
rithms, and a list of features gorithms, and a list of features
that pcre2_dfa_match() does not support, see the pcre2matching documentat ion. that pcre2_dfa_match() does not support, see the pcre2matching documentat ion.
The arguments for the pcre2_dfa_match() function are the same as for pcr The arguments for the pcre2_dfa_match() function are the same as for pcre
e2_match(), plus two extras. The 2_match(), plus two extras. The
ovector within the match data block is used in a different way, and this ovector within the match data block is used in a different way, and th
is described below. The other is is described below. The other
common arguments are used in the same way as for pcre2_match(), so t common arguments are used in the same way as for pcre2_match(), so their
heir description is not repeated description is not repeated
here. here.
The two additional arguments provide workspace for the function. The work The two additional arguments provide workspace for the function. The wo
space vector should contain at rkspace vector should contain at
least 20 elements. It is used for keeping track of multiple path least 20 elements. It is used for keeping track of multiple paths th
s through the pattern tree. More rough the pattern tree. More
workspace is needed for patterns and subjects where there are a lot of po tential matches. workspace is needed for patterns and subjects where there are a lot of po tential matches.
Here is an example of a simple call to pcre2_dfa_match(): Here is an example of a simple call to pcre2_dfa_match():
int wspace[20]; int wspace[20];
pcre2_match_data *md = pcre2_match_data_create(4, NULL); pcre2_match_data *md = pcre2_match_data_create(4, NULL);
int rc = pcre2_dfa_match( int rc = pcre2_dfa_match(
re, /* result of pcre2_compile() */ re, /* result of pcre2_compile() */
"some string", /* the subject string */ "some string", /* the subject string */
11, /* the length of the subject string */ 11, /* the length of the subject string */
0, /* start at offset 0 in the subject */ 0, /* start at offset 0 in the subject */
0, /* default options */ 0, /* default options */
md, /* the match data block */ md, /* the match data block */
NULL, /* a match context; NULL means use defaults */ NULL, /* a match context; NULL means use defaults */
wspace, /* working space vector */ wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */ 20); /* number of elements (NOT size in bytes) */
Option bits for pcre_dfa_match() Option bits for pcre_dfa_match()
The unused bits of the options argument for pcre2_dfa_match() must be zer o. The only bits that may be set The unused bits of the options argument for pcre2_dfa_match() must be zer o. The only bits that may be set
are PCRE2_ANCHORED, PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, are PCRE2_ANCHORED, PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED,
PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTBOL, PCRE2_NOTEOL,
PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PAR PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PA
TIAL_HARD, PCRE2_PARTIAL_SOFT, RTIAL_HARD, PCRE2_PARTIAL_SOFT,
PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of t PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of these
hese are exactly the same as for are exactly the same as for
pcre2_match(), so their description is not repeated here. pcre2_match(), so their description is not repeated here.
PCRE2_PARTIAL_HARD PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT PCRE2_PARTIAL_SOFT
These have the same general effect as they do for pcre2_match(), but the These have the same general effect as they do for pcre2_match(), but the
details are slightly different. details are slightly different.
When PCRE2_PARTIAL_HARD is set for pcre2_dfa_match(), it returns PCRE2 When PCRE2_PARTIAL_HARD is set for pcre2_dfa_match(), it returns PCRE2_ER
_ERROR_PARTIAL if the end of the ROR_PARTIAL if the end of the
subject is reached and there is still at least one matching possibility t subject is reached and there is still at least one matching possibility
hat requires additional charac- that requires additional charac-
ters. This happens even if some complete matches have already been found. When PCRE2_PARTIAL_SOFT is set, ters. This happens even if some complete matches have already been found. When PCRE2_PARTIAL_SOFT is set,
the return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL the return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PART
if the end of the subject is IAL if the end of the subject is
reached, there have been no complete matches, but there is still at lea reached, there have been no complete matches, but there is still at least
st one matching possibility. The one matching possibility. The
portion of the string that was inspected when the longest partial match w portion of the string that was inspected when the longest partial mat
as found is set as the first ch was found is set as the first
matching string in both cases. There is a more detailed discussion of par tial and multi-segment matching, matching string in both cases. There is a more detailed discussion of par tial and multi-segment matching,
with examples, in the pcre2partial documentation. with examples, in the pcre2partial documentation.
PCRE2_DFA_SHORTEST PCRE2_DFA_SHORTEST
Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to st Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
op as soon as it has found one stop as soon as it has found one
match. Because of the way the alternative algorithm works, this is n match. Because of the way the alternative algorithm works, this is nece
ecessarily the shortest possible ssarily the shortest possible
match at the first possible matching point in the subject string. match at the first possible matching point in the subject string.
PCRE2_DFA_RESTART PCRE2_DFA_RESTART
When pcre2_dfa_match() returns a partial match, it is possible to call it When pcre2_dfa_match() returns a partial match, it is possible to call i
again, with additional subject t again, with additional subject
characters, and have it continue with the same match. The PCRE2_DFA_REST characters, and have it continue with the same match. The PCRE2_DFA_RESTA
ART option requests this action; RT option requests this action;
when it is set, the workspace and wscount options must reference the same when it is set, the workspace and wscount options must reference the sa
vector as before because data me vector as before because data
about the match so far is left in them after a partial match. There is about the match so far is left in them after a partial match. There is mo
more discussion of this facility re discussion of this facility
in the pcre2partial documentation. in the pcre2partial documentation.
Successful returns from pcre2_dfa_match() Successful returns from pcre2_dfa_match()
When pcre2_dfa_match() succeeds, it may have matched more than one substr When pcre2_dfa_match() succeeds, it may have matched more than one subs
ing in the subject. Note, how- tring in the subject. Note, how-
ever, that all the matches from one run of the function start at th ever, that all the matches from one run of the function start at the sam
e same point in the subject. The e point in the subject. The
shorter matches are all initial substrings of the longer matches. For exa mple, if the pattern shorter matches are all initial substrings of the longer matches. For exa mple, if the pattern
<.*> <.*>
is matched against the string is matched against the string
This is <something> <something else> <something further> no more This is <something> <something else> <something further> no more
the three matched strings are the three matched strings are
<something> <something else> <something further> <something> <something else> <something further>
<something> <something else> <something> <something else>
<something> <something>
On success, the yield of the function is a number greater than zero, whic h is the number of matched sub- On success, the yield of the function is a number greater than zero, whi ch is the number of matched sub-
strings. The offsets of the substrings are returned in the ovector, and c an be extracted by number in the strings. The offsets of the substrings are returned in the ovector, and c an be extracted by number in the
same way as for pcre2_match(), but the numbers bear no relation to any ca pture groups that may exist in same way as for pcre2_match(), but the numbers bear no relation to any capture groups that may exist in
the pattern, because DFA matching does not support capturing. the pattern, because DFA matching does not support capturing.
Calls to the convenience functions that extract substrings by name return the error PCRE2_ERROR_DFA_UFUNC Calls to the convenience functions that extract substrings by name return the error PCRE2_ERROR_DFA_UFUNC
(unsupported function) if used after a DFA match. The convenience functio ns that extract substrings by (unsupported function) if used after a DFA match. The convenience func tions that extract substrings by
number never return PCRE2_ERROR_NOSUBSTRING. number never return PCRE2_ERROR_NOSUBSTRING.
The matched strings are stored in the ovector in reverse order of lengt The matched strings are stored in the ovector in reverse order of length;
h; that is, the longest matching that is, the longest matching
string is first. If there were too many matches to fit into the ovector, string is first. If there were too many matches to fit into the ovect
the yield of the function is or, the yield of the function is
zero, and the vector is filled with the longest matches. zero, and the vector is filled with the longest matches.
NOTE: PCRE2's "auto-possessification" optimization usually applies to c NOTE: PCRE2's "auto-possessification" optimization usually applies to cha
haracter repeats at the end of a racter repeats at the end of a
pattern (as well as internally). For example, the pattern "a\d+" is compi pattern (as well as internally). For example, the pattern "a\d+" is com
led as if it were "a\d++". For piled as if it were "a\d++". For
DFA matching, this means that only one possible match is found. If you re ally do want multiple matches in DFA matching, this means that only one possible match is found. If you re ally do want multiple matches in
such cases, either use an ungreedy repeat such as "a\d+?" or set the PCR E2_NO_AUTO_POSSESS option when such cases, either use an ungreedy repeat such as "a\d+?" or set the P CRE2_NO_AUTO_POSSESS option when
compiling. compiling.
Error returns from pcre2_dfa_match() Error returns from pcre2_dfa_match()
The pcre2_dfa_match() function returns a negative number when it fails. The pcre2_dfa_match() function returns a negative number when it fails.
Many of the errors are the same Many of the errors are the same
as for pcre2_match(), as described above. There are in addition the foll as for pcre2_match(), as described above. There are in addition the fo
owing errors that are specific llowing errors that are specific
to pcre2_dfa_match(): to pcre2_dfa_match():
PCRE2_ERROR_DFA_UITEM PCRE2_ERROR_DFA_UITEM
This return is given if pcre2_dfa_match() encounters an item in the patte rn that it does not support, for This return is given if pcre2_dfa_match() encounters an item in the patte rn that it does not support, for
instance, the use of \C in a UTF mode or a backreference. instance, the use of \C in a UTF mode or a backreference.
PCRE2_ERROR_DFA_UCOND PCRE2_ERROR_DFA_UCOND
This return is given if pcre2_dfa_match() encounters a condition item tha t uses a backreference for the This return is given if pcre2_dfa_match() encounters a condition item t hat uses a backreference for the
condition, or a test for recursion in a specific capture group. These are not supported. condition, or a test for recursion in a specific capture group. These are not supported.
PCRE2_ERROR_DFA_UINVALID_UTF PCRE2_ERROR_DFA_UINVALID_UTF
This return is given if pcre2_dfa_match() is called for a pa ttern that was compiled with This return is given if pcre2_dfa_match() is called for a patte rn that was compiled with
PCRE2_MATCH_INVALID_UTF. This is not supported for DFA matching. PCRE2_MATCH_INVALID_UTF. This is not supported for DFA matching.
PCRE2_ERROR_DFA_WSSIZE PCRE2_ERROR_DFA_WSSIZE
This return is given if pcre2_dfa_match() runs out of space in the worksp ace vector. This return is given if pcre2_dfa_match() runs out of space in the worksp ace vector.
PCRE2_ERROR_DFA_RECURSE PCRE2_ERROR_DFA_RECURSE
When a recursion or subroutine call is processed, the matching function c When a recursion or subroutine call is processed, the matching function
alls itself recursively, using calls itself recursively, using
private memory for the ovector and workspace. This error is given if t private memory for the ovector and workspace. This error is given if the
he internal ovector is not large internal ovector is not large
enough. This should be extremely rare, as a vector of size 1000 is used. enough. This should be extremely rare, as a vector of size 1000 is used.
PCRE2_ERROR_DFA_BADRESTART PCRE2_ERROR_DFA_BADRESTART
When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, some When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, some
plausibility checks are made on plausibility checks are made on
the contents of the workspace, which should contain data about the p the contents of the workspace, which should contain data about the previo
revious partial match. If any of us partial match. If any of
these checks fail, this error is given. these checks fail, this error is given.
SEE ALSO SEE ALSO
pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), pcre2part ial(3), pcre2posix(3), pcre2sam- pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), pcre2part ial(3), pcre2posix(3), pcre2sam-
ple(3), pcre2unicode(3). ple(3), pcre2unicode(3).
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service University Computing Service
Cambridge, England. Cambridge, England.
REVISION REVISION
Last updated: 19 March 2020 Last updated: 04 November 2020
Copyright (c) 1997-2020 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
PCRE2 10.35 19 March 2020 PCRE2API(3) PCRE2 10.36 04 November 2020 PCRE2API(3)
 End of changes. 403 change blocks. 
1520 lines changed or deleted 1558 lines changed or added

Home  |  About  |  Features  |  All  |  Newest  |  Dox  |  Diffs  |  RSS Feeds  |  Screenshots  |  Comments  |  Imprint  |  Privacy  |  HTTP(S)