"Fossies" - the Fresh Open Source Software Archive

Member "vfu-4.18/vslib/pcre2/pcre2-10.20/doc/html/pcre2test.html" (2 Jul 2015, 61460 Bytes) of package /linux/privat/vfu-4.18.tar.gz:

As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) HTML source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 <html>
    2 <head>
    3 <title>pcre2test specification</title>
    4 </head>
    5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
    6 <h1>pcre2test man page</h1>
    7 <p>
    8 Return to the <a href="index.html">PCRE2 index page</a>.
    9 </p>
   10 <p>
   11 This page is part of the PCRE2 HTML documentation. It was generated
   12 automatically from the original man page. If there is any nonsense in it,
   13 please consult the man page, in case the conversion went wrong.
   14 <br>
   15 <ul>
   16 <li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
   17 <li><a name="TOC2" href="#SEC2">PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
   18 <li><a name="TOC3" href="#SEC3">INPUT ENCODING</a>
   19 <li><a name="TOC4" href="#SEC4">COMMAND LINE OPTIONS</a>
   20 <li><a name="TOC5" href="#SEC5">DESCRIPTION</a>
   21 <li><a name="TOC6" href="#SEC6">COMMAND LINES</a>
   22 <li><a name="TOC7" href="#SEC7">MODIFIER SYNTAX</a>
   23 <li><a name="TOC8" href="#SEC8">PATTERN SYNTAX</a>
   24 <li><a name="TOC9" href="#SEC9">SUBJECT LINE SYNTAX</a>
   25 <li><a name="TOC10" href="#SEC10">PATTERN MODIFIERS</a>
   26 <li><a name="TOC11" href="#SEC11">SUBJECT MODIFIERS</a>
   27 <li><a name="TOC12" href="#SEC12">THE ALTERNATIVE MATCHING FUNCTION</a>
   28 <li><a name="TOC13" href="#SEC13">DEFAULT OUTPUT FROM pcre2test</a>
   30 <li><a name="TOC15" href="#SEC15">RESTARTING AFTER A PARTIAL MATCH</a>
   31 <li><a name="TOC16" href="#SEC16">CALLOUTS</a>
   32 <li><a name="TOC17" href="#SEC17">NON-PRINTING CHARACTERS</a>
   33 <li><a name="TOC18" href="#SEC18">SAVING AND RESTORING COMPILED PATTERNS</a>
   34 <li><a name="TOC19" href="#SEC19">SEE ALSO</a>
   35 <li><a name="TOC20" href="#SEC20">AUTHOR</a>
   36 <li><a name="TOC21" href="#SEC21">REVISION</a>
   37 </ul>
   38 <br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
   39 <P>
   40 <b>pcre2test [options] [input file [output file]]</b>
   41 <br>
   42 <br>
   43 <b>pcre2test</b> is a test program for the PCRE2 regular expression libraries,
   44 but it can also be used for experimenting with regular expressions. This
   45 document describes the features of the test program; for details of the regular
   46 expressions themselves, see the
   47 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
   48 documentation. For details of the PCRE2 library function calls and their
   49 options, see the
   50 <a href="pcre2api.html"><b>pcre2api</b></a>
   51 documentation.
   52 </P>
   53 <P>
   54 The input for <b>pcre2test</b> is a sequence of regular expression patterns and
   55 subject strings to be matched. There are also command lines for setting
   56 defaults and controlling some special actions. The output shows the result of
   57 each match attempt. Modifiers on external or internal command lines, the
   58 patterns, and the subject lines specify PCRE2 function options, control how the
   59 subject is processed, and what output is produced.
   60 </P>
   61 <P>
   62 As the original fairly simple PCRE library evolved, it acquired many different
   63 features, and as a result, the original <b>pcretest</b> program ended up with a
   64 lot of options in a messy, arcane syntax, for testing all the features. The
   65 move to the new PCRE2 API provided an opportunity to re-implement the test
   66 program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
   67 are still many obscure modifiers, some of which are specifically designed for
   68 use in conjunction with the test script and data files that are distributed as
   69 part of PCRE2. All the modifiers are documented here, some without much
   70 justification, but many of them are unlikely to be of use except when testing
   71 the libraries.
   72 </P>
   73 <br><a name="SEC2" href="#TOC1">PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br>
   74 <P>
   75 Different versions of the PCRE2 library can be built to support character
   76 strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
   77 all three of these libraries may be simultaneously installed. The
   78 <b>pcre2test</b> program can be used to test all the libraries. However, its own
   79 input and output are always in 8-bit format. When testing the 16-bit or 32-bit
   80 libraries, patterns and subject strings are converted to 16- or 32-bit format
   81 before being passed to the library functions. Results are converted back to
   82 8-bit code units for output.
   83 </P>
   84 <P>
   85 In the rest of this document, the names of library functions and structures
   86 are given in generic form, for example, <b>pcre_compile()</b>. The actual
   87 names used in the libraries have a suffix _8, _16, or _32, as appropriate.
   88 </P>
   89 <br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
   90 <P>
   91 Input to <b>pcre2test</b> is processed line by line, either by calling the C
   92 library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
   93 below). The input is processed using using C's string functions, so must not
   94 contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
   95 treats any bytes other than newline as data characters. In some Windows
   96 environments character 26 (hex 1A) causes an immediate end of file, and no
   97 further data is read.
   98 </P>
   99 <P>
  100 For maximum portability, therefore, it is safest to avoid non-printing
  101 characters in <b>pcre2test</b> input files. There is a facility for specifying a
  102 pattern's characters as hexadecimal pairs, thus making it possible to include
  103 binary zeroes in a pattern for testing purposes. Subject lines are processed
  104 for backslash escapes, which makes it possible to include any data value.
  105 </P>
  106 <br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
  107 <P>
  108 <b>-8</b>
  109 If the 8-bit library has been built, this option causes it to be used (this is
  110 the default). If the 8-bit library has not been built, this option causes an
  111 error.
  112 </P>
  113 <P>
  114 <b>-16</b>
  115 If the 16-bit library has been built, this option causes it to be used. If only
  116 the 16-bit library has been built, this is the default. If the 16-bit library
  117 has not been built, this option causes an error.
  118 </P>
  119 <P>
  120 <b>-32</b>
  121 If the 32-bit library has been built, this option causes it to be used. If only
  122 the 32-bit library has been built, this is the default. If the 32-bit library
  123 has not been built, this option causes an error.
  124 </P>
  125 <P>
  126 <b>-b</b>
  127 Behave as if each pattern has the <b>/fullbincode</b> modifier; the full
  128 internal binary form of the pattern is output after compilation.
  129 </P>
  130 <P>
  131 <b>-C</b>
  132 Output the version number of the PCRE2 library, and all available information
  133 about the optional features that are included, and then exit with zero exit
  134 code. All other options are ignored.
  135 </P>
  136 <P>
  137 <b>-C</b> <i>option</i>
  138 Output information about a specific build-time option, then exit. This
  139 functionality is intended for use in scripts such as <b>RunTest</b>. The
  140 following options output the value and set the exit code as indicated:
  141 <pre>
  142   ebcdic-nl  the code for LF (= NL) in an EBCDIC environment:
  143                0x15 or 0x25
  144                0 if used in an ASCII environment
  145                exit code is always 0
  146   linksize   the configured internal link size (2, 3, or 4)
  147                exit code is set to the link size
  148   newline    the default newline setting:
  149                CR, LF, CRLF, ANYCRLF, or ANY
  150                exit code is always 0
  151   bsr        the default setting for what \R matches:
  152                ANYCRLF or ANY
  153                exit code is always 0
  154 </pre>
  155 The following options output 1 for true or 0 for false, and set the exit code
  156 to the same value:
  157 <pre>
  158   ebcdic     compiled for an EBCDIC environment
  159   jit        just-in-time support is available
  160   pcre2-16   the 16-bit library was built
  161   pcre2-32   the 32-bit library was built
  162   pcre2-8    the 8-bit library was built
  163   unicode    Unicode support is available
  164 </pre>
  165 If an unknown option is given, an error message is output; the exit code is 0.
  166 </P>
  167 <P>
  168 <b>-d</b>
  169 Behave as if each pattern has the <b>debug</b> modifier; the internal
  170 form and information about the compiled pattern is output after compilation;
  171 <b>-d</b> is equivalent to <b>-b -i</b>.
  172 </P>
  173 <P>
  174 <b>-dfa</b>
  175 Behave as if each subject line has the <b>dfa</b> modifier; matching is done
  176 using the <b>pcre2_dfa_match()</b> function instead of the default
  177 <b>pcre2_match()</b>.
  178 </P>
  179 <P>
  180 <b>-help</b>
  181 Output a brief summary these options and then exit.
  182 </P>
  183 <P>
  184 <b>-i</b>
  185 Behave as if each pattern has the <b>/info</b> modifier; information about the
  186 compiled pattern is given after compilation.
  187 </P>
  188 <P>
  189 <b>-jit</b>
  190 Behave as if each pattern line has the <b>jit</b> modifier; after successful
  191 compilation, each pattern is passed to the just-in-time compiler, if available.
  192 </P>
  193 <P>
  194 \fB-pattern\fB <i>modifier-list</i>
  195 Behave as if each pattern line contains the given modifiers.
  196 </P>
  197 <P>
  198 <b>-q</b>
  199 Do not output the version number of <b>pcre2test</b> at the start of execution.
  200 </P>
  201 <P>
  202 <b>-S</b> <i>size</i>
  203 On Unix-like systems, set the size of the run-time stack to <i>size</i>
  204 megabytes.
  205 </P>
  206 <P>
  207 <b>-subject</b> <i>modifier-list</i>
  208 Behave as if each subject line contains the given modifiers.
  209 </P>
  210 <P>
  211 <b>-t</b>
  212 Run each compile and match many times with a timer, and output the resulting
  213 times per compile or match. When JIT is used, separate times are given for the
  214 initial compile and the JIT compile. You can control the number of iterations
  215 that are used for timing by following <b>-t</b> with a number (as a separate
  216 item on the command line). For example, "-t 1000" iterates 1000 times. The
  217 default is to iterate 500,000 times.
  218 </P>
  219 <P>
  220 <b>-tm</b>
  221 This is like <b>-t</b> except that it times only the matching phase, not the
  222 compile phase.
  223 </P>
  224 <P>
  225 <b>-T</b> <b>-TM</b>
  226 These behave like <b>-t</b> and <b>-tm</b>, but in addition, at the end of a run,
  227 the total times for all compiles and matches are output.
  228 </P>
  229 <P>
  230 <b>-version</b>
  231 Output the PCRE2 version number and then exit.
  232 </P>
  233 <br><a name="SEC5" href="#TOC1">DESCRIPTION</a><br>
  234 <P>
  235 If <b>pcre2test</b> is given two filename arguments, it reads from the first and
  236 writes to the second. If the first name is "-", input is taken from the
  237 standard input. If <b>pcre2test</b> is given only one argument, it reads from
  238 that file and writes to stdout. Otherwise, it reads from stdin and writes to
  239 stdout.
  240 </P>
  241 <P>
  242 When <b>pcre2test</b> is built, a configuration option can specify that it
  243 should be linked with the <b>libreadline</b> or <b>libedit</b> library. When this
  244 is done, if the input is from a terminal, it is read using the <b>readline()</b>
  245 function. This provides line-editing and history facilities. The output from
  246 the <b>-help</b> option states whether or not <b>readline()</b> will be used.
  247 </P>
  248 <P>
  249 The program handles any number of tests, each of which consists of a set of
  250 input lines. Each set starts with a regular expression pattern, followed by any
  251 number of subject lines to be matched against that pattern. In between sets of
  252 test data, command lines that begin with # may appear. This file format, with
  253 some restrictions, can also be processed by the <b>perltest.sh</b> script that
  254 is distributed with PCRE2 as a means of checking that the behaviour of PCRE2
  255 and Perl is the same.
  256 </P>
  257 <P>
  258 When the input is a terminal, <b>pcre2test</b> prompts for each line of input,
  259 using "re&#62;" to prompt for regular expression patterns, and "data&#62;" to prompt
  260 for subject lines. Command lines starting with # can be entered only in
  261 response to the "re&#62;" prompt.
  262 </P>
  263 <P>
  264 Each subject line is matched separately and independently. If you want to do
  265 multi-line matches, you have to use the \n escape sequence (or \r or \r\n,
  266 etc., depending on the newline setting) in a single line of input to encode the
  267 newline sequences. There is no limit on the length of subject lines; the input
  268 buffer is automatically extended if it is too small. There is a replication
  269 feature that makes it possible to generate long subject lines without having to
  270 supply them explicitly.
  271 </P>
  272 <P>
  273 An empty line or the end of the file signals the end of the subject lines for a
  274 test, at which point a new pattern or command line is expected if there is
  275 still input to be read.
  276 </P>
  277 <br><a name="SEC6" href="#TOC1">COMMAND LINES</a><br>
  278 <P>
  279 In between sets of test data, a line that begins with # is interpreted as a
  280 command line. If the first character is followed by white space or an
  281 exclamation mark, the line is treated as a comment, and ignored. Otherwise, the
  282 following commands are recognized:
  283 <pre>
  284   #forbid_utf
  285 </pre>
  286 Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP
  287 options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and
  288 the use of (*UTF) and (*UCP) at the start of patterns. This command also forces
  289 an error if a subsequent pattern contains any occurrences of \P, \p, or \X,
  290 which are still supported when PCRE2_UTF is not set, but which require Unicode
  291 property support to be included in the library.
  292 </P>
  293 <P>
  294 This is a trigger guard that is used in test files to ensure that UTF or
  295 Unicode property tests are not accidentally added to files that are used when
  296 Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and
  297 PCRE2_NEVER_UCP as a default can also be obtained by the use of <b>#pattern</b>;
  298 the difference is that <b>#forbid_utf</b> cannot be unset, and the automatic
  299 options are not displayed in pattern information, to avoid cluttering up test
  300 output.
  301 <pre>
  302   #load &#60;filename&#62;
  303 </pre>
  304 This command is used to load a set of precompiled patterns from a file, as
  305 described in the section entitled "Saving and restoring compiled patterns"
  306 <a href="#saverestore">below.</a>
  307 <pre>
  308   #pattern &#60;modifier-list&#62;
  309 </pre>
  310 This command sets a default modifier list that applies to all subsequent
  311 patterns. Modifiers on a pattern can change these settings.
  312 <pre>
  313   #perltest
  314 </pre>
  315 The appearance of this line causes all subsequent modifier settings to be
  316 checked for compatibility with the <b>perltest.sh</b> script, which is used to
  317 confirm that Perl gives the same results as PCRE2. Also, apart from comment
  318 lines, none of the other command lines are permitted, because they and many
  319 of the modifiers are specific to <b>pcre2test</b>, and should not be used in
  320 test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b>
  321 command helps detect tests that are accidentally put in the wrong file.
  322 <pre>
  323   #pop [&#60;modifiers&#62;]
  324 </pre>
  325 This command is used to manipulate the stack of compiled patterns, as described
  326 in the section entitled "Saving and restoring compiled patterns"
  327 <a href="#saverestore">below.</a>
  328 <pre>
  329   #save &#60;filename&#62;
  330 </pre>
  331 This command is used to save a set of compiled patterns to a file, as described
  332 in the section entitled "Saving and restoring compiled patterns"
  333 <a href="#saverestore">below.</a>
  334 <pre>
  335   #subject &#60;modifier-list&#62;
  336 </pre>
  337 This command sets a default modifier list that applies to all subsequent
  338 subject lines. Modifiers on a subject line can change these settings.
  339 </P>
  340 <br><a name="SEC7" href="#TOC1">MODIFIER SYNTAX</a><br>
  341 <P>
  342 Modifier lists are used with both pattern and subject lines. Items in a list
  343 are separated by commas and optional white space. Some modifiers may be given
  344 for both patterns and subject lines, whereas others are valid for one or the
  345 other only. Each modifier has a long name, for example "anchored", and some of
  346 them must be followed by an equals sign and a value, for example, "offset=12".
  347 Modifiers that do not take values may be preceded by a minus sign to turn off a
  348 previous setting.
  349 </P>
  350 <P>
  351 A few of the more common modifiers can also be specified as single letters, for
  352 example "i" for "caseless". In documentation, following the Perl convention,
  353 these are written with a slash ("the /i modifier") for clarity. Abbreviated
  354 modifiers must all be concatenated in the first item of a modifier list. If the
  355 first item is not recognized as a long modifier name, it is interpreted as a
  356 sequence of these abbreviations. For example:
  357 <pre>
  358   /abc/ig,newline=cr,jit=3
  359 </pre>
  360 This is a pattern line whose modifier list starts with two one-letter modifiers
  361 (/i and /g). The lower-case abbreviated modifiers are the same as used in Perl.
  362 </P>
  363 <br><a name="SEC8" href="#TOC1">PATTERN SYNTAX</a><br>
  364 <P>
  365 A pattern line must start with one of the following characters (common symbols,
  366 excluding pattern meta-characters):
  367 <pre>
  368   / ! " ' ` - = _ : ; , % & @ ~
  369 </pre>
  370 This is interpreted as the pattern's delimiter. A regular expression may be
  371 continued over several input lines, in which case the newline characters are
  372 included within it. It is possible to include the delimiter within the pattern
  373 by escaping it with a backslash, for example
  374 <pre>
  375   /abc\/def/
  376 </pre>
  377 If you do this, the escape and the delimiter form part of the pattern, but
  378 since the delimiters are all non-alphanumeric, this does not affect its
  379 interpretation. If the terminating delimiter is immediately followed by a
  380 backslash, for example,
  381 <pre>
  382   /abc/\
  383 </pre>
  384 then a backslash is added to the end of the pattern. This is done to provide a
  385 way of testing the error condition that arises if a pattern finishes with a
  386 backslash, because
  387 <pre>
  388   /abc\/
  389 </pre>
  390 is interpreted as the first line of a pattern that starts with "abc/", causing
  391 pcre2test to read the next line as a continuation of the regular expression.
  392 </P>
  393 <P>
  394 A pattern can be followed by a modifier list (details below).
  395 </P>
  396 <br><a name="SEC9" href="#TOC1">SUBJECT LINE SYNTAX</a><br>
  397 <P>
  398 Before each subject line is passed to <b>pcre2_match()</b> or
  399 <b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the
  400 line is scanned for backslash escapes. The following provide a means of
  401 encoding non-printing characters in a visible way:
  402 <pre>
  403   \a         alarm (BEL, \x07)
  404   \b         backspace (\x08)
  405   \e         escape (\x27)
  406   \f         form feed (\x0c)
  407   \n         newline (\x0a)
  408   \r         carriage return (\x0d)
  409   \t         tab (\x09)
  410   \v         vertical tab (\x0b)
  411   \nnn       octal character (up to 3 octal digits); always
  412                a byte unless &#62; 255 in UTF-8 or 16-bit or 32-bit mode
  413   \o{dd...}  octal character (any number of octal digits}
  414   \xhh       hexadecimal byte (up to 2 hex digits)
  415   \x{hh...}  hexadecimal character (any number of hex digits)
  416 </pre>
  417 The use of \x{hh...} is not dependent on the use of the <b>utf</b> modifier on
  418 the pattern. It is recognized always. There may be any number of hexadecimal
  419 digits inside the braces; invalid values provoke error messages.
  420 </P>
  421 <P>
  422 Note that \xhh specifies one byte rather than one character in UTF-8 mode;
  423 this makes it possible to construct invalid UTF-8 sequences for testing
  424 purposes. On the other hand, \x{hh} is interpreted as a UTF-8 character in
  425 UTF-8 mode, generating more than one byte if the value is greater than 127.
  426 When testing the 8-bit library not in UTF-8 mode, \x{hh} generates one byte
  427 for values less than 256, and causes an error for greater values.
  428 </P>
  429 <P>
  430 In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
  431 possible to construct invalid UTF-16 sequences for testing purposes.
  432 </P>
  433 <P>
  434 In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This makes it
  435 possible to construct invalid UTF-32 sequences for testing purposes.
  436 </P>
  437 <P>
  438 There is a special backslash sequence that specifies replication of one or more
  439 characters:
  440 <pre>
  441   \[&#60;characters&#62;]{&#60;count&#62;}
  442 </pre>
  443 This makes it possible to test long strings without having to provide them as
  444 part of the file. For example:
  445 <pre>
  446   \[abc]{4}
  447 </pre>
  448 is converted to "abcabcabcabc". This feature does not support nesting. To
  449 include a closing square bracket in the characters, code it as \x5D.
  450 </P>
  451 <P>
  452 A backslash followed by an equals sign marks the end of the subject string and
  453 the start of a modifier list. For example:
  454 <pre>
  455   abc\=notbol,notempty
  456 </pre>
  457 A backslash followed by any other non-alphanumeric character just escapes that
  458 character. A backslash followed by anything else causes an error. However, if
  459 the very last character in the line is a backslash (and there is no modifier
  460 list), it is ignored. This gives a way of passing an empty line as data, since
  461 a real empty line terminates the data input.
  462 </P>
  463 <br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
  464 <P>
  465 There are three types of modifier that can appear in pattern lines, two of
  466 which may also be used in a <b>#pattern</b> command. A pattern's modifier list
  467 can add to or override default modifiers that were set by a previous
  468 <b>#pattern</b> command.
  469 <a name="optionmodifiers"></a></P>
  470 <br><b>
  471 Setting compilation options
  472 </b><br>
  473 <P>
  474 The following modifiers set options for <b>pcre2_compile()</b>. The most common
  475 ones have single-letter abbreviations. See
  476 <a href="pcreapi.html"><b>pcreapi</b></a>
  477 for a description of their effects.
  478 <pre>
  479       allow_empty_class         set PCRE2_ALLOW_EMPTY_CLASS
  480       alt_bsux                  set PCRE2_ALT_BSUX
  481       alt_circumflex            set PCRE2_ALT_CIRCUMFLEX
  482       anchored                  set PCRE2_ANCHORED
  483       auto_callout              set PCRE2_AUTO_CALLOUT
  484   /i  caseless                  set PCRE2_CASELESS
  485       dollar_endonly            set PCRE2_DOLLAR_ENDONLY
  486   /s  dotall                    set PCRE2_DOTALL
  487       dupnames                  set PCRE2_DUPNAMES
  488   /x  extended                  set PCRE2_EXTENDED
  489       firstline                 set PCRE2_FIRSTLINE
  490       match_unset_backref       set PCRE2_MATCH_UNSET_BACKREF
  491   /m  multiline                 set PCRE2_MULTILINE
  492       never_backslash_c         set PCRE2_NEVER_BACKSLASH_C
  493       never_ucp                 set PCRE2_NEVER_UCP
  494       never_utf                 set PCRE2_NEVER_UTF
  495       no_auto_capture           set PCRE2_NO_AUTO_CAPTURE
  496       no_auto_possess           set PCRE2_NO_AUTO_POSSESS
  497       no_dotstar_anchor         set PCRE2_NO_DOTSTAR_ANCHOR
  498       no_start_optimize         set PCRE2_NO_START_OPTIMIZE
  499       no_utf_check              set PCRE2_NO_UTF_CHECK
  500       ucp                       set PCRE2_UCP
  501       ungreedy                  set PCRE2_UNGREEDY
  502       utf                       set PCRE2_UTF
  503 </pre>
  504 As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
  505 non-printing characters in output strings to be printed using the \x{hh...}
  506 notation. Otherwise, those less than 0x100 are output in hex without the curly
  507 brackets.
  508 <a name="controlmodifiers"></a></P>
  509 <br><b>
  510 Setting compilation controls
  511 </b><br>
  512 <P>
  513 The following modifiers affect the compilation process or request information
  514 about the pattern:
  515 <pre>
  516       bsr=[anycrlf|unicode]     specify \R handling
  517   /B  bincode                   show binary code without lengths
  518       callout_info              show callout information
  519       debug                     same as info,fullbincode
  520       fullbincode               show binary code with lengths
  521   /I  info                      show info about compiled pattern
  522       hex                       pattern is coded in hexadecimal
  523       jit[=&#60;number&#62;]            use JIT
  524       jitfast                   use JIT fast path
  525       jitverify                 verify JIT use
  526       locale=&#60;name&#62;             use this locale
  527       memory                    show memory used
  528       newline=&#60;type&#62;            set newline type
  529       parens_nest_limit=&#60;n&#62;     set maximum parentheses depth
  530       posix                     use the POSIX API
  531       push                      push compiled pattern onto the stack
  532       stackguard=&#60;number&#62;       test the stackguard feature
  533       tables=[0|1|2]            select internal tables
  534 </pre>
  535 The effects of these modifiers are described in the following sections.
  536 </P>
  537 <br><b>
  538 Newline and \R handling
  539 </b><br>
  540 <P>
  541 The <b>bsr</b> modifier specifies what \R in a pattern should match. If it is
  542 set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to "unicode",
  543 \R matches any Unicode newline sequence. The default is specified when PCRE2
  544 is built, with the default default being Unicode.
  545 </P>
  546 <P>
  547 The <b>newline</b> modifier specifies which characters are to be interpreted as
  548 newlines, both in the pattern and in subject lines. The type must be one of CR,
  549 LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
  550 </P>
  551 <br><b>
  552 Information about a pattern
  553 </b><br>
  554 <P>
  555 The <b>debug</b> modifier is a shorthand for <b>info,fullbincode</b>, requesting
  556 all available information.
  557 </P>
  558 <P>
  559 The <b>bincode</b> modifier causes a representation of the compiled code to be
  560 output after compilation. This information does not contain length and offset
  561 values, which ensures that the same output is generated for different internal
  562 link sizes and different code unit widths. By using <b>bincode</b>, the same
  563 regression tests can be used in different environments.
  564 </P>
  565 <P>
  566 The <b>fullbincode</b> modifier, by contrast, <i>does</i> include length and
  567 offset values. This is used in a few special tests that run only for specific
  568 code unit widths and link sizes, and is also useful for one-off tests.
  569 </P>
  570 <P>
  571 The <b>info</b> modifier requests information about the compiled pattern
  572 (whether it is anchored, has a fixed first character, and so on). The
  573 information is obtained from the <b>pcre2_pattern_info()</b> function. Here are
  574 some typical examples:
  575 <pre>
  576     re&#62; /(?i)(^a|^b)/m,info
  577   Capturing subpattern count = 1
  578   Compile options: multiline
  579   Overall options: caseless multiline
  580   First code unit at start or follows newline
  581   Subject length lower bound = 1
  583     re&#62; /(?i)abc/info
  584   Capturing subpattern count = 0
  585   Compile options: &#60;none&#62;
  586   Overall options: caseless
  587   First code unit = 'a' (caseless)
  588   Last code unit = 'c' (caseless)
  589   Subject length lower bound = 3
  590 </pre>
  591 "Compile options" are those specified by modifiers; "overall options" have
  592 added options that are taken or deduced from the pattern. If both sets of
  593 options are the same, just a single "options" line is output; if there are no
  594 options, the line is omitted. "First code unit" is where any match must start;
  595 if there is more than one they are listed as "starting code units". "Last code
  596 unit" is the last literal code unit that must be present in any match. This is
  597 not necessarily the last character. These lines are omitted if no starting or
  598 ending code units are recorded.
  599 </P>
  600 <P>
  601 The <b>callout_info</b> modifier requests information about all the callouts in
  602 the pattern. A list of them is output at the end of any other information that
  603 is requested. For each callout, either its number or string is given, followed
  604 by the item that follows it in the pattern.
  605 </P>
  606 <br><b>
  607 Specifying a pattern in hex
  608 </b><br>
  609 <P>
  610 The <b>hex</b> modifier specifies that the characters of the pattern are to be
  611 interpreted as pairs of hexadecimal digits. White space is permitted between
  612 pairs. For example:
  613 <pre>
  614   /ab 32 59/hex
  615 </pre>
  616 This feature is provided as a way of creating patterns that contain binary zero
  617 and other non-printing characters. By default, <b>pcre2test</b> passes patterns
  618 as zero-terminated strings to <b>pcre2_compile()</b>, giving the length as
  619 PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the
  620 actual length of the pattern is passed.
  621 </P>
  622 <br><b>
  623 JIT compilation
  624 </b><br>
  625 <P>
  626 The <b>/jit</b> modifier may optionally be followed by an equals sign and a
  627 number in the range 0 to 7:
  628 <pre>
  629   0  disable JIT
  630   1  use JIT for normal match only
  631   2  use JIT for soft partial match only
  632   3  use JIT for normal match and soft partial match
  633   4  use JIT for hard partial match only
  634   6  use JIT for soft and hard partial match
  635   7  all three modes
  636 </pre>
  637 If no number is given, 7 is assumed. If JIT compilation is successful, the
  638 compiled JIT code will automatically be used when <b>pcre2_match()</b> is run
  639 for the appropriate type of match, except when incompatible run-time options
  640 are specified. For more details, see the
  641 <a href="pcre2jit.html"><b>pcre2jit</b></a>
  642 documentation. See also the <b>jitstack</b> modifier below for a way of
  643 setting the size of the JIT stack.
  644 </P>
  645 <P>
  646 If the <b>jitfast</b> modifier is specified, matching is done using the JIT
  647 "fast path" interface, <b>pcre2_jit_match()</b>, which skips some of the sanity
  648 checks that are done by <b>pcre2_match()</b>, and of course does not work when
  649 JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
  650 assumed.
  651 </P>
  652 <P>
  653 If the <b>jitverify</b> modifier is specified, information about the compiled
  654 pattern shows whether JIT compilation was or was not successful. If
  655 <b>jitverify</b> is specified without <b>jit</b>, jit=7 is assumed. If JIT
  656 compilation is successful when <b>jitverify</b> is set, the text "(JIT)" is
  657 added to the first output line after a match or non match when JIT-compiled
  658 code was actually used in the match.
  659 </P>
  660 <br><b>
  661 Setting a locale
  662 </b><br>
  663 <P>
  664 The <b>/locale</b> modifier must specify the name of a locale, for example:
  665 <pre>
  666   /pattern/locale=fr_FR
  667 </pre>
  668 The given locale is set, <b>pcre2_maketables()</b> is called to build a set of
  669 character tables for the locale, and this is then passed to
  670 <b>pcre2_compile()</b> when compiling the regular expression. The same tables
  671 are used when matching the following subject lines. The <b>/locale</b> modifier
  672 applies only to the pattern on which it appears, but can be given in a
  673 <b>#pattern</b> command if a default is needed. Setting a locale and alternate
  674 character tables are mutually exclusive.
  675 </P>
  676 <br><b>
  677 Showing pattern memory
  678 </b><br>
  679 <P>
  680 The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
  681 the compiled pattern to be output. This does not include the size of the
  682 <b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
  683 subsequently passed to the JIT compiler, the size of the JIT compiled code is
  684 also output. Here is an example:
  685 <pre>
  686     re&#62; /a(b)c/jit,memory
  687   Memory allocation (code space): 21
  688   Memory allocation (JIT code): 1910
  690 </PRE>
  691 </P>
  692 <br><b>
  693 Limiting nested parentheses
  694 </b><br>
  695 <P>
  696 The <b>parens_nest_limit</b> modifier sets a limit on the depth of nested
  697 parentheses in a pattern. Breaching the limit causes a compilation error.
  698 The default for the library is set when PCRE2 is built, but <b>pcre2test</b>
  699 sets its own default of 220, which is required for running the standard test
  700 suite.
  701 </P>
  702 <br><b>
  703 Using the POSIX wrapper API
  704 </b><br>
  705 <P>
  706 The <b>/posix</b> modifier causes <b>pcre2test</b> to call PCRE2 via the POSIX
  707 wrapper API rather than its native API. This supports only the 8-bit library.
  708 When the POSIX API is being used, the following pattern modifiers set options
  709 for the <b>regcomp()</b> function:
  710 <pre>
  711   caseless           REG_ICASE
  712   multiline          REG_NEWLINE
  713   no_auto_capture    REG_NOSUB
  714   dotall             REG_DOTALL     )
  715   ungreedy           REG_UNGREEDY   ) These options are not part of
  716   ucp                REG_UCP        )   the POSIX standard
  717   utf                REG_UTF8       )
  718 </pre>
  719 The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
  720 below. All other modifiers cause an error.
  721 </P>
  722 <br><b>
  723 Testing the stack guard feature
  724 </b><br>
  725 <P>
  726 The <b>/stackguard</b> modifier is used to test the use of
  727 <b>pcre2_set_compile_recursion_guard()</b>, a function that is provided to
  728 enable stack availability to be checked during compilation (see the
  729 <a href="pcre2api.html"><b>pcre2api</b></a>
  730 documentation for details). If the number specified by the modifier is greater
  731 than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up
  732 callback from <b>pcre2_compile()</b> to a local function. The argument it
  733 receives is the current nesting parenthesis depth; if this is greater than the
  734 value given by the modifier, non-zero is returned, causing the compilation to
  735 be aborted.
  736 </P>
  737 <br><b>
  738 Using alternative character tables
  739 </b><br>
  740 <P>
  741 The value specified for the <b>/tables</b> modifier must be one of the digits 0,
  742 1, or 2. It causes a specific set of built-in character tables to be passed to
  743 <b>pcre2_compile()</b>. This is used in the PCRE2 tests to check behaviour with
  744 different character tables. The digit specifies the tables as follows:
  745 <pre>
  746   0   do not pass any special character tables
  747   1   the default ASCII tables, as distributed in
  748         pcre2_chartables.c.dist
  749   2   a set of tables defining ISO 8859 characters
  750 </pre>
  751 In table 2, some characters whose codes are greater than 128 are identified as
  752 letters, digits, spaces, etc. Setting alternate character tables and a locale
  753 are mutually exclusive.
  754 </P>
  755 <br><b>
  756 Setting certain match controls
  757 </b><br>
  758 <P>
  759 The following modifiers are really subject modifiers, and are described below.
  760 However, they may be included in a pattern's modifier list, in which case they
  761 are applied to every subject line that is processed with that pattern. They do
  762 not affect the compilation process.
  763 <pre>
  764       aftertext           show text after match
  765       allaftertext        show text after captures
  766       allcaptures         show all captures
  767       allusedtext         show all consulted text
  768   /g  global              global matching
  769       mark                show mark values
  770       replace=&#60;string&#62;    specify a replacement string
  771       startchar           show starting character when relevant
  772 </pre>
  773 These modifiers may not appear in a <b>#pattern</b> command. If you want them as
  774 defaults, set them in a <b>#subject</b> command.
  775 </P>
  776 <br><b>
  777 Saving a compiled pattern
  778 </b><br>
  779 <P>
  780 When a pattern with the <b>push</b> modifier is successfully compiled, it is
  781 pushed onto a stack of compiled patterns, and <b>pcre2test</b> expects the next
  782 line to contain a new pattern (or a command) instead of a subject line. This
  783 facility is used when saving compiled patterns to a file, as described in the
  784 section entitled "Saving and restoring compiled patterns"
  785 <a href="#saverestore">below.</a>
  786 The <b>push</b> modifier is incompatible with compilation modifiers such as
  787 <b>global</b> that act at match time. Any that are specified are ignored, with a
  788 warning message, except for <b>replace</b>, which causes an error. Note that,
  789 <b>jitverify</b>, which is allowed, does not carry through to any subsequent
  790 matching that uses this pattern.
  791 </P>
  792 <br><a name="SEC11" href="#TOC1">SUBJECT MODIFIERS</a><br>
  793 <P>
  794 The modifiers that can appear in subject lines and the <b>#subject</b>
  795 command are of two types.
  796 </P>
  797 <br><b>
  798 Setting match options
  799 </b><br>
  800 <P>
  801 The following modifiers set options for <b>pcre2_match()</b> or
  802 <b>pcre2_dfa_match()</b>. See
  803 <a href="pcreapi.html"><b>pcreapi</b></a>
  804 for a description of their effects.
  805 <pre>
  806       anchored                  set PCRE2_ANCHORED
  807       dfa_restart               set PCRE2_DFA_RESTART
  808       dfa_shortest              set PCRE2_DFA_SHORTEST
  809       no_utf_check              set PCRE2_NO_UTF_CHECK
  810       notbol                    set PCRE2_NOTBOL
  811       notempty                  set PCRE2_NOTEMPTY
  812       notempty_atstart          set PCRE2_NOTEMPTY_ATSTART
  813       noteol                    set PCRE2_NOTEOL
  814       partial_hard (or ph)      set PCRE2_PARTIAL_HARD
  815       partial_soft (or ps)      set PCRE2_PARTIAL_SOFT
  816 </pre>
  817 The partial matching modifiers are provided with abbreviations because they
  818 appear frequently in tests.
  819 </P>
  820 <P>
  821 If the <b>/posix</b> modifier was present on the pattern, causing the POSIX
  822 wrapper API to be used, the only option-setting modifiers that have any effect
  823 are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
  824 REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
  825 Any other modifiers cause an error.
  826 </P>
  827 <br><b>
  828 Setting match controls
  829 </b><br>
  830 <P>
  831 The following modifiers affect the matching process or request additional
  832 information. Some of them may also be specified on a pattern line (see above),
  833 in which case they apply to every subject line that is matched against that
  834 pattern.
  835 <pre>
  836       aftertext                 show text after match
  837       allaftertext              show text after captures
  838       allcaptures               show all captures
  839       allusedtext               show all consulted text (non-JIT only)
  840       altglobal                 alternative global matching
  841       callout_capture           show captures at callout time
  842       callout_data=&#60;n&#62;          set a value to pass via callouts
  843       callout_fail=&#60;n&#62;[:&#60;m&#62;]    control callout failure
  844       callout_none              do not supply a callout function
  845       copy=&#60;number or name&#62;     copy captured substring
  846       dfa                       use <b>pcre2_dfa_match()</b>
  847       find_limits               find match and recursion limits
  848       get=&#60;number or name&#62;      extract captured substring
  849       getall                    extract all captured substrings
  850   /g  global                    global matching
  851       jitstack=&#60;n&#62;              set size of JIT stack
  852       mark                      show mark values
  853       match_limit=&#62;n&#62;           set a match limit
  854       memory                    show memory usage
  855       offset=&#60;n&#62;                set starting offset
  856       ovector=&#60;n&#62;               set size of output vector
  857       recursion_limit=&#60;n&#62;       set a recursion limit
  858       replace=&#60;string&#62;          specify a replacement string
  859       startchar                 show startchar when relevant
  860       zero_terminate            pass the subject as zero-terminated
  861 </pre>
  862 The effects of these modifiers are described in the following sections.
  863 </P>
  864 <br><b>
  865 Showing more text
  866 </b><br>
  867 <P>
  868 The <b>aftertext</b> modifier requests that as well as outputting the part of
  869 the subject string that matched the entire pattern, <b>pcre2test</b> should in
  870 addition output the remainder of the subject string. This is useful for tests
  871 where the subject contains multiple copies of the same substring. The
  872 <b>allaftertext</b> modifier requests the same action for captured substrings as
  873 well as the main matched substring. In each case the remainder is output on the
  874 following line with a plus character following the capture number.
  875 </P>
  876 <P>
  877 The <b>allusedtext</b> modifier requests that all the text that was consulted
  878 during a successful pattern match by the interpreter should be shown. This
  879 feature is not supported for JIT matching, and if requested with JIT it is
  880 ignored (with a warning message). Setting this modifier affects the output if
  881 there is a lookbehind at the start of a match, or a lookahead at the end, or if
  882 \K is used in the pattern. Characters that precede or follow the start and end
  883 of the actual match are indicated in the output by '&#60;' or '&#62;' characters
  884 underneath them. Here is an example:
  885 <pre>
  886     re&#62; /(?&#60;=pqr)abc(?=xyz)/
  887   data&#62; 123pqrabcxyz456\=allusedtext
  888    0: pqrabcxyz
  889       &#60;&#60;&#60;   &#62;&#62;&#62;
  890 </pre>
  891 This shows that the matched string is "abc", with the preceding and following
  892 strings "pqr" and "xyz" having been consulted during the match (when processing
  893 the assertions).
  894 </P>
  895 <P>
  896 The <b>startchar</b> modifier requests that the starting character for the match
  897 be indicated, if it is different to the start of the matched string. The only
  898 time when this occurs is when \K has been processed as part of the match. In
  899 this situation, the output for the matched string is displayed from the
  900 starting character instead of from the match point, with circumflex characters
  901 under the earlier characters. For example:
  902 <pre>
  903     re&#62; /abc\Kxyz/
  904   data&#62; abcxyz\=startchar
  905    0: abcxyz
  906       ^^^
  907 </pre>
  908 Unlike <b>allusedtext</b>, the <b>startchar</b> modifier can be used with JIT.
  909 However, these two modifiers are mutually exclusive.
  910 </P>
  911 <br><b>
  912 Showing the value of all capture groups
  913 </b><br>
  914 <P>
  915 The <b>allcaptures</b> modifier requests that the values of all potential
  916 captured parentheses be output after a match. By default, only those up to the
  917 highest one actually used in the match are output (corresponding to the return
  918 code from <b>pcre2_match()</b>). Groups that did not take part in the match
  919 are output as "&#60;unset&#62;".
  920 </P>
  921 <br><b>
  922 Testing callouts
  923 </b><br>
  924 <P>
  925 A callout function is supplied when <b>pcre2test</b> calls the library matching
  926 functions, unless <b>callout_none</b> is specified. If <b>callout_capture</b> is
  927 set, the current captured groups are output when a callout occurs.
  928 </P>
  929 <P>
  930 The <b>callout_fail</b> modifier can be given one or two numbers. If there is
  931 only one number, 1 is returned instead of 0 when a callout of that number is
  932 reached. If two numbers are given, 1 is returned when callout &#60;n&#62; is reached
  933 for the &#60;m&#62;th time. Note that callouts with string arguments are always given
  934 the number zero. See "Callouts" below for a description of the output when a
  935 callout it taken.
  936 </P>
  937 <P>
  938 The <b>callout_data</b> modifier can be given an unsigned or a negative number.
  939 This is set as the "user data" that is passed to the matching function, and
  940 passed back when the callout function is invoked. Any value other than zero is
  941 used as a return from <b>pcre2test</b>'s callout function.
  942 </P>
  943 <br><b>
  944 Finding all matches in a string
  945 </b><br>
  946 <P>
  947 Searching for all possible matches within a subject can be requested by the
  948 <b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
  949 function is called again to search the remainder of the subject. The difference
  950 between <b>global</b> and <b>altglobal</b> is that the former uses the
  951 <i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
  952 to start searching at a new point within the entire string (which is what Perl
  953 does), whereas the latter passes over a shortened subject. This makes a
  954 difference to the matching process if the pattern begins with a lookbehind
  955 assertion (including \b or \B).
  956 </P>
  957 <P>
  958 If an empty string is matched, the next match is done with the
  959 PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
  960 another, non-empty, match at the same point in the subject. If this match
  961 fails, the start offset is advanced, and the normal match is retried. This
  962 imitates the way Perl handles such cases when using the <b>/g</b> modifier or
  963 the <b>split()</b> function. Normally, the start offset is advanced by one
  964 character, but if the newline convention recognizes CRLF as a newline, and the
  965 current character is CR followed by LF, an advance of two characters occurs.
  966 </P>
  967 <br><b>
  968 Testing substring extraction functions
  969 </b><br>
  970 <P>
  971 The <b>copy</b> and <b>get</b> modifiers can be used to test the
  972 <b>pcre2_substring_copy_xxx()</b> and <b>pcre2_substring_get_xxx()</b> functions.
  973 They can be given more than once, and each can specify a group name or number,
  974 for example:
  975 <pre>
  976    abcd\=copy=1,copy=3,get=G1
  977 </pre>
  978 If the <b>#subject</b> command is used to set default copy and/or get lists,
  979 these can be unset by specifying a negative number to cancel all numbered
  980 groups and an empty name to cancel all named groups.
  981 </P>
  982 <P>
  983 The <b>getall</b> modifier tests <b>pcre2_substring_list_get()</b>, which
  984 extracts all captured substrings.
  985 </P>
  986 <P>
  987 If the subject line is successfully matched, the substrings extracted by the
  988 convenience functions are output with C, G, or L after the string number
  989 instead of a colon. This is in addition to the normal full list. The string
  990 length (that is, the return from the extraction function) is given in
  991 parentheses after each substring, followed by the name when the extraction was
  992 by name.
  993 </P>
  994 <br><b>
  995 Testing the substitution function
  996 </b><br>
  997 <P>
  998 If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
  999 called instead of one of the matching functions. Unlike subject strings,
 1000 <b>pcre2test</b> does not process replacement strings for escape sequences. In
 1001 UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
 1002 If so, it is correctly converted to a UTF string of the appropriate code unit
 1003 width. If it is not a valid UTF-8 string, the individual code units are copied
 1004 directly. This provides a means of passing an invalid UTF-8 string for testing
 1005 purposes.
 1006 </P>
 1007 <P>
 1008 If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
 1009 <b>pcre2_substitute()</b>. After a successful substitution, the modified string
 1010 is output, preceded by the number of replacements. This may be zero if there
 1011 were no matches. Here is a simple example of a substitution test:
 1012 <pre>
 1013   /abc/replace=xxx
 1014       =abc=abc=
 1015    1: =xxx=abc=
 1016       =abc=abc=\=global
 1017    2: =xxx=xxx=
 1018 </pre>
 1019 Subject and replacement strings should be kept relatively short for
 1020 substitution tests, as fixed-size buffers are used. To make it easy to test for
 1021 buffer overflow, if the replacement string starts with a number in square
 1022 brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
 1023 output buffer, with the replacement string starting at the next character. Here
 1024 is an example that tests the edge case:
 1025 <pre>
 1026   /abc/
 1027       123abc123\=replace=[10]XYZ
 1028    1: 123XYZ123
 1029       123abc123\=replace=[9]XYZ
 1030   Failed: error -47: no more memory
 1031 </pre>
 1032 A replacement string is ignored with POSIX and DFA matching. Specifying partial
 1033 matching provokes an error return ("bad option value") from
 1034 <b>pcre2_substitute()</b>.
 1035 </P>
 1036 <br><b>
 1037 Setting the JIT stack size
 1038 </b><br>
 1039 <P>
 1040 The <b>jitstack</b> modifier provides a way of setting the maximum stack size
 1041 that is used by the just-in-time optimization code. It is ignored if JIT
 1042 optimization is not being used. The value is a number of kilobytes. Providing a
 1043 stack that is larger than the default 32K is necessary only for very
 1044 complicated patterns.
 1045 </P>
 1046 <br><b>
 1047 Setting match and recursion limits
 1048 </b><br>
 1049 <P>
 1050 The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate
 1051 limits in the match context. These values are ignored when the
 1052 <b>find_limits</b> modifier is specified.
 1053 </P>
 1054 <br><b>
 1055 Finding minimum limits
 1056 </b><br>
 1057 <P>
 1058 If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
 1059 <b>pcre2_match()</b> several times, setting different values in the match
 1060 context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b>
 1061 until it finds the minimum values for each parameter that allow
 1062 <b>pcre2_match()</b> to complete without error.
 1063 </P>
 1064 <P>
 1065 If JIT is being used, only the match limit is relevant. If DFA matching is
 1066 being used, neither limit is relevant, and this modifier is ignored (with a
 1067 warning message).
 1068 </P>
 1069 <P>
 1070 The <i>match_limit</i> number is a measure of the amount of backtracking
 1071 that takes place, and learning the minimum value can be instructive. For most
 1072 simple matches, the number is quite small, but for patterns with very large
 1073 numbers of matching possibilities, it can become large very quickly with
 1074 increasing length of subject string. The <i>match_limit_recursion</i> number is
 1075 a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much
 1076 heap) memory is needed to complete the match attempt.
 1077 </P>
 1078 <br><b>
 1079 Showing MARK names
 1080 </b><br>
 1081 <P>
 1082 The <b>mark</b> modifier causes the names from backtracking control verbs that
 1083 are returned from calls to <b>pcre2_match()</b> to be displayed. If a mark is
 1084 returned for a match, non-match, or partial match, <b>pcre2test</b> shows it.
 1085 For a match, it is on a line by itself, tagged with "MK:". Otherwise, it
 1086 is added to the non-match message.
 1087 </P>
 1088 <br><b>
 1089 Showing memory usage
 1090 </b><br>
 1091 <P>
 1092 The <b>memory</b> modifier causes <b>pcre2test</b> to log all memory allocation
 1093 and freeing calls that occur during a match operation.
 1094 </P>
 1095 <br><b>
 1096 Setting a starting offset
 1097 </b><br>
 1098 <P>
 1099 The <b>offset</b> modifier sets an offset in the subject string at which
 1100 matching starts. Its value is a number of code units, not characters.
 1101 </P>
 1102 <br><b>
 1103 Setting the size of the output vector
 1104 </b><br>
 1105 <P>
 1106 The <b>ovector</b> modifier applies only to the subject line in which it
 1107 appears, though of course it can also be used to set a default in a
 1108 <b>#subject</b> command. It specifies the number of pairs of offsets that are
 1109 available for storing matching information. The default is 15.
 1110 </P>
 1111 <P>
 1112 A value of zero is useful when testing the POSIX API because it causes
 1113 <b>regexec()</b> to be called with a NULL capture vector. When not testing the
 1114 POSIX API, a value of zero is used to cause
 1115 <b>pcre2_match_data_create_from_pattern()</b> to be called, in order to create a
 1116 match block of exactly the right size for the pattern. (It is not possible to
 1117 create a match block with a zero-length ovector; there is always at least one
 1118 pair of offsets.)
 1119 </P>
 1120 <br><b>
 1121 Passing the subject as zero-terminated
 1122 </b><br>
 1123 <P>
 1124 By default, the subject string is passed to a native API matching function with
 1125 its correct length. In order to test the facility for passing a zero-terminated
 1126 string, the <b>zero_terminate</b> modifier is provided. It causes the length to
 1127 be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
 1128 this modifier has no effect, as there is no facility for passing a length.)
 1129 </P>
 1130 <P>
 1131 When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
 1132 passing the replacement string as zero-terminated.
 1133 </P>
 1134 <br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br>
 1135 <P>
 1136 By default, <b>pcre2test</b> uses the standard PCRE2 matching function,
 1137 <b>pcre2_match()</b> to match each subject line. PCRE2 also supports an
 1138 alternative matching function, <b>pcre2_dfa_match()</b>, which operates in a
 1139 different way, and has some restrictions. The differences between the two
 1140 functions are described in the
 1141 <a href="pcre2matching.html"><b>pcre2matching</b></a>
 1142 documentation.
 1143 </P>
 1144 <P>
 1145 If the <b>dfa</b> modifier is set, the alternative matching function is used.
 1146 This function finds all possible matches at a given point in the subject. If,
 1147 however, the <b>dfa_shortest</b> modifier is set, processing stops after the
 1148 first match is found. This is always the shortest possible match.
 1149 </P>
 1150 <br><a name="SEC13" href="#TOC1">DEFAULT OUTPUT FROM pcre2test</a><br>
 1151 <P>
 1152 This section describes the output when the normal matching function,
 1153 <b>pcre2_match()</b>, is being used.
 1154 </P>
 1155 <P>
 1156 When a match succeeds, <b>pcre2test</b> outputs the list of captured substrings,
 1157 starting with number 0 for the string that matched the whole pattern.
 1158 Otherwise, it outputs "No match" when the return is PCRE2_ERROR_NOMATCH, or
 1159 "Partial match:" followed by the partially matching substring when the
 1160 return is PCRE2_ERROR_PARTIAL. (Note that this is the
 1161 entire substring that was inspected during the partial match; it may include
 1162 characters before the actual match start if a lookbehind assertion, \K, \b,
 1163 or \B was involved.)
 1164 </P>
 1165 <P>
 1166 For any other return, <b>pcre2test</b> outputs the PCRE2 negative error number
 1167 and a short descriptive phrase. If the error is a failed UTF string check, the
 1168 code unit offset of the start of the failing character is also output. Here is
 1169 an example of an interactive <b>pcre2test</b> run.
 1170 <pre>
 1171   $ pcre2test
 1172   PCRE2 version 9.00 2014-05-10
 1174     re&#62; /^abc(\d+)/
 1175   data&#62; abc123
 1176    0: abc123
 1177    1: 123
 1178   data&#62; xyz
 1179   No match
 1180 </pre>
 1181 Unset capturing substrings that are not followed by one that is set are not
 1182 shown by <b>pcre2test</b> unless the <b>allcaptures</b> modifier is specified. In
 1183 the following example, there are two capturing substrings, but when the first
 1184 data line is matched, the second, unset substring is not shown. An "internal"
 1185 unset substring is shown as "&#60;unset&#62;", as for the second data line.
 1186 <pre>
 1187     re&#62; /(a)|(b)/
 1188   data&#62; a
 1189    0: a
 1190    1: a
 1191   data&#62; b
 1192    0: b
 1193    1: &#60;unset&#62;
 1194    2: b
 1195 </pre>
 1196 If the strings contain any non-printing characters, they are output as \xhh
 1197 escapes if the value is less than 256 and UTF mode is not set. Otherwise they
 1198 are output as \x{hh...} escapes. See below for the definition of non-printing
 1199 characters. If the <b>/aftertext</b> modifier is set, the output for substring
 1200 0 is followed by the the rest of the subject string, identified by "0+" like
 1201 this:
 1202 <pre>
 1203     re&#62; /cat/aftertext
 1204   data&#62; cataract
 1205    0: cat
 1206    0+ aract
 1207 </pre>
 1208 If global matching is requested, the results of successive matching attempts
 1209 are output in sequence, like this:
 1210 <pre>
 1211     re&#62; /\Bi(\w\w)/g
 1212   data&#62; Mississippi
 1213    0: iss
 1214    1: ss
 1215    0: iss
 1216    1: ss
 1217    0: ipp
 1218    1: pp
 1219 </pre>
 1220 "No match" is output only if the first match attempt fails. Here is an example
 1221 of a failure message (the offset 4 that is specified by the <b>offset</b>
 1222 modifier is past the end of the subject string):
 1223 <pre>
 1224     re&#62; /xyz/
 1225   data&#62; xyz\=offset=4
 1226   Error -24 (bad offset value)
 1227 </PRE>
 1228 </P>
 1229 <P>
 1230 Note that whereas patterns can be continued over several lines (a plain "&#62;"
 1231 prompt is used for continuations), subject lines may not. However newlines can
 1232 be included in a subject by means of the \n escape (or \r, \r\n, etc.,
 1233 depending on the newline sequence setting).
 1234 </P>
 1235 <br><a name="SEC14" href="#TOC1">OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION</a><br>
 1236 <P>
 1237 When the alternative matching function, <b>pcre2_dfa_match()</b>, is used, the
 1238 output consists of a list of all the matches that start at the first point in
 1239 the subject where there is at least one match. For example:
 1240 <pre>
 1241     re&#62; /(tang|tangerine|tan)/
 1242   data&#62; yellow tangerine\=dfa
 1243    0: tangerine
 1244    1: tang
 1245    2: tan
 1246 </pre>
 1247 Using the normal matching function on this data finds only "tang". The
 1248 longest matching string is always given first (and numbered zero). After a
 1249 PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the
 1250 partially matching substring. Note that this is the entire substring that was
 1251 inspected during the partial match; it may include characters before the actual
 1252 match start if a lookbehind assertion, \b, or \B was involved. (\K is not
 1253 supported for DFA matching.)
 1254 </P>
 1255 <P>
 1256 If global matching is requested, the search for further matches resumes
 1257 at the end of the longest match. For example:
 1258 <pre>
 1259     re&#62; /(tang|tangerine|tan)/g
 1260   data&#62; yellow tangerine and tangy sultana\=dfa
 1261    0: tangerine
 1262    1: tang
 1263    2: tan
 1264    0: tang
 1265    1: tan
 1266    0: tan
 1267 </pre>
 1268 The alternative matching function does not support substring capture, so the
 1269 modifiers that are concerned with captured substrings are not relevant.
 1270 </P>
 1271 <br><a name="SEC15" href="#TOC1">RESTARTING AFTER A PARTIAL MATCH</a><br>
 1272 <P>
 1273 When the alternative matching function has given the PCRE2_ERROR_PARTIAL
 1274 return, indicating that the subject partially matched the pattern, you can
 1275 restart the match with additional subject data by means of the
 1276 <b>dfa_restart</b> modifier. For example:
 1277 <pre>
 1278     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 1279   data&#62; 23ja\=P,dfa
 1280   Partial match: 23ja
 1281   data&#62; n05\=dfa,dfa_restart
 1282    0: n05
 1283 </pre>
 1284 For further information about partial matching, see the
 1285 <a href="pcre2partial.html"><b>pcre2partial</b></a>
 1286 documentation.
 1287 </P>
 1288 <br><a name="SEC16" href="#TOC1">CALLOUTS</a><br>
 1289 <P>
 1290 If the pattern contains any callout requests, <b>pcre2test</b>'s callout
 1291 function is called during matching unless <b>callout_none</b> is specified.
 1292 This works with both matching functions.
 1293 </P>
 1294 <P>
 1295 The callout function in <b>pcre2test</b> returns zero (carry on matching) by
 1296 default, but you can use a <b>callout_fail</b> modifier in a subject line (as
 1297 described above) to change this and other parameters of the callout.
 1298 </P>
 1299 <P>
 1300 Inserting callouts can be helpful when using <b>pcre2test</b> to check
 1301 complicated regular expressions. For further information about callouts, see
 1302 the
 1303 <a href="pcre2callout.html"><b>pcre2callout</b></a>
 1304 documentation.
 1305 </P>
 1306 <P>
 1307 The output for callouts with numerical arguments and those with string
 1308 arguments is slightly different.
 1309 </P>
 1310 <br><b>
 1311 Callouts with numerical arguments
 1312 </b><br>
 1313 <P>
 1314 By default, the callout function displays the callout number, the start and
 1315 current positions in the subject text at the callout time, and the next pattern
 1316 item to be tested. For example:
 1317 <pre>
 1318   ---&#62;pqrabcdef
 1319     0    ^  ^     \d
 1320 </pre>
 1321 This output indicates that callout number 0 occurred for a match attempt
 1322 starting at the fourth character of the subject string, when the pointer was at
 1323 the seventh character, and when the next pattern item was \d. Just
 1324 one circumflex is output if the start and current positions are the same.
 1325 </P>
 1326 <P>
 1327 Callouts numbered 255 are assumed to be automatic callouts, inserted as a
 1328 result of the <b>/auto_callout</b> pattern modifier. In this case, instead of
 1329 showing the callout number, the offset in the pattern, preceded by a plus, is
 1330 output. For example:
 1331 <pre>
 1332     re&#62; /\d?[A-E]\*/auto_callout
 1333   data&#62; E*
 1334   ---&#62;E*
 1335    +0 ^      \d?
 1336    +3 ^      [A-E]
 1337    +8 ^^     \*
 1338   +10 ^ ^
 1339    0: E*
 1340 </pre>
 1341 If a pattern contains (*MARK) items, an additional line is output whenever
 1342 a change of latest mark is passed to the callout function. For example:
 1343 <pre>
 1344     re&#62; /a(*MARK:X)bc/auto_callout
 1345   data&#62; abc
 1346   ---&#62;abc
 1347    +0 ^       a
 1348    +1 ^^      (*MARK:X)
 1349   +10 ^^      b
 1350   Latest Mark: X
 1351   +11 ^ ^     c
 1352   +12 ^  ^
 1353    0: abc
 1354 </pre>
 1355 The mark changes between matching "a" and "b", but stays the same for the rest
 1356 of the match, so nothing more is output. If, as a result of backtracking, the
 1357 mark reverts to being unset, the text "&#60;unset&#62;" is output.
 1358 </P>
 1359 <br><b>
 1360 Callouts with string arguments
 1361 </b><br>
 1362 <P>
 1363 The output for a callout with a string argument is similar, except that instead
 1364 of outputting a callout number before the position indicators, the callout
 1365 string and its offset in the pattern string are output before the reflection of
 1366 the subject string, and the subject string is reflected for each callout. For
 1367 example:
 1368 <pre>
 1369     re&#62; /^ab(?C'first')cd(?C"second")ef/
 1370   data&#62; abcdefg
 1371   Callout (7): 'first'
 1372   ---&#62;abcdefg
 1373       ^ ^         c
 1374   Callout (20): "second"
 1375   ---&#62;abcdefg
 1376       ^   ^       e
 1377    0: abcdef
 1379 </PRE>
 1380 </P>
 1381 <br><a name="SEC17" href="#TOC1">NON-PRINTING CHARACTERS</a><br>
 1382 <P>
 1383 When <b>pcre2test</b> is outputting text in the compiled version of a pattern,
 1384 bytes other than 32-126 are always treated as non-printing characters and are
 1385 therefore shown as hex escapes.
 1386 </P>
 1387 <P>
 1388 When <b>pcre2test</b> is outputting text that is a matched part of a subject
 1389 string, it behaves in the same way, unless a different locale has been set for
 1390 the pattern (using the <b>/locale</b> modifier). In this case, the
 1391 <b>isprint()</b> function is used to distinguish printing and non-printing
 1392 characters.
 1393 <a name="saverestore"></a></P>
 1394 <br><a name="SEC18" href="#TOC1">SAVING AND RESTORING COMPILED PATTERNS</a><br>
 1395 <P>
 1396 It is possible to save compiled patterns on disc or elsewhere, and reload them
 1397 later, subject to a number of restrictions. JIT data cannot be saved. The host
 1398 on which the patterns are reloaded must be running the same version of PCRE2,
 1399 with the same code unit width, and must also have the same endianness, pointer
 1400 width and PCRE2_SIZE type. Before compiled patterns can be saved they must be
 1401 serialized, that is, converted to a stream of bytes. A single byte stream may
 1402 contain any number of compiled patterns, but they must all use the same
 1403 character tables. A single copy of the tables is included in the byte stream
 1404 (its size is 1088 bytes).
 1405 </P>
 1406 <P>
 1407 The functions whose names begin with <b>pcre2_serialize_</b> are used
 1408 for serializing and de-serializing. They are described in the
 1409 <a href="pcre2serialize.html"><b>pcre2serialize</b></a>
 1410 documentation. In this section we describe the features of <b>pcre2test</b> that
 1411 can be used to test these functions.
 1412 </P>
 1413 <P>
 1414 When a pattern with <b>push</b> modifier is successfully compiled, it is pushed
 1415 onto a stack of compiled patterns, and <b>pcre2test</b> expects the next line to
 1416 contain a new pattern (or command) instead of a subject line. By this means, a
 1417 number of patterns can be compiled and retained. The <b>push</b> modifier is
 1418 incompatible with <b>posix</b>, and control modifiers that act at match time are
 1419 ignored (with a message). The <b>jitverify</b> modifier applies only at compile
 1420 time. The command
 1421 <pre>
 1422   #save &#60;filename&#62;
 1423 </pre>
 1424 causes all the stacked patterns to be serialized and the result written to the
 1425 named file. Afterwards, all the stacked patterns are freed. The command
 1426 <pre>
 1427   #load &#60;filename&#62;
 1428 </pre>
 1429 reads the data in the file, and then arranges for it to be de-serialized, with
 1430 the resulting compiled patterns added to the pattern stack. The pattern on the
 1431 top of the stack can be retrieved by the #pop command, which must be followed
 1432 by lines of subjects that are to be matched with the pattern, terminated as
 1433 usual by an empty line or end of file. This command may be followed by a
 1434 modifier list containing only
 1435 <a href="#controlmodifiers">control modifiers</a>
 1436 that act after a pattern has been compiled. In particular, <b>hex</b>,
 1437 <b>posix</b>, and <b>push</b> are not allowed, nor are any
 1438 <a href="#optionmodifiers">option-setting modifiers.</a>
 1439 The JIT modifiers are, however permitted. Here is an example that saves and
 1440 reloads two patterns.
 1441 <pre>
 1442   /abc/push
 1443   /xyz/push
 1444   #save tempfile
 1445   #load tempfile
 1446   #pop info
 1447   xyz
 1449   #pop jit,bincode
 1450   abc
 1451 </pre>
 1452 If <b>jitverify</b> is used with #pop, it does not automatically imply
 1453 <b>jit</b>, which is different behaviour from when it is used on a pattern.
 1454 </P>
 1455 <br><a name="SEC19" href="#TOC1">SEE ALSO</a><br>
 1456 <P>
 1457 <b>pcre2</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
 1458 <b>pcre2jit</b>, <b>pcre2matching</b>(3), <b>pcre2partial</b>(d),
 1459 <b>pcre2pattern</b>(3), <b>pcre2serialize</b>(3).
 1460 </P>
 1461 <br><a name="SEC20" href="#TOC1">AUTHOR</a><br>
 1462 <P>
 1463 Philip Hazel
 1464 <br>
 1465 University Computing Service
 1466 <br>
 1467 Cambridge, England.
 1468 <br>
 1469 </P>
 1470 <br><a name="SEC21" href="#TOC1">REVISION</a><br>
 1471 <P>
 1472 Last updated: 20 May 2015
 1473 <br>
 1474 Copyright &copy; 1997-2015 University of Cambridge.
 1475 <br>
 1476 <p>
 1477 Return to the <a href="index.html">PCRE2 index page</a>.
 1478 </p>