"Fossies" - the Fresh Open Source Software Archive

Member "pcre-8.42/doc/html/pcresyntax.html" (20 Mar 2018, 16694 Bytes) of package /linux/misc/pcre-8.42.tar.bz2:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) HTML source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 <html>
    2 <head>
    3 <title>pcresyntax specification</title>
    4 </head>
    5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
    6 <h1>pcresyntax man page</h1>
    7 <p>
    8 Return to the <a href="index.html">PCRE index page</a>.
    9 </p>
   10 <p>
   11 This page is part of the PCRE HTML documentation. It was generated automatically
   12 from the original man page. If there is any nonsense in it, please consult the
   13 man page, in case the conversion went wrong.
   14 <br>
   15 <ul>
   16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
   17 <li><a name="TOC2" href="#SEC2">QUOTING</a>
   18 <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
   19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
   20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
   21 <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
   22 <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
   23 <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
   24 <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
   25 <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
   26 <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
   27 <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
   28 <li><a name="TOC13" href="#SEC13">CAPTURING</a>
   29 <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
   30 <li><a name="TOC15" href="#SEC15">COMMENT</a>
   31 <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
   32 <li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
   33 <li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
   34 <li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
   35 <li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
   36 <li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
   37 <li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
   38 <li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
   39 <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
   40 <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
   41 <li><a name="TOC26" href="#SEC26">AUTHOR</a>
   42 <li><a name="TOC27" href="#SEC27">REVISION</a>
   43 </ul>
   44 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
   45 <P>
   46 The full syntax and semantics of the regular expressions that are supported by
   47 PCRE are described in the
   48 <a href="pcrepattern.html"><b>pcrepattern</b></a>
   49 documentation. This document contains a quick-reference summary of the syntax.
   50 </P>
   51 <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
   52 <P>
   53 <pre>
   54   \x         where x is non-alphanumeric is a literal x
   55   \Q...\E    treat enclosed characters as literal
   56 </PRE>
   57 </P>
   58 <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
   59 <P>
   60 <pre>
   61   \a         alarm, that is, the BEL character (hex 07)
   62   \cx        "control-x", where x is any ASCII character
   63   \e         escape (hex 1B)
   64   \f         form feed (hex 0C)
   65   \n         newline (hex 0A)
   66   \r         carriage return (hex 0D)
   67   \t         tab (hex 09)
   68   \0dd       character with octal code 0dd
   69   \ddd       character with octal code ddd, or backreference
   70   \o{ddd..}  character with octal code ddd..
   71   \xhh       character with hex code hh
   72   \x{hhh..}  character with hex code hhh..
   73 </pre>
   74 Note that \0dd is always an octal code, and that \8 and \9 are the literal
   75 characters "8" and "9".
   76 </P>
   77 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
   78 <P>
   79 <pre>
   80   .          any character except newline;
   81                in dotall mode, any character whatsoever
   82   \C         one data unit, even in UTF mode (best avoided)
   83   \d         a decimal digit
   84   \D         a character that is not a decimal digit
   85   \h         a horizontal white space character
   86   \H         a character that is not a horizontal white space character
   87   \N         a character that is not a newline
   88   \p{<i>xx</i>}     a character with the <i>xx</i> property
   89   \P{<i>xx</i>}     a character without the <i>xx</i> property
   90   \R         a newline sequence
   91   \s         a white space character
   92   \S         a character that is not a white space character
   93   \v         a vertical white space character
   94   \V         a character that is not a vertical white space character
   95   \w         a "word" character
   96   \W         a "non-word" character
   97   \X         a Unicode extended grapheme cluster
   98 </pre>
   99 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
  100 or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
  101 happening, \s and \w may also match characters with code points in the range
  102 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
  103 is changed to use Unicode properties and they match many more characters.
  104 </P>
  105 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
  106 <P>
  107 <pre>
  108   C          Other
  109   Cc         Control
  110   Cf         Format
  111   Cn         Unassigned
  112   Co         Private use
  113   Cs         Surrogate
  114 
  115   L          Letter
  116   Ll         Lower case letter
  117   Lm         Modifier letter
  118   Lo         Other letter
  119   Lt         Title case letter
  120   Lu         Upper case letter
  121   L&         Ll, Lu, or Lt
  122 
  123   M          Mark
  124   Mc         Spacing mark
  125   Me         Enclosing mark
  126   Mn         Non-spacing mark
  127 
  128   N          Number
  129   Nd         Decimal number
  130   Nl         Letter number
  131   No         Other number
  132 
  133   P          Punctuation
  134   Pc         Connector punctuation
  135   Pd         Dash punctuation
  136   Pe         Close punctuation
  137   Pf         Final punctuation
  138   Pi         Initial punctuation
  139   Po         Other punctuation
  140   Ps         Open punctuation
  141 
  142   S          Symbol
  143   Sc         Currency symbol
  144   Sk         Modifier symbol
  145   Sm         Mathematical symbol
  146   So         Other symbol
  147 
  148   Z          Separator
  149   Zl         Line separator
  150   Zp         Paragraph separator
  151   Zs         Space separator
  152 </PRE>
  153 </P>
  154 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
  155 <P>
  156 <pre>
  157   Xan        Alphanumeric: union of properties L and N
  158   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
  159   Xsp        Perl space: property Z or tab, NL, VT, FF, CR
  160   Xuc        Univerally-named character: one that can be
  161                represented by a Universal Character Name
  162   Xwd        Perl word: property Xan or underscore
  163 </pre>
  164 Perl and POSIX space are now the same. Perl added VT to its space character set
  165 at release 5.18 and PCRE changed at release 8.34.
  166 </P>
  167 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
  168 <P>
  169 Arabic,
  170 Armenian,
  171 Avestan,
  172 Balinese,
  173 Bamum,
  174 Bassa_Vah,
  175 Batak,
  176 Bengali,
  177 Bopomofo,
  178 Brahmi,
  179 Braille,
  180 Buginese,
  181 Buhid,
  182 Canadian_Aboriginal,
  183 Carian,
  184 Caucasian_Albanian,
  185 Chakma,
  186 Cham,
  187 Cherokee,
  188 Common,
  189 Coptic,
  190 Cuneiform,
  191 Cypriot,
  192 Cyrillic,
  193 Deseret,
  194 Devanagari,
  195 Duployan,
  196 Egyptian_Hieroglyphs,
  197 Elbasan,
  198 Ethiopic,
  199 Georgian,
  200 Glagolitic,
  201 Gothic,
  202 Grantha,
  203 Greek,
  204 Gujarati,
  205 Gurmukhi,
  206 Han,
  207 Hangul,
  208 Hanunoo,
  209 Hebrew,
  210 Hiragana,
  211 Imperial_Aramaic,
  212 Inherited,
  213 Inscriptional_Pahlavi,
  214 Inscriptional_Parthian,
  215 Javanese,
  216 Kaithi,
  217 Kannada,
  218 Katakana,
  219 Kayah_Li,
  220 Kharoshthi,
  221 Khmer,
  222 Khojki,
  223 Khudawadi,
  224 Lao,
  225 Latin,
  226 Lepcha,
  227 Limbu,
  228 Linear_A,
  229 Linear_B,
  230 Lisu,
  231 Lycian,
  232 Lydian,
  233 Mahajani,
  234 Malayalam,
  235 Mandaic,
  236 Manichaean,
  237 Meetei_Mayek,
  238 Mende_Kikakui,
  239 Meroitic_Cursive,
  240 Meroitic_Hieroglyphs,
  241 Miao,
  242 Modi,
  243 Mongolian,
  244 Mro,
  245 Myanmar,
  246 Nabataean,
  247 New_Tai_Lue,
  248 Nko,
  249 Ogham,
  250 Ol_Chiki,
  251 Old_Italic,
  252 Old_North_Arabian,
  253 Old_Permic,
  254 Old_Persian,
  255 Old_South_Arabian,
  256 Old_Turkic,
  257 Oriya,
  258 Osmanya,
  259 Pahawh_Hmong,
  260 Palmyrene,
  261 Pau_Cin_Hau,
  262 Phags_Pa,
  263 Phoenician,
  264 Psalter_Pahlavi,
  265 Rejang,
  266 Runic,
  267 Samaritan,
  268 Saurashtra,
  269 Sharada,
  270 Shavian,
  271 Siddham,
  272 Sinhala,
  273 Sora_Sompeng,
  274 Sundanese,
  275 Syloti_Nagri,
  276 Syriac,
  277 Tagalog,
  278 Tagbanwa,
  279 Tai_Le,
  280 Tai_Tham,
  281 Tai_Viet,
  282 Takri,
  283 Tamil,
  284 Telugu,
  285 Thaana,
  286 Thai,
  287 Tibetan,
  288 Tifinagh,
  289 Tirhuta,
  290 Ugaritic,
  291 Vai,
  292 Warang_Citi,
  293 Yi.
  294 </P>
  295 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
  296 <P>
  297 <pre>
  298   [...]       positive character class
  299   [^...]      negative character class
  300   [x-y]       range (can be used for hex characters)
  301   [[:xxx:]]   positive POSIX named set
  302   [[:^xxx:]]  negative POSIX named set
  303 
  304   alnum       alphanumeric
  305   alpha       alphabetic
  306   ascii       0-127
  307   blank       space or tab
  308   cntrl       control character
  309   digit       decimal digit
  310   graph       printing, excluding space
  311   lower       lower case letter
  312   print       printing, including space
  313   punct       printing, excluding alphanumeric
  314   space       white space
  315   upper       upper case letter
  316   word        same as \w
  317   xdigit      hexadecimal digit
  318 </pre>
  319 In PCRE, POSIX character set names recognize only ASCII characters by default,
  320 but some of them use Unicode properties if PCRE_UCP is set. You can use
  321 \Q...\E inside a character class.
  322 </P>
  323 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
  324 <P>
  325 <pre>
  326   ?           0 or 1, greedy
  327   ?+          0 or 1, possessive
  328   ??          0 or 1, lazy
  329   *           0 or more, greedy
  330   *+          0 or more, possessive
  331   *?          0 or more, lazy
  332   +           1 or more, greedy
  333   ++          1 or more, possessive
  334   +?          1 or more, lazy
  335   {n}         exactly n
  336   {n,m}       at least n, no more than m, greedy
  337   {n,m}+      at least n, no more than m, possessive
  338   {n,m}?      at least n, no more than m, lazy
  339   {n,}        n or more, greedy
  340   {n,}+       n or more, possessive
  341   {n,}?       n or more, lazy
  342 </PRE>
  343 </P>
  344 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
  345 <P>
  346 <pre>
  347   \b          word boundary
  348   \B          not a word boundary
  349   ^           start of subject
  350                also after internal newline in multiline mode
  351   \A          start of subject
  352   $           end of subject
  353                also before newline at end of subject
  354                also before internal newline in multiline mode
  355   \Z          end of subject
  356                also before newline at end of subject
  357   \z          end of subject
  358   \G          first matching position in subject
  359 </PRE>
  360 </P>
  361 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
  362 <P>
  363 <pre>
  364   \K          reset start of match
  365 </pre>
  366 \K is honoured in positive assertions, but ignored in negative ones.
  367 </P>
  368 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
  369 <P>
  370 <pre>
  371   expr|expr|expr...
  372 </PRE>
  373 </P>
  374 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
  375 <P>
  376 <pre>
  377   (...)           capturing group
  378   (?&#60;name&#62;...)    named capturing group (Perl)
  379   (?'name'...)    named capturing group (Perl)
  380   (?P&#60;name&#62;...)   named capturing group (Python)
  381   (?:...)         non-capturing group
  382   (?|...)         non-capturing group; reset group numbers for
  383                    capturing groups in each alternative
  384 </PRE>
  385 </P>
  386 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
  387 <P>
  388 <pre>
  389   (?&#62;...)         atomic, non-capturing group
  390 </PRE>
  391 </P>
  392 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
  393 <P>
  394 <pre>
  395   (?#....)        comment (not nestable)
  396 </PRE>
  397 </P>
  398 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
  399 <P>
  400 <pre>
  401   (?i)            caseless
  402   (?J)            allow duplicate names
  403   (?m)            multiline
  404   (?s)            single line (dotall)
  405   (?U)            default ungreedy (lazy)
  406   (?x)            extended (ignore white space)
  407   (?-...)         unset option(s)
  408 </pre>
  409 The following are recognized only at the very start of a pattern or after one
  410 of the newline or \R options with similar syntax. More than one of them may
  411 appear.
  412 <pre>
  413   (*LIMIT_MATCH=d) set the match limit to d (decimal number)
  414   (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
  415   (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
  416   (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
  417   (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
  418   (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
  419   (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
  420   (*UTF)          set appropriate UTF mode for the library in use
  421   (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
  422 </pre>
  423 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
  424 limits set by the caller of pcre_exec(), not increase them.
  425 </P>
  426 <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
  427 <P>
  428 These are recognized only at the very start of the pattern or after option
  429 settings with a similar syntax.
  430 <pre>
  431   (*CR)           carriage return only
  432   (*LF)           linefeed only
  433   (*CRLF)         carriage return followed by linefeed
  434   (*ANYCRLF)      all three of the above
  435   (*ANY)          any Unicode newline sequence
  436 </PRE>
  437 </P>
  438 <br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
  439 <P>
  440 These are recognized only at the very start of the pattern or after option
  441 setting with a similar syntax.
  442 <pre>
  443   (*BSR_ANYCRLF)  CR, LF, or CRLF
  444   (*BSR_UNICODE)  any Unicode newline sequence
  445 </PRE>
  446 </P>
  447 <br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
  448 <P>
  449 <pre>
  450   (?=...)         positive look ahead
  451   (?!...)         negative look ahead
  452   (?&#60;=...)        positive look behind
  453   (?&#60;!...)        negative look behind
  454 </pre>
  455 Each top-level branch of a look behind must be of a fixed length.
  456 </P>
  457 <br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
  458 <P>
  459 <pre>
  460   \n              reference by number (can be ambiguous)
  461   \gn             reference by number
  462   \g{n}           reference by number
  463   \g{-n}          relative reference by number
  464   \k&#60;name&#62;        reference by name (Perl)
  465   \k'name'        reference by name (Perl)
  466   \g{name}        reference by name (Perl)
  467   \k{name}        reference by name (.NET)
  468   (?P=name)       reference by name (Python)
  469 </PRE>
  470 </P>
  471 <br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
  472 <P>
  473 <pre>
  474   (?R)            recurse whole pattern
  475   (?n)            call subpattern by absolute number
  476   (?+n)           call subpattern by relative number
  477   (?-n)           call subpattern by relative number
  478   (?&name)        call subpattern by name (Perl)
  479   (?P&#62;name)       call subpattern by name (Python)
  480   \g&#60;name&#62;        call subpattern by name (Oniguruma)
  481   \g'name'        call subpattern by name (Oniguruma)
  482   \g&#60;n&#62;           call subpattern by absolute number (Oniguruma)
  483   \g'n'           call subpattern by absolute number (Oniguruma)
  484   \g&#60;+n&#62;          call subpattern by relative number (PCRE extension)
  485   \g'+n'          call subpattern by relative number (PCRE extension)
  486   \g&#60;-n&#62;          call subpattern by relative number (PCRE extension)
  487   \g'-n'          call subpattern by relative number (PCRE extension)
  488 </PRE>
  489 </P>
  490 <br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
  491 <P>
  492 <pre>
  493   (?(condition)yes-pattern)
  494   (?(condition)yes-pattern|no-pattern)
  495 
  496   (?(n)...        absolute reference condition
  497   (?(+n)...       relative reference condition
  498   (?(-n)...       relative reference condition
  499   (?(&#60;name&#62;)...   named reference condition (Perl)
  500   (?('name')...   named reference condition (Perl)
  501   (?(name)...     named reference condition (PCRE)
  502   (?(R)...        overall recursion condition
  503   (?(Rn)...       specific group recursion condition
  504   (?(R&name)...   specific recursion condition
  505   (?(DEFINE)...   define subpattern for reference
  506   (?(assert)...   assertion condition
  507 </PRE>
  508 </P>
  509 <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
  510 <P>
  511 The following act immediately they are reached:
  512 <pre>
  513   (*ACCEPT)       force successful match
  514   (*FAIL)         force backtrack; synonym (*F)
  515   (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
  516 </pre>
  517 The following act only when a subsequent match failure causes a backtrack to
  518 reach them. They all force a match failure, but they differ in what happens
  519 afterwards. Those that advance the start-of-match point do so only if the
  520 pattern is not anchored.
  521 <pre>
  522   (*COMMIT)       overall failure, no advance of starting point
  523   (*PRUNE)        advance to next starting character
  524   (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
  525   (*SKIP)         advance to current matching position
  526   (*SKIP:NAME)    advance to position corresponding to an earlier
  527                   (*MARK:NAME); if not found, the (*SKIP) is ignored
  528   (*THEN)         local failure, backtrack to next alternation
  529   (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
  530 </PRE>
  531 </P>
  532 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
  533 <P>
  534 <pre>
  535   (?C)      callout
  536   (?Cn)     callout with data n
  537 </PRE>
  538 </P>
  539 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
  540 <P>
  541 <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
  542 <b>pcrematching</b>(3), <b>pcre</b>(3).
  543 </P>
  544 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
  545 <P>
  546 Philip Hazel
  547 <br>
  548 University Computing Service
  549 <br>
  550 Cambridge CB2 3QH, England.
  551 <br>
  552 </P>
  553 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
  554 <P>
  555 Last updated: 08 January 2014
  556 <br>
  557 Copyright &copy; 1997-2014 University of Cambridge.
  558 <br>
  559 <p>
  560 Return to the <a href="index.html">PCRE index page</a>.
  561 </p>