"Fossies" - the Fresh Open Source Software Archive

Member "scalpel-2.0/tre-0.7.5-win32/doc/tre-syntax.html" (20 Apr 2011, 14466 Bytes) of archive /linux/misc/scalpel-2.0.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) HTML source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 <h1>TRE Regexp Syntax</h1>
    2 
    3 <p>
    4 This document describes the POSIX 1003.2 extended RE (ERE) syntax and
    5 the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions
    6 to the ERE syntax.  A simple Extended Backus-Naur Form (EBNF) style
    7 notation is used to describe the grammar.
    8 </p>
    9 
   10 <h2>ERE Syntax</h2>
   11 
   12 <h3>Alternation operator</h3>
   13 <a name="alternation"></a>
   14 <a name="extended-regexp"></a>
   15 
   16 <table bgcolor="#e0e0f0" cellpadding="10">
   17 <tr><td>
   18 <pre>
   19 <i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a>
   20                 |   <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a>
   21 </pre>
   22 </td></tr>
   23 </table>
   24 <p>
   25 An extended regexp (ERE) is one or more <i>branches</i>, separated by
   26 <tt>|</tt>.  An ERE matches anything that matches one or more of the
   27 branches.
   28 </p>
   29 
   30 <h3>Catenation of REs</h3>
   31 <a name="catenation"></a>
   32 <a name="branch"></a>
   33 
   34 <table bgcolor="#e0e0f0" cellpadding="10">
   35 <tr><td>
   36 <pre>
   37 <i>branch</i> ::= <i>piece</i>
   38        |   <i>branch</i> <i>piece</i>
   39 </pre>
   40 </td></tr>
   41 </table>
   42 <p>
   43 A branch is one or more <i>pieces</i> concatenated.  It matches a
   44 match for the first piece, followed by a match for the second piece,
   45 and so on.
   46 </p>
   47 
   48 
   49 <table bgcolor="#e0e0f0" cellpadding="10">
   50 <tr><td>
   51 <pre>
   52 <i>piece</i> ::= <i>atom</i>
   53       |   <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a>
   54       |   <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a>
   55 </pre>
   56 </td></tr>
   57 </table>
   58 <p>
   59 A piece is an <i>atom</i> possibly followed by a repeat operator or an
   60 expression controlling approximate matching parameters for the <i>atom</i>.
   61 </p>
   62 
   63 
   64 <table bgcolor="#e0e0f0" cellpadding="10">
   65 <tr><td>
   66 <pre>
   67 <i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b>
   68      |   <a href="#bracket-expression"><i>bracket-expression</i></a>
   69      |   <b>"."</b>
   70      |   <a href="#assertion"><i>assertion</i></a>
   71      |   <a href="#literal"><i>literal</i></a>
   72      |   <a href="#backref"><i>back-reference</i></a>
   73      |   <b>"(?#"</b> <i>comment-text</i> <b>")"</b>
   74      |   <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i>
   75      |   <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b>
   76 </pre>
   77 </td></tr>
   78 </table>
   79 <p>
   80 An atom is either an ERE enclosed in parenthesis, a bracket
   81 expression, a <tt>.</tt> (period), an assertion, or a literal.
   82 </p>
   83 
   84 <p>
   85 The dot (<tt>.</tt>) matches any single character.
   86 If the <code>REG_NEWLINE</code> compilation flag (see <a
   87 href="api.html">API manual</a>) is specified, the newline
   88 character is not matched.
   89 </p>
   90 
   91 <p>
   92 <tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is
   93 completely ignored by the regex parser and it used solely for readability purposes.
   94 </p>
   95 
   96 <h3>Repeat operators</h3>
   97 <a name="repeat-operator"></a>
   98 
   99 <table bgcolor="#e0e0f0" cellpadding="10">
  100 <tr><td>
  101 <pre>
  102 <i>repeat-operator</i> ::= <b>"*"</b>
  103                 |   <b>"+"</b>
  104                 |   <b>"?"</b>
  105                 |   <i>bound</i>
  106                 |   <b>"*?"</b>
  107                 |   <b>"+?"</b>
  108                 |   <b>"??"</b>
  109                 |   <i>bound</i> <b>?</b>
  110 </pre>
  111 </td></tr>
  112 </table>
  113 
  114 <p>
  115 An atom followed by <tt>*</tt> matches a sequence of 0 or more matches
  116 of the atom.  <tt>+</tt> is similar to <tt>*</tt>, matching a sequence
  117 of 1 or more matches of the atom.  An atom followed by <tt>?</tt>
  118 matches a sequence of 0 or 1 matches of the atom.
  119 </p>
  120 
  121 <p>
  122 A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i>
  123 are unsigned decimal integers between <tt>0</tt> and
  124 <tt>RE_DUP_MAX</tt>:
  125 </p>
  126 
  127 <ol>
  128 <li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li>
  129 <li><tt>{</tt><i>m</i><tt>,}</tt></li>
  130 <li><tt>{</tt><i>m</i><tt>}</tt></li>
  131 </ol>
  132 
  133 <p>
  134 An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i>
  135 (inclusive) matches of the atom.  An atom followed by [2]
  136 matches a sequence of <i>m</i> or more matches of the atom.  An atom
  137 followed by [3] matches a sequence of exactly <i>m</i> matches of the
  138 atom.
  139 </p>
  140 
  141 
  142 <p>
  143 Adding a <tt>?</tt> to a repeat operator makes the subexpression
  144 minimal, or non-greedy.  Normally a repeated expression is greedy,
  145 that is, it matches as many characters as possible.  A non-greedy
  146 subexpression matches as few characters as possible.  Note that this
  147 does not (always) mean the same thing as matching as many or few
  148 repetitions as possible.
  149 </p>
  150 
  151 <h3>Approximate matching settings</h3>
  152 <a name="approx-settings"></a>
  153 
  154 <table bgcolor="#e0e0f0" cellpadding="10">
  155 <tr><td>
  156 <pre>
  157 <i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b>
  158 
  159 <i>count-limits</i> ::= <b>"+"</b> <i>number</i>?
  160              |   <b>"-"</b> <i>number</i>?
  161              |   <b>"#"</b> <i>number</i>?
  162              |   <b>"~"</b> <i>number</i>?
  163 
  164 <i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"&lt;"</b> <i>number</i>
  165 
  166 <i>cost-term</i> ::= <i>number</i> <b>"i"</b>
  167           |   <i>number</i> <b>"d"</b>
  168           |   <i>number</i> <b>"s"</b>
  169 
  170 </pre>
  171 </td></tr>
  172 </table>
  173 
  174 <p>
  175 The approximate matching settings for a subpattern can be changed
  176 by appending <i>approx-settings</i> to the subpattern.  Limits for
  177 the number of errors can be set and an expression for specifying and
  178 limiting the costs can be given.
  179 </p>
  180 
  181 <p>
  182 The <i>count-limits</i> can be used to set limits for the number of
  183 insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions
  184 (<tt>#</tt>), and total number of errors (<tt>~</tt>).  If the
  185 <i>number</i> part is omitted, the specified error count will be
  186 unlimited.
  187 </p>
  188 
  189 <p>
  190 The <i>cost-equation</i> can be thought of as a mathematical equation,
  191 where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of
  192 insertions, deletions, and substitutions, respectively.  The equation
  193 can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and
  194 <tt>s</tt>.  The multiplier is the cost of the error, and the number
  195 after <tt>&lt;</tt> is the maximum allowed cost of a match.  Spaces
  196 and pluses can be inserted to make the equation readable.
  197 </p>
  198 
  199 <p>
  200 Examples:
  201 <dl>
  202 <dt><tt>{~}</tt></dt>
  203 <dd>Sets the maximum number of errors to unlimited.</dd>
  204 <dt><tt>{~3}</tt></dt>
  205 <dd>Sets the maximum number of errors to three.</dd>
  206 <dt><tt>{+2~5}</tt></dt>
  207 <dd>Sets the maximum number of errors to five, and the maximum number
  208 of insertions to two.</dd>
  209 <dt><tt>{&lt;3}</tt></dt>
  210 <dd>Sets the maximum cost to three.
  211 <dt><tt>{2i + d + 2s &lt; 5}</tt></dt>
  212 <dd>Sets the cost of an insertion to two, a deletion to one, a
  213 substitution to two, and the maximum cost to five.
  214 </dl>
  215 
  216 
  217 <h3>Bracket expressions</h3>
  218 <a name="bracket-expression"></a>
  219 
  220 <table bgcolor="#e0e0f0" cellpadding="10">
  221 <tr><td>
  222 <pre>
  223 <i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b>
  224                    |   <b>"[^"</b> <i>item</i>+ <b>"]"</b>
  225 </pre>
  226 </td></tr>
  227 </table>
  228 
  229 <p>
  230 A bracket expression specifies a set of characters by enclosing a
  231 nonempty list of items in brackets.  Normally anything matching any
  232 item in the list is matched.  If the list begins with <tt>^</tt> the
  233 meaning is negated; any character matching no item in the list is
  234 matched.
  235 </p>
  236 
  237 <p>
  238 An item is any of the following:
  239 </p>
  240 <ul>
  241 <li>A single character, matching that character.</li>
  242 <li>Two characters separated by <tt>-</tt>.  This is shorthand for the
  243 full range of characters  between those two (inclusive) in the
  244 collating sequence.  For example, <tt>[0-9]</tt> in ASCII matches any
  245 decimal digit.</li>
  246 <li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>,
  247 matching the collating element.  This can be used to include a literal
  248 <tt>-</tt> or a multi-character collating element in the list.</li>
  249 <li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an
  250 equivalence class), matching all collating elements with the same
  251 primary collation weight as that element, including the element
  252 itself.</li>
  253 <li>The name of a character class enclosed in <tt>[:</tt> and
  254 <tt>:]</tt>, matching any character belonging to the class.  The set
  255 of valid names depends on the <code>LC_CTYPE</code> category of the
  256 current locale, but the following names are valid in all locales:
  257 <ul>
  258 <li><tt>alnum</tt> - alphanumeric characters</li>
  259 <li><tt>alpha</tt> - alphabetic characters</li>
  260 <li><tt>blank</tt> - blank characters</li>
  261 <li><tt>cntrl</tt> - control characters</li>
  262 <li><tt>digit</tt> - decimal digits (0 through 9)</li>
  263 <li><tt>graph</tt> - all printable characters except space</li>
  264 <li><tt>lower</tt> - lower-case letters</li>
  265 <li><tt>print</tt> - printable characters including space</li>
  266 <li><tt>punct</tt> - printable characters not space or alphanumeric</li>
  267 <li><tt>space</tt> - white-space characters</li>
  268 <li><tt>upper</tt> - upper case letters</li>
  269 <li><tt>xdigit</tt> - hexadecimal digits</li>
  270 </ul>
  271 </ul>
  272 <p>
  273 To include a literal <tt>-</tt> in the list, make it either the first
  274 or last item, the second endpoint of a range, or enclose it in
  275 <tt>[.</tt> and <tt>.]</tt> to make it a collating element.  To
  276 include a literal <tt>]</tt> in the list, make it either the first
  277 item, the second endpoint of a range, or enclose it in <tt>[.</tt> and
  278 <tt>.]</tt>.  To use a literal <tt>-</tt> as the first
  279 endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>.
  280 </p>
  281 
  282 
  283 <h3>Assertions</h3>
  284 <a name="assertion"></a>
  285 
  286 <table bgcolor="#e0e0f0" cellpadding="10">
  287 <tr><td>
  288 <pre>
  289 <i>assertion</i> ::= <b>"^"</b>
  290           |   <b>"$"</b>
  291           |   <b>"\"</b> <i>assertion-character</i>
  292 </pre>
  293 </td></tr>
  294 </table>
  295 
  296 <p>
  297 The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and
  298 "right anchor", respectively.  The left anchor matches the empty
  299 string at the beginning of the string.  The right anchor matches the
  300 empty string at the end of the string.  The behaviour of both anchors
  301 can be varied by specifying certain execution and compilation flags;
  302 see the <a href="api.html">API manual</a>.
  303 </p>
  304 
  305 <p>
  306 An assertion-character can be any of the following:
  307 </p>
  308 
  309 <ul>
  310 <li><tt>&lt;</tt> - Beginning of word
  311 <li><tt>&gt;</tt> - End of word
  312 <li><tt>b</tt> - Word boundary
  313 <li><tt>B</tt> - Non-word boundary
  314 <li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li>
  315 <li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li>
  316 <li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li>
  317 <li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li>
  318 <li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li>
  319 <li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li>
  320 </ul>
  321 
  322 
  323 <h3>Literals</h3>
  324 <a name="literal"></a>
  325 
  326 <table bgcolor="#e0e0f0" cellpadding="10">
  327 <tr><td>
  328 <pre>
  329 <i>literal</i> ::= <i>ordinary-character</i>
  330         |   <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2}
  331         |   <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b>
  332         |   <b>"\"</b> <i>character</i>
  333 </pre>
  334 </td></tr>
  335 </table>
  336 <p>
  337 A literal is either an ordinary character (a character that has no
  338 other significance in the context), an 8 bit hexadecimal encoded
  339 character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character
  340 (e.g. <tt>\x{263a}</tt>), or an escaped character.  An escaped
  341 character is a <tt>\</tt> followed by any character, and matches that
  342 character.  Escaping can be used to match characters which have a
  343 special meaning in regexp syntax.  A <tt>\</tt> cannot be the last
  344 character of an ERE.  Escaping also allows you to include a few
  345 non-printable characters in the regular expression.  These special
  346 escape sequences include:
  347 </p>
  348 
  349 <ul>
  350 <li><tt>\a</tt> - Bell character (ASCII code 7)
  351 <li><tt>\e</tt> - Escape character (ASCII code 27)
  352 <li><tt>\f</tt> - Form-feed character (ASCII code 12)
  353 <li><tt>\n</tt> - New-line/line-feed character (ASCII code 10)
  354 <li><tt>\r</tt> - Carriage return character (ASCII code 13)
  355 <li><tt>\t</tt> - Horizontal tab character (ASCII code 9)
  356 </ul>
  357 
  358 <p>
  359 An ordinary character is just a single character with no other
  360 significance, and matches that character.  A <tt>{</tt> followed by
  361 something else than a digit is considered an ordinary character.
  362 </p>
  363 
  364 
  365 <h3>Back references</h3>
  366 <a name="backref"></a>
  367 
  368 <table bgcolor="#e0e0f0" cellpadding="10">
  369 <tr><td>
  370 <pre>
  371 <i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>]
  372 </pre>
  373 </td></tr>
  374 </table>
  375 <p>
  376 A back reference is a backslash followed by a single non-zero decimal
  377 digit <i>d</i>.  It matches the same sequence of characters
  378 matched by the <i>d</i>th parenthesized subexpression.
  379 </p>
  380 
  381 <p>
  382 Back references are not defined for POSIX EREs (for BREs they are),
  383 but many matchers, including TRE, implement back references for both
  384 EREs and BREs.
  385 </p>
  386 
  387 <h3>Options</h3>
  388 <a name="options"></a>
  389 <table bgcolor="#e0e0f0" cellpadding="10">
  390 <tr><td>
  391 <pre>
  392 <i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)?
  393 </pre>
  394 </td></tr>
  395 </table>
  396 
  397 Options allow compile time options to be turned on/off for particular parts of the 
  398 regular expression. The options equate to several compile time options specified to
  399 the regcomp API function. If the option is specified in the first section, it is
  400 turned on. If it is specified in the second section (after the <tt>-</tt>), it is
  401 turned off.
  402 <ul>
  403 <li>i - Case insensitive.
  404 <li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in 
  405 the <a href="tre-api.html">API Manual</a>.
  406 <li>r - Causes the regex to be matched in a right associative manner rather than the normal
  407 left associative manner.
  408 <li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended.
  409 </ul>
  410 <h2>BRE Syntax</h2>
  411 
  412 <p>
  413 The obsolete basic regexp (BRE) syntax differs from the ERE syntax as
  414 follows:
  415 </p>
  416 
  417 <ul>
  418 <li><tt>|</tt> is an ordinary character, and there is no equivalent
  419 for its functionality.  <tt>+</tt>, and <tt>?</tt> are ordinary
  420 characters.</li>
  421 <li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with
  422 <tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li>
  423 <li>The parentheses for nested subexpressions are <tt>\(</tt> and
  424 <tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary
  425 characters.</li>
  426 <li><tt>^</tt> is an ordinary character except at the beginning of the
  427 RE or the beginning of a parenthesized subexpression.  Similarly,
  428 <tt>$</tt> is an ordinary character except at the end of the
  429 RE or the end of a parenthesized subexpression.</li>
  430 </ul>