"Fossies" - the Fresh Open Source Software Archive

Member "scalpel-2.0/tre-0.7.5-win32/README" (20 Apr 2011, 10116 Bytes) of archive /linux/misc/scalpel-2.0.tar.gz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 Introduction
    2 
    3    TRE is a lightweight, robust, and efficient POSIX compliant regexp
    4    matching library with some exciting features such as approximate
    5    (fuzzy) matching.
    6 
    7    The matching algorithm used in TRE uses linear worst-case time in
    8    the length of the text being searched, and quadratic worst-case
    9    time in the length of the used regular expression. In other words,
   10    the time complexity of the algorithm is O(M^2N), where M is the
   11    length of the regular expression and N is the length of the
   12    text. The used space is also quadratic on the length of the regex,
   13    but does not depend on the searched string. This quadratic
   14    behaviour occurs only on pathological cases which are probably very
   15    rare in practice.
   16 
   17 Features
   18 
   19    TRE is not just yet another regexp matcher. TRE has some features
   20    which are not there in most free POSIX compatible
   21    implementations. Most of these features are not present in non-free
   22    implementations either, for that matter.
   23 
   24 Approximate matching
   25 
   26    Approximate pattern matching allows matches to be approximate, that
   27    is, allows the matches to be close to the searched pattern under
   28    some measure of closeness. TRE uses the edit-distance measure (also
   29    known as the Levenshtein distance) where characters can be
   30    inserted, deleted, or substituted in the searched text in order to
   31    get an exact match. Each insertion, deletion, or substitution adds
   32    the distance, or cost, of the match. TRE can report the matches
   33    which have a cost lower than some given threshold value. TRE can
   34    also be used to search for matches with the lowest cost.
   35 
   36    TRE includes a version of the agrep (approximate grep) command line
   37    tool for approximate regexp matching in the style of grep. Unlike
   38    other agrep implementations (like the one by Sun Wu and Udi Manber
   39    from University of Arizona available here) TRE agrep allows full
   40    regexps of any length, any number of errors, and non-uniform costs
   41    for insertion, deletion and substitution.
   42 
   43 Strict standard conformance
   44 
   45    POSIX defines the behaviour of regexp functions precisely. TRE
   46    attempts to conform to these specifications as strictly as
   47    possible. TRE always returns the correct matches for subpatterns,
   48    for example. Very few other implementations do this correctly. In
   49    fact, the only other implementations besides TRE that I am aware of
   50    (free or not) that get it right are Rx by Tom Lord, Regex++ by John
   51    Maddock, and the AT&T ast regex by Glenn Fowler and Doug McIlroy.
   52 
   53    The standard TRE tries to conform to is the IEEE Std 1003.1-2001,
   54    or Open Group Base Specifications Issue 6, commonly referred to as
   55    "POSIX".  It can be found online here. The relevant parts are the
   56    base specifications on regular expressions (and the rationale) and
   57    the description of the regcomp() API.
   58 
   59    For an excellent survey on POSIX regexp matchers, see the testregex
   60    pages by Glenn Fowler of AT&T Labs Research.
   61 
   62 Predictable matching speed
   63 
   64    Because of the matching algorithm used in TRE, the maximum time
   65    consumed by any regexec() call is always directly proportional to
   66    the length of the searched string. There is one exception: if back
   67    references are used, the matching may take time that grows
   68    exponentially with the length of the string. This is because
   69    matching back references is an NP complete problem, and almost
   70    certainly requires exponential time to match in the worst case.
   71 
   72 Predictable and modest memory consumption
   73 
   74    A regexec() call never allocates memory from the heap. TRE
   75    allocates all the memory it needs during a regcomp() call, and some
   76    temporary working space from the stack frame for the duration of
   77    the regexec() call. The amount of temporary space needed is
   78    constant during matching and does not depend on the searched
   79    string. For regexps of reasonable size TRE needs less than 50K of
   80    dynamically allocated memory during the regcomp() call, less than
   81    20K for the compiled pattern buffer, and less than two kilobytes of
   82    temporary working space from the stack frame during a regexec()
   83    call. There is no time/memory tradeoff. TRE is also small in code
   84    size; statically linking with TRE increases the executable size
   85    less than 30K (gcc-3.2, x86, GNU/Linux).
   86 
   87 Wide character and multibyte character set support
   88 
   89    TRE supports multibyte character sets. This makes it possible to
   90    use regexps seamlessly with, for example, Japanese locales. TRE
   91    also provides a wide character API.
   92 
   93 Binary pattern and data support
   94 
   95    TRE provides APIs which allow binary zero characters both in
   96    regexps and searched strings. The standard API cannot be easily
   97    used to, for example, search for printable words from binary data
   98    (although it is possible with some hacking). Searching for patterns
   99    which contain binary zeroes embedded is not possible at all with
  100    the standard API.
  101 
  102 Completely thread safe
  103 
  104    TRE is completely thread safe. All the exported functions are
  105    re-entrant, and a single compiled regexp object can be used
  106    simultaneously in multiple contexts; e.g. in main() and a signal
  107    handler, or in many threads of a multithreaded application.
  108 
  109 Portable
  110 
  111    TRE is portable across multiple platforms. Here's a table of
  112    platforms and compilers that have been successfully used to compile
  113    and run TRE:
  114 
  115       Platform(s)                       | Compiler(s)
  116       ----------------------------------+------------
  117       AIX 4.3.2 - 5.3.0                 | GCC, C for AIX compiler version 5
  118       Compaq Tru64 UNIX V5.1A/B         | Compaq C V6.4-014 - V6.5-011
  119       Cygwin 1.3 - 1.5                  | GCC
  120       Digital UNIX V4.0                 | DEC C V5.9-005
  121       FreeBSD 4 and above               | GCC
  122       GNU/Linux systems on x86, x86_64, | GCC
  123       ppc64, s390			|
  124       HP-UX 10.20- 11.00                | GCC, HP C Compiler
  125       IRIX 6.5                          | GCC, MIPSpro Compilers 7.3.1.3m
  126       Max OS X				|
  127       NetBSD 1.5 and above              | GCC, egcs
  128       OpenBSD 3.3 and above             | GCC
  129       Solaris 2.7-10 sparc/x86          | GCC, Sun Workshop 6 compilers
  130       Windows 98 - XP                   | Microsoft Visual C++ 6.0
  131 
  132    TRE 0.7.5 should compile without changes on all of the above
  133    platforms.  Tell me if you are using TRE on a platform that is not
  134    listed above, and I'll add it to the list. Also let me know if TRE
  135    does not work on a listed platform.
  136 
  137    Depending on the platform, you may need to install libutf8 to get
  138    wide character and multibyte character set support.
  139 
  140  Free
  141 
  142    TRE is free; you can redistribute it and/or modify it under the
  143    terms of the GNU Lesser General Public License as published by the
  144    Free Software Foundation. See the file COPYING, included in the
  145    distribution packages, for the complete license.
  146 
  147    If you cannot accept the GNU LGPL, it may be possible that you can
  148    persuade me to give or sell you a version which is licensed
  149    differently.  Please contact me for more information.
  150 
  151 Roadmap
  152 
  153    There are currently two features, both related to collating
  154    elements, missing from 100% POSIX compliance. These are:
  155 
  156      * Support for collating elements (e.g. [[.<X>.]], where <X> is a
  157        collating element). It is not possible to support
  158        multi-character collating elements portably, since POSIX does
  159        not define a way to determine whether a character sequence is a
  160        multi-character collating element or not.
  161 
  162      * Support for equivalence classes, for example [[=<X>=]], where
  163        <X> is a collating element. An equivalence class matches any
  164        character which has the same primary collation weight as
  165        <X>. Again, POSIX provides no portable mechanism for
  166        determining the primary collation weight of a collating
  167        element.
  168 
  169    Note that other portable regexp implementations don't support
  170    collating elements either. The single exception is Regex++, which
  171    comes with its own database for collating elements for different
  172    locales. Support for collating elements and equivalence classes has
  173    not been widely requested and is not very high on the TODO list at
  174    the moment.
  175 
  176    These are other features I'm planning to implement real soon now:
  177 
  178      * All the missing GNU extensions enabled in GNU regex, such as
  179        [[:<:]] and [[:>:]]
  180 
  181      * A REG_SHORTEST regexec() flag for returning the shortest match
  182        instead of the longest match.
  183 
  184      * Perl-compatible syntax
  185 
  186             [:^class:]
  187                Matches anything but the characters in class. Note that
  188                [^[:class:]] works already, this would be just a
  189                convenience shorthand.
  190 
  191             \A
  192                Match only at beginning of string
  193 
  194             \Z
  195                Match only at end of string, or before newline at the end
  196 
  197             \z
  198                Match only at end of string
  199 
  200             \l
  201                Lowercase next char (think vi)
  202 
  203             \u
  204                Uppercase next char (think vi)
  205 
  206             \L
  207                Lowercase till \E (think vi)
  208 
  209             \U
  210                Uppercase till \E (think vi)
  211 
  212             (?=pattern)
  213                Zero-width positive look-ahead assertions.
  214 
  215             (?!pattern)
  216                Zero-width negative look-ahead assertions.
  217 
  218             (?<=pattern)
  219                Zero-width positive look-behind assertions.
  220 
  221             (?<!pattern)
  222                Zero-width negative look-behind assertions.
  223 
  224    Documentation especially for the nonstandard features of TRE, such
  225    as approximate matching, is a work in progress (with "progress"
  226    loosely defined...)
  227 
  228    Mailing lists
  229 
  230    tre-general@lists.laurikari.net
  231       This list is for any discussion on the TRE software, including
  232       reporting bugs, feature requests, requests for help, and other
  233       things.
  234 
  235    tre-announce@lists.laurikari.net
  236       Subscribe to this list to get announcements of new releases of
  237       TRE.  Alternatively, you can subscribe to the freshmeat.net
  238       project and get similar announcements that way, if you prefer.
  239 
  240 Ville Laurikari    <vl@iki.fi>