"Fossies" - the Fresh Open Source Software Archive
Member "scalpel-2.0/tre-0.7.5-win32/README" (20 Apr 2011, 10116 Bytes) of archive /linux/misc/scalpel-2.0.tar.gz:
As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard
) with prefixed line numbers.
Alternatively you can here view
the uninterpreted source code file.
3 TRE is a lightweight, robust, and efficient POSIX compliant regexp
4 matching library with some exciting features such as approximate
5 (fuzzy) matching.
7 The matching algorithm used in TRE uses linear worst-case time in
8 the length of the text being searched, and quadratic worst-case
9 time in the length of the used regular expression. In other words,
10 the time complexity of the algorithm is O(M^2N), where M is the
11 length of the regular expression and N is the length of the
12 text. The used space is also quadratic on the length of the regex,
13 but does not depend on the searched string. This quadratic
14 behaviour occurs only on pathological cases which are probably very
15 rare in practice.
19 TRE is not just yet another regexp matcher. TRE has some features
20 which are not there in most free POSIX compatible
21 implementations. Most of these features are not present in non-free
22 implementations either, for that matter.
24 Approximate matching
26 Approximate pattern matching allows matches to be approximate, that
27 is, allows the matches to be close to the searched pattern under
28 some measure of closeness. TRE uses the edit-distance measure (also
29 known as the Levenshtein distance) where characters can be
30 inserted, deleted, or substituted in the searched text in order to
31 get an exact match. Each insertion, deletion, or substitution adds
32 the distance, or cost, of the match. TRE can report the matches
33 which have a cost lower than some given threshold value. TRE can
34 also be used to search for matches with the lowest cost.
36 TRE includes a version of the agrep (approximate grep) command line
37 tool for approximate regexp matching in the style of grep. Unlike
38 other agrep implementations (like the one by Sun Wu and Udi Manber
39 from University of Arizona available here) TRE agrep allows full
40 regexps of any length, any number of errors, and non-uniform costs
41 for insertion, deletion and substitution.
43 Strict standard conformance
45 POSIX defines the behaviour of regexp functions precisely. TRE
46 attempts to conform to these specifications as strictly as
47 possible. TRE always returns the correct matches for subpatterns,
48 for example. Very few other implementations do this correctly. In
49 fact, the only other implementations besides TRE that I am aware of
50 (free or not) that get it right are Rx by Tom Lord, Regex++ by John
51 Maddock, and the AT&T ast regex by Glenn Fowler and Doug McIlroy.
53 The standard TRE tries to conform to is the IEEE Std 1003.1-2001,
54 or Open Group Base Specifications Issue 6, commonly referred to as
55 "POSIX". It can be found online here. The relevant parts are the
56 base specifications on regular expressions (and the rationale) and
57 the description of the regcomp() API.
59 For an excellent survey on POSIX regexp matchers, see the testregex
60 pages by Glenn Fowler of AT&T Labs Research.
62 Predictable matching speed
64 Because of the matching algorithm used in TRE, the maximum time
65 consumed by any regexec() call is always directly proportional to
66 the length of the searched string. There is one exception: if back
67 references are used, the matching may take time that grows
68 exponentially with the length of the string. This is because
69 matching back references is an NP complete problem, and almost
70 certainly requires exponential time to match in the worst case.
72 Predictable and modest memory consumption
74 A regexec() call never allocates memory from the heap. TRE
75 allocates all the memory it needs during a regcomp() call, and some
76 temporary working space from the stack frame for the duration of
77 the regexec() call. The amount of temporary space needed is
78 constant during matching and does not depend on the searched
79 string. For regexps of reasonable size TRE needs less than 50K of
80 dynamically allocated memory during the regcomp() call, less than
81 20K for the compiled pattern buffer, and less than two kilobytes of
82 temporary working space from the stack frame during a regexec()
83 call. There is no time/memory tradeoff. TRE is also small in code
84 size; statically linking with TRE increases the executable size
85 less than 30K (gcc-3.2, x86, GNU/Linux).
87 Wide character and multibyte character set support
89 TRE supports multibyte character sets. This makes it possible to
90 use regexps seamlessly with, for example, Japanese locales. TRE
91 also provides a wide character API.
93 Binary pattern and data support
95 TRE provides APIs which allow binary zero characters both in
96 regexps and searched strings. The standard API cannot be easily
97 used to, for example, search for printable words from binary data
98 (although it is possible with some hacking). Searching for patterns
99 which contain binary zeroes embedded is not possible at all with
100 the standard API.
102 Completely thread safe
104 TRE is completely thread safe. All the exported functions are
105 re-entrant, and a single compiled regexp object can be used
106 simultaneously in multiple contexts; e.g. in main() and a signal
107 handler, or in many threads of a multithreaded application.
111 TRE is portable across multiple platforms. Here's a table of
112 platforms and compilers that have been successfully used to compile
113 and run TRE:
115 Platform(s) | Compiler(s)
117 AIX 4.3.2 - 5.3.0 | GCC, C for AIX compiler version 5
118 Compaq Tru64 UNIX V5.1A/B | Compaq C V6.4-014 - V6.5-011
119 Cygwin 1.3 - 1.5 | GCC
120 Digital UNIX V4.0 | DEC C V5.9-005
121 FreeBSD 4 and above | GCC
122 GNU/Linux systems on x86, x86_64, | GCC
123 ppc64, s390 |
124 HP-UX 10.20- 11.00 | GCC, HP C Compiler
125 IRIX 6.5 | GCC, MIPSpro Compilers 22.214.171.124m
126 Max OS X |
127 NetBSD 1.5 and above | GCC, egcs
128 OpenBSD 3.3 and above | GCC
129 Solaris 2.7-10 sparc/x86 | GCC, Sun Workshop 6 compilers
130 Windows 98 - XP | Microsoft Visual C++ 6.0
132 TRE 0.7.5 should compile without changes on all of the above
133 platforms. Tell me if you are using TRE on a platform that is not
134 listed above, and I'll add it to the list. Also let me know if TRE
135 does not work on a listed platform.
137 Depending on the platform, you may need to install libutf8 to get
138 wide character and multibyte character set support.
142 TRE is free; you can redistribute it and/or modify it under the
143 terms of the GNU Lesser General Public License as published by the
144 Free Software Foundation. See the file COPYING, included in the
145 distribution packages, for the complete license.
147 If you cannot accept the GNU LGPL, it may be possible that you can
148 persuade me to give or sell you a version which is licensed
149 differently. Please contact me for more information.
153 There are currently two features, both related to collating
154 elements, missing from 100% POSIX compliance. These are:
156 * Support for collating elements (e.g. [[.<X>.]], where <X> is a
157 collating element). It is not possible to support
158 multi-character collating elements portably, since POSIX does
159 not define a way to determine whether a character sequence is a
160 multi-character collating element or not.
162 * Support for equivalence classes, for example [[=<X>=]], where
163 <X> is a collating element. An equivalence class matches any
164 character which has the same primary collation weight as
165 <X>. Again, POSIX provides no portable mechanism for
166 determining the primary collation weight of a collating
169 Note that other portable regexp implementations don't support
170 collating elements either. The single exception is Regex++, which
171 comes with its own database for collating elements for different
172 locales. Support for collating elements and equivalence classes has
173 not been widely requested and is not very high on the TODO list at
174 the moment.
176 These are other features I'm planning to implement real soon now:
178 * All the missing GNU extensions enabled in GNU regex, such as
179 [[:<:]] and [[:>:]]
181 * A REG_SHORTEST regexec() flag for returning the shortest match
182 instead of the longest match.
184 * Perl-compatible syntax
187 Matches anything but the characters in class. Note that
188 [^[:class:]] works already, this would be just a
189 convenience shorthand.
192 Match only at beginning of string
195 Match only at end of string, or before newline at the end
198 Match only at end of string
201 Lowercase next char (think vi)
204 Uppercase next char (think vi)
207 Lowercase till \E (think vi)
210 Uppercase till \E (think vi)
213 Zero-width positive look-ahead assertions.
216 Zero-width negative look-ahead assertions.
219 Zero-width positive look-behind assertions.
222 Zero-width negative look-behind assertions.
224 Documentation especially for the nonstandard features of TRE, such
225 as approximate matching, is a work in progress (with "progress"
226 loosely defined...)
228 Mailing lists
231 This list is for any discussion on the TRE software, including
232 reporting bugs, feature requests, requests for help, and other
236 Subscribe to this list to get announcements of new releases of
237 TRE. Alternatively, you can subscribe to the freshmeat.net
238 project and get similar announcements that way, if you prefer.
240 Ville Laurikari <firstname.lastname@example.org>