ocrad.info (ocrad-0.25-pre5) | : | ocrad.info (ocrad-0.25-pre6) | ||
---|---|---|---|---|
File: ocrad.info, Node: Top, Next: Introduction, Up: (dir) | File: ocrad.info, Node: Top, Next: Introduction, Up: (dir) | |||
GNU Ocrad Manual | GNU Ocrad Manual | |||
**************** | **************** | |||
This manual is for GNU Ocrad (version 0.25-pre5, 8 January 2015). | This manual is for GNU Ocrad (version 0.25-pre6, 19 January 2015). | |||
* Menu: | * Menu: | |||
* Introduction:: Purpose and features of GNU Ocrad | * Introduction:: Purpose and features of GNU Ocrad | |||
* Character sets:: Input charsets and output formats | * Character sets:: Input charsets and output formats | |||
* Invoking ocrad:: Command line interface | * Invoking ocrad:: Command line interface | |||
* Filters:: Postprocessing the produced text | * Filters:: Postprocessing the produced text | |||
* Library version:: Checking library version | * Library version:: Checking library version | |||
* Library functions:: Descriptions of the library functions | * Library functions:: Descriptions of the library functions | |||
* Library error codes:: Meaning of codes returned by functions | * Library error codes:: Meaning of codes returned by functions | |||
skipping to change at line 220 | skipping to change at line 220 | |||
Filters don't enable the recognition of characters, just filter them | Filters don't enable the recognition of characters, just filter them | |||
from the output. Use '--charset' to enable the recognition of a | from the output. Use '--charset' to enable the recognition of a | |||
character set different from the default ISO-8859-15. | character set different from the default ISO-8859-15. | |||
Ocrad provides both built-in filters and user-defined filters. | Ocrad provides both built-in filters and user-defined filters. | |||
4.1 User-defined filters | 4.1 User-defined filters | |||
======================== | ======================== | |||
The format of a user-defined filter file (*note --user-filter::) is very | The format of a user-defined filter file (*note --user-filter::) is very | |||
simple. Each line contains a comma-separated list of quoted characters | simple. Each line contains either a character conversion or a word that | |||
specifies the default behaviour for unlisted characters. | ||||
A character conversion is a comma-separated list of quoted characters | ||||
('c'), character sets ([0-9A-Z]), character codes (U0063), or character | ('c'), character sets ([0-9A-Z]), character codes (U0063), or character | |||
ranges (U0000 - UFFFF), and an optional conversion (an equal sign (=) | ranges (U0000 - UFFFF), and an optional conversion (an equal sign (=) | |||
followed by a quoted character or a character code). The characters in | followed by a quoted character or a character code). The characters in | |||
the list are converted to the character in the conversion. If no | the list are converted to the character in the conversion. If no | |||
conversion is specified, the character is left unmodified (converted to | conversion is specified, the character is left unmodified (converted to | |||
itself). | itself). | |||
Any character not appearing in the file, either by itself or | The default behaviour is to discard unlisted characters, i.e. those | |||
included in a set or range, will be discarded. The destination | characters not appearing in the file, either by themselves or included | |||
character of a conversion is considered as listed by default. Every | in a set or range. If a line containing just the word 'leave' is found | |||
character may be listed more than once, even as part of different | in the file, unlisted characters are left unmodified. If the word is | |||
conversions. The last conversion affecting a given character is the one | 'mark', unlisted characters are marked as unrecognized. | |||
that is performed. | ||||
The destination character of a conversion is considered as listed by | ||||
default. Every character may be listed more than once, even as part of | ||||
different conversions. The last conversion affecting a given character | ||||
is the one that is performed. | ||||
Character sets and quoted characters may contain escape sequences. | Character sets and quoted characters may contain escape sequences. | |||
The character '#' at begin of line or after whitespace starts a | The character '#' at begin of line or after whitespace starts a | |||
comment that extends to the end of the line. | comment that extends to the end of the line. | |||
Ranges of characters may be specified in character sets by writing | Ranges of characters may be specified in character sets by writing | |||
the starting and ending characters with a '-' between them. Thus, | the starting and ending characters with a '-' between them. Thus, | |||
'[A-Z]' matches any ASCII uppercase letter. '-' may be specified by | '[A-Z]' matches any ASCII uppercase letter. '-' may be specified by | |||
placing it first or last. ']' may be specified by placing it first. If | placing it first or last. ']' may be specified by placing it first. If | |||
skipping to change at line 260 | skipping to change at line 267 | |||
capital letter y with diaeresis' is specified in a set as '[\xBE]', but | capital letter y with diaeresis' is specified in a set as '[\xBE]', but | |||
its code is 'U0178'. | its code is 'U0178'. | |||
Spaces and control characters are unaffected by filters, except that | Spaces and control characters are unaffected by filters, except that | |||
leadind, trailing, and duplicate spaces produced by the removal of other | leadind, trailing, and duplicate spaces produced by the removal of other | |||
characters will be themselves removed. | characters will be themselves removed. | |||
Here is an example user-defined filter file equivalent to the built-in | Here is an example user-defined filter file equivalent to the built-in | |||
filter 'numbers': | filter 'numbers': | |||
U0000 - U00FF # remove this line to get 'numbers_only' | leave # remove this line to get 'numbers_only' | |||
'D', 'O', 'Q', 'o' = '0' | 'D', 'O', 'Q', 'o' = '0' | |||
'I', 'L', 'l', '|' = '1' | 'I', 'L', 'l', '|' = '1' | |||
'Z', 'z' = '2' | 'Z', 'z' = '2' | |||
'3' | '3' | |||
'A', 'q' = '4' | 'A', 'q' = '4' | |||
'S', 's' = '5' | 'S', 's' = '5' | |||
'G', 'b', U00F3 = '6' # latin small letter o with acute | 'G', 'b', U00F3 = '6' # latin small letter o with acute | |||
'J', 'T' = '7' | 'J', 'T' = '7' | |||
'&', 'B' = '8' | '&', 'B' = '8' | |||
'g' = '9' | 'g' = '9' | |||
End of changes. 4 change blocks. | ||||
9 lines changed or deleted | 16 lines changed or added |