"Fossies" - the Fresh Open Source Software Archive  

Source code changes of the file "doc/ocrad.texi" between
ocrad-0.24.tar.gz and ocrad-0.25.tar.gz

About: GNU Ocrad is an OCR (Optical Character Recognition) program.

ocrad.texi  (ocrad-0.24):ocrad.texi  (ocrad-0.25)
\input texinfo @c -*-texinfo-*- \input texinfo @c -*-texinfo-*-
@c %**start of header @c %**start of header
@setfilename ocrad.info @setfilename ocrad.info
@documentencoding ISO-8859-15 @documentencoding ISO-8859-15
@settitle GNU Ocrad Manual @settitle GNU Ocrad Manual
@finalout @finalout
@c %**end of header @c %**end of header
@set UPDATED 3 October 2014 @set UPDATED 31 March 2015
@set VERSION 0.24 @set VERSION 0.25
@dircategory GNU Packages @dircategory GNU Packages
@direntry @direntry
* Ocrad: (ocrad). The GNU OCR program * Ocrad: (ocrad). The GNU OCR program
@end direntry @end direntry
@ifnothtml @ifnothtml
@titlepage @titlepage
@title GNU Ocrad @title GNU Ocrad
@subtitle The GNU OCR Program @subtitle The GNU OCR Program
skipping to change at line 36 skipping to change at line 36
@end titlepage @end titlepage
@contents @contents
@end ifnothtml @end ifnothtml
@node Top @node Top
@top @top
This manual is for GNU Ocrad (version @value{VERSION}, @value{UPDATED}). This manual is for GNU Ocrad (version @value{VERSION}, @value{UPDATED}).
@sp 1
GNU Ocrad is an OCR (Optical Character Recognition) program and library
based on a feature extraction method. It reads images in pbm (bitmap),
pgm (greyscale) or ppm (color) formats and produces text in @w{byte
(8-bit)} or UTF-8 formats. The pbm, pgm and ppm formats are collectively
known as pnm.
Ocrad includes a layout analyser able to separate the columns or blocks
of text normally found on printed pages.
@menu @menu
* Introduction:: Purpose and features of GNU Ocrad
* Character sets:: Input charsets and output formats * Character sets:: Input charsets and output formats
* Invoking ocrad:: Command line interface * Invoking ocrad:: Command line interface
* Filters:: Postprocessing the produced text
* Library version:: Checking library version * Library version:: Checking library version
* Library functions:: Descriptions of the library functions * Library functions:: Descriptions of the library functions
* Library error codes:: Meaning of codes returned by functions * Library error codes:: Meaning of codes returned by functions
* Image format conversion:: How to convert other formats to pnm * Image format conversion:: How to convert other formats to pnm
* Algorithm:: How ocrad does its job * Algorithm:: How ocrad does its job
* OCR results file:: Description of the ORF file format * OCR results file:: Description of the ORF file format
* Problems:: Reporting bugs * Problems:: Reporting bugs
* Concept index:: Index of concepts * Concept index:: Index of concepts
@end menu @end menu
@sp 1 @sp 1
Copyright @copyright{} 2003-2014 Antonio Diaz Diaz. Copyright @copyright{} 2003-2015 Antonio Diaz Diaz.
This manual is free documentation: you have unlimited permission This manual is free documentation: you have unlimited permission
to copy, distribute and modify it. to copy, distribute and modify it.
@node Introduction
@chapter Introduction
@cindex introduction
GNU Ocrad is an OCR (Optical Character Recognition) program and library
based on a feature extraction method. It reads images in pbm (bitmap),
pgm (greyscale) or ppm (color) formats and produces text in @w{byte
(8-bit)} or UTF-8 formats. The pbm, pgm and ppm formats are collectively
known as pnm.
Ocrad includes a layout analyser able to separate the columns or blocks
of text normally found on printed pages.
For best results the characters should be at least 20 pixels high. If
they are smaller, try the @samp{--scale} option. Scanning the image at
300 dpi usually produces a character size good enough for ocrad.
@node Character sets @node Character sets
@chapter Character sets @chapter Character sets
@cindex input charsets @cindex input charsets
@cindex output format @cindex output format
The character set internally used by ocrad is ISO 10646, also known as The character set internally used by ocrad is ISO 10646, also known as
UCS (Universal Character Set), which can represent over two thousand UCS (Universal Character Set), which can represent over two thousand
million characters (2^31). million characters (2^31).
As it is unpractical to try to recognize one among so many different As it is unpractical to try to recognize one among so many different
skipping to change at line 129 skipping to change at line 138
Append generated text to the output file instead of overwriting it. Append generated text to the output file instead of overwriting it.
@item -c @var{name} @item -c @var{name}
@itemx --charset=@var{name} @itemx --charset=@var{name}
Enable recognition of the characters belonging to the given character set. Enable recognition of the characters belonging to the given character set.
You can repeat this option multiple times with different names for You can repeat this option multiple times with different names for
processing a page with characters from different character sets.@* processing a page with characters from different character sets.@*
If no charset is specified, @w{@samp{iso-8859-15}} (latin9) is assumed.@* If no charset is specified, @w{@samp{iso-8859-15}} (latin9) is assumed.@*
Try @w{@samp{--charset=help}} for a list of valid charset names. Try @w{@samp{--charset=help}} for a list of valid charset names.
@anchor{--filter}
@item -e @var{name} @item -e @var{name}
@itemx --filter=@var{name} @itemx --filter=@var{name}
Pass the output text through the given postprocessing filter. Several Pass the output text through the given built-in postprocessing filter
filters can be applied in sequence using more than one @samp{--filter} (@pxref{Filters}). Several filters can be applied in sequence using as
option. The filters are applied in the order they appear on the command many @samp{--filter} and @samp{--user-filter} options as needed. The
line. filters are applied in the order they appear on the command line.@*
@w{@samp{--filter=letters}} forces every character that resembles a
letter to be recognized as a letter. Other characters will be output
without change.@*
@w{@samp{--filter=letters_only}}, same as @w{@samp{--filter=letters}},
but other characters will be discarded.@*
@w{@samp{--filter=numbers}} forces every character that resembles a
number to be recognized as a number. Other characters will be output
without change.@*
@w{@samp{--filter=numbers_only}}, same as @w{@samp{--filter=numbers}}
but other characters will be discarded.@*
@w{@samp{--filter=same_height}} discards any character (or noise) whose
height differs in more than 13 percent from the median height of the
characters in the line.@*
@w{@samp{--filter=upper_num}} forces every character that resembles a
uppercase letter or a number to be recognized as such. Other characters
will be output without change.@*
@w{@samp{--filter=upper_num_only}}, same as @w{@samp{--filter=upper_num}},
but other characters will be discarded.@*
Try @w{@samp{--filter=help}} for a list of valid filter names. Try @w{@samp{--filter=help}} for a list of valid filter names.
@anchor{--user-filter}
@item -E @var{file}
@itemx --user-filter=@var{file}
Pass the output text through the postprocessing filter defined in
@var{file}. See the chapter @samp{Filters} (@pxref{Filters}) for a
description of the format of @var{file}. Several filters can be applied
in sequence using as many @samp{--filter} and @samp{--user-filter}
options as needed. The filters are applied in the order they appear on
the command line.
@item -f @item -f
@itemx --force @itemx --force
Force overwrite of output files. Force overwrite of output files.
@item -F @var{name} @item -F @var{name}
@itemx --format=@var{name} @itemx --format=@var{name}
Select the output format. The valid names are @samp{byte} and @samp{utf8}.@* Select the output format. The valid names are @samp{byte} and @samp{utf8}.@*
If no output format is specified, @samp{byte} (8 bit) is assumed. If no output format is specified, @samp{byte} (8 bit) is assumed.
@item -i @item -i
skipping to change at line 191 skipping to change at line 192
@item -s @var{value} @item -s @var{value}
@itemx --scale=@var{value} @itemx --scale=@var{value}
Scale up the input image by @var{value} before layout analysis and Scale up the input image by @var{value} before layout analysis and
recognition. If @var{value} is negative, the input image is scaled down recognition. If @var{value} is negative, the input image is scaled down
by @var{-value}. by @var{-value}.
@item -t @var{name} @item -t @var{name}
@itemx --transform=@var{name} @itemx --transform=@var{name}
Perform given transformation (rotation or mirroring) on the input image Perform given transformation (rotation or mirroring) on the input image
before scaling, layout analysis and recognition.@* before scaling, layout analysis and recognition. Rotations are made
counter-clockwise.@*
Try @w{@samp{--transform=help}} for a list of valid transformation names. Try @w{@samp{--transform=help}} for a list of valid transformation names.
@item -T @var{value} @item -T @var{value}
@itemx --threshold=@var{value} @itemx --threshold=@var{value}
Set binarization threshold for pgm or ppm files or for @samp{--scale} Set binarization threshold for pgm or ppm files or for @samp{--scale}
option (only for scaled down images). @var{value} should be a rational option (only for scaled down images). @var{value} should be a rational
number between 0 and 1, and may be given as a percentage (50%), a number between 0 and 1, and may be given as a percentage (50%), a
fraction (1/2), or a decimal value (0.5). Image values greater than fraction (1/2), or a decimal value (0.5). Image values greater than
threshold are converted to white. The default value is 0.5. threshold are converted to white. The default value is 0.5.
skipping to change at line 233 skipping to change at line 235
@w{@samp{-x -}} writes to stdout, overriding text output except if @w{@samp{-x -}} writes to stdout, overriding text output except if
output has been also redirected with the @samp{-o} option. output has been also redirected with the @samp{-o} option.
@end table @end table
Exit status: 0 for a normal exit, 1 for environmental problems (file not Exit status: 0 for a normal exit, 1 for environmental problems (file not
found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or
invalid input file, 3 for an internal consistency error (eg, bug) which invalid input file, 3 for an internal consistency error (eg, bug) which
caused ocrad to panic. caused ocrad to panic.
@node Filters
@chapter Postprocessing the produced text
@cindex filters
Filters replace some characters in the text output with different
characters and remove some other characters from the output. For
example, when recognizing a text that is known to contain just numbers,
any character recognized as a @samp{Z} will probably be a @samp{2}.
Filters don't enable the recognition of characters, just filter them
from the output. Use @samp{--charset} to enable the recognition of a
character set different from the default ISO-8859-15.
Ocrad provides both built-in filters and user-defined filters.
@section User-defined filters
The format of a user-defined filter file (@pxref{--user-filter}) is very
simple. Each line contains either a character conversion or a word that
specifies the default behaviour for unlisted characters.
A character conversion is a comma-separated list of quoted characters
('c'), character sets ([0-9A-Z]), character codes (U0063), or character
ranges (U0000 - UFFFF), and an optional conversion (an equal sign (=)
followed by a quoted character or a character code). The characters in
the list are converted to the character in the conversion. If no
conversion is specified, the character is left unmodified (converted to
itself).
The default behaviour is to discard unlisted characters, i.e. those
characters not appearing in the file, either by themselves or included
in a set or range. If a line containing just the word @samp{leave} is
found in the file, unlisted characters are left unmodified. If the word
is @samp{mark}, unlisted characters are marked as unrecognized.
The destination character of a conversion is considered as listed by
default. Every character may be listed more than once, even as part of
different conversions. The last conversion affecting a given character
is the one that is performed.
Character sets and quoted characters may contain escape sequences.
The character @samp{#} at begin of line or after whitespace starts a
comment that extends to the end of the line.
Ranges of characters may be specified in character sets by writing the
starting and ending characters with a @samp{-} between them. Thus,
@samp{[A-Z]} matches any ASCII uppercase letter. @samp{-} may be
specified by placing it first or last. @samp{]} may be specified by
placing it first. If the first character after the left bracket is
@samp{^}, it indicates a "complemented set", which matches any character
except the ones between the brackets.
Literals (quoted characters and character sets) are decoded as
ISO-8859-15. Character codes are decoded as UCS2. Thus, a @samp{latin
capital letter y with diaeresis} is specified in a set as @samp{[\xBE]},
but its code is @samp{U0178}.
Spaces and control characters are unaffected by filters, except that
leadind, trailing, and duplicate spaces produced by the removal of other
characters will be themselves removed.
@noindent
Here is an example user-defined filter file equivalent to the built-in
filter @samp{numbers}:
@example
leave # remove this line to get @samp{numbers_only}
'D', 'O', 'Q', 'o' = '0'
'I', 'L', 'l', '|' = '1'
'Z', 'z' = '2'
'3'
'A', 'q' = '4'
'S', 's' = '5'
'G', 'b', U00F3 = '6' # latin small letter o with acute
'J', 'T' = '7'
'&', 'B' = '8'
'g' = '9'
@end example
@section Built-in filters
Ocrad provides the following built-in filters (@pxref{--filter}):
@table @samp
@item --filter=letters
Forces every character that resembles a letter to be recognized as a
letter. Other characters will be output without change.
@item --filter=letters_only
Same as @samp{--filter=letters}, but other characters will be discarded.
@item --filter=numbers
Forces every character that resembles a number to be recognized as a
number. Other characters will be output without change.
@item --filter=numbers_only
Same as @samp{--filter=numbers} but other characters will be discarded.
@item --filter=same_height
Discards any character (or noise) whose height differs in more than 10
percent from the median height of the characters in the line.
@item --filter=text_block
Discards any character (or noise) outside of a rectangular block of text
lines.
@item --filter=upper_num
Forces every character that resembles a uppercase letter or a number to
be recognized as such. Other characters will be output without change.
@item --filter=upper_num_mark
Same as @samp{--filter=upper_num}, but other characters will be marked
as unrecognized.
@item --filter=upper_num_only
Same as @samp{--filter=upper_num}, but other characters will be
discarded.
@end table
@node Library version @node Library version
@chapter Library version @chapter Library version
@cindex library version @cindex library version
@deftypefun {const char *} OCRAD_version ( void ) @deftypefun {const char *} OCRAD_version ( void )
Returns the library version as a string. Returns the library version as a string.
@end deftypefun @end deftypefun
@deftypevr Constant {const char *} OCRAD_version_string @deftypevr Constant {const char *} OCRAD_version_string
This constant is defined in the header file @samp{ocradlib.h}. This constant is defined in the header file @samp{ocradlib.h}.
skipping to change at line 461 skipping to change at line 584
understanding about OCR issues. understanding about OCR issues.
The overall working of ocrad may be described as follows:@* The overall working of ocrad may be described as follows:@*
1) Read the image.@* 1) Read the image.@*
2) Optionally, perform some transformations (cut, rotate, scale, etc).@* 2) Optionally, perform some transformations (cut, rotate, scale, etc).@*
3) Optionally, perform layout detection.@* 3) Optionally, perform layout detection.@*
4) Remove frames and pictures.@* 4) Remove frames and pictures.@*
5) Detect characters and group them in lines.@* 5) Detect characters and group them in lines.@*
6) Recognize characters (very ad hoc; one algorithm per character).@* 6) Recognize characters (very ad hoc; one algorithm per character).@*
7) Correct some ambiguities (transform l.OOO into 1.000, etc).@* 7) Correct some ambiguities (transform l.OOO into 1.000, etc).@*
8) Output result. 8) Optionally, apply one or more filters to the text.@*
9) Output text result.
@sp 1 @sp 1
Ocrad recognizes characters by its shape, and the reason it is so fast Ocrad recognizes characters by its shape, and the reason it is so fast
is that it does not compare the shape of every character against some is that it does not compare the shape of every character against some
sort of database of shapes and then chooses the best match. Instead of sort of database of shapes and then chooses the best match. Instead of
this, ocrad only compares the shape differences that are relevant to this, ocrad only compares the shape differences that are relevant to
choose between two character categories, mostly like a binary search. choose between two character categories, mostly like a binary search.
As there is no such thing as a free lunch, this approach has some As there is no such thing as a free lunch, this approach has some
drawbacks. It makes ocrad very sensitive to character defects, and makes drawbacks. It makes ocrad very sensitive to character defects, and makes
difficult to modify ocrad to recognize new characters. difficult to modify ocrad to recognize new characters.
For best results, the characters should be at least 20 pixels high. If
they are smaller, try the --scale option. Scanning the image at 300 dpi
usually produces a character size good enough for ocrad.
@node OCR results file @node OCR results file
@chapter OCR results file @chapter OCR results file
@cindex OCR results file @cindex OCR results file
Calling ocrad with option @samp{-x} produces an OCR results file (ORF), Calling ocrad with option @samp{-x} produces an OCR results file (ORF),
that is, a parsable file containing the OCR results. The ORF format is that is, a parsable file containing the OCR results. The ORF format is
as follows: as follows:
@itemize @minus @itemize @minus
@item @item
skipping to change at line 541 skipping to change at line 661
@var{w} is the width of the bounding box.@* @var{w} is the width of the bounding box.@*
@var{h} is the height of the bounding box.@* @var{h} is the height of the bounding box.@*
@var{g} is the number of different recognition guesses for this character.@* @var{g} is the number of different recognition guesses for this character.@*
The result characters follow after the number of guesses in the form of The result characters follow after the number of guesses in the form of
a comma-separated list of pairs. Every pair is formed by the actual a comma-separated list of pairs. Every pair is formed by the actual
recognised char @var{c} enclosed in single quotes, followed by the recognised char @var{c} enclosed in single quotes, followed by the
confidence value @var{v}, without space between them. The higher the confidence value @var{v}, without space between them. The higher the
value of confidence, the more confident is the result. value of confidence, the more confident is the result.
@end itemize @end itemize
Running @code{./ocrad -x test.orf examples/test.pbm} in the source directory Running @code{./ocrad -x test.orf testsuite/test.pbm} in the source
will give you an example ORF file. directory will give you an example ORF file.
@node Problems @node Problems
@chapter Reporting bugs @chapter Reporting bugs
@cindex bugs @cindex bugs
@cindex getting help @cindex getting help
There are probably bugs in ocrad. There are certainly errors and There are probably bugs in ocrad. There are certainly errors and
omissions in this manual. If you report them, they will get fixed. If omissions in this manual. If you report them, they will get fixed. If
you don't, no one will ever know about them and they will remain unfixed you don't, no one will ever know about them and they will remain unfixed
for all eternity, if not longer. for all eternity, if not longer.
If you find a bug in GNU Ocrad, please send electronic mail to If you find a bug in GNU Ocrad, please send electronic mail to
@email{bug-ocrad@@gnu.org}. Include the version number, which you can @email{bug-ocrad@@gnu.org}. Include the version number, which you can
find by running @w{@samp{ocrad --version}}. find by running @w{@code{ocrad --version}}.
@node Concept index @node Concept index
@unnumbered Concept index @unnumbered Concept index
@printindex cp @printindex cp
@bye @bye
 End of changes. 15 change blocks. 
45 lines changed or deleted 165 lines changed or added

Home  |  About  |  All  |  Newest  |  Fossies Dox  |  Screenshots  |  Comments  |  Imprint  |  Privacy  |  HTTPS