"Fossies" - the Fresh Open Source Software Archive  

Source code changes of the file "docs/advanced.rst" between
OCRmyPDF-8.0.1.tar.gz and OCRmyPDF-8.1.0.tar.gz

About: OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.

advanced.rst  (OCRmyPDF-8.0.1):advanced.rst  (OCRmyPDF-8.1.0)
Advanced features Advanced features
================= =================
Control of unpaper
------------------
OCRmyPDF uses ``unpaper`` to provide the implementation of the ``--clean`` and `
`--clean-final`` arguments. `unpaper <https://github.com/Flameeyes/unpaper/blob/
master/doc/basic-concepts.md>`_ provides a variety of image processing filters t
o improve images.
By default, OCRmyPDF uses only ``unpaper`` arguments that were found to be safe
to use on almost all files without having to inspect every page of the file afte
rwards. This is particularly true when only ``--clean`` is used, since that inst
ructs OCRmyPDF to only clean the image before OCR and not the final image.
However, if you wish to use the more aggressive options in ``unpaper``, you may
use ``--unpaper-args '...'`` to override the OCRmyPDF's defaults and forward oth
er arguments to unpaper. This option will forward arguments to ``unpaper`` witho
ut any knowledge of what that program considers to be valid arguments. The strin
g of arguments must be quoted as shown in the examples below. No filename argume
nts may be included. OCRmyPDF will assume it can append input and output filenam
e of intermediate images to the ``--unpaper-args`` string.
In this example, we tell ``unpaper`` to expect two pages of text on a sheet (ima
ge), such as occurs when two facing pages of a book are scanned. ``unpaper`` use
s this information to deskew each independently and clean up the margins of both
.
.. code-block:: bash
ocrmypdf --clean --clean-final --unpaper-args '--layout double' input.pdf ou
tput.pdf
ocrmypdf --clean --clean-final --unpaper-args '--layout double --no-noisefil
ter' input.pdf output.pdf
.. warning::
Some ``unpaper`` features will reposition text within the image. ``--clean-f
inal`` is recommended to avoid this issue.
.. warning::
Some ``unpaper`` features cause multiple input or output files to be consume
d or produced. OCRmyPDF requires ``unpaper`` to consume one file and produce one
file. An deviation from that condition will result in errors.
.. note::
``unpaper`` uses uncompressed PBM/PGM/PPM files for its intermediate files.
For large images or documents, it can take a lot of temporary disk space.
Control of OCR options Control of OCR options
---------------------- ----------------------
OCRmyPDF provides many features to control the behavior of the OCR engine, Tesse ract. OCRmyPDF provides many features to control the behavior of the OCR engine, Tesse ract.
When OCR is skipped When OCR is skipped
""""""""""""""""""" """""""""""""""""""
If a page in a PDF seems to have text, by default OCRmyPDF will exit without mod ifying the PDF. This is to ensure that PDFs that were previously OCRed or were " born digital" rather than scanned are not processed. If a page in a PDF seems to have text, by default OCRmyPDF will exit without mod ifying the PDF. This is to ensure that PDFs that were previously OCRed or were " born digital" rather than scanned are not processed.
 End of changes. 1 change blocks. 
0 lines changed or deleted 49 lines changed or added

Home  |  About  |  Features  |  All  |  Newest  |  Dox  |  Diffs  |  RSS Feeds  |  Screenshots  |  Comments  |  Imprint  |  Privacy  |  HTTP(S)