"Fossies" - the Fresh Open Source Software Archive

Member "cb2bib-2.0.1/src/c2bPdfImport.cpp" (12 Feb 2021, 21082 Bytes) of package /linux/privat/cb2bib-2.0.1.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) C and C++ source code syntax highlighting (style: standard) with prefixed line numbers and code folding option. Alternatively you can here view or download the uninterpreted source code file. For more information about "c2bPdfImport.cpp" see the Fossies "Dox" file reference documentation and the latest Fossies "Diffs" side-by-side code changes report: 2.0.0_vs_2.0.1.

    1 /***************************************************************************
    2  *   Copyright (C) 2004-2021 by Pere Constans
    3  *   constans@molspaces.com
    4  *   cb2Bib version 2.0.1. Licensed under the GNU GPL version 3.
    5  *   See the LICENSE file that comes with this distribution.
    6  ***************************************************************************/
    7 #include "c2bPdfImport.h"
    8 
    9 #include "c2b.h"
   10 #include "c2bFileDialog.h"
   11 #include "c2bSettings.h"
   12 #include "c2bUtils.h"
   13 
   14 #include <document.h>
   15 
   16 #include <QDropEvent>
   17 #include <QMimeData>
   18 #include <QPushButton>
   19 #include <QTimer>
   20 #include <QUrl>
   21 
   22 
   23 /** \page pdfimport PDF Reference Import
   24 
   25 <p>GET_TABLE_OF_CONTENTS</p>
   26 
   27 
   28   \section intro_automatic_extraction Introduction
   29 
   30   Articles in PDF or other formats that can be converted to plain text can be
   31   processed and indexed by cb2Bib. Files can be selected using the Select Files
   32   button, or dragging them from the desktop or the file manager to the
   33   PDFImport dialog panel. Files are converted to plain text by using any
   34   external translation tool or script. This tool, and optionally its
   35   parameters, are set in the cb2Bib configure dialog. See the \ref
   36   c2bconf_utilities section for details.
   37 
   38   Once the file is converted, the text, and optionally, the preparsed metadata,
   39   is sent to cb2Bib for reference recognition. This is the usual, two step
   40   process. First, text is optionally preprocessed, using a simple set of rules
   41   and/or any external script.or tool. See \ref c2bconf_clipboard. Second, text
   42   is processed for reference extraction. cb2Bib so far uses two methods. One
   43   considers the text as a full pattern, which is checked against the user's set
   44   of regular expressions. The better designed are these rules, the best and
   45   most reliable will be the extraction. The second method, used when no regular
   46   expression matches the text, considers instead a set of predefined
   47   subpatterns. See \ref heuristics.
   48 
   49   At this point users can interact and supervise their references, right before
   50   saving them. Allowing user intervention is and has been a design goal in
   51   cb2Bib. Therefore, at this point, cb2Bib helps users to check their
   52   references. Poorly translated characters, accented letters, 'forgotten'
   53   words, or some minor formatting in the titles might be worth considering. See
   54     \htmlonly
   55     <a href="https://www.glyphandcog.com/textext.html" target="_blank">Glyph & Cog's Text Extraction</a>
   56     \endhtmlonly
   57   for a description on the intricacies of PDF to text conversions. In addition,
   58   if too few fields were extracted, one might perform a network query. Say,
   59   only the DOI was catch, then there are chances that such a query will fill
   60   the remaining fields.
   61 
   62   The references are saved from the cb2Bib main panel. Once Save is pressed,
   63   and depending on the configuration, see \ref c2bconf_documents, the document
   64   file will be either renamed, copied, moved or simply linked onto the
   65   <tt>file</tt> field of the reference. If <b>Insert BibTeX metadata to
   66   document files</b> is checked, the current reference will also be inserted
   67   into the document itself.
   68 
   69 
   70   When several files are going to be indexed, the sequence can be as follows:
   71 
   72   - <b>Process next after saving</b>\n Once files are load and Process is
   73   pressed, the PDFImport dialog can be minimized (but not closed) for
   74   convenience. All required operations to completely fill the desired fields
   75   (e.g. dynamic bookmarks, open DOI, etc, which might be required if the data
   76   in document is not complete) are at this point accessible from the main
   77   panel. The link in the <tt>file</tt> field <b>will be permanent</b>, without
   78   regard to which operations (e.g. clipboard copying) are needed, until the
   79   reference is saved. The source file can be open at any time by right clicking
   80   the <tt>file</tt> line edit. Once the reference is saved, the next file will
   81   be automatically processed. To skip a given document file from saving its
   82   reference, press the Process button.
   83 
   84 
   85   - <b>Unsupervised processing</b>\n In this operation mode, all files will be
   86   sequentially processed, following the chosen steps and rules. <b>If the
   87   processes is successful</b>, the reference is automatically saved, and the
   88   next file is processed. <b>If it is not</b>, the file is skipped and no
   89   reference is saved. While processing, the clipboard is disabled for safety.
   90   Once finished, this box is unchecked, to avoid a possible accidental saving
   91   of a void reference. Network queries that require intervention, i.e., whose
   92   result is launching a given page, are skipped. The processes follows until
   93   all files are processed. However, it will stop to avoid a file being
   94   overwritten, as a result of a repeated key. In this case, it will resume
   95   after manual renaming and saving. See also \ref commandline, commands
   96   <tt>--txt2bib</tt> and <tt>--doc2bib</tt>.
   97 
   98 
   99   <p>&nbsp;</p>
  100   \section faq_automatic_extraction Automatic Extraction: Questions and Answers
  101 
  102   - <b>When does cb2Bib do automatic extractions?</b>\n
  103     cb2Bib is conceived as a lightweight tool to extract references and manage
  104     bibliographies in a simple, fast, and accurate way. Accuracy is better
  105     achieved in semi-automatic extractions. Such extractions are handy, and
  106     allow user intervention and verification. However, in cases where one has
  107     accumulated a large number of unindexed documents, automatic processing can
  108     be convenient. cb2Bib does automatic extraction when, in PDFImport mode,
  109     'Unsupervised processing' is checked, or, in command line mode, when typing
  110     <tt>cb2bib --doc2bib *.pdf tmp_references.bib</tt>, or, on Windows,
  111     <tt>c2bconsole.exe</tt> instead of <tt>cb2bib</tt>.
  112 
  113   - <b>Are PDFImport and command line modes equivalent?</b>\n
  114     Yes. There are, however, two minor differences. First, PDFImport adds each
  115     reference to the current BibTeX file, as this behavior is the normal one in
  116     cb2Bib. On the other hand, command line mode will, instead, overwrite
  117     <tt>tmp_references.bib</tt> if it exists, as this is the expected behavior
  118     for almost all command line tools. Second, as for now, command line mode
  119     does not follow the configuration option 'Check Repeated On Save'.
  120 
  121   - <b>How do I do automatic extraction?</b>\n
  122     To test and learn about automatic extractions, the cb2Bib distribution
  123     includes a set of four PDF files that mimic a paper title page. For these
  124     files, distribution also includes a regular expression, in file
  125     <tt>regexps.txt</tt>, capable of extracting the reference fields, provided
  126     the <tt>pdftotex</tt> flags are set to their default values. Processing
  127     these files, should, therefore, be automatic, and four messages stating
  128     <tt>Processed as 'PDF Import Example'</tt> should be seen in the logs. Note
  129     that extractions are configurable. A reading of \ref configuration will
  130     provide additional, useful information.
  131 
  132   - <b>Why some entries are not saved and files not renamed?</b>\n
  133     Once you move from the fabricated examples to real cases, you will realize
  134     that some of the files, while being processed, are not renamed and their
  135     corresponding BibTeX data is not written. For each document file, cb2Bib
  136     converts its first page to text, and from this text it attempts to extract
  137     the bibliographic reference. By design, when extraction fails, cb2Bib does
  138     nothing: no file is moved, no BibTeX is written. This way, you know that
  139     the remaining files in the origin directory need special, manual attention.
  140     <b>Extractions are seen as failed, unless reliable data is found in the
  141     text</b>.
  142 
  143   - <b>What is <em>reliable data</em>?</b>\n
  144     Note that computer processing of natural texts, as extracting the
  145     bibliographic data from a title page, is nowadays an approximated
  146     procedure. cb2Bib tries several strategies: <b>1)</b> allow for including
  147     user regular expressions very specific to the extraction at hand, <b>2)</b>
  148     use metadata if available, <b>3)</b> guess what is reasonable, and, based
  149     on this, make customized queries. Then, cb2Bib considers extracted <b>data
  150     is reliable if i)</b> data comes from a match to an user supplied regular
  151     expression <b>ii)</b> document contains BibTeX metadata, or <b>iii)</b> a
  152     guess is transformed through a query to formatted bibliographic data. As
  153     formatted bibliographic data, cb2Bib understands BibTeX, PubMed XML, arXiv
  154     XML, and CR JSON data. In addition, it allows external processing if
  155     needed. Other data, metadata, guesses, and guesses on query results are
  156     considered unreliable data.
  157 
  158   - <b>Is metadata reliable data?</b>\n
  159     No. Only author, title, and keywords in standard PDF metadata can be mapped
  160     to their corresponding bibliographic fields. Furthermore, publishers most
  161     often misuse these three keys, placing, for instance, DOI in title, or
  162     setting author to, perhaps, the document typesetter. Only BibTeX XMP
  163     metadata is considered reliable. If you consider that a set of PDF files
  164     does contain reliable data, you may force to accept it using the command
  165     line switch <tt>--sloppy</tt> together with <tt>--doc2bib</tt>.
  166 
  167   - <b>How successful is automatic extraction?</b>\n
  168     As it follows from the given definition of reliable data, running automatic
  169     extractions without adhoc <tt>regexps.txt</tt> and <tt>netqinf.txt</tt>
  170     files will certainly give a zero success ratio. In practice, scenario 3)
  171     often applies: cb2Bib guesses several fields, and, based on the
  172     out-of-the-box <tt>netqinf.txt</tt> file, it obtains from the web either
  173     BibTeX, PubMed XML, arXiv XML, or CR JSON data.
  174 
  175   - <b>What can I do to increase success ratio?</b>\n
  176     First, set your favorite journals in file <tt>abbreviations.txt</tt>.
  177     Besides increasing the chances of journal name recognition, it will provide
  178     consistency across your BibTeX database. In general, do not write regular
  179     expressions to extract directly from the PDF text. Conversion is often
  180     poor. Special characters often break lines, thus breaking your regular
  181     expressions too. Write customized queries instead. For instance, if your
  182     PDFs have DOI in title page, set the simple query
  183 \verbatim
  184 journal=The Journal of Everything|
  185 query=https://dx.doi.org/<<doi>>
  186 capture_from_query=
  187 referenceurl_prefix=
  188 referenceurl_sufix=
  189 pdfurl_prefix=
  190 pdfurl_sufix=
  191 action=htm2txt_query
  192 \endverbatim
  193     then, if it is feasible to extract the reference from the document's web
  194     page using a regular expression, include it in file <tt>regexps.txt</tt>.
  195     Note that querying in cb2Bib had been designed having in mind minority
  196     fields of research, for which, established databases might not be
  197     available. If cb2Bib failed to make reasonable guesses, then, you might
  198     consider writing very simple regular expressions to extract directly from
  199     the PDF text. For instance, obtain title only. Then, the posterior query
  200     step can provide the remaining information. Note also, especially for old
  201     documents, journal name is often missing from the paper title page. If in
  202     need of processing a series of those papers, consider using a simple
  203     script, that, in the cb2Bib preprocessing step, adds this missing
  204     information.
  205 
  206   - <b>Does successful extraction mean accurate extraction?</b>\n
  207     No. An extraction is successful if reliable data, as defined above, is
  208     found in the text, in the metadata, or in the text returned by a query.
  209     Reference accuracy relies on whether or not user regular expressions are
  210     robust, BibTeX metadata is correct, a guess is appropriate, a set of
  211     queries can correct a partially incorrect guess, and the text returned by a
  212     query is accurate. In general, well designed sets of regular expressions
  213     are accurate. Publisher's abstract pages and PubMed are accurate. But, some
  214     publishers are still using images for non-ASCII characters, and PubMed
  215     algorithms may drop author middle names if a given author has 'too many
  216     names'. Expect convenience over accuracy on other sources.
  217 
  218 
  219   - <b>Can I use cb2Bib to extract comma separated value CSV references?</b>\n
  220     Yes. To automatically import multiple CSV references you will need one
  221     regular expression. If you can control CSV export, choose | as separator,
  222     since comma might be used, for instance, in titles. The regular expression
  223     for
  224 \verbatim
  225 AuthName1, AuthName2 | Title | 2010
  226 \endverbatim
  227     will simply be
  228 \verbatim
  229 author title year
  230 ^([^|]*)\|([^|]*)\|([^|]*)$
  231 \endverbatim
  232     The reference file <tt>references.csv</tt> can then be split to single-line
  233     files typing
  234 \verbatim
  235 split -l 1 references.csv slineref
  236 \endverbatim
  237     and the command
  238 \verbatim
  239 cb2bib --txt2bib slineref* references.bib
  240 rm -f slineref*
  241 \endverbatim
  242   will convert <tt>references.csv</tt> to BibTeX file <tt>references.bib</tt>
  243 
  244 */
  245 c2bPdfImport::c2bPdfImport(QWidget* parentw) : QDialog(parentw)
  246 {
  247     ui.setupUi(this);
  248     setWindowFlags(windowFlags() & ~Qt::WindowContextHelpButtonHint);
  249     settings = c2bSettingsP;
  250     loadSettings();
  251 
  252     buttonSelectFiles = new QPushButton(tr("&Select Files"));
  253     buttonSelectFiles->setStatusTip(tr("Select PDF files. Hint: Files can be dragged and dropped to this window"));
  254     buttonSelectFiles->setMouseTracking(true);
  255     ui.buttonBox->addButton(buttonSelectFiles, QDialogButtonBox::ActionRole);
  256     buttonProcess = new QPushButton(tr("&Process"));
  257     ui.buttonBox->addButton(buttonProcess, QDialogButtonBox::ActionRole);
  258     ui.buttonBox->button(QDialogButtonBox::Help)->setAutoDefault(false);
  259     buttonProcess->setAutoDefault(true);
  260     buttonProcess->setDefault(true);
  261     buttonProcess->setEnabled(false);
  262     buttonSelectFiles->setAutoDefault(true);
  263     buttonSelectFiles->setDefault(true);
  264     buttonSelectFiles->setFocus();
  265     ui.buttonBox->button(QDialogButtonBox::Abort)->setAutoDefault(false);
  266     ui.buttonBox->button(QDialogButtonBox::Abort)->setEnabled(false);
  267     m_aborted = false;
  268 
  269     connect(ui.buttonBox->button(QDialogButtonBox::Abort), SIGNAL(clicked()), this, SLOT(abort()));
  270     connect(ui.buttonBox, SIGNAL(helpRequested()), this, SLOT(help()));
  271     connect(buttonSelectFiles, SIGNAL(clicked()), this, SLOT(selectFiles()));
  272     connect(buttonProcess, SIGNAL(clicked()), this, SLOT(processOneFile()));
  273     connect(ui.DoAll, SIGNAL(toggled(bool)), this, SIGNAL(setClipboardDisabled(bool)));
  274     connect(c2b::instance(), SIGNAL(statusMessage(QString)), this, SLOT(showMessage(QString)));
  275 
  276     ui.Log->appendPlainText(
  277         tr("PDF to Text converter: %1\nArguments: %2\n")
  278         .arg(settings->fileName("c2bPdfImport/Pdf2TextBin"), settings->value("c2bPdfImport/Pdf2TextArg").toString()));
  279     showMessage(tr("See cb2Bib install directory for demo c2bPdfImport files."));
  280 }
  281 
  282 c2bPdfImport::~c2bPdfImport()
  283 {
  284     emit setClipboardDisabled(false);
  285     saveSettings();
  286 }
  287 
  288 
  289 void c2bPdfImport::processOneFile()
  290 {
  291     // Converting PDF to Text
  292     buttonProcess->setEnabled(false);
  293     m_aborted = false;
  294     ui.buttonBox->button(QDialogButtonBox::Abort)->setEnabled(ui.DoAll->isChecked());
  295     settings->setValue("networkQuery/isSupervised", !ui.DoAll->isChecked());
  296     settings->setValue("cb2Bib/AutomaticQuery", ui.AutomaticQuery->isChecked());
  297 
  298     if (ui.PDFlist->currentItem() == 0)
  299         return;
  300     processedFile = ui.PDFlist->currentItem()->text();
  301     if (ui.OpenFiles->isChecked())
  302         c2bUtils::openFile(processedFile, this);
  303 
  304     QCoreApplication::processEvents();
  305     processDocument();
  306 }
  307 
  308 void c2bPdfImport::processNext()
  309 {
  310     processedFile.clear();
  311     if (m_aborted)
  312     {
  313         m_aborted = false;
  314         return;
  315     }
  316     if (ui.PDFlist->currentItem() == 0)
  317         return;
  318     if (ui.DoNextAfterSaving->isChecked() || ui.DoAll->isChecked())
  319         processOneFile();
  320 }
  321 
  322 void c2bPdfImport::processDocument()
  323 {
  324     document doc(processedFile, document::FirstPage);
  325     QString text(doc.toString());
  326     const QString log(doc.logString());
  327     if (!log.isEmpty())
  328         ui.Log->appendPlainText(log);
  329     const QString error(doc.errorString());
  330     if (!error.isEmpty())
  331         ui.Log->appendPlainText(tr("[cb2bib] %1.").arg(error));
  332 
  333     QListWidgetItem* item = ui.PDFlist->currentItem();
  334     delete item;
  335     if (ui.PDFlist->currentItem() == 0)
  336     {
  337         buttonProcess->setEnabled(false);
  338         ui.buttonBox->button(QDialogButtonBox::Close)->setFocus();
  339     }
  340     else
  341     {
  342         buttonProcess->setEnabled(true);
  343         buttonProcess->setFocus();
  344     }
  345 
  346     QString metadata;
  347     if (settings->value("cb2Bib/AddMetadata").toBool())
  348         metadata = c2b::documentMetadata(processedFile);
  349     if (text.isEmpty() && metadata.isEmpty())
  350     {
  351         if (ui.DoAll->isChecked())
  352             QTimer::singleShot(500, this, SLOT(processNext()));
  353     }
  354     else
  355     {
  356         if (settings->value("cb2Bib/PreAppendMetadata").toString() == "prepend")
  357             text = metadata + text;
  358         else
  359             text = text + '\n' + metadata;
  360         ui.Log->appendPlainText(tr("[cb2bib] Conversion completed for file %1.").arg(processedFile));
  361         emit textProcessed(text);
  362         emit fileProcessed(processedFile);
  363     }
  364 }
  365 
  366 void c2bPdfImport::referenceExtacted(bool status)
  367 {
  368     if (!ui.DoAll->isChecked())
  369         return;
  370     if (ui.PDFlist->currentItem() == 0)
  371     {
  372         ui.DoAll->setChecked(false);
  373         ui.buttonBox->button(QDialogButtonBox::Abort)->setEnabled(false);
  374     }
  375 
  376     // Delay request to make sure fileProcessed has finished
  377     if (status)
  378         QTimer::singleShot(500, this, SIGNAL(saveReferenceRequest()));
  379     else
  380         QTimer::singleShot(500, this, SLOT(processNext()));
  381 }
  382 
  383 void c2bPdfImport::dropEvent(QDropEvent* qevent)
  384 {
  385     const QList<QUrl> fns(qevent->mimeData()->urls());
  386     for (int i = 0; i < fns.count(); ++i)
  387     {
  388         QString scheme(fns.at(i).scheme());
  389         QString fn;
  390         if (scheme == "file")
  391             fn = fns.at(i).toLocalFile();
  392         if (!fn.isEmpty())
  393         {
  394             QListWidgetItem* item(new QListWidgetItem(fn, ui.PDFlist));
  395             if (ui.PDFlist->currentItem() == 0)
  396                 ui.PDFlist->setCurrentItem(item);
  397         }
  398     }
  399     qevent->acceptProposedAction();
  400     c2bUtils::setWidgetOnTop(this);
  401     buttonProcess->setEnabled(true);
  402     buttonProcess->setFocus();
  403     showMessage(tr("%1 files selected.").arg(ui.PDFlist->count()));
  404 }
  405 
  406 void c2bPdfImport::dragEnterEvent(QDragEnterEvent* qevent)
  407 {
  408     if (qevent->mimeData()->hasUrls())
  409         qevent->acceptProposedAction();
  410 }
  411 
  412 bool c2bPdfImport::event(QEvent* qevent)
  413 {
  414     if (qevent->type() == QEvent::StatusTip)
  415     {
  416         ui.statusBar->showMessage(static_cast<QStatusTipEvent*>(qevent)->tip());
  417         return true;
  418     }
  419     else
  420         return QWidget::event(qevent);
  421 }
  422 
  423 void c2bPdfImport::selectFiles()
  424 {
  425     const QStringList fns(c2bFileDialog::getOpenFilenames(this, QString(), settings->fileName("c2bPdfImport/LastFile"),
  426                           tr("Portable Document Format (*.pdf);;All (*)")));
  427     if (fns.isEmpty())
  428         return;
  429     settings->setFilename("c2bPdfImport/LastFile", fns.last());
  430 
  431     for (QStringList::const_iterator i = fns.constBegin(); i != fns.constEnd(); ++i)
  432     {
  433         QListWidgetItem* item(new QListWidgetItem(*i, ui.PDFlist));
  434         if (ui.PDFlist->currentItem() == 0)
  435             ui.PDFlist->setCurrentItem(item);
  436     }
  437     buttonProcess->setEnabled(true);
  438     buttonProcess->setFocus();
  439     showMessage(tr("%1 files selected.").arg(ui.PDFlist->count()));
  440 }
  441 
  442 void c2bPdfImport::show()
  443 {
  444     c2bUtils::setWidgetOnTop(this);
  445     if (buttonProcess->isEnabled())
  446         buttonProcess->setFocus();
  447     else
  448         buttonSelectFiles->setFocus();
  449     QDialog::show();
  450 }
  451 
  452 void c2bPdfImport::showMessage(const QString& ms)
  453 {
  454     ui.statusBar->showMessage(ms, C2B_MESSAGE_TIME);
  455     if (ms.startsWith(tr("Processed as")) || ms.startsWith(tr("Unable")))
  456         ui.Log->appendPlainText("[cb2bib] " + ms);
  457 }
  458 
  459 void c2bPdfImport::loadSettings()
  460 {
  461     c2bAutomaticQuery = settings->value("cb2Bib/AutomaticQuery").toBool();
  462     ui.AutomaticQuery->setChecked(settings->value("c2bPdfImport/AutomaticQuery").toBool());
  463     ui.DoNextAfterSaving->setChecked(settings->value("c2bPdfImport/DoNextAfterSaving", true).toBool());
  464     ui.OpenFiles->setChecked(settings->value("c2bPdfImport/OpenFiles", false).toBool());
  465 }
  466 
  467 void c2bPdfImport::saveSettings()
  468 {
  469     settings->setValue("c2bPdfImport/AutomaticQuery", ui.AutomaticQuery->isChecked());
  470     settings->setValue("c2bPdfImport/DoNextAfterSaving", ui.DoNextAfterSaving->isChecked());
  471     settings->setValue("c2bPdfImport/OpenFiles", ui.OpenFiles->isChecked());
  472     settings->setValue("cb2Bib/AutomaticQuery", c2bAutomaticQuery);
  473     settings->setValue("networkQuery/isSupervised", true);
  474 }
  475 
  476 void c2bPdfImport::abort()
  477 {
  478     m_aborted = true;
  479     ui.DoAll->setChecked(false);
  480     ui.buttonBox->button(QDialogButtonBox::Abort)->setEnabled(false);
  481 }
  482 
  483 void c2bPdfImport::help()
  484 {
  485     c2bUtils::displayHelp("https://www.molspaces.com/cb2bib/doc/pdfimport/");
  486 }