"Fossies" - the Fresh Open Source Software Archive

Member "cb2bib-2.0.1/src/c2b/metadataParser.cpp" (12 Feb 2021, 19690 Bytes) of package /linux/privat/cb2bib-2.0.1.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) C and C++ source code syntax highlighting (style: standard) with prefixed line numbers and code folding option. Alternatively you can here view or download the uninterpreted source code file. For more information about "metadataParser.cpp" see the Fossies "Dox" file reference documentation and the latest Fossies "Diffs" side-by-side code changes report: 2.0.0_vs_2.0.1.

    1 /***************************************************************************
    2  *   Copyright (C) 2004-2021 by Pere Constans
    3  *   constans@molspaces.com
    4  *   cb2Bib version 2.0.1. Licensed under the GNU GPL version 3.
    5  *   See the LICENSE file that comes with this distribution.
    6  ***************************************************************************/
    7 #include "metadataParser.h"
    8 
    9 #include "coreBibParser.h"
   10 #include "settings.h"
   11 
   12 #include <QDate>
   13 #include <QProcess>
   14 #include <QXmlStreamReader>
   15 
   16 
   17 /** \page metadata Reading and Writing Bibliographic Metadata
   18 
   19 GET_TABLE_OF_CONTENTS
   20 
   21 
   22   \section metadata_read Reading Metadata
   23 
   24   Metadata in scientific documents had been rarely appreciated and used for
   25   decades. For bibliographic metadata, no format specification had been widely
   26   accepted. cb2Bib adapted back in 2008 the PDF predefined metadata
   27   capabilities to set BibTeX bibliographic keys in document files.
   28 
   29   cb2Bib reads all XMP (a specific XML standard devised for metadata storage)
   30   packets found in the document. It then parses the XML strings looking for
   31   nodes and attributes with key names meaningful to bibliographic references.
   32   If a given bibliographic field is found in multiple packets, cb2Bib will take
   33   the last one, which most often, and according to the PDF specs, is the most
   34   updated one. The fields <tt>file</tt>, which would be the document itself,
   35   and <tt>pages</tt>, which is usually the actual number of pages, are skipped.
   36 
   37   The metadata is then summarized in cb2Bib clipboard panel as, for instance
   38 
   39 \verbatim
   40 [Bibliographic Metadata
   41 <title>arXiv:0705.0751v1  [cs.IR]  5 May 2007</title>
   42 /Bibliographic Metadata]
   43 \endverbatim
   44 
   45   This data, whenever the user considers it to be correct, can be easily
   46   imported by the build-in 'Heuristic Guess' capability. On the other hand, if
   47   keys are found with the prefix <tt>bibtex</tt>, cb2Bib will assume the
   48   document does contain bibliographic metadata, and it will only consider the
   49   keys having this prefix. Assuming therefore that metadata is bibliographic,
   50   cb2Bib will automatically import the reference. This way, if using PDFImport,
   51   BibTeX-aware documents will be processed as successfully recognized, without
   52   requiring any user supplied regular expression.
   53 
   54   See also \ref relnotes100, \ref c2bconf_clipboard, and \ref pdfimport.
   55   <p>&nbsp;</p>
   56 
   57 
   58   \section metadata_write Writing Metadata
   59 
   60   Once an extracted reference is saved and there is a document attached to it,
   61   cb2Bib will optionally insert the bibliographic metadata into the document
   62   itself. cb2Bib writes an XMP packet as, for instance
   63 
   64 \verbatim
   65 <bibtex:author>P. Constans</bibtex:author>
   66 <bibtex:journal>arXiv 0705.0751</bibtex:journal>
   67 <bibtex:title>Approximate textual retrieval</bibtex:title>
   68 <bibtex:type>article</bibtex:type>
   69 <bibtex:year>2007</bibtex:year>
   70 \endverbatim
   71 
   72   The BibTeX fields <tt>file</tt> and <tt>id</tt> are skip from writing. The
   73   former for the reason mentioned above, and the latter because it is easily
   74   generated by specialized BibTeX software according to each user preferences.
   75   LaTeX escaped characters for non ASCII letters are converted to UTF-8, as XMP
   76   already specifies this codec.
   77 
   78   The actual writing of the packet into the document is performed by ExifTool,
   79   an excellent Perl program written by Phil Harvey. See
   80   \htmlonly
   81   <a href="https://exiftool.org/" target="_blank">https://exiftool.org</a>.
   82   \endhtmlonly
   83   ExifTool supports several document formats for writing. The most relevant
   84   here are Postscript and PDF. For PDF documents, metadata is written as an
   85   incremental update of the document. This exactly preserves the binary
   86   structure of the document, and changes can be easily reversed or modified if
   87   so desired. Whenever ExifTool is unable to insert metadata, e.g., because the
   88   document format is not supported or it has structural errors, cb2Bib will
   89   issue an information message, and the document will remain untouched.
   90 
   91 
   92   See also \ref c2bconf_documents and \ref update_metadata.
   93 
   94 */
   95 metadataParser::metadataParser(QObject* parento) : QObject(parento)
   96 {
   97     _cbpP = new coreBibParser(this);
   98     init();
   99 }
  100 
  101 metadataParser::metadataParser(coreBibParser* cbp, QObject* parento) : QObject(parento), _cbpP(cbp)
  102 {
  103     Q_ASSERT_X(_cbpP, "metadataParser", "coreBibParser was not instantiated");
  104     init();
  105 }
  106 
  107 
  108 void metadataParser::init()
  109 {
  110     _settingsP = settings::instance();
  111     // Set bibliographic fields
  112     // Remove fields file (it is itself) and pages (usually number of pages) from list
  113     _fields = QRegExp("\\b(?:abstract|address|annote|author|authors|booktitle|chapter|"
  114                       "doi|edition|editor|eprint|institution|isbn|issn|journal|"
  115                       "keyword|keywords|key words|month|note|number|organization|"
  116                       "pagerange|publicationname|publisher|school|series|title|url|volume|year)\\b");
  117     _fields.setCaseSensitivity(Qt::CaseInsensitive);
  118     // Recognition from BibTeX entries
  119     _bibtex_fields = QRegExp("\\bbibtex:(?:abstract|address|annote|author|booktitle|chapter|"
  120                              "doi|edition|editor|eprint|institution|isbn|issn|journal|"
  121                              "keywords|month|note|number|organization|pages|publisher|"
  122                              "school|series|title|url|volume|year)\\b");
  123     _bibtex_fields.setCaseSensitivity(Qt::CaseInsensitive);
  124     // Set field keys equivalences
  125     const QStringList& bibliographicFields = _cbpP->bibliographicFields();
  126     for (int i = 0; i < bibliographicFields.count(); ++i)
  127         _bibtex_key.insert(bibliographicFields.at(i), bibliographicFields.at(i));
  128     _bibtex_key.insert("authors", "author");
  129     _bibtex_key.insert("key words", "keywords");
  130     _bibtex_key.insert("keyword", "keywords");
  131     _bibtex_key.insert("pagerange", "pages");
  132     _bibtex_key.insert("publicationname", "journal");
  133 }
  134 
  135 const QString metadataParser::metadata(const QString& fn)
  136 {
  137     if (!_metadata(fn))
  138         return QString();
  139     QString data;
  140     if (_has_bibtex)
  141         data = _cbpP->referenceToBibTeX(_ref);
  142     else
  143     {
  144         const QStringList& bibliographicFields(_cbpP->bibliographicFields());
  145         if (_ref.contains("type"))
  146             data += QString("<type>%1</type>\n").arg(_ref.value("type"));
  147         for (int i = 0; i < bibliographicFields.count(); ++i)
  148         {
  149             const QString key(bibliographicFields.at(i));
  150             if (_ref.contains(key))
  151                 data += QString("<%1>%2</%1>\n").arg(key, _ref.value(key));
  152         }
  153     }
  154     data = QString("[Bibliographic Metadata\n%1/Bibliographic Metadata]\n").arg(data);
  155     return data;
  156 }
  157 
  158 bool metadataParser::metadata(const QString& fn, bibReference* ref)
  159 {
  160     ref->clearReference();
  161     bool has_reference(_metadata(fn));
  162     has_reference = has_reference && _has_bibtex && _has_cb2bib;
  163     if (has_reference)
  164         (*ref) = _ref;
  165     return has_reference;
  166 }
  167 
  168 bool metadataParser::_metadata(const QString& fn)
  169 {
  170     QByteArray raw_contents;
  171     QFile f(fn);
  172     if (f.open(QIODevice::ReadOnly))
  173     {
  174         raw_contents = f.readAll();
  175         f.close();
  176     }
  177     else
  178         return false;
  179 
  180     _ref.clearReference();
  181     _ref.typeName = "article";
  182     _has_bibtex = false;
  183     _has_cb2bib = false;
  184     _has_prism = false;
  185 
  186     QStringList xmls;
  187     _metadataXmp(fn, raw_contents, &xmls);
  188     // Last in list should be the most updated, parse it last
  189     for (int i = 0; i < xmls.count(); ++i)
  190         _fuzzyParser(xmls.at(i));
  191     QMutableHashIterator<QString, QString> it(_ref);
  192     while (it.hasNext())
  193     {
  194         it.next();
  195         it.value() = c2bUtils::fromQtXmlString(it.value());
  196     }
  197     if (!_has_cb2bib)
  198         _miscellaneousData(fn, raw_contents);
  199     if (_ref.count() == 0)
  200         return false;
  201 
  202     if (_has_bibtex)
  203         if (_ref.contains("type"))
  204             _ref.typeName = _ref.value("type");
  205 
  206     return true;
  207 }
  208 
  209 void metadataParser::_metadataXmp(const QString& fn, const QByteArray& raw_contents, QStringList* xmls)
  210 {
  211     xmls->clear();
  212     int pos(0);
  213     while (pos > -1)
  214     {
  215         // Scan all packets, and do not trust "=''  " etc, as producers encode differently
  216         pos = raw_contents.indexOf("<?xpacket begin", pos);
  217         if (pos > -1)
  218         {
  219             int posn(raw_contents.indexOf("<?xpacket end", pos));
  220             if (posn > pos)
  221             {
  222                 xmls->append(c2bUtils::toQtXmlString(QString::fromUtf8(raw_contents.mid(pos, posn - pos + 19))));
  223                 _has_bibtex = _has_bibtex || xmls->last().contains("bibtex:");
  224                 _has_cb2bib = _has_cb2bib || xmls->last().contains("www.molspaces.com/cb2bib");
  225                 _has_prism = _has_prism || xmls->last().contains("prismstandard.org/namespaces/basic/2.0");
  226                 pos = posn;
  227             }
  228             else
  229                 pos = -1;
  230         }
  231     }
  232     if (xmls->count() == 0)
  233         _metadataXmpExifTool(fn, xmls);
  234 }
  235 
  236 void metadataParser::_miscellaneousData(const QString& fn, const QByteArray& raw_contents)
  237 {
  238     // Get title, author, and keywords from here whenever no cb2Bib BibTeX data is available
  239     QString data;
  240     QRegExp pdf_author_rx;
  241     QRegExp pdf_title_rx;
  242     QRegExp pdf_keywords_rx;
  243     const QString exiftool_bin(_settingsP->fileName("cb2Bib/ExifToolBin"));
  244     bool is_exiftool_available = !exiftool_bin.isEmpty();
  245     if (is_exiftool_available)
  246     {
  247         QProcess exiftool;
  248         QStringList arglist;
  249         arglist.append(fn);
  250         exiftool.start(exiftool_bin, arglist);
  251         if (!exiftool.waitForFinished(90000))
  252             exiftool.kill();
  253         data = QString::fromUtf8(exiftool.readAllStandardOutput());
  254         if (exiftool.error() == QProcess::UnknownError) // No error
  255         {
  256             pdf_author_rx.setPattern("Author\\s*:\\s+(.*)\\n");
  257             pdf_title_rx.setPattern("Title\\s*:\\s+(.*)\\n");
  258             pdf_keywords_rx.setPattern("Subject\\s*:\\s+(.*)\\n");
  259         }
  260         else
  261             is_exiftool_available = false;
  262     }
  263     if (!is_exiftool_available)
  264     {
  265         if (!raw_contents.startsWith("%PDF"))
  266             return;
  267         data = _pdfDictionary(raw_contents);
  268         pdf_author_rx.setPattern("\\Author\\s*\\((.*)\\)");
  269         pdf_title_rx.setPattern("\\Title\\s*\\((.*)\\)");
  270     }
  271 
  272     pdf_author_rx.setMinimal(true);
  273     pdf_author_rx.setCaseSensitivity(Qt::CaseSensitive);
  274     if (pdf_author_rx.indexIn(data) > -1)
  275         if (!pdf_author_rx.cap(1).trimmed().isEmpty())
  276             _ref["author"] = pdf_author_rx.cap(1);
  277 
  278     // Dublin Core Metadata keywords if exiftool is available
  279     if (_has_prism && is_exiftool_available)
  280     {
  281         pdf_keywords_rx.setMinimal(true);
  282         pdf_keywords_rx.setCaseSensitivity(Qt::CaseSensitive);
  283         if (pdf_keywords_rx.indexIn(data) > -1)
  284             if (!pdf_keywords_rx.cap(1).trimmed().isEmpty())
  285                 _ref["keywords"] = pdf_keywords_rx.cap(1);
  286     }
  287 
  288     // Done if BibTeX, otherwise try checking dictionary for title
  289     if (_has_bibtex)
  290         return;
  291     pdf_title_rx.setMinimal(true);
  292     pdf_title_rx.setCaseSensitivity(Qt::CaseSensitive);
  293     if (pdf_title_rx.indexIn(data) > -1)
  294         if (!pdf_title_rx.cap(1).trimmed().isEmpty())
  295             _ref["title"] = pdf_title_rx.cap(1);
  296 }
  297 
  298 const QString metadataParser::_pdfDictionary(const QByteArray& rawpdf)
  299 {
  300     // Heuristic to locate the Pdf dictionary
  301     const int pos(rawpdf.lastIndexOf("/Producer"));
  302     if (pos > -1)
  303     {
  304         const int pos0(rawpdf.lastIndexOf("<<", pos));
  305         if (pos0 > -1)
  306         {
  307             const int posn(rawpdf.indexOf(">>", pos));
  308             if (posn > pos0)
  309                 return QString::fromLatin1(rawpdf.mid(pos0, posn - pos0 + 2));
  310         }
  311     }
  312     return QString();
  313 }
  314 
  315 void metadataParser::_metadataXmpExifTool(const QString& fn, QStringList* xmls)
  316 {
  317     // Not actually needed, called for not directly visible XMP packages
  318     // It's slower than _metadataXmp() full scan.
  319     const QString exiftool_bin(_settingsP->fileName("cb2Bib/ExifToolBin"));
  320     if (exiftool_bin.isEmpty())
  321         return;
  322     QProcess exiftool;
  323     QStringList arglist;
  324     arglist.append("-xmp");
  325     arglist.append("-b");
  326     arglist.append(fn);
  327     exiftool.start(exiftool_bin, arglist);
  328     if (!exiftool.waitForFinished(90000))
  329         exiftool.kill();
  330     QString xmp(c2bUtils::toQtXmlString(QString::fromUtf8(exiftool.readAllStandardOutput())));
  331     if (xmp.startsWith("<?xpacket begin"))
  332     {
  333         xmls->append(xmp);
  334         _has_bibtex = _has_bibtex || xmls->last().contains("bibtex:");
  335         _has_cb2bib = _has_cb2bib || xmls->last().contains("www.molspaces.com/cb2bib");
  336         _has_prism = _has_prism || xmls->last().contains("prismstandard.org/namespaces/basic/2.0");
  337     }
  338 }
  339 
  340 void metadataParser::_fuzzyParser(const QString& data)
  341 {
  342     if (data.isEmpty())
  343         return;
  344     QXmlStreamReader parser;
  345     parser.addData(data);
  346     QRegExp* fields;
  347     if (_has_bibtex)
  348         fields = &_bibtex_fields;
  349     else
  350         fields = &_fields;
  351     QString field;
  352     QString key;
  353     QString value;
  354     while (!parser.atEnd())
  355     {
  356         parser.readNext();
  357         if (parser.isStartElement())
  358         {
  359             // Do attributes (seems poppler xml composing)
  360             QXmlStreamAttributes att = parser.attributes();
  361             for (int i = 0; i < att.count(); ++i)
  362             {
  363                 field = att.at(i).qualifiedName().toString();
  364                 key = att.at(i).name().toString().toLower();
  365                 value = att.at(i).value().toString();
  366                 if (value.isEmpty())
  367                     continue;
  368                 if (field.contains(*fields))
  369                     _ref[_bibtex_key.value(key)] = value;
  370                 else if (QString::compare(field, "summary", Qt::CaseInsensitive) == 0 ||
  371                          QString::compare(field, "subject", Qt::CaseInsensitive) == 0)
  372                 {
  373                     if (!_ref.contains("abstract")) // Prefer BibTeX field key if exists than synonyms
  374                         _ref["abstract"] = value;
  375                 }
  376                 else if (QString::compare(field, "bibtex:type", Qt::CaseInsensitive) == 0 ||
  377                          QString::compare(field, "bibtex:entrytype", Qt::CaseInsensitive) == 0)
  378                     _ref["type"] = value.toLower();
  379             }
  380 
  381             // Do element (exiftool and exempi xml composing)
  382             field = parser.qualifiedName().toString();
  383             key = parser.name().toString().toLower();
  384             if (field.contains(*fields))
  385             {
  386                 parser.readNext();
  387                 value = parser.text().toString().trimmed();
  388                 if (!value.isEmpty())
  389                     _ref[_bibtex_key.value(key)] = value;
  390             }
  391             else if (!_has_bibtex && QString::compare(field, "prism:coverDate", Qt::CaseSensitive) == 0)
  392             {
  393                 parser.readNext();
  394                 value = parser.text().toString().trimmed();
  395                 const QDate pdate(QDate::fromString(value, Qt::ISODate));
  396                 const QString pyear(pdate.toString("yyyy"));
  397                 // Prefer BibTeX date over Prism
  398                 if (!pyear.isEmpty() && !_ref.contains("year"))
  399                     _ref["year"] = pyear;
  400                 const QString pmonth(pdate.toString("d MMMM"));
  401                 if (!pmonth.isEmpty() && !_ref.contains("month"))
  402                     _ref["month"] = pmonth;
  403             }
  404             else if (QString::compare(field, "summary", Qt::CaseInsensitive) == 0 ||
  405                      QString::compare(field, "subject", Qt::CaseInsensitive) == 0)
  406             {
  407                 parser.readNext();
  408                 value = parser.text().toString().trimmed();
  409                 if (!value.isEmpty() && !_ref.contains("abstract")) // Prefer BibTeX field key if exists than synonyms
  410                     _ref["abstract"] = value;
  411             }
  412             else if (QString::compare(field, "bibtex:type", Qt::CaseInsensitive) == 0 ||
  413                      QString::compare(field, "bibtex:entrytype", Qt::CaseInsensitive) == 0)
  414             {
  415                 parser.readNext();
  416                 value = parser.text().toString().trimmed();
  417                 if (!value.isEmpty())
  418                     _ref["type"] = value.toLower();
  419             }
  420         }
  421     }
  422     if (parser.hasError())
  423         c2bUtils::debug(tr("metadataParser: Error while parsing XML packets"));
  424 }
  425 
  426 bool metadataParser::insertMetadata(const bibReference& ref, const QString& fn, QString* error, const bool has_reference)
  427 {
  428     if (error)
  429         error->clear();
  430     const QString exiftool_bin(_settingsP->fileName("cb2Bib/ExifToolBin"));
  431     if (exiftool_bin.isEmpty())
  432     {
  433         if (error)
  434             *error = tr("Metadata writer: ExifTool location has not been specified.");
  435         else
  436             emit showMessage(tr("Warning - cb2Bib"), tr("Metadata writer: ExifTool location has not been specified."));
  437         return false;
  438     }
  439     if (ref.count() == 0)
  440         return false;
  441 
  442     QString bibtags;
  443     QString key;
  444     QString value;
  445     const QString entry("<bibtex:%1>%2</bibtex:%1>\n");
  446     bibtags += entry.arg("type", ref.typeName);
  447     const QStringList& bibliographicFields = _cbpP->bibliographicFields();
  448     for (int i = 0; i < bibliographicFields.count(); ++i)
  449     {
  450         key = bibliographicFields.at(i);
  451         value = ref.value(key);
  452         if (key == "file")
  453             continue;
  454         else if (key == "id")
  455             continue;
  456         c2bUtils::fullBibToC2b(value);
  457         if (key == "title" || key == "booktitle")
  458             c2bUtils::cleanTitle(value);
  459         bibtags += entry.arg(key, value);
  460     }
  461     QString bibtags_xmp(c2bUtils::fileToString(":/xml/xml/cb2bib.xmp"));
  462     bibtags_xmp.replace("GET_BIBTEX_TAGS", bibtags);
  463     bibtags_xmp.replace("GET_FORMATTED_AUTHOR", formattedAuthor(ref.value("author")));
  464 
  465     const QString workdir(QFileInfo(fn).absolutePath());
  466     const QString bibtags_file(workdir + "/bibtags.xmp");
  467     c2bUtils::stringToFile(bibtags_xmp, bibtags_file);
  468 
  469     QProcess exiftool;
  470     QStringList arglist;
  471     arglist.append("-overwrite_original");
  472     arglist.append("-m");
  473     if (has_reference)
  474         arglist.append("-P");
  475     arglist.append("-TagsFromFile");
  476     arglist.append(bibtags_file);
  477     arglist.append("-all:all");
  478     arglist.append("-pdf:all<all");
  479     arglist.append("-postscript:all<all");
  480     arglist.append(fn);
  481 
  482     QStringList envlist(QProcess::systemEnvironment());
  483     envlist.prepend("EXIFTOOL_HOME=" + workdir);
  484     exiftool.setEnvironment(envlist);
  485     const QString exiftoolconf_file(workdir + "/.ExifTool_config");
  486     c2bUtils::stringToFile(c2bUtils::fileToString(":/xml/xml/ExifTool_config"), exiftoolconf_file);
  487 
  488     exiftool.start(exiftool_bin, arglist);
  489     if (!exiftool.waitForStarted())
  490     {
  491         if (error)
  492             *error =
  493                 tr("Metadata writer: '%1' could not be started. Check file permissions and path.").arg(exiftool_bin);
  494         else
  495             emit showMessage(
  496                 tr("Warning - cb2Bib"),
  497                 tr("Metadata writer: '%1' could not be started. Check file permissions and path.").arg(exiftool_bin));
  498     }
  499     if (!exiftool.waitForFinished(90000))
  500         exiftool.kill();
  501     const QString exiftool_error(exiftool.readAllStandardError().trimmed());
  502     const bool inserted(exiftool.error() == QProcess::UnknownError && exiftool.exitCode() == 0 &&
  503                         exiftool_error.isEmpty());
  504     if (!inserted && error)
  505         if (!exiftool_error.isEmpty())
  506             *error = exiftool_error;
  507     QFile::remove(bibtags_file);
  508     QFile::remove(exiftoolconf_file);
  509     return inserted;
  510 }