"Fossies" - the Fresh Open Source Software Archive

Member "xapian-core-1.4.14/docs/collapsing.html" (23 Nov 2019, 14439 Bytes) of package /linux/www/xapian-core-1.4.14.tar.xz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) HTML source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 <?xml version="1.0" encoding="utf-8" ?>
    2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    3 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    4 <head>
    5 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    6 <meta name="generator" content="Docutils 0.15.2: http://docutils.sourceforge.net/" />
    7 <title>Collapsing of Search Results</title>
    8 <style type="text/css">
    9 
   10 /*
   11 :Author: David Goodger (goodger@python.org)
   12 :Id: $Id: html4css1.css 7952 2016-07-26 18:15:59Z milde $
   13 :Copyright: This stylesheet has been placed in the public domain.
   14 
   15 Default cascading style sheet for the HTML output of Docutils.
   16 
   17 See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
   18 customize this style sheet.
   19 */
   20 
   21 /* used to remove borders from tables and images */
   22 .borderless, table.borderless td, table.borderless th {
   23   border: 0 }
   24 
   25 table.borderless td, table.borderless th {
   26   /* Override padding for "table.docutils td" with "! important".
   27      The right padding separates the table cells. */
   28   padding: 0 0.5em 0 0 ! important }
   29 
   30 .first {
   31   /* Override more specific margin styles with "! important". */
   32   margin-top: 0 ! important }
   33 
   34 .last, .with-subtitle {
   35   margin-bottom: 0 ! important }
   36 
   37 .hidden {
   38   display: none }
   39 
   40 .subscript {
   41   vertical-align: sub;
   42   font-size: smaller }
   43 
   44 .superscript {
   45   vertical-align: super;
   46   font-size: smaller }
   47 
   48 a.toc-backref {
   49   text-decoration: none ;
   50   color: black }
   51 
   52 blockquote.epigraph {
   53   margin: 2em 5em ; }
   54 
   55 dl.docutils dd {
   56   margin-bottom: 0.5em }
   57 
   58 object[type="image/svg+xml"], object[type="application/x-shockwave-flash"] {
   59   overflow: hidden;
   60 }
   61 
   62 /* Uncomment (and remove this text!) to get bold-faced definition list terms
   63 dl.docutils dt {
   64   font-weight: bold }
   65 */
   66 
   67 div.abstract {
   68   margin: 2em 5em }
   69 
   70 div.abstract p.topic-title {
   71   font-weight: bold ;
   72   text-align: center }
   73 
   74 div.admonition, div.attention, div.caution, div.danger, div.error,
   75 div.hint, div.important, div.note, div.tip, div.warning {
   76   margin: 2em ;
   77   border: medium outset ;
   78   padding: 1em }
   79 
   80 div.admonition p.admonition-title, div.hint p.admonition-title,
   81 div.important p.admonition-title, div.note p.admonition-title,
   82 div.tip p.admonition-title {
   83   font-weight: bold ;
   84   font-family: sans-serif }
   85 
   86 div.attention p.admonition-title, div.caution p.admonition-title,
   87 div.danger p.admonition-title, div.error p.admonition-title,
   88 div.warning p.admonition-title, .code .error {
   89   color: red ;
   90   font-weight: bold ;
   91   font-family: sans-serif }
   92 
   93 /* Uncomment (and remove this text!) to get reduced vertical space in
   94    compound paragraphs.
   95 div.compound .compound-first, div.compound .compound-middle {
   96   margin-bottom: 0.5em }
   97 
   98 div.compound .compound-last, div.compound .compound-middle {
   99   margin-top: 0.5em }
  100 */
  101 
  102 div.dedication {
  103   margin: 2em 5em ;
  104   text-align: center ;
  105   font-style: italic }
  106 
  107 div.dedication p.topic-title {
  108   font-weight: bold ;
  109   font-style: normal }
  110 
  111 div.figure {
  112   margin-left: 2em ;
  113   margin-right: 2em }
  114 
  115 div.footer, div.header {
  116   clear: both;
  117   font-size: smaller }
  118 
  119 div.line-block {
  120   display: block ;
  121   margin-top: 1em ;
  122   margin-bottom: 1em }
  123 
  124 div.line-block div.line-block {
  125   margin-top: 0 ;
  126   margin-bottom: 0 ;
  127   margin-left: 1.5em }
  128 
  129 div.sidebar {
  130   margin: 0 0 0.5em 1em ;
  131   border: medium outset ;
  132   padding: 1em ;
  133   background-color: #ffffee ;
  134   width: 40% ;
  135   float: right ;
  136   clear: right }
  137 
  138 div.sidebar p.rubric {
  139   font-family: sans-serif ;
  140   font-size: medium }
  141 
  142 div.system-messages {
  143   margin: 5em }
  144 
  145 div.system-messages h1 {
  146   color: red }
  147 
  148 div.system-message {
  149   border: medium outset ;
  150   padding: 1em }
  151 
  152 div.system-message p.system-message-title {
  153   color: red ;
  154   font-weight: bold }
  155 
  156 div.topic {
  157   margin: 2em }
  158 
  159 h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
  160 h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
  161   margin-top: 0.4em }
  162 
  163 h1.title {
  164   text-align: center }
  165 
  166 h2.subtitle {
  167   text-align: center }
  168 
  169 hr.docutils {
  170   width: 75% }
  171 
  172 img.align-left, .figure.align-left, object.align-left, table.align-left {
  173   clear: left ;
  174   float: left ;
  175   margin-right: 1em }
  176 
  177 img.align-right, .figure.align-right, object.align-right, table.align-right {
  178   clear: right ;
  179   float: right ;
  180   margin-left: 1em }
  181 
  182 img.align-center, .figure.align-center, object.align-center {
  183   display: block;
  184   margin-left: auto;
  185   margin-right: auto;
  186 }
  187 
  188 table.align-center {
  189   margin-left: auto;
  190   margin-right: auto;
  191 }
  192 
  193 .align-left {
  194   text-align: left }
  195 
  196 .align-center {
  197   clear: both ;
  198   text-align: center }
  199 
  200 .align-right {
  201   text-align: right }
  202 
  203 /* reset inner alignment in figures */
  204 div.align-right {
  205   text-align: inherit }
  206 
  207 /* div.align-center * { */
  208 /*   text-align: left } */
  209 
  210 .align-top    {
  211   vertical-align: top }
  212 
  213 .align-middle {
  214   vertical-align: middle }
  215 
  216 .align-bottom {
  217   vertical-align: bottom }
  218 
  219 ol.simple, ul.simple {
  220   margin-bottom: 1em }
  221 
  222 ol.arabic {
  223   list-style: decimal }
  224 
  225 ol.loweralpha {
  226   list-style: lower-alpha }
  227 
  228 ol.upperalpha {
  229   list-style: upper-alpha }
  230 
  231 ol.lowerroman {
  232   list-style: lower-roman }
  233 
  234 ol.upperroman {
  235   list-style: upper-roman }
  236 
  237 p.attribution {
  238   text-align: right ;
  239   margin-left: 50% }
  240 
  241 p.caption {
  242   font-style: italic }
  243 
  244 p.credits {
  245   font-style: italic ;
  246   font-size: smaller }
  247 
  248 p.label {
  249   white-space: nowrap }
  250 
  251 p.rubric {
  252   font-weight: bold ;
  253   font-size: larger ;
  254   color: maroon ;
  255   text-align: center }
  256 
  257 p.sidebar-title {
  258   font-family: sans-serif ;
  259   font-weight: bold ;
  260   font-size: larger }
  261 
  262 p.sidebar-subtitle {
  263   font-family: sans-serif ;
  264   font-weight: bold }
  265 
  266 p.topic-title {
  267   font-weight: bold }
  268 
  269 pre.address {
  270   margin-bottom: 0 ;
  271   margin-top: 0 ;
  272   font: inherit }
  273 
  274 pre.literal-block, pre.doctest-block, pre.math, pre.code {
  275   margin-left: 2em ;
  276   margin-right: 2em }
  277 
  278 pre.code .ln { color: grey; } /* line numbers */
  279 pre.code, code { background-color: #eeeeee }
  280 pre.code .comment, code .comment { color: #5C6576 }
  281 pre.code .keyword, code .keyword { color: #3B0D06; font-weight: bold }
  282 pre.code .literal.string, code .literal.string { color: #0C5404 }
  283 pre.code .name.builtin, code .name.builtin { color: #352B84 }
  284 pre.code .deleted, code .deleted { background-color: #DEB0A1}
  285 pre.code .inserted, code .inserted { background-color: #A3D289}
  286 
  287 span.classifier {
  288   font-family: sans-serif ;
  289   font-style: oblique }
  290 
  291 span.classifier-delimiter {
  292   font-family: sans-serif ;
  293   font-weight: bold }
  294 
  295 span.interpreted {
  296   font-family: sans-serif }
  297 
  298 span.option {
  299   white-space: nowrap }
  300 
  301 span.pre {
  302   white-space: pre }
  303 
  304 span.problematic {
  305   color: red }
  306 
  307 span.section-subtitle {
  308   /* font-size relative to parent (h1..h6 element) */
  309   font-size: 80% }
  310 
  311 table.citation {
  312   border-left: solid 1px gray;
  313   margin-left: 1px }
  314 
  315 table.docinfo {
  316   margin: 2em 4em }
  317 
  318 table.docutils {
  319   margin-top: 0.5em ;
  320   margin-bottom: 0.5em }
  321 
  322 table.footnote {
  323   border-left: solid 1px black;
  324   margin-left: 1px }
  325 
  326 table.docutils td, table.docutils th,
  327 table.docinfo td, table.docinfo th {
  328   padding-left: 0.5em ;
  329   padding-right: 0.5em ;
  330   vertical-align: top }
  331 
  332 table.docutils th.field-name, table.docinfo th.docinfo-name {
  333   font-weight: bold ;
  334   text-align: left ;
  335   white-space: nowrap ;
  336   padding-left: 0 }
  337 
  338 /* "booktabs" style (no vertical lines) */
  339 table.docutils.booktabs {
  340   border: 0px;
  341   border-top: 2px solid;
  342   border-bottom: 2px solid;
  343   border-collapse: collapse;
  344 }
  345 table.docutils.booktabs * {
  346   border: 0px;
  347 }
  348 table.docutils.booktabs th {
  349   border-bottom: thin solid;
  350   text-align: left;
  351 }
  352 
  353 h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
  354 h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
  355   font-size: 100% }
  356 
  357 ul.auto-toc {
  358   list-style-type: none }
  359 
  360 </style>
  361 </head>
  362 <body>
  363 <div class="document" id="collapsing-of-search-results">
  364 <h1 class="title">Collapsing of Search Results</h1>
  365 
  366 <!-- Copyright (C) 2009,2011 Olly Betts -->
  367 <div class="contents topic" id="table-of-contents">
  368 <p class="topic-title first">Table of contents</p>
  369 <ul class="simple">
  370 <li><a class="reference internal" href="#introduction" id="id1">Introduction</a></li>
  371 <li><a class="reference internal" href="#performance" id="id2">Performance</a></li>
  372 <li><a class="reference internal" href="#api" id="id3">API</a></li>
  373 <li><a class="reference internal" href="#statistics" id="id4">Statistics</a></li>
  374 <li><a class="reference internal" href="#examples" id="id5">Examples</a><ul>
  375 <li><a class="reference internal" href="#duplicate-elimination" id="id6">Duplicate Elimination</a></li>
  376 <li><a class="reference internal" href="#restricting-the-number-of-matches-per-source" id="id7">Restricting the Number of Matches per Source</a></li>
  377 </ul>
  378 </li>
  379 </ul>
  380 </div>
  381 <div class="section" id="introduction">
  382 <h1><a class="toc-backref" href="#id1">Introduction</a></h1>
  383 <p>Xapian provides the ability to eliminate &quot;duplicate&quot; documents from the MSet.
  384 This feature is known as &quot;collapsing&quot; - think of a pile of duplicates being
  385 collapsed down to leave a single result (or a small number of results).</p>
  386 <p>The collapsing always removes the worse ranked documents (if ranking by
  387 relevance, those with the lowest weight; if ranking by sorting, those which
  388 sort lowest).</p>
  389 <p>Whether two documents count as duplicates of one another is determined by their
  390 &quot;collapse key&quot;.  If a document has an empty collapse key, it will never be
  391 collapsed, but otherwise documents with the same collapse key will be collapsed
  392 together.</p>
  393 <p>Currently the collapse key is taken from a value slot you specify (via the
  394 method <tt class="docutils literal"><span class="pre">Enquire::set_collapse_key()</span></tt>), but in the future you should be able
  395 to build collapse keys dynamically using <tt class="docutils literal"><span class="pre">Xapian::KeyMaker</span></tt> as you already
  396 can for sort keys.</p>
  397 </div>
  398 <div class="section" id="performance">
  399 <h1><a class="toc-backref" href="#id2">Performance</a></h1>
  400 <p>The collapsing is performed during the match process, so is pretty efficient.
  401 In particular, this approach is much better than generating a larger MSet and
  402 post-processing it.</p>
  403 <p>However, if the collapsing eliminates a lot of documents then the collapsed
  404 search will typically take rather longer than the uncollapsed search because
  405 the matcher has to consider many more potential matches.</p>
  406 </div>
  407 <div class="section" id="api">
  408 <h1><a class="toc-backref" href="#id3">API</a></h1>
  409 <p>To enable collapsing, call the method <tt class="docutils literal"><span class="pre">Enquire::set_collapse_key</span></tt> with the
  410 value slot, and optionally the number of matches with each collapse key to keep
  411 (this defaults to 1 if not specified), e.g.:</p>
  412 <pre class="literal-block">
  413 // Collapse on value slot 4, leaving at most 2 documents with each
  414 // collapse key.
  415 enquire.set_collapse_key(4, 2);
  416 </pre>
  417 <p>Once you have the <tt class="docutils literal">MSet</tt> object, you can read the collapse key for each
  418 match with <tt class="docutils literal"><span class="pre">MSetIterator::get_collapse_key()</span></tt>, and also the &quot;collapse count&quot;
  419 with <tt class="docutils literal"><span class="pre">MSetIterator::get_collapse_count()</span></tt>.  The latter is a lower bound on
  420 the number of documents with the same collapse key which collapsing eliminated.</p>
  421 <p>Beware that if you have a percentage cutoff active, then the collapse count
  422 will (at least in the current implementation) will always be either 0 or 1
  423 as it is hard to tell if the collapsed documents would have failed the cutoff.</p>
  424 </div>
  425 <div class="section" id="statistics">
  426 <h1><a class="toc-backref" href="#id4">Statistics</a></h1>
  427 <p>As well as the usual bounds and estimate of the &quot;full&quot; MSet size (i.e. the
  428 size if you'd asked for enough matches to get them all), the matcher also
  429 calculates bounds and an estimate for what the MSet size would be if collapsing
  430 had not been used - you can obtain these using these methods:</p>
  431 <pre class="literal-block">
  432 Xapian::doccount get_uncollapsed_matches_lower_bound() const;
  433 Xapian::doccount get_uncollapsed_matches_estimated() const;
  434 Xapian::doccount get_uncollapsed_matches_upper_bound() const;
  435 </pre>
  436 </div>
  437 <div class="section" id="examples">
  438 <h1><a class="toc-backref" href="#id5">Examples</a></h1>
  439 <p>Here are some ways this feature can be used:</p>
  440 <div class="section" id="duplicate-elimination">
  441 <h2><a class="toc-backref" href="#id6">Duplicate Elimination</a></h2>
  442 <p>If your document collection includes some identical documents, it's unhelpful
  443 when these show up in the search results.  Sometimes it is possible to
  444 eliminate them at index time, but this isn't always feasible.</p>
  445 <p>If you store a checksum (e.g. SHA1 or MD5) of the document contents and store
  446 this in a document value then you can collapse on this to eliminate such
  447 duplicates.</p>
  448 <p>If the document files will be identical, then the checksum can just be of the
  449 file, but sometimes it makes sense to extract and normalise the text, then
  450 calculate the checksum of this.</p>
  451 </div>
  452 <div class="section" id="restricting-the-number-of-matches-per-source">
  453 <h2><a class="toc-backref" href="#id7">Restricting the Number of Matches per Source</a></h2>
  454 <p>It's sometimes desirable to avoid one source dominating the results.  For
  455 example, in a web search application, you might want to show at most three
  456 matches from any website, in which case you could collapse on the hostname
  457 with collapse_max set to 3.</p>
  458 <p>When displaying the results, you can use the collapse count of each match
  459 to inform the user that there are at least that many other matches for this
  460 host (unless you are also using a percentage cutoff - see above).  If it is
  461 non-zero it means you can usefully provide a &quot;show all documents for host
  462 &lt;get_collapse_key()&gt;&quot; button which reruns the search without collapsing and
  463 with a boolean filter for a prefixed term containing the hostname (though note
  464 that this may not always give a button when there are collapsed documents
  465 because the collapse count is a lower bound and may be zero when there are
  466 collapsed matches with the same key).</p>
  467 <p>This approach isn't just useful for web search - the &quot;source&quot; can be defined
  468 usefully in many applications.  For example, a forum or mailing list search
  469 could collapse on a topic or thread identifier, an index at the chapter level
  470 could collapse on a book identifier (such as an ISBN), etc.</p>
  471 </div>
  472 </div>
  473 </div>
  474 </body>
  475 </html>