"Fossies" - the Fresh Open Source Software Archive

Member "xapian-core-1.4.14/docs/matcherdesign.html" (23 Nov 2019, 15621 Bytes) of package /linux/www/xapian-core-1.4.14.tar.xz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) HTML source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 <?xml version="1.0" encoding="utf-8" ?>
    2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    3 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    4 <head>
    5 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    6 <meta name="generator" content="Docutils 0.15.2: http://docutils.sourceforge.net/" />
    7 <title>Matcher Design Notes</title>
    8 <style type="text/css">
    9 
   10 /*
   11 :Author: David Goodger (goodger@python.org)
   12 :Id: $Id: html4css1.css 7952 2016-07-26 18:15:59Z milde $
   13 :Copyright: This stylesheet has been placed in the public domain.
   14 
   15 Default cascading style sheet for the HTML output of Docutils.
   16 
   17 See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
   18 customize this style sheet.
   19 */
   20 
   21 /* used to remove borders from tables and images */
   22 .borderless, table.borderless td, table.borderless th {
   23   border: 0 }
   24 
   25 table.borderless td, table.borderless th {
   26   /* Override padding for "table.docutils td" with "! important".
   27      The right padding separates the table cells. */
   28   padding: 0 0.5em 0 0 ! important }
   29 
   30 .first {
   31   /* Override more specific margin styles with "! important". */
   32   margin-top: 0 ! important }
   33 
   34 .last, .with-subtitle {
   35   margin-bottom: 0 ! important }
   36 
   37 .hidden {
   38   display: none }
   39 
   40 .subscript {
   41   vertical-align: sub;
   42   font-size: smaller }
   43 
   44 .superscript {
   45   vertical-align: super;
   46   font-size: smaller }
   47 
   48 a.toc-backref {
   49   text-decoration: none ;
   50   color: black }
   51 
   52 blockquote.epigraph {
   53   margin: 2em 5em ; }
   54 
   55 dl.docutils dd {
   56   margin-bottom: 0.5em }
   57 
   58 object[type="image/svg+xml"], object[type="application/x-shockwave-flash"] {
   59   overflow: hidden;
   60 }
   61 
   62 /* Uncomment (and remove this text!) to get bold-faced definition list terms
   63 dl.docutils dt {
   64   font-weight: bold }
   65 */
   66 
   67 div.abstract {
   68   margin: 2em 5em }
   69 
   70 div.abstract p.topic-title {
   71   font-weight: bold ;
   72   text-align: center }
   73 
   74 div.admonition, div.attention, div.caution, div.danger, div.error,
   75 div.hint, div.important, div.note, div.tip, div.warning {
   76   margin: 2em ;
   77   border: medium outset ;
   78   padding: 1em }
   79 
   80 div.admonition p.admonition-title, div.hint p.admonition-title,
   81 div.important p.admonition-title, div.note p.admonition-title,
   82 div.tip p.admonition-title {
   83   font-weight: bold ;
   84   font-family: sans-serif }
   85 
   86 div.attention p.admonition-title, div.caution p.admonition-title,
   87 div.danger p.admonition-title, div.error p.admonition-title,
   88 div.warning p.admonition-title, .code .error {
   89   color: red ;
   90   font-weight: bold ;
   91   font-family: sans-serif }
   92 
   93 /* Uncomment (and remove this text!) to get reduced vertical space in
   94    compound paragraphs.
   95 div.compound .compound-first, div.compound .compound-middle {
   96   margin-bottom: 0.5em }
   97 
   98 div.compound .compound-last, div.compound .compound-middle {
   99   margin-top: 0.5em }
  100 */
  101 
  102 div.dedication {
  103   margin: 2em 5em ;
  104   text-align: center ;
  105   font-style: italic }
  106 
  107 div.dedication p.topic-title {
  108   font-weight: bold ;
  109   font-style: normal }
  110 
  111 div.figure {
  112   margin-left: 2em ;
  113   margin-right: 2em }
  114 
  115 div.footer, div.header {
  116   clear: both;
  117   font-size: smaller }
  118 
  119 div.line-block {
  120   display: block ;
  121   margin-top: 1em ;
  122   margin-bottom: 1em }
  123 
  124 div.line-block div.line-block {
  125   margin-top: 0 ;
  126   margin-bottom: 0 ;
  127   margin-left: 1.5em }
  128 
  129 div.sidebar {
  130   margin: 0 0 0.5em 1em ;
  131   border: medium outset ;
  132   padding: 1em ;
  133   background-color: #ffffee ;
  134   width: 40% ;
  135   float: right ;
  136   clear: right }
  137 
  138 div.sidebar p.rubric {
  139   font-family: sans-serif ;
  140   font-size: medium }
  141 
  142 div.system-messages {
  143   margin: 5em }
  144 
  145 div.system-messages h1 {
  146   color: red }
  147 
  148 div.system-message {
  149   border: medium outset ;
  150   padding: 1em }
  151 
  152 div.system-message p.system-message-title {
  153   color: red ;
  154   font-weight: bold }
  155 
  156 div.topic {
  157   margin: 2em }
  158 
  159 h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
  160 h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
  161   margin-top: 0.4em }
  162 
  163 h1.title {
  164   text-align: center }
  165 
  166 h2.subtitle {
  167   text-align: center }
  168 
  169 hr.docutils {
  170   width: 75% }
  171 
  172 img.align-left, .figure.align-left, object.align-left, table.align-left {
  173   clear: left ;
  174   float: left ;
  175   margin-right: 1em }
  176 
  177 img.align-right, .figure.align-right, object.align-right, table.align-right {
  178   clear: right ;
  179   float: right ;
  180   margin-left: 1em }
  181 
  182 img.align-center, .figure.align-center, object.align-center {
  183   display: block;
  184   margin-left: auto;
  185   margin-right: auto;
  186 }
  187 
  188 table.align-center {
  189   margin-left: auto;
  190   margin-right: auto;
  191 }
  192 
  193 .align-left {
  194   text-align: left }
  195 
  196 .align-center {
  197   clear: both ;
  198   text-align: center }
  199 
  200 .align-right {
  201   text-align: right }
  202 
  203 /* reset inner alignment in figures */
  204 div.align-right {
  205   text-align: inherit }
  206 
  207 /* div.align-center * { */
  208 /*   text-align: left } */
  209 
  210 .align-top    {
  211   vertical-align: top }
  212 
  213 .align-middle {
  214   vertical-align: middle }
  215 
  216 .align-bottom {
  217   vertical-align: bottom }
  218 
  219 ol.simple, ul.simple {
  220   margin-bottom: 1em }
  221 
  222 ol.arabic {
  223   list-style: decimal }
  224 
  225 ol.loweralpha {
  226   list-style: lower-alpha }
  227 
  228 ol.upperalpha {
  229   list-style: upper-alpha }
  230 
  231 ol.lowerroman {
  232   list-style: lower-roman }
  233 
  234 ol.upperroman {
  235   list-style: upper-roman }
  236 
  237 p.attribution {
  238   text-align: right ;
  239   margin-left: 50% }
  240 
  241 p.caption {
  242   font-style: italic }
  243 
  244 p.credits {
  245   font-style: italic ;
  246   font-size: smaller }
  247 
  248 p.label {
  249   white-space: nowrap }
  250 
  251 p.rubric {
  252   font-weight: bold ;
  253   font-size: larger ;
  254   color: maroon ;
  255   text-align: center }
  256 
  257 p.sidebar-title {
  258   font-family: sans-serif ;
  259   font-weight: bold ;
  260   font-size: larger }
  261 
  262 p.sidebar-subtitle {
  263   font-family: sans-serif ;
  264   font-weight: bold }
  265 
  266 p.topic-title {
  267   font-weight: bold }
  268 
  269 pre.address {
  270   margin-bottom: 0 ;
  271   margin-top: 0 ;
  272   font: inherit }
  273 
  274 pre.literal-block, pre.doctest-block, pre.math, pre.code {
  275   margin-left: 2em ;
  276   margin-right: 2em }
  277 
  278 pre.code .ln { color: grey; } /* line numbers */
  279 pre.code, code { background-color: #eeeeee }
  280 pre.code .comment, code .comment { color: #5C6576 }
  281 pre.code .keyword, code .keyword { color: #3B0D06; font-weight: bold }
  282 pre.code .literal.string, code .literal.string { color: #0C5404 }
  283 pre.code .name.builtin, code .name.builtin { color: #352B84 }
  284 pre.code .deleted, code .deleted { background-color: #DEB0A1}
  285 pre.code .inserted, code .inserted { background-color: #A3D289}
  286 
  287 span.classifier {
  288   font-family: sans-serif ;
  289   font-style: oblique }
  290 
  291 span.classifier-delimiter {
  292   font-family: sans-serif ;
  293   font-weight: bold }
  294 
  295 span.interpreted {
  296   font-family: sans-serif }
  297 
  298 span.option {
  299   white-space: nowrap }
  300 
  301 span.pre {
  302   white-space: pre }
  303 
  304 span.problematic {
  305   color: red }
  306 
  307 span.section-subtitle {
  308   /* font-size relative to parent (h1..h6 element) */
  309   font-size: 80% }
  310 
  311 table.citation {
  312   border-left: solid 1px gray;
  313   margin-left: 1px }
  314 
  315 table.docinfo {
  316   margin: 2em 4em }
  317 
  318 table.docutils {
  319   margin-top: 0.5em ;
  320   margin-bottom: 0.5em }
  321 
  322 table.footnote {
  323   border-left: solid 1px black;
  324   margin-left: 1px }
  325 
  326 table.docutils td, table.docutils th,
  327 table.docinfo td, table.docinfo th {
  328   padding-left: 0.5em ;
  329   padding-right: 0.5em ;
  330   vertical-align: top }
  331 
  332 table.docutils th.field-name, table.docinfo th.docinfo-name {
  333   font-weight: bold ;
  334   text-align: left ;
  335   white-space: nowrap ;
  336   padding-left: 0 }
  337 
  338 /* "booktabs" style (no vertical lines) */
  339 table.docutils.booktabs {
  340   border: 0px;
  341   border-top: 2px solid;
  342   border-bottom: 2px solid;
  343   border-collapse: collapse;
  344 }
  345 table.docutils.booktabs * {
  346   border: 0px;
  347 }
  348 table.docutils.booktabs th {
  349   border-bottom: thin solid;
  350   text-align: left;
  351 }
  352 
  353 h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
  354 h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
  355   font-size: 100% }
  356 
  357 ul.auto-toc {
  358   list-style-type: none }
  359 
  360 </style>
  361 </head>
  362 <body>
  363 <div class="document" id="matcher-design-notes">
  364 <h1 class="title">Matcher Design Notes</h1>
  365 
  366 <p>This document is incomplete at present. It lacks explanation of the
  367 min-heap used to keep the best N M-set items (Managing Gigabytes
  368 describes this technique well), the check() method isn't discussed, and
  369 probably some other things.</p>
  370 <div class="section" id="the-postlist-tree">
  371 <h1>The PostList Tree</h1>
  372 <p>The QueryOptimiser class builds a tree structure of PostList objects
  373 from the query. At the leaf level, a PostList object is created for each
  374 term, and for other leaf-level subqueries, like PostingSource objects
  375 and value ranges. Then pairs or groups of PostLists are combined using
  376 2-way or n-way branching tree elements for AND, OR, etc - these are
  377 virtual PostLists whose class names reflect the operation
  378 (MultiAndPostList, OrPostList, etc). See below for a full list.</p>
  379 <div class="section" id="or">
  380 <h2>OR</h2>
  381 <p>For a group of OR operations, each OrPostList has two children, job).
  382 The OR tree is built up in a similar way to how an optimal huffman code
  383 is constructed, so the sub-PostLists with the fewest entries are
  384 furthest down the tree, and those with most nearest the top (this is
  385 more efficient than an n-ary tree in terms of the number of comparisons
  386 which need to be performed, ignoring various optimisations which the
  387 matcher can perform - it may actually be the case that a MultiOrPostList
  388 could do a better job in practice though).</p>
  389 <p>OR is coded for maximum efficiency when the right branch has fewer
  390 postings in than the left branch.</p>
  391 <p>When an OR gets &quot;at end&quot;, it autoprunes, replacing itself with the
  392 branch that still has postings - see below for full details.</p>
  393 </div>
  394 <div class="section" id="and">
  395 <h2>AND</h2>
  396 <p>For a multi-way AND operation, we have MultiAndPostList, which tries the
  397 sub-postlists in order from least frequent to most frequent (two-way AND
  398 is handled the same way). This will generally minimise the number of
  399 posting list entries we read and maximises the size of each skip_to.</p>
  400 <p>When one of a sub-trees of AND operations runs out, the sub-query will
  401 signal &quot;at end&quot;, and this causes the AND to signal &quot;at end&quot; too.</p>
  402 <p>The OP_FILTER query operator is actually treated as AND in the postlist
  403 tree - the boolean-ness is pushed down to the leaf query, where it is
  404 handled by the Weight object.</p>
  405 </div>
  406 <div class="section" id="other-operations">
  407 <h2>Other operations</h2>
  408 <p>The other operations also handle &quot;at end&quot; either like OR or AND (for
  409 asymmetric operations like AND_MAYBE, which happens may depend which
  410 branch has run out).</p>
  411 </div>
  412 </div>
  413 <div class="section" id="running-the-match">
  414 <h1>running the match</h1>
  415 <p>Once the tree is built, the matcher repeatedly asks the root of the tree
  416 for the next matching document and compares it to those in the
  417 proto-mset it maintains. Once the proto-mset is of the desired final
  418 size, the candidate needs to score more highly that the lowest scoring
  419 document in the proto-mset (either by weight, or in sort order if
  420 sorting is used) to be interesting. If it is, the lowest scoring
  421 document is removed (which is easy as we store the proto-mset as a min
  422 heap) and the candidate is added.</p>
  423 <p>When the matcher itself gets &quot;at end&quot; from the postlist tree, the match
  424 process ends.</p>
  425 <p>The matcher also passes the lowest weight currently needed make the
  426 proto-mset into the tree, and each node may adjust this weight and pass
  427 it on to its subtrees. Each PostList can report a minimum weight it
  428 could contribute - so if the left branch of an AND will always return a
  429 weight of 2 or more, then if the whole AND needs to return at least 6,
  430 the right branch is told it needs to return at least 4.</p>
  431 <p>For example, an OR knows that if its left branch can contribute at most
  432 a weight of 4 and its right branch at most 7, then if the minimum weight
  433 is 8, only documents matching both branches are now of interest so it
  434 mutates into an AND. If the minimum weight is 6 it changes into an
  435 AND_MAYBE (A AND_MAYBE B matches documents which which match A, but B
  436 contributes to the weight - in most search engines query syntax, that's
  437 expressed as <cite>+A B</cite>). See the &quot;Operator Decay&quot; section below for full
  438 details of these mutations. If the minimum weight needed is 12, no
  439 document is good enough, and the OR returns &quot;end of list&quot;.</p>
  440 </div>
  441 <div class="section" id="phrase-and-near-matching">
  442 <h1>Phrase and near matching</h1>
  443 <p>The way phrase and near matching works is to perform an AND query for
  444 all the terms, with a filter node in front which only returns documents
  445 whose positional information fulfils the phrase requirements.</p>
  446 <p>Because checking the positional information can be quite costly compared
  447 to matching postlist trees, we hoist the position check higher up the
  448 tree in cases when the phrase operation is below an AND. So A AND (B
  449 NEAR C) will actually filter the results of (A AND B AND C) through a
  450 check for B NEAR C, which means we never need to check positions for
  451 documents which don't match A.</p>
  452 </div>
  453 <div class="section" id="virtual-postlist-types">
  454 <h1>virtual postlist types</h1>
  455 <p>There are several types of virtual PostList. Each type can be used in a
  456 weighted or unweighted (boolean) context - the only difference is whether the
  457 weights are used or not. The types are:</p>
  458 <ul class="simple">
  459 <li>OrPostList: returns documents which match either branch</li>
  460 <li>MultiAndPostList: returns documents which match all branches</li>
  461 <li>MultiXorPostList: returns documents which match an odd number of
  462 branches</li>
  463 <li>AndNotPostList: returns documents which match the left branch, but
  464 not the right (the weights of documents from the right branch are
  465 ignored).  &quot;X ANDNOT Y&quot; implements what some search engines offer
  466 as &quot;X -Y&quot; in their query syntax.</li>
  467 <li>AndMaybePostList: returns documents which match the left branch with
  468 weights added on from the right branch where that also matches (so
  469 an unweighted AndMaybePostList isn't very useful).  &quot;X ANDMAYBE Y&quot;
  470 implements what some search engines offer as &quot;+X Y&quot; in their
  471 query syntax.</li>
  472 <li>FIXME: this list is no longer complete...</li>
  473 </ul>
  474 <p>There are two main optimisations which the best match performs:
  475 autoprune and operator decay.</p>
  476 </div>
  477 <div class="section" id="autoprune">
  478 <h1>autoprune</h1>
  479 <p>For example, if a branch in the match tree is &quot;A OR B&quot;, when A runs out
  480 then &quot;A OR B&quot; is replaced by &quot;B&quot;. Similar reductions occur for XOR,
  481 ANDNOT, and ANDMAYBE (if the right branch runs out). Other operators
  482 (AND, FILTER, and ANDMAYBE (when the left branch runs out) simply return
  483 &quot;at_end&quot; and this is dealt with somewhere further up the tree as
  484 appropriate.</p>
  485 <p>An autoprune is indicated by the next or skip_to method returning a
  486 pointer to the PostList object to replace the postlist being read with.</p>
  487 </div>
  488 <div class="section" id="operator-decay">
  489 <h1>operator decay</h1>
  490 <p>The matcher tracks the minimum weight needed for a document to make it
  491 into the m-set (this decreases monotonically as the m-set forms). This
  492 can be used to replace on boolean operator with a stricter one. E.g.
  493 consider A OR B - when maxweight(A) &lt; minweight and maxweight(B) &lt;
  494 minweight then only documents matching both A and B can make it into the
  495 m-set so we can replace the OR with an AND. Operator decay is flagged
  496 using the same mechanism as autoprune, by returning the replacement
  497 operator from next or skip_to.</p>
  498 <p>Possible decays:</p>
  499 <ul class="simple">
  500 <li>OR → AND</li>
  501 <li>OR → ANDMAYBE</li>
  502 <li>ANDMAYBE → AND</li>
  503 <li>XOR → ANDNOT</li>
  504 </ul>
  505 <p>A related optimisation is that the Match object may terminate early if
  506 maxweight for the whole tree is less than the smallest weight in the
  507 mset.</p>
  508 </div>
  509 </div>
  510 </body>
  511 </html>