"Fossies" - the Fresh Open Source Software Archive

Member "xapian-core-1.4.14/docs/scalability.html" (23 Nov 2019, 13701 Bytes) of package /linux/www/xapian-core-1.4.14.tar.xz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) HTML source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 <?xml version="1.0" encoding="utf-8" ?>
    2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    3 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    4 <head>
    5 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    6 <meta name="generator" content="Docutils 0.15.2: http://docutils.sourceforge.net/" />
    7 <title>Scalability</title>
    8 <style type="text/css">
    9 
   10 /*
   11 :Author: David Goodger (goodger@python.org)
   12 :Id: $Id: html4css1.css 7952 2016-07-26 18:15:59Z milde $
   13 :Copyright: This stylesheet has been placed in the public domain.
   14 
   15 Default cascading style sheet for the HTML output of Docutils.
   16 
   17 See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
   18 customize this style sheet.
   19 */
   20 
   21 /* used to remove borders from tables and images */
   22 .borderless, table.borderless td, table.borderless th {
   23   border: 0 }
   24 
   25 table.borderless td, table.borderless th {
   26   /* Override padding for "table.docutils td" with "! important".
   27      The right padding separates the table cells. */
   28   padding: 0 0.5em 0 0 ! important }
   29 
   30 .first {
   31   /* Override more specific margin styles with "! important". */
   32   margin-top: 0 ! important }
   33 
   34 .last, .with-subtitle {
   35   margin-bottom: 0 ! important }
   36 
   37 .hidden {
   38   display: none }
   39 
   40 .subscript {
   41   vertical-align: sub;
   42   font-size: smaller }
   43 
   44 .superscript {
   45   vertical-align: super;
   46   font-size: smaller }
   47 
   48 a.toc-backref {
   49   text-decoration: none ;
   50   color: black }
   51 
   52 blockquote.epigraph {
   53   margin: 2em 5em ; }
   54 
   55 dl.docutils dd {
   56   margin-bottom: 0.5em }
   57 
   58 object[type="image/svg+xml"], object[type="application/x-shockwave-flash"] {
   59   overflow: hidden;
   60 }
   61 
   62 /* Uncomment (and remove this text!) to get bold-faced definition list terms
   63 dl.docutils dt {
   64   font-weight: bold }
   65 */
   66 
   67 div.abstract {
   68   margin: 2em 5em }
   69 
   70 div.abstract p.topic-title {
   71   font-weight: bold ;
   72   text-align: center }
   73 
   74 div.admonition, div.attention, div.caution, div.danger, div.error,
   75 div.hint, div.important, div.note, div.tip, div.warning {
   76   margin: 2em ;
   77   border: medium outset ;
   78   padding: 1em }
   79 
   80 div.admonition p.admonition-title, div.hint p.admonition-title,
   81 div.important p.admonition-title, div.note p.admonition-title,
   82 div.tip p.admonition-title {
   83   font-weight: bold ;
   84   font-family: sans-serif }
   85 
   86 div.attention p.admonition-title, div.caution p.admonition-title,
   87 div.danger p.admonition-title, div.error p.admonition-title,
   88 div.warning p.admonition-title, .code .error {
   89   color: red ;
   90   font-weight: bold ;
   91   font-family: sans-serif }
   92 
   93 /* Uncomment (and remove this text!) to get reduced vertical space in
   94    compound paragraphs.
   95 div.compound .compound-first, div.compound .compound-middle {
   96   margin-bottom: 0.5em }
   97 
   98 div.compound .compound-last, div.compound .compound-middle {
   99   margin-top: 0.5em }
  100 */
  101 
  102 div.dedication {
  103   margin: 2em 5em ;
  104   text-align: center ;
  105   font-style: italic }
  106 
  107 div.dedication p.topic-title {
  108   font-weight: bold ;
  109   font-style: normal }
  110 
  111 div.figure {
  112   margin-left: 2em ;
  113   margin-right: 2em }
  114 
  115 div.footer, div.header {
  116   clear: both;
  117   font-size: smaller }
  118 
  119 div.line-block {
  120   display: block ;
  121   margin-top: 1em ;
  122   margin-bottom: 1em }
  123 
  124 div.line-block div.line-block {
  125   margin-top: 0 ;
  126   margin-bottom: 0 ;
  127   margin-left: 1.5em }
  128 
  129 div.sidebar {
  130   margin: 0 0 0.5em 1em ;
  131   border: medium outset ;
  132   padding: 1em ;
  133   background-color: #ffffee ;
  134   width: 40% ;
  135   float: right ;
  136   clear: right }
  137 
  138 div.sidebar p.rubric {
  139   font-family: sans-serif ;
  140   font-size: medium }
  141 
  142 div.system-messages {
  143   margin: 5em }
  144 
  145 div.system-messages h1 {
  146   color: red }
  147 
  148 div.system-message {
  149   border: medium outset ;
  150   padding: 1em }
  151 
  152 div.system-message p.system-message-title {
  153   color: red ;
  154   font-weight: bold }
  155 
  156 div.topic {
  157   margin: 2em }
  158 
  159 h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
  160 h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
  161   margin-top: 0.4em }
  162 
  163 h1.title {
  164   text-align: center }
  165 
  166 h2.subtitle {
  167   text-align: center }
  168 
  169 hr.docutils {
  170   width: 75% }
  171 
  172 img.align-left, .figure.align-left, object.align-left, table.align-left {
  173   clear: left ;
  174   float: left ;
  175   margin-right: 1em }
  176 
  177 img.align-right, .figure.align-right, object.align-right, table.align-right {
  178   clear: right ;
  179   float: right ;
  180   margin-left: 1em }
  181 
  182 img.align-center, .figure.align-center, object.align-center {
  183   display: block;
  184   margin-left: auto;
  185   margin-right: auto;
  186 }
  187 
  188 table.align-center {
  189   margin-left: auto;
  190   margin-right: auto;
  191 }
  192 
  193 .align-left {
  194   text-align: left }
  195 
  196 .align-center {
  197   clear: both ;
  198   text-align: center }
  199 
  200 .align-right {
  201   text-align: right }
  202 
  203 /* reset inner alignment in figures */
  204 div.align-right {
  205   text-align: inherit }
  206 
  207 /* div.align-center * { */
  208 /*   text-align: left } */
  209 
  210 .align-top    {
  211   vertical-align: top }
  212 
  213 .align-middle {
  214   vertical-align: middle }
  215 
  216 .align-bottom {
  217   vertical-align: bottom }
  218 
  219 ol.simple, ul.simple {
  220   margin-bottom: 1em }
  221 
  222 ol.arabic {
  223   list-style: decimal }
  224 
  225 ol.loweralpha {
  226   list-style: lower-alpha }
  227 
  228 ol.upperalpha {
  229   list-style: upper-alpha }
  230 
  231 ol.lowerroman {
  232   list-style: lower-roman }
  233 
  234 ol.upperroman {
  235   list-style: upper-roman }
  236 
  237 p.attribution {
  238   text-align: right ;
  239   margin-left: 50% }
  240 
  241 p.caption {
  242   font-style: italic }
  243 
  244 p.credits {
  245   font-style: italic ;
  246   font-size: smaller }
  247 
  248 p.label {
  249   white-space: nowrap }
  250 
  251 p.rubric {
  252   font-weight: bold ;
  253   font-size: larger ;
  254   color: maroon ;
  255   text-align: center }
  256 
  257 p.sidebar-title {
  258   font-family: sans-serif ;
  259   font-weight: bold ;
  260   font-size: larger }
  261 
  262 p.sidebar-subtitle {
  263   font-family: sans-serif ;
  264   font-weight: bold }
  265 
  266 p.topic-title {
  267   font-weight: bold }
  268 
  269 pre.address {
  270   margin-bottom: 0 ;
  271   margin-top: 0 ;
  272   font: inherit }
  273 
  274 pre.literal-block, pre.doctest-block, pre.math, pre.code {
  275   margin-left: 2em ;
  276   margin-right: 2em }
  277 
  278 pre.code .ln { color: grey; } /* line numbers */
  279 pre.code, code { background-color: #eeeeee }
  280 pre.code .comment, code .comment { color: #5C6576 }
  281 pre.code .keyword, code .keyword { color: #3B0D06; font-weight: bold }
  282 pre.code .literal.string, code .literal.string { color: #0C5404 }
  283 pre.code .name.builtin, code .name.builtin { color: #352B84 }
  284 pre.code .deleted, code .deleted { background-color: #DEB0A1}
  285 pre.code .inserted, code .inserted { background-color: #A3D289}
  286 
  287 span.classifier {
  288   font-family: sans-serif ;
  289   font-style: oblique }
  290 
  291 span.classifier-delimiter {
  292   font-family: sans-serif ;
  293   font-weight: bold }
  294 
  295 span.interpreted {
  296   font-family: sans-serif }
  297 
  298 span.option {
  299   white-space: nowrap }
  300 
  301 span.pre {
  302   white-space: pre }
  303 
  304 span.problematic {
  305   color: red }
  306 
  307 span.section-subtitle {
  308   /* font-size relative to parent (h1..h6 element) */
  309   font-size: 80% }
  310 
  311 table.citation {
  312   border-left: solid 1px gray;
  313   margin-left: 1px }
  314 
  315 table.docinfo {
  316   margin: 2em 4em }
  317 
  318 table.docutils {
  319   margin-top: 0.5em ;
  320   margin-bottom: 0.5em }
  321 
  322 table.footnote {
  323   border-left: solid 1px black;
  324   margin-left: 1px }
  325 
  326 table.docutils td, table.docutils th,
  327 table.docinfo td, table.docinfo th {
  328   padding-left: 0.5em ;
  329   padding-right: 0.5em ;
  330   vertical-align: top }
  331 
  332 table.docutils th.field-name, table.docinfo th.docinfo-name {
  333   font-weight: bold ;
  334   text-align: left ;
  335   white-space: nowrap ;
  336   padding-left: 0 }
  337 
  338 /* "booktabs" style (no vertical lines) */
  339 table.docutils.booktabs {
  340   border: 0px;
  341   border-top: 2px solid;
  342   border-bottom: 2px solid;
  343   border-collapse: collapse;
  344 }
  345 table.docutils.booktabs * {
  346   border: 0px;
  347 }
  348 table.docutils.booktabs th {
  349   border-bottom: thin solid;
  350   text-align: left;
  351 }
  352 
  353 h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
  354 h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
  355   font-size: 100% }
  356 
  357 ul.auto-toc {
  358   list-style-type: none }
  359 
  360 </style>
  361 </head>
  362 <body>
  363 <div class="document" id="scalability">
  364 <h1 class="title">Scalability</h1>
  365 
  366 <p>People often want to know how Xapian will scale. The short answer is
  367 &quot;very well&quot; - an early version of the software powered the (now defunct)
  368 Webtop search engine, which offered a search over around 500 million web
  369 pages (around 1.5 terabytes of database files). Searches took less than
  370 a second.</p>
  371 <p>In terms of current deployments, <a class="reference external" href="http://search.gmane.org/">gmane</a>
  372 indexes and searches nearly 100 million mail messages on a single server
  373 at the time of writing (2012), and we've had user reports of systems with
  374 more than 250 million documents.</p>
  375 <div class="section" id="benchmarking">
  376 <h1>Benchmarking</h1>
  377 <p>One effect to be aware of when designing benchmarks is that queries will
  378 be a lot slower when nothing is cached. So the first few queries on a
  379 database which hasn't been searched recently will be unrepresentatively
  380 slow compared to the typical case.</p>
  381 <p>In real use, pretty much all the non-leaf blocks from the B-trees being
  382 used for the search will be cached pretty quickly, as well as many
  383 commonly used leaf blocks.</p>
  384 </div>
  385 <div class="section" id="general-scalability-considerations">
  386 <h1>General Scalability Considerations</h1>
  387 <p>In a large search application, I/O will end up being the limiting
  388 factor. So you want a RAID setup optimised for fast reading, lots of RAM
  389 in the box so the OS can cache lots of disk blocks (the access patterns
  390 typically mean that you only need to cache a few percent of the database
  391 to eliminate most disk cache misses).</p>
  392 <p>It also means that reducing the database size is usually a win.  Xapian's
  393 disk-based databases compress the information in the tables in ways which
  394 work well given the nature of the data but aren't too expensive to
  395 unpack (e.g. lists of sorted docids are stored as differences with
  396 smaller values encoded in fewer bytes). There is further potential for
  397 improving the encodings used.</p>
  398 <p>Another way to reduce disk I/O is to run databases through
  399 xapian-compact. The Btree manager usually leaves some spare space in
  400 each block so that updates are more efficient (though there are
  401 heuristics which will fill blocks fuller when they detect a long
  402 sequence of sequential insertions, which means adding documents to the
  403 end of an empty database will produce fairly compact tables, apart from
  404 the postlist table). Compacting makes all blocks as full as possible,
  405 and so reduces the size of the database. It also produces a database
  406 with revision 1 which is inherently faster to search. The penalty is
  407 that updates will be slow for a while, as they'll result in a lot of
  408 block splitting when all blocks are full.</p>
  409 <p>Splitting the data over several databases is generally a good strategy.
  410 Once each has finished being updated, compact it to make it small and
  411 faster to search.</p>
  412 <p>A multiple-database scheme works particularly well if you want a rolling
  413 web index where the contents of the oldest database can be rechecked and
  414 live links put back into a new database which, once built, replaces the
  415 oldest database. It's also good for a news-type application where older
  416 documents should expire from the index.</p>
  417 </div>
  418 <div class="section" id="size-limits-in-xapian">
  419 <h1>Size Limits in Xapian</h1>
  420 <p>The glass backend (which is currently the default and recommended
  421 backend) stores the indexes in several files containing Btree tables. If
  422 you're indexing with positional information (for phrase searching) the
  423 term positions table is usually the largest.</p>
  424 <p>The current limits are:</p>
  425 <ul>
  426 <li><p class="first">Xapian uses unsigned 32-bit integers for document ids by default, which
  427 means a limit of just over 4 billion documents in a database.  Xapian 1.4
  428 can be built to use 64-bit document ids and term counts, and the glass
  429 backend will then handle 64-bit document ids (and the databases are
  430 compatible with a standard build provided you don't actually use docids &gt;=
  431 2<sup>32</sup>).</p>
  432 </li>
  433 <li><p class="first">If you search many databases concurrently, you may hit the
  434 per-process file-descriptor limit - each glass database uses between
  435 1 and 6 fds depending which tables are present. Some Unix-like OSes
  436 allow this limit to be raised. Another way to avoid it (and to spread
  437 the search load) is to use the remote backend to search databases on
  438 a cluster of machines.</p>
  439 </li>
  440 <li><p class="first">If the OS has a filesize limit, that obviously applies to Xapian (a
  441 2GB limit used to be common for older operating systems). The
  442 xapian-core configure script will attempt to detect and automatically
  443 enable support for &quot;LARGE FILES&quot; where possible.</p>
  444 <p>So what is the limit for a modern OS? Taking Linux 2.6 as an example,
  445 ext4 allows files up to 16TB and filesystems up to 1EB, while btrfs
  446 allows files and filesystems up to 16EB (<a class="reference external" href="https://en.wikipedia.org/wiki/Comparison_of_file_systems">figures from
  447 Wikipedia</a>).</p>
  448 </li>
  449 <li><p class="first">The B-trees use a 32-bit unsigned block count. The default blocksize
  450 is 8K which limits you to 32TB tables. You can increase the blocksize
  451 if this is a problem, but it's best to do it before you create the
  452 database as otherwise you need to use xapian-compact to make a
  453 compacted copy of the database with the new blocksize, and that will
  454 take a while for such a large database. The maximum blocksize
  455 currently allowed is 64K, which limits you to 256TB tables.</p>
  456 </li>
  457 <li><p class="first">Xapian stores the total length (i.e. number of terms) of all the
  458 documents in a database so it can calculate the average document
  459 length. This is currently handled as an unsigned 64-bit quantity so
  460 it's not likely to be a limit you'll hit. It's listed here for
  461 completeness.</p>
  462 </li>
  463 </ul>
  464 <p>If you've further questions about scalability, ask on the mailing lists
  465 - people using Xapian to search large databases may be able to make
  466 further suggestions.</p>
  467 </div>
  468 </div>
  469 </body>
  470 </html>