"Fossies" - the Fresh Open Source Software Archive

Member "memcached-1.6.15/doc/storage.txt" (21 Feb 2022, 5428 Bytes) of package /linux/www/memcached-1.6.15.tar.gz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 Storage system notes
    2 --------------------
    3 
    4 extstore.h defines the API.
    5 
    6 extstore_write() is a synchronous call which memcpy's the input buffer into a
    7 write buffer for an active page. A failure is not usually a hard failure, but
    8 indicates caller can try again another time. IE: it might be busy freeing
    9 pages or assigning new ones.
   10 
   11 As of this writing the write() implementation doesn't have an internal loop,
   12 so it can give spurious failures (good for testing integration)
   13 
   14 extstore_read() is an asynchronous call which takes a stack of IO objects and
   15 adds it to the end of a queue. It then signals the IO thread to run. Once an
   16 IO stack is submitted the caller must not touch the submitted objects anymore
   17 (they are relinked internally).
   18 
   19 extstore_delete() is a synchronous call which informs the storage engine an
   20 item has been removed from that page. It's important to call this as items are
   21 actively deleted or passively reaped due to TTL expiration. This allows the
   22 engine to intelligently reclaim pages.
   23 
   24 The IO threads execute each object in turn (or in bulk of running in the
   25 future libaio mode).
   26 
   27 Callbacks are issued from the IO threads. It's thus important to keep
   28 processing to a minimum. Callbacks may be issued out of order, and it is the
   29 caller's responsibility to know when its stack has been fully processed so it
   30 may reclaim the memory.
   31 
   32 With DIRECT_IO support, buffers submitted for read/write will need to be
   33 aligned with posix_memalign() or similar.
   34 
   35 Buckets
   36 -------
   37 
   38 During extstore_init(), a number of active buckets is specified. Pages are
   39 handled overall as a global pool, but writes can be redirected to specific
   40 active pages.
   41 
   42 This allows a lot of flexibility, ie:
   43 
   44 1) an idea of "high TTL" and "low TTL" being two buckets. TTL < 86400
   45 goes into bucket 0, rest into bucket 1. Co-locating low TTL items means
   46 those pages can reach zero objects and free up more easily.
   47 
   48 2) Extended: "low TTL" is one bucket, and then one bucket per slab class.
   49 If TTL's are low, mixed sized objects can go together as they are likely to
   50 expire before cycling out of flash (depending on workload, of course).
   51 For higher TTL items, pages are stored on chunk barriers. This means less
   52 space is wasted as items should fit nearly exactly into write buffers and
   53 pages. It also means you can blindly read items back if the system wants to
   54 free a page and we can indicate to the caller somehow which pages are up for
   55 probation. ie; issue a read against page 3 version 1 for byte range 0->1MB,
   56 then chunk and look up objects. Then read next 1MB chunk/etc. If there's
   57 anything we want to keep, pull it back into RAM before pages is freed.
   58 
   59 Pages are assigned into buckets on demand, so if you make 30 but use 1 there
   60 will only be a single active page with write buffers.
   61 
   62 Memcached integration
   63 ---------------------
   64 
   65 With the POC: items.c's lru_maintainer_thread calls writes to storage if all
   66 memory has been allocated out to slab classes, and there is less than an
   67 amount of memory free. Original objects are swapped with items marked with
   68 ITEM_HDR flag. an ITEM_HDR contains copies of the original key and most of the
   69 header data. The ITEM_data() section of an ITEM_HDR object contains (item_hdr
   70 *), which describes enough information to retrieve the original object from
   71 storage.
   72 
   73 To get best performance is important that reads can be deeply pipelined.
   74 As much processing as possible is done ahead of time, IO's are submitted, and
   75 once IO's are done processing a minimal amount of code is executed before
   76 transmit() is possible. This should amortize the amount of latency incurred by
   77 hopping threads and waiting on IO.
   78 
   79 Recaching
   80 ---------
   81 
   82 If a header is hit twice overall, and the second time within ~60s of the first
   83 time, it will have a chance of getting recached. "recache_rate" is a simple
   84 "counter % rate == 0" check. Setting to 1000 means one out of every 1000
   85 instances of an item being hit twice within ~60s it will be recached into
   86 memory. Very hot items will get pulled out of storage relatively quickly.
   87 
   88 Compaction
   89 ----------
   90 
   91 A target fragmentation limit is set: "0.9", meaning "run compactions if pages
   92 exist which have less than 90% of their bytes used".
   93 
   94 This value is slewed based on the number of free pages in the system, and
   95 activates when half of the pages used. The percentage of free pages is
   96 multiplied against the target fragmentation limit, ie:
   97 limit of 0.9: 50% of pages free -> 0.9 * 0.5 -> 0.45%. If a page is 64
   98 megabytes, pages with less than 28.8 megabytes used would be targeted for
   99 compaction. If 0 pges are free, anything less than 90% used is targeted, which
  100 means it has to rewrite 10 pages to free one page.
  101 
  102 In memcached's integration, a second bucket is used for objects rewritten via
  103 the compactor. Potentially objects around long enough to get compacted might
  104 continue to stick around, so co-locating them could reduce fragmentation work.
  105 
  106 If an exclusive lock is made on a valid object header, the flash locations are
  107 rewritten directly in the object. As of this writing, if an object header is
  108 busy for some reason, the write is dropped (COW needs to be implemented). This
  109 is an unlikely scenario however.
  110 
  111 Objects are read back along the boundaries of a write buffer. If an 8 meg
  112 write buffer is used, 8 megs are read back at once and iterated for objects.
  113 
  114 This needs a fair amount of tuning, possibly more throttling. It will still
  115 evict pages if the compactor gets behind.
  116