"Fossies" - the Fresh Open Source Software Archive
Member "memcached-1.6.15/doc/storage.txt" (21 Feb 2022, 5428 Bytes) of package /linux/www/memcached-1.6.15.tar.gz:
As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard
) with prefixed line numbers.
Alternatively you can here view
the uninterpreted source code file.
1 Storage system notes
4 extstore.h defines the API.
6 extstore_write() is a synchronous call which memcpy's the input buffer into a
7 write buffer for an active page. A failure is not usually a hard failure, but
8 indicates caller can try again another time. IE: it might be busy freeing
9 pages or assigning new ones.
11 As of this writing the write() implementation doesn't have an internal loop,
12 so it can give spurious failures (good for testing integration)
14 extstore_read() is an asynchronous call which takes a stack of IO objects and
15 adds it to the end of a queue. It then signals the IO thread to run. Once an
16 IO stack is submitted the caller must not touch the submitted objects anymore
17 (they are relinked internally).
19 extstore_delete() is a synchronous call which informs the storage engine an
20 item has been removed from that page. It's important to call this as items are
21 actively deleted or passively reaped due to TTL expiration. This allows the
22 engine to intelligently reclaim pages.
24 The IO threads execute each object in turn (or in bulk of running in the
25 future libaio mode).
27 Callbacks are issued from the IO threads. It's thus important to keep
28 processing to a minimum. Callbacks may be issued out of order, and it is the
29 caller's responsibility to know when its stack has been fully processed so it
30 may reclaim the memory.
32 With DIRECT_IO support, buffers submitted for read/write will need to be
33 aligned with posix_memalign() or similar.
38 During extstore_init(), a number of active buckets is specified. Pages are
39 handled overall as a global pool, but writes can be redirected to specific
40 active pages.
42 This allows a lot of flexibility, ie:
44 1) an idea of "high TTL" and "low TTL" being two buckets. TTL < 86400
45 goes into bucket 0, rest into bucket 1. Co-locating low TTL items means
46 those pages can reach zero objects and free up more easily.
48 2) Extended: "low TTL" is one bucket, and then one bucket per slab class.
49 If TTL's are low, mixed sized objects can go together as they are likely to
50 expire before cycling out of flash (depending on workload, of course).
51 For higher TTL items, pages are stored on chunk barriers. This means less
52 space is wasted as items should fit nearly exactly into write buffers and
53 pages. It also means you can blindly read items back if the system wants to
54 free a page and we can indicate to the caller somehow which pages are up for
55 probation. ie; issue a read against page 3 version 1 for byte range 0->1MB,
56 then chunk and look up objects. Then read next 1MB chunk/etc. If there's
57 anything we want to keep, pull it back into RAM before pages is freed.
59 Pages are assigned into buckets on demand, so if you make 30 but use 1 there
60 will only be a single active page with write buffers.
62 Memcached integration
65 With the POC: items.c's lru_maintainer_thread calls writes to storage if all
66 memory has been allocated out to slab classes, and there is less than an
67 amount of memory free. Original objects are swapped with items marked with
68 ITEM_HDR flag. an ITEM_HDR contains copies of the original key and most of the
69 header data. The ITEM_data() section of an ITEM_HDR object contains (item_hdr
70 *), which describes enough information to retrieve the original object from
73 To get best performance is important that reads can be deeply pipelined.
74 As much processing as possible is done ahead of time, IO's are submitted, and
75 once IO's are done processing a minimal amount of code is executed before
76 transmit() is possible. This should amortize the amount of latency incurred by
77 hopping threads and waiting on IO.
82 If a header is hit twice overall, and the second time within ~60s of the first
83 time, it will have a chance of getting recached. "recache_rate" is a simple
84 "counter % rate == 0" check. Setting to 1000 means one out of every 1000
85 instances of an item being hit twice within ~60s it will be recached into
86 memory. Very hot items will get pulled out of storage relatively quickly.
91 A target fragmentation limit is set: "0.9", meaning "run compactions if pages
92 exist which have less than 90% of their bytes used".
94 This value is slewed based on the number of free pages in the system, and
95 activates when half of the pages used. The percentage of free pages is
96 multiplied against the target fragmentation limit, ie:
97 limit of 0.9: 50% of pages free -> 0.9 * 0.5 -> 0.45%. If a page is 64
98 megabytes, pages with less than 28.8 megabytes used would be targeted for
99 compaction. If 0 pges are free, anything less than 90% used is targeted, which
100 means it has to rewrite 10 pages to free one page.
102 In memcached's integration, a second bucket is used for objects rewritten via
103 the compactor. Potentially objects around long enough to get compacted might
104 continue to stick around, so co-locating them could reduce fragmentation work.
106 If an exclusive lock is made on a valid object header, the flash locations are
107 rewritten directly in the object. As of this writing, if an object header is
108 busy for some reason, the write is dropped (COW needs to be implemented). This
109 is an unlikely scenario however.
111 Objects are read back along the boundaries of a write buffer. If an 8 meg
112 write buffer is used, 8 megs are read back at once and iterated for objects.
114 This needs a fair amount of tuning, possibly more throttling. It will still
115 evict pages if the compactor gets behind.