## "Fossies" - the Fresh Open Source Software Archive

### Member "ncc-2.8/doc/dbstree.tex" (12 Apr 2002, 13473 Bytes) of package /linux/privat/old/ncc-2.8.tar.gz:

As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 \documentclass[twocolumn]{article}
2
3 \title{The Dynamic Binary search tree}
4 \author{}
5
6 \begin{document}
7
8 \maketitle
9
10 \section{Introduction}
11
12 Binary trees are an important data organization in which searching for
13 an element is performed in $O(\log{n})$ steps. Dynamic trees were new
14 elements can be added and removed may become unbalanced and thus more
15 sophisticated binary tree implementations have been suggested to keep
16 the tree in $O(\log{n})$ limits.
17
18 Our goal is controlling the height of a tree to small values. A balanced
19 tree fulfills this goal but a tree need not necessarily be balanced to have
20 its height controlled.
21
22 The algorithm for an efficient implementation of dynamic tree is described,
23 along with comparisons with existing dynamic tree implementations.
24
25 \section{Description of the tree}
26
27 The basic idea is using a simple binary search tree and setting a limit on
28 it's height. Let us use for the rest of this study the number 32 as height
29 limit. This restricts the total number of nodes to no more than $2^{32}$ as
30 long as duplicate elements are not allowed.
31 We should notice that due to the logarithmic nature of complexity,
32 in real-life applications a balanced tree will never reach such values for
33 it's height.
34
35 Supposing we can efficiently fulfill this condition, locating an element in
36 the tree will be performed in no more than 32 steps. Typically
37 $O(32)<O(\log{n})$. Actually it may result in a few more steps.
38 In practice it is an unimportant
39 difference and therefore it is as fast as any balanced tree.
40
41 Removing an element from the tree can be easily done without the necessity
42 of a balancing operation as long as by removing an element the
43 height of the tree is not increased. This condition can be implemented.
44
45 If the node to be removed has two child nodes then the procedure is to find
46 the biggest element from the smaller ones or the smallest from the bigger
47 ones and replace the node to be removed with this element. A sample code is
48 illustrated in \ref{app:rmv}
49
50 Adding a new element to the tree is the most important operation. While
51 inserting the element the number of comparisons performed is counted. The
52 simple rule, when this number reaches 32 the tree must be rebalanced, is
53 very efficient.
54
55 Balancing the tree is accomplished in $n$ steps. However, the frequency at
56 which the tree needs to be balanced again is a function of $\frac{1}{2^n}$.
57 As a result adding a new element is in the great majority of the cases done
58 in less than 32 steps; but rebalancing is more rare as the number of nodes
59 increases
60 \footnote{A wosre-case behavior were a lot of elements have to be inserted
61 in an empty tree and they arrive sorted is solved in \ref{init}}.
62
63 \subsection{Balancing the tree}
64
65 The binary tree can be entirely reconstructed in $n$ steps using a simple
66 algorithm. First, an array of $n$ pointers is allocated an all the elements
67 of the tree are placed in this array in sorted order. Let us remember than an
68 always-balanced tree implementation requires extra memory for the node
69 structures and therefore the extra allocation of the array is yet efficient;
70 furthermore, in modern multithreaded database systems an operation which
71 alters part of the data is given full control of the hardware and the memory
72 usage burst will in this case be safely performed.
73
74 The next step is to reset the root of the tree and start inserting elements
75 in the order at which they will result in a perfectly balanced tree. This is
76 done by first inserting the middle element $N/2$ of the array, then
77 elements $N/4$ and $3 N / 4$, then $N/8$, $3 N / 8$, $5 N / 8 \ldots$
78 and so on.
79
80 By this procedure $2^k - 1$ where $k$ the integer for which
81 $2^k \leq N < 2^{k + 1}$ nodes can be inserted. For the remaining $N-(2^k -1)$
82 nodes, if any, various algorithms of $O(n)$ steps can be applied. One way
83 that is used in the sample code is to proceed normally as if there were
84 $2^{k+1}$ nodes and exclude the elements $i$ for which the modulo of the
85 division $i * 2^k / N$ is greater or equal to $N-(2^k -1)$ or zero; in other
86 words excluding those numbers $a$ for which there is a number $b$ so that
87 \begin{equation}
88 \left\lfloor \frac{a N}{2^{k+1}} \right\rfloor =
89 \left\lfloor \frac{b N}{2^k} \right\rfloor
90 \: , \;\; 0 < b < 2^k
91 \end{equation}
92
93 The complexity of the above algorithm will be linear if when inserting nodes
94 in the new tree redundant comparisons are avoided.
95
96 A sample implementation is provided at \ref{code:balance}.
97
98 \section{Characteristics of the tree}
99
100 The main feature of the this tree is that after is has been balanced, the
101 more nodes it contains, the more new elements be distributed in external
102 nodes and thus it will be much more unlikely to become unbalanced''.
103
104 \begin{itemize}
105 \item
106    Memory requirements for this tree are minimal. The standard memory needed
107    is as much as any simple binary tree and the same as a double linked
108    list. Memory usage bursts occur when
109    balancing is performed but this is generally a rare and very fast action.
110    Memory due to recursion is not trivial since tree height is limited
111    to 32.
112 \item
113    Locating elements will happen in less than 32 steps. Generally after
114    balancing has been performed locating elements is a matter of $\log{n}$.
115 \item
116    Removing elements is also a matter of less than 32 steps.
117    However, there are no rotations performed and therefore removing an
118    element is possibly performed faster than in an always-balanced tree.
119 \item
120    Adding elements in most of the cases is a matter of less than 32 very
121    fast steps.
122    Balancing is rare and becomes even more rare as the tree grows and it
123    requires $n$ simple steps.
124
125    In long term performance, for a growing tree the number of nodes $n$
126    divided with the number of steps spent for balancing, tends to be
127    independent of $n$.
128 \end{itemize}
129
130 Tests have shown that the tree will need to balance 2 times for
131 1,000,000 elements arriving in random order, once after 20,000 elements
132 and once again at about 400,000.
133
134 \section{Database initialization}
135 \label{init}
136
137 One case were the suggested tree would prove inefficient is inserting a lot
138 of elements in sorted order in an \emph{empty} tree.
139 If the tree is not empty and has been balanced before then the arriving
140 elements will be distributed on the external nodes.
141
142 This is a case which would occur when initially loading the data from
143 storage devices. The workaround to this problem is similar to the way the
144 tree is balanced. The -sorted- elements will have to be inserted in the tree
145 in the order that they will make a balanced tree, first the middle element,
146 $N/4$, $3*N/4$ and so on.
147
148 Once a perfectly balance tree is constructed there should be no worry for
149 this worse-case occurring unless most of the elements are removed.
150
151 Generally, in cases were a long sequence of elements is to be inserted
152 in the tree, if the tree has few nodes it may be preferable to sort the
153 sequence of elements and then insert the right elements first.
154 This procedure is worth the tradeoff if the arriving elements are almost
155 sorted, and in this case special sorting algorithms can be applied.
156
157 Another solution is to make the elements really random, by using a random
158 tree. In a \emph{random tree}, the comparison of elements is heads or tails.
159 After the sequence of elements has been inserted in a random tree (which
160 will never be unbalanced), they can be read in-order and sent to the dbs
161 tree.
162
163 \section{Conclusions}
164
165 The tree structure with the characteristics that have been presented is
166 efficient for specific database requirements.
167
168 When the database is idle or the specific tree is not used, is consumes the
169 least memory compared to other balanced tree structures.
170 That is the primary reason for the design of this structure.
171
172 When the tree is just accessed for locating an element this is as fast as
173 any balanced tree.
174
175 Theoretically, balanced tree implementations which perform rotations after
176 a node is inserted or removed from the tree are better. In practice, and
177 more specificly in a database environment limiting the height of the tree to
178 a small' number is more than adequate.
179 Except from memory efficiency, the suggested tree structure may possibly
180 behave faster as well; the worse case scenario will either occur at the
181 initialization of the database or else is can be easilly prevented.
182
183 \onecolumn
184 \section{Sample code}
185
186 A sample implementation in C of the presented tree is provided.
187 For the sample code the simple definition of a node with a pointer to
188 data is used. The function {\tt compare ()} is the application-specific
189 function which compares the data of two nodes.
190
191 \begin{verbatim}
192 #define MAX_HEIGHT 32
193 typedef struct tnode node;
194 struct tnode {
195    char *data;
196    node *less, *more;
197 };
198 node *root;
199 \end{verbatim}
200
201 \subsection{Removing a node}
202 \label{app:rmv}
203
204 When removing an element which has two children,
205 both the smallest from the bigger elements and the
206 biggest from the smaller elements are located. By convention the deepest one
207 is selected to replace the node to be removed.
208
209 \begin{verbatim}
210 void remove (node *n)
211 {
212     node *np, *nl, *nr, *nlp, *nrp;
213     int isroot;
214
215     if (!(isroot = (np = root) == n))
216         for (;;)
217             if (compare (np->data, n->data) > 0)
218                 if (np->less == n) break;
219                 else np = np->less;
220             else
221                 if (np->more == n) break;
222                 else np = np->more;
223
224     if (!(n->less && n->more))
225     {
226         if (isroot) root = (n->less) ? n->less : n->more;
227         else
228             if (np->less == n) np->less = (n->less) ? n->less : n->more;
229             else np->more = (n->less) ? n->less : n->more;
230         return;
231     }
232     for (i = 0, nlp = NULL, nl = n->less; nl->more; i++)
233         nlp = nl, nl = nl->more;
234     for (j = 0, nrp = NULL, nr = n->more; nr->less; j++)
235         nrp = nr, nr = nr->less;
236     if (i >= j)
237     {
238         if (isroot) root = nl;
239         else
240             if (np->less == n) np->less = nl;
241             else               np->more = nl;
242         if (nlp)
243         {
244              nlp->more = nl->less
245              nl->less = n->less;
246         }
247         nl->more = n->more;
248     } else {
249         if (isroot) root = nr;
250         else
251             if (np->less == n) np->less = nr;
252             else               np->more = nr;
253         if (nrp)
254         {
255             nrp->less = nr->more
256             nr->more = n->more;
257         }
258         nr->less = n->less;
259     }
260 }
261 \end{verbatim}
262
263 \subsection{Balancing}
264 \label{code:balance}
265
266 The algorithm for reconstructing a balanced tree.
267 {\tt alloca} is a stack allocation mechanism.
268
269 Special care should be taken that the variables {\tt i, j, D, k} in
270 {\tt balance()} are 64 bit integers. If not then the algorithm will fail for
271 trees with more than about 65000 nodes.
272
273 \begin{verbatim}
274 unsigned int suint;
275 node **snpp;
276
277 void count_nodes (node *n)
278 {
279     ++suint;
280     if (n->less) count_nodes (n->less);
281     if (n->more) count_nodes (n->more);
282 }
283 void tree_to_array (node *n)
284 {
285     if (!n) return;
286     tree_to_array (n->less);
287     *snpp++ = n;
288     tree_to_array (n->more);
289 }
290 void balance ()
291 {
292     node **npp;
293     long int i, j, D, k, k2, k3;
294
295     /** Generate a sorted array **/
296     suint = 0;
297     count_nodes (root);
298     npp = snpp = (node**) alloca (sizeof (node*) * suint);
299     tree_to_array (root);
300     /** Insert the top 2^k -1 nodes **/
301     root = npp [suint / 2];
302     for (i = 4; i <= suint + 1; i *= 2)
303         for (j = 2; j < i; j += 4)
304         {
305             k2 = suint * j / i;
306             npp [k2]->less = npp [suint * (j - 1) / i];
307             npp [k2]->more = npp [suint * (j + 1) / i];
308         }
309     /** Test whether there are nodes left **/
310     if ((k = suint + 1 - i / 2)) == 0)
311     {
312         for (i /= 2, j = 1; j < i; j += 2)
313             k2 = suint * j / i,
314             npp [k2]->less = npp [k2]->more = NULL;
315         return;
316     }
317     /** Proceed normally but for specific nodes **/
318     for (j = 2; j < i; j += 4)
319     {
320         k3 = suint * j / i;
321         D = (k2 = suint * (j - 1) / i) * (i / 2) % suint;
322         if (D >= k || D == 0)    /* k2 Excluded */
323             npp [k3]->less = NULL;
324         else {                   /* k2 Inserted */
325             npp [k3]->less = npp [k2];
326             npp [k2]->less = npp [k2]->more = NULL;
327         }
328         D = (k2 = suint * (j + 1) / i) * (i / 2) % suint;
329         if (D >= k || D == 0)
330             npp [k3]->more = NULL;
331         else {
332             npp [k3]->more = npp [k2];
333             npp [k2]->less = npp [k2]->more = NULL;
334         }
335     }
336     insert (npp [0]);
337 }
338 \end{verbatim}
339
340 \subsection{Inserting a node}
341
342 The procedure is similar to that of a simple binary search tree.
343
344 \begin{verbatim}
345 int insert (node *n)
346 {
347     node *np;
348     int h;
349
350     n->less = n->more = NULL;
351     if (!(np = root))
352     {
353         root = n;
354         return 0;
355     }
356     for (h = 0;; h++)
357         if (compare (np->data, n->data) > 0)
358             if (!np->less) { np->less = n; break; }
359             else np = np->less;
360         else
361             if (!np->more) { np->more = n; break; }
362             else np = np->more;
363      if (h == MAX_HEIGHT)
364      {
365          balance ();
366          return 0;
367      } else if (h > MAX_HEIGHT) {
368          // If that ever happens please send me a mail and I'll send
369          // you 10.000\$ for having a tree with 4bill nodes
370          // or make MAX_HEIGHT 34 and go to 16bills
371          printf ("Tree full. Reached 2^%i nodes\n", MAX_HEIGHT);
372          return -1;
373      }
374      return h;
375 }
376 \end{verbatim}
377
378 \end{document}
`