"Fossies" - the Fresh Open Source Software Archive

Member "ncc-2.8/doc/dbstree.tex" (12 Apr 2002, 13473 Bytes) of package /linux/privat/old/ncc-2.8.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 \documentclass[twocolumn]{article}
    2 
    3 \title{The Dynamic Binary search tree}
    4 \author{}
    5 
    6 \begin{document}
    7 
    8 \maketitle
    9 
   10 \section{Introduction}
   11 
   12 Binary trees are an important data organization in which searching for
   13 an element is performed in $O(\log{n})$ steps. Dynamic trees were new
   14 elements can be added and removed may become unbalanced and thus more
   15 sophisticated binary tree implementations have been suggested to keep
   16 the tree in $O(\log{n})$ limits.
   17 
   18 Our goal is controlling the height of a tree to small values. A balanced
   19 tree fulfills this goal but a tree need not necessarily be balanced to have
   20 its height controlled.
   21 
   22 The algorithm for an efficient implementation of dynamic tree is described,
   23 along with comparisons with existing dynamic tree implementations.
   24 
   25 \section{Description of the tree}
   26 
   27 The basic idea is using a simple binary search tree and setting a limit on
   28 it's height. Let us use for the rest of this study the number 32 as height
   29 limit. This restricts the total number of nodes to no more than $2^{32}$ as
   30 long as duplicate elements are not allowed.
   31 We should notice that due to the logarithmic nature of complexity,
   32 in real-life applications a balanced tree will never reach such values for
   33 it's height.
   34 
   35 Supposing we can efficiently fulfill this condition, locating an element in
   36 the tree will be performed in no more than 32 steps. Typically
   37 $O(32)<O(\log{n})$. Actually it may result in a few more steps.
   38 In practice it is an unimportant
   39 difference and therefore it is as fast as any balanced tree.
   40 
   41 Removing an element from the tree can be easily done without the necessity
   42 of a balancing operation as long as by removing an element the
   43 height of the tree is not increased. This condition can be implemented.
   44 
   45 If the node to be removed has two child nodes then the procedure is to find
   46 the biggest element from the smaller ones or the smallest from the bigger
   47 ones and replace the node to be removed with this element. A sample code is
   48 illustrated in \ref{app:rmv}
   49 
   50 Adding a new element to the tree is the most important operation. While
   51 inserting the element the number of comparisons performed is counted. The
   52 simple rule, when this number reaches 32 the tree must be rebalanced, is
   53 very efficient.
   54 
   55 Balancing the tree is accomplished in $n$ steps. However, the frequency at
   56 which the tree needs to be balanced again is a function of $\frac{1}{2^n}$.
   57 As a result adding a new element is in the great majority of the cases done
   58 in less than 32 steps; but rebalancing is more rare as the number of nodes
   59 increases
   60 \footnote{A wosre-case behavior were a lot of elements have to be inserted
   61 in an empty tree and they arrive sorted is solved in \ref{init}}.
   62 
   63 \subsection{Balancing the tree}
   64 
   65 The binary tree can be entirely reconstructed in $n$ steps using a simple
   66 algorithm. First, an array of $n$ pointers is allocated an all the elements
   67 of the tree are placed in this array in sorted order. Let us remember than an
   68 always-balanced tree implementation requires extra memory for the node
   69 structures and therefore the extra allocation of the array is yet efficient;
   70 furthermore, in modern multithreaded database systems an operation which
   71 alters part of the data is given full control of the hardware and the memory
   72 usage burst will in this case be safely performed.
   73 
   74 The next step is to reset the root of the tree and start inserting elements
   75 in the order at which they will result in a perfectly balanced tree. This is
   76 done by first inserting the middle element $N/2$ of the array, then
   77 elements $N/4$ and $3 N / 4$, then $N/8$, $3 N / 8$, $5 N / 8 \ldots$
   78 and so on.
   79 
   80 By this procedure $2^k - 1$ where $k$ the integer for which
   81 $2^k \leq N < 2^{k + 1}$ nodes can be inserted. For the remaining $N-(2^k -1)$
   82 nodes, if any, various algorithms of $O(n)$ steps can be applied. One way
   83 that is used in the sample code is to proceed normally as if there were
   84 $2^{k+1}$ nodes and exclude the elements $i$ for which the modulo of the
   85 division $i * 2^k / N$ is greater or equal to $N-(2^k -1)$ or zero; in other
   86 words excluding those numbers $a$ for which there is a number $b$ so that
   87 \begin{equation}
   88 \left\lfloor \frac{a N}{2^{k+1}} \right\rfloor =
   89 \left\lfloor \frac{b N}{2^k} \right\rfloor
   90 \: , \;\; 0 < b < 2^k
   91 \end{equation}
   92 
   93 The complexity of the above algorithm will be linear if when inserting nodes
   94 in the new tree redundant comparisons are avoided.
   95 
   96 A sample implementation is provided at \ref{code:balance}.
   97 
   98 \section{Characteristics of the tree}
   99 
  100 The main feature of the this tree is that after is has been balanced, the
  101 more nodes it contains, the more new elements be distributed in external
  102 nodes and thus it will be much more unlikely to become ``unbalanced''.
  103 
  104 \begin{itemize}
  105 \item
  106    Memory requirements for this tree are minimal. The standard memory needed
  107    is as much as any simple binary tree and the same as a double linked
  108    list. Memory usage bursts occur when
  109    balancing is performed but this is generally a rare and very fast action.
  110    Memory due to recursion is not trivial since tree height is limited
  111    to 32.
  112 \item
  113    Locating elements will happen in less than 32 steps. Generally after
  114    balancing has been performed locating elements is a matter of $\log{n}$.
  115 \item
  116    Removing elements is also a matter of less than 32 steps.
  117    However, there are no rotations performed and therefore removing an
  118    element is possibly performed faster than in an always-balanced tree.
  119 \item
  120    Adding elements in most of the cases is a matter of less than 32 very
  121    fast steps.
  122    Balancing is rare and becomes even more rare as the tree grows and it
  123    requires $n$ simple steps.
  124 
  125    In long term performance, for a growing tree the number of nodes $n$
  126    divided with the number of steps spent for balancing, tends to be
  127    independent of $n$.
  128 \end{itemize}
  129 
  130 Tests have shown that the tree will need to balance 2 times for
  131 1,000,000 elements arriving in random order, once after 20,000 elements
  132 and once again at about 400,000.
  133 
  134 \section{Database initialization}
  135 \label{init}
  136 
  137 One case were the suggested tree would prove inefficient is inserting a lot
  138 of elements in sorted order in an \emph{empty} tree.
  139 If the tree is not empty and has been balanced before then the arriving
  140 elements will be distributed on the external nodes.
  141 
  142 This is a case which would occur when initially loading the data from
  143 storage devices. The workaround to this problem is similar to the way the
  144 tree is balanced. The -sorted- elements will have to be inserted in the tree
  145 in the order that they will make a balanced tree, first the middle element,
  146 $N/4$, $3*N/4$ and so on.
  147 
  148 Once a perfectly balance tree is constructed there should be no worry for
  149 this worse-case occurring unless most of the elements are removed.
  150 
  151 Generally, in cases were a long sequence of elements is to be inserted
  152 in the tree, if the tree has few nodes it may be preferable to sort the
  153 sequence of elements and then insert the right elements first.
  154 This procedure is worth the tradeoff if the arriving elements are almost
  155 sorted, and in this case special sorting algorithms can be applied.
  156 
  157 Another solution is to make the elements really random, by using a random
  158 tree. In a \emph{random tree}, the comparison of elements is heads or tails.
  159 After the sequence of elements has been inserted in a random tree (which
  160 will never be unbalanced), they can be read in-order and sent to the dbs
  161 tree.
  162 
  163 \section{Conclusions}
  164 
  165 The tree structure with the characteristics that have been presented is
  166 efficient for specific database requirements.
  167 
  168 When the database is idle or the specific tree is not used, is consumes the
  169 least memory compared to other balanced tree structures.
  170 That is the primary reason for the design of this structure.
  171 
  172 When the tree is just accessed for locating an element this is as fast as
  173 any balanced tree.
  174 
  175 Theoretically, balanced tree implementations which perform rotations after
  176 a node is inserted or removed from the tree are better. In practice, and
  177 more specificly in a database environment limiting the height of the tree to
  178 a `small' number is more than adequate.
  179 Except from memory efficiency, the suggested tree structure may possibly
  180 behave faster as well; the worse case scenario will either occur at the
  181 initialization of the database or else is can be easilly prevented.
  182 
  183 \onecolumn
  184 \section{Sample code}
  185 
  186 A sample implementation in C of the presented tree is provided.
  187 For the sample code the simple definition of a node with a pointer to
  188 data is used. The function {\tt compare ()} is the application-specific
  189 function which compares the data of two nodes.
  190 
  191 \begin{verbatim}
  192 #define MAX_HEIGHT 32
  193 typedef struct tnode node;
  194 struct tnode {
  195    char *data;
  196    node *less, *more;
  197 };
  198 node *root;
  199 \end{verbatim}
  200 
  201 \subsection{Removing a node}
  202 \label{app:rmv}
  203 
  204 When removing an element which has two children,
  205 both the smallest from the bigger elements and the
  206 biggest from the smaller elements are located. By convention the deepest one
  207 is selected to replace the node to be removed.
  208 
  209 \begin{verbatim}
  210 void remove (node *n)
  211 {
  212     node *np, *nl, *nr, *nlp, *nrp;
  213     int isroot;
  214 
  215     if (!(isroot = (np = root) == n))
  216         for (;;)
  217             if (compare (np->data, n->data) > 0)
  218                 if (np->less == n) break;
  219                 else np = np->less;
  220             else
  221                 if (np->more == n) break;
  222                 else np = np->more;
  223 
  224     if (!(n->less && n->more))
  225     {
  226         if (isroot) root = (n->less) ? n->less : n->more;
  227         else
  228             if (np->less == n) np->less = (n->less) ? n->less : n->more;
  229             else np->more = (n->less) ? n->less : n->more;
  230         return;
  231     }
  232     for (i = 0, nlp = NULL, nl = n->less; nl->more; i++)
  233         nlp = nl, nl = nl->more;
  234     for (j = 0, nrp = NULL, nr = n->more; nr->less; j++)
  235         nrp = nr, nr = nr->less;
  236     if (i >= j)
  237     {
  238         if (isroot) root = nl;
  239         else
  240             if (np->less == n) np->less = nl;
  241             else               np->more = nl;
  242         if (nlp)
  243         {
  244              nlp->more = nl->less
  245              nl->less = n->less;
  246         }
  247         nl->more = n->more;
  248     } else {
  249         if (isroot) root = nr;
  250         else
  251             if (np->less == n) np->less = nr;
  252             else               np->more = nr;
  253         if (nrp)
  254         {
  255             nrp->less = nr->more
  256             nr->more = n->more;
  257         }
  258         nr->less = n->less;
  259     }
  260 }
  261 \end{verbatim}
  262 
  263 \subsection{Balancing}
  264 \label{code:balance}
  265 
  266 The algorithm for reconstructing a balanced tree.
  267 {\tt alloca} is a stack allocation mechanism.
  268 
  269 Special care should be taken that the variables {\tt i, j, D, k} in
  270 {\tt balance()} are 64 bit integers. If not then the algorithm will fail for
  271 trees with more than about 65000 nodes.
  272 
  273 \begin{verbatim}
  274 unsigned int suint;
  275 node **snpp;
  276 
  277 void count_nodes (node *n)
  278 {
  279     ++suint;
  280     if (n->less) count_nodes (n->less);
  281     if (n->more) count_nodes (n->more);
  282 }
  283 void tree_to_array (node *n)
  284 {
  285     if (!n) return;
  286     tree_to_array (n->less);
  287     *snpp++ = n;
  288     tree_to_array (n->more);
  289 }
  290 void balance ()
  291 {
  292     node **npp;
  293     long int i, j, D, k, k2, k3;
  294 
  295     /** Generate a sorted array **/
  296     suint = 0;
  297     count_nodes (root);
  298     npp = snpp = (node**) alloca (sizeof (node*) * suint);
  299     tree_to_array (root);
  300     /** Insert the top 2^k -1 nodes **/
  301     root = npp [suint / 2];
  302     for (i = 4; i <= suint + 1; i *= 2)
  303         for (j = 2; j < i; j += 4)
  304         {
  305             k2 = suint * j / i;
  306             npp [k2]->less = npp [suint * (j - 1) / i];
  307             npp [k2]->more = npp [suint * (j + 1) / i];
  308         }
  309     /** Test whether there are nodes left **/
  310     if ((k = suint + 1 - i / 2)) == 0)
  311     {
  312         for (i /= 2, j = 1; j < i; j += 2)
  313             k2 = suint * j / i,
  314             npp [k2]->less = npp [k2]->more = NULL;
  315         return;
  316     }
  317     /** Proceed normally but for specific nodes **/
  318     for (j = 2; j < i; j += 4)
  319     {
  320         k3 = suint * j / i;
  321         D = (k2 = suint * (j - 1) / i) * (i / 2) % suint;
  322         if (D >= k || D == 0)    /* k2 Excluded */
  323             npp [k3]->less = NULL;
  324         else {                   /* k2 Inserted */
  325             npp [k3]->less = npp [k2];
  326             npp [k2]->less = npp [k2]->more = NULL;
  327         }
  328         D = (k2 = suint * (j + 1) / i) * (i / 2) % suint;
  329         if (D >= k || D == 0)
  330             npp [k3]->more = NULL;
  331         else {
  332             npp [k3]->more = npp [k2];
  333             npp [k2]->less = npp [k2]->more = NULL;
  334         }
  335     }
  336     insert (npp [0]);
  337 }
  338 \end{verbatim}
  339 
  340 \subsection{Inserting a node}
  341 
  342 The procedure is similar to that of a simple binary search tree.
  343 
  344 \begin{verbatim}
  345 int insert (node *n)
  346 {
  347     node *np;
  348     int h;
  349 
  350     n->less = n->more = NULL;
  351     if (!(np = root))
  352     {
  353         root = n;
  354         return 0;
  355     }
  356     for (h = 0;; h++)
  357         if (compare (np->data, n->data) > 0)
  358             if (!np->less) { np->less = n; break; }
  359             else np = np->less;
  360         else
  361             if (!np->more) { np->more = n; break; }
  362             else np = np->more;
  363      if (h == MAX_HEIGHT)
  364      {
  365          balance ();
  366          return 0;
  367      } else if (h > MAX_HEIGHT) {
  368          // If that ever happens please send me a mail and I'll send
  369          // you 10.000$ for having a tree with 4bill nodes
  370          // or make MAX_HEIGHT 34 and go to 16bills
  371          printf ("Tree full. Reached 2^%i nodes\n", MAX_HEIGHT);
  372          return -1;
  373      }
  374      return h;
  375 }
  376 \end{verbatim}
  377 
  378 \end{document}