Subsections


8.1 ScapegoatTree: A Binary Search Tree with Partial Rebuilding

A ScapegoatTree is a BinarySearchTree that, in addition to keeping track of the number, $ \mathtt{n}$, of nodes in the tree also keeps a counter, $ \mathtt{q}$, that maintains an upper-bound on the number of nodes.

    int q;
At all times, $ \mathtt{n}$ and $ \mathtt{q}$ obey the following inequalities:

$\displaystyle \ensuremath{\mathtt{q}}/2 \le \ensuremath{\mathtt{n}} \le \ensuremath{\mathtt{q}} \enspace .
$

In addition, a ScapegoatTree has logarithmic height; at all times, the height of the scapegoat tree does not exceed:

$\displaystyle \log_{3/2} \ensuremath{\mathtt{q}} \le \log_{3/2} 2\ensuremath{\mathtt{n}} < \log_{3/2} \ensuremath{\mathtt{n}} + 2\enspace .$ (8.1)

Even with this constraint, a ScapegoatTree can look surprisingly unbalanced. The tree in Figure 8.1 has $ \ensuremath{\mathtt{q}}=\ensuremath{\mathtt{n}}=10$ and height $ 5<\log_{3/2}10 \approx 5.679$.

Figure 8.1: A ScapegoatTree with 10 nodes and height 5.
\includegraphics{figs/scapegoat-insert-1}

Implementing the $ \mathtt{find(x)}$ operation in a ScapegoatTree is done using the standard algorithm for searching in a BinarySearchTree (see Section 6.2). This takes time proportional to the height of the tree which, by (8.1) is $ O(\log \ensuremath{\mathtt{n}})$.

To implement the $ \mathtt{add(x)}$ operation, we first increment $ \mathtt{n}$ and $ \mathtt{q}$ and then use the usual algorithm for adding $ \mathtt{x}$ to a binary search tree; we search for $ \mathtt{x}$ and then add a new leaf $ \mathtt{u}$ with $ \ensuremath{\mathtt{u.x}}=\ensuremath{\mathtt{x}}$. At this point, we may get lucky and the depth of $ \mathtt{u}$ might not exceed $ \log_{3/2}\ensuremath{\mathtt{q}}$. If so, then we leave well enough alone and don't do anything else.

Unfortunately, it will sometimes happen that $ \ensuremath{\mathtt{depth(u)}} > \log_{3/2}
\ensuremath{\mathtt{q}}$. In this case we need to do something to reduce the height. This isn't a big job; there is only one node, namely $ \mathtt{u}$, whose depth exceeds $ \log_{3/2}\ensuremath{\mathtt{q}}$. To fix $ \mathtt{u}$, we walk from $ \mathtt{u}$ back up to the root looking for a scapegoat, $ \mathtt{w}$. The scapegoat, $ \mathtt{w}$, is a very unbalanced node. It has the property that

$\displaystyle \frac{\ensuremath{\mathtt{size(w.child)}}}{\ensuremath{\mathtt{size(w)}}} > \frac{2}{3} \enspace ,$ (8.2)

where $ \mathtt{w.child}$ is the child of $ \mathtt{w}$ on the path from the root to $ \mathtt{u}$. We'll very shortly prove that a scapegoat exists. For now, we can take it for granted. Once we've found the scapegoat $ \mathtt{w}$, we completely destroy the subtree rooted at $ \mathtt{w}$ and rebuild it into a perfectly balanced binary search tree. We know, from (8.2), that, even before the addition of $ \mathtt{u}$, $ \mathtt{w}$'s subtree was not a complete binary tree. Therefore, when we rebuild $ \mathtt{w}$, the height decreases by at least 1 so that height of the ScapegoatTree is once again at most $ \log_{3/2}\ensuremath{\mathtt{q}}$.

    boolean add(T x) {
        // first do basic insertion keeping track of depth
        Node<T> u = newNode(x);
        int d = addWithDepth(u);
        if (d > log32(q)) {
            // depth exceeded, find scapegoat
            Node<T> w = u.parent;
            while (3*size(w) <= 2*size(w.parent))
                w = w.parent;
            rebuild(w.parent);
        }
        return d >= 0;
    }

Figure 8.2: Inserting 3.5 into a ScapegoatTree increases its depth to 6, which violates (8.1) since $ 6 > \log_{3/2} 11 \approx 5.914$. A scapegoat is found at the node containing 5.
\includegraphics{figs/scapegoat-insert-3} \includegraphics{figs/scapegoat-insert-4}
If we ignore the cost of finding the scapegoat $ \mathtt{w}$ and rebuilding the subtree rooted at $ \mathtt{w}$, then the running time of $ \mathtt{add(x)}$ is dominated by the initial search, which takes $ O(\log \ensuremath{\mathtt{q}}) = O(\log \ensuremath{\mathtt{n}})$ time. We will account for the cost of finding the scapegoat and rebuilding using amortized analysis in the next section.

The implementation of $ \mathtt{remove(x)}$ in a ScapegoatTree is very simple. We search for $ \mathtt{x}$ and remove it using the usual algorithm for removing a node from a BinarySearchTree. (Note that this can never increase the height of the tree.) Next, we decrement $ \mathtt{n}$ but leave $ \mathtt{q}$ unchanged. Finally, we check if $ \ensuremath{\mathtt{q}} > 2\ensuremath{\mathtt{n}}$ and, if so, we rebuild the entire tree into a perfectly balanced binary search tree and set $ \ensuremath{\mathtt{q}}=\ensuremath{\mathtt{n}}$.

    boolean remove(T x) {
        if (super.remove(x)) {
            if (2*n < q) {
                rebuild(r);
                q = n;
            }
            return true;
        }
        return false;
    }
Again, if we ignore the cost of rebuilding, the running time of the $ \mathtt{remove(x)}$ operation is proportional to the height of the tree, and is therefore $ O(\log \ensuremath{\mathtt{n}})$.

8.1.1 Analysis of Correctness and Running-Time

In this section we analyze the correctness and amortized running time of operations on a ScapegoatTree. We first prove the correctness by showing that, when the $ \mathtt{add(x)}$ operation results in a node that violates Condition (8.1), then we can always find a scapegoat:

Lemma 8..1   Let $ \mathtt{u}$ be a node of depth $ h>\log_{3/2} \ensuremath{\mathtt{q}}$ in a ScapegoatTree. Then there exists a node $ \ensuremath{\mathtt{w}}$ on the path from $ \mathtt{u}$ to the root such that

$\displaystyle \frac{\ensuremath{\mathtt{size(w)}}}{\ensuremath{\mathtt{size(parent(w))}}} > 2/3 \enspace .
$

Proof. Suppose, for the sake of contradiction, that this is not the case, and

$\displaystyle \frac{\ensuremath{\mathtt{size(w)}}}{\ensuremath{\mathtt{size(parent(w))}}} \le 2/3 \enspace .
$

for all nodes $ \mathtt{w}$ on the path from $ \mathtt{u}$ to the root. Denote the path from the root to $ \mathtt{u}$ as $ \ensuremath{\mathtt{r}}=\ensuremath{\mathtt{u}}_0,\ldots,\ensuremath{\mathtt{u}}_h=\ensuremath{\mathtt{u}}$. Then, we have $ \ensuremath{\mathtt{size(u}}_0\ensuremath{\mathtt{)}}=\ensuremath{\mathtt{n}}$, $ \ensuremath{\mathtt{size(u}}_1\ensuremath{\mathtt{)}}\le\frac{2}{3}\ensuremath{\mathtt{n}}$, $ \ensuremath{\mathtt{size(u}}_2\ensuremath{\mathtt{)}}\le\frac{4}{9}\ensuremath{\mathtt{n}}$ and, more generally,

$\displaystyle \ensuremath{\mathtt{size(u}}_i\ensuremath{\mathtt{)}}\le\left(\frac{2}{3}\right)^i\ensuremath{\mathtt{n}} \enspace .
$

But this gives a contradiction, since $ \ensuremath{\mathtt{size(u)}}\ge 1$, hence

$\displaystyle 1 \le \ensuremath{\mathtt{size(u)}} \le \left(\frac{2}{3}\right)^...
...suremath{\mathtt{n}}}\right) \ensuremath{\mathtt{n}}
= 1 \enspace . \qedhere
$

$ \qedsymbol$

Next, we analyze the parts of the running time that we have not yet accounted for. There are two parts: The cost of calls to $ \mathtt{size(u)}$ when search for scapegoat nodes, and the cost of calls to $ \mathtt{rebuild(w)}$ when we find a scapegoat $ \mathtt{w}$. The cost of calls to $ \mathtt{size(u)}$ can be related to the cost of calls to $ \mathtt{rebuild(w)}$, as follows:

Lemma 8..2   During a call to $ \mathtt{add(x)}$ in a ScapegoatTree, the cost of finding the scapegoat $ \mathtt{w}$ and rebuilding the subtree rooted at $ \mathtt{w}$ is $ O(\ensuremath{\mathtt{size(w)}})$.

Proof. The cost of rebuilding the scapegoat node $ \mathtt{w}$, once we find it, is $ O(\ensuremath{\mathtt{size(w)}})$. When searching for the scapegoat node, we call $ \mathtt{size(u)}$ on a sequence of nodes $ \ensuremath{\mathtt{u}}_0,\ldots,\ensuremath{\mathtt{u}}_k$ until we find the scapegoat $ \ensuremath{\mathtt{u}}_k=\ensuremath{\mathtt{w}}$. However, since $ \ensuremath{\mathtt{u}}_k$ is the first node in this sequence that is a scapegoat, we know that

$\displaystyle \ensuremath{\mathtt{size(u}}_{i}\ensuremath{\mathtt{)}} < \frac{2}{3}\ensuremath{\mathtt{size(u}}_{i+1}\ensuremath{\mathtt{)}}
$

for all $ i\in\{0,\ldots,k-2\}$. Therefore, the cost of all calls to $ \mathtt{size(u)}$ is
$\displaystyle O\left( \sum_{i=0}^k \ensuremath{\mathtt{size(u}}_{k-i}\ensuremath{\mathtt{)}} \right)$ $\displaystyle =$ $\displaystyle O\left(
\ensuremath{\mathtt{size(u}}_k\ensuremath{\mathtt{)}}
...
...i=0}^{k-1} \ensuremath{\mathtt{size(u}}_{k-i-1}\ensuremath{\mathtt{)}}
\right)$  
  $\displaystyle =$ $\displaystyle O\left(
\ensuremath{\mathtt{size(u}}_k\ensuremath{\mathtt{)}}
...
...{2}{3}\right)^i\ensuremath{\mathtt{size(u}}_{k}\ensuremath{\mathtt{)}}
\right)$  
  $\displaystyle =$ $\displaystyle O\left(
\ensuremath{\mathtt{size(u}}_k\ensuremath{\mathtt{)}}\left(1+
\sum_{i=0}^{k-1} \left(\frac{2}{3}\right)^i
\right)\right)$  
  $\displaystyle =$ $\displaystyle O(\ensuremath{\mathtt{size(u}}_k\ensuremath{\mathtt{)}}) = O(\ensuremath{\mathtt{size(w)}}) \enspace ,$  

where the last line follows from the fact that the sum is a geometrically decreasing series. $ \qedsymbol$

All that remains is to prove an upper-bound on the cost of calls to $ \mathtt{rebuild(u)}$:

Lemma 8..3   Starting with an empty ScapegoatTree any sequence of $ m$ $ \mathtt{add(x)}$ and $ \mathtt{remove(x)}$ operations causes at most $ O(m\log m)$ time to be used by $ \mathtt{rebuild(u)}$ operations.

Proof. To prove this, we will use a credit scheme. Each node stores a number of credits. Each credit can pay for some constant, $ c$, units of time spent rebuilding. The scheme gives out a total of $ O(m\log m)$ credits and every call to $ \mathtt{rebuild(u)}$ is paid for with credits stored at $ \mathtt{u}$.

During an insertion or deletion, we give one credit to each node on the path to the inserted node, or deleted node, $ \mathtt{u}$. In this way we hand out at most $ \log_{3/2}\ensuremath{\mathtt{q}}\le \log_{3/2}m$ credits per operation. During a deletion we also store an additional 1 credit ``on the side.'' Thus, in total we give out at most $ O(m\log m)$ credits. All that remains is to show that these credits are sufficient to pay for all calls to $ \mathtt{rebuild(u)}$.

If we call $ \mathtt{rebuild(u)}$ during an insertion, it is because $ \mathtt{u}$ is a scapegoat. Suppose, without loss of generality, that

$\displaystyle \frac{\ensuremath{\mathtt{size(u.left)}}}{\ensuremath{\mathtt{size(u)}}} > \frac{2}{3} \enspace .
$

Using the fact that

$\displaystyle \ensuremath{\mathtt{size(u)}} = 1 + \ensuremath{\mathtt{size(u.left)}} + \ensuremath{\mathtt{size(u.right)}}
$

we deduce that

$\displaystyle \frac{1}{2}\ensuremath{\mathtt{size(u.left)}} > \ensuremath{\mathtt{size(u.right)}} \enspace
$

and therefore

$\displaystyle \ensuremath{\mathtt{size(u.left)}} - \ensuremath{\mathtt{size(u.r...
...\mathtt{size(u.left)}} >
\frac{1}{3}\ensuremath{\mathtt{size(u)}} \enspace .
$

Now, the last time a subtree containing $ \mathtt{u}$ was rebuilt (or when $ \mathtt{u}$ was inserted, if a subtree containing $ \mathtt{u}$ was never rebuilt), we had

$\displaystyle \ensuremath{\mathtt{size(u.left)}} - \ensuremath{\mathtt{size(u.right)}} \le 1 \enspace .
$

Therefore, the number of $ \mathtt{add(x)}$ or $ \mathtt{remove(x)}$ operations that have affected $ \mathtt{u.left}$ or $ \mathtt{u.right}$ since then is at least

$\displaystyle \frac{1}{3}\ensuremath{\mathtt{size(u)}} - 1 \enspace .
$

and there are therefore at least this many credits stored at $ \mathtt{u}$ that are available to pay for the $ O(\ensuremath{\mathtt{size(u)}})$ time it takes to call $ \mathtt{rebuild(u)}$.

If we call $ \mathtt{rebuild(u)}$ during a deletion, it is because $ \ensuremath{\mathtt{q}} > 2\ensuremath{\mathtt{n}}$. In this case, we have $ \ensuremath{\mathtt{q}}-\ensuremath{\mathtt{n}}> \ensuremath{\mathtt{n}}$ credits stored ``on the side'' and we use these to pay for the $ O(\ensuremath{\mathtt{n}})$ time it takes to rebuild the root. This completes the proof. $ \qedsymbol$

8.1.2 Summary

The following theorem summarizes the performance of the ScapegoatTree data structure:

Theorem 8..1   A ScapegoatTree implements the SSet interface. Ignoring the cost of $ \mathtt{rebuild(u)}$ operations, a ScapegoatTree supports the operations $ \mathtt{add(x)}$, $ \mathtt{remove(x)}$, and $ \mathtt{find(x)}$ in $ O(\log \ensuremath{\mathtt{n}})$ time per operation.

Furthermore, beginning with an empty ScapegoatTree, any sequence of $ m$ $ \mathtt{add(x)}$ and $ \mathtt{remove(x)}$ operations results in a total of $ O(m\log m)$ time spent during all calls to $ \mathtt{rebuild(u)}$.

opendatastructures.org