Open Data Structures
\(\newcommand{\E}{\mathrm{E}} \DeclareMathOperator{\ddiv}{div} \)
Contents
8 Scapegoat Trees

In this chapter, we study a binary search tree data structure, the ScapegoatTree. This structure is based on the common wisdom that, when something goes wrong, the first thing people tend to do is find someone to blame (the scapegoat). Once blame is firmly established, we can leave the scapegoat to fix the problem.

A ScapegoatTree keeps itself balanced by partial rebuilding operations. During a partial rebuilding operation, an entire subtree is deconstructed and rebuilt into a perfectly balanced subtree. There are many ways of rebuilding a subtree rooted at node u into a perfectly balanced tree. One of the simplest is to traverse u's subtree, gathering all its nodes into an array, a, and then to recursively build a balanced subtree using a. If we let \(\texttt{m}=\texttt{a.length}/2\), then the element a[m] becomes the root of the new subtree, \(\texttt{a}[0],\ldots,\texttt{a}[\texttt{m}-1]\) get stored recursively in the left subtree and \(\texttt{a}[\texttt{m}+1],\ldots,\texttt{a}[\texttt{a.length}-1]\) get stored recursively in the right subtree.

    void rebuild(Node<T> u) {
        int ns = size(u);
        Node<T> p = u.parent;
        Node<T>[] a = (Node<T>[]) Array.newInstance(Node.class, ns);
        packIntoArray(u, a, 0);
        if (p == nil) {
            r = buildBalanced(a, 0, ns);
            r.parent = nil;
        } else if (p.right == u) {
            p.right = buildBalanced(a, 0, ns);
            p.right.parent = p;
        } else {
            p.left = buildBalanced(a, 0, ns);
            p.left.parent = p;
        }
    }
    int packIntoArray(Node<T> u, Node<T>[] a, int i) {
        if (u == nil) {
            return i;
        }
        i = packIntoArray(u.left, a, i);
        a[i++] = u;
        return packIntoArray(u.right, a, i);
    }
    Node<T> buildBalanced(Node<T>[] a, int i, int ns) {
        if (ns == 0)
            return nil;
        int m = ns / 2;
        a[i + m].left = buildBalanced(a, i, m);
        if (a[i + m].left != nil)
            a[i + m].left.parent = a[i + m];
        a[i + m].right = buildBalanced(a, i + m + 1, ns - m - 1);
        if (a[i + m].right != nil)
            a[i + m].right.parent = a[i + m];
        return a[i + m];
    }
A call to rebuild(u) takes \(O(\texttt{size(u)})\) time. The resulting subtree has minimum height; there is no tree of smaller height that has size(u) nodes.

8.1 ScapegoatTree: A Binary Search Tree with Partial Rebuilding

A ScapegoatTree is a BinarySearchTree that, in addition to keeping track of the number, n, of nodes in the tree also keeps a counter, q, that maintains an upper-bound on the number of nodes.

    int q;
At all times, n and q obey the following inequalities: \begin{equation*} \texttt{q}/2 \le \texttt{n} \le \texttt{q} \enspace . \end{equation*} In addition, a ScapegoatTree has logarithmic height; at all times, the height of the scapegoat tree does not exceed \begin{equation} \log_{3/2} \texttt{q} \le \log_{3/2} 2\texttt{n} < \log_{3/2} \texttt{n} + 2\enspace . \end{equation} Even with this constraint, a ScapegoatTree can look surprisingly unbalanced. The tree in Figure 8.1 has \(\texttt{q}=\texttt{n}=10\) and height \(5<\log_{3/2}10 \approx 5.679\).

Figure 8.1 A ScapegoatTree with 10 nodes and height 5.

Implementing the find(x) operation in a ScapegoatTree is done using the standard algorithm for searching in a BinarySearchTree (see Section 6.2). This takes time proportional to the height of the tree which, by Equation 8.1 is \(O(\log \texttt{n})\).

To implement the add(x) operation, we first increment n and q and then use the usual algorithm for adding x to a binary search tree; we search for x and then add a new leaf u with \(\texttt{u.x}=\texttt{x}\). At this point, we may get lucky and the depth of u might not exceed \(\log_{3/2}\texttt{q}\). If so, then we leave well enough alone and don't do anything else.

Unfortunately, it will sometimes happen that \(\texttt{depth(u)} > \log_{3/2} \texttt{q}\). In this case, we need to reduce the height. This isn't a big job; there is only one node, namely u, whose depth exceeds \(\log_{3/2} \texttt{q}\). To fix u, we walk from u back up to the root looking for a scapegoat, w. The scapegoat, w, is a very unbalanced node. It has the property that \begin{equation} \frac{\texttt{size(w.child)}}{\texttt{size(w)}} > \frac{2}{3} \enspace , \end{equation} where w.child is the child of w on the path from the root to u. We'll very shortly prove that a scapegoat exists. For now, we can take it for granted. Once we've found the scapegoat w, we completely destroy the subtree rooted at w and rebuild it into a perfectly balanced binary search tree. We know, from Equation 8.2, that, even before the addition of u, w's subtree was not a complete binary tree. Therefore, when we rebuild w, the height decreases by at least 1 so that the height of the ScapegoatTree is once again at most \(\log_{3/2}\texttt{q}\).

    boolean add(T x) {
        // first do basic insertion keeping track of depth
        Node<T> u = newNode(x);
        int d = addWithDepth(u);
        if (d > log32(q)) {
            // depth exceeded, find scapegoat
            Node<T> w = u.parent;
            while (3*size(w) <= 2*size(w.parent))
                w = w.parent;
            rebuild(w.parent);
        }
        return d >= 0;
    }

Figure 8.2 Inserting 3.5 into a ScapegoatTree increases its height to 6, which violates Equation 8.1 since \(6 > \log_{3/2} 11 \approx 5.914\). A scapegoat is found at the node containing 5.
If we ignore the cost of finding the scapegoat w and rebuilding the subtree rooted at w, then the running time of add(x) is dominated by the initial search, which takes \(O(\log \texttt{q}) = O(\log \texttt{n})\) time. We will account for the cost of finding the scapegoat and rebuilding using amortized analysis in the next section.

The implementation of remove(x) in a ScapegoatTree is very simple. We search for x and remove it using the usual algorithm for removing a node from a BinarySearchTree. (Note that this can never increase the height of the tree.) Next, we decrement n, but leave q unchanged. Finally, we check if \(\texttt{q} > 2\texttt{n}\) and, if so, then we rebuild the entire tree into a perfectly balanced binary search tree and set \(\texttt{q}=\texttt{n}\).

    boolean remove(T x) {
        if (super.remove(x)) {
            if (2*n < q) {
                rebuild(r);
                q = n;
            }
            return true;
        }
        return false;
    }
Again, if we ignore the cost of rebuilding, the running time of the remove(x) operation is proportional to the height of the tree, and is therefore \(O(\log \texttt{n})\).

8.1.1 Analysis of Correctness and Running-Time

In this section, we analyze the correctness and amortized running time of operations on a ScapegoatTree. We first prove the correctness by showing that, when the add(x) operation results in a node that violates Condition Equation 8.1, then we can always find a scapegoat:

Lemma 8.1 Let u be a node of depth \(h>\log_{3/2} \texttt{q}\) in a ScapegoatTree. Then there exists a node \(\texttt{w}\) on the path from u to the root such that \begin{equation*} \frac{\texttt{size(w)}}{\texttt{size(parent(w))}} > 2/3 \enspace . \end{equation*}

Proof Suppose, for the sake of contradiction, that this is not the case, and \begin{equation*} \frac{\texttt{size(w)}}{\texttt{size(parent(w))}} \le 2/3 \enspace . \end{equation*} for all nodes w on the path from u to the root. Denote the path from the root to u as \(\texttt{r}=\texttt{u}_0,\ldots,\texttt{u}_h=\texttt{u}\). Then, we have \(\texttt{size(u}_0\texttt{)}=\texttt{n}\), \(\texttt{size(u}_1\texttt{)}\le\frac{2}{3}\texttt{n}\), \(\texttt{size(u}_2\texttt{)}\le\frac{4}{9}\texttt{n}\) and, more generally, \begin{equation*} \texttt{size(u}_i\texttt{)}\le\left(\frac{2}{3}\right)^i\texttt{n} \enspace . \end{equation*} But this gives a contradiction, since \(\texttt{size(u)}\ge 1\), hence \begin{equation*} 1 \le \texttt{size(u)} \le \left(\frac{2}{3}\right)^h\texttt{n} < \left(\frac{2}{3}\right)^{\log_{3/2} \texttt{q}}\texttt{n} \le \left(\frac{2}{3}\right)^{\log_{3/2} \texttt{n}}\texttt{n} = \left(\frac{1}{\texttt{n}}\right) \texttt{n} = 1 \enspace . \end{equation*}

Next, we analyze the parts of the running time that are not yet accounted for. There are two parts: The cost of calls to size(u) when searching for scapegoat nodes, and the cost of calls to rebuild(w) when we find a scapegoat w. The cost of calls to size(u) can be related to the cost of calls to rebuild(w), as follows:

Lemma 8.2 During a call to add(x) in a ScapegoatTree, the cost of finding the scapegoat w and rebuilding the subtree rooted at w is \(O(\texttt{size(w)})\).

Proof The cost of rebuilding the scapegoat node w, once we find it, is \(O(\texttt{size(w)})\). When searching for the scapegoat node, we call size(u) on a sequence of nodes \(\texttt{u}_0,\ldots,\texttt{u}_k\) until we find the scapegoat \(\texttt{u}_k=\texttt{w}\). However, since \(\texttt{u}_k\) is the first node in this sequence that is a scapegoat, we know that \begin{equation*} \texttt{size(u}_{i}\texttt{)} < \frac{2}{3}\texttt{size(u}_{i+1}\texttt{)} \end{equation*} for all \(i\in\{0,\ldots,k-2\}\). Therefore, the cost of all calls to size(u) is \begin{eqnarray*} O\left( \sum_{i=0}^k \texttt{size(u}_{k-i}\texttt{)} \right) &=& O\left( \texttt{size(u}_k\texttt{)} + \sum_{i=0}^{k-1} \texttt{size(u}_{k-i-1}\texttt{)} \right) \\
&=& O\left( \texttt{size(u}_k\texttt{)} + \sum_{i=0}^{k-1} \left(\frac{2}{3}\right)^i\texttt{size(u}_{k}\texttt{)} \right) \\
&=& O\left( \texttt{size(u}_k\texttt{)}\left(1+ \sum_{i=0}^{k-1} \left(\frac{2}{3}\right)^i \right)\right) \\
&=& O(\texttt{size(u}_k\texttt{)}) = O(\texttt{size(w)}) \enspace , \end{eqnarray*} where the last line follows from the fact that the sum is a geometrically decreasing series.

All that remains is to prove an upper-bound on the cost of all calls to rebuild(u) during a sequence of \(m\) operations:

Lemma 8.3 Starting with an empty ScapegoatTree any sequence of \(m\) add(x) and remove(x) operations causes at most \(O(m\log m)\) time to be used by rebuild(u) operations.

Proof To prove this, we will use a credit scheme. We imagine that each node stores a number of credits. Each credit can pay for some constant, \(c\), units of time spent rebuilding. The scheme gives out a total of \(O(m\log m)\) credits and every call to rebuild(u) is paid for with credits stored at u.

During an insertion or deletion, we give one credit to each node on the path to the inserted node, or deleted node, u. In this way we hand out at most \(\log_{3/2}\texttt{q}\le \log_{3/2}m\) credits per operation. During a deletion we also store an additional credit “on the side.” Thus, in total we give out at most \(O(m\log m)\) credits. All that remains is to show that these credits are sufficient to pay for all calls to rebuild(u).

If we call rebuild(u) during an insertion, it is because u is a scapegoat. Suppose, without loss of generality, that \begin{equation*} \frac{\texttt{size(u.left)}}{\texttt{size(u)}} > \frac{2}{3} \enspace . \end{equation*} Using the fact that \begin{equation*} \texttt{size(u)} = 1 + \texttt{size(u.left)} + \texttt{size(u.right)} \end{equation*} we deduce that \begin{equation*} \frac{1}{2}\texttt{size(u.left)} > \texttt{size(u.right)} \enspace \end{equation*} and therefore \begin{equation*} \texttt{size(u.left)} - \texttt{size(u.right)} > \frac{1}{2}\texttt{size(u.left)} > \frac{1}{3}\texttt{size(u)} \enspace . \end{equation*} Now, the last time a subtree containing u was rebuilt (or when u was inserted, if a subtree containing u was never rebuilt), we had \begin{equation*} \texttt{size(u.left)} - \texttt{size(u.right)} \le 1 \enspace . \end{equation*} Therefore, the number of add(x) or remove(x) operations that have affected u.left or u.right since then is at least \begin{equation*} \frac{1}{3}\texttt{size(u)} - 1 \enspace . \end{equation*} and there are therefore at least this many credits stored at u that are available to pay for the \(O(\texttt{size(u)})\) time it takes to call rebuild(u).

If we call rebuild(u) during a deletion, it is because \(\texttt{q} > 2\texttt{n}\). In this case, we have \(\texttt{q}-\texttt{n}> \texttt{n}\) credits stored “on the side,” and we use these to pay for the \(O(\texttt{n})\) time it takes to rebuild the root. This completes the proof.

8.1.2 Summary

The following theorem summarizes the performance of the ScapegoatTree data structure:

Theorem 8.1 A ScapegoatTree implements the SSet interface. Ignoring the cost of rebuild(u) operations, a ScapegoatTree supports the operations add(x), remove(x), and find(x) in \(O(\log \texttt{n})\) time per operation.

Furthermore, beginning with an empty ScapegoatTree, any sequence of \(m\) add(x) and remove(x) operations results in a total of \(O(m\log m)\) time spent during all calls to rebuild(u).

8.2 Discussion and Exercises

The term scapegoat tree is due to Galperin and Rivest [33], who define and analyze these trees. However, the same structure was discovered earlier by Andersson [5,7], who called them general balanced trees since they can have any shape as long as their height is small.

Experimenting with the ScapegoatTree implementation will reveal that it is often considerably slower than the other SSet implementations in this book. This may be somewhat surprising, since height bound of \begin{equation*} \log_{3/2}\texttt{q} \approx 1.709\log \texttt{n} + O(1) \end{equation*} is better than the expected length of a search path in a Skiplist and not too far from that of a Treap. The implementation could be optimized by storing the sizes of subtrees explicitly at each node or by reusing already computed subtree sizes (Exercises Exercise 8.5 and Exercise 8.6). Even with these optimizations, there will always be sequences of add(x) and delete(x) operation for which a ScapegoatTree takes longer than other SSet implementations.

This gap in performance is due to the fact that, unlike the other SSet implementations discussed in this book, a ScapegoatTree can spend a lot of time restructuring itself. Exercise 8.3 asks you to prove that there are sequences of n operations in which a ScapegoatTree will spend on the order of \(\texttt{n}\log \texttt{n}\) time in calls to rebuild(u). This is in contrast to other SSet implementations discussed in this book, which only make \(O(\texttt{n})\) structural changes during a sequence of n operations. This is, unfortunately, a necessary consequence of the fact that a ScapegoatTree does all its restructuring by calls to rebuild(u) [20].

Despite their lack of performance, there are applications in which a ScapegoatTree could be the right choice. This would occur any time there is additional data associated with nodes that cannot be updated in constant time when a rotation is performed, but that can be updated during a rebuild(u) operation. In such cases, the ScapegoatTree and related structures based on partial rebuilding may work. An example of such an application is outlined in Exercise 8.11.

Exercise 8.1 Illustrate the addition of the values 1.5 and then 1.6 on the ScapegoatTree in Figure 8.1.

Exercise 8.2 Illustrate what happens when the sequence \(1,5,2,4,3\) is added to an empty ScapegoatTree, and show where the credits described in the proof of Lemma 8.3 go, and how they are used during this sequence of additions.

Exercise 8.3 Show that, if we start with an empty ScapegoatTree and call add(x) for \(\texttt{x}=1,2,3,\ldots,\texttt{n}\), then the total time spent during calls to rebuild(u) is at least \(c\texttt{n}\log \texttt{n}\) for some constant \(c>0\).

Exercise 8.4 The ScapegoatTree, as described in this chapter, guarantees that the length of the search path does not exceed \(\log_{3/2}\texttt{q}\).
  1. Design, analyze, and implement a modified version of ScapegoatTree where the length of the search path does not exceed \(\log_{\texttt{b}} \texttt{q}\), where b is a parameter with \(1<\texttt{b}<2\).
  2. What does your analysis and/or your experiments say about the amortized cost of find(x), add(x) and remove(x) as a function of n and b?

Exercise 8.5 Modify the add(x) method of the ScapegoatTree so that it does not waste any time recomputing the sizes of subtrees that have already been computed. This is possible because, by the time the method wants to compute size(w), it has already computed one of size(w.left) or size(w.right). Compare the performance of your modified implementation with the implementation given here.

Exercise 8.6 Implement a second version of the ScapegoatTree data structure that explicitly stores and maintains the sizes of the subtree rooted at each node. Compare the performance of the resulting implementation with that of the original ScapegoatTree implementation as well as the implementation from Exercise 8.5.

Exercise 8.7 Reimplement the rebuild(u) method discussed at the beginning of this chapter so that it does not require the use of an array to store the nodes of the subtree being rebuilt. Instead, it should use recursion to first connect the nodes into a linked list and then convert this linked list into a perfectly balanced binary tree. (There are very elegant recursive implementations of both steps.)

Exercise 8.8 Analyze and implement a WeightBalancedTree. This is a tree in which each node u, except the root, maintains the balance invariant that \(\texttt{size(u)} \le (2/3)\texttt{size(u.parent)}\). The add(x) and remove(x) operations are identical to the standard BinarySearchTree operations, except that any time the balance invariant is violated at a node u, the subtree rooted at u.parent is rebuilt. Your analysis should show that operations on a WeightBalancedTree run in \(O(\log\texttt{n})\) amortized time.

Exercise 8.9 Analyze and implement a CountdownTree. In a CountdownTree each node u keeps a timer u.t. The add(x) and remove(x) operations are exactly the same as in a standard BinarySearchTree except that, whenever one of these operations affects u's subtree, u.t is decremented. When \(\texttt{u.t}=0\) the entire subtree rooted at u is rebuilt into a perfectly balanced binary search tree. When a node u is involved in a rebuilding operation (either because u is rebuilt or one of u's ancestors is rebuilt) u.t is reset to \(\texttt{size(u)}/3\).

Your analysis should show that operations on a CountdownTree run in \(O(\log \texttt{n})\) amortized time. (Hint: First show that each node u satisfies some version of a balance invariant.)

Exercise 8.10 Analyze and implement a DynamiteTree. In a DynamiteTree each node u keeps tracks of the size of the subtree rooted at u in a variable u.size. The add(x) and remove(x) operations are exactly the same as in a standard BinarySearchTree except that, whenever one of these operations affects a node u's subtree, u explodes with probability \(1/\texttt{u.size}\). When u explodes, its entire subtree is rebuilt into a perfectly balanced binary search tree.

Your analysis should show that operations on a DynamiteTree run in \(O(\log \texttt{n})\) expected time.

Exercise 8.11 Design and implement a Sequence data structure that maintains a sequence (list) of elements. It supports these operations: The first two operations should run in \(O(\log \texttt{n})\) amortized time. The third operation should run in constant time.

The Sequence data structure can be implemented by storing the elements in something like a ScapegoatTree, in the same order that they occur in the sequence. To implement testBefore(e1,e2) in constant time, each element e is labelled with an integer that encodes the path from the root to e. In this way, testBefore(e1,e2) can be implemented by comparing the labels of e1 and e2.