Subsections


5.1 ChainedHashTable: Hashing with Chaining

A ChainedHashTable data structure uses hashing with chaining to store data as an array, $ \mathtt{t}$, of lists. An integer, $ \mathtt{n}$, keeps track of the total number of items in all lists:

    List<T>[] t;
    int n;
The hash value of a data item $ \mathtt{x}$, denoted $ \mathtt{hash(x)}$ is a value in the range $ \{0,\ldots,\ensuremath{\mathtt{t.length}}-1\}$. All items with hash value $ \mathtt{i}$ are stored in the list at $ \mathtt{t[i]}$. To ensure that lists don't get too long, we maintain the invariant

$\displaystyle \ensuremath{\mathtt{n}} \le \ensuremath{\mathtt{t.length}}
$

so that the average number of elements stored in one of these lists is $ \ensuremath{\mathtt{n}}/\ensuremath{\mathtt{t.length}} \le 1$.

To add an element, $ \mathtt{x}$, to the hash table, we first check if the length of $ \mathtt{t}$ needs to be increased and, if so, we grow $ \mathtt{t}$. With this out of the way we hash $ \mathtt{x}$ to get an integer, $ \mathtt{i}$, in the range $ \{0,\ldots,\ensuremath{\mathtt{t.length}}-1\}$ and we append $ \mathtt{x}$ to the list $ \mathtt{t[i]}$:

    boolean add(T x) {
        if (find(x) != null) return false;
        if (n+1 > t.length) resize();
        t[hash(x)].add(x);
        n++;
        return true;
    }
Growing the table, if necessary, involves doubling the length of $ \mathtt{t}$ and reinserting all elements into the new table. This is exactly the same strategy used in the implementation of ArrayStack and the same result applies: The cost of growing is only constant when amortized over a sequence of insertions (see Lemma 2.1 on page [*]).

Besides growing, the only other work done when adding $ \mathtt{x}$ to a ChainedHashTable involves appending $ \mathtt{x}$ to the list $ \mathtt{t[hash(x)]}$. For any of the list implementations described in Chapters 2 or 3, this takes only constant time.

To remove an element $ \mathtt{x}$ from the hash table we iterate over the list $ \mathtt{t[hash(x)]}$ until we find $ \mathtt{x}$ so that we can remove it:

    T remove(T x) {
        Iterator<T> it = t[hash(x)].iterator();
        while (it.hasNext()) {
            T y = it.next();
            if (y.equals(x)) {
                it.remove();
                n--;
                return y;
            }
        }
        return null;
    }
This takes $ O(\ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{hash(x)}}})$time, where $ \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{i}}}$ denotes the length of the list stored at $ \mathtt{t[i]}$.

Searching for the element $ \mathtt{x}$ in a hash table is similar. We perform a linear search on the list $ \mathtt{t[hash(x)]}$:

    T find(Object x) {
        for (T y : t[hash(x)])
            if (y.equals(x))
                return y;
        return null;
    }
Again, this takes time proportional to the length of the list $ \mathtt{t[hash(x)]}$.

The performance of a hash table depends critically on the choice of the hash function. A good hash function will spread the elements evenly among the $ \mathtt{t.length}$ lists, so that the expected size of the list $ \mathtt{t[hash(x)]}$ is $ O(\ensuremath{\mathtt{n}}/\ensuremath{\mathtt{t.length)}} = O(1)$. On the other hand, a bad hash function will hash all values (including $ \mathtt{x}$) to the same table location, in which case the size of the list $ \mathtt{t[hash(x)]}$ will be $ \mathtt{n}$. In the next section we describe a good hash function.


5.1.1 Multiplicative Hashing

Multiplicative hashing is an efficient method of generating hash values based on modular arithmetic (discussed in ) and integer division. It uses the $ \ddiv $ operator, which calculates the integral part of a quotient, while discarding the remainder. Formally, for any integers $ a\ge 0$ and $ b\ge 1$, $ a\ddiv b = \lfloor
a/b\rfloor$.

In multiplicative hashing, we use a hash table of size $ 2^{\ensuremath{\mathtt{d}}}$ for some integer $ \mathtt{d}$ (called the dimension). The formula for hashing an integer $ \ensuremath{\mathtt{x}}\in\{0,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$ is

$\displaystyle \ensuremath{\mathtt{hash}}(\ensuremath{\mathtt{x}}) = ((\ensurema...
...htt{w}}}) \ddiv 2^{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}} \enspace .
$

Here, $ \mathtt{z}$ is a randomly chosen odd integer in $ \{1,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$. This hash function can be realized very efficiently by observing that, by default, operations on integers are already done modulo $ 2^{\ensuremath{\mathtt{w}}}$ where $ \ensuremath{\mathtt{w}}$ is the number of bits in an integer. (See Figure 5.1.) Furthermore, integer division by $ 2^{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}}$ is equivalent to dropping the rightmost $ \ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$ bits in a binary representation (which is implemented by shifting the bits right by $ \ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$). In this way, the code that implements the above formula is simpler than the formula itself:
    int hash(Object x) {
        return (z * x.hashCode()) >>> (w-d);
    }

Figure 5.1: The operation of the multiplicative hash function with $ \ensuremath {\mathtt {w}}=32$ and $ \ensuremath {\mathtt {d}}=8$.
\begin{figure}\begin{center}
\begin{tabular}{\vert lr@{}r\vert}\hline
$2^\ensu...
...suremath{\mathtt{00011110}}} \\ \hline
\end{tabular} \end{center}
\end{figure}

The following lemma, whose proof is deferred until later in this section, shows that multiplicative hashing does a good job of avoiding collisions:

Lemma 5..1   Let $ \mathtt{x}$ and $ \mathtt{y}$ be any two values in $ \{0,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$ with $ \ensuremath{\mathtt{x}}\neq \ensuremath{\mathtt{y}}$. Then $ \Pr\{\ensuremath{\mathtt{hash(x)}}=\ensuremath{\mathtt{hash(y)}}\} \le 2/2^{\ensuremath{\mathtt{d}}}$.

With Lemma 5.1, the performance of $ \mathtt{remove(x)}$, and $ \mathtt{find(x)}$ are easy to analyze:

Lemma 5..2   For any data value $ \mathtt{x}$, the expected length of the list $ \mathtt{t[hash(x)]}$ is at most $ \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + 2$, where $ \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}}$ is the number of occurrences of $ \mathtt{x}$ in the hash table.

Proof. Let $ S$ be the (multi-)set of elements stored in the hash table that are not equal to $ \mathtt{x}$. For an element $ \ensuremath{\mathtt{y}}\in S$, define the indicator variable

$\displaystyle I_{\ensuremath{\mathtt{y}}} = \left\{\begin{array}{ll}
1 & \mbox...
...\ensuremath{\mathtt{hash(y)}}$} \\
0 & \mbox{otherwise}
\end{array}\right.
$

and notice that, by Lemma 5.1, $ \mathrm{E}[I_{\ensuremath{\mathtt{y}}}] \le
2/2^{\ensuremath{\mathtt{d}}}=2/\ensuremath{\mathtt{t.length}}$. The expected length of the list $ \mathtt{t[hash(x)]}$ is given by
$\displaystyle \mathrm{E}\left[\ensuremath{\mathtt{t[hash(x)].size()}}\right]$ $\displaystyle =$ $\displaystyle \mathrm{E}\left[\ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} I_{\ensuremath{\mathtt{y}}}\right]$  
  $\displaystyle =$ $\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} \mathrm{E}[I_{\ensuremath{\mathtt{y}}} ]$  
  $\displaystyle \le$ $\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} 2/\ensuremath{\mathtt{t.length}}$  
  $\displaystyle \le$ $\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} 2/\ensuremath{\mathtt{n}}$  
  $\displaystyle \le$ $\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + (\ensuremath{...
...n}}-\ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}})2/\ensuremath{\mathtt{n}}$  
  $\displaystyle \le$ $\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + 2 \enspace ,$  

as required. $ \qedsymbol$

Now, we want to prove Lemma 5.1, but first we need a result from number theory. In the following proof, we use the notation $ (b_r,\ldots,b_0)_2$ to denote $ \sum_{i=0}^r b_i2^i$, where each $ b_i$ is a bit, either 0 or 1. In other words, $ (b_r,\ldots,b_0)_2$ is the integer whose binary representation is given by $ b_r,\ldots,b_0$. We use $ \star$ to denote a bit of unknown value.

Lemma 5..3   Let $ S$ be the set of odd integers in $ \{1,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$, Let $ q$ and $ i$ be any two elements in $ S$. Then there is exactly one value $ \ensuremath{\mathtt{z}}\in S$ such that $ \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = i$.

Proof. Since the number of choices for $ \ensuremath{\mathtt{z}}$ and $ i$ is the same, it is sufficient to prove that there is at most one value $ \ensuremath{\mathtt{z}}\in S$ that satisfies $ \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = i$.

Suppose, for the sake of contradiction, that there are two such values $ \mathtt{z}$ and $ \mathtt{z'}$, with $ \ensuremath{\mathtt{z}}>\ensuremath{\mathtt{z}}'$. Then

$\displaystyle \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = \ensuremath{\mathtt{z}}'q \bmod 2^{\ensuremath{\mathtt{w}}} = i
$

So

$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q\bmod 2^{\ensuremath{\mathtt{w}}} = 0
$

But this means that

$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q = k 2^{\ensuremath{\mathtt{w}}}$ (5.1)

for some integer $ k$. Thinking in terms of binary numbers, we have

$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q = k\cdot(1,\underbrace{0,\ldots,0}_{\ensuremath{\mathtt{w}}})_2 \enspace ,
$

so that the $ \mathtt{w}$ trailing bits in the binary representation of $ (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q$ are all 0's.

Furthermore $ k\neq 0$ since $ q\neq 0$ and $ \ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}'\neq 0$. Since $ q$ is odd, it has no trailing 0's in its binary representation:

$\displaystyle q = (\star,\ldots,\star,1)_2 \enspace .
$

Since $ \vert\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}'\vert < 2^{\ensuremath{\mathtt{w}}}$, $ \ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}'$ has fewer than $ \mathtt{w}$ trailing 0's in its binary representation:

$\displaystyle \ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}' = (\star,\ldots,\star,1,\underbrace{0,\ldots,0}_{<\ensuremath{\mathtt{w}}})_2
\enspace .
$

Therefore, the product $ (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q$ has fewer than $ \mathtt{w}$ trailing 0's in its binary representation:

$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q = (\star,\cdots,\star,1,\underbrace{0,\ldots,0}_{<\ensuremath{\mathtt{w}}})_2
\enspace .
$

Therefore $ (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q$ cannot satisfy ([*]), yielding a contradiction and completing the proof. $ \qedsymbol$

The utility of Lemma 5.3 comes from the following observation: If $ \mathtt{z}$ is chosen uniformly at random from $ S$, then $ \mathtt{zt}$ is uniformly distributed over $ S$. In the following proof, it helps to think of the binary representation of $ \mathtt{z}$, which consists of $ \ensuremath{\mathtt{w}}-1$ random bits followed by a 1.

Proof. [Proof of Lemma 5.1] First we note that the condition $ \ensuremath{\mathtt{hash(x)}}=\ensuremath{\mathtt{hash(y)}}$ is equivalent to the statement ``the highest-order $ \mathtt{d}$ bits of $ \ensuremath{\mathtt{z}} \ensuremath{\mathtt{x}}\bmod2^{\ensuremath{\mathtt{w}}}$ and the highest-order $ \mathtt{d}$ bits of $ \ensuremath{\mathtt{z}} \ensuremath{\mathtt{y}}\bmod 2^{\ensuremath{\mathtt{w}}}$ are the same.'' A necessary condition of that statement is that the highest-order $ \mathtt{d}$ bits in the binary representation of $ \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ are either all 0's or all 1's. That is,

$\displaystyle \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\math...
...rbrace{\star,\ldots,\star}_{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}})_2$ (5.2)

when $ \ensuremath{\mathtt{zx}}\bmod 2^{\ensuremath{\mathtt{w}}} > \ensuremath{\mathtt{zy}}\bmod 2^{\ensuremath{\mathtt{w}}}$ or

$\displaystyle \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\math...
...r,\ldots,\star}_{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}})_2 \enspace .$ (5.3)

when $ \ensuremath{\mathtt{zx}}\bmod 2^{\ensuremath{\mathtt{w}}} < \ensuremath{\mathtt{zy}}\bmod 2^{\ensuremath{\mathtt{w}}}$. Therefore, we only have to bound the probability that $ \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ looks like ([*]) or ([*]).

Let $ q$ be the unique odd integer such that $ (\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod
2^{\ensuremath{\mathtt{w}}}=q2^r$ for some integer $ r\ge 0$. By Lemma 5.3, the binary representation of $ \ensuremath{\mathtt{z}}q\bmod
2^{\ensuremath{\mathtt{w}}}$ has $ \ensuremath{\mathtt{w}}-1$ random bits, followed by a 1:

$\displaystyle \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = (\und...
...{b_{\ensuremath{\mathtt{w}}-1},\ldots,b_{1}}_{\ensuremath{\mathtt{w}}-1},1)_2
$

Therefore, the binary representation of $ \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod ...
...emath{\mathtt{w}}}=\ensuremath{\mathtt{z}}q2^r\bmod 2^{\ensuremath{\mathtt{w}}}$ has $ \ensuremath{\mathtt{w}}-r-1$ random bits, followed by a 1, followed by $ r$ 0's:

$\displaystyle \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\math...
...ldots,b_{1}}_{\ensuremath{\mathtt{w}}-r-1},1,\underbrace{0,0,\ldots,0}_{r})_2
$

We can now finish the proof: If $ r > \ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$, then the $ \mathtt{d}$ higher order bits of $ \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ contain both 0's and 1's, so the probability that $ \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ looks like ([*]) or ([*]) is 0. If $ \ensuremath{\mathtt{r}}=\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$, then the probability of looking like ([*]) is 0, but the probability of looking like ([*]) is $ 1/2^{\ensuremath{\mathtt{d}}-1}=2/2^{\ensuremath{\mathtt{d}}}$ (since we must have $ b_1,\ldots,b_{d-1}=1,\ldots,1$). If $ r < \ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$ then we must have $ b_{\ensuremath{\mathtt{w}}-r-1},\ldots,b_{\ensuremath{\mathtt{w}}-r-\ensuremath{\mathtt{d}}}=0,\ldots,0$ or $ b_{\ensuremath{\mathtt{w}}-r-1},\ldots,b_{\ensuremath{\mathtt{w}}-r-\ensuremath{\mathtt{d}}}=1,\ldots,1$. The probability of each of these cases is $ 1/2^{\ensuremath{\mathtt{d}}}$ and they are mutually exclusive, so the probability of either of these cases is $ 2/2^{\ensuremath{\mathtt{d}}}$. This completes the proof. $ \qedsymbol$

5.1.2 Summary

The following theorem summarizes the performance of the ChainedHashTable data structure:

Theorem 5..1   A ChainedHashTable implements the USet interface. Ignoring the cost of calls to $ \mathtt{grow()}$, a ChainedHashTable supports the operations $ \mathtt{add(x)}$, $ \mathtt{remove(x)}$, and $ \mathtt{find(x)}$ in $ O(1)$ expected time per operation.

Furthermore, beginning with an empty ChainedHashTable, any sequence of $ m$ $ \mathtt{add(x)}$ and $ \mathtt{remove(x)}$ operations results in a total of $ O(m)$ time spent during all calls to $ \mathtt{grow()}$.

opendatastructures.org