5.1 : Hashing with Chaining

A $\mathtt{ChainedHashTable}$ data structure uses hashing with chaining to store data as an array, $\mathtt{t}$ , of lists. An integer, $\mathtt{n}$ , keeps track of the total number of items in all lists:

To add an element, $\mathtt{x}$ , to the hash table, we first check if the length of $\mathtt{t}$ needs to be increased and, if so, we grow $\mathtt{t}$ . With this out of the way we hash $\mathtt{x}$ to get an integer, $\mathtt{i}$ , in the range $\{0,\ldots,\ensuremath{\mathtt{t.length}}-1\}$ and we append $\mathtt{x}$ to the list $\mathtt{t[i]}$ :

Besides growing, the only other work done when adding $\mathtt{x}$ to a $\mathtt{ChainedHashTable}$ involves appending $\mathtt{x}$ to the list $\mathtt{t[hash(x)]}$ . For any of the list implementations described in Chapters 2 or 3, this takes only constant time.

To remove an element $\mathtt{x}$ from the hash table we iterate over the list $\mathtt{t[hash(x)]}$ until we find $\mathtt{x}$ so that we can remove it:

Searching for the element $\mathtt{x}$ in a hash table is similar. We perform a linear search on the list $\mathtt{t[hash(x)]}$ :

The performance of a hash table depends critically on the choice of the hash function. A good hash function will spread the elements evenly among the $\mathtt{t.length}$ lists, so that the expected size of the list $\mathtt{t[hash(x)]}$ is $O(\ensuremath{\mathtt{n}}/\ensuremath{\mathtt{t.length)}} = O(1)$ . On the other hand, a bad hash function will hash all values (including $\mathtt{x}$ ) to the same table location, in which case the size of the list $\mathtt{t[hash(x)]}$ will be $\mathtt{n}$ . In the next section we describe a good hash function.

5.1.1 Multiplicative Hashing

Multiplicative hashing is an efficient method of generating hash values based on modular arithmetic (discussed in ) and integer division. It uses the $\ddiv$ operator, which calculates the integral part of a quotient, while discarding the remainder. Formally, for any integers $a\ge 0$ and $b\ge 1$ , $a\ddiv b = \lfloor a/b\rfloor$ .

In multiplicative hashing, we use a hash table of size $2^{\ensuremath{\mathtt{d}}}$ for some integer $\mathtt{d}$ (called the dimension). The formula for hashing an integer $\ensuremath{\mathtt{x}}\in\{0,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$ is

The following lemma, whose proof is deferred until later in this section, shows that multiplicative hashing does a good job of avoiding collisions:

**Figure 5.1:** The operation of the multiplicative hash function with $\ensuremath {\mathtt {w}}=32$ and $\ensuremath {\mathtt {d}}=8$ .
$\begin{figure}\begin{center} \begin{tabular}{\vert lr@{}r\vert}\hline $2^\ensu... ...suremath{\mathtt{00011110}}} \\ \hline \end{tabular} \end{center} \end{figure}$

With Lemma 5.1, the performance of $\mathtt{remove(x)}$ , and $\mathtt{find(x)}$ are easy to analyze:

Proof. Let

be the (multi-)set of elements stored in the hash table that are not equal to $\mathtt{x}$ . For an element $\ensuremath{\mathtt{y}}\in S$ , define the indicator variable

$\displaystyle I_{\ensuremath{\mathtt{y}}} = \left\{\begin{array}{ll} 1 & \mbox... ...\ensuremath{\mathtt{hash(y)}}$} \\ 0 & \mbox{otherwise} \end{array}\right.$

and notice that, by Lemma 5.1, $\mathrm{E}[I_{\ensuremath{\mathtt{y}}}] \le 2/2^{\ensuremath{\mathtt{d}}}=2/\ensuremath{\mathtt{t.length}}$ . The expected length of the list $\mathtt{t[hash(x)]}$ is given by

$\displaystyle \mathrm{E}\left[\ensuremath{\mathtt{t[hash(x)].size()}}\right]$	$\displaystyle =$	$\displaystyle \mathrm{E}\left[\ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} I_{\ensuremath{\mathtt{y}}}\right]$
	$\displaystyle =$	$\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} \mathrm{E}[I_{\ensuremath{\mathtt{y}}} ]$
	$\displaystyle \le$	$\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} 2/\ensuremath{\mathtt{t.length}}$
	$\displaystyle \le$	$\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + \sum_{\ensuremath{\mathtt{y}}\in S} 2/\ensuremath{\mathtt{n}}$
	$\displaystyle \le$	$\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + (\ensuremath{... ...n}}-\ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}})2/\ensuremath{\mathtt{n}}$
	$\displaystyle \le$	$\displaystyle \ensuremath{\mathtt{n}}_{\ensuremath{\mathtt{x}}} + 2 \enspace ,$

as required. $\qedsymbol$

Now, we want to prove Lemma 5.1, but first we need a result from number theory. In the following proof, we use the notation $(b_r,\ldots,b_0)_2$ to denote $\sum_{i=0}^r b_i2^i$ , where each

is a bit, either 0 or 1. In other words, $(b_r,\ldots,b_0)_2$ is the integer whose binary representation is given by $b_r,\ldots,b_0$ . We use $\star$ to denote a bit of unknown value.

Proof. Since the number of choices for $\ensuremath{\mathtt{z}}$ and

is the same, it is sufficient to prove that there is at most one value $\ensuremath{\mathtt{z}}\in S$ that satisfies $\ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = i$ .

Suppose, for the sake of contradiction, that there are two such values $\mathtt{z}$ and $\mathtt{z'}$ , with $\ensuremath{\mathtt{z}}>\ensuremath{\mathtt{z}}'$ . Then

$\displaystyle \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = \ensuremath{\mathtt{z}}'q \bmod 2^{\ensuremath{\mathtt{w}}} = i$

$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q\bmod 2^{\ensuremath{\mathtt{w}}} = 0$

But this means that

$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q = k 2^{\ensuremath{\mathtt{w}}}$

(5.1)

for some integer

. Thinking in terms of binary numbers, we have

$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q = k\cdot(1,\underbrace{0,\ldots,0}_{\ensuremath{\mathtt{w}}})_2 \enspace ,$

so that the $\mathtt{w}$ trailing bits in the binary representation of $(\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q$ are all 0's.

Furthermore $k\neq 0$ since $q\neq 0$ and $\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}'\neq 0$ . Since is odd, it has no trailing 0's in its binary representation:

$\displaystyle q = (\star,\ldots,\star,1)_2 \enspace .$

Since $\vert\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}'\vert < 2^{\ensuremath{\mathtt{w}}}$ , $\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}'$ has fewer than $\mathtt{w}$ trailing 0's in its binary representation:

$\displaystyle \ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}' = (\star,\ldots,\star,1,\underbrace{0,\ldots,0}_{<\ensuremath{\mathtt{w}}})_2 \enspace .$

Therefore, the product $(\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q$ has fewer than $\mathtt{w}$ trailing 0's in its binary representation:

$\displaystyle (\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q = (\star,\cdots,\star,1,\underbrace{0,\ldots,0}_{<\ensuremath{\mathtt{w}}})_2 \enspace .$

Therefore $(\ensuremath{\mathtt{z}}-\ensuremath{\mathtt{z}}')q$ cannot satisfy (

), yielding a contradiction and completing the proof. $\qedsymbol$

The utility of Lemma 5.3 comes from the following observation: If $\mathtt{z}$ is chosen uniformly at random from

, then $\mathtt{zt}$ is uniformly distributed over

. In the following proof, it helps to think of the binary representation of $\mathtt{z}$ , which consists of $\ensuremath{\mathtt{w}}-1$ random bits followed by a 1.

Proof. [Proof of Lemma 5.1] First we note that the condition $\ensuremath{\mathtt{hash(x)}}=\ensuremath{\mathtt{hash(y)}}$ is equivalent to the statement ``the highest-order $\mathtt{d}$ bits of $\ensuremath{\mathtt{z}} \ensuremath{\mathtt{x}}\bmod2^{\ensuremath{\mathtt{w}}}$ and the highest-order $\mathtt{d}$ bits of $\ensuremath{\mathtt{z}} \ensuremath{\mathtt{y}}\bmod 2^{\ensuremath{\mathtt{w}}}$ are the same.'' A necessary condition of that statement is that the highest-order $\mathtt{d}$ bits in the binary representation of $\ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ are either all 0's or all 1's. That is,

$\displaystyle \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\math... ...rbrace{\star,\ldots,\star}_{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}})_2$

(5.2)

when $\ensuremath{\mathtt{zx}}\bmod 2^{\ensuremath{\mathtt{w}}} > \ensuremath{\mathtt{zy}}\bmod 2^{\ensuremath{\mathtt{w}}}$ or

$\displaystyle \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\math... ...r,\ldots,\star}_{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}})_2 \enspace .$

(5.3)

when $\ensuremath{\mathtt{zx}}\bmod 2^{\ensuremath{\mathtt{w}}} < \ensuremath{\mathtt{zy}}\bmod 2^{\ensuremath{\mathtt{w}}}$ . Therefore, we only have to bound the probability that $\ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ looks like (

) or (

Let be the unique odd integer such that $(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}=q2^r$ for some integer $r\ge 0$ . By Lemma 5.3, the binary representation of $\ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}}$ has $\ensuremath{\mathtt{w}}-1$ random bits, followed by a 1:

$\displaystyle \ensuremath{\mathtt{z}}q\bmod 2^{\ensuremath{\mathtt{w}}} = (\und... ...{b_{\ensuremath{\mathtt{w}}-1},\ldots,b_{1}}_{\ensuremath{\mathtt{w}}-1},1)_2$

Therefore, the binary representation of $\ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod ... ...emath{\mathtt{w}}}=\ensuremath{\mathtt{z}}q2^r\bmod 2^{\ensuremath{\mathtt{w}}}$ has $\ensuremath{\mathtt{w}}-r-1$ random bits, followed by a 1, followed by

0's:

$\displaystyle \ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\math... ...ldots,b_{1}}_{\ensuremath{\mathtt{w}}-r-1},1,\underbrace{0,0,\ldots,0}_{r})_2$

We can now finish the proof: If $r > \ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$ , then the $\mathtt{d}$ higher order bits of $\ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ contain both 0's and 1's, so the probability that $\ensuremath{\mathtt{z}}(\ensuremath{\mathtt{x}}-\ensuremath{\mathtt{y}})\bmod 2^{\ensuremath{\mathtt{w}}}$ looks like (

) or (

) is 0. If $\ensuremath{\mathtt{r}}=\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$ , then the probability of looking like (

) is 0, but the probability of looking like (

) is $1/2^{\ensuremath{\mathtt{d}}-1}=2/2^{\ensuremath{\mathtt{d}}}$ (since we must have $b_1,\ldots,b_{d-1}=1,\ldots,1$ ). If $r < \ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}$ then we must have $b_{\ensuremath{\mathtt{w}}-r-1},\ldots,b_{\ensuremath{\mathtt{w}}-r-\ensuremath{\mathtt{d}}}=0,\ldots,0$ or $b_{\ensuremath{\mathtt{w}}-r-1},\ldots,b_{\ensuremath{\mathtt{w}}-r-\ensuremath{\mathtt{d}}}=1,\ldots,1$ . The probability of each of these cases is $1/2^{\ensuremath{\mathtt{d}}}$ and they are mutually exclusive, so the probability of either of these cases is $2/2^{\ensuremath{\mathtt{d}}}$ . This completes the proof. $\qedsymbol$

5.1.2 Summary

The following theorem summarizes the performance of the $\mathtt{ChainedHashTable}$ data structure: