5.2 : Linear Probing

The $\mathtt{ChainedHashTable}$ data structure uses an array of lists, where the $\mathtt{i}$ th list stores all elements $\mathtt{x}$ such that $\ensuremath{\mathtt{hash(x)}}=\ensuremath{\mathtt{i}}$ . An alternative, called open addressing is to store the elements directly in an array, $\mathtt{t}$ , with each array location in $\mathtt{t}$ storing at most one value. This approach is taken by the $\mathtt{LinearHashTable}$ described in this section. In some places, this data structure is described as open addressing with linear probing.

The main idea behind a $\mathtt{LinearHashTable}$ is that we would, ideally, like to store the element $\mathtt{x}$ with hash value $\mathtt{i=hash(x)}$ in the table location $\mathtt{t[i]}$ . If we cannot do this (because some element is already stored there) then we try to store it at location $\ensuremath{\mathtt{t}}[(\ensuremath{\mathtt{i}}+1)\bmod\ensuremath{\mathtt{t.length}}]$ ; if that's not possible, then we try $\ensuremath{\mathtt{t}}[(\ensuremath{\mathtt{i}}+2)\bmod\ensuremath{\mathtt{t.length}}]$ , and so on, until we find a place for $\mathtt{x}$ .

To summarize, a $\mathtt{LinearHashTable}$ contains an array, $\mathtt{t}$ , that stores data elements, and integers $\mathtt{n}$ and $\mathtt{q}$ that keep track of the number of data elements and non- $\mathtt{null}$ values of $\mathtt{t}$ , respectively. Because many hash functions only work for table sizes that are a power of 2, we also keep an integer $\mathtt{d}$ and maintain the invariant that $\ensuremath{\mathtt{t.length}}=2^\ensuremath{\mathtt{d}}$ .

The $\mathtt{find(x)}$ operation in a $\mathtt{LinearHashTable}$ is simple. We start at array entry $\mathtt{t[i]}$ where $\ensuremath{\mathtt{i}}=\ensuremath{\mathtt{hash(x)}}$ and search entries $\mathtt{t[i]}$ , $\ensuremath{\mathtt{t}}[(\ensuremath{\mathtt{i}}+1)\bmod\ensuremath{\mathtt{t.length}}]$ , $\ensuremath{\mathtt{t}}[(\ensuremath{\mathtt{i}}+2)\bmod\ensuremath{\mathtt{t.length}}]$ , and so on, until we find an index $\mathtt{i'}$ such that, either, $\mathtt{t[i']=x}$ , or $\mathtt{t[i']=null}$ . In the former case we return $\mathtt{t[i']}$ . In the latter case, we conclude that $\mathtt{x}$ is not contained in the hash table and return $\mathtt{null}$ .

The $\mathtt{add(x)}$ operation is also fairly easy to implement. After checking that $\mathtt{x}$ is not already stored in the table (using $\mathtt{find(x)}$ ), we search $\mathtt{t[i]}$ , $\ensuremath{\mathtt{t}}[(\ensuremath{\mathtt{i}}+1)\bmod\ensuremath{\mathtt{t.length}}]$ , $\ensuremath{\mathtt{t}}[(\ensuremath{\mathtt{i}}+2)\bmod\ensuremath{\mathtt{t.length}}]$ , and so on, until we find a $\mathtt{null}$ or $\mathtt{del}$ and store $\mathtt{x}$ at that location, increment $\mathtt{n}$ , and $\mathtt{q}$ , if appropriate.

By now, the implementation of the $\mathtt{remove(x)}$ operation should be obvious. We search $\mathtt{t[i]}$ , $\ensuremath{\mathtt{t}}[(\ensuremath{\mathtt{i}}+1)\bmod\ensuremath{\mathtt{t.length}}]$ , $\ensuremath{\mathtt{t}}[(\ensuremath{\mathtt{i}}+2)\bmod\ensuremath{\mathtt{t.length}}]$ , and so on until we find an index $\mathtt{i'}$ such that $\mathtt{t[i']=x}$ or $\mathtt{t[i']=null}$ . In the former case, we set $\mathtt{t[i']=del}$ and return $\mathtt{true}$ . In the latter case we conclude that $\mathtt{x}$ was not stored in the table (and therefore cannot be deleted) and return $\mathtt{false}$ .

The correctness of the $\mathtt{find(x)}$ , $\mathtt{add(x)}$ , and $\mathtt{remove(x)}$ methods is easy to verify, though it relies on the use of $\mathtt{del}$ values. Notice that none of these operations ever sets a non- $\mathtt{null}$ entry to $\mathtt{null}$ . Therefore, when we reach an index $\mathtt{i'}$ such that $\mathtt{t[i']=null}$ , this is a proof that the element, $\mathtt{x}$ , that we are searching for is not stored in the table; $\mathtt{t[i']}$ has always been $\mathtt{null}$ , so there is no reason that a previous $\mathtt{add(x)}$ operation would have proceeded beyond index $\mathtt{i'}$ .

The $\mathtt{resize()}$ method is called by $\mathtt{add(x)}$ when the number of non- $\mathtt{null}$ entries exceeds $\ensuremath{\mathtt{t.length}}/2$ or by $\mathtt{remove(x)}$ when the number of data entries is less than $\mathtt{t.length/8}$ . The $\mathtt{resize()}$ method works like the $\mathtt{resize()}$ methods in other array-based data structures. We find the smallest non-negative integer $\mathtt{d}$ such that $2^{\ensuremath{\mathtt{d}}} \ge 3\ensuremath{\mathtt{n}}$ . We reallocate the array $\mathtt{t}$ so that it has size $2^{\ensuremath{\mathtt{d}}}$ , and then we insert all the elements in the old version of $\mathtt{t}$ into the newly-resized copy of $\mathtt{t}$ . While doing this, we reset $\mathtt{q}$ equal to $\mathtt{n}$ since the newly-allocated $\mathtt{t}$ contains no $\mathtt{del}$ values.

5.2.1 Analysis of Linear Probing

Notice that each operation, $\mathtt{add(x)}$ , $\mathtt{remove(x)}$ , or $\mathtt{find(x)}$ , finishes as soon as (or before) it discovers the first $\mathtt{null}$ entry in $\mathtt{t}$ . The intuition behind the analysis of linear probing is that, since at least half the elements in $\mathtt{t}$ are equal to $\mathtt{null}$ , an operation should not take long to complete because it will very quickly come across a $\mathtt{null}$ entry. We shouldn't rely too heavily on this intuition, though, because it would lead us to (the incorrect) conclusion that the expected number of locations in $\mathtt{t}$ examined by an operation is at most 2.

For the rest of this section, we will assume that all hash values are independently and uniformly distributed in $\{0,\ldots,\ensuremath{\mathtt{t.length}}-1\}$ . This is not a realistic assumption, but it will make it possible for us to analyze linear probing. Later in this section we will describe a method, called tabulation hashing, that produces a hash function that is ``good enough'' for linear probing. We will also assume that all indices into the positions of $\mathtt{t}$ are taken modulo $\mathtt{t.length}$ , so that $\mathtt{t[i]}$ is really a shorthand for $\ensuremath{\mathtt{t}}[\ensuremath{\mathtt{i}}\bmod\ensuremath{\mathtt{t.length}}]$ .

We say that a run of length

that starts at $\mathtt{i}$ occurs when all the table entries $\ensuremath{\mathtt{t[i]}}, \ensuremath{\mathtt{t[i+1]}},\ldots,\ensuremath{\mathtt{t}}[\ensuremath{\mathtt{i}}+k-1]$ are non- $\mathtt{null}$ and $\ensuremath{\mathtt{t}}[\ensuremath{\mathtt{i}}-1]=\ensuremath{\mathtt{t}}[\ensuremath{\mathtt{i}}+k]=\ensuremath{\mathtt{null}}$ . The number of non- $\mathtt{null}$ elements of $\mathtt{t}$ is exactly $\mathtt{q}$ and the $\mathtt{add(x)}$ method ensures that, at all times, $\ensuremath{\mathtt{q}}\le\ensuremath{\mathtt{t.length}}/2$ . There are $\mathtt{q}$ elements $\ensuremath{\mathtt{x}}_1,\ldots,\ensuremath{\mathtt{x}}_{\ensuremath{\mathtt{q}}}$ that have been inserted into $\mathtt{t}$ since the last $\mathtt{rebuild()}$ operation. By our assumption, each of these has a hash value, $\ensuremath{\mathtt{hash}}(\ensuremath{\mathtt{x}}_j)$ , that is uniform and independent of the rest. With this setup, we can prove the main lemma required to analyze linear probing.

Proof. If a run of length

starts at $\mathtt{i}$ , then there are exactly

elements $\ensuremath{\mathtt{x}}_j$ such that $\ensuremath{\mathtt{hash}}(\ensuremath{\mathtt{x}}_j)\in\{\ensuremath{\mathtt{i}},\ldots,\ensuremath{\mathtt{i}}+k-1\}$ . The probability that this occurs is exactly

$\displaystyle p_k = \binom{\ensuremath{\mathtt{q}}}{k}\left(\frac{k}{\ensuremat... ...{\ensuremath{\mathtt{t.length}}}\right)^{\ensuremath{\mathtt{q}}-k} \enspace ,$

since, for each choice of

elements, these

elements must hash to one of the

locations and the remaining $\ensuremath{\mathtt{q}}-k$ elements must hash to the other $\ensuremath{\mathtt{t.length}}-k$ table locations.^5.1

In the following derivation we will cheat a little and replace with . Stirling's Approximation (Section 1.3.2) shows that this is only a factor of $O(\sqrt{r})$ from the truth. This is just done to make the derivation simpler; Exercise 5.4 asks the reader to redo the calculation more rigorously using Stirling's Approximation in its entirety.

The value of is maximized when $\mathtt{t.length}$ is minimum, and the data structure maintains the invariant that $\ensuremath{\mathtt{t.length}}\ge 2\ensuremath{\mathtt{q}}$ , so

$\displaystyle p_k$	$\displaystyle \le \binom{\ensuremath{\mathtt{q}}}{k}\left(\frac{k}{2\ensuremath... ...ath{\mathtt{q}}-k}{2\ensuremath{\mathtt{q}}}\right)^{\ensuremath{\mathtt{q}}-k}$
	$\displaystyle = \left(\frac{\ensuremath{\mathtt{q}}!}{(\ensuremath{\mathtt{q}}-... ...ath{\mathtt{q}}-k}{2\ensuremath{\mathtt{q}}}\right)^{\ensuremath{\mathtt{q}}-k}$
	$\displaystyle \approx \left(\frac{\ensuremath{\mathtt{q}}^{\ensuremath{\mathtt{... ...ath{\mathtt{q}}-k}{2\ensuremath{\mathtt{q}}}\right)^{\ensuremath{\mathtt{q}}-k}$	[Stirling's approximation]
	$\displaystyle = \left(\frac{\ensuremath{\mathtt{q}}^{k}\ensuremath{\mathtt{q}}^... ...ath{\mathtt{q}}-k}{2\ensuremath{\mathtt{q}}}\right)^{\ensuremath{\mathtt{q}}-k}$
	$\displaystyle = \left(\frac{\ensuremath{\mathtt{q}}k}{2\ensuremath{\mathtt{q}}k... ...math{\mathtt{q}}(\ensuremath{\mathtt{q}}-k)}\right)^{\ensuremath{\mathtt{q}}-k}$
	$\displaystyle = \left(\frac{1}{2}\right)^k \left(\frac{(2\ensuremath{\mathtt{q}}-k)}{2(\ensuremath{\mathtt{q}}-k)}\right)^{\ensuremath{\mathtt{q}}-k}$
	$\displaystyle = \left(\frac{1}{2}\right)^k \left(1+\frac{k}{2(\ensuremath{\mathtt{q}}-k)}\right)^{\ensuremath{\mathtt{q}}-k}$
	$\displaystyle \le \left(\frac{\sqrt{e}}{2}\right)^k \enspace .$

(In the last step, we use the inequality $(1+1/x)^x \le e$ , which holds for all

.) Since $\sqrt{e}/{2}< 0.824360636 < 1$ , this completes the proof. $\qedsymbol$

Using Lemma 5.4 to prove upper-bounds on the expected running time of $\mathtt{find(x)}$ , $\mathtt{add(x)}$ , and $\mathtt{remove(x)}$ is now fairly straightforward. Consider the simplest case, where we execute $\mathtt{find(x)}$ for some value $\mathtt{x}$ that has never been stored in the $\mathtt{LinearHashTable}$ . In this case, $\ensuremath{\mathtt{i}}=\ensuremath{\mathtt{hash(x)}}$ is a random value in $\{0,\ldots,\ensuremath{\mathtt{t.length}}-1\}$ independent of the contents of $\mathtt{t}$ . If $\mathtt{i}$ is part of a run of length

, then the time it takes to execute the $\mathtt{find(x)}$ operation is at most

. Thus, the expected running time can be upper-bounded by

If we ignore the cost of the $\mathtt{resize()}$ operation, then the above analysis gives us all we need to analyze the cost of operations on a $\mathtt{LinearHashTable}$ .

First of all, the analysis of $\mathtt{find(x)}$ given above applies to the $\mathtt{add(x)}$ operation when $\mathtt{x}$ is not contained in the table. To analyze the $\mathtt{find(x)}$ operation when $\mathtt{x}$ is contained in the table, we need only note that this is the same as the cost of the $\mathtt{add(x)}$ operation that previously added $\mathtt{x}$ to the table. Finally, the cost of a $\mathtt{remove(x)}$ operation is the same as the cost of a $\mathtt{find(x)}$ operation.

In summary, if we ignore the cost of calls to $\mathtt{resize()}$ , all operations on a $\mathtt{LinearHashTable}$ run in

expected time. Accounting for the cost of resize can be done using the same type of amortized analysis performed for the $\mathtt{ArrayStack}$ data structure in Section 2.1.

5.2.2 Summary

The following theorem summarizes the performance of the $\mathtt{LinearHashTable}$ data structure:

5.2.3 Tabulation Hashing

While analyzing the $\mathtt{LinearHashTable}$ structure, we made a very strong assumption: That for any set of elements, $\{\ensuremath{\mathtt{x_1}},\ldots,\ensuremath{\mathtt{x_n}}\}$ , the hash values $\ensuremath{\mathtt{hash(x_1)}},\ldots,\ensuremath{\mathtt{hash(x_n)}}$ are independently and uniformly distributed over the set $\{0,\ldots,\ensuremath{\mathtt{t.length}}-1\}$ . One way to achieve this is to store a giant array, $\mathtt{tab}$ , of length $2^{\ensuremath{\mathtt{w}}}$ , where each entry is a random $\mathtt{w}$ -bit integer, independent of all the other entries. In this way, we could implement $\mathtt{hash(x)}$ by extracting a $\mathtt{d}$ -bit integer from $\mathtt{tab[x.hashCode()]}$ :

Unfortunately, storing an array of size $2^{\ensuremath{\mathtt{w}}}$ is prohibitive in terms of memory usage. The approach used by tabulation hashing is to, instead, treat $\mathtt{w}$ -bit integers as being comprised of $\ensuremath{\mathtt{w}}/\ensuremath{\mathtt{r}}$ integers, each having only $\ensuremath{\mathtt{r}}$ bits. In this way, tabulation hashing only needs $\ensuremath{\mathtt{w}}/\ensuremath{\mathtt{r}}$ arrays each of length $2^{\ensuremath{\mathtt{r}}}$ . All the entries in these arrays are independent $\mathtt{w}$ -bit integers. To obtain the value of $\mathtt{hash(x)}$ we split $\mathtt{x.hashCode()}$ up into $\ensuremath{\mathtt{w}}/\ensuremath{\mathtt{r}}$ $\mathtt{r}$ -bit integers and use these as indices into these arrays. We then combine all these values with the bitwise exclusive-or operator to obtain $\mathtt{hash(x)}$ . The following code shows how this works when $\ensuremath {\mathtt {w}}=32$ and $\ensuremath{\mathtt{r}}=4$ :

One can easily verify that, for any $\mathtt{x}$ , $\mathtt{hash(x)}$ is uniformly distributed over $\{0,\ldots,2^{\ensuremath{\mathtt{d}}}-1\}$ . With a little work, one can even verify that any pair of values have independent hash values. This implies tabulation hashing could be used in place of multiplicative hashing for the $\mathtt{ChainedHashTable}$ implementation.

However, it is not true that any set of $\mathtt{n}$ distinct values gives a set of $\mathtt{n}$ independent hash values. Nevertheless, when tabulation hashing is used, the bound of Theorem 5.2 still holds. References for this are provided at the end of this chapter.

	$\displaystyle { } O\left(1 + \left(\frac{1}{\ensuremath{\mathtt{t.length}}}\rig... ...fty} k^2\Pr\{\mbox{\ensuremath{\mathtt{i}} starts a run of length $k$}\}\right)$
	$\displaystyle \le O\left(1 + \left(\frac{1}{\ensuremath{\mathtt{t.length}}}\right)\sum_{i=1}^{\ensuremath{\mathtt{t.length}}}\sum_{k=0}^{\infty} k^2p_k\right)$
	$\displaystyle = O\left(1 + \sum_{k=0}^{\infty} k^2p_k\right)$
	$\displaystyle = O\left(1 + \sum_{k=0}^{\infty} k^2\cdot O(c^k)\right)$
	$\displaystyle = O(1) \enspace .$

5.2 $\mathtt{LinearHashTable}$ : Linear Probing

5.2.1 Analysis of Linear Probing

5.2.2 Summary

5.2.3 Tabulation Hashing

Footnotes