5.4 Discussion and Exercises

Hash tables and hash codes represent an enormous and active field of research that is just touched upon in this chapter. The online Bibliography on Hashing [10] contains nearly 2000 entries.

A variety of different hash table implementations exist. The one described in Section 5.1 is known as hashing with chaining (each array entry contains a chain ( $ \mathtt{List}$) of elements). Hashing with chaining dates back to an internal IBM memorandum authored by H. P. Luhn and dated January 1953. This memorandum also seems to be one of the earliest references to linked lists.

An alternative to hashing with chaining is that used by open addressing schemes, where all data is stored directly in an array. These schemes include the $ \mathtt{LinearHashTable}$ structure of Section 5.2. This idea was also proposed, independently, by a group at IBM in the 1950s. Open addressing schemes must deal with the problem of collision resolution: the case where two values hash to the same array location. Different strategies exist for collision resolution; these provide different performance guarantees and often require more sophisticated hash functions than the ones described here.

Yet another category of hash table implementations are the so-called perfect hashing methods. These are methods in which $ \mathtt{find(x)}$ operations take $ O(1)$ time in the worst-case. For static data sets, this can be accomplished by finding perfect hash functions for the data; these are functions that map each piece of data to a unique array location. For data that changes over time, perfect hashing methods include FKS two-level hash tables [31,24] and cuckoo hashing [55].

The hash functions presented in this chapter are probably among the most practical methods currently known that can be proven to work well for any set of data. Other provably good methods date back to the pioneering work of Carter and Wegman who introduced the notion of universal hashing and described several hash functions for different scenarios [14]. Tabulation hashing, described in Section 5.2.3, is due to Carter and Wegman [14], but its analysis, when applied to linear probing (and several other hash table schemes) is due to P{\v{a\/}}\kern.05emtra{\c{s\/}}cu and Thorup [58].

The idea of multiplicative hashing is very old and seems to be part of the hashing folklore [48, Section 6.4]. However, the idea of choosing the multiplier $ \mathtt{z}$ to be a random odd number, and the analysis in Section 5.1.1 is due to Dietzfelbinger et al. [23]. This version of multiplicative hashing is one of the simplest, but its collision probability of $ 2/2^{\ensuremath{\mathtt{d}}}$ is a factor of two larger than what one could expect with a random function from $ 2^{\ensuremath{\mathtt{w}}}\to
2^{\ensuremath{\mathtt{d}}}$. The multiply-add hashing method uses the function

$\displaystyle h(\ensuremath{\mathtt{x}}) = ((\ensuremath{\mathtt{z}}\ensuremath...
...math{\mathtt{2w}}}) \ddiv 2^{\ensuremath{\mathtt{2w}}-\ensuremath{\mathtt{d}}}
$

where $ \mathtt{z}$ and $ \mathtt{b}$ are each randomly chosen from $ \{0,\ldots,2^{\ensuremath{\mathtt{2w}}}-1\}$. Multiply-add hashing has a collision probability of only $ 1/2^{\ensuremath{\mathtt{d}}}$ [21], but requires $ 2\ensuremath{\mathtt{w}}$-bit precision arithmetic.

There are a number of methods of obtaining hash codes from fixed-length sequences of $ \mathtt{w}$-bit integers. One particularly fast method [11] is the function

\begin{displaymath}\begin{array}{l}
h(\ensuremath{\mathtt{x}}_0,\ldots,\ensurem...
...htt{w}}})\right) \bmod 2^{2\ensuremath{\mathtt{w}}}
\end{array}\end{displaymath}

where $ r$ is even and $ \ensuremath{\mathtt{a}}_0,\ldots,\ensuremath{\mathtt{a}}_{r-1}$ are randomly chosen from $ \{0,\ldots,2^{\ensuremath{\mathtt{w}}}\}$. This yields a $ 2\ensuremath{\mathtt{w}}$-bit hash code that has collision probability $ 1/2^{\ensuremath{\mathtt{w}}}$. This can be reduced to a $ \mathtt{w}$-bit hash code using multiplicative (or multiply-add) hashing. This method is fast because it requires only $ r/2$ $ 2\ensuremath{\mathtt{w}}$-bit multiplications whereas the method described in Section 5.3.2 requires $ r$ multiplications. (The $ \bmod$ operations occur implicitly by using $ \mathtt{w}$ and $ 2\ensuremath{\mathtt{w}}$-bit arithmetic for the additions and multiplications, respectively.)

The method from Section 5.3.3 of using polynomials over prime fields to hash variable-length arrays and strings is due to Dietzfelbinger et al. [22]. Due to its use of the $ \bmod$ operator which relies on a costly machine instruction, it is, unfortunately, not very fast. Some variants of this method choose the prime $ \mathtt{p}$ to be one of the form $ 2^{\ensuremath{\mathtt{w}}}-1$, in which case the $ \bmod$ operator can be replaced with addition ( $ \mathtt{+}$) and bitwise-and ( $ \mathtt{\&}$) operations [47, Section 3.6]. Another option is to apply one of the fast methods for fixed-length strings to blocks of length $ c$ for some constant $ c>1$ and then apply the prime field method to the resulting sequence of $ \lceil r/c\rceil$ hash codes.

Exercise 5..1   A certain university assigns each of its students student numbers the first time they register for any course. These numbers are sequential integers that started at 0 many years ago and are now in the millions. Suppose we have a class of one hundred first year students and we want to assign them hash codes based on their student numbers. Does it make more sense to use the first two digits or the last two digits of their student number? Justify your answer.

Exercise 5..2   Consider the hashing scheme in Section 5.1.1, and suppose $ \ensuremath{\mathtt{n}}=2^{\ensuremath{\mathtt{d}}}$ and $ \ensuremath{\mathtt{d}}\le \ensuremath{\mathtt{w}}/2$.
  1. Show that, for any choice of the muliplier, $ \mathtt{z}$, there exists $ \mathtt{n}$ values that all have the same hash code. (Hint: This is easy, and doesn't require any number theory.)
  2. Given the multiplier, $ \mathtt{z}$, describe $ \mathtt{n}$ values that all have the same hash code. (Hint: This is harder, and requires some basic number theory.)

Exercise 5..3   Prove that the bound $ 2/2^{\ensuremath{\mathtt{d}}}$ in Lemma 5.1 is the best possible bound by showing that, if $ x=2^{\ensuremath{\mathtt{w}}-\ensuremath{\mathtt{d}}-2}$ and $ \ensuremath{\mathtt{y}}=3\ensuremath{\mathtt{x}}$, then $ \Pr\{\ensuremath{\mathtt{hash(x)}}=\ensuremath{\mathtt{hash(y)}}\}=2/2^{\ensuremath{\mathtt{d}}}$. (Hint look at the binary representations of $ \ensuremath{\mathtt{zx}}$ and $ \ensuremath{\mathtt{z}}3\ensuremath{\mathtt{x}}$ and use the fact that $ \ensuremath{\mathtt{z}}3\ensuremath{\mathtt{x}} = \ensuremath{\mathtt{z}}x\ensuremath{\mathtt{+2}}z\ensuremath{\mathtt{x}}$.)

Exercise 5..4   Reprove Lemma 5.4 using the full version of Stirling's Approximation given in Section 1.3.2.

Exercise 5..5   Consider the following simplified version of the code for adding an element $ \mathtt{x}$ to a $ \mathtt{LinearHashTable}$, which simply stores $ \mathtt{x}$ in the first $ \mathtt{null}$ array entry it finds. Explain why this could be very slow by giving an example of a sequence of $ O(\ensuremath{\mathtt{n}})$ $ \mathtt{add(x)}$, $ \mathtt{remove(x)}$, and $ \mathtt{find(x)}$ operations that would take on the order of $ \ensuremath{\mathtt{n}}^2$ time to execute.
  bool addSlow(T x) {
    if (2*(q+1) > t.length) resize();   // max 50% occupancy
    int i = hash(x);
    while (t[i] != null) {
        if (t[i] != del && x.equals(t[i])) return false;
        i = (i == t.length-1) ? 0 : i + 1; // increment i
    }
    t[i] = x;
    n++; q++;
    return true;
  }

Exercise 5..6   Early versions of the Java $ \mathtt{hashCode()}$ method for the $ \mathtt{String}$ class worked by not using all of the characters found in long strings. For example, for a sixteen character string, the hash code was computed using only the eight even-indexed characters. Explain why this was a very bad idea by giving an example of large set of strings that all have the same hash code.

Exercise 5..7   Suppose you have an object made up of two $ \mathtt{w}$-bit integers, $ \mathtt{x}$ and $ \mathtt{y}$. Show why $ \ensuremath{\mathtt{x}}\oplus\ensuremath{\mathtt{y}}$ does not make a good hash code for your object. Give an example of a large set of objects that would all have hash code 0.

Exercise 5..8   Suppose you have an object made up of two $ \mathtt{w}$-bit integers, $ \mathtt{x}$ and $ \mathtt{y}$. Show why $ \ensuremath{\mathtt{x}}+\ensuremath{\mathtt{y}}$ does not make a good hash code for your object. Give an example of a large set of objects that would all have the same hash code.

Exercise 5..9   Suppose you have an object made up of two $ \mathtt{w}$-bit integers, $ \mathtt{x}$ and $ \mathtt{y}$. Suppose that the hash code for your object is defined by some deterministic function $ h(\ensuremath{\mathtt{x}},\ensuremath{\mathtt{y}})$ that produces a single $ \mathtt{w}$-bit integer. Prove that there exists a large set of objects that have the same hash code.

Exercise 5..10   Let $ p=2^{\ensuremath{\mathtt{w}}}-1$ for some positive integer $ \mathtt{w}$. Explain why, for a positive integer $ x$

$\displaystyle (x\bmod 2^{\ensuremath{\mathtt{w}}}) + (x\ddiv 2^{\ensuremath{\mathtt{w}}}) \equiv x \bmod (2^{\ensuremath{\mathtt{w}}}-1) \enspace .
$

(This gives an algorithm for computing $ x \bmod (2^{\ensuremath{\mathtt{w}}}-1)$ by repeatedly setting

$\displaystyle \ensuremath{\mathtt{x = x\&((1«w)-1) + x»w}}
$

until $ \ensuremath{\mathtt{x}} \le 2^{\ensuremath{\mathtt{w}}}-1$.)

Exercise 5..11   Find some commonly used hash table implementation such as the (The C++ STL $ \mathtt{unordered\_map}$ or the $ \mathtt{HashTable}$ or $ \mathtt{LinearHashTable}$ implementations in this book, and design a program that stores integers in this data structure so that there are integers, $ \mathtt{x}$, such that $ \mathtt{find(x)}$ takes linear time. That is, find a set of $ \mathtt{n}$ integers for which there are $ c\ensuremath{\mathtt{n}}$ elements that hash to the same table location.

Depending on how good the implementation is, you may be able to do this just by inspecting the code for the implementation, or you may have to write some code that does trial insertions and searches, timing how long it takes to add and find particular values. (This can be, and has been, used to launch denial of service attacks on web servers [17].)

opendatastructures.org