A data structure uses hashing with chaining to store data as an array, , of lists. An integer, , keeps track of the total number of items in all lists (see Figure 5.1):
array<List> t; int n;The hash value of a data item , denoted is a value in the range . All items with hash value are stored in the list at . To ensure that lists don't get too long, we maintain the invariant
To add an element, , to the hash table, we first check if the length of needs to be increased and, if so, we grow . With this out of the way we hash to get an integer, , in the range and we append to the list :
bool add(T x) { if (find(x) != null) return false; if (n+1 > t.length) resize(); t[hash(x)].add(x); n++; return true; }Growing the table, if necessary, involves doubling the length of and reinserting all elements into the new table. This is exactly the same strategy used in the implementation of and the same result applies: The cost of growing is only constant when amortized over a sequence of insertions (see Lemma 2.1 on page ).
Besides growing, the only other work done when adding to a involves appending to the list . For any of the list implementations described in Chapters 2 or 3, this takes only constant time.
To remove an element from the hash table we iterate over the list until we find so that we can remove it:
T remove(T x) { int j = hash(x); for (int i = 0; i < t[j].size(); i++) { T y = t[j].get(i); if (x == y) { t[j].remove(i); n--; return y; } } return null; }This takes time, where denotes the length of the list stored at .
Searching for the element in a hash table is similar. We perform a linear search on the list :
T find(T x) { int j = hash(x); for (int i = 0; i < t[j].size(); i++) if (x == t[j].get(i)) return t[j].get(i); return null; }Again, this takes time proportional to the length of the list .
The performance of a hash table depends critically on the choice of the hash function. A good hash function will spread the elements evenly among the lists, so that the expected size of the list is . On the other hand, a bad hash function will hash all values (including ) to the same table location, in which case the size of the list will be . In the next section we describe a good hash function.
Multiplicative hashing is an efficient method of generating hash values based on modular arithmetic (discussed in Section 2.3) and integer division. It uses the operator, which calculates the integral part of a quotient, while discarding the remainder. Formally, for any integers and , .
In multiplicative hashing, we use a hash table of size for some integer (called the dimension). The formula for hashing an integer is
int hash(T x) { return ((unsigned)(z * hashCode(x))) >> (w-d); }
The following lemma, whose proof is deferred until later in this section, shows that multiplicative hashing does a good job of avoiding collisions:
With Lemma 5.1, the performance of , and are easy to analyze:
Now, we want to prove Lemma 5.1, but first we need a result from number theory. In the following proof, we use the notation to denote , where each is a bit, either 0 or 1. In other words, is the integer whose binary representation is given by . We use to denote a bit of unknown value.
Suppose, for the sake of contradiction, that there are two such values and , with . Then
Furthermore since and . Since is odd, it has no trailing 0's in its binary representation:
The utility of Lemma 5.3 comes from the following observation: If is chosen uniformly at random from , then is uniformly distributed over . In the following proof, it helps to think of the binary representation of , which consists of random bits followed by a 1.
Let be the unique odd integer such that for some integer . By Lemma 5.3, the binary representation of has random bits, followed by a 1:
The following theorem summarizes the performance of the data structure:
Furthermore, beginning with an empty , any sequence of and operations results in a total of time spent during all calls to .
opendatastructures.org