Subsections

5.3 Hash Codes

The hash tables discussed in the previous section are used to associate data with integer keys consisting of $ \mathtt{w}$ bits. In many cases, we have keys that are not integers. They may be strings, objects, arrays, or other compound structures. To use hash tables for these types of data, we must map these data types to $ \mathtt{w}$-bit hash codes. Hash code mappings should have the following properties:

  1. If $ \mathtt{x}$ and $ \mathtt{y}$ are equal, then $ \mathtt{x.hashCode()}$ and $ \mathtt{y.hashCode()}$ are equal.

  2. If $ \mathtt{x}$ and $ \mathtt{y}$ are not equal, then the probability that $ \ensuremath{\mathtt{x.hashCode()}}=\ensuremath{\mathtt{y.hashCode()}}$ should be small (close to $ 1/2^{\ensuremath{\mathtt{w}}}$).

The first property ensures that if we store $ \mathtt{x}$ in a hash table and later look up a value $ \mathtt{y}$ equal to $ \mathtt{x}$, then we will find $ \mathtt{x}$--as we should. The second property minimizes the loss from converting our objects to integers. It ensures that unequal objects usually have different hash codes and so are likely to be stored at different locations in our hash table.

5.3.1 Hash Codes for Primitive Data Types

Small primitive data types like $ \mathtt{char}$, $ \mathtt{byte}$, $ \mathtt{int}$, and $ \mathtt{float}$ are usually easy to find hash codes for. These data types always have a binary representation and this binary representation usually consists of $ \mathtt{w}$ or fewer bits. (For example, in Java, $ \mathtt{byte}$ is an 8-bit type and $ \mathtt{float}$ is a 32-bit type.) In these cases, we just treat these bits as the representation of an integer in the range $ \{0,\ldots,2^\ensuremath{\mathtt{w}}-1\}$. If two values are different, they get different hash codes. If they are the same, they get the same hash code.

A few primitive data types are made up of more than $ \mathtt{w}$ bits, usually $ c\ensuremath{\mathtt{w}}$ bits for some constant integer $ c$. (Java's $ \mathtt{long}$ and $ \mathtt{double}$ types are examples of this with $ c=2$.) These data types can be treated as compound objects made of $ c$ parts, as described in the next section.


5.3.2 Hash Codes for Compound Objects

For a compound object, we want to create a hash code by combining the individual hash codes of the object's constituent parts. This is not as easy as it sounds. Although one can find many hacks for this (for example, combining the hash codes with bitwise exclusive-or operations), many of these hacks turn out to be easy to foil (see Exercises 5.7-5.9). However, if one is willing to do arithmetic with $ 2\ensuremath{\mathtt{w}}$ bits of precision, then there are simple and robust methods available. Suppose we have an object made up of several parts $ P_0,\ldots,P_{r-1}$ whose hash codes are $ \ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1}$. Then we can choose mutually independent random $ \mathtt{w}$-bit integers $ \ensuremath{\mathtt{z}}_0,\ldots,\ensuremath{\mathtt{z}}_{r-1}$ and a random $ 2\ensuremath{\mathtt{w}}$-bit odd integer $ \mathtt{z}$ and compute a hash code for our object with

$\displaystyle h(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1})...
...2\ensuremath{\mathtt{w}}}\right)
\ddiv 2^{\ensuremath{\mathtt{w}}} \enspace .
$

Note that this hash code has a final step (multiplying by $ \mathtt{z}$ and dividing by $ 2^{\ensuremath{\mathtt{w}}}$) that uses the multiplicative hash function from Section 5.1.1 to take the $ 2\ensuremath{\mathtt{w}}$-bit intermediate result and reduce it to a $ \mathtt{w}$-bit final result. Here is an example of this method applied to a simple compound object with three parts $ \mathtt{x0}$, $ \mathtt{x1}$, and $ \mathtt{x2}$:
    int hashCode() {
        // random numbers from rand.org
        long[] z = {0x2058cc50L, 0xcb19137eL, 0x2cb6b6fdL}; 
        long zz = 0xbea0107e5067d19dL;

        // convert (unsigned) hashcodes to long
        long h0 = x0.hashCode() & ((1L<<32)-1);
        long h1 = x1.hashCode() & ((1L<<32)-1);
        long h2 = x2.hashCode() & ((1L<<32)-1);
        
        return (int)(((z[0]*h0 + z[1]*h1 + z[2]*h2)*zz)
                     >>> 32);
    }
The following theorem shows that, in addition to being straightforward to implement, this method is provably good:

Theorem 5..3   Let $ \ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1}$ and $ \ensuremath{\mathtt{y}}_0,\ldots,\ensuremath{\mathtt{y}}_{r-1}$ each be sequences of $ \mathtt{w}$ bit integers in $ \{0,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$ and assume $ \ensuremath{\mathtt{x}}_i \neq \ensuremath{\mathtt{y}}_i$ for at least one index $ i\in\{0,\ldots,r-1\}$. Then

$\displaystyle \Pr\{ h(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_...
...suremath{\mathtt{y}}_{r-1}) \}
\le 3/2^{\ensuremath{\mathtt{w}}} \enspace .
$

Proof. We will first ignore the final multiplicative hashing step and see how that step contributes later. Define:

$\displaystyle h'(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1}...
...\ensuremath{\mathtt{x}}_j\right)\bmod 2^{2\ensuremath{\mathtt{w}}} \enspace .
$

Suppose that $ h'(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1}) = h'(\ensuremath{\mathtt{y}}_0,\ldots,\ensuremath{\mathtt{y}}_{r-1})$. We can rewrite this as:

$\displaystyle \ensuremath{\mathtt{z}}_i(\ensuremath{\mathtt{x}}_i-\ensuremath{\mathtt{y}}_i) \bmod 2^{2\ensuremath{\mathtt{w}}} = t$ (5.4)

where

$\displaystyle t = \left(\sum_{j=0}^{i-1} \ensuremath{\mathtt{z}}_j(\ensuremath{...
...tt{y}}_j-\ensuremath{\mathtt{x}}_j)\right) \bmod 2^{2\ensuremath{\mathtt{w}}}
$

If we assume, without loss of generality that $ \ensuremath{\mathtt{x}}_i> \ensuremath{\mathtt{y}}_i$, then (5.4) becomes

$\displaystyle \ensuremath{\mathtt{z}}_i(\ensuremath{\mathtt{x}}_i-\ensuremath{\mathtt{y}}_i) = t\enspace ,$ (5.5)

since each of $ \ensuremath{\mathtt{z}}_i$ and $ (\ensuremath{\mathtt{x}}_i-\ensuremath{\mathtt{y}}_i)$ is at most $ 2^{\ensuremath{\mathtt{w}}}-1$, so their product is at most $ 2^{2\ensuremath{\mathtt{w}}}-2^{\ensuremath{\mathtt{w}}+1}+1 < 2^{2\ensuremath{\mathtt{w}}}-1$. By assumption, $ \ensuremath{\mathtt{x}}_i-\ensuremath{\mathtt{y}}_i\neq 0$, so (5.5) has at most one solution in $ \ensuremath{\mathtt{z}}_i$. Therefore, since $ \ensuremath{\mathtt{z}}_i$ and $ t$ are independent ( $ \ensuremath{\mathtt{z}}_0,\ldots,\ensuremath{\mathtt{z}}_{r-1}$ are mutually independent), the probability that we select $ \ensuremath{\mathtt{z}}_i$ so that $ h'(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1})=h'(\ensuremath{\mathtt{y}}_0,\ldots,\ensuremath{\mathtt{y}}_{r-1})$ is at most $ 1/2^{\ensuremath{\mathtt{w}}}$.

The final step of the hash function is to apply multiplicative hashing to reduce our $ 2\ensuremath{\mathtt{w}}$-bit intermediate result $ h'(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1})$ to a $ \mathtt{w}$-bit final result $ h(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1})$. By Theorem 5.3, if $ h'(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1})\neq h'(\ensuremath{\mathtt{y}}_0,\ldots,\ensuremath{\mathtt{y}}_{r-1})$, then $ \Pr\{h(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1}) = h(\en...
...y}}_0,\ldots,\ensuremath{\mathtt{y}}_{r-1})\} \le 2/2^{\ensuremath{\mathtt{w}}}$.

To summarize,

  $\displaystyle \Pr\left\{\begin{array}{l} h(\ensuremath{\mathtt{x}}_0,\ldots,\en...
...suremath{\mathtt{y}}_0,\ldots,\ensuremath{\mathtt{y}}_{r-1})\end{array}\right\}$    
  $\displaystyle = \Pr\left\{\begin{array}{ll} \mbox{$h'(\ensuremath{\mathtt{x}}_0...
...emath{\mathtt{y}}_{r-1})\ddiv 2^{\ensuremath{\mathtt{w}}}$} \end{array}\right\}$    
  $\displaystyle \le 1/2^{\ensuremath{\mathtt{w}}} + 2/2^{\ensuremath{\mathtt{w}}} = 3/2^{\ensuremath{\mathtt{w}}} \enspace . \qedhere$    

$ \qedsymbol$


5.3.3 Hash Codes for Arrays and Strings

The method from the previous section works well for objects that have a fixed, constant, number of components. However, it breaks down when we want to use it with objects that have a variable number of components, since it requires a random $ \mathtt{w}$-bit integer $ \ensuremath{\mathtt{z}}_i$ for each component. We could use a pseudorandom sequence to generate as many $ \ensuremath{\mathtt{z}}_i$'s as we need, but then the $ \ensuremath{\mathtt{z}}_i$'s are not mutually independent, and it becomes difficult to prove that the pseudorandom numbers don't interact badly with the hash function we are using. In particular, the values of $ t$ and $ \ensuremath{\mathtt{z}}_i$ in the proof of Theorem 5.3 are no longer independent.

A more rigorous approach is to base our hash codes on polynomials over prime fields; these are just regular polynomials that are evaluated modulo some prime number, $ \mathtt{p}$. This method is based on the following theorem, which says that polynomials over prime fields behave pretty-much like usual polynomials:

Theorem 5..4   Let $ \ensuremath{\mathtt{p}}$ be a prime number, and let $ f(\ensuremath{\mathtt{z}}) = \ensuremath{\mathtt{x}}_0\ensuremath{\mathtt{z}}^...
...tt{z}}^1 +
\cdots + \ensuremath{\mathtt{x}}_{r-1}\ensuremath{\mathtt{z}}^{r-1}$ be a non-trivial polynomial with coefficients $ \ensuremath{\mathtt{x}}_i\in\{0,\ldots,\ensuremath{\mathtt{p}}-1\}$. Then the equation $ f(\ensuremath{\mathtt{z}})\bmod \ensuremath{\mathtt{p}} = 0$ has at most $ r-1$ solutions for $ \ensuremath{\mathtt{z}}\in\{0,\ldots,p-1\}$.

To use Theorem 5.4, we hash a sequence of integers $ \ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1}$ with each $ \ensuremath{\mathtt{x}}_i\in \{0,\ldots,\ensuremath{\mathtt{p}}-2\}$ using a random integer $ \ensuremath{\mathtt{z}}\in\{0,\ldots,\ensuremath{\mathtt{p}}-1\}$ via the formula

$\displaystyle h(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1})...
...}}-1)\ensuremath{\mathtt{z}}^r \right)\bmod \ensuremath{\mathtt{p}} \enspace .
$

Note the extra $ (\ensuremath{\mathtt{p}}-1)\ensuremath{\mathtt{z}}^r$ term at the end of the formula. It helps to think of $ (\ensuremath{\mathtt{p}}-1)$ as the last element, $ \ensuremath{\mathtt{x}}_r$, in the sequence $ \ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r}$. Note that this element differs from every other element in the sequence (each of which is in the set $ \{0,\ldots,\ensuremath{\mathtt{p}}-2\}$). We can think of $ \ensuremath{\mathtt{p}}-1$ as an end-of-sequence marker.

The following theorem, which considers the case of two sequences of the same length, shows that this hash function gives a good return for the small amount of randomization needed to choose $ \mathtt{z}$:

Theorem 5..5   Let $ \ensuremath{\mathtt{p}}>2^{\ensuremath{\mathtt{w}}}+1$ be a prime, let $ \ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1}$ and $ \ensuremath{\mathtt{y}}_0,\ldots,\ensuremath{\mathtt{y}}_{r-1}$ each be sequences of $ \mathtt{w}$-bit integers in $ \{0,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$, and assume $ \ensuremath{\mathtt{x}}_i \neq \ensuremath{\mathtt{y}}_i$ for at least one index $ i\in\{0,\ldots,r-1\}$. Then

$\displaystyle \Pr\{ h(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_...
...math{\mathtt{y}}_{r-1}) \}
\le (r-1)/\ensuremath{\mathtt{p}} \} \enspace .
$

Proof. The equation $ h(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1}) = h(\ensuremath{\mathtt{y}}_0,\ldots,\ensuremath{\mathtt{y}}_{r-1})$ can be rewritten as

$\displaystyle \left( (\ensuremath{\mathtt{x}}_0-\ensuremath{\mathtt{y}}_0)\ensu...
...}_{r-1})\ensuremath{\mathtt{z}}^{r-1} \right)\bmod \ensuremath{\mathtt{p}} = 0.$ (5.6)

Since $ \ensuremath{\mathtt{x}}_\ensuremath{\mathtt{i}}\neq \ensuremath{\mathtt{y}}_\ensuremath{\mathtt{i}}$, this polynomial is non-trivial. Therefore, by Theorem 5.4, it has at most $ r-1$ solutions in $ \mathtt{z}$. The probability that we pick $ \mathtt{z}$ to be one of these solutions is therefore at most $ (r-1)/\ensuremath{\mathtt{p}}$. $ \qedsymbol$

Note that this hash function also deals with the case in which two sequences have different lengths, even when one of the sequences is a prefix of the other. This is because this function effectively hashes the infinite sequence

$\displaystyle \ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1}, \ensuremath{\mathtt{p}}-1,0,0,\ldots \enspace .
$

This guarantees that if we have two sequences of length $ r$ and $ r'$ with $ r > r'$, then these two sequences differ at index $ i=r$. In this case, (5.6) becomes

$\displaystyle \left(
\sum_{i=0}^{i=r'-1}(\ensuremath{\mathtt{x}}_i-\ensuremath...
...nsuremath{\mathtt{z}}^{r}
\right)\bmod \ensuremath{\mathtt{p}} = 0 \enspace ,
$

which, by Theorem 5.4, has at most $ r$ solutions in $ \ensuremath{\mathtt{z}}$. This combined with Theorem 5.5 suffice to prove the following more general theorem:

Theorem 5..6   Let $ \ensuremath{\mathtt{p}}>2^{\ensuremath{\mathtt{w}}}+1$ be a prime, let $ \ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_{r-1}$ and $ \ensuremath{\mathtt{y}}_0,\ldots,\ensuremath{\mathtt{y}}_{r'-1}$ be distinct sequences of $ \mathtt{w}$-bit integers in $ \{0,\ldots,2^{\ensuremath{\mathtt{w}}}-1\}$. Then

$\displaystyle \Pr\{ h(\ensuremath{\mathtt{x}}_0,\ldots,\ensuremath{\mathtt{x}}_...
...{\mathtt{y}}_{r-1}) \}
\le \max\{r,r'\}/\ensuremath{\mathtt{p}} \enspace .
$

The following example code shows how this hash function is applied to an object that contains an array, $ \mathtt{x}$, of values:

    int hashCode() {
        long p = (1L<<32)-5;   // prime: 2^32 - 5
        long z = 0x64b6055aL;  // 32 bits from random.org
        int z2 = 0x5067d19d;   // random odd 32 bit number
        long s = 0;
        long zi = 1;
        for (int i = 0; i < x.length; i++) {
            // reduce to 31 bits
            long xi = (x[i].hashCode() * z2) >>> 1; 
            s = (s + zi * xi) % p;
            zi = (zi * z) % p;    
        }
        s = (s + zi * (p-1)) % p;
        return (int)s;
    }

The preceding code sacrifices some collision probability for implementation convenience. In particular, it applies the multiplicative hash function from Section 5.1.1, with $ \ensuremath{\mathtt{d}}=31$ to reduce $ \mathtt{x[i].hashCode()}$ to a 31-bit value. This is so that the additions and multiplications that are done modulo the prime $ \ensuremath{\mathtt{p}}=2^{32}-5$ can be carried out using unsigned 63-bit arithmetic. Thus the probability of two different sequences, the longer of which has length $ r$, having the same hash code is at most

$\displaystyle 2/2^{31} + r/(2^{32}-5)
$

rather than the $ r/(2^{32}-5)$ specified in Theorem 5.6.

opendatastructures.org