14. External Memory Searching

Throughout this book, we have been using the $ \ensuremath{\ensuremath{\mathit{w}}}$-bit word-RAM model of computation defined in Section 1.4. An implicit assumption of this model is that our computer has a large enough random access memory to store all of the data in the data structure. In some situations, this assumption is not valid. There exist collections of data so large that no computer has enough memory to store them. In such cases, the application must resort to storing the data on some external storage medium such as a hard disk, a solid state disk, or even a network file server (which has its own external storage).

Accessing an item from external storage is extremely slow. The hard disk attached to the computer on which this book was written has an average access time of 19ms and the solid state drive attached to the computer has an average access time of 0.3ms. In contrast, the random access memory in the computer has an average access time of less than 0.000113ms. Accessing RAM is more than 2500 times faster than accessing the solid state drive and more than 160000 times faster than accessing the hard drive.

These speeds are fairly typical; accessing a random byte from RAM is thousands of times faster than accessing a random byte from a hard disk or solid-state drive. Access time, however, does not tell the whole story. When we access a byte from a hard disk or solid state disk, an entire block of the disk is read. Each of the drives attached to the computer has a block size of 4096; each time we read one byte, the drive gives us a block containing 4096 bytes. If we organize our data structure carefully, this means that each disk access could yield 4096 bytes that are helpful in completing whatever operation we are doing.

This is the idea behind the external memory model of computation, illustrated schematically in Figure 14.1. In this model, the computer has access to a large external memory in which all of the data resides. This memory is divided into memory blocks each containing $ B$ words. The computer also has limited internal memory on which it can perform computations. Transferring a block between internal memory and external memory takes constant time. Computations performed within the internal memory are free; they take no time at all. The fact that internal memory computations are free may seem a bit strange, but it simply emphasizes the fact that external memory is so much slower than RAM.

Figure 14.1: In the external memory model, accessing an individual item, $ \ensuremath{\ensuremath{\mathit{x}}}$, in the external memory requires reading the entire block containing $ \ensuremath{\ensuremath{\mathit{x}}}$ into RAM.
\includegraphics[width=\textwidth ]{figs-python/em}

In the full-blown external memory model, the size of the internal memory is also a parameter. However, for the data structures described in this chapter, it is sufficient to have an internal memory of size $ O(B+\log_B \ensuremath{\ensuremath{\ensuremath{\mathit{n}}}})$. That is, the memory needs to be capable of storing a constant number of blocks and a recursion stack of height $ O(\log_B
\ensuremath{\ensuremath{\ensuremath{\mathit{n}}}})$. In most cases, the $ O(B)$ term dominates the memory requirement. For example, even with the relatively small value $ B=32$, $ B\ge \log_B
\ensuremath{\ensuremath{\ensuremath{\mathit{n}}}}$ for all $ \ensuremath{\ensuremath{\ensuremath{\mathit{n}}}}\le 2^{160}$. In decimal, $ B\ge \log_B \ensuremath{\ensuremath{\ensuremath{\mathit{n}}}}$ for any

$\displaystyle \ensuremath{\ensuremath{\ensuremath{\mathit{n}}}} \le 1\,461\,501...
...7\,330\,902\,918\,203\,684\,832\,716\,283\,019\,655\,932\,542\,976 \enspace
. $