"Describe a simple implementation of a hash table."

At first blush this seems a simple enough request. My python brain immediately jumped to the thought, "well of course dict is python's implemenation of a hash table...", but given a world where dict doesn't exist, how would you define it's behavior? I decided to cozy up to Wikipedia, dig in to the article on hash tables, and became familiar with some basic data structures concepts. The ultimate goal was to trace the CPython implementation of dict and replicate something similar in pure python. It turns out that this was not at all a bad introduction to some basic data structures concepts and a was a great way to spend a lazy Sunday.

What is a hash table?

In the simplest terms possible, a hash table is a data structure that maps keys to values. The word "hash" in the definition is the biggest clue to what makes this possible. Given that an array-like/indexable data type exists, we can simply use a hashing function to generate an index for a given key and insert the value into an array-like structure at that generated index. Looking up a value by its key should therefore just require the inverse operation, compute the hash of the given key and fetch data at that index. For examples sake, let's make a super primitive implementation.

def hasher(key):
    if key == 'foo':
        return 0
    elif key == 'bar':
        return 1
    elif key == 'baz':
        return 2

buckets = list()

# Convert the key 'foo' to a numeric value and insert the value 'hello_world'
# at that numeric index
buckets.insert(hasher('foo'), 'hello world')

# Retrieve the value by getting the data at the hashed index
buckets[hasher('foo')]  # Returns 'hello world'

Proper Hashing

The behavior above is a very basic example of how hash tables work. Obviously there are some huge flaws with this example though - our hasher function is almost entirely worthless as it is just hardcoded for 3 specific input values, we rely on the returned indexes to start at 0 and grow n+1 from there, and you cant just insert values at arbitrary positions in python lists so attempting to insert 'bar' before 'foo' would insert the value at the wrong index.

Fortunately for python devs there is the handy builtin hash function that is perfect for this application (this is actually the same algorithm used in the default implemenation of the dict type).

Unfortunately, the returned value will be a big integer and, as discussed above, python lists are not sparse so we cant simply insert at arbitrary indexes without growing the list to be at least that big. Therefore, to insert a value for the hash of the given key 'foo' the list would need to have a length of at least 4,177,197,833,195,190,596 -not exactly efficient.

The CPython implementation works around this by operating on the bits of the hash directly. Take the last n-bits of the hash, convert those to an integer value, and insert them at that index. So instead of needing a list that has a length of four quintillion, we can use just the final three bits of the hash and know that the largest index we will ever need to support is 7. Thus, buckets only needs to have 8 available slots at initialization.

Now that there is a known fixed number of slots that need to be supported, the problem of lists not being sparse becomes easy to solve. Simply construct a list full of dummy objects, in this case we will use None, and now inserting at arbitrary positions will maintain correct indices.

# Generate a hash given the word 'foo'
key = 'foo'
value = 'hello world'
hash(key)  # -4177197833195190597

# Convert to binary, slice just the last 3 bits to keep buckets small
bits = bin(hash(key))[-3:]  # 101

# Convert bits to integer for insertion
idx = int(bits, 2)  # 5

# Create a buckets that is big enough for 3 bits of indices
buckets = [None] * 8
buckets[idx] = 'hello world'

# Throw it all together to get that data back out of the buckets
buckets[int(bin(hash('foo'))[-3:], 2)]  # 'hello world'

Collision Handling

That really is about 90% of what goes into a hash table implementation, the final 10% is where things can get a bit more complicated. Since we are choosing from such a small pool of possible hashes, what happens when two keys hash to the same bit sequence? This is what is referred to as a collision and there are many formal ways of handling them. CPython uses a combination of two techniques known as Open Addressing and Random Probing to implement the dictionary type.

Open addressing simply means that there is a single collection of buckets, and rather than storing only the value at the computed index in the buckets, we actually store a composite of the full computed hash, the key, and the value itself. In CPython that is handled by storing a struct, in the pure python implementation we will just use a tuple.

When a new record is inserted, the bucket at the computed hash is checked, if it is empty the value is simply inserted. If the bucket is NOT empty however, the table is probed in a pseudo-random order determined by including the higher bits of the hash that were initially ignored - hence the term "random probing". For lookups these operations run in the same order, they computed index is checked and if its not empty but the key does not match the key that was requested the buckets are randomly probed until a match is found.

Putting these concepts into action and cleaning up the API a bit, here is a slightly-more-working implementation.

(Try not to get terribly stuck on the bitwise operations (the & and >> operators in this case) if you aren't familiar. Just know that under the covers these are doing the same work as the simple example above. If you need a refresher, check out this article on the PSF wiki.)

class HashTable(object):

    def __init__(self, **initial_values):
        self.size = 8
        self._buckets = [None] * self.size

        for k, v in initial_values.iteritems():
            self.insert(k, v)

    def get(self, key):
        h = hash(key)
        idx = h & self.mask
        bucket = self._buckets[idx]

        # Do a naive probe, if the bucket at idx is existant but NOT
        # the correct key, use locate to probe randomly.
        if bucket and bucket[1] != key:
            idx = self._locate(key, h)

        return self._buckets[idx][2] if self._buckets[idx] else None

    def insert(self, key, value):
        h = hash(key)
        idx = h & self.mask

        # If the initial probe returned a bucket, probe randomly
        # for the next available slot.
        if self._buckets[idx]:
            idx = self._locate(key, h)

        self._buckets[idx] = (h, key, value)

    def _locate(self, key, h):
        idx = h & self.mask
        bucket = self._buckets[idx]

        perturb = h
        while True:
            idx = ((idx >> 2) + idx + perturb + 1) & self.mask
            bucket = self._buckets[idx]

            if not bucket or bucket[1] == key:
                return idx

            perturb >>= 5  # per the cpython impl, 5 is the optimal shift

    @property
    def mask(self):
        return len(self._buckets) - 1

h = HashTable(foo='hello world', bar='don\'t panic!')
print(h.get('foo'))  # 'hello world'
print(h.get('bar'))  # 'don\'t panic'

Buckets Sizing

This is all well and good for storing a handful of values, but it's easy to see that with such a small storage space collisions become very likely and running out of buckets slots to store items in is very easy to do. To handle this, the default implementation of dict grows it's buckets store by a factor of 4, once it is greater than 75% full. Conversely, in order to save on RAM utilization, the dictionary will auto-shrink by the same factor if it becomes less-than 1/6th full.

Resizing the buckets list is very easy, but the data itself must actually be re-hashed and re-stored both to utilize the new space efficiently and in order for the data to stay reachable as the bitmask that is applied to the hash is tied directly to the size of the buckets list.

class HashTable(object):

    def __init__(self, **initial_values):
        self._buckets = [None] * 8

        for k, v in initial_values.iteritems():
            self.insert(k, v)

    def get(self, key):
        h = hash(key)
        idx = h & self.mask

        # Do a naive probe, if the bucket at idx is existant but NOT
        # the correct key, use locate to probe randomly.
        bucket = self._buckets[idx]
        if bucket and bucket[1] != key:
            idx = self._locate(key, h)

        return self._buckets[idx][2] if self._buckets[idx] else None

    def insert(self, key, value):
        if self.utilization >= 0.75:
            self._resize(len(self._buckets) * 4)

        h = hash(key)
        idx = h & self.mask

        # If the initial probe returned a bucket, probe randomly
        # for the next available slot.
        if self._buckets[idx]:
            idx = self._locate(key, h)

        self._buckets[idx] = (h, key, value)

    def _locate(self, key, h):
        idx = h & self.mask
        bucket = self._buckets[idx]

        perturb = h
        while True:
            idx = ((idx >> 2) + idx + perturb + 1) & self.mask
            bucket = self._buckets[idx]

            if not bucket or bucket[1] == key:
                return idx

            perturb >>= 5  # per the cpython impl, 5 is the optimal shift

    def _resize(self, size):
        old_buckets = self._buckets
        self._buckets = [None] * size

        # Make sure the old data gets re-indexed into the new store
        for bucket in [b for b in old_buckets if b]:
            self.insert(bucket[1], bucket[2])

    @property
    def mask(self):
        return len(self._buckets) - 1

    @property
    def utilization(self):
        ''' Calculate the number of buckets that are actually populated '''
        try:
            return float(len(self)) / float(len(self._buckets))
        except ZeroDivisionError:
            return 0

    def __len__(self):
        return len([x for x in self._buckets if x])

h = HashTable(one="one", two="two", three="three", four="four", five="five",
              six="six", seven="seven", eight="eight", nine="nine")

print(h.get('nine'))  # 'nine' - and now there are >8 items stored so resize worked!
print(len(h))  # 9 - side-effect, the class now supports len!

Item Deletion

We are rounding the final corner here, but there is one last implementation detail left unhandled - deleting items. Simply removing items from the buckets list can actually cause keys to become unreachable due to collision handling. If the computed hash inserts baz at index 4 and bar is added, the indexes collide and bar gets inserted at index 2 per the above algorithm. If a process now comes along and deletes baz from the table any effort to retrieve the key bar will fail when it sees an empty slot at the expected index 4. The easiest way to work around this limitation is to leave a kind of dummy item in its place when deleting. This will force future lookups to bounce off of the expected key and proceed to the next expected slot.

Now will be a good time to implement a couple more magic methods to make the API a little bit cleaner - __getitem__ to support square bracket access, __delitem__ to support the del keyword, and setitem to support square bracket item assignment. Assuming this will be the last pass over the code, I've thrown some comments and docstrings in for good measure. There is obviously room for more optimizations, but for science-sake this was about as far as I wanted to go. Feel free to tell me where I'm wrong in the comments section!

class HashTable(object):
    '''Rudimentary implemnation of the CPython dictionary object in pure python.

    Supports a relatively dictionary-like API, except its worse in every way!
    The one thing this may do is give you some nice analogies for the CPython
    implementation without having to dig through the source.

    >>> foo = HashTable(bar="baz", boo="doh")
    >>> foo.get("bar")
    "baz"
    >>> foo["addtl"] = "things"
    >>> assert len(foo) == 3
    >>> del foo["addtl"]
    >>> assert len(foo) == 2
    '''

    TOMBSTONE = '<Tombstone>'
    MINSIZE = 8

    def __init__(self, **initial_values):
        self._buckets = [None] * self.MINSIZE

        for k, v in initial_values.iteritems():
            self.insert(k, v)

    def get(self, key):
        ''' Simple getter method to retrieve buckets by key '''

        h = hash(key)
        idx = h & self.mask

        # Do a naive probe, if the bucket at idx is existant but NOT
        # the correct key, use locate to probe randomly.
        bucket = self._buckets[idx]
        if bucket and bucket[1] != key:
            idx = self._locate(key, h)

        return self._buckets[idx][2] if self._buckets[idx] else None

    def insert(self, key, value):
        ''' Simple insert method to insert buckets by key '''

        # If the buckets seem over-utilized, grow and re-index
        if self.utilization >= 0.75:
            self._resize(len(self._buckets) * 4)

        h = hash(key)
        idx = h & self.mask

        # If the initial probe returned a bucket, probe randomly
        # for the next available slot.
        if self._buckets[idx]:
            idx = self._locate(key, h)

        self._buckets[idx] = (h, key, value)

    def delete(self, key):
        ''' Simple delete method to delete buckets by key '''

        h = hash(key)
        idx = h & self.mask

        bucket = self._buckets[idx]
        if bucket and bucket[1] != key:
            idx = self._locate(key, h)

        self._buckets[idx] = self.TOMBSTONE

        # If the buckets seem under-utilized, shrink and re-index
        if 0 < self.utilization <= 0.16 and len(self._buckets) > self.MINSIZE:
            self._resize(len(self._buckets) / 4)

    def _locate(self, key, h):
        ''' Very basic implementation of the CPython lookdict method. This is where
        the magic happens.

        The algoritm itself is described in much greater detail at:
            https://hg.python.org/cpython/file/52f68c95e025/Objects/dictobject.c#l97
        '''

        idx = h & self.mask
        bucket = self._buckets[idx]

        perturb = h
        for _ in xrange(0, self.mask):
            idx = ((idx >> 2) + idx + perturb + 1) & self.mask
            bucket = self._buckets[idx]

            if not bucket or bucket[1] == key:
                return idx

            perturb >>= 5  # per the cpython impl, 5 is the optimal shift
        raise Exception("something went wrong with {} {}".format(key, idx))

    def _resize(self, size):
        '''Allow the buckets list to be grown or shrunk to a passed size.

        This method recreates the buckets list and reinserts all old records
        at their computed hashes. Also cleans up Tombstones in the process.
        '''

        old_buckets = self._buckets
        self._buckets = [None] * size

        # Make sure the old data gets re-indexed into the new store
        for bucket in [b for b in old_buckets if b and b != self.TOMBSTONE]:
            self.insert(bucket[1], bucket[2])

    @property
    def mask(self):
        '''The length of the buckets list is also applied as the bit mask to hash
        for item indexing.'''

        return len(self._buckets) - 1

    @property
    def utilization(self):
        '''Calculate the number of buckets that are populated or tombstoned'''
        try:
            return float(len(self)) / float(len(self._buckets))
        except ZeroDivisionError:
            return 0

    def __len__(self):
        '''Len should return the number of non-tombstoned and populated records'''
        return len([b for b in self._buckets if b and b != self.TOMBSTONE])

    def __setitem__(self, key, val):
        self.insert(key, val)

    def __getitem__(self, key):
        val = self.get(key)
        if val:
            return val
        else:
            raise KeyError

    def __delitem__(self, key):
        self.delete(key)


h = HashTable(one="one", two="two", three="three", four="four", five="five",
              six="six", seven="seven", eight="eight", nine="nine")

print(len(h._buckets))  # 32 - buckets have grown to accomodate items

# Remove a bunch of the keys
del h['one']
del h['two']
del h['three']
del h['four']
del h['five']
del h['six']
del h['seven']
del h['eight']

print(len(h._buckets))  # 8 - buckets has now shrunk due to utilization
h['ten'] = 10
print(h.get('one'))  # None -
print(h.get('ten'))  # 10

Conclusion

All in all this was a great excuse to pour over the CPython code base and learn a bit about data structures in the process. Massive shoutout to the CPython maintainers for going to great lengths to comment the C itself. There is a single ~100 line comment at the top of the file dictobject.c that bends over backwards to explain the implementation and even a few corner cases to the way they ultimately chose to do it.

One of the most interesting discoveries I made during this process was how this actually explains why dictionaries are not "ordered", at least not in the way that you might expect. When iterating over a dictionary you may have noticed that the order seems arbitrary. The iteration actually processes the entries in the hashed order of the key! So if foo hashes to 101 and bar hashes to 100 then the keys will be iterated foo first. You can even see evidence of collisions in the iteration order. The keys bar and baz have colliding indices with a 3-bit mask, watch what happens to the output of .keys() when their insertion order is changed.

a = dict(bar="world", foo="hello")
a.keys()  # ['foo', 'bar']

b = dict()

b['bar'] = 'hello'
b['baz'] = 'world'

c = dict()

c['baz'] = 'hello'
c['bar'] = 'world'

b.keys()  # ['baz', 'bar']

c.keys()  # ['bar', 'baz'] !!


References

There are, of course, many many resources on these subjects online but I found that the following two really contained 99% of what I wanted to learn. The dictobject.c comments are absolutely cruicial to understanding the implementation and I cannot recommend them highly enough. The PyCon 2010 talk by Brandon Rhodes was also of great help to understand the why behind the implementation and in particular the probing algorithm.