# Python HashTable Implementation

"Describe a simple implementation of a hash table."

At first blush this seems a simple enough request. My python brain immediately jumped to the thought, "well of course dict is python's implemenation of a hash table...", but given a world where dict doesn't exist, how would you define it's behavior? I decided to cozy up to Wikipedia, dig in to the article on hash tables, and became familiar with some basic data structures concepts. The ultimate goal was to trace the CPython implementation of dict and replicate something similar in pure python. It turns out that this was not at all a bad introduction to some basic data structures concepts and a was a great way to spend a lazy Sunday.

## What is a hash table?

In the simplest terms possible, a hash table is a data structure that maps keys to values. The word "hash" in the definition is the biggest clue to what makes this possible. Given that an array-like/indexable data type exists, we can simply use a hashing function to generate an index for a given key and insert the value into an array-like structure at that generated index. Looking up a value by its key should therefore just require the inverse operation, compute the hash of the given key and fetch data at that index. For examples sake, let's make a super primitive implementation.

```
def hasher(key):
if key == 'foo':
return 0
elif key == 'bar':
return 1
elif key == 'baz':
return 2
buckets = list()
# Convert the key 'foo' to a numeric value and insert the value 'hello_world'
# at that numeric index
buckets.insert(hasher('foo'), 'hello world')
# Retrieve the value by getting the data at the hashed index
buckets[hasher('foo')] # Returns 'hello world'
```

## Proper Hashing

The behavior above is a very basic example of how hash tables work. Obviously there are some huge flaws with this
example though - our hasher function is almost entirely worthless as it is just hardcoded for 3 specific input values,
we rely on the returned indexes to start at 0 and grow n+1 from there, and you cant just insert values at arbitrary
positions in python lists so attempting to insert *'bar'* before *'foo'* would insert the value at the wrong index.

Fortunately for python devs there is the handy builtin hash function that is perfect for this application (this is actually the same algorithm used in the default implemenation of the dict type).

Unfortunately, the returned value will be a big integer and, as discussed above, python lists are not sparse so we cant simply
insert at arbitrary indexes without growing the list to be at least that big. Therefore, to insert a value for the hash
of the given key *'foo'* the list would need to have a length of at least *4,177,197,833,195,190,596* -not exactly efficient.

The CPython implementation works around this by operating on the bits of the hash directly. Take the last n-bits of the hash, convert those
to an integer value, and insert them at that index. So instead of needing a list that has a length of four *quintillion*, we can use just the final
three bits of the hash and know that the largest index we will ever need to support is 7. Thus, buckets only needs to have 8 available
slots at initialization.

Now that there is a known fixed number of slots that need to be supported, the problem of lists not being sparse becomes easy to solve. Simply construct a list full of dummy objects, in this case we will use None, and now inserting at arbitrary positions will maintain correct indices.

```
# Generate a hash given the word 'foo'
key = 'foo'
value = 'hello world'
hash(key) # -4177197833195190597
# Convert to binary, slice just the last 3 bits to keep buckets small
bits = bin(hash(key))[-3:] # 101
# Convert bits to integer for insertion
idx = int(bits, 2) # 5
# Create a buckets that is big enough for 3 bits of indices
buckets = [None] * 8
buckets[idx] = 'hello world'
# Throw it all together to get that data back out of the buckets
buckets[int(bin(hash('foo'))[-3:], 2)] # 'hello world'
```

## Collision Handling

That really is about 90% of what goes into a hash table implementation, the final 10% is where things can get a bit more
complicated. Since we are choosing from such a small pool of possible hashes, what happens when two keys hash to the
same bit sequence? This is what is referred to as a *collision* and there are many formal ways of handling them. CPython uses
a combination of two techniques known as **Open Addressing** and **Random Probing** to implement the dictionary type.

Open addressing simply means that there is a single collection of buckets, and rather than storing only the value at the computed index in the buckets, we actually store a composite of the full computed hash, the key, and the value itself. In CPython that is handled by storing a struct, in the pure python implementation we will just use a tuple.

When a new record is inserted, the bucket at the computed hash is checked, if it is empty the value is simply inserted. If the bucket is NOT empty however, the table is probed in a pseudo-random order determined by including the higher bits of the hash that were initially ignored - hence the term "random probing". For lookups these operations run in the same order, they computed index is checked and if its not empty but the key does not match the key that was requested the buckets are randomly probed until a match is found.

Putting these concepts into action and cleaning up the API a bit, here is a slightly-more-working implementation.

_{(Try not
to get terribly stuck on the bitwise operations (the & and >> operators in this case) if you aren't familiar.
Just know that under the covers these are doing the same work as the simple example above. If you need a refresher,
check out this article on the PSF wiki.)}

```
class HashTable(object):
def __init__(self, **initial_values):
self.size = 8
self._buckets = [None] * self.size
for k, v in initial_values.iteritems():
self.insert(k, v)
def get(self, key):
h = hash(key)
idx = h & self.mask
bucket = self._buckets[idx]
# Do a naive probe, if the bucket at idx is existant but NOT
# the correct key, use locate to probe randomly.
if bucket and bucket[1] != key:
idx = self._locate(key, h)
return self._buckets[idx][2] if self._buckets[idx] else None
def insert(self, key, value):
h = hash(key)
idx = h & self.mask
# If the initial probe returned a bucket, probe randomly
# for the next available slot.
if self._buckets[idx]:
idx = self._locate(key, h)
self._buckets[idx] = (h, key, value)
def _locate(self, key, h):
idx = h & self.mask
bucket = self._buckets[idx]
perturb = h
while True:
idx = ((idx >> 2) + idx + perturb + 1) & self.mask
bucket = self._buckets[idx]
if not bucket or bucket[1] == key:
return idx
perturb >>= 5 # per the cpython impl, 5 is the optimal shift
@property
def mask(self):
return len(self._buckets) - 1
h = HashTable(foo='hello world', bar='don\'t panic!')
print(h.get('foo')) # 'hello world'
print(h.get('bar')) # 'don\'t panic'
```

## Buckets Sizing

This is all well and good for storing a handful of values, but it's easy to see that with such a small storage space collisions become very likely and running out of buckets slots to store items in is very easy to do. To handle this, the default implementation of dict grows it's buckets store by a factor of 4, once it is greater than 75% full. Conversely, in order to save on RAM utilization, the dictionary will auto-shrink by the same factor if it becomes less-than 1/6th full.

Resizing the buckets list is very easy, but the data itself must actually be re-hashed and re-stored both to utilize the new space efficiently and in order for the data to stay reachable as the bitmask that is applied to the hash is tied directly to the size of the buckets list.

```
class HashTable(object):
def __init__(self, **initial_values):
self._buckets = [None] * 8
for k, v in initial_values.iteritems():
self.insert(k, v)
def get(self, key):
h = hash(key)
idx = h & self.mask
# Do a naive probe, if the bucket at idx is existant but NOT
# the correct key, use locate to probe randomly.
bucket = self._buckets[idx]
if bucket and bucket[1] != key:
idx = self._locate(key, h)
return self._buckets[idx][2] if self._buckets[idx] else None
def insert(self, key, value):
if self.utilization >= 0.75:
self._resize(len(self._buckets) * 4)
h = hash(key)
idx = h & self.mask
# If the initial probe returned a bucket, probe randomly
# for the next available slot.
if self._buckets[idx]:
idx = self._locate(key, h)
self._buckets[idx] = (h, key, value)
def _locate(self, key, h):
idx = h & self.mask
bucket = self._buckets[idx]
perturb = h
while True:
idx = ((idx >> 2) + idx + perturb + 1) & self.mask
bucket = self._buckets[idx]
if not bucket or bucket[1] == key:
return idx
perturb >>= 5 # per the cpython impl, 5 is the optimal shift
def _resize(self, size):
old_buckets = self._buckets
self._buckets = [None] * size
# Make sure the old data gets re-indexed into the new store
for bucket in [b for b in old_buckets if b]:
self.insert(bucket[1], bucket[2])
@property
def mask(self):
return len(self._buckets) - 1
@property
def utilization(self):
''' Calculate the number of buckets that are actually populated '''
try:
return float(len(self)) / float(len(self._buckets))
except ZeroDivisionError:
return 0
def __len__(self):
return len([x for x in self._buckets if x])
h = HashTable(one="one", two="two", three="three", four="four", five="five",
six="six", seven="seven", eight="eight", nine="nine")
print(h.get('nine')) # 'nine' - and now there are >8 items stored so resize worked!
print(len(h)) # 9 - side-effect, the class now supports len!
```

## Item Deletion

We are rounding the final corner here, but there is one last implementation detail left unhandled - deleting items. Simply
removing items from the buckets list can actually cause keys to become unreachable due to collision handling. If the computed
hash inserts *baz* at index 4 and *bar* is added, the indexes collide and *bar* gets inserted at index 2 per the above algorithm.
If a process now comes along and deletes *baz* from the table any effort to retrieve the key *bar* will fail when it sees an empty
slot at the expected index 4. The easiest way to work around this limitation is to leave a kind of dummy item in its place when
deleting. This will force future lookups to bounce off of the expected key and proceed to the next expected slot.

Now will be a good time to implement a couple more magic methods to make the API a little bit cleaner - __getitem__ to support
square bracket access, __delitem__ to support the *del* keyword, and **setitem** to support square bracket item assignment.
Assuming this will be the last pass over the code, I've thrown some comments and docstrings in for good measure. There is obviously
room for more optimizations, but for science-sake this was about as far as I wanted to go. Feel free to tell me where
I'm wrong in the comments section!

```
class HashTable(object):
'''Rudimentary implemnation of the CPython dictionary object in pure python.
Supports a relatively dictionary-like API, except its worse in every way!
The one thing this may do is give you some nice analogies for the CPython
implementation without having to dig through the source.
>>> foo = HashTable(bar="baz", boo="doh")
>>> foo.get("bar")
"baz"
>>> foo["addtl"] = "things"
>>> assert len(foo) == 3
>>> del foo["addtl"]
>>> assert len(foo) == 2
'''
TOMBSTONE = '<Tombstone>'
MINSIZE = 8
def __init__(self, **initial_values):
self._buckets = [None] * self.MINSIZE
for k, v in initial_values.iteritems():
self.insert(k, v)
def get(self, key):
''' Simple getter method to retrieve buckets by key '''
h = hash(key)
idx = h & self.mask
# Do a naive probe, if the bucket at idx is existant but NOT
# the correct key, use locate to probe randomly.
bucket = self._buckets[idx]
if bucket and bucket[1] != key:
idx = self._locate(key, h)
return self._buckets[idx][2] if self._buckets[idx] else None
def insert(self, key, value):
''' Simple insert method to insert buckets by key '''
# If the buckets seem over-utilized, grow and re-index
if self.utilization >= 0.75:
self._resize(len(self._buckets) * 4)
h = hash(key)
idx = h & self.mask
# If the initial probe returned a bucket, probe randomly
# for the next available slot.
if self._buckets[idx]:
idx = self._locate(key, h)
self._buckets[idx] = (h, key, value)
def delete(self, key):
''' Simple delete method to delete buckets by key '''
h = hash(key)
idx = h & self.mask
bucket = self._buckets[idx]
if bucket and bucket[1] != key:
idx = self._locate(key, h)
self._buckets[idx] = self.TOMBSTONE
# If the buckets seem under-utilized, shrink and re-index
if 0 < self.utilization <= 0.16 and len(self._buckets) > self.MINSIZE:
self._resize(len(self._buckets) / 4)
def _locate(self, key, h):
''' Very basic implementation of the CPython lookdict method. This is where
the magic happens.
The algoritm itself is described in much greater detail at:
https://hg.python.org/cpython/file/52f68c95e025/Objects/dictobject.c#l97
'''
idx = h & self.mask
bucket = self._buckets[idx]
perturb = h
for _ in xrange(0, self.mask):
idx = ((idx >> 2) + idx + perturb + 1) & self.mask
bucket = self._buckets[idx]
if not bucket or bucket[1] == key:
return idx
perturb >>= 5 # per the cpython impl, 5 is the optimal shift
raise Exception("something went wrong with {} {}".format(key, idx))
def _resize(self, size):
'''Allow the buckets list to be grown or shrunk to a passed size.
This method recreates the buckets list and reinserts all old records
at their computed hashes. Also cleans up Tombstones in the process.
'''
old_buckets = self._buckets
self._buckets = [None] * size
# Make sure the old data gets re-indexed into the new store
for bucket in [b for b in old_buckets if b and b != self.TOMBSTONE]:
self.insert(bucket[1], bucket[2])
@property
def mask(self):
'''The length of the buckets list is also applied as the bit mask to hash
for item indexing.'''
return len(self._buckets) - 1
@property
def utilization(self):
'''Calculate the number of buckets that are populated or tombstoned'''
try:
return float(len(self)) / float(len(self._buckets))
except ZeroDivisionError:
return 0
def __len__(self):
'''Len should return the number of non-tombstoned and populated records'''
return len([b for b in self._buckets if b and b != self.TOMBSTONE])
def __setitem__(self, key, val):
self.insert(key, val)
def __getitem__(self, key):
val = self.get(key)
if val:
return val
else:
raise KeyError
def __delitem__(self, key):
self.delete(key)
h = HashTable(one="one", two="two", three="three", four="four", five="five",
six="six", seven="seven", eight="eight", nine="nine")
print(len(h._buckets)) # 32 - buckets have grown to accomodate items
# Remove a bunch of the keys
del h['one']
del h['two']
del h['three']
del h['four']
del h['five']
del h['six']
del h['seven']
del h['eight']
print(len(h._buckets)) # 8 - buckets has now shrunk due to utilization
h['ten'] = 10
print(h.get('one')) # None -
print(h.get('ten')) # 10
```

## Conclusion

All in all this was a great excuse to pour over the CPython code base and learn a bit about data structures in the process. Massive shoutout to the CPython maintainers for going to great lengths to comment the C itself. There is a single ~100 line comment at the top of the file dictobject.c that bends over backwards to explain the implementation and even a few corner cases to the way they ultimately chose to do it.

One of the most interesting discoveries I made during this process was how this actually explains why dictionaries are
not "ordered", at least not in the way that you might expect. When iterating over a dictionary you may have noticed
that the order seems arbitrary. The iteration actually processes the entries in the hashed order of the key! So if
*foo* hashes to 101 and *bar* hashes to 100 then the keys will be iterated *foo* first. You can even see evidence of
collisions in the iteration order. The keys *bar* and *baz* have colliding indices with a 3-bit mask, watch what
happens to the output of .keys() when their insertion order is changed.

```
a = dict(bar="world", foo="hello")
a.keys() # ['foo', 'bar']
b = dict()
b['bar'] = 'hello'
b['baz'] = 'world'
c = dict()
c['baz'] = 'hello'
c['bar'] = 'world'
b.keys() # ['baz', 'bar']
c.keys() # ['bar', 'baz'] !!
```

### References

There are, of course, many many resources on these subjects online but I found that the following two really contained 99% of what I wanted to learn. The dictobject.c comments are absolutely cruicial to understanding the implementation and I cannot recommend them highly enough. The PyCon 2010 talk by Brandon Rhodes was also of great help to understand the why behind the implementation and in particular the probing algorithm.