An unfinished draft of Linearly Probed Robin Hood Hash Tables

I was working on a blog article a few months ago but I never finished it. Here’s the first draft, nonetheless.

Open Addressing: To store elements in the table itself rather than in nodes pointed-to by the table.

Linear Probing:  When collisions occur, scan rightwards until finding an unoccupied bucket.

An Explanation

So what is Robin Hood hashing anyway? Well, Wikipedia gives the following quote:

One interesting variation on double-hashing collision resolution is Robin Hood hashing. The idea is that a new key may displace a key already inserted, if its probe count is larger than that of the key at the current position. The net effect of this is that it reduces worst case search times in the table.

But what the hell does that mean? How does that apply to linear probing?

To get a better idea of what’s happening, let’s try visualizing the insertion algorithm.

A Microsoft Paint Visualization

Lots of diagrams use numbers to explain things, but I find numbers to be confusing so let’s use colors instead.

Our hash table will contain 12 buckets, of which we will focus on the four leftmost: the red bucket, the green bucket, the yellow bucket, and the blue bucket. I’ve listed these in ascending order, such that red < green < yellow < blue.

To start, let's try inserting a bunch of elements that hash to these four buckets, first using plain old linear probing: linearprobegif

Once the first four buckets fill up, each additional insertion appends an element onto the leftmost unoccupied bucket. This results in a long chain of elements matching the order of insertion.

That’s without Robin Hood. Here’s with it:robin

Notice something?

The inserted elements are sorted! In fact, they’re sorted by the bucket that they hash to. If you were to iterate a linearly probed Robin Hood hash table from the first bucket to the last bucket, you’d be iterating elements that hash to an ascending bucket order. With linear probing, Robin Hood hash tables are nothing more than sorted arrays interspersed with unoccupied buckets.

Faster Failed Searches

The primary advantage of Robin Hood hashing is that it enable failed searches to terminate early.

Consider the case of searching for a green element that doesn’t exist in a regular linearly probed hash table:linearlookup

With regular linear probing, the only way to determine whether an element doesn’t exist is to probe the table until reaching an unoccupied bucket.

With Robin Hood hashing, the search is allowed to terminate early due to the properties of sorted arrays. The search will terminate as soon as it reaches an element that hashes to a bucket higher than the bucket being searched for:robinlookup.png

This will, on average, reduce the number of iterations by half.

Non-faster, Non-failed Searches

Unfortunately, Robin Hood hashing provides no speedup for searches that succeed in finding the requested element. On average, the successful probe count of regular linear probing and Robin Hood hashing is exactly the same.

This can be illustrated by showing the probe count of each element in the table. First, a diagram illustrating regular linear probing:linearprobe

The probe count of each element is equal to the number of elements that must be iterated in order to find the element in a search.

Here’s Robin Hood:robinprobe.png

There’s a bit less variance in the probe counts of this table. Note that there’s only one probe count marked ‘1’, and that the maximum probe count is ‘9’, compared to ’12’ on the other table. This is an oft-cited advantage of Robin Hood hash tables: their probe counts have low variance.

Anyway, let’s try adding up the total probe counts of both tables, as that will be proportional to the average:

Regular Linear = 1+1+3+1+3+5+6+5+6+8+9+12 = 60 probes
    Robin Hood = 1+2+3+3+4+5+5+6+7+7+8+9  = 60 probes

Hmm, they’re both 60. That confirms that these two tables have the same average probe count, and thus the same average performance for successful searches. Note that there’s no way to improve these results by swapping elements around. Any gains achieved by swapping one element will be canceled-out by losses of the element it swapped with. Any configuration of this table will always have a total probe count of 60.

Keep this in mind when using Robin Hood hashing.

I’m bolding this point because too often it’s forgotten.

For example, take a look at Malte Skarupke’s recent blog post called I Wrote The Fastest Hashtable. In it, Skarupke implements and benchmarks an optimized version of linearly probed Robin Hood hashing, but unfortunately neglects to test regular linear probing alongside it. One such benchmark is the K-Nucleotides test; a benchmark where all searches succeed and none can possibly fail.

Testing K-Nucleotides, I get the following results:

Skarupke's "Fastest Hashtable Ever"        : 6248.724800 task-clock:u (msec)
A trivial implementation of linear probing : 5609.471389 task-clock:u (msec)

It should be clear from these numbers that Robin Hood is not the “Fastest Ever” for situations where searches always succeed; it needs searches to fail in order to pull ahead in performance.

Insertion and Erasure

Inserting and erasing from a linearly probed Robin Hood hash table require shifting elements around to preserve the sorted order. This is no different than inserting an element into a sorted array.

Example: Inserting a third red element into the table requires shifting elements to the right to make space.robininsert1.png

Once space is made the new element can be inserted:robininsert2.png

It’s worth noting that not every element needs to be shifted. Rather, only the first element of each color region needs shifting:

robininsert3.png

The shifting stops once it reaches an unoccupied bucket. Elements on the other side of the table (such as the purple ones) will not be affected:

robininsert4.png

Erasure

Erasure works the same way as insertion, though elements will be shifted left rather than right:

robinerasurepng.png

The first shifted element will either be to the left of an unoccupied bucket, or to the left of an element with probe count of ‘1’:robinerase2.png

The above erasure results in the following:robinerase3.png

Sadly, because inserting and erasing from Robin Hood hash tables requires shifting elements around, the performance of these operations, should they succeed, are worse than that of regular linearly probed hash tables.

Circular Tables

A common implementation of open addressed hash tables is to implement the table circularly. That is, if a collision occurs on the very last, rightmost bucket, the probe sequence will “wrap around” to the first, leftmost bucket and continue on.

Without circular tables, a different strategy is needed: Allocate extra space at the end of the table, and if a collision occurs on the last, rightmost bucket, continue probing into the extra allocated space. In the unlikely chance that the extra space gets filled, additional space will be allocated.

In practice, the second strategy performs better when the table fits in the cache. Otherwise, the first strategy holds a small advantage for using less memory.

Unfortunately, the idea of Robin Hood tables being sorted arrays is slightly complicated when circular tables are involved. This makes it is necessary to go back to the Wikipedia definition for the implementation.

Recall:

One interesting variation on double-hashing collision resolution is Robin Hood hashing. The idea is that a new key may displace a key already inserted, if its probe count is larger than that of the key at the current position. The net effect of this is that it reduces worst case search times in the table.

When implementing circular tables, one must compare probe counts rather than hashed-to bucket positions to maintain sortedness. This is because probe counts are a relative measurement, while bucket positions are not. Bucket positions rely on a global ordering of buckets (red < green < yellow < blue), which does not exist in a circular table.

Example: An incorrect circular table created by comparing bucket positions. Note the grey seam separating the lowest bucket in memory from the highest, and how it splits the purple elements in two.
circular1

Example: A correct circular table created by comparing probe counts. Purple elements remain contiguous.
circular2


Implementation

Part 1: Bookkeeping

Implementing an efficient linearly probed Robin Hood hash table requires solving two problems:

  1. How to distinguish occupied buckets from unoccupied buckets.
  2. How to efficiently sort and compare the order elements without having to rehash elements.

Idea: Store hashes

One possible solution is to store a copy of the element’s hash in the table alongside the element. It’s easy to maintain Robin Hood sortedness this way by calculating buckets from the stored hashes.

To distinguish unoccupied buckets from occupied buckets, a hash of ‘0’ can be used to mark unoccupied buckets. This of course requires that hash function never actually return ‘0’.

Below is C++ code illustrating how searching works with such tables:

key_type* hash_table::find(key_type key)
{
	unsigned int initial_bucket = hash(element) % table_size;
	for(unsigned int b = initial_bucket;; ++b)
	{
		if(table[b].stored_hash == 0)
			return nullptr;       // Unoccupied (failed search)
		if(initial_bucket < table[b].stored_hash % table_size)
			return nullptr;       // Early exit (failed search)
		if(key == table[b].key)
			return &table[b].key; // Successful match
	}
}

Storing hashes comes with other benefits, too. One such benefit is that hash values do not need to be recalculated every time the table is enlarged; a costly operation if the hash function is expensive. Another benefit is that stored hashes provide a way of comparing elements without comparing the elements themselves. If the hashes of two elements differ then the elements differ and you do not need to bother comparing them at all.

Example Implementation: Rust’s HashMap

Idea: Store probe counts

An alternative to storing hashes is to store probe counts in the table instead.
robinprobe
To represent unoccupied buckets, a probe count of ‘0’ can be used. This is possible because the minimum probe count of any element is ‘1’.

Interestingly, using ‘0’ to mark unoccupied elements makes it possible to combine the check for unoccupied elements with the check for early exit into a single conditional statement; an obvious boon for performance.

key_type* hash_table::find(key_type key)
{
	unsigned int initial_bucket = hash(element) % table_size;
	for(unsigned int b = initial_bucket, probes_counted = 1;; ++b, ++probes_counted)
	{
		if(table[b].probe_count < probes_counted)
			return nullptr;       // Handles both unoccupied and early exit (failed search)
		if(key == table[b].key) TODO TODO TODO TODO TODO
			return &table[b].key; // Successful match
	}
}

Storing probe counts is especially nice because it doesn’t require changing the hash function. Sadly, stored probe counts lack the auxiliary benefits that stored hashes provide, though they do allow hash functions to return ‘0’.

Example Implementation: Malte Skarupke’s “World’s Fastest Hashtable”

Idea: Combine the two previous methods

There are two properties of probe counts worth thinking about:

  1. If the table isn’t 100% full, a probe count will always be smaller than table size.
  2. Given a probe count and an element’s position, it is possible to calculate the element’s hash modulo the table size using simple arithmetic.

So what the hell does that mean? Well, things get interesting when focusing on table sizes that are powers of two.

For such sizes the above properties can be rewritten:

  1. A table of size 2^N will require probe counts of N bits.
  2. A probe count of N bits can be used to calculate the lower N bits of the element’s hash using simple arithmetic.

Still confused? Well here’s where this is going:

The high bits of an element’s hash can be bitwise OR’d to an element’s probe count to result in a value that contains the information of both.

And it’s only a little bit awkward to calculate!

// First use the table's size (2^N) to compute
// an N bit mask:
table_size = 2048;      // 2 ^ 11
mask = table_size - 1;  // An 11 bit mask

// Then use the mask to remove the lower N bits
// from an element's hash:
hash        = 0xDEADBEEF;   // A 32-bit hash.
masked_hash = hash & ~mask; // The hash's upper 21 bits.

// Finally calculate the element's probe count
// (which will be N bits or less)
// and bitwise OR it with masked_hash.
resulting_value = masked_hash | probe_count;

Recalculating the complete hash using this value is not hard, but you’ll need to know the element’s position in the table:

unsigned_int hash;
hash = position - (table[position].key & mask) + 1;
hash |= table[position].key & ~mask;

Example Implementation: Amble and Knuth’s Sorted Hash Tables

Zero Memory Overhead

Do you know what the hash function for integers is in most languages?

The hash function for integers is just the identity function!

unsigned int hash(unsigned int i)
{
    return i;
}

Now, if an integer and its hash are exactly the same value, doesn’t it seem a little wasteful to store both a copy of the integer, and a copy of its hash in the table? That’s duplication! It’s better off to store just one.

This idea can be combined with the bitwise trick of the previous section to result in a space-efficient hash table. Such tables are nice for using less memory, but also for being capable of holding all integers representable in the integer’s type. This is unlike other linearly probed tables that require specifying an integer (usually ‘0’ or ‘-1’) to mark unoccupied buckets.

It’s worth mentioning that this trick is a bit more general than tables with integer keys. To be precise, this technique will work whenever the hash function is perfect.

Advertisements
This entry was posted in Programming. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s