Bitsets & FREE software consulting announcement!

Announcements

Today, we’re going to motivate ourselves as to why we would want to use a bitset and the space savings we can get. Let’s jump into it!

Hash tables

For example, we could have a mapping from a string to another string. The key could represent an individual, like John Smith, and the value could represent a phone number as we have in the above photo.

In some languages like Java, the standard library comes with classes for more specialized cases. One common case is a mapping between a key and a boolean where the boolean denotes whether or not a key is a member of some collection.

Rather than using a hashmap, it’s more idiomatic to use a hash set which simply takes in a key. Again, we’re making use of hash sets to determine membership of a given element.

Here’s a snippet showing the difference between the two approaches:

So what’s a bitset?

What would we use a bitset for?

Imagine you’re running a very large popular restaurant and you’ve been tasked with developing a ticketing system.

Since the restaurant is always so packed, when new customers arrive, they are assigned a ticket number. When a table becomes available, they turn in their ticket number and place it back into the basket to be reused by another customer.

This ticket number doesn’t necessarily denote their place in line because of party size constraints. Some parties might get served earlier than others because they have a smaller or larger party.

We want to design a system to efficiently keep track of the tickets. We have a N number of tickets and don’t want to give away anymore tickets if all tickets are currently taken.

Initial iteration

We can have a hash set to represent the basket of tickets that we have. When a ticket is taken out, we remove it from the set. Once the party is seated and puts the ticket back in, we can add that value back into the set.

If we run out integers in the set, we know we’ve allocated all of our tickets.

Problems with this approach

Since we’re in a pandemic, say we wanted to reuse this ticketing system to host a large virtual concert where we need to limit the number of attendees at a hard cap of 10 million. We’ll also let attendees watch with their friends and they’ll be admitted to the concert at the same time.

This is similar to the restaurant situation so we can use the same approach right? We can use a ticket to represent a single party and as people enter the concert, they can virtually put their ticket back into the pool.

Space issues

Using what we’ve learned from the other emails talking about int sizes, we can ballpark the amount of memory this approach would use:

10,000,000 integers * sizeof(int)

10,000,000 integers * 4 bytes = 40,000,000 bytes

And looking up a converter online, 40,000,000 bytes is equal to 40 megabytes.

Now, this might be acceptable for some use cases, maybe not for some others. Nevertheless, let’s say for the sake of this application that 40 megabytes of memory is too much for us to allocate. Is there a more compact way to represent these integers?

Bitset to the rescue!

Pictorially, this looks like:

Each of the blocks is an individual byte. Say I want to see if ticket number 12 is in the set or not. Internally, the bitset would divide the index requested, in this case 12, by the size of the byte to get the byte index (in this case, 12 / 8 = 1 per integer truncation).

Now I need to get the bit offset to figure out what bit in the byte I need to look at. So in this case, I’ll use the mod operator to get the remainder from 12 % 8 = 4. With this, I can now create a bitwise mask and filter out all other bits to look at except the one I’m interested in which is the fourth bit in the byte.

How do we determine an open ticket?

We could keep a cache of recently returned tickets numbers, but we would run into the same issue as before. In the worst case, we could have a cache of all of the returned tickets which would mean we could have 10 million integers.

This is a classic example of a space vs time tradeoff we often have to make as engineers. We can either use more space to make the searching for an open ticket faster, or we can spend more time instead to find an open ticket number.

What are our space savings?

10,000,000 million integers / 8 bits in a byte = 1,250,000 bytes

Which is equal to 1.25 megabytes. Comparing this to using the hashset of numbers approach of 40 megabytes before, this is a 96.8% decrease in space taken. Crazy right!?

Using a bitset in code

What’s next?

Thanks all! As always, here is the Discord link if you would like to chat :).

CS @ UW • SDE @ Amazon • http://www.justinharjanto.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store