Before we get into the meat of the content today, if you want to connect on a 1:1 consultation session about making the switch into software, resume reviews, or anything related to software engineering, I’m starting to open up 30 minute slots on Calendly free of charge! Click here to sign up!
Today, we’re going to motivate ourselves as to why we would want to use a bitset and the space savings we can get. Let’s jump into it!
Before we talk about what a bitset is, let’s take a step back to make sure we’re on the same page for hash tables. Recall that a hash table is an efficient data structure that establishes a mapping between two different types.
For example, we could have a mapping from a string to another string. The key could represent an individual, like John Smith, and the value could represent a phone number as we have in the above photo.
In some languages like Java, the standard library comes with classes for more specialized cases. One common case is a mapping between a key and a boolean where the boolean denotes whether or not a key is a member of some collection.
Rather than using a hashmap, it’s more idiomatic to use a hash set which simply takes in a key. Again, we’re making use of hash sets to determine membership of a given element.
Here’s a snippet showing the difference between the two approaches:
So what’s a bitset?
At a high level, a bitset is an extremely space efficient hash set for positive integers. It exploits the structure of how bits are stored so we can easily ask questions to tell us whether or not a given integer is in a set.
What would we use a bitset for?
To give us some underlying motivation as to when we would use a bitset, let’s take an actual interview question that I was asked.
Imagine you’re running a very large popular restaurant and you’ve been tasked with developing a ticketing system.
Since the restaurant is always so packed, when new customers arrive, they are assigned a ticket number. When a table becomes available, they turn in their ticket number and place it back into the basket to be reused by another customer.
This ticket number doesn’t necessarily denote their place in line because of party size constraints. Some parties might get served earlier than others because they have a smaller or larger party.
We want to design a system to efficiently keep track of the tickets. We have a N number of tickets and don’t want to give away anymore tickets if all tickets are currently taken.
This seems like a good use case for a hash set!
We can have a hash set to represent the basket of tickets that we have. When a ticket is taken out, we remove it from the set. Once the party is seated and puts the ticket back in, we can add that value back into the set.
If we run out integers in the set, we know we’ve allocated all of our tickets.
Problems with this approach
While this is a simple and straightforward approach, one drawback is that as this restaurant expands and grows, the number of tickets we need to keep track of in the set grows linearly as well.
Since we’re in a pandemic, say we wanted to reuse this ticketing system to host a large virtual concert where we need to limit the number of attendees at a hard cap of 10 million. We’ll also let attendees watch with their friends and they’ll be admitted to the concert at the same time.
This is similar to the restaurant situation so we can use the same approach right? We can use a ticket to represent a single party and as people enter the concert, they can virtually put their ticket back into the pool.
One of the issues that we run into when we want to store a large amount of integers in memory is the fact that it might take up too much space. Say we want to store 10 million integers in memory. We need to make sure that the hardware that we’re running on can actually store that many numbers!
Using what we’ve learned from the other emails talking about int sizes, we can ballpark the amount of memory this approach would use:
10,000,000 integers * sizeof(int)
10,000,000 integers * 4 bytes = 40,000,000 bytes
And looking up a converter online, 40,000,000 bytes is equal to 40 megabytes.
Now, this might be acceptable for some use cases, maybe not for some others. Nevertheless, let’s say for the sake of this application that 40 megabytes of memory is too much for us to allocate. Is there a more compact way to represent these integers?
Bitset to the rescue!
One way we can drastically reduce the amount of space is to leverage using a bitset. A bitset internally is represented with an array of bytes. Recall that a byte has 8 bits inside of it. Each bit in an individual byte will represent whether or not that number is in the bitset, 0 if it’s not in the set, 1 if it is.
Pictorially, this looks like:
Each of the blocks is an individual byte. Say I want to see if ticket number 12 is in the set or not. Internally, the bitset would divide the index requested, in this case 12, by the size of the byte to get the byte index (in this case, 12 / 8 = 1 per integer truncation).
Now I need to get the bit offset to figure out what bit in the byte I need to look at. So in this case, I’ll use the mod operator to get the remainder from 12 % 8 = 4. With this, I can now create a bitwise mask and filter out all other bits to look at except the one I’m interested in which is the fourth bit in the byte.
How do we determine an open ticket?
One downside of this approach is the fact that it’s more difficult for us to determine whether or not we have any open tickets to give. We would need to iterate over the bitset everytime we want to find an open ticket and do some bitwise manipulations to figure out whether or not a particular byte has any bits. Thankfully, the bitwise operations are very fast so it shouldn’t take long to iterate over all of the indexes.
We could keep a cache of recently returned tickets numbers, but we would run into the same issue as before. In the worst case, we could have a cache of all of the returned tickets which would mean we could have 10 million integers.
This is a classic example of a space vs time tradeoff we often have to make as engineers. We can either use more space to make the searching for an open ticket faster, or we can spend more time instead to find an open ticket number.
What are our space savings?
This is where the bit set really shines. Each byte can store 8 bits which means we can easily keep track of 8 numbers with a single byte. So with our new calculations:
10,000,000 million integers / 8 bits in a byte = 1,250,000 bytes
Which is equal to 1.25 megabytes. Comparing this to using the hashset of numbers approach of 40 megabytes before, this is a 96.8% decrease in space taken. Crazy right!?
Using a bitset in code
If you want to make use of bitsets in your code, rather than implementing your own, most programming languages generally already have built in classes so that you don’t need to interface with low level bits and bitwise operators. Some languages like Java even have it included in their standard library. Amazing!
This article is getting a little long, so we’ll go ahead and wrap up the module next week! We’ll go over some actual concrete code making use of bit sets as well as several exercises to help solidify your understanding. Afterwards, we’ll be moving onto our next module, computer memory!
Thanks all! As always, here is the Discord link if you would like to chat :).