I first heard of Bloom filters from my professor at IIT Madras when talking about caching. I was young and new to the world of computer science then, but even so, I remember being fascinated by how simple yet powerful bloom filters are. After close to 9 years, I realize that this simple yet powerful data structure is still relegated to the side when someone thinks of mainstream data structures. This article is to correct that.
Bloom filters are a probabilistic data structure used for membership checks. They give true negatives and probable positives i.e., when I ask a bloom filter if an element exists in it, its answer would be ‘definitely not’ or ‘maybe yes’. The advantage is they do this in near-constant time using every little extra space. Curious, isn’t it? I will now explain how a bloom filter works, and you will understand the nature of their answers.
A bloom filter has 2 components:
When we add an element E to the bloom filter, we do the following:
If this is too complex to follow, let me explain the addition with a simple example using a small bloom filter with 2 hash functions and just 16 bits (2 bytes) of storage.
Bloom Filter construction:
hash1(‘ganesh’) = 2
hash2(‘ganesh’) = 7
So, we set the 2nd and 7th bits in our bit array to 1.
So, we set the 5th and 12th bits in our bit array to 1.
After adding both these elements, the bits 2, 7, 5, and 12 are set to 1.
To check if an element Q is present in a filter, we do the following:
Bloom filters are very useful as a data structure to represent sets in memory-constrained environments. They are often used in cache firmware to decide if an object is present in the cache or not before trying to access the object. Database engines like Postgres use Bloom Filters to reduce disk reads. Disk reads are the slowest operations in most systems so reducing them helps to significantly speed up queries.
To analyze the speed vs memory footprint of Bloom filters, I prepared a small script in Python.
from bit array import bit array
class BloomFilter: def
def _hash(self, item, seed):
return hash(item + str(seed)) % self.size
def add(self, item):
for i in range(self.num_hashes):
index = self._hash(item, i)
self.bit_array[index] = 1
def check(self, item):
for i in range(self.num_hashes):
index = self._hash(item, i)
if not self.bit_array[index]:
return False
return True
import random import string import time from bloom_filter import BloomFilter
def generate_random_string(length): return ''.join(random.choice(string.ascii_letters) for _ in range(length))
num_items = 1000000 num_checks = 10000 bloom_size = 10000000 num_hashes = 7
bloom_filter = BloomFilter(bloom_size, num_hashes) standard_set = set()
print("Adding items...") bloom_add_time = 0 set_add_time = 0 for _ in range(num_items): item = generate_random_string(10) start_time = time.time() bloom_filter.add(item) bloom_add_time += time.time() - start_time start_time = time.time() standard_set.add(item) set_add_time += time.time() - start_time print(f"Bloom Filter add time: {bloom_add_time/num_items:.6f} seconds") print(f"Standard Set add time: {set_add_time/num_items:.6f} seconds")
print(
Performing membership checks...") bloom_time = 0 set_time = 0
for _ in range(num_checks): item = generate_random_string(10)
start = time.time()
bloom_result = bloom_filter.check(item)
bloom_time += time.time() - start
start = time.time()
set_result = item in standard_set
set_time += time.time() - start
print(f"Bloom Filter average check time: {bloom_time/num_checks:.6f} seconds") print(f"Standard Set average check time: {set_time/num_checks:.6f} seconds")
false_positives = sum(1 for _ in range(num_checks) if bloom_filter.check(generate_random_string(10))) false_positive_rate = false_positives / num_checks
print(f
False Positive Rate: {false_positive_rate:.2%}")
import sys bloom_memory = sys.getsizeof(bloom_filter.bit_array) set_memory = sys.getsizeof(standard_set)
print(f
Bloom Filter memory usage: {bloom_memory/1024:.2f} KB") print(f"Standard Set memory usage: {set_memory/1024:.2f} KB") print(f"Memory saved: {(set_memory - bloom_memory)/1024:.2f} KB")
Adding items... Bloom Filter add time: 0.000004 seconds Standard Set add time: 0.000000 seconds
.
Performing membership checks... Bloom Filter average check time: 0.000002 seconds Standard Set average check time: 0.000000 seconds
False Positive Rate: 0.84%
Bloom Filter memory usage: 1220.78 KB Standard Set memory usage: 32768.21 KB Memory saved: 31547.43 KB.
As you can see, there is a false positive rate of < 1%. The membership checks are also slightly less performant. However, the bloom filter uses just 3% of the space used by the set. In an execution environment where memory is at a premium, Bloom filters provide huge savings in space with a moderate reduction in performance.
Bloom filters are an efficient data structure for use in memory-starved environments. They have a lot of practical applications and are even used in popular tools like CDNs and database engines. Do you have any interesting use cases in which bloom filters would be helpful? Do let me know in the comments!