Bloom Filters - Power in Simplicity

Bloom Filters - Power in Simplicity
2024-11-27 11:4:5 Author: hackernoon.com(查看原文) 阅读量:0 收藏

I first heard of Bloom filters from my professor at IIT Madras when talking about caching. I was young and new to the world of computer science then, but even so, I remember being fascinated by how simple yet powerful bloom filters are. After close to 9 years, I realize that this simple yet powerful data structure is still relegated to the side when someone thinks of mainstream data structures. This article is to correct that.

What Are Bloom Filters?

Bloom filters are a probabilistic data structure used for membership checks. They give true negatives and probable positives i.e., when I ask a bloom filter if an element exists in it, its answer would be ‘definitely not’ or ‘maybe yes’. The advantage is they do this in near-constant time using every little extra space. Curious, isn’t it? I will now explain how a bloom filter works, and you will understand the nature of their answers.

A bloom filter has 2 components:

k hash functions: A hash function is just a function that takes a string and outputs a mathematical ‘fingerprint’ of it.
bit array of size N: Nothing fancy, just N bits.

Adding an Element

When we add an element E to the bloom filter, we do the following:

Hash E with each of the k hash functions and generate k hashes h1, h2…hk.
Find the reminder when each of these hashes is divided by N. Let these be r1, r2, …, rk.
Turn on the bits corresponding to r1, r2 … rk in the bloom filter’s bit array.

If this is too complex to follow, let me explain the addition with a simple example using a small bloom filter with 2 hash functions and just 16 bits (2 bytes) of storage.

Example: Add

Bloom Filter construction:

2 hash functions: Here, I am going to use some imaginary function that gives me the values I need to explain.

16 bits storage, all set to 0 initially. The bits are indexed from 0 to 15.

Step 1: Add ‘Ganesh’

hash1(‘ganesh’) = 2

hash2(‘ganesh’) = 7

So, we set the 2nd and 7th bits in our bit array to 1.

Step 2: Add ‘palanikumar’
hash1(‘palanikumar’) = 5
hash2(‘palanikumar’) = 12

So, we set the 5th and 12th bits in our bit array to 1.

After adding both these elements, the bits 2, 7, 5, and 12 are set to 1.

Membership Check

To check if an element Q is present in a filter, we do the following:

Find the indexes that correspond to Q using the same method described above.
If any one of these indices is 0, that means Q is definitely not in the set. If the element had been added to the set, every one of these bits must have been turned on.

If all of them are 1, it means Q ‘maybe’ in the set. Why the maybe? It is possible that these bits were turned on at separate times because of multiple other elements.

Example: Check

Following the same example above, let us check for ‘animal’.
hash1(‘animal’) = 6
hash2(‘animal’) = 2
But in our bit array, index 6 is not set to 1. So, we can say that ‘animal’ definitely does not exist in the set.
Now, let us check for '‘lottery’
hash1(‘lottery’) = 2
hash2(‘lottery’) = 12
In our bitarray, index 2 and 12 are indeed set to 1. But we know that ‘lottery’ is not in the set since we inserted only 2 words ‘Ganesh’ and ‘palanikumar’. This is an example of the bloom filter returning a false positive.

Why Use Bloom Filters?

Bloom filters are very useful as a data structure to represent sets in memory-constrained environments. They are often used in cache firmware to decide if an object is present in the cache or not before trying to access the object. Database engines like Postgres use Bloom Filters to reduce disk reads. Disk reads are the slowest operations in most systems so reducing them helps to significantly speed up queries.

Bloom Filter vs Set: Evaluating Time and Space

To analyze the speed vs memory footprint of Bloom filters, I prepared a small script in Python.

Step 1: Define the Bloom Filter

from bit array import bit array

class BloomFilter: def init(self, size, num_hashes): self.size = size self.num_hashes = num_hashes self.bit_array = bitarray(size) self.bit_array.setall(0)

def _hash(self, item, seed):
    return hash(item + str(seed)) % self.size

def add(self, item):
    for i in range(self.num_hashes):
        index = self._hash(item, i)
        self.bit_array[index] = 1

def check(self, item):
    for i in range(self.num_hashes):
        index = self._hash(item, i)
        if not self.bit_array[index]:
            return False
    return True

Step 2: Write the Test Script

import random import string import time from bloom_filter import BloomFilter

def generate_random_string(length): return ''.join(random.choice(string.ascii_letters) for _ in range(length))

num_items = 1000000 num_checks = 10000 bloom_size = 10000000 num_hashes = 7

bloom_filter = BloomFilter(bloom_size, num_hashes) standard_set = set()

print("Adding items...") bloom_add_time = 0 set_add_time = 0 for _ in range(num_items): item = generate_random_string(10) start_time = time.time() bloom_filter.add(item) bloom_add_time += time.time() - start_time start_time = time.time() standard_set.add(item) set_add_time += time.time() - start_time print(f"Bloom Filter add time: {bloom_add_time/num_items:.6f} seconds") print(f"Standard Set add time: {set_add_time/num_items:.6f} seconds")

print(
Performing membership checks...") bloom_time = 0 set_time = 0

for _ in range(num_checks): item = generate_random_string(10)

start = time.time()
bloom_result = bloom_filter.check(item)
bloom_time += time.time() - start

start = time.time()
set_result = item in standard_set
set_time += time.time() - start

print(f"Bloom Filter average check time: {bloom_time/num_checks:.6f} seconds") print(f"Standard Set average check time: {set_time/num_checks:.6f} seconds")

false_positives = sum(1 for _ in range(num_checks) if bloom_filter.check(generate_random_string(10))) false_positive_rate = false_positives / num_checks

print(f
False Positive Rate: {false_positive_rate:.2%}")

import sys bloom_memory = sys.getsizeof(bloom_filter.bit_array) set_memory = sys.getsizeof(standard_set)

print(f
Bloom Filter memory usage: {bloom_memory/1024:.2f} KB") print(f"Standard Set memory usage: {set_memory/1024:.2f} KB") print(f"Memory saved: {(set_memory - bloom_memory)/1024:.2f} KB")

Output

Adding items... Bloom Filter add time: 0.000004 seconds Standard Set add time: 0.000000 seconds

Performing membership checks... Bloom Filter average check time: 0.000002 seconds Standard Set average check time: 0.000000 seconds

False Positive Rate: 0.84%

Bloom Filter memory usage: 1220.78 KB Standard Set memory usage: 32768.21 KB Memory saved: 31547.43 KB.

Analysis

As you can see, there is a false positive rate of < 1%. The membership checks are also slightly less performant. However, the bloom filter uses just 3% of the space used by the set. In an execution environment where memory is at a premium, Bloom filters provide huge savings in space with a moderate reduction in performance.

Conclusion

Bloom filters are an efficient data structure for use in memory-starved environments. They have a lot of practical applications and are even used in popular tools like CDNs and database engines. Do you have any interesting use cases in which bloom filters would be helpful? Do let me know in the comments!

文章来源: https://hackernoon.com/bloom-filters-power-in-simplicity?source=rss
如有侵权请联系:admin#unsafe.sh