Once, when introducing myself to a new team, I jokingly said one of my lifelong ambitions was to solve a production issue with a Bloom filter.

So far I’ve never realized that ambition, but I keep hoping.

Bloom filters allow you to quickly test if an element is a member of a (typically very large) set without the need to store (and search) each element. The cost of this speed and storage efficiency is a certain number of false positives.

They’re often used to reduce the need for expensive database lookups. For example, the Chrome web browser uses a Bloom filter to locally check if a given URL is a known phishing or malware site before making a more expensive remote query.

Basic structure #

block-beta columns 1 block 0["1"] 1["1"] 2["0"] 3["0"] 4["0"] 5["0"] 6["1"] 7["0"] end space block h1("h₁(key)") --> 0 h2("h₂(key)") --> 1 h3("h₃(key)") --> 6 h4("h₄(key)") --> 6 end

A Bloom filter is an array of m bits that are set by hashing the desired key with k unique hash functions. The value of each hash is the index of a single bit in a bit array. It lets you quickly check if an element is probably in a set, or definitely not in a set by hashing a key and checking that all of the corresponding bits are set. If even one is unset, then the key is definitely not in the set. If all are set, then it’s probably in the set (but there’s a chance of a false positive).

type Filter struct {
    m, k int
    bits []byte
}

// New makes a bloom filter with m bits and k hash functions.
func New(m, k int) *Filter {
    // Round m up to the nearest 8-bit boundary.
    if m%8 != 0 {
        m += 8 - m%8
    }
    return &Filter{m, k, make([]byte, m/8)}
}

Adding and testing #

To get or set a bit n in the array, you divide n by 8 to get the byte and use the remainder to find the bit inside that byte. We’ll define a couple of utility methods to get and set bits. (We won’t need one to clear a bit, since it’s not possible to remove an element from a regular Bloom filter once it’s added.)

func (f *Filter) bit(n uint) bool {
    return f.bits[n/8]&(1<<(n%8)) != 0
}

func (f *Filter) setBit(n uint) {
    f.bits[n/8] |= 1 << (n % 8)
}

With these functions in hand, we’re ready to add and check elements. To add an element, hash it with each of the k hash functions, and set the corresponding bit in the array. (More on hash functions later.)

func (f *Filter) Add(elt []byte) {
    for i := 0; i < f.k; i++ {
        f.setBit(f.hash(i, elt))
    }
}

Checking if an element is in the filter works similarly. If all of the bits that elt hashes to are 1, then it is probably in the set. But if even one bit is 0 then elt is definitely not in the set.

func (f *Filter) Check(elt []byte) bool {
    for i := 0; i < f.k; i++ {
        if !f.bit(f.hash(i, elt)) {
            return false
        }
    }
    return true
}

Hash functions #

While a Google search will turn up a variety of different hash functions, it’s not necessarily clear how to turn that list into an arbitrary number of functions to use in a Bloom filter. Fortunately, you don’t have to scour the Internet for a bunch of different functions and hope you’ve found enough to ensure you never run out. There’s a simple trick for turning a single n-bit hash function into $2^{n/2}$ $n/2$-bit hash functions.

To turn a hash function, $h(\mathrm{msg})$ into k hash functions, $h_i(\mathrm{msg})$, you split the bits of $h(\mathrm{msg}) = x$ into two numbers and calculate $ix_\mathrm{upper} + x_\mathrm{lower}$. In this case, we’ll use a 64-bit FNV-1a hash and turn it into 2³² 32-bit hashes. Finally, we take the mod of the number of bits so that the resulting value maps to one of the bits in our array.

There are several approaches to mapping hash values onto the much smaller space of array indexes. Using modulo m is the simplest, but it’s far from the best. However, that’s a discussion for another time.

func (f *Filter) hash(i int, elt []byte) uint {
    h := fnv.New64a()
    h.Write(elt)
    x := h.Sum64()
    return (uint(i)*uint(x>>32) + uint(x)) % uint(f.m)
}

Choosing parameters #

Bloom filters take two parameters: the size of the bit array, m, and the number of hash functions, k. How you choose these parameters determines how likely you are to get a false positive. (Remember, false negatives are impossible.)

In general, for a filter of a given size, the chance of a false positive increases as you add elements. In fact, if you add enough elements all of the bits will be 1, and every element will match. To choose m and k, you need to decide how many elements you need to store and what the false positive rate should be when it’s filled to that capacity.

Given a desired false positive probability $0 < p \le 1$ and expected number of elements n, the size of the bit array should be $$ m = \left\lceil -\frac{n\ln{p}}{\ln^2 2} \right\rceil $$ The optimal number of hash functions is given by $$ k = \left\lceil \frac{m}{n}\ln{2} \right\rceil $$

For example, if you want to store 200,000 elements with a false positive probability of 5%, the number of bits in your array should be $$ m = \left\lceil -\frac{200,\!000\ln{0.05}}{\ln^22} \right\rceil = 1,\!247,\!045\; \mathrm{bits}$$ or 152 kB. And the number of hash functions should be $$ k = \left\lceil \frac{1,\!247,\!045}{200,\!000}\ln{2} \right\rceil = 5 $$

Finally, we can write a function to choose the size and number of hash functions.

func OptimalSize(n int, p float64) (m, k int) {
    // The natural log of 2.
    const ln2 = 0.693147180559945

    m = int(math.Ceil(-(float64(n) * math.Log(p)) / (ln2 * ln2)))
    k = int(math.Ceil((float64(m) * ln2) / float64(n)))
    return m, k
}

Using it #

Now that we’ve got the basic functionality, it’s time to try it out.

set := [][]byte{[]byte("foo"), []byte("bar")}
f := New(OptimalSize(len(set), 0.1))
for _, x := range set {
    f.Add(x)
}

for _, x := range set {
    fmt.Println(f.Check(x)) // Always true
}

// In general, tests for elements not in the set will be false
// 90% of the time.
fmt.Println(f.Check([]byte("not in set")))

Bloom Filters

Basic structure #

Adding and testing #

Hash functions #

Choosing parameters #

Using it #

Further reading #