# Bloom Filters

So far I’ve never realized that ambition, but I keep hoping.

Bloom filters allow you to quickly test if an element is a member of a (typically very large) set without the need to store (and search) each element. The cost of this speed and storage efficiency is a certain number of false positives.

They’re often used to reduce the need for expensive database lookups. For example, the Chrome web browser uses a Bloom filter to locally check if a given URL is a known phishing or malware site before making a more expensive remote query.

# Basic structure #

_{1}(key)") --> 0 h2("h

_{2}(key)") --> 1 h3("h

_{3}(key)") --> 6 h4("h

_{4}(key)") --> 6 end

A Bloom filter is an array of m bits that are set by hashing the
desired key with k unique hash functions. The value of each hash is the
index of a single bit in a bit array. It lets you quickly check if
an element is *probably in* a set, or *definitely not* in a set by hashing
a key and checking that all of the corresponding bits are set.
If even one is unset, then the key is definitely not in the set. If all
are set, then it’s probably in the set (but there’s a chance of a false
positive).

```
type Filter struct {
m, k int
bits []byte
}
// New makes a bloom filter with m bits and k hash functions.
func New(m, k int) *Filter {
// Round m up to the nearest 8-bit boundary.
if m%8 != 0 {
m += 8 - m%8
}
return &Filter{m, k, make([]byte, m/8)}
}
```

# Adding and testing #

To get or set a bit n in the array, you divide n by 8 to get the byte and use the remainder to find the bit inside that byte. We’ll define a couple of utility methods to get and set bits. (We won’t need one to clear a bit, since it’s not possible to remove an element from a regular Bloom filter once it’s added.)

```
func (f *Filter) bit(n uint) bool {
return f.bits[n/8]&(1<<(n%8)) != 0
}
func (f *Filter) setBit(n uint) {
f.bits[n/8] |= 1 << (n % 8)
}
```

With these functions in hand, we’re ready to add and check elements. To add an element, hash it with each of the k hash functions, and set the corresponding bit in the array. (More on hash functions later.)

```
func (f *Filter) Add(elt []byte) {
for i := 0; i < f.k; i++ {
f.setBit(f.hash(i, elt))
}
}
```

Checking if an element is in the filter works similarly. If all of the
bits that elt hashes to are 1, then it is *probably* in the set. But if
even one bit is 0 then elt is *definitely not* in the set.

```
func (f *Filter) Check(elt []byte) bool {
for i := 0; i < f.k; i++ {
if !f.bit(f.hash(i, elt)) {
return false
}
}
return true
}
```

# Hash functions #

While a Google search will turn up a variety of different hash functions, it’s not necessarily clear how to turn that list into an arbitrary number of functions to use in a Bloom filter. Fortunately, you don’t have to scour the Internet for a bunch of different functions and hope you’ve found enough to ensure you never run out. There’s a simple trick for turning a single n-bit hash function into \(2^{n/2}\) \(n/2\)-bit hash functions.

To turn a hash function, \(h(\mathrm{msg})\) into k
hash functions, \(h_i(\mathrm{msg})\), you split the bits of
\(h(\mathrm{msg}) = x\) into two numbers and calculate
\(ix_\mathrm{upper} + x_\mathrm{lower}\). In this case, we’ll
use a 64-bit FNV-1a hash and turn it into 2^{32} 32-bit hashes.
Finally, we take the mod of the number of bits so that the resulting
value maps to one of the bits in our array.

```
func (f *Filter) hash(i int, elt []byte) uint {
h := fnv.New64a()
h.Write(elt)
x := h.Sum64()
return (uint(i)*uint(x>>32) + uint(x)) % uint(f.m)
}
```

# Choosing parameters #

Bloom filters take two parameters: the size of the bit array, m, and the number of hash functions, k. How you choose these parameters determines how likely you are to get a false positive. (Remember, false negatives are impossible.)

In general, for a filter of a given size, the chance of a false positive
increases as you add elements. In fact, if you add enough elements all
of the bits will be 1, and *every* element will match. To choose m and
k, you need to decide how many elements you need to store and what the
false positive rate should be when it’s filled to that capacity.

Given a desired false positive probability \(0 < p \le 1\) and expected number of elements n, the size of the bit array should be $$ m = \left\lceil -\frac{n\ln{p}}{\ln^2 2} \right\rceil $$ The optimal number of hash functions is given by $$ k = \left\lceil \frac{m}{n}\ln{2} \right\rceil $$

For example, if you want to store 200,000 elements with a false positive probability of 5%, the number of bits in your array should be $$ m = \left\lceil -\frac{200,\!000\ln{0.05}}{\ln^22} \right\rceil = 1,\!247,\!045\; \mathrm{bits}$$ or 152 kB. And the number of hash functions should be $$ k = \left\lceil \frac{1,\!247,\!045}{200,\!000}\ln{2} \right\rceil = 5 $$

Finally, we can write a function to choose the size and number of hash functions.

```
func OptimalSize(n int, p float64) (m, k int) {
// The natural log of 2.
const ln2 = 0.693147180559945
m = int(math.Ceil(-(float64(n) * math.Log(p)) / (ln2 * ln2)))
k = int(math.Ceil((float64(m) * ln2) / float64(n)))
return m, k
}
```

# Using it #

Now that we’ve got the basic functionality, it’s time to try it out.

```
set := [][]byte{[]byte("foo"), []byte("bar")}
f := New(OptimalSize(len(set), 0.1))
for _, x := range set {
f.Add(x)
}
for _, x := range set {
fmt.Println(f.Check(x)) // Always true
}
// In general, tests for elements not in the set will be false
// 90% of the time.
fmt.Println(f.Check([]byte("not in set")))
```