Go package implementing Bloom filters

Overview

Bloom filters

Master Build Status Coverage Status Go Report Card GoDoc

A Bloom filter is a representation of a set of n items, where the main requirement is to make membership queries; i.e., whether an item is a member of a set.

A Bloom filter has two parameters: m, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

This implementation accepts keys for setting and testing as []byte. Thus, to add a string item, "Love":

n := uint(1000)
filter := bloom.New(20*n, 5) // load of 20, 5 keys
filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

if filter.Test([]byte("Love"))

For numeric data, I recommend that you look into the encoding/binary library. But, for example, to add a uint32 to the filter:

i := uint32(100)
n1 := make([]byte, 4)
binary.BigEndian.PutUint32(n1, i)
filter.Add(n1)

Finally, there is a method to estimate the false positive rate of a particular bloom filter for a set of size n:

if filter.EstimateFalsePositiveRate(1000) > 0.001

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

Discussion here: Bloom filter

Godoc documentation: https://godoc.org/github.com/willf/bloom

Installation

go get -u github.com/willf/bloom

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")

This project include a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make deps
make qa
Issues
  • Applying the bloom filter for HASH160 addresses

    Applying the bloom filter for HASH160 addresses

    Example Address List.

    0af80f98e511577e5c87d56069896c70fe4a6263 0af81021340c23bb20429e47c3bdd62137c322d3 0af811994346cdca9d6dbe8e7da4f3393130fc71 0af81250856d377cf7f6914ebe196a9570ada313 0af814b46c4ef5c57dbc53ae03c812eea37d4aac 0af816bb57c0adac8bca0401d295724cd80b0956 64dfb83dcca1f516d2bac93c8c9e4eb6513fd364 64dfb8c23802f3aa67c9078c0e7970fa17584892 64dfb8c5d21743ba3dd8cef5b576e1ddf6d4c239 64dfb931e9d68aa68842efa3c5d0bc3c1101b5ec 64dfba34a39823c273a33fadf49112c063ec3724

    It gives quite a lot of positive and false results and for this reason I cannot use it. I want to use bloom filter to filter about 25 million hash160 addresses. How can I use the settings more efficiently?

    I want to use it for this project. https://github.com/oguzdelioglu/odelbrain

    opened by oguzdelioglu 11
  • Incorrect tests

    Incorrect tests

    Some of the tests are faulty. TestDirect20_5 for example. If we use the formula from estimateParameters (which is the same as the formula on wikipedia). We get that for 10000 entries and a 0.0001% precision (p = 0.000001 because EstimateFalsePositiveRate (after fix #20) returns a 100-0 range and estimateParameters uses a 1-0 range) the hash should have 287,552 bits and 20 hashing functions. The values used in the test (200,000 bits and only 5 hashing functions) there for result in a much higher false positive chance of 0.03%.

    Even with the #20 fix the tests still pass but this is only because the Fnv hashing functions used results in a really weird false positive rate for different inputs. Some inputs result in a much lower false positive rate (which I still don't understand) and many other result in a much higher false positive rate than expected.

    For testing I quickly hacked a version which uses the Sha256 hashing function and the research from Less Hashing, Same Performance: Building a Better Bloom Filter (also used in my redis bloom filter). This version always results in the correct false positive rate but is around 26 times slower. If you are interested I can make a fork with this code so you can see if it's interesting.

    opened by erikdubbelboer 11
  • WriteTo() variant that exports only the bitset

    WriteTo() variant that exports only the bitset

    Currently there is no elegant solution to export only the bloom filter bitset. That makes it difficult to engineer custom serialization formats.

    I'm eager to submit a PR to change that if we could agree on the design.

    Of the top of my head we could do one of those things:

    • expose BloomFilter.b – the simplest solution. However will allow developers to shoot themselves in the foot.
    • add a function that would return a copy of BloomFilter.b. *BloomFilter.BisetCopy?
    • add a *BloomFilter.WriteBitSetTo() function?
    • and my favourite: add a *BloomFilter.BitSetWriter() function that would return a
    type BitSetWriter interface {
      WriterTo
    }
    
    opened by maciej 9
  • Problem with uniformity of FNV implementation

    Problem with uniformity of FNV implementation

    Hi Will,

    Really appreciate your time implementing Bloom. We have opted to use it in a new project of ours at work, and while reviewing I have found a problem with your custom implementation of FNV. I threw together a test in a branch of my fork to help illustrate. You can find that here: https://github.com/turgon/bloom-1/blob/uniformity/uniformity_test.go

    The arrangement of the test is pretty straightforward. The hashing method under test is used to determine the location of a bit to set in the filter, just like it is during your Add() method. I count up the number of times each bit is hit, and then compare the histogram to the uniform distribution using a Chi Square goodness of fit test.

    I tested five different methods of finding bit locations:

    1. your current implementation
    2. an implementation using Murmur3
    3. your implementation preceding changeset ed0c3ad552eb74b37999995a9678f3723ca44454
    4. an implementation using Go's native FNV-1
    5. an implementation using Go's native FNV-1a

    The results indicate that your two custom implementations of FNV are not uniformly distributing mod m. In practice one could size up the filter in m and k and not suffer terribly, however, it certainly will increase the false positive rate in very unpredictable ways.

    I think the problem stems from your addition of the index on line 77. That addition makes the initial value differ from FNV for indices i greater than zero. My suspicion is that fnv64Hash(0, data) is ok, but the others are not. Also, the others (1, 2, 3) have no influence on the bit location when ii is zero on line 99. I believe this causes the first bit location to be uniformly distributed, but not the rest.

    After further research I introduced Go's native FNV-1 and FNV-1a methods as a baseline, and although they performed surprisingly well, there are only 32 and 64 bit implementations. Therefore the g(x) = h0(x) + i*h1(x) trick can only be applied for filters with m < 2^32 when splitting 64-bit FNV.

    Cassandra uses Murmur3 instead of FNV for its bloom filters and the linked article hints at why. It satisfies the requirement of Kirsch and Mitzenmacher, Our k hash functions will be of the form gi(x) = h1(x)+ih2(x) mod p, where h1(x) and h2(x) are two independent, uniform random hash functions on the universe with range {0, 1, 2, . . . , p − 1}, and throughout we assume that i ranges from 0 to k − 1. I.e.: Uniformity is critical or the math doesn't hold up.

    I also prefer Murmur3 as it produces 128-bit values, which lends itself well to addressing 64-bit locations, while not suffering the kind of performance hit a cryptographic hash takes.

    I hope this was clear enough, and you can find it useful. Thank you!

    opened by turgon 7
  • Concurrent

    Concurrent

    I use the package for filtering some data in web requests. Since BloomFilter structure uses shared locBuff and hasher fields it is not safe to call Test function from multiple goroutines (included a test case to demonstrate the problem). I have added another method (TestConcurrent) that does not use these shared fileds. Feel free to reject if you don't like the idea.

    opened by cenkalti 7
  • Is one hash enough?

    Is one hash enough?

    Hi Will, is one hash really enough for a bloom filter?

    Using one is efficient, but if fnv(A) and fnv(B) collide then surely the multiplication you're doing will simply spread the collision across all the buckets in the same way? That doesn't provide any net benefit to avoiding false-positives by increasing the number of hashes.

    Non-fnv-collisions (same bucket chosen from a different hash) do benefit still though, and there will be more of those.

    I'm interested in the rationale. Thanks for sharing your implementation too, I'm grateful.

    Cheers, Bruce

    opened by Bwooce 5
  • Issue when importing package from master branch

    Issue when importing package from master branch

    when I importing bloom from master branch go get -u github.com/bits-and-blooms/[email protected] error appear that bitset package has invalid go.mod module github.com/willf/bitset

    opened by moredure 4
  • Heap-free hashing

    Heap-free hashing

    There is no good reason why hashing a block of data should require heap allocation or even a non-trivial dynamic buffer.

    Unfortunately, the murmur library makes this a bit difficult. Thankfully, we can just roll our own thin functions. It is possible to do while preserving exactly the same hash values. That's what this PR does.

    The goal of this PR is not to improve speed. Of course, it does so... On my machine...

    Before:

    BenchmarkEstimated
    BenchmarkEstimated-8                    1000000000               0.1333 ns/op          0 B/op          0 allocs/op
    BenchmarkSeparateTestAndAdd
    BenchmarkSeparateTestAndAdd-8            2706183               508.2 ns/op           194 B/op          4 allocs/op
    BenchmarkCombinedTestAndAdd
    BenchmarkCombinedTestAndAdd-8            3312842               437.6 ns/op            97 B/op          2 allocs/op
    PASS
    

    After...

    BenchmarkEstimated-8                    1000000000               0.05285 ns/op         0 B/op          0 allocs/op
    BenchmarkSeparateTestAndAdd
    BenchmarkSeparateTestAndAdd-8            6219651               451.9 ns/op             0 B/op          0 allocs/op
    BenchmarkCombinedTestAndAdd
    BenchmarkCombinedTestAndAdd-8            5010103               393.2 ns/op             0 B/op          0 allocs/op
    PASS
    

    So we see about a 10% boost in performance in this instance on my system. Results will vary.

    The important part is that it does away entirely with heap allocation.

    The PR includes tests to ensure that the hash values are preserved.

    Note that if we break backward compatibility (with exact hash value reproduction), there are many great optimization opportunities. But that should be addressed in other PRs.

    Related to PRs https://github.com/willf/bloom/pull/51 and https://github.com/willf/bloom/pull/57

    opened by lemire 4
  • bloom: add func to create bloom with data and m

    bloom: add func to create bloom with data and m

    Ordinarily the From function will create a new zeroed data slice.

    If we want to create a BloomFilter that is exactly identical to one we had previously, it's not possible.

    Add a new function FromWithM which exposes the data slice as an argument so that we can reconstruct a BloomFilter that is identical in every way (data slice, m, k) to a remote BloomFilter.

    opened by paralin 2
  • False positive rate increases dramatically on big sets

    False positive rate increases dramatically on big sets

    I wrote a simple code to test the FP%, and found that FP% was good (~2%) on small sets, but suddenly began increasing at some scale, and reached 100% very quickly.

    func data(i uint) []byte {
    	return []byte(fmt.Sprintf("%x", md5.Sum([]byte(string(i)))))
    }
    
    func falsePositive(n, b, k uint) float64 {
    	f := bloom.New(uint(n * b), k)
    	for i := uint(0); i < n; i++ {
    		f.Add(data(i))
    	}
    	fp, rounds := 0, 10000
    	for i := 0; i < rounds; i++ {
    		if f.Test(data(uint(i) + n)) {
    			fp++
    		}
    	}
    	return float64(fp)/float64(rounds)
    }
    
    func main() {
    	fmt.Println(falsePositive(1000, 8, 4))
    	fmt.Println(falsePositive(10000, 8, 4))
    	fmt.Println(falsePositive(100000, 8, 4))
    	fmt.Println(falsePositive(1000000, 8, 4))
    	fmt.Println(falsePositive(1104000, 8, 4))
    	fmt.Println(falsePositive(1108000, 8, 4))
    	fmt.Println(falsePositive(1112000, 8, 4))
    	fmt.Println(falsePositive(1116000, 8, 4))
    }
    

    On macbook pro 64-bit, the output was:

    0.0222
    0.0237
    0.0213
    0.0225
    0.0241
    0.4028
    0.7934
    1
    

    My project needs a filter over tens of million items, but the filter fails at scale of ~1 million due to 100% false positives.

    opened by dai-shuo 2
  • General improvements and fixes

    General improvements and fixes

    • New reports via Makefile;
    • Minor code cleanup;

    @willf Please check the target/report/errcheck.txt report, there are 3 ignored errors (also reported by AST scan). You may want report them back eventually. I haven't fixed these to not change the functions' signatures.

    opened by nicolaasuni 2
  • 2x 128 bit hash instead of 2x 64 bit hash. Why?

    2x 128 bit hash instead of 2x 64 bit hash. Why?

    I noticed that for computing multiple hashes, you make use of the work of Less Hashing, Same Performance, which is calculated by: gi(x) = h1(x)+ih2(x) . For that you generate a 256 bits long hash partitioned into 4 uint64s. So I wonder why you decided to generate a 256 long hash partitioned into 4 uint64s. instead of 128 long hash partitioned into 2 uint64s 🤔 Wouldn't it be the same regarding hashing, but with a performance improvement?

    opened by aratz-lasa 0
  • Create SECURITY.md

    Create SECURITY.md

    Hey there!

    I belong to an open source security research community, and a member (@akincibor) has found an issue, but doesn’t know the best way to disclose it.

    If not a hassle, might you kindly add a SECURITY.md file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.

    Thank you for your consideration, and I look forward to hearing from you!

    (cc @huntr-helper)

    opened by JamieSlome 0
  • module declares its path as github.com/bits-and-blooms/bitset but require github.com/willf/bitset

    module declares its path as github.com/bits-and-blooms/bitset but require github.com/willf/bitset

    go get -u github.com/willf/bloom

    NOTICE:

    go: downloading github.com/willf/bloom v1.0.0
    go: downloading github.com/willf/bloom v2.0.3+incompatible
    go: downloading github.com/willf/bitset v1.2.2
    go: downloading github.com/spaolacci/murmur3 v1.1.0
    go get: github.com/willf/[email protected] updating to
            github.com/willf/[email protected]: parsing go.mod:
            module declares its path as: github.com/bits-and-blooms/bitset
                    but was required as: github.com/willf/bitset
    
    opened by baipangbai 0
  • Error in readme.md of v3

    Error in readme.md of v3

    The link to reference in pkg.go.dev should be https://pkg.go.dev/github.com/bits-and-blooms/bloom/v3 other than https://pkg.go.dev/github.com/bits-and-blooms/bloom . The latter one is the doc for v2, the former one is for v3

    opened by lonelyevil 0
  • Backward compatibillity break test failures on s390x

    Backward compatibillity break test failures on s390x

    With version 3.0.1 I'm seeing TestHashBasic and TestHashRandom fail on s390x:

    --- FAIL: TestHashBasic (0.00s)
        murmur_test.go:29: Backward compatibillity break.
        murmur_test.go:29: Backward compatibillity break.
    [...]
    --- FAIL: TestHashRandom (0.02s)
        murmur_test.go:60: Backward compatibillity break.
        murmur_test.go:60: Backward compatibillity break.
    [...]
    FAIL
    exit status 1
    FAIL	github.com/willf/bloom	0.217s
    

    You can see the full output in https://koji.fedoraproject.org/koji/taskinfo?taskID=73367600 (look at build.log).

    opened by davide125 1
Releases(v3.2.0)
Owner
Will Fitzgerald
Words and numbers
Will Fitzgerald
Sync distributed sets using bloom filters

goSetReconciliation An implementation to sync distributed sets using bloom filters. Based on the paper "Low complexity set reconciliation using Bloom

Elton SV 25 Jan 4, 2022
Go library implementing xor filters

xorfilter: Go library implementing xor filters Bloom filters are used to quickly check whether an element is part of a set. Xor filters are a faster a

null 569 Jun 16, 2022
Package ring provides a high performance and thread safe Go implementation of a bloom filter.

ring - high performance bloom filter Package ring provides a high performance and thread safe Go implementation of a bloom filter. Usage Please see th

Tanner Ryan 126 May 10, 2022
Cuckoo Filter: Practically Better Than Bloom

Cuckoo Filter Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient

Seif Lotfy 913 Jun 25, 2022
A simple Bloom Filter implementation in Go

This is a simple Bloom filter implementation written in Go. For the theory behind Bloom filters, read http://en.wikipedia.org/wiki/Bloom_filter This

Damian Gryski 16 Apr 26, 2018
go/golang: fast bit set Bloom filter

package implements a fast bloom filter with real 'bitset' and JSONMarshal/JSONUnmarshal to store/reload the Bloom filter.

Andreas Briese 120 May 23, 2022
Leaked password check library with bloom filter

Leaked password check With this library you can check the password is probably leaked or not. Pre generated bitset DB includes 6 Million leaked passwo

Kaan Karakaya 41 Dec 2, 2021
A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set

A Go implementation of a bloom filter, with support for boltdb and badgerdb as optional in-memory persistent storage.

Samsondeen 25 Jun 7, 2022
A Go implementation of an in-memory bloom filter, with support for boltdb and badgerdb as optional data persistent storage.

Sprout A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set. Bloom filters are fast and space

Samsondeen 25 Jun 7, 2022
Go package implementing bitsets

bitset Go language library to map between non-negative integers and boolean values Description Package bitset implements bitsets, a mapping between no

Will Fitzgerald 895 Jun 28, 2022
Go package implementing an indexable ordered multimap

PACKAGE package skiplist import "github.com/glenn-brown/skiplist" Package skiplist implements fast indexable ordered multimaps. This sk

Glenn Brown 25 Jul 2, 2022
Package set is a small wrapper around the official reflect package that facilitates loose type conversion and assignment into native Go types.

Package set is a small wrapper around the official reflect package that facilitates loose type conversion and assignment into native Go types. Read th

null 37 Jun 15, 2022
Go package for mapping values to and from space-filling curves, such as Hilbert and Peano curves.

Hilbert Go package for mapping values to and from space-filling curves, such as Hilbert and Peano curves. Documentation available here This is not an

Google 252 Jun 20, 2022
A package for Go that can be used for range queries on large number of intervals

go-stree go-stree is a package for Go that can be used to process a large number of intervals. The main purpose of this module is to solve the followi

Thomas Oberndörfer 39 May 14, 2022
parody of some of the basic python core features (collections package)

collections import "github.com/marcsantiago/collections" Overview Index Subdirectories Overview Index func StringEncoder(encoder *bytes.Buffer, data D

Marc Santiago 18 Jan 14, 2022
Package mafsa implements Minimal Acyclic Finite State Automata in Go, essentially a high-speed, memory-efficient, Unicode-friendly set of strings.

MA-FSA for Go Package mafsa implements Minimal Acyclic Finite State Automata (MA-FSA) with Minimal Perfect Hashing (MPH). Basically, it's a set of str

SmartyStreets (Archives) 290 Jun 22, 2022
Package for indexing zip files and storing a compressed index

zipindex zipindex provides a size optimized representation of a zip file to allow decompressing the file without reading the zip file index. It will o

High Performance, Kubernetes Native Object Storage 27 May 10, 2022
peanut is a Go package to write tagged data structs to disk in a variety of formats.

peanut peanut is a Go package to write tagged data structs to disk in a variety of formats. Its primary purpose is to provide a single consistent inte

Jim Smart 3 Apr 21, 2021
graph package by golang

graph-go sample package main import ( "fmt" "github.com/Iovesophy/graph-go" ) func main() { samplePlace := []graph.Node{ {ID: 1, Element: "plac

Iovesophy 4 Oct 24, 2021