Go package implementing Bloom filters

Related tags

go bloom bloom-filters
Overview

Bloom filters

Master Build Status Coverage Status Go Report Card GoDoc

A Bloom filter is a representation of a set of n items, where the main requirement is to make membership queries; i.e., whether an item is a member of a set.

A Bloom filter has two parameters: m, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

This implementation accepts keys for setting and testing as []byte. Thus, to add a string item, "Love":

n := uint(1000)
filter := bloom.New(20*n, 5) // load of 20, 5 keys
filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

if filter.Test([]byte("Love"))

For numeric data, I recommend that you look into the encoding/binary library. But, for example, to add a uint32 to the filter:

i := uint32(100)
n1 := make([]byte, 4)
binary.BigEndian.PutUint32(n1, i)
filter.Add(n1)

Finally, there is a method to estimate the false positive rate of a particular bloom filter for a set of size n:

if filter.EstimateFalsePositiveRate(1000) > 0.001

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

Discussion here: Bloom filter

Godoc documentation: https://godoc.org/github.com/willf/bloom

Installation

go get -u github.com/willf/bloom

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")

This project include a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make deps
make qa
Issues
  • Applying the bloom filter for HASH160 addresses

    Applying the bloom filter for HASH160 addresses

    Example Address List.

    0af80f98e511577e5c87d56069896c70fe4a6263 0af81021340c23bb20429e47c3bdd62137c322d3 0af811994346cdca9d6dbe8e7da4f3393130fc71 0af81250856d377cf7f6914ebe196a9570ada313 0af814b46c4ef5c57dbc53ae03c812eea37d4aac 0af816bb57c0adac8bca0401d295724cd80b0956 64dfb83dcca1f516d2bac93c8c9e4eb6513fd364 64dfb8c23802f3aa67c9078c0e7970fa17584892 64dfb8c5d21743ba3dd8cef5b576e1ddf6d4c239 64dfb931e9d68aa68842efa3c5d0bc3c1101b5ec 64dfba34a39823c273a33fadf49112c063ec3724

    It gives quite a lot of positive and false results and for this reason I cannot use it. I want to use bloom filter to filter about 25 million hash160 addresses. How can I use the settings more efficiently?

    I want to use it for this project. https://github.com/oguzdelioglu/odelbrain

    opened by oguzdelioglu 11
  • Incorrect tests

    Incorrect tests

    Some of the tests are faulty. TestDirect20_5 for example. If we use the formula from estimateParameters (which is the same as the formula on wikipedia). We get that for 10000 entries and a 0.0001% precision (p = 0.000001 because EstimateFalsePositiveRate (after fix #20) returns a 100-0 range and estimateParameters uses a 1-0 range) the hash should have 287,552 bits and 20 hashing functions. The values used in the test (200,000 bits and only 5 hashing functions) there for result in a much higher false positive chance of 0.03%.

    Even with the #20 fix the tests still pass but this is only because the Fnv hashing functions used results in a really weird false positive rate for different inputs. Some inputs result in a much lower false positive rate (which I still don't understand) and many other result in a much higher false positive rate than expected.

    For testing I quickly hacked a version which uses the Sha256 hashing function and the research from Less Hashing, Same Performance: Building a Better Bloom Filter (also used in my redis bloom filter). This version always results in the correct false positive rate but is around 26 times slower. If you are interested I can make a fork with this code so you can see if it's interesting.

    opened by erikdubbelboer 11
  • Problem with uniformity of FNV implementation

    Problem with uniformity of FNV implementation

    Hi Will,

    Really appreciate your time implementing Bloom. We have opted to use it in a new project of ours at work, and while reviewing I have found a problem with your custom implementation of FNV. I threw together a test in a branch of my fork to help illustrate. You can find that here: https://github.com/turgon/bloom-1/blob/uniformity/uniformity_test.go

    The arrangement of the test is pretty straightforward. The hashing method under test is used to determine the location of a bit to set in the filter, just like it is during your Add() method. I count up the number of times each bit is hit, and then compare the histogram to the uniform distribution using a Chi Square goodness of fit test.

    I tested five different methods of finding bit locations:

    1. your current implementation
    2. an implementation using Murmur3
    3. your implementation preceding changeset ed0c3ad552eb74b37999995a9678f3723ca44454
    4. an implementation using Go's native FNV-1
    5. an implementation using Go's native FNV-1a

    The results indicate that your two custom implementations of FNV are not uniformly distributing mod m. In practice one could size up the filter in m and k and not suffer terribly, however, it certainly will increase the false positive rate in very unpredictable ways.

    I think the problem stems from your addition of the index on line 77. That addition makes the initial value differ from FNV for indices i greater than zero. My suspicion is that fnv64Hash(0, data) is ok, but the others are not. Also, the others (1, 2, 3) have no influence on the bit location when ii is zero on line 99. I believe this causes the first bit location to be uniformly distributed, but not the rest.

    After further research I introduced Go's native FNV-1 and FNV-1a methods as a baseline, and although they performed surprisingly well, there are only 32 and 64 bit implementations. Therefore the g(x) = h0(x) + i*h1(x) trick can only be applied for filters with m < 2^32 when splitting 64-bit FNV.

    Cassandra uses Murmur3 instead of FNV for its bloom filters and the linked article hints at why. It satisfies the requirement of Kirsch and Mitzenmacher, Our k hash functions will be of the form gi(x) = h1(x)+ih2(x) mod p, where h1(x) and h2(x) are two independent, uniform random hash functions on the universe with range {0, 1, 2, . . . , p − 1}, and throughout we assume that i ranges from 0 to k − 1. I.e.: Uniformity is critical or the math doesn't hold up.

    I also prefer Murmur3 as it produces 128-bit values, which lends itself well to addressing 64-bit locations, while not suffering the kind of performance hit a cryptographic hash takes.

    I hope this was clear enough, and you can find it useful. Thank you!

    opened by turgon 7
  • Concurrent

    Concurrent

    I use the package for filtering some data in web requests. Since BloomFilter structure uses shared locBuff and hasher fields it is not safe to call Test function from multiple goroutines (included a test case to demonstrate the problem). I have added another method (TestConcurrent) that does not use these shared fileds. Feel free to reject if you don't like the idea.

    opened by cenkalti 7
  • Reduce allocations

    Reduce allocations

    Hello @willf. Please check! Same as https://github.com/willf/bloom/pull/51 but less code, maybe not as fast as 51.

    opened by moredure 6
  • Looking for new maintainer

    Looking for new maintainer

    I am not writing Go code anymore. Would someone like to take over as maintainer? @lemire ?

    opened by willf 5
  • Is one hash enough?

    Is one hash enough?

    Hi Will, is one hash really enough for a bloom filter?

    Using one is efficient, but if fnv(A) and fnv(B) collide then surely the multiplication you're doing will simply spread the collision across all the buckets in the same way? That doesn't provide any net benefit to avoiding false-positives by increasing the number of hashes.

    Non-fnv-collisions (same bucket chosen from a different hash) do benefit still though, and there will be more of those.

    I'm interested in the rationale. Thanks for sharing your implementation too, I'm grateful.

    Cheers, Bruce

    opened by Bwooce 5
  • Heap-free hashing

    Heap-free hashing

    There is no good reason why hashing a block of data should require heap allocation or even a non-trivial dynamic buffer.

    Unfortunately, the murmur library makes this a bit difficult. Thankfully, we can just roll our own thin functions. It is possible to do while preserving exactly the same hash values. That's what this PR does.

    The goal of this PR is not to improve speed. Of course, it does so... On my machine...

    Before:

    BenchmarkEstimated
    BenchmarkEstimated-8                    1000000000               0.1333 ns/op          0 B/op          0 allocs/op
    BenchmarkSeparateTestAndAdd
    BenchmarkSeparateTestAndAdd-8            2706183               508.2 ns/op           194 B/op          4 allocs/op
    BenchmarkCombinedTestAndAdd
    BenchmarkCombinedTestAndAdd-8            3312842               437.6 ns/op            97 B/op          2 allocs/op
    PASS
    

    After...

    BenchmarkEstimated-8                    1000000000               0.05285 ns/op         0 B/op          0 allocs/op
    BenchmarkSeparateTestAndAdd
    BenchmarkSeparateTestAndAdd-8            6219651               451.9 ns/op             0 B/op          0 allocs/op
    BenchmarkCombinedTestAndAdd
    BenchmarkCombinedTestAndAdd-8            5010103               393.2 ns/op             0 B/op          0 allocs/op
    PASS
    

    So we see about a 10% boost in performance in this instance on my system. Results will vary.

    The important part is that it does away entirely with heap allocation.

    The PR includes tests to ensure that the hash values are preserved.

    Note that if we break backward compatibility (with exact hash value reproduction), there are many great optimization opportunities. But that should be addressed in other PRs.

    Related to PRs https://github.com/willf/bloom/pull/51 and https://github.com/willf/bloom/pull/57

    opened by lemire 4
  • Issue when importing package from master branch

    Issue when importing package from master branch

    when I importing bloom from master branch go get -u github.com/bits-and-blooms/[email protected] error appear that bitset package has invalid go.mod module github.com/willf/bitset

    opened by moredure 4
  • Add TestOrAdd, TestOrAddString

    Add TestOrAdd, TestOrAddString

    @lemire please review https://github.com/bits-and-blooms/bloom/pull/69

    opened by moredure 0
  • Update bloom.go

    Update bloom.go

    null

    opened by moredure 2
  • Backward compatibillity break test failures on s390x

    Backward compatibillity break test failures on s390x

    With version 3.0.1 I'm seeing TestHashBasic and TestHashRandom fail on s390x:

    --- FAIL: TestHashBasic (0.00s)
        murmur_test.go:29: Backward compatibillity break.
        murmur_test.go:29: Backward compatibillity break.
    [...]
    --- FAIL: TestHashRandom (0.02s)
        murmur_test.go:60: Backward compatibillity break.
        murmur_test.go:60: Backward compatibillity break.
    [...]
    FAIL
    exit status 1
    FAIL	github.com/willf/bloom	0.217s
    

    You can see the full output in https://koji.fedoraproject.org/koji/taskinfo?taskID=73367600 (look at build.log).

    opened by davide125 0
  • bloom: expose bitset with a func

    bloom: expose bitset with a func

    to access the underlying BitSet inside the bloom filter

    In some cases it's necessary to access the []uint64 representation within.

    opened by paralin 2
  • Question for hashing strings and order inside it

    Question for hashing strings and order inside it

    Hey,

    I have a question if it is possible to hash strings not depending on the order of strings? I mean

    hash("black cat") = hash("cat black")

    this generates false negatives because changing order of sentence results in totally different hash. Is there any option for it? Maybe splitting sentences and combining hashed strings?

    Thank you for any response.

    opened by robsztal 2
  • WriteTo() variant that exports only the bitset

    WriteTo() variant that exports only the bitset

    Currently there is no elegant solution to export only the bloom filter bitset. That makes it difficult to engineer custom serialization formats.

    I'm eager to submit a PR to change that if we could agree on the design.

    Of the top of my head we could do one of those things:

    • expose BloomFilter.b – the simplest solution. However will allow developers to shoot themselves in the foot.
    • add a function that would return a copy of BloomFilter.b. *BloomFilter.BisetCopy?
    • add a *BloomFilter.WriteBitSetTo() function?
    • and my favourite: add a *BloomFilter.BitSetWriter() function that would return a
    type BitSetWriter interface {
      WriterTo
    }
    
    opened by maciej 8
Releases(v3.0.1)
Owner
Will Fitzgerald
Words and numbers
Will Fitzgerald
go/golang: fast bit set Bloom filter

package implements a fast bloom filter with real 'bitset' and JSONMarshal/JSONUnmarshal to store/reload the Bloom filter.

Andreas Briese 115 Jul 28, 2021
Go package implementing Bloom filters

Bloom filters A Bloom filter is a representation of a set of n items, where the main requirement is to make membership queries; i.e., whether an item

Will Fitzgerald 1.3k Sep 10, 2021
Probabilistic set data structure

Your basic Bloom filter Golang probabilistic set data structure A Bloom filter is a fast and space-efficient probabilistic data structure used to test

Algorithms to Go 63 Aug 1, 2021
Go library implementing xor filters

xorfilter: Go library implementing xor filters Bloom filters are used to quickly check whether an element is part of a set. Xor filters are a faster a

null 499 Sep 16, 2021
Leaked password check library with bloom filter

Leaked password check With this library you can check the password is probably leaked or not. Pre generated bitset DB includes 6 Million leaked passwo

Kaan Karakaya 32 Sep 22, 2021
Cuckoo Filter: Practically Better Than Bloom

Cuckoo Filter Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient

Seif Lotfy 823 Sep 18, 2021
Package ring provides a high performance and thread safe Go implementation of a bloom filter.

ring - high performance bloom filter Package ring provides a high performance and thread safe Go implementation of a bloom filter. Usage Please see th

Tanner Ryan 119 Aug 17, 2021
A simple Bloom Filter implementation in Go

This is a simple Bloom filter implementation written in Go. For the theory behind Bloom filters, read http://en.wikipedia.org/wiki/Bloom_filter This

Damian Gryski 16 Apr 26, 2018
Data structure and algorithm library for go, designed to provide functions similar to C++ STL

GoSTL English | 简体中文 Introduction GoSTL is a data structure and algorithm library for go, designed to provide functions similar to C++ STL, but more p

stirlingx 308 Sep 23, 2021
an R-Tree library for Go

rtreego A library for efficiently storing and querying spatial data in the Go programming language. About The R-tree is a popular data structure for e

Daniel Connelly 487 Sep 14, 2021
Go datastructures.

Go Data Structures by Tim Henderson ([email protected]) Copyright 2013, Licensed under the GPL version 2. Please reach out to me directly if you requ

Tim Henderson 346 Aug 17, 2021
:rowboat: Raft implementation in Go

?? Raft This is an instructional implementation of the Raft distributed consensus algorithm in Go. It's accompanied by a series of blog posts: Part 0:

Eli Bendersky 497 Sep 14, 2021
Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).

Slim - surprisingly space efficient data types in Golang Slim is collection of surprisingly space efficient data types, with corresponding serializati

null 1.7k Sep 15, 2021
Go collections

collections Summary Basic data structures represent a great opportunity to learn a new langauge. These are some of the data structures that I have bui

Cosmin Nicolaescu 28 May 6, 2021
高性能的分布式的专门空间优化的 Bitmap 服务, 高效检查数据是否存在,日活统计,签到,打点等等

基于Raft的数据一致性分布式的 bitmap 服务 bitmap(位图)技术是数据库、大数据和互联网业务等场景下经常使用的一种技术。 存在性判断 爬虫url去重 垃圾邮件过滤 用户已阅读 用户已赞 ... 去重 数据库 大数据计算 架构图 支持三种协议读写: HTTP、redcis和rpcx。 写

rpcx.io 133 Sep 10, 2021
101+ coding interview problems in Go

116+ Coding Interview Problems with Detailed Solutions The Ultimate Go Study Guide eBook version → Join my mailing list to get the latest updates here

Hoanh An 3.1k Sep 17, 2021
Go framework to simplify CRUD of structured data using Graph operations

gocrud Go framework to simplify creating, reading, updating, and deleting arbitrary depth structured data — to make building REST services fast and ea

Manish R Jain 310 Aug 11, 2021
☔️ A complete Go cache library that brings you multiple ways of managing your caches

Gocache Guess what is Gocache? a Go cache library. This is an extendable cache library that brings you a lot of features for caching data. Overview He

Vincent Composieux 950 Sep 15, 2021
Gota: DataFrames and data wrangling in Go (Golang)

Gota: DataFrames, Series and Data Wrangling for Go This is an implementation of DataFrames, Series and data wrangling methods for the Go programming l

null 1.8k Sep 22, 2021