HyperLogLog and HyperLogLog++ implementation in Go/Golang.

Overview

Build Status Coverage Status GoDoc

HyperLogLog and HyperLogLog++

Implements the HyperLogLog and HyperLogLog++ algorithms.

HyperLogLog paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

HyperLogLog++ paper: http://research.google.com/pubs/pub40671.html

Documentation

Documentation can be found here.

Comparison of Algorithms

The HyperLogLog++ algorithm has much lower error for small cardinalities. This is because it uses a different representation of data for small sets of data. Data generated using this library shows the difference for N < 10000:

N < 10000

HyperLogLog++ also has bias correction which helps offset estimation errors in the original HyperLogLog algorithm. This correction can be seen here, again using data generated using this library:

N < 80000

Future Improvements

  • Right now HLL++ uses 8 bits per register. It could use 6 bits and take less memory.
  • The list compression algorithm could be improved, allowing the sparse representation to be used longer.
Comments
  • Large errors hll++ at high cardinality

    Large errors hll++ at high cardinality

    Once I set my cardinality to 100k using hll++, the error rate becomes massive:

    approx:  10506
    actual:  100000
    delta:   89494
    

    My sample code is pretty much cribbed from your test/bench code. Set the truth variable to 1000 and you get exact result. Things look good in the 10k range too. But at 100k... boom.

    Complete repro code:

    package main
    
    import (
        "github.com/clarkduvall/hyperloglog"
        "fmt"
        "hash/fnv"
        "hash"
    )
    
    func hash64(s string) hash.Hash64 {
        h := fnv.New64()
        h.Write([]byte(s))
        return h
    }
    
    func main() {
        truth := 100000
    
        h, err := hyperloglog.NewPlus(16);
        if err != nil {
            fmt.Println(err)
            return
        }
    
        for i := 0; i < truth; i++ {
                istr := fmt.Sprintf("%s", i)
                h.Add(hash64(istr))
        }
        epp := h.Count()
    
        fmt.Println("approx: ", epp)
        fmt.Println("actual: ", truth)
        fmt.Println("delta:  ", int64(truth)-int64(h.Count()))
    }
    
    opened by nieksand 5
  • Streaming serialization with gob.

    Streaming serialization with gob.

    This is an alternative PR which build around io writers/readers rather than a []byte.

    It uses gob, but things are complicated a bit because:

    1. hll and hllp structs don't export their members
    2. there possible nil pointers at times

    I added the complication handling in marshal/unmarshal rather than make a more invasive package-wide change to play nice with gob.

    One other benefit over the other PR... this maintains the sparse state for hllp when applicable.

    opened by nieksand 4
  • Another case of really large Count value

    Another case of really large Count value

    Similar to #18 I have another case where the Count method returns an insanely large value for a very specific set of inputs.

    Here is the input data: hll.txt

    And here is the script to test it:

    package main
    
    import (
    	"fmt"
    	"hash/fnv"
    	"io"
    	"log"
    
    	"github.com/clarkduvall/hyperloglog"
    )
    
    func main() {
    	a, _ := hyperloglog.NewPlus(12)
    
    	for {
    		var l string
    		if _, err := fmt.Scanln(&l); err == io.EOF {
    			break
    		} else if err != nil {
    			panic(err)
    		}
    		h := fnv.New64a()
    		h.Write([]byte(l))
    		a.Add(h)
    		//log.Printf("added val %q to hll, count is now: %d", l, a.Count())
    	}
    
    	log.Printf("a.Count: %d", a.Count())
    }
    

    Like in #18 I narrowed down the dataset so that omitting the last line causes the large number to not be output:

     /tmp ▻ cat hll.txt | ./hlltest                                                   
    2017/05/22 14:49:54 a.Count: 18446744073709551604                                 
     /tmp ▻ cat hll.txt | head -n-1| ./hlltest                                        
    2017/05/22 14:50:04 a.Count: 20478
    

    Let me know if there's anything I can do to help debug more, thanks!

    opened by mediocregopher 3
  • Make sure toNormal is eventually done even if Count is called often

    Make sure toNormal is eventually done even if Count is called often

    Our app, for various reasons, always calls Count after adding a bunch of items. This has the unfortunate side effect that toNormal never gets called - because Count calls mergeSparse and this resets the size of tmpSet to zero. Thus the size of tmpSet is never big enough for maybeToNormal to trigger toNormal.

    The end result is we have a sparseList of millions of items, and Count is very slow (hundreds of milliseconds).

    This patch fixes this by checking if the threshold is reached during Count.

    opened by sdsykes 2
  • Weird results post-merge

    Weird results post-merge

    I know hll's aren't supposed to be accurate, so if this is simply a case of finding a limitation in the accuracy then that's cool, I just wanted to double check. I have the following code:

    func main() {
    	add := func(hll *hyperloglog.HyperLogLogPlus, i int) {
    		h := fnv.New64a()
    		h.Write([]byte(strconv.Itoa(i)))
    		hll.Add(h)
    	}
    
    	a, _ := hyperloglog.NewPlus(12)
    	for i := 0; i < 10000; i++ {
    		add(a, i%100)
    	}
    	fmt.Printf("a: %d\n", a.Count())
    
    	b, _ := hyperloglog.NewPlus(12)
    	for i := 0; i < 10000; i++ {
    		add(b, (i%100)+5)
    	}
    	fmt.Printf("b: %d\n", b.Count())
    
    	a.Merge(b)
    	fmt.Printf("a+b: %d\n", a.Count())
    }
    

    I get the output:

    a: 100
    b: 100
    a+b: 6
    

    So the Counts for a and b individually look good, but post-merge it seems to get borked. Again, just wanted to check if this is a bug vs a limitation in what hll's can actually do. Thanks!

    opened by mediocregopher 2
  • Extremely large Count() value after a Merge

    Extremely large Count() value after a Merge

    So this is a super specific test case, and unfortunately I haven't had much luck in boiling it down farther. But for a very specific input we're seeing extremely high Count() values after Merge'ing a populated HLL+ into an empty one.

    Attached is the input data: hll.txt

    This is the test code:

    package main
    
    import (
    	"fmt"
    	"hash/fnv"
    	"io"
    	"log"
    
    	"github.com/clarkduvall/hyperloglog"
    )
    
    func main() {
    	a, _ := hyperloglog.NewPlus(12)
    
    	for {
    		var l string
    		if _, err := fmt.Scanln(&l); err == io.EOF {
    			break
    		} else if err != nil {
    			panic(err)
    		}
    		h := fnv.New64a()
    		h.Write([]byte(l))
    		a.Add(h)
    		log.Printf("added val %q to hll, count is now: %d", l, a.Count())
    	}
    
    	b, _ := hyperloglog.NewPlus(12)
    	b.Merge(a)
    	log.Printf("b.Count: %d", b.Count())
    }
    

    To run:

    cat hll.txt | ./hlltest
    

    The output should end with something like:

    2016/12/12 14:20:06 b.Count: 18446744073709551615
    

    If you omit the last line like this:

    cat hll.txt | head -n-1 | ./hlltest
    

    You get a much more reasonable output:

    2016/12/12 14:26:46 b.Count: 3099
    

    Let me know if there's any other information you need to help debug this, thanks!

    opened by mediocregopher 2
  • Count error rate very high after merge (hll++)

    Count error rate very high after merge (hll++)

    every time I merge 2 hll++ with precision = 18. I find that the count result is totally wrong

    
    func HashFnv1(str string) hash.Hash64 {
        hashed := fnv.New64a()
        hashed.Write([]byte(str))
        return hashed
    }
    
    func BenchmarkHLLPP(b *testing.B) {
        hll1,_ := hyperloglog.NewPlus(18)
        hll2,_ := hyperloglog.NewPlus(18)
    
        for i := 0; i < 1000; i++ {
            hll1.Add(HashFnv1("entry_1_"+strconv.Itoa(i)))
            hll2.Add(HashFnv1("entry_2_"+strconv.Itoa(i)))
        }
    
        b.Log("count hll1 = ", hll1.Count())
        b.Log("count hll2 = ", hll2.Count())
    
        for i := 0; i < b.N; i++ {
            hll1.Merge(hll2)
        }
        b.Log("count hll1 | hll2 =", hll1.Count())
    }
    

    gives me this

    BenchmarkHLLPP-4       50000         36267 ns/op
    --- BENCH: BenchmarkHLLPP-4
        operation_test.go:160: count hll1 =  1000
        operation_test.go:161: count hll2 =  1000
        operation_test.go:166: count hll1 | hll2 = 244
        operation_test.go:160: count hll1 =  1000
        operation_test.go:161: count hll2 =  1000
        operation_test.go:166: count hll1 | hll2 = 244
        operation_test.go:160: count hll1 =  1000
        operation_test.go:161: count hll2 =  1000
        operation_test.go:166: count hll1 | hll2 = 244
        operation_test.go:160: count hll1 =  1000
    
    opened by tuxlinuxien 2
  • Updated benchmarks to use the fnv-a variant.

    Updated benchmarks to use the fnv-a variant.

    This improves the avalanche characteristic and avoids the issue found in: https://github.com/clarkduvall/hyperloglog/issues/10

    I ran the benchmark. Code compiles fine; outputs look reasonable.

    opened by nieksand 2
  • Be less restrictive when adding items

    Be less restrictive when adding items

    When adding items to sets, it's often impractical to use structs that implement full hash.Hash32/hash.Hash64 method set. Most of my implementations contain a bunch of empty methods, pretty much what you have done in your tests with fakeHash32/fakeHash64.

    opened by dim 2
  • add new encode/decode mechanism to support external serialization

    add new encode/decode mechanism to support external serialization

    The existing gob encoder is easy to use but really slow. I'm using protobuf for my project which is much faster, but depends on serializing to/from generated data structures. Rather than introduce a dependency to the library, I've added this interface to translate from hll's internal data to the external struct with as little extra copying and allocation as possible. I think this is sufficiently generic to be used with a variety of communication libraries.

    opened by ianwilkes 1
  • added GobEncoder/GobDecoder interfaces to HyperLogLog and HyperLogLog+

    added GobEncoder/GobDecoder interfaces to HyperLogLog and HyperLogLog+

    I see that other efforts to add serialization/deserialization to HLL are not merged yet, so let me offer another one, because I find it critical to be able to save HLL counter and restore it if app is restarted.

    Basically, I've added gob-compatible interfaces, so HLL and HLLP inside custom structs can be saved/restored seamlessly.

    opened by zserge 1
  • Fix function comments based on best practices from Effective Go

    Fix function comments based on best practices from Effective Go

    Every exported function in a program should have a doc comment. The first sentence should be a summary that starts with the name being declared. From effective go.

    I generated this with CodeLingo and I'm keen to get some feedback, but this is automated so feel free to close it and just say "opt out" to opt out of future CodeLingo outreach PRs.

    opened by BlakeMScurr 1
  • Add load function

    Add load function

    What this does

    • Adds a load function so we can make a new hyperloglog directly from the bytes instead of needing the gob format.

    Why is this needed?

    I needed to deserialize HLLs that were serialized from another language.

    #14 also tried to address this issue, but lacked tests and seemed to be abandoned.

    Instead of setting a state, this returns a new HLL, which I think is cleaner.

    opened by msanterre 0
An app with Trie tree and Breve search Implementation CLI and HTTP both 🥳

Introduction LifeLongLearner project consists of two different parts. My English Vocabulary My Technical Book Notes All of them provided by me within

A.Samet İleri 15 Jul 1, 2022
A simple Set data structure implementation in Go (Golang) using LinkedHashMap.

Set Set is a simple Set data structure implementation in Go (Golang) using LinkedHashMap. This library allow you to get a set of int64 or string witho

Studio Sol Comunicação Digital Ltda 22 Sep 26, 2022
Golang implementation of Radix trees

go-radix Provides the radix package that implements a radix tree. The package only provides a single Tree implementation, optimized for sparse nodes.

Armon Dadgar 768 Dec 30, 2022
Golang FrodoKEM implementation

FrodoKEM in Golang Golang implementation of FrodoKEM: a Practical quantum-secure key encapsulation from generic lattices (https://frodokem.org). This

Ed Riccardi 39 Mar 3, 2022
Trie data structure implementation in Golang 🌳

Gotri Gotri is an Unicode character based Trie/prefix tree implementation in Go, with the suggestion/auto-complete feature for character searching. Si

Monir Zaman 9 Jun 17, 2022
An immutable radix tree implementation in Golang

go-immutable-radix Provides the iradix package that implements an immutable radix tree. The package only provides a single Tree implementation, optimi

HashiCorp 859 Dec 29, 2022
Access LeetCode problems via id, Golang implementation

LCid-Go Introduction This naive toy is based on bunnyxt/lcid, and implemented in Golang for practice. They are same in program logic and static files.

Yunchuan Zheng 0 Jan 15, 2022
Go-merkle - Merkle tree implementation in Golang

go-merkle go-merkle implements a simple merkle tree in Golang. It allows to obta

atomixwap 2 Aug 8, 2022
Go implementation of C++ STL iterators and algorithms.

iter Go implementation of C++ STL iterators and algorithms. Less hand-written loops, more expressive code. README translations: 简体中文 Motivation Althou

disksing 170 Dec 19, 2022
Package ring provides a high performance and thread safe Go implementation of a bloom filter.

ring - high performance bloom filter Package ring provides a high performance and thread safe Go implementation of a bloom filter. Usage Please see th

Tanner Ryan 130 Nov 20, 2022
Data Structure Libraries and Algorithms implementation

Algorithms Data Structure Libraries and Algorithms implementation in C++ Disclaimer This repository is meant to be used as a reference to learn data s

Priyank Chheda 642 Dec 8, 2022
Go implementation of SipHash-2-4, a fast short-input PRF created by Jean-Philippe Aumasson and Daniel J. Bernstein.

SipHash (Go) Go implementation of SipHash-2-4, a fast short-input PRF created by Jean-Philippe Aumasson and Daniel J. Bernstein (http://131002.net/sip

Dmitry Chestnykh 248 Dec 25, 2022
Data structures and algorithms implementation from ZeroToMastery course

ZeroToMastery Data Structures & Algorithms course This repo includes all the data structure and algorithm exercises solutions and implementations. Ins

Fabio Bozzo 3 Jul 4, 2022
Implementation of various data structures and algorithms in Go

GoDS (Go Data Structures) Implementation of various data structures and algorithms in Go. Data Structures Containers Lists ArrayList SinglyLinkedList

TopXeQ 0 Jan 25, 2022
A Go implementation of an in-memory bloom filter, with support for boltdb and badgerdb as optional data persistent storage.

Sprout A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set. Bloom filters are fast and space

Samsondeen 26 Jul 4, 2022
A Go implementation of a radix tree, that uses binary searches to speed up insert, retrieve and delete operations on dense trees

radixs A Go implementation of a radix tree, that uses binary searches to speed up insert, retrieve and delete operations on dense trees. This implemen

Bruno Moura 0 Feb 14, 2022
Data Structures and Algorithms implementation in Go

Data Structures and Algorithms Clean and simple implementation in Go Implementation There are several data structures and algorithms implemented in th

Mykyta Paliienko 2.6k Jan 2, 2023
Go implementation of Count-Min-Log

Count-Min-Log Count-Min-Log sketch: Approximately counting with approximate counters - Guillaume Pitel & Geoffroy Fouquier TL;DR: Count-Min-Log Sketch

Seif Lotfy 59 Jan 4, 2023
A Go implementation of the Elias-Fano encoding

go-ef A Go implementation of the Elias-Fano encoding Example package main import ( "fmt" "github.com/amallia/go-ef" "os" ) func main() {

Antonio Mallia 27 Nov 23, 2022