# HyperLogLog and HyperLogLog++ implementation in Go/Golang.

### Related tags

Data Structures hyperloglog

# HyperLogLog and HyperLogLog++

Implements the HyperLogLog and HyperLogLog++ algorithms.

HyperLogLog paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

## Documentation

Documentation can be found here.

## Comparison of Algorithms

The HyperLogLog++ algorithm has much lower error for small cardinalities. This is because it uses a different representation of data for small sets of data. Data generated using this library shows the difference for N < 10000:

HyperLogLog++ also has bias correction which helps offset estimation errors in the original HyperLogLog algorithm. This correction can be seen here, again using data generated using this library:

## Future Improvements

• Right now HLL++ uses 8 bits per register. It could use 6 bits and take less memory.
• The list compression algorithm could be improved, allowing the sparse representation to be used longer.
• #### Large errors hll++ at high cardinality

Once I set my cardinality to 100k using hll++, the error rate becomes massive:

``````approx:  10506
actual:  100000
delta:   89494
``````

My sample code is pretty much cribbed from your test/bench code. Set the `truth` variable to 1000 and you get exact result. Things look good in the 10k range too. But at 100k... boom.

Complete repro code:

``````package main

import (
"github.com/clarkduvall/hyperloglog"
"fmt"
"hash/fnv"
"hash"
)

func hash64(s string) hash.Hash64 {
h := fnv.New64()
h.Write([]byte(s))
return h
}

func main() {
truth := 100000

h, err := hyperloglog.NewPlus(16);
if err != nil {
fmt.Println(err)
return
}

for i := 0; i < truth; i++ {
istr := fmt.Sprintf("%s", i)
}
epp := h.Count()

fmt.Println("approx: ", epp)
fmt.Println("actual: ", truth)
fmt.Println("delta:  ", int64(truth)-int64(h.Count()))
}
``````
opened by nieksand 5
• #### Streaming serialization with gob.

This is an alternative PR which build around io writers/readers rather than a []byte.

It uses gob, but things are complicated a bit because:

1. hll and hllp structs don't export their members
2. there possible nil pointers at times

I added the complication handling in marshal/unmarshal rather than make a more invasive package-wide change to play nice with gob.

One other benefit over the other PR... this maintains the sparse state for hllp when applicable.

opened by nieksand 4
• #### Another case of really large Count value

Similar to #18 I have another case where the `Count` method returns an insanely large value for a very specific set of inputs.

Here is the input data: hll.txt

And here is the script to test it:

``````package main

import (
"fmt"
"hash/fnv"
"io"
"log"

"github.com/clarkduvall/hyperloglog"
)

func main() {
a, _ := hyperloglog.NewPlus(12)

for {
var l string
if _, err := fmt.Scanln(&l); err == io.EOF {
break
} else if err != nil {
panic(err)
}
h := fnv.New64a()
h.Write([]byte(l))
//log.Printf("added val %q to hll, count is now: %d", l, a.Count())
}

log.Printf("a.Count: %d", a.Count())
}
``````

Like in #18 I narrowed down the dataset so that omitting the last line causes the large number to not be output:

`````` /tmp ▻ cat hll.txt | ./hlltest
2017/05/22 14:49:54 a.Count: 18446744073709551604
/tmp ▻ cat hll.txt | head -n-1| ./hlltest
2017/05/22 14:50:04 a.Count: 20478
``````

Let me know if there's anything I can do to help debug more, thanks!

opened by mediocregopher 3
• #### Make sure toNormal is eventually done even if Count is called often

Our app, for various reasons, always calls Count after adding a bunch of items. This has the unfortunate side effect that toNormal never gets called - because Count calls mergeSparse and this resets the size of tmpSet to zero. Thus the size of tmpSet is never big enough for maybeToNormal to trigger toNormal.

The end result is we have a sparseList of millions of items, and Count is very slow (hundreds of milliseconds).

This patch fixes this by checking if the threshold is reached during Count.

opened by sdsykes 2
• #### Weird results post-merge

I know hll's aren't supposed to be accurate, so if this is simply a case of finding a limitation in the accuracy then that's cool, I just wanted to double check. I have the following code:

``````func main() {
add := func(hll *hyperloglog.HyperLogLogPlus, i int) {
h := fnv.New64a()
h.Write([]byte(strconv.Itoa(i)))
}

a, _ := hyperloglog.NewPlus(12)
for i := 0; i < 10000; i++ {
}
fmt.Printf("a: %d\n", a.Count())

b, _ := hyperloglog.NewPlus(12)
for i := 0; i < 10000; i++ {
}
fmt.Printf("b: %d\n", b.Count())

a.Merge(b)
fmt.Printf("a+b: %d\n", a.Count())
}
``````

I get the output:

``````a: 100
b: 100
a+b: 6
``````

So the Counts for `a` and `b` individually look good, but post-merge it seems to get borked. Again, just wanted to check if this is a bug vs a limitation in what hll's can actually do. Thanks!

opened by mediocregopher 2
• #### Extremely large Count() value after a Merge

So this is a super specific test case, and unfortunately I haven't had much luck in boiling it down farther. But for a very specific input we're seeing extremely high Count() values after Merge'ing a populated HLL+ into an empty one.

Attached is the input data: hll.txt

This is the test code:

``````package main

import (
"fmt"
"hash/fnv"
"io"
"log"

"github.com/clarkduvall/hyperloglog"
)

func main() {
a, _ := hyperloglog.NewPlus(12)

for {
var l string
if _, err := fmt.Scanln(&l); err == io.EOF {
break
} else if err != nil {
panic(err)
}
h := fnv.New64a()
h.Write([]byte(l))
log.Printf("added val %q to hll, count is now: %d", l, a.Count())
}

b, _ := hyperloglog.NewPlus(12)
b.Merge(a)
log.Printf("b.Count: %d", b.Count())
}
``````

To run:

``````cat hll.txt | ./hlltest
``````

The output should end with something like:

``````2016/12/12 14:20:06 b.Count: 18446744073709551615
``````

If you omit the last line like this:

``````cat hll.txt | head -n-1 | ./hlltest
``````

You get a much more reasonable output:

``````2016/12/12 14:26:46 b.Count: 3099
``````

Let me know if there's any other information you need to help debug this, thanks!

opened by mediocregopher 2
• #### Count error rate very high after merge (hll++)

every time I merge 2 hll++ with precision = 18. I find that the count result is totally wrong

``````
func HashFnv1(str string) hash.Hash64 {
hashed := fnv.New64a()
hashed.Write([]byte(str))
return hashed
}

func BenchmarkHLLPP(b *testing.B) {
hll1,_ := hyperloglog.NewPlus(18)
hll2,_ := hyperloglog.NewPlus(18)

for i := 0; i < 1000; i++ {
}

b.Log("count hll1 = ", hll1.Count())
b.Log("count hll2 = ", hll2.Count())

for i := 0; i < b.N; i++ {
hll1.Merge(hll2)
}
b.Log("count hll1 | hll2 =", hll1.Count())
}
``````

gives me this

``````BenchmarkHLLPP-4       50000         36267 ns/op
--- BENCH: BenchmarkHLLPP-4
operation_test.go:160: count hll1 =  1000
operation_test.go:161: count hll2 =  1000
operation_test.go:166: count hll1 | hll2 = 244
operation_test.go:160: count hll1 =  1000
operation_test.go:161: count hll2 =  1000
operation_test.go:166: count hll1 | hll2 = 244
operation_test.go:160: count hll1 =  1000
operation_test.go:161: count hll2 =  1000
operation_test.go:166: count hll1 | hll2 = 244
operation_test.go:160: count hll1 =  1000
``````
opened by tuxlinuxien 2
• #### Updated benchmarks to use the fnv-a variant.

This improves the avalanche characteristic and avoids the issue found in: https://github.com/clarkduvall/hyperloglog/issues/10

I ran the benchmark. Code compiles fine; outputs look reasonable.

opened by nieksand 2
• #### Be less restrictive when adding items

When adding items to sets, it's often impractical to use structs that implement full `hash.Hash32/hash.Hash64` method set. Most of my implementations contain a bunch of empty methods, pretty much what you have done in your tests with `fakeHash32/fakeHash64`.

opened by dim 2
• #### add new encode/decode mechanism to support external serialization

The existing gob encoder is easy to use but really slow. I'm using protobuf for my project which is much faster, but depends on serializing to/from generated data structures. Rather than introduce a dependency to the library, I've added this interface to translate from hll's internal data to the external struct with as little extra copying and allocation as possible. I think this is sufficiently generic to be used with a variety of communication libraries.

opened by ianwilkes 1
• #### added GobEncoder/GobDecoder interfaces to HyperLogLog and HyperLogLog+

I see that other efforts to add serialization/deserialization to HLL are not merged yet, so let me offer another one, because I find it critical to be able to save HLL counter and restore it if app is restarted.

Basically, I've added gob-compatible interfaces, so HLL and HLLP inside custom structs can be saved/restored seamlessly.

opened by zserge 1
• #### Fix function comments based on best practices from Effective Go

Every exported function in a program should have a doc comment. The first sentence should be a summary that starts with the name being declared. From effective go.

I generated this with CodeLingo and I'm keen to get some feedback, but this is automated so feel free to close it and just say "opt out" to opt out of future CodeLingo outreach PRs.

opened by BlakeMScurr 1

# What this does

• Adds a load function so we can make a new hyperloglog directly from the bytes instead of needing the `gob` format.

# Why is this needed?

I needed to deserialize HLLs that were serialized from another language.

#14 also tried to address this issue, but lacked tests and seemed to be abandoned.

Instead of setting a state, this returns a new HLL, which I think is cleaner.

opened by msanterre 0
###### An app with Trie tree and Breve search Implementation CLI and HTTP both 🥳

Introduction LifeLongLearner project consists of two different parts. My English Vocabulary My Technical Book Notes All of them provided by me within

15 Jul 1, 2022
###### A simple Set data structure implementation in Go (Golang) using LinkedHashMap.

Set Set is a simple Set data structure implementation in Go (Golang) using LinkedHashMap. This library allow you to get a set of int64 or string witho

22 Sep 26, 2022
###### Golang implementation of Radix trees

go-radix Provides the radix package that implements a radix tree. The package only provides a single Tree implementation, optimized for sparse nodes.

768 Dec 30, 2022
###### Golang FrodoKEM implementation

FrodoKEM in Golang Golang implementation of FrodoKEM: a Practical quantum-secure key encapsulation from generic lattices (https://frodokem.org). This

39 Mar 3, 2022
###### Trie data structure implementation in Golang 🌳

Gotri Gotri is an Unicode character based Trie/prefix tree implementation in Go, with the suggestion/auto-complete feature for character searching. Si

9 Jun 17, 2022
###### An immutable radix tree implementation in Golang

go-immutable-radix Provides the iradix package that implements an immutable radix tree. The package only provides a single Tree implementation, optimi

859 Dec 29, 2022
###### Access LeetCode problems via id, Golang implementation

LCid-Go Introduction This naive toy is based on bunnyxt/lcid, and implemented in Golang for practice. They are same in program logic and static files.

0 Jan 15, 2022
###### Go-merkle - Merkle tree implementation in Golang

go-merkle go-merkle implements a simple merkle tree in Golang. It allows to obta

2 Aug 8, 2022
###### Go implementation of C++ STL iterators and algorithms.

iter Go implementation of C++ STL iterators and algorithms. Less hand-written loops, more expressive code. README translations: 简体中文 Motivation Althou

170 Dec 19, 2022
###### Package ring provides a high performance and thread safe Go implementation of a bloom filter.

ring - high performance bloom filter Package ring provides a high performance and thread safe Go implementation of a bloom filter. Usage Please see th

130 Nov 20, 2022
###### Data Structure Libraries and Algorithms implementation

Algorithms Data Structure Libraries and Algorithms implementation in C++ Disclaimer This repository is meant to be used as a reference to learn data s

642 Dec 8, 2022
###### Go implementation of SipHash-2-4, a fast short-input PRF created by Jean-Philippe Aumasson and Daniel J. Bernstein.

SipHash (Go) Go implementation of SipHash-2-4, a fast short-input PRF created by Jean-Philippe Aumasson and Daniel J. Bernstein (http://131002.net/sip

248 Dec 25, 2022
###### Data structures and algorithms implementation from ZeroToMastery course

ZeroToMastery Data Structures & Algorithms course This repo includes all the data structure and algorithm exercises solutions and implementations. Ins

3 Jul 4, 2022
###### Implementation of various data structures and algorithms in Go

GoDS (Go Data Structures) Implementation of various data structures and algorithms in Go. Data Structures Containers Lists ArrayList SinglyLinkedList

0 Jan 25, 2022
###### A Go implementation of an in-memory bloom filter, with support for boltdb and badgerdb as optional data persistent storage.

Sprout A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set. Bloom filters are fast and space

26 Jul 4, 2022
###### A Go implementation of a radix tree, that uses binary searches to speed up insert, retrieve and delete operations on dense trees

radixs A Go implementation of a radix tree, that uses binary searches to speed up insert, retrieve and delete operations on dense trees. This implemen

0 Feb 14, 2022
###### Data Structures and Algorithms implementation in Go

Data Structures and Algorithms Clean and simple implementation in Go Implementation There are several data structures and algorithms implemented in th

2.6k Jan 2, 2023
###### Go implementation of Count-Min-Log

Count-Min-Log Count-Min-Log sketch: Approximately counting with approximate counters - Guillaume Pitel & Geoffroy Fouquier TL;DR: Count-Min-Log Sketch

59 Jan 4, 2023
###### A Go implementation of the Elias-Fano encoding

go-ef A Go implementation of the Elias-Fano encoding Example package main import ( "fmt" "github.com/amallia/go-ef" "os" ) func main() {

27 Nov 23, 2022