An effective time-series data compression/decompression method based on Facebook's Gorilla.

Overview

Gorilla

This package provides the effective time-series data compression method based on Facebook's Gorilla.. In a nutshell, it uses delta-of-delta timestamps and XOR'd floating-points values because most data points arrived at a fixed interval and the value in most time-series does not change significantly to reduce the data size.

Pelkonen, T., Franklin, S., Teller, J., Cavallaro, P., Huang, Q., Meza, J., & Veeraraghavan, K. (2015). Gorilla. Proceedings of the VLDB Endowment, 8(12), 1816–1827. https://doi.org/10.14778/2824032.2824078

Notes

Use timestamp seconds to keep high compression rate

Using milliseconds as a timestamp increases the amount of bits because timestamps arrives at no longer evenly interval. A delta of delta second timestamp which arrives at a certain interval is basically 0, 1 or -1, so it can keep small data size.

As the value fraction increases, the compression ratio decreases

This method converts each value to a 64-bit sequence of floating-point numbers and XORed with one previous data point. If the values are the same, they will be zero, so encoding them will only require one bit of zero. Also, instead of recording the result of the XOR in 64 bits each time, we compress the bits from the beginning that are followed by 0 (LeadingZeros) and the bits from the end that are followed by 0 (TrailingZeros) and record the rest of the bit sequence. In addition, if there are more digits in the LeadingZeros and TrailingZeros than in the previous value, they are left alone, and only the rest of the bit sequence is recorded. Otherwise, the new LeadingZeros and TrailingZeros values and the rest of the bit sequence are recorded. This compression scheme requires fewer bits if the value is 63.0, 63.5, or some other floating-point number with many zero bits from the middle to the end of the mantissa part. However, for a number like 0.1, many bits in the mantissa are non-zero, so the LeadingZeros and TrailingZeros values become small, and it takes a lot of bits to record the rest of the bit sequence.

Compression window should be longer than 2 hours

Using the same encoding scheme over larger time periods allows us to achieve better compression ratios. However, queries that wish to read data over a short time range might need to expend additional computational resources on decoding data. The paper shows the average compression ratio for the time series as we change the block size. One can see that blocks that extend longer than two hours provide diminishing returns for compressed size. A two-hour block allows us to achieve a compression ratio of 1.37 bytes per data point.

Usage

Installing

go get github.com/kei6u/gorilla

Compressor

buf := new(bytes.Buffer)
header := uint32(time.Now().Unix())

c, finish, err := gorilla.NewCompressor(buf, header)
if err != nil {
    return err
}

if err := c.Compress(uint32(time.Now().Unix()), 10.0); err != nil {
    return err
}
if err := c.Compress(uint32(time.Now().Unix()), 10.5); err != nil {
    return err
}

return finish()

Decompressor

buf := new(bytes.Buffer)

// Compressing time-series data to buf ...

d, h, err := gorilla.NewDecompressor(buf)
if err != nil {
    return err
}

fmt.Printf("header: %v\n", h)

iter := d.Iterator()
for iter.HasNext() {
    t, v := iter.Next()
    fmt.Println(t, v)
}

return iter.Err()
Issues
  • compression does not work well

    compression does not work well

    func Test_gorilla(t *testing.T) {
    	var buf bytes.Buffer
    	c, finish, err := gorilla.NewCompressor(&buf, 27550378*60)
    	assert.NoError(t, err)
    
    	for i := 27550378; i < 27550388; i++ {
    		t.Log("write: ", uint32(i*60), float64(i))
    		assert.NoError(t, c.Compress(uint32(i*60), float64(i)))
    	}
    	assert.NoError(t, finish())
    
    	d, _, err := gorilla.NewDecompressor(bytes.NewReader(buf.Bytes()))
    	assert.NoError(t, err)
    
    	itr := d.Iterator()
    	for itr.Next() {
    		fmt.Println(itr.At())
    	}
    }
    

    the output is

    1653022680 2.7550378e+07
    1653022740 3.190115827897793e+84
    1653022800 2.054801063973621e-147
    1653022860 2.3792970816456578e-70
    1653022920 1.5325406469947978e-301
    1653022980 1.7745608335663593e-224
    1653023040 1.774561542092724e-224
    1653023100 1.7745614776812363e-224
    1653023160 1.7745616709156994e-224
    1653023220 1.7745616065042117e-224
    
    opened by CodingCrush 1
Releases(v0.3.0)
Owner
Keisuke Umegaki
Keisuke Umegaki
A toolkit for replaying time series data.

Replay Toolkit The replay package provides some simple tools for replaying captured data at realtime. I use this in various tools that take logged dat

Dustin Sallings 16 Aug 13, 2019
The Gorilla Programming Language

Gorilla Programming Language Gorilla is a tiny, dynamically typed, multi-engine programming language It has flexible syntax, a compiler, as well as an

null 29 Apr 16, 2022
Seekable ZSTD compression format implemented in Golang.

ZSTD seekable compression format implementation in Go Seekable ZSTD compression format implemented in Golang. This library provides a random access re

Alexey Ivanov 17 Jun 22, 2022
Provide Go Statistics Handler, Struct, Measure Method

Go Statistics Handler About The gosh is an abbreviation for Go Statistics Handler. This Repository is provided following functions. Go runtime statist

Osamu TONOMORI 29 Jul 24, 2022
Go implementation Welford’s method for one-pass variance computation

Variance and standard deviation caluculation using variance's algorithm Table of Contents Introduction Installation Usage Contributing License Introdu

Axiom, Inc. 7 Jun 5, 2022
Converts NFAs (and DFAs) to a regular expressions using the state removal method

nfa2regex: convert NFAs (and DFAs) to regular expressions An implementation of the state removal technique for converting an NFA to a regular expressi

David Wolever 3 Apr 29, 2022
This project is an implementation of Fermat's factorization method in which multiples of prime numbers are factored into their constituent primes

This project is an implementation of Fermat's factorization method in which multiples of prime numbers are factored into their constituent primes. It is a vanity attempt to break RSA Encryption which relies on prime multiples for encryption.

Lih Ingabo 1 Jun 3, 2022
A method dispatcher written in go powered by reflection.

go-dispatcher A single-file dispatcher written in Golang. Description Inspired by the JSON-RPC module of polygon-edge and geth, this package provides

null 6 Feb 21, 2022
Exercise for solve problem data processing, performance and something wrong in passing data

Citcall Exercise Exercise for solve problem data processing, performance and something wrong in passing data Pengolahan data data processing - Readme

Muhammad Fazri 0 Nov 25, 2021
Stackoverflow-Tag-Recommender - Calculate similarity of different tags based on the data of stackoverflow

Recommender System with Stackoverflow's Data This project is written to recommen

Roozbeh Sayadi 8 Apr 4, 2022
Real-time Charging System for Telecom & ISP environments

Real-time Online/Offline Charging System (OCS) for Telecom & ISP environments Features Real-time Online/Offline Charging System (OCS). Account Balance

null 356 Aug 4, 2022
A simple Cron library for go that can execute closures or functions at varying intervals, from once a second to once a year on a specific date and time. Primarily for web applications and long running daemons.

Cron.go This is a simple library to handle scheduled tasks. Tasks can be run in a minimum delay of once a second--for which Cron isn't actually design

Robert K 210 Jul 27, 2022
Weaviate is a cloud-native, modular, real-time vector search engine

Weaviate is a cloud-native, real-time vector search engine (aka neural search engine or deep search engine). There are modules for specific use cases such as semantic search, plugins to integrate Weaviate in any application of your choice, and a console to visualize your data.

SeMI Technologies 2.6k Aug 5, 2022
Visualize plant growth over time with Go, WebDAV and WASM; @pojntfx's entry for #growlab.

Growlapse Visualize plant growth over time with Go, WebDAV and WASM; @pojntfx's entry for #growlab. Installation Containerized You can get the Docker

Felix Pojtinger 7 Feb 21, 2022
Code snippets by first time open source contributors

Introduction Golang code snippets by first time open source contributors Rules How to contribute Add a folder and create your desired code snippet fil

Luigi Morel 1 Oct 6, 2021
Solution to elevator test problem but this time recursive and in go

Synopsis A multi-floor building has a Lift in it. People are queued on different floors waiting for the Lift. Some people want to go up. Some people w

Alex Piemont 0 Nov 8, 2021
Typesafe lazy instantiation to improve service start time

Package lazy is a light wrapper around sync.Once providing support for return values. It removes the burden of capturing return values via closures from the caller.

Jeremy Loy 1 May 30, 2022
FlameScope is a visualization tool for exploring different time ranges as Flame Graphs.

FlameScope FlameScope is a visualization tool for exploring different time ranges as Flame Graphs, allowing quick analysis of performance issues such

Netflix, Inc. 2.7k Aug 5, 2022
Advent of Code 2021 - Time to learn Go

aoc2021 Advent of Code 2021 - Time to learn Go Will contain my solutions for aoc2021, so avoid reading the files in .src/aoc2021/ unless you want spoi

null 1 Dec 22, 2021