A disk-backed key-value store.

Overview

What is diskv?

Diskv (disk-vee) is a simple, persistent key-value store written in the Go language. It starts with an incredibly simple API for storing arbitrary data on a filesystem by key, and builds several layers of performance-enhancing abstraction on top. The end result is a conceptually simple, but highly performant, disk-backed storage system.

Build Status

Installing

Install Go 1, either from source or with a prepackaged binary. Then,

$ go get github.com/peterbourgon/diskv

Usage

package main

import (
	"fmt"
	"github.com/peterbourgon/diskv"
)

func main() {
	// Simplest transform function: put all the data files into the base dir.
	flatTransform := func(s string) []string { return []string{} }

	// Initialize a new diskv store, rooted at "my-data-dir", with a 1MB cache.
	d := diskv.New(diskv.Options{
		BasePath:     "my-data-dir",
		Transform:    flatTransform,
		CacheSizeMax: 1024 * 1024,
	})

	// Write three bytes to the key "alpha".
	key := "alpha"
	d.Write(key, []byte{'1', '2', '3'})

	// Read the value back out of the store.
	value, _ := d.Read(key)
	fmt.Printf("%v\n", value)

	// Erase the key+value from the store (and the disk).
	d.Erase(key)
}

More complex examples can be found in the "examples" subdirectory.

Theory

Basic idea

At its core, diskv is a map of a key (string) to arbitrary data ([]byte). The data is written to a single file on disk, with the same name as the key. The key determines where that file will be stored, via a user-provided TransformFunc, which takes a key and returns a slice ([]string) corresponding to a path list where the key file will be stored. The simplest TransformFunc,

func SimpleTransform (key string) []string {
    return []string{}
}

will place all keys in the same, base directory. The design is inspired by Redis diskstore; a TransformFunc which emulates the default diskstore behavior is available in the content-addressable-storage example.

Note that your TransformFunc should ensure that one valid key doesn't transform to a subset of another valid key. That is, it shouldn't be possible to construct valid keys that resolve to directory names. As a concrete example, if your TransformFunc splits on every 3 characters, then

d.Write("abcabc", val) // OK: written to <base>/abc/abc/abcabc
d.Write("abc", val)    // Error: attempted write to <base>/abc/abc, but it's a directory

This will be addressed in an upcoming version of diskv.

Probably the most important design principle behind diskv is that your data is always flatly available on the disk. diskv will never do anything that would prevent you from accessing, copying, backing up, or otherwise interacting with your data via common UNIX commandline tools.

Advanced path transformation

If you need more control over the file name written to disk or if you want to support slashes in your key name or special characters in the keys, you can use the AdvancedTransform property. You must supply a function that returns a special PathKey structure, which is a breakdown of a path and a file name. Strings returned must be clean of any slashes or special characters:

func AdvancedTransformExample(key string) *diskv.PathKey {
	path := strings.Split(key, "/")
	last := len(path) - 1
	return &diskv.PathKey{
		Path:     path[:last],
		FileName: path[last] + ".txt",
	}
}

// If you provide an AdvancedTransform, you must also provide its
// inverse:

func InverseTransformExample(pathKey *diskv.PathKey) (key string) {
	txt := pathKey.FileName[len(pathKey.FileName)-4:]
	if txt != ".txt" {
		panic("Invalid file found in storage folder!")
	}
	return strings.Join(pathKey.Path, "/") + pathKey.FileName[:len(pathKey.FileName)-4]
}

func main() {
	d := diskv.New(diskv.Options{
		BasePath:          "my-data-dir",
		AdvancedTransform: AdvancedTransformExample,
		InverseTransform:  InverseTransformExample,
		CacheSizeMax:      1024 * 1024,
	})
	// Write some text to the key "alpha/beta/gamma".
	key := "alpha/beta/gamma"
	d.WriteString(key, "¡Hola!") // will be stored in "<basedir>/alpha/beta/gamma.txt"
	fmt.Println(d.ReadString("alpha/beta/gamma"))
}

Adding a cache

An in-memory caching layer is provided by combining the BasicStore functionality with a simple map structure, and keeping it up-to-date as appropriate. Since the map structure in Go is not threadsafe, it's combined with a RWMutex to provide safe concurrent access.

Adding order

diskv is a key-value store and therefore inherently unordered. An ordering system can be injected into the store by passing something which satisfies the diskv.Index interface. (A default implementation, using Google's btree package, is provided.) Basically, diskv keeps an ordered (by a user-provided Less function) index of the keys, which can be queried.

Adding compression

Something which implements the diskv.Compression interface may be passed during store creation, so that all Writes and Reads are filtered through a compression/decompression pipeline. Several default implementations, using stdlib compression algorithms, are provided. Note that data is cached compressed; the cost of decompression is borne with each Read.

Streaming

diskv also now provides ReadStream and WriteStream methods, to allow very large data to be handled efficiently.

Future plans

  • Needs plenty of robust testing: huge datasets, etc...
  • More thorough benchmarking
  • Your suggestions for use-cases I haven't thought of

Credits and contributions

Original idea, design and implementation: Peter Bourgon Other collaborations: Javier Peletier (Epic Labs)

Issues
  • Switch to actively maintained Red-Black tree implementation?

    Switch to actively maintained Red-Black tree implementation?

    opened by onlyjob 15
  • completeFilename causes invalid memory address or nil pointer dereference

    completeFilename causes invalid memory address or nil pointer dereference

    Hi

    We're running a component in Kubernetes that uses diskv under the hood. The problem is that the process occasionally crashes when it attempts to remove the key from the store. Here is the relevant stack trace:

    github.com/org/vendor/github.com/peterbourgon/diskv.(*Diskv).Erase(0xc42041c2d0, 0x0, 0x1b, 0x0, 0x0) /home/rabbit/org/vendor/github.com/peterbourgon/diskv/diskv.go:409 +0xe7
    github.com/org/vendor/github.com/peterbourgon/diskv.(*Diskv).completeFilename(0xc42041c2d0, 0x0, 0x1b, 0x1b, 0x27fdf00) /home/rabbit/org/vendor/github.com/peterbourgon/diskv/diskv.go:525 +0x98
    path/filepath.Join(0xc42110b970, 0x2, 0x2, 0xc4208e18c0, 0x11) /usr/lib/go/src/path/filepath/path.go:210
    path/filepath.join(0xc42110b970, 0x2, 0x2, 0x0, 0x0) /usr/lib/go/src/path/filepath/path_unix.go:45 +0x96
    strings.Join(0xc42110b970, 0x2, 0x2, 0x18d6236, 0x1, 0xc42110b918, 0x2) /usr/lib/go/src/strings/strings.go:424
    

    Data directory is mounted as a regular host path (/opt/spm/agent) and file names are ksuid-compatible identifiers.

    diskv is initialized with following configuration:

    d := diskv.New(diskv.Options{
    		BasePath:     c.Dir,
    		Transform:    func(s string) []string { return []string{} },
    		CacheSizeMax: 1024 * 1024,
    })
    

    Do you have any pointers or ideas why this would happen?

    bug 
    opened by rabbitstack 11
  • (*File).Chmod on Windows

    (*File).Chmod on Windows

    https://github.com/golang/go/blob/bba88467f86472764a656e61f5f3265ed6853692/src/syscall/syscall_windows.go#L1094

    os.Chmod always return syscall.EWINDOWS error on Windows.

    opened by mchtech 8
  • ensureCacheSpaceWithLock panics with variable sized keys

    ensureCacheSpaceWithLock panics with variable sized keys

    If variable sized keys are being used, then ensureCacheSpaceWithLock() can panic during the cache cleanup routine which loops through safe(). This occurs if the key being inserted is larger than the last key removed.

    It would be better if either this key were inserted anyway (and thus exceeding the cache threshold) or the routine was modified so that it cleared some percent of the total cache size before proceeding.

    bug helpwanted 
    opened by WinstonPrivacy 7
  • Cache expiration

    Cache expiration

    I've used this awesome library in a couple of projects now and I think it is a great tool. One thing I'm wondering about is a way to set a cache expiration policy. The policy would dictate when cache keys become stale and should expire and perhaps be erased. Have you considered any additional features to this library, or perhaps strategies to employ regarding this topic?

    Thanks!

    opened by adampresley 6
  • CAS feature request/offer?

    CAS feature request/offer?

    I was looking for a CAS[1].

    Seems like it could be a pretty small change to diskv. You would need a cryptographic hash function, like say skein or sha256. Then something like:

    d := diskv.New(diskv.Options{
            BasePath:     "my-data-dir",
            Transform:    flatTransform,
            CryptoHash: sha256,
            CacheSizeMax: 1024 * 1024,
        })
    (key, err) := d.put([]byte{'1', '2', '3'}))
    ...
    (value,err) := d.get(key)
    

    I use put/get because they imply (to me anyways) atomic operations, that read/write do not.

    Opinions? Alternatives?

    If I did something like the above and send a pull request, would you consider it?

    [1] http://en.wikipedia.org/wiki/Content-addressable_storage

    opened by spikebike 6
  • Implement atomic write to disk

    Implement atomic write to disk

    Implement atomic write by writing into a temporary file (within the same directory to guarantee that it won't move accross disks), with the same permission and then rename at the last moment to make it available.

    opened by apelisse 5
  • please tag formal releases

    please tag formal releases

    Please consider assigning version numbers and tagging releases. Tags/releases are useful for downstream package maintainers (in Debian and other distributions) to export source tarballs, automatically track new releases and to declare dependencies between packages. Read more in the Debian Upstream Guide.

    Versioning provides additional benefits to encourage vendoring of a particular (e.g. latest stable) release contrary to random unreleased snapshots.

    Versioning provides safety margin for situations like "ops, we've made a mistake but reverted problematic commit in master". Presumably tagged version/release is better tested/reviewed than random snapshot of "master" branch.

    Thank you.

    See also

    • http://semver.org/
    • https://en.wikipedia.org/wiki/Software_versioning
    opened by onlyjob 5
  • Canceling a Keys() walk

    Canceling a Keys() walk

    We need a done channel on the Keys and KeysPrefix iterator. That will be a breaking API to the library however.

    I am happy to send a PR but want to get agreement on the API breakage problem first. Based on discussion from https://github.com/coreos/rocket/pull/209#issuecomment-66217292

    opened by philips 5
  • concurrent access

    concurrent access

    I've a use-case where I have two applications reading/writing to the disk cache. When reading from the disk cache, an application will first read from the in-memory cache, which in some scenario's will be dirty. Is it possible to always read from the physical disk first?

    opened by asanami 5
  • Fix issue #40

    Fix issue #40

    When we call ReadStream, it checks if the desired value is in the cache and if not calls readWithRLock, which returns a diskv.siphon. The siphon will write to the cache when it gets an EOF from the file on disk.

    If we call ReadStream twice on the same key, we get two siphon objects, both of which will write to the cache when they finish reading. However, the cacheWithLock function does not check if the key it has been told to cache already exists in the cache. If the key already exists, cacheWithLock blithely overwrites it (d.cache[key] = val) and then improperly adds to the cache size (d.cacheSize += valueSize).

    This patch makes cacheWithLock blow away any existing cached value before it writes into the cache. Benchmarks show no noticeable difference.

    Fixes #40

    opened by floren 4
  • Implement a fix for #66, excessive memory use in siphon.

    Implement a fix for #66, excessive memory use in siphon.

    The siphon will now stop writing to its internal buffer once the size of the buffer exceeds the maximum cache size. Because we write until we exceed the max cache size, we're safe to attempt the cache update even if the buffer only contains partial data, because it's still over the limit & will be rejected.

    opened by floren 1
  • ReadStream with a very large value results in excessive memory use when cache is enabled

    ReadStream with a very large value results in excessive memory use when cache is enabled

    If the cache is enabled, readWithRLock always reads the file using a siphon.

    The siphon code copies every byte it reads into a bytes.Buffer. When the full file has been read, that bytes.Buffer is used to update the cache.

    However, if the underlying file is e.g. a gigabyte in size, the siphon will end up with a bytes.Buffer containing that entire gigabyte. Unless you've set your cache size to over a gigabyte, this gets thrown away as soon as the ReadStream is done.

    The main reason we use ReadStream is so we can deal with very large items without having to stick the entire thing in memory at once. Having discovered this, we'll probably disable the cache, but there are cases where people may wish to have a cache enabled without blowing up their memory!

    opened by floren 2
  • ReadStream/WriteStream can lead to data races

    ReadStream/WriteStream can lead to data races

    If someone calls ReadStream, then proceeds to read from it slowly, then someone else calls WriteStream, the reader will start to get the new resource contents part-way through. This is because createKeyFileWithNoLock calls os.OpenFile with O_TRUNC set when updating a file, and the existing reader ends up pointing at the new data.

    opened by floren 0
  • [Question] Performance of writes

    [Question] Performance of writes

    We use diskv in one of the services I help maintain at work. Attached is a screenshot of the write performance (in seconds) across a few instances. Have you ever seen this kind of behavior before? Screen Shot 2019-05-30 at 8 53 08 AM

    Most of the time, the writes are extremely fast, but every once in a while, the write time blows up.

    Please close the issue is this is not the right place, just not sure where else to post.

    opened by wild-endeavor 2
Releases(v3.0.0)
Owner
Peter Bourgon
The official GitHub account of Dwayne 'The Rock' Johnson
Peter Bourgon
Simple Distributed key-value database (in-memory/disk) written with Golang.

Kallbaz DB Simple Distributed key-value store (in-memory/disk) written with Golang. Installation go get github.com/msam1r/kallbaz-db Usage API // Get

Mohamed Samir 5 Jan 18, 2022
Distributed reliable key-value store for the most critical data of a distributed system

etcd Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in order

etcd-io 40.3k Jun 24, 2022
Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service.

Olric Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service. With

Burak Sezer 2.2k Jul 1, 2022
a persistent real-time key-value store, with the same redis protocol with powerful features

a fast NoSQL DB, that uses the same RESP protocol and capable to store terabytes of data, also it integrates with your mobile/web apps to add real-time features, soon you can use it as a document store cause it should become a multi-model db. Redix is used in production, you can use it in your apps with no worries.

Mohammed Al Ashaal 1.1k Jun 13, 2022
GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

GhostDB is designed to speed up dynamic database or API driven websites by storing data in RAM in order to reduce the number of times an external data source such as a database or API must be read. GhostDB provides a very large hash table that is distributed across multiple machines and stores large numbers of key-value pairs within the hash table.

Jake Grogan 721 Jun 9, 2022
Pogreb is an embedded key-value store for read-heavy workloads written in Go.

Embedded key-value store for read-heavy workloads written in Go

Artem Krylysov 900 Jun 26, 2022
CrankDB is an ultra fast and very lightweight Key Value based Document Store.

CrankDB is a ultra fast, extreme lightweight Key Value based Document Store.

Shrey Batra 30 Apr 12, 2022
yakv is a simple, in-memory, concurrency-safe key-value store for hobbyists.

yakv (yak-v. (originally intended to be "yet-another-key-value store")) is a simple, in-memory, concurrency-safe key-value store for hobbyists. yakv provides persistence by appending transactions to a transaction log and restoring data from the transaction log on startup.

Aadhav Vignesh 5 Feb 24, 2022
Multithreaded key value pair store using thread safe locking mechanism allowing concurrent reads

Project Amnesia A Multi-threaded key-value pair store using thread safe locking mechanism allowing concurrent reads. Curious to Try it out?? Check out

Nikhil Nayak 6 Apr 7, 2022
ShockV is a simple key-value store with RESTful API

ShockV is a simple key-value store based on badgerDB with RESTful API. It's best suited for experimental project which you need a lightweight data store.

delihiros 2 Sep 26, 2021
A rest-api that works with golang as an in-memory key value store

In Store A rest-api that works with golang as an in-memory key value store Usage Fist of all, clone the repo with the command below. You must have gol

Eyüp Arslan 0 Oct 24, 2021
Distributed key-value store

Keva Distributed key-value store General Demo Start the server docker-compose up --build Insert data curl -XPOST http://localhost:5555/storage/test1

Yaroslav Gaponov 0 Nov 15, 2021
Simple in memory key-value store.

Simple in memory key-value store. Development This project is written in Go. Make sure you have Go installed (download). Version 1.17 or higher is req

Mustafa Navruz 0 Nov 6, 2021
A simple in-memory key-value store application

vtec vtec, is a simple in-memory key-value store application. vtec provides persistence by appending transactions to a json file and restoring data fr

Ahmet Tek 3 Jun 22, 2022
Biscuit is a multi-region HA key-value store for your AWS infrastructure secrets.

Biscuit Biscuit is a simple key-value store for your infrastructure secrets. Is Biscuit right for me? Biscuit is most useful to teams already using AW

Doug 558 Jun 17, 2022
An in-memory key:value store/cache (similar to Memcached) library for Go, suitable for single-machine applications.

go-cache go-cache is an in-memory key:value store/cache similar to memcached that is suitable for applications running on a single machine. Its major

Patrick Mylund Nielsen 6.3k Jun 25, 2022
NutsDB a simple, fast, embeddable and persistent key/value store written in pure Go.

A simple, fast, embeddable, persistent key/value store written in pure Go. It supports fully serializable transactions and many data structures such as list, set, sorted set.

徐佳军 2.3k Jun 24, 2022
KV - a toy in-memory key value store built primarily in an effort to write more go and check out grpc

KV KV is a toy in-memory key value store built primarily in an effort to write more go and check out grpc. This is still a work in progress. // downlo

Ali Mir 0 Dec 30, 2021
PrimeKV is a Secure, REST API driven Key/Value store.

PrimeKV PrimeKV is a Secure, REST API driven Key/Value store. Features 100% In-memory. Encrypted by default, All stored values are bi-directionally en

Josh Burns 0 Jan 10, 2022