RocksDB/LevelDB inspired key-value database in Go


Pebble Build Status GoDoc

Nightly benchmarks

Pebble is a LevelDB/RocksDB inspired key-value store focused on performance and internal usage by CockroachDB. Pebble inherits the RocksDB file formats and a few extensions such as range deletion tombstones, table-level bloom filters, and updates to the MANIFEST format.

Pebble intentionally does not aspire to include every feature in RocksDB and is specifically targetting the use case and feature set needed by CockroachDB:

  • Block-based tables
  • Checkpoints
  • Indexed batches
  • Iterator options (lower/upper bound, table filter)
  • Level-based compaction
  • Manual compaction
  • Merge operator
  • Prefix bloom filters
  • Prefix iteration
  • Range deletion tombstones
  • Reverse iteration
  • SSTable ingestion
  • Single delete
  • Snapshots
  • Table-level bloom filters

RocksDB has a large number of features that are not implemented in Pebble:

  • Backups
  • Column families
  • Delete files in range
  • FIFO compaction style
  • Forward iterator / tailing iterator
  • Hash table format
  • Memtable bloom filter
  • Persistent cache
  • Pin iterator key / value
  • Plain table format
  • SSTable ingest-behind
  • Sub-compactions
  • Transactions
  • Universal compaction style

WARNING: Pebble may silently corrupt data or behave incorrectly if used with a RocksDB database that uses a feature Pebble doesn't support. Caveat emptor!

Production Ready

Pebble was introduced as an alternative storage engine to RocksDB in CockroachDB v20.1 (released May 2020) and was used in production successfully at that time. Pebble was made the default storage engine in CockroachDB v20.2 (released Nov 2020). Pebble is being used in production by users of CockroachDB at scale and is considered stable and production ready.


Pebble offers several improvements over RocksDB:

  • Faster reverse iteration via backwards links in the memtable's skiplist.
  • Faster commit pipeline that achieves better concurrency.
  • Seamless merged iteration of indexed batches. The mutations in the batch conceptually occupy another memtable level.
  • Smaller, more approachable code base.

See the Pebble vs RocksDB: Implementation Differences doc for more details on implementation differences.

RocksDB Compatibility

Pebble strives for forward compatibility with RocksDB 6.2.1 (the latest version of RocksDB used by CockroachDB). Forward compatibility means that a DB generated by RocksDB can be used by Pebble. Currently, Pebble provides bidirectional compatibility with RocksDB (a Pebble generated DB can be used by RocksDB), but that will change in the future as new functionality is introduced to Pebble. In general, Pebble only provides compatibility with the subset of functionality and configuration used by CockroachDB. The scope of RocksDB functionality and configuration is too large to adequately test and document all the incompatibilities. The list below contains known incompatibilities.

  • Pebble's use of WAL recycling is only compatible with RocksDB's kTolerateCorruptedTailRecords WAL recovery mode. Older versions of RocksDB would automatically map incompatible WAL recovery modes to kTolerateCorruptedTailRecords. New versions of RocksDB will disable WAL recycling.
  • Column families. Pebble does not support column families, nor does it attempt to detect their usage when opening a DB that may contain them.
  • Hash table format. Pebble does not support the hash table sstable format.
  • Plain table format. Pebble does not support the plain table sstable format.
  • SSTable format version 3 and 4. Pebble does not currently support version 3 and version 4 format sstables. The sstable format version is controlled by the BlockBasedTableOptions::format_version option. See #97.


Pebble is based on the incomplete Go version of LevelDB:

The Go version of LevelDB is based on the C++ original:

Optimizations and inspiration were drawn from RocksDB:

Getting Started

Example Code

package main

import (


func main() {
	db, err := pebble.Open("demo", &pebble.Options{})
	if err != nil {
	key := []byte("hello")
	if err := db.Set(key, []byte("world"), pebble.Sync); err != nil {
	value, closer, err := db.Get(key)
	if err != nil {
	fmt.Printf("%s %s\n", key, value)
	if err := closer.Close(); err != nil {
	if err := db.Close(); err != nil {
  • panic on arm64

    panic on arm64

    [5058]: unexpected fault address 0x6c1ed94dad07336
    [5058]: fatal error: fault
    [5058]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x6c1ed94dad07336 pc=0x601e70]
    [5058]: goroutine 30 [running]:
    [5058]: runtime.throw(0x98160e, 0x5)
    [5058]:         /usr/local/go/src/runtime/panic.go:1116 +0x54 fp=0x400060ed80 sp=0x400060ed50 pc=0x413d4
    [5058]: runtime.sigpanic()
    [5058]:         /usr/local/go/src/runtime/signal_unix.go:727 +0x3b8 fp=0x400060edb0 sp=0x400060ed80 pc=0x56a88
    [5058]:         /go/pkg/mod/[email protected]/internal/cache/entry.go:120
    [5058]:*shard).metaAdd(0x4000158d70, 0x2, 0xa, 0x93c04, 0x4000523ec0, 0x4000523ec0)
    [5058]:         /go/pkg/mod/[email protected]/internal/cache/clockpro.go:310 +0x140 fp=0x400060ee00 sp=0x400060edc0 pc=0x601e70
    [5058]:*shard).Set(0x4000158d70, 0x2, 0xa, 0x93c04, 0x4000136600, 0x0)
    [5058]:         /go/pkg/mod/[email protected]/internal/cache/clockpro.go:141 +0x528 fp=0x400060eea0 sp=0x400060ee00 pc=0x6015d8
    [5058]:*Cache).Set(0x40004aa6e0, 0x2, 0xa, 0x93c04, 0x4000136600, 0x931)
    [5058]:         /go/pkg/mod/[email protected]/internal/cache/clockpro.go:658 +0x6c fp=0x400060eee0 sp=0x400060eea0 pc=0x602ffc
    [5058]:*Reader).readBlock(0x400040ea00, 0x93c04, 0x92c, 0x0, 0x400017d030, 0x5, 0x24, 0x40004fa000)
    [5058]:         /go/pkg/mod/[email protected]/sstable/reader.go:1519 +0x220 fp=0x400060f0f0 sp=0x400060eee0 pc=0x617750
    [5058]:*singleLevelIterator).loadBlock(0x400017ce00, 0x400017cef8)
    [5058]:         /go/pkg/mod/[email protected]/sstable/reader.go:212 +0x124 fp=0x400060f150 sp=0x400060f0f0 pc=0x612c04
    [5058]:*singleLevelIterator).skipForward(0x400017ce00, 0x0, 0x0, 0x0, 0x0)
    [5058]:         /go/pkg/mod/[email protected]/sstable/reader.go:422 +0x44 fp=0x400060f1c0 sp=0x400060f150 pc=0x613bf4
    [5058]:*singleLevelIterator).Next(0x400017ce00, 0x66062c, 0x4000083380, 0x879, 0x400060f2e8)
    [5058]:         /go/pkg/mod/[email protected]/sstable/reader.go:398 +0xf8 fp=0x400060f230 sp=0x400060f1c0 pc=0x613a08
    [5058]:*twoLevelCompactionIterator).Next(0x400000dea0, 0x61eb98, 0x0, 0xc, 0x400060f2e8)
    [5058]:         /go/pkg/mod/[email protected]/sstable/reader.go:913 +0x50 fp=0x400060f290 sp=0x400060f230 pc=0x616730
    [5058]:*mergingIter).nextEntry(0x40004c4180, 0x4000083380)
    [5058]:         /go/pkg/mod/[email protected]/merging_iter.go:495 +0x6c fp=0x400060f2f0 sp=0x400060f290 pc=0x66053c
    [5058]:*mergingIter).Next(0x40004c4180, 0x40000a29a8, 0x400060f418, 0x61d650, 0x40000a29a8)
    [5058]:         /go/pkg/mod/[email protected]/merging_iter.go:981 +0x64 fp=0x400060f330 sp=0x400060f2f0 pc=0x662474
    [5058]:         /go/pkg/mod/[email protected]/compaction_iter.go:392
    [5058]:*compactionIter).nextInStripe(0x40000de000, 0x0)
    [5058]:         /go/pkg/mod/[email protected]/compaction_iter.go:409 +0x34 fp=0x400060f390 sp=0x400060f330 pc=0x63e2d4
    [5058]:         /go/pkg/mod/[email protected]/compaction_iter.go:377 +0x38 fp=0x400060f3b0 sp=0x400060f390 pc=0x63e268
    [5058]:*compactionIter).Next(0x40000de000, 0x40004fa0c0, 0x24, 0x30, 0x1)
    [5058]:         /go/pkg/mod/[email protected]/compaction_iter.go:247 +0x50 fp=0x400060f480 sp=0x400060f3b0 pc=0x63d610
    [5058]:*DB).runCompaction(0x400004f000, 0x3, 0x40000ca900, 0xad33a0, 0x1094be0, 0x40003a40f0, 0x400041a108, 0x1, 0x1, 0x0, ...)
    [5058]:         /go/pkg/mod/[email protected]/compaction.go:2234 +0xde0 fp=0x400060fc20 sp=0x400060f480 pc=0x63b0a0
    [5058]:*DB).compact1(0x400004f000, 0x40000ca900, 0x0, 0x0, 0x0)
    [5058]:         /go/pkg/mod/[email protected]/compaction.go:1744 +0x144 fp=0x400060fe80 sp=0x400060fc20 pc=0x639cd4
    [5058]:*DB).compact.func1(0xaeaea0, 0x400009b050)
    [5058]:         /go/pkg/mod/[email protected]/compaction.go:1705 +0x90 fp=0x400060fef0 sp=0x400060fe80 pc=0x675320
    [5058]: runtime/pprof.Do(0xaeaea0, 0x400009b050, 0x40003b3600, 0x1, 0x1, 0x4000489798)
    [5058]:         /usr/local/go/src/runtime/pprof/runtime.go:40 +0xac fp=0x400060ff60 sp=0x400060fef0 pc=0x605a5c
    [5058]:*DB).compact(0x400004f000, 0x40000ca900, 0x0)
    [5058]:         /go/pkg/mod/[email protected]/compaction.go:1702 +0xa0 fp=0x400060ffc0 sp=0x400060ff60 pc=0x639b70
    [5058]: runtime.goexit()
    [5058]:         /usr/local/go/src/runtime/asm_arm64.s:1136 +0x4 fp=0x400060ffc0 sp=0x400060ffc0 pc=0x72c64
    [5058]: created by*DB).maybeScheduleCompactionPicker
    [5058]:         /go/pkg/mod/[email protected]/compaction.go:1499 +0x2e8
    [5058]: goroutine 1 [select]:
    [5058]: net/http.(*Transport).getConn(0x40003ec780, 0x40004a7000, 0x0, 0x4000097080, 0x5, 0x4000039020, 0x1d, 0x0, 0x0, 0x0, ...)
    opened by uschen 40
  • db: pebble internal error

    db: pebble internal error

    Hi @petermattis

    I got the following error when trying to open the attached DB using the latest version of Pebble. The attached DB is generated using RocksDB.

    pebble: internal error: L0 flushed file 000019 overlaps with the largest seqnum of a preceding flushed file: 8212-11042 vs 8416

    Both this one and #566 were observed when testing Pebble to see whether it can operate correctly on RocksDB generated DB and the other way around. RocksDB/Pebble features used are pretty basic, they are -

    • Write to DB via WriteBatch with sync set to true
    • Read from DB via Get and forward iterator
    • Delete a single record using the regular delete operation
    • Range delete
    • Manual compaction
    opened by lni 33
  • perf: read-based compaction heuristic

    perf: read-based compaction heuristic

    Re-introduce a read-based compaction heuristic. LevelDB had a heuristic to trigger a compaction of a table if that table was being read frequently and there were tables from multiple levels involved in the read. This essentially reduced read-amplification in read heavy workloads. A read-based compaction is only selected if a size-based compaction is not needed.

    In addition to reducing read-amplification, a read compaction would also squash deletion tombstones. So we'd get a scan performance benefit in not needing to skip the deletion tombstones.

    opened by petermattis 31
  • db: incorporate range tombstones into compaction heuristics

    db: incorporate range tombstones into compaction heuristics

    Mentioned in

    Note that CRDB also know marks any sstable containing a range deletion tombstone as requiring compaction, which will tend to result in range deletion tombstones quickly compacting down to the bottom of the LSM and thus freeing up space. Incorporating better heuristics around range deletion tombstones into compaction is worth investigating.

    @ajkr and I both recall discussing this in the past and feel that something akin to the "compensated size" adjustment RocksDB performs for point deletions should be done for range tombstones. For a tombstone at Ln, we could estimate the number of bytes that tombstone covers in Ln+1 using something like RocksDB's GetApproximateSizes. This would be cheapish and would only need to be done when an sstable is created.

    opened by petermattis 30
  • DeleteRange truncation issues

    DeleteRange truncation issues

    There are a couple issues with range tombstone truncation exposed in these test cases:

    I believe the current approach (implicit truncation) works nicely for point lookups and iterator next/prev. But the above test cases shows it doesn't handle all cases of iterator seek or writing compaction outputs.

    For iterator seek, it should be enough to limit the seek over shadowed data to the sstable boundaries.

    For writing compaction outputs it is a bit more difficult. We need to truncate tombstones before writing them but can't simply use the input sstables' boundaries due to cases like In RocksDB we solved it by truncating to the compaction unit boundary. I believe the analogous thing should work here, even though compaction units are computed differently.

    opened by ajkr 29
  • Blocked on alloc under macOS

    Blocked on alloc under macOS

    I'm running a local roachprod cluster on macOS with 8 nodes (cockroachdb/cockroach@ce3b29b71f), and creating a table with some data hangs forever:

    $ roachprod create local -n 8
    $ roachprod start local --racks 3
    ALTER DATABASE test CONFIGURE ZONE USING num_replicas = 5, range_min_bytes = 1e6, range_max_bytes=10e6;
    USE test;
    CREATE TABLE data AS SELECT id, REPEAT('x', 1024) FROM generate_series(1, 1e6) AS id;

    The logs kept repeating these messages over and over:

    E210407 14:01:42.440401 2322884 kv/kvserver/queue.go:1093 ⋮ [n6,raftsnapshot,s6,r78/2:‹/{Table/53/1/-…-Max}›] 5294  (n8,s8):3: remote couldn't accept VIA_SNAPSHOT_QUEUE snapshot ‹c40b7a30› at applied index 13 with error: ‹[n8,s8],r78: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n8,s8,r69/3:/{Table/53/1/-…-Max}] (incoming /{Table/53/1/-9222246136947506714-Max})›

    Range 69 is stuck applying Raft entries before a range split, which is why the GC never goes through:

    Screenshot 2021-04-07 at 16 06 09

    Looking at the goroutine stacks, we found this Pebble call which has been blocked on alloc for 67 minutes:

    goroutine 250 [syscall, 67 minutes]:, 0x1, 0x0)
    	_cgo_gotypes.go:42 +0x49, 0x73, 0x1056f1, 0x0)
    	/Users/erik/Projects/go/src/ +0x3d, 0x2)
    	/Users/erik/Projects/go/src/ +0x38*Cache).Alloc(...)
    	/Users/erik/Projects/go/src/*Reader).readBlock(0xc001e28a00, 0x1056f1, 0x54, 0x0, 0x0, 0x3fd9384950000000, 0x6, 0x7)
    	/Users/erik/Projects/go/src/ +0x165*Reader).readMetaindex(0xc001e28a00, 0x1056f1, 0x54, 0x0, 0x0)
    	/Users/erik/Projects/go/src/ +0x7d, 0xc0004c6698, 0xc000c62460, 0xc5590a0, 0xc0014b5560, 0x8718864, 0x18, 0xc001cc2440, 0x1, 0x1, ...)
    	/Users/erik/Projects/go/src/ +0x348, 0xc0016359a0, 0x49, 0x2, 0x73, 0x0, 0x0, 0x0)
    	/Users/erik/Projects/go/src/ +0x2cd, 0xc0025f6ae0, 0x1, 0x1, 0x2, 0xc00381d718, 0x1, 0x1, 0xc001cc25c0, 0x10000c001cc25f0, ...)
    	/Users/erik/Projects/go/src/ +0x1a5*DB).Ingest(0xc0007e0400, 0xc0025f6ae0, 0x1, 0x1, 0x1, 0xc0025f6ae0)
    	/Users/erik/Projects/go/src/ +0x19e*Pebble).IngestExternalFiles(0xc001281380, 0x9bbdf50, 0xc0026774d0, 0xc0025f6ae0, 0x1, 0x1, 0x0, 0x0)
    	/Users/erik/Projects/go/src/ +0x4c, 0xc0026774d0, 0xc000ec0000, 0x9d0ad68, 0xc001281380, 0x9c08d48, 0xc001222580, 0x7, 0x27, 0xc004cf4000, ...)
    	/Users/erik/Projects/go/src/ +0x762*replicaAppBatch).runPreApplyTriggersAfterStagingWriteBatch(0xc000b43be0, 0x9bbdf50, 0xc0026774d0, 0xc00106e008, 0x0, 0x0)
    	/Users/erik/Projects/go/src/ +0xcfc*replicaAppBatch).Stage(0xc000b43be0, 0x9bbe6f8, 0xc00106e008, 0xc001cc3338, 0x5ae3c4b, 0xc000b43df8, 0xc000b43e28)
    	/Users/erik/Projects/go/src/ +0x3be, 0xc000b43df8, 0xc001cc3470, 0x0, 0x0, 0x0, 0x0)
    	/Users/erik/Projects/go/src/ +0x142*Task).applyOneBatch(0xc001cc3958, 0x9bbdf50, 0xc0026774d0, 0x9c005c0, 0xc000b43dc8, 0x0, 0x0)
    	/Users/erik/Projects/go/src/ +0x185*Task).ApplyCommittedEntries(0xc001cc3958, 0x9bbdf50, 0xc0026774d0, 0x1, 0x8760951)
    	/Users/erik/Projects/go/src/ +0xc5*Replica).handleRaftReadyRaftMuLocked(0xc000b43b00, 0x9bbdf50, 0xc0026774d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    	/Users/erik/Projects/go/src/ +0xfad*Replica).handleRaftReady(0xc000b43b00, 0x9bbdf50, 0xc0026774d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    	/Users/erik/Projects/go/src/ +0x113*Store).processReady(0xc000e72a00, 0x9bbdf50, 0xc000f1ff20, 0x45)
    	/Users/erik/Projects/go/src/ +0x134*raftScheduler).worker(0xc001854b40, 0x9bbdf50, 0xc000f1ff20)
    	/Users/erik/Projects/go/src/ +0x2c2*Stopper).RunAsyncTask.func1(0xc0014c0580, 0x9bbdf50, 0xc000f1ff20, 0xc000e83340, 0xc001ad8e70)
    	/Users/erik/Projects/go/src/ +0xb9
    created by*Stopper).RunAsyncTask
    	/Users/erik/Projects/go/src/ +0xfc
    opened by erikgrinaker 26
  • internal/metamorphic: TestMeta failed

    internal/metamorphic: TestMeta failed

    internal/metamorphic.TestMeta failed with artifacts on master @ 634eb34a29e146116a53e77e1fea8f67bce8a686:

    db.Merge("lwaptkkqtrk", "yfjsnmvj")
    iter203.SetBounds("eyrwxpljf", "eyrwxpljf")
    iter203.SeekGE("eyrwxpljf", "")
    db.Set("wbcwwpaez", "jjhyaoiboqwmlglcgqz")
    db.Set("xksmkbla", "gsvggqbwuamjrqvbuh")
    iter202.SetBounds("wnmzeeiq", "ztpabkh")
    iter202.SeekLT("ztpabkh", "")
    iter202.SetBounds("ztpabkh", "ztpabkh")
    iter202.SeekGE("ztpabkh", "")
    snap148 = db.NewSnapshot()
    iter202.SetBounds("ztpabkh", "ztpabkh")
    iter202.SeekLT("ztpabkh", "")
    iter203.SeekLT("nhmcoe", "")
    db.Set("wykrmx", "lrjajwnwqzfvrp")
    iter202.SetBounds("ztpabkh", "ztpabkh")
    iter202.SeekGE("ztpabkh", "")
    db.Merge("mlxwyjny", "tavwbaucawnaqwblk")
    ERROR: exit status 1
    10 runs completed, 1 failures, over 6m36s
    context canceled

    To reproduce, try:

    go test -mod=vendor -tags 'invariants' -exec 'stress -p 1' -timeout 0 -test.v -run TestMeta$ ./internal/metamorphic -seed 1632284396633596427

    This test on roachdash | Improve this report!

    O-robot C-test-failure metamorphic-failure branch-master 
    opened by cockroach-teamcity 24
  • perf: pacing user writes, flushes and compactions

    perf: pacing user writes, flushes and compactions

    Add a mechanism to pace (rate limit) user writes, flushes and compactions. Some level of rate limiting of user writes is necessary to prevent user write from blowing up memory (if flushes can't keep up) or creating too many L0 tables (if compaction can't keep up). The existing control mechanisms mirror what are present in RocksDB: Options.MemTableStopWritesThreshold, Options.L0CompactionThreshold and Options.L0SlowdownWritesThreshold. These control mechanisms are blunt resulting in undesirable hiccups in write performance. The controller object was an initial attempt at providing smoother write throughput and it achieved some success, but is too fragile.

    The problem here is akin to the problem with pacing a concurrent garbage collector. The Go GC pacer design should be inspiration. We want to balance the rate of dirty data arriving in memtables with the rate of flushes and compactions. Flushes and compactions should also be throttled to run at the rate necessary to keep up with incoming user writes and no faster so as to leave CPU available for user reads and other operations. A challenge here is to adjust quickly to changing user load.

    opened by petermattis 24
  • db: provide option to flush WAL in DB.Checkpoint

    db: provide option to flush WAL in DB.Checkpoint

    I just write a simple test case to check whether the checkpoint has the same data as below:

    func TestCheckpointData(t *testing.T) {
    // some init, xxxx
    eng, err := pebble.Open(dataDir, opts)
    wb := eng.NewBatch()
    wb.Set(key, value, writeOpts)
    eng.Apply(wb, writeOpts)
    eng2, err := pebble.Open(ckPath, opts)
    v1, closer1, err := eng.Get(key)
    v2, closer2, err := eng2.Get(key)
    assert.Equal(t, v1, v2)        // failed
    assert.Equal(t, v2, value)  // failed
    opened by absolute8511 21
  •  sstable: add parallelized RewriteKeySuffixes function

    sstable: add parallelized RewriteKeySuffixes function

    This adds a function to the sstable package that can replace some suffix with another suffix from every key in an sstable, assuming enforcing that the sstable consists of only Sets.

    If the replacement is passed a filter policy, rather than build new filters, the filter blocks from the original sstable are copied over, but only after checking that it used the same filter policy and the same splitter, and that that splitter splits the same length that as is being replaced i.e. that the pre-replacement filter remains valid post-replacement.

    Data blocks are rewritten in parallel. This informs a few choices in how it does so: rewritten keys are not passed to addPoint, but rather are directly handed to a worker-local blockWriter. Finished blocks are not immediately flushed to the output or added to the index, but rather are added to a buffer, which is then iterated and flushed at the end.

    A significant implication of not passing each key to addPoint is that individual KVs are not passed to the table and block property collectors during rewriting. Instead, when flushing the finished blocks, each table collector is passed just the first key of the table and each block prop collector is passed the first key of each block. For the collectors that CockroachDB uses, to which only the timestamp key suffix is relevant, this approach may be sufficient, as during replacement all keys in all blocks have the same suffix. However extending the collector API could allow this to be more general, e.g. either by making a separate API for this approach, like AddOneKeyForBlock, or by making the collectors able to collect and aggregate partial results, so each rewriter could have a local partial collector and then merge their results. However as the simple approach of just passing one key per block to the existing API appears sufficient for CockroachDB, this is left for later exploration.

    Benchmarking this implementation compared to simply opening a writer and feeding each key from a reader to it shows modest gains even without the addition of concurrency, just from skipping rebuilding filters and props from every key, with a 12M SST rewriting at about 33% faster. With the addition of concurrency=8, this jumps to about a 7.5x speedup over the naive read/write iteration, of 5x over concurrency=1.

        writer_test.go:357: rewriting a 12 M SSTable
    BenchmarkRewriter/RewriteKeySuffixes,concurrency=1-10                130          91251942 ns/op
    BenchmarkRewriter/RewriteKeySuffixes,concurrency=8-10                666          18205387 ns/op
    BenchmarkRewriter/read-and-write-new-sst-10                           88         134926796 ns/op
    opened by dt 20
  • internal/cache: untangle mutual recursion in cache eviction

    internal/cache: untangle mutual recursion in cache eviction

    • untangle mutual recursive calls in runHand{Hot,Cold,Test}
    • separate cache eviction into reclaiming and balancing
    • fix clock hands go past other hands

    Fix #526

    opened by chanyoung 20
  • .github: use Go 1.19

    .github: use Go 1.19

    Use Go 1.19 for the default workflows (excluding the explicit Go version compatibility tests, which run on compilers 1.1{7-9}). Go 1.19 introduced changes to gofmt that result in significant churn to our comments, so also update the lint check to use crlfmt (--fast).

    opened by jbowens 1
  • replay: implement Stringer on workload collector's Cleaner

    replay: implement Stringer on workload collector's Cleaner

    Pebble encodes the Options.Cleaner used within the OPTIONS file. During workload collection, a custom cleaner is installed to defer the deletion of files until they've been collected. Previously, the *pebble.WorkloadCollector itself implemented the Cleaner interface, resulting in the %s serialization of the entire WorkloadCollector structure being recorded within the OPTIONS file. This commit adjusts the collector to pass a Cleaner that implements fmt.Stringer, so that a consistent, stable string representation is encoded in the OPTIONS file.

    opened by jbowens 1
  • db: require validity check before calling HasPointAndRange, RangeKeyChanged

    db: require validity check before calling HasPointAndRange, RangeKeyChanged

    Document that both HasPointAndRange and RangeKeyChanged require that the iterator already be known to be valid before being called. This is already true of CockroachDB's use of these methods:

    • cockroachdb/cockroach#94220

    These methods are called in the hot path every KV, so eliding unnecessary validity checks does slightly move the MVCCScan benchmarks.

        name                                                                   old time/op    new time/op    delta
        MVCCScan_Pebble/rows=1/versions=1/valueSize=8/numRangeKeys=0-24          6.54µs ± 1%    6.47µs ± 1%  -1.13%  (p=0.001 n=9+10)
        MVCCScan_Pebble/rows=1/versions=2/valueSize=8/numRangeKeys=0-24          7.74µs ± 1%    7.65µs ± 2%  -1.25%  (p=0.004 n=9+10)
        MVCCScan_Pebble/rows=1/versions=10/valueSize=8/numRangeKeys=0-24         25.8µs ± 2%    25.0µs ± 1%  -3.06%  (p=0.000 n=10+9)
        MVCCScan_Pebble/rows=1/versions=100/valueSize=8/numRangeKeys=0-24        93.2µs ± 2%    90.2µs ± 1%  -3.21%  (p=0.000 n=10+10)
        MVCCScan_Pebble/rows=10/versions=1/valueSize=8/numRangeKeys=0-24         10.9µs ± 1%    10.7µs ± 1%  -1.69%  (p=0.000 n=10+10)
        MVCCScan_Pebble/rows=10/versions=2/valueSize=8/numRangeKeys=0-24         13.3µs ± 1%    13.1µs ± 1%  -1.52%  (p=0.000 n=10+10)
        MVCCScan_Pebble/rows=10/versions=10/valueSize=8/numRangeKeys=0-24        41.6µs ± 2%    40.9µs ± 2%  -1.83%  (p=0.001 n=10+10)
        MVCCScan_Pebble/rows=10/versions=100/valueSize=8/numRangeKeys=0-24        149µs ± 2%     146µs ± 2%  -1.99%  (p=0.000 n=10+10)
        MVCCScan_Pebble/rows=100/versions=1/valueSize=8/numRangeKeys=0-24        44.1µs ± 2%    42.8µs ± 1%  -2.82%  (p=0.000 n=10+10)
        MVCCScan_Pebble/rows=100/versions=2/valueSize=8/numRangeKeys=0-24        58.1µs ± 1%    57.2µs ± 1%  -1.59%  (p=0.000 n=10+10)
        MVCCScan_Pebble/rows=100/versions=10/valueSize=8/numRangeKeys=0-24        170µs ± 0%     169µs ± 1%  -0.98%  (p=0.000 n=8+8)
        MVCCScan_Pebble/rows=100/versions=100/valueSize=8/numRangeKeys=0-24       636µs ± 1%     629µs ± 1%  -1.09%  (p=0.002 n=10+10)
        MVCCScan_Pebble/rows=1000/versions=1/valueSize=8/numRangeKeys=0-24        348µs ± 2%     339µs ± 1%  -2.72%  (p=0.000 n=10+10)
        MVCCScan_Pebble/rows=1000/versions=2/valueSize=8/numRangeKeys=0-24        469µs ± 0%     464µs ± 1%  -1.08%  (p=0.000 n=9+9)
        MVCCScan_Pebble/rows=1000/versions=10/valueSize=8/numRangeKeys=0-24      1.37ms ± 1%    1.36ms ± 1%    ~     (p=0.077 n=9+9)
        MVCCScan_Pebble/rows=1000/versions=100/valueSize=8/numRangeKeys=0-24     5.41ms ± 2%    5.33ms ± 1%  -1.50%  (p=0.003 n=10+10)
        MVCCScan_Pebble/rows=10000/versions=1/valueSize=8/numRangeKeys=0-24      3.16ms ± 2%    3.09ms ± 2%  -2.10%  (p=0.002 n=10+10)
        MVCCScan_Pebble/rows=10000/versions=2/valueSize=8/numRangeKeys=0-24      4.37ms ± 2%    4.31ms ± 1%  -1.21%  (p=0.001 n=10+9)
        MVCCScan_Pebble/rows=10000/versions=10/valueSize=8/numRangeKeys=0-24     13.4ms ± 3%    13.2ms ± 1%    ~     (p=0.075 n=10+10)
        MVCCScan_Pebble/rows=10000/versions=100/valueSize=8/numRangeKeys=0-24    53.6ms ± 4%    52.1ms ± 4%    ~     (p=0.053 n=9+10)
        MVCCScan_Pebble/rows=50000/versions=1/valueSize=8/numRangeKeys=0-24      16.7ms ± 1%    16.5ms ± 1%  -1.62%  (p=0.000 n=9+9)
        MVCCScan_Pebble/rows=50000/versions=2/valueSize=8/numRangeKeys=0-24      22.9ms ± 3%    22.6ms ± 2%  -1.33%  (p=0.035 n=10+9)
        MVCCScan_Pebble/rows=50000/versions=10/valueSize=8/numRangeKeys=0-24     67.5ms ± 3%    67.0ms ± 2%    ~     (p=0.280 n=10+10)
        MVCCScan_Pebble/rows=50000/versions=100/valueSize=8/numRangeKeys=0-24     270ms ± 5%     259ms ± 7%  -4.09%  (p=0.004 n=10+9)
    opened by jbowens 1
  • db: allow switching directions using NextPrefix

    db: allow switching directions using NextPrefix

    db: factor out iterator repositioning to first/last key

    Pull out the logic to reposition the internal iterator to first first or last visible key into helper functions.

    db: allow switching directions using NextPrefix

    Previously, NextPrefix had undefined behavior when called on an Iterator positioned in the reverse direction. This commit expands NextPrefix to support being called when oriented in the reverse direction, in which case it will correctly advance to the first key with a greater prefix. The behavior while positioned at an IterAtLimit position continues to be undefined (but in practice, is implemented as a simple Next for metamorphic test determinism).

    This will simply code in Cockroach's MVCC scanner, which itself needs to support switching directions.

    The internal iterator interface continues to prohibit switching directions through InternalIterator.NextPrefix.

    opened by jbowens 1
  • db: Qualify fatal commit error message

    db: Qualify fatal commit error message

    This is a trivial change to add some additional context in the fatal log line when a commit fails.

    I've been diagnosing a PebbleDB crash with sync: negative WaitGroup counter originating from this callsite (at the top of the stack). Since panics are recovered and the error is reproduced literally here, it was difficult to narrow down the source of this crash. This additional qualification in the message should make it easier to trace down in the future.

    (I may open a separate issue for the commit failure crash once I am able to create a reliably reproducible test case)

    opened by LINKIWI 1
the scalable, survivable, SQL database
Golang-key-value-store - Key Value Store API Service with Go DDD Architecture

This document specifies the tools used in the Key-Value store and reorganizes how to use them. In this created service, In-Memory Key-Value Service was created and how to use the API is specified in the HTML file in the folder named "doc"

Kemal Emre Ballı 2 Jul 31, 2022
A key-value db api with multiple storage engines and key generation

Jet is a deadly-simple key-value api. The main goals of this project are : Making a simple KV tool for our other projects. Learn tests writing and git

null 12 Apr 5, 2022
An embedded key/value database for Go.

Bolt Bolt is a pure Go key/value store inspired by Howard Chu's LMDB project. The goal of the project is to provide a simple, fast, and reliable datab

BoltDB 13.3k Dec 30, 2022
Key-value database stored in memory with option of persistence

Easy and intuitive command line tool allows you to spin up a database avaliable from web or locally in a few seconds. Server can be run over a custom TCP protocol or over HTTP.

Mario Petričko 6 Aug 1, 2022
ZedisDB - a key-value memory database written in Go

ZedisDB - a key-value memory database written in Go

Znkisoft 4 Sep 4, 2022
Simple Distributed key-value database (in-memory/disk) written with Golang.

Kallbaz DB Simple Distributed key-value store (in-memory/disk) written with Golang. Installation go get Usage API // Get

Mohamed Samir 5 Jan 18, 2022
FlashDB is an embeddable, in-memory key/value database in Go

FlashDB is an embeddable, in-memory key/value database in Go (with Redis like commands and super easy to read)

Farhan 283 Dec 28, 2022
levigo is a Go wrapper for LevelDB

levigo levigo is a Go wrapper for LevelDB. The API has been godoc'ed and is available on the web. Questions answered at [email protected].

Jeff Hodges 412 Jan 5, 2023
A disk-backed key-value store.

What is diskv? Diskv (disk-vee) is a simple, persistent key-value store written in the Go language. It starts with an incredibly simple API for storin

Peter Bourgon 1.2k Jan 1, 2023
Distributed reliable key-value store for the most critical data of a distributed system

etcd Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in order

etcd-io 42.2k Dec 28, 2022
Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service.

Olric Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service. With

Burak Sezer 2.7k Jan 4, 2023
Simple, ordered, key-value persistence library for the Go Language

gkvlite gkvlite is a simple, ordered, ACID, key-value persistence library for Go. Overview gkvlite is a library that provides a simple key-value persi

Steve Yen 257 Dec 21, 2022
Distributed, fault-tolerant key-value storage written in go.

A simple, distributed, fault-tolerant key-value storage inspired by Redis. It uses Raft protocotol as consensus algorithm. It supports the following data structures: String, Bitmap, Map, List.

Igor German 359 Jan 3, 2023
a persistent real-time key-value store, with the same redis protocol with powerful features

a fast NoSQL DB, that uses the same RESP protocol and capable to store terabytes of data, also it integrates with your mobile/web apps to add real-time features, soon you can use it as a document store cause it should become a multi-model db. Redix is used in production, you can use it in your apps with no worries.

Mohammed Al Ashaal 1.1k Dec 25, 2022
GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

GhostDB is designed to speed up dynamic database or API driven websites by storing data in RAM in order to reduce the number of times an external data source such as a database or API must be read. GhostDB provides a very large hash table that is distributed across multiple machines and stores large numbers of key-value pairs within the hash table.

Jake Grogan 734 Jan 6, 2023
Pogreb is an embedded key-value store for read-heavy workloads written in Go.

Embedded key-value store for read-heavy workloads written in Go

Artem Krylysov 1k Dec 29, 2022
HA LDAP based key/value solution for projects configuration storing with multi master replication support

Recon is the simple solution for storing configs of you application. There are no specified instruments, no specified data protocols. For the full power of Recon you only need curl.

Mikhail Panfilov 12 Jun 15, 2022
Fault tolerant, sharded key value storage written in GoLang

Ravel is a sharded, fault-tolerant key-value store built using BadgerDB and hashicorp/raft. You can shard your data across multiple clusters with mult

Aditya Meharia 77 Nov 1, 2022
CrankDB is an ultra fast and very lightweight Key Value based Document Store.

CrankDB is a ultra fast, extreme lightweight Key Value based Document Store.

Shrey Batra 30 Apr 12, 2022