Distributed RWMutex in Go

Overview

Distributed Read-Write Mutex in Go

The default Go implementation of sync.RWMutex does not scale well to multiple cores, as all readers contend on the same memory location when they all try to atomically increment it. This repository provides an n-way RWMutex, also known as a "big reader" lock, which gives each CPU core its own RWMutex. Readers take only a read lock local to their core, whereas writers must take all locks in order.

Note that the current implementation only supports x86 processors on Linux; other combinations will revert (automatically) to the old sync.RWMutex behaviour. To support other architectures and OSes, the appropriate cpu_GOARCH.go and cpus_GOOS.go files need to be written. If you have a different setup available, and have the time to write one of these, I'll happily accept patches.

Finding the current CPU

To determine which lock to take, readers use the CPUID instruction, which gives the APICID of the currently active CPU without having to issue a system call or modify the runtime. This instruction is supported on both Intel and AMD processors; ARM CPUs should use the CPU ID register instead. For systems with more than 256 processors, x2APIC must be used, and the EDX register after CPUID with EAX=0xb should be used instead. A mapping from APICID to CPU index is constructed (using CPU affinity syscalls) when the program is started, as it is static for the lifetime of a process. Since the CPUID instruction can be fairly expensive, goroutines will also only periodically update their estimate of what core they are running on. More frequent updates lead to less inter-core lock traffic, but also increases the time spent on CPUID instructions relative to the actual locking.

Stale CPU information. The information of which CPU a goroutine is running on might be stale when we take the lock (the goroutine could have been moved to another core), but this will only affect performance, not correctness, as long as the reader remembers which lock it took. Such moves are also unlikely, as the OS kernel tries to keep threads on the same core to improve cache hits.

Performance

There are many parameters that affect the performance characteristics of this scheme. In particular, the frequency of CPUID checking, the number of readers, the ratio of readers to writers, and the time readers hold their locks, are all important. Since only a single writer is active at the time, the duration a writer holds a lock for does not affect the difference in performance between sync.RWMutex and DRWMutex.

Experiments show that DRWMutex performs better the more cores the system has, and in particular when the fraction of writers is <1%, and CPUID is called at most every 10 locks (this changes depending on the duration a lock is held for). Even on few cores, DRWMutex outperforms sync.RWMutex under these conditions, which are common for applications that elect to use sync.RWMutex over sync.Mutex.

The plot below shows mean performance across 30 runs (using experiment) as the number of cores increases using:

drwmutex-bench -i 5000 -p 0.0001 -n 500 -w 1 -r 100 -c 100

DRWMutex and sync.RWMutex performance comparison

Error bars denote 25th and 75th percentile. Note the drops every 10th core; this is because 10 cores constitute a NUMA node on the machine the benchmarks were run on, so once a NUMA node is added, cross-core traffic becomes more expensive. Performance increases for DRWMutex as more readers can work in parallel compared to sync.RWMutex.

See the go-nuts thread for further discussion.

Issues
  • Question about CPUs map

    Question about CPUs map

    I'm wondering why the cpus map which maps CPUID to a RWMutex index is strictly needed? I.e. wouldn't you get the same effect using mx[cpu()].RLocker() where mx is sized according to runtime.NumCPU()? This would remove the need for platform-specific APICID mappings, but I must be missing something.

    opened by tylertreat 10
  • `init()` does not always find all CPUs.

    `init()` does not always find all CPUs.

    Once in a while, init() misses one CPU, usually with the warning

    cpu x == y -- both have CPUID c
    

    I'm not sure why this happens yet, but it is somewhat troubling. Suggestions are welcome.

    opened by jonhoo 0
  • Apple M1 (and possible ARM PCs): unpopulated cpus map

    Apple M1 (and possible ARM PCs): unpopulated cpus map

    when RLock()'ing a drwmutex on an M1 mac, cpu() returns 0, but the map itself is empty:

    func map_cpus() (cpus map[uint64]int) {
    	cpus = make(map[uint64]int)
    	return
    }
    

    Perhaps, a tmp solution: cpus = map[uint64]int{0: 0}

    opened by kawahwookiee 0
  • Is it reasonable to use runtime.fasthash on platforms other than linux_amd64?

    Is it reasonable to use runtime.fasthash on platforms other than linux_amd64?

    Since drwmutex is mainly to improve the performance of read-most workload, use a fast hash function also can distribute the lock evenly, which may help to reduce lock contention. Benchmark shows cost of runtime.fasthashn is very cheap.

    func BenchmarkCpuid(b *testing.B) {
    	for i := 0; i < b.N; i++ {
    		_ = cpu()
    	}
    }
    
    func BenchmarkFastrand(b *testing.B) {
    	for i := 0; i < b.N; i++ {
    		_ = fastrandn(CPU_COUNT)
    	}
    }
    
    func cpu() uint64
    
    //go:noescape
    //go:linkname fastrandn runtime.fastrandn
    func fastrandn(x uint32) uint32
    
    goos: darwin
    goarch: amd64
    
    BenchmarkCpuid-12               16798030                66.3 ns/op
    BenchmarkFastrand-12            504372906                2.39 ns/op
    
    goos: linux
    goarch: amd64
    
    BenchmarkCpuid-4         1471983               813 ns/op
    BenchmarkFastrand-4     324570566                3.58 ns/op
    

    So is it reasonable to use this rand method to choose read lock on platforms other than linux_amd64?

    opened by jxskiss 2
  • No implementation of map_cpus on non-Linux OSes

    No implementation of map_cpus on non-Linux OSes

    Currently, all other OSes revert to the old sync.RWMutex behaviour of sharing a single lock. Implementing map_cpus for other OSes in cpus_GOOS.go will fix this, and should be relatively straightforward if you have a machine running that OS.

    Patches are welcome.

    opened by jonhoo 3
  • Applications should not have to deal with throttling calls to `RLocker()`

    Applications should not have to deal with throttling calls to `RLocker()`

    Currently, each call to DRWMutex.RLocker() causes the fairly expensive CPUID instruction (at least on some architechtures) to be called. The benchmarking tool avoids this to some extent by caching the lock across multiple iterations on the lock, but applications should not have to implement this logic themselves. Ideally, the drwmutex should handle this itself, though I'm not sure how it might do so... Suggestions are welcome.

    opened by jonhoo 0
  • No support for >256 CPUs.

    No support for >256 CPUs.

    The current implementation uses the single byte of CPU ID information in EBX from CPUID with EAX=1 to play nicely with CPUs that do not support x2APIC (such as AMDs and older Intels). However, this limits the number of processors to 256. CPUID also supports giving x2APIC IDs using EAX=0xb and reading the 4 bytes in EDX. cpu_amd64.s should use this feature if the CPU supports it.

    A patch to correctly use the x2APIC CPUID instruction is given below:

    diff --git a/cpu_amd64.s b/cpu_amd64.s
    index b485f31..354f8f9 100644
    --- a/cpu_amd64.s
    +++ b/cpu_amd64.s
    @@ -2,14 +2,12 @@
    
     // func cpu() uint64
     TEXT ·cpu(SB),NOSPLIT,$0-8
    -       MOVL    $0x01, AX // version information
    +       MOVL    $0x0b, AX // version information
            MOVL    $0x00, BX // any leaf will do
            MOVL    $0x00, CX // any subleaf will do
    
            // call CPUID
            BYTE $0x0f
            BYTE $0xa2
    -
    -       SHRQ    $24, BX // logical cpu id is put in EBX[31-24]
    -       MOVQ    BX, ret+0(FP)
    +       MOVQ    DX, ret+0(FP) // logical cpu id is put in EDX
            RET
    
    opened by jonhoo 0
  • No support for ARM CPUs

    No support for ARM CPUs

    There is currently no .s file for cpu() on non-x86 architechtures. Adding support for ARM should be fairly straightforward by reading the CPU ID register, but I don't have a machine to test this on. Patches are welcome.

    opened by jonhoo 1
Owner
Jon Gjengset
Rust educational streamer. Rustacean at AWS. A fan of making things secure, fast, scalable, and well-documented.
Jon Gjengset
Eventually consistent distributed in-memory cache Go library

bcache A Go Library to create distributed in-memory cache inside your app. Features LRU cache with configurable maximum keys Eventual Consistency sync

Iwan Budi Kusnanto 88 Jun 22, 2022
CockroachDB - the open source, cloud-native distributed SQL database.

CockroachDB is a cloud-native SQL database for building global, scalable cloud services that survive disasters. What is CockroachDB? Docs Quickstart C

CockroachDB 25k Jun 23, 2022
The lightweight, distributed relational database built on SQLite.

rqlite is a lightweight, distributed relational database, which uses SQLite as its storage engine. Forming a cluster is very straightforward, it grace

rqlite 10.6k Jun 26, 2022
TiDB is an open source distributed HTAP database compatible with the MySQL protocol

Slack Channel Twitter: @PingCAP Reddit Mailing list: lists.tidb.io For support, please contact PingCAP What is TiDB? TiDB ("Ti" stands for Titanium) i

PingCAP 31.7k Jun 29, 2022
A distributed key-value store. On Disk. Able to grow or shrink without service interruption.

Vasto A distributed high-performance key-value store. On Disk. Eventual consistent. HA. Able to grow or shrink without service interruption. Vasto sca

Chris Lu 238 Jun 16, 2022
Yeqllo 22 Nov 26, 2021
A distributed MySQL binlog storage system built on Raft

What is kingbus? 中文 Kingbus is a distributed MySQL binlog store based on raft. Kingbus can act as a slave to the real master and as a master to the sl

Fei Chen 844 Jun 16, 2022
Distributed cache with gossip peer membership enrollment.

Autocache Groupcache enhanced with memberlist for distributed peer discovery. TL;DR See /_example/ for usage. Run docker-compose -f _example/docker-co

null 95 Feb 19, 2022
LBADD: An experimental, distributed SQL database

LBADD Let's build a distributed database. LBADD is an experimental distributed SQL database, written in Go. The goal of this project is to build a dat

Tom Arrell 379 May 22, 2022
A course to build the SQL layer of a distributed database.

TinySQL TinySQL is a course designed to teach you how to implement a distributed relational database in Go. TinySQL is also the name of the simplifed

TiDB Incubator 979 Jun 26, 2022
TalariaDB is a distributed, highly available, and low latency time-series database for Presto

TalariaDB is a distributed, highly available, and low latency time-series database that stores real-time data. It's built on top of Badger DB.

Grab 97 Jun 18, 2022
Redwood is a highly-configurable, distributed, realtime database that manages a state tree shared among many peers

Redwood is a highly-configurable, distributed, realtime database that manages a state tree shared among many peers. Imagine something like a Redux store, but distributed across all users of an application, that offers offline editing and is resilient to poor connectivity.

Redwood 634 Jun 28, 2022
A distributed key value store in under 1000 lines. Used in production at comma.ai

minikeyvalue Fed up with the complexity of distributed filesystems? minikeyvalue is a ~1000 line distributed key value store, with support for replica

George Hotz 2.3k Jun 29, 2022
Distributed cache and in-memory key/value data store.

Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service.

Burak Sezer 2.2k Jun 22, 2022
IceFireDB - Distributed disk storage system based on Raft and RESP protocol.

Distributed disk storage database based on Raft and Redis protocol.

GITSRC 876 Jul 2, 2022
a go client for distributed transaction manager DTM. go语言分布式事务管理器DTM的客户端SDK

a client for distributed transaction manager dtm dtmcli 是分布式事务管理器dtm的客户端sdk dtmcli 这个库的代码与dtm下的dtmcli代码保持完全同步,如果您需要线上使用,那么当前的dtmcli相关的依赖非常少,对于最终打包的镜像也

null 12 Apr 27, 2022
decentralized,distributed,peer-to-peer database.

P2PDB 中文 | English 简介 P2PDB(p2p数据库),是一个去中心化、分布式、点对点数据库、P2PDB使用IPFS为其数据存储和IPFS Pubsub自动与对等方同步数据。P2PDB期望打造一个去中心化的分布式数据库,使P2PDB 成为去中心化应用程序 (dApps)、区块链应用程

Rock 2 Jun 9, 2022
Disgo - A distributed lock based on redis developed with golang

English | 中文 DisGo Introduce DisGo is a distributed lock based on Redis, develop

Jinyu Chen 8 Jun 27, 2022
Couchbase - distributed NoSQL cloud database

couchbase Couchbase is distributed NoSQL cloud database. create Scope CREATE SCO

Md Abu. Raihan 2 Feb 16, 2022