Tiexec - TiExec tries to alleviate the iTLB-Cache-Miss problem of the application it loaded

Overview

TiExec

This opensource project is inspired by the TiDB Hackathon 2021. Here is the RFC doc of this opensource project (only Chinese version is available currently).

I would like to name this opensource project with a prefix "Ti" to show my sincere respect for the marvelous contributions they have done to the opensource community.

Status

Prototype is ready.

You should wait for its v1.0 release if you want to use it in production.

Synopsis

$ tiexec echo -e "Hi, I am loaded by tiexec ❤️\nIt may try to make me more performant ☺\n"
Hi, I am loaded by tiexec ❤️
It may try to make me more performant ☺

$ tiexec go version
go version go1.16.4 linux/amd64

$ tiexec rustc -V
rustc 1.55.0 (c8dfcfe04 2021-09-06)

$ tiexec bin/pd-server ...
$ tiexec bin/tidb-server ...
$ tiexec bin/tikv-server ...
$ tiexec bin/tiflash/tiflash ...

$ # or even any elf you like
$ tiexec bin/prometheus/prometheus ...
$ tiexec bin/bin/grafana-server ...

Description

TiExec will try to alleviate the iTLB-Cache-Miss problem of the application it loaded, so it will bring some direct performance improvement to those applications that are being punished by iTLB-Cache-Miss problem. Generally speaking, one program may face such iTLB-Cache-Miss problem if its .text segment is too large.

For example, the .text segment size of some components in TiDB is from ~46MB to ~160MB, and a test in an OLTP scenario of TiDB with these components optimized by TiExec shows that it could bring about an overall 6-11% performance improvement directly. Here is more detailed information about this test:

In one OLTP scenario of TiDB, the tidb-server suffers 68.62% iTLB-Cache-Miss, overall TPS is 307.68/sec, medium latency is 62.22 ms. After TiExec is used, iTLB-Cache-Miss reduced to 47.1% (- ~31%), overall TPS became 341.35/sec (+10.9%), medium latency became 56.32 ms (-9.5%).

Build and Have a Try

Build & Setup

$ cd $ROOT_OF_SRC
$ go build -o tiexec-helper helper.go
$ cd c
$ gcc -I log/ tiexec.c log/log.c -o tiexec

Install (need to be root):

$ mkdir -p /root/.tiexec/bin
$ mkdir -p /root/.tiexec/log
$ cp -f $ROOT_OF_SRC/tiexec-helper /root/.tiexec/bin/
$ cp -f $ROOT_OF_SRC/c/tiexec /root/.tiexec/bin/

Setup Env (need to be root):

$ export PATH=/root/.tiexec/bin:$PATH
# tell kernel to rereserve some hugepages for us
$ echo 500 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# have a check
$ cat /proc/meminfo | grep -P Huge
AnonHugePages:     49152 kB
HugePages_Total:     500 // <-- success
HugePages_Free:      500 // <-- 500 pages available
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Have a Try

Run (need to be root):

$ tiexec echo -e "Hi, I am loaded by tiexec ❤️\n"
Hi, I am loaded by tiexec ❤️

Now let's have a try on TiDB-Server:

$ tiexec ./tidb-server # args is following...

Here is the memory maps when tidb-server reaches its entry point:

$ cat /proc/$pid_of_tidb_server/maps
00400000-0579a000 r-xp 00000000 00:2d 370                                /mnt/down/tarball/tidb-server
05999000-060b3000 rw-p 05399000 00:2d 370                                /mnt/down/tarball/tidb-server
... ...
$ cat /proc/$pid_of_tidb_server/numa_maps
00400000 default file=/mnt/down/tarball/tidb-server mapped=21402 active=1 N0=21402 kernelpagesize_kB=4
05999000 default file=/mnt/down/tarball/tidb-server anon=1 dirty=1 N0=1 kernelpagesize_kB=4
... ...

We can see that the .text segment of tidb-server is very large, ~83.6 MB, and it takes about 21402 * 4KB-pages to map them.

And here is after tiexec optimized:

$ cat /proc/$pid_of_tidb_server/maps
00400000-05600000 r-xp 00000000 00:0e 454903                             /anon_hugepage (deleted)
05600000-0579a000 r-xp 00000000 00:00 0
05999000-060b3000 rw-p 05399000 00:2d 370                                /mnt/down/tarball/tidb-server
... ...
$ cat /proc/$pid_of_tidb_server/numa_maps
00400000 default file=/anon_hugepage\040(deleted) huge anon=41 dirty=41 N0=41 kernelpagesize_kB=2048
05600000 default anon=410 dirty=410 N0=410 kernelpagesize_kB=4
05999000 default file=/mnt/down/tarball/tidb-server anon=1 dirty=1 N0=1 kernelpagesize_kB=4
... ...

Now it takes only 451 pages to map them, i.e. 41 * 2MB-hugepages and 410 * 4KB-pages.

>>> 451/21402.0 - 1
-0.9789272030651341

And further more, for many occasions, these 410 * 4KB-pages still could be optimized into one 2MB-hugepage. And that would be:

>>> 42/21402.0 - 1
-0.9980375665825624

Design

Basically, TiExec try to re-mmap the .text area of one process to hugepages (as much area as possible).

For example, if one .text segment of a process has a range of 0x5ff000 - 0xc10000, TiExec will re-mmap this big area into 3 small areas:

0x5ff000 - 0x600000 # 1  * 4KB Page
0x600000 - 0xc00000 # 3  * 2MB Pages
0xc00000 - 0xc10000 # 16 * 4KB Pages

Something like this:

conservative-strategy

Here is how TiExec do such re-mmap stuff in userspace:

procedure

  1. TiExec Tracer start the program it wants to optimize as a ptrace Tracee, and Tracee blocks at its entry point.
  2. Tracer saves the registers' state of Tracee.
  3. Tracer fork and exec the TiExec Helper with pipes ready.
  4. Helper analyze the Tracee's memory layout and make snapshots on the memory areas it want to re-mmap.
  5. Helper tells the Tracer the syscall lists it want the Tracee to execute, and setup the executing environment for Tracee.
  6. Tracer controlls Tracee to make these syscalls (something like munmap 4KB pages and mmap 2MB hugepage again).
  7. Helper restores data on newly remmaped areas.
  8. Tracer restores the registers' state of Tracee which is saved at Step 2.
  9. Tracer detach Tracee and wait Tracee to exit.

The executing environment of Tracee at Step 5 is like below:

0000000000000000 <inject_hardcode>:
   0:   48 c7 c0 e7 00 00 00    mov    rax,0xe7  # exit_group
   7:   48 c7 c7 01 00 00 00    mov    rdi,0x1   # exit_group(1)
   e:   0f 05                   syscall
  10:   eb ee                   jmp    0 <inject_hardcode> # jmp rel8(-18)

The point is, every time Tracee blocks at enter-syscall, Tracer will replace the syscall exit_group as the syscall in the list which is decided by Helper in Step 4. And when Tracee block at return from syscall, Tracer would check the return value. If the syscall is sucess then everything is fine, otherwise there would be some error handling.

TODO

Suport something like runuser.

Support Linux ARM64.

Support something like GODEBUG.

Support re-mmap on dynamic loading library.

Support more aggresive strategy about memory area "splitting". For the .text segment range of 0x5ff000 - 0xc10000, we could have:

0x400000 - 0xe00000 # 5  * 2MB Pages

Something like this:

aggressive-strategy

Copyright and License

Copyright (C) 2021, by Sen Han [email protected].

Under the Apache License, Version 2.0.

See the LICENSE file for details.

Issues
Releases(prototype-0.02)
  • prototype-0.02(Jan 7, 2022)

    What's Changed

    • Add log and more docs by @hnes in https://github.com/hnes/tiexec/pull/2
    • docs: update by @hnes in https://github.com/hnes/tiexec/pull/3

    Full Changelog: https://github.com/hnes/tiexec/compare/prototype-0.01...prototype-0.02

    Source code(tar.gz)
    Source code(zip)
  • prototype-0.01(Jan 6, 2022)

Owner
Sen Han
[email protected] Open source developer. He who fights with monsters ;-)
Sen Han
Cache library for golang. It supports expirable Cache, LFU, LRU and ARC.

GCache Cache library for golang. It supports expirable Cache, LFU, LRU and ARC. Features Supports expirable Cache, LFU, LRU and ARC. Goroutine safe. S

Jun Kimura 2k Aug 4, 2022
cyhone 108 Jul 30, 2022
Package cache is a middleware that provides the cache management for Flamego.

cache Package cache is a middleware that provides the cache management for Flamego. Installation The minimum requirement of Go is 1.16. go get github.

Flamego 11 Jul 14, 2022
A mem cache base on other populator cache, add following feacture

memcache a mem cache base on other populator cache, add following feacture add lazy load(using expired data, and load it asynchronous) add singlefligh

zhq 1 Oct 28, 2021
Cache - A simple cache implementation

Cache A simple cache implementation LRU Cache An in memory cache implementation

Stanislav Petrashov 1 Jan 25, 2022
Gin-cache - Gin cache middleware with golang

Gin-cache - Gin cache middleware with golang

Anson 38 Mar 16, 2022
LevelDB style LRU cache for Go, support non GC object.

Go语言QQ群: 102319854, 1055927514 凹语言(凹读音“Wa”)(The Wa Programming Language): https://github.com/wa-lang/wa LRU Cache Install go get github.com/chai2010/c

chai2010 11 Jul 5, 2020
groupcache is a caching and cache-filling library, intended as a replacement for memcached in many cases.

groupcache Summary groupcache is a distributed caching and cache-filling library, intended as a replacement for a pool of memcached nodes in many case

Go 11.6k Aug 8, 2022
☔️ A complete Go cache library that brings you multiple ways of managing your caches

Gocache Guess what is Gocache? a Go cache library. This is an extendable cache library that brings you a lot of features for caching data. Overview He

Vincent Composieux 1.4k Aug 4, 2022
fastcache - fast thread-safe inmemory cache for big number of entries in Go

Fast thread-safe inmemory cache for big number of entries in Go. Minimizes GC overhead

VictoriaMetrics 1.5k Aug 3, 2022
An in-memory cache library for golang. It supports multiple eviction policies: LRU, LFU, ARC

GCache Cache library for golang. It supports expirable Cache, LFU, LRU and ARC. Features Supports expirable Cache, LFU, LRU and ARC. Goroutine safe. S

Jun Kimura 318 May 31, 2021
Efficient cache for gigabytes of data written in Go.

BigCache Fast, concurrent, evicting in-memory cache written to keep big number of entries without impact on performance. BigCache keeps entries on hea

Allegro Tech 5.9k Aug 4, 2022
An in-memory key:value store/cache (similar to Memcached) library for Go, suitable for single-machine applications.

go-cache go-cache is an in-memory key:value store/cache similar to memcached that is suitable for applications running on a single machine. Its major

Patrick Mylund Nielsen 6.4k Jul 28, 2022
Primer proyecto OSS en comunidad sobre cache en memoria.

GoKey ?? Concepto del proyecto: Sistema de base de datos clave valor, distribuido. En forma de cache en memoria. Especificaciones: Para conjuntar inf

Gophers LATAM 20 Feb 18, 2022
LFU Redis implements LFU Cache algorithm using Redis as data storage

LFU Redis cache library for Golang LFU Redis implements LFU Cache algorithm using Redis as data storage LFU Redis Package gives you control over Cache

Mohamed Shapan 6 Apr 23, 2021
gdcache is a pure non-intrusive distributed cache library implemented by golang

gdcache is a pure non-intrusive distributed cache library implemented by golang, you can use it to implement your own distributed cache

Jovan 9 Mar 24, 2022
entcache - An experimental cache driver for ent with variety of storage options

entcache An experimental cache driver for ent with variety of storage options, such as: A context.Context-based cache. Usually, attached to an HTTP re

ariga 95 Jul 30, 2022
Cachy is a simple and lightweight in-memory cache api.

cachy Table of Contents cachy Table of Contents Description Features Structure Configurability settings.json default values for backup_file_path Run o

Hayrullah cansu 4 Apr 24, 2022