Go library providing algorithms optimized to leverage the characteristics of modern CPUs


asm build status GoDoc

Go library providing algorithms optimized to leverage the characteristics of modern CPUs.


With the development of Cloud technologies, access to large scale compute capacity has never been easier, and running distributed systems deployed across dozens or sometimes hundreds of CPUs has become common practice. As a side effect of being provided seemingly unlimited (but somewhat expensive) compute capacity, software engineers are now in direct connections with the economical and environmental impact of running the software they develop in production; performance and efficiency of our programs matters today more than it has ever before.

Modern CPUs are complex machines with performance characteristic that may vary by orders of magnitude depending on how they are used. Features like branch prediction, instruction reordering, pipelining, or caching are all input variables that determine the compute throughput that a CPU can achieve. While compilers keep being improved, and often employ micro-optimizations that would be counter-productive for human developers to be responsible for, there are limitations to what they can do, and Assembly still has a role to play in optimizing algorithms on hot code paths of large scale applications.

SIMD instruction sets offer interesting opportunities for software engineers. Taking advantage of these instructions often requires rethinking how the program represents and manipulates data, which is beyond the realm of optimizations that can be implemented by a compiler. When renting CPU time from a Cloud provider, programs that fail to leverage the full sets of instructions available are therefore paying for features they do not use.

This package aims to provide such algorithms, optimized to leverage advanced instruction sets of modern CPUs to maximize throughput and take the best advantage of the available compute power. Users of the package will find functions that have often been designed to work on arrays of values, which is where SIMD and branchless algorithms shine.

The functions in this library have been used in high throughput production environments at Segment, we hope that they will be useful to other developers using Go in performance-sensitive software.


The library is composed of multiple Go packages intended to act as logical groups of functions sharing similar properties:

Package Purpose
ascii library of functions designed to work on ASCII inputs
base64 standard library compatible base64 encodings
bswap byte swapping algorithms working on arrays of fixed-size items
cpu definition of the ABI used to detect CPU features
mem functions operating on byte arrays
qsort quick-sort implementations for arrays of fixed-size items
slices functions performing computations on pairs of slices
sortedset functions working on sorted arrays of fixed-size items

When no assembly version of a function is available for the target platform, the package provides a generic implementation in Go which is automatically picked up by the compiler.


The purpose of this library being to improve the runtime efficiency of Go programs, we compiled a few snapshots of benchmark runs to showcase the kind of improvements that these code paths can expect from leveraging SIMD and branchless optimizations:

goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
pkg: github.com/segmentio/asm/ascii
name                  old time/op    new time/op     delta
EqualFoldString/0512     276ns ± 1%       21ns ± 2%    -92.50%  (p=0.008 n=5+5)

name                  old speed      new speed       delta
EqualFoldString/0512  3.71GB/s ± 1%  49.44GB/s ± 2%  +1232.79%  (p=0.008 n=5+5)
pkg: github.com/segmentio/asm/bswap
name    old time/op    new time/op     delta
Swap64    11.2µs ± 1%      0.9µs ± 9%    -92.06%  (p=0.008 n=5+5)

name    old speed      new speed       delta
Swap64  5.83GB/s ± 1%  73.67GB/s ± 9%  +1162.98%  (p=0.008 n=5+5)
pkg: github.com/segmentio/asm/qsort
name            old time/op    new time/op     delta
Sort16/1000000     269ms ± 2%       46ms ± 3%   -83.08%  (p=0.008 n=5+5)

name            old speed      new speed       delta
Sort16/1000000  59.4MB/s ± 2%  351.2MB/s ± 3%  +491.24%  (p=0.008 n=5+5)


Generation of the assembly code is managed with AVO, and orchestrated by a Makefile which helps maintainers rebuild the assembly source code when the AVO files are modified.

The repository contains two Go modules; the main module is declared as github.com/segmentio/asm at the root of the repository, and the second module is found in the build subdirectory.

The build module is used to isolate build dependencies from programs that import the main module. Through this mechanism, AVO does not become a dependency of programs using github.com/segmentio/asm, keeping the dependency management overhead minimal for the users, and allowing maintainers to make modifications to the build package.

Versioning of the two modules is managed independently; while we aim to provide stable APIs on the main package, breaking changes may be introduced on the build package more often, as it is intended to be ground for more experimental constructs in the project.


Programs in the build module should add the following declaration:

func init() {

It instructs AVO to inject the !purego tag in the generated files, allowing compilation of the libraries without any assembly optimizations with a build command such as:

go build -tags purego ...

This is mainly useful to compare the impact of using the assembly optimized versions instead of the simpler Go-only implementations.

  •  Performance comparison between assembly and cgo

    Performance comparison between assembly and cgo

    Hi folks, This library is accelerated using assembly. Have you made any performance comparison of using cgo for corresponding acceleration? Although cgo itself has a remarkable overhead around 100ns for each call, it might be negligible given larger batch inputs. On the other hand, assembly itself also has the shortcomings such that it could not be inlined, so it would be interesting if performance comparison is available. Thank you~

    opened by yingfeng 0
  • add mem.Count function

    add mem.Count function

    The bits.CountByte function in https://github.com/segmentio/parquet-go is a good candidate to bring to https://github.com/segmentio/asm

    Here is a suggestion for the signature:

    package mem
    func Count(b []byte, v byte) int
    opened by achille-roussel 0
  • ARM64 work would lead to Apple Silicon support?

    ARM64 work would lead to Apple Silicon support?

    Apologies if this isn't the right place to ask this question, but could find any other means. Noticed ARM64 being worked on, does that imply there might be support for Apple Silicon in the future? Apple Silicon deviates from ARM64 by supporting W^X. Don't know if it's applicable to your algorithms though.

    opened by kathe 0
  • v1.2.0(May 1, 2022)

    What's Changed

    • .github: update to Go 1.18.1 by @kevinburkesegment in https://github.com/segmentio/asm/pull/76

    Full Changelog: https://github.com/segmentio/asm/compare/v1.1.5...v1.2.0

    Source code(tar.gz)
    Source code(zip)
a small form factor OpenShift/Kubernetes optimized for edge computing

Microshift Microshift is OpenShift1 Kubernetes in a small form factor and optimized for edge computing. Edge devices deployed out in the field pose ve

Red Hat Emerging Technologies 340 Jun 24, 2022
A Rancher and Kubernetes optimized immutable Linux distribution based on openSUSE

RancherOS v2 WORK IN PROGRESS RancherOS v2 is an immutable Linux distribution built to run Rancher and it's corresponding Kubernetes distributions RKE

Rancher 74 Jun 22, 2022
Tape backup software optimized for large WORM data and long-term recoverability

Mixtape Backup software for tape users with lots of WORM data. Draft design License This codebase is not open-source software (or free, or "libre") at

Dave Anderson 15 Apr 23, 2022
Kubei is a flexible Kubernetes runtime scanner, scanning images of worker and Kubernetes nodes providing accurate vulnerabilities assessment, for more information checkout:

Kubei is a vulnerabilities scanning and CIS Docker benchmark tool that allows users to get an accurate and immediate risk assessment of their kubernet

Portshift 663 Jun 27, 2022
Kubernetes operator providing CRDs to interact with NETCONF servers.

NETCONF operator This operator is meant to provide support for: RFC6241 Network Configuration Protocol (NETCONF) RFC6242 Using the NETCONF Protocol ov

Alexis de Talhouët 0 Nov 17, 2021
Kstone is an etcd management platform, providing cluster management, monitoring, backup, inspection, data migration, visual viewing of etcd data, and intelligent diagnosis.

Kstone 中文 Kstone is an etcd management platform, providing cluster management, monitoring, backup, inspection, data migration, visual viewing of etcd

TKEStack 544 Jun 27, 2022
A Go package providing a generic data type to track maximum and minimum peak values.

go-peak Overview go-peak is a Go package providing a generic data type that tracks the maximum and minimum peak values within a specific period of tim

tunabay 0 Mar 26, 2022
An Oracle Cloud (OCI) Pulumi resource package, providing multi-language access to OCI

Oracle Cloud Infrastructure Resource Provider The Oracle Cloud Infrastructure (OCI) Resource Provider lets you manage OCI resources. Installing This p

Pulumi 8 May 6, 2022
🚢 Go package providing lifecycle management for PostgreSQL Docker instances.

?? psqldocker powered by ory/dockertest. Go package providing lifecycle management for PostgreSQL Docker instances. Leverage Docker to run unit and in

Adrian Brad 10 Jun 14, 2022
:rocket: Modern cross-platform HTTP load-testing tool written in Go

English | 中文 Cassowary is a modern HTTP/S, intuitive & cross-platform load testing tool built in Go for developers, testers and sysadmins. Cassowary d

Roger Welin 588 Jun 24, 2022
Modern Job Scheduler

Kala Kala is a simplistic, modern, and performant job scheduler written in Go. Features: Single binary No dependencies JSON over HTTP API Job Stats Co

AJ Bahnken 1.8k Jun 26, 2022
Resilient, scalable Brainf*ck, in the spirit of modern systems design

Brainf*ck-as-a-Service A little BF interpreter, inspired by modern systems design trends. How to run it? docker-compose up -d bash hello.sh # Should p

Serge Zaitsev 138 Apr 22, 2022
k6 is a modern load testing tool for developers and testers in the DevOps era.

k6 is a modern load testing tool, building on our years of experience in the load and performance testing industry. It provides a clean, approachable scripting API, local and cloud execution, and flexible configuration.

k6 16.9k Jun 22, 2022
The smart virtual machines manager. A modern CLI for Vagrant Boxes.

The smart virtual machines manager Table of Contents: What is Vermin Install Vermin Usage Contributors TODO What is Vermin Vermin is a smart, simple a

Muhammad Hewedy 123 Jun 15, 2022
💧 Visual Data Preparation (VDP) is an open-source tool to seamlessly integrate Vision AI with the modern data stack

Website | Community | Blog Get Early Access Visual Data Preparation (VDP) is an open-source tool to streamline the end-to-end visual data processing p

Instill AI 17 May 30, 2022
Tigris is a modern, scalable backend for building real-time websites and apps.

Tigris Data Getting started These instructions will get you through setting up Tigris Data locally as Docker containers. Prerequisites Make sure that

Tigris Data Inc 90 Jun 23, 2022
HTTP load testing tool and library. It's over 9000!

Vegeta Vegeta is a versatile HTTP load testing tool built out of a need to drill HTTP services with a constant request rate. It can be used both as a

Tomás Senart 19.8k Jul 2, 2022
Go library to create resilient feedback loop/control controllers.

Gontroller A Go library to create feedback loop/control controllers, or in other words... a Go library to create controllers without Kubernetes resour

Spotahome 143 May 31, 2022
Orchestra is a library to manage long running go processes.

Orchestra Orchestra is a library to manage long running go processes. At the heart of the library is an interface called Player // Player is a long ru

Stephen Afam-Osemene 105 May 2, 2022