CRFS: Container Registry Filesystem

Related tags

File Handling crfs
Overview

CRFS: Container Registry Filesystem

Discussion: https://github.com/golang/go/issues/30829

Overview

CRFS is a read-only FUSE filesystem that lets you mount a container image, served directly from a container registry (such as gcr.io), without pulling it all locally first.

Background

Starting a container should be fast. Currently, however, starting a container in many environments requires doing a pull operation from a container registry to read the entire container image from the registry and write the entire container image to the local machine's disk. It's pretty silly (and wasteful) that a read operation becomes a write operation. For small containers, this problem is rarely noticed. For larger containers, though, the pull operation quickly becomes the slowest part of launching a container, especially on a cold node. Contrast this with launching a VM on major cloud providers: even with a VM image that's hundreds of gigabytes, the VM boots in seconds. That's because the hypervisors' block devices are reading from the network on demand. The cloud providers all have great internal networks. Why aren't we using those great internal networks to read our container images on demand?

Why does Go want this?

Go's continuous build system tests Go on many operating systems and architectures, using a mix of containers (mostly for Linux) and VMs (for other operating systems). We prioritize fast builds, targeting 5 minute turnaround for pre-submit tests when testing new changes. For isolation and other reasons, we run all our containers in a single-use fresh VMs. Generally our containers do start quickly, but some of our containers are very large and take a long time to start. To work around that, we've automated the creation of VM images where our heavy containers are pre-pulled. This is all a silly workaround. It'd be much better if we could just read the bytes over the network from the right place, without the all the hoops.

Tar files

One reason that reading the bytes directly from the source on demand is somewhat non-trivial is that container images are, somewhat regrettably, represented by tar.gz files, and tar files are unindexed, and gzip streams are not seekable. This means that trying to read 1KB out of a file named /var/lib/foo/data still involves pulling hundreds of gigabytes to uncompress the stream, to decode the entire tar file until you find the entry you're looking for. You can't look it up by its path name.

Introducing Stargz

Fortunately, we can fix the fact that tar.gz files are unindexed and unseekable, while still making the file a valid tar.gz file by taking advantage of the fact that two gzip streams can be concatenated and still be a valid gzip stream. So you can just make a tar file where each tar entry is its own gzip stream.

We introduce a format, Stargz, a Seekable tar.gz format that's still a valid tar.gz file for everything else that's unaware of these details.

In summary:

  • That traditional *.tar.gz format is: Gzip(TarF(file1) + TarF(file2) + TarF(file3) + TarFooter))
  • Stargz's format is: Gzip(TarF(file1)) + Gzip(TarF(file2)) + Gzip(TarF(file3_chunk1)) + Gzip(F(file3_chunk2)) + Gzip(F(index of earlier files in magic file), TarFooter), where the trailing ZIP-like index contains offsets for each file/chunk's GZIP header in the overall stargz file.

This makes images a few percent larger (due to more gzip headers and loss of compression context between files), but it's plenty acceptable.

Converting images

If you're using docker push to push to a registry, you can't use CRFS to mount the image. Maybe one day docker push will push stargz files (or something with similar properties) by default, but not yet. So for now we need to convert the storage image layers from tar.gz into stargz. There is a tool that does that. TODO: examples

Operation

When mounting an image, the FUSE filesystem makes a couple Docker Registry HTTP API requests to the container registry to get the metadata for the container and all its layers.

It then does HTTP Range requests to read just the stargz index out of the end of each of the layers. The index is stored similar to how the ZIP format's TOC is stored, storing a pointer to the index at the very end of the file. Generally it takes 1 HTTP request to read the index, but no more than 2. In any case, we're assuming a fast network (GCE VMs to gcr.io, or similar) with low latency to the container registry. Each layer needs these 1 or 2 HTTP requests, but they can all be done in parallel.

From that, we keep the index in memory, so readdir, stat, and friends are all served from memory. For reading data, the index contains the offset of each file's GZIP(TAR(file data)) range of the overall stargz file. To make it possible to efficiently read a small amount of data from large files, there can actually be multiple stargz index entries for large files. (e.g. a new gzip stream every 16MB of a large file).

Union/overlay filesystems

CRFS can do the aufs/overlay2-ish unification of multiple read-only stargz layers, but it will stop short of trying to unify a writable filesystem layer atop. For that, you can just use the traditional Linux filesystems.

Using with Docker, without modifying Docker

Ideally container runtimes would support something like this whole scheme natively, but in the meantime a workaround is that when converting an image into stargz format, the converter tool can also produce an image variant that only has metadata (environment, entrypoints, etc) and no file contents. Then you can bind mount in the contents from the CRFS FUSE filesystem.

That is, the convert tool can do:

Input: gcr.io/your-proj/container:v2

Output: gcr.io/your-proj/container:v2meta + gcr.io/your-proj/container:v2stargz

What you actually run on Docker or Kubernetes then is the v2meta version, so your container host's docker pull or equivalent only pulls a few KB. The gigabytes of remaining data is read lazily via CRFS from the v2stargz layer directly from the container registry.

Status

WIP. Enough parts are implemented & tested for me to realize this isn't crazy. I'm publishing this document first for discussion while I finish things up. Maybe somebody will point me to an existing implementation, which would be great.

Discussion

See https://github.com/golang/go/issues/30829

Issues
  • Support mode that refreshes contents as image tag is updated on registry?

    Support mode that refreshes contents as image tag is updated on registry?

    Hey there,

    Congrats on the idea for this project. Sounds interesting and useful.

    Question: What happens when I have mounted the image, and then a new digest of the same image tag is pushed to the registry? Does my mounted image gets automatically the new changes of the new digest? Does the already running container stops and restarts automatically with the new digest?

    Thanks in advance

    enhancement question 
    opened by dimitrisyields 7
  • Add support for converting images to stargzify

    Add support for converting images to stargzify

    A version of stargzify that works against container registries.

    We probably want a better name and different flags, but this is a start. We also might want to make layer flattening optional?

    cla: yes 
    opened by jonjohnsonjr 6
  • stargz: fix lookup for last chunk

    stargz: fix lookup for last chunk

    sort.Search fails only when the specified offset is after the last chunk. When it happens, attempt to use the last chunk.

    Signed-off-by: Giuseppe Scrivano [email protected]

    cla: yes 
    opened by giuseppe 6
  • Add an option to specify size to fetch along with stargz footer

    Add an option to specify size to fetch along with stargz footer

    In some cases (e.g. when high bandwidth and high latency), 2 roundtrips(for footer + TOC json) can be a performance overhead on the mount. This commit mitigates this issue by adding an option to specify the size to fetch along with the stargz footer. By the option, we can hopefully get the TOC JSON file + footer in one go, which reduces roundtrips.

    This is also one of the TODO consumption indicated in the source code.

    cla: yes 
    opened by ktock 5
  • comment on CVMFS

    comment on CVMFS

    FYI maybe you have already heard about it before, but this seems similar to using CVMFS for container distribution.

    General information on CVMFS: https://cvmfs.readthedocs.io/en/stable/ https://cernvm.cern.ch/portal/filesystem https://github.com/cvmfs/cvmfs

    Information about loading docker images on demand from CVMFS: https://cvmfs.readthedocs.io/en/stable/cpt-graphdriver.html

    Information about automatically converting container images and publishing them to CVMFS (with DUCC) https://cvmfs.readthedocs.io/en/stable/cpt-ducc.html

    documentation 
    opened by rptaylor 5
  • stargz: include xattrs in the TOC

    stargz: include xattrs in the TOC

    I need it for: https://github.com/giuseppe/crfs-plugin

    Signed-off-by: Giuseppe Scrivano [email protected]

    cla: yes 
    opened by giuseppe 4
  • Truncate unnecessary data before specified offset

    Truncate unnecessary data before specified offset

    Fixes: #17 Sometimes, big files are broken in CRFS.

    When we read a big file, the actual readings are separated to several blocks. In such situations, a node is requested to read at specific offset, but CRFS doesn't truncate unnecessary data before the offset.

    This commit solve this issue by truncating unnecessary data before specified offset when CRFS fetch first chunk of required range.

    cla: yes 
    opened by ktock 4
  • Support whiteout entries in overlayfs

    Support whiteout entries in overlayfs

    Fixes: #40

    To remove entries, current CRFS' behaviour doesn't make overlayfs happy because overlayfs has a convention to express whiteouts different from the convention defined in OCI(docker-compliant). This commit solves this issue.

    See also:

    • OCI spec: https://github.com/opencontainers/image-spec/blob/775207bd45b6cb8153ce218cc59351799217451f/layer.md#whiteouts
    • Docker spec: https://github.com/moby/moby/blob/64fd3dc0d5e0b15246dcf8d2a58baf202cc179bc/image/spec/v1.2.md#creating-an-image-filesystem-changeset
    • overlayfs: https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt

    Signed-off-by: Kohei Tokunaga [email protected]

    cla: yes 
    opened by ktock 3
  • Whiteouts don't work with overlayfs

    Whiteouts don't work with overlayfs

    Some container images use whiteouts to indicate "removed entries". But currently, when we use CRFS with overlayfs these whiteouts don't work and no entry is removed.

    Assume we have the lower layer:

    lower/etc
    ├── group
    ├── hostname
    ├── hosts
    ├── localtime
    ├── mtab -> /proc/mounts
    ├── network
    │   ├── if-down.d
    │   ├── if-post-down.d
    │   ├── if-pre-up.d
    │   └── if-up.d
    ├── passwd
    ├── resolv.conf
    └── shadow
    

    And the upper layer including whiteouts:

    upper
    └── etc
        ├── network
        │   ├── newfile
        │   └── .wh..wh..opq
        └── .wh.localtime
    

    According to "whiteout" definition in the OCI image specification, the merged directory should be the following(compatible with docker images).

    merged/etc
    ├── group
    ├── hostname
    ├── hosts
    ├── mtab -> /proc/mounts
    ├── network
    │   └── newfile
    ├── passwd
    ├── resolv.conf
    └── shadow
    
    1 directory, 8 files
    

    But currently CRFS shows these ".wh."-prefixed whiteout files as-is. This behaviour doesn't make overlayfs happy because overlayfs has a different convention to express whiteouts. So it currently results in the following unexpected result:

    merged/etc
    ├── group
    ├── hostname
    ├── hosts
    ├── localtime
    ├── mtab -> /proc/mounts
    ├── network
    │   ├── if-down.d
    │   ├── if-post-down.d
    │   ├── if-pre-up.d
    │   ├── if-up.d
    │   ├── newfile
    │   └── .wh..wh..opq
    ├── passwd
    ├── resolv.conf
    ├── shadow
    └── .wh.localtime
    
    opened by ktock 3
  • Sorting in python without sort function

    Sorting in python without sort function

    A way to sort in python without sort function by ujjwal pratap singh

    cla: no 
    opened by Anoymous-ai 3
Owner
Google
Google ❤️ Open Source
Google
A FileSystem Abstraction System for Go

A FileSystem Abstraction System for Go Overview Afero is a filesystem framework providing a simple, uniform and universal API interacting with any fil

Steve Francia 4.1k Jan 6, 2022
A package to allow one to concurrently go through a filesystem with ease

skywalker Skywalker is a package to allow one to concurrently go through a filesystem with ease. Features Concurrency BlackList filtering WhiteList fi

Will Dixon 67 Jan 3, 2022
An implementation of the FileSystem interface for tar files.

TarFS A wrapper around tar.Reader. Implements the FileSystem interface for tar files. Adds an Open method, that enables reading of file according to i

Eyal Posener 48 Nov 18, 2021
Takes an input http.FileSystem (likely at go generate time) and generates Go code that statically implements it.

vfsgen Package vfsgen takes an http.FileSystem (likely at go generate time) and generates Go code that statically implements the provided http.FileSys

null 936 Jan 12, 2022
memfs: A simple in-memory io/fs.FS filesystem

memfs: A simple in-memory io/fs.FS filesystem memfs is an in-memory implementation of Go's io/fs.FS interface. The goal is to make it easy and quick t

Peter Sanford 50 Dec 19, 2021
A Go io/fs filesystem implementation for reading files in a Github gists.

GistFS GistFS is an io/fs implementation that enables to read files stored in a given Gist. Requirements This module depends on io/fs which is only av

Jean Hadrien Chabran 122 Dec 6, 2021
A Small Virtual Filesystem in Go

This is a virtual filesystem I'm coding to teach myself Go in a fun way. I'm documenting it with a collection of Medium posts that you can find here.

Alyson 28 Jan 12, 2022
Encrypted overlay filesystem written in Go

An encrypted overlay filesystem written in Go. Official website: https://nuetzlich.net/gocryptfs (markdown source). gocryptfs is built on top the exce

null 2.1k Jan 5, 2022
Go filesystem implementations for various URL schemes

hairyhenderson/go-fsimpl This module contains a collection of Go filesystem implementations that can discovered dynamically by URL scheme. All filesys

Dave Henderson 217 Dec 22, 2021
A Go filesystem package for working with files and directories

Stowage A Go filesystem package for working with files and directories, it features a simple API with support for the common files and directories ope

null 19 May 28, 2021
filesystem for golang

filesystem filesystem for golang installation go get github.com/go-component/filesystem import import "github.com/go-component/filesystem" Usage sup

null 4 Jul 9, 2021
A set of io/fs filesystem abstractions and utilities for Go

A set of io/fs filesystem abstractions and utilities for Go Please ⭐ this project Overview This package provides io/fs interfaces for: Cloud providers

null 7 Oct 7, 2021
Tarserv serves streaming tar files from filesystem snapshots.

tarserv A collection of tools that allow serving large datasets from local filesystem snapshots. It is meant for serving big amounts of data to shell

Aurora 1 Dec 1, 2021
🤖 Prune old images on GitHub (ghcr.io) and GitLab (registry.gitlab.com) container registry

✨ Prune container images in a CLI way ✨ Prune old images on GitHub (ghcr.io) and GitLab (registry.gitlab.com) Container Registry Getting Started | Des

> CI Monk 1 Jan 10, 2022
Fast docker image distribution plugin for containerd, based on CRFS/stargz

[ ⬇️ Download] [ ?? Browse images] [ ☸ Quick Start (Kubernetes)] [ ?? Quick Start (nerdctl)] Stargz Snapshotter Read also introductory blog: Startup C

containerd 516 Jan 14, 2022
registry-tools: Prints image digest from a registry

registry-tools: Prints image digest from a registry

Rashed K 1 Dec 23, 2021
A REST API for the DN42 registry, written in Go, to provide a bridge between interactive applications and the registry.

dn42regsrv A REST API for the DN42 registry, written in Go, to provide a bridge between interactive applications and registry data. A public instance

Simon Marsh 0 Dec 2, 2021
Kubernetes Container Registry

k8scr A kubectl plugin for pushing OCI images through the Kubernetes API server. Quickstart Build kubectl-k8scr make build Move to location in PATH s

Daniel Mangum 103 Dec 9, 2021
Container Registry Synchronization made easy and fast

?? booster - Makes synchronization of container images between registries faster.

Silvio Moioli 10 Oct 7, 2021
Kubernetes controller for backing up public container images to our own registry repository

image-clone-controller Kubernetes controller which watches applications (Deployment and DaemonSet) and "caches" the images (public container images) b

Sahadat Hossain 4 Jan 14, 2022
Moby Project - a collaborative project for the container ecosystem to assemble container-based systems

The Moby Project Moby is an open-source project created by Docker to enable and accelerate software containerization. It provides a "Lego set" of tool

Moby 62k Jan 14, 2022
Yeqllo 22 Nov 26, 2021
Moby Project - a collaborative project for the container ecosystem to assemble container-based systems

The Moby Project Moby is an open-source project created by Docker to enable and accelerate software containerization. It provides a "Lego set" of tool

Moby 61.9k Jan 8, 2022
top in container - Running the original top command in a container

Running the original top command in a container will not get information of the container, many metrics like uptime, users, load average, tasks, cpu, memory, are about the host in fact. topic(top in container) will retrieve those metrics from container instead, and shows the status of the container, not the host.

silenceshell 63 Nov 17, 2021
Boxygen is a container as code framework that allows you to build container images from code

Boxygen is a container as code framework that allows you to build container images from code, allowing integration of container image builds into other tooling such as servers or CLI tooling.

nitric 5 Dec 13, 2021
Amazon ECS Container Agent: a component of Amazon Elastic Container Service

Amazon ECS Container Agent The Amazon ECS Container Agent is a component of Amazon Elastic Container Service (Amazon ECS) and is responsible for manag

null 0 Dec 28, 2021
A registry for resilient mid-tier load balancing and failover.

Discovery Discovery is a based service that is production-ready and primarily used at Bilibili for locating services for the purpose of load balancing

bilibili 1.6k Jan 12, 2022
k8s-image-swapper Mirror images into your own registry and swap image references automatically.

k8s-image-swapper Mirror images into your own registry and swap image references automatically. k8s-image-swapper is a mutating webhook for Kubernetes

Enrico Stahn 279 Jan 15, 2022
Authentication server for Docker Registry 2

The original Docker Registry server (v1) did not provide any support for authentication or authorization. Access control had to be performed externally, typically by deploying Nginx in the reverse proxy mode with Basic or other type of authentication. While performing simple user authentication is pretty straightforward, performing more fine-grained access control was cumbersome.

Cesanta Software 1.1k Jan 4, 2022