CRFS: Container Registry Filesystem

Related tags

File Handling crfs
Overview

CRFS: Container Registry Filesystem

Discussion: https://github.com/golang/go/issues/30829

Overview

CRFS is a read-only FUSE filesystem that lets you mount a container image, served directly from a container registry (such as gcr.io), without pulling it all locally first.

Background

Starting a container should be fast. Currently, however, starting a container in many environments requires doing a pull operation from a container registry to read the entire container image from the registry and write the entire container image to the local machine's disk. It's pretty silly (and wasteful) that a read operation becomes a write operation. For small containers, this problem is rarely noticed. For larger containers, though, the pull operation quickly becomes the slowest part of launching a container, especially on a cold node. Contrast this with launching a VM on major cloud providers: even with a VM image that's hundreds of gigabytes, the VM boots in seconds. That's because the hypervisors' block devices are reading from the network on demand. The cloud providers all have great internal networks. Why aren't we using those great internal networks to read our container images on demand?

Why does Go want this?

Go's continuous build system tests Go on many operating systems and architectures, using a mix of containers (mostly for Linux) and VMs (for other operating systems). We prioritize fast builds, targeting 5 minute turnaround for pre-submit tests when testing new changes. For isolation and other reasons, we run all our containers in a single-use fresh VMs. Generally our containers do start quickly, but some of our containers are very large and take a long time to start. To work around that, we've automated the creation of VM images where our heavy containers are pre-pulled. This is all a silly workaround. It'd be much better if we could just read the bytes over the network from the right place, without the all the hoops.

Tar files

One reason that reading the bytes directly from the source on demand is somewhat non-trivial is that container images are, somewhat regrettably, represented by tar.gz files, and tar files are unindexed, and gzip streams are not seekable. This means that trying to read 1KB out of a file named /var/lib/foo/data still involves pulling hundreds of gigabytes to uncompress the stream, to decode the entire tar file until you find the entry you're looking for. You can't look it up by its path name.

Introducing Stargz

Fortunately, we can fix the fact that tar.gz files are unindexed and unseekable, while still making the file a valid tar.gz file by taking advantage of the fact that two gzip streams can be concatenated and still be a valid gzip stream. So you can just make a tar file where each tar entry is its own gzip stream.

We introduce a format, Stargz, a Seekable tar.gz format that's still a valid tar.gz file for everything else that's unaware of these details.

In summary:

  • That traditional *.tar.gz format is: Gzip(TarF(file1) + TarF(file2) + TarF(file3) + TarFooter))
  • Stargz's format is: Gzip(TarF(file1)) + Gzip(TarF(file2)) + Gzip(TarF(file3_chunk1)) + Gzip(F(file3_chunk2)) + Gzip(F(index of earlier files in magic file), TarFooter), where the trailing ZIP-like index contains offsets for each file/chunk's GZIP header in the overall stargz file.

This makes images a few percent larger (due to more gzip headers and loss of compression context between files), but it's plenty acceptable.

Converting images

If you're using docker push to push to a registry, you can't use CRFS to mount the image. Maybe one day docker push will push stargz files (or something with similar properties) by default, but not yet. So for now we need to convert the storage image layers from tar.gz into stargz. There is a tool that does that. TODO: examples

Operation

When mounting an image, the FUSE filesystem makes a couple Docker Registry HTTP API requests to the container registry to get the metadata for the container and all its layers.

It then does HTTP Range requests to read just the stargz index out of the end of each of the layers. The index is stored similar to how the ZIP format's TOC is stored, storing a pointer to the index at the very end of the file. Generally it takes 1 HTTP request to read the index, but no more than 2. In any case, we're assuming a fast network (GCE VMs to gcr.io, or similar) with low latency to the container registry. Each layer needs these 1 or 2 HTTP requests, but they can all be done in parallel.

From that, we keep the index in memory, so readdir, stat, and friends are all served from memory. For reading data, the index contains the offset of each file's GZIP(TAR(file data)) range of the overall stargz file. To make it possible to efficiently read a small amount of data from large files, there can actually be multiple stargz index entries for large files. (e.g. a new gzip stream every 16MB of a large file).

Union/overlay filesystems

CRFS can do the aufs/overlay2-ish unification of multiple read-only stargz layers, but it will stop short of trying to unify a writable filesystem layer atop. For that, you can just use the traditional Linux filesystems.

Using with Docker, without modifying Docker

Ideally container runtimes would support something like this whole scheme natively, but in the meantime a workaround is that when converting an image into stargz format, the converter tool can also produce an image variant that only has metadata (environment, entrypoints, etc) and no file contents. Then you can bind mount in the contents from the CRFS FUSE filesystem.

That is, the convert tool can do:

Input: gcr.io/your-proj/container:v2

Output: gcr.io/your-proj/container:v2meta + gcr.io/your-proj/container:v2stargz

What you actually run on Docker or Kubernetes then is the v2meta version, so your container host's docker pull or equivalent only pulls a few KB. The gigabytes of remaining data is read lazily via CRFS from the v2stargz layer directly from the container registry.

Status

WIP. Enough parts are implemented & tested for me to realize this isn't crazy. I'm publishing this document first for discussion while I finish things up. Maybe somebody will point me to an existing implementation, which would be great.

Discussion

See https://github.com/golang/go/issues/30829

Issues
  • Support mode that refreshes contents as image tag is updated on registry?

    Support mode that refreshes contents as image tag is updated on registry?

    Hey there,

    Congrats on the idea for this project. Sounds interesting and useful.

    Question: What happens when I have mounted the image, and then a new digest of the same image tag is pushed to the registry? Does my mounted image gets automatically the new changes of the new digest? Does the already running container stops and restarts automatically with the new digest?

    Thanks in advance

    enhancement question 
    opened by dimitrisyields 7
  • stargz: fix lookup for last chunk

    stargz: fix lookup for last chunk

    sort.Search fails only when the specified offset is after the last chunk. When it happens, attempt to use the last chunk.

    Signed-off-by: Giuseppe Scrivano [email protected]

    cla: yes 
    opened by giuseppe 6
  • Add support for converting images to stargzify

    Add support for converting images to stargzify

    A version of stargzify that works against container registries.

    We probably want a better name and different flags, but this is a start. We also might want to make layer flattening optional?

    cla: yes 
    opened by jonjohnsonjr 6
  • Add an option to specify size to fetch along with stargz footer

    Add an option to specify size to fetch along with stargz footer

    In some cases (e.g. when high bandwidth and high latency), 2 roundtrips(for footer + TOC json) can be a performance overhead on the mount. This commit mitigates this issue by adding an option to specify the size to fetch along with the stargz footer. By the option, we can hopefully get the TOC JSON file + footer in one go, which reduces roundtrips.

    This is also one of the TODO consumption indicated in the source code.

    cla: yes 
    opened by ktock 5
  • comment on CVMFS

    comment on CVMFS

    FYI maybe you have already heard about it before, but this seems similar to using CVMFS for container distribution.

    General information on CVMFS: https://cvmfs.readthedocs.io/en/stable/ https://cernvm.cern.ch/portal/filesystem https://github.com/cvmfs/cvmfs

    Information about loading docker images on demand from CVMFS: https://cvmfs.readthedocs.io/en/stable/cpt-graphdriver.html

    Information about automatically converting container images and publishing them to CVMFS (with DUCC) https://cvmfs.readthedocs.io/en/stable/cpt-ducc.html

    documentation 
    opened by rptaylor 5
  • Truncate unnecessary data before specified offset

    Truncate unnecessary data before specified offset

    Fixes: #17 Sometimes, big files are broken in CRFS.

    When we read a big file, the actual readings are separated to several blocks. In such situations, a node is requested to read at specific offset, but CRFS doesn't truncate unnecessary data before the offset.

    This commit solve this issue by truncating unnecessary data before specified offset when CRFS fetch first chunk of required range.

    cla: yes 
    opened by ktock 4
  • stargz: add json tag for numLink

    stargz: add json tag for numLink

    stargz.index.json from ghcr.io/stargz-containers/node:17.8.0-esgz

    main

    {
    	"version": 1,
    	"entries": [
    		{
    			"name": ".no.prefetch.landmark",
    			"type": "reg",
    			"size": 1,
    			"offset": 85,
    			"NumLink": 0,
    			"digest": "sha256:dc0e9c3658a1a3ed1ec94274d8b19925c93e1abb7ddba294923ad9bde30f8cb8",
    			"chunkDigest": "sha256:dc0e9c3658a1a3ed1ec94274d8b19925c93e1abb7ddba294923ad9bde30f8cb8"
    		},
    		...
    }
    

    pr

    {
    	"version": 1,
    	"entries": [
    		{
    			"name": ".no.prefetch.landmark",
    			"type": "reg",
    			"size": 1,
    			"offset": 85,
    			"digest": "sha256:dc0e9c3658a1a3ed1ec94274d8b19925c93e1abb7ddba294923ad9bde30f8cb8",
    			"chunkDigest": "sha256:dc0e9c3658a1a3ed1ec94274d8b19925c93e1abb7ddba294923ad9bde30f8cb8"
    		},
    		...
    }
    

    Better make NumLink to numLink keep field name consistency.

    opened by jonyhy96 3
  • Support whiteout entries in overlayfs

    Support whiteout entries in overlayfs

    Fixes: #40

    To remove entries, current CRFS' behaviour doesn't make overlayfs happy because overlayfs has a convention to express whiteouts different from the convention defined in OCI(docker-compliant). This commit solves this issue.

    See also:

    • OCI spec: https://github.com/opencontainers/image-spec/blob/775207bd45b6cb8153ce218cc59351799217451f/layer.md#whiteouts
    • Docker spec: https://github.com/moby/moby/blob/64fd3dc0d5e0b15246dcf8d2a58baf202cc179bc/image/spec/v1.2.md#creating-an-image-filesystem-changeset
    • overlayfs: https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt

    Signed-off-by: Kohei Tokunaga [email protected]

    cla: yes 
    opened by ktock 3
  • Whiteouts don't work with overlayfs

    Whiteouts don't work with overlayfs

    Some container images use whiteouts to indicate "removed entries". But currently, when we use CRFS with overlayfs these whiteouts don't work and no entry is removed.

    Assume we have the lower layer:

    lower/etc
    ├── group
    ├── hostname
    ├── hosts
    ├── localtime
    ├── mtab -> /proc/mounts
    ├── network
    │   ├── if-down.d
    │   ├── if-post-down.d
    │   ├── if-pre-up.d
    │   └── if-up.d
    ├── passwd
    ├── resolv.conf
    └── shadow
    

    And the upper layer including whiteouts:

    upper
    └── etc
        ├── network
        │   ├── newfile
        │   └── .wh..wh..opq
        └── .wh.localtime
    

    According to "whiteout" definition in the OCI image specification, the merged directory should be the following(compatible with docker images).

    merged/etc
    ├── group
    ├── hostname
    ├── hosts
    ├── mtab -> /proc/mounts
    ├── network
    │   └── newfile
    ├── passwd
    ├── resolv.conf
    └── shadow
    
    1 directory, 8 files
    

    But currently CRFS shows these ".wh."-prefixed whiteout files as-is. This behaviour doesn't make overlayfs happy because overlayfs has a different convention to express whiteouts. So it currently results in the following unexpected result:

    merged/etc
    ├── group
    ├── hostname
    ├── hosts
    ├── localtime
    ├── mtab -> /proc/mounts
    ├── network
    │   ├── if-down.d
    │   ├── if-post-down.d
    │   ├── if-pre-up.d
    │   ├── if-up.d
    │   ├── newfile
    │   └── .wh..wh..opq
    ├── passwd
    ├── resolv.conf
    ├── shadow
    └── .wh.localtime
    
    opened by ktock 3
  • Enable to fetch and upload images using HTTP protocol

    Enable to fetch and upload images using HTTP protocol

    Fixes: #22

    Stargzifying an image on the HTTP registry fails when the registry isn't "localhost"(or 127.0.0.1). Recently, go-containerregistry supported to read/write images using HTTP even for fat images. (https://github.com/google/go-containerregistry/pull/567) So we can introduce "insecure" options to fetch and upload images using HTTP by upgrading the module dependencies and using the functionality.

    cla: yes 
    opened by ktock 2
  • Provide same Node instance for same file not to make overlayfs confused.

    Provide same Node instance for same file not to make overlayfs confused.

    Fixes: #16

    CRFS currently doesn't support to merge the layers using overlayfs.

    CRFS generates different "Node" instance every time "Lookup" is called. This behaviour makes bazil/fuse assign different "Node IDs"(used by FUSE) to inodes every time even if these "Lookups" point to same file because bazil/fuse caches "Node IDs" keyed by a "Node" instance (not an inode number etc.). Most time (when we don't use overlayfs etc.) it is fine.

    However, when dentry cache revalidation is executed and the dentry is expired (by default, set to 1 min in bazil/fuse.), FUSE "lookups" the original inode and it doesn't allow different Node IDs to same inode and concludes the cache as invalid. Unfortunately, overlayfs doesn't allow dentry caches being invalid and returns ESTALE.

    This commit solves this issue and make CRFS support overlayfs by cache node instances in CRFS once it looked-up and using it when the same name is looked up.

    cla: yes 
    opened by ktock 2
Owner
Google
Google ❤️ Open Source
Google
A FileSystem Abstraction System for Go

A FileSystem Abstraction System for Go Overview Afero is a filesystem framework providing a simple, uniform and universal API interacting with any fil

Steve Francia 4.5k Jun 27, 2022
A package to allow one to concurrently go through a filesystem with ease

skywalker Skywalker is a package to allow one to concurrently go through a filesystem with ease. Features Concurrency BlackList filtering WhiteList fi

Will Dixon 80 Jun 18, 2022
An implementation of the FileSystem interface for tar files.

TarFS A wrapper around tar.Reader. Implements the FileSystem interface for tar files. Adds an Open method, that enables reading of file according to i

Eyal Posener 50 May 17, 2022
Takes an input http.FileSystem (likely at go generate time) and generates Go code that statically implements it.

vfsgen Package vfsgen takes an http.FileSystem (likely at go generate time) and generates Go code that statically implements the provided http.FileSys

null 945 May 31, 2022
memfs: A simple in-memory io/fs.FS filesystem

memfs: A simple in-memory io/fs.FS filesystem memfs is an in-memory implementation of Go's io/fs.FS interface. The goal is to make it easy and quick t

Peter Sanford 64 Jun 21, 2022
A Go io/fs filesystem implementation for reading files in a Github gists.

GistFS GistFS is an io/fs implementation that enables to read files stored in a given Gist. Requirements This module depends on io/fs which is only av

Jean Hadrien Chabran 124 May 1, 2022
A Small Virtual Filesystem in Go

This is a virtual filesystem I'm coding to teach myself Go in a fun way. I'm documenting it with a collection of Medium posts that you can find here.

Alyson 31 Apr 18, 2022
Encrypted overlay filesystem written in Go

An encrypted overlay filesystem written in Go. Official website: https://nuetzlich.net/gocryptfs (markdown source). gocryptfs is built on top the exce

null 2.3k Jun 30, 2022
Go filesystem implementations for various URL schemes

hairyhenderson/go-fsimpl This module contains a collection of Go filesystem implementations that can discovered dynamically by URL scheme. All filesys

Dave Henderson 225 Jun 27, 2022
A Go filesystem package for working with files and directories

Stowage A Go filesystem package for working with files and directories, it features a simple API with support for the common files and directories ope

null 19 May 28, 2021
filesystem for golang

filesystem filesystem for golang installation go get github.com/go-component/filesystem import import "github.com/go-component/filesystem" Usage sup

null 4 Jul 9, 2021
A set of io/fs filesystem abstractions and utilities for Go

A set of io/fs filesystem abstractions and utilities for Go Please ⭐ this project Overview This package provides io/fs interfaces for: Cloud providers

null 8 Mar 24, 2022
Tarserv serves streaming tar files from filesystem snapshots.

tarserv A collection of tools that allow serving large datasets from local filesystem snapshots. It is meant for serving big amounts of data to shell

Aurora 1 Jan 11, 2022
Grep archive search in any files on the filesystem, in archive and even inner archive.

grep-archive Grep archive search for string in any files on the filesystem, in archive and even inner archive. Supported archive format are : Tar Form

Michel Prunet 0 Jan 26, 2022
Warp across your filesystem in ~5 ms

WarpDrive: the Go version. What does this do? Instead of having a huge cd routine to get where you want, with WarpDrive you use short keywords to warp

Ishan Goel 17 Jun 13, 2022
🤖 Prune old images on GitHub (ghcr.io) and GitLab (registry.gitlab.com) container registry

✨ Prune container images in a CLI way ✨ Prune old images on GitHub (ghcr.io) and GitLab (registry.gitlab.com) Container Registry Getting Started | Des

> CI Monk 1 Jan 11, 2022
Fast docker image distribution plugin for containerd, based on CRFS/stargz

[ ⬇️ Download] [ ?? Browse images] [ ☸ Quick Start (Kubernetes)] [ ?? Quick Start (nerdctl)] Stargz Snapshotter Read also introductory blog: Startup C

containerd 627 Jul 2, 2022
registry-tools: Prints image digest from a registry

registry-tools: Prints image digest from a registry

Rashed K 1 Dec 23, 2021
A REST API for the DN42 registry, written in Go, to provide a bridge between interactive applications and the registry.

dn42regsrv A REST API for the DN42 registry, written in Go, to provide a bridge between interactive applications and registry data. A public instance

Simon Marsh 0 Apr 21, 2022
Kubernetes Container Registry

k8scr A kubectl plugin for pushing OCI images through the Kubernetes API server. Quickstart Build kubectl-k8scr make build Move to location in PATH s

Daniel Mangum 111 Jun 21, 2022
Container Registry Synchronization made easy and fast

?? booster - Makes synchronization of container images between registries faster.

Silvio Moioli 11 May 12, 2022
Kubernetes controller for backing up public container images to our own registry repository

image-clone-controller Kubernetes controller which watches applications (Deployment and DaemonSet) and "caches" the images (public container images) b

Sahadat Hossain 6 Jan 17, 2022
Returns which registry from the container image name

Returns which registry from the container image name

Nozomu Ohki 0 Jan 23, 2022
Moby Project - a collaborative project for the container ecosystem to assemble container-based systems

The Moby Project Moby is an open-source project created by Docker to enable and accelerate software containerization. It provides a "Lego set" of tool

Moby 63.3k Jun 23, 2022
Yeqllo 22 Nov 26, 2021
Moby Project - a collaborative project for the container ecosystem to assemble container-based systems

The Moby Project Moby is an open-source project created by Docker to enable and accelerate software containerization. It provides a "Lego set" of tool

Moby 63.4k Jun 24, 2022
top in container - Running the original top command in a container

Running the original top command in a container will not get information of the container, many metrics like uptime, users, load average, tasks, cpu, memory, are about the host in fact. topic(top in container) will retrieve those metrics from container instead, and shows the status of the container, not the host.

silenceshell 69 Jun 20, 2022
Boxygen is a container as code framework that allows you to build container images from code

Boxygen is a container as code framework that allows you to build container images from code, allowing integration of container image builds into other tooling such as servers or CLI tooling.

nitric 5 Dec 13, 2021
Amazon ECS Container Agent: a component of Amazon Elastic Container Service

Amazon ECS Container Agent The Amazon ECS Container Agent is a component of Amazon Elastic Container Service (Amazon ECS) and is responsible for manag

null 0 Dec 28, 2021