Alternative archiving tool with fast performance for huge numbers of small files

Overview

fast-archiver

fast-archiver is a command-line tool for archiving directories, and restoring those archives written in [Go](http://golang.org).

fast-archiver uses a few techniques to try to be more efficient than traditional tools:

  1. It reads a number of files concurrently and then serializes the output. Most other tools use sequential file processing, where operations like open(), lstat(), and close() can cause a lot of overhead when reading huge numbers of small files. Making these operations concurrent means that the tool is more often reading and writing data than you would be otherwise.
  2. It begins archiving files before it has completed reading the directory entries that it is archiving, allowing for a fast startup time compared to tools that first create an inventory of files to transfer.

How Fast?

On a test workload of 2,089,214 files representing a total of 90 GiB of data, fast-archiver was compared with tar and rsync for reading data files and transfering them over a network. The test scenario was a PostgreSQL database, with many of the files being small, 8-24kiB in size.

Compared with tar, fast-archiver took 33% of the execution time (27m 38s vs. 1h 23m 23s) to read the test workload and output the archive to /dev/null. The tar output had to be redirected through cat to create a comparable scenario, because tar recognized /dev/null and shortcuts the actual data file reading and writing. Here's the raw timing output for some hard data:

$ time fast-archiver -c -o /dev/null /db/data
skipping symbolic link /db/data/pg_xlog
1008.92user 663.00system 27:38.27elapsed 100%CPU (0avgtext+0avgdata 24352maxresident)k
0inputs+0outputs (0major+1732minor)pagefaults 0swaps

$ time tar -cf - /db/data | cat > /dev/null
tar: Removing leading `/' from member names
tar: /db/data/base/16408/12445.2: file changed as we read it
tar: /db/data/base/16408/12464: file changed as we read it
32.68user 375.19system 1:23:23elapsed 8%CPU (0avgtext+0avgdata 81744maxresident)k
0inputs+0outputs (0major+5163minor)pagefaults 0swaps

Compared with rsync, fast-archiver piped over ssh can transfer the database from one machine to another in 1h 30m, vs. rsync in 3h.

These huge reductions in time may not be typical, but they happen to be the workload that fast-archiver was designed for.

Examples

Creates an archive (-c) reading the directory target1, and redirects the archive to the file named target1.fast-archive:

fast-archiver -c target1 > target1.fast-archive
fast-archiver -c -o target1.fast-archive target1

Extracts the archive target1.fast-archive into the current directory:

fast-archiver -x < target1.fast-archive
fast-archiver -x -i target1.fast-archive

Creates a fast-archive remotely, and restores it locally, piping the data through ssh:

ssh [email protected] "cd /db; fast-archive -c data --exclude=data/\*.pid" | fast-archiver -x

Installation

The fast-archiver repository contains both a command-line tool (at the root) and a package called falib which contains the archive reading and writing code. To make the build work correctly with both the library and the command-line tool, it's necessary to setup the correct GOPATH and directory references.

Here's a quick set of steps to setup the build:

  • Install [Go](http://golang.org).
  • Setup $GOPATH, for example: export GOPATH=$HOME/go-projects. Probably better to set it up in your .bash_aliases.
  • go get -u github.com/replicon/fast-archiver

_or_

  • go get -d github.com/replicon/fast-archiver && $GOPATH/src/github.com/replicon/fast-archiver/build.sh

Command-line arguments

-x Extract archive mode.
-c Create archive mode.
--multicpu Allows concurrent activities to run on the specified number of CPUs. Since the archiving is dominated by I/O, additional CPUs tend to just add overhead in communicating between concurrent processes, but it could increase throughput in some scenarios. Defaults to 1.

Create-mode only

-o Output path for the archive. Defaults to stdout.
--exclude A colon-separated list of paths to exclude from the archive. Can include wildcards and other shell matching constructs.
--block-size Specifies the size of blocks being read from disk, in bytes. The larger the block size, the more memory fast-archiver will use, but it could result in higher I/O rates. Defaults to 4096, maximum value is 65535.
--dir-readers The maximum number of directories that will be read concurrently. Defaults to 16.
--file-readers The maximum number of files that will be read concurrently. Defaults to 16.
--queue-dir The maximum size of the queue for sub-directory paths to be processed. Defaults to 128.
--queue-read The maximum size of the queue for file paths to be processed. Defaults to 128.
--queue-write The maximum size of the block queue for archive output. Increasing this will increase the potential memory usage, as (queue-write * block-size) memory could be allocated for file reads. Defaults to 128.

Extract-mode only

-i Input path for the archive. Defaults to stdin.
--ignore-perms Do not restore permissions on files and directories.
--ignore-owners
  Do not restore uid and gid on files and directories.
Comments
  • Excessive goroutine creation

    Excessive goroutine creation

    The fix for issue #4 caused a scenario where a directory tree with many, many sub-directories can cause tens of thousands of goroutines to be created, affecting both performance and stability of fast-archiver.

    opened by mfenniak 1
  • Minor fixups for fast-archiver

    Minor fixups for fast-archiver

    Hi,

    I've added a -n flag to disable writing. In combination with -c, this will show the user what files would be created in an archive. In combination with -x, it will list the contents of an archive.

    Also, no point in allowing both -x and -c.

    opened by kbrint 1
  • Deadlock when directoryScanQueue is full

    Deadlock when directoryScanQueue is full

    If directoryScanQueue fills up and all directoryScanner goroutines are pending at "directoryScanQueue <- filePath", then a deadlock occurs where the workInProgress counter will never hit zero and all goroutines will get stuck shortly afterwards.

    bug 
    opened by mfenniak 1
  • readdirnames is UNIX-specific

    readdirnames is UNIX-specific

    The implementation of readdirnames in create-archive.go, copied from golang, is UNIX specific. This prevents fast-archiver from working on a non-UNIX system (eg. Windows).

    bug 
    opened by mfenniak 1
  • cwd behaviour

    cwd behaviour

    Hi there,

    imho there should be an option to specify the directory to extract to. this would be useful to archive and extract in a single command. e.g.

    fast-archiver -c $source | fast-archiver -x $target

    or add base directory option, default = current directory

    fast-archiver -c -b $source . | fast-archiver -x -b $target

    this would be even better

    opened by notEvil 1
  • Malicious archives may cause problems

    Malicious archives may cause problems

    Leading / is prohibited, but an archive containing this file could cause problems on extract:

    ../../../../../../../../../../../../Users/you/.ssh/authorized_keys
    

    Or even:

    dir1
    dir1/somefile
    dir1/dir2 
    dir1/dir2/../../../../../../../../../../../../Users/you/.ssh/authorized_keys
    

    Any path containing ".." should be prohibited just like those starting with "/"

    Probably something like this... not sure if this is robust enough:

    for dir := range strings.Split(path, os.PathSeparator) {
      if dir == ".." {
        barf
      }
    }
    
    opened by kbrint 0
Owner
Replicon Inc.
Replicon Inc.
An easy-to-use CLI-based compression tool.

Easy Compression An easy-to-use CLI-based compression tool. Usage NAME: EasyCompression - A CLI-based tool for (de)compression USAGE: EasyCompr

Tei Michael 1 Jan 1, 2022
zlib compression tool for modern multi-core machines written in Go

zlib compression tool for modern multi-core machines written in Go

Pedro F. Albanese 0 Jan 21, 2022
cyhone 149 Dec 28, 2022
A twitch focused command line tool for producing, archiving and managing live stream content. Built for Linux.

twinx is a live-streaming command line tool for Linux. It connects streaming services (like Twitch, OBS and YouTube) together via a common title and description.

Kris Nóva 26 Oct 17, 2022
Fast, dependency-free, small Go package to infer the binary file type based on the magic numbers signature

filetype Small and dependency free Go package to infer file and MIME type checking the magic numbers signature. For SVG file type checking, see go-is-

Tom 1.7k Jan 3, 2023
Software for archiving my digital stuff like tweets

rsms's memex Software for managing my digital information, like tweets. Usage First check out the source and build. You'll need Make and Go installed.

Rasmus 45 Nov 17, 2022
Archiving Street Art in a distributed manner

Graffiti Repository provides a tool for graffiti community to keep track and preserving street artwork in their areas. Features Distributed storage Th

Roman Blanco 4 Oct 27, 2022
SigNoz helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc. 🔥 🖥. 👉 Open source Application Performance Monitoring (APM) & Observability tool

Monitor your applications and troubleshoot problems in your deployed applications, an open-source alternative to DataDog, New Relic, etc. Documentatio

SigNoz 4.7k Sep 24, 2021
Split multiple Kubernetes files into smaller files with ease. Split multi-YAML files into individual files.

Split multiple Kubernetes files into smaller files with ease. Split multi-YAML files into individual files.

Patrick D'appollonio 204 Dec 29, 2022
Split multiple Kubernetes files into smaller files with ease. Split multi-YAML files into individual files.

kubectl-slice: split Kubernetes YAMLs into files kubectl-slice is a neat tool that allows you to split a single multi-YAML Kubernetes manifest into mu

Patrick D'appollonio 205 Jan 3, 2023
tool for working with numbers and units

tool for working with numbers and units

null 36 Nov 26, 2022
A Small tool for SDWAN performance test and policy validation

sdwan-perf Sdwan-perf is based on golang and could support almost platform for performance and policy validation. SDWAN Performance Test Report +--

Mie~~~ 19 Sep 3, 2022
A fast and powerful alternative to grep

sift A fast and powerful open source alternative to grep. Features sift has a slightly different focus than most other grep alternatives. Code search,

Sven Taute 1.6k Jan 3, 2023
An extremely fast UUID alternative written in golang

Overview WUID is a globally unique number generator, while it is NOT a UUID implementation. WUID is 10-135 times faster than UUID and 4600 times faste

Edwin 481 Dec 9, 2022
An extremely fast UUID alternative written in golang

Overview WUID is a globally unique number generator, while it is NOT a UUID implementation. WUID is 10-135 times faster than UUID and 4600 times faste

Edwin 11 May 10, 2021
Small utility to sign a small json containing basic kyc information. The key generated by it is fully compatible with cosmos based chains.

Testnet signer utility This utility generates a signed JSON-formatted ID to prove ownership of a key used to submit tx on the blockchain. This testnet

Archway Network 62 Sep 10, 2022
goKryptor is a small and portable cryptographic tool for encrypting and decrypting files.

goKryptor goKryptor is a small and portable cryptographic tool for encrypting and decrypting files. This tool supports XOR and AES-CTR (Advanced Encry

null 0 Dec 6, 2021
Vaala archive is a tar archive tool & library optimized for lots of small files.

?? Vaar ?? Vaala archive is a tar archive tool & library optimized for lots of small files. Written in Golang, vaar performs operations in parallel &

Qing Moy 14 Sep 12, 2022
A small CLI tool to compress and decompress files using Golang

Goflate A simple & small CLI tool to compress and decompress files using Golang Usage Install the binary to your local machine with the below command

Pedre Viljoen 0 May 27, 2022
Small tool to batch-update MP3-ID3v2-Tags (artist + title) of mp3-files based on filename

mp3fileInfo Enriches ID3-data (artist and title) based on the filename of all mp3-files in either a specific directory if given via command line argum

null 0 Dec 19, 2021