Go parallel gzip (de)compression

Related tags

Compression pgzip
Overview

pgzip

Go parallel gzip compression/decompression. This is a fully gzip compatible drop in replacement for "compress/gzip".

This will split compression into blocks that are compressed in parallel. This can be useful for compressing big amounts of data. The output is a standard gzip file.

The gzip decompression is modified so it decompresses ahead of the current reader. This means that reads will be non-blocking if the decompressor can keep ahead of your code reading from it. CRC calculation also takes place in a separate goroutine.

You should only use this if you are (de)compressing big amounts of data, say more than 1MB at the time, otherwise you will not see any benefit, and it will likely be faster to use the internal gzip library or this package.

It is important to note that this library creates and reads standard gzip files. You do not have to match the compressor/decompressor to get the described speedups, and the gzip files are fully compatible with other gzip readers/writers.

A golang variant of this is bgzf, which has the same feature, as well as seeking in the resulting file. The only drawback is a slightly bigger overhead compared to this and pure gzip. See a comparison below.

GoDoc Build Status

Installation

go get github.com/klauspost/pgzip/...

You might need to get/update the dependencies:

go get -u github.com/klauspost/compress

Usage

Godoc Doumentation

To use as a replacement for gzip, exchange

import "compress/gzip" with import gzip "github.com/klauspost/pgzip".

Changes

  • Oct 6, 2016: Fixed an issue if the destination writer returned an error.
  • Oct 6, 2016: Better buffer reuse, should now generate less garbage.
  • Oct 6, 2016: Output does not change based on write sizes.
  • Dec 8, 2015: Decoder now supports the io.WriterTo interface, giving a speedup and less GC pressure.
  • Oct 9, 2015: Reduced allocations by ~35 by using sync.Pool. ~15% overall speedup.

Changes in github.com/klauspost/compress are also carried over, so see that for more changes.

Compression

The simplest way to use this is to simply do the same as you would when using compress/gzip.

To change the block size, use the added (*pgzip.Writer).SetConcurrency(blockSize, blocks int) function. With this you can control the approximate size of your blocks, as well as how many you want to be processing in parallel. Default values for this is SetConcurrency(1MB, runtime.GOMAXPROCS(0)), meaning blocks are split at 1 MB and up to the number of CPU threads blocks can be processing at once before the writer blocks.

Example:

var b bytes.Buffer
w := gzip.NewWriter(&b)
w.SetConcurrency(100000, 10)
w.Write([]byte("hello, world\n"))
w.Close()

To get any performance gains, you should at least be compressing more than 1 megabyte of data at the time.

You should at least have a block size of 100k and at least a number of blocks that match the number of cores your would like to utilize, but about twice the number of blocks would be the best.

Another side effect of this is, that it is likely to speed up your other code, since writes to the compressor only blocks if the compressor is already compressing the number of blocks you have specified. This also means you don't have worry about buffering input to the compressor.

Decompression

Decompression works similar to compression. That means that you simply call pgzip the same way as you would call compress/gzip.

The only difference is that if you want to specify your own readahead, you have to use pgzip.NewReaderN(r io.Reader, blockSize, blocks int) to get a reader with your custom blocksizes. The blockSize is the size of each block decoded, and blocks is the maximum number of blocks that is decoded ahead.

See Example on playground

Performance

Compression

See my blog post in Benchmarks of Golang Gzip.

Compression cost is usually about 0.2% with default settings with a block size of 250k.

Example with GOMAXPROC set to 32 (16 core CPU)

Content is Matt Mahoneys 10GB corpus. Compression level 6.

Compressor MB/sec speedup size size overhead (lower=better)
gzip (golang) 15.44MB/s (1 thread) 1.0x 4781329307 0%
gzip (klauspost) 135.04MB/s (1 thread) 8.74x 4894858258 +2.37%
pgzip (klauspost) 1573.23MB/s 101.9x 4902285651 +2.53%
bgzf (biogo) 361.40MB/s 23.4x 4869686090 +1.85%
pargzip (builder) 306.01MB/s 19.8x 4786890417 +0.12%

pgzip also contains a linear time compression mode, that will allow compression at ~250MB per core per second, independent of the content.

See the complete sheet for different content types and compression settings.

Decompression

The decompression speedup is there because it allows you to do other work while the decompression is taking place.

In the example above, the numbers are as follows on a 4 CPU machine:

Decompressor Time Speedup
gzip (golang) 1m28.85s 0%
pgzip (golang) 43.48s 104%

But wait, since gzip decompression is inherently singlethreaded (aside from CRC calculation) how can it be more than 100% faster? Because pgzip due to its design also acts as a buffer. When using unbuffered gzip, you are also waiting for io when you are decompressing. If the gzip decoder can keep up, it will always have data ready for your reader, and you will not be waiting for input to the gzip decompressor to complete.

This is pretty much an optimal situation for pgzip, but it reflects most common usecases for CPU intensive gzip usage.

I haven't included bgzf in this comparison, since it only can decompress files created by a compatible encoder, and therefore cannot be considered a generic gzip decompressor. But if you are able to compress your files with a bgzf compatible program, you can expect it to scale beyond 100%.

License

This contains large portions of code from the go repository - see GO_LICENSE for more information. The changes are released under MIT License. See LICENSE for more information.

Comments
  • Missing lines with the uncompression example

    Missing lines with the uncompression example

    I am trying to uncompress a large file 24GB with the provided example (https://play.golang.org/p/uHv1B5NbDh) and the number of lines computed doesn't match the number expected.

    The compressed file has ~94M lines and the output shows ~79M ...

    What am I missing here ?

    Thanks

    opened by zorino 8
  • gzip: fix memory allocs (buffers not returned to pool)

    gzip: fix memory allocs (buffers not returned to pool)

    Fix allocation leaks during gzip writes. Due to incorrect use of the dstPool (unmatched Get and Put), large amounts of memory were temporarily allocated during writes and not put back into the pool.

    This also removes some special handling code in compressCurrent that would recursively call itself for too large input buffers. This condition can never occur, because Write ensures that blocks are capped and there is no other public interface that extends currentBuffer. The recursive call that slices the buffer would have made returning byte slices to the Pool dangerous, as we could have been returning the same underlying buffer multiple times.

    This also adds a test to check allocations per Write to prevent regressions. There is further room for improvement, but this was by far the biggest leak.

    Closes #8

    ~~Additionally, this adds a go.mod for Go modules support.~~

    (Note that tests broke with recent commit a8ba21498dc99e88bfc7677aa9b3ef38ef0101cc).

    opened by wojas 6
  • Add method to determine does file is pgzip or not

    Add method to determine does file is pgzip or not

    I'm use many compressors (zip, gzip, pgzip, bgzf) and need to understand what file underline i have. For example if i download bzgf file i need to enter to some code path to able to seek inside file, in case of gzip/pgzip i need to switch to other things (like enable more cpus or not..). Does it possible to add such method?

    opened by vtolstov 5
  • Unexpected, and nondeterministic, panic reading `.tar.gz`

    Unexpected, and nondeterministic, panic reading `.tar.gz`

    NOTE: Borrowing the 'bug report' template from mholt/archiver which I'm coming indirectly from.

    What version of the package or command are you using?

    v1.2.5 via mholt/archiver latest version 3.5.1.

    What are you trying to do?

    Read a .tar.gz file and check that specific files are contained within it.

    What steps did you take?

    You can find a copy of the archive file here: https://github.com/fastly/cli/blob/main/pkg/commands/compute/testdata/deploy/pkg/package.tar.gz

    Here is the code that attempts to validate the file: https://github.com/fastly/cli/blob/main/pkg/commands/compute/validate.go#L51-L105

    What did you expect to happen, and what actually happened instead?

    I noticed that my test suite would fail nondeterministically with a panic raised from this project indirectly via the mholt/archiver dependency my project uses.

    If I run my test suite with a non default -count value of 20 (default is 1), then I can reliably get the test suite to panic. What happens is the archive file is read and one of the tests will eventually try to read the file and fail. The tests are not using t.Parallel() so there is no need to synchronise access to the archive file.

    Below is the test run error stack trace (you'll notice that there are two tests that run and they pass multiple times before nondeterministically failing with a panic)...

    NOTE: From what I can see our code calls the mholt/archiver's tar.Read() method here which then triggers the panic down in klauspost/pgzip.(*Reader).Read here.

    --- PASS: TestDeploy (24.79s)
        --- PASS: TestDeploy/service_domain_error (19.51s)
        --- PASS: TestDeploy/service_backend_error (5.20s)
    
    === RUN   TestDeploy
    === RUN   TestDeploy/service_domain_error
    panic: test timed out after 30s
    
    goroutine 153 [running]:
    testing.(*M).startAlarm.func1()
            /usr/local/go/src/testing/testing.go:1788 +0xbb
    created by time.goFunc
            /usr/local/go/src/time/sleep.go:180 +0x4a
    
    goroutine 1 [chan receive]:
    testing.(*T).Run(0xc00050dba0, {0x1ffd734, 0xa}, 0x203eae0)
            /usr/local/go/src/testing/testing.go:1307 +0x752
    testing.runTests.func1(0x0)
            /usr/local/go/src/testing/testing.go:1598 +0x9a
    testing.tRunner(0xc00050dba0, 0xc0000e3bf8)
            /usr/local/go/src/testing/testing.go:1259 +0x230
    testing.runTests(0xc00029c200, {0x28cbe60, 0x12, 0x12}, {0x0, 0xc00028c840, 0x28dc3a0})
            /usr/local/go/src/testing/testing.go:1596 +0x7cb
    testing.(*M).Run(0xc00029c200)
            /usr/local/go/src/testing/testing.go:1504 +0x9d2
    main.main()
            _testmain.go:79 +0x22c
    
    goroutine 150 [chan receive]:
    testing.(*T).Run(0xc00050dd40, {0x200777d, 0x14}, 0xc00027cf60)
            /usr/local/go/src/testing/testing.go:1307 +0x752
    github.com/fastly/cli/pkg/commands/compute_test.TestDeploy(0xc00050dd40)
            /Users/integralist/Code/fastly/cli-main-branch/pkg/commands/compute/deploy_test.go:1262 +0x9ec9
    testing.tRunner(0xc00050dd40, 0x203eae0)
            /usr/local/go/src/testing/testing.go:1259 +0x230
    created by testing.(*T).Run
            /usr/local/go/src/testing/testing.go:1306 +0x727
    
    goroutine 151 [chan receive]:
    github.com/klauspost/pgzip.(*Reader).Read(0xc000613180, {0xc000344000, 0x28c5540, 0x2000})
            /Users/integralist/Code/fastly/cli-main-branch/vendor/github.com/klauspost/pgzip/gunzip.go:451 +0x134
    io.(*LimitedReader).Read(0xc00000e378, {0xc000344000, 0x2000, 0x2000})
            /usr/local/go/src/io/io.go:473 +0xc6
    io.discard.ReadFrom({}, {0x2215c40, 0xc00000e378})
            /usr/local/go/src/io/io.go:598 +0x92
    io.copyBuffer({0x2216d00, 0x290d960}, {0x2215c40, 0xc00000e378}, {0x0, 0x0, 0x0})
            /usr/local/go/src/io/io.go:409 +0x1c3
    io.Copy(...)
            /usr/local/go/src/io/io.go:382
    io.CopyN({0x2216d00, 0x290d960}, {0xc584120, 0xc000613180}, 0x22113a0)
            /usr/local/go/src/io/io.go:358 +0xcc
    archive/tar.discard({0xc584120, 0xc000613180}, 0x22113a0)
            /usr/local/go/src/archive/tar/reader.go:852 +0x150
    archive/tar.(*Reader).next(0xc0000ecd80)
            /usr/local/go/src/archive/tar/reader.go:68 +0xef
    archive/tar.(*Reader).Next(0xc0000ecd80)
            /usr/local/go/src/archive/tar/reader.go:51 +0x53
    github.com/mholt/archiver/v3.(*Tar).Read(0xc00060e940)
            /Users/integralist/Code/fastly/cli-main-branch/vendor/github.com/mholt/archiver/v3/tar.go:441 +0xa5
    github.com/fastly/cli/pkg/commands/compute.validate({0xc00015cba0, 0x12})
            /Users/integralist/Code/fastly/cli-main-branch/pkg/commands/compute/validate.go:78 +0x54d
    github.com/fastly/cli/pkg/commands/compute.validatePackage({{{0x0, 0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, 0x2, ...}, ...}, ...)
            /Users/integralist/Code/fastly/cli-main-branch/pkg/commands/compute/deploy.go:414 +0x650
    github.com/fastly/cli/pkg/commands/compute.(*DeployCommand).Exec(0xc0000b24e0, {0x2216220, 0xc00008b420}, {0x2215040, 0xc00027d050})
            /Users/integralist/Code/fastly/cli-main-branch/pkg/commands/compute/deploy.go:106 +0x5d8
    github.com/fastly/cli/pkg/app.Run({0xc0004fff80, {0xc00016e7c0, 0x4, 0x4}, {{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, ...}, ...}, ...})
            /Users/integralist/Code/fastly/cli-main-branch/pkg/app/run.go:171 +0x1f95
    github.com/fastly/cli/pkg/commands/compute_test.TestDeploy.func1(0xc000183520)
            /Users/integralist/Code/fastly/cli-main-branch/pkg/commands/compute/deploy_test.go:1352 +0x1165
    testing.tRunner(0xc000183520, 0xc00027cf60)
            /usr/local/go/src/testing/testing.go:1259 +0x230
    created by testing.(*T).Run
            /usr/local/go/src/testing/testing.go:1306 +0x727
    
    goroutine 152 [running]:
            goroutine running on other thread; stack unavailable
    created by github.com/klauspost/pgzip.(*Reader).doReadAhead
            /Users/integralist/Code/fastly/cli-main-branch/vendor/github.com/klauspost/pgzip/gunzip.go:379 +0x528
    FAIL    github.com/fastly/cli/pkg/commands/compute      31.614s
    ?       github.com/fastly/cli/pkg/commands/compute/setup        [no test files]
    FAIL
    make: *** [test] Error 1
    

    How do you think this should be fixed?

    I'm not sure because mholt/archiver is already using the latest version of pgzip (1.2.5). I guess ideally pgzip shouldn't panic unexpectedly. But I suspect there's something I (or mholt/archiver) is doing wrong.

    opened by Integralist 3
  •  SetConcurrency(?,?)

    SetConcurrency(?,?)

    If only one goroutine is used for the setting, what is the performance compare with gzip ; SetConcurrency(?,?), what is the optimal setting for the second parameter

    opened by ayamzh 3
  • gunzip: Reset may loose buffers

    gunzip: Reset may loose buffers

    If Reset is called on the gzip reader without reading the gzip stream untill EOF the read-ahead go-routine may loose buffers from the block pool.

    Simple way to reproduce this is doing something like.

    r, _ := pgzip.NewReader(in)
    r.Reset(in)
    io.Copy(ioutil.Discard, r)
    

    Running something will succeed most of the time but sometimes it'll deadlock. What happens is that by the time killReadAhead() called the read ahead go-routine might have already taken a buffer from the block pool. This buffer is either send into the readAhead channel and is lost when this channel is reinitialized when doReadAhead is called again or it's lost when the go-routine exits in gunzip.go#L419. In both cases the buffer taken by the read-ahead go-routine is never returned to the block pool. If all buffers are lost the reader will completely deadlock.

    The easiest way to fix this would probably be to reinitialize and fill the block pool in either Reset or doReadAhead

    bug 
    opened by buengese 3
  • panic: close of closed channel

    panic: close of closed channel

    While using pgzip, I'm getting this error in some (not yet fully debugged) situations:

    panic: close of closed channel
    
    goroutine 2364 [running]:
    github.com/klauspost/pgzip.(*Writer).checkError(0xc208060100, 0x0, 0x0)
    /home/ubuntu/.go_workspace/src/github.com/klauspost/pgzip/gzip.go:254 +0xee
    github.com/klauspost/pgzip.(*Writer).Write(0xc208060100, 0xc20827a000, 0x8000, 0x8000, 0x8000, 0x0, 0x0)
    /home/ubuntu/.go_workspace/src/github.com/klauspost/pgzip/gzip.go:273 +0x6b
    [...]
    

    If it can help, the underlying writer for pgzip is a io.Pipe() writer, and the other end is copied into a socket, so it looks like the bug is related to the packetization of the data over the wire, since it's not fully reproducible.

    From my reading of the code, it looks like the panic is actually a failure to propagate an underlying error code, so I will just put a print there to see what error code was triggered in the first place. Meanwhile, the traceback might be enough to point you to the bug causing the panic.

    opened by rasky 3
  • Possibility to reduce memory consumption

    Possibility to reduce memory consumption

    Hello. I'm from the Kopia project. We use a bunch of your compressors in our repo. Kopia has a benchmark command that given an input file, we run all the compressors on it and report metrics, such as compression ratio, throughput and memory consumption. pgzip seems to have huge stats comparing to, say s2.

    The memory consumption stats is calculated by calling runtime.ReadMemStats() before and after the compression loop, then compare the delta. Note that this is not about memory leak, just allocation.

    Baseline: compressing a 400MB highly compressible file just once. All compressors behave similarly
    Repeating 1 times per compression method (total 466.7 MiB).
    
         Compression                Compressed   Throughput   Memory Usage
    ------------------------------------------------------------------------------------------------
      0. s2-default                 127.1 MiB    4 GiB/s      3126   375.4 MiB
      1. s2-better                  120.1 MiB    3.4 GiB/s    2999   351.7 MiB
      2. s2-parallel-8              127.1 MiB    2.8 GiB/s    2981   362.2 MiB
      3. s2-parallel-4              127.1 MiB    2.3 GiB/s    2951   344.1 MiB
      4. pgzip-best-speed           96.7 MiB     2.1 GiB/s    4127   324.1 MiB
      5. pgzip                      86.3 MiB     1.2 GiB/s    4132   298.7 MiB
      6. lz4                        131.8 MiB    458.9 MiB/s  17     321.7 MiB
      7. zstd-fastest               79.8 MiB     356.2 MiB/s  22503  246 MiB
      8. zstd                       76.8 MiB     323.7 MiB/s  22605  237.8 MiB
      9. deflate-best-speed         96.7 MiB     220.8 MiB/s  45     310.8 MiB
     10. gzip-best-speed            94.9 MiB     165 MiB/s    40     305.2 MiB
     11. deflate-default            86.3 MiB     143.1 MiB/s  34     311 MiB
     12. zstd-better-compression    74.2 MiB     104 MiB/s    22496  251.4 MiB
     13. pgzip-best-compression     83 MiB       55.9 MiB/s   4359   299.1 MiB
     14. gzip                       83.6 MiB     40.5 MiB/s   69     304.8 MiB
     15. zstd-best-compression      68.9 MiB     19.2 MiB/s   22669  303.4 MiB
     16. deflate-best-compression   83 MiB       5.6 MiB/s    134    311 MiB
     17. gzip-best-compression      83 MiB       5.1 MiB/s    137    304.8 MiB
    
    Compressing the first 128KB of the same file but repeat 10 times, you can see the higher memory consumption of pgzip among compressors
    Repeating 10 times per compression method (total 1.2 MiB).
    
         Compression                Compressed   Throughput   Memory Usage
    ------------------------------------------------------------------------------------------------
      0. s2-default                 43.6 KiB     625.3 MiB/s  71     2.1 MiB
      1. s2-parallel-4              43.6 KiB     625.3 MiB/s  67     2.1 MiB
      2. s2-parallel-8              43.6 KiB     624.5 MiB/s  67     2.1 MiB
      3. s2-better                  41.3 KiB     416.8 MiB/s  72     2.1 MiB
      4. deflate-best-speed         34.3 KiB     208.3 MiB/s  22     874.6 KiB
      5. zstd-fastest               28.6 KiB     178.6 MiB/s  160    9.4 MiB
      6. lz4                        44.7 KiB     178.5 MiB/s  38     88.6 MiB
      7. gzip-best-speed            33.7 KiB     138.9 MiB/s  28     1.2 MiB
      8. deflate-default            31.2 KiB     125 MiB/s    22     1.1 MiB
      9. zstd                       26.8 KiB     113.6 MiB/s  174    18.4 MiB
     10. pgzip-best-speed           34.3 KiB     113.6 MiB/s  252    27.3 MiB
     11. zstd-better-compression    26.3 KiB     96.2 MiB/s   156    37.2 MiB
     12. pgzip                      31.2 KiB     74.5 MiB/s   342    31.7 MiB
     13. gzip                       30.4 KiB     39.1 MiB/s   26     874.7 KiB
     14. deflate-best-compression   30.4 KiB     25.5 MiB/s   21     1 MiB
     15. gzip-best-compression      30.4 KiB     24 MiB/s     26     874.7 KiB
     16. pgzip-best-compression     30.4 KiB     23.2 MiB/s   285    30.2 MiB
     17. zstd-best-compression      25.1 KiB     16.9 MiB/s   155    99.2 MiB
    
    Repeating 100 times. s2 has exactly same stats, while pgzip grows accordingly
    Repeating 100 times per compression method (total 12.5 MiB).
    
         Compression                Compressed   Throughput   Memory Usage
    ------------------------------------------------------------------------------------------------
      0. s2-parallel-4              43.6 KiB     833.4 MiB/s  533    2.1 MiB
      1. s2-parallel-8              43.6 KiB     833.3 MiB/s  555    2.1 MiB
      2. s2-default                 43.6 KiB     833.3 MiB/s  579    2.1 MiB
      3. s2-better                  41.3 KiB     500 MiB/s    610    2.1 MiB
      4. zstd-fastest               28.6 KiB     240.4 MiB/s  925    9.5 MiB
      5. deflate-best-speed         34.3 KiB     198.4 MiB/s  22     874.6 KiB
      6. zstd                       26.8 KiB     165.4 MiB/s  907    18.5 MiB
      7. zstd-better-compression    26.3 KiB     162.3 MiB/s  881    37.3 MiB
      8. gzip-best-speed            33.7 KiB     150.6 MiB/s  28     1.2 MiB
      9. pgzip-best-speed           34.3 KiB     143.7 MiB/s  1649   220.2 MiB
     10. deflate-default            31.2 KiB     126.3 MiB/s  22     1.1 MiB
     11. lz4                        44.7 KiB     112.6 MiB/s  435    816.7 MiB
     12. pgzip                      31.2 KiB     94.6 MiB/s   2634   277.5 MiB
     13. gzip                       30.4 KiB     39.5 MiB/s   26     874.7 KiB
     14. deflate-best-compression   30.4 KiB     25.4 MiB/s   21     1 MiB
     15. gzip-best-compression      30.4 KiB     24.5 MiB/s   27     874.9 KiB
     16. pgzip-best-compression     30.4 KiB     23.1 MiB/s   2646   281.8 MiB
     17. zstd-best-compression      25.1 KiB     19.3 MiB/s   882    99.3 MiB
    

    I did some experiments around SetConcurrency() and found that:

    1. The consumption grows slowly as blocks increases, and exponentially as blockSize increases, possibly due to z.dstPool.New = func() interface{} { return make([]byte, 0, blockSize+(blockSize)>>4) } line.
    2. Even by just creating a new writer and immediately close it, the allocation still happens, possibly due to the internal compressCurrent().

    Is there a bug here? Why allocate memory when no data is compressed? And can Reset() reuse previously allocated memory instead of creating new (like s2)?

    opened by CrendKing 2
  • pgzip.Writer causes panics in bufio.Write

    pgzip.Writer causes panics in bufio.Write

    When pgzip's Writer is used as a bufio.Writer, calls to Write can result in panics like:

    bufio: writer returned negative count from Write
    

    which come from this bit of code in bufio:

    var errNegativeWrite = errors.New("bufio: writer returned negative count from Write")
    
    // writeBuf writes the Reader's buffer to the writer.
    func (b *Reader) writeBuf(w io.Writer) (int64, error) {
        n, err := w.Write(b.buf[b.r:b.w])
        if n < 0 {
            panic(errNegativeWrite)
        }
        b.r += n
        return int64(n), err
    }
    

    This seems to be due to the section of Write that actually writes the compressed data to the underlying buffer, definitely at least during the first iteration, and possibly others. The issue is with this return:

    if err := z.checkError(); err != nil {
    	return len(p) - len(q) - length, err
    
    }
    

    On the first iteration of this loop q := p, and length is a positive integer, so this will always return a negative number, causing bufio to panic rather than to propagate the error back to the caller.

    I think a simple fix here is to return the max of 0 and that value.

    opened by ajm188 2
  • Publish version tags that are compatible with Go Modules

    Publish version tags that are compatible with Go Modules

    Go 1.12 and beyond require tags to be published in the form vN.N.N (all three numbers are required). Right now there exists v1.1 and v½.2.0 which the compiler doesn't behave as one might expect.

    For example (using Go1.11 + GO111MODULE=on),

    v1.1 does not work:

    [p1 foobar] $ cat go.mod
    module foobar
    
    require (
    	github.com/klauspost/pgzip v1.1
    )
    [p1 foobar] $ go build
    go: errors parsing go.mod:
    /tmp/foobar/go.mod:4: invalid module version "v1.1": no matching versions for query "v1.1"
    

    whereas v1.0.1 does work:

    [p1 foobar] $ cat go.mod
    module foobar
    
    require (
    	github.com/klauspost/pgzip v1.0.1
    )
    [p1 foobar] $ go build
    go: finding github.com/klauspost/compress/flate latest
    go: finding github.com/klauspost/crc32 latest
    [p1 foobar] $
    

    tag v½.2.0 seems to trigger the fallback timestamp+hash pseudo version behavior:

    [p1 foobar] $ cat go.mod
    module foobar
    
    require (
    	github.com/klauspost/pgzip v½.2.0
    )
    [p1 foobar] $ go build
    go: finding github.com/klauspost/pgzip v½.2.0
    go: finding github.com/klauspost/compress/flate latest
    go: finding github.com/klauspost/crc32 latest
    [p1 foobar] $ cat go.mod
    module foobar
    
    require (
    	github.com/klauspost/compress v1.4.1 // indirect
    	github.com/klauspost/cpuid v1.2.0 // indirect
    	github.com/klauspost/crc32 v0.0.0-20170628072449-bab58d77464a // indirect
    	github.com/klauspost/pgzip v1.0.2-0.20180717084224-c4ad2ed77aec
    )
    

    Attempting to use v1.1.0 (as one might expect to be synonymous with v1.1) also doesn't work:

    [p1 foobar] $ cat go.mod
    module foobar
    
    require (
    	github.com/klauspost/pgzip v1.1.0
    )
    [p1 foobar] $ go build
    go: finding github.com/klauspost/pgzip v1.1.0
    go: github.com/klauspost/[email protected]: unknown revision v1.1.0
    go: error loading module requirements
    
    opened by shoenig 2
  • Please document the version of Go that was used

    Please document the version of Go that was used

    On go 1.11 beta2, I'm seeing the same performance from stdlib and pgzip for decompression, so it would be useful to know when your benchmarks were done.

    opened by flx42 2
  • gunzip: improve EOF handling

    gunzip: improve EOF handling

    This fixes #38 and #39, though I'm not entirely sure if you're happy with this approach.

    To solve #39, we switch from using a channel for the block pool and instead use a sync.Pool. This does have the downside that the read-ahead goroutine can now end up allocating more blocks than the user requested. If this is not acceptable I can try to figure out a different solution for this problem. By using sync.Pool, there is no issue of blocking on a goroutine channel send when there are no other threads reading from it.

    To solve #38, some extra io.EOF special casing was needed in both WriteTo and Read. I think that these changes are reasonable -- it seems as though z.err should never store io.EOF (and there were only a few cases where it would -- which I've now fixed), but let me know what you think.

    Fixes #38 Fixes #39

    Signed-off-by: Aleksa Sarai [email protected]

    opened by cyphar 9
  • goroutine deadlock if Read or WriteTo is called after WriteTo end of stream

    goroutine deadlock if Read or WriteTo is called after WriteTo end of stream

    I found this when playing around with the reproducer for #38. It seems as though if you do an io.Copy of a stream (which uses z.WriteTo), followed by ReadAll (which uses z.Read) you end up with a goroutine deadlock. https://play.golang.org/p/x6u6JSoKd2t

    package main
    
    import (
    	"bytes"
    	"fmt"
    	"io"
    	"io/ioutil"
    
    	"github.com/klauspost/pgzip"
    )
    
    // echo hello | gzip -c | xxd -i
    var gzipData = []byte{
    	0x1f, 0x8b, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0xcb, 0x48,
    	0xcd, 0xc9, 0xc9, 0xe7, 0x02, 0x00, 0x20, 0x30, 0x3a, 0x36, 0x06, 0x00,
    	0x00, 0x00,
    }
    
    func main() {
    	buf := bytes.NewBuffer(gzipData)
    
    	rdr, err := pgzip.NewReader(buf)
    	if err != nil {
    		panic(err)
    	}
    
    	n, err := io.Copy(ioutil.Discard, rdr)
    	fmt.Printf("io.Copy at start of stream: n=%v, err=%v\n", n, err)
    
    	b, err := ioutil.ReadAll(rdr)
    	if err != nil {
    		panic(err)
    	}
    	fmt.Printf("read %q from stream\n", string(b))
    }
    
    io.Copy at start of stream: n=6, err=<nil>
    fatal error: all goroutines are asleep - deadlock!
    
    goroutine 1 [chan send]:
    github.com/klauspost/pgzip.(*Reader).Read(0xc00006ea80, 0xc000120000, 0x200, 0x200, 0xc000120000, 0x0, 0x0)
    	/tmp/gopath805285542/pkg/mod/github.com/klauspost/[email protected]/gunzip.go:473 +0xfe
    bytes.(*Buffer).ReadFrom(0xc000043e80, 0x5055a0, 0xc00006ea80, 0xc000062a90, 0xc00011e000, 0x29)
    	/usr/local/go-faketime/src/bytes/buffer.go:204 +0xb1
    io/ioutil.readAll(0x5055a0, 0xc00006ea80, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0)
    	/usr/local/go-faketime/src/io/ioutil/ioutil.go:36 +0xe5
    io/ioutil.ReadAll(...)
    	/usr/local/go-faketime/src/io/ioutil/ioutil.go:45
    main.main()
    	/tmp/sandbox128643491/prog.go:30 +0x1fc
    
    Program exited: status 2.
    
    opened by cyphar 1
  • io.Copy(pgzip.Reader) returns io.EOF if stream already complete due to WriteTo implementation

    io.Copy(pgzip.Reader) returns io.EOF if stream already complete due to WriteTo implementation

    It turns out that if you have a pgzip.Reader which has read to the end of the stream, if you call io.Copy on that stream you get io.EOF -- which should never happen and causes spurious errors on callers that check error values from io.Copy. I hit this when working on opencontainers/umoci#360.

    This happens because (as an optimisation) io.Copy will use the WriteTo method of the reader (or ReadFrom method of the writer) if they support that method. And in this mode, io.Copy simply returns whatever error the reader or writer give it -- meaning it doesn't hide io.EOFs returned from those methods. In the normal Read+Write mode, io.Copy does hide the error.

    It seems as though this is at some level a stdlib bug, because this requirement of io.WriterTo and io.ReaderTo implementations (don't return io.EOF because io.Copy can't handle it) is not spelled out anywhere in the documentation. So either io.Copy should handle this, or this requirment should be documented. So I will open a parallel issue on the Go tracker for this problem.

    But for now, it seems that the WriteTo implementation should avoid returning io.EOF. If the reader reaches an io.EOF before it is expected, the error should instead be io.ErrUnexpectedEOF.

    opened by cyphar 1
  • A parallel zlib implementation?

    A parallel zlib implementation?

    Hi there,

    Any chance of implementing pgzip for plain zlib? As far as I can tell, the only thing that differs between the two formats are the headers and the CRC.

    Cheers, Gabriel

    opened by gabriel-samfira 0
Releases(v1.2.5)
Owner
Klaus Post
Klaus Post
a little app to gzip+base64 encode and decode

GO=GZIP64 A little golang console utility that reads a file and either: 1) Encodes it - gzip compress followed by base64 encode writes

Steve White 1 Oct 16, 2021
Optimized compression packages

compress This package provides various compression algorithms. zstandard compression and decompression in pure Go. S2 is a high performance replacemen

Klaus Post 3.3k Nov 22, 2022
Go wrapper for LZO compression library

This is a cgo wrapper around the LZO real-time compression library. LZO is available at http://www.oberhumer.com/opensource/lzo/ lzo.go is the go pack

Damian Gryski 13 Mar 4, 2022
Port of LZ4 lossless compression algorithm to Go

go-lz4 go-lz4 is port of LZ4 lossless compression algorithm to Go. The original C code is located at: https://github.com/Cyan4973/lz4 Status Usage go

Бранимир Караџић 209 Jun 14, 2022
LZ4 compression and decompression in pure Go

lz4 : LZ4 compression in pure Go Overview This package provides a streaming interface to LZ4 data streams as well as low level compress and uncompress

Pierre Curto 710 Nov 25, 2022
Unsigned Integer 32 Byte Packing Compression

dbp32 Unsigned Integer 32 Byte Packing Compression. Inspired by lemire/FastPFor. Package bp32 is an implementation of the binary packing integer compr

Ali Josie 2 Sep 6, 2021
Bzip2 Compression Tool written in Go

Bzip2 Compression Tool written in Go

Pedro Albanese 1 Dec 28, 2021
Slipstream is a method for lossless compression of power system data.

Slipstream Slipstream is a method for lossless compression of power system data. Design principles The protocol is designed for streaming raw measurem

Synaptec Ltd 4 Apr 14, 2022
An easy-to-use CLI-based compression tool.

Easy Compression An easy-to-use CLI-based compression tool. Usage NAME: EasyCompression - A CLI-based tool for (de)compression USAGE: EasyCompr

Tei Michael 1 Jan 1, 2022
zlib compression tool for modern multi-core machines written in Go

zlib compression tool for modern multi-core machines written in Go

Pedro F. Albanese 0 Jan 21, 2022
Parallel implementation of Gzip for modern multi-core machines written in Go

gzip Parallel implementation of gzip for modern multi-core machines written in Go Usage: gzip [OPTION]... [FILE] Compress or uncompress FILE (by defau

Pedro Albanese 0 Nov 16, 2021
parallel: a Go Parallel Processing Library

parallel: a Go Parallel Processing Library Concurrency is hard. This library doesn't aim to make it easy, but it will hopefully make it a little less

Ryan Skidmore 29 May 9, 2022
M3u8-parallel-downloader - M3u8 parallel downloader with golang

m3u8-parallel-downloader Usage ./m3u8-parallel-downloader -input http://example.

CzBiX 4 Aug 12, 2022
Gzip Middleware for Go

An out-of-the-box, also customizable gzip middleware for Gin and net/http.

LI Zhennan 130 Oct 31, 2022
Split text files into gzip files with x lines

hakgzsplit split lines of text into multiple gzip files

Luke Stephens (hakluke) 6 Jun 21, 2022
a little app to gzip+base64 encode and decode

GO=GZIP64 A little golang console utility that reads a file and either: 1) Encodes it - gzip compress followed by base64 encode writes

Steve White 1 Oct 16, 2021
Ripgrep but for gzip-compressed files over http

Juicer It's ripgrep but for Gzip-compressed files over HTTP! This tool was primarily designed to scan thru the Common Crawl dataset for URLs without s

Boom 2 Feb 21, 2022
Optimized compression packages

compress This package provides various compression algorithms. zstandard compression and decompression in pure Go. S2 is a high performance replacemen

Klaus Post 3.3k Nov 22, 2022
Go wrapper for LZO compression library

This is a cgo wrapper around the LZO real-time compression library. LZO is available at http://www.oberhumer.com/opensource/lzo/ lzo.go is the go pack

Damian Gryski 13 Mar 4, 2022
Port of LZ4 lossless compression algorithm to Go

go-lz4 go-lz4 is port of LZ4 lossless compression algorithm to Go. The original C code is located at: https://github.com/Cyan4973/lz4 Status Usage go

Бранимир Караџић 209 Jun 14, 2022