tl;dr increased compression speed for small messages 95%, for large ones ~60%, but parallelism has died.
I like optimizing things. For one project I'm working on we wanted to support lz4 in a network compression scenario, where we are encoding small standalone sequences in a single data stream. I wrote a couple of reference implementations; flate/gzip performed quite well initially, but lz4 didn't do so well. I set about fixing that.
The first adjustment I did was to remove the parallelism in the Writer. I expect this is the most controversial change in this PR and totally understand if you don't want to accept it as a result. But, I think it makes sense... Go's built-in gzip implementation is single threaded and in most applications where lz4 support would be embedded, threading compression itself is unnecessary because the application is already serving other requests on all available threads; concurrency comes at the request level rather than the operation level. Also this let me do some other optimizations further below...
This change directly allowed me to make tweaks in the writer so that no memory allocations are necessary during compression. Previously Go needed to make some heap allocations for memory that was shared between threads. These changes cut about 50% off the time of compressing the
lorem string, and are what is included in the first commit in this PR.
After that I noticed that each write was suspiciously slow given that we weren't allocating anything. Turns out most of the time was spec in clearing the
hashTable as it was allocated on the stack. I toyed around with a few solutions to avoid this but ultimately ended up with a "generational" table, where each call to
CompressBlock is given its own unique generation ID and operations only affect other values written during the current generation. This, combined with a single threading approach, lets us cache the hashTable. This change cut 75% off of the time that was remaining.
Finally, pprof showed me that quite a bit of time was being spent in
binary.LittleEndian.Uint32. I used a little bit of scary
unsafe that avoids this call on most common computers (x86/amd64/some ARM) which natively have little endian byte order. This cut 50% off of the remaining time.
All in all, compressing
lorem in the included benchmark now runs about 95% faster on my machine. I was concerned that removing the parallelism would hurt compression speed on large sequences but it seems that, at least on the VM I was testing on, the optimizations which become possible as a result were a good tradeoff.
benchmark old ns/op new ns/op delta
BenchmarkCompressEndToEnd 44258 2421 -94.53%
BenchmarkCompressEndToEnd-10MB 24691717 10113045 -59.04%
Look ma, no mallocs!
BenchmarkCompressEndToEnd 500000 2434 ns/op 8 B/op 0 allocs/op
I think I got most of the low-hanging fruit in this PR. The majority of the remaining time is spent in branch mispredictions, some memmoves, hashing, bitwise operators, and assignments.