New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compress/flate: make compressor faster #2726
Labels
Milestone
Comments
I looked a little deeper into this examining two files made by compressing the King James Bible using native Debian Linux gzip and compress/gzip. The native gripped file contains 23 blocks encoded using dynamic Huffman trees The compress/gzipped file contains 47 blocks encoded using dynamic Huffman trees and an empty block at the end containing no data at all. This seems to indicate that compress/gzip (actually compress/flate) decides to emit a new block much more frequently than native gzip. At the same time the emitting of blocks is partly controlled by the maxFlateBlockTokens in deflate.go. Changing this reduces the output size (but not by enough to reach native gzip levels). |
Experimenting further with maxFlateBlockTokens it's definitely possible to reduce the number of blocks emitted. With maxFlateBlockTokens changed from 1 << 14 to 1 << 16 the same King James Bible compression results in 12 blocks. With it equal to 1 << 15 a total of 24 blocks are emitted. However, the file size has not changed greatly. Standard compress/gzip = 1507879 With 1<<15 =1506733 With 1<<16 =1506281 So, this must go deeper into the actual compression itself. |
Reproduced. $ hg identify 0b98ba2443b0 tip I have used the King James Bible from $ wget http://www.gutenberg.org/cache/epub/10/pg10.txt: 4452069 pg10.txt I used the following Go program: $ cat gogzip.go package main import ( "compress/gzip" "flag" "io" "os" ) var n = flag.Int("n", 6, "Compression level (0-9)") func main() { flag.Parse() c, _ := gzip.NewWriterLevel(os.Stdout, *n) io.Copy(c, os.Stdin) } $ cat pg10.txt | gzip -6 > pg10.linux.gz $ cat pg10.txt | ./gogzip -n=6 > pg10.go.gz 1486154 pg10.go.gz 1404431 pg10.linux.gz What I don't like is that the source is not exactly the same as topic starter has, but anyway: the issue exists. I will try to find out the source of the problem. |
My program above has an error (missing c.Close(), so bytes may stuck in the buffer w/o being flushed to stdout, which is reproduced on zero-array input). The correct program is: package main import ( "compress/gzip" "flag" "io" "os" ) var n = flag.Int("n", 6, "Compression level (0-9)") func main() { flag.Parse() c, _ := gzip.NewWriterLevel(os.Stdout, *n) io.Copy(c, os.Stdin) c.Close() } Test results: 1048576 zero.1M.bin 1051 zero.1M.bin.linux.gz 1055 zero.1M.bin.go.gz 4452069 pg10.txt 1404431 pg10.linux.gz 1507801 pg10.go.gz |
The preview of the fix for lazy matching is here: http://golang.org/cl/5554066 Compression results: level=6: 4452069 pg10.txt 1404431 pg10.linux.gz 1472391 pg10.go.gz level=9: 4452069 pg10.txt 1392385 pg10.linux.gz 1467823 pg10.go.gz The gap is still there, but it's getting better. |
Not yet, because I see the cases when Go produce slightly shorter results as well: head -n 1000 pg10.txt > short.txt 42728 short.txt 13658 short.go.gz 13664 short.linux.gz I would like to polish the fix above (via adding more DeflateInflate tests to ensure correctness) and after that understand this difference. |
I see some room for improvement at cost of some memory. Now, the matching algorithm uses 32k int array to store hashes and hash collisions prevent finding a match. Stupid increasing the hashHead size (and changing several related constants accordingly), gives the following result: -6: 1463307 (vs 1472391 above) -9: 1459164 (vs 1467823 above) This is not immediately applicable because for some reason it slows down the encoder. Anyway, it's not a big deal to fix. Moreover, there is a good chance that using more memory would speed up the encoder, because hash function would be simplified. Another room for improvement is to allow to decrease the length of previous match by 1, if its length > 3. It may have sense in case of the following text: zbcazzzzzzzzzzbc now, if I am correct, the algorithm would take something like: [z][b][c][a][z][z][z](zzz)(zzzz)[b][c] // 11 tokens it may be improved as [z][b][c][a][z][z][z](zzz)(zzz)(zbc) // 10 tokens I will probably do something about this, once http://golang.org/cl/5554066 is ready and committed. |
I have applied your patch to my local copy of Go and rerun my web site tests. BBC News Linux gzip size = 21095, original Go gzip size = 21612, patched Go gzip size = 21307 CNN Linux gzip size = 19971, original Go gzip size = 20473, patched Go gzip size = 20073 Wikipedia Linux gzip size = 15149, original Go gzip size = 15551, patched Go gzip size = 15172 King James Bible Linux gzip size = 1404452, original Go gzip size = 1503220, patched Go gzip size = 1472447 In all cases the new gzip size is smaller. Currently the sizes as a percentage of the Linux gzip size are as follows: BBC 101.00% CNN 100.51% Wikipedia 100.15% King James Bible 104.84% Previously those figures were: BBC 102.45% CNN 102.51% Wikipedia 102.65% King James Bible 107.03% So a marked improvement albeit with still room for more. |
I have started seriously considering to change the last name from Krasin to Krasin-Idiot. See http://golang.org/cl/5556077/ King James Bible, -9: 1392218 pg10.go.gz 1392385 pg10.linux.gz King James Bible, -6: 1403709 pg10.go.gz 1404431 pg10.linux.gz It means, that Go implementation now outperforms zlib on this particular example. Basically, this single "-" cleared the history after each fillDeflate which led to shorter dictionary. |
I've made that small change as well and rerun the numbers: BBC News Linux gzip size = 21095, original Go gzip size = 21612, patched Go gzip size = 21088 CNN Linux gzip size = 19971, original Go gzip size = 20473, patched Go gzip size = 19894 Wikipedia Linux gzip size = 15149, original Go gzip size = 15551, patched Go gzip size = 15172 King James Bible Linux gzip size = 1404452, original Go gzip size = 1503220, patched Go gzip size = 1403723 In three of the four cases the Google Go gzip size is now smaller than the Linux gzip size. The absolute differences are BBC 7 bytes, CNN 77 bytes, Wikipedia -23 bytes, Bible 729 bytes. Given that for the the Linux gzip files the gzip header has the filename in it this is a very, very satisfactory result. Marvellous! For completeness here are the percentages BBC 99.97% CNN 99.61% Wikipedia 100.15% King James Bible 99.95% Given how close this is (and often better), I'm satisfied. Thanks for working on this. |
I have just checked that aforementioned "room for improvement at cost of memory" is not available now -- the compressed size does not become smaller if hashHead is enlarged. At the same time, I see performance issues: King James Bible, -9: time -p ./gogzip -n 9 > pg10.go.gz real 0.88 user 0.87 sys 0.00 time gzip -9 < pg10.txt > pg10.linux.gz real 0m0.457s user 0m0.448s sys 0m0.008s King James Bible, -6: time -p ./gogzip -n 6 > pg10.go.gz real 0.56 user 0.54 sys 0.01 time gzip -6 < pg10.txt > pg10.linux.gz real 0m0.231s user 0m0.220s sys 0m0.008s I will try to understand the reason of such a gap (it may be due to the fact that Go compiler has less optimizations that gcc or it may be the algorithmic slowness) |
If you haven't used gopprof before, see http://blog.golang.org/2011/06/profiling-go-programs.html (Only works on Linux, not on Mac.) |
Thanks, Russ. I'm already there: Total: 12110 samples 3529 29.1% 29.1% 9979 82.4% compress/flate.(*compressor).deflate 1686 13.9% 43.1% 1686 13.9% compress/flate.(*compressor).findMatch 1222 10.1% 53.2% 1236 10.2% compress/flate.(*compressor).fillDeflate 497 4.1% 57.3% 4827 39.9% compress/flate.(*huffmanBitWriter).writeBlock 460 3.8% 61.1% 460 3.8% hash/crc32.update 371 3.1% 64.1% 2146 17.7% runtime.mallocgc 335 2.8% 66.9% 2500 20.6% compress/flate.(*huffmanEncoder).bitCounts 332 2.7% 69.6% 332 2.7% compress/flate._func_001 321 2.7% 72.3% 953 7.9% compress/flate.(*literalNodeSorter).Less 286 2.4% 74.6% 741 6.1% sweep |
The first trivial improvement here: http://golang.org/cl/5555070/ Instead of filling hashPrev and hashHead with zeros at every fillDeflate, it uses hashOffset that is adjusted at fill. Impact (I test on 1.5 GB avi file), -6: $ time gzip < /tmp/lala.avi > /dev/null real 0m46.657s user 0m46.415s sys 0m0.200s Original Go (b372a927701e tip) $ 6g gogzip.go && 6l -o gogzip gogzip.6 && time -p ./gogzip < /tmp/lala.avi > /dev/null real 120.20 user 119.63 sys 0.45 Patched Go: $ 6g gogzip.go && 6l -o gogzip gogzip.6 && time -p ./gogzip < /tmp/lala.avi > /dev/null real 108.85 user 108.35 sys 0.40 In short: gzip: 46.657 original Go: 120.20 (2.57x gzip) patched Go: 108.85 (2.33x gzip, 10% improvement) |
I have another cleanup CL here: http://golang.org/cl/5561056/ When these two CLs are committed, I am going to optimize findMatch. Currently, it's always called with prevLength = 2. It should be called with larger prevLength if fastSkipHashing == skipNever and there was a match found at previous step. |
Three CLs committed: http://code.google.com/p/go/source/detail?r=26dceba5c6100300d585bd5a87d479951f2bd498 http://code.google.com/p/go/source/detail?r=c2b80f5d73a6134afbd8110f8ab665a100dc7c95 http://code.google.com/p/go/source/detail?r=7a7845c5895511438239559968007c5c13c0a6a2 Current stats (on the same .avi file): gzip: 46 s Go: 107 s (2.33x) I will continue looking for options to speed up this. |
I have decided to increase the size of hashHead from 32768 to 131072. http://golang.org/cl/5569048/ Results: amd64, core i7 2600, 1.5 GB avi: gzip: 46 s Go: 107 s (2.33x) patched Go: 80.33 s (1.75x) ARM, Cortex A9 (first 50 MB of the avi): gzip: 12.53 s Go: 73 s (5.82x) patched Go: 65.02 s (5.2x) This change does not come free. It may affect systems with small L2 cache (like some Atom processors) or older ARM systems (Cortex A8 and older). At the same time, I hope that the slowdown should not be too bad, because of heavily reduced hash collision rate. |
I have decided to include a text file to the benchmark. guterberg.txt is a subset of Gutenberg Project texts created with: $ find /disk2/gutenberg/ -name "[a-h]*.txt" | xargs cat > gutenberg.txt amd64, core i7 2600, 435MB gutenberg.txt: gzip: 26 s Go: 57.88 (2.23x) patched Go: 57.62 s (2.22x) ARM, Cortex A9, first 50 MB of guternberg.txt: gzip: 30.79 s Go: 84.33 s (2.74x) patched Go: 84.02 s (2.73x) Text version is mostly not affected by this change, which suggests us that hash function is quite good, because it worked well even on small table size. |
What I like about this change (http://golang.org/cl/5569048/) is that it changes the profile in a very clear way: fixes non-productive findMatch calls in case if the data is already compressed (like avi). Before change: avi: Total: 10769 samples 3276 30.4% 30.4% 9896 91.9% compress/flate.(*compressor).deflate 1866 17.3% 47.7% 1866 17.3% compress/flate.(*compressor).findMatch 473 4.4% 52.1% 4818 44.7% compress/flate.(*huffmanBitWriter).writeBlock 425 3.9% 56.1% 2256 20.9% runtime.mallocgc 420 3.9% 60.0% 420 3.9% hash/crc32.update 331 3.1% 63.1% 2590 24.1% compress/flate.(*huffmanEncoder).bitCounts 328 3.0% 66.1% 328 3.0% compress/flate._func_001 304 2.8% 68.9% 803 7.5% sweep 300 2.8% 71.7% 966 9.0% compress/flate.(*literalNodeSorter).Less 238 2.2% 73.9% 793 7.4% compress/flate.literalNodeSorter.Less text: Total: 5969 samples 4594 77.0% 77.0% 4594 77.0% compress/flate.(*compressor).findMatch 643 10.8% 87.7% 5781 96.9% compress/flate.(*compressor).deflate 207 3.5% 91.2% 563 9.4% compress/flate.(*huffmanBitWriter).writeBlock 127 2.1% 93.3% 127 2.1% hash/crc32.update 95 1.6% 94.9% 167 2.8% compress/flate.(*huffmanBitWriter).writeBits 68 1.1% 96.1% 157 2.6% compress/flate.(*huffmanBitWriter).writeCode 40 0.7% 96.7% 77 1.3% compress/flate.(*huffmanBitWriter).flushBits 26 0.4% 97.2% 26 0.4% compress/flate.offsetCode 25 0.4% 97.6% 29 0.5% syscall.Syscall 12 0.2% 97.8% 80 1.3% runtime.mallocgc After change: avi: Total: 8885 samples 2721 30.6% 30.6% 8061 90.7% compress/flate.(*compressor).deflate 652 7.3% 38.0% 652 7.3% compress/flate.(*compressor).findMatch 506 5.7% 43.7% 4719 53.1% compress/flate.(*huffmanBitWriter).writeBlock 429 4.8% 48.5% 2210 24.9% runtime.mallocgc 414 4.7% 53.1% 414 4.7% hash/crc32.update 324 3.6% 56.8% 324 3.6% compress/flate._func_001 309 3.5% 60.3% 806 9.1% sweep 307 3.5% 63.7% 2489 28.0% compress/flate.(*huffmanEncoder).bitCounts 303 3.4% 67.1% 934 10.5% compress/flate.(*literalNodeSorter).Less 225 2.5% 69.7% 727 8.2% runtime.MCache_Alloc text: Total: 5875 samples 4594 78.2% 78.2% 4594 78.2% compress/flate.(*compressor).findMatch 585 10.0% 88.2% 5697 97.0% compress/flate.(*compressor).deflate 198 3.4% 91.5% 541 9.2% compress/flate.(*huffmanBitWriter).writeBlock 107 1.8% 93.3% 107 1.8% hash/crc32.update 87 1.5% 94.8% 153 2.6% compress/flate.(*huffmanBitWriter).writeBits 58 1.0% 95.8% 148 2.5% compress/flate.(*huffmanBitWriter).writeCode 32 0.5% 96.4% 71 1.2% compress/flate.(*huffmanBitWriter).flushBits 28 0.5% 96.8% 35 0.6% syscall.Syscall 22 0.4% 97.2% 22 0.4% compress/flate.offsetCode 21 0.4% 97.6% 91 1.5% compress/flate.(*huffmanEncoder).bitCounts (I'm sorry for a lot of small updates on this issue. Feel free to unsubscribe. I would like to continue posting my progress here) |
http://golang.org/cl/5569048/ is committed. Currently, I have evaluated a number of possible optimizations of findMatch and all of them are effectively defeated by bounds-checking. I'm currently thinking of two possible ways to go: 1. Use unsafe.Pointer and uintptr within findMatch. It would make the code less readable 2. Put findMatch into .c file I dislike both options. Russ, what is preferrable? Am I missing something? |
Could you post results with bounds-checking disabled? That would give us an idea how much it is currently costing us. My preference (which doesn't really matter) would be to leave it in "pure'n safe" go, and focus on improvements to the compiler, e.g. if there are ways that it could determine that bounds-checking is unneeded... |
yes, it's algorithmic issues with http://en.wikipedia.org/wiki/Trigram We spend too much time to "the", "and" and related stuff, because the lists are really huge for them. I am currently trying to produce a solution for hot trigrams. If my idea would not work, I will try to look into gzip or kindle ask Jean-Loup Gailly to help. |
Two updates: 1. Minor cleanup CL that does not affect performance (removing dead code): http://golang.org/cl/5577061/ 2. The second CL dramatically reduce the amount of calls to runtime.GC from huffmanEncoder: http://golang.org/cl/5573076/ (not completely ready to submit) The second CL depends on the first (unintentionally). Stats are given for the maximum compression level (-9): gutenberg: gzip: 37.52 Go: 76 (2.02x) patched Go: 74.43 (1.98x, 2% improvement -- it's because most of the time it's doing findMatch) avi: gzip: 46.21 Go: 90 (1.95x) patched Go: 67.79 (1.47x, 25% improvement) Obviously, GC issues only affect large files, but anyway, it's probably the right step. I will provide stats for other compression levels tomorrow. |
Let's postpone any algorithmic changes to gzip until after Go 1. I'd like to let things settle so that we have some confidence in the stability of the code we're going to release. It's getting to be too late to be making significant changes, which could be harboring new bugs to find and fix (or miss). |
Russ, is http://golang.org/cl/5573076/ in the category of algorithmic changes? |
CL 5573076 was superseded by https://golang.org/cl/10006043 (revision 2fe813f4f3c2). |
Sorry about commenting on a fixed issue, just thought this bit of information might be useful. I did some benchmarks with go1.2rc5 using gzip/gunzip benchmarks under test/, with compress/gzip and vitess' cgzip. Compression is ~2x slower whereas decompression is ~2.6x slower. BenchmarkCGzip 5 286336917 ns/op 67.77 MB/s 32944 B/op 2 allocs/op BenchmarkCGunzip 50 48548286 ns/op 399.70 MB/s 33056 B/op 3 allocs/op BenchmarkGzip 2 513806722 ns/op 37.77 MB/s 1673440 B/op 4005 allocs/op BenchmarkGunzip 10 125987530 ns/op 154.02 MB/s 112675 B/op 510 allocs/op |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
by jgrahamc:
The text was updated successfully, but these errors were encountered: