New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compress/flate: add 3-byte matching for higher compression levels #15622
Comments
I spent some time analyzing
It seems that the ratio hit occurs because the new hash-chain algorithm only operates on 4-byte segments, rather than 3-byte segments. This tends to be pathological for bitmap-based images, which represent a pixel as a tuple of 3 bytes for (R, G, B). This explains why there is such a large number of {Length:3, Distance:3} references in go1.6 that cannot be generated in go.tip. Given the dramatic increase in performance using 4-byte hashing, I support keeping 4-byte hashing even if it causes some inputs to suffer (especially RGB based images, and possibly other inputs). @nigeltao, your thoughts? |
Thanks for the analysis. It certainly sounds plausible that these are RGB 3-tuples in an uncompressed TIFF. I agree that we don't need to revert 4-byte hashing yet. For example, TIFF is not as common as it used to be, and more modern image formats use 2D-image-specific filtering before passing to a general purpose compressor like DEFLATE; those filters already take out any easy repeated-RGB-3-tuple wins. Using the standard library's image/png encoder yields 501111 bytes, smaller than both 511928 and 741046. WEBP Lossless (generated by the cwebp tool) is 379612 bytes. |
For "BestCompression" it could be feasible to have a version that could to 3-byte matches. The token writer could then be modified to choose between 3-length matches and 3-byte literals depending on the encoded size. |
Sounds good. Possibly adding 3-byte matching to BestCompression, might not be a bad idea. Since it seems we're in consensus about keeping the 4-byte matching, I'm going to relabel the issue to possibly adding 3-byte matching to high compression levels. |
I ran both the Go tip (i.e. Go 1.7) and Go 1.6 image/png encoders over a corpus of 20,000 or so PNG images, comparing the output size of a decode-and-then-re-encode. I don't think image/png changed much; it should all be due to compress/flate. Below are the summary statistics (grouped by image type) of the ratios of tip size over 1.6 size; higher means tip is worse, a value of 1.000 means no change.
pXX are the percentiles, e.g. p50 is the median ratio. It's not too surprising that *image.RGBA and *image.NRGBA Go image types (i.e. what PNG calls RGB and RGBA images) miss the 3-byte matches the most. Nonetheless, I think the increase in compressed size, in the order of 1-2% on average and less than 1% on median, is acceptable for Go 1.7. Sure, we can always find specific examples (such as the TIFF in the original post) that re-encode relatively badly, but I think we're OK in general, at least for PNG. In Go 1.8, we can consider re-introducing 3-byte matches, and see if that will bring the outliers back into the fold. |
Using
9af83462
The recent changes to
compress/flate
has provided pretty good improvements overall to the compression speeds, but I have noticed a few extraordinary cases where the compression ratio suffers significantly.This spreadsheet compares the output sizes from go1.6 to current go.tip on a number of compression benchmarks. All tests were done using the
flate.DefaultCompression
setting.Most notably, the clegg.tiff image from the ACT corpus experiences a 44% increase in output size (511928 to 741046 bytes).
I think it would be worthwhile to understand what causes this fairly significant regression in compression ratio before the go1.7 release.
cc @nigeltao @klauspost
The text was updated successfully, but these errors were encountered: