-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compress/zlib: NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level #49780
Comments
cc @dsnet |
Compression ratios are heavily dependent on the input data set. Without a reproducer, there isn't much that can be done here. A few years ago, @klauspost optimized the encoder. As part of the optimization, it made the assumption that matches were always at least 4 bytes long. This means that we don't perform 3-byte matches, which may be the cause of lower compression ratio. See #15622. However, we cannot verify this without a reproduction. |
If you want the same compression rate as another Lempel-Ziv compression tool, you need to replicate the Lempel-Ziv-Encoder exactly and use the same parameters. So this is actually a request to replicate the zlib Lempel-Ziv-Encoder in Go. |
Test on a large set of inputs. Your input clearly has some special properties since it is uncommon for higher levels to have worse compression than medium. This is not typical.
For example, this is 4 GB web content speed vs compression with Go stdlib:
For my own deflate I have tried to make the tradeoff more "linear", with a reasonable default level 5. It is not perfect, and some inputs will skew it, but that is the measure I use for balancing the compression levels. Typically I only really find the "fastest", "default" (the reasonable tradeoff) and "best" (smallest) levels to be useful anyway, so for most implementations that are the only ones I focus on. Short story: Choose the compression level that makes sense for your use case. Don't copy+paste from other languages. @dsnet 3 vs 4 byte is rarely a gain since encoding 3 byte matches in most cases yield the same or larger output than the entropy encoding of the immediate values. You can estimate both, but it is rather expensive and relies on incomplete information. FWIW, I tried adding 3 byte matches to deflate level 9, without comparing entropy coded sizes similar to old implementation. Total output size went from 730389060 -> 732262492 bytes on edit: Found a small improvement for the test, numbers updated. |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
1: I have origin data which is compressed by zlib of C/C++;
2: And I use golang zlib uncompress it , then re-compress it back;
What did you expect to see?
3: Then I found that cannot get the same data with origin data:
4: I change language to Python3, then I get the correct data.
What did you see instead?
5: So, I think the function "zlib.NewWriterLevel(&in, 9)" is not good as other language at the same compress level.
Attach:
origin data size : 125154
Golang result:
0 1801955
1 138636
2 139037
3 136736
4 123597
5 120754
6 121741
7 121694
8 125187
9 125185
Python3 result: level, compress size whith same origin data
0 1801950
1 133128
2 140130
3 137284
4 124565
5 121315
6 122151
7 122118
8 125165
9 125154
The text was updated successfully, but these errors were encountered: