Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/zlib: NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level #49780

Closed
walter1045 opened this issue Nov 24, 2021 · 5 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@walter1045
Copy link

walter1045 commented Nov 24, 2021

What version of Go are you using (go version)?

$ go version go1.17.3 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=auto
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\my\AppData\Local\go-build
set GOENV=C:\Users\my\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\my\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\my\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.17.3
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=E:\gopro\mygo\go.mod
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=D:\cache\Temp\go-build312801471=/tmp/go-build -gno-record-gcc-switches

What did you do?

1: I have origin data which is compressed by zlib of C/C++;
2: And I use golang zlib uncompress it , then re-compress it back;

What did you expect to see?

3: Then I found that cannot get the same data with origin data:
4: I change language to Python3, then I get the correct data.

What did you see instead?

5: So, I think the function "zlib.NewWriterLevel(&in, 9)" is not good as other language at the same compress level.

Attach:

origin data size : 125154

Golang result:
0 1801955
1 138636
2 139037
3 136736
4 123597
5 120754
6 121741
7 121694
8 125187
9 125185

Python3 result: level, compress size whith same origin data
0 1801950
1 133128
2 140130
3 137284
4 124565
5 121315
6 122151
7 122118
8 125165
9 125154

@seankhliao seankhliao changed the title zlib.NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level compress/zlib: NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level Nov 24, 2021
@seankhliao seankhliao added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels Nov 24, 2021
@seankhliao
Copy link
Member

cc @dsnet

@dsnet
Copy link
Member

dsnet commented Nov 24, 2021

Compression ratios are heavily dependent on the input data set. Without a reproducer, there isn't much that can be done here.

A few years ago, @klauspost optimized the encoder. As part of the optimization, it made the assumption that matches were always at least 4 bytes long. This means that we don't perform 3-byte matches, which may be the cause of lower compression ratio. See #15622. However, we cannot verify this without a reproduction.

@dsnet dsnet added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Nov 24, 2021
@ulikunitz
Copy link
Contributor

ulikunitz commented Nov 25, 2021

If you want the same compression rate as another Lempel-Ziv compression tool, you need to replicate the Lempel-Ziv-Encoder exactly and use the same parameters. So this is actually a request to replicate the zlib Lempel-Ziv-Encoder in Go.

@klauspost
Copy link
Contributor

klauspost commented Nov 25, 2021

  1. A sample size of one says nothing about a compression scheme.

Test on a large set of inputs. Your input clearly has some special properties since it is uncommon for higher levels to have worse compression than medium. This is not typical.

  1. Size cannot be compared without considering speed.

For example, this is 4 GB web content speed vs compression with Go stdlib:

image

  1. Compression levels should not be compared across implementations. They should reflect the tradeoff between speed an compression as an average in a given implementation. Looking at the graph above anything above level 6 looks like a waste of CPU cycles, and level 4 would be the reasonable "default", and not level 6.

For my own deflate I have tried to make the tradeoff more "linear", with a reasonable default level 5.

image

It is not perfect, and some inputs will skew it, but that is the measure I use for balancing the compression levels.

Typically I only really find the "fastest", "default" (the reasonable tradeoff) and "best" (smallest) levels to be useful anyway, so for most implementations that are the only ones I focus on.

Short story: Choose the compression level that makes sense for your use case. Don't copy+paste from other languages.

@dsnet 3 vs 4 byte is rarely a gain since encoding 3 byte matches in most cases yield the same or larger output than the entropy encoding of the immediate values. You can estimate both, but it is rather expensive and relies on incomplete information.

FWIW, I tried adding 3 byte matches to deflate level 9, without comparing entropy coded sizes similar to old implementation. Total output size went from 730389060 -> 732262492 bytes on webdevdata.org-2015-01-07-subset.
enwik9 (which has a preference for small matches) went from 324205506 -> 323999691 bytes.

edit: Found a small improvement for the test, numbers updated.

@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@golang golang locked and limited conversation to collaborators Dec 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

6 participants