compress/zlib: NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level #49780

walter1045 · 2021-11-24T10:02:46Z

What version of Go are you using (`go version`)?

$ go version go1.17.3 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
set GO111MODULE=auto
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\my\AppData\Local\go-build
set GOENV=C:\Users\my\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\my\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\my\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.17.3
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=E:\gopro\mygo\go.mod
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=D:\cache\Temp\go-build312801471=/tmp/go-build -gno-record-gcc-switches

What did you do?

1: I have origin data which is compressed by zlib of C/C++;
2: And I use golang zlib uncompress it , then re-compress it back;

What did you expect to see?

3: Then I found that cannot get the same data with origin data:
4: I change language to Python3, then I get the correct data.

What did you see instead?

5: So, I think the function "zlib.NewWriterLevel(&in, 9)" is not good as other language at the same compress level.

Attach:

origin data size : 125154

Golang result:
0 1801955
1 138636
2 139037
3 136736
4 123597
5 120754
6 121741
7 121694
8 125187
9 125185

Python3 result: level, compress size whith same origin data
0 1801950
1 133128
2 140130
3 137284
4 124565
5 121315
6 122151
7 122118
8 125165
9 125154

The text was updated successfully, but these errors were encountered:

seankhliao · 2021-11-24T10:30:45Z

cc @dsnet

dsnet · 2021-11-24T19:00:22Z

Compression ratios are heavily dependent on the input data set. Without a reproducer, there isn't much that can be done here.

A few years ago, @klauspost optimized the encoder. As part of the optimization, it made the assumption that matches were always at least 4 bytes long. This means that we don't perform 3-byte matches, which may be the cause of lower compression ratio. See #15622. However, we cannot verify this without a reproduction.

ulikunitz · 2021-11-25T07:10:39Z

If you want the same compression rate as another Lempel-Ziv compression tool, you need to replicate the Lempel-Ziv-Encoder exactly and use the same parameters. So this is actually a request to replicate the zlib Lempel-Ziv-Encoder in Go.

klauspost · 2021-11-25T10:13:18Z

A sample size of one says nothing about a compression scheme.

Test on a large set of inputs. Your input clearly has some special properties since it is uncommon for higher levels to have worse compression than medium. This is not typical.

Size cannot be compared without considering speed.

For example, this is 4 GB web content speed vs compression with Go stdlib:

Compression levels should not be compared across implementations. They should reflect the tradeoff between speed an compression as an average in a given implementation. Looking at the graph above anything above level 6 looks like a waste of CPU cycles, and level 4 would be the reasonable "default", and not level 6.

For my own deflate I have tried to make the tradeoff more "linear", with a reasonable default level 5.

It is not perfect, and some inputs will skew it, but that is the measure I use for balancing the compression levels.

Typically I only really find the "fastest", "default" (the reasonable tradeoff) and "best" (smallest) levels to be useful anyway, so for most implementations that are the only ones I focus on.

Short story: Choose the compression level that makes sense for your use case. Don't copy+paste from other languages.

@dsnet 3 vs 4 byte is rarely a gain since encoding 3 byte matches in most cases yield the same or larger output than the entropy encoding of the immediate values. You can estimate both, but it is rather expensive and relies on incomplete information.

FWIW, I tried adding 3 byte matches to deflate level 9, without comparing entropy coded sizes similar to old implementation. Total output size went from 730389060 -> 732262492 bytes on webdevdata.org-2015-01-07-subset.
enwik9 (which has a preference for small matches) went from 324205506 -> 323999691 bytes.

edit: Found a small improvement for the test, numbers updated.

gopherbot · 2021-12-24T20:29:20Z

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

seankhliao changed the title ~~zlib.NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level~~ compress/zlib: NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level Nov 24, 2021

seankhliao added NeedsInvestigation Performance labels Nov 24, 2021

dsnet added the WaitingForInfo label Nov 24, 2021

gopherbot closed this as completed Dec 24, 2021

golang locked and limited conversation to collaborators Dec 28, 2022

gopherbot added the FrozenDueToAge label Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compress/zlib: NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level #49780

compress/zlib: NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level #49780

walter1045 commented Nov 24, 2021 •

edited

Loading

seankhliao commented Nov 24, 2021

dsnet commented Nov 24, 2021

ulikunitz commented Nov 25, 2021 •

edited

Loading

klauspost commented Nov 25, 2021 •

edited

Loading

gopherbot commented Dec 24, 2021

compress/zlib: NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level #49780

compress/zlib: NewWriterLevel(&in, zlib.BestCompression) is not good as other lauguages at the same level #49780

Comments

walter1045 commented Nov 24, 2021 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

seankhliao commented Nov 24, 2021

dsnet commented Nov 24, 2021

ulikunitz commented Nov 25, 2021 • edited Loading

klauspost commented Nov 25, 2021 • edited Loading

gopherbot commented Dec 24, 2021

walter1045 commented Nov 24, 2021 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

ulikunitz commented Nov 25, 2021 •

edited

Loading

klauspost commented Nov 25, 2021 •

edited

Loading