Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/gzip: compression level does not work #21987

Closed
dongweigogo opened this issue Sep 23, 2017 · 9 comments
Closed

compress/gzip: compression level does not work #21987

dongweigogo opened this issue Sep 23, 2017 · 9 comments
Labels
FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@dongweigogo
Copy link

dongweigogo commented Sep 23, 2017

go 1.8.3

the gzip compression level does not work

I changed the compression level from 1 to 9, but all the compressed files have the same size and the same compression ratio. I don't know if anything goes wrong.

code as below:
`
func CompressFile(file string) {

data,_ := ioutil.ReadFile(file)

fout,_ := os.Create("out.gz")

defer fout.Close()

gz,_ := gzip.NewWriterLevel(fout, 3)    // I tried from 1 to 9

defer gz.Close()

gz.Write(data)

gz.Flush()

}
`

@dsnet dsnet changed the title the gzip compression level does not work compress/gzip: compression level does not work Sep 23, 2017
@dsnet
Copy link
Member

dsnet commented Sep 23, 2017

Without knowing what data is, it is impossible to determine if something is going wrong here. Not all data is compressible (may result in file a few bytes larger than input). And in other cases the optimal representation is trivial to achieve on any level (compressing a string of zero bytes).

@dsnet dsnet added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 23, 2017
@odeke-em
Copy link
Member

@dsnet I've made a repro here https://play.golang.org/p/6GjY9AHO_z or inlined

package main

import (
	"compress/gzip"
	"crypto/rand"
	"fmt"
	"io"
	"io/ioutil"
	"log"
)

func compressSize(r io.Reader, level int) int64 {
	prc, pwc := io.Pipe()
	go func() {
		defer pwc.Close()
		gzw, err := gzip.NewWriterLevel(pwc, level)
		if err != nil {
			log.Printf("level: #%d err: %v", level, err)
		}
		io.Copy(gzw, r)
		gzw.Flush()
		gzw.Close()
	}()
	n, _ := io.Copy(ioutil.Discard, prc)
	return n
}

func main() {
	for level := 1; level <= 9; level++ {
		size := compressSize(io.LimitReader(rand.Reader, 100000), level)
		fmt.Printf("Level: %d size: %d\n", level, size)
	}
}

and the sequence of bytes is retrieved from cryto/rand and you see that it changes with this sample at https://play.golang.org/p/4Hdst47Jov

When I run the compression repro 10 times, I get the same results

$ for ((i=0; i <=10; i++)) do echo -e "Run #$i\n" && go run main.go && echo -e "End of Run\n";done
Run #0

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

Run #1

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

Run #2

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

Run #3

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

Run #4

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

Run #5

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

Run #6

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

Run #7

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

Run #8

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

Run #9

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

Run #10

Level: 1 size: 100038
Level: 2 size: 100063
Level: 3 size: 100063
Level: 4 size: 100063
Level: 5 size: 100063
Level: 6 size: 100063
Level: 7 size: 100063
Level: 8 size: 100063
Level: 9 size: 100063
End of Run

@dsnet
Copy link
Member

dsnet commented Sep 23, 2017

@odeke-em, your repro is just demonstrating that random data is not compressible, which is entirely explained by information theory. If level 9 can't compress, I don't know how you would expect the lower levels to do better.

(level 1 does better because it uses an entirely different algorithm that does not always emit a trailing empty block like the more general algorithm for levels 2-9, but that's a very minor implementation detail that maybe we'll fix someday).

@odeke-em
Copy link
Member

Thanks for pointing that out @dsnet, I was being dumb and using different data on each run, I just have stuck with the same data which would make sense(I had that before), and I don't know much about information theory, but I'll study up on that, thanks for piquing a field of interest for me :)
My apologies for the noise here, however on using the same reader, level 1 doesn't compress much but the others seem to by huge factors https://play.golang.org/p/ZglKtnpBTr.

@nussjustin
Copy link
Contributor

As @dsnet already explained, random data is not really a good fit for compression.

Here is a small program that compares the length of the output of go's compress/gzip with the locally installed gzip tool: https://play.golang.org/p/nRU119B45l (does not run on the playground, needs gzip in the users $PATH)

@dongweigogo Can you try the code with your data? Just save it as "gzip.go" and run go run gzip.go < input.txt and it will print the compressed length for both go and your locally installed gzip command for compression levels 1 to 9. There shouldn't be much difference between compress/gzip and the gzip tool.

Feeding the program (compiled with Go 1.9) with random data I get:

$ cat /dev/urandom | head -c100000 | go run gzip.go
2017/09/23 10:44:52 level 1, mode   go, bytes 100033
2017/09/23 10:44:52 level 1, mode exec, bytes 100038
2017/09/23 10:44:52 level 2, mode   go, bytes 100058
2017/09/23 10:44:52 level 2, mode exec, bytes 100038
2017/09/23 10:44:52 level 3, mode   go, bytes 100058
2017/09/23 10:44:52 level 3, mode exec, bytes 100038
2017/09/23 10:44:52 level 4, mode   go, bytes 100058
2017/09/23 10:44:52 level 4, mode exec, bytes 100038
2017/09/23 10:44:52 level 5, mode   go, bytes 100058
2017/09/23 10:44:52 level 5, mode exec, bytes 100038
2017/09/23 10:44:52 level 6, mode   go, bytes 100058
2017/09/23 10:44:52 level 6, mode exec, bytes 100038
2017/09/23 10:44:52 level 7, mode   go, bytes 100058
2017/09/23 10:44:52 level 7, mode exec, bytes 100038
2017/09/23 10:44:52 level 8, mode   go, bytes 100058
2017/09/23 10:44:52 level 8, mode exec, bytes 100038
2017/09/23 10:44:52 level 9, mode   go, bytes 100058
2017/09/23 10:44:52 level 9, mode exec, bytes 100038

Using some real data (multiple *.go files concatenated) I get

$ cat $GOPATH/src/github.com/nats-io/gnatsd/server/*.go | go run gzip.go
2017/09/23 10:45:04 level 1, mode   go, bytes 106244
2017/09/23 10:45:04 level 1, mode exec, bytes 104494
2017/09/23 10:45:04 level 2, mode   go, bytes 97783
2017/09/23 10:45:04 level 2, mode exec, bytes 99273
2017/09/23 10:45:04 level 3, mode   go, bytes 95359
2017/09/23 10:45:04 level 3, mode exec, bytes 95593
2017/09/23 10:45:04 level 4, mode   go, bytes 88169
2017/09/23 10:45:04 level 4, mode exec, bytes 88797
2017/09/23 10:45:04 level 5, mode   go, bytes 85406
2017/09/23 10:45:04 level 5, mode exec, bytes 85536
2017/09/23 10:45:04 level 6, mode   go, bytes 84633
2017/09/23 10:45:04 level 6, mode exec, bytes 84106
2017/09/23 10:45:04 level 7, mode   go, bytes 84512
2017/09/23 10:45:04 level 7, mode exec, bytes 83890
2017/09/23 10:45:04 level 8, mode   go, bytes 84429
2017/09/23 10:45:04 level 8, mode exec, bytes 83782
2017/09/23 10:45:04 level 9, mode   go, bytes 84426
2017/09/23 10:45:04 level 9, mode exec, bytes 83780

@dongweigogo
Copy link
Author

dongweigogo commented Sep 23, 2017

@dsnet , @nussjustin actually the data is large, a 700 Mb text file.

@dongweigogo
Copy link
Author

dongweigogo commented Sep 23, 2017

@nussjustin , since the input is very large and needs quite a lot of time at high level, I truncated the input file to 60M. The result is below

2017/09/23 18:32:56 level 1, mode go, bytes 21211154
2017/09/23 18:32:59 level 1, mode exec, bytes 20587438
2017/09/23 18:33:02 level 2, mode go, bytes 19613592
2017/09/23 18:33:05 level 2, mode exec, bytes 19829704
2017/09/23 18:33:09 level 3, mode go, bytes 18827023
2017/09/23 18:33:13 level 3, mode exec, bytes 19071163
2017/09/23 18:33:16 level 4, mode go, bytes 18343188
2017/09/23 18:33:19 level 4, mode exec, bytes 19026827
2017/09/23 18:33:27 level 5, mode go, bytes 17844476
2017/09/23 18:33:32 level 5, mode exec, bytes 18512737
2017/09/23 18:33:51 level 6, mode go, bytes 17238602
2017/09/23 18:34:04 level 6, mode exec, bytes 17841435
2017/09/23 18:34:29 level 7, mode go, bytes 17070190
2017/09/23 18:34:51 level 7, mode exec, bytes 17520662
2017/09/23 18:35:17 level 8, mode go, bytes 17057128
2017/09/23 18:36:15 level 8, mode exec, bytes 17068204
2017/09/23 18:36:40 level 9, mode go, bytes 17057115
2017/09/23 18:37:38 level 9, mode exec, bytes 17064092

Then I tried using gzip command to compress with level from 1 to 9, it works well. I've no idea why my code does not work.

@dsnet
Copy link
Member

dsnet commented Sep 23, 2017

The result of the program @nussjustin provided seems to indicate that it is decreasing in filesize. So I'm not sure what you mean by the compression level having no effect.

@ianlancetaylor
Copy link
Contributor

Timed out.

@golang golang locked and limited conversation to collaborators Mar 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

6 participants