Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/gzip: use bufio to improve compress speed #56449

Closed
YorkJ opened this issue Oct 27, 2022 · 6 comments
Closed

compress/gzip: use bufio to improve compress speed #56449

YorkJ opened this issue Oct 27, 2022 · 6 comments
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. Performance
Milestone

Comments

@YorkJ
Copy link

YorkJ commented Oct 27, 2022

What version of Go are you using (go version)?

$ go version
go1.17.6

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOVERSION="go1.17.6"

What did you do?

Use gzip to compress, for example:

// example for tar.gz compress
func TarGz(...) error {
    ...
    // writer
    fw, err := os.Create(dst)
    if err != nil {
        return err
    }
    defer fw.Close()
    gw = gzip.NewWriter(fw)
    defer gw.Close()
    tw := tar.NewWriter(gw)
    defer tw.Close()
    // compress and write file
    fr, err := os.Open(src)
    if err != nil {
	return err
    }
    defer fr.Close()
    _, err = io.Copy(tw, fr)
    if err != nil {
	return err
    }
    ...
}

In my case, the target tar file dst is on a device which has a high I/O latency.

What did you expect to see?

Time cost for TarGz() is close to the time cost for linux command tar -czf .....

What did you see instead?

The result shows that TarGz() is 8x slower than tar -czf ....
Tracing the code, I found that gzip write file every 240 bytes, which is inefficiency in my case.
Thus, I recommend that the bufio writer can be used as a buffered writer to avoid too many I/O requests in the process of compressing. It is just like what is done in compress/lzw:

// compress/lzw/writer.go
...
w.w = bufio.NewWriter(dst)
...
w.w.WriteByte(...)
...
w.w.Flush()
...

With bufio, the dst file is written every 4096 bytes (compated to 240 bytes), which decreases the I/O requests and has a significant improvement in my case.
Furthermore, it makes the request to I/O to be stable and irrelevant to the implementation of given compression algorithm.

@ianlancetaylor
Copy link
Contributor

It's straightforward for your code to wrap fw using bufio.NewWriter to get the speedup you want. As some people use gzip to write data to a network socket, it's not clear that it would be a good idea for gzip to use bufio by default.

@dsnet
Copy link
Member

dsnet commented Oct 27, 2022

I'm not sure I understand why it's flushing every 240 bytes. The compressor already buffers a relatively large window before flushing since there's already a built-in buffer of at least 32KiB of uncompressed input. Of course, that's for the uncompressed input, not the compressed output. Is the input highly compressible such that a large chunk of data outputs as a only a small amount?

@dsnet
Copy link
Member

dsnet commented Oct 27, 2022

Assuming this issue is because the data is highly compressible, I don't think we should add an output buffer automatically since memory use of DEFLATE compressors is something that matters for servers with many concurrent writers.

@heschi heschi added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 27, 2022
@heschi heschi added this to the Backlog milestone Oct 27, 2022
@ZekeLu
Copy link
Contributor

ZekeLu commented Oct 28, 2022

FYI, I think 240 bytes is set here:

// bufferFlushSize indicates the buffer size
// after which bytes are flushed to the writer.
// Should preferably be a multiple of 6, since
// we accumulate 6 bytes between writes to the buffer.
bufferFlushSize = 240

@YorkJ
Copy link
Author

YorkJ commented Oct 28, 2022

Thank you for the reply ! I'am confused that in compress/lzw/writer.go the buffer is set to 4096 bytes by bufio.NewWriter, but in compress/gzip/gzip.go it is set to 240 bytes because of implementation of buffman_bit_writer.
It seems that the buffer size is a choice to concern in different situations. For example, it is not a good idea to write every 240 bytes to a file.
Maybe there is some way to show the config. Or we have to recommend that bufio is always used in a I/O sensitive case, no matter you want a large buffer or a small buffer. Because different compressing algorithms implement it in different way.

@klauspost
Copy link
Contributor

I'm not sure I understand why it's flushing every 240 bytes.

The buffer size is of course rather arbitrary, but I think as with most other packages it is reasonable for users to apply the buffering that matches their use case. If you are writing to a bytes.Buffer or already have a bufio there is little benefit to add more internal buffering - you would just allocate more memory and get more cache misses.

Deflate is already rather memory intensive. I think leaving the option to add a buffer to the user is the correct approach. The buffer is big enough that writes to memory output isn't affected.

@YorkJ YorkJ closed this as completed Nov 14, 2022
@golang golang locked and limited conversation to collaborators Nov 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. Performance
Projects
None yet
Development

No branches or pull requests

8 participants