Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress: add AppendCompress and AppendDecompress #54078

Open
dsnet opened this issue Jul 27, 2022 · 11 comments
Open

compress: add AppendCompress and AppendDecompress #54078

dsnet opened this issue Jul 27, 2022 · 11 comments

Comments

@dsnet
Copy link
Member

dsnet commented Jul 27, 2022

There are situations where we have the entire input on hand and want to compress/decompress it into a buffer we already have on hand. I propose the addition of the following functions to gzip, flate, zlib:

// AppendEncoded appends the compressed form of src to the end of dst
// using the specified compression level.
// It returns an error only if the specified level is invalid.
func AppendEncoded(dst, src []byte, level int) ([]byte, error)

// AppendDecoded appends the decompressed form of src to the end of dst.
// If the decompressed output exceeds maxSize, it returns an error.
// If maxSize is negative, then there is no limit on the output size.
func AppendDecoded(dst, src []byte, maxSize int) ([]byte, error)

At face value, this appears to be yet another proposal for convenience functions (which has been proposed and rejected before: #16504). However, there are CPU and memory performance benefits to this API that is impossible to achieve otherwise.

Compression is the combination of LZ77 and Huffman encoding. The memory used by LZ77 is essentially the last 32KiB of uncompressed data. When operating with an io.Reader or io.Writer, the compressor/decompressor is responsible for maintaining a copy of the LZ77 window. This is quite a bit of copying and extra memory usage. The advantage of the APIs above is that there is zero work that needs to be done to maintain the LZ77 window separately since the AppendEncoded.src and AppendDecoded.dst are the LZ77 windows. When decompressing, the only memory required is the memory for the Huffman decoder, which is relatively small. When compressing, the only memory required is the memory for the LZ77 search structure and Huffman encoder, which is still sizeable, but still smaller than the entire LZ77 window.

Other considerations: The flate and zlib APIs provide the ability to specify a dictionary and the proposed APIs have no means of specifying such a feature. It appears that dictionaries are rarely used such that adding it as part of the API is probably not worth it. Usage of dictionaries can still use the io.Reader and io.Writer based APIs.

@dsnet dsnet added the Proposal label Jul 27, 2022
@gopherbot gopherbot added this to the Proposal milestone Jul 27, 2022
@ianlancetaylor
Copy link
Contributor

I don't see anywhere to keep state here. Is this the correct usage? In particular, are we required to always keep passing the same destination slice to each call? Is there really no other state required?

    buf := make([]byte, 4096)
    var dst []byte
    for {
        n, err := r.Read(buf)
        if err != nil && err != io.EOF {
            return nil, err
        }
        if n == 0 {
            break
        }
        dst, err = gzip.AppendEncoded(dst, buf[:n], 9)
        if err != nil {
            return nil, err
        }
    }
    return dst, nil

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Jul 27, 2022
@dsnet
Copy link
Member Author

dsnet commented Jul 27, 2022

The intention of the proposed API is for use when you already have a []byte in its entirety on hand and you want to continue dealing with it as a []byte. Anytime, you start with or end with a io.Reader or io.Writer, then the current API is superior.

For your example, you could do:

b, err := io.ReadAll(r) // converts io.Reader to []byte
if err != nil {
    return nil, err
}
return gzip.AppendEncoded(nil, b, 9) // stays in the []byte world

but I'm not sure if that's any faster or slower than doing:

var bb bytes.Buffer
zw := gzip.NewWriter(bb) // stays in the io.Writer world
if _, err := io.Copy(zw, r); err != nil { // connects io.Writer and io.Reader together
    return nil, err
}
zw.Close()
return bb.Bytes(), nil // converts io.Writer into a []byte

Calling gzip.AppendEncoded within each read-loop would be a fairly odd use of the API. It is technically correct since each call to gzip.AppendEncoded would append the compressed form of buf[:n] as a individual gzip files to the destination. A sequence of gzip files concatenated together is still a valid gzip file per RFC 1952. Since the chunk size is 4k, I would not expect compression ratio in this case to be particularly great since we're functionally limiting the compression window to 4k.

@rsc
Copy link
Contributor

rsc commented Jul 27, 2022

This API would remove the allocation of the window. How much can we remove of the other allocated state? Can these be implemented with no allocation other than the append? (That is, can we get everything to be stack-allocated? Or were you planning on a sync.Pool?)

@dsnet
Copy link
Member Author

dsnet commented Jul 27, 2022

I was planning on using a sync.Pool. With work, I'm fairly certain the huffman coders can be stack allocated (since they're fairly constant size), not sure about the LZ77 search state.

@rsc
Copy link
Contributor

rsc commented Jul 27, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) Jul 27, 2022
@ianlancetaylor
Copy link
Contributor

Is Encoded really the right word here? Should these be named AppendCompressed and AppendDecompressed?

@dsnet
Copy link
Member Author

dsnet commented Aug 3, 2022

I'm fine with either AppendEncoded/AppendDecoded or AppendCompressed/AppendDecompressed. I see "compression" as a particular type of "encoding". Decoded is shorter than Decompressed and the fact that we're dealing with compression is implied by the package name: gzip.AppendDecoded.

@rsc
Copy link
Contributor

rsc commented Aug 10, 2022

strconv.AppendQuote is not AppendQuoted (no d).
So it seems like we can use AppendCompress and AppendDecompress, which will be very clear.
Quoting is a kind of encoding too, of course, but it's probably clearer to say the encoding than just 'Encoded'.

Does anyone object to AppendCompress and AppendDecompress?

@rsc rsc changed the title proposal: compress: add AppendEncoded and AppendDecoded proposal: compress: add AppendCompress and AppendDecompress Aug 10, 2022
@rsc
Copy link
Contributor

rsc commented Aug 17, 2022

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Sep 7, 2022

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: compress: add AppendCompress and AppendDecompress compress: add AppendCompress and AppendDecompress Sep 7, 2022
@rsc rsc modified the milestones: Proposal, Backlog Sep 7, 2022
@gopherbot
Copy link

Change https://go.dev/cl/429736 mentions this issue: compress/gzip: add AppendEncoded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests

4 participants