-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compress/zlib: reading data with known length #14867
Comments
I think the safest thing we can do here is document this. Changing the behavior of Close to start doing reads and changing the return value from Close seems like an Go 1 compatibility promise change at this point. Btw, a more convincing play.golang.org snippet might include checking return values, such as: https://play.golang.org/p/dI-lca_rjR ... because notably, the zlib.Reader docs explicitly says it can over-read from the underlying reader, so what you should be testing isn't the number of bytes remaining, but what your desired return values are. |
I don't think having Close do a 4-byte read is a valid solution. The zlib format does not encode the uncompressed size, so there is no way for Close to know that the next 4 bytes is the checksum. This happens to work for you since you know that information, but the Reader implementation does not know that. It only knows to read the checksum when it receives an EOF from the underlying flate.Reader. However, the current implementation of flate.Read always returns Thus, I agree with @bradfitz that the best solution is to just document this situation. On a side note, the documentation about "The implementation buffers input and may read more data than necessary from r." should probably be updated since it is not always true. If the reader passed in satisfies the io.ByteReader interface, it will not read more data than necessary. |
If altering Close is not acceptable, could flate.Read be changed to return io.EOF on the "last" read and not on the first read past the end? This should be possible, since the last block in a DEFLATE stream is explicitly marked as such. As an aside, what is the purpose of the Close method, actually? Looking at the implementation in both compress/zlib and compress/flate, it seems to merely return the stored error (if any), but said error will have already been returned by the call that generated it, which a well-behaved client should've also checked. |
You are correct that the last block in DEFLATE is marked as such, but as I noted above, the current implementation of compress/flate does not make that easily known. We could alter flate to report EOF more agressively, but that would be a more invasive change to flate. It's certainly do-able, but it's also very fragile, because it means that zlib is dependent on more strict behavior than what io.Reader promises. I'm wondering, you seem to be attempting to read a series of zlib encoded stream back-to-back and you know the uncompressed size. Can you do something similar to the following? var rd io.Reader // Let's say your data stream is in here
chunks := []int64{13, 1323, 251} // Uncompressed size of each chunk
// Wrap your reader with bufio.Reader to ensure that it satisfies io.ByteReader.
rd = bufio.NewReader(rd)
for _, nchk := range chunks {
zr, err := zlib.NewReader(rd)
if err != nil {
panic(err)
}
var buf bytes.Buffer
n, err := io.CopyN(&buf, zr, nchk+1) // Read one byte past EOF to fully consume stream
switch {
case err == nil || n > nchk:
panic("ZLIB stream is longer than expected")
case err != io.EOF:
panic(err)
case n < nchk:
panic("ZLIB stream is shorter than expected")
}
if err := zr.Close(); err != nil {
panic(err)
}
} P.S. io.CopyN does allocate a buffer on each use. You could use a combination io.LimitedReader and io.CopyBuffer to avoid that allocation if performance matters. |
CL https://golang.org/cl/20881 mentions this issue. |
@dsnet, btw, I've made a number of changes over the years to return (n, io.EOF) instead of (n, nil) + (0, io.EOF) more aggressively. Either is valid under the io.Reader contract, but there are a number of performance reasons for returning io.EOF early. I'd be in favor of doing so in compress/flate if it doesn't already. |
@bradfitz, I'd be happy to make that change in the inflate code :) What are your thoughts on my concern about having zlib be dependent on flate.Read always returning io.EOF aggressively? Making the change for performance changes is one thing, relying on that change for correctness of some feature is another. It seems brittle to me... |
Define "dependent"? If we want the final zlib.Reader.Read (the (non-zero, io.EOF) read) to return an error if the 4 byte checksum following is bogus, that doesn't seem fragile: it's a dependency is the same repo, and we write tests. |
SGTM, I'll submit CLs in the near future. |
@dsnet: Yes, reading back-to-back zlib streams with known uncompressed sizes is my use case. Getting it done isn't a problem - I've just been using ioutil.ReadAll to date - but I raised this issue to see if, sometime in the future, I could maybe get it done with io.ReadFull and some appropriately pre-allocated slices. |
CL https://golang.org/cl/20884 mentions this issue. |
I thought about this over the weekend and I don't think compress/flate should be changed since this changes some subtle behavior with regard to how flushing works. By convention, a DEFLATE reader stops reading when it encounters an [empty, raw block] as a signal that all data read so far should be flushed to the user. Consider the following DEFLATE stream:
In this situation, suppose you read exactly 10 bytes (knowing that was the length of the stream), in order for the DEFLATE to know that EOF was encountered, it must read through all empty blocks until it either 1) hits another block with data, or 2) hits an empty block with the final bit set. However, this might break some network protocols that have subtle dependencies on flushing (for example, that the read offset of the underlying reader was left immediately after the distinctive Let's just document this situation. |
CL https://golang.org/cl/21218 mentions this issue. |
CL https://golang.org/cl/21291 mentions this issue. |
Flip around the composition order of the http.Response.Body's gzip.Reader vs. the reader which keeps track of waiting to see the end of the HTTP/1 response framing (whether that's a Content-Length or HTTP/1.1 chunking). Previously: user -> http.Response.Body -> bodyEOFSignal -> gzipReader -> gzip.Reader -> bufio.Reader [ -> http/1.1 de-chunking reader ] optional -> http1 framing *body But because bodyEOFSignal was waiting to see an EOF from the underlying gzip.Reader before reusing the connection, and gzip.Reader (or more specifically: the flate.Reader) wasn't returning an early io.EOF with the final chunk, the bodyEOfSignal was never releasing the connection, because the EOF from the http1 framing was read by a party who didn't care about it yet: the helper bufio.Reader created to do byte-at-a-time reading in the flate.Reader. Flip the read composition around to: user -> http.Response.Body -> gzipReader -> gzip.Reader -> bufio.Reader -> bodyEOFSignal [ -> http/1.1 de-chunking reader ] optional -> http1 framing *body Now when gzip.Reader does its byte-at-a-time reading via the bufio.Reader, the bufio.Reader will do its big reads against the bodyEOFSignal reader instead, which will then see the underlying http1 framing EOF, and be able to reuse the connection. Updates google/go-github#317 Updates #14867 And related abandoned fix to flate.Reader: https://golang.org/cl/21290 Change-Id: I3729dfdffe832ad943b84f4734b0f59b0e834749 Reviewed-on: https://go-review.googlesource.com/21291 Reviewed-by: David Symonds <dsymonds@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
CL https://golang.org/cl/21302 mentions this issue. |
Rather than checking the block final bit on the next invocation of nextBlock, we check it at the termination of the current block. This ensures that we return (n, io.EOF) instead of (0, io.EOF) more frequently for most streams. However, there are certain situations where an eager io.EOF is not done: 1) We previously returned from Read because the write buffer of the internal dictionary was full, and it just so happens that there is no more data remaining in the stream. 2) There exists a [non-final, empty, raw block] after all blocks that actually contain uncompressed data. We cannot return io.EOF eagerly here since it would break flushing semantics. Both situations happen infrequently, but it is still important to note that this change does *not* guarantee that flate will *always* return (n, io.EOF). Furthermore, this CL makes no changes to the pattern of ReadByte calls to the underlying io.ByteReader. Below is the motivation for this change, pulling the text from @bradfitz's CL/21290: net/http and other things work better when io.Reader implementations return (n, io.EOF) at the end, instead of (n, nil) followed by (0, io.EOF). Both are legal, but the standard library has been moving towards n+io.EOF. An investigation of net/http connection re-use in google/go-github#317 revealed that with gzip compression + http/1.1 chunking, the net/http package was not automatically reusing the underlying TCP connections when the final EOF bytes were already read off the wire. The net/http package only reuses the connection if the underlying Readers (many of them nested in this case) all eagerly return io.EOF. Previous related CLs: https://golang.org/cl/76400046 - tls.Reader https://golang.org/cl/58240043 - http chunked reader In addition to net/http, this behavior also helps things like ioutil.ReadAll (see comments about performance improvements in https://codereview.appspot.com/49570044) Updates #14867 Updates google/go-github#317 Change-Id: I637c45552efb561d34b13ed918b73c660f668378 Reviewed-on: https://go-review.googlesource.com/21302 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
When the length of the compressed data is known a priori, and one reads precisely as many bytes from the ReadCloser returned by zlib.NewReader, the zlib checksum isn't read. One needs to attempt to read past the end of the compressed stream to get zlib to read it, which I think this is counterintuitive, and makes it so that one pretty much has to use ioutil.ReadAll to safely read zlibbed segments embedded in a larger stream.
Obligatory play.golang.org link: https://play.golang.org/p/KvqnDU36Q2
A good solution, I think, would be to arrange to have Close read and check the checksum if it hasn't been done yet, as it should be called when one is done reading anyway.
Tested on go1.5 linux/amd64.
The text was updated successfully, but these errors were encountered: