Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/zlib: reading data with known length #14867

Closed
ghost opened this issue Mar 18, 2016 · 15 comments
Closed

compress/zlib: reading data with known length #14867

ghost opened this issue Mar 18, 2016 · 15 comments

Comments

@ghost
Copy link

ghost commented Mar 18, 2016

When the length of the compressed data is known a priori, and one reads precisely as many bytes from the ReadCloser returned by zlib.NewReader, the zlib checksum isn't read. One needs to attempt to read past the end of the compressed stream to get zlib to read it, which I think this is counterintuitive, and makes it so that one pretty much has to use ioutil.ReadAll to safely read zlibbed segments embedded in a larger stream.

Obligatory play.golang.org link: https://play.golang.org/p/KvqnDU36Q2

A good solution, I think, would be to arrange to have Close read and check the checksum if it hasn't been done yet, as it should be called when one is done reading anyway.

Tested on go1.5 linux/amd64.

@bradfitz
Copy link
Contributor

I think the safest thing we can do here is document this. Changing the behavior of Close to start doing reads and changing the return value from Close seems like an Go 1 compatibility promise change at this point.

Btw, a more convincing play.golang.org snippet might include checking return values, such as: https://play.golang.org/p/dI-lca_rjR ... because notably, the zlib.Reader docs explicitly says it can over-read from the underlying reader, so what you should be testing isn't the number of bytes remaining, but what your desired return values are.

@dsnet
Copy link
Member

dsnet commented Mar 18, 2016

I don't think having Close do a 4-byte read is a valid solution. The zlib format does not encode the uncompressed size, so there is no way for Close to know that the next 4 bytes is the checksum. This happens to work for you since you know that information, but the Reader implementation does not know that. It only knows to read the checksum when it receives an EOF from the underlying flate.Reader. However, the current implementation of flate.Read always returns (n, nil) if n is positive. Thus, zlib.Reader cannot get an EOF unless an attempt to read past the length of the stream is made.

Thus, I agree with @bradfitz that the best solution is to just document this situation.

On a side note, the documentation about "The implementation buffers input and may read more data than necessary from r." should probably be updated since it is not always true. If the reader passed in satisfies the io.ByteReader interface, it will not read more data than necessary.

@ghost
Copy link
Author

ghost commented Mar 18, 2016

If altering Close is not acceptable, could flate.Read be changed to return io.EOF on the "last" read and not on the first read past the end? This should be possible, since the last block in a DEFLATE stream is explicitly marked as such.

As an aside, what is the purpose of the Close method, actually? Looking at the implementation in both compress/zlib and compress/flate, it seems to merely return the stored error (if any), but said error will have already been returned by the call that generated it, which a well-behaved client should've also checked.

@dsnet
Copy link
Member

dsnet commented Mar 18, 2016

You are correct that the last block in DEFLATE is marked as such, but as I noted above, the current implementation of compress/flate does not make that easily known. We could alter flate to report EOF more agressively, but that would be a more invasive change to flate. It's certainly do-able, but it's also very fragile, because it means that zlib is dependent on more strict behavior than what io.Reader promises.

I'm wondering, you seem to be attempting to read a series of zlib encoded stream back-to-back and you know the uncompressed size. Can you do something similar to the following?

var rd io.Reader // Let's say your data stream is in here
chunks := []int64{13, 1323, 251} // Uncompressed size of each chunk

// Wrap your reader with bufio.Reader to ensure that it satisfies io.ByteReader.
rd = bufio.NewReader(rd)

for _, nchk := range chunks {
    zr, err := zlib.NewReader(rd)
    if err != nil {
        panic(err)
    }

    var buf bytes.Buffer
    n, err := io.CopyN(&buf, zr, nchk+1) // Read one byte past EOF to fully consume stream
    switch {
    case err == nil || n > nchk:
        panic("ZLIB stream is longer than expected")
    case err != io.EOF:
        panic(err)
    case n < nchk:
        panic("ZLIB stream is shorter than expected")
    }

    if err := zr.Close(); err != nil {
        panic(err)
    }
}

P.S. io.CopyN does allocate a buffer on each use. You could use a combination io.LimitedReader and io.CopyBuffer to avoid that allocation if performance matters.

@gopherbot
Copy link

CL https://golang.org/cl/20881 mentions this issue.

@bradfitz
Copy link
Contributor

@dsnet, btw, I've made a number of changes over the years to return (n, io.EOF) instead of (n, nil) + (0, io.EOF) more aggressively. Either is valid under the io.Reader contract, but there are a number of performance reasons for returning io.EOF early. I'd be in favor of doing so in compress/flate if it doesn't already.

@dsnet
Copy link
Member

dsnet commented Mar 18, 2016

@bradfitz, I'd be happy to make that change in the inflate code :)

What are your thoughts on my concern about having zlib be dependent on flate.Read always returning io.EOF aggressively? Making the change for performance changes is one thing, relying on that change for correctness of some feature is another. It seems brittle to me...

@bradfitz
Copy link
Contributor

Define "dependent"? If we want the final zlib.Reader.Read (the (non-zero, io.EOF) read) to return an error if the 4 byte checksum following is bogus, that doesn't seem fragile: it's a dependency is the same repo, and we write tests.

@dsnet
Copy link
Member

dsnet commented Mar 18, 2016

SGTM, I'll submit CLs in the near future.

@ghost
Copy link
Author

ghost commented Mar 18, 2016

@dsnet: Yes, reading back-to-back zlib streams with known uncompressed sizes is my use case. Getting it done isn't a problem - I've just been using ioutil.ReadAll to date - but I raised this issue to see if, sometime in the future, I could maybe get it done with io.ReadFull and some appropriately pre-allocated slices.

@gopherbot
Copy link

CL https://golang.org/cl/20884 mentions this issue.

@dsnet
Copy link
Member

dsnet commented Mar 28, 2016

I thought about this over the weekend and I don't think compress/flate should be changed since this changes some subtle behavior with regard to how flushing works. By convention, a DEFLATE reader stops reading when it encounters an [empty, raw block] as a signal that all data read so far should be flushed to the user.

Consider the following DEFLATE stream:

[non-empty N=10, raw block] [empty, raw block] [final, empty, fixed block]
                                              ^
                              Read must stop here, by flushing semantics;
                              But can't report EOF, since that's in the next block

In this situation, suppose you read exactly 10 bytes (knowing that was the length of the stream), in order for the DEFLATE to know that EOF was encountered, it must read through all empty blocks until it either 1) hits another block with data, or 2) hits an empty block with the final bit set. However, this might break some network protocols that have subtle dependencies on flushing (for example, that the read offset of the underlying reader was left immediately after the distinctive 0x0000ffff marker of empty raw blocks).

Let's just document this situation.

@gopherbot
Copy link

CL https://golang.org/cl/21218 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/21291 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 30, 2016
Flip around the composition order of the http.Response.Body's
gzip.Reader vs. the reader which keeps track of waiting to see the end
of the HTTP/1 response framing (whether that's a Content-Length or
HTTP/1.1 chunking).

Previously:

user -> http.Response.Body
     -> bodyEOFSignal
     -> gzipReader
     -> gzip.Reader
     -> bufio.Reader
   [ -> http/1.1 de-chunking reader ]   optional
     -> http1 framing *body

But because bodyEOFSignal was waiting to see an EOF from the
underlying gzip.Reader before reusing the connection, and gzip.Reader
(or more specifically: the flate.Reader) wasn't returning an early
io.EOF with the final chunk, the bodyEOfSignal was never releasing the
connection, because the EOF from the http1 framing was read by a party
who didn't care about it yet: the helper bufio.Reader created to do
byte-at-a-time reading in the flate.Reader.

Flip the read composition around to:

user -> http.Response.Body
     -> gzipReader
     -> gzip.Reader
     -> bufio.Reader
     -> bodyEOFSignal
   [ -> http/1.1 de-chunking reader ]   optional
     -> http1 framing *body

Now when gzip.Reader does its byte-at-a-time reading via the
bufio.Reader, the bufio.Reader will do its big reads against the
bodyEOFSignal reader instead, which will then see the underlying http1
framing EOF, and be able to reuse the connection.

Updates google/go-github#317
Updates #14867
And related abandoned fix to flate.Reader: https://golang.org/cl/21290

Change-Id: I3729dfdffe832ad943b84f4734b0f59b0e834749
Reviewed-on: https://go-review.googlesource.com/21291
Reviewed-by: David Symonds <dsymonds@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link

CL https://golang.org/cl/21302 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 31, 2016
Rather than checking the block final bit on the next invocation
of nextBlock, we check it at the termination of the current block.
This ensures that we return (n, io.EOF) instead of (0, io.EOF)
more frequently for most streams.

However, there are certain situations where an eager io.EOF is not done:
1) We previously returned from Read because the write buffer of the internal
dictionary was full, and it just so happens that there is no more data
remaining in the stream.
2) There exists a [non-final, empty, raw block] after all blocks that
actually contain uncompressed data. We cannot return io.EOF eagerly here
since it would break flushing semantics.

Both situations happen infrequently, but it is still important to note that
this change does *not* guarantee that flate will *always* return (n, io.EOF).

Furthermore, this CL makes no changes to the pattern of ReadByte calls
to the underlying io.ByteReader.

Below is the motivation for this change, pulling the text from
@bradfitz's CL/21290:

net/http and other things work better when io.Reader implementations
return (n, io.EOF) at the end, instead of (n, nil) followed by (0,
io.EOF). Both are legal, but the standard library has been moving
towards n+io.EOF.

An investigation of net/http connection re-use in
google/go-github#317 revealed that with gzip
compression + http/1.1 chunking, the net/http package was not
automatically reusing the underlying TCP connections when the final
EOF bytes were already read off the wire. The net/http package only
reuses the connection if the underlying Readers (many of them nested
in this case) all eagerly return io.EOF.

Previous related CLs:
    https://golang.org/cl/76400046 - tls.Reader
    https://golang.org/cl/58240043 - http chunked reader

In addition to net/http, this behavior also helps things like
ioutil.ReadAll (see comments about performance improvements in
https://codereview.appspot.com/49570044)

Updates #14867
Updates google/go-github#317

Change-Id: I637c45552efb561d34b13ed918b73c660f668378
Reviewed-on: https://go-review.googlesource.com/21302
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@bradfitz bradfitz added this to the Go1.7 milestone Apr 9, 2016
@golang golang locked and limited conversation to collaborators May 9, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants