Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/flate: can't reuse inflate state for split data packets #48877

Open
gudvinr opened this issue Oct 9, 2021 · 7 comments
Open

compress/flate: can't reuse inflate state for split data packets #48877

gudvinr opened this issue Oct 9, 2021 · 7 comments
Labels
FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@gudvinr
Copy link

gudvinr commented Oct 9, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.1 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="~/.cache/go-build"
GOENV="~/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="~/tools/go/path/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="~/tools/go/path"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.1"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1090573846=/tmp/go-build -gno-record-gcc-switches"

What did you do?

WebSocket permessage-deflate compression extension (RFC7692) can reuse LZ77 sliding window for future messages which requires maintaining inflate state so you could feed incoming messages to it.

Also Discord uses zlib streaming compression in a similar manner.

In both cases sender uses single context for all consecutive messages that end with Z_SYNC_FLUSH = 0x0000FFFF.

If you decompress these messages without resetting inflate state it wouldn't be possible but quite expected.

If you try to maintain inflate state, it's still doesn't seem possible to decompress separate messages within same context.
io.Reader object returned from flate.NewReader doesn't provide information about stream state.
Read calls do not indicate when flate encounters Z_SYNC_FLUSH.

Neither inflate nor WebSocket protocol have information about size of decompressed payload (and doesn't need to correctly work) so you can't allocate correctly-sized buffer for reading without going out of message boundaries.

If you try to call Read on io.Reader from flate.NewReader, you'll get io.UnexpectedEOF as soon as first compressed message ends. You can't also feed data to that reader anymore even if you ignore io.UnexpectedEOF (which is good). It can be easily reproduced by using io.ReadAll.
It is understandable why it works like that and not an issue by itself.

var flate io.ReadCloser

data := receiveCompressedMessageFromWebSocket()
buf := bytes.NewBuffer(data)

flate = flate.NewReader(buf)
msg, err := io.ReadAll(flate)
// err will have io.UnexpectedEOF
@seankhliao seankhliao added FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 9, 2021
@seankhliao
Copy link
Member

cc @dsnet

@dsnet
Copy link
Member

dsnet commented Oct 9, 2021

WebSocket permessage-deflate compression extension (RFC7692) can reuse LZ77 sliding window for future messages which requires maintaining inflate state so you could feed incoming messages to it.

Have you tried using NewReaderDict and NewWriterDict? If so, why are these insufficient? The "dictionary" is the LZ77 sliding window prior to the current decompression offset.

@gudvinr
Copy link
Author

gudvinr commented Oct 9, 2021

How could you extract said dictionary so you can feed it to NewReaderDict?

flate.Writer also doesn't have methods for exporting its dictionary to reuse for consecutive writes.

I am by no means a specialist in compression algorithms.
From what I know, inflate can back-reference occurrences of same data blocks within sliding window, which is at most 32KiB IIRC.
So, I suppose you should maintain circular buffer by yourself by overwriting content with last read data. Is that correct?

@dsnet
Copy link
Member

dsnet commented Oct 9, 2021

So, I suppose you should maintain circular buffer by yourself by overwriting content with last read data. Is that correct?

Correct. The dictionary is literally the last 32KiB of the most recently compressed input or decompressed output. At the very least, you could use a io.MultiWriter to siphon off the last parts of the buffer.

We could add methods to directly retrieve the buffer with something like:

// ExportDictionary dictionary copies the internal dictionary into b
// and reports the number of bytes copied.
// A buffer of at least 32KiB should be provided. 
func (*Writer) ExportDictionary(b []byte) int

It's highly unfortunate that NewReader and NewWriter are asymmetrical. NewReader returns an io.ReadCloser, so it would be difficult to add new methods to the underlying decompressor type.

@gudvinr
Copy link
Author

gudvinr commented Oct 9, 2021

OK, dictionary itself isn't an issue then, thanks. Adding new method to stdlib isn't necessary in that case. At least not right now.

But there is still an issue with reader (writer is probably fine) because it's impossible to know whether it finished to read all decompressed data from provided chunk or not. I don't think that's a good idea to rely on io.UnexpectedEOF and use Resetter.
There's no way to know if io.UnexpectedEOF is indeed an unexpected EOF that caused by e.g. corrupted data or it's an end of current chunk.

Edit: I tried using fixed buffer for incoming stream and then passing that to NewReaderDict and it seems to be working as expected.
However, there should be a way to reliably detect when reader consumed all input.

How about some kind of NewFlushReaderWithDict(r io.Reader, dict []byte, flush int)? flush int is a flag to indicate whether we expecting to read up until Z_SYNC_FLUSH or Z_FINISH (as it is now). It is similar to what inflate in zlib does.
This reader could return io.EOF when it consumed all of the reader content and encountered Z_SYNC_FLUSH at the end of data block.
Old NewReaderDict could just call NewFlushReaderWithDict with ZFinish argument.

@gudvinr
Copy link
Author

gudvinr commented Nov 6, 2021

@dsnet can you please take another look and give some feedback?
I don't think that shipping production code that uses io.UnexpectedEOF as a way to detect expected EOF is a good idea. So I still would appreciate ability to reliably detect boundaries inside zlib streams.

@edaniels
Copy link
Contributor

edaniels commented Jul 24, 2022

In addition to relying on io.UnexpectedEOF, it doesn't seem like you can "just" set the window to the last 32KiB read. It seems like the dictionary must also take into account not appending data into the circular buffer that was decoded via dictionary back references. I could be very wrong about this assumption though, but trying to create my own reader that resets the underlying each time a z sync flush happens, the dictionary eventually becomes corrupted when introducing data (I think) that makes back references. For now I'm using https://github.com/DataDog/czlib, but ideally other users wouldn't hit this unexpected behavior.

@seankhliao seankhliao added this to the Unplanned milestone Aug 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants