Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: server sends RST_STREAM w/ PROTOCOL_ERROR to clients it incorrectly believes have violated max advertised num streams #42777

Closed
jared2501 opened this issue Nov 23, 2020 · 17 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@jared2501
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.15.0 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What's the issue

I'm seeing an issue with a grpc-go client talking to a http2.Server{}. I've documented the issue grpc/grpc-go#4057. To summarize the issue:

  • I have a grpc-go client that is making requests to an http2.Servr{} on the same machine. The client is generating a lot of load on the server. The server is configured to allow a maximum of 320 requests on a single connection.
  • I'm seeing sporadic, and relatively regular, RST_STREAM w/ PROTOCOL_ERROR frames from the server.
  • The client usually has upwards of 500 simultaneous requests open at a time, which means there are 320 concurrent streams and ~200 streams waiting for these streams to complete. The machine that hosts the server is under heavy load.

I put some println statements in the client and server code and isolated the error that the server is returning to be from this line. This code checks the sc.curClientStreams and rejects the new stream if it believes the client is violating the advertised max num of concurrent streams. I added further logging to the server and the client to record the number of active streams and found that this number never creeps above 320 on the client or the server - the grpc-go client code believes it is being well behaved, which indicated to me something is fishy in the http2.Server code.

The only place in the http2.Server code that decremtnts sc.curClientStreams is here in sc.closeStream which is called from sc.wroteFrame, and sc.wroteFrame is only ever called in the sc.serve loop. Writes to the client can be triggered asynchronously by sc.writeFrameAsync (i.e. see here), and then once the write is complete the result is sent to the sc.wroteFrameCh which is read by the sc.serve loop (i.e. see here). The sc.serve loop contains a big select statement that reads from the sc.wroteFrameCh, but also the sc.readFrameCh which receives new frames from clients. Since these channels are read from the same select statement, if both contain a message then there is no guarantee which one will be read from first.

I believe what is happening to cause the spurious PROTOCOL_ERRORs is:

  • The server is maxed out with 320 concurrent streams.
  • One of the streams on the server finishes. This triggers a call to sc.writeFrameAsync to be spawned in a separate goroutine that writes a frame f1 that ends the stream.
  • sc.writeFrameAsync writes the frame f1 to the TCP connection and adds a message to the sc.wroteFrameCh.
  • The client reads frame f1 immediately, sees that the stream has completed, and therefore starts a new stream to the server by sending a frame f2.
  • The sc.readFrames loop reads frame f2 and pushes it onto the sc.readFrameCh channel.
  • Now, the sc.serve loop goes through another iteration, and since both sc.readFrameCh and sc.writeFrameAsync contain messages, it's undefined which one it processes first. In this case, it processes the message on sc.readFrameCh first, which contains frame f2.
  • Since the message in the sc.wroteFrameCh has not been processed yet, the sc.curClientStreams variable has not yet been decremented and thus the server blieves that sc.curClientStreams is still set to 320. When sc.processFrameFromReader is called it will return a PROTOCOL_ERROR.

To attempt to fix this, I patched the http2.Server code:

diff --git a/vendor/golang.org/x/net/http2/server.go b/vendor/golang.org/x/net/http2/server.go
index 686501b4..a79fed98 100644
--- a/vendor/golang.org/x/net/http2/server.go
+++ b/vendor/golang.org/x/net/http2/server.go
@@ -868,6 +868,13 @@ func (sc *serverConn) serve() {
                case res := <-sc.wroteFrameCh:
                        sc.wroteFrame(res)
                case res := <-sc.readFrameCh:
+                       // Process any wrote frames before reading new frames from the client since a
+                       // wrote frame could have triggered a new stream to be started.
+                       select {
+                       case wroteRes := <-sc.wroteFrameCh:
+                               sc.wroteFrame(wroteRes)
+                       default:
+                       }
                        if !sc.processFrameFromReader(res) {
                                return
                        }

When I ran this on the server it significantly reduced the occurrences of the PRTOCOL_ERRORs I saw. I was seeing 20 every 3 minutes before the patch, and after I observed about 10 over the course of 2 hours.

I believe that this patch solves some of the problem, but not all of it, as the error still persists. Also, I may be totally wrong in my diagnosis above. Does the above reasoning make sense as a possible failure?

I'm going to try and get a minimal repro working, but wanted to create this issue first to get folks thoughts.

@networkimprov
Copy link

cc @fraenkel

@fraenkel
Copy link
Contributor

I agree with your explanation however it is not guaranteed that the above will fix most of the racing that occurs. The problem is that between writeFrame and sending the frameWriteResult, one can receive the readFrame.
We can mitigate the select by only checking when there is an async write occuring, otherwise checking for the wroteFrameCh is just a waste.

BTW, this the reverse of what the client does between the bodyWriter and the readLoop.

A testcase would go a long way.

@jared2501
Copy link
Author

@fraenkel yeah that makes sense - I was thinking along those lines too. Weird thing was that when I made a hacky fix that used a mutex to lock, I still saw some PROTOCOL_ERRORs.. But there's also a chance that my hax were wrong since I'm not as familiar with this part of the code.

Did you still need a repro from me, or do you think you have a sufficient understanding of the failure mode here? Happy to try and fixings on our PROD system to see if they fix things too (and can do a bit of debugging to see what else could be up).

@fraenkel
Copy link
Contributor

fraenkel commented Dec 2, 2020

I understand the failure. The trick is coming up with a test case which may not be so simple.

@cagedmantis cagedmantis added this to the Backlog milestone Dec 2, 2020
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 2, 2020
@jared2501
Copy link
Author

Quick ping on this - is this issue on the cards any time soon? Still seeing this issue in production.

@gopherbot
Copy link

Change https://golang.org/cl/330909 mentions this issue: http2: complete outstanding async write before reading the next frame

@jared2501
Copy link
Author

@neild quick ping on this! the CL looks like it's good to go - any chance of a merge?

BTW, a colleague of mine also found a separate bug in our PROD env: #47076. Would be super appreciative of your thoughts on it. Thanks!

gopherbot pushed a commit to golang/net that referenced this issue Aug 13, 2021
For golang/go#42777

Change-Id: I8d7e35ded3fe508f6c285e84ba768104c3406bdf
Reviewed-on: https://go-review.googlesource.com/c/net/+/330909
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
@neild
Copy link
Contributor

neild commented Aug 13, 2021

@gopherbot please open backport issues for 1.16 and 1.17.

@gopherbot
Copy link

Backport issue(s) opened: #47691 (for 1.16), #47692 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

bradfitz pushed a commit to tailscale/net that referenced this issue Aug 13, 2021
For golang/go#42777

Change-Id: I8d7e35ded3fe508f6c285e84ba768104c3406bdf
Reviewed-on: https://go-review.googlesource.com/c/net/+/330909
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
(cherry picked from commit 60bc85c)
@cherrymui
Copy link
Member

Hi!
Does CL https://golang.org/cl/330909 fixes this issue, or there is more work that needs to be done? If it fixes the issue, we can close this, and you could go ahead prepare the CLs for 1.16 and 1.17 backports. Thanks!

@neild
Copy link
Contributor

neild commented Sep 1, 2021

Doing backports now. Thanks for the reminder!

@neild neild closed this as completed Sep 1, 2021
@gopherbot
Copy link

Change https://golang.org/cl/346889 mentions this issue: [internal-branch.go1.16-vendor] http2: complete outstanding async write before reading the next frame

@gopherbot
Copy link

Change https://golang.org/cl/346890 mentions this issue: [internal-branch.go1.17-vendor] http2: complete outstanding async write before reading the next frame

gopherbot pushed a commit to golang/net that referenced this issue Sep 1, 2021
…te before reading the next frame

Updates golang/go#42777
Fixes golang/go#47691

Change-Id: I8d7e35ded3fe508f6c285e84ba768104c3406bdf
Reviewed-on: https://go-review.googlesource.com/c/net/+/330909
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
(cherry picked from commit 60bc85c)
Reviewed-on: https://go-review.googlesource.com/c/net/+/346889
Trust: Damien Neil <dneil@google.com>
Run-TryBot: Damien Neil <dneil@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
gopherbot pushed a commit to golang/net that referenced this issue Sep 1, 2021
…te before reading the next frame

Updates golang/go#42777
Fixes golang/go#47692

Change-Id: I8d7e35ded3fe508f6c285e84ba768104c3406bdf
Reviewed-on: https://go-review.googlesource.com/c/net/+/330909
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
(cherry picked from commit 60bc85c)
Reviewed-on: https://go-review.googlesource.com/c/net/+/346890
Trust: Damien Neil <dneil@google.com>
Run-TryBot: Damien Neil <dneil@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/347010 mentions this issue: all: update golang.org/x/net to pull in CL 346890

gopherbot pushed a commit that referenced this issue Sep 1, 2021
For #42777.

Change-Id: I963db8c666e8bcf0fc4f390b359db6408a0f792b
Reviewed-on: https://go-review.googlesource.com/c/go/+/347010
Trust: Damien Neil <dneil@google.com>
Run-TryBot: Damien Neil <dneil@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/347033 mentions this issue: http2: make Transport not reuse conns after a stream protocol error

gopherbot pushed a commit to golang/net that referenced this issue Sep 3, 2021
If a server sends a stream error of type "protocol error" to a client,
that's the server saying "you're speaking http2 wrong". At that point,
regardless of whether we're in the right or not (that is, regardless of
whether the Transport is bug-free), clearly there's some confusion and
one of the two parties is either wrong or confused. There's no point
pushing on and trying to use the connection and potentially exacerbating
the confusion (as we saw in golang/go#47635).

Instead, make the client "poison" the connection by setting a new "do
not reuse" bit on it. Existing streams can finish up but new requests
won't pick that connection.

Also, make those requests as retryable without the caller getting an
error.

Given that golang/go#42777 existed, there are HTTP/2 servers in the
wild that incorrectly set RST_STREAM PROTOCOL_ERROR codes. But even
once those go away, this is still a reasonable fix for preventing
a broken connection from being stuck in the connection pool that fails
all future requests if a similar bug happens in another HTTP/2 server.

Updates golang/go#47635

Change-Id: I3f89ecd1d3710e49f7219ccb846e016eb269515b
Reviewed-on: https://go-review.googlesource.com/c/net/+/347033
Trust: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/356978 mentions this issue: [internal-branch.go1.16-vendor] http2: make Transport not reuse conns after a stream protocol error

@gopherbot
Copy link

Change https://golang.org/cl/357673 mentions this issue: [internal-branch.go1.17-vendor] http2: make Transport not reuse conns after a stream protocol error

gopherbot pushed a commit to golang/net that referenced this issue Oct 29, 2021
… after a stream protocol error

If a server sends a stream error of type "protocol error" to a client,
that's the server saying "you're speaking http2 wrong". At that point,
regardless of whether we're in the right or not (that is, regardless of
whether the Transport is bug-free), clearly there's some confusion and
one of the two parties is either wrong or confused. There's no point
pushing on and trying to use the connection and potentially exacerbating
the confusion (as we saw in golang/go#47635).

Instead, make the client "poison" the connection by setting a new "do
not reuse" bit on it. Existing streams can finish up but new requests
won't pick that connection.

Also, make those requests as retryable without the caller getting an
error.

Given that golang/go#42777 existed, there are HTTP/2 servers in the
wild that incorrectly set RST_STREAM PROTOCOL_ERROR codes. But even
once those go away, this is still a reasonable fix for preventing
a broken connection from being stuck in the connection pool that fails
all future requests if a similar bug happens in another HTTP/2 server.

Updates golang/go#49077

Change-Id: I3f89ecd1d3710e49f7219ccb846e016eb269515b
Reviewed-on: https://go-review.googlesource.com/c/net/+/347033
Trust: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Reviewed-on: https://go-review.googlesource.com/c/net/+/357673
Trust: Damien Neil <dneil@google.com>
Run-TryBot: Damien Neil <dneil@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
gopherbot pushed a commit to golang/net that referenced this issue Oct 29, 2021
… after a stream protocol error

If a server sends a stream error of type "protocol error" to a client,
that's the server saying "you're speaking http2 wrong". At that point,
regardless of whether we're in the right or not (that is, regardless of
whether the Transport is bug-free), clearly there's some confusion and
one of the two parties is either wrong or confused. There's no point
pushing on and trying to use the connection and potentially exacerbating
the confusion (as we saw in golang/go#47635).

Instead, make the client "poison" the connection by setting a new "do
not reuse" bit on it. Existing streams can finish up but new requests
won't pick that connection.

Also, make those requests as retryable without the caller getting an
error.

Given that golang/go#42777 existed, there are HTTP/2 servers in the
wild that incorrectly set RST_STREAM PROTOCOL_ERROR codes. But even
once those go away, this is still a reasonable fix for preventing
a broken connection from being stuck in the connection pool that fails
all future requests if a similar bug happens in another HTTP/2 server.

Updates golang/go#49076

Change-Id: I3f89ecd1d3710e49f7219ccb846e016eb269515b
Reviewed-on: https://go-review.googlesource.com/c/net/+/347033
Trust: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Reviewed-on: https://go-review.googlesource.com/c/net/+/356978
Trust: Damien Neil <dneil@google.com>
Run-TryBot: Damien Neil <dneil@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
dteh pushed a commit to dteh/fhttp that referenced this issue Jun 22, 2022
For golang/go#42777

Change-Id: I8d7e35ded3fe508f6c285e84ba768104c3406bdf
Reviewed-on: https://go-review.googlesource.com/c/net/+/330909
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
dteh pushed a commit to dteh/fhttp that referenced this issue Jun 22, 2022
If a server sends a stream error of type "protocol error" to a client,
that's the server saying "you're speaking http2 wrong". At that point,
regardless of whether we're in the right or not (that is, regardless of
whether the Transport is bug-free), clearly there's some confusion and
one of the two parties is either wrong or confused. There's no point
pushing on and trying to use the connection and potentially exacerbating
the confusion (as we saw in golang/go#47635).

Instead, make the client "poison" the connection by setting a new "do
not reuse" bit on it. Existing streams can finish up but new requests
won't pick that connection.

Also, make those requests as retryable without the caller getting an
error.

Given that golang/go#42777 existed, there are HTTP/2 servers in the
wild that incorrectly set RST_STREAM PROTOCOL_ERROR codes. But even
once those go away, this is still a reasonable fix for preventing
a broken connection from being stuck in the connection pool that fails
all future requests if a similar bug happens in another HTTP/2 server.

Updates golang/go#47635

Change-Id: I3f89ecd1d3710e49f7219ccb846e016eb269515b
Reviewed-on: https://go-review.googlesource.com/c/net/+/347033
Trust: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
fedosgad pushed a commit to fedosgad/oohttp that referenced this issue Jun 22, 2022
…te before reading the next frame

Updates golang/go#42777
Fixes golang/go#47692

Change-Id: I8d7e35ded3fe508f6c285e84ba768104c3406bdf
Reviewed-on: https://go-review.googlesource.com/c/net/+/330909
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
(cherry picked from commit 60bc85c4be6d32924bcfddb728394cb8713f2c78)
Reviewed-on: https://go-review.googlesource.com/c/net/+/346890
Trust: Damien Neil <dneil@google.com>
Run-TryBot: Damien Neil <dneil@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
fedosgad pushed a commit to fedosgad/oohttp that referenced this issue Jun 22, 2022
… after a stream protocol error

If a server sends a stream error of type "protocol error" to a client,
that's the server saying "you're speaking http2 wrong". At that point,
regardless of whether we're in the right or not (that is, regardless of
whether the Transport is bug-free), clearly there's some confusion and
one of the two parties is either wrong or confused. There's no point
pushing on and trying to use the connection and potentially exacerbating
the confusion (as we saw in golang/go#47635).

Instead, make the client "poison" the connection by setting a new "do
not reuse" bit on it. Existing streams can finish up but new requests
won't pick that connection.

Also, make those requests as retryable without the caller getting an
error.

Given that golang/go#42777 existed, there are HTTP/2 servers in the
wild that incorrectly set RST_STREAM PROTOCOL_ERROR codes. But even
once those go away, this is still a reasonable fix for preventing
a broken connection from being stuck in the connection pool that fails
all future requests if a similar bug happens in another HTTP/2 server.

Updates golang/go#49077

Change-Id: I3f89ecd1d3710e49f7219ccb846e016eb269515b
Reviewed-on: https://go-review.googlesource.com/c/net/+/347033
Trust: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Reviewed-on: https://go-review.googlesource.com/c/net/+/357673
Trust: Damien Neil <dneil@google.com>
Run-TryBot: Damien Neil <dneil@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@golang golang locked and limited conversation to collaborators Oct 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

7 participants