Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: retry requests rejected with REFUSED_STREAM #20985

Closed
jba opened this issue Jul 11, 2017 · 17 comments
Closed

x/net/http2: retry requests rejected with REFUSED_STREAM #20985

jba opened this issue Jul 11, 2017 · 17 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@jba
Copy link
Contributor

jba commented Jul 11, 2017

What version of Go are you using (go version)?

go1.8.1.typealias

What operating system and processor architecture are you using (go env)?

Linux amd64

What did you do?

go run refused-stream.go -bucket veener-jba-doc-test-bucket -object datastore

Where refused-stream.go is https://gist.github.com/jba/9e75c3aedeb4e8b98a323424283fae88.

(veener-jba-doc-test-bucket/datastore should be publicly readable, but I'm guessing any object would do.)

What did you expect to see?

No output.

What did you see instead?

Often (not always), it fails with

2017/07/11 16:09:29 url Error &url.Error{Op:"Get", URL:"https://storage.googleapis.com/veener-jba-doc-test-bucket/datastore", Err:http.http2StreamError{StreamID:0xdd, Code:0x7, Cause:error(nil)}}, temporary=false, timeout=false
2017/07/11 16:09:29 NewReader: Get https://storage.googleapis.com/veener-jba-doc-test-bucket/datastore: stream error: stream ID 221; REFUSED_STREAM (*url.Error)
exit status 1
@odeke-em odeke-em changed the title see REFUSED_STREAM when reading from GCS x/net/http2: REFUSED_STREAM when reading from GCS Jul 12, 2017
@odeke-em
Copy link
Member

odeke-em commented Jul 12, 2017

Looks like this is another HTTP2 problem on GCS almost reminiscent of #20689 perhaps

/cc @tombergan @bradfitz @Capstan

@bradfitz
Copy link
Contributor

This isn't a GFE bug implementing HTTP/2 wrong.

This is GCS (via the GFE) telling us to slow down it seems. But I can't reproduce. Maybe their rate limiting likes me more.

Oh, if I crank it up 10x I can get the error. With GODEBUG=http2debug=2:

2017/07/12 17:45:55 http2: Transport received HEADERS flags=END_HEADERS stream=249 len=20
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=251 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=251 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=253 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=253 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=255 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=255 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=257 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=257 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=259 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=259 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=261 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=261 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=263 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=263 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=265 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=265 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=267 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=267 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=269 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=269 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=271 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=271 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=273 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=273 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=275 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=275 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=277 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=277 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=279 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=279 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=281 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=281 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=283 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=283 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Framer 0xc424630000: read RST_STREAM stream=285 len=4 ErrCode=REFUSED_STREAM
2017/07/12 17:45:55 http2: Transport received RST_STREAM stream=285 len=4 ErrCode=REFUSED_STREAM

It's not clear what the GFE wants us to do, though: if it didn't want us to open 100 concurrent streams, why did it tell us to?

$ h2i storage.googleapis.com
Connecting to storage.googleapis.com:443 ...
Connected to 173.194.197.128:443
Negotiated protocol "h2"
Sending: []
[FrameHeader SETTINGS len=18]
  [MAX_CONCURRENT_STREAMS = 100]
  [INITIAL_WINDOW_SIZE = 1048576]
  [MAX_HEADER_LIST_SIZE = 16384]
[FrameHeader WINDOW_UPDATE len=4]
  Window-Increment = 983041
[FrameHeader SETTINGS flags=ACK len=0]

So, okay, maybe it doesn't like something else, like the rate (which isn't expressible as a SETTING), so it sends us a REFUSED_STREAM instead.

What are we supposed to do? Retry in a loop forever? With some exponential backoff I guess?

Yeah, http://httpwg.org/specs/rfc7540.html#Reliability says:

The REFUSED_STREAM error code can be included in a RST_STREAM frame to indicate that the stream is being closed prior to any processing having occurred. Any request that was sent on the reset stream can be safely retried.

So, yeah, do that.

/cc @tombergan

@bradfitz bradfitz added this to the Go1.10 milestone Jul 12, 2017
@bradfitz bradfitz added the NeedsFix The path to resolution is known, but the work has not been done. label Jul 12, 2017
@Capstan
Copy link

Capstan commented Jul 12, 2017

Do I understand correctly that you are well below MAX_CONCURRENT_STREAMS and you're still getting REFUSED_STREAM?

@bradfitz
Copy link
Contributor

@Capstan, that's what it seems like, but I could verify more. I didn't analyze the full log and count. Go's net/http client does respect the MAX_CONCURRENT_STREAMS streams from the peer (the GFE here) and respects it whenever it makes a new request, accounting for how many streams are already in flight.

@tombergan tombergan changed the title x/net/http2: REFUSED_STREAM when reading from GCS x/net/http2: retry requests rejected with REFUSED_STREAM Jul 13, 2017
@tombergan tombergan self-assigned this Jul 13, 2017
@lhecker
Copy link

lhecker commented Jul 20, 2017

We're facing the exact same error in production as well when calling out to GCS. A robust retry logic inside the Go stdlib would indeed be awesome and surely prevent having some wonky code in every client library or application code doing this manually. 🙂

@tombergan
Copy link
Contributor

tombergan commented Jul 20, 2017

I have two questions:

  1. Should we retry with backoff, or immediately? My gut says to backoff. However, both Firefox and Chrome appear to retry without exponential backoff.
    https://cs.chromium.org/chromium/src/net/http/http_network_transaction.cc?rcl=f5793c4dfa7d3e80e208e756d4fbc5eb9ea7cdae&l=1540
    https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/Http2Session.cpp#2931

    Even with backoff, I worry that servers might send REFUSED_STREAM deterministically which could get stuck in an infinite retry loop (and with exponential backoff, potentially an infinite delay). One example: "Routing failures are now surfaced as REFUSED_STREAM resets". I don't know anything about that project, but I could imagine that "routing failures" might be deterministic. Given this possibility, I am hesitant to retry without a limit.

    Firefox has a limit of 6 retries per request. Chrome may have a limit, but if so, I cannot find it.
    https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpTransaction.cpp#1184
    https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpHandler.cpp#187

  2. Relatedly, if the client makes N RoundTrip calls concurrently but the server only allows N-1 concurrent requests per connection, we optimistically create a new connection for the Nth request. This seems kind of sketchy. Wouldn't it be better to stall the Nth request until one prior request completes?

    This is what both Chrome and Firefox do AFAICT:
    https://cs.chromium.org/chromium/src/net/spdy/chromium/spdy_session.cc?rcl=83b7815399e50c9f9da3434669eba2829c2857e0&l=1404
    https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/Http2Session.cpp#473

@bradfitz
Copy link
Contributor

I'd do the first retry immediately, and then with some backoff, up to a max number of retries total.

And once we get a REFUSED_STREAM on a conn, mark that conn as unsuitable for any future streams as well and put it into shutdown mode.

Relatedly, if the client makes N RoundTrip calls concurrently but the server only allows N-1 concurrent requests per connection, we optimistically create a new connection for the Nth request. This seems kind of sketchy. Wouldn't it be better to stall the Nth request until one prior request completes?

There's an existing bug for that.

@tombergan
Copy link
Contributor

That retry strategy SGTM. I'll use a fixed limit of 6 retries (unless you object), using Firefox as prior art.

And once we get a REFUSED_STREAM on a conn, mark that conn as unsuitable for any future streams as well and put it into shutdown mode.

This should not be necessary ... the server can send GOAWAY if they want us to go away entirely.

There's an existing bug for that.

Ah yes, #13774.

@bradfitz
Copy link
Contributor

This should not be necessary ... the server can send GOAWAY if they want us to go away entirely.

I want to do it anyway. It also means we're forced to open a new TCP connection and get a different backend.

Otherwise I worry that our loop will run and we'll select the same TCP connection that just told us REFUSED_STREAM. I want a better chance at being reassigned to a different backend.

See also: #20977 (comment)

@gopherbot
Copy link

CL https://golang.org/cl/50471 mentions this issue.

@tombergan
Copy link
Contributor

This should not be necessary ... the server can send GOAWAY if they want us to go away entirely.

I want to do it anyway. It also means we're forced to open a new TCP connection and get a different backend.

From ietf-http-wg, retrying on a new connection is not necessary, and again, Go would be the only client I'm aware of that would do this.
https://lists.w3.org/Archives/Public/ietf-http-wg/2016AprJun/0241.html

Another client that retries REFUSED_STREAM on the same connection (add to Firefox and Chrome):
square/okhttp#2543 (comment)

@tombergan
Copy link
Contributor

@Capstan, that's what it seems like, but I could verify more. I didn't analyze the full log and count. Go's net/http client does respect the MAX_CONCURRENT_STREAMS streams from the peer (the GFE here) and respects it whenever it makes a new request, accounting for how many streams are already in flight.

We default to MAX_CONCURRENT_STREAMS=1000, then get a SETTINGS frame with MAX_CONCURRENT_STREAMS=100 from the GFE. Is it possible that we manage to send 101 requests in the initial TCP congestion window before we get the SETTINGS from the GFE? That would explain REFUSED_STREAM.

(I checked the GFE code. It looks like the GFE sends REFUSED_STREAM in two cases: exceeded MAX_CONCURRENT_STREAMS or the GFE itself is overloaded. The latter case seems unlikely, so it's likely that we're exceeding MAX_CONCURRENT_STREAMS.)

If that is the case, we can't entirely fix this bug without fixing #13774.

@tombergan
Copy link
Contributor

Deleted my prior comment because I was wrong, this is exactly what's happening:

We default to MAX_CONCURRENT_STREAMS=1000. The GFE sends MAX_CONCURRENT_STREAMS=100. However, we create multiple HTTP/2 connections (I count 5). On one of those connections, we manage to send 200+ requests before actually reading the SETTINGS from the GFE.

I think the right fix for this bug is to backoff and retry, along with a fix for #13774.

@gopherbot
Copy link

Change https://golang.org/cl/53250 mentions this issue: http2: block RoundTrip when the Transport hits MaxConcurrentStreams

@tombergan
Copy link
Contributor

tombergan commented Aug 4, 2017

Also see internal bug number 64069455. I believe the two changes attached to this issue should fix the problem, although I have not yet tried the original repro with both changes patched at the same time.

Edit: more correctly, those two changes, plus the change to bundle them into net/http.

gopherbot pushed a commit to golang/net that referenced this issue Aug 8, 2017
RoundTrip will retry a request if it receives REFUSED_STREAM. To guard
against servers that use REFUSED_STREAM to encourage rate limiting, or
servers that return REFUSED_STREAM deterministically for some requests,
we retry after an exponential backoff and we cap the number of retries.

The exponential backoff starts on the second retry, with a backoff
sequence of 1s, 2s, 4s, etc, with 10% random jitter.

The retry cap was set to 6, somewhat arbitrarily. Rationale: this is
what Firefox does.

Updates golang/go#20985

Change-Id: I4dcac4392ac4a3220d6d839f28bf943fe6b3fea7
Reviewed-on: https://go-review.googlesource.com/50471
Run-TryBot: Tom Bergan <tombergan@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
gopherbot pushed a commit to golang/net that referenced this issue Aug 9, 2017
Currently if the http2.Transport hits SettingsMaxConcurrentStreams for a
server, it just makes a new TCP connection and creates the stream on the
new connection. This CL updates that behavior to instead block RoundTrip
until a new stream is available.

I also fixed a second bug, which was necessary to make some tests pass:
Previously, a stream was removed from cc.streams only if either (a) we
received END_STREAM from the server, or (b) we received RST_STREAM from
the server. This CL removes a stream from cc.streams if the request was
cancelled (via ctx.Close, req.Cancel, or resp.Body.Close) before
receiving END_STREAM or RST_STREAM from the server.

Updates golang/go#13774
Updates golang/go#20985
Updates golang/go#21229

Change-Id: I660ffd724c4c513e0f1cc587b404bedb8aff80be
Reviewed-on: https://go-review.googlesource.com/53250
Run-TryBot: Tom Bergan <tombergan@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/54052 mentions this issue: net/http: update bundled http2

@tombergan
Copy link
Contributor

This should be fixed in the main repo (for Go 1.10). Please reopen if you are still seeing problems.

c3mb0 pushed a commit to c3mb0/net that referenced this issue Apr 2, 2018
RoundTrip will retry a request if it receives REFUSED_STREAM. To guard
against servers that use REFUSED_STREAM to encourage rate limiting, or
servers that return REFUSED_STREAM deterministically for some requests,
we retry after an exponential backoff and we cap the number of retries.

The exponential backoff starts on the second retry, with a backoff
sequence of 1s, 2s, 4s, etc, with 10% random jitter.

The retry cap was set to 6, somewhat arbitrarily. Rationale: this is
what Firefox does.

Updates golang/go#20985

Change-Id: I4dcac4392ac4a3220d6d839f28bf943fe6b3fea7
Reviewed-on: https://go-review.googlesource.com/50471
Run-TryBot: Tom Bergan <tombergan@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
c3mb0 pushed a commit to c3mb0/net that referenced this issue Apr 2, 2018
Currently if the http2.Transport hits SettingsMaxConcurrentStreams for a
server, it just makes a new TCP connection and creates the stream on the
new connection. This CL updates that behavior to instead block RoundTrip
until a new stream is available.

I also fixed a second bug, which was necessary to make some tests pass:
Previously, a stream was removed from cc.streams only if either (a) we
received END_STREAM from the server, or (b) we received RST_STREAM from
the server. This CL removes a stream from cc.streams if the request was
cancelled (via ctx.Close, req.Cancel, or resp.Body.Close) before
receiving END_STREAM or RST_STREAM from the server.

Updates golang/go#13774
Updates golang/go#20985
Updates golang/go#21229

Change-Id: I660ffd724c4c513e0f1cc587b404bedb8aff80be
Reviewed-on: https://go-review.googlesource.com/53250
Run-TryBot: Tom Bergan <tombergan@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@golang golang locked and limited conversation to collaborators Aug 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants