Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: make Transport return nicer error when Amazon ALB hangs up mid-response? #18639

Open
bfallik opened this issue Jan 12, 2017 · 65 comments
Milestone

Comments

@bfallik
Copy link

bfallik commented Jan 12, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

$ go version
go version go1.8rc1 darwin/amd64

What operating system and processor architecture are you using (go env)?

Linux AMD64

What did you do?

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

We have http client code that has started to return errors when the corresponding server uses HTTP2 instead of HTTP.

What did you expect to see?

Identical behavior.

What did you see instead?

http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""

@bfallik
Copy link
Author

bfallik commented Jan 12, 2017

To add some color on this, what I'm wondering is if go http clients are expected to handle the GOAWAY frames potentially returned from http2 servers or should the transports somehow manage this case similar to a remotely closed connection. Ideally our program would just re-try the request for a NO_ERROR+GOAWAY scenario.

I've asked a related question on SO and golang-nuts.

@bfallik bfallik changed the title x/net/http2: x/net/http2: http client and GOAWAY, NO_ERROR Jan 12, 2017
@bradfitz bradfitz changed the title x/net/http2: http client and GOAWAY, NO_ERROR x/net/http2: Transport not retrying on Server's GOAWAY NO_ERROR Jan 12, 2017
@bradfitz
Copy link
Contributor

You don't provide enough information in this bug report.

Are you importing golang.org/x/net/http2 at all, or only using Go's net/http package?

This was fixed in December via golang/net@8dab929

Are you perhaps using an old version of golang.org/x/net/http2 directly?

What HTTP method are these? If they're not idempotent requests, how are you creating the requests?

/cc @tombergan

@bradfitz bradfitz added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 12, 2017
@bfallik
Copy link
Author

bfallik commented Jan 12, 2017

@bradfitz We're using the standard net/http package. I miscommunicated the version information. We originally observed this using go 1.7 but since then I was able to reproduce it using go1.8 beta1. Yesterday I upgraded to go 1.8rc1 but I'm not certain if I rebuilt the test case. I can retry the test case with 1.8rc1 if that includes golang/net@8dab929.

The call stack is roughly:

        request, err := http.NewRequest("GET", endpoint.String(), nil)
        if err != nil {
                return nil, err
        }
        request.Header.Set("Accept", "application/json")
        if len(jwtToken) > 0 {
                request.Header.Add("Authorization", jwtToken)
        }
        response, err := http.DefaultClient.Do(request)
        if err != nil {
                return nil, errors.WithStack(err)
        }
        return response, nil

@bfallik
Copy link
Author

bfallik commented Jan 12, 2017

I was just able to observe the same error using a test binary built with go1.8 rc1.

@bradfitz
Copy link
Contributor

Can you capture the stderr output with GODEBUG=http2debug=2 set in your environment when it occurs?

How "rough" is that "roughly" call stack? Is it always a GET with no body?

@bfallik
Copy link
Author

bfallik commented Jan 12, 2017

@bradfitz the NewRequest() line is coped verbatim, the body is always nil.

I can collect the stderr output later today or tomorrow.

@bradfitz
Copy link
Contributor

And where do you see http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""? Is that stderr output, or the error returned by a specific call? Which?

@bfallik
Copy link
Author

bfallik commented Jan 13, 2017

@bradfitz that is the error returned (and logged by us) after a specific call to http.Client.Do().

@bfallik
Copy link
Author

bfallik commented Jan 13, 2017

@bradfitz below is stderr with GODEBUG=http2debug=2. I removed some sensitive data, replaced with 'XXX'. Hopefully those changes won't interfere.

2017/01/13 00:30:04 http2: Transport failed to get client conn for XXX.clypd.com:443: http2: no cached connection was available
2017/01/13 00:30:04 http2: Transport creating client conn 0xc420171380 to XXX:443
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: wrote SETTINGS len=18, settings: ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=4194304, MAX_HEADER_LIST_SIZE=10485760
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE len=4 (conn) incr=1073741824
2017/01/13 00:30:04 http2: Transport encoding header ":authority" = "XXX.clypd.com"
2017/01/13 00:30:04 http2: Transport encoding header ":method" = "GET"
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: read SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=128, INITIAL_WINDOW_SIZE=65536, MAX_FRAME_SIZE=16777215
2017/01/13 00:30:04 http2: Transport received SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=128, INITIAL_WINDOW_SIZE=65536, MAX_FRAME_SIZE=16777215
2017/01/13 00:30:04 http2: Transport encoding header ":path" = "XXX?end_date=01%2F12%2F2022&start_date=01%2F13%2F2017"
2017/01/13 00:30:04 http2: Transport encoding header ":scheme" = "https"
2017/01/13 00:30:04 http2: Transport encoding header "accept" = "application/json"
2017/01/13 00:30:04 http2: Transport encoding header "authorization" = "XXX.XXX.XXX"
2017/01/13 00:30:04 http2: Transport encoding header "accept-encoding" = "gzip"
2017/01/13 00:30:04 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=265
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: wrote SETTINGS flags=ACK len=0
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: read WINDOW_UPDATE len=4 (conn) incr=2147418112
2017/01/13 00:30:04 http2: Transport received WINDOW_UPDATE len=4 (conn) incr=2147418112
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: read SETTINGS flags=ACK len=0
2017/01/13 00:30:04 http2: Transport received SETTINGS flags=ACK len=0
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read HEADERS flags=END_HEADERS stream=1 len=255
2017/01/13 00:30:11 http2: decoded hpack field header field ":status" = "200"
2017/01/13 00:30:11 http2: decoded hpack field header field "date" = "Fri, 13 Jan 2017 00:30:11 GMT"
2017/01/13 00:30:11 http2: decoded hpack field header field "content-type" = "application/json; charset=utf-8"
2017/01/13 00:30:11 http2: decoded hpack field header field "x-frame-options" = "SAMEORIGIN"
2017/01/13 00:30:11 http2: decoded hpack field header field "x-xss-protection" = "1; mode=block"
2017/01/13 00:30:11 http2: decoded hpack field header field "x-content-type-options" = "nosniff"
2017/01/13 00:30:11 http2: decoded hpack field header field "etag" = "\"19a8a0bf510ccd2fc61c6f74d8bac94b\""
2017/01/13 00:30:11 http2: decoded hpack field header field "cache-control" = "max-age=0, private, must-revalidate"
2017/01/13 00:30:11 http2: decoded hpack field header field "x-request-id" = "530d5473-f35c-4dd8-9f08-0c0e62b8f4ea"
2017/01/13 00:30:11 http2: decoded hpack field header field "x-runtime" = "6.870694"
2017/01/13 00:30:11 http2: decoded hpack field header field "server" = "thin"
2017/01/13 00:30:11 http2: Transport received HEADERS flags=END_HEADERS stream=1 len=255
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=396 data="XXX (140 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=396 data="XXX (140 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=7680
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=1514 data="XXX (1258 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=1514 data="XXX (1258 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=8192
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=16384
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=1514 data="XXX (1258 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=1514 data="XXX (1258 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=12128
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=757 data="XXX (501 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=757 data="XXX (501 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=8949
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=6264 data="XXX (6008 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=6264 data="XXX (6008 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA flags=END_STREAM stream=1 len=0 data=""
2017/01/13 00:30:11 http2: Transport received DATA flags=END_STREAM stream=1 len=0 data=""
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=11691
2017/01/13 00:30:12 http2: Transport encoding header ":authority" = "XXX.clypd.com"
2017/01/13 00:30:12 http2: Transport encoding header ":method" = "GET"
2017/01/13 00:30:12 http2: Transport encoding header ":path" = "XXX?end_date=01%2F12%2F2022&start_date=01%2F13%2F2017"
2017/01/13 00:30:12 http2: Transport encoding header ":scheme" = "https"
2017/01/13 00:30:12 http2: Transport encoding header "authorization" = "XXX.XXX.XXX"
2017/01/13 00:30:12 http2: Transport encoding header "accept" = "application/json"
2017/01/13 00:30:12 http2: Transport encoding header "accept-encoding" = "gzip"
2017/01/13 00:30:12 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2017/01/13 00:30:12 http2: Framer 0xc4203ac8f0: wrote HEADERS flags=END_STREAM|END_HEADERS stream=3 len=167
2017/01/13 00:30:12 http2: Framer 0xc4203ac8f0: read GOAWAY len=8 LastStreamID=3 ErrCode=NO_ERROR Debug=""
2017/01/13 00:30:12 http2: Transport received GOAWAY len=8 LastStreamID=3 ErrCode=NO_ERROR Debug=""
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read HEADERS flags=END_HEADERS stream=3 len=255
2017/01/13 00:30:18 http2: decoded hpack field header field ":status" = "200"
2017/01/13 00:30:18 http2: decoded hpack field header field "date" = "Fri, 13 Jan 2017 00:30:18 GMT"
2017/01/13 00:30:18 http2: decoded hpack field header field "content-type" = "application/json; charset=utf-8"
2017/01/13 00:30:18 http2: decoded hpack field header field "x-frame-options" = "SAMEORIGIN"
2017/01/13 00:30:18 http2: decoded hpack field header field "x-xss-protection" = "1; mode=block"
2017/01/13 00:30:18 http2: decoded hpack field header field "x-content-type-options" = "nosniff"
2017/01/13 00:30:18 http2: decoded hpack field header field "etag" = "\"19a8a0bf510ccd2fc61c6f74d8bac94b\""
2017/01/13 00:30:18 http2: decoded hpack field header field "cache-control" = "max-age=0, private, must-revalidate"
2017/01/13 00:30:18 http2: decoded hpack field header field "x-request-id" = "b834b8d7-cb49-45e5-a731-22ef0784220a"
2017/01/13 00:30:18 http2: decoded hpack field header field "x-runtime" = "6.018764"
2017/01/13 00:30:18 http2: decoded hpack field header field "server" = "thin"
2017/01/13 00:30:18 http2: Transport received HEADERS flags=END_HEADERS stream=3 len=255
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=1153 data="XXX (897 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=1153 data="XXX (897 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=3 len=4 incr=7680
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=3 len=4 incr=8192
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=757 data="XXX (501 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=757 data="XXX (501 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=3 len=4 incr=9857
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=3 len=4 incr=6527
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=2271 data="XXX (2015 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=2271 data="XXX (2015 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=3 len=4 incr=21077
2017/01/13 00:30:18 http2: Transport readFrame error on conn 0xc420171380: (*net.OpError) read tcp 10.10.64.169:51702->XXX:443: read: connection reset by peer
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote RST_STREAM stream=3 len=4 ErrCode=CANCEL

@bradfitz bradfitz added NeedsFix The path to resolution is known, but the work has not been done. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jan 13, 2017
@bradfitz bradfitz added this to the Go1.8Maybe milestone Jan 13, 2017
@bradfitz
Copy link
Contributor

@tombergan, I have a theory. Can you sanity check me?

Note that the error he's getting from RoundTrip is:

server sent GOAWAY and closed the connection

That comes only from http2/transport.go's GoAwayError.

But GoAwayError is only ever used by (*clientConnReadLoop).cleanup, where it's sent to any opens streams in the two range loops.

My theory is that when we get the initial GOAWAY and decide to retry the request on a new conn, we're forgetting to remove those streams from the originally-chosen ClientConn's activeRes and/or streams maps.

Plausible?

@tombergan
Copy link
Contributor

tombergan commented Jan 13, 2017

My theory is that when we get the initial GOAWAY and decide to retry the request on a new conn, we're forgetting to remove those streams from the originally-chosen ClientConn's activeRes and/or streams maps.

You may have described a bug, but I don't think that's happening in this scenario. I see only two requests in the connection: stream=1 and stream=3. This is the second request:

2017/01/13 00:30:12 http2: Framer 0xc4203ac8f0: wrote HEADERS flags=END_STREAM|END_HEADERS stream=3 len=167
2017/01/13 00:30:12 http2: Framer 0xc4203ac8f0: read GOAWAY len=8 LastStreamID=3 ErrCode=NO_ERROR Debug=""
2017/01/13 00:30:12 http2: Transport received GOAWAY len=8 LastStreamID=3 ErrCode=NO_ERROR Debug=""
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read HEADERS flags=END_HEADERS stream=3 len=255

Note that the GOAWAY frame comes just after sending request HEADERS on stream=3. Note also that the GOAWAY frame has LastStreamID=3. We do not close stream=3 because the server claims they might process it. We optimistically assume that we will get a response.
https://github.com/golang/net/blob/69d4b8aa71caaaa75c3dfc11211d1be495abec7c/http2/transport.go#L554

We receive response HEADERS on stream=3 after the GOAWAY (6 seconds after the GOAWAY, in fact). We then receive a sequence of DATA frames. However, we never get END_STREAM. Instead, the server closes the connection. Their server is behaving legally ... I just double-checked the spec, and it's quite clear that the server is not required to process streams with id < GOAWAY.LastStreamId. The server MAY process those streams only partially.

There's not much we can do. We cannot retry the request after the connection closes because we've already received response headers and returned from RoundTrip. I have only two ideas, neither perfect:

  1. The error message can be improved. Perhaps io.UnexpectedEOF instead of "server sent GOAWAY"?

  2. This is a GET, so we could pessimistically retry the RoundTrip immediately after receiving the GOAWAY, rather than optimistically assuming the request will complete. I could see that being suboptimal in some cases, though.

@bradfitz
Copy link
Contributor

He said that RoundTrip returned an error.

If this trace is really about streamid 3 yet we saw this:

2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read HEADERS flags=END_HEADERS stream=3 len=255

... then RoundTrip should've returned something (END_HEADERS is set), and it's the Body.Read that would've returned the GOAWAY error.

So I still don't see a clear picture of what happened.

@tombergan
Copy link
Contributor

tombergan commented Jan 13, 2017

I am stumped. I looked for suspicious uses of cc.mu on the guess that RoundTrip may be getting deadlocked. Found this suspicious defer cc.mu.Unlock (should that instead happen just after encodeTrailers?), but that shouldn't affect requests that don't have a request body.

@bfallik, can you triple-check that the error is coming from http.Client.Do, not Response.Body.Read? Assuming the error is coming from http.Client.Do, could you insert the following code in clientConnReadLoop.cleanup just before this line:

+buf := make([]byte, 16<<10)
+buf = buf[:runtime.Stack(buf, true)]
+cc.vlogf("stack after run()\n:%s", string(buf))
 err = GoAwayError{

It would also help to add a vlogs in clientConnReadLoop.processHeaders just before this return nil

 // We'd get here if we canceled a request while the
 // server had its response still in flight. So if this
 // was just something we canceled, ignore it.
+cc.vlog("processHeaders could not find stream with id %d", f.StreamID)
 return nil

and this return nil

 // (nil, nil) special case. See handleResponse docs.
+cc.vlog("processHeaders got (nil, nil) from handleResponse")
 return nil

After making the above changes, could you run the program until it fails with that same "server sent goaway" error, then copy the stack dump here? Thanks! It is fine to remove the call frames running your code if you need to keep those private.

@bfallik
Copy link
Author

bfallik commented Jan 13, 2017

@tombergan Hi. I don't have the rest of the logging you requested but I do believe that the error is propagated from Response.Body.Read() and not http.Client.Do(). I slightly modified my test harness to split apart those operations and the stack trace clearly originates from:

	if err = json.NewDecoder(resp.Body).Decode(&v); err != nil {
		return nil, errors.WithStack(err)
	}

Do you still need that other debugging information or is there different info I can provide?

Also, just to be clear, this is the code I used to reproduce the error:

func checkResponseCode(r *http.Response, err error) (*http.Response, error) {
	if err != nil {
		return nil, err
	}
	if r.StatusCode >= http.StatusBadRequest {
		return nil, errors.WithStack(HTTPError{r})
	}
	return r, nil
}

func HTTPGet(endpoint url.URL, jwtToken string) (*http.Response, error) {
	request, err := http.NewRequest("GET", endpoint.String(), nil)
	if err != nil {
		return nil, err
	}
	request.Header.Set("Accept", "application/json")
	if len(jwtToken) > 0 {
		request.Header.Add("Authorization", jwtToken)
	}

	resp, err := checkResponseCode(http.DefaultClient.Do(request))
	if err != nil {
		return nil, err
	}

	var v interface{}
	if err = json.NewDecoder(resp.Body).Decode(&v); err != nil {
		return nil, errors.WithStack(err)
	}
	return nil, err
}

I think this code is missing a resp.Body.Close() and the original may be missing that as well. Could that be related?

@tombergan
Copy link
Contributor

Well, that changes everything :) We're back to my earlier comment.

I think this code is missing a resp.Body.Close() and the original may be missing that as well. Could that be related?

No, I don't think that's related. I think the sequence of events is described by my linked comment above.

@bradfitz
Copy link
Contributor

So there's no bug here, then.

Except maybe in the http2 server. Which server is this? The linked SO post suggests it's AWS? Is that a new implementation? Do we have any contacts there?

@bradfitz
Copy link
Contributor

@jeffbarr, it seems the AWS ALB's new HTTP/2 support is behaving oddly, not finishing streams after sending a GOAWAY. Who's the right person on the AWS side to look into this? Thanks!

@bfallik
Copy link
Author

bfallik commented Jan 13, 2017

@bradfitz yes, the server is AWS Application Load Balancer.

@bfallik
Copy link
Author

bfallik commented Jan 13, 2017

@bradfitz assuming there's no bug client side do you recommend we explicitly catch the GOAWAY+NO_ERROR error and retry the request within our application logic? I'm still learning about http2 and so wasn't sure the expected behavior but now that seems like our only/best workaround until the server can be fixed.

@bradfitz
Copy link
Contributor

@bfallik, you could I suppose. I wouldn't make it specific to that error, though. Even though ALB shouldn't do that, you could just treat it as any other type of network error or interrupted request and retry N times, especially if it's just a GET request.

In this case it's too late for the Go http client to retry since the headers have already been returned. If we did retry, it's possible the headers would be different, so we couldn't stitch the second response's body together with the first response's headers.

Ideally ALB would be fixed, though.

@bfallik
Copy link
Author

bfallik commented Jan 13, 2017

@bradfitz OK, thanks.

@bradfitz bradfitz modified the milestones: Unplanned, Go1.8Maybe Jan 18, 2017
k8s-publishing-bot pushed a commit to kubernetes/apimachinery that referenced this issue Jan 26, 2019
http2 is allowed to tell us to go away, and for watch it is safe
to exit and restart in almost all cases where a connection is
forcibly closed by the upstream. This error message happens a lot
behind ELB and other http2 aware proxies.

Treat the error as "basically done" as suggested by

golang/go#18639 (comment)

Kubernetes-commit: 640caeb74f2bc5e93b1579a10d6d04a07863379f
@froodian
Copy link

Is there anything at all that can be done here? Amazon ALB isn't going away, and if it is technically spec-conforming (even if we view its behavior as sub-optimal), it would be nice for golang not to throw errors.

@bradfitz
Copy link
Contributor

@froodian, what do you want me to do?

@froodian
Copy link

Is it safe to simply retry the request? Or no because it may have gone through and may not be idempotent? I suspect that the way I'll silence this at my layer is to retry any error where the message contains both "server sent GOAWAY and closed the connection" and "ErrCode=NO_ERROR"

@bradfitz
Copy link
Contributor

@froodian, nope, what's happening is:

  • Go user code: "res, err := http.Transport.RoundTrip(someReq) ...."
  • Go net/http: "Here's HTTP request 3 on this TCP connection"
  • ALB: "I'm gracefully shutting down, and the last request I'll handle is 3" (GOAWAY LastStreamID=3)
  • ALB: "Here are the the HTTP response headers for request 3."
  • Go net/http: "Great, thanks for the response headers for request 3, I'll give that to the user now."
  • Go user code: "res, err := http.Transport.RoundTrip(someReq) completes; err == nil, res.Body non-nil & read to read, res.ContentLength set to -1 or some value...."
  • ALB: "BYE BYE TCP CONNECTION DEAD."
  • Go user code: "res.Body.Read(...)"
  • Go net/http: "Sorry, no Body to read. TCP connection is dead."

If ALB hadn't sent a response header then we could retry the request, but it's pretty weird for us to retry the request when we've already given the response headers to the user code. The only safe thing to do is retry the request and hope for exactly the "same" response headers and only if they "match", then continue acting like the original res.Body (which the Go user code is already reading from) is part of the second retried request.

But things like the server's Date header probably changed, so that at least needs to be ignored. What else?

What if ALB had already returned some of the response body bytes, but not all? Do we need to keep a checksum of bytes read and stitch together the two bodies if the body is same length and second response response's prefix bytes have the same checksum?

That would all be super sketchy.

It's better to just return an error, which we do. If the caller wants to retry, they can retry.

Do you just want a better error message? What text would sound good to you?

@froodian
Copy link

I see, yeah, thank you for that condensed write-up @bradfitz... that makes sense, and I agree it's correct from a client perspective not to retry when we've already been given response headers... I agree with @sj26 that a message like "unexpected connection close before stream end" or something long those lines might help indicate the problem a little more clearly.

But at a more root level, I also wonder if we could lean more heavily on AWS as a group to change their behavior - I agree it really seems like they should let that last response write out its whole body to EOF before they close the TCP connection... but I guess they want to have timeouts on their end too so that if the server's app code never EOFs the response body for some reason, they still clean up the TCP connection at some point, hence their current behavior...? I guess https://forums.aws.amazon.com/thread.jspa?messageID=771883#771883 appears to be the most recent public thread with AWS about this? But I also wonder if other conversations have gone on behind the scenes.

@stephanwesten
Copy link

I searched in our logs for GOAWAY and found a couple of thousand hits with the following message:
Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=17611, ErrCode=NO_ERROR, debug=""

All hits have ErrCode=NO_ERROR afaict.

This seems harmless, can this be an info message instead of error?

@bradfitz

This comment has been minimized.

@itcuihao
Copy link

hi, same error.

GODEBUG=http2debug=2 go run m.go 
2022/03/17 18:03:39 http2: Transport failed to get client conn for db.ams.op-mobile.com:443: http2: no cached connection was available
2022/03/17 18:03:40 http2: Transport creating client conn 0xc000001680 to 82.145.213.10:443
2022/03/17 18:03:40 http2: Framer 0xc00031fa40: wrote SETTINGS len=18, settings: ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=4194304, MAX_HEADER_LIST_SIZE=10485760
2022/03/17 18:03:40 http2: Framer 0xc00031fa40: wrote WINDOW_UPDATE len=4 (conn) incr=1073741824
2022/03/17 18:03:40 http2: Transport encoding header ":authority" = "db.ams.op-mobile.com"
2022/03/17 18:03:40 http2: Transport encoding header ":method" = "GET"
2022/03/17 18:03:40 http2: Transport encoding header ":path" = "/user?uids=b9bd8d50dca2e64f136aefadf99fa82111105498,……"
2022/03/17 18:03:40 http2: Transport encoding header ":scheme" = "https"
2022/03/17 18:03:40 http2: Transport encoding header "accept-encoding" = "gzip"
2022/03/17 18:03:40 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2022/03/17 18:03:40 http2: Framer 0xc00031fa40: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=4810
2022/03/17 18:03:41 http2: Framer 0xc00031fa40: read SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=128, INITIAL_WINDOW_SIZE=65536, MAX_FRAME_SIZE=16777215
2022/03/17 18:03:41 http2: Transport received SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=128, INITIAL_WINDOW_SIZE=65536, MAX_FRAME_SIZE=16777215
2022/03/17 18:03:41 http2: Framer 0xc00031fa40: wrote SETTINGS flags=ACK len=0
2022/03/17 18:03:41 http2: Framer 0xc00031fa40: read WINDOW_UPDATE len=4 (conn) incr=2147418112
2022/03/17 18:03:41 http2: Transport received WINDOW_UPDATE len=4 (conn) incr=2147418112
2022/03/17 18:03:41 http2: Framer 0xc00031fa40: read SETTINGS flags=ACK len=0
2022/03/17 18:03:41 http2: Transport received SETTINGS flags=ACK len=0
2022/03/17 18:03:41 http2: Framer 0xc00031fa40: read GOAWAY len=8 LastStreamID=1 ErrCode=ENHANCE_YOUR_CALM Debug=""
2022/03/17 18:03:41 http2: Transport received GOAWAY len=8 LastStreamID=1 ErrCode=ENHANCE_YOUR_CALM Debug=""
2022/03/17 18:03:41 transport got GOAWAY with error code = ENHANCE_YOUR_CALM
2022/03/17 18:03:41 http2: Transport readFrame error on conn 0xc000001680: (*errors.errorString) EOF
2022/03/17 18:03:41 RoundTrip failure: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=ENHANCE_YOUR_CALM, debug=""

@CameronGo
Copy link

If it helps anyone else out there, we solved this problem by changing our ALB config to use HTTP1 instead of HTTP2. It is a workaround obviously and not a fix, but It is effective for now until Go gets around to changing this behavior.

@bradfitz
Copy link
Contributor

but It is effective for now until Go gets around to changing this behavior.

You misunderstand where the problem lies. See my earlier comment. This isn't something that Go can fix. ALB is hanging up on the middle of responses. This is an Amazon problem.

@CameronGo
Copy link

but It is effective for now until Go gets around to changing this behavior.

You misunderstand where the problem lies. See my earlier comment. This isn't something that Go can fix. ALB is hanging up on the middle of responses. This is an Amazon problem.

It has been a while since the long and painful process of troubleshooting this issue with AWS support so my memory may be faulty. I had 2 seemingly conflicting notes on this issue and how it was concluded. Needless to say, the result was a bit unsatisfying and resulted in our just disabling HTTP2 on our load balancers.

One note I had, which seemed to support the POV that the issue was caused by a problem with the way AWS load balancers handle the HTTP2 standard said

ALBs seem to send GOAWAY — as you linked, https://tools.ietf.org/html/rfc7540#section-6.8 — mid-stream with a last stream id, which is fine, then potentially more header/data frames for that last stream id, which is fine, but never sets the END_STREAM flag — https://tools.ietf.org/html/rfc7540#section-8.1 — on any frame in the stream before closing the connection, which is the problem. This is an error in the ALB implementation of the HTTP2 spec — it is dropping the connection with a stream in an open state — https://tools.ietf.org/html/rfc7540#section-5.1 — which golang is correctly handling as an "unexpected closed connection" error, albeit hidden beneath a "goaway" error.

But then I had some follow-up notes that appeared to imply the way the Go HTTP2 client was handling GOAWAY is what was responsible. They included these notes from AWS support.

the RFC does not state that after a GOAWAY Frame is sent, the sender (ALB) 'must' allow all Streams including-and-up-to the stream identifier included in the frame, to send END_STREAM before terminating the connection. These are separate concepts, the connection termination process 'should' allow for this to happen, but what if the GOAWAY Frame includes an error code and a stream identifier of 0 - It moves responsibility onto the client application to respond or retry in the appropriate manner.

Then I have some much less organized notes from some internal discussions at this time where there's some speculation about whether the issue is that this is being treated as an error when it should not be. I don't recall the context right now. Specifically elsewhere I've noted

Since the server is sending NO_ERROR, your client should simply try to reconnect, and not treat the message as an error.

Regardless I think some reality check is in order. AWS is not infallible, but when such a significant portion of the INET depends on AWS, an issue like this should be brought to some sort of satisfactory conclusion. AWS is either RFC compliant or not and it seems there may be differing opinions on this; however, Go applications should be able to use services from the single largest cloud provider without encountering errors like this. I think it is noteworthy that I've haven't seen this issue discussed on forums from any other languages.

@jeffbarr
Copy link

jeffbarr commented Mar 18, 2022 via email

@bradfitz
Copy link
Contributor

@jeffbarr, gladly. I'm bradfitz on Twitter (DMs open) or gmail.com/golang.org. Thanks!

@jeffbarr
Copy link

I am working to find a good connection, stay tuned...

@radutopala
Copy link

We are also seeing this issue on ELBs in front of ElasticSearch/OpenSearch clusters:

http2: server sent GOAWAY and closed the connection; LastStreamID=19999, ErrCode=NO_ERROR, debug=""

@bradfitz
Copy link
Contributor

@jeffbarr, thanks for the connection! Three of us hopped on a call the other day and were able to repro the issue on demand.

For the record, the tool we used for debugging was https://github.com/bradfitz/h2slam pointed at an ALB, and then changed certain ALB parameters on the AWS control plane and the TCP connection from AWS would fail (in up to 10 seconds), often without even a GOAWAY.

I'll let AWS folk give further updates here.

@radutopala
Copy link

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html

The load balancer sends a response code of 000

With HTTP/2 connections, if the compressed length of any of the headers exceeds 8K bytes or if the number of requests served through one connection exceeds 10,000, the load balancer sends a GOAWAY frame and closes the connection with a TCP FIN.

@radutopala
Copy link

Hitting the ElasticSearch ALB with https://github.com/bradfitz/h2slam gives

./h2slam --host xx.es.amazonaws.com --path /_cluster/settings > slam.log
2022/04/04 16:13:31 Get "https://xx.es.amazonaws.com/_cluster/settings": http2: Transport received Server's graceful shutdown GOAWAY
cat slam.log | wc -l
9999

which is the expected behavior actually from AWS documentation perspective. So exactly at the 10k request on the same connection, the connection closes.

@ebarlas
Copy link

ebarlas commented Sep 29, 2023

Hi folks, was there ever a resolution here?

I'm witnessing the same problematic behavior from the ALB. The GOAWAY frame seems to preempt the response data, which is never received by the application.

@CameronGo
Copy link

CameronGo commented Sep 30, 2023

To ensure we didn't see this behavior, we ended up switching all of our ALB target groups to HTTP1. This doesn't occur then. In addition, we made this change in our http client with the following reference:

	// And also add the same thing to `Request.GetBody`, which allows
	// `net/http` to get a new body in cases like a redirect. This is
	// usually not used, but it doesn't hurt to set it in case it's
	// needed. See:
	//
	//     https://github.com/stripe/stripe-go/issues/710
	//
	req.GetBody = func() (io.ReadCloser, error) {
		reader := strings.NewReader(payload)
		return ioutil.NopCloser(reader), nil
	}

@ra-coder
Copy link

ra-coder commented Oct 4, 2023

My briff from the discussion:

This is an issue of how Go handle behaviour of aws load balancer (or other) under http 2.0 connection

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html

The load balancer sends a response code of 000 With HTTP/2 connections, if the compressed length of any of the headers exceeds 8 K bytes or if the number of requests served through one connection exceeds 10,000, the load balancer sends a GOAWAY frame and closes the connection with a TCP FIN

So the root cause => is "to heavy header" or "ddos" (hight load in the moment) to the server/service where your Go app do http 2.0 requests

I suggest to
handle this error same as return 503 Busy http code in HTTP/1

The 503 (Service Unavailable) status code indicates that the server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay. The server MAY send a Retry-After header field to suggest an appropriate amount of time for the client to wait before retrying the request.

@3052
Copy link

3052 commented Apr 21, 2024

I am getting this error consistently as well with certain servers. disabling HTTP/2 seems to fix it:

client := http.Client{
   Transport: &http.Transport{
      Proxy: http.ProxyFromEnvironment,
      TLSNextProto: map[string]func(string, *tls.Conn) http.RoundTripper{},
   },
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests