Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: transport caches permanently broken persistent connections if write error happens during h2 handshake #34978

Closed
mikedanese opened this issue Oct 18, 2019 · 13 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@mikedanese
Copy link
Contributor

mikedanese commented Oct 18, 2019

Copy of http://issuetracker.google.com/140973477 to github.

What version of Go are you using (go version)?

$ go version
go version go1.13.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/usr/local/google/home/mikedanese/.cache/go-build"
GOENV="/usr/local/google/home/mikedanese/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/usr/local/google/home/mikedanese/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/google/home/mikedanese/.gimme/versions/go1.13.1.linux.amd64"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/google/home/mikedanese/.gimme/versions/go1.13.1.linux.amd64/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build932575926=/tmp/go-build -gno-record-gcc-switches"        

What did you do?

Test case-ish is here:

https://gist.github.com/mikedanese/a9204f541d1b12740be2551e381b99fc

Trigger a connection write error after the TLS hanshake but during the h2 handshake.

What did you expect to see?

The persistent connection is not cached by http.Transport.

What did you see instead?

The persistent connection is cached by http.Transport.

More

This change appears to be the culprit:

43b9fcf

We noticed that a client can get into an state where it is consistently failing to make requests and dumping logs like:

W1016 23:30:50.108262       1 plugin.go:164] rpc error: code = Unauthenticated desc = transport: Post ...: write tcp PRIVATE_IP:47518->108.177.119.95:443: write: broken pipe

While inspecting the process with strace, we noticed that it wasn't making any syscalls that would correspond to an outbound http request. We collected a core dump from the process and inspected it:

(dlv) p "net/http".DefaultTransport.idleConn
map[net/http.connectMethodKey][]*net/http.persistConn [
        {proxy: "", scheme: "https", addr: "container.googleapis.com:443", onlyH1: false}: [
                *(*"net/http.persistConn")(0xc00039e6c0),
        ],
]

(dlv) p (*(*"net/http.persistConn")(0xc00039e6c0)).alt
net/http.RoundTripper(net/http.http2erringRoundTripper) {
        err: error(*net.OpError) *{
                Op: "write",
                Net: "tcp",
                Source: net.Addr(*net.TCPAddr) ...,
                Addr: net.Addr(*net.TCPAddr) ...,
                Err: error(*os.SyscallError) ...,},}

(dlv) p (*(*"net/http.persistConn")(0xc00039e6c0)).alt.err
error(*net.OpError) *{
        Op: "write",
        Net: "tcp",
        Source: net.Addr(*net.TCPAddr) *{
                IP: net.IP len: 4, cap: 4, [PRIVATE_IP],
                Port: 47518,
                Zone: "",},
        Addr: net.Addr(*net.TCPAddr) *{
                IP: net.IP len: 4, cap: 4, [108,177,119,95],
                Port: 443,
                Zone: "",},
        Err: error(*os.SyscallError) *{
                Syscall: "write",
                Err: error(syscall.Errno) *(*error)(0xc0003ddc90),},}

The core dump showed that http.Transport was holding onto one persistent connection with an alt http2erringRoundTripper that wrapped a failed write error. We successfully reproduced the behavior in the test case-ish linked above and saw that a failed write here:

go/src/net/http/h2_bundle.go

Lines 7157 to 7164 in 8d2ea29

cc.bw.Write(http2clientPreface)
cc.fr.WriteSettings(initialSettings...)
cc.fr.WriteWindowUpdate(0, http2transportDefaultConnFlow)
cc.inflow.add(http2transportDefaultConnFlow + http2initialWindowSize)
cc.bw.Flush()
if cc.werr != nil {
return nil, cc.werr
}

Causes the h2 upgradeFn:

go/src/net/http/h2_bundle.go

Lines 6652 to 6656 in 8d2ea29

upgradeFn := func(authority string, c *tls.Conn) RoundTripper {
addr := http2authorityAddr("https", authority)
if used, err := connPool.addConnIfNeeded(addr, t2, c); err != nil {
go c.Close()
return http2erringRoundTripper{err}

To return an http2erringRoundTripper. The persistent connection is cached by the transport storing the erring round tripper in pconn.alt. The erring round tripper is used in subsequent calls to:

// HTTP/2 path.
t.setReqCanceler(req, nil) // not cancelable with CancelRequest
resp, err = pconn.alt.RoundTrip(req)

And the pconn remains in the idle connection cache.

@agnivade
Copy link
Contributor

@bradfitz

@agnivade agnivade added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 18, 2019
@agnivade agnivade added this to the Unplanned milestone Oct 18, 2019
@bcmills
Copy link
Contributor

bcmills commented Oct 18, 2019

Does this reproduce with Go 1.13.3 (released yesterday)?

#34498 seems like it may be related, and the fix for that was backported in 1.13.3.

@mikedanese
Copy link
Contributor Author

It reproduces on go1.13.3 and master.

@bcmills
Copy link
Contributor

bcmills commented Oct 18, 2019

Is this a regression from 1.12?

CC @tombergan @fraenkel @bgentry

@mikedanese
Copy link
Contributor Author

Yes, this is a new regression. This was broken by 43b9fcf which is present in: go1.13.3 go1.13.2 go1.13.1 go1.13 go1.13rc2 go1.13rc1 go1.13beta1.

@bradfitz
Copy link
Contributor

@fraenkel, could you take a look?

I'll also try to look but it might still be in your head more than mine.

@bradfitz
Copy link
Contributor

@mikedanese, thanks! This is a great bug report.

@gopherbot
Copy link

Change https://golang.org/cl/202078 mentions this issue: net/http: don't cache http2.erringRoundTripper connections

@bradfitz
Copy link
Contributor

@gopherbot, please backport to Go 1.13.

@gopherbot
Copy link

Backport issue(s) opened: #35087 (for 1.13).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

Change https://golang.org/cl/202642 mentions this issue: [release-branch.go1.13] net/http: don't cache http2.erringRoundTripper connections

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 23, 2019
gopherbot pushed a commit that referenced this issue Oct 23, 2019
…r connections

Fixes #35087
Updates #34978

Change-Id: I3baf1392ba7366ae6628889c47c343ef702ec438
Reviewed-on: https://go-review.googlesource.com/c/go/+/202078
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
(cherry picked from commit 88186e5)
Reviewed-on: https://go-review.googlesource.com/c/go/+/202642
Reviewed-by: Emmanuel Odeke <emm.odeke@gmail.com>
@bradfitz bradfitz self-assigned this Oct 23, 2019
@bradfitz
Copy link
Contributor

Reopening, as investigation into the test failures from this CL (#35113) aren't leaving me entirely happy yet.

@gopherbot
Copy link

Change https://golang.org/cl/220905 mentions this issue: net/http: fix handling of HTTP/2 upgrade failures

@golang golang locked and limited conversation to collaborators Mar 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants