Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: pool deadlock #32388

Closed
stevenh opened this issue Jun 2, 2019 · 20 comments
Closed

x/net/http2: pool deadlock #32388

stevenh opened this issue Jun 2, 2019 · 20 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@stevenh
Copy link
Contributor

stevenh commented Jun 2, 2019

What version of Go are you using (go version)?

go version go1.12.4 freebsd/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
GOARCH="amd64"
GOBIN="/data/go/bin"
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH="/data/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build232080061=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Use the net/http client to make requests to an http2 supporting server.

What did you expect to see?

Requests should continue to succeed if a single connection becomes blocked in a write.

What did you see instead?

A single blocked connection prevented new http2 connections from being established due to the http2 pool requiring the connection lock when testing to see if an existing connection can handle new requests. This results in the entire application hanging indefinitely.

The following trace was created with an older go version 1.9 from our production app, but I have proved it's still possible to reproduce in the latest version as detailed above, as the locking between connections and pool haven't been changed.

-- goroutine blocked in writeHeaders -> Flush
1 @ 0x43bddc 0x436eea 0x4364e7 0x4a201e 0x4a210d 0x4a3a04 0x577812 0x589dbd 0x5d1887 0x5d1da0 0x5d2cbb 0x662cea 0x6ad1d4 0x52bdce 0x6678dc 0x66680b 0x663215 0x662dda 0x6476b9 0x690a17 0x63b819 0x63b4ad 0x63cb4d 0xb1d000 0xe641df 0xe6213e 0xe8245c 0xe7f42e 0xe7ad28 0xe61e87 0xf598ed 0xf59637
#    0x4364e6    internal/poll.runtime_pollWait+0x56                                                    /usr/local/go/src/runtime/netpoll.go:173
#    0x4a201d    internal/poll.(*pollDesc).wait+0xad                                                    /usr/local/go/src/internal/poll/fd_poll_runtime.go:85
#    0x4a210c    internal/poll.(*pollDesc).waitWrite+0x3c                                                /usr/local/go/src/internal/poll/fd_poll_runtime.go:94
#    0x4a3a03    internal/poll.(*FD).Write+0x243                                                        /usr/local/go/src/internal/poll/fd_unix.go:227
#    0x577811    net.(*netFD).Write+0x51                                                            /usr/local/go/src/net/fd_unix.go:220
#    0x589dbc    net.(*conn).Write+0x6c                                                            /usr/local/go/src/net/net.go:188
#    0x5d1886    crypto/tls.(*Conn).write+0x1a6                                                        /usr/local/go/src/crypto/tls/conn.go:832
#    0x5d1d9f    crypto/tls.(*Conn).writeRecordLocked+0x35f                                                /usr/local/go/src/crypto/tls/conn.go:909
#    0x5d2cba    crypto/tls.(*Conn).Write+0x1ba                                                        /usr/local/go/src/crypto/tls/conn.go:1068
#    0x662ce9    net/http.http2stickyErrWriter.Write+0x79                                                /usr/local/go/src/net/http/h2_bundle.go:6803
#    0x6ad1d3    net/http.(*http2stickyErrWriter).Write+0x73                                                <autogenerated>:1
#    0x52bdcd    bufio.(*Writer).Flush+0x7d                                                        /usr/local/go/src/bufio/bufio.go:567
#    0x6678db    net/http.(*http2ClientConn).writeHeaders+0x1eb                                                /usr/local/go/src/net/http/h2_bundle.go:7436
#    0x66680a    net/http.(*http2ClientConn).RoundTrip+0x46a                                                /usr/local/go/src/net/http/h2_bundle.go:7289
#    0x663214    net/http.(*http2Transport).RoundTripOpt+0x144                                                /usr/local/go/src/net/http/h2_bundle.go:6858
#    0x662dd9    net/http.(*http2Transport).RoundTrip+0x39                                                /usr/local/go/src/net/http/h2_bundle.go:6820
#    0x6476b8    net/http.http2noDialH2RoundTripper.RoundTrip+0x38                                            /usr/local/go/src/net/http/h2_bundle.go:990
#    0x690a16    net/http.(*Transport).RoundTrip+0xd46                                                    /usr/local/go/src/net/http/transport.go:371
#    0x63b818    net/http.send+0x1a8                                                            /usr/local/go/src/net/http/client.go:249
#    0x63b4ac    net/http.(*Client).send+0xfc                                                        /usr/local/go/src/net/http/client.go:173
...

-- goroutine waiting for connection lock, held above, while holding the pool lock
1 @ 0x43bddc 0x43bece 0x44d2c4 0x44cfdd 0x48115e 0x66546f 0x6457d8 0x646dce 0x6631c7 0x662dda 0x6476b9 0x690a17 0x63b819 0x63b4ad 0x63cb4d 0xb1d000 0xe641df 0xe6213e 0xe8245c 0xe7f42e 0xe7ad28 0xe61e87 0xf598ed 0xf59637 0x46b1f1
#    0x44cfdc    sync.runtime_SemacquireMutex+0x3c                                                    /usr/local/go/src/runtime/sema.go:71
#    0x48115d    sync.(*Mutex).Lock+0xed                                                            /usr/local/go/src/sync/mutex.go:134
#    0x66546e    net/http.(*http2ClientConn).CanTakeNewRequest+0x3e                                            /usr/local/go/src/net/http/h2_bundle.go:7083
#    0x6457d7    net/http.(*http2clientConnPool).getClientConn+0xe7                                            /usr/local/go/src/net/http/h2_bundle.go:739
#    0x646dcd    net/http.http2noDialClientConnPool.GetClientConn+0x4d                                            /usr/local/go/src/net/http/h2_bundle.go:927
#    0x6631c6    net/http.(*http2Transport).RoundTripOpt+0xf6                                                /usr/local/go/src/net/http/h2_bundle.go:6852
#    0x662dd9    net/http.(*http2Transport).RoundTrip+0x39                                                /usr/local/go/src/net/http/h2_bundle.go:6820
#    0x6476b8    net/http.http2noDialH2RoundTripper.RoundTrip+0x38                                            /usr/local/go/src/net/http/h2_bundle.go:990
#    0x690a16    net/http.(*Transport).RoundTrip+0xd46
...

-- goroutines blocked waiting for the pool lock held above
13033 @ 0x43bddc 0x43bece 0x44d2c4 0x44cfdd 0x48115e 0x645754 0x646dce 0x6631c7 0x662dda 0x6476b9 0x690a17 0x63b819 0x63b4ad 0x63cb4d 0xb1d000 0xe641df 0xe6213e 0xe8245c 0xe7f42e 0xe7ad28 0xe61e87 0xf598ed 0xf59637 0x46b1f1
#   0x44cfdc    sync.runtime_SemacquireMutex+0x3c                                                   /usr/local/go/src/runtime/sema.go:71
#   0x48115d    sync.(*Mutex).Lock+0xed                                                         /usr/local/go/src/sync/mutex.go:134
#   0x645753    net/http.(*http2clientConnPool).getClientConn+0x63                                          /usr/local/go/src/net/http/h2_bundle.go:737
#   0x646dcd    net/http.http2noDialClientConnPool.GetClientConn+0x4d                                           /usr/local/go/src/net/http/h2_bundle.go:927
#   0x6631c6    net/http.(*http2Transport).RoundTripOpt+0xf6                                                /usr/local/go/src/net/http/h2_bundle.go:6852
#   0x662dd9    net/http.(*http2Transport).RoundTrip+0x39                                               /usr/local/go/src/net/http/h2_bundle.go:6820
#   0x6476b8    net/http.http2noDialH2RoundTripper.RoundTrip+0x38                                           /usr/local/go/src/net/http/h2_bundle.go:990
#   0x690a16    net/http.(*Transport).RoundTrip+0xd46                                                   /usr/local/go/src/net/http/transport.go:371
#   0x63b818    net/http.send+0x1a8                                                         /usr/local/go/src/net/http/client.go:249
#   0x63b4ac    net/http.(*Client).send+0xfc                                                        /usr/local/go/src/net/http/client.go:173
#   0x63cb4c    net/http.(*Client).Do+0x28c                                                     /usr/local/go/src/net/http/client.go:602

This is similar to #23559 however that bug is specifically about certain areas of the http2 connections not paying attention to timeouts whereas this bug is caused by the interaction of individual connection locks and the pool lock.

@gopherbot
Copy link

Change https://golang.org/cl/179938 mentions this issue: http2: Prevent pool blocking when testing a conn for a new request

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 3, 2019
@bcmills bcmills added this to the Unreleased milestone Jun 3, 2019
@bcmills
Copy link
Contributor

bcmills commented Jun 3, 2019

CC @bradfitz @tombergan

@bradfitz
Copy link
Contributor

bradfitz commented Jun 3, 2019

Whelp....

        // TODO(bradfitz): this Flush could potentially block (as
        // could the WriteHeaders call(s) above), which means they
        // wouldn't respond to Request.Cancel being readable. That's
        // rare, but this should probably be in a goroutine.
        cc.bw.Flush()
        return cc.werr
}

I haven't reviewed CL 179938 but it seems a bit long/invasive. I wonder why the mutex needs to be held during that write.

Rather than change everything so CanTakeRequest can work without taking the lock, can we instead not hold the lock during the write and not modify CanTakeRequest?

@stevenh
Copy link
Contributor Author

stevenh commented Jun 3, 2019

I tried that but because cc.bw.Write(...) modifies cc.werr it requires the lock.

It may be that adding a lock just around the update of werr would be possible but given the wide scope of the current lock, it would hard to prove it's not needed elsewhere too.

@fraenkel
Copy link
Contributor

fraenkel commented Jun 8, 2019

There seems to be a mix of ways the locks are acquired but it is always mu then wmu. I can see value in acquiring wmu under mu and then releasing if we are trying to guarantee some ordering.
I didn't submit any CL because I am lacking a testcase that shows the issue. When I try the testcase in the submitted CL, it fails the same way using the current code as well as when I move the lock.
If I just glance at the testcase, I don't see how this is testing the right behavior. The server is closed after the first connection and we are only allowed 1 dial, so the second request fails with too many dials.

@stevenh
Copy link
Contributor Author

stevenh commented Jun 8, 2019

The test only talks to one endpoint so the http/2 code will try to make these requests over the same TCP connection; it's designed to validate that the first request doesn't block the second request due to a stalled write; in this situation, it should just create a new connection but as the CanTakeRequest blocks it can't do that.

You are correct in that mu is always acquired before wmu if holding both as documented here:
https://github.com/golang/net/blob/master/http2/transport.go#L240

@gopherbot
Copy link

Change https://golang.org/cl/181457 mentions this issue: http2: reduce lock contention when writing

@gopherbot
Copy link

Change https://golang.org/cl/224797 mentions this issue: http2: queue new streams to reduce write lock contention

@gopherbot
Copy link

Change https://golang.org/cl/240338 mentions this issue: x/net/http2: Honor the request timeouts for writes

@bcbee
Copy link

bcbee commented Jul 14, 2020

@fraenkel Thanks so much for helping fix this bug! We've run into this quite a few times on some service to service calls under specific conditions

roidelapluie added a commit to roidelapluie/common that referenced this issue Aug 3, 2020
HTTP/2 support is golang has many problematic cornercases where dead
connections would be kept.

golang/go#32388
golang/go#39337
golang/go#39750

I suggest we disable HTTP/2 for now and enable it manually on the
blackbox exporter.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
@nqkdev
Copy link

nqkdev commented Sep 3, 2020

@fraenkel Thanks for this patch https://golang.org/cl/240338, I've manually applied the patch and the issue is gone. Can we have this merged so I can remove the manual patch?

@nqkdev
Copy link

nqkdev commented Nov 17, 2020

Any update on this @fraenkel? This patch: https://golang.org/cl/240338 has been working great for me in production for the last 2 months

@bcbee
Copy link

bcbee commented Nov 19, 2020

I believe this may be tangential which was resolved in 1.15.3 #42113

@stevenh
Copy link
Contributor Author

stevenh commented Nov 19, 2020

Given the description of that bug says "will cause all subsequent write requests to fail blindly" I think this is very different as its a definite deadlock.

@CPoirot3
Copy link

CPoirot3 commented Mar 15, 2021

@fraenkel Thanks for this patch https://golang.org/cl/240338, I've manually applied the patch and the issue is gone. Can we have this merged so I can remove the manual patch?

@fraenkel Thanks for this patch , will this patch merge into master ?

@gopherbot
Copy link

Change https://golang.org/cl/349594 mentions this issue: http2: avoid blocking while holding ClientConn.mu

@horkhe
Copy link

horkhe commented Sep 27, 2021

Thank you for fixing this issue. It has have been affecting us for quite some time. Is it possible to merge the solution to a 1.17 point release? Or we have to wait for 1.18?

@ianlancetaylor
Copy link
Contributor

@gopherbot Please open backport issues.

This appears to be an occasional deadlock in http. Should be considered for backporting depending on what @neild thinks.

@gopherbot
Copy link

Backport issue(s) opened: #48649 (for 1.16), #48650 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@fraenkel
Copy link
Contributor

I believe there is a slight issue when sending a new stream. There is a gap between obtaining the new streamID and sending it via encodeHeaders. If streamID + N is sent before any previous streamID, they will close when received.

dteh pushed a commit to dteh/fhttp that referenced this issue Jun 22, 2022
Operations which examine the state of a ClientConn--notably,
the connection pool's check to see if a conn is available to
take a new request--need to acquire mu. Blocking while holding mu,
such as when writing to the network, blocks these operations.

Remove blocking operations from the mutex.
Perform network writes with only ClientConn.wmu held.
Clarify that wmu guards the per-conn HPACK encoder and buffer.

Add a new mutex guarding request creation, covering the critical
section starting with allocating a new stream ID and continuing
until the stream is created.

Fix a locking issue where trailers were written from the HPACK
buffer with only wmu held, but headers were encoded into the buffer
with only mu held. (Now both encoding and writes occur with wmu
held.)

Fixes golang/go#32388.
Fixes golang/go#48340.

Change-Id: Ibb313424ed2f32c1aeac4645b76aedf227b597a3
Reviewed-on: https://go-review.googlesource.com/c/net/+/349594
Trust: Damien Neil <dneil@google.com>
Run-TryBot: Damien Neil <dneil@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@golang golang locked and limited conversation to collaborators Sep 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

10 participants