x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely #33425

prashantv · 2019-08-02T05:02:31Z

What version of Go are you using (`go version`)?

$ go version
go version go1.12.7 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/prashant/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/prashant/go"
GOPROXY=""
GORACE=""
GOROOT="/home/prashant/.gimme/versions/go1.12.7.linux.amd64"
GOTMPDIR=""
GOTOOLDIR="/home/prashant/.gimme/versions/go1.12.7.linux.amd64/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build705930436=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Used the HTTP/2 library to make calls as a client. One of the backends stalled and stopped reading from the TCP connection (although it was still running and the TCP connection was active).

This caused the TCP write buffer on the client to fill up.

What did you expect to see?

I expected that all future calls to the blocked server would timeout eventually, and that calls to any other server would not be affected.

What did you see instead?

All calls on the transport are blocked forever, including calls to working backends. Stack traces should the calls are blocked on a mutex, so any timeouts are ignored.

Repro

Repro: https://github.com/prashantv/go-http2-stuck-repro/

The repro sets up 2 "backends":
stuckURL: Accepts a TCP connection and does not read from the connection, to simulate a stuck backend
normal1URL: Echoes the request body back to the caller.

A call is made to stuckURL with a payload that never completes. This causes the TCP write buffer to fill up, and the Write to block, while holding the *ClientConn.wmu (write lock on the connection). We sleep a second to make sure the TCP buffer fills up, and the write is blocked.

Stack trace

goroutine 44 [IO wait]: internal/poll.runtime_pollWait(0x7f74ab342b98, 0x77, 0xffffffffffffffff) /home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/runtime/netpoll.go:182 +0x56 internal/poll.(*pollDesc).wait(0xc000144598, 0x77, 0x4000, 0x4800, 0xffffffffffffffff) /home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/internal/poll/fd_poll_runtime.go:87 +0x9b internal/poll.(*pollDesc).waitWrite(...) /home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/internal/poll/fd_poll_runtime.go:96 internal/poll.(*FD).Write(0xc000144580, 0xc0001f2000, 0x4009, 0x4800, 0x0, 0x0, 0x0) /home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/internal/poll/fd_unix.go:276 +0x26e net.(*netFD).Write(0xc000144580, 0xc0001f2000, 0x4009, 0x4800, 0x3, 0x3, 0xc0001aadaa) /home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/fd_unix.go:220 +0x4f net.(*conn).Write(0xc00013c0a0, 0xc0001f2000, 0x4009, 0x4800, 0x0, 0x0, 0x0) /home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/net.go:189 +0x69 github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.stickyErrWriter.Write(0x93f320, 0xc00013c0a0, 0xc000195058, 0xc0001f2000, 0x4009, 0x4800, 0x4009, 0x0, 0x0) /home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:372 +0x68 bufio.(*Writer).Write(0xc000130480, 0xc0001f2000, 0x4009, 0x4800, 0xc000034070, 0xc000034000, 0x0) /home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/bufio/bufio.go:622 +0x145 github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*Framer).endWrite(0xc0001ea000, 0x0, 0x0) /home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/frame.go:366 +0xab github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*Framer).WriteDataPadded(0xc0001ea000, 0x3, 0xc0001ee000, 0x4000, 0x4000, 0x0, 0x0, 0x0, 0x0, 0x0) /home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/frame.go:685 +0x25d github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*Framer).WriteData(...) /home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/frame.go:643 github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*clientStream).writeRequestBody(0xc000154640, 0x7f74ab343070, 0xc000138600, 0x7f74ab343090, 0xc000138600, 0x0, 0x0) /home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:1270 +0x337 github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*Transport).getBodyWriterState.func1() /home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:2462 +0xc2 created by github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.bodyWriterState.scheduleBodyWrite /home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:2509 +0xa3

Another call to stuckURL is made, which is blocked in roundTrip trying to get the *ClientConn.wmu (since (1) holds this lock), and with the *ClientConn.mu lock held.

Stack trace

goroutine 46 [semacquire]:
sync.runtime_SemacquireMutex(0xc000195054, 0x40de00)
/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc000195050)
/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/sync/mutex.go:134 +0x109
github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*ClientConn).roundTrip(0xc000194f00, 0xc000204100, 0x0, 0x0, 0x0, 0x0)
/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:1024 +0x467
github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt(0xc00013a4e0, 0xc000204100, 0xc000118000, 0xc0001164b0, 0xc00013e03c, 0xc00013e040)
/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:447 +0x172
github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*Transport).RoundTrip(0xc00013a4e0, 0xc000204100, 0xc00013a4e0, 0xbf490e456743e889, 0xe34199ec9)
/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:408 +0x3a
net/http.send(0xc000204000, 0x93f060, 0xc00013a4e0, 0xbf490e456743e889, 0xe34199ec9, 0xc6c560, 0xc00011c050, 0xbf490e456743e889, 0x1, 0x0)
/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:250 +0x461
net/http.(*Client).send(0xc000134db0, 0xc000204000, 0xbf490e456743e889, 0xe34199ec9, 0xc6c560, 0xc00011c050, 0x0, 0x1, 0xc000020a80)
/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:174 +0xfb
net/http.(*Client).do(0xc000134db0, 0xc000204000, 0x0, 0x0, 0x0)
/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:641 +0x279
net/http.(*Client).Do(...)
/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:509
net/http.(*Client).Get(0xc000134db0, 0xc000136180, 0x16, 0x8b330c, 0xf, 0x93f020)
/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:398 +0x9e
created by github.com/prashantv/go-http2-stuck-repro.TestSendLargeUnreadPayload
/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/repro_test.go:46 +0x402

Another call to stuckURL grabs the connection pool lock, *clientConnPool.mu, then iterates over all connections to that address to check the idleState() which tries to grab *ClientConn.mu. This blocks since (2) is already holding this lock.

Stack trace

goroutine 45 [semacquire]:
sync.runtime_SemacquireMutex(0xc000194f54, 0xc00013e000)
	/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc000194f50)
	/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/sync/mutex.go:134 +0x109
github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*ClientConn).idleState(0xc000194f00, 0xc000130000)
	/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:718 +0x41
github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn(0xc000134fc0, 0xc000204300, 0xc00013e080, 0xf, 0xc00013a501, 0xc0001a5a18, 0x7c1030, 0xc00013a4e0)
	/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/client_conn_pool.go:89 +0xeb
github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*clientConnPool).GetClientConn(0xc000134fc0, 0xc000204300, 0xc00013e080, 0xf, 0xc00013e080, 0xf, 0x2)
	/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/client_conn_pool.go:47 +0x4e
github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt(0xc00013a4e0, 0xc000204300, 0xc000118100, 0xc0001165d0, 0xc00013e06c, 0xc00013e070)
	/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:440 +0x105
github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*Transport).RoundTrip(0xc00013a4e0, 0xc000204300, 0xc00013a4e0, 0xbf490e456749078a, 0xe341ebd66)
	/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:408 +0x3a
net/http.send(0xc000204200, 0x93f060, 0xc00013a4e0, 0xbf490e456749078a, 0xe341ebd66, 0xc6c560, 0xc00011c080, 0xbf490e456749078a, 0x1, 0x0)
	/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:250 +0x461
net/http.(*Client).send(0xc000134db0, 0xc000204200, 0xbf490e456749078a, 0xe341ebd66, 0xc6c560, 0xc00011c080, 0x0, 0x1, 0xc000020a80)
	/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:174 +0xfb
net/http.(*Client).do(0xc000134db0, 0xc000204200, 0x0, 0x0, 0x0)
	/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:641 +0x279
net/http.(*Client).Do(...)
	/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:509
net/http.(*Client).Get(0xc000134db0, 0xc000136180, 0x16, 0x8b330c, 0xf, 0x93f020)
	/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:398 +0x9e
created by github.com/prashantv/go-http2-stuck-repro.TestSendLargeUnreadPayload
	/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/repro_test.go:46 +0x402

Now any calls made using the same *clientConnPool (e.g., any call through the same client/transport) block on trying to get *clientConnPool.mu.

Stack trace

3 @ 0x430b8f 0x440d79 0x440d4f 0x440aed 0x4663d9 0x78e6c6 0x78e62e 0x7a8b85 0x7a86ba 0x6af511 0x6aef0b 0x6b0519 0x6b004e 0x6b003b 0x45d481
#	0x440aec	sync.runtime_SemacquireMutex+0x3c										/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/runtime/sema.go:71
#	0x4663d8	sync.(*Mutex).Lock+0x108											/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/sync/mutex.go:134
#	0x78e6c5	github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn+0x65	/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/client_conn_pool.go:87
#	0x78e62d	github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*clientConnPool).GetClientConn+0x4d	/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/client_conn_pool.go:47
#	0x7a8b84	github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt+0x104		/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:440
#	0x7a86b9	github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2.(*Transport).RoundTrip+0x39		/home/prashant/go/src/github.com/prashantv/go-http2-stuck-repro/vendor/golang.org/x/net/http2/transport.go:408
#	0x6af510	net/http.send+0x460												/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:250
#	0x6aef0a	net/http.(*Client).send+0xfa											/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:174
#	0x6b0518	net/http.(*Client).do+0x278											/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:641
#	0x6b004d	net/http.(*Client).Do+0x9d											/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:509
#	0x6b003a	net/http.(*Client).Get+0x8a											/home/prashant/.gimme/versions/go1.12.7.linux.amd64/src/net/http/client.go:398

I've reduced the size of TCP read/write buffers to trigger the issue more quickly.

The repro includs a flag for controlling steps 2/3. Setting the --num-stuck-calls to 2 or higher will reproduce the issue (the echo calls will fail), while 0 or 1 will cause the test to pass.

The text was updated successfully, but these errors were encountered:

katiehockman · 2019-08-05T14:51:26Z

/cc @bradfitz @tombergan

prashantv · 2019-08-19T17:11:59Z

Any other information I can provide to help speed up any investigation? I've seen some outages in production caused by this issue, and want to help get a fix out.

prashantv · 2019-10-18T21:04:14Z

Anything we can help contribute to help with this? Since a single bad backend can deadlock the whole transport, the impact if quite large for us.

bradfitz · 2019-10-18T21:27:13Z

Is this a dup of #32388?

prashantv · 2019-10-21T16:59:00Z

It looks similar to #32388, but I don't see any mention of the *ClientConn.wmu in that bug, and the stacks are different (though that could be because of different versions). The underlying cause of this issue is definitely due to the wmu.

I took the in-review change and applied it to my repro, and it didn't seem to help. Presumably, there's more changes required.

# Ensure the patch is applied:

$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   vendor/golang.org/x/net/http2/transport.go
        modified:   vendor/golang.org/x/net/http2/transport_test.go

no changes added to commit (use "git add" and/or "git commit -a")
$ git diff | head
diff --git a/vendor/golang.org/x/net/http2/transport.go b/vendor/golang.org/x/net/http2/transport.go
index c0c80d8..70e4f14 100644
--- a/vendor/golang.org/x/net/http2/transport.go
+++ b/vendor/golang.org/x/net/http2/transport.go
@@ -814,23 +814,25 @@ func (cc *ClientConn) Shutdown(ctx context.Context) error {

 func (cc *ClientConn) sendGoAway() error {
        cc.mu.Lock()
-       defer cc.mu.Unlock()
-       cc.wmu.Lock()

# Run the test which hangs with and without the fix:
$ go test . -v -timeout 10s -count 1
Listening on :9999
=== RUN   TestSendLargeUnreadPayload
Making request 0
^CFAIL  github.com/prashantv/go-http2-stuck-repro       4.997s

The deadlock is caused by:

*clientStream.writeRequestBody holds the *ClientConn.wmu while writing. When writes are stalled, this lock is held indefinitely.
*ClientConn.roundTrip holds the *ClientConn.mu while trying to get the *ClientConn.wmu:

 979   cc.mu.Lock()
[...]
1016   cc.wmu.Lock()
1017   cc.mu.Unlock()

*clientConnPool.getClientConn holds the *clientConnPool.mu while checking the idle state for each connection:

 87   p.mu.Lock()
 88   for _, cc := range p.conns[addr] {
 89     if st := cc.idleState(); st.canTakeNewRequest {

And getting the idleState requires the *ClientConn.mu:

 717 func (cc *ClientConn) idleState() clientConnIdleState {
 718   cc.mu.Lock()

So the blocked *ClientConn.wmu causes the *ClientConn.mu to be held waiting for the wmu on the next call, which then blocks clientConnPool.getClientConn to block as well, blocking the whole transport.

fraenkel · 2019-10-22T02:04:20Z

The mentioned change doesn't fix anything because its incomplete. All of these problems are all the same with no good solution because eventually the write lock will become the issue that backs up to the mu given the relationship that exists between them.

prashantv · 2019-10-22T06:12:31Z

Yeah, I dug through the code and was unable to find any easy way to fix the issue. I think the main issue is that idleState() used by the pool requires *ClientConn.mu, which is involved in the write path. If the state could be maintained as a cached value, that might help but given the number of conditions, it doesn't seem easy.

In this scenario, it'd be nice to know if the lock was held since we can skip any connection with the lock held (probably safe to assume it's not usable right now). To prototype this idea, I used a huge hack that lets me check whether a lock is held, and return a clientConnIdleState{false, false} if so.

+type fakeMutex struct {
+       mu     sync.Mutex
+       locked int32
+}
+
+func (m *fakeMutex) Locked() bool {
+       return atomic.LoadInt32(&m.locked) > 0
+}
+
+func (m *fakeMutex) Lock() {
+       atomic.StoreInt32(&m.locked, 1)
+       m.mu.Lock()
+}
+
+func (m *fakeMutex) Unlock() {
+       atomic.StoreInt32(&m.locked, 0)
+       m.mu.Unlock()
+}
+
 // Transport is an HTTP/2 Transport.
 //
 // A Transport internally caches connections to servers. It is safe
@@ -210,7 +229,7 @@ type ClientConn struct {
        idleTimeout time.Duration // or 0 for never
        idleTimer   *time.Timer

-       mu              sync.Mutex // guards following
+       mu              fakeMutex  // guards following
        cond            *sync.Cond // hold mu; broadcast on flow/closed changes
        flow            flow       // our conn-level flow control quota (cs.flow is per stream)
        inflow          flow       // peer's conn-level flow control
@@ -715,6 +734,10 @@ type clientConnIdleState struct {
 }

 func (cc *ClientConn) idleState() clientConnIdleState {
+       if cc.mu.Locked() {
+               return clientConnIdleState{false, false}
+       }

With this change, my repro above passes everytime. A less hacky of my approach can be done using a channel with size 1 (blocking write to Lock, blocking read to Unlock, non-blocking write to check if it's locked). This would allow the idleState function to return even if the connection is blocked. Thoughts?

fraenkel · 2019-10-22T10:59:22Z

@prashantv While your solution fixes idleState it breaks everything else that relies on the read mutex.
The solution is dictated by the problem we are trying to solve. Almost all the reports are regarding the Write blocking which is difficult to solve.
I can easily add one more mutex to my code to disconnect the read/write mutex during new connections and that might solve a majority of the cases. But I know it won't solve the write issue everyone keeps reporting.

prashantv · 2019-10-22T17:01:06Z

@prashantv While your solution fixes idleState it breaks everything else that relies on the read mutex.

Can you expand on what's broken by this? The *ClientConn.mu is currently a *sync.Mutex, and my change adds a way for idleState() to check if the mutex is locked or about to get locked before it calls .Lock(). Before accessing data protected by the lock, we still call Lock() which is exactly the same since it still uses a *sync.Mutex to Lock / Unlock.

The only thing that's affected is calls to *ClientConn.idleState, of which there's only one -- which is from *clientConnPool.getClientConn, and that's where we need to avoid grabbing the lock.

In fact, I've run the HTTP2 tests multiple times with my change without issues:

$ go test -v -race -count 1  
[..]
PASS
ok      golang.org/x/net/http2  83.645s

Pushed a branch with my change if you want to test it:
https://github.com/prashantv/go-http2-stuck-repro/compare/hack_workaround?expand=1

My change is still a hack, so it doesn't fully solve the problem (it's still possible that *clientConnPool will think that the *ClientConn is safe to lock, but then is beaten to the lock by another goroutine, but it seems to help significantly, without breaking other parts of the library.

fraenkel · 2019-10-23T01:47:52Z

Here is a simple issue that you have:
if 2 routines both perform a Lock(), locked = 1 and one blocks on the mutex. When Unlock() is invoked, the second routine will execute but locked = 0.

prashantv · 2019-10-23T03:48:55Z

if 2 routines both perform a Lock(), locked = 1 and one blocks on the mutex. When Unlock() is invoked, the second routine will execute but locked = 0.

My mistake, I meant to set locked = 1 while holding the lock. Once that's flipped, apart from not fixing the issue 100%, are there other things that are broken? You mentioned "everything that relies on read mutex", not exactly sure what you meant by "read mutex".

I want to understand if the problem being solved (Making idleState() not block trying to get *ClientConn.mu) is the right problem to focus on. The above is still just an example solution, but it suffers from a race since the check for Locked() and Lock() are not atomic. However that can be solved using a channel based lock:
https://github.com/prashantv/go-http2-stuck-repro/pull/new/try_mutex

This uses a channel to implement a lock that supports a TryLock method. It's likely a little slower, but I didn't see any noticeable difference when I ran the benchmarks.

fraenkel · 2019-10-23T11:47:59Z

And eventually the channel can be blocked. You can't just short circuit idleStateLocked, it actually computes the answer. Returning an incorrect value will break other guarantees.
As I already stated, my patch attempted to break the read/write mutex and did so up until I hit the one place where control needs to transfer sequentially from the read lock to the write lock. I haven't determined a good way to accomplish this once the write side is blocked.
Any solution must fix the write blocking issue which then allows us to disentangle the read/write mutex overlaps.

prashantv · 2019-10-23T15:45:41Z

break other guraantees

I assume the guarantee you mean is that we'll end up with more connections than required, since my idleState() implementation assumes that a connection cannot be used if the lock is held. This means that any call started while there's any pending operation on a connection will end up triggering a new connection.

Looking at the change linked to the other issue, I'm not sure if there's a clear path to fixing the underlying locking issues. I can try to investigate that approach, but in the meantime, we're stuck with deadlocks. Is there a short-term workaround? E.g., write deadlines?

A workaround (even if it triggers more connections than necessary) would be preferable to a deadlock in production.

fraenkel · 2019-10-25T01:45:40Z

The only solution I can think of right now is to build your own client pool which tracks outstanding requests per client. Don't share the transport.
You might actually be able to build a Transport that does pooling instead.

jaricftw · 2020-06-03T05:45:05Z

Alternatively, how do we like the idea of per-backend/address lock, instead of a shared lock for the entire client connection pool? A proof-of-concept: https://github.com/jaricftw/go-http2-stuck-transport/compare/client-conn-pool?expand=1, which ensures a single stuck backend won't block requests to other backends.

fraenkel · 2020-06-03T11:13:36Z

@jaricftw That doesn't fix the actual issue. While there are multiple logical client connections, we only have one physical connection. The data protected is about all clients not individual ones.

Please see #32388. I submitted 2 patches which should resolve this issue.

due to lots of issues with x/net/http2, as well as the hundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now

due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now

* fix: metacache should only rename entries during cleanup (minio#11503) To avoid large delays in metacache cleanup, use rename instead of recursive delete calls, renames are cheaper move the content to minioMetaTmpBucket and then cleanup this folder once in 24hrs instead. If the new cache can replace an existing one, we should let it replace since that is currently being saved anyways, this avoids pile up of 1000's of metacache entires for same listing calls that are not necessary to be stored on disk. * turn off http2 for TLS setups for now (minio#11523) due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now * fix: save ModTime properly in disk cache (minio#11522) fix minio#11414 * fix: osinfos incomplete in case of warnings (minio#11505) The function used for getting host information (host.SensorsTemperaturesWithContext) returns warnings in some cases. Returning with error in such cases means we miss out on the other useful information already fetched (os info). If the OS info has been succesfully fetched, it should always be included in the output irrespective of whether the other data (CPU sensors, users) could be fetched or not. * fix: avoid timed value for network calls (minio#11531) additionally simply timedValue to have RWMutex to avoid concurrent calls to DiskInfo() getting serialized, this has an effect on all calls that use GetDiskInfo() on the same disks. Such as getOnlineDisks, getOnlineDisksWithoutHealing * fix: support IAM policy handling for wildcard actions (minio#11530) This PR fixes - allow 's3:versionid` as a valid conditional for Get,Put,Tags,Object locking APIs - allow additional headers missing for object APIs - allow wildcard based action matching * fix: in MultiDelete API return MalformedXML upon empty input (minio#11532) To follow S3 spec * Update yaml files to latest version RELEASE.2021-02-14T04-01-33Z * fix: multiple pool reads parallelize when possible (minio#11537) * Add support for remote tier management With this change MinIO's ILM supports transitioning objects to a remote tier. This change includes support for Azure Blob Storage, AWS S3 and Google Cloud Storage as remote tier storage backends. Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Krishna Srinivas <krishna@minio.io> Co-authored-by: Krishnan Parthasarathi <kp@minio.io> Co-authored-by: Harshavardhana <harsha@minio.io> Co-authored-by: Poorna Krishnamoorthy <poornas@users.noreply.github.com> Co-authored-by: Shireesh Anjal <355479+anjalshireesh@users.noreply.github.com> Co-authored-by: Anis Elleuch <vadmeste@users.noreply.github.com> Co-authored-by: Minio Trusted <trusted@minio.io> Co-authored-by: Krishna Srinivas <krishna.srinivas@gmail.com> Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Krishna Srinivas <krishna@minio.io>

due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now

neild · 2021-10-13T21:45:50Z

I believe the problem of new requests being blocked by stuck network writes in old ones should have been fixed by CL 353390 and CL 349594. (And maybe a couple other CLs in that general range.)

A bug in ping checking could result in a permanently write-blocked connection never being flagged as dead. CL 354389 should have fixed that.

There's a remaining case where Response.Body.Close can block indefinitely on a write-blocked connection. That's tracked in #48908.

It might also be nice to be able to detect a write-blocked connection independently of a ping timeout. That's #48830.

I think the cases described in this issue are now all either fixed or covered by more specific issues, so I'm going to close this one. Let me know if there's anything I'm missing.

prashantv changed the title ~~x/net/http2: Blocked Write on single connection causes all calls on the transport to fail~~ x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely Aug 2, 2019

katiehockman added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 5, 2019

ianlancetaylor added this to the Go1.14 milestone Aug 19, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

fraenkel mentioned this issue Oct 31, 2019

proposal: net: add BufferedPipe (buffered Pipe) #34502

Open

steven-zou mentioned this issue Apr 16, 2020

Replication job might be hang when doing copy to some big blobs (after enable tls) goharbor/harbor#11642

Closed

fraenkel mentioned this issue Jun 27, 2020

x/net/http2: random read deadlock #39812

Closed

harshavardhana mentioned this issue Feb 11, 2021

turn off http2 for TLS setups for now minio/minio#11523

Merged

7 tasks

TyeMcQueen mentioned this issue Apr 7, 2021

sync: add Mutex.TryLock #45435

Closed

neild closed this as completed Oct 13, 2021

golang locked and limited conversation to collaborators Oct 13, 2022

gopherbot added the FrozenDueToAge label Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely #33425

x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely #33425

prashantv commented Aug 2, 2019 •

edited

katiehockman commented Aug 5, 2019

prashantv commented Aug 19, 2019

prashantv commented Oct 18, 2019

bradfitz commented Oct 18, 2019

prashantv commented Oct 21, 2019

fraenkel commented Oct 22, 2019

prashantv commented Oct 22, 2019

fraenkel commented Oct 22, 2019

prashantv commented Oct 22, 2019 •

edited

fraenkel commented Oct 23, 2019 •

edited

prashantv commented Oct 23, 2019 •

edited

fraenkel commented Oct 23, 2019

prashantv commented Oct 23, 2019 •

edited

fraenkel commented Oct 25, 2019 •

edited

jaricftw commented Jun 3, 2020

fraenkel commented Jun 3, 2020

neild commented Oct 13, 2021

x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely #33425

x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely #33425

Comments

prashantv commented Aug 2, 2019 • edited

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Repro

katiehockman commented Aug 5, 2019

prashantv commented Aug 19, 2019

prashantv commented Oct 18, 2019

bradfitz commented Oct 18, 2019

prashantv commented Oct 21, 2019

fraenkel commented Oct 22, 2019

prashantv commented Oct 22, 2019

fraenkel commented Oct 22, 2019

prashantv commented Oct 22, 2019 • edited

fraenkel commented Oct 23, 2019 • edited

prashantv commented Oct 23, 2019 • edited

fraenkel commented Oct 23, 2019

prashantv commented Oct 23, 2019 • edited

fraenkel commented Oct 25, 2019 • edited

jaricftw commented Jun 3, 2020

fraenkel commented Jun 3, 2020

neild commented Oct 13, 2021

prashantv commented Aug 2, 2019 •

edited

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

prashantv commented Oct 22, 2019 •

edited

fraenkel commented Oct 23, 2019 •

edited

prashantv commented Oct 23, 2019 •

edited

prashantv commented Oct 23, 2019 •

edited

fraenkel commented Oct 25, 2019 •

edited