Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2, net/http: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time #23559

Closed
gwik opened this issue Jan 25, 2018 · 19 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@gwik
Copy link
Contributor

gwik commented Jan 25, 2018

What version of Go are you using (go version)?

go version go1.9.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/<redacted>/Dev/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cd/_7rcv5812531s2lswhn6kp680000gp/T/go-build436968147=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

Also happened on linux/amd64, see below.

What did you do?

A bug in production showed an http2 client hanging more than 250 seconds without respect to the request context which was set to timeout after 5 sec.
All possible timeouts are set on connection and TLS handshake and I didn't see many dials (they are monitored).
Latency graph in ms:
screen shot 2018-01-26 at 12 14 34 am

Clients are using net/http2.Transport directly.

What did you expect to see?

The requests should have timed out after 5s.

What did you see instead?

No or very long timeouts (I believe the server has reset the connection or TCP timed out).

The synchronous write cc.writeHeaders to the connection in ClientConn.roundTrip does not set any deadline on the connection which can block forever (or being timed out by TCP) if the server or network hangs:
https://github.com/golang/net/blob/0ed95abb35c445290478a5348a7b38bb154135fd/http2/transport.go#L833

I wrote a test that demonstrates this:
gwik/net@e4c191a

// TestTransportTimeoutServerHangs demonstrates that client can hang forever
// when the server hangs and the headers exceed the conn buffer size (forcing a
// flush). Without respect to the context's deadline.
func TestTransportTimeoutServerHangs(t *testing.T) {
	clientDone := make(chan struct{})
	ct := newClientTester(t)
	ct.client = func() error {
		defer ct.cc.(*net.TCPConn).CloseWrite()
		defer close(clientDone)

		buf := make([]byte, 1<<19)
		_, err := rand.Read(buf)
		if err != nil {
			t.Fatal("fail to gen random data")
		}
		headerVal := hex.EncodeToString(buf)

		req, err := http.NewRequest("PUT", "https://dummy.tld/", nil)
		if err != nil {
			return err
		}

		ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)
		defer cancel()
		req = req.WithContext(ctx)
		req.Header.Add("Authorization", headerVal)
		_, err = ct.tr.RoundTrip(req)
		if err == nil {
			return errors.New("error should not be nil")
		}
		if ne, ok := err.(net.Error); !ok || !ne.Timeout() {
			return fmt.Errorf("error should be a net error timeout was: +%v", err)
		}
		return nil
	}
	ct.server = func() error {
		ct.greet()
		select {
		case <-time.After(5 * time.Second):
		case <-clientDone:
		}
		return nil
	}
	ct.run()
}

Setting the write deadline fixes the test:
gwik/net@052de95

It seems to fail when the header value exceeds 1MB. I might miss something here, the default buffer size of bufio.Writer is 4096 bytes, I had expected to see it fail around that value, maybe compression and/or TCP buffers...
Also I don't think it sent 1MB headers when it failed in production something else must have fill the buffer.

The buffer is on the connection and shared among streams the buffer can be filled by other requests on the same connection.

Besides this particular call which is synchronous, no write nor read to the connection has a deadline set. Can't this lead to goroutine leaks and http2 streams being stuck in the background ?

@odeke-em odeke-em changed the title x/net/http2: client can hang forever x/net/http2: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time Jan 26, 2018
@odeke-em
Copy link
Member

/cc @bradfitz @tombergan

@gwik
Copy link
Contributor Author

gwik commented Jan 26, 2018

Working on a patch to fix this, I realized that using the context deadline to timeout the i/o calls may not be appropriate.
Timing out the I/O might leave the framing protocol in an inconsistent state and the only safe option would be to close the whole connection. Probably not was the caller intends when the context deadline times out. It would work if the deadline was set on context only for this purpose just before doing the request but that's usually not the case.
I think there should be an I/O timeout option to the transport that would serve the purpose of timing out the I/O calls and every I/O calls would set the deadline to now + timeout and reset it after the call to Read or Write.
I'm going to implement this for now. Let me know what you think about it.

gwik added a commit to gwik/net that referenced this issue Feb 3, 2018
Addresses hanging transport when on blocking I/O. There are many scenario where
the roundtrip hangs on write or read and won't be unlocked by current
cancelation systems (context, Request.Cancel, ...).

This adds read and write deadlines support.
The writer disables the read deadline and enables the write deadline, then after
the write is successful, it disables the write deadline and re-enables the read
deadline.
The read loop also sets its read deadline after a successful read since the next
frame is not predictable.
It guarantees that an I/O will not timeout before IOTimeout and will timeout
after a complete block before at least IOTimeout.

See issue: golang/go#23559
gwik added a commit to gwik/net that referenced this issue Feb 3, 2018
Addresses hanging transport when on blocking I/O. There are many scenario where
the roundtrip hangs on write or read and won't be unlocked by current
cancelation systems (context, Request.Cancel, ...).

This adds read and write deadlines support.
The writer disables the read deadline and enables the write deadline, then after
the write is successful, it disables the write deadline and re-enables the read
deadline.
The read loop also sets its read deadline after a successful read since the next
frame is not predictable.
It guarantees that an I/O will not timeout before IOTimeout and will timeout
after a complete block before at least IOTimeout.

See issue: golang/go#23559
@bradfitz bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 3, 2018
@bradfitz bradfitz added this to the Go1.11 milestone Feb 3, 2018
gwik added a commit to gwik/net that referenced this issue Mar 9, 2018
Addresses hanging transport when on blocking I/O. There are many scenario where
the roundtrip hangs on write or read and won't be unlocked by current
cancelation systems (context, Request.Cancel, ...).

This adds read and write deadlines support.
The writer disables the read deadline and enables the write deadline, then after
the write is successful, it disables the write deadline and re-enables the read
deadline.
The read loop also sets its read deadline after a successful read since the next
frame is not predictable.
It guarantees that an I/O will not timeout before IOTimeout and will timeout
after a complete block before at least IOTimeout.

See issue: golang/go#23559

Change-Id: If618a63857cc32d8c3175c0d9bef1f8bf83c89df
ikenchina referenced this issue in gwik/net May 14, 2018
The transport can hang forever if the server hangs after accepting the
connection and the request headers (or control frames) exceeds the connection
write buffer. Without respect to the context's deadline.

The ClientConn writes to the connection without setting a write deadline thus
blocks forever if the buffer is flushed to the socket.

go test -v -run TestTransportTimeoutServerHangs ./http2
@bradfitz
Copy link
Contributor

Any update here?

Keep in mind that there are 3 goroutines per Transport connection to a server:

  • the writer (which might block forever)
  • the reader (blocked by default, waiting for frames from server)
  • the control one, which never blocks, and owns all state

I think the behavior we want is:

  • if the user said 5 seconds timeout on a request, the control channel should notice and return an error to the user's RoundTrip. I'd be surprised if this part isn't already working?
  • don't accept new streams (new RoundTrips) on that ClientConn if the conn has been stuck in a write for "too long", for some definition TBD
  • if the write appears totally hung, nuke the whole connection and all streams still attached to it

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned Jul 12, 2018
@gwik
Copy link
Contributor Author

gwik commented Jul 16, 2018

if the user said 5 seconds timeout on a request, the control channel should notice and return an error to the user's RoundTrip. I'd be surprised if this part isn't already working?

There are cases where this isn't working. I believe the graph and tests demonstrates it.

don't accept new streams (new RoundTrips) on that ClientConn if the conn has been stuck in a write for "too long", for some definition TBD

Unfortunately, the writes almost never time out I don't know if this is specific to my use particular use case or if buffering makes it very long to happen.

I run this CL in production for a very specific service that only makes requests to APNS and Firebase push gateways.

The patch helps to recover from stuck connections and kills it but it's slow to converge, is incomplete and may not work properly for the general use case.

What happens is that there are situations where the APNS gateway is blocking the connection, it behaves like a firewall that drops packet never reset the connection. Most of the time the requests are timing out correctly, the context allow to unlock the client stream goroutine and return to the caller however the connection read loop never times out nor receive any other error causing all requests to time out. (Also not that the APNS gateway doesn't allow pings).

What I see most often is the writes are not timing out. I believe that the only way to guarantee to time out the connection is have a way to unblock the read loop. The reads (ClientConn) are asynchronous to the writes (ClientStream) so we have to find a way to add a deadline on the reads that will follow the writes (requests) or that follow a read (e.g reading the rest of the body) from the client stream side.

The CL takes a simple approach and avoids synchronization between the client stream and the connection read loop. It resets the read deadline and sets a write deadline before every write from the client stream. After the write is done, it sets the read deadline.
This kinda work, I see the that connections are no longer stuck forever but it is slow to time out the connection. Resetting the read deadline before every write continuously pushes back the deadline.
Also there are no read timeout when waiting for subsequent body reads for example.

I wanted to experiment the following algorithm:

Upon waiting for something from the network on the ClientStream / caller side, push a deadline in a heap data structure (probably under ClientConn.wmu lock) keeping track of the stream that pushes it.
If there was no deadline, unlock the read loop by setting one in the past an set the deadline before next read.
Before every read look at the root of the deadline heap and set it.
After every read remove the deadline of the stream that has successfully completed the read. We need some sort of reverse index to find the entry in the heap and remove it (maybe every client stream can track it's unique read deadline entry in the heap).

What do you think ?

@stevenh
Copy link
Contributor

stevenh commented May 28, 2019

This is actually worse than just a single connection hanging, it can cause the entire app to hang.

The blocking behaviour in both the cc.bw.Flush() and cc.fr.WriteHeaders(...) called within cc.writeHeaders(...) is done while holding cc.mu which is required for cc.idleState() called by p.getClientConn() while holding the pool look.

This means a single connection can block all new requests potentially forever.

@gwik
Copy link
Contributor Author

gwik commented May 29, 2019

@stevenh Could you please share a bit more on the context ? What your app does, which kind of services it talks to and when this kind of behavior happens ?

@stevenh
Copy link
Contributor

stevenh commented May 29, 2019

I'd have to defer to a colleague to answer that but the high-level basics are its making http2 calls and the entire app hung due to this.

The relevant stacks from the hung app are:

-- goroutine blocked in writeHeaders -> Flush
1 @ 0x43bddc 0x436eea 0x4364e7 0x4a201e 0x4a210d 0x4a3a04 0x577812 0x589dbd 0x5d1887 0x5d1da0 0x5d2cbb 0x662cea 0x6ad1d4 0x52bdce 0x6678dc 0x66680b 0x663215 0x662dda 0x6476b9 0x690a17 0x63b819 0x63b4ad 0x63cb4d 0xb1d000 0xe641df 0xe6213e 0xe8245c 0xe7f42e 0xe7ad28 0xe61e87 0xf598ed 0xf59637
#    0x4364e6    internal/poll.runtime_pollWait+0x56                                                    /usr/local/go/src/runtime/netpoll.go:173
#    0x4a201d    internal/poll.(*pollDesc).wait+0xad                                                    /usr/local/go/src/internal/poll/fd_poll_runtime.go:85
#    0x4a210c    internal/poll.(*pollDesc).waitWrite+0x3c                                                /usr/local/go/src/internal/poll/fd_poll_runtime.go:94
#    0x4a3a03    internal/poll.(*FD).Write+0x243                                                        /usr/local/go/src/internal/poll/fd_unix.go:227
#    0x577811    net.(*netFD).Write+0x51                                                            /usr/local/go/src/net/fd_unix.go:220
#    0x589dbc    net.(*conn).Write+0x6c                                                            /usr/local/go/src/net/net.go:188
#    0x5d1886    crypto/tls.(*Conn).write+0x1a6                                                        /usr/local/go/src/crypto/tls/conn.go:832
#    0x5d1d9f    crypto/tls.(*Conn).writeRecordLocked+0x35f                                                /usr/local/go/src/crypto/tls/conn.go:909
#    0x5d2cba    crypto/tls.(*Conn).Write+0x1ba                                                        /usr/local/go/src/crypto/tls/conn.go:1068
#    0x662ce9    net/http.http2stickyErrWriter.Write+0x79                                                /usr/local/go/src/net/http/h2_bundle.go:6803
#    0x6ad1d3    net/http.(*http2stickyErrWriter).Write+0x73                                                <autogenerated>:1
#    0x52bdcd    bufio.(*Writer).Flush+0x7d                                                        /usr/local/go/src/bufio/bufio.go:567
#    0x6678db    net/http.(*http2ClientConn).writeHeaders+0x1eb                                                /usr/local/go/src/net/http/h2_bundle.go:7436
#    0x66680a    net/http.(*http2ClientConn).RoundTrip+0x46a                                                /usr/local/go/src/net/http/h2_bundle.go:7289
#    0x663214    net/http.(*http2Transport).RoundTripOpt+0x144                                                /usr/local/go/src/net/http/h2_bundle.go:6858
#    0x662dd9    net/http.(*http2Transport).RoundTrip+0x39                                                /usr/local/go/src/net/http/h2_bundle.go:6820
#    0x6476b8    net/http.http2noDialH2RoundTripper.RoundTrip+0x38                                            /usr/local/go/src/net/http/h2_bundle.go:990
#    0x690a16    net/http.(*Transport).RoundTrip+0xd46                                                    /usr/local/go/src/net/http/transport.go:371
#    0x63b818    net/http.send+0x1a8                                                            /usr/local/go/src/net/http/client.go:249
#    0x63b4ac    net/http.(*Client).send+0xfc                                                        /usr/local/go/src/net/http/client.go:173
...

-- goroutine waiting for connection lock, held above, while holding the pool lock
1 @ 0x43bddc 0x43bece 0x44d2c4 0x44cfdd 0x48115e 0x66546f 0x6457d8 0x646dce 0x6631c7 0x662dda 0x6476b9 0x690a17 0x63b819 0x63b4ad 0x63cb4d 0xb1d000 0xe641df 0xe6213e 0xe8245c 0xe7f42e 0xe7ad28 0xe61e87 0xf598ed 0xf59637 0x46b1f1
#    0x44cfdc    sync.runtime_SemacquireMutex+0x3c                                                    /usr/local/go/src/runtime/sema.go:71
#    0x48115d    sync.(*Mutex).Lock+0xed                                                            /usr/local/go/src/sync/mutex.go:134
#    0x66546e    net/http.(*http2ClientConn).CanTakeNewRequest+0x3e                                            /usr/local/go/src/net/http/h2_bundle.go:7083
#    0x6457d7    net/http.(*http2clientConnPool).getClientConn+0xe7                                            /usr/local/go/src/net/http/h2_bundle.go:739
#    0x646dcd    net/http.http2noDialClientConnPool.GetClientConn+0x4d                                            /usr/local/go/src/net/http/h2_bundle.go:927
#    0x6631c6    net/http.(*http2Transport).RoundTripOpt+0xf6                                                /usr/local/go/src/net/http/h2_bundle.go:6852
#    0x662dd9    net/http.(*http2Transport).RoundTrip+0x39                                                /usr/local/go/src/net/http/h2_bundle.go:6820
#    0x6476b8    net/http.http2noDialH2RoundTripper.RoundTrip+0x38                                            /usr/local/go/src/net/http/h2_bundle.go:990
#    0x690a16    net/http.(*Transport).RoundTrip+0xd46
...

-- goroutines blocked waiting for the pool lock held above
13033 @ 0x43bddc 0x43bece 0x44d2c4 0x44cfdd 0x48115e 0x645754 0x646dce 0x6631c7 0x662dda 0x6476b9 0x690a17 0x63b819 0x63b4ad 0x63cb4d 0xb1d000 0xe641df 0xe6213e 0xe8245c 0xe7f42e 0xe7ad28 0xe61e87 0xf598ed 0xf59637 0x46b1f1
#   0x44cfdc    sync.runtime_SemacquireMutex+0x3c                                                   /usr/local/go/src/runtime/sema.go:71
#   0x48115d    sync.(*Mutex).Lock+0xed                                                         /usr/local/go/src/sync/mutex.go:134
#   0x645753    net/http.(*http2clientConnPool).getClientConn+0x63                                          /usr/local/go/src/net/http/h2_bundle.go:737
#   0x646dcd    net/http.http2noDialClientConnPool.GetClientConn+0x4d                                           /usr/local/go/src/net/http/h2_bundle.go:927
#   0x6631c6    net/http.(*http2Transport).RoundTripOpt+0xf6                                                /usr/local/go/src/net/http/h2_bundle.go:6852
#   0x662dd9    net/http.(*http2Transport).RoundTrip+0x39                                               /usr/local/go/src/net/http/h2_bundle.go:6820
#   0x6476b8    net/http.http2noDialH2RoundTripper.RoundTrip+0x38                                           /usr/local/go/src/net/http/h2_bundle.go:990
#   0x690a16    net/http.(*Transport).RoundTrip+0xd46                                                   /usr/local/go/src/net/http/transport.go:371
#   0x63b818    net/http.send+0x1a8                                                         /usr/local/go/src/net/http/client.go:249
#   0x63b4ac    net/http.(*Client).send+0xfc                                                        /usr/local/go/src/net/http/client.go:173
#   0x63cb4c    net/http.(*Client).Do+0x28c                                                     /usr/local/go/src/net/http/client.go:602

@evanj
Copy link
Contributor

evanj commented Jul 28, 2019

I believe we just hit this same issue or extremely similar issue with go1.12.7. I have thread stacks from the "blocked" threads from kill -ABRT below. To be clear: The caller of http.Client.Do is stuck in http2ClientConn.roundTrip on cc.wmu.Lock() so it can call cc.writeHeaders. The holder of that lock is from scheduleBodyWrite/http2clientStream.writeRequestBody, which is doing an actual network write that is hung.

As a result: Client.Do ignores any context Deadlines in this code path.

Possible fixes:

  • The RoundTrip thread can't try to lock cc.wmu since it can get stuck on network I/O, so writing headers needs to be done with some Go routines so it can use something like awaitRequestCancel.
  • Don't attempt to reuse HTTP/2 connections while another request is using it to send

Details

Stripe Veneur periodically sends a set of HTTP2 POSTs and waits for them to complete. When it is getting CPU overloaded, we have observed these calls to Client.Do getting "stuck" for ~10 minutes, even though the HTTP requests set a 10 second deadline. I think the following is happening:

  1. One thread calls http.Client.Do, but then times out waiting for the response. However, it has already called scheduleBodyWrite.
  2. scheduleBodyWrite's goroutine holds cc.wmu while trying to send the POST body. Because the system is overloaded, this is now stuck waiting for some sort of TCP timer.
  3. Another thread starts http.Client.Do. It gets the same connection from the pool, since it is an HTTP2 connection that supports multiple streams.
  4. This eventually calls:
    cc.wmu.Lock()
    endStream := !hasBody && !hasTrailers
    werr := cc.writeHeaders(cs.ID, endStream, int(cc.maxFrameSize), hdrs)
    cc.wmu.Unlock()
  1. The client gets stuck waiting on cc.wmu.Lock(), and will wait until the io operation completes, ignoring the context

Stacks

The first thread is the writer, holding cc.wmu. The second is the OTHER writer, blocked waiting on cc.wmu. The third is the read loop, for completeness (this lets us see that the http2Framer belongs to the connection, so these are in fact the threads involved in this issue.

internal/poll.runtime_pollWait(0x7f2818062bb0, 0x77, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:182 +0x56
internal/poll.(*pollDesc).wait(0xc003c06298, 0x77, 0x100, 0x19f, 0xffffffffffffffff)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x9b
internal/poll.(*pollDesc).waitWrite(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:96
internal/poll.(*FD).Write(0xc003c06280, 0xc004ed7000, 0x401d, 0x401d, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:276 +0x26e
net.(*netFD).Write(0xc003c06280, 0xc004ed7000, 0x401d, 0x401d, 0x401d, 0x4018, 0x0)
    /usr/local/go/src/net/fd_unix.go:220 +0x4f
net.(*conn).Write(0xc000013460, 0xc004ed7000, 0x401d, 0x401d, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:189 +0x69
crypto/tls.(*Conn).write(0xc005b38e00, 0xc004ed7000, 0x401d, 0x401d, 0xc009306000, 0x4000, 0x40000)
    /usr/local/go/src/crypto/tls/conn.go:896 +0x160
crypto/tls.(*Conn).writeRecordLocked(0xc005b38e00, 0x15f4217, 0xc009306000, 0x10009, 0x40000, 0x3, 0x3, 0xc000c24daa)
    /usr/local/go/src/crypto/tls/conn.go:945 +0x2a1
crypto/tls.(*Conn).Write(0xc005b38e00, 0xc0092fe000, 0x10009, 0x48000, 0x0, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:1114 +0x184
net/http.http2stickyErrWriter.Write(0x7f281c2393f0, 0xc005b38e00, 0xc00427bad8, 0xc0092fe000, 0x10009, 0x48000, 0x1, 0xc01f3b20c0, 0xc000c24d40)
    /usr/local/go/src/net/http/h2_bundle.go:6792 +0x68
bufio.(*Writer).Write(0xc004686480, 0xc0092fe000, 0x10009, 0x48000, 0xc000052a70, 0xc000052a00, 0x0)
    /usr/local/go/src/bufio/bufio.go:622 +0x145
net/http.(*http2Framer).endWrite(0xc00330fb20, 0x0, 0x0)
    /usr/local/go/src/net/http/h2_bundle.go:1605 +0xab
net/http.(*http2Framer).WriteDataPadded(0xc00330fb20, 0x3bccb, 0xc0084ce000, 0x10000, 0x80000, 0x0, 0x0, 0x0, 0x1525493, 0xc000c24f88)
    /usr/local/go/src/net/http/h2_bundle.go:1927 +0x25d
net/http.(*http2Framer).WriteData(...)
    /usr/local/go/src/net/http/h2_bundle.go:1885
net/http.(*http2clientStream).writeRequestBody(0xc003362000, 0x7f281c239488, 0xc000084d70, 0x7f281c2394a8, 0xc000084d70, 0x0, 0x0)
    /usr/local/go/src/net/http/h2_bundle.go:7690 +0x337
net/http.(*http2Transport).getBodyWriterState.func1()
    /usr/local/go/src/net/http/h2_bundle.go:8878 +0xc2
created by net/http.http2bodyWriterState.scheduleBodyWrite
    /usr/local/go/src/net/http/h2_bundle.go:8925 +0xa3

goroutine 3263511 [semacquire, 10 minutes]:
sync.runtime_SemacquireMutex(0xc00427bad4, 0x40df00)
    /usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc00427bad0)
    /usr/local/go/src/sync/mutex.go:134 +0x109
net/http.(*http2ClientConn).roundTrip(0xc00427b980, 0xc005448300, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/h2_bundle.go:7444 +0x467
net/http.(*http2Transport).RoundTripOpt(0xc002094180, 0xc005448300, 0xc01ea24600, 0x6dedb6, 0x0, 0xc000914240)
    /usr/local/go/src/net/http/h2_bundle.go:6867 +0x159
net/http.(*http2Transport).RoundTrip(...)
    /usr/local/go/src/net/http/h2_bundle.go:6829
net/http.http2noDialH2RoundTripper.RoundTrip(0xc002094180, 0xc005448300, 0xc000838000, 0x5, 0xc0009142c8)
    /usr/local/go/src/net/http/h2_bundle.go:8959 +0x3f
net/http.(*Transport).roundTrip(0xc00038c480, 0xc005448300, 0xc02eee6240, 0xc029bcd9bc, 0xc029bcd9c0)
    /usr/local/go/src/net/http/transport.go:430 +0xe90
net/http.(*Transport).RoundTrip(0xc00038c480, 0xc005448300, 0xc00038c480, 0xbf4650ff70c64558, 0x459b8945e32c2)
    /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc005448200, 0x1791b80, 0xc00038c480, 0xbf4650ff70c64558, 0x459b8945e32c2, 0x23e7c00, 0xc029e7a3a8, 0xbf4650ff70c64558, 0x1, 0x0)
    /usr/local/go/src/net/http/client.go:250 +0x461
net/http.(*Client).send(0xc0003aa660, 0xc005448200, 0xbf4650ff70c64558, 0x459b8945e32c2, 0x23e7c00, 0xc029e7a3a8, 0x0, 0x1, 0x0)
    /usr/local/go/src/net/http/client.go:174 +0xfb
net/http.(*Client).do(0xc0003aa660, 0xc005448200, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/client.go:641 +0x279
net/http.(*Client).Do(...)
    /usr/local/go/src/net/http/client.go:509
github.com/stripe/veneur/http.PostHelper(0x17c4ce0, 0xc000492ed0, 0xc0003aa660, 0xc000440400, 0x1515d91, 0x4, 0xc000838000, 0x50, 0x135a860, 0xc000493200, ...)
    /go/src/github.com/stripe/veneur/http/http.go:229 +0xc94
github.com/stripe/veneur/sinks/datadog.(*DatadogMetricSink).flushPart(0xc0001e2240, 0x17c4ce0, 0xc000492ed0, 0xc034f00000, 0x4ebf, 0x4ebf, 0xc00821a850)
    /go/src/github.com/stripe/veneur/sinks/datadog/datadog.go:356 +0x30f
created by github.com/stripe/veneur/sinks/datadog.(*DatadogMetricSink).Flush
    /go/src/github.com/stripe/veneur/sinks/datadog/datadog.go:144 +0x40a

goroutine 1319 [IO wait, 13 minutes]:
internal/poll.runtime_pollWait(0x7f2818062bb0, 0x72, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:182 +0x56
internal/poll.(*pollDesc).wait(0xc003c06298, 0x72, 0x1900, 0x199a, 0xffffffffffffffff)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x9b
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc003c06280, 0xc003c1af80, 0x199a, 0x199a, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:169 +0x19b
net.(*netFD).Read(0xc003c06280, 0xc003c1af80, 0x199a, 0x199a, 0x203003, 0x0, 0x198d)
    /usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc000013460, 0xc003c1af80, 0x199a, 0x199a, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:177 +0x69
crypto/tls.(*atLeastReader).Read(0xc00e6373e0, 0xc003c1af80, 0x199a, 0x199a, 0xc0126b6068, 0x2, 0xc0005929e0)
    /usr/local/go/src/crypto/tls/conn.go:761 +0x60
bytes.(*Buffer).ReadFrom(0xc005b39058, 0x178fae0, 0xc00e6373e0, 0x40b805, 0x1397300, 0x14c5ee0)
    /usr/local/go/src/bytes/buffer.go:207 +0xbd
crypto/tls.(*Conn).readFromUntil(0xc005b38e00, 0x1791ae0, 0xc000013460, 0x5, 0xc000013460, 0x1)
    /usr/local/go/src/crypto/tls/conn.go:783 +0xf8
crypto/tls.(*Conn).readRecordOrCCS(0xc005b38e00, 0x15f4200, 0xc005b38f38, 0xc000592d58)
    /usr/local/go/src/crypto/tls/conn.go:590 +0x125
crypto/tls.(*Conn).readRecord(...)
    /usr/local/go/src/crypto/tls/conn.go:558
crypto/tls.(*Conn).Read(0xc005b38e00, 0xc006189000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:1236 +0x137
bufio.(*Reader).Read(0xc0003f3380, 0xc00330fb58, 0x9, 0x9, 0x407e54, 0xc028298480, 0xc000592d58)
    /usr/local/go/src/bufio/bufio.go:223 +0x23e
io.ReadAtLeast(0x178f920, 0xc0003f3380, 0xc00330fb58, 0x9, 0x9, 0x9, 0x178fcc0, 0xc02e520ad0, 0xc000084060)
    /usr/local/go/src/io/io.go:310 +0x88
io.ReadFull(...)
    /usr/local/go/src/io/io.go:329
net/http.http2readFrameHeader(0xc00330fb58, 0x9, 0x9, 0x178f920, 0xc0003f3380, 0x0, 0x0, 0xbf4650d0587ebd97, 0x4598cb66a5319)
    /usr/local/go/src/net/http/h2_bundle.go:1476 +0x88
net/http.(*http2Framer).ReadFrame(0xc00330fb20, 0xc03755a150, 0x0, 0x0, 0x1)
    /usr/local/go/src/net/http/h2_bundle.go:1734 +0xa1
net/http.(*http2clientConnReadLoop).run(0xc000592fb8, 0x15f3a90, 0xc0005427b8)
    /usr/local/go/src/net/http/h2_bundle.go:8102 +0x8f
net/http.(*http2ClientConn).readLoop(0xc00427b980)
    /usr/local/go/src/net/http/h2_bundle.go:8030 +0x76
created by net/http.(*http2Transport).newClientConn
    /usr/local/go/src/net/http/h2_bundle.go:7093 +0x637

@evanj
Copy link
Contributor

evanj commented Jul 28, 2019

It looks like CL https://go-review.googlesource.com/c/net/+/181457/ mentioned on issue #32388 will probably address my issue.

harshavardhana added a commit to harshavardhana/minio that referenced this issue Feb 11, 2021
due to lots of issues with x/net/http2, as
well as the hundled h2_bundle.go in the go
runtime should be avoided for now.

golang/go#23559
golang/go#42534
golang/go#43989
golang/go#33425
golang/go#29246

With collection of such issues present, it
make sense to remove HTTP2 support for now
harshavardhana added a commit to harshavardhana/minio that referenced this issue Feb 11, 2021
due to lots of issues with x/net/http2, as
well as the bundled h2_bundle.go in the go
runtime should be avoided for now.

golang/go#23559
golang/go#42534
golang/go#43989
golang/go#33425
golang/go#29246

With collection of such issues present, it
make sense to remove HTTP2 support for now
harshavardhana added a commit to minio/minio that referenced this issue Feb 11, 2021
due to lots of issues with x/net/http2, as
well as the bundled h2_bundle.go in the go
runtime should be avoided for now.

golang/go#23559
golang/go#42534
golang/go#43989
golang/go#33425
golang/go#29246

With collection of such issues present, it
make sense to remove HTTP2 support for now
poornas added a commit to poornas/minio that referenced this issue Feb 16, 2021
* fix: metacache should only rename entries during cleanup (minio#11503)

To avoid large delays in metacache cleanup, use rename
instead of recursive delete calls, renames are cheaper
move the content to minioMetaTmpBucket and then cleanup
this folder once in 24hrs instead.

If the new cache can replace an existing one, we should
let it replace since that is currently being saved anyways,
this avoids pile up of 1000's of metacache entires for
same listing calls that are not necessary to be stored
on disk.

* turn off http2 for TLS setups for now (minio#11523)

due to lots of issues with x/net/http2, as
well as the bundled h2_bundle.go in the go
runtime should be avoided for now.

golang/go#23559
golang/go#42534
golang/go#43989
golang/go#33425
golang/go#29246

With collection of such issues present, it
make sense to remove HTTP2 support for now

* fix: save ModTime properly in disk cache (minio#11522)

fix minio#11414

* fix: osinfos incomplete in case of warnings (minio#11505)

The function used for getting host information
(host.SensorsTemperaturesWithContext) returns warnings in some cases.

Returning with error in such cases means we miss out on the other useful
information already fetched (os info).

If the OS info has been succesfully fetched, it should always be
included in the output irrespective of whether the other data (CPU
sensors, users) could be fetched or not.

* fix: avoid timed value for network calls (minio#11531)

additionally simply timedValue to have RWMutex
to avoid concurrent calls to DiskInfo() getting
serialized, this has an effect on all calls that
use GetDiskInfo() on the same disks.

Such as getOnlineDisks, getOnlineDisksWithoutHealing

* fix: support IAM policy handling for wildcard actions (minio#11530)

This PR fixes

- allow 's3:versionid` as a valid conditional for
  Get,Put,Tags,Object locking APIs
- allow additional headers missing for object APIs
- allow wildcard based action matching

* fix: in MultiDelete API return MalformedXML upon empty input (minio#11532)

To follow S3 spec

* Update yaml files to latest version RELEASE.2021-02-14T04-01-33Z

* fix: multiple pool reads parallelize when possible (minio#11537)

* Add support for remote tier management

With this change MinIO's ILM supports transitioning objects to a remote tier.
This change includes support for Azure Blob Storage, AWS S3 and Google Cloud
Storage as remote tier storage backends.

Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
Co-authored-by: Krishnan Parthasarathi <kp@minio.io>

Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Poorna Krishnamoorthy <poornas@users.noreply.github.com>
Co-authored-by: Shireesh Anjal <355479+anjalshireesh@users.noreply.github.com>
Co-authored-by: Anis Elleuch <vadmeste@users.noreply.github.com>
Co-authored-by: Minio Trusted <trusted@minio.io>
Co-authored-by: Krishna Srinivas <krishna.srinivas@gmail.com>
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
harshavardhana added a commit to minio/minio that referenced this issue Feb 17, 2021
due to lots of issues with x/net/http2, as
well as the bundled h2_bundle.go in the go
runtime should be avoided for now.

golang/go#23559
golang/go#42534
golang/go#43989
golang/go#33425
golang/go#29246

With collection of such issues present, it
make sense to remove HTTP2 support for now
kerneltime pushed a commit to kerneltime/minio that referenced this issue Feb 18, 2021
due to lots of issues with x/net/http2, as
well as the bundled h2_bundle.go in the go
runtime should be avoided for now.

golang/go#23559
golang/go#42534
golang/go#43989
golang/go#33425
golang/go#29246

With collection of such issues present, it
make sense to remove HTTP2 support for now
harshavardhana pushed a commit to minio/minio that referenced this issue Feb 18, 2021
due to lots of issues with x/net/http2, as
well as the bundled h2_bundle.go in the go
runtime should be avoided for now.

golang/go#23559
golang/go#42534
golang/go#43989
golang/go#33425
golang/go#29246

With collection of such issues present, it
make sense to remove HTTP2 support for now
@TyeMcQueen
Copy link

So we have a total deadlock scenario that hasn't been fixed for years now. This regularly break hundreds of requests for us because a whole process has to be killed because new requests coming in just sit behind the lock and do not get serviced so we kill the process and abandon any in-progress requests in that process in order to stop creating even more locked requests.

I could quickly fix this if sync.Mutex supported TryLock() but that request (only a few lines of code) was rejected because a super strong use case for having it was required.

Perhaps being able to fix a very serious bug that can't be fixed for years is a strong enough use case? (Or should we argue about whether the connection mutex should be replaced with a channel instead?)

(TryLock proposal: #6123)

So, if there is a chance this would be accepted, I'm happy to submit a code change proposal that adds sync.Mutex.TryLock() and then uses it to not deadlock when looking for a connection to reuse (by not considering connections that are currently locked and so really can't be reused with any guarantee of timeliness anyway).

@TyeMcQueen
Copy link

FWIW, I concur with the quibble about the name and would be happy to call it LockIfNoWait().

@king40or
Copy link

king40or commented Mar 1, 2021

@TyeMcQueen what all is this pertaining with, I'm dealing with a bug on my PC that I'm not exactly sure how to execute the correct code for the fix

@TyeMcQueen
Copy link

TyeMcQueen commented Mar 2, 2021

FWIW, I am currently testing my 14-line fix for this problem (but it is not my first priority so that testing will have to be done as I can find time over the next while).

+++ mutex.go	2021-03-01 15:50:28.000000000 -0800
@@ -81,6 +81,18 @@
 	m.lockSlow()
 }

+// LockIfNoWait either locks m immediately if it is not in use, returning
+// true, or, when a lock on m is already held, immediately returns false.
+func (m *Mutex) LockIfNoWait() bool {
+	if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
+		if race.Enabled {
+			race.Acquire(unsafe.Pointer(m))
+		}
+		return true
+	}
+	return false
+}
+
 func (m *Mutex) lockSlow() {```

```--- h2_bundle.go-orig	2020-02-25 10:31:46.000000000 -0800
+++ h2_bundle.go	2021-03-01 15:54:39.000000000 -0800
@@ -829,7 +829,7 @@
 func (p *http2clientConnPool) addConnIfNeeded(key string, t *http2Transport, c *tls.Conn) (used bool, err error) {
 	p.mu.Lock()
 	for _, cc := range p.conns[key] {
-		if cc.CanTakeNewRequest() {
+		if cc.CanTakeNewRequestNow() {
 			p.mu.Unlock()
 			return false, nil
 		}
@@ -7208,6 +7208,17 @@
 	return cc.canTakeNewRequestLocked()
 }

+// CanTakeNewRequestNow reports whether the connection can take a new request,
+// like CanTakeNewRequest(), except if connection's mutex is currently locked,
+// it immediately returns false rather than waiting to obtain the lock.
+func (cc *http2ClientConn) CanTakeNewRequestNow() bool {
+	if ! cc.mu.LockIfNoWait() {
+		return false
+	}
+	defer cc.mu.Unlock()
+	return cc.canTakeNewRequestLocked()
+}
+
 // clientConnIdleState describes the suitability of a client
 // connection to initiate a new RoundTrip request.
 type http2clientConnIdleState struct {```

@TyeMcQueen
Copy link

I have found two other places where cc.mu.Lock() can be called while p.mu.Lock() is in effect and will apply similar fixes. Though I don't have evidence of those particular flows causing lock-up in the wild. The logic is still sound for all 3 cases: Locked connections should just be skipped until the next time.

@gopherbot
Copy link

Change https://golang.org/cl/347299 mentions this issue: http2: handle request cancelation when writing headers

@neild
Copy link
Contributor

neild commented Sep 14, 2021

I don't believe the TryLock approach is the right fix here.

In the usual case where the server is responsive and we do not exceed the server per-connection stream limit, we want to create a single HTTP/2 connection and use it for all requests. Creating a new connection whenever we are unable to unlock any existing connection means that we will create new connections much more aggressively. If, for example, we make ten simultaneous requests, we will be very likely to create several new connections as all the requests contend for the connection lock.

The TryLock-based patch above also completely breaks the (http2.Transport).StrictMaxConcurrentStreams configuration knob.

Possibly we should have a way to create multiple HTTP/2 connections to a single server even before we exceed the stream limit on a single connection, but if so, this should be done more deliberately than creating new connections every time two requests briefly contend on a mutex.

The right fix to avoid connection pool deadlocks (#32388) is to avoid holding the connection mutex indefinitely during blocking operations, such as header writes. I have a proposed fix for that in CL 349594.

The right fix to avoid requests hanging past the request deadline when writing headers blocks (this bug, as distinct from #32388) is to watch for cancelation while writing. Unfortunately, this means we need to move header writes to a separate goroutine, since there is no simple way to bound a write to a net.Conn with a context. I have a change which does this in CL 347299.

@gopherbot
Copy link

Change https://golang.org/cl/353870 mentions this issue: all: update golang.org/x/net to pull in CL 353390

@neild
Copy link
Contributor

neild commented Oct 6, 2021

@gopherbot please open backport issues.

@gopherbot
Copy link

Backport issue(s) opened: #48822 (for 1.16), #48823 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 27, 2021
@dmitshur dmitshur modified the milestones: Unplanned, Go1.18 Oct 27, 2021
@dmitshur dmitshur changed the title x/net/http2: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time x/net/http2, net/http: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time Oct 27, 2021
@golang golang locked and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

11 participants