net/http: Goroutine persistConn.roundTrip stucks when HTTP call timeouts #66193

rschone · 2024-03-08T13:45:22Z

Go version

go1.22.1

Output of `go env` in your module/workspace:

GO111MODULE='on'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/XXX/Library/Caches/go-build'
GOENV='/Users/XXX/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/XXX/go/pkg/mod'
GONOPROXY=''
GONOSUMDB='github.com/YYY/*'
GOOS='darwin'
GOPATH='/Users/XXX/go'
GOPRIVATE=''
GOPROXY='go-proxy.oss.wandera.net'
GOROOT='/opt/homebrew/opt/go/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/opt/go/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.22.1'
GCCGO='gccgo'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/gy/lwh57s5173s3m4058f2kqw_h0000gp/T/go-build1021772654=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

During the load testing thousands of HTTP requests were generated by the golang HTTP client. The server couldn't keep up with the load and requests started timeouting.

What did you see happen?

We added a simple atomic counter to see the progress of processed requests and realized, that when the load is too high, some client goroutines never finish even though a context with a deadline is used. A dump from the client showed the following stuck stack traces:

#	0x7379b8	net/http.(*persistConn).roundTrip+0x978								/usr/local/go/src/net/http/transport.go:2675
#	0x72b859	net/http.(*Transport).roundTrip+0x799								/usr/local/go/src/net/http/transport.go:608
#	0x70f6b2	net/http.(*Transport).RoundTrip+0x12								/usr/local/go/src/net/http/roundtrip.go:17
#	0x1fafeb3	github.com/opentracing-contrib/go-stdlib/nethttp.(*Transport).RoundTrip+0x253

What did you expect to see?

As we went deeper we saw a possible place in the net/http/transport.go that could cause the issue. When the readLoop is waiting when reading the body of the response and the pc.roundTrip waits until pgClosed/cancelChan/ctxDoneChan, it might happen, that the select in readLoop processes the rc.req.Cancel or rc.req.Context().Done(), calls and removes the cancelKeyFn from the transport.reqCanceler and the pc.roundTrip.pcClosed case will never have canceled=true or even pc.t.replaceReqCanceler()=true because the transport.reqCanceler won't contain the cancelKey anymore.

The prerequisites for this to happen are:

the processing timeouted or is canceled
the rc.reqCancel or rc.req.Context().Done() is processed by readLoop

The text was updated successfully, but these errors were encountered:

mknyszek · 2024-03-08T16:31:56Z

CC @neild

panjf2000 · 2024-03-09T12:18:39Z

This issue may have something to do with 212d385 and 854a2f8.

As we went deeper we saw a possible place in the net/http/transport.go that could cause the issue. When the readLoop is waiting when reading the body of the response and the pc.roundTrip waits until pgClosed/cancelChan/ctxDoneChan, it might happen, that the select in readLoop processes the rc.req.Cancel or rc.req.Context().Done(), calls and removes the cancelKeyFn from the transport.reqCanceler and the pc.roundTrip.pcClosed case will never have canceled=true or even pc.t.replaceReqCanceler()=true because the transport.reqCanceler won't contain the cancelKey anymore.

The prerequisites for this to happen are:

the processing timeouted or is canceled

the rc.reqCancel or rc.req.Context().Done() is processed by readLoop

Nonetheless, I'm not so convinced by this analysis because if the client has made its way to here:

go/src/net/http/transport.go

Lines 2272 to 2277 in 065c5d2

    
           case <-rc.req.Cancel: 
        
           	alive = false 
        
           	pc.t.cancelRequest(rc.cancelKey, errRequestCanceled) 
        
           case <-rc.req.Context().Done(): 
        
           	alive = false 
        
           	pc.t.cancelRequest(rc.cancelKey, rc.req.Context().Err())

, a response should have been sent out, which would set free the roundtrip method from blocking in select{...}:

go/src/net/http/transport.go

Lines 2708 to 2718 in 065c5d2

    
           case re := <-resc: 
        
           	if (re.res == nil) == (re.err == nil) { 
        
           		panic(fmt.Sprintf("internal error: exactly one of res or err should be set; nil=%v", re.res == nil)) 
        
           	} 
        
           	if debugRoundTrip { 
        
           		req.logf("resc recv: %p, %T/%#v", re.res, re.err, re.err) 
        
           	} 
        
           	if re.err != nil { 
        
           		return nil, pc.mapRoundTripError(req, startBytesWritten, re.err) 
        
           	} 
        
           	return re.res, nil

Any chance you can write a test that can reproduce this issue? Thanks!
@rschone

rschone · 2024-03-09T20:53:33Z

That was our best guess :( Hmm, you're right that the case re := <-resc has to free the roundtrip method when the readLoop reaches its last select, there must be something else.

The problem is that the situation with stuck goroutines is reproducible only in production. A gateway service (Golang, HTTP client) communicates with a Java service (HTTP server here) using PUT call. Both HA, initially 2->2 pods before scaling. When the gateway is bombed by requests generated by dozens of pods (simulated DDoS - hundreds of requests per second on each of them), the Java service starts to become slow and latencies rise, the number of concurrently processed requests in the gateway goes from ~tens to ~thousands. We let the load test go in such frequency for ~15-30 seconds and then turn it off (to reproduce the issue but being under k8s memory limits - not kill the pods and before the services scale up and spread the load handling). After this load test a metric watching the number of currently processed requests shows values around ~lower hundreds (and never returns to the normal level of ~tens). The same number of stuck goroutines is visible in pprof dump.

So the only information we have is the pprof dump pointing us to the roundtrip method. Still far from writing a test, searching for the exact reason :(

We set ResponseHeaderTimeout, there is deadline on context, we tried it without trace round tripper, disabled keep alive, with different golang versions - 1.21, 1.22 but nothing from this helped to stop the goroutines stuck.

Any ideas on how to proceed further to localize the bug? Thank you!

panjf2000 · 2024-03-11T11:03:23Z

That was our best guess :( Hmm, you're right that the case re := <-resc has to free the roundtrip method when the readLoop reaches its last select, there must be something else.

The problem is that the situation with stuck goroutines is reproducible only in production. A gateway service (Golang, HTTP client) communicates with a Java service (HTTP server here) using PUT call. Both HA, initially 2->2 pods before scaling. When the gateway is bombed by requests generated by dozens of pods (simulated DDoS - hundreds of requests per second on each of them), the Java service starts to become slow and latencies rise, the number of concurrently processed requests in the gateway goes from ~tens to ~thousands. We let the load test go in such frequency for ~15-30 seconds and then turn it off (to reproduce the issue but being under k8s memory limits - not kill the pods and before the services scale up and spread the load handling). After this load test a metric watching the number of currently processed requests shows values around ~lower hundreds (and never returns to the normal level of ~tens). The same number of stuck goroutines is visible in pprof dump.

So the only information we have is the pprof dump pointing us to the roundtrip method. Still far from writing a test, searching for the exact reason :(

We set ResponseHeaderTimeout, there is deadline on context, we tried it without trace round tripper, disabled keep alive, with different golang versions - 1.21, 1.22 but nothing from this helped to stop the goroutines stuck.

Any ideas on how to proceed further to localize the bug? Thank you!

I'll keep investigating this. In the meantime, please inform us by updating this issue thread if any helpful info pops into your head, thanks!

gopherbot · 2024-05-01T04:22:01Z

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 8, 2024

mknyszek added this to the Backlog milestone Mar 8, 2024

panjf2000 added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 9, 2024

seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Mar 31, 2024

gopherbot closed this as not planned Won't fix, can't repro, duplicate, stale May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net/http: Goroutine persistConn.roundTrip stucks when HTTP call timeouts #66193

net/http: Goroutine persistConn.roundTrip stucks when HTTP call timeouts #66193

rschone commented Mar 8, 2024 •

edited

mknyszek commented Mar 8, 2024

panjf2000 commented Mar 9, 2024

rschone commented Mar 9, 2024 •

edited

panjf2000 commented Mar 11, 2024

gopherbot commented May 1, 2024

net/http: Goroutine persistConn.roundTrip stucks when HTTP call timeouts #66193

net/http: Goroutine persistConn.roundTrip stucks when HTTP call timeouts #66193

Comments

rschone commented Mar 8, 2024 • edited

Go version

Output of go env in your module/workspace:

What did you do?

What did you see happen?

What did you expect to see?

mknyszek commented Mar 8, 2024

panjf2000 commented Mar 9, 2024

rschone commented Mar 9, 2024 • edited

panjf2000 commented Mar 11, 2024

gopherbot commented May 1, 2024

rschone commented Mar 8, 2024 •

edited

Output of `go env` in your module/workspace:

rschone commented Mar 9, 2024 •

edited