Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: Goroutine persistConn.roundTrip stucks when HTTP call timeouts #66193

Closed
rschone opened this issue Mar 8, 2024 · 5 comments
Closed
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@rschone
Copy link

rschone commented Mar 8, 2024

Go version

go1.22.1

Output of go env in your module/workspace:

GO111MODULE='on'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/XXX/Library/Caches/go-build'
GOENV='/Users/XXX/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/XXX/go/pkg/mod'
GONOPROXY=''
GONOSUMDB='github.com/YYY/*'
GOOS='darwin'
GOPATH='/Users/XXX/go'
GOPRIVATE=''
GOPROXY='go-proxy.oss.wandera.net'
GOROOT='/opt/homebrew/opt/go/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/opt/go/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.22.1'
GCCGO='gccgo'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/gy/lwh57s5173s3m4058f2kqw_h0000gp/T/go-build1021772654=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

During the load testing thousands of HTTP requests were generated by the golang HTTP client. The server couldn't keep up with the load and requests started timeouting.

What did you see happen?

We added a simple atomic counter to see the progress of processed requests and realized, that when the load is too high, some client goroutines never finish even though a context with a deadline is used. A dump from the client showed the following stuck stack traces:

#	0x7379b8	net/http.(*persistConn).roundTrip+0x978								/usr/local/go/src/net/http/transport.go:2675
#	0x72b859	net/http.(*Transport).roundTrip+0x799								/usr/local/go/src/net/http/transport.go:608
#	0x70f6b2	net/http.(*Transport).RoundTrip+0x12								/usr/local/go/src/net/http/roundtrip.go:17
#	0x1fafeb3	github.com/opentracing-contrib/go-stdlib/nethttp.(*Transport).RoundTrip+0x253

What did you expect to see?

As we went deeper we saw a possible place in the net/http/transport.go that could cause the issue. When the readLoop is waiting when reading the body of the response and the pc.roundTrip waits until pgClosed/cancelChan/ctxDoneChan, it might happen, that the select in readLoop processes the rc.req.Cancel or rc.req.Context().Done(), calls and removes the cancelKeyFn from the transport.reqCanceler and the pc.roundTrip.pcClosed case will never have canceled=true or even pc.t.replaceReqCanceler()=true because the transport.reqCanceler won't contain the cancelKey anymore.

The prerequisites for this to happen are:

  • the processing timeouted or is canceled
  • the rc.reqCancel or rc.req.Context().Done() is processed by readLoop
@mknyszek
Copy link
Contributor

mknyszek commented Mar 8, 2024

CC @neild

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 8, 2024
@mknyszek mknyszek added this to the Backlog milestone Mar 8, 2024
@panjf2000
Copy link
Member

This issue may have something to do with 212d385 and 854a2f8.

As we went deeper we saw a possible place in the net/http/transport.go that could cause the issue. When the readLoop is waiting when reading the body of the response and the pc.roundTrip waits until pgClosed/cancelChan/ctxDoneChan, it might happen, that the select in readLoop processes the rc.req.Cancel or rc.req.Context().Done(), calls and removes the cancelKeyFn from the transport.reqCanceler and the pc.roundTrip.pcClosed case will never have canceled=true or even pc.t.replaceReqCanceler()=true because the transport.reqCanceler won't contain the cancelKey anymore.

The prerequisites for this to happen are:

  • the processing timeouted or is canceled
  • the rc.reqCancel or rc.req.Context().Done() is processed by readLoop

Nonetheless, I'm not so convinced by this analysis because if the client has made its way to here:

go/src/net/http/transport.go

Lines 2272 to 2277 in 065c5d2

case <-rc.req.Cancel:
alive = false
pc.t.cancelRequest(rc.cancelKey, errRequestCanceled)
case <-rc.req.Context().Done():
alive = false
pc.t.cancelRequest(rc.cancelKey, rc.req.Context().Err())

, a response should have been sent out, which would set free the roundtrip method from blocking in select{...}:

go/src/net/http/transport.go

Lines 2708 to 2718 in 065c5d2

case re := <-resc:
if (re.res == nil) == (re.err == nil) {
panic(fmt.Sprintf("internal error: exactly one of res or err should be set; nil=%v", re.res == nil))
}
if debugRoundTrip {
req.logf("resc recv: %p, %T/%#v", re.res, re.err, re.err)
}
if re.err != nil {
return nil, pc.mapRoundTripError(req, startBytesWritten, re.err)
}
return re.res, nil

Any chance you can write a test that can reproduce this issue? Thanks!
@rschone

@panjf2000 panjf2000 added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 9, 2024
@rschone
Copy link
Author

rschone commented Mar 9, 2024

That was our best guess :( Hmm, you're right that the case re := <-resc has to free the roundtrip method when the readLoop reaches its last select, there must be something else.

The problem is that the situation with stuck goroutines is reproducible only in production. A gateway service (Golang, HTTP client) communicates with a Java service (HTTP server here) using PUT call. Both HA, initially 2->2 pods before scaling. When the gateway is bombed by requests generated by dozens of pods (simulated DDoS - hundreds of requests per second on each of them), the Java service starts to become slow and latencies rise, the number of concurrently processed requests in the gateway goes from ~tens to ~thousands. We let the load test go in such frequency for ~15-30 seconds and then turn it off (to reproduce the issue but being under k8s memory limits - not kill the pods and before the services scale up and spread the load handling). After this load test a metric watching the number of currently processed requests shows values around ~lower hundreds (and never returns to the normal level of ~tens). The same number of stuck goroutines is visible in pprof dump.

So the only information we have is the pprof dump pointing us to the roundtrip method. Still far from writing a test, searching for the exact reason :(

We set ResponseHeaderTimeout, there is deadline on context, we tried it without trace round tripper, disabled keep alive, with different golang versions - 1.21, 1.22 but nothing from this helped to stop the goroutines stuck.

Any ideas on how to proceed further to localize the bug? Thank you!

@panjf2000
Copy link
Member

That was our best guess :( Hmm, you're right that the case re := <-resc has to free the roundtrip method when the readLoop reaches its last select, there must be something else.

The problem is that the situation with stuck goroutines is reproducible only in production. A gateway service (Golang, HTTP client) communicates with a Java service (HTTP server here) using PUT call. Both HA, initially 2->2 pods before scaling. When the gateway is bombed by requests generated by dozens of pods (simulated DDoS - hundreds of requests per second on each of them), the Java service starts to become slow and latencies rise, the number of concurrently processed requests in the gateway goes from ~tens to ~thousands. We let the load test go in such frequency for ~15-30 seconds and then turn it off (to reproduce the issue but being under k8s memory limits - not kill the pods and before the services scale up and spread the load handling). After this load test a metric watching the number of currently processed requests shows values around ~lower hundreds (and never returns to the normal level of ~tens). The same number of stuck goroutines is visible in pprof dump.

So the only information we have is the pprof dump pointing us to the roundtrip method. Still far from writing a test, searching for the exact reason :(

We set ResponseHeaderTimeout, there is deadline on context, we tried it without trace round tripper, disabled keep alive, with different golang versions - 1.21, 1.22 but nothing from this helped to stop the goroutines stuck.

Any ideas on how to proceed further to localize the bug? Thank you!

I'll keep investigating this. In the meantime, please inform us by updating this issue thread if any helpful info pops into your head, thanks!

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Mar 31, 2024
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this as not planned Won't fix, can't repro, duplicate, stale May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

5 participants