New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http: Goroutine persistConn.roundTrip stucks when HTTP call timeouts #66193
Comments
CC @neild |
This issue may have something to do with 212d385 and 854a2f8.
Nonetheless, I'm not so convinced by this analysis because if the client has made its way to here: Lines 2272 to 2277 in 065c5d2
, a response should have been sent out, which would set free the roundtrip method from blocking in select{...} :Lines 2708 to 2718 in 065c5d2
Any chance you can write a test that can reproduce this issue? Thanks! |
That was our best guess :( Hmm, you're right that the case The problem is that the situation with stuck goroutines is reproducible only in production. A gateway service (Golang, HTTP client) communicates with a Java service (HTTP server here) using PUT call. Both HA, initially 2->2 pods before scaling. When the gateway is bombed by requests generated by dozens of pods (simulated DDoS - hundreds of requests per second on each of them), the Java service starts to become slow and latencies rise, the number of concurrently processed requests in the gateway goes from ~tens to ~thousands. We let the load test go in such frequency for ~15-30 seconds and then turn it off (to reproduce the issue but being under k8s memory limits - not kill the pods and before the services scale up and spread the load handling). After this load test a metric watching the number of currently processed requests shows values around ~lower hundreds (and never returns to the normal level of ~tens). The same number of stuck goroutines is visible in pprof dump. So the only information we have is the pprof dump pointing us to the roundtrip method. Still far from writing a test, searching for the exact reason :( We set ResponseHeaderTimeout, there is deadline on context, we tried it without trace round tripper, disabled keep alive, with different golang versions - 1.21, 1.22 but nothing from this helped to stop the goroutines stuck. Any ideas on how to proceed further to localize the bug? Thank you! |
I'll keep investigating this. In the meantime, please inform us by updating this issue thread if any helpful info pops into your head, thanks! |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
Go version
go1.22.1
Output of
go env
in your module/workspace:What did you do?
During the load testing thousands of HTTP requests were generated by the golang HTTP client. The server couldn't keep up with the load and requests started timeouting.
What did you see happen?
We added a simple atomic counter to see the progress of processed requests and realized, that when the load is too high, some client goroutines never finish even though a context with a deadline is used. A dump from the client showed the following stuck stack traces:
What did you expect to see?
As we went deeper we saw a possible place in the
net/http/transport.go
that could cause the issue. When thereadLoop
is waiting when reading the body of the response and thepc.roundTrip
waits until pgClosed/cancelChan/ctxDoneChan, it might happen, that the select inreadLoop
processes therc.req.Cancel
orrc.req.Context().Done()
, calls and removes the cancelKeyFn from thetransport.reqCanceler
and thepc.roundTrip
.pcClosed case will never have canceled=true or even pc.t.replaceReqCanceler()=true because thetransport.reqCanceler
won't contain the cancelKey anymore.The prerequisites for this to happen are:
The text was updated successfully, but these errors were encountered: