-
Notifications
You must be signed in to change notification settings - Fork 18k
sync: higher mutex wait durations in 1.24 #72765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think we'll need more information, a reproducer, and maybe try with the |
You've described how the diagnostic metrics changed when updating to go1.24.0. Can you share how the application-level performance changed, if at all? And, can you put the magnitude of delay in context, comparing it to the number of total thread-seconds or goroutine-seconds in the app? For instance, does this mean the app spends 0.1% of its time blocked, or 0.0001%? As a bonus, how does it compare to the amount of time that goroutines spend waiting to be scheduled (summing the buckets of The flame graph screenshot shows changes across the board to which functions have reportable mutex contention. The calls under the "my service" label used to have 60% of their sub-calls to a package other than "context", but that's disappeared from the new profile on the left, with 90%+ of its reported time now attributed to With the information available here so far, I wonder if I'm also suspicious of the amount of time that's still attributed to Describing the app-level regression, if any, would be helpful here. The critical sections involved seem quite small, and the overall delay (without additional information) seems quite small. It's alarming to see something double, but a small number that's doubled is still a small number. Showing that it's large relative to the app (such as relating it to the total scheduling latency over the same time period) could help. |
Using the Haven't had the time yet, but i'll try to create a small reproducer. I couldn't conclude meaningful affects on the application overally. Most of the services APIs are io-bound and their latency bottleneck is somewhere else. There is somewhat of a regression on the latency of few internally computed APIs, but not a confidant and provable change. (Though, we've gone forward with the upgrade at the moment. As @rhysh mentioned, the latency is a still a samll value.) To put it in perspective, for around 15 seconds of cpu time samples for Relation of mutex wait and scheduler latency: Other calls haven't disappear from the graph but are smaller now. Maybe visualizing the differnece of flame graphs helps to understand it better:
Not sure about the load contentention data, the profiles are generated with standard settings of pprof + |
Sorry for the wait. Here's a similar but simple reproducer for the issue: |
That was incorrect, sorry. Nothing suspicious there. (Removing that setting is part of #66999, which isn't done yet.)
Thanks for checking. That, plus my corrected understanding of why I'm not sure there's a problem here. The reported contention (in particular, in |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
Go version
1.24
Output of
go env
in your module/workspace:What did you do?
Upgraded a grpc serving applciation's golang version from 1.23.3 to 1.24.0 (and 1.24.1) (The application is more on the io-bound side)
What did you see happen?
Mutex wait duraion increased by a factor of 2 after the upgrade. the issue seems to be with an unlock in
context.(*cancelCtx).propagateCancel
Prometheus metrics:

go_sync_mutex_wait_total_seconds_total
(At 1130 the upgraded version was deployed on yellow)
Pprof profiles:
(Left side is go 1.24 and right side is 1.23)
What did you expect to see?
It's mentioned in 1.24's release notes that cpu performance is improved via some changes including a new runtime-internal mutex implementation [ref]. I'm wondering if that could've caused some related issues?
The text was updated successfully, but these errors were encountered: