Skip to content

sync: higher mutex wait durations in 1.24 #72765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hossein-raeisi opened this issue Mar 10, 2025 · 7 comments
Closed

sync: higher mutex wait durations in 1.24 #72765

hossein-raeisi opened this issue Mar 10, 2025 · 7 comments
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@hossein-raeisi
Copy link

hossein-raeisi commented Mar 10, 2025

Go version

1.24

Output of go env in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE='on'
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHEPROG=''
GODEBUG=''
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build322822516=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOOS='linux'
GOROOT='/usr/lib/go'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.0'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

Upgraded a grpc serving applciation's golang version from 1.23.3 to 1.24.0 (and 1.24.1) (The application is more on the io-bound side)

What did you see happen?

Mutex wait duraion increased by a factor of 2 after the upgrade. the issue seems to be with an unlock in context.(*cancelCtx).propagateCancel

Prometheus metrics:
go_sync_mutex_wait_total_seconds_total
Image
(At 1130 the upgraded version was deployed on yellow)

Pprof profiles:

Image

(Left side is go 1.24 and right side is 1.23)

What did you expect to see?

It's mentioned in 1.24's release notes that cpu performance is improved via some changes including a new runtime-internal mutex implementation [ref]. I'm wondering if that could've caused some related issues?

@seankhliao
Copy link
Member

I think we'll need more information, a reproducer, and maybe try with the GOEXPERIMENT=nospinbitmutex option.

@seankhliao seankhliao changed the title go 1.24: Higher mutex wait durations. sync: higher mutex wait durations in 1.24 Mar 10, 2025
@seankhliao seankhliao added Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Mar 10, 2025
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 10, 2025
@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Mar 10, 2025
@rhysh
Copy link
Contributor

rhysh commented Mar 10, 2025

You've described how the diagnostic metrics changed when updating to go1.24.0. Can you share how the application-level performance changed, if at all?

And, can you put the magnitude of delay in context, comparing it to the number of total thread-seconds or goroutine-seconds in the app? For instance, does this mean the app spends 0.1% of its time blocked, or 0.0001%? As a bonus, how does it compare to the amount of time that goroutines spend waiting to be scheduled (summing the buckets of /sched/latencies:seconds)?


The flame graph screenshot shows changes across the board to which functions have reportable mutex contention. The calls under the "my service" label used to have 60% of their sub-calls to a package other than "context", but that's disappeared from the new profile on the left, with 90%+ of its reported time now attributed to context.(*cancelCtx).propagateCancel via context.WithTimeout. Unless there's a slow AfterFunc call (unlikely, I'd guess), the critical section looks quite short: https://github.com/golang/go/blob/go1.24.0/src/context/context.go#L484-L494

With the information available here so far, I wonder if GOEXPERIMENT=spinbitmutex is resulting in small changes to the order that threads and goroutines are able to execute, which leads to existing forms of delay being reclassified as "mutex wait". (A goroutine can only count as waiting on a mutex if it's been scheduled for long enough to reach the Lock call. Time it spends in a run queue before making that call doesn't count.)

I'm also suspicious of the amount of time that's still attributed to runtime._LostContendedRuntimeLock. I'd expect that to be very small: First, runtime-internal locks have small and well-controlled critical sections, so each instance of delay would usually be quite small relative to an app's other work. Second, platforms (including linux/amd64) that use GOEXPERIMENT=spinbitmutex will report the call stacks when they experience contention, even without GODEBUG=runtimecontentionstacks=1, rather than attributing them to runtime._LostContendedRuntimeLock. But I don't see call stacks in the new/left profile that show runtime internals.


Describing the app-level regression, if any, would be helpful here. The critical sections involved seem quite small, and the overall delay (without additional information) seems quite small. It's alarming to see something double, but a small number that's doubled is still a small number. Showing that it's large relative to the app (such as relating it to the total scheduling latency over the same time period) could help.

@hossein-raeisi
Copy link
Author

Using the GOEXPERIMENT=nospinbitmutex flag didn't help the issue.

Haven't had the time yet, but i'll try to create a small reproducer.


I couldn't conclude meaningful affects on the application overally. Most of the services APIs are io-bound and their latency bottleneck is somewhere else. There is somewhat of a regression on the latency of few internally computed APIs, but not a confidant and provable change. (Though, we've gone forward with the upgrade at the moment. As @rhysh mentioned, the latency is a still a samll value.)


To put it in perspective, for around 15 seconds of cpu time samples for context.Timeout, there is 3 seconds of internal/sync.(*Mutex).Unlock.

Relation of mutex wait and scheduler latency:
Image
(yellow upgraded around 1130)

Other calls haven't disappear from the graph but are smaller now. Maybe visualizing the differnece of flame graphs helps to understand it better:

Image
They rather take a lesser portion of time in 1.24.

Not sure about the load contentention data, the profiles are generated with standard settings of pprof + runtime.SetMutexProfileFraction. Is there any configurations I could include to get the full stack trace?

@hossein-raeisi
Copy link
Author

hossein-raeisi commented Mar 13, 2025

Sorry for the wait. Here's a similar but simple reproducer for the issue:

Repo

Mutex wait for builds with differenet golang versions:
Image

@rhysh
Copy link
Contributor

rhysh commented Mar 13, 2025

Second, platforms (including linux/amd64) that use GOEXPERIMENT=spinbitmutex will report the call stacks when they experience contention, even without GODEBUG=runtimecontentionstacks=1, rather than attributing them to runtime._LostContendedRuntimeLock.

That was incorrect, sorry. Nothing suspicious there. (Removing that setting is part of #66999, which isn't done yet.)

Using the GOEXPERIMENT=nospinbitmutex flag didn't help the issue.

Thanks for checking. That, plus my corrected understanding of why runtime._LostContendedRuntimeLock shows up (it's still for any runtime-internal contention when GODEBUG=runtimecontentionstacks has its default setting .. I thought I'd changed that for spinbitmutex platforms), makes it look like this isn't at all related to the runtime-internal lock changes.


I'm not sure there's a problem here. The reported contention (in particular, in context.(*cancelCtx).propagateCancel) went up, but it's small relative to the app (moving from 5% up to 15% of the amount of time that runnable goroutines spend waiting to be scheduled). But if it's not causing problems for the app, then I'm not sure what (if anything) there is to do about the change in the metric.

@rhysh rhysh added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Mar 13, 2025
@mknyszek mknyszek removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 17, 2025
@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 17, 2025
@cherrymui cherrymui modified the milestones: Backlog, Go1.25 Mar 17, 2025
@mknyszek mknyszek added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 19, 2025
@gopherbot
Copy link
Contributor

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in Go Compiler / Runtime Apr 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
Development

No branches or pull requests

7 participants