Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: contention in runtime.gcFlushBgCredit on work.assistQueue.lock #61426

Open
rhysh opened this issue Jul 18, 2023 · 1 comment
Open

runtime: contention in runtime.gcFlushBgCredit on work.assistQueue.lock #61426

rhysh opened this issue Jul 18, 2023 · 1 comment
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Jul 18, 2023

This is the first in what I think will be a series of three issue reports about contention I've seen on various runtime-internal singleton locks in an app that typically runs on dual-socket linux/amd64 machines with 96 (or sometimes 64) hardware threads.

This issue is about contention in runtime.gcFlushBgCredit on work.assistQueue.lock.

I hope it'll be of interest to @mknyszek . Also CC @golang/runtime

Sorry about the somewhat outdated version. I haven't seen related issues go by, so I hope the info in this one is still of some use.

What version of Go are you using (go version)?

$ go version
go version go1.19.6 linux/amd64

Does this issue reproduce with the latest release?

I don't know yet if the issue is present in Go 1.20 series or in the release candidates of the Go 1.21 series.

What operating system and processor architecture are you using (go env)?

The app runs on Linux / x86_64, typically on machines with two processor sockets (NUMA). I don't have the output of go env from the app's runtime environment.

What did you do?

A data-processing app received a higher volume of data than usual. Here are some more of its dimensions:

The average allocation size is about 128 bytes. Across the whole app, it averages around 4 GiB per second of allocations, or 16 kiB of allocations every 4 µs. Split across the 96 threads the runtime uses for the app (GOMAXPROCS == num hardware threads), that's 40 MiB per second per thread, an average of 3 µs between allocations, or 16 kiB of allocations every 400 µs.

What did you expect to see?

I expected to see very little contention on runtime-internal singleton locks.

What did you see instead?

Contention in runtime.gcFlushBgCredit makes it hard for the runtime.gcBgMarkWorker goroutines to share their scan credit with mutator goroutines. Mutator goroutines pile up in runtime.gcParkAssist, probably for longer than necessary because the scan credit that would allow them to proceed hasn't been able to get to the global shared / stealable pool.

https://github.com/golang/go/blob/go1.19.6/src/runtime/mgcmark.go#L678
https://github.com/golang/go/blob/go1.19.6/src/runtime/mgcmark.go#L713

Data set 5e4664552a1df41ccb8effb833c0442a2a17f33365e909600dfc317a2dd2c086 is from a instance of the app running on a 96-thread machine. The CPU profile ran for 5.14 seconds, during which it collected samples corresponding to a total 352 seconds of on-CPU time (an average of 70 on-CPU threads) and 77.61+86.91 = 164.52 seconds of time (an average of 32 on-CPU threads) in calls from runtime.gcFlushBgCredit to lock and unlock work.assistQueue.lock.

$ ln -s /usr/local/go/bin/go /tmp/redacted
$ (cd 5e4664552a1df41ccb8effb833c0442a2a17f33365e909600dfc317a2dd2c086 && go tool pprof -focus=futex '-show_from=^runtime\.' '-prune_from=runtime.(lock|unlock)' '-peek=runtime\.(lock|unlock)$' /tmp/redacted pprof/profile)
File: redacted
Type: cpu
Time: Jul 1, 2023 at 11:04am (PDT)
Duration: 5.14s, Total samples = 352s (6854.68%)
Active filters:
   focus=futex
   show_from=^runtime\.
Showing nodes accounting for 164.96s, 46.86% of 352s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            77.61s 99.59% |   runtime.gcFlushBgCredit (inline)
                                             0.09s  0.12% |   runtime.(*mheap).allocSpan (inline)
                                             0.08s   0.1% |   runtime.chansend (inline)
                                             0.04s 0.051% |   runtime.newMarkBits (inline)
                                             0.04s 0.051% |   runtime.runOneTimer (inline)
                                             0.02s 0.026% |   runtime.mProf_Malloc (inline)
                                             0.02s 0.026% |   runtime.sellock (inline)
                                             0.01s 0.013% |   runtime.checkTimers (inline)
                                             0.01s 0.013% |   runtime.gcParkAssist (inline)
     0.01s 0.0028% 0.0028%     77.93s 22.14%                | runtime.lock
                                            77.92s   100% |   runtime.lockWithRank (inline)
----------------------------------------------------------+-------------
                                            86.91s 99.86% |   runtime.gcFlushBgCredit (inline)
                                             0.06s 0.069% |   runtime.(*mheap).allocSpan (inline)
                                             0.04s 0.046% |   runtime.chansend (inline)
                                             0.01s 0.011% |   runtime.selparkcommit (inline)
                                             0.01s 0.011% |   runtime.selunlock (inline)
         0     0% 0.0028%     87.03s 24.72%                | runtime.unlock
                                            87.03s   100% |   runtime.unlockWithRank (inline)
----------------------------------------------------------+-------------

gcFlushBgCredit-pprof

A goroutine profile taken a few moments before the start of the CPU profile shows a few goroutines are in the GCWaiting state.

$ (cd 5e4664552a1df41ccb8effb833c0442a2a17f33365e909600dfc317a2dd2c086 && go tool pprof -peek=gcAssistAlloc '-show_from=^runtime\.' /tmp/redacted pprof/goroutine)
File: redacted
Type: goroutine
Time: Jul 1, 2023 at 11:04am (PDT)
Active filters:
   show_from=^runtime\.
Showing nodes accounting for 3235, 97.53% of 3317 total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                                32   100% |   runtime.mallocgc
         0     0%     0%         32  0.96%                | runtime.gcAssistAlloc
                                                28 87.50% |   runtime.Gosched (inline)
                                                 4 12.50% |   runtime.gcParkAssist
----------------------------------------------------------+-------------

Data set 82da472df6d99764df5636f26da9f15ad61dfe4f4b75a4d3da1dccc833bd2c8b includes an execution trace, and a concurrently-collected CPU profile. It's from a slightly different environment, where the machines have 64 hardware threads. The profile covers 6.23 seconds of wall-clock time and shows an average of 54 on-CPU threads. It shows an average of 5.8 on-CPU threads in calls from runtime.gcFlushBgCredit to lock and unlock work.assistQueue.lock.

$ (cd 82da472df6d99764df5636f26da9f15ad61dfe4f4b75a4d3da1dccc833bd2c8b && go tool pprof -focus=futex '-show_from=^runtime\.' '-prune_from=runtime.(lock|unlock)' '-peek=runtime\.(lock|unlock)$' /tmp/redacted pprof/profile-during-trace)
File: redacted
Type: cpu
Time: Jul 1, 2023 at 8:50am (PDT)
Duration: 6.23s, Total samples = 333.71s (5359.38%)
Active filters:
   focus=futex
   show_from=^runtime\.
Showing nodes accounting for 41.23s, 12.36% of 333.71s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            21.56s 87.22% |   runtime.gcFlushBgCredit (inline)
                                             0.84s  3.40% |   runtime.(*mheap).freeSpan.func1 (inline)
                                             0.69s  2.79% |   runtime.gcParkAssist (inline)
                                             0.66s  2.67% |   runtime.sellock (inline)
                                             0.43s  1.74% |   runtime.(*mheap).reclaimChunk (inline)
                                             0.43s  1.74% |   runtime.chansend (inline)
                                             0.07s  0.28% |   runtime.(*mheap).allocSpan (inline)
                                             0.01s  0.04% |   runtime.(*mheap).reclaim (inline)
                                             0.01s  0.04% |   runtime.addspecial (inline)
                                             0.01s  0.04% |   runtime.mProf_Malloc (inline)
                                             0.01s  0.04% |   runtime.traceString (inline)
     0.01s 0.003% 0.003%     24.72s  7.41%                | runtime.lock
                                            24.71s   100% |   runtime.lockWithRank (inline)
----------------------------------------------------------+-------------
                                            14.59s 89.02% |   runtime.gcFlushBgCredit (inline)
                                             0.55s  3.36% |   runtime.(*mheap).freeSpan.func1 (inline)
                                             0.41s  2.50% |   runtime.chansend (inline)
                                             0.29s  1.77% |   runtime.(*mheap).reclaimChunk (inline)
                                             0.18s  1.10% |   runtime.selunlock (inline)
                                             0.16s  0.98% |   runtime.parkunlock_c (inline)
                                             0.11s  0.67% |   runtime.selparkcommit (inline)
                                             0.03s  0.18% |   runtime.chansend.func1 (inline)
                                             0.02s  0.12% |   runtime.(*mheap).allocSpan (inline)
                                             0.02s  0.12% |   runtime.setprofilebucket (inline)
                                             0.01s 0.061% |   runtime.chanparkcommit (inline)
                                             0.01s 0.061% |   runtime.findRunnable (inline)
                                             0.01s 0.061% |   runtime.markrootSpans (inline)
         0     0% 0.003%     16.39s  4.91%                | runtime.unlock
                                            16.39s   100% |   runtime.unlockWithRank (inline)
----------------------------------------------------------+-------------

The execution trace shows many goroutines in the GCWaiting state, and that goroutines both enter and leave this state. It shows some magenta "MARK ASSIST (unfinished)" regions, and some chartreuse "MARK ASSIST" regions. Because this execution trace is large and go tool trace split it into several pieces, the Threads ribbon display didn't get an initial update in this view and so it shows up as absent, when in fact it's stuck at the max (64 for this app instance) for most of the displayed time window. Procs 0, 1, 2 are running the background mark worker; its display is broken in a similar way.

gcFlushBgCredit-trace-end

It seems that the pacer started the GC Mark phase too late here, but that's a separate issue with separate solutions (including to move away from Go 1.19).

Here's another view from earlier in the GC Mark phase. (Procs 0–11, 17, and 18 appear to be dedicated GC workers.) There's a change around 3700 ms -- the mutator goroutines stop showing chartreuse "MARK ASSIST" regions, and the number of goroutines in the GCWaiting state grows from 0 to several hundred.

gcFlushBgCredit-trace-transition

Here's my rough analysis: It looks to me like the design of runtime.gcFlushBgCredit results in metastability -- that when assist credit is available to steal, the background mark worker goroutines take the fast path when adding additional credit to the pool. But when that pool runs dry, the background mark worker goroutines start needing to acquire a lock in order to replenish it. And because of the contention on that lock, they spend their CPU cycles in runtime.futex instead of doing useful scan work. If the assist queue lock isn't immediately available, maybe the worker could add its credit directly to the pool and rely on some subsequent call releasing it?

Furthermore, if so many user goroutines are in GCWaiting, the app's Ps will end up doing GC work via gcBgMarkWorker instead of by gcAssistAlloc, which means they'll want to share any excess scan credit instead of keeping it for themselves, leading to more Ps contending for the assist queue lock.

Ideally the GC pacer would start earlier (and Go 1.20 should help), but it looks like once the runtime starts experiencing contention on this lock it has a hard time recovering until the GC Mark phase completes.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 18, 2023
@heschi heschi added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 19, 2023
@heschi
Copy link
Contributor

heschi commented Jul 19, 2023

cc @golang/runtime

@mknyszek mknyszek added this to the Go1.22 milestone Jul 25, 2023
@mknyszek mknyszek self-assigned this Jul 25, 2023
@mknyszek mknyszek modified the milestones: Go1.22, Go1.23 Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

5 participants