runtime: aggressive gc assist with many goroutines #56966

nvanbenschoten · 2022-11-28T21:47:32Z

This is an extraction of a private thread with @mknyszek.

While working with a customer that is running CockroachDB, we found a class of workloads that are severely (and reproducibly) impacted by GC assist. The situation occurs when an app opens many (O(10k)) SQL connections to CockroachDB and applies a moderately memory-intensive workload (O(1KB) reads and writes @ 30k qps). We've found that this leads to severe tail latency blips (p50=2ms, p99=70ms) and have pinpointed the effect to GC assist.

This effect is present with go1.17.11 and go1.19.1. It is also present on master with 4179552. However, the degraded tail latency disappears if GC assist is disabled by commenting out this line.

name \ p99(ms)  17.txt     17-noassist.txt  19.txt     19-noassist.txt  19-4179552.txt
kv95            54.5 ±46%         2.5 ± 0%  73.4 ±43%         3.9 ±96%       41.9 ± 5%

Increasing GOGC does improve tail latency. However, the improvement comes from running fewer GCs. When GC was running, the impact on tail latency appears to be about the same.

name \ p99(ms)  17-gogc-300.txt  19-gogc-300.txt  19-4179552-gogc-300.txt
kv95                  44.2 ±47%        18.8 ±24%               22.6 ±131%

Go execution traces show GC assist kick in across workload goroutines almost immediately (within a few ms) after the background GC process starts. It then consumes the majority of on-cpu time on these goroutines for the duration of the background GC duration.

Here is a collection of gctraces from different variants of the test using GODEBUG=gctrace=1,gcpacertrace=1:

go1.17_gogc_100.txt
go1.17_gogc_300.txt
go1.17_gogc_900.txt
go1.19_gogc_100.txt
go1.19_gogc_300.txt
go1.19_gogc_900.txt

An interesting note is that the investigations began when we noticed higher tail latency when the same workload (30k qps) was split across more SQL connections. CockroachDB maintains two goroutines per active connection. An early test found the following:

vCPUs	Go GC	SQL connections	p99 latency
30 (1 socket)	Default	512	2.0
		10,000	28.3
60 (2 sockets)	Default	512	3.1
		10,000	67.1

To reproduce

The easiest way to reproduce is using CockroachDB's internal cluster orchestration tool called roachprod. With that tool, the reprudction steps are:

export CLUSTER="${USER}-test"
roachprod create $CLUSTER -n2 --gce-machine-type='c2-standard-30' --local-ssd=false
roachprod stage $CLUSTER release v22.2.0-rc.3
roachprod start $CLUSTER:1
roachprod run   $CLUSTER:2 -- "./cockroach workload run kv --init --read-percent=95 --min-block-bytes=1024 --max-block-bytes=1024 --concurrency=10000 --max-rate=30000 --ramp=1m --duration=5m {pgurl:1}"

If roachprod is not an option, than the steps are:

create a pair of c2-standard-30 VM instances
stage a CockroachDB binary on each
start CockroachDB on the first VM
run the following from the second VM:

./cockroach workload run kv --init --read-percent=95 --min-block-bytes=1024 --max-block-bytes=1024 --concurrency=10000 --max-rate=30000 --ramp=1m --duration=5m 'postgresql://root@<INSERT VM1 HOSTNAME HERE>:26257?sslmode=disable'

I'm happy to help get these reproduction steps working in other environments.

The text was updated successfully, but these errors were encountered:

mknyszek · 2022-11-28T22:48:50Z

Thank you!

AFAICT what's going on here is that the pacer is built to schedule riiiight on the edge of no assists in the steady-state, to avoid starting earlier than it absolutely has to. However, when there's some noise (I see that the amount of goroutine stacks vary a good bit, but there's likely also other noise) the pacer may constantly be either ahead or behind of where it needs to be. The amount of assists here is not super high, it's just that it hovers around 2-7%.

There are two ways we can fix this. (1) is to bound the latency of assists, but that would require a bigger change. (2) is to modify the pacer to always start a little early hedge against noise. This could make assists consistently 0% of GC CPU time, but comes at the cost of a slightly longer GC cycle. This might lead to a slightly higher frequency of GC cycles in some cases, slightly more write barriers executed, and slightly more memory used for floating garbage. However, I suspect that the amount of hedging required to resolve this issue is small enough that it won't meaningfully change performance otherwise (especially when things aren't very noisy).

My main idea is to have the pacer always use a slightly higher cons/mark ratio. For example, instead of taking the average of the last 2 cycles as a prediction for the next one like it currently does, take the maximum of the two, or weight the larger cons/mark more heavily. (This might be too noisy, and maybe we want to track 3 GC cycles or something, but it's a decent starting point.)

nvanbenschoten · 2022-12-06T18:49:13Z

Thanks @mknyszek! Hedging against noise to avoid GC assist makes sense to me. There's a high, direct penalty to starting late and then pushing assist work onto foreground goroutines. Meanwhile, there's a lower, indirect penalty to starting early and running GC for slightly longer.

There's a tension here between reliable/predictable performance and absolute performance. Different systems will want to live at different spots on that spectrum. I'm hopeful a small degree of hedging will meaningfully move us in the direction of predictability without compromising much on absolute performance.

I'm happy to test out fixes on this workload to see if they help.

mknyszek · 2022-12-14T16:39:27Z

Sorry for the delay! I got a bit caught up with other things, and after some more thought and a little experimentation, I think indeed this idea

For example, instead of taking the average of the last 2 cycles as a prediction for the next one like it currently does, take the maximum of the two, or weight the larger cons/mark more heavily. (This might be too noisy, and maybe we want to track 3 GC cycles or something, but it's a decent starting point.)

is indeed too noisy.

I'm hopeful a small degree of hedging will meaningfully move us in the direction of predictability without compromising much on absolute performance.

I agree. I suspect aggregate performance metrics like throughput tend to be less sensitive to relatively small changes in pacing, so a pure win (or close to one) feels achievable here.

I'm going to be out on vacation for a few days, just wanted to leave a ping here that I hadn't forgotten about it, and that it's still on my radar. :)

I'm happy to test out fixes on this workload to see if they help.

Thank you! I'll ping you when I have something more concrete. I have a couple more ideas I'm playing with, I'll aim to get back to you next week (I'm out for the rest of this week).

nvanbenschoten · 2022-12-14T16:47:55Z

Thanks for the update @mknyszek. Enjoy your vacation!

mknyszek · 2023-01-18T21:36:31Z

Okay, it's been a full month and I'm back. I experimented with a whole lot of things (The test case I'm using here is uniformly random noise in the measured cons/mark ratio), and I think the best hedging mechanism is in fact a maximum, it's just that a 2 cycle window is pretty bad. A 5 cycle window is pretty good, but it's not very responsive when the application suddenly goes idle; that's 5 consecutive cycles where the pacer is going to be too conservative. Maybe that's fine though. When you're idle these things tend to not matter so much, and it's a constant factor hit. I'm going to try benchmarking this ASAP; I'll send a CL if it looks OK.

mknyszek · 2023-01-31T04:10:23Z

Sorry again for the delay. I was out sick all last week, and I'm on release rotation this week. I'll try to find some time to actually run the experiment. I'm currently trying to put together a few ideas to avoid "pacing on the edge," which this is certainly a symptom of.

nvanbenschoten · 2023-02-03T15:31:21Z

@mknyszek thanks for the update! Please let me know if you'd like help testing out your ideas that avoid pacing on the edge. We can pretty easily reproduce this behavior (see above) and are happy to run the same tests in our environment with patches to the runtime.

gopherbot · 2023-02-13T22:32:54Z

Change https://go.dev/cl/467875 mentions this issue: runtime: bias the pacer's cons/mark smoothing against noise

mknyszek · 2023-02-13T22:36:11Z

I put up https://go.dev/cl/467875 if you'd like to give it a try. Sigh... such a long time for such a small diff... 😅 Though, I think I fully reasoned out the consequences, and I feel more confident that this is a good solution.

I'll run some benchmarks overnight, too.

mknyszek · 2023-02-14T16:07:03Z

The benchmarks show that this CL moves the needle very little. This isn't necessarily a bad thing, because I don't believe any of these benchmarks are particularly noisy to the pacer. Where the needle did move, it moved roughly in the direction I expected, which is a good sign.

If you can patch in that CL and try a run with it, I'd appreciate it!

mknyszek · 2023-03-15T06:11:39Z

Friendly ping! :)

nvanbenschoten · 2023-03-15T18:32:09Z

Hi @mknyszek, I took the patch in https://go.dev/cl/467875 for a spin and it is showing a meaningful improvement to tail latency on the memory-intensive workload from above!

CockroachDB has not yet upgraded to Go 1.20, so running with the patch was a little tricky. I was able to do so by cherry-picking 4179552 (an earlier GC assist-related change only present in go1.20) onto go1.19.4. I then built one version of CockroachDB with just that earlier patch and one that also contains your new patch, https://go.dev/cl/467875.

Here are these results, averaged across a series of tests:

name  old p99(ms)  new p99(ms)  delta
kv95    26.7 ±10%    15.1 ±25%  -43.36%  (p=0.008 n=5+5)

Here is a pair of gctraces from the two different variants of the test using GODEBUG=gctrace=1,gcpacertrace=1:

go1.19_4179552_gogc.txt
go1.19_4179552_467875_gogc.txt

mknyszek · 2023-03-15T21:25:21Z

That's great to hear! I mean, it definitely isn't at the point of "no assists" but at least worth moving forward with for now.

There's clearly more denoising to be done, but at this point it looks like most of the pacer inputs and their derivatives are quite stable from the latest GC traces you shared (thanks!). This leads me to be suspicious of noisiness during the GC itself causing the assists to be "jumpy." I realize in the execution trace you showed in your original post the assists were pretty uniformly applied to all goroutines, but now I'm wondering if post-patch (the patch mostly addresses larger-scale, aggregate behavior) the assist ratio during a GC looks more spiky over time (which should cause transient global assist spikes).

I have a couple of follow-up questions in this direction (in no particular order of priority):

Is your application running on a system that's perhaps sharing CPU resources with other applications (enough that you might be transiently oversubscribing the system)? Part of the purpose of the mark assists is to deal with like, the dedicated workers getting descheduled by the OS and the GC transiently falling behind as a result (raising the assist ratio, making assists more likely). This wouldn't be visible in an execution trace, but might be visible in a Linux sched trace if you can correlate thread IDs to the execution trace. (This isn't easy to do off-the-shelf, but if you're willing and/or able to hand me both I can provide you instructions that'll help me stitch things together.) If you're pretty confident that you're running this on a system that has ample headroom though I wouldn't bother.
If you collect another execution trace after the latest patch, do you notice any pattern in the assists? For example, if the assists are more heavily present at the beginning of the mark phase or at the end? There's a known quirk in the pacer where at the beginning of the GC cycle we expect more assists as goroutines start with neither assist credit nor debt. Seeding them with credit could help if that's indeed the problem. (I kind of see this in the execution trace in your original post, where the individual assist blocks are definitely a lot higher toward the beginning of the trace than the end, but this could also be an artifact of the GC trying to pick up the dregs of scan work but when it goes to assist it's finding very little work to do each time. This should in theory be more pronounced post-patch.)

Also, my apologies about the ergonomics of that patch. I can prepare any future patches on top of Go 1.19 and later port them to tip.

Currently the pacer is designed to pace against the edge. Specifically, it tries to find the sweet spot at which there are zero assists, but simultaneously finishes each GC perfectly on time. This pretty much works, despite the noisiness of the measurement of the cons/mark ratio, which is central to the pacer's function. (And this noise is basically a given; the cons/mark ratio is used as a prediction under a steady-state assumption.) Typically, this means that the GC might assist a little bit more because it started the GC late, or it might execute more GC cycles because it started early. In many cases the magnitude of this variation is small. However, we can't possibly control for all sources of noise, especially since some noise can come from the underlying system. Furthermore, there are inputs to the measurement that have effectively no restrictions on how they vary, and the pacer needs to assume that they're essentially static when they might not be in some applications (i.e. goroutine stacks). The result of high noise is that the variation in when a GC starts is much higher, leading to a significant amount of assists in some GC cycles. While the GC cycle frequency basically averages out in the steady-state in the face of this variation, starting a GC late has the significant drawback of reducing application latencies. This CL thus biases the pacer toward avoiding assists by picking a cons/mark smoothing function that takes the maximum measured cons/mark over 5 cycles total. I picked 5 cycles because empirically this was the best trade-off between window size and smoothness for a uniformly distributed jitter in the cons/mark signal. The cost here is that if there's a significant phase change in the application that makes it less active with the GC, then we'll be using a stale cons/mark measurement for 5 cycles. I suspect this is fine precisely because this only happens when the application becomes less active, i.e. when latency matters less. Another good reason for this particular bias is that even though the GC might start earlier and end earlier on average, resulting in more frequent GC cycles and potentially worse throughput, it also means that it uses less memory used on average. As a result, there's a reasonable workaround in just turning GOGC up slightly to reduce GC cycle frequency and bringing memory (and hopefully throughput) levels back to the same baseline. Meanwhile, there should still be fewer assists than before which is just a clear improvement to latency. Lastly, this CL updates the GC pacer tests to capture this bias against assists and toward GC cycles starting earlier in the face of noise. Sweet benchmarks didn't show any meaningful difference, but real production applications showed a reduction in tail latencies of up to 45%. Updates #56966. Change-Id: I8f03d793f9a1c6e7ef3524d18294dbc0d7de6122 Reviewed-on: https://go-review.googlesource.com/c/go/+/467875 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com>

mknyszek · 2023-03-27T20:54:10Z

As a heads-up, I landed the change you tried out earlier for Go 1.21.

I moved this out of the Go 1.21 milestone for now, but I'm happy to do more here if you have time and/or are able to collect more data on what's going on during the GC cycle (as per my last comment). Thanks!

nvanbenschoten · 2023-12-08T23:16:55Z

Hi @mknyszek, we've recently upgraded CockroachDB to Go 1.21 and wanted to share that we are seeing significantly better tail latency as a result. For example, here's the output of a recent test we ran before and after the upgrade with a pathological workload (many goroutines, reasonably high rate of allocation):

./cockroach workload run kv --read-percent=95 --min-block-bytes=1024 --max-block-bytes=1024 --concurrency=20000 --warmup-conns=20000 --max-conn-lifetime=1h --max-rate=50000 --duration=3m --ramp=30s

# go1.20
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  180.0s        0        9003760        50020.5      3.6      1.1     16.3     67.1    260.0

# go1.21
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  180.0s        0        9006818        50038.2      1.9      1.0      4.1     25.2    251.7

Thank you for your help on this improvement!

At the same time, we've also recently begun to identify just how dominant of a source of tail latency the Go GC still is in CockroachDB. Specifically, how impactful gc assist can be under steady-state and when running background tasks like backups or bulk writes.

We added a debug environment variable called GODEBUG=gcnoassist=1 in cockroachdb/cockroach#115585, and have been running some testing with it. Disabling gc assist has a pronounced impact on tail latency (p95 and above), reducing it by almost 50% on a common, non-pathological workload.

We also measured the effect of adjusting GOGC to run the GC less frequently while relying on GOMEMLIMIT to avoid OOMs, but nothing has been as impactful as disabling gc assist. Regardless of the specific GOGC tuning, the process had plenty of room left between the gc goal and the soft memory limit during each GC, so there was little risk of OOMing without gc assist in these tests.

This raises the question of what role gc assist plays in an application which has also configured a GOMEMLIMIT, expressing to the runtime how much memory it is willing to consume.

GC assist is a backpressure mechanism to prevent an unexpectedly fast rate of user goroutine memory allocation from exceeding memory limits and causing OOMs. Having such a mechanism feels critical, and other concurrent GCs like the JVM's CMS collector have similar concepts. On the other hand, it strikes me that gc assist in Go is a fairly expected consequence of GC. The JVM's CMS collector views foreground allocation backpressure (called a "concurrent mode failure") a little more similarly to how latency-sensitive applications do — as an exceptional case that should be avoided ("These estimates are padded for safety, because concurrent mode failure can be very costly.")

I'm interested to get your thoughts on how to think about all of this. Many of the implementation choices of the GC pacing algorithm seem to designed around some amount of assist. For example, gcControllerState.bgScanCredit is initialized at the beginning of each gc cycle to 0, and is only accumulated as background scan work makes progress. This leads to an initial spike of assist as the GC starts. From there, the gcControllerState adjusts assistWorkPerByte and assistBytesPerWork in order to roughly pace concurrent allocation in order to reach the heap goal right as the GC completes. Assist credits are also stashed per goroutine on g.gcAssistBytes, which feels susceptible to fragmentation.

This seems to be optimizing for starting the GC as late as possible and running the fewest GCs as possible. For throughput-oriented applications, that is probably the right choice. For latency-sensitive applications, minimizing GC assist would be a more important goal. For example, a latency-sensitive application might want to start each GC cycle GBs of heap earlier than necessary just to minimize the chance of any assist work penalizing latency, with the expectation that assist would only kick in at a last resort to avoid OOMs.

Is there a missing knob here to allow applications to express their willingness to backpressure user goroutines? Or a knob to configure how much each cons/mark estimate should be padded to trade earlier GCs for less assist? Is the GC pacing algorithm written in a way that kicking off GC earlier would even help avoid assist?

Independent of the discussion here, I'm happy to provide reproductions (outside of cockroach) which perform similar heap allocation patterns to cockroach and demonstrate high levels of assist. For example, I've attached one such reproduction below, which commonly shows goroutines slow down by 40-50x (from ~250us to >10ms) while GC is running due to assist.

main.go.txt

mknyszek · 2023-12-11T18:20:11Z

Thank you for your help on this improvement!

I'm happy that I could help. :)

On the other hand, it strikes me that gc assist in Go is a fairly expected consequence of GC. The JVM's CMS collector views foreground allocation backpressure (called a "concurrent mode failure") a little more similarly to how latency-sensitive applications do — as an exceptional case that should be avoided ("These estimates are padded for safety, because concurrent mode failure can be very costly.")

This used to be true before Go 1.18 (#44167). The goal since then has very much been to try and avoid assists as much as possible. The steady-state goal for the pacer is zero assists (previously it was a target of 5%). However, I agree there's more that can be done.

I think part of the problem is that assist scheduling hasn't been touched at all in response to that change. If we can improve that and broaden the definition of "steady state," then that'll reduce assists in even more situations. More on this below.

For example, gcControllerState.bgScanCredit is initialized at the beginning of each gc cycle to 0, and is only accumulated as background scan work makes progress. This leads to an initial spike of assist as the GC starts.

That's an unfortunate known issue that I don't think we have a good sense about how to solve in the current design. See #42430. We could start bgScanCredit with some non-zero value, but it's hard to decide what a good value is until the GC actually starts running. We could maybe look at the last cycle's data for a hint, but that's going to be noisy. It'll probably help, but it's unclear by what amount.

From there, the gcControllerState adjusts assistWorkPerByte and assistBytesPerWork in order to roughly pace concurrent allocation in order to reach the heap goal right as the GC completes.

Yeah, unfortunately both of these mechanisms are susceptible to noise. Conceptually assistWorkPerByte represents the slope of a line that the runtime is trying to stay under, and it's pretty strict about staying under that line. Assists begin the moment you cross it.

Part of the reason for this is that we want to smooth out assists across the entire GC cycle. This is counter-intuitive, but by having the line the GC can plan ahead. If it knows its going to have a short runway, then it'll always be under the line, but everything will be equally under the line. Consider situations like a low GOMEMLIMIT or low GOGC for this. In those cases, we avoid severe latency spikes because we've already effectively dedicated some global amount of assist work.

On the other hand, this line is intended to smooth out spiky behavior to keep the application on track. For example, if a background worker stalls in the OS for some reason, assists can pick up the slack without any overall GC CPU impact, and the GC still hits its goal on time. I've long wondered if we could do a lot better with some hysteresis: assume that we'll right ourselves in the near future, and only if that doesn't happen do we kick up assists. However, this conflicts with the goal of smearing assist work across the GC cycle, which is also good for latency. I don't think this is unresolvable, but it will require some thought.

Assist credits are also stashed per goroutine on g.gcAssistBytes, which feels susceptible to fragmentation.

Also, assist credit being goroutine-local was also an intentional design decision. The idea is that the goroutines that allocate the most are the ones that pay the biggest assist penalty. This policy is about making performance easier to locally reason about, though the GC is already one big global component and we've considered changing this policy. Additionally, it helps keep synchronization costs down, because memory allocation only needs to do a non-atomic local update, but we can flush to a global pool often enough to bound the amount of fragmentation. It would be nice to figure out how much of a problem the fragmentation causes in practice.

TBH, I have a hunch that this goroutine-local property is a significant issue in your case. (It's not based on a lot of concrete evidence; more of a gut feeling.) The absolute worst case latency from assists is a frequently allocating goroutine repeatedly blocking in mark assist because they can't find GC work to do, but they're in debt. If some other goroutine could pay off its debt, or it could go in the red in the hope that someone else covers their debt, that would help a lot.

But, it really depends on the application; we'd first have to confirm this is actually happening for CockroachDB.

This raises the question of what role gc assist plays in an application which has also configured a GOMEMLIMIT, expressing to the runtime how much memory it is willing to consume.

I think the role of assists with GOMEMLIMIT is clear: the assists are integral to maintaining the memory limit. They are the only mechanism the runtime has to ensure the total heap size doesn't exceed the heap goal. Without them, random OOMs are much more likely.

That being said, if the GC has a lot of runway, then assists should not be kicking in in the steady state. The pacer shouldn't be doing that, and it represents some flaw in the system. Either in scheduling, or in the pacer itself. (At this point, I'm more likely to blame scheduling, but we need more data.)

Is there a missing knob here to allow applications to express their willingness to backpressure user goroutines? Or a knob to configure how much each cons/mark estimate should be padded to trade earlier GCs for less assist?

There are an endless amount of knobs we could add, but unfortunately I don't think either of these are compelling enough to me. I'm afraid they would be overfitting to the current GC design.

Instead, I think we could do better by rethinking how assists are scheduled in the first place to better fit the Go 1.18 "no assists in the steady state" policy while still retaining some of the existing properties, like decent smearing across the GC cycle.

This seems to be optimizing for starting the GC as late as possible and running the fewest GCs as possible. For throughput-oriented applications, that is probably the right choice. For latency-sensitive applications, minimizing GC assist would be a more important goal. For example, a latency-sensitive application might want to start each GC cycle GBs of heap earlier than necessary just to minimize the chance of any assist work penalizing latency, with the expectation that assist would only kick in at a last resort to avoid OOMs.
...
Is the GC pacing algorithm written in a way that kicking off GC earlier would even help avoid assist?

Yes. In fact, that's what the pacer is fundamentally already doing. It starts the GC earlier to avoid assists, and before Go 1.21 tried to find the sweet spot of "minimal GC length" and "no assists". (Shorter GC cycles are also good for throughput, latency, and memory use, since write barriers are disabled and there's less 'floating garbage' that inflates the live heap size.) Unfortunately in real-world noisy environments, pacing for that sweet spot can result in bad guesses and spikes in assist when the GC is wrong. My patch for Go 1.21 specifically changes the pacer to be more conservative in its estimate of the cons/mark ratio. It is biased against assists in favor of longer GC mark phases, though it'll still hit the sweet spot in low-noise situations.

One reason I have been fixated on assist scheduling in this reply is that the pacer, for the most part, should already be doing a good job at eliminating assist from its end. As of Go 1.21, the pacer should already be giving the rest of the runtime plenty of space to avoid assists from this perspective by starting the GC conservatively early. But fundamentally the fine-grained assist scheduling (i.e. when a goroutine actually calls into an assist, and how long it stays there) is still sensitive and has a really long worst case tail latency.

(The one exception to "the pacer should be giving the rest of the runtime plenty of space" is phase changes, which is a different problem. However, from what I recall of your previous descriptions, I don't think that's the main thing going on here. If you are aware of latency spikes from assists that spawn around the time a background task starts running, that's probably this problem. That is less clear how to solve.)

Independent of the discussion here, I'm happy to provide reproductions (outside of cockroach) which perform similar heap allocation patterns to cockroach and demonstrate high levels of assist.

Thanks. Would you also be willing to collect and share some execution traces? (What's produced by the runtime/trace package.) I'm curious about some properties of assists in your program and those traces would have all the information about assists I could ever need. As I mentioned above, assist scheduling hasn't had a very close look in a while, and it's possible there's an issue that's hitting CockroachDB particularly hard.

Go 1.22 will release with a new tracer with partitioned traces, allowing us to analyze traces in a streaming manner. I don't know how difficult this would be for you to do, but taking a very long trace containing many GC cycles would help answer a lot of questions.

mknyszek · 2023-12-11T21:17:22Z

This is a little off-topic but I think it's a good time to bring it up: it would be nice add this CockroachDB workload (and possibly others) to our benchmark suite that we run continuously against commits to tip.

The suite is not very opinionated as long as the result ends up in the Go benchmark format and we can get basic diagnostics out of the program (CPU profiles, memory profiles, traces, etc.). It would be great for us to be running this continuously (https://perf.golang.org/dashboard/) on our side so we can watch for both improvements and regressions.

Do you have any recommendations on what configurations are useful and/or representative to run and the best way to run them? I don't mind doing the work of integrating into the benchmark suite, that part is pretty easy. The biggest hurdle to adding new benchmarks is generally identifying representative workloads.

Also, there's support in the benchmarking suite for seeding programs with interesting data sets (even if they're really big) and running multiple instances of a program generally isn't a problem either, as long as it can all run on the same machine. (I recognize leaving out network latency for something like a distributed database makes it less representative, but usually that just serves to magnify improvements or regressions in the compiler/runtime. That's more useful for spotting trends anyway.)

Let me know what you think!

nvanbenschoten · 2023-12-12T03:06:32Z

Thank you @mknyszek for the thorough response, as always.

That's an unfortunate known issue that I don't think we have a good sense about how to solve in the current design. See #42430. We could start bgScanCredit with some non-zero value, but it's hard to decide what a good value is until the GC actually starts running. We could maybe look at the last cycle's data for a hint, but that's going to be noisy. It'll probably help, but it's unclear by what amount.

Is this because of the goal to smooth out assists across the entire GC cycle? One can imagine a different goal that trades average latency for tail latency (hopefully far into the tail). We would run with no assist-provided pacing until the heap goal is reached and then slam on the breaks with a stop-the-world pause in cases where the goal is reached before the cycle completes. In such a world, we would instead initialize bgScanCredit with the full amount of remaining heap at the start of a GC cycle and then never populate it in background GC workers.

There are clear downsides to that approach as well, many of which you alluded to in #44167, but I find it to be a useful comparison point. It hints at an intermediate approach where bgScanCredit starts with some non-zero value to avoid assist at the start of a GC cycle until the ratio between the rate of GC work and the rate of allocation is more clearly established. You can formulate this as giving the background GC workers a "head start" to reduce the chance that the allocations momentarily catch up and need to be slowed down.

This is similar but not identical to the hysteresis approach you described to let the allocations/scan work rate temporarily exceed the sustainable rate but then pace more aggressively if the rates don't right themselves quickly.

TBH, I have a hunch that this goroutine-local property is a significant issue in your case. (It's not based on a lot of concrete evidence; more of a gut feeling.) The absolute worst case latency from assists is a frequently allocating goroutine repeatedly blocking in mark assist because they can't find GC work to do, but they're in debt. If some other goroutine could pay off its debt, or it could go in the red in the hope that someone else covers their debt, that would help a lot.

This would line up with the observation that the larger the pool of goroutines that a fixed amount of work is distributed across, the worse the assist penalty appears to be on latency.

Two different related observations I had which may be useful here (but probably aren't) are:

goroutines flush their assist credits to the global pool on exit, but not after yielding when waiting on a channel/network call/disk IO/etc. This might allow assist credit to get locked up in sleeping goroutines while others are starved of credit and blocked in mark assist or the assist queue.
background scan credits that are flushed and only partially satisfy a waiting assist cause the assist to be pushed to the back of the assist queue. For cases where many goroutines perform large allocations, this could degrade into a regime where no single goroutine's debt ever gets paid off. A more basic FIFO queuing discipline would benefit some workloads, at the expense of others.

This raises the question of what role gc assist plays in an application which has also configured a GOMEMLIMIT, expressing to the runtime how much memory it is willing to consume.

I think the role of assists with GOMEMLIMIT is clear: the assists are integral to maintaining the memory limit. They are the only mechanism the runtime has to ensure the total heap size doesn't exceed the heap goal. Without them, random OOMs are much more likely.

My questions could have been worded more clearly. In cases where the GOMEMLIMIT is the cause of the GC cycle and dictates the heap goal, I agree that gc assist pacing is integral to exceeding the limit and avoiding OOMs. However, in cases where the heap goal is dictated by GOGC and there's plenty of room between the GC cycle's heap goal and the soft memory limit, gc assist feels extra unfortunate. That's because in these cases where the heap goal is a consequence of GOGC and is much less than GOMEMLIMIT, it's a fairly arbitrary goal to the application. GC assist is pacing user allocations so that it doesn't exceed a heap goal when the application is fine with the heap growing up to GOMEMLIMIT.

Maybe the right answer here is that GOGC should just be set to off when GOMEMLIMIT is set, though in practice we've seen that this can lead to more severe spikes in assist during the times when GC does run. I think this would be the ideal configuration though — GC only when the app gets close to GOMEMLIMIT, but then pad estimates enough / kick off the GC early enough that user goroutines ~never have to assist.

Yes. In fact, that's what the pacer is fundamentally already doing. It starts the GC earlier to avoid assists, and before Go 1.21 tried to find the sweet spot of "minimal GC length" and "no assists". (Shorter GC cycles are also good for throughput, latency, and memory use, since write barriers are disabled and there's less 'floating garbage' that inflates the live heap size.) Unfortunately in real-world noisy environments, pacing for that sweet spot can result in bad guesses and spikes in assist when the GC is wrong. My patch for Go 1.21 specifically changes the pacer to be more conservative in its estimate of the cons/mark ratio. It is biased against assists in favor of longer GC mark phases, though it'll still hit the sweet spot in low-noise situations.

For my own understanding, why does starting GC earlier lead to longer GC mark phases? My naive understanding is that it would lead to more frequent GC cycles, but that each cycle would take about the same time (assuming no assist in either case).

One reason I have been fixated on assist scheduling in this reply is that the pacer, for the most part, should already be doing a good job at eliminating assist from its end. As of Go 1.21, the pacer should already be giving the rest of the runtime plenty of space to avoid assists from this perspective by starting the GC conservatively early. But fundamentally the fine-grained assist scheduling (i.e. when a goroutine actually calls into an assist, and how long it stays there) is still sensitive and has a really long worst case tail latency.

This makes sense, and I appreciate you writing everything out and building up to this conclusion. It's tempting to point at pacing as a solution to prevent assist, but if GC cycles are getting kicked off sufficiently early and yet allocations are still being scheduled to perform assist, then that's a problem with assist scheduling and we should try to understand what the cause is. I'm happy to help with that.

Would you also be willing to collect and share some execution traces? (What's produced by the runtime/trace package.) I'm curious about some properties of assists in your program and those traces would have all the information about assists I could ever need.

I'll start with the reproduction outside of cockroach that I provided above, even though I'm not certain it's the same exact behavior we see inside cockroach.

I'm collecting these while running with go1.21.5 (bbab863) and with GODEBUG=gctrace=1,gcpacertrace=1: traces.zip

The runtime trace shows that during each background GC, user goroutines end up spending most of their time in a GCWaiting state:

Goroutines do run, but only for short periods of time. Presumably until their next allocation, at which point they bounce off the gc assistQueue:

nvanbenschoten · 2023-12-12T03:22:51Z

it would be nice add this CockroachDB workload (and possibly others) to our benchmark suite that we run continuously against commits to tip

We would be very interested in this!

I see that etcd is already supported. I would imagine that benchmarking cockroach end-to-end would look similar. We also have plenty of Go microbenchmarks that we could expose.

One challenge to this is that CockroachDB uses Bazel for compilation, which makes switching go versions a little tricky. go build doesn't work out of the box, though we do have a way to use Bazel to generate all source files and artifacts necessary to then permit a go build.

mknyszek · 2023-12-12T18:19:56Z

We would run with no assist-provided pacing until the heap goal is reached and then slam on the breaks with a stop-the-world pause in cases where the goal is reached before the cycle completes. In such a world, we would instead initialize bgScanCredit with the full amount of remaining heap at the start of a GC cycle and then never populate it in background GC workers.

There are clear downsides to that approach as well, many of which you alluded to in #44167, but I find it to be a useful comparison point. It hints at an intermediate approach where bgScanCredit starts with some non-zero value to avoid assist at the start of a GC cycle until the ratio between the rate of GC work and the rate of allocation is more clearly established. You can formulate this as giving the background GC workers a "head start" to reduce the chance that the allocations momentarily catch up and need to be slowed down.

This is similar but not identical to the hysteresis approach you described to let the allocations/scan work rate temporarily exceed the sustainable rate but then pace more aggressively if the rates don't right themselves quickly.

I really like this framing and I find it very genuinely enlightening.

It suggests to me that seeding the pool with some background credit proportional to the estimated total heap work is the right call. Say, somewhere between 1-5%. The theory is that we start with just enough credit so that the background workers perpetually keep us in the green, and assists always take from the pool without draining it. We could possibly even use that as a feedback mechanism to decide how much to seed it (within some limit).

Two different related observations I had which may be useful here (but probably aren't) are:

I think those are both very useful observations! :) I think flushing credit when a goroutine blocks is a good idea to try. I also think the assist queue should probably be satisfied in a first-come first-serve order. The fairness is likely better for progress, because if you're doing a lot of partial fills in a row, you end up letting nothing run for quite a while, which seems pathological but also very bad.

My questions could have been worded more clearly. In cases where the GOMEMLIMIT is the cause of the GC cycle and dictates the heap goal, I agree that gc assist pacing is integral to exceeding the limit and avoiding OOMs. However, in cases where the heap goal is dictated by GOGC and there's plenty of room between the GC cycle's heap goal and the soft memory limit, gc assist feels extra unfortunate.

Ack. And yeah, with plenty of runway the GC really shouldn't be assisting at all in this case. I fully agree with you there.

For my own understanding, why does starting GC earlier lead to longer GC mark phases? My naive understanding is that it would lead to more frequent GC cycles, but that each cycle would take about the same time (assuming no assist in either case).

Oops, sorry, you're right. It does lead to more frequent GC cycles after you're already starting earlier than the 'sweet spot' (where the background workers end right as the application reaches the heap goal). This is because there's a floor on the GC CPU usage at 25%. If we could scale GC CPU usage down further, we could start earlier and end at the same point.

I'm not really sure what I was thinking when I wrote that. Perhaps my head was still back in the pre-1.18 days, where this was more true because of the 5% assist (30% total) target. 😅

This makes sense, and I appreciate you writing everything out and building up to this conclusion. It's tempting to point at pacing as a solution to prevent assist, but if GC cycles are getting kicked off sufficiently early and yet allocations are still being scheduled to perform assist, then that's a problem with assist scheduling and we should try to understand what the cause is. I'm happy to help with that.

It would still be good to "prove" that the pacer isn't the source of extra assists at this point, since this is mostly a hunch. Where I'm coming from is that we're already starting early enough to see a slight increase in GC frequency in some cases, which suggests to me that we really are pushing past the sweet spot. But maybe we still haven't quite gotten there. We could probably rule it out by being even more conservative about the cons/mark while experimenting with assist scheduling, to keep us honest.

The runtime trace shows that during each background GC, user goroutines end up spending most of their time in a GCWaiting state:

Oof. That's what I was concerned about. :( Well, that does support my theory a little bit. Being on the assist queue frequently and not finding work to do is definitely a point at which something is breaking down, because the goroutines allocating think they need to help, but there isn't even enough work for them to do.

Two things I'm wondering about:
(1) If the heap in this case is somehow less parallelizable than usual, or
(2) If something is wrong about the scan work estimates going into the assist ratio.

mknyszek · 2023-12-12T18:27:50Z

We would be very interested in this!

I see that etcd is already supported. I would imagine that benchmarking cockroach end-to-end would look similar. We also have plenty of Go microbenchmarks that we could expose.

Great! That makes sense. We have another suite for go-gettable microbenchmarks; those are much easier to add. If you can point us at the most useful of them, they should be super easy to add.

One challenge to this is that CockroachDB uses Bazel for compilation, which makes switching go versions a little tricky. go build doesn't work out of the box, though we do have a way to use Bazel to generate all source files and artifacts necessary to then permit a go build.

Interesting. Is it not sufficient to just set go in PATH properly? Though, yeah, we may run into some issues with compiler changes that Bazel would need to be updated for that would be nice to avoid. It would be totally fine for the suite to perform that generation step, though. Is there a script or some Bazel command we could invoke?

nvanbenschoten · 2024-01-22T19:48:40Z

Regardless of the specific GOGC tuning, the process had plenty of room left between the gc goal and the soft memory limit during each GC, so there was little risk of OOMing without gc assist in these tests.

This raises the question of what role gc assist plays in an application which has also configured a GOMEMLIMIT, expressing to the runtime how much memory it is willing to consume.

...

However, in cases where the heap goal is dictated by GOGC and there's plenty of room between the GC cycle's heap goal and the soft memory limit, gc assist feels extra unfortunate. That's because in these cases where the heap goal is a consequence of GOGC and is much less than GOMEMLIMIT, it's a fairly arbitrary goal to the application. GC assist is pacing user allocations so that it doesn't exceed a heap goal when the application is fine with the heap growing up to GOMEMLIMIT.

I think the role of assists with GOMEMLIMIT is clear: the assists are integral to maintaining the memory limit. They are the only mechanism the runtime has to ensure the total heap size doesn't exceed the heap goal. Without them, random OOMs are much more likely.

@mknyszek I've been thinking more about this, and am wondering if a change like the following might make sense, at least in spirit:

diff --git a/src/runtime/mgcpacer.go b/src/runtime/mgcpacer.go
index 32e19f96e1..03bfc41c6c 100644
--- a/src/runtime/mgcpacer.go
+++ b/src/runtime/mgcpacer.go
@@ -502,6 +502,14 @@ func (c *gcControllerState) revise() {
        // heapGoal assuming the heap is in steady-state.
        heapGoal := int64(c.heapGoal())

+       // If we have a soft memory limit configured, relax the goal (raise it) as
+       // it applies to GC assist to accommodate for the fact that the application
+       // is tolerant of the heap growing to that larger limit. As a result, we
+       // don't need to pace allocations using assist as aggressively.
+       if c.memoryLimit.Load() != maxInt64 {
+               heapGoal = max(heapGoal, int64(c.memoryLimitHeapGoal()))
+       }
+
        // The expected scan work is computed as the amount of bytes scanned last
        // GC cycle (both heap and stack), plus our estimate of globals work for this cycle.
        scanWorkExpected := int64(c.lastHeapScan + c.lastStackScan.Load() + c.globalsScan.Load())

Currently, GC assist is paced to prevent a heap from exceeding the current GC's heap goal, regardless of whether the GC was triggered and that goal defined due to the GC percent or due to the soft memory limit. We can split these two possible heap goals into a percentHeapGoal and a memoryLimitHeapGoal. gcControllerState.heapGoal currently takes the minimum of these two goals and applies it both when deciding when to trigger the GC and when deciding whether to GC assist while the GC is running.

For the purposes of GC assist and in cases where memory limits are configured, would it make sense to only consider the memoryLimitHeapGoal to determine assist ratios?

The effect of this kind of change would be that when running a "gc percent" GC, we would allow the heap to grow past the percentHeapGoal without backpressure, as long as the heap is not growing fast enough that we risk exceeding the memoryLimitHeapGoal. Only in cases where we are on pace to exceed the memoryLimitHeapGoal would we need to assist to slow down allocations.

The rationale behind this kind of change is that when memory limits are set, the application has already expressed a maximum size that it can tolerate the heap growing to. With this limit defined and available to the Go runtime, pacing allocations when the heap is well below that limit and not on pace to reach that limit before a given GC cycle completes feels artificial. Pacing should only be necessary to preserve the memory limit and avoid OOMs.

mknyszek · 2024-01-29T22:09:47Z

@nvanbenschoten Sorry for the delay! I think that's a really interesting idea and it makes a lot of sense to me. If you have the protection of GOMEMLIMIT then some overshoot under the limit is almost certainly fine: assists should still kick in if you really need them, and the page allocator will still help on the physical memory side of things.

I think there might need to be some experimentation and tests to make sure that everything still behaves appropriately, but I don't see why it wouldn't.

As for the specific implementation details, I am not really a fan of maxInt64 having special meaning as a general rule. Another way to put this is that I think maxInt64-1 should behave basically identically to maxInt64 in this scenario.

The easiest "smooth way" of doing this that comes to mind is to always pace assists by the memory limit heap goal. I'd be a little worried about pathological cases of runaway allocation causing less predictable memory usage for applications only using GOGC.

I can't help but feel that there is some simple, smooth way of expressing this relationship. That is, some function that prefers the memory limit goal the closer it gets to the GOGC goal. The idea here is that even if the memory limit goal is far off of GOGC but some realistic number, you might not set it to the memory limit goal, but maybe halfway, and that's still good enough. Like, suppose in the case where the memory limit goal was 100x the GOGC goal, this function would spit out a number 10x the GOGC goal. That's still probably plenty to get a massive reduction in assists. Let me know if that makes sense to you.

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 28, 2022

mknyszek self-assigned this Nov 28, 2022

mknyszek added this to the Go1.21 milestone Nov 28, 2022

mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 28, 2022

mknyszek mentioned this issue Dec 6, 2022

runtime: excessive scavengeOne work slows mutator progress #57069

Closed

mknyszek modified the milestones: Go1.21, Backlog Mar 27, 2023

nvanbenschoten mentioned this issue Dec 8, 2023

cli,server: set reasonable GOGC by default cockroachdb/cockroach#115164

Closed

herkolategan mentioned this issue Apr 3, 2024

golang: add CRDB to golang benchmark suite cockroachdb/cockroach#121606

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: aggressive gc assist with many goroutines #56966

runtime: aggressive gc assist with many goroutines #56966

nvanbenschoten commented Nov 28, 2022

mknyszek commented Nov 28, 2022

nvanbenschoten commented Dec 6, 2022

mknyszek commented Dec 14, 2022

nvanbenschoten commented Dec 14, 2022

mknyszek commented Jan 18, 2023

mknyszek commented Jan 31, 2023

nvanbenschoten commented Feb 3, 2023

gopherbot commented Feb 13, 2023

mknyszek commented Feb 13, 2023

mknyszek commented Feb 14, 2023

mknyszek commented Mar 15, 2023

nvanbenschoten commented Mar 15, 2023

mknyszek commented Mar 15, 2023

mknyszek commented Mar 27, 2023

nvanbenschoten commented Dec 8, 2023

mknyszek commented Dec 11, 2023 •

edited

mknyszek commented Dec 11, 2023

nvanbenschoten commented Dec 12, 2023

nvanbenschoten commented Dec 12, 2023

mknyszek commented Dec 12, 2023 •

edited

mknyszek commented Dec 12, 2023

nvanbenschoten commented Jan 22, 2024

mknyszek commented Jan 29, 2024

runtime: aggressive gc assist with many goroutines #56966

runtime: aggressive gc assist with many goroutines #56966

Comments

nvanbenschoten commented Nov 28, 2022

To reproduce

mknyszek commented Nov 28, 2022

nvanbenschoten commented Dec 6, 2022

mknyszek commented Dec 14, 2022

nvanbenschoten commented Dec 14, 2022

mknyszek commented Jan 18, 2023

mknyszek commented Jan 31, 2023

nvanbenschoten commented Feb 3, 2023

gopherbot commented Feb 13, 2023

mknyszek commented Feb 13, 2023

mknyszek commented Feb 14, 2023

mknyszek commented Mar 15, 2023

nvanbenschoten commented Mar 15, 2023

mknyszek commented Mar 15, 2023

mknyszek commented Mar 27, 2023

nvanbenschoten commented Dec 8, 2023

mknyszek commented Dec 11, 2023 • edited

mknyszek commented Dec 11, 2023

nvanbenschoten commented Dec 12, 2023

nvanbenschoten commented Dec 12, 2023

mknyszek commented Dec 12, 2023 • edited

mknyszek commented Dec 12, 2023

nvanbenschoten commented Jan 22, 2024

mknyszek commented Jan 29, 2024

mknyszek commented Dec 11, 2023 •

edited

mknyszek commented Dec 12, 2023 •

edited