Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: make the proportion of CPU the GC uses based on actual available CPU time and not GOMAXPROCS #59715

Open
mknyszek opened this issue Apr 19, 2023 · 10 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Apr 19, 2023

Currently the Go GC sets its parallelism as 25% of GOMAXPROCS. In doing so, it effectively assumes the Go runtime has access to GOMAXPROCS*N seconds of CPU time in N seconds of wall time, which is true in many circumstances, but not all.

For example, if a container is configured to have 100 ms of CPU time available for each 1 second window of wall time and GOMAXPROCS=4 (because that's how much parallelism is available on the machine), then the Go runtime is effectively assuming it can use 1 second of CPU time very 1 second of wall time on GC alone.

In practice, this is not true, resulting in significant CPU throttling that can hurt latency-sensitive applications. To be more specific, throttling in general is not really a Go runtime issue. The issue is that the GC on its own is using more CPU over a given window of time than the container is actually allowed.

Another example where this can go awry is with GOMEMLIMIT. If a container with a small CPU reservation runs up against the memory limit and the GC starts to execute frequently enough to then subsequently hit the 50% GC CPU limit, that 50% GC CPU limit is going to be based on GOMAXPROCS. Once again, this results in significant throttling that can make an already bad latency situation worse.

One possible fix for this is to make the Go GC container-aware. Rather than set the GC CPU limit, or the 25% parallelism from GOMAXPROCS, the Go GC could derive these numbers from the container.

On Linux, that means using the CPU cgroups parameters cpu.cfs_period_us and cpu.cfs_quota_us. More specifically, the GC CPU limit is set to 50% of cpu.cfs_quota_us in any given window of cpu.cfs_period_us, while the GC's default mark-phase parallelism is 25% of cpu.cfs_quota_us / cpu.cfs_period_us. (Disclaimer: this second part I'm less sure about.)

The Go runtime would re-read these parameters regularly, for example every few seconds, to stay up-to-date with its current environment.

@mknyszek mknyszek added this to the Backlog milestone Apr 19, 2023
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 19, 2023
@mknyszek mknyszek self-assigned this Apr 19, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 19, 2023
@mknyszek
Copy link
Contributor Author

I should note that on Linux we'd ignore cpu.shares because that seems to be the norm these days. The main issue is that the runtime can't really do anything useful with cpu.shares without knowing the total number of shares distributed across the same cgroup.

@mknyszek
Copy link
Contributor Author

On the other hand, it might be a little bit uncharacteristic and backwards incompatible for the runtime to just adjust itself based on container configuration. That might suggest having another knob, but we don't like knobs. There might be a better way to identify how much CPU time the GC should be using.

At the very least, this issue should serve to track any improvements we make in this area.

@thockin
Copy link

thockin commented Sep 1, 2023

Coming in from the Kubernetes POV: Looking at CFS quota is also wrong. That indicates the max that you MAY get, but does not say anything about what you will actually get at any particular time.

@mknyszek
Copy link
Contributor Author

mknyszek commented Sep 2, 2023

@thockin Thanks, that makes sense. I think I may have come to a similar conclusion not long after I filed this issue. 😅

Another perspective on all of this to just say that GOMAXPROCS should just be set directly to the amount of parallelism available in bursts, because fundamentally the GC is like a bursty CPU-bound (really, memory-bandwidth-bound) process. Containers then just need to be tuned with that in mind.

Things are easy to reason about if your Go program is already CPU-bound, since the GC isn't really adding anything, just taking away some of that CPU time. It's also fine if your application is bursty, the GC is just like another bursty. The difficulty comes when your program is I/O bound, since now you have this occasional burst you need to account for that might be completely unexpected if you're not already familiar with the impact a GC has. It's possible that most of the confusion and friction from this is just a misunderstanding how the GC behaves in a Go program.

I'd love to collect more data on this, but I'm not sure how. I suppose I could dig through our bug backlog and try to figure out how often this has come up and categorize the issues more thoroughly.

@thockin
Copy link

thockin commented Sep 2, 2023

I came to this issue after many reports of GC languages, including but not limited to Go, having miserable issues around CPU in containers. It is still not obvious to me what the right behavior would be.

It seema more like a dynamic problem than static. GOMAXPROCS doesn't feel right for this.

@mknyszek
Copy link
Contributor Author

mknyszek commented Sep 2, 2023

I don't disagree that GOMAXPROCS could be better; we've discussed the idea of having something like GOMAXPROCS=auto or something where the runtime just figures out available parallelism by itself (or gets it from the system). Then, the runtime would decouple the concept of available parallelism from available CPU time, and the GC would base its operation on an estimation of both separately.

But what's always gotten in the way of all this is that it's unclear exactly how to discover parallelism and available CPU time in a robust and prompt manner.

Also, GC languages are likely to run into issues like burning all CPU quota and getting stalled, but I suspect they're not the only case of a program behaving poorly in a CPU-constrained container. My broader point of thinking about the GC as a bursty parallel workload is that it seems like CPU quotas (at least the way they're defined today) don't interact well with bursty parallel workloads. Unless I'm missing something, a web server written in C could experience a similar kind of CPU throttling issue. Even if the CPU quota for this application is set correctly for its p50, p75, or even p90 load, the tail can get arbitrarily bad due to stalls. It's just less likely because there are fewer bursts (no background GC threads burning CPU time). IIUC, today, the only complete mitigation is giving the container full CPUs, which is wasteful.

Taking that broader perspective, the answer to this question seems like it has to come from the OS layer. Whether that's a better way to communicate container resources back to the application, better scheduling, or something else, I don't know.

@thockin
Copy link

thockin commented Sep 2, 2023

You're right that this is not a unique problem :)

We could feed in some clues from (for example) the kubernetes layer. E.g. shares means something concrete, but it's not exactly the same thing as parallelism. Maybe that could be the seed for a sort of dynamic mechanism - slow-start from seed info and "feel" your way into higher parallelism. Back off when things actually take too long.

@justinsb
Copy link

I suggest there are a few (hopefully smaller) questions we could work through here:

  1. How should GC behave in a container: given limited CPU, how should we divide that CPU between GC and the user workload?
    1b. Does that answer change if/when memory is also limited?
  2. How should a go program discover the CPU and/or memory limits?
    2b. What should the go program do if the CPU or memory limits can change dynamically?

I think if we can tackle those questions we can get to the right answer. There is more complexity here - we have soft/hard limits, thread priorities, we can dynamically "autotune" limits across multiple processes etc, but I think of those more as potentially part of our answers, rather than the actual questions.

@seankhliao
Copy link
Member

We do have #33803 filed for making the default GOMAXPROCS cgroup aware.
Even if the full quota may not be available, it sounds like it would be a strict improvement over using hardware processor count?

@thockin
Copy link

thockin commented Feb 11, 2024

Probably anything is better than nothing. Unfortunately, a process may not have all the info it needs to make the BEST choices, but this is a place where we should be willing to let abstactions leak. If we (kube) need to find a way to feed information that allows a self-tuning runtime to be more correct, we will.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

5 participants