proposal: runtime: add per-goroutine CPU stats #41554

asubiotto · 2020-09-22T13:28:35Z

Per-process CPU stats can currently be obtained via third-party packages like https://github.com/elastic/gosigar. However, I believe that there exists a need for a certain type of applications to be able to understand CPU usage at a finer granularity.

Example

At a high level in CockroachDB, whenever an application sends a query to the database, we spawn one or more goroutines to handle the request. If more queries are sent to the database, they each get an independent set of goroutines. Currently, we have no way of showing the database operator how much CPU is used per query. This is useful for operators in order to understand which queries are using more CPU and measure that against their expectations in order to do things like cancel a query that's using too many resources (e.g. an accidental overly intensive analytical query). If we had per-goroutine CPU stats, we could implement this by aggregating CPU usage across all goroutines that were spawned for that query.

Fundamentally, I think this is similar to bringing up a task manager when you feel like your computer is slow, figuring out which process on your computer is using more resources than expected, and killing that process.

Proposal

Add a function to the runtime package that does something like:

type CPUStats struct {
    user time.Duration
    system time.Duration
    ...
}

// ReadGoroutineCPUStats writes the active goroutine's CPU stats into
// CPUStats.
func ReadGoroutineCPUStats(stats *CPUStats)

Alternatives

From a correctness level, an alternative to offering these stats is to LockOSThread the active goroutine for exclusive thread access and then get coarser-grained thread-level cpu usage by calling Getrusage for the current thread. The performance impact is unclear.

Additional notes

Obtaining execution statistics during runtime at a fine-grained goroutine level is essential for an application like a database. I'd like to focus this conversation on CPU usage specifically, but the same idea applies to goroutine memory usage. We'd like to be able to tell how much live memory a single goroutine has allocated to be able to decide whether this goroutine should spill a memory-intensive computation to disk, for example. This is reminiscent of #29696 but at a finer-grained level without a feedback mechanism.

I think that offering per-goroutine stats like this is useful even if it's just from an obervability standpoint. Any application that divides work into independent sets of goroutines and would like to track resource usage of a single group should benefit.

The text was updated successfully, but these errors were encountered:

martisch · 2020-09-22T14:06:50Z

A possible solution to showing high level usage of different query paths can be achieved by setting profiling labels on the goroutine:
https://golang.org/src/runtime/pprof/runtime.go?s=1362:1432#L26

And doing background profiling on the job:
https://go-review.googlesource.com/c/go/+/102755

Overall Go program usage can be queried from the enclosing container or process stats from the Operating system directly.

ianlancetaylor · 2020-09-22T20:16:42Z

Yes, this is exactly what labels are for. A nice thing about labels is that they let you measure CPU or heap performance across a range of goroutines all cooperating on some shared task.

https://golang.org/pkg/runtime/pprof/#Labels
https://golang.org/pkg/runtime/pprof/#Do

Please let us know if you need something that is not addressed by labels.

asubiotto · 2020-09-23T11:42:00Z

Thanks for the suggestion. My main concern with profiling is that there is a non-negligible performance overhead. For example, running a quick workload (95% reads and 5% writes against a CockroachDB SQL table) shows that throughput drops by at least 8% when profiling with a one second interval.

I'm hoping that this information can be gathered by the scheduler in a much cheaper way since the question to answer is not "where has this goroutine spent most of its time" but "how much CPU time has this goroutine used". Would this even be feasible?

ianlancetaylor · 2020-09-24T00:41:28Z

Ah, I see. I would think that always collecting CPU statistics would be unreasonably expensive. But it does seem possible to collect them upon request in some way, at least when running on GNU/Linux. Every time a thread switched to a different goroutine it would call getrusage with RUSAGE_THREAD. The delta would be stored somewhere with the old goroutine. Then it could be retrieved as you suggest. Memory profiling information could be collected separately.

I don't know how useful this would be for most programs. In Go it is very easy to start a new goroutine, and it is very easy to ask an existing goroutine to do work on your behalf. That means that it's very easy for goroutine based stats to accidentally become very misleading, for example when the program forgets to collect the stats of some newly created goroutine. That is why runtime/pprof uses the labels mechanism.

Perhaps it would also be possible for this mechanism to use the labels mechanism. But then it is hard to see where the data should be stored or how it should be retrieved.

tbg · 2021-02-19T16:17:51Z

And doing background profiling on the job:
https://go-review.googlesource.com/c/go/+/102755

Overall Go program usage can be queried from the enclosing container or process stats from the Operating system directly.

I'm curious what the internal experience with this mechanism is and how it is used exactly. I suppose some goroutine will be tasked with continuously reading the background profiling stream, but what does it do with it? Is there any code I can look at for how to get to "here's a stream of timestamped label maps" to "this label used X CPU resources"? Do we "just" create a profile every X seconds from the stream? And if that is the idea, how is that different from doing "regular" profiling at a lower sample rate (you might miss a beat every time you restart the sample, but let's assume this is ok)? I feel like I'm missing the bigger picture here.

For the record, I did get the background tracing PR linked above working on the go1.16 branch: cockroachdb/cockroach#60795

tbg · 2021-02-22T15:53:29Z

Ah, I see. I would think that always collecting CPU statistics would be unreasonably expensive. But it does seem possible to collect them upon request in some way, at least when running on GNU/Linux. Every time a thread switched to a different goroutine it would call getrusage with RUSAGE_THREAD. The delta would be stored somewhere with the old goroutine. Then it could be retrieved as you suggest. Memory profiling information could be collected separately.

With the RUSAGE_THREAD idea, wouldn't we be adding a syscall per scheduler tick? That seems very expensive and would likely be a non-starter for use cases such as ours where we want to track usage essentially at all times for most of the goroutines.

The most lightweight variant I have seen is based on nanotime() per scheduler ticks, as my colleague @knz prototyped here (the code just counts ticks, but imagine adding up the nanotime() instead of the ticks:

https://github.com/golang/go/compare/master...cockroachdb:taskgroup?expand=1

I understand that this basic approach has caveats (if threads get preempted, the duration of the preemption will be counted towards the goroutine's CPU time, and there's probably something about cgo calls too) but it seems workable and cheap enough to at least be able to opt into globally.

I don't know how useful this would be for most programs. In Go it is very easy to start a new goroutine, and it is very easy to ask an existing goroutine to do work on your behalf. That means that it's very easy for goroutine based stats to accidentally become very misleading, for example when the program forgets to collect the stats of some newly created goroutine. That is why runtime/pprof uses the labels mechanism.

The PR above avoids (if I understand you correctly) this by using propagation rules very similar to labels. A small initial set of goroutines (ideally just one, at the top of the "logical task") is explicitly labelled via the task group (identified by, say, an int64) to which the app holds a handle (for querying). The ID is inherited by child goroutines. Statistics are accrued at the level of the task group, not at the individual goroutine level (though goroutine level is possible, just create unique task group IDs). I recall that the Go team is averse to giving users goroutine-local storage by accident, which this approach would not (there would not need to be a way to retrieve the current task group from a goroutine) - given the task group ID, one can ask for its stats. But one can not ask a goroutine about its task group ID.

Perhaps it would also be possible for this mechanism to use the labels mechanism. But then it is hard to see where the data should be stored or how it should be retrieved.

I agree that getting labels and either of the new mechanisms proposed to play together would be nice, but it doesn't seem straightforward. You could assign to each label pair (key -> value) assigned a counter (i.e. two goroutines with both have label[k]=v share a task group <k,v>. If a goroutine with labels map[string]string{"l1": "foo", "l2": "bar"} ran for 10ms, we would accrue these to both of these label pairs, i.e. conceptually somewhere in the runtime m[tup{"l1", "bar"}] += 10 // ms and similarly for l2. Perhaps the way to access these metrics would be via the runtime/metrics package.

One difficulty is that we lose the simple identification of a task group with a memory address which we had before, because two maps with identical contents identify the same task group.

On the plus side, any labelled approach would avoid the clobbering of counters that could naively result if libraries did their own grouping (similar to how one should not call runtime.SetGoroutineLabels), as counters could be namespaced and/or nested.

You mention above that

A nice thing about labels is that they let you measure CPU or heap performance

but profiler labels are not used for heap profiles yet, and I think it's because of a somewhat similar problem - a goroutine's allocations in general outlive the goroutine, so the current mechanism (where the labels hang off the g) doesn't work and the lifecycle needs to be managed explicitly. (Everything I know about this I learned from #23458 (comment)). Maybe the way forward there opens a way forward for sane per-label-pair counters?

knz · 2021-02-22T17:09:47Z

The most lightweight variant I have seen is based on nanotime() per scheduler ticks, as my colleague @knz prototyped here (the code just counts ticks, but imagine adding up the nanotime() instead of the ticks:

The implementation using nanotime() is in fact ready already here: cockroachdb@b089033

Perhaps the way to access these metrics would be via the runtime/metrics package.

Done here: cockroachdb@9020a4b

You could assign to each label pair (key -> value) assigned a counter (i.e. two goroutines with both have label[k]=v share a task group <k,v>

One elephant in the room is that looking up labels and traversing a Go map in the allocation hot path is a performance killer.

A nice thing about labels is that they let you measure CPU or heap performance

but profiler labels are not used for heap profiles yet,

This is the other elephant in the room: partitioning the heap allocator by profiler labels would run amok of the small heap optimization. (plus, it would be CPU-expensive to group the profiling data by labels)

knz · 2021-02-22T17:12:18Z

In case it wasn't clear from the latest comments from @tbg and myself: we believe that doing things at the granularity of goroutines is too fine grained, and raises painful questions about where to accumulate the metrics when goroutines terminate.

While @tbg is trying to salvage pprof labels as the entity that defines a grouping of goroutines, I am promoting a separate "task group" abstraction which yields a simpler implementation and a lower run-time overhead. I don't know which one is going to win yet—we need to run further experiments—but in any case we don't want a solution that does stats per individual goroutines.

crazycs520 · 2021-05-13T05:24:25Z

@knz I really like the feature in https://github.com/cockroachdb/go/commits/crdb-fixes. Could you create a pull request to go official?

knz · 2021-05-13T14:50:39Z

thank you for your interest!
We're still evaluating whether this approach is useful in practice. Once we have a good story to tell, we'll share it with the go team.

Fixes github.com/golang/issues/41554. This commit introduces a /sched/goroutine/running:nanoseconds metric, defined as the total time spent by a goroutine in the running state. This measurement is useful for systems that would benefit from fine-grained CPU attribution for. An alternative for scheduler-backed CPU attribution would be the use of profiler labels. Given it's common to spawn multiple goroutines for the same task, goroutine-backed statistics can easily become misleading. Profiler labels instead let you measure CPU performance across a set of cooperating goroutines. That said, it comes with overhead that makes it unfeasible to always enable. For high-performance systems that care about fine-grained CPU attribution (databases for e.g. that want to measure total CPU time spent processing each request), profiler labels are too cost-prohibitive, especially given the Go runtime has a much cheaper view of the data needed. It's worth noting that we already export /sched/latencies:seconds to track scheduling latencies, i.e. time spent by a goroutine in the runnable state. This commit does effectively the same, except for the running state. Users are free to use this metric to power histograms or tracking on-CPU time across a set of goroutines. Change-Id: Id21ae4fcee0cd5f983604d61dad373098a0966bc

Fixes github.com/golang/issues/41554. This commit introduces a /sched/goroutine/running:nanoseconds metric, defined as the total time spent by a goroutine in the running state. This measurement is useful for systems that would benefit from fine-grained CPU attribution. An alternative for scheduler-backed CPU attribution would be the use of profiler labels. Given it's common to spawn multiple goroutines for the same task, goroutine-backed statistics can easily become misleading. Profiler labels instead let you measure CPU performance across a set of cooperating goroutines. That said, it comes with overhead that makes it unfeasible to always enable. For high-performance systems that care about fine-grained CPU attribution (databases for e.g. that want to measure total CPU time spent processing each request), profiler labels are too cost-prohibitive, especially given the Go runtime has a much cheaper view of the data needed. It's worth noting that we already export /sched/latencies:seconds to track scheduling latencies, i.e. time spent by a goroutine in the runnable state. This commit does effectively the same, except for the running state. Users are free to use this metric to power histograms or tracking on-CPU time across a set of goroutines. Change-Id: Id21ae4fcee0cd5f983604d61dad373098a0966bc

gopherbot · 2022-02-24T14:01:31Z

Change https://go.dev/cl/387874 mentions this issue: runtime,runtime/metrics: track on-cpu time per goroutine

Fixes golang#41554. This commit introduces a /sched/goroutine/running:nanoseconds metric, defined as the total time spent by a goroutine in the running state. This measurement is useful for systems that would benefit from fine-grained CPU attribution. An alternative for scheduler-backed CPU attribution would be the use of profiler labels. Given it's common to spawn multiple goroutines for the same task, goroutine-backed statistics can easily become misleading. Profiler labels instead let you measure CPU performance across a set of cooperating goroutines. That said, it comes with overhead that makes it unfeasible to always enable. For high-performance systems that care about fine-grained CPU attribution (databases for e.g. that want to measure total CPU time spent processing each request), profiler labels are too cost-prohibitive, especially given the Go runtime has a much cheaper view of the data needed. It's worth noting that we already export /sched/latencies:seconds to track scheduling latencies, i.e. time spent by a goroutine in the runnable state (go-review.googlesource.com/c/go/+/308933). This commit does effectively the same, except for the running state. Users are free to use this metric to power histograms or tracking on-CPU time across a set of goroutines. Change-Id: Id21ae4fcee0cd5f983604d61dad373098a0966bc

Fixes golang#41554. This commit introduces a /sched/goroutine/running:nanoseconds metric, defined as the total time spent by a goroutine in the running state. This measurement is useful for systems that would benefit from fine-grained CPU attribution. An alternative to scheduler-backed CPU attribution would be the use of profiler labels. Given it's common to spawn multiple goroutines for the same task, goroutine-backed statistics can easily become misleading. Profiler labels instead let you measure CPU performance across a set of cooperating goroutines. That said, it comes with overhead that makes it unfeasible to always enable. For high-performance systems that care about fine-grained CPU attribution (databases for e.g. that want to measure total CPU time spent processing each request), profiler labels are too cost-prohibitive, especially given the Go runtime has a much cheaper view of the data needed. It's worth noting that we already export /sched/latencies:seconds to track scheduling latencies, i.e. time spent by a goroutine in the runnable state (go-review.googlesource.com/c/go/+/308933). This commit does effectively the same except for the running state. Users are free to use this metric to power histograms or tracking on-CPU time across a set of goroutines. Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297

prattmic · 2022-02-24T16:38:01Z

@irfansharif sent https://go.dev/cl/387874 today which adds a /sched/goroutine/running:nanoseconds metric to runtime/metrics, which total time in _Grunning for the current goroutine. This is effectively equivalent to the ReadGoroutineCPUStats API from the first comment but hidden behind the runtime/metrics API.

FWIW, I don't think this API is a good fit for runtime/metrics, which typically returns the same results regardless of the calling goroutine [1]. Given the other metrics available, like /sched/latencies:seconds, I expected this to be a histogram of time spent in running, which I think could be useful, but not for the resource isolation requirements in this proposal.

[1] We also expect some monitoring systems to discover and export all metrics. This metric would be meaningless to export directly, as it would only report on a single reporter goroutine.

cc @golang/runtime @mknyszek

irfansharif · 2022-02-24T17:01:14Z

Thanks for the quick turnaround. I'm happy to go the route of ReadGoroutineCPUStats or a more direct GoroutineRunningNanos like API; I only went with runtime/metrics given the flexibility of the API and because direct APIs have commentary faboring the runtime/metrics variants. I would also be happy with the adding just the private grunningnanos() helper for external dependants like CRDB to go:linkname directly against, but I can see why that's unsatisfying in to include in the stdlib. If the nanos was just tracked in type g struct, that too is altogether sufficient.

We also expect some monitoring systems to discover and export all metrics. This metric would be meaningless to export directly, as it would only report on a single reporter goroutine.

Great point. When documenting that the metric is scoped only to the calling goroutine, I hoped that'd be sufficient for monitoring systems to know and filter out explicitly.

Fixes golang#41554. This commit introduces a /sched/goroutine/running:nanoseconds metric, defined as the total time spent by a goroutine in the running state. This measurement is useful for systems that would benefit from fine-grained CPU attribution. An alternative to scheduler-backed CPU attribution would be the use of profiler labels. Given it's common to spawn multiple goroutines for the same task, goroutine-backed statistics can easily become misleading. Profiler labels instead let you measure CPU performance across a set of cooperating goroutines. That said, it comes with overhead that makes it unfeasible to always enable. For high-performance systems that care about fine-grained CPU attribution (databases for e.g. that want to measure total CPU time spent processing each request), profiler labels are too cost-prohibitive, especially given the Go runtime has a much cheaper view of the data needed. It's worth noting that we already export /sched/latencies:seconds to track scheduling latencies, i.e. time spent by a goroutine in the runnable state (go-review.googlesource.com/c/go/+/308933). This commit does effectively the same except for the running state. Users are free to use this metric to power histograms or tracking on-CPU time across a set of goroutines. Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297

rhysh · 2022-05-13T17:53:15Z

The report in #36821 reflects bugs in / shortcomings of Go's CPU profiling on Linux as of early 2020. @irfansharif , have you found that it is still inaccurate on Linux as of Go 1.18, or is it possible that the runtime/pprof sampling profiler could give acceptable results for on-CPU time?

Some of the earlier comments here pointed to map-based goroutine labels adding unacceptable computation overhead. If those were more efficient, would goroutine labels plus the sampling profiler work well enough?

There's also the question from @andy-kimball of granularity, in particular for billing customers for their CPU time. I don't fully understand why sampling isn't an option, even for (repeated) operations that take only a few hundred microseconds each. It seems like the work of repeated operations would show up in over 1000 samples (at 100 Hz) well before it's consumed $0.01 of CPU time (at current cloud prices).

Overall it seems to me that the current tools we have are close, so I'm interested in how we can improve them enough to be useful here.

Fixes golang#41554. This commit introduces a /sched/goroutine/running:nanoseconds metric, defined as the total time spent by a goroutine in the running state. This measurement is useful for systems that would benefit from fine-grained CPU attribution. An alternative to scheduler-backed CPU attribution would be the use of profiler labels. Given it's common to spawn multiple goroutines for the same task, goroutine-backed statistics can easily become misleading. Profiler labels instead let you measure CPU performance across a set of cooperating goroutines. That said, it has two downsides: - performance overhead; for high-performance systems that care about fine-grained CPU attribution (databases for e.g. that want to measure total CPU time spent processing each request), profiler labels are too cost-prohibitive, especially given the Go runtime has a much cheaper and more granular view of the data needed - inaccuracy and imprecision, as evaluated in golang#36821 It's worth noting that we already export /sched/latencies:seconds to track scheduling latencies, i.e. time spent by a goroutine in the runnable state (go-review.googlesource.com/c/go/+/308933). This commit does effectively the same except for the running state on the requesting goroutine. Users are free to use this metric to power histograms or tracking on-CPU time across a set of goroutines. Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297

andy-kimball · 2022-05-14T02:32:46Z

We would like to get to the point where a user can run an EXPLAIN ANALYZE and get back a cost for that single statement. This should work on a dev/staging cluster (i.e. with little or no other load), and give results that are similar to a production cluster that is fully loaded. Sampling would not work well for that.

Similarly, we'd like to be able to show a list of all recent SQL statements that have been run and show the CPU consumed by each. While some statements may have thousands of iterations and show an accurate number, there's often a "long tail" of other SQL statements that have only a few iterations. Those would often show 0, since we didn't happen to get a "hit". While it's perhaps better than nothing, we're trying to enable a better customer experience than that.

rhysh · 2022-05-16T19:17:13Z

Collecting a timestamp every time the scheduler starts or stops a goroutine, tracking goroutines' running time to sub-microsecond levels, sounds close to key features of the execution tracer (runtime/trace). That data stream also describes when goroutines interact with each other, which allows tracking which parts of the work each goroutine does are for particular inbound requests.

Rather than build up additional ways to see data that's already available via runtime/trace, are there ways to pare down or modify the execution tracer's view to something with lower complexity and overhead? I have guesses (below), but I'm curious for your view on how the execution tracer falls short today for your needs.

Is it:

CPU overhead (from collecting timestamps, from collecting stack traces, writing out the byte stream, or some other part) is too high?
Expensive startup (does work proportional to the number of living goroutines)?
Runtime does not flush partial results in a timely manner?
Inconvenient to parse?

mknyszek · 2022-05-16T19:34:02Z

@rhysh I am currently having similar thoughts, and I think a low-enough overhead, user-parseable trace (with some way of inserting markers at the user level) sounds like it could potentially resolve this situation, assuming it's OK to stream this data out and analyze it programmatically out-of-band.

For this to work, though, the trace format itself has to be more scalable and robust, too. On scalability, the traces come out to be pretty large, and they unfortunately basically need to be read fully into memory. On robustness, you really always need a perfect trace to get any useful information out (or at least, the tools back out if it's not perfect). For instance, partial traces don't exist as a concept because there's no enforcement that, say, a block of really old events actually appears early in the trace, forcing you to always deal with the whole trace (from StartTrace to StopTrace).

I'm currently thinking a bit about tracing going forward, and this has been on my mind. No promises of course, but Go's tracing story in general could use some improvements.

ajwerner · 2022-06-02T22:10:31Z

What would it take to get this proposal moved into the Active column of the Proposals board?

rsc · 2022-06-22T18:47:07Z

Anyone interested in tracing, I can't recommend highly enough Dick Sites's new book Understanding Software Dynamics.

For an intro see his Stanford talk.

It seems to me that really good tracing along these lines would help a lot more than scattered per-goroutine CPU stats.

irfansharif · 2022-06-22T19:23:16Z

I agree that really good tracing would help, but for the environments we're hoping to use per-goroutine CPU stats, and the contexts we're hoping to use it under (CPU controllers), we're not running with the kinds of kernel patches (https://github.com/dicksites/KUtrace) Dick's tracing techniques seem to be predicated on. I've not evaluated how the more mainstream kernel tracing techniques (ftrace, strace) or use of bpf probes fare performance wise.

I also recently learned about how Go's own GC is able to bound it's CPU use to 25% (https://github.com/golang/proposal/blob/master/design/44167-gc-pacer-redesign.md#a-note-about-cpu-utilization, authored by participants in this thread), which effectively uses a similar idea: by capturing the grunning time for time spent in GC assists. This kind of CPU control is similar to what we could do if per-goroutine CPU stats were exposed by the runtime, it would let us hard-cap a certain tenant on a given machine to some fixed % of CPU (with the usual caveats around how accurate this measure is).

rsc · 2022-06-22T19:26:13Z

@irfansharif I think there's an interesting question how much we could learn with a Go-runtime-run version of the tracers (GUTrace, say) and no kernel help. I think quite a lot. The point was inspiration, not direct adoption.

The setting where tracing would not help is if you want the program to observe itself and respond, like the pacer does. The tracing I am thinking of has a workflow more like the Go profiler, where you collect the profile/trace and then interact with it outside the program.

irfansharif · 2022-06-22T19:34:31Z

The setting where tracing would not help is if you want the program to observe itself and respond, like the pacer does.

Ack. This is precisely the class of problems I'm hoping to push on with per-goroutine CPU stats (or at least just the grunning time).

irfansharif · 2022-06-22T19:56:08Z

@irfansharif, have you found that it is still inaccurate on Linux as of Go 1.18, or is it possible that the runtime/pprof sampling profiler could give acceptable results for on-CPU time?

For posterity, running the reproductions from https://github.com/chabbimilind/GoPprofDemo#serial-program:

$ go version
go version devel go1.19-7e33e9e7a3 Fri May 13 14:03:15 2022 -0400 darwin/arm64

$ time go run serial.go && go tool pprof -top serial_prof |  grep 'expect'
      60ms 20.00% 20.00%       70ms 23.33%  main.H_expect_14_546 (inline)
      50ms 16.67% 36.67%       50ms 16.67%  main.G_expect_12_73 (inline)
      50ms 16.67% 53.33%       60ms 20.00%  main.J_expect_18_18 (inline)
      30ms 10.00% 63.33%       30ms 10.00%  main.F_expect_10_91 (inline)
      20ms  6.67% 70.00%       20ms  6.67%  main.C_expect_5_46 (inline)
      20ms  6.67% 76.67%       20ms  6.67%  main.E_expect_9_09 (inline)
      20ms  6.67% 83.33%       20ms  6.67%  main.I_expect_16_36 (inline)
      10ms  3.33%   100%       10ms  3.33%  main.D_expect_7_27 (inline)

________________________________________________________
Executed in  793.74 millis    fish           external
   usr time  473.95 millis    0.13 millis  473.83 millis
   sys time  178.80 millis    3.86 millis  174.94 millis

Still somewhat inaccurate + imprecise (we observe variances across multiple runs) for small total running time (controllable by -m), increasing in accuracy + precision when sampling over larger durations.

g-talbot · 2022-10-11T07:41:28Z

Goroutine stats are useful in serving infrastructure where users are charged on how much resource (in this case CPU) is used and the program must measure and accumulate that. Being able to do goroutine CPU stats in a goroutine-per-request model maps nicely to per-user accounting. Also when isolation decisions are made--i.e. throttling requests for users who are doing a lot of high-CPU work. This isn't just about measuring CPU for debugging. Different examples than Irfan's, but it's really the same thing -- the program observing itself and making decisions. P.S. I like the Sites book too.

…

On Wed, Jun 22, 2022 at 3:34 PM irfan sharif ***@***.***> wrote: The setting where tracing would not help is if you want the program to observe itself and respond, like the pacer does. Ack. This is precisely the class of problems I'm hoping to push on with per-goroutine CPU stats (or at least just the grunning time). — Reply to this email directly, view it on GitHub <#41554 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2F67ZGQ53PNBLUY3LQB2LVQNTFLANCNFSM4RVY5BXA> . You are receiving this because you commented.Message ID: ***@***.***>

irfansharif · 2022-12-15T20:57:15Z

CockroachDB's latest release (v22.2) builds with a patched Go runtime, using effectively the diff from #51347. In the comments above we talked about the "kind of CPU control [..] we could do if per-goroutine CPU stats were exposed by the runtime" and how "it would let us hard-cap a certain tenant on a given machine to some fixed % of CPU". We did this exact thing for bulky, scan-heavy, CPU work in the system (think backups or range feed catch up scans) where we've found them to be disruptive to foreground traffic due to elevated scheduling latencies. https://www.cockroachlabs.com/blog/rubbing-control-theory/ is a wordy write up of the simple idea: dynamically adjust what % of CPU is used for elastic work by measuring at high frequency what the Go scheduling latency p99 is, and then use per-goroutine CPU time to enforce the prescribed %. It's been effective.

@mknyszek: Could we move this further along the proposals board? I believe this issue was also referenced in #57175 (comment) ("There's an issue about measure CPU time on-line").

felixge · 2022-12-15T23:29:59Z

@irfansharif nice article. I like the idea of per-goroutine scheduler stats in general (even if it's a niche use case). A few thoughts.

So scheduling latencies increase non-linearly with CPU utilization [from the article]

Sounds like Little's Law?

and then use per-goroutine CPU time to enforce the prescribed %

Why do you need to enforce a prescribed % instead of just deciding if you want more or less elastic work to happen? Specifically, wouldn't it be enough to have a P(ID) controller that outputs a throttle (sleep) value for elastic work based on the error of (actual scheduler latency - desired scheduler latency)? Sorry if I'm missing something mentioned in the article or that's otherwise obvious.

irfansharif · 2022-12-16T00:00:22Z

Enforcing a CPU utilization % for elastic work is the same as introducing tiny sleeps everywhere. Elastic work synchronously acquires CPU tokens before/while processing, tokens that get generated at some prescribed rate. When tokens are unavailable, work is blocked.

rhysh · 2022-12-16T01:58:15Z

semi-frequent calls to runtime.Gosched() during elastic work didn’t help [from the article]

Off topic for this issue, but related to your blog post, @irfansharif : You might be interested in #56060 (comment) . Gosched appears to yield only to goroutines in the P's local run queue.

mknyszek · 2022-12-19T21:34:59Z

RE: https://www.cockroachlabs.com/blog/rubbing-control-theory/, nice! I think this demonstrates an interesting use-case for wanting to ingest metrics produced by the Go runtime on-line.

By any chance have you measured whether there's any additional overhead incurred to tracking per-goroutine CPU times for every goroutine?

Could we move this further along the proposals board?

I don't think we can should move forward with exactly what #51347 does, since runtime/metrics really wasn't built for this per-goroutine use-case. Specifically, it breaks the use-case of just slurping up all the available metrics and having them be meaningful (IMO it would be easy to misinterpret the new metric since it would only apply to the metrics-collection goroutine in most cases, which is typically spending most of its time sleeping).

I think something like this probably needs a new API, though I'm not sure what that looks like.

Why do you need to enforce a prescribed % instead of just deciding if you want more or less elastic work to happen? Specifically, wouldn't it be enough to have a P(ID) controller that outputs a throttle (sleep) value for elastic work based on the error of (actual scheduler latency - desired scheduler latency)? Sorry if I'm missing something mentioned in the article or that's otherwise obvious.

Slightly off-topic, but this is exactly how the Go runtime's background scavenger goroutine works. There's a PI controller that controls sleep time to achieve a desired CPU utilization. It seems to work reasonably well.

felixge · 2022-12-20T14:38:50Z

I think something like this probably needs a new API, though I'm not sure what that looks like.

Since my colleagues at Datadog have at least two UCs (EXPLAIN ANALYZE and Resource Isolation) for this feature that are not well served by CPU profiling with pprof labels, I decided to give this a shot. It's similar to @asubiotto's original proposal, but with a few differences:

It lives in runtime/debug rather than runtime. I think it's a good place because the package docs say it "contains facilities for programs to debug themselves while they are running". Additionally the package contains SetPanicOnFault which establishes precedent for controlling per-G beahvior.
SetGStats prevents any potential tracking overhead from impacting Go programs that don't make use of it. And even programs using this feature will only have to pay for it on the goroutines they are interested in, rather than all goroutines used by the program. An efficient implementation could try to overload the meaning of an existing g field or find some unused padding to avoid increased memory usage per goroutine.
The docs the "Running" field clarify that this is not actually tracking CPU time, but a close proxy for it. An alternative name would be "Executing" which is the term used by the runtime/trace goroutine analysis.

package debug // runtime/debug

// GStats collect information about a single goroutine.
type GStats struct {
  // Running is the time this goroutine was in running state. This should usually
  // be similar to the CPU time used by the goroutine.
  Running time.Duration
  // could be extended in the future, e.g. with cumulative scheduler latency
  // or time spend in different wait states.
}

// SetGStats enables the tracking of goroutine statistics for the current
// goroutine. It returns the previous setting.
func SetGStats(enabled bool) bool {}

// ReadGStats reads statistics about the calling goroutine into stats. Calling this
// from a goroutine that has not previously called SetGStats(true) may panic.
func ReadGStats(stats *GStats)

I haven't checked with @irfansharif, but I suspect this API would be compatible with CRDBs use cases.

irfansharif · 2022-12-20T14:52:34Z

I'm happy to wholly adopt any alternative, sticking this into runtime/metrics was done without much forethought on my part. In CRDB we didn't use runtime/metrics (it's a tad costlier), we instead exposed a private func grunningnanos() int64 that we go:linknamed against: See https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20220602_fine_grained_cpu_attribution.md#design and https://github.com/cockroachdb/cockroach/blob/060cdcb6661d1a16d262d37bef5787aa5061f235/pkg/util/grunning/enabled.go#L22-L26.

As for keeping this opt-{in,out}/non-optional, there too I'll happily defer to folks with better intuition about the performance overhead. Some microbenchmarks against go1.19 on darwin/arm64: https://gist.github.com/irfansharif/256a6864de0d4d2a797d5c815fcd679a. And also since Mac's vDSO-like mechanism ("commpage") is not the same thing as Linux, and Linux is actually what we use on servers, the results looked like below. For the overhead of reading off grunningnanos() itself (https://github.com/irfansharif/runner/blob/a0a8b6f4434408f9f7e78bf931e9170169fcc4a0/runtime_test.go#L229-L235):

goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
BenchmarkGRunningNanos
BenchmarkGRunningNanos-24       195321096               30.67 ns/op
BenchmarkGRunningNanos-24       195100147               30.77 ns/op
BenchmarkGRunningNanos-24       195415414               30.71 ns/op
BenchmarkGRunningNanos-24       195564742               30.70 ns/op
BenchmarkGRunningNanos-24       195472393               30.70 ns/op
PASS

runtime micro-benchmarks (PingPongHog looked the most interesting/relevant to me). Old==with patch.

name                             old time/op    new time/op    delta
PingPongHog-24                      482ns ± 9%     472ns ± 9%    ~     (p=0.065 n=25+25)
CreateGoroutines-24                 306ns ± 4%     308ns ± 4%    ~     (p=0.225 n=25+25)
CreateGoroutinesParallel-24        34.2ns ± 2%    34.2ns ± 3%    ~     (p=0.963 n=25+23)
CreateGoroutinesCapture-24         2.85µs ± 5%    2.82µs ± 4%    ~     (p=0.231 n=24+25)
CreateGoroutinesSingle-24           407ns ± 2%     407ns ± 2%    ~     (p=0.541 n=25+25)
Matmult-24                         0.97ns ± 3%    0.97ns ± 4%    ~     (p=0.613 n=25+25)
ChanNonblocking-24                 0.41ns ± 0%    0.41ns ± 1%    ~     (p=0.055 n=23+24)
ChanUncontended-24                  348ns ± 0%     348ns ± 0%    ~     (p=0.193 n=24+24)
ChanContended-24                   48.0µs ± 2%    46.9µs ± 4%  -2.34%  (p=0.000 n=23+24)
ChanSync-24                         237ns ± 2%     239ns ± 9%    ~     (p=0.645 n=20+24)
ChanSyncWork-24                    33.6µs ± 3%    32.9µs ± 3%  -2.04%  (p=0.000 n=25+25)
ChanProdCons0-24                   1.07µs ± 1%    1.06µs ± 2%  -0.78%  (p=0.006 n=21+25)
ChanProdCons10-24                   847ns ± 2%     842ns ± 2%    ~     (p=0.094 n=25+24)
ChanProdCons100-24                  577ns ± 3%     579ns ± 3%    ~     (p=0.362 n=24+22)
ChanProdConsWork0-24               1.08µs ± 1%    1.08µs ± 2%  +0.56%  (p=0.017 n=24+22)
ChanProdConsWork10-24              1.01µs ± 2%    1.01µs ± 2%    ~     (p=0.732 n=24+23)
ChanProdConsWork100-24              891ns ± 2%     888ns ± 2%    ~     (p=0.418 n=24+25)
ReceiveDataFromClosedChan-24       31.1ns ± 0%    31.1ns ± 0%    ~     (p=0.303 n=25+25)
ChanCreation-24                    61.7ns ± 3%    60.6ns ± 2%  -1.87%  (p=0.000 n=24+24)
ChanSem-24                          439ns ±10%     444ns ± 7%    ~     (p=0.603 n=25+25)
ChanPopular-24                     1.24ms ± 3%    1.24ms ± 3%    ~     (p=0.539 n=25+23)
ChanClosed-24                      0.47ns ± 0%    0.47ns ± 0%  +0.13%  (p=0.006 n=24+24)
SelectUncontended-24               8.63ns ± 0%    8.60ns ± 0%  -0.26%  (p=0.000 n=24+24)
SelectSyncContended-24             3.91µs ± 2%    3.89µs ± 2%  -0.69%  (p=0.019 n=25+25)
SelectAsyncContended-24             705ns ± 3%     702ns ± 4%    ~     (p=0.277 n=25+22)
SelectNonblock-24                  1.44ns ± 0%    1.44ns ± 0%    ~     (p=0.783 n=25+24)
SelectProdCons-24                  1.25µs ± 1%    1.24µs ± 1%  -0.92%  (p=0.000 n=24+22)
GoroutineSelect-24                 1.90ms ± 1%    1.90ms ± 2%    ~     (p=0.775 n=23+25)
WakeupParallelSpinning/0s-24       14.5µs ± 2%    14.5µs ± 3%    ~     (p=0.547 n=25+25)
WakeupParallelSpinning/1µs-24      19.5µs ±10%    19.8µs ± 8%    ~     (p=0.185 n=24+22)
WakeupParallelSpinning/2µs-24      25.5µs ± 4%    25.5µs ± 6%    ~     (p=0.785 n=24+25)
WakeupParallelSpinning/5µs-24      37.2µs ± 7%    37.4µs ± 7%    ~     (p=0.672 n=25+25)
WakeupParallelSpinning/10µs-24     54.3µs ± 1%    54.4µs ± 1%    ~     (p=0.613 n=25+25)
WakeupParallelSpinning/20µs-24     96.0µs ± 1%    96.0µs ± 1%    ~     (p=0.712 n=24+23)
WakeupParallelSpinning/50µs-24      222µs ± 0%     222µs ± 0%    ~     (p=0.510 n=24+23)
WakeupParallelSpinning/100µs-24     399µs ± 3%     396µs ± 3%    ~     (p=0.104 n=25+25)
WakeupParallelSyscall/0s-24         172µs ± 4%     173µs ± 3%    ~     (p=0.410 n=25+25)
WakeupParallelSyscall/1µs-24        172µs ± 2%     173µs ± 3%    ~     (p=0.284 n=23+22)
WakeupParallelSyscall/2µs-24        177µs ± 3%     175µs ± 3%  -1.02%  (p=0.003 n=25+24)
WakeupParallelSyscall/5µs-24        183µs ± 2%     183µs ± 3%    ~     (p=0.845 n=23+23)
WakeupParallelSyscall/10µs-24       196µs ± 9%     194µs ± 4%    ~     (p=0.295 n=24+23)
WakeupParallelSyscall/20µs-24       219µs ± 4%     217µs ± 3%    ~     (p=0.133 n=25+23)
WakeupParallelSyscall/50µs-24       279µs ± 2%     282µs ± 5%    ~     (p=0.109 n=24+23)
WakeupParallelSyscall/100µs-24      394µs ± 4%     391µs ± 3%    ~     (p=0.099 n=25+25)

name                             old alloc/op   new alloc/op   delta
CreateGoroutinesCapture-24           144B ± 0%      144B ± 0%    ~     (all equal)

name                             old allocs/op  new allocs/op  delta
CreateGoroutinesCapture-24           5.00 ± 0%      5.00 ± 0%    ~     (all equal)

I did want to learn from the Go team what would be a better way to evaluate the overhead. We didn't observe anything at the CRDB-level {micro,benchmarks}. We didn't bother with keeping it opt-{in,out}.

felixge · 2022-12-20T16:13:21Z

Thanks @irfansharif. I'll try to reproduce the bench results with a bare metal amd64/linux box tomorrow. I'm okay if we decide the SetGStats part of my proposal is not needed. I also don't know what's acceptable for this code path, and how
to weigh linux/darwin/windows/etc against each other in case the overhead is platform dependent.

mknyszek · 2022-12-20T17:35:01Z

If we're reasonably certain that it doesn't make a difference on microbenchmarks, I'm not all that concerned about bigger systems.

On the API side, I think it'd be nice to have something less committal than GStats, since we'd have to live with that forever. What about something like:

package metrics

func ReadPerGoroutine([]Sample)
func AllPerGoroutine() []Description

As long as the metrics are scalars, I expect this to be only slightly less efficient than the struct-based API. One thing that might be interesting also is including a dump of this information in the goroutine profile.

felixge · 2022-12-21T11:18:06Z

@irfansharif I just tried on a big AWS linux machine with 128 vCPUs and I still see a significant performance regression on PingPongHog. I've also tried with -cpu 24 to use the same number of procs as you did above, but the results are the same. I don't understand why your results are so different. Do you have other hardware to try with?

$ curl http://169.254.169.254/latest/meta-data/instance-type
c6i.metalubuntu
$ head -n4 before.txt 
goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
$ benchstat before.txt after.txt 
name             old time/op  new time/op  delta
PingPongHog-24   610ns ±16%   808ns ±32%  +32.38%  (p=0.000 n=50+50)
PingPongHog-128   603ns ±17%   803ns ±19%  +33.20%  (p=0.000 n=49+50)

@mknyszek how do you feel about making this opt-in if the results above are true? I believe these micro benchmarks will not translate to noticable regression for most Go programs, but I'm still a little worried about the numbers.

The API you proposed sounds nice as well, I'd be okay with that 👍.

One thing that might be interesting also is including a dump of this information in the goroutine profile.

As a new sample type? How would one interpret the resulting profile? For most programs the stack traces in a goroutine profile will be biased towards Off-CPU time, but then the flame graph would be scaled by On-CPU (executing) time? I'd imagine this to lead to confusing results.

tuxillo · 2024-11-18T18:43:14Z

Was there any conclusion in the end? What happened to this?

gopherbot added this to the Proposal milestone Sep 22, 2020

gopherbot added the Proposal label Sep 22, 2020

ianlancetaylor changed the title ~~proposal: add per-goroutine CPU stats~~ proposal: runtime: add per-goroutine CPU stats Sep 24, 2020

This was referenced Feb 16, 2021

rfcs: Task groups in Go: per-session resource usage tracking cockroachdb/cockroach#60589

Open

[dnm]: use background profiling cockroachdb/cockroach#60795

Closed

ajwerner mentioned this issue Jan 5, 2022

Surface a Request Unit metric per statement cockroachdb/cockroach#74441

Closed

irfansharif linked a pull request Feb 24, 2022 that will close this issue

runtime,runtime/metrics: track on-cpu time per goroutine #51347

Open

prattmic added this to Go Compiler / Runtime Feb 24, 2022

prattmic moved this to Todo in Go Compiler / Runtime Feb 24, 2022

irfansharif mentioned this issue Feb 25, 2022

tenantrate: use measured on-cpu time for rate limiting cockroachdb/cockroach#77041

Open

irfansharif mentioned this issue Jul 4, 2022

*: fine-grained cpu attribution cockroachdb/cockroach#82625

Closed

5 tasks

rsc added this to Proposals Aug 10, 2022

rsc moved this to Incoming in Proposals Aug 10, 2022

nsrip-dd mentioned this issue Oct 18, 2022

runtime/trace: execution trace doesn't include pprof labels #56295

Open

mknyszek mentioned this issue Feb 16, 2023

runtime: performance and diagnostics meeting notes #57175

Open

kolesnikovae mentioned this issue Jan 10, 2024

Derive metrics from ingested profiles grafana/pyroscope#2908

Open

aktau mentioned this issue Jan 23, 2024

proposal: runtime/metrics: provide histogram of goroutines' on-CPU time #63341

Open

proposal: runtime: add per-goroutine CPU stats #41554

proposal: runtime: add per-goroutine CPU stats #41554

Comments

asubiotto commented Sep 22, 2020

Example

Proposal

Alternatives

Additional notes

martisch commented Sep 22, 2020 • edited Loading

ianlancetaylor commented Sep 22, 2020

asubiotto commented Sep 23, 2020

ianlancetaylor commented Sep 24, 2020

tbg commented Feb 19, 2021 • edited Loading

tbg commented Feb 22, 2021

knz commented Feb 22, 2021

knz commented Feb 22, 2021

crazycs520 commented May 13, 2021

knz commented May 13, 2021

gopherbot commented Feb 24, 2022

prattmic commented Feb 24, 2022 • edited Loading

irfansharif commented Feb 24, 2022 • edited Loading

rhysh commented May 13, 2022

andy-kimball commented May 14, 2022

rhysh commented May 16, 2022

mknyszek commented May 16, 2022

ajwerner commented Jun 2, 2022

rsc commented Jun 22, 2022

irfansharif commented Jun 22, 2022 • edited Loading

rsc commented Jun 22, 2022

irfansharif commented Jun 22, 2022

irfansharif commented Jun 22, 2022

g-talbot commented Oct 11, 2022 via email

irfansharif commented Dec 15, 2022

felixge commented Dec 15, 2022

irfansharif commented Dec 16, 2022 • edited Loading

rhysh commented Dec 16, 2022

mknyszek commented Dec 19, 2022

felixge commented Dec 20, 2022 • edited Loading

irfansharif commented Dec 20, 2022 • edited Loading

felixge commented Dec 20, 2022 • edited Loading

mknyszek commented Dec 20, 2022

felixge commented Dec 21, 2022 • edited Loading

tuxillo commented Nov 18, 2024

martisch commented Sep 22, 2020 •

edited

Loading

tbg commented Feb 19, 2021 •

edited

Loading

prattmic commented Feb 24, 2022 •

edited

Loading

irfansharif commented Feb 24, 2022 •

edited

Loading

irfansharif commented Jun 22, 2022 •

edited

Loading

irfansharif commented Dec 16, 2022 •

edited

Loading

felixge commented Dec 20, 2022 •

edited

Loading

irfansharif commented Dec 20, 2022 •

edited

Loading

felixge commented Dec 20, 2022 •

edited

Loading

felixge commented Dec 21, 2022 •

edited

Loading