runtime/metrics: CPU metrics update at a relatively low rate #59749

mknyszek · 2023-04-21T01:02:41Z

Currently they update every GC cycle. It should be straightforward to make them "continuously" updating, since most (all?) of them are accumulated in real time.

gopherbot · 2023-04-21T01:59:25Z

Change https://go.dev/cl/487215 mentions this issue: runtime/metrics: make CPU stats real-time

Currently the CPU stats are only updated once every mark termination, but for writing robust tests, it's often useful to force this update. Refactor the CPU stats accumulation out of gcMarkTermination and into its own function. This is also a step toward real-time CPU stats. While we're here, fix some incorrect documentation about dedicated GC CPU time. For #59749. For #60276. Change-Id: I8c1a9aca45fcce6ce7999702ae4e082853a69711 Reviewed-on: https://go-review.googlesource.com/c/go/+/487215 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>

aktau · 2024-03-06T16:15:41Z

What's still needed to have on-demand CPU stats, after the refactoring?

mknyszek · 2024-03-08T00:33:05Z

The main problem I ran into is monotonicity issues with user time. This arises because user time is computed as the subtraction of everything else from the computed total. For example, if there's a mark assist in flight (but not completed) while we take a measurement, and the take another measurement a small amount of time later (less than the total mark assist time), user time will appear to spike up, then down. More concretely: mark assist takes 10µs. 5µs in we measure. That 5µs on that one CPU, at that time will be attributed to user time. Then, 5µs later, when it's done, we measure again. Now it's attributed to mark assists, so user time may appear to skip backwards.

It's complicated, but we'd need something like https://cs.opensource.google/go/go/+/master:src/runtime/mgclimit.go;l=208;drc=70fbc88288143c218fde9f905a38d55505adfb2b;bpv=1;bpt=1 that's even more complete. Not impossible, but probably a substantial amount of work.

aktau · 2024-03-08T12:39:48Z

Thanks for the explanation. Before the ideal solution is implemented, which would be complicated as you say, what about an intermediate solution that gets us most of the way there?

If we can detect that a mark assist is in progress, don't update.
Optionally, to prevent stat starvation if mark assists can be back-to-back ~forever (don't know if they can), set a bit which forces the next mark assist to update the stat before locking.

In this way, the metric wouldn't be perfectly on-demand, but retain its monotonicity (which is important for metrics) and be less stale than it currently is.

mknyszek · 2024-03-08T17:11:35Z

To be clear, the mark assist example was just an example (and maybe a bad one; that one we could actually already handle today...). There are several other cases, like the GC workers actively running. It's not super useful to wait until they finish running because that's equivalent to waiting until the end of the GC mark phase.

aktau · 2024-03-08T17:18:19Z

It's not super useful to wait until they finish running because that's equivalent to waiting until the end of the GC mark phase.

I wouldn't advocate for that. My suggestion was to provide the last snapshot if this is the case:

GCWorker1     ---==----------------------
GCWorker2     --------------====---------
MetricsPoller ---||---------||||----------

Where there is a bar (|), the metrics poller would just see the last updated value. On-demand metrics would start again when no worker is busy updating.

mknyszek · 2024-03-08T17:36:13Z

In practice, we try really hard to have the dedicated GC workers stick to their P and their thread. I think it would be unlikely that we'd be able to find any quiescence.

Also again, it's not just the GC workers. We'd need to make sure there would be no outstanding events at all to make it work properly. That's basically a STW.

aktau · 2024-05-03T13:23:44Z

I bumped into this again (*) while attempting to compare Go's idea of processing time with the kernel's. Essentially:

(/cpu/classes/total - /cpu/classes/idle - /cpu/classes/gc/pause) 
---------------------------------------------------------------
                              rusage.ru_utime

I know that the metrics package recommends against comparing kernel and Go metrics for various, but in this case I'd like to observe significant divergence between the two. To answer the question: when does Go believe it is consuming X s/s but the kernel is giving it (much) less? While most of the time the ratio is only slightly higher than 1, I've observed significant outliers (e.g. a 15x ratio) at times. I'd like to rule out the possibility that this is due to the time delay inherent in waiting for a GC cycle to complete before updating the metrics.

I don't mean to say this is absolutely necessary, one can get similar information via other means, but it gets messier. It's just another use case.

(*) At least, I think I did.

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 21, 2023

mknyszek added this to the Go1.21 milestone Apr 21, 2023

mknyszek self-assigned this Apr 21, 2023

mknyszek added the NeedsFix The path to resolution is known, but the work has not been done. label Apr 21, 2023

mknyszek modified the milestones: Go1.21, Backlog May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/metrics: CPU metrics update at a relatively low rate #59749

runtime/metrics: CPU metrics update at a relatively low rate #59749

mknyszek commented Apr 21, 2023

gopherbot commented Apr 21, 2023

aktau commented Mar 6, 2024

mknyszek commented Mar 8, 2024

aktau commented Mar 8, 2024

mknyszek commented Mar 8, 2024

aktau commented Mar 8, 2024 •

edited

mknyszek commented Mar 8, 2024

aktau commented May 3, 2024 •

edited

runtime/metrics: CPU metrics update at a relatively low rate #59749

runtime/metrics: CPU metrics update at a relatively low rate #59749

Comments

mknyszek commented Apr 21, 2023

gopherbot commented Apr 21, 2023

aktau commented Mar 6, 2024

mknyszek commented Mar 8, 2024

aktau commented Mar 8, 2024

mknyszek commented Mar 8, 2024

aktau commented Mar 8, 2024 • edited

mknyszek commented Mar 8, 2024

aktau commented May 3, 2024 • edited

aktau commented Mar 8, 2024 •

edited

aktau commented May 3, 2024 •

edited