Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/metrics: CPU metrics update at a relatively low rate #59749

Open
mknyszek opened this issue Apr 21, 2023 · 8 comments
Open

runtime/metrics: CPU metrics update at a relatively low rate #59749

mknyszek opened this issue Apr 21, 2023 · 8 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@mknyszek
Copy link
Contributor

Currently they update every GC cycle. It should be straightforward to make them "continuously" updating, since most (all?) of them are accumulated in real time.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 21, 2023
@mknyszek mknyszek added this to the Go1.21 milestone Apr 21, 2023
@mknyszek mknyszek self-assigned this Apr 21, 2023
@mknyszek mknyszek added the NeedsFix The path to resolution is known, but the work has not been done. label Apr 21, 2023
@gopherbot
Copy link

Change https://go.dev/cl/487215 mentions this issue: runtime/metrics: make CPU stats real-time

@mknyszek mknyszek modified the milestones: Go1.21, Backlog May 3, 2023
gopherbot pushed a commit that referenced this issue May 23, 2023
Currently the CPU stats are only updated once every mark termination,
but for writing robust tests, it's often useful to force this update.
Refactor the CPU stats accumulation out of gcMarkTermination and into
its own function. This is also a step toward real-time CPU stats.

While we're here, fix some incorrect documentation about dedicated GC
CPU time.

For #59749.
For #60276.

Change-Id: I8c1a9aca45fcce6ce7999702ae4e082853a69711
Reviewed-on: https://go-review.googlesource.com/c/go/+/487215
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
@aktau
Copy link
Contributor

aktau commented Mar 6, 2024

What's still needed to have on-demand CPU stats, after the refactoring?

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 8, 2024

The main problem I ran into is monotonicity issues with user time. This arises because user time is computed as the subtraction of everything else from the computed total. For example, if there's a mark assist in flight (but not completed) while we take a measurement, and the take another measurement a small amount of time later (less than the total mark assist time), user time will appear to spike up, then down. More concretely: mark assist takes 10µs. 5µs in we measure. That 5µs on that one CPU, at that time will be attributed to user time. Then, 5µs later, when it's done, we measure again. Now it's attributed to mark assists, so user time may appear to skip backwards.

It's complicated, but we'd need something like https://cs.opensource.google/go/go/+/master:src/runtime/mgclimit.go;l=208;drc=70fbc88288143c218fde9f905a38d55505adfb2b;bpv=1;bpt=1 that's even more complete. Not impossible, but probably a substantial amount of work.

@aktau
Copy link
Contributor

aktau commented Mar 8, 2024

Thanks for the explanation. Before the ideal solution is implemented, which would be complicated as you say, what about an intermediate solution that gets us most of the way there?

  • If we can detect that a mark assist is in progress, don't update.
  • Optionally, to prevent stat starvation if mark assists can be back-to-back ~forever (don't know if they can), set a bit which forces the next mark assist to update the stat before locking.

In this way, the metric wouldn't be perfectly on-demand, but retain its monotonicity (which is important for metrics) and be less stale than it currently is.

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 8, 2024

To be clear, the mark assist example was just an example (and maybe a bad one; that one we could actually already handle today...). There are several other cases, like the GC workers actively running. It's not super useful to wait until they finish running because that's equivalent to waiting until the end of the GC mark phase.

@aktau
Copy link
Contributor

aktau commented Mar 8, 2024

It's not super useful to wait until they finish running because that's equivalent to waiting until the end of the GC mark phase.

I wouldn't advocate for that. My suggestion was to provide the last snapshot if this is the case:

GCWorker1     ---==----------------------
GCWorker2     --------------====---------
MetricsPoller ---||---------||||----------

Where there is a bar (|), the metrics poller would just see the last updated value. On-demand metrics would start again when no worker is busy updating.

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 8, 2024

In practice, we try really hard to have the dedicated GC workers stick to their P and their thread. I think it would be unlikely that we'd be able to find any quiescence.

Also again, it's not just the GC workers. We'd need to make sure there would be no outstanding events at all to make it work properly. That's basically a STW.

@aktau
Copy link
Contributor

aktau commented May 3, 2024

I bumped into this again (*) while attempting to compare Go's idea of processing time with the kernel's. Essentially:

(/cpu/classes/total - /cpu/classes/idle - /cpu/classes/gc/pause) 
---------------------------------------------------------------
                              rusage.ru_utime

I know that the metrics package recommends against comparing kernel and Go metrics for various, but in this case I'd like to observe significant divergence between the two. To answer the question: when does Go believe it is consuming X s/s but the kernel is giving it (much) less? While most of the time the ratio is only slightly higher than 1, I've observed significant outliers (e.g. a 15x ratio) at times. I'd like to rule out the possibility that this is due to the time delay inherent in waiting for a GC cycle to complete before updating the metrics.

I don't mean to say this is absolutely necessary, one can get similar information via other means, but it gets messier. It's just another use case.

(*) At least, I think I did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
Development

No branches or pull requests

3 participants