Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/metrics: time.Sleep contributes to "user" rather than "idle" cpu class #60276

Closed
rhysh opened this issue May 17, 2023 · 8 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented May 17, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20.4 darwin/arm64
go version devel go1.21-f777726ff0 Wed May 17 21:12:04 2023 +0000 darwin/arm64

Does this issue reproduce with the latest release?

Yes. New in Go 1.20 (with the introduction of /cpu/classes/user:cpu-seconds), still present in tip.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/rhys/Library/Caches/go-build"
GOENV="/Users/rhys/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/rhys/go/pkg/mod"
GONOPROXY="*"
GONOSUMDB="*"
GOOS="darwin"
GOPATH="/Users/rhys/go"
GOPRIVATE="*"
GOPROXY="direct"
GOROOT="/Users/rhys/go/version/go1.20"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/rhys/go/version/go1.20/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.20.4"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/lf/n4ry0qv5639_3z0nhyhy3z7c0000gs/T/go-build1529865541=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I inspected /cpu/classes/user:cpu-seconds and /cpu/classes/idle:cpu-seconds from the runtime/metrics package before and after calling time.Sleep.

Note that CPU time accounting isn't available on the playground, so the problem doesn't reproduce there.

package main

import (
	"runtime"
	"runtime/metrics"
	"testing"
	"time"
)

func TestSleepAccounting(t *testing.T) {
	names := []string{
		"/cpu/classes/gc/mark/assist:cpu-seconds",
		"/cpu/classes/gc/mark/dedicated:cpu-seconds",
		"/cpu/classes/gc/mark/idle:cpu-seconds",
		"/cpu/classes/gc/pause:cpu-seconds",
		"/cpu/classes/gc/total:cpu-seconds",
		"/cpu/classes/idle:cpu-seconds",
		"/cpu/classes/scavenge/assist:cpu-seconds",
		"/cpu/classes/scavenge/background:cpu-seconds",
		"/cpu/classes/scavenge/total:cpu-seconds",
		"/cpu/classes/total:cpu-seconds",
		"/cpu/classes/user:cpu-seconds",
	}
	prep := func() []metrics.Sample {
		mm := make([]metrics.Sample, len(names))
		for i := range names {
			mm[i].Name = names[i]
		}
		return mm
	}
	m1, m2 := prep(), prep()

	dur := time.Second

	runtime.GC() // update /cpu/classes metrics
	metrics.Read(m1)
	time.Sleep(dur)
	runtime.GC() // update /cpu/classes metrics
	metrics.Read(m2)

	for i := range names {
		if m1[i].Value.Kind() == metrics.KindBad {
			continue
		}
		t.Logf("%s %0.3f\n", names[i], m2[i].Value.Float64()-m1[i].Value.Float64())
	}

	for i := range names {
		if names[i] == "/cpu/classes/user:cpu-seconds" {
			if dt := m2[i].Value.Float64() - m1[i].Value.Float64(); dt > (dur / 2).Seconds() {
				t.Errorf(`BUG: time.Sleep has contributed to "user" rather than "idle" CPU class`)
			}
		}
	}
}

What did you expect to see?

I expected time during which all user goroutines are asleep to count as "idle" time.

What did you see instead?

The full duration of the time.Sleep call is counted towards /cpu/classes/user:cpu-seconds. Maybe the P that is responsible for tracking the timer isn't considered to be "fully asleep"?

I see there's a chance of some overestimation:

/cpu/classes/user:cpu-seconds
	Estimated total CPU time spent running user Go code. This may
	also include some small amount of time spent in the Go runtime.
	This metric is an overestimate, and not directly comparable
	to system CPU time measurements. Compare only with other
	/cpu/classes metrics.

But there's also a chance of overestimation in the "idle" class. And during the sleep, no user or runtime code should be "executing".

/cpu/classes/idle:cpu-seconds
	Estimated total available CPU time not spent executing
	any Go or Go runtime code. In other words, the part of
	/cpu/classes/total:cpu-seconds that was unused. This metric is
	an overestimate, and not directly comparable to system CPU time
	measurements. Compare only with other /cpu/classes metrics.
$ go1.20 test /tmp/sleep_test.go -cpu=1,8
--- FAIL: TestSleepAccounting (1.00s)
    sleep_test.go:45: /cpu/classes/gc/mark/assist:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/gc/mark/dedicated:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/gc/mark/idle:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/gc/pause:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/gc/total:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/idle:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/scavenge/assist:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/scavenge/background:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/scavenge/total:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/total:cpu-seconds 1.001
    sleep_test.go:45: /cpu/classes/user:cpu-seconds 1.001
    sleep_test.go:51: BUG: time.Sleep has contributed to "user" rather than "idle" CPU class
--- FAIL: TestSleepAccounting (1.00s)
    sleep_test.go:45: /cpu/classes/gc/mark/assist:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/gc/mark/dedicated:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/gc/mark/idle:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/gc/pause:cpu-seconds 0.001
    sleep_test.go:45: /cpu/classes/gc/total:cpu-seconds 0.002
    sleep_test.go:45: /cpu/classes/idle:cpu-seconds 7.010
    sleep_test.go:45: /cpu/classes/scavenge/assist:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/scavenge/background:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/scavenge/total:cpu-seconds 0.000
    sleep_test.go:45: /cpu/classes/total:cpu-seconds 8.013
    sleep_test.go:45: /cpu/classes/user:cpu-seconds 1.002
    sleep_test.go:51: BUG: time.Sleep has contributed to "user" rather than "idle" CPU class
FAIL
FAIL	command-line-arguments	2.121s
FAIL

CC @golang/runtime @mknyszek

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 17, 2023
@mknyszek
Copy link
Contributor

That's very odd. I would expect the M to be blocked in the netpoller, after releasing its P. Once the P is idle, all the CPU time should be going to idle time. Specifically, down here: https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=3105?q=findRunnable&ss=go%2Fgo. pidleput is what actually switches the runtime to tracking idle time, and that specifically happens before blocking in the poller.

I suspect whatever the problem is, it'll be easy to fix.

@mknyszek mknyszek self-assigned this May 18, 2023
@mknyszek mknyszek added this to the Go1.21 milestone May 18, 2023
@mknyszek
Copy link
Contributor

Found the issue; we're using a stale "now" value for Ms that block in netpoll.

$ git diff
diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 35aeb2d1ac..987c1cd2c7 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -3112,7 +3112,7 @@ top:
                        goto top
                }
                lock(&sched.lock)
-               pp, _ := pidleget(now)
+               pp, _ := pidleget(0)
                unlock(&sched.lock)
                if pp == nil {
                        injectglist(&list)

Ugh, this probably has implications for the GC CPU limiter.

@gopherbot
Copy link

Change https://go.dev/cl/496183 mentions this issue: runtime: fix usage of stale "now" value for netpolling Ms

@mknyszek
Copy link
Contributor

We may want to backport this? The fix is very small and we have a test.

@mknyszek mknyszek added the NeedsFix The path to resolution is known, but the work has not been done. label May 18, 2023
@gopherbot
Copy link

Change https://go.dev/cl/487215 mentions this issue: runtime/metrics: refactor CPU stats accumulation

@gopherbot
Copy link

Change https://go.dev/cl/496117 mentions this issue: runtime: flush idle time to sched.idleTime on limiter event consumption

gopherbot pushed a commit that referenced this issue May 19, 2023
This was an oversight, which might cause accounted-for idle time to be
lost. Noticed this while working on #60276.

Change-Id: Ic743785d6dc82555e660f2c9b6aaa9dedef56ed8
Reviewed-on: https://go-review.googlesource.com/c/go/+/496117
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue May 23, 2023
Currently the CPU stats are only updated once every mark termination,
but for writing robust tests, it's often useful to force this update.
Refactor the CPU stats accumulation out of gcMarkTermination and into
its own function. This is also a step toward real-time CPU stats.

While we're here, fix some incorrect documentation about dedicated GC
CPU time.

For #59749.
For #60276.

Change-Id: I8c1a9aca45fcce6ce7999702ae4e082853a69711
Reviewed-on: https://go-review.googlesource.com/c/go/+/487215
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
@gopherbot
Copy link

Change https://go.dev/cl/497876 mentions this issue: runtime: skip TestCPUMetricsSleep as flaky

gopherbot pushed a commit that referenced this issue May 24, 2023
This test is fundamentally flaky because of a mismatch between how
internal idle time is calculated and how the test expects it to be
calculated. It's unclear how to resolve this mismatch, given that it's
perfectly valid for a goroutine to remain asleep while background
goroutines (e.g. the scavenger) run. In practice, we might be able to
set some generous lower-bound, but until we can confirm that on the
affected platforms, skip the test as flaky unconditionally.

For #60276.
For #60376.

Change-Id: Iffd5c4be10cf8ae8a6c285b61fcc9173235fbb2a
Reviewed-on: https://go-review.googlesource.com/c/go/+/497876
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

Change https://go.dev/cl/498274 mentions this issue: runtime: make TestCPUMetricsSleep even more lenient

gopherbot pushed a commit that referenced this issue May 31, 2023
This test was introduced as a regression test for #60276. However, it
was quite flaky on a number of different platforms because there are
myriad ways the runtime can eat into time one might expect is completely
idle.

This change re-enables the test, but makes it much more resilient.
Because the issue we're testing for is persistent, we now require 10
consecutive failures to count. Any single success counts as a test
success. This change also makes the test's idle time bound more lenient,
allowing for a little bit of time to be eaten up. The regression we're
testing for results in nearly zero idle time being accounted for.

If this is still not good enough to eliminate flakes, this test should
just be deleted.

For #60276.
Fixes #60376.

Change-Id: Icd81f0c9970821b7f386f6d27c8a566fee4d0ff7
Reviewed-on: https://go-review.googlesource.com/c/go/+/498274
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

3 participants