Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/metrics: inconsistency can be observed in /cpu/classes metrics #66212

Closed
aktau opened this issue Mar 8, 2024 · 2 comments
Closed

runtime/metrics: inconsistency can be observed in /cpu/classes metrics #66212

aktau opened this issue Mar 8, 2024 · 2 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime.
Milestone

Comments

@aktau
Copy link
Contributor

aktau commented Mar 8, 2024

Go version

go version 1.23-dev-c8c46e746b778c39727c588adf79aff34ab6f151

Output of go env in your module/workspace:

Not relevant.

What did you do?

Run the following program over and over until it fails. It compares the result of fetching /cpu/classes/gc/total:cpu-seconds and calculating it from its parts, and also prints how many ULPs apart the numbers are ([good article on this])(https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/). The program makes a little effort to entice a sheared update (successfully on my workstation). This is perhaps the reason the current tests need a 2% fudge factor to pass. Once fixed, I recommend adjusting the tests to be much stricter, allowing only a couple of ULPs difference.

package main

import (
	"log"
	"math"
	"os"
	"runtime"
	"runtime/metrics"
	"strconv"
	"sync"
	"time"
)

var samples = []metrics.Sample{
	{Name: "/cpu/classes/gc/total:cpu-seconds"},       // 0
	{Name: "/cpu/classes/idle:cpu-seconds"},           // 1
	{Name: "/cpu/classes/scavenge/total:cpu-seconds"}, // 2
	{Name: "/cpu/classes/total:cpu-seconds"},          // 3
	{Name: "/cpu/classes/user:cpu-seconds"},           // 4
}

func main() {
	// Apply some GC load and wait for a bit. Gives a high chance of obtaining the
	// sheared result on my workstation.
	go applyGCLoad()
	time.Sleep(50 * time.Millisecond)

	// Read metrics.
	metrics.Read(samples)
	gc := samples[0].Value.Float64()
	idle := samples[1].Value.Float64()
	scav := samples[2].Value.Float64()
	total := samples[3].Value.Float64()
	user := samples[4].Value.Float64()

	// Calculate the total from the subtotals.
	calcTotal := (((scav + gc) + user) + idle)

	// Compare total read from metrics versus our own calculation.
	log.Printf("Δ = %s = total (%s) - scav+gc+user+idle (%s), there's %d floats between them\n",
		strconv.FormatFloat(total-calcTotal, 'f', -1, 64),
		strconv.FormatFloat(calcTotal, 'f', -1, 64),
		strconv.FormatFloat(total, 'f', -1, 64),
		int64(math.Float64bits(calcTotal)-math.Float64bits(total)),
	)
	if want, got := total, calcTotal; got != want && (got-want > 0.0001 || want-got > 0.0001) {
		os.Exit(1)
	}
}

func applyGCLoad() func() {
	// We’ll apply load to the runtime with maxProcs-1 goroutines
	// and use one more to actually benchmark. It doesn't make sense
	// to try to run this test with only 1 P (that's what
	// BenchmarkReadMemStats is for).
	maxProcs := runtime.GOMAXPROCS(-1)

	// Code to build a big tree with lots of pointers.
	type node struct {
		children [16]*node
	}
	var buildTree func(depth int) *node
	buildTree = func(depth int) *node {
		tree := new(node)
		if depth != 0 {
			for i := range tree.children {
				tree.children[i] = buildTree(depth - 1)
			}
		}
		return tree
	}

	// Keep the GC busy by continuously generating large trees.
	done := make(chan struct{})
	var wg sync.WaitGroup
	for i := 0; i < maxProcs-1; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			var hold *node
		loop:
			for {
				hold = buildTree(5)
				select {
				case <-done:
					break loop
				default:
				}
			}
			runtime.KeepAlive(hold)
		}()
	}
	return func() {
		close(done)
		wg.Wait()
	}
}

What did you see happen?

$ GO111MODULE=off go build && ( export GOMAXPROCS=20 ; while ./metricsbug ; do : ; done )
2024/03/08 15:20:59 Δ = -0.004854960000000075 = total (0.50744394) - scav+gc+user+idle (0.50258898), there's 43729592093798 floats between them

What did you expect to see?

I expected the diff to be much smaller, on the order of 4*ULP(total). But there's a very large difference (in float space) between these values. After investigation together with @mknyszek, the issue was found: most of the value that serve as the source for these metrics are updated in accumulate:

go/src/runtime/mstats.go

Lines 953 to 976 in 61d6817

// Update cumulative GC CPU stats.
s.gcAssistTime += markAssistCpu
s.gcDedicatedTime += markDedicatedCpu + markFractionalCpu
s.gcIdleTime += markIdleCpu
s.gcTotalTime += markAssistCpu + markDedicatedCpu + markFractionalCpu + markIdleCpu
// Update cumulative scavenge CPU stats.
s.scavengeAssistTime += scavAssistCpu
s.scavengeBgTime += scavBgCpu
s.scavengeTotalTime += scavAssistCpu + scavBgCpu
// Update total CPU.
s.totalTime = sched.totaltime + (now-sched.procresizetime)*int64(gomaxprocs)
s.idleTime += sched.idleTime.Load()
// Compute userTime. We compute this indirectly as everything that's not the above.
//
// Since time spent in _Pgcstop is covered by gcPauseTime, and time spent in _Pidle
// is covered by idleTime, what we're left with is time spent in _Prunning and _Psyscall,
// the latter of which is fine because the P will either go idle or get used for something
// else via sysmon. Meanwhile if we subtract GC time from whatever's left, we get non-GC
// _Prunning time. Note that this still leaves time spent in sweeping and in the scheduler,
// but that's fine. The overwhelming majority of this time will be actual user time.
s.userTime = s.totalTime - (s.gcTotalTime + s.scavengeTotalTime + s.idleTime)

under the same (STW) lock. But gcPauseTime and gcTotalTime are not updated outside of the lock somewhere else:

go/src/runtime/mgc.go

Lines 751 to 752 in 61d6817

work.cpuStats.gcPauseTime += sweepTermCpu
work.cpuStats.gcTotalTime += sweepTermCpu

It is possible that a metric read happens in between these calls, and sheared results are observed. I've verified that this is what happens by manually changing sweepTermCpu to a large value and re-running the test. Exactly this change then appears in the diff.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 8, 2024
@aktau aktau changed the title runtime/metrics: runtime/metrics: inconsistency can be observed in /cpu/classes metrics Mar 8, 2024
@mknyszek
Copy link
Contributor

mknyszek commented Mar 9, 2024

Already sent a fix. ;) https://go.dev/cl/570257

I'll see if I can incorporate your test program. Thanks for filing an issue!

@mknyszek mknyszek added this to the Go1.23 milestone Mar 13, 2024
@gopherbot
Copy link

Change https://go.dev/cl/570257 mentions this issue: runtime: move GC pause time CPU metrics update into the STW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime.
Projects
Development

No branches or pull requests

3 participants