runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too high on linux-arm64-aws #50097

bcmills · 2021-12-10T20:59:47Z

Seen on a linux-arm64-aws TryBot for unrelated CL 370666 (https://storage.googleapis.com/go-build-log/a297f68d/linux-arm64-aws_04471a78.log):

--- FAIL: TestCPUProfileMultithreadMagnitude (0.66s)
    pprof_test.go:127: Running on Linux 4.19.0
    pprof_test.go:526: total 59 CPU profile samples collected:
        5: 0x7566c (runtime.epollwait:731) 0x3f23f (runtime.netpoll:126) 0x4b013 (runtime.findrunnable:2956) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
        8: 0x138d10 (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog1:55) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139d83 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.1:176) 0x148d83 (runtime/pprof.diffCPUTimeRUsage:27) 0x13aca7 (runtime/pprof.diffCPUTime:485) 0x139a5f (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2:174) 0x13ab07 (runtime/pprof.testCPUProfile:453) 0x13965b (runtime/pprof.TestCPUProfileMultithreadMagnitude:173) 0xd439f (testing.tRunner:1410) labels: map[]
        
        1: 0x52680 (runtime.pMask.set:5749 runtime.pidleput:5815) 0x4adb7 (runtime.findrunnable:2863) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
        2: 0x138cfc (runtime/pprof.cpuHog0:60 runtime/pprof.cpuHog1:55) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139d83 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.1:176) 0x148d83 (runtime/pprof.diffCPUTimeRUsage:27) 0x13aca7 (runtime/pprof.diffCPUTime:485) 0x139a5f (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2:174) 0x13ab07 (runtime/pprof.testCPUProfile:453) 0x13965b (runtime/pprof.TestCPUProfileMultithreadMagnitude:173) 0xd439f (testing.tRunner:1410) labels: map[]
        
        1: 0x4b02b (runtime.nanotime:19 runtime.findrunnable:2958) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
        31: 0x138d90 (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        1: 0x74b30 (runtime.asyncPreempt:7) 0x138d7b (runtime/pprof.cpuHog0:60 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        2: 0x74b30 (runtime.asyncPreempt:7) 0x138d8f (runtime/pprof.cpuHog0:62 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        7: 0x138d7c (runtime/pprof.cpuHog0:60 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        1: 0x526d8 (runtime.pidleput:5818) 0x4adb7 (runtime.findrunnable:2863) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
    pprof_test.go:597: runtime/pprof.cpuHog1: 10
    pprof_test.go:597: runtime/pprof.cpuHog3: 41
    --- FAIL: TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof (0.00s)
        pprof_test.go:214: compare 183.697ms vs 100ms
        pprof_test.go:226: CPU usage reports are too different (limit -40.0%, got -45.6%)
    pprof_test.go:130: Failure of this test may indicate that your system suffers from a known Linux kernel bug fixed on newer kernels. See https://golang.org/issue/49065.
FAIL
FAIL	runtime/pprof	5.806s

Marking as release-blocker for Go 1.18 because this test is new (as of CL 324129).
(CC @prattmic @rhysh)

The text was updated successfully, but these errors were encountered:

rhysh · 2021-12-10T23:05:29Z

The number of samples in the serial portion (cpuHog1) are spot-on: 10ms per sample, 100ms wall clock, 10 samples. The number of samples in the parallel portion (cpuHog3) are very close to the goal: 10ms per sample, 100ms wall clock, GOMAXPROCS=4 (this builder runs on a m6g.xlarge, 4 cores and no hyperthreading), 41 samples vs an expected 40.

The failure is in the comparison of the serial portion. The profile reports 100ms, but rusage(2) reports 184ms. So this looks like another source of "extra/background/non-application CPU time that the runtime expends". (Earlier flakes were from GC.)

Of the 59 total samples, taking out the 10+41 that are in the CPU hog functions the test expects, the remaining 8 are all in findrunnable. I'm surprised to see any there, and I'm not sure what to make of it.

Of those, five point at the PC right after the epollwait syscall:

go/src/runtime/sys_linux_arm64.s

Line 731 in 766f89b

MOVW R0, ret+24(FP)

Two are at atomics:

go/src/runtime/proc.go

Line 5749 in 766f89b

atomic.Or(&p[word], mask)

go/src/runtime/proc.go

Line 5818 in 766f89b

atomic.Xadd(&sched.npidle, 1) // TODO: fast atomic

One is at a function call boundary, either before or after calling nanotime1:

go/src/runtime/time_nofake.go

Line 19 in 766f89b

return nanotime1()

gopherbot · 2021-12-16T20:36:22Z

Change https://golang.org/cl/372800 mentions this issue: runtime/pprof: run TestCPUProfileMultithreadMagnitude subtests separately

…tely Currently TestCPUProfileMultithreadMagnitude runs two CPU consumption functions in a single profile and then analyzes the results as separate subtests. This works fine, but when debugging failures it makes manual analysis of the profile dump a bit annoying. Refactor the test to collect separate profiles for each subtest for easier future analysis. For #50097. For #50232. Change-Id: Ia1c8bb86aaaf652e64c5e660dcc2da47d2194c2b Reviewed-on: https://go-review.googlesource.com/c/go/+/372800 Trust: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Rhys Hiltner <rhys@justin.tv> Reviewed-by: Bryan Mills <bcmills@google.com>

heschi · 2022-01-12T17:23:58Z

Was the change above intended to fix the test, or just improve debuggability? This is marked as a release blocker.

prattmic · 2022-01-12T17:26:44Z

Just improve debuggability. Fixing this still needs more investigation, which the CL helps with.

bcmills · 2022-01-12T18:01:31Z

Given that the rate of builder runs has slowed down for the code freeze, and given that these failures appear to be detecting a mostly-successful-but-incomplete fix, would it make sense to add a testenv.SkipFlaky for this test until the tree reopens for 1.19 and then pick it up from there?

prattmic · 2022-01-14T19:15:51Z

I ran this test for several hours on a linux-arm64-aws gomote and was unable to reproduce any failures.

I think that adding a temporary skip would be OK given that as @rhysh's mentions, this appears to be unexpected background CPU, rather than an issue with profiling itself. That said, I don't have any particular reason to believe this issue would be limited to arm64, so I'm not sure what constraints to put on the skip.

@rhysh right now the test compares time in runtime/pprof.cpuHog1 samples vs rusage. What would you think about comparing the total sample time (i.e., any samples) vs rusage. It seems that should still cover the core of this test, which is whether we receive enough samples to cover all CPU usage and I believe should solve these kinds of issues with background runtime CPU.

toothrot · 2022-01-19T17:25:48Z

Any updates on this, @prattmic? Checking in as it's a release-blocker for Go 1.18.

gopherbot · 2022-01-19T19:31:34Z

Change https://golang.org/cl/379535 mentions this issue: runtime/pprof: compare all samples vs rusage in TestCPUProfileMultithreadMagnitude

…tely Currently TestCPUProfileMultithreadMagnitude runs two CPU consumption functions in a single profile and then analyzes the results as separate subtests. This works fine, but when debugging failures it makes manual analysis of the profile dump a bit annoying. Refactor the test to collect separate profiles for each subtest for easier future analysis. For golang#50097. For golang#50232. Change-Id: Ia1c8bb86aaaf652e64c5e660dcc2da47d2194c2b Reviewed-on: https://go-review.googlesource.com/c/go/+/372800 Trust: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Rhys Hiltner <rhys@justin.tv> Reviewed-by: Bryan Mills <bcmills@google.com>

…readMagnitude TestCPUProfileMultithreadMagnitude compares pprof results vs OS rusage to verify that pprof is capturing all CPU usage. Presently it compares the sum of cpuHog1 samples vs rusage. However, background usage from the scheduler, GC, etc can cause additional CPU usage causing test failures if rusage is too far off from the cpuHog1 samples. That said, this test doesn't actually need to care about cpuHog1 samples. It simply cares that pprof samples match rusage, not what the breakdown of usage was. As a result, we can compare the sum of _all_ pprof samples vs rusage, which should implicitly include any background CPU usage. Fixes golang#50097. Change-Id: I649a18de5b3dcf58b62be5962fa508d14cd4dc79 Reviewed-on: https://go-review.googlesource.com/c/go/+/379535 Trust: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>

bcmills added NeedsInvestigation release-blocker okay-after-beta1 labels Dec 10, 2021

bcmills added this to the Go1.18 milestone Dec 10, 2021

cherrymui removed the okay-after-beta1 label Dec 14, 2021

prattmic changed the title ~~runtime/pprof: TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof failure with "CPU usage reports are too different" on linux-arm64-aws~~ runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too high on linux-arm64-aws Dec 16, 2021

prattmic mentioned this issue Dec 16, 2021

runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too low #50232

Closed

gopherbot closed this as completed in d1640d8 Jan 19, 2022

thanm mentioned this issue Jan 25, 2022

Go compiler and runtime meeting notes #43930

Open

prattmic self-assigned this Feb 17, 2022

rsc unassigned prattmic Jun 22, 2022

prattmic self-assigned this Jun 24, 2022

bcmills mentioned this issue Jul 11, 2022

runtime/pprof: TestCPUProfileMultithreadMagnitude failure due to usage too high on linux-arm-aws #53785

Open

golang locked and limited conversation to collaborators Jun 24, 2023

gopherbot added the FrozenDueToAge label Jun 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too high on linux-arm64-aws #50097

runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too high on linux-arm64-aws #50097

bcmills commented Dec 10, 2021 •

edited

Loading

rhysh commented Dec 10, 2021

gopherbot commented Dec 16, 2021

heschi commented Jan 12, 2022

prattmic commented Jan 12, 2022

bcmills commented Jan 12, 2022

prattmic commented Jan 14, 2022

toothrot commented Jan 19, 2022

gopherbot commented Jan 19, 2022

runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too high on linux-arm64-aws #50097

runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too high on linux-arm64-aws #50097

Comments

bcmills commented Dec 10, 2021 • edited Loading

rhysh commented Dec 10, 2021

gopherbot commented Dec 16, 2021

heschi commented Jan 12, 2022

prattmic commented Jan 12, 2022

bcmills commented Jan 12, 2022

prattmic commented Jan 14, 2022

toothrot commented Jan 19, 2022

gopherbot commented Jan 19, 2022

bcmills commented Dec 10, 2021 •

edited

Loading