Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too high on linux-arm64-aws #50097

Closed
bcmills opened this issue Dec 10, 2021 · 8 comments
Assignees
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Dec 10, 2021

Seen on a linux-arm64-aws TryBot for unrelated CL 370666 (https://storage.googleapis.com/go-build-log/a297f68d/linux-arm64-aws_04471a78.log):

--- FAIL: TestCPUProfileMultithreadMagnitude (0.66s)
    pprof_test.go:127: Running on Linux 4.19.0
    pprof_test.go:526: total 59 CPU profile samples collected:
        5: 0x7566c (runtime.epollwait:731) 0x3f23f (runtime.netpoll:126) 0x4b013 (runtime.findrunnable:2956) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
        8: 0x138d10 (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog1:55) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139d83 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.1:176) 0x148d83 (runtime/pprof.diffCPUTimeRUsage:27) 0x13aca7 (runtime/pprof.diffCPUTime:485) 0x139a5f (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2:174) 0x13ab07 (runtime/pprof.testCPUProfile:453) 0x13965b (runtime/pprof.TestCPUProfileMultithreadMagnitude:173) 0xd439f (testing.tRunner:1410) labels: map[]
        
        1: 0x52680 (runtime.pMask.set:5749 runtime.pidleput:5815) 0x4adb7 (runtime.findrunnable:2863) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
        2: 0x138cfc (runtime/pprof.cpuHog0:60 runtime/pprof.cpuHog1:55) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139d83 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.1:176) 0x148d83 (runtime/pprof.diffCPUTimeRUsage:27) 0x13aca7 (runtime/pprof.diffCPUTime:485) 0x139a5f (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2:174) 0x13ab07 (runtime/pprof.testCPUProfile:453) 0x13965b (runtime/pprof.TestCPUProfileMultithreadMagnitude:173) 0xd439f (testing.tRunner:1410) labels: map[]
        
        1: 0x4b02b (runtime.nanotime:19 runtime.findrunnable:2958) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
        31: 0x138d90 (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        1: 0x74b30 (runtime.asyncPreempt:7) 0x138d7b (runtime/pprof.cpuHog0:60 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        2: 0x74b30 (runtime.asyncPreempt:7) 0x138d8f (runtime/pprof.cpuHog0:62 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        7: 0x138d7c (runtime/pprof.cpuHog0:60 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        1: 0x526d8 (runtime.pidleput:5818) 0x4adb7 (runtime.findrunnable:2863) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
    pprof_test.go:597: runtime/pprof.cpuHog1: 10
    pprof_test.go:597: runtime/pprof.cpuHog3: 41
    --- FAIL: TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof (0.00s)
        pprof_test.go:214: compare 183.697ms vs 100ms
        pprof_test.go:226: CPU usage reports are too different (limit -40.0%, got -45.6%)
    pprof_test.go:130: Failure of this test may indicate that your system suffers from a known Linux kernel bug fixed on newer kernels. See https://golang.org/issue/49065.
FAIL
FAIL	runtime/pprof	5.806s

Marking as release-blocker for Go 1.18 because this test is new (as of CL 324129).
(CC @prattmic @rhysh)

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 labels Dec 10, 2021
@bcmills bcmills added this to the Go1.18 milestone Dec 10, 2021
@rhysh
Copy link
Contributor

rhysh commented Dec 10, 2021

The number of samples in the serial portion (cpuHog1) are spot-on: 10ms per sample, 100ms wall clock, 10 samples. The number of samples in the parallel portion (cpuHog3) are very close to the goal: 10ms per sample, 100ms wall clock, GOMAXPROCS=4 (this builder runs on a m6g.xlarge, 4 cores and no hyperthreading), 41 samples vs an expected 40.

The failure is in the comparison of the serial portion. The profile reports 100ms, but rusage(2) reports 184ms. So this looks like another source of "extra/background/non-application CPU time that the runtime expends". (Earlier flakes were from GC.)

Of the 59 total samples, taking out the 10+41 that are in the CPU hog functions the test expects, the remaining 8 are all in findrunnable. I'm surprised to see any there, and I'm not sure what to make of it.

Of those, five point at the PC right after the epollwait syscall:

MOVW R0, ret+24(FP)

Two are at atomics:
atomic.Or(&p[word], mask)
atomic.Xadd(&sched.npidle, 1) // TODO: fast atomic

One is at a function call boundary, either before or after calling nanotime1:

return nanotime1()

@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@prattmic prattmic changed the title runtime/pprof: TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof failure with "CPU usage reports are too different" on linux-arm64-aws runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too high on linux-arm64-aws Dec 16, 2021
@gopherbot
Copy link

Change https://golang.org/cl/372800 mentions this issue: runtime/pprof: run TestCPUProfileMultithreadMagnitude subtests separately

gopherbot pushed a commit that referenced this issue Jan 10, 2022
…tely

Currently TestCPUProfileMultithreadMagnitude runs two CPU consumption
functions in a single profile and then analyzes the results as separate
subtests.

This works fine, but when debugging failures it makes manual analysis of
the profile dump a bit annoying.

Refactor the test to collect separate profiles for each subtest for
easier future analysis.

For #50097.
For #50232.

Change-Id: Ia1c8bb86aaaf652e64c5e660dcc2da47d2194c2b
Reviewed-on: https://go-review.googlesource.com/c/go/+/372800
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Rhys Hiltner <rhys@justin.tv>
Reviewed-by: Bryan Mills <bcmills@google.com>
@heschi
Copy link
Contributor

heschi commented Jan 12, 2022

Was the change above intended to fix the test, or just improve debuggability? This is marked as a release blocker.

@prattmic
Copy link
Member

Just improve debuggability. Fixing this still needs more investigation, which the CL helps with.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 12, 2022

Given that the rate of builder runs has slowed down for the code freeze, and given that these failures appear to be detecting a mostly-successful-but-incomplete fix, would it make sense to add a testenv.SkipFlaky for this test until the tree reopens for 1.19 and then pick it up from there?

@prattmic
Copy link
Member

I ran this test for several hours on a linux-arm64-aws gomote and was unable to reproduce any failures.

I think that adding a temporary skip would be OK given that as @rhysh's mentions, this appears to be unexpected background CPU, rather than an issue with profiling itself. That said, I don't have any particular reason to believe this issue would be limited to arm64, so I'm not sure what constraints to put on the skip.

@rhysh right now the test compares time in runtime/pprof.cpuHog1 samples vs rusage. What would you think about comparing the total sample time (i.e., any samples) vs rusage. It seems that should still cover the core of this test, which is whether we receive enough samples to cover all CPU usage and I believe should solve these kinds of issues with background runtime CPU.

@toothrot
Copy link
Contributor

Any updates on this, @prattmic? Checking in as it's a release-blocker for Go 1.18.

@gopherbot
Copy link

Change https://golang.org/cl/379535 mentions this issue: runtime/pprof: compare all samples vs rusage in TestCPUProfileMultithreadMagnitude

@prattmic prattmic self-assigned this Feb 17, 2022
jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
…tely

Currently TestCPUProfileMultithreadMagnitude runs two CPU consumption
functions in a single profile and then analyzes the results as separate
subtests.

This works fine, but when debugging failures it makes manual analysis of
the profile dump a bit annoying.

Refactor the test to collect separate profiles for each subtest for
easier future analysis.

For golang#50097.
For golang#50232.

Change-Id: Ia1c8bb86aaaf652e64c5e660dcc2da47d2194c2b
Reviewed-on: https://go-review.googlesource.com/c/go/+/372800
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Rhys Hiltner <rhys@justin.tv>
Reviewed-by: Bryan Mills <bcmills@google.com>
jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
…readMagnitude

TestCPUProfileMultithreadMagnitude compares pprof results vs OS rusage
to verify that pprof is capturing all CPU usage. Presently it compares
the sum of cpuHog1 samples vs rusage. However, background usage from the
scheduler, GC, etc can cause additional CPU usage causing test failures
if rusage is too far off from the cpuHog1 samples.

That said, this test doesn't actually need to care about cpuHog1
samples. It simply cares that pprof samples match rusage, not what the
breakdown of usage was. As a result, we can compare the sum of _all_
pprof samples vs rusage, which should implicitly include any background
CPU usage.

Fixes golang#50097.

Change-Id: I649a18de5b3dcf58b62be5962fa508d14cd4dc79
Reviewed-on: https://go-review.googlesource.com/c/go/+/379535
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@prattmic prattmic self-assigned this Jun 24, 2022
@golang golang locked and limited conversation to collaborators Jun 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

7 participants