Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/trace: TestTraceCPUProfile fails intermittently starting 2022-05-03 #52693

Closed
rhysh opened this issue May 3, 2022 · 10 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Soon This needs to be done soon. (regressions, serious bugs, outages)
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented May 3, 2022

The failures I've seen so far are on android/386 and openbsd/amd64. Both show a sample starting in runtime.mcall (which I think is important to the bug) and leading to how a preemption is reported in the execution trace (which I think is less important). I'll take a look at what mcall means for getg(), and getg().m.p.ptr().

CC @bcmills (you'll see these when you do the next dashboard triage).

$ greplogs -e TestTraceCPUProfile -dashboard -l -since 2022-05-02 -md

2022-05-03T21:11:24-23f1325/android-386-emu
2022-05-03T21:11:24-23f1325/openbsd-amd64-70

--- FAIL: TestTraceCPUProfile (0.31s)
    trace_test.go:689: CPU profile included 1 samples at stack "runtime.step:1122 runtime.pcvalue:918 runtime.funcspdelta:1041 runtime.gentraceback:191 runtime.gcallers:920 runtime.traceStackID:791 runtime.traceEventLocked:667 runtime.traceEvent:601 runtime.traceGoPreempt:1285 runtime.gopreempt_m:3396 runtime.mcall:325" not present in execution trace
    trace_test.go:693: execution trace CPU samples:
    trace_test.go:695: 8: "runtime/trace_test.cpuHog0:732 runtime/trace_test.cpuHog1:723 runtime/trace_test.cpuHogger:710 runtime/trace_test.TestTraceCPUProfile.func1.1:613 runtime/pprof.Do:40 runtime/trace_test.TestTraceCPUProfile.func1:612 runtime/trace_test.TestTraceCPUProfile:618 testing.tRunner:1449"
    trace_test.go:695: 10: "runtime/trace_test.cpuHog0:732 runtime/trace_test.cpuHog1:723 runtime/trace_test.cpuHogger:710 runtime/trace_test.TestTraceCPUProfile.func1:617 runtime/trace_test.TestTraceCPUProfile:618 testing.tRunner:1449"
--- FAIL: TestTraceCPUProfile (0.32s)
    trace_test.go:689: CPU profile included 1 samples at stack "runtime.gentraceback:154 runtime.gcallers:920 runtime.traceStackID:791 runtime.traceEventLocked:667 runtime.traceEvent:601 runtime.traceGoPreempt:1285 runtime.gopreempt_m:3396 runtime.mcall:448" not present in execution trace
    trace_test.go:693: execution trace CPU samples:
    trace_test.go:695: 9: "runtime/trace_test.cpuHog0:732 runtime/trace_test.cpuHog1:723 runtime/trace_test.cpuHogger:710 runtime/trace_test.TestTraceCPUProfile.func1.1:613 runtime/pprof.Do:40 runtime/trace_test.TestTraceCPUProfile.func1:612 runtime/trace_test.TestTraceCPUProfile:618 testing.tRunner:1449"
    trace_test.go:695: 10: "runtime/trace_test.cpuHog0:732 runtime/trace_test.cpuHog1:723 runtime/trace_test.cpuHogger:710 runtime/trace_test.TestTraceCPUProfile.func1:617 runtime/trace_test.TestTraceCPUProfile:618 testing.tRunner:1449"
@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels May 4, 2022
@bcmills bcmills added this to the Go1.19 milestone May 4, 2022
@bcmills
Copy link
Contributor

bcmills commented May 4, 2022

Since there is relatively little overlap between android/386 and openbsd/amd64, I'm assuming that this is likely a platform-independent bug, which makes it a release-blocker for Go 1.19.

(CC @golang/runtime)

@bcmills
Copy link
Contributor

bcmills commented May 4, 2022

Indeed, it also affects linux/386, which is a first class port.

greplogs -l -e 'FAIL: TestTraceCPUProfile' --since=2022-05-02
2022-05-03T21:11:24-23f1325/android-386-emu
2022-05-03T21:11:24-23f1325/linux-386-longtest
2022-05-03T21:11:24-23f1325/openbsd-amd64-70

In addition to those explicit failures, there was also an index out of range crash while executing the test that may or may not be related (filed as #52704).

@bcmills bcmills changed the title runtime/trace: TestTraceCPUProfile flakes on some platforms runtime/trace: TestTraceCPUProfile fails intermittently starting 2022-05-03 May 4, 2022
@gopherbot
Copy link

Change https://go.dev/cl/404055 mentions this issue: runtime: prefer curg for execution trace profile

@bcmills
Copy link
Contributor

bcmills commented May 5, 2022

@rhysh, this test has a very high failure rate — please aim to have it resolved today.

(If there isn't a quick solution to the underlying problem, either roll back the changes that introduced or exposed it, or add call to testenv.SkipFlaky to the test until it can be fixed.)

@bcmills bcmills added the Soon This needs to be done soon. (regressions, serious bugs, outages) label May 5, 2022
@rhysh
Copy link
Contributor Author

rhysh commented May 7, 2022

Of the results for greplogs -dashboard -l -e 'TestTraceCPUProfile' --since=2022-05-05 -md, four are recent enough to include the attempted fix at 7b1f8b6.

All four of those claim that the CPU profile included a stack with a doubled frame, and in all four of those the doubling is runtime.pcvalue:932 runtime.pcvalue:932. In all four, the execution trace's view has that same stack but with only one layer of runtime.pcvalue:932. That makes me wonder about a failure mode similar to #50996 -- meaning the execution tracer's new profiling stacks are correct, and the view from the existing runtime/pprof profiler is what's broken.


2022-05-06T04:25:34-f87e28d/openbsd-arm-jsing

trace_test.go:715: CPU profile included 1 samples at stack "runtime.fastrand:148 runtime.fastrandn:157 runtime.pcvalue:932 runtime.pcvalue:932 runtime.funcspdelta:1041 runtime.gentraceback:191 runtime.gcallers:920 runtime.traceStackID:792 runtime.traceEventLocked:667 runtime.traceEvent:601 runtime.traceGoSched:1280 runtime.gosched_m:3376 runtime.mcall:265" not present in execution trace

2022-05-05T23:41:16-f99511d/linux-mips64le-rtrk

trace_test.go:715: CPU profile included 1 samples at stack "runtime.fastrand:135 runtime.fastrandn:157 runtime.pcvalue:932 runtime.pcvalue:932 runtime.funcspdelta:1041 runtime.gentraceback:191 runtime.gcallers:920 runtime.traceStackID:792 runtime.traceEventLocked:667 runtime.traceEvent:601 runtime.traceGoSched:1280 runtime.gosched_m:3376 runtime.mcall:142" not present in execution trace

2022-05-05T22:53:14-86536b9/linux-mips64le-rtrk

trace_test.go:715: CPU profile included 1 samples at stack "runtime.fastrand:136 runtime.fastrandn:157 runtime.pcvalue:932 runtime.pcvalue:932 runtime.pcdatavalue:1076 runtime.gentraceback:364 runtime.gcallers:920 runtime.traceStackID:792 runtime.traceEventLocked:667 runtime.traceEvent:601 runtime.traceGoSched:1280 runtime.gosched_m:3376 runtime.mcall:142" not present in execution trace

2022-05-05T18:48:17-e94fe09/linux-386-longtest

trace_test.go:715: CPU profile included 1 samples at stack "runtime.fastrand:148 runtime.fastrandn:157 runtime.pcvalue:932 runtime.pcvalue:932 runtime.funcspdelta:1041 runtime.gentraceback:191 runtime.gcallers:920 runtime.traceStackID:792 runtime.traceEventLocked:667 runtime.traceEvent:601 runtime.traceGoSched:1280 runtime.gosched_m:3376 runtime.mcall:325" not present in execution trace

The following all predate the attempted fix at 7b1f8b6, so don't indicate an existing problem.
2022-05-05T15:43:29-fdb640b/openbsd-arm64-jsing
2022-05-05T15:43:29-fdb640b/windows-amd64-race
2022-05-05T04:53:27-0e90ba9/darwin-amd64-race
2022-05-05T04:53:27-0e90ba9/linux-arm64-packet
2022-05-05T04:53:27-0e90ba9/linux-mips64-rtrk
2022-05-05T04:05:43-d544591/openbsd-386-68
2022-05-05T04:05:43-d544591/windows-amd64-longtest
2022-05-05T01:33:11-0bd7408/android-arm-corellium
2022-05-05T01:33:01-ed462a6/linux-386-clang
2022-05-05T01:33:01-ed462a6/linux-arm64-packet

@gopherbot
Copy link

Change https://go.dev/cl/404698 mentions this issue: runtime/trace: mark new CPU profile test as flaky

@gopherbot
Copy link

Change https://go.dev/cl/404995 mentions this issue: runtime/pprof: fix allFrames cache

@rhysh
Copy link
Contributor Author

rhysh commented May 9, 2022

The builder failures continue to point to doubled runtime.pcvalue:932 runtime.pcvalue:932 frames. https://go.dev/cl/404995 fixes that (confirmed by the test I wrote for #52764).

greplogs -dashboard -l -e 'TestTraceCPUProfile' --since=2022-05-06 -md
2022-05-08T17:19:07-4c565a5/linux-amd64-wsl
2022-05-06T17:37:38-2049649/linux-mips-rtrk
2022-05-06T04:25:34-f87e28d/openbsd-arm-jsing

@bcmills bcmills reopened this May 9, 2022
@bcmills bcmills closed this as completed May 9, 2022
gopherbot pushed a commit that referenced this issue May 13, 2022
The compiler may choose to inline multiple layers of function call, such
that A calling B calling C may end up with all of the instructions for B
and C written as part of A's function body.

Within that function body, some PCs will represent code from function A.
Some will represent code from function B, and for each of those the
runtime will have an instruction attributable to A that it can report as
its caller. Others will represent code from function C, and for each of
those the runtime will have an instruction attributable to B and an
instruction attributable to A that it can report as callers.

When a profiling signal arrives at an instruction in B (as inlined in A)
that the runtime also uses to describe calls to C, the profileBuilder
ends up with an incorrect cache of allFrames results. That PC should
lead to a location record in the profile that represents the frames
B<-A, but the allFrames cache's view should expand the PC only to the B
frame.

Otherwise, when a profiling signal arrives at an instruction in C (as
inlined in B in A), the PC stack C,B,A can get expanded to the frames
C,B<-A,A as follows: The inlining deck starts empty. The first tryAdd
call proposes PC C and frames C, which the deck accepts. The second
tryAdd call proposes PC B and, due to the incorrect caching, frames B,A.
(A fresh call to allFrames with PC B would return the frame list B.) The
deck accepts that PC and frames. The third tryAdd call proposes PC A and
frames A. The deck rejects those because a call from A to A cannot
possibly have been inlined. This results in a new location record in the
profile representing the frames C<-B<-A (good), as called by A (bad).

The bug is the cached expansion of PC B to frames B<-A. That mapping is
only appropriate for the resulting protobuf-format profile. The cache
needs to reflect the results of a call to allFrames, which expands the
PC B to the single frame B.

For #50996
For #52693
Fixes #52764

Change-Id: I36d080f3c8a05650cdc13ced262189c33b0083b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/404995
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@rhysh
Copy link
Contributor Author

rhysh commented May 27, 2022

No failures here since May 13, greplogs -dashboard -l -e 'TestTraceCPUProfile' --since=2022-05-13 -md

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Soon This needs to be done soon. (regressions, serious bugs, outages)
Projects
None yet
Development

No branches or pull requests

3 participants