Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/trace: TestTraceCPUProfile failures with runtime._System sample not present in execution trace #53378

Closed
bcmills opened this issue Jun 14, 2022 · 5 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Jun 14, 2022

--- FAIL: TestTraceCPUProfile (1.57s)
    trace_test.go:715: CPU profile included 1 samples at stack ":0 runtime._System:4491" not present in execution trace

greplogs -l -e 'runtime\._System:\d+" not present in execution trace'
2022-06-06T19:55:16-7271a0a/android-amd64-emu
2022-05-05T15:43:29-fdb640b/windows-amd64-race

TestTraceCPUProfile was previously believed to have been fixed in #52693 (attn @rhysh @cherrymui @prattmic; CC @golang/runtime).

@bcmills
Copy link
Contributor Author

bcmills commented Jun 14, 2022

Marking as release-blocker for Go 1.19 because this failure mode affects windows/amd64 (a first-class port).

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Jun 14, 2022
@bcmills bcmills added this to the Go1.19 milestone Jun 14, 2022
@prattmic
Copy link
Member

prattmic commented Jun 14, 2022

The missing samples are both runtime._System stacks, so my first thought was that we simply don't include samples missing a P in the trace. However, it looks like we do.

Perhaps this is more mundane: the test disables tracing before CPU profiling, so there is a race between disabling tracing and disabling profiling during which a sample could end up in the pprof profile but not the trace.

@rhysh perhaps we should only complain about mismatched cpuHogger samples, which should be strictly bound inside tracing and profiling being enabled?

@rhysh
Copy link
Contributor

rhysh commented Jun 15, 2022

In the android/amd64 failure: The top-level error message is trace_test.go:715: CPU profile included 1 samples at stack ":0 runtime._System:4491" not present in execution trace. A few lines down in the list of samples that do appear in the execution trace, the log shows trace_test.go:724: 1: "runtime._System:4491". The sample in the raw CPU profile dump is 1 10000000: 25 26 / tracing:[on]. The PC is within the mapping of /system/lib64/libc.so.

In the windows/amd64-race failure: The top-level error message is trace_test.go:689: CPU profile included 1 samples at stack "_ZN6__tsan11ThreadClock7releaseEPNS_19DenseSlabAllocCacheEPNS_9SyncClockE:0 runtime._System:4480" not present in execution trace. There's no mention of runtime._System in the list of CPU samples that the execution trace includes. The sample in the raw CPU profile dump is 1 10000000: 9 10 / tracing:[on]. The function looks like part of the race detector runtime (the CPU profile does not list the mappings).

Calls from runtime._System are inserted by runtime.sigprof when "Normal traceback is impossible or has failed", which then adds the same stk[:n] to both the CPU profile and the execution trace.

The execution tracer and the CPU profiler do their symbolization in rather different ways: a single call to runtime.CallersFrames (the allFrames function in src/runtime/trace.go), or a more involved approach that eventually leads to a separate runtime.CallersFrames call for each single address (the allFrames function in src/runtime/pprof/proto.go). Maybe that explains the difference? These two-frame call stacks are faked by runtime.sigprof, and built in a way that is known to render correctly with the CPU profiler's approach of splitting addresses and symbolizing each separately, and might (I don't know the details of CallersFrames' operation) not deal well with being symbolized as a pair as the new code does.

Should we try to change the execution tracer's profile stacks to include more info on reported callees of runtime._System? Or should it continue to report that as runtime._System calling a non-symbolized function :0 (or, as with the windows/amd64-race failure, to drop the sample), with the test changing to ignore samples that show a root of runtime._System or the other functions runtime.sigprof uses (runtime._GC, runtime._ExternalCode, and runtime._VDSO)? I'm currently in favor of adding code to the test that special-cases those generated stacks.

I don't think it's the mundane race, @prattmic : the test filters the CPU profile samples down to the ones that include the "tracing:[on]" label, which has a fairly well-controlled start and end time. That should make the order of the Start/Stop function calls not matter.

@prattmic
Copy link
Member

I'm currently in favor of adding code to the test that special-cases those generated stacks.

By "special-case" in the test, you mean ignore, right? I think that is fine. It would be nice to get the execution trace symbolization to match up with pprof a bit better in the these cases, but I certainly don't think it is high priority or something that needs to be in 1.19.

the test filters the CPU profile samples down to the ones that include the "tracing:[on]" label, which has a fairly well-controlled start and end time.

Apologies, I did not pay close enough attention and missed that.

@prattmic prattmic added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 16, 2022
@gopherbot
Copy link

Change https://go.dev/cl/413457 mentions this issue: runtime/trace: ignore fallback stacks in test

@rsc rsc unassigned rhysh Jun 22, 2022
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
When runtime.sigprof encounters a stack that gentraceback is unable to
process, it synthesizes a call stack with a sentinel function (such as
runtime._System) at the leaf.

The test to confirm that runtime/trace and runtime/pprof have similar
views of CPU profile samples has trouble with those stacks. The test
confirms that the samples match by confirming that their symbolized
forms match, and the symbolization procedure is very different for the
two packages.

Skip the samples that the CPU profiler's view symbolizes to include one
of runtime.sigprof's sentinel functions at the leaf. (The test design
expects the CPU profiler to under-report samples relative to the
execution tracer.)

Fixes golang#53378

Change-Id: I60c27de4c69b454057d28a3b6e12d70369de4c4f
Reviewed-on: https://go-review.googlesource.com/c/go/+/413457
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
@golang golang locked and limited conversation to collaborators Jun 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
Development

No branches or pull requests

4 participants