New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/pprof: TestLabelSystemstack due to sample with no location #51550
Comments
This is indeed interesting. I suspect that a sample with no location is not a new issue, just that this test is one of few to care. Unfortunately, from this information, we can't tell if this is an issue with the runtime internals failing to collect even a single stack frame, or with proto encoding failing to emit locations from frames. For this test, the best workaround is to skip samples with no locations. But I would like a reproducer with which to investigate the underlying issue, and notably I'd like to know if this is limited to windows. So for now I'd like to leave the test enabled to gather more data. I am also running a more strict version of this test (fail on any samples with no location) on a windows-amd64-race gomote overnight to see if I get anything. |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
No hits since then, but no reason to suspect that it is fixed either.
|
Just hit this once today in https://storage.googleapis.com/go-build-log/d8188e3c/linux-amd64-nounified_2330af42.log when testing CL https://go-review.googlesource.com/c/go/+/426156. Maybe the CL is the culprit, but not sure either, so post it here and re-run the trybot to see if the problem disappear. |
Re-run trybot successfully, so sounds like the test still flaky somehow. |
|
cc @felixge |
@prattmic thanks for the ping, and sorry for this test causing issues again :(. Your last comment indicated you were waiting for more data, but otherwise had an idea for a workaround/fix. If you'd like my help with submitting a patch for that I can take a look. Otherwise I'm not sure if I'll be able to shed further light on the root cause of the empty stack. But I could try to take a closer look at that as well. |
Found new dashboard test flakes for:
2022-09-06 15:49 linux-amd64-longtest go@e3885c4e runtime/pprof.TestLabelSystemstack (log)
2022-09-26 20:52 linux-amd64-nounified go@84f95aa8 runtime/pprof.TestLabelSystemstack (log)
|
watchflakes hasn't updated this bug since Oct 4th. I think we should fix this but since it was introduced in the last release I don't think this is a release blocker (and I don't think it's failing frequently enough to be causing significant harm). It's prioritized as P1 for triage which sounds right to me. I'll set up a dashboard so we don't miss flakes like this and get around to them soon. I'm going over the other "unassigned" issues now, so they'll take priority if they look more strongly like release blockers. |
Actually, there seems to have been 2 additional failures since then: 2022-11-17T20:52:28-6e0e492/darwin-amd64-11-aws |
Change https://go.dev/cl/460516 mentions this issue: |
Found new dashboard test flakes for:
2022-11-18 21:48 openbsd-amd64-71 go@ea2c27fe runtime/pprof.TestLabelSystemstack (log)
2022-12-05 16:41 darwin-amd64-12_0 go@ad55b878 runtime/pprof.TestLabelSystemstack (log)
2022-12-12 14:20 openbsd-386-71 go@6b895d9e runtime/pprof.TestLabelSystemstack (log)
|
I added a check to profileOk to fail if any profile collected by the runtime/pprof tests contains an empty stack. I haven't been able to reproduce a failure with labels and no stack, but I have reproduced a sample with no stack 8 times out of 1464 runs of Looking at the code, I'm still tracing the read side. |
There is a test of empty stacks (TestEmptyStack), which further suggests that it's okay to have an empty stack in a profile. I'm still digging in to the cause, though. |
I've narrowed it down a fair amount. In profileBuilder.build, where we iterate over the profile hash table, e.stk is non-empty, but the sequence of locations returned by appendLocsForStack for e.stk is empty. So somehow appendLocsForStack is turning a non-empty PC slice into an empty location slice. |
Found it. Sometimes we get a stack trace consisting solely of This is clearly intentional behavior for allFrames expanding the stack of a single PC at a time, but it's less clear if it's intentional behavior for this to lead to an empty accumulated stack trace. I would argue that, if we really get a sample just as a goroutine is exiting and has unwound its stack, then that really is a sample with an empty stack. And we probably do want to include it in the profile so that the cumulative sample count is correct. |
Given my reasoning above, that means this is a test-only failure, so I'm dropping release-blocker. We can still land the CL I have out that fixes the test failure. I'll update that to explain a little more in the commit message. |
Change https://go.dev/cl/460695 mentions this issue: |
I spent quite a while determining the cause of empty stacks in profiles and reasoning out why this is okay. There isn't a great place to record this knowledge, but a documentation comment on appendLocsForStack is better than nothing. Updates #51550. Change-Id: I2eefc6ea31f1af885885c3d96199319f45edb4ce Reviewed-on: https://go-review.googlesource.com/c/go/+/460695 Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com>
Found new dashboard test flakes for:
2023-01-23 15:51 openbsd-amd64-71 go@e22bd234 runtime/pprof.TestLabelSystemstack (log)
|
The failure is #57585 . |
Found new dashboard test flakes for:
2023-02-14 17:25 darwin-amd64-10_15 go@5c3e11bd runtime/pprof.TestLabelSystemstack (log)
|
The above failure was on |
@gopherbot Please backport to 1.19. This is a fix for a flaky test, which is also flaky on 1.19. |
Backport issue(s) opened: #58939 (for 1.19). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/474618 mentions this issue: |
…temstack The current output of TestLabelSystemstack is a bit cryptic. This CL improves various messages and hopefully simplifies the logic in the test. Simplifying the logic leads to three changes in possible outcomes, which I verified by running the logic before and after this change through all 2^4 possibilities (https://go.dev/play/p/bnfb-OQCT4j): 1. If a sample both must be labeled and must not be labeled, the test now reports that explicitly rather than giving other confusing output. 2. If a sample must not be labeled but is, the current logic will print two identical error messages. The new logic prints only one. 3. If the test finds no frames at all that it recognizes, but the sample is labeled, it will currently print a confusing "Sample labeled got true want false" message. The new logic prints nothing. We've seen this triggered by empty stacks in profiles. Fixes #51550. This bug was caused by case 3 above, where it was triggered by a profile label on an empty stack. It's valid for empty stacks to appear in a profile if we sample a goroutine just as it's exiting (and that goroutine may have a profile label), so the test shouldn't fail in this case. For #58939. Change-Id: I1593ec4ac33eced5bb89572a3ba7623e56f2fb3d Reviewed-on: https://go-review.googlesource.com/c/go/+/460516 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> Reviewed-by: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> (cherry picked from commit d9f23cf) Reviewed-on: https://go-review.googlesource.com/c/go/+/474618 Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
Found new dashboard test flakes for:
2024-04-03 15:35 go1.21-linux-arm release-branch.go1.21@d8392e69 runtime/pprof.TestLabelSystemstack [ABORT] (log)
|
This one is a doozy. The failure message comes from here:
https://cs.opensource.google/go/go/+/master:src/runtime/pprof/pprof_test.go;l=1535;drc=18c2033ba587ce63fc9f2d6f52b8bb2e395c561f
That seems to imply that the sample was labeled but its stack was empty(!!), which does seem to be the case in the dump of collected profiles:
(attn @prattmic, CC @golang/runtime)
greplogs --dashboard -md -l -e '(?ms)FAIL: TestLabelSystemstack.*Sample labeled got true want false:\s*$'
2022-03-05T08:36:13-55a60ca/windows-amd64-race
The text was updated successfully, but these errors were encountered: