Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: TestLabelSystemstack due to sample with no location #51550

Open
bcmills opened this issue Mar 8, 2022 · 30 comments
Open

runtime/pprof: TestLabelSystemstack due to sample with no location #51550

bcmills opened this issue Mar 8, 2022 · 30 comments
Assignees
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Mar 8, 2022

#!watchflakes
post <- pkg == "runtime/pprof" && test == "TestLabelSystemstack"

This one is a doozy. The failure message comes from here:
https://cs.opensource.google/go/go/+/master:src/runtime/pprof/pprof_test.go;l=1535;drc=18c2033ba587ce63fc9f2d6f52b8bb2e395c561f

That seems to imply that the sample was labeled but its stack was empty(!!), which does seem to be the case in the dump of collected profiles:

        1: labels: map[key:[value]]

(attn @prattmic, CC @golang/runtime)

--- FAIL: TestLabelSystemstack (0.32s)
    pprof_test.go:524: total 85 CPU profile samples collected:
        1: 0x866eef (_ZN6__tsanL14MemoryRangeSetEPNS_11ThreadStateEyyyy.isra.39.part.40:0) 0x8e1ec0 (runtime._System:4432) labels: map[]
        
        2: 0x86a6bc (_ZN6__tsan7MetaMap9FreeRangeEPNS_9ProcessorEyy:0) 0x8e1ec0 (runtime._System:4432) labels: map[key:[value]]
        
        1: 0x8d2f04 (runtime.stdcall1:1090) 0x8d23a9 (runtime.semawakeup:871) 0x8a96ad (runtime.notewakeup:161) 0x8dce48 (runtime.startm:2324) 0x8ded1e (runtime.injectglist.func1:3076 runtime.injectglist:3100) 0x8bf5b6 (runtime.wakeScavenger:222) 0x8bf6d6 (runtime.bgscavenge.func1:268) 0x8f6d21 (runtime.runOneTimer:867) 0x8f6ad2 (runtime.runtimer:775) 0x8df2cf (runtime.checkTimers:3286) 0x8de484 (runtime.stealWork:2868) 0x8dda35 (runtime.findrunnable:2599) 0x8defd8 (runtime.schedule:3187) 0x8df52c (runtime.park_m:3336) 0x905f89 (runtime.mcall:425) labels: map[]
        
        13: 0x907ee6 (runtime.procyield:733) 0x8a9475 (runtime.lock2:69) 0x8bf1f1 (runtime.lockWithRank:22 runtime.lock:36 runtime.setGCPercent.func1:1255) 0x90600d (runtime.systemstack:469) 0x9020a5 (runtime/debug.setGCPercent:1254) 0xa8de99 (runtime/debug.SetGCPercent:92 runtime/pprof.labelHog:1552) 0xa8e152 (runtime/pprof.parallelLabelHog.func1:1565) labels: map[key:[value]]
        
        1: 0x866ef7 (_ZN6__tsanL14MemoryRangeSetEPNS_11ThreadStateEyyyy.isra.39.part.40:0) 0x8e1ec0 (runtime._System:4432) labels: map[]
        
        1: 0x8bf0d0 (runtime.(*gcControllerState).commit:1089) 0x8bf195 (runtime.(*gcControllerState).setGCPercent:1246) 0x8bf204 (runtime.setGCPercent.func1:1256) 0x90600d (runtime.systemstack:469) 0x9020a5 (runtime/debug.setGCPercent:1254) 0xa8de99 (runtime/debug.SetGCPercent:92 runtime/pprof.labelHog:1552) 0xa8e152 (runtime/pprof.parallelLabelHog.func1:1565) labels: map[key:[value]]
        
        5: 0x8a9402 (runtime.lock2:61) 0x8bf1f1 (runtime.lockWithRank:22 runtime.lock:36 runtime.setGCPercent.func1:1255) 0x90600d (runtime.systemstack:469) 0x9020a5 (runtime/debug.setGCPercent:1254) 0xa8de99 (runtime/debug.SetGCPercent:92 runtime/pprof.labelHog:1552) 0xa8e152 (runtime/pprof.parallelLabelHog.func1:1565) labels: map[key:[value]]
        
        27: 0x8d2f84 (runtime.stdcall2:1099) 0x8d204e (runtime.semasleep:819) 0x8a94f7 (runtime.lock2:89) 0x8bf1f1 (runtime.lockWithRank:22 runtime.lock:36 runtime.setGCPercent.func1:1255) 0x90600d (runtime.systemstack:469) labels: map[key:[value]]
        
        4: 0x8a9419 (runtime.lock2:63) 0x8bf1f1 (runtime.lockWithRank:22 runtime.lock:36 runtime.setGCPercent.func1:1255) 0x90600d (runtime.systemstack:469) 0x9020a5 (runtime/debug.setGCPercent:1254) 0xa8de99 (runtime/debug.SetGCPercent:92 runtime/pprof.labelHog:1552) 0xa8e152 (runtime/pprof.parallelLabelHog.func1:1565) labels: map[key:[value]]
        
        11: 0x8e1f00 (runtime._ExternalCode:4433) 0x8e1ec0 (runtime._System:4432) labels: map[key:[value]]
        
        10: 0x8d2f04 (runtime.stdcall1:1090) 0x8d23a9 (runtime.semawakeup:871) 0x8a95f5 (runtime.unlock2:117) 0x8bf238 (runtime.unlockWithRank:31 runtime.unlock:97 runtime.setGCPercent.func1:1259) 0x90600d (runtime.systemstack:469) labels: map[key:[value]]
        
        1: 0x8d3004 (runtime.stdcall3:1108) 0x8b5ba4 (runtime.sysUnused:33) 0x8c098a (runtime.(*pageAlloc).scavengeRangeLocked:775) 0x8c07cd (runtime.(*pageAlloc).scavengeOneFast:726) 0x8c0324 (runtime.(*pageAlloc).scavengeOne:637) 0x8bfcfc (runtime.(*pageAlloc).scavenge.func1:454) 0x90600d (runtime.systemstack:469) labels: map[]
        
        3: 0x907ee4 (runtime.procyield:732) 0x8a9475 (runtime.lock2:69) 0x8bf1f1 (runtime.lockWithRank:22 runtime.lock:36 runtime.setGCPercent.func1:1255) 0x90600d (runtime.systemstack:469) 0x9020a5 (runtime/debug.setGCPercent:1254) 0xa8de99 (runtime/debug.SetGCPercent:92 runtime/pprof.labelHog:1552) 0xa8e152 (runtime/pprof.parallelLabelHog.func1:1565) labels: map[key:[value]]
        
        1: 0x8d2037 (runtime.semasleep:819) 0x8a94f7 (runtime.lock2:89) 0x8bf1f1 (runtime.lockWithRank:22 runtime.lock:36 runtime.setGCPercent.func1:1255) 0x90600d (runtime.systemstack:469) 0x9020a5 (runtime/debug.setGCPercent:1254) 0xa8de99 (runtime/debug.SetGCPercent:92 runtime/pprof.labelHog:1552) 0xa8e152 (runtime/pprof.parallelLabelHog.func1:1565) labels: map[key:[value]]
        
        1: 0xa8de9a (runtime/pprof.labelHog:1549) 0xa8e152 (runtime/pprof.parallelLabelHog.func1:1565) labels: map[key:[value]]
        
        1: 0x8d2f04 (runtime.stdcall1:1090) 0x8d23a9 (runtime.semawakeup:871) 0x8a95f5 (runtime.unlock2:117) 0x8c07e7 (runtime.unlockWithRank:31 runtime.unlock:97 runtime.(*pageAlloc).scavengeOneFast:727) 0x8c0324 (runtime.(*pageAlloc).scavengeOne:637) 0x8bfcfc (runtime.(*pageAlloc).scavenge.func1:454) 0x90600d (runtime.systemstack:469) labels: map[]
        
        1: labels: map[key:[value]]
        
        1: 0x89a3a8 (.text$_ZN6__tsan14DenseSlabAllocINS_10ClockBlockELy65536ELy1024EE6RefillEPNS_19DenseSlabAllocCacheE:0) 0x8e1ec0 (runtime._System:4432) labels: map[]
        
    pprof_test.go:595: runtime.systemstack;key=value: 64
    pprof_test.go:1535: Sample labeled got true want false: 
FAIL
FAIL	runtime/pprof	19.664s

greplogs --dashboard -md -l -e '(?ms)FAIL: TestLabelSystemstack.*Sample labeled got true want false:\s*$'

2022-03-05T08:36:13-55a60ca/windows-amd64-race

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 8, 2022
@bcmills bcmills added this to the Go1.19 milestone Mar 8, 2022
@prattmic
Copy link
Member

prattmic commented Mar 8, 2022

This is indeed interesting. I suspect that a sample with no location is not a new issue, just that this test is one of few to care.

Unfortunately, from this information, we can't tell if this is an issue with the runtime internals failing to collect even a single stack frame, or with proto encoding failing to emit locations from frames.

For this test, the best workaround is to skip samples with no locations. But I would like a reproducer with which to investigate the underlying issue, and notably I'd like to know if this is limited to windows. So for now I'd like to leave the test enabled to gather more data. I am also running a more strict version of this test (fail on any samples with no location) on a windows-amd64-race gomote overnight to see if I get anything.

@prattmic prattmic changed the title runtime/pprof: TestLabelSystemstack failure with " runtime/pprof: TestLabelSystemstack due to sample with no location Mar 8, 2022
@bcmills bcmills added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. OS-Windows and removed release-blocker labels Mar 8, 2022
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@bcmills
Copy link
Contributor Author

bcmills commented Apr 11, 2022

No hits since then, but no reason to suspect that it is fixed either.

greplogs --dashboard -md -l -e '(?ms)FAIL: TestLabelSystemstack.*Sample labeled got true want false:\s*$' --since=2022-03-08

@bcmills bcmills reopened this Apr 11, 2022
@bcmills bcmills added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Apr 11, 2022
@cuonglm
Copy link
Member

cuonglm commented Sep 2, 2022

Just hit this once today in https://storage.googleapis.com/go-build-log/d8188e3c/linux-amd64-nounified_2330af42.log when testing CL https://go-review.googlesource.com/c/go/+/426156. Maybe the CL is the culprit, but not sure either, so post it here and re-run the trybot to see if the problem disappear.

@cuonglm
Copy link
Member

cuonglm commented Sep 2, 2022

Just hit this once today in https://storage.googleapis.com/go-build-log/d8188e3c/linux-amd64-nounified_2330af42.log when testing CL https://go-review.googlesource.com/c/go/+/426156. Maybe the CL is the culprit, but not sure either, so post it here and re-run the trybot to see if the problem disappear.

Re-run trybot successfully, so sounds like the test still flaky somehow.

@bcmills
Copy link
Contributor Author

bcmills commented Sep 6, 2022

greplogs -l -e '(?ms)FAIL: TestLabelSystemstack.*Sample labeled got true want false:\s*$' --since=2022-03-08
2022-09-06T15:49:33-e3885c4/linux-amd64-longtest
2022-04-29T02:01:27-e7c56fe/windows-amd64-longtest

@bcmills bcmills reopened this Sep 6, 2022
@bcmills bcmills removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 6, 2022
@bcmills bcmills modified the milestones: Go1.19, Go1.20 Sep 6, 2022
@bcmills bcmills reopened this Sep 6, 2022
@prattmic
Copy link
Member

cc @felixge

@felixge
Copy link
Contributor

felixge commented Sep 28, 2022

@prattmic thanks for the ping, and sorry for this test causing issues again :(. Your last comment indicated you were waiting for more data, but otherwise had an idea for a workaround/fix. If you'd like my help with submitting a patch for that I can take a look. Otherwise I'm not sure if I'll be able to shed further light on the root cause of the empty stack. But I could try to take a closer look at that as well.

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime/pprof" && test == "TestLabelSystemstack"
2022-09-06 15:49 linux-amd64-longtest go@e3885c4e runtime/pprof.TestLabelSystemstack (log)
--- FAIL: TestLabelSystemstack (5.14s)
    pprof_test.go:539: total 5783 CPU profile samples collected:
        17: 0x420403 (runtime.(*gcControllerState).commit:1283) 0x420a53 (runtime.gcControllerCommit:1542) 0x4204a4 (runtime.setGCPercent.func1:1310) 0x4665a8 (runtime.systemstack:492) 0x463015 (runtime/debug.setGCPercent:1307) 0x54e031 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x54e21b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        1: 0x468386 (runtime.procyield:756) 0x40b232 (runtime.lock2:88) 0x43e693 (runtime.lockWithRank:22 runtime.lock:48 runtime.goschedImpl:3424) 0x43e870 (runtime.gosched_m:3436) 0x466522 (runtime.mcall:448) labels: map[]

        7: 0x40b267 (runtime.lock2:98) 0x420455 (runtime.lockWithRank:22 runtime.lock:48 runtime.setGCPercent.func1:1308) 0x4665a8 (runtime.systemstack:492) 0x463015 (runtime/debug.setGCPercent:1307) 0x54e031 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x54e21b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        1082: 0x46a523 (runtime.futex:560) 0x431295 (runtime.futexsleep:69) 0x40b1f8 (runtime.lock2:107) 0x420455 (runtime.lockWithRank:22 runtime.lock:48 runtime.setGCPercent.func1:1308) 0x4665a8 (runtime.systemstack:492) 0x463015 (runtime/debug.setGCPercent:1307) 0x54e031 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x54e21b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

...
        1: 0x420bf3 (runtime.gcPaceScavenger:182) 0x420aa6 (runtime.gcControllerCommit:1557) 0x4204a4 (runtime.setGCPercent.func1:1310) 0x4665a8 (runtime.systemstack:492) 0x463015 (runtime/debug.setGCPercent:1307) 0x54e031 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x54e21b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        1: 0x40b2dd (runtime.lock2:85) 0x420455 (runtime.lockWithRank:22 runtime.lock:48 runtime.setGCPercent.func1:1308) 0x4665a8 (runtime.systemstack:492) 0x463015 (runtime/debug.setGCPercent:1307) 0x54e031 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x54e21b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        1: 0x42003d (runtime/internal/atomic.(*Uint64).Load:313 runtime.(*gcControllerState).heapGoalInternal:963) 0x420184 (runtime.(*gcControllerState).trigger:1138) 0x420a84 (runtime.gcControllerCommit:1555) 0x4204a4 (runtime.setGCPercent.func1:1310) 0x4665a8 (runtime.systemstack:492) 0x463015 (runtime/debug.setGCPercent:1307) 0x54e031 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x54e21b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        1: labels: map[key:[value]]

    pprof_test.go:610: runtime.systemstack;key=value: 5739
    pprof_test.go:1885: Sample labeled got true want false: 
2022-09-26 20:52 linux-amd64-nounified go@84f95aa8 runtime/pprof.TestLabelSystemstack (log)
--- FAIL: TestLabelSystemstack (0.31s)
    pprof_test.go:539: total 110 CPU profile samples collected:
        23: 0x46a923 (runtime.futex:560) 0x431e75 (runtime.futexwakeup:82) 0x40b572 (runtime.unlock2:121) 0x420b12 (runtime.unlockWithRank:32 runtime.unlock:112 runtime.setGCPercent.func1:1311) 0x4669a8 (runtime.systemstack:492) 0x463325 (runtime/debug.setGCPercent:1307) 0x5515b1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x55179b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        16: 0x46aa87 (runtime.osyield:658) 0x40b446 (runtime.lock2:98) 0x420ab5 (runtime.lockWithRank:22 runtime.lock:48 runtime.setGCPercent.func1:1308) 0x4669a8 (runtime.systemstack:492) 0x463325 (runtime/debug.setGCPercent:1307) 0x5515b1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x55179b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        23: 0x46a923 (runtime.futex:560) 0x431d95 (runtime.futexsleep:69) 0x40b3d8 (runtime.lock2:107) 0x420ab5 (runtime.lockWithRank:22 runtime.lock:48 runtime.setGCPercent.func1:1308) 0x4669a8 (runtime.systemstack:492) 0x463325 (runtime/debug.setGCPercent:1307) 0x5515b1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x55179b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        32: 0x468786 (runtime.procyield:756) 0x40b412 (runtime.lock2:88) 0x420ab5 (runtime.lockWithRank:22 runtime.lock:48 runtime.setGCPercent.func1:1308) 0x4669a8 (runtime.systemstack:492) 0x463325 (runtime/debug.setGCPercent:1307) 0x5515b1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x55179b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

...
        1: 0x468784 (runtime.procyield:755) 0x40b412 (runtime.lock2:88) 0x420ab5 (runtime.lockWithRank:22 runtime.lock:48 runtime.setGCPercent.func1:1308) 0x4669a8 (runtime.systemstack:492) 0x463325 (runtime/debug.setGCPercent:1307) 0x5515b1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x55179b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        1: 0x4212a8 (runtime.gcPaceScavenger:208) 0x421106 (runtime.gcControllerCommit:1557) 0x420b04 (runtime.setGCPercent.func1:1310) 0x4669a8 (runtime.systemstack:492) 0x463325 (runtime/debug.setGCPercent:1307) 0x5515b1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x55179b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        1: 0x421102 (runtime.gcControllerCommit:1557) 0x420b04 (runtime.setGCPercent.func1:1310) 0x4669a8 (runtime.systemstack:492) 0x463325 (runtime/debug.setGCPercent:1307) 0x5515b1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1902) 0x55179b (runtime/pprof.parallelLabelHog.func1:1915) labels: map[key:[value]]

        1: labels: map[key:[value]]

    pprof_test.go:610: runtime.systemstack;key=value: 109
    pprof_test.go:1885: Sample labeled got true want false: 

watchflakes

@mknyszek
Copy link
Contributor

watchflakes hasn't updated this bug since Oct 4th. I think we should fix this but since it was introduced in the last release I don't think this is a release blocker (and I don't think it's failing frequently enough to be causing significant harm).

It's prioritized as P1 for triage which sounds right to me. I'll set up a dashboard so we don't miss flakes like this and get around to them soon. I'm going over the other "unassigned" issues now, so they'll take priority if they look more strongly like release blockers.

@mknyszek
Copy link
Contributor

Actually, there seems to have been 2 additional failures since then:

2022-11-17T20:52:28-6e0e492/darwin-amd64-11-aws
2022-09-26T23:54:36-17f83e3/netbsd-arm64-bsiegert

@gopherbot
Copy link

Change https://go.dev/cl/460516 mentions this issue: runtime/pprof: improve output of TestLabelSystemstack

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime/pprof" && test == "TestLabelSystemstack"
2022-11-18 21:48 openbsd-amd64-71 go@ea2c27fe runtime/pprof.TestLabelSystemstack (log)
panic: test timed out after 3m0s
running tests:
	TestLabelSystemstack (2m57s)

runtime.gopark(0x0?, 0x0?, 0x80?, 0x6?, 0x1?)
	/tmp/workdir/go/src/runtime/proc.go:381 +0xd6 fp=0xc000090cd8 sp=0xc000090cb8 pc=0x437456
runtime.chanrecv(0xc0003c8180, 0x0, 0x1)
	/tmp/workdir/go/src/runtime/chan.go:583 +0x49d fp=0xc000090d68 sp=0xc000090cd8 pc=0x40641d
runtime.chanrecv1(0x5e4580?, 0xc0000a6008?)
	/tmp/workdir/go/src/runtime/chan.go:442 +0x18 fp=0xc000090d90 sp=0xc000090d68 pc=0x405f58
runtime/pprof.StopCPUProfile()
	/tmp/workdir/go/src/runtime/pprof/pprof.go:839 +0xa5 fp=0xc000090dd0 sp=0xc000090d90 pc=0x540a85
runtime/pprof.testCPUProfile(0xc0002cc4e0, 0xc000096710?, 0xc000090ef0)
	/tmp/workdir/go/src/runtime/pprof/pprof_test.go:466 +0x1df fp=0xc000090e68 sp=0xc000090dd0 pc=0x54eedf
runtime/pprof.TestLabelSystemstack(0xc0002cc4e0)
	/tmp/workdir/go/src/runtime/pprof/pprof_test.go:1857 +0xc5 fp=0xc000090f70 sp=0xc000090e68 pc=0x557e05
testing.tRunner(0xc0002cc4e0, 0x5b8890)
2022-12-05 16:41 darwin-amd64-12_0 go@ad55b878 runtime/pprof.TestLabelSystemstack (log)
--- FAIL: TestLabelSystemstack (0.33s)
    pprof_test.go:431: uname -a: Darwin Gophers-iMac-Pro-2.local 21.6.0 Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64 x86_64
    pprof_test.go:539: total 21 CPU profile samples collected:
        1: 0x1054b53 (runtime.kevent:457) 0x103136b (runtime.netpoll:130) 0x103c62e (runtime.findRunnable:2759) 0x103dc70 (runtime.schedule:3360) 0x103ec0d (runtime.goexit0:3687) 0x1066fe2 (runtime.mcall:452) labels: map[]

        1: 0x10544cb (runtime.madvise:257) 0x10216a4 (runtime.sysUnusedOS:26 runtime.sysUnused:62 runtime.(*pageAlloc).scavengeOne:754) 0x10212b4 (runtime.(*pageAlloc).scavenge.func1:658) 0x1067068 (runtime.systemstack:496) labels: map[]

        2: 0x100b2ba (runtime.unlock2:108) 0x1020212 (runtime.unlockWithRank:32 runtime.unlock:96 runtime.setGCPercent.func1:1247) 0x1067068 (runtime.systemstack:496) 0x1063515 (runtime/debug.setGCPercent:1243) 0x115a3f1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1946) 0x115a5db (runtime/pprof.parallelLabelHog.func1:1959) labels: map[key:[value]]

        1: 0x1020193 (runtime.setGCPercent.func1:1243) 0x1067068 (runtime.systemstack:496) 0x1063515 (runtime/debug.setGCPercent:1243) 0x115a3f1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1946) 0x115a5db (runtime/pprof.parallelLabelHog.func1:1959) labels: map[key:[value]]
...
        1: 0x1054c13 (runtime.pthread_mutex_lock:478) 0x1031909 (runtime.semawakeup:73) 0x100b2f5 (runtime.unlock2:117) 0x1020212 (runtime.unlockWithRank:32 runtime.unlock:96 runtime.setGCPercent.func1:1247) 0x1067068 (runtime.systemstack:496) labels: map[key:[value]]

        1: 0x1020174 (runtime.(*gcControllerState).commit:1219) 0x1020693 (runtime.gcControllerCommit:1410) 0x1020204 (runtime.setGCPercent.func1:1246) 0x1067068 (runtime.systemstack:496) 0x1063515 (runtime/debug.setGCPercent:1243) 0x115a3f1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1946) 0x115a5db (runtime/pprof.parallelLabelHog.func1:1959) labels: map[key:[value]]

        1: 0x1054f46 (runtime.libcCall:49) 0x105464b (runtime.usleep:306) 0x100b14a (runtime.osyield:352 runtime.lock2:70) 0x10201b5 (runtime.lockWithRank:22 runtime.lock:35 runtime.setGCPercent.func1:1244) 0x1067068 (runtime.systemstack:496) 0x1063515 (runtime/debug.setGCPercent:1243) 0x115a3f1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1946) 0x115a5db (runtime/pprof.parallelLabelHog.func1:1959) labels: map[key:[value]]

        1: labels: map[key:[value]]

    pprof_test.go:610: runtime.systemstack;key=value: 18
    pprof_test.go:1929: Sample labeled got true want false: 
2022-12-12 14:20 openbsd-386-71 go@6b895d9e runtime/pprof.TestLabelSystemstack (log)
panic: test timed out after 3m0s
running tests:
	TestLabelSystemstack (2m56s)

runtime.gopark(0x81e435c, 0x57b46eb0, 0xe, 0x17, 0x2)
	/tmp/workdir/go/src/runtime/proc.go:381 +0xf6 fp=0x579f9de0 sp=0x579f9dcc pc=0x8080566
runtime.chanrecv(0x57b46e80, 0x0, 0x1)
	/tmp/workdir/go/src/runtime/chan.go:583 +0x3e8 fp=0x579f9e28 sp=0x579f9de0 pc=0x804d6b8
runtime.chanrecv1(0x57b46e80, 0x0)
	/tmp/workdir/go/src/runtime/chan.go:442 +0x1c fp=0x579f9e3c sp=0x579f9e28 pc=0x804d2cc
runtime/pprof.StopCPUProfile()
	/tmp/workdir/go/src/runtime/pprof/pprof.go:839 +0x94 fp=0x579f9e58 sp=0x579f9e3c pc=0x8179e14
runtime/pprof.testCPUProfile(0x578831d0, 0x57824bc0, 0x579f9f40)
	/tmp/workdir/go/src/runtime/pprof/pprof_test.go:466 +0x257 fp=0x579f9ef4 sp=0x579f9e58 pc=0x8187b57
runtime/pprof.TestLabelSystemstack(0x578831d0)
	/tmp/workdir/go/src/runtime/pprof/pprof_test.go:1857 +0xd3 fp=0x579f9f9c sp=0x579f9ef4 pc=0x8190423
testing.tRunner(0x578831d0, 0x81e4558)

watchflakes

@aclements
Copy link
Member

I added a check to profileOk to fail if any profile collected by the runtime/pprof tests contains an empty stack. I haven't been able to reproduce a failure with labels and no stack, but I have reproduced a sample with no stack 8 times out of 1464 runs of go test runtime/pprof under stress. That isn't necessarily the same root cause, but it probably is.

Looking at the code, sigprof always produces a stack trace because it has a special case to ensure this. There are a few other calls to cpuprof.log.write, but they all pass nil for the tagPtr, so they definitely can't produce a record with labels and no stack. profBuf.write doesn't modify stk and copies it to the profile buffer using a for loop (not copy), so it would get a out-of-bounds panic if something went wrong with that.

I'm still tracing the read side.

@aclements
Copy link
Member

There is a test of empty stacks (TestEmptyStack), which further suggests that it's okay to have an empty stack in a profile. I'm still digging in to the cause, though.

@aclements
Copy link
Member

I've narrowed it down a fair amount. In profileBuilder.build, where we iterate over the profile hash table, e.stk is non-empty, but the sequence of locations returned by appendLocsForStack for e.stk is empty. So somehow appendLocsForStack is turning a non-empty PC slice into an empty location slice.

@aclements
Copy link
Member

Found it. Sometimes we get a stack trace consisting solely of runtime.goexit (in my test binary, the stack is [469b07] and runtime.goexit.abi0 is at 469b00). appendLocsForStack calls allFrames, which symbolizes the goexit frame and immediate returns with nothing.

This is clearly intentional behavior for allFrames expanding the stack of a single PC at a time, but it's less clear if it's intentional behavior for this to lead to an empty accumulated stack trace.

I would argue that, if we really get a sample just as a goroutine is exiting and has unwound its stack, then that really is a sample with an empty stack. And we probably do want to include it in the profile so that the cumulative sample count is correct.

@aclements
Copy link
Member

Given my reasoning above, that means this is a test-only failure, so I'm dropping release-blocker. We can still land the CL I have out that fixes the test failure. I'll update that to explain a little more in the commit message.

@gopherbot
Copy link

Change https://go.dev/cl/460695 mentions this issue: runtime/pprof: document possibility of empty stacks

@bcmills bcmills added the Testing An issue that has been verified to require only test changes, not just a test failure. label Jan 5, 2023
gopherbot pushed a commit that referenced this issue Jan 9, 2023
I spent quite a while determining the cause of empty stacks in
profiles and reasoning out why this is okay. There isn't a great place
to record this knowledge, but a documentation comment on
appendLocsForStack is better than nothing.

Updates #51550.

Change-Id: I2eefc6ea31f1af885885c3d96199319f45edb4ce
Reviewed-on: https://go-review.googlesource.com/c/go/+/460695
Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime/pprof" && test == "TestLabelSystemstack"
2023-01-23 15:51 openbsd-amd64-71 go@e22bd234 runtime/pprof.TestLabelSystemstack (log)
SIGQUIT: quit
PC=0x24a24c2ba m=9 sigcode=0

rax    0x4
rbx    0x25ea93920
rcx    0x24a24c2ba
rdx    0x0
rdi    0x3
rsi    0x0
rbp    0x25ea93870
...
	/tmp/workdir/go/src/runtime/chan.go:583 +0x49d fp=0xc000070d10 sp=0xc000070c80 pc=0x4063dd
runtime.chanrecv1(0x0?, 0xc000018148?)
	/tmp/workdir/go/src/runtime/chan.go:442 +0x18 fp=0xc000070d38 sp=0xc000070d10 pc=0x405f18
runtime/pprof.StopCPUProfile()
	/tmp/workdir/go/src/runtime/pprof/pprof.go:839 +0xa5 fp=0xc000070d78 sp=0xc000070d38 pc=0x544dc5
runtime/pprof.testCPUProfile(0xc00028c4e0, 0xc0000ac3b0?, 0xc000070ea8)
	/tmp/workdir/go/src/runtime/pprof/pprof_test.go:466 +0x1df fp=0xc000070e10 sp=0xc000070d78 pc=0x55325f
runtime/pprof.TestLabelSystemstack(0xc00028c4e0)
	/tmp/workdir/go/src/runtime/pprof/pprof_test.go:1860 +0xc5 fp=0xc000070f70 sp=0xc000070e10 pc=0x55c225
testing.tRunner(0xc00028c4e0, 0x5c0270)

watchflakes

@gopherbot gopherbot reopened this Jan 25, 2023
@cherrymui
Copy link
Member

The failure is #57585 .

@gopherbot gopherbot reopened this Feb 24, 2023
@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime/pprof" && test == "TestLabelSystemstack"
2023-02-14 17:25 darwin-amd64-10_15 go@5c3e11bd runtime/pprof.TestLabelSystemstack (log)
--- FAIL: TestLabelSystemstack (0.24s)
    pprof_test.go:433: uname -a: Darwin Gophers-iMac-Pro.local 19.6.0 Darwin Kernel Version 19.6.0: Thu Oct 29 22:56:45 PDT 2020; root:xnu-6153.141.2.2~1/RELEASE_X86_64 x86_64
    pprof_test.go:541: total 18 CPU profile samples collected:
        1: 0x1054e6b (runtime.usleep:250) 0x10450c4 (runtime.runqgrab:5985) 0x104521c (runtime.runqsteal:6020) 0x103ded1 (runtime.stealWork:2961) 0x103d318 (runtime.findRunnable:2670) 0x103e83d (runtime.schedule:3214) 0x103ed6c (runtime.park_m:3363) 0x1066402 (runtime.mcall:448) labels: map[]

        1: 0x1055726 (runtime.libcCall:49) 0x1055553 (runtime.pthread_cond_wait:450) 0x10325ac (runtime.semasleep:66) 0x100b324 (runtime.notesleep:181) 0x103c36b (runtime.mPark:1457 runtime.stopm:2247) 0x103ce49 (runtime.gcstopm:2505) 0x103d06e (runtime.findRunnable:2571) 0x103e83d (runtime.schedule:3214) 0x103ed6c (runtime.park_m:3363) 0x1066402 (runtime.mcall:448) labels: map[]

        1: 0x1055373 (runtime.kevent:401) 0x103218b (runtime.netpoll:127) 0x103d28f (runtime.findRunnable:2647) 0x103e83d (runtime.schedule:3214) 0x103ed6c (runtime.park_m:3363) 0x1066402 (runtime.mcall:448) labels: map[]

        1: 0x1055726 (runtime.libcCall:49) 0x1055553 (runtime.pthread_cond_wait:450) 0x10325ac (runtime.semasleep:66) 0x100b324 (runtime.notesleep:181) 0x103c36b (runtime.mPark:1457 runtime.stopm:2247) 0x103da07 (runtime.findRunnable:2874) 0x103e83d (runtime.schedule:3214) 0x103ed6c (runtime.park_m:3363) 0x1066402 (runtime.mcall:448) labels: map[]
...
        1: 0x1054e6b (runtime.usleep:250) 0x100b00a (runtime.osyield:351 runtime.lock2:70) 0x10210d5 (runtime.lockWithRank:22 runtime.lock:35 runtime.setGCPercent.func1:1329) 0x1066488 (runtime.systemstack:492) labels: map[key:[value]]

        1: 0x100b1dc (runtime.unlock2:126) 0x1021132 (runtime.unlockWithRank:32 runtime.unlock:96 runtime.setGCPercent.func1:1332) 0x1066488 (runtime.systemstack:492) 0x1062b25 (runtime/debug.setGCPercent:1328) 0x11503b1 (runtime/debug.SetGCPercent:96 runtime/pprof.labelHog:1904) 0x115059b (runtime/pprof.parallelLabelHog.func1:1917) labels: map[key:[value]]

        1: 0x1055726 (runtime.libcCall:49) 0x1054fbe (runtime.nanotime1:285) 0x103a30f (runtime.nanotime:19 runtime.casgstatus:1017) 0x103edde (runtime.goschedImpl:3372) 0x103eff0 (runtime.gosched_m:3386) 0x1066402 (runtime.mcall:448) labels: map[key:[value]]

        1: labels: map[key:[value]]

    pprof_test.go:612: runtime.systemstack;key=value: 7
    pprof_test.go:1887: Sample labeled got true want false: 

watchflakes

@bcmills
Copy link
Contributor Author

bcmills commented Feb 24, 2023

The above failure was on release-branch.go1.19. How do we feel about backporting the fix?

@prattmic
Copy link
Member

prattmic commented Mar 8, 2023

@gopherbot Please backport to 1.19. This is a fix for a flaky test, which is also flaky on 1.19.

@gopherbot
Copy link

Backport issue(s) opened: #58939 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link

Change https://go.dev/cl/474618 mentions this issue: [release-branch.go1.19] runtime/pprof: improve output of TestLabelSystemstack

gopherbot pushed a commit that referenced this issue Mar 15, 2023
…temstack

The current output of TestLabelSystemstack is a bit cryptic. This CL
improves various messages and hopefully simplifies the logic in the
test.

Simplifying the logic leads to three changes in possible outcomes,
which I verified by running the logic before and after this change
through all 2^4 possibilities (https://go.dev/play/p/bnfb-OQCT4j):

1. If a sample both must be labeled and must not be labeled, the test
now reports that explicitly rather than giving other confusing output.

2. If a sample must not be labeled but is, the current logic will
print two identical error messages. The new logic prints only one.

3. If the test finds no frames at all that it recognizes, but the
sample is labeled, it will currently print a confusing "Sample labeled
got true want false" message. The new logic prints nothing. We've seen
this triggered by empty stacks in profiles.

Fixes #51550. This bug was caused by case 3 above, where it was
triggered by a profile label on an empty stack. It's valid for empty
stacks to appear in a profile if we sample a goroutine just as it's
exiting (and that goroutine may have a profile label), so the test
shouldn't fail in this case.

For #58939.

Change-Id: I1593ec4ac33eced5bb89572a3ba7623e56f2fb3d
Reviewed-on: https://go-review.googlesource.com/c/go/+/460516
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
(cherry picked from commit d9f23cf)
Reviewed-on: https://go-review.googlesource.com/c/go/+/474618
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@gopherbot gopherbot reopened this Apr 18, 2024
@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime/pprof" && test == "TestLabelSystemstack"
2024-04-03 15:35 go1.21-linux-arm release-branch.go1.21@d8392e69 runtime/pprof.TestLabelSystemstack [ABORT] (log)
=== RUN   TestLabelSystemstack
SIGQUIT: quit
PC=0x8b78c m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex()
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/sys_linux_arm.s:416 +0x1c fp=0xffab15a8 sp=0xffab15a8 pc=0x8b78c
runtime.futexsleep(0x3080d0, 0x0, 0xffffffffffffffff)
	/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/os_linux.go:69 +0x60 fp=0xffab15d0 sp=0xffab15a8 pc=0x4b670
runtime.notesleep(0x3080d0)
...
r9      0x308750
r10     0x307b48
fp      0x6
ip      0x4
sp      0xffab15a8
lr      0x4b670
pc      0x8b78c
cpsr    0xa0800010
fault   0x0
*** Test killed with quit: ran too long (10m0s).

watchflakes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
Status: Done
Development

No branches or pull requests

8 participants