-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: intermittent TestCgoPprof failures since 2021-09-28 #49401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This issue was forked from #49065 (comment). In #49065 (comment), @rhysh replied:
|
Marking as release-blocker, since this appears to be a regression during the Go 1.18 release cycle. It could be a regression in the runtime's signal behavior, a new source of signals added to the test, or a test that was (still) already inherently flaky tickled by some subtle scheduling or preemption change (although it appears to predate @mknyszek's GC pacer changes 😅). |
CCing @cherrymui as our resident pprof owner. |
I'll take a look at this one as well. |
I have change the following line to an infinite loop, then use gdb conditional breakpoint to observe the received non-27 signal
(gdb) b x_cgo_callers if sig != 27 Breakpoint 1 at 0x4fe320: file gcc_traceback.c, line 23. (gdb) r Starting program: /mnt/d/gopath/src/issues/issue49401/testprogcgo CgoPprof It has been running two days , still nothing happen... |
The problem is clear to me now. CL 324129 introduces the profile with per-thread timers feature. However, this judgment is called later than x_cgo_callers. (BUG here) In such a scene: It affects almost all linux systems, but currently only amd64 and ppc64le have the CgoProf testing, so I will try to fix it on linux_amd64 first only for now. |
Change https://golang.org/cl/363443 mentions this issue: |
@hanchaoqun I agree with your analysis, that seems like the cause of this issue. However, I don't think it is worth fixing by moving validSIGPROF into assembly. There is nothing fundamentally wrong with calling cgo_callers in a signal that is ultimately dropped, the test simply makes the fragile assumption that most of the calls will be used for the pprof sample. Note that this test already has to disable async preemption because that also breaks this assumption. IMO, we should drop this assumption from the test. I'll send a CL. |
Will that add bias to the profile, since the program will spend more time than necessary in |
Yes, continuing to call In a profile, this would make threads using lots of cgo to receive slightly more samples than others, and since (I'm pretty sure) SIGPROF is masked during signal handling, those extra samples will simply appear somewhere in user code rather than This is a form of skew, but note that even with something like https://golang.org/cl/363443 we can't eliminate it entirely because valid SIGPROF signals still need to call IMO, these small sources of skew are arguments for using a PMU-based profiler if you are really worried about them, rather than trying to make "constant-time" signal handlers. On the other hand, the extra useless CPU usage, if statistically significant, is a good argument for avoiding calling |
I should clarify: it is a good argument for potential future work, not for addressing now. |
Change https://golang.org/cl/363634 mentions this issue: |
Yes, i agree drop this assumption from the test. |
greplogs --dashboard -md -l -e '(?ms)FAIL: TestCgoPprof.*cpuHog traceback missing' --since=2020-06-16
2021-11-04T23:56:29-0e5f287/linux-amd64-buster
2021-11-02T21:18:44-80065cf/linux-amd64-staticlockranking
2021-11-02T16:12:23-599de4b/linux-amd64-sid
2021-10-14T21:08:35-440b63b/linux-amd64-staticlockranking
2021-10-06T19:45:35-e38ec96/linux-ppc64le-buildlet
2021-10-04T17:17:11-cc5e3de/linux-amd64-longtest
2021-09-28T20:54:26-b8a6017/linux-amd64-jessie
The symptoms of these failures match #37201, but that appeared to have been fixed in May 2020.
The text was updated successfully, but these errors were encountered: