Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: intermittent TestCgoPprof failures since 2021-09-28 #49401

Closed
bcmills opened this issue Nov 5, 2021 · 14 comments
Closed

runtime: intermittent TestCgoPprof failures since 2021-09-28 #49401

bcmills opened this issue Nov 5, 2021 · 14 comments
Assignees
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 5, 2021

step 1
step 8
step 9
step 10
step 11
step 12
step 13
step 14
--- FAIL: TestCgoPprof (12.23s)
    crash_cgo_test.go:344: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/go-build4045495910/testprogcgo_.exe /workdir/tmp/prof1465180680]:
        File: testprogcgo_.exe
        Build ID: 9a7aae8c544e237851f2d51d3d04179d2be73832
        Type: cpu
        Time: Nov 5, 2021 at 12:05am (UTC)
        Duration: 203.77ms, Total samples = 10ms ( 4.91%)
        -----------+-------------------------------------------------------
              10ms   runtime.cgocall
                     main._Cfunc_cpuHog
                     main.CgoPprof
                     main.main
                     runtime.main
        -----------+-------------------------------------------------------
    crash_cgo_test.go:352: cpuHog traceback missing.
    crash_cgo_test.go:344: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/prof1465180680]:
        File: testprogcgo_.exe
        Build ID: 9a7aae8c544e237851f2d51d3d04179d2be73832
        Type: cpu
        Time: Nov 5, 2021 at 12:05am (UTC)
        Duration: 203.77ms, Total samples = 10ms ( 4.91%)
        -----------+-------------------------------------------------------
              10ms   runtime.cgocall
                     main._Cfunc_cpuHog
                     main.CgoPprof
                     main.main
                     runtime.main
        -----------+-------------------------------------------------------
    crash_cgo_test.go:352: cpuHog traceback missing.
FAIL
FAIL	runtime	51.987s

greplogs --dashboard -md -l -e '(?ms)FAIL: TestCgoPprof.*cpuHog traceback missing' --since=2020-06-16

2021-11-04T23:56:29-0e5f287/linux-amd64-buster
2021-11-02T21:18:44-80065cf/linux-amd64-staticlockranking
2021-11-02T16:12:23-599de4b/linux-amd64-sid
2021-10-14T21:08:35-440b63b/linux-amd64-staticlockranking
2021-10-06T19:45:35-e38ec96/linux-ppc64le-buildlet
2021-10-04T17:17:11-cc5e3de/linux-amd64-longtest
2021-09-28T20:54:26-b8a6017/linux-amd64-jessie

The symptoms of these failures match #37201, but that appeared to have been fixed in May 2020.

@bcmills
Copy link
Contributor Author

bcmills commented Nov 5, 2021

This issue was forked from #49065 (comment). In #49065 (comment), @rhysh replied:

@bcmills this particular kernel bug wouldn't explain the failure on linux-amd64-longtest on October 4 (well before October 18).

Yes, those failures look like what's described in #37201. That issue closed when a particular source of signals was eliminated; there may be another (less frequent) source of signals.

The October 4 failure shows Duration: 200.88ms, Total samples = 10ms ( 4.98%). Stopping the CPU profiler seems to involve 200ms worth of sleeps, so I take that to mean that it got stopped almost immediately, and due to the "traceback function called 2 times" condition rather than "test ran for one second" (

for C.getCpuHogCount() < 2 && time.Since(t0) < time.Second {
). Samples covering 10ms, at 100Hz, means 1 sample. That's consistent with the process receiving signals in addition to SIGPROF, causing the test to exit early.

@bcmills
Copy link
Contributor Author

bcmills commented Nov 5, 2021

Marking as release-blocker, since this appears to be a regression during the Go 1.18 release cycle.

It could be a regression in the runtime's signal behavior, a new source of signals added to the test, or a test that was (still) already inherently flaky tickled by some subtle scheduling or preemption change (although it appears to predate @mknyszek's GC pacer changes 😅).

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Nov 5, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 5, 2021
@jeremyfaller
Copy link
Contributor

CCing @cherrymui as our resident pprof owner.

@prattmic
Copy link
Member

prattmic commented Nov 9, 2021

I'll take a look at this one as well.

@prattmic prattmic self-assigned this Nov 9, 2021
@hanchaoqun
Copy link
Contributor

hanchaoqun commented Nov 10, 2021

I have change the following line to an infinite loop, then use gdb conditional breakpoint to observe the received non-27 signal

for C.getCpuHogCount() < 2 && time.Since(t0) < time.Second {

(gdb) b x_cgo_callers if sig != 27
Breakpoint 1 at 0x4fe320: file gcc_traceback.c, line 23.
(gdb) r
Starting program: /mnt/d/gopath/src/issues/issue49401/testprogcgo CgoPprof

It has been running two days , still nothing happen...

@mknyszek
Copy link
Contributor

@hanchaoqun
Copy link
Contributor

hanchaoqun commented Nov 12, 2021

The problem is clear to me now.

CL 324129 introduces the profile with per-thread timers feature.
It is judged in the validSIGPROF function whether the source of the SIGPROF signal meets expectations, and if it is unexpected, will be ignored.

However, this judgment is called later than x_cgo_callers. (BUG here)

In such a scene:
When a "process" wide timer rise a SIGPROF signal, at the same time the M's "thread" timer was opened , that SIGPROF signal will be ignored by the sighandler, so that signal will NOT delivery to pprof handler at proc.go:sigprof.
BUT before that signal arrived to sighandler, it through the call path cgoSigtramp->x_cgo_callers>cgoTraceback->pprofCgoTraceback without any verify.
So cpuHogCount counted then lead to the condition C.getCpuHogCount() <2 exit early.

It affects almost all linux systems, but currently only amd64 and ppc64le have the CgoProf testing, so I will try to fix it on linux_amd64 first only for now.

@gopherbot
Copy link

Change https://golang.org/cl/363443 mentions this issue: runtime: verify the source of SIGPROF before _cgo_callers is called

@prattmic
Copy link
Member

@hanchaoqun I agree with your analysis, that seems like the cause of this issue. However, I don't think it is worth fixing by moving validSIGPROF into assembly.

There is nothing fundamentally wrong with calling cgo_callers in a signal that is ultimately dropped, the test simply makes the fragile assumption that most of the calls will be used for the pprof sample. Note that this test already has to disable async preemption because that also breaks this assumption.

IMO, we should drop this assumption from the test. I'll send a CL.

@bcmills
Copy link
Contributor Author

bcmills commented Nov 12, 2021

There is nothing fundamentally wrong with calling cgo_callers in a signal that is ultimately dropped,

Will that add bias to the profile, since the program will spend more time than necessary in cgo_callers itself? (I would think it would cause whatever is running on that thread to be slightly over-represented, since the time spent in cgo_callers will have to be made up for by keeping the thread runnable for slightly longer than it otherwise might be.)

@prattmic
Copy link
Member

prattmic commented Nov 12, 2021

Yes, continuing to call cgo_callers unconditionally for all signals will cause additional CPU usage vs calling it only when we are certain it is needed.

In a profile, this would make threads using lots of cgo to receive slightly more samples than others, and since (I'm pretty sure) SIGPROF is masked during signal handling, those extra samples will simply appear somewhere in user code rather than cgo_callers itself.

This is a form of skew, but note that even with something like https://golang.org/cl/363443 we can't eliminate it entirely because valid SIGPROF signals still need to call cgo_callers, making them more expensive than Go-only signals. In a similar vane, Go code with deep stacktraces are also "skewed" because it takes the signal handler longer to generate their tracebacks.

IMO, these small sources of skew are arguments for using a PMU-based profiler if you are really worried about them, rather than trying to make "constant-time" signal handlers.

On the other hand, the extra useless CPU usage, if statistically significant, is a good argument for avoiding calling cgo_callers in as many cases as we can determine it is not needed, which would include both invalid SIGPROF and async-preemption signals.

@prattmic
Copy link
Member

On the other hand, the extra useless CPU usage, if statistically significant, is a good argument for avoiding calling cgo_callers in as many cases as we can determine it is not needed, which would include both invalid SIGPROF and async-preemption signals.

I should clarify: it is a good argument for potential future work, not for addressing now.

@gopherbot
Copy link

Change https://golang.org/cl/363634 mentions this issue: runtime: drop cgoTraceback call assumptions from CgoPprof tests

@hanchaoqun
Copy link
Contributor

On the other hand, the extra useless CPU usage, if statistically significant, is a good argument for avoiding calling cgo_callers in as many cases as we can determine it is not needed, which would include both invalid SIGPROF and async-preemption signals.

I should clarify: it is a good argument for potential future work, not for addressing now.

Yes, i agree drop this assumption from the test.

@prattmic prattmic self-assigned this Jun 24, 2022
@golang golang locked and limited conversation to collaborators Jun 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants