runtime: intermittent TestCgoPprof failures since 2021-09-28 #49401

bcmills · 2021-11-05T19:51:19Z

step 1
step 8
step 9
step 10
step 11
step 12
step 13
step 14
--- FAIL: TestCgoPprof (12.23s)
    crash_cgo_test.go:344: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/go-build4045495910/testprogcgo_.exe /workdir/tmp/prof1465180680]:
        File: testprogcgo_.exe
        Build ID: 9a7aae8c544e237851f2d51d3d04179d2be73832
        Type: cpu
        Time: Nov 5, 2021 at 12:05am (UTC)
        Duration: 203.77ms, Total samples = 10ms ( 4.91%)
        -----------+-------------------------------------------------------
              10ms   runtime.cgocall
                     main._Cfunc_cpuHog
                     main.CgoPprof
                     main.main
                     runtime.main
        -----------+-------------------------------------------------------
    crash_cgo_test.go:352: cpuHog traceback missing.
    crash_cgo_test.go:344: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/prof1465180680]:
        File: testprogcgo_.exe
        Build ID: 9a7aae8c544e237851f2d51d3d04179d2be73832
        Type: cpu
        Time: Nov 5, 2021 at 12:05am (UTC)
        Duration: 203.77ms, Total samples = 10ms ( 4.91%)
        -----------+-------------------------------------------------------
              10ms   runtime.cgocall
                     main._Cfunc_cpuHog
                     main.CgoPprof
                     main.main
                     runtime.main
        -----------+-------------------------------------------------------
    crash_cgo_test.go:352: cpuHog traceback missing.
FAIL
FAIL	runtime	51.987s

greplogs --dashboard -md -l -e '(?ms)FAIL: TestCgoPprof.*cpuHog traceback missing' --since=2020-06-16

2021-11-04T23:56:29-0e5f287/linux-amd64-buster
2021-11-02T21:18:44-80065cf/linux-amd64-staticlockranking
2021-11-02T16:12:23-599de4b/linux-amd64-sid
2021-10-14T21:08:35-440b63b/linux-amd64-staticlockranking
2021-10-06T19:45:35-e38ec96/linux-ppc64le-buildlet
2021-10-04T17:17:11-cc5e3de/linux-amd64-longtest
2021-09-28T20:54:26-b8a6017/linux-amd64-jessie

The symptoms of these failures match #37201, but that appeared to have been fixed in May 2020.

The text was updated successfully, but these errors were encountered:

bcmills · 2021-11-05T19:52:36Z

This issue was forked from #49065 (comment). In #49065 (comment), @rhysh replied:

@bcmills this particular kernel bug wouldn't explain the failure on linux-amd64-longtest on October 4 (well before October 18).

Yes, those failures look like what's described in #37201. That issue closed when a particular source of signals was eliminated; there may be another (less frequent) source of signals.

The October 4 failure shows Duration: 200.88ms, Total samples = 10ms ( 4.98%). Stopping the CPU profiler seems to involve 200ms worth of sleeps, so I take that to mean that it got stopped almost immediately, and due to the "traceback function called 2 times" condition rather than "test ran for one second" (

go/src/runtime/testdata/testprogcgo/pprof.go

Line 89 in cc5e3de

for C.getCpuHogCount() < 2 && time.Since(t0) < time.Second {

). Samples covering 10ms, at 100Hz, means 1 sample. That's consistent with the process receiving signals in addition to SIGPROF, causing the test to exit early.

bcmills · 2021-11-05T19:56:02Z

Marking as release-blocker, since this appears to be a regression during the Go 1.18 release cycle.

It could be a regression in the runtime's signal behavior, a new source of signals added to the test, or a test that was (still) already inherently flaky tickled by some subtle scheduling or preemption change (although it appears to predate @mknyszek's GC pacer changes 😅).

jeremyfaller · 2021-11-08T20:44:59Z

CCing @cherrymui as our resident pprof owner.

prattmic · 2021-11-09T19:24:50Z

I'll take a look at this one as well.

hanchaoqun · 2021-11-10T03:06:13Z

I have change the following line to an infinite loop, then use gdb conditional breakpoint to observe the received non-27 signal

go/src/runtime/testdata/testprogcgo/pprof.go

Line 89 in 02d7eab

for C.getCpuHogCount() < 2 && time.Since(t0) < time.Second {

(gdb) b x_cgo_callers if sig != 27
Breakpoint 1 at 0x4fe320: file gcc_traceback.c, line 23.
(gdb) r
Starting program: /mnt/d/gopath/src/issues/issue49401/testprogcgo CgoPprof

It has been running two days , still nothing happen...

mknyszek · 2021-11-10T20:34:20Z

Trybot failure: https://storage.googleapis.com/go-build-log/4b7308f5/linux-amd64-race_38da783b.log

hanchaoqun · 2021-11-12T06:45:44Z

The problem is clear to me now.

CL 324129 introduces the profile with per-thread timers feature.
It is judged in the validSIGPROF function whether the source of the SIGPROF signal meets expectations, and if it is unexpected, will be ignored.

However, this judgment is called later than x_cgo_callers. （BUG here）

In such a scene:
When a "process" wide timer rise a SIGPROF signal, at the same time the M's "thread" timer was opened , that SIGPROF signal will be ignored by the sighandler, so that signal will NOT delivery to pprof handler at proc.go:sigprof.
BUT before that signal arrived to sighandler, it through the call path cgoSigtramp->x_cgo_callers>cgoTraceback->pprofCgoTraceback without any verify.
So cpuHogCount counted then lead to the condition C.getCpuHogCount() <2 exit early.

It affects almost all linux systems, but currently only amd64 and ppc64le have the CgoProf testing, so I will try to fix it on linux_amd64 first only for now.

gopherbot · 2021-11-12T08:31:58Z

Change https://golang.org/cl/363443 mentions this issue: runtime: verify the source of SIGPROF before _cgo_callers is called

prattmic · 2021-11-12T15:52:59Z

@hanchaoqun I agree with your analysis, that seems like the cause of this issue. However, I don't think it is worth fixing by moving validSIGPROF into assembly.

There is nothing fundamentally wrong with calling cgo_callers in a signal that is ultimately dropped, the test simply makes the fragile assumption that most of the calls will be used for the pprof sample. Note that this test already has to disable async preemption because that also breaks this assumption.

IMO, we should drop this assumption from the test. I'll send a CL.

bcmills · 2021-11-12T15:58:30Z

There is nothing fundamentally wrong with calling cgo_callers in a signal that is ultimately dropped,

Will that add bias to the profile, since the program will spend more time than necessary in cgo_callers itself? (I would think it would cause whatever is running on that thread to be slightly over-represented, since the time spent in cgo_callers will have to be made up for by keeping the thread runnable for slightly longer than it otherwise might be.)

prattmic · 2021-11-12T16:14:18Z

Yes, continuing to call cgo_callers unconditionally for all signals will cause additional CPU usage vs calling it only when we are certain it is needed.

In a profile, this would make threads using lots of cgo to receive slightly more samples than others, and since (I'm pretty sure) SIGPROF is masked during signal handling, those extra samples will simply appear somewhere in user code rather than cgo_callers itself.

This is a form of skew, but note that even with something like https://golang.org/cl/363443 we can't eliminate it entirely because valid SIGPROF signals still need to call cgo_callers, making them more expensive than Go-only signals. In a similar vane, Go code with deep stacktraces are also "skewed" because it takes the signal handler longer to generate their tracebacks.

IMO, these small sources of skew are arguments for using a PMU-based profiler if you are really worried about them, rather than trying to make "constant-time" signal handlers.

On the other hand, the extra useless CPU usage, if statistically significant, is a good argument for avoiding calling cgo_callers in as many cases as we can determine it is not needed, which would include both invalid SIGPROF and async-preemption signals.

prattmic · 2021-11-12T16:15:59Z

On the other hand, the extra useless CPU usage, if statistically significant, is a good argument for avoiding calling cgo_callers in as many cases as we can determine it is not needed, which would include both invalid SIGPROF and async-preemption signals.

I should clarify: it is a good argument for potential future work, not for addressing now.

gopherbot · 2021-11-12T16:29:07Z

Change https://golang.org/cl/363634 mentions this issue: runtime: drop cgoTraceback call assumptions from CgoPprof tests

hanchaoqun · 2021-11-12T18:18:07Z

On the other hand, the extra useless CPU usage, if statistically significant, is a good argument for avoiding calling cgo_callers in as many cases as we can determine it is not needed, which would include both invalid SIGPROF and async-preemption signals.

I should clarify: it is a good argument for potential future work, not for addressing now.

Yes, i agree drop this assumption from the test.

bcmills added NeedsInvestigation release-blocker labels Nov 5, 2021

bcmills added this to the Go1.18 milestone Nov 5, 2021

prattmic self-assigned this Nov 9, 2021

gopherbot closed this as completed in ecd2e14 Nov 12, 2021

rsc unassigned prattmic Jun 23, 2022

prattmic self-assigned this Jun 24, 2022

golang locked and limited conversation to collaborators Jun 24, 2023

gopherbot added the FrozenDueToAge label Jun 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: intermittent TestCgoPprof failures since 2021-09-28 #49401

runtime: intermittent TestCgoPprof failures since 2021-09-28 #49401

bcmills commented Nov 5, 2021

bcmills commented Nov 5, 2021

bcmills commented Nov 5, 2021

jeremyfaller commented Nov 8, 2021

prattmic commented Nov 9, 2021

hanchaoqun commented Nov 10, 2021 •

edited

Loading

mknyszek commented Nov 10, 2021

hanchaoqun commented Nov 12, 2021 •

edited

Loading

gopherbot commented Nov 12, 2021

prattmic commented Nov 12, 2021

bcmills commented Nov 12, 2021

prattmic commented Nov 12, 2021 •

edited

Loading

prattmic commented Nov 12, 2021

gopherbot commented Nov 12, 2021

hanchaoqun commented Nov 12, 2021

runtime: intermittent TestCgoPprof failures since 2021-09-28 #49401

runtime: intermittent TestCgoPprof failures since 2021-09-28 #49401

Comments

bcmills commented Nov 5, 2021

bcmills commented Nov 5, 2021

bcmills commented Nov 5, 2021

jeremyfaller commented Nov 8, 2021

prattmic commented Nov 9, 2021

hanchaoqun commented Nov 10, 2021 • edited Loading

mknyszek commented Nov 10, 2021

hanchaoqun commented Nov 12, 2021 • edited Loading

gopherbot commented Nov 12, 2021

prattmic commented Nov 12, 2021

bcmills commented Nov 12, 2021

prattmic commented Nov 12, 2021 • edited Loading

prattmic commented Nov 12, 2021

gopherbot commented Nov 12, 2021

hanchaoqun commented Nov 12, 2021

hanchaoqun commented Nov 10, 2021 •

edited

Loading

hanchaoqun commented Nov 12, 2021 •

edited

Loading

prattmic commented Nov 12, 2021 •

edited

Loading