Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: SIGSEGV in gentraceback during SIGPROF handling of cgo callback #50936

Closed
prattmic opened this issue Jan 31, 2022 · 3 comments
Closed
Assignees
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@prattmic
Copy link
Member

When running a goroutine, mp.curg points to the user goroutine, and mp.curg.m points back to mp.

The transition points are:

  • execute which sets mp.curg and then mp.curg.m.
  • dropg which does the opposite, clearing mp.curg.m and then mp.curg.

Thus, there are two small critical sections that may have mp.curg != nil but mp.curg.m == nil.

When handling a SIGPROF during cgo execution, we pass mp.curg to gentraceback, specifically with the condition mp.ncgo > 0 && mp.curg != nil && mp.curg.syscallpc != 0 && mp.curg.syscallsp != 0.

gentraceback then assumes gp.m != nil for _TraceJumpStack, and would crash on a curg in one of these critical sections.

This is reachable during cgocallback, which calls exitsyscall without decrementing ncgo [1]. If no Ps are available, exitsyscall calls exitsyscall0, which calls both dropg and (eventually) execute. Thus if a SIGPROF lands in the critical sections, gentraceback will crash [2].

I have managed to reproduce this locally with a program that:

  1. Uses pprof profiling.
  2. Makes tons of cgo calls with Go callbacks from more goroutines that GOMAXPROCS, so that some will be starved of of a P in exitsyscall.
  3. Adjusted sysmon to retake Ps from cgo calls (aka syscalls) more aggressively than the default 20us/10ms to make the starved case more frequent.

(3) makes this difficult for me to make a regression test, but I may be able to eliminate this and just get a lower failure rate.

[1] Standard cgocall decrements ncgo prior to exitsyscall, though reentrant calls would also make this condition reachable as ncgo will still be > 0.
[2] Interestingly, this also meets the qualifications for cgoSigtramp to call the C traceback function, which will actually end up doing a traceback of Go code. This isn't a bug, just an odd edge case.

cc @cherrymui @aclements @mknyszek

@prattmic prattmic added NeedsFix The path to resolution is known, but the work has not been done. release-blocker labels Jan 31, 2022
@prattmic prattmic added this to the Go1.18 milestone Jan 31, 2022
@prattmic prattmic self-assigned this Jan 31, 2022
@prattmic
Copy link
Member Author

Oops, forgot to mention the most important bit: https://go.dev/cl/358900 introduced this crash by extending _TraceJumpStack to cgo tracebacks. I plan to revert that CL.

@gopherbot
Copy link

Change https://golang.org/cl/382079 mentions this issue: Revert "runtime: normalize sigprof traceback flags"

@gopherbot
Copy link

Change https://golang.org/cl/382244 mentions this issue: runtime: regression test for issue 50936

gopherbot pushed a commit that referenced this issue Feb 3, 2022
Add a regression test for issue 50936 which coerces the runtime into
frequent execution of the cgocall dropg/execute curg assignment race by
making many concurrent cgo calls eligible for P retake by sysmon. This
results in no P during exitsyscall, at which point they will update curg
and will crash if SIGPROF arrives in between updating mp.curg and
mp.curg.m.

This test is conceptually similar to the basic cgo callback test in
aprof.go but with additional concurrency and a sleep in C.

On my machine this test fails ~5% of the time prior to CL 382079.

For #50936.

Change-Id: I21b6c7f2594f9a615a64580ef70a88b692505678
Reviewed-on: https://go-review.googlesource.com/c/go/+/382244
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@prattmic prattmic self-assigned this Jun 24, 2022
@golang golang locked and limited conversation to collaborators Jun 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
None yet
Development

No branches or pull requests

2 participants