Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc/cgo/testcarchive: TestSIGPROF failures with "signal: profiling timer expired" #43828

Closed
bcmills opened this issue Jan 21, 2021 · 17 comments
Closed
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Jan 21, 2021

2021-01-18T17:21:53-5a8fbb0/freebsd-amd64-11_2
2021-01-14T22:01:23-e125ccd/linux-amd64-wsl

See previously #43710, #19320

CC @ianlancetaylor

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 21, 2021
@bcmills bcmills added this to the Backlog milestone Jan 21, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Feb 22, 2021

@bcmills
Copy link
Contributor Author

bcmills commented Apr 21, 2021

@mdempsky
Copy link
Member

mdempsky commented Nov 5, 2021

@bcmills
Copy link
Contributor Author

bcmills commented Nov 18, 2021

@bcmills
Copy link
Contributor Author

bcmills commented Nov 29, 2021

@bcmills
Copy link
Contributor Author

bcmills commented Nov 29, 2021

Given that this has affected at least three TryBots in the last month alone, I'm marking it as a release-blocker via #11811.

@bcmills bcmills added okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker labels Nov 29, 2021
@bcmills bcmills modified the milestones: Backlog, Go1.18 Nov 29, 2021
@toothrot
Copy link
Contributor

toothrot commented Dec 8, 2021

Checking on this as a release blocker. Are there any updates? @ianlancetaylor

@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@cherrymui
Copy link
Member

I'll take a look.

@cherrymui
Copy link
Member

cherrymui commented Dec 21, 2021

I think I have a hunch. When we turn off profiling,

		if !sigInstallGoHandler(_SIGPROF) {
			if atomic.Cas(&handlingSig[_SIGPROF], 1, 0) {
				h := atomic.Loaduintptr(&fwdSig[_SIGPROF])
				if h == _SIG_DFL {
					h = _SIG_IGN
				}
				setsig(_SIGPROF, h)
			}
		}

If a pending profiling signal lands right after the Cas but before the setsig (including in another thread running concurrently), in the signal handler, sigfwdgo will see that we're not handling that signal (handlingSig[sig] is 0) and fwdFn is _SIG_DFL, it will die.

Maybe we could do the Cas after the setsig? We have prof.signalLock held so at lease it cannot race with other profiling operation. Or, maybe we could Cas handlingSig[_SIGPROF] to some nonzero value (e.g. 2), setsig, then store 0?

@ianlancetaylor
Copy link
Contributor

Good catch. I think it would be correct to change the code to

    if atomic.Load(&handlingSig[_SIGPROF]) != 0 {
        ...
        setsig(...)
        atomic.Store(&handlingSig[_SIGPROF], 0)
    }

It can mess up if there a race between enabling and disabling profiling, but that will always have unpredictable results.

@cherrymui
Copy link
Member

SGTM. With that change I still see failures, possibly with a lower rate (same for CAS'ing to 2 then store 0). Still looking.

@cherrymui
Copy link
Member

While I haven't got any useful debugging, here is another guess: when we do the atomic load, setsig, atomic store, a signal handler may be running concurrently on another thread, and it may still observe handlingSig[_SIGPROF] as 0. (The signal is triggered before the setsig, but the atomic load in sigfwdgo occurs after the atomic store.)

Maybe we need some mechanism to synchronize with a concurrently running signal handler. Or maybe we could atomic store fwdSig to _SIG_IGN and carefully do the atomic loads in sigfwdgo in the right order?

@cherrymui
Copy link
Member

cherrymui commented Dec 21, 2021

With this patch I cannot reproduce the failure. But the code is a bit weird... Maybe there is a better way.

diff --git a/src/runtime/signal_unix.go b/src/runtime/signal_unix.go
index dbcbfc67bc..12340628ad 100644
--- a/src/runtime/signal_unix.go
+++ b/src/runtime/signal_unix.go
@@ -298,12 +298,14 @@ func setProcessCPUProfilerTimer(hz int32) {
                // that use profiling don't want to crash on a stray SIGPROF.
                // See issue 19320.
                if !sigInstallGoHandler(_SIGPROF) {
-                       if atomic.Cas(&handlingSig[_SIGPROF], 1, 0) {
+                       if atomic.Load(&handlingSig[_SIGPROF]) != 0 {
                                h := atomic.Loaduintptr(&fwdSig[_SIGPROF])
                                if h == _SIG_DFL {
                                        h = _SIG_IGN
+                                       atomic.Storeuintptr(&fwdSig[_SIGPROF], h)
                                }
                                setsig(_SIGPROF, h)
+                               atomic.Store(&handlingSig[_SIGPROF], 0)
                        }
                }
        }
@@ -1034,6 +1036,8 @@ func sigfwdgo(sig uint32, info *siginfo, ctx unsafe.Pointer) bool {
 
        // If we aren't handling the signal, forward it.
        if atomic.Load(&handlingSig[sig]) == 0 || !signalsOK {
+               fwdFn = atomic.Loaduintptr(&fwdSig[sig]) // XXX load again
+
                // If the signal is ignored, doing nothing is the same as forwarding.
                if fwdFn == _SIG_IGN || (fwdFn == _SIG_DFL && flags&_SigIgn != 0) {
                        return true

@ianlancetaylor
Copy link
Contributor

Interesting. So the problem may be that the switch from SIG_DFL to SIG_IGN done by setProcessCPUProfilerTimer isn't observed by sigfwdgo.

What if we change the code that enables profiling instead? If getsig(_SIGPROF) returns SIG_DFL, record SIG_IGN in fwdSig[_SIGPROF]. Then we can drop the check when disabling profiling, and it seems likely that we won't crash in sigfwdgo either.

@cherrymui
Copy link
Member

Good idea! Thanks. Sent CL https://go-review.googlesource.com/c/go/+/374074 .

@gopherbot
Copy link

Change https://golang.org/cl/374074 mentions this issue: runtime: handle stray profiling signal better

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 22, 2021
@golang golang locked and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
None yet
Development

No branches or pull requests

7 participants