-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: TestSignalIgnoreSIGTRAP flaky on OpenBSD #17496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Plenty of failures, all OpenBSD:
The program in question does:
signal.Ignore sets the 1<<SIGTRAP bit in runtime's sig.ignored, which can be queried by signal_ignored:
Then syscall.Kill sends the SIGTRAP, which ends up in the signal handler, which does:
Clearly that condition is false and should be true.
I built testprognet on an openbsd-amd64-gce58 gomote and am running 'stress -p 50 testprognet SignalIgnoreSIGTRAP' in hopes of getting even 1 failure. Nothing so far. |
CL https://golang.org/cl/32183 mentions this issue. |
CL 32183 added a print of sigcode in the SIGTRAP crash. That will at least let us see which half of the if statement is wrong. I can't reproduce this myself but it seems to be happening around once a month, so maybe we can just wait and see. At some point we should think about using atomic loads/stores to synchronize signal_ignore with signal_ignored, but I'd prefer to gather more information before we stomp around where the bug might be. I talked to @aclements and he agrees that there doesn't seem to be any plausible way to connect the dots in the missed memory update theory. |
For #17496. Change-Id: I671a59581c54d17bc272767eeb7b2742b54eca38 Reviewed-on: https://go-review.googlesource.com/32183 Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Aside, there seems to be a theoretical race: we're setting sig.ignored in thread A, then sending a process-directed SIGTRAP signal, which can be handled in thread B on a different CPU, where the sig.ignored write isn't visible yet. Reviewing OpenBSD's kill system call implementation, it looks like the signal should always be handled during this test by A (see "If the current thread can process the signal immediately (it's unblocked) then have it take it." in ptsignal). So I don't think that's the issue here. Experimentally, I can see A!=B occasionally happens on Linux though:
(I.e., SIGTRAP is being sent by thread 4106, but being handled by thread 4108.) There also seems to be a theoretical race that when SIGTRAP is handled on a separate thread, there's no synchronization to ensure the main thread doesn't just immediately print "OK" and exit, before the SIGTRAP signal handler runs. But that would only cause sporadic success, not sporadic failure. |
At least on x86, I don't believe this is possible. Thread A on CPU A writes sig.ignored and then must write something to inform thread B on CPU B that it wants to send a signal. If CPU B observes the second write, then it must also observe the first write. Reads can be reordered with writes, but writes cannot be reordered with other writes, and reads cannot be reordered with other reads. |
Good point, I agree, x86's memory model doesn't seem to allow that particular race. |
FWIW, there's also been an openbsd/386 failure: https://build.golang.org/log/0056d2a00dc647d543f6f56f14d1056c5828581c |
Recent failures with the sigcode output: 2016-12-01T20:08:56-b42d4a8/openbsd-386-gce58 PC=0x8097237 m=0 sigcode=418098389 The sigcodes are all over the place and look like they're just corrupted. The PCs seem reasonable, though, so it's not like the whole signal context is bad (the PC isn't exactly where I see syscall.Kill when I build the binary, but it's within a few KB and maybe the cross-compile is doing something that throws it off). Re-assigning to @ianlancetaylor now that we have more debugging info. |
I ran the test over 10,000 times on an OpenBSD 386 gomote, but failed to reproduce it. I checked things like the placement of the siginfo struct (on the signal stack) and the arguments. I read through all the code I could think of checking, and didn't see anything at all. I confirmed that the PC value in a real program is correct. I have no suggestions for how to proceed. |
As another data point, this also flakes on openbsd/arm: https://build.golang.org/log/387c70cdbbfdf65f3672e3434ad0f843ffcabe6b |
|
|
I think this is an OpenBSD kernel issue. In particular, I think What I'm guessing is happening:
I've emailed a couple OpenBSD kernel developers about this. |
Confirmed that modifying the kernel to set |
Change https://golang.org/cl/222856 mentions this issue: |
We (the Go project) can't fix the OpenBSD kernel unilaterally, and I don't think we should go to great lengths to work around it. Skipping the test on the builders with known-bad kernels, and I think that's about all we should do. |
This test is flaky, and the cause is suspected to be an OpenBSD kernel bug. Since there is no obvious workaround on the Go side, skip the test on builders whose versions are known to be affected. Fixes #17496 Change-Id: Ifa70061eb429e1d949f0fa8a9e25d177afc5c488 Reviewed-on: https://go-review.googlesource.com/c/go/+/222856 Run-TryBot: Bryan C. Mills <bcmills@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Alexander Rakoczy <alex@golang.org>
Trybot flake from https://go-review.googlesource.com/31173
https://storage.googleapis.com/go-build-log/a2e3e68c/openbsd-amd64-gce58_b0eb773d.log
The text was updated successfully, but these errors were encountered: