Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestSignalIgnoreSIGTRAP flaky on OpenBSD #17496

Closed
bradfitz opened this issue Oct 18, 2016 · 21 comments
Closed

runtime: TestSignalIgnoreSIGTRAP flaky on OpenBSD #17496

bradfitz opened this issue Oct 18, 2016 · 21 comments
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-OpenBSD
Milestone

Comments

@bradfitz
Copy link
Contributor

Trybot flake from https://go-review.googlesource.com/31173

https://storage.googleapis.com/go-build-log/a2e3e68c/openbsd-amd64-gce58_b0eb773d.log

ok      reflect 0.266s
ok      regexp  0.362s
ok      regexp/syntax   0.307s
--- FAIL: TestSignalIgnoreSIGTRAP (0.05s)
    crash_test.go:106: testprognet SignalIgnoreSIGTRAP exit status: exit status 2
    crash_unix_test.go:176: want OK
        , got SIGTRAP: trace trap
        PC=0x45a694 m=0

        goroutine 1 [syscall]:
        syscall.Syscall(0x25, 0x6554, 0x5, 0x0, 0x0, 0x6554, 0x0)
            /tmp/workdir/go/src/syscall/asm_unix_amd64.s:19 +0x5 fp=0xc420047e08 sp=0xc420047e00
        syscall.Kill(0x6554, 0x5, 0x1, 0x1)
            /tmp/workdir/go/src/syscall/zsyscall_openbsd_amd64.go:666 +0x4b fp=0xc420047e58 sp=0xc420047e08
        main.SignalIgnoreSIGTRAP()
            /tmp/workdir/go/src/runtime/testdata/testprognet/signal.go:24 +0x8d fp=0xc420047ea0 sp=0xc420047e58
        main.main()
            /tmp/workdir/go/src/runtime/testdata/testprognet/main.go:34 +0x1f1 fp=0xc420047f48 sp=0xc420047ea0
        runtime.main()
            /tmp/workdir/go/src/runtime/proc.go:185 +0x20a fp=0xc420047fa0 sp=0xc420047f48
        runtime.goexit()
            /tmp/workdir/go/src/runtime/asm_amd64.s:2160 +0x1 fp=0xc420047fa8 sp=0xc420047fa0

        goroutine 17 [syscall, locked to thread]:
        runtime.goexit()
            /tmp/workdir/go/src/runtime/asm_amd64.s:2160 +0x1

        goroutine 5 [runnable]:
        os/signal.loop()
            /tmp/workdir/go/src/os/signal/signal_unix.go:20
        created by os/signal.init.1
            /tmp/workdir/go/src/os/signal/signal_unix.go:28 +0x41

        rax    0x0
        rbx    0x0
        rcx    0x45a694
        rdx    0x0
        rdi    0x6554
        rsi    0x5
        rbp    0xc420047e48
        rsp    0xc420047e00
        r8     0x0
        r9     0x0
        r10    0x0
        r11    0x246
        r12    0xc42006c4e1
        r13    0x1
        r14    0xf3
        r15    0x3
        rip    0x45a694
        rflags 0x246
        cs     0x2b
        fs     0x2c8feeaa0
        gs     0x29

FAIL
FAIL    runtime 34.550s
2016/10/17 23:20:53 Failed: exit status 1
@bradfitz bradfitz added this to the Go1.8 milestone Oct 18, 2016
@quentinmit quentinmit added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 21, 2016
@rsc
Copy link
Contributor

rsc commented Oct 27, 2016

Plenty of failures, all OpenBSD:

$ grep -lR ,.got.SIGTRAP:.trace.trap rev 
rev/2016-01-12T02:00:48-352e287/openbsd-386-gce56
rev/2016-01-20T04:43:43-5395846/openbsd-386-gce56
rev/2016-02-09T22:14:02-fd458ba/openbsd-amd64-gce58
rev/2016-04-12T23:14:16-982274c/openbsd-386-gce58
rev/2016-05-01T02:36:46-9813951/openbsd-386-gce58
rev/2016-05-15T06:24:02-b4bf066/openbsd-386-gce58
rev/2016-05-31T13:02:09-4223294/openbsd-386-gce58
rev/2016-06-01T10:28:01-ad074e2/openbsd-386-gce58
rev/2016-08-16T00:24:16-e0d8064/openbsd-386-gce58
rev/2016-09-06T17:21:22-b926bf8/openbsd-386-gce58
rev/2016-09-16T03:45:18-9658de3/openbsd-386-gce58
rev/2016-09-21T01:25:29-35d22af/openbsd-386-gce58
rev/2016-10-20T01:06:09-ef8e85e/openbsd-386-gce58

The program in question does:

func SignalIgnoreSIGTRAP() {
    signal.Ignore(syscall.SIGTRAP)
    syscall.Kill(syscall.Getpid(), syscall.SIGTRAP)
    println("OK")
}

signal.Ignore sets the 1<<SIGTRAP bit in runtime's sig.ignored, which can be queried by signal_ignored:

// Must only be called from a single goroutine at a time.
//go:linkname signal_ignore os/signal.signal_ignore
func signal_ignore(s uint32) {
    if s >= uint32(len(sig.wanted)*32) {
        return
    }
    sig.wanted[s/32] &^= 1 << (s & 31)
    sig.ignored[s/32] |= 1 << (s & 31)
    sigignore(s)
}

// Checked by signal handlers.
func signal_ignored(s uint32) bool {
    return sig.ignored[s/32]&(1<<(s&31)) != 0
}

Then syscall.Kill sends the SIGTRAP, which ends up in the signal handler, which does:

if c.sigcode() == _SI_USER && signal_ignored(sig) {
    return
}

Clearly that condition is false and should be true.

  • Is c.sigcode() == _SI_USER false?
    • Maybe the kernel has a bug and sends the wrong siginfo sometimes?
    • Maybe we misinterpret the siginfo sometimes?
  • Is signal_ignored(sig) false?
    • Maybe multiple calls to signal_ignore step on each other? No, os/signal is holding a lock when calling it.
    • Maybe memory corruption due to the kernel?
    • Maybe signal.Ignore happens on one thread/CPU and the signal handler is invoked on a separate thread/CPU,
      and maybe there's no synchronization between the signaling CPU and the handling CPU, so that the handling CPU
      has not yet seen the update to sig.ignored. How? The same thread that set the sig.ignored bit triggered the signal. How could the existence of the signal propagate to the other CPU without carrying the sig.ignored memory update with it?

I built testprognet on an openbsd-amd64-gce58 gomote and am running 'stress -p 50 testprognet SignalIgnoreSIGTRAP' in hopes of getting even 1 failure. Nothing so far.

@gopherbot
Copy link

CL https://golang.org/cl/32183 mentions this issue.

@rsc
Copy link
Contributor

rsc commented Oct 27, 2016

CL 32183 added a print of sigcode in the SIGTRAP crash. That will at least let us see which half of the if statement is wrong. I can't reproduce this myself but it seems to be happening around once a month, so maybe we can just wait and see.

At some point we should think about using atomic loads/stores to synchronize signal_ignore with signal_ignored, but I'd prefer to gather more information before we stomp around where the bug might be. I talked to @aclements and he agrees that there doesn't seem to be any plausible way to connect the dots in the missed memory update theory.

gopherbot pushed a commit that referenced this issue Oct 27, 2016
For #17496.

Change-Id: I671a59581c54d17bc272767eeb7b2742b54eca38
Reviewed-on: https://go-review.googlesource.com/32183
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@mdempsky
Copy link
Member

Aside, there seems to be a theoretical race: we're setting sig.ignored in thread A, then sending a process-directed SIGTRAP signal, which can be handled in thread B on a different CPU, where the sig.ignored write isn't visible yet.

Reviewing OpenBSD's kill system call implementation, it looks like the signal should always be handled during this test by A (see "If the current thread can process the signal immediately (it's unblocked) then have it take it." in ptsignal). So I don't think that's the issue here.

Experimentally, I can see A!=B occasionally happens on Linux though:

$ strace -f ./testprognet SignalIgnoreSIGTRAP 2>&1 | grep SIGTRAP
execve("./testprognet", ["./testprognet", "SignalIgnoreSIGTRAP"], [/* 44 vars */]) = 0
rt_sigaction(SIGTRAP, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTRAP, {0x45ab30, ~[], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x45abf0}, NULL, 8) = 0
[pid  4106] kill(4106, SIGTRAP <unfinished ...>
[pid  4108] --- SIGTRAP {si_signo=SIGTRAP, si_code=SI_USER, si_pid=4106, si_uid=123644} ---

(I.e., SIGTRAP is being sent by thread 4106, but being handled by thread 4108.)

There also seems to be a theoretical race that when SIGTRAP is handled on a separate thread, there's no synchronization to ensure the main thread doesn't just immediately print "OK" and exit, before the SIGTRAP signal handler runs. But that would only cause sporadic success, not sporadic failure.

@aclements
Copy link
Member

Aside, there seems to be a theoretical race: we're setting sig.ignored in thread A, then sending a process-directed SIGTRAP signal, which can be handled in thread B on a different CPU, where the sig.ignored write isn't visible yet.

At least on x86, I don't believe this is possible. Thread A on CPU A writes sig.ignored and then must write something to inform thread B on CPU B that it wants to send a signal. If CPU B observes the second write, then it must also observe the first write. Reads can be reordered with writes, but writes cannot be reordered with other writes, and reads cannot be reordered with other reads.

@mdempsky
Copy link
Member

Good point, I agree, x86's memory model doesn't seem to allow that particular race.

@rsc rsc modified the milestones: Go1.9, Go1.8 Nov 11, 2016
@josharian
Copy link
Contributor

FWIW, there's also been an openbsd/386 failure: https://build.golang.org/log/0056d2a00dc647d543f6f56f14d1056c5828581c

@aclements
Copy link
Member

Recent failures with the sigcode output:

2016-12-01T20:08:56-b42d4a8/openbsd-386-gce58 PC=0x8097237 m=0 sigcode=418098389
2016-12-01T23:11:15-e3a1d0c/openbsd-arm PC=0x6e0f4 m=0 sigcode=282571616
2017-03-03T21:02:17-fbf4dd9/openbsd-386-60 PC=0x80971a7 m=0 sigcode=415316774
2017-03-06T15:05:42-06a6b3a/openbsd-386-60 PC=0x8097537 m=0 sigcode=33
2017-03-10T01:02:14-678f35b/openbsd-386-60 PC=0x8097b77 m=0 sigcode=1713534066
2017-03-22T17:33:27-cfb3c8d/openbsd-386-60 PC=0x8098a47 m=0 sigcode=4096
2017-03-31T20:05:33-8ab7130/openbsd-386-60 PC=0x8099a17 m=0 sigcode=8

The sigcodes are all over the place and look like they're just corrupted. The PCs seem reasonable, though, so it's not like the whole signal context is bad (the PC isn't exactly where I see syscall.Kill when I build the binary, but it's within a few KB and maybe the cross-compile is doing something that throws it off).

Re-assigning to @ianlancetaylor now that we have more debugging info.

@ianlancetaylor
Copy link
Contributor

I ran the test over 10,000 times on an OpenBSD 386 gomote, but failed to reproduce it. I checked things like the placement of the siginfo struct (on the signal stack) and the arguments. I read through all the code I could think of checking, and didn't see anything at all. I confirmed that the PC value in a real program is correct.

I have no suggestions for how to proceed.

@ianlancetaylor ianlancetaylor removed their assignment Jun 28, 2017
@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.9 Jun 28, 2017
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@gopherbot gopherbot modified the milestones: Go1.11, Unplanned May 23, 2018
@4a6f656c
Copy link
Contributor

As another data point, this also flakes on openbsd/arm:

https://build.golang.org/log/387c70cdbbfdf65f3672e3434ad0f843ffcabe6b

@bcmills
Copy link
Contributor

bcmills commented Jul 3, 2019

openbsd-386-64: https://build.golang.org/log/eaafe23bbe5c70aa29d7a59a05e93d6b36fdeac2

--- FAIL: TestSignalIgnoreSIGTRAP (1.44s)
    crash_test.go:95: testprognet SignalIgnoreSIGTRAP exit status: exit status 2
    crash_unix_test.go:272: want OK
        , got SIGTRAP: trace trap
        PC=0x80a4dc7 m=0 sigcode=137933296
        
        goroutine 1 [syscall]:
        syscall.Syscall(0xbecc, 0x5, 0x0, 0x0, 0x822a240, 0x810da90, 0x815b50c)
        	/tmp/workdir/go/src/syscall/asm_unix_386.s:19 +0x5 fp=0x6e4266f8 sp=0x6e4266f4 pc=0x80a4db5
        syscall.Kill(0xbecc, 0x5, 0x0, 0x0)
        	/tmp/workdir/go/src/syscall/zsyscall_openbsd_386.go:662 +0x37 fp=0x6e426718 sp=0x6e4266f8 pc=0x80a42c7
        main.SignalIgnoreSIGTRAP()
        	/tmp/workdir/go/src/runtime/testdata/testprognet/signal.go:24 +0x95 fp=0x6e426740 sp=0x6e426718 pc=0x8113405
        main.main()
        	/tmp/workdir/go/src/runtime/testdata/testprognet/main.go:34 +0x148 fp=0x6e4267b0 sp=0x6e426740 pc=0x8112f38
        runtime.main()
        	/tmp/workdir/go/src/runtime/proc.go:203 +0x1ea fp=0x6e4267f0 sp=0x6e4267b0 pc=0x8071a6a
        runtime.goexit()
        	/tmp/workdir/go/src/runtime/asm_386.s:1325 +0x1 fp=0x6e4267f4 sp=0x6e4267f0 pc=0x8096b61
        
        goroutine 6 [runnable]:
        os/signal.loop()
        	/tmp/workdir/go/src/os/signal/signal_unix.go:21
        created by os/signal.init.0
        	/tmp/workdir/go/src/os/signal/signal_unix.go:29 +0x31
        
        eax    0x0
        ebx    0x6e41c004
        ecx    0x6e41c004
        edx    0x822a060
        edi    0x6e426704
        esi    0x6e426708
        ebp    0x0
        esp    0x6e4266f4
        eip    0x80a4dc7
        eflags 0x216
        cs     0x2b
        fs     0x5b
        gs     0x63
        
FAIL
FAIL	runtime	38.584s

@bcmills bcmills changed the title runtime: TestSignalIgnoreSIGTRAP flaky on OpenBSD-amd64 runtime: TestSignalIgnoreSIGTRAP flaky on OpenBSD Jul 3, 2019
@bcmills
Copy link
Contributor

bcmills commented Jul 15, 2019

@bcmills
Copy link
Contributor

bcmills commented Oct 4, 2019

@bcmills
Copy link
Contributor

bcmills commented Dec 9, 2019

@mdempsky
Copy link
Member

mdempsky commented Dec 9, 2019

I think this is an OpenBSD kernel issue. In particular, I think thread_new is failing to initialize p_sisig to 0, so the p_si* member variables are all uninitialized garbage.

What I'm guessing is happening:

  1. Process A gets a proc at address P. No proc at P has initialized the p_si* values yet.
  2. A calls sigexit(P, SIGTRAP). This sets P->si_sisig=SIGTRAP, but other the other P->si_si* variables are still garbage.
  3. Address P is returned to the pool after A finishes exiting.
  4. Process B (Go test program) reuses address P.
  5. B calls kill(getpid(), SIGTRAP), which in turn eventually calls postsig(P, SIGTRAP).
  6. P->p_sisig is still SIGTRAP, so "if (p->p_sisig != signum)" evaluates false, and we write the garbage P->p_sicode value instead of SI_USER.

I've emailed a couple OpenBSD kernel developers about this.

@mdempsky
Copy link
Member

mdempsky commented Dec 10, 2019

Confirmed that modifying the kernel to set p->p_sisig = SIGTRAP; immediately after allocating the struct proc (i.e., before re-initializing it; similar to the consequences of steps 1--3 above) reliably reproduces the issue.

@josharian
Copy link
Contributor

@bcmills
Copy link
Contributor

bcmills commented Mar 11, 2020

@gopherbot
Copy link

Change https://golang.org/cl/222856 mentions this issue: runtime: skip TestSignalIgnoreSIGTRAP on known-flaky OpenBSD builders

@bcmills
Copy link
Contributor

bcmills commented Mar 11, 2020

We (the Go project) can't fix the OpenBSD kernel unilaterally, and I don't think we should go to great lengths to work around it. Skipping the test on the builders with known-bad kernels, and I think that's about all we should do.

@bcmills bcmills closed this as completed Mar 11, 2020
gopherbot pushed a commit that referenced this issue Mar 11, 2020
This test is flaky, and the cause is suspected to be an OpenBSD kernel bug.
Since there is no obvious workaround on the Go side, skip the test on
builders whose versions are known to be affected.

Fixes #17496

Change-Id: Ifa70061eb429e1d949f0fa8a9e25d177afc5c488
Reviewed-on: https://go-review.googlesource.com/c/go/+/222856
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Alexander Rakoczy <alex@golang.org>
@golang golang locked and limited conversation to collaborators Mar 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-OpenBSD
Projects
None yet
Development

No branches or pull requests

10 participants