Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: unix: c-shared libraries segfault handler overrides the default handler #14899

Closed
jtsylve opened this issue Mar 21, 2016 · 18 comments
Closed

Comments

@jtsylve
Copy link
Contributor

jtsylve commented Mar 21, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6 darwin/amd64 or go version go1.6 darwin/386

  1. What operating system and processor architecture are you using (go env)?
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/joe/go:/Users/joe/src/repo004/GOPATH"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.6/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.6/libexec/pkg/tool/darwin_amd64"
GO15VENDOREXPERIMENT="1"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fno-common"
CXX="clang++"
CGO_ENABLED="1"
  1. What did you do?
  1. Compile a library using buildmode=c-shared
  2. Link against a C program
  3. Cause the C program to segfault in any way (ex. dereference a NULL pointer)
  1. What did you expect to see?

The program should crash and the default segfault handler should be called (since the crash was not caused by Go).

I'd expect to see a crash like this:

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   crash                           0x00000001023cdf91 main + 17
1   libdyld.dylib                   0x00007fff89f915c9 start + 1
  1. What did you see instead?

Go's signal handler takes precedence, obscuring the stack trace, and making it default to debug crashes in applications that link against Go c-shared libraries.

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libmygolib.dylib                   0x02ef6955 runtime.raiseproc + 37 (sys_darwin_386.s:75)

Thread 1:
0   libmygolib.dylib                   0x02ef6d7a runtime.mach_semaphore_wait + 10 (sys_darwin_386.s:454)

Thread 2:
0   libmygolib.dylib                   0x02ef6c3e runtime.usleep + 78 (sys_darwin_386.s:330)

Thread 3:
0   libmygolib.dylib                   0x02ef6d7a runtime.mach_semaphore_wait + 10 (sys_darwin_386.s:454)

Thread 4:
0   libmygolib.dylib                   0x02ef6d7a runtime.mach_semaphore_wait + 10 (sys_darwin_386.s:454)

Thread 5:
0   libmygolib.dylib                   0x02ef6d7a runtime.mach_semaphore_wait + 10 (sys_darwin_386.s:454)

I believe this has to do with the following code:

TEXT runtime·raise(SB),NOSPLIT,$0
    // Ideally we'd send the signal to the current thread,
    // not the whole process, but that's too hard on OS X.
    JMP runtime·raiseproc(SB)

TEXT runtime·raiseproc(SB),NOSPLIT,$24
    MOVL    $(0x2000000+20), AX // getpid
    SYSCALL
    MOVQ    AX, DI  // arg 1 - pid
    MOVL    sig+0(FP), SI   // arg 2 - signal
    MOVL    $1, DX  // arg 3 - posix
    MOVL    $(0x2000000+37), AX // kill
    SYSCALL
    RET
@ianlancetaylor ianlancetaylor changed the title darwin: c-shared libraries segfault handler overrides the default handler runtime: darwin: c-shared libraries segfault handler overrides the default handler Mar 21, 2016
@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Mar 21, 2016
@ianlancetaylor
Copy link
Contributor

Do you have any suggestions for how to fix this?

@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 21, 2016

Not yet. Can you try to help me understand why it's currently implemented in the way it is and what the difficulties may be in addressing this issue? I assume it's because the Go runtime was originally not supposed to be loaded into other processes, but I'm not sure.

It's currently making it very difficult / near impossible to debug some crashes in our product, so I am eager to help fix the issue or at least come up with a work around.

@ianlancetaylor
Copy link
Contributor

The Go language specifies that certain operations cause a runtime panic. These operations include dereferencing a pointer that is nil or that is somehow invalid. That is implemented by installing a signal handler, catching the signal, and turning it into a panic if appropriate.

When using c-shared the Go runtime looks at the location where the signal occurred. If the signal occurred running non-Go code, it removes the Go signal handler, installs the previous handler if there was one, and raises the signal again. That is presumably what you are seeing.

Go c-shared libraries are always multi-threaded and always have background threads running Go code (for the garbage collector if nothing else), so it doesn't work to only install the signal handler while Go code is running.

For your purposes you could decide that you don't care about correct Go handling of invalid pointer references, and simply remove the Go signal handler after loading the shared library. That might work for you, but it would not be a good choice for most Go programs or shared libraries.

Otherwise, I don't know what to do, but I'm certainly open to suggestions. The handling of signals with mixed C/Go code is definitely a complex area; see https://golang.org/pkg/os/signal .

@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 21, 2016

When using c-shared the Go runtime looks at the location where the signal occurred. If the signal occurred running non-Go code, it removes the Go signal handler, installs the previous handler if there was one, and raises the signal again. That is presumably what you are seeing.

This is what I would expect, but in the way that it's re-raising the signals (by using the kill system call to send the signal to the process) we lose all context.

The Windows handler seems to check to see if the IP that caused the fault is not in the range of Go, and if so tells Windows to try the next handler in line. (

if !isgoexception(info, r) {
)

Couldn't we do something similar?

I believe in C you can do something like the following:

static struct sigaction _sigsev_old;
static struct sigaction _sigsev;

...

// Initialize segfault handler, saving original 
_sigsev.sa_sigaction = sigsev_handler;
_sigsev.sa_flags = SA_SIGINFO;
sigaction(SIGBUS, &_sigsev, &_sigsev_old);

...

// Somewhere inside the segfault handler
if (!isgoexception(siginfo)) {
    _sigsev_old.sa_flags = SA_RESETHAND;
    sigaction(SIGSEGV, &_sigsev_old, NULL);
}

@ianlancetaylor
Copy link
Contributor

What you suggest we do is what we already do.

The crash report you show above is not coming from Go, it's coming from something else, either the non-Go signal handler or the OS. You clearly aren't happy with that crash report. Are you unhappy because it shows 6 different stacks, or because the first stack says runtime.raiseproc rather than something in your own program, or both, or something else entirely?

@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 21, 2016

This is the standard OS X crash report. I get why shows multiple stacks. That makes sense and is expected. Our problem is that the identifies runtime.raiseproc as the source of the fault rather than whatever actually causes the fault.

@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 21, 2016

Is it possible to fix up the context stack in these situations?

@ianlancetaylor
Copy link
Contributor

One possibility would be to simply return from the Go signal handler, after restoring the old signal handler. That should cause the instruction to be executed again, raising the signal again, this time invoking the old handler. I'm not sure but it seems worth a try.

But a serious problem with that idea, or in general with any attempt to fix up the stack in the signal context, is that there would be no way to reinstall the Go signal handler if the program did not simply crash. In this case, of course, it does crash, but in the general case the non-Go signal handler might handle the signal and continue executing.

@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 22, 2016

Well I thought you could just call the original sighandler, passing the same context struct without uninstalling the go signal handler. This doesn't seem to be what's happening (at least when the original sig handler is the default).

if fwdFn == _SIG_DFL {

I tried removing that check, but then I think I get into some sort of infinite sighandling loop. It's hard to debug at that level, because you can't print anything out due to the nosplit

@jtsylve jtsylve changed the title runtime: darwin: c-shared libraries segfault handler overrides the default handler runtime: unix: c-shared libraries segfault handler overrides the default handler Mar 22, 2016
@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 22, 2016

I've got an idea that at least seems to work on my end. Let me know what you think and I'll submit a code review request.

raisebadsignal resets the signal handler to the original, calls raise to re-raise the signal, and then if it returns resets the signal handler to the go signal handler. raisebadsignal only seems to be called if the original handler couldn't be called with the context in sigfwdgo.

I propose that in the case that the signal is _SIG_DFL we return from raisebadsignal after resetting the handler instead of raising the signal with raise. This will allow the default handler to be called without messing with the stack.

We can also only do it for signals which we know the default handler does not recover from, so we know that there's no chance that the signal will be recovered and needed to be re-setup.

@ianlancetaylor
Copy link
Contributor

That sounds like it might work. We would have to check that the signal was produced because of execution--that it was not the result of a call to kill or pthread_kill. Basically we have to be sure that the signal really will happen again.

@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 23, 2016

So I've got a patch, but I'm seeing something really weird that I can't figure out. I'm bootstrapping with Go 1.6 and I get a segfault in runtime.sigaction + 33 (sys_darwin_amd64.s:212) while compiling. I've narrowed it down to what causes it, but I have no idea why. I defined another sigtable flag so that we can easily lookup which default handlers are fatal on each OS. If I set the flag for SIGKILL to anything other than 0, I get the segfault in the build process (even if that's the only change).

Any idea what might cause this? A search through the codebase tells me that the field is only checked against specific bits. Setting additional flags shouldn't cause a problem...

TEXT runtime·sigaction(SB),NOSPLIT,$0-24
        MOVL    mode+0(FP), DI          // arg 1 sig
        MOVQ    new+8(FP), SI           // arg 2 act
        MOVQ    old+16(FP), DX          // arg 3 oact
        MOVQ    old+16(FP), CX          // arg 3 oact
        MOVQ    old+16(FP), R10         // arg 3 oact
        MOVL    $(0x2000000+46), AX     // syscall entry
        SYSCALL
        JCC     2(PC)
        MOVL    $0xf1, 0xf1  // crash <- crashing here because the syscall fails
        RET

@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 23, 2016

The setup to that syscall looks wrong to me. Why is the 3rd parameter being added to 3 different registers? The sigaction syscall only has 3 parameters...

@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 23, 2016

OK I figured out why the syscall is failing. It's returning EINVAL, which according to the xnu source of sigaction is because it's being passed SIGKILL.

if (signum <= 0 || signum >= NSIG ||
        signum == SIGKILL || signum == SIGSTOP)
        return (EINVAL);

For whatever reason changing the flags on SIGKILL must cause sigaction to be called on it. I still can't figure out why that is, but since the handler can't be changed anyway I don't need to set the flags on that one.

@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 23, 2016

CL: https://golang.org/cl/21006/

@gopherbot
Copy link

CL https://golang.org/cl/21006 mentions this issue.

@jtsylve
Copy link
Contributor Author

jtsylve commented Mar 25, 2016

Apparently the CL broke the Solaris build. I'm looking into it.

@gopherbot
Copy link

CL https://golang.org/cl/21145 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 25, 2016
This fixes the problems with signal handling that were inadvertently
introduced in https://go-review.googlesource.com/21006.

Fixes #14899

Change-Id: Ia746914dcb3146a52413d32c57b089af763f0810
Reviewed-on: https://go-review.googlesource.com/21145
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@golang golang locked and limited conversation to collaborators Mar 26, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants