Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: signal 2 received on thread with no signal stack #18600

Closed
jbardin opened this issue Jan 10, 2017 · 21 comments
Closed

runtime: signal 2 received on thread with no signal stack #18600

jbardin opened this issue Jan 10, 2017 · 21 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@jbardin
Copy link
Contributor

jbardin commented Jan 10, 2017

go1.8rc1 && master on darwin

Sending a SIGINT to a go build ./... caused a runtime error:

signal 2 received on thread with no signal stack
fatal error: non-Go code disabled sigaltstack

runtime stack:
runtime.throw(0x1499cf6, 0x20)
	/usr/local/go/src/runtime/panic.go:596 +0x95
runtime.noSignalStack(0x2)
	/usr/local/go/src/runtime/signal_unix.go:455 +0x94
runtime.sigtrampgo(0x2, 0xc4207fcd10, 0xc4207fcd78)
	/usr/local/go/src/runtime/signal_unix.go:238 +0x2d8
runtime.sigtramp(0xc420114386, 0x5, 0x5956, 0x200005a00000085, 0xc4202f43a0, 0x0, 0xe, 0xc4207fce30, 0x0, 0x3, ...)
	/usr/local/go/src/runtime/sys_darwin_amd64.s:240 +0x28

Managed to replicate this by building in a fairly large project, and running for a while

$ while true; do go build ./...& sleep .$RANDOM; kill -2 %; done
@bradfitz
Copy link
Contributor

/cc @rsc @ianlancetaylor

@bradfitz bradfitz added this to the Go1.8Maybe milestone Jan 10, 2017
@ianlancetaylor
Copy link
Contributor

Which version of Darwin?

Is there any more to that stack trace? Please try setting the environment variable GOTRACEBACK=system when running go build.

@jbardin
Copy link
Contributor Author

jbardin commented Jan 11, 2017

Sorry, running on macOS 10.12.2 (16C67)

Here's a full system stack trace: https://gist.github.com/jbardin/0421e4a7689ebc46732894f3f8e2af8d

I wasn't able to trigger it on Linux, but I may also just be missing the timing window on that machine.

@ianlancetaylor
Copy link
Contributor

I've looked over the Go runtime code, and it looks fine.

@quentinmit Is it still possible to ssh into that Sierra system?

@quentinmit
Copy link
Contributor

@ianlancetaylor Yeah, I don't remember what we had actually set up though. Ping me on Hangouts and I can set it up again.

@jbardin Do you have any antivirus software installed? Or haxies?

@jbardin
Copy link
Contributor Author

jbardin commented Jan 17, 2017

@quentinmit, nope. running vanilla macOS on real hardware.

@quentinmit
Copy link
Contributor

"haxies" = things that inject code into running programs; Application Enhancer and SIMBL are examples of that.

@jbardin
Copy link
Contributor Author

jbardin commented Jan 17, 2017

Not running anything like that that I know of, unless macOS is injecting something by default.

I did see that the go tool is dynamically linked, so as an experiment I rebuilt the toolchain with CGO_ENABLED=0 and haven't been able to reproduce the crash. I'm not able to reliably reproduce the crash in the first place though, so this doesn't prove anything, just adds another data point.

@rsc
Copy link
Contributor

rsc commented Jan 25, 2017

Reproduced on 10.12.2 16C67 using the while loop in $GOPATH/src/camlistore.org.

@rsc
Copy link
Contributor

rsc commented Jan 25, 2017

I confirmed that the signal is arriving on the ordinary goroutine stack, and the stack trace for that goroutine at the time of the signal is:

syscall.RawSyscall(0x2, 0x0, 0x0, 0x0, 0x3, 0x3, 0x5c)
	/Users/rsc/go/src/syscall/asm_darwin_amd64.s:109 +0x25 fp=0xc420cf4e38 sp=0xc420cf4e30
syscall.forkAndExecInChild(0xc42072db90, 0xc4208b8210, 0x15, 0x15, 0xc420a78000, 0x40, 0x40, 0x0, 0xc4204ff2c0, 0xc420cf5110, ...)
	/Users/rsc/go/src/syscall/exec_bsd.go:69 +0xee fp=0xc420cf4f00 sp=0xc420cf4e38
syscall.forkExec(0xc42072dad0, 0x2b, 0xc420a643c0, 0x14, 0x14, 0xc420cf5110, 0x0, 0xc420cec118, 0xc420a1a9c0)
	/Users/rsc/go/src/syscall/exec_unix.go:193 +0x368 fp=0xc420cf5028 sp=0xc420cf4f00
syscall.StartProcess(0xc42072dad0, 0x2b, 0xc420a643c0, 0x14, 0x14, 0xc420cf5110, 0x2, 0x4, 0x0, 0x20)
	/Users/rsc/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc420cf5080 sp=0xc420cf5028
os.startProcess(0xc42072dad0, 0x2b, 0xc420a643c0, 0x14, 0x14, 0xc420cf52b8, 0xc420778d90, 0x1, 0x0)
	/Users/rsc/go/src/os/exec_posix.go:45[436] 81429
 +0x1a3 fp=0xc420cf5168 sp=0xc420cf5080

That's the RawSyscall(SYS_FORK, ...) call that creates the child process.

I changed the runtime to record its own pid at init time and then at the time of the signal print both the init-time pid and the current pid. That appears to confirm that the signal has been received in the child process (at least not in the parent process!), before the child process has gotten a chance to continue executing after the fork system call.

It looks like kill -2 % is sending the signal to the background job's process group, not the parent pid (confirmed by looking at Apple's bash-103 source code: jobs.c is full of killpg calls). If the child has just been created, then it's in the process group and gets the signal but apparently has had alternate signal stack information cleared.

When we create a new thread we mask all signals, make the new thread, and then bring back the signals. We could do the same for fork, using the beforefork and afterfork runtime hooks, except there is no good time to run the afterfork hook and turn signals back on. Any window between 'after the fork' and 'before the exec' where signals are enabled will confuse the runtime if a signal is received. And we can't start the child with signals masked off, or it will likely keep them masked off.

It seems like the best we can do is detect when we receive a signal in a child process that hasn't quite gotten to exec yet and make the child die quietly?

I assume that using posix_spawn would be another way around this.

I'm a little surprised this hasn't turned up on other systems too, but I guess the race window is small and it only affects killpg. Or maybe macOS is the only one that clears alternate signal stacks during fork.

/cc @ianlancetaylor for thoughts about how to proceed

@rsc
Copy link
Contributor

rsc commented Jan 25, 2017

Confirmed that using 'kill -2 $! ' instead of 'kill -2 %' makes the problem go away: this is about killpg, not a kernel bug where parent signals get delivered to a child.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Jan 25, 2017

This problem does not happen on GNU/Linux because the child process still has an alternate signal stack in effect. At least, that is what I see from this program, which on GNU/Linux prints

0x2517010 0 8192

Program:

#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

static void
die(const char *s)
{
  perror(s);
  exit(EXIT_FAILURE);
}

int
main()
{
  stack_t st;
  pid_t pid;

  st.ss_sp = malloc(SIGSTKSZ);
  if (st.ss_sp == NULL)
    die("malloc");
  st.ss_flags = 0;
  st.ss_size = SIGSTKSZ;
  if (sigaltstack(&st, NULL) < 0)
    die("sigaltstack");

  pid = fork();
  if (pid < 0)
    die("fork");
  if (pid == 0) {
    if (sigaltstack(NULL, &st) < 0)
      die("sigaltstack in child");
    printf("%p %#x %zu\n", st.ss_sp, st.ss_flags, st.ss_size);
    exit(EXIT_SUCCESS);
  }
  if (waitpid(pid, NULL, 0) < 0)
    die("waitpid");
  exit(EXIT_SUCCESS);
}

@rsc
Copy link
Contributor

rsc commented Jan 25, 2017

On macOS 12.12.2 I get

0x0 0x4 0

OK, so we know that Linux doesn't clear sigaltstack at fork (that's nice), but there's still technically a problem. A signal has arrived and we're handling it as if we're in the parent, but we're in the child. If the signal handler did anything that depended on being in the real parent process, we'd be in trouble. We probably do need to either make this impossible or detect it and react accordingly.

@ianlancetaylor
Copy link
Contributor

The most correct procedure I can think of is

  1. block all signals before fork
  2. after fork in parent, restore original signal mask
  3. after fork in child, change all signal handlers that are not SIG_IGN to SIG_DFL, then restore signal mask

I will work on a CL.

@ianlancetaylor
Copy link
Contributor

To clarify, the reason for changing all the signal handlers to SIG_DFL is 1) that avoids the problem of invoking a signal handler in the child; 2) that is what execve is going to do anyhow, so any race condition already exists.

@rsc
Copy link
Contributor

rsc commented Jan 25, 2017

SGTM. Thanks.

@rsc
Copy link
Contributor

rsc commented Jan 25, 2017

I think this can wait for Go 1.9. We're more likely to break something common than fix this rare event.

@aclements
Copy link
Member

Ping @ianlancetaylor.

@aclements aclements added the NeedsFix The path to resolution is known, but the work has not been done. label Jun 13, 2017
@gopherbot
Copy link

CL https://golang.org/cl/45471 mentions this issue.

@aclements
Copy link
Member

@ianlancetaylor should this (plus the follow-up fix 28f650a) get back-ported to Go 1.8.4?

@ianlancetaylor
Copy link
Contributor

Russ decided above that this was rare enough that we wouldn't worry about for 1.8.

@golang golang locked and limited conversation to collaborators Jul 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants