runtime: signal 2 received on thread with no signal stack #18600

jbardin · 2017-01-10T22:27:59Z

go1.8rc1 && master on darwin

Sending a SIGINT to a go build ./... caused a runtime error:

signal 2 received on thread with no signal stack
fatal error: non-Go code disabled sigaltstack

runtime stack:
runtime.throw(0x1499cf6, 0x20)
	/usr/local/go/src/runtime/panic.go:596 +0x95
runtime.noSignalStack(0x2)
	/usr/local/go/src/runtime/signal_unix.go:455 +0x94
runtime.sigtrampgo(0x2, 0xc4207fcd10, 0xc4207fcd78)
	/usr/local/go/src/runtime/signal_unix.go:238 +0x2d8
runtime.sigtramp(0xc420114386, 0x5, 0x5956, 0x200005a00000085, 0xc4202f43a0, 0x0, 0xe, 0xc4207fce30, 0x0, 0x3, ...)
	/usr/local/go/src/runtime/sys_darwin_amd64.s:240 +0x28

Managed to replicate this by building in a fairly large project, and running for a while

$ while true; do go build ./...& sleep .$RANDOM; kill -2 %; done

The text was updated successfully, but these errors were encountered:

bradfitz · 2017-01-10T22:34:05Z

/cc @rsc @ianlancetaylor

ianlancetaylor · 2017-01-11T01:40:17Z

Which version of Darwin?

Is there any more to that stack trace? Please try setting the environment variable GOTRACEBACK=system when running go build.

jbardin · 2017-01-11T02:20:48Z

Sorry, running on macOS 10.12.2 (16C67)

Here's a full system stack trace: https://gist.github.com/jbardin/0421e4a7689ebc46732894f3f8e2af8d

I wasn't able to trigger it on Linux, but I may also just be missing the timing window on that machine.

ianlancetaylor · 2017-01-11T05:04:26Z

I've looked over the Go runtime code, and it looks fine.

@quentinmit Is it still possible to ssh into that Sierra system?

quentinmit · 2017-01-17T17:05:14Z

@ianlancetaylor Yeah, I don't remember what we had actually set up though. Ping me on Hangouts and I can set it up again.

@jbardin Do you have any antivirus software installed? Or haxies?

jbardin · 2017-01-17T19:42:12Z

@quentinmit, nope. running vanilla macOS on real hardware.

quentinmit · 2017-01-17T20:36:48Z

"haxies" = things that inject code into running programs; Application Enhancer and SIMBL are examples of that.

jbardin · 2017-01-17T22:17:02Z

Not running anything like that that I know of, unless macOS is injecting something by default.

I did see that the go tool is dynamically linked, so as an experiment I rebuilt the toolchain with CGO_ENABLED=0 and haven't been able to reproduce the crash. I'm not able to reliably reproduce the crash in the first place though, so this doesn't prove anything, just adds another data point.

rsc · 2017-01-25T16:13:16Z

Reproduced on 10.12.2 16C67 using the while loop in $GOPATH/src/camlistore.org.

rsc · 2017-01-25T17:04:05Z

I confirmed that the signal is arriving on the ordinary goroutine stack, and the stack trace for that goroutine at the time of the signal is:

syscall.RawSyscall(0x2, 0x0, 0x0, 0x0, 0x3, 0x3, 0x5c)
	/Users/rsc/go/src/syscall/asm_darwin_amd64.s:109 +0x25 fp=0xc420cf4e38 sp=0xc420cf4e30
syscall.forkAndExecInChild(0xc42072db90, 0xc4208b8210, 0x15, 0x15, 0xc420a78000, 0x40, 0x40, 0x0, 0xc4204ff2c0, 0xc420cf5110, ...)
	/Users/rsc/go/src/syscall/exec_bsd.go:69 +0xee fp=0xc420cf4f00 sp=0xc420cf4e38
syscall.forkExec(0xc42072dad0, 0x2b, 0xc420a643c0, 0x14, 0x14, 0xc420cf5110, 0x0, 0xc420cec118, 0xc420a1a9c0)
	/Users/rsc/go/src/syscall/exec_unix.go:193 +0x368 fp=0xc420cf5028 sp=0xc420cf4f00
syscall.StartProcess(0xc42072dad0, 0x2b, 0xc420a643c0, 0x14, 0x14, 0xc420cf5110, 0x2, 0x4, 0x0, 0x20)
	/Users/rsc/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc420cf5080 sp=0xc420cf5028
os.startProcess(0xc42072dad0, 0x2b, 0xc420a643c0, 0x14, 0x14, 0xc420cf52b8, 0xc420778d90, 0x1, 0x0)
	/Users/rsc/go/src/os/exec_posix.go:45[436] 81429
 +0x1a3 fp=0xc420cf5168 sp=0xc420cf5080

That's the RawSyscall(SYS_FORK, ...) call that creates the child process.

I changed the runtime to record its own pid at init time and then at the time of the signal print both the init-time pid and the current pid. That appears to confirm that the signal has been received in the child process (at least not in the parent process!), before the child process has gotten a chance to continue executing after the fork system call.

It looks like kill -2 % is sending the signal to the background job's process group, not the parent pid (confirmed by looking at Apple's bash-103 source code: jobs.c is full of killpg calls). If the child has just been created, then it's in the process group and gets the signal but apparently has had alternate signal stack information cleared.

When we create a new thread we mask all signals, make the new thread, and then bring back the signals. We could do the same for fork, using the beforefork and afterfork runtime hooks, except there is no good time to run the afterfork hook and turn signals back on. Any window between 'after the fork' and 'before the exec' where signals are enabled will confuse the runtime if a signal is received. And we can't start the child with signals masked off, or it will likely keep them masked off.

It seems like the best we can do is detect when we receive a signal in a child process that hasn't quite gotten to exec yet and make the child die quietly?

I assume that using posix_spawn would be another way around this.

I'm a little surprised this hasn't turned up on other systems too, but I guess the race window is small and it only affects killpg. Or maybe macOS is the only one that clears alternate signal stacks during fork.

/cc @ianlancetaylor for thoughts about how to proceed

rsc · 2017-01-25T17:04:44Z

Confirmed that using 'kill -2 $! ' instead of 'kill -2 %' makes the problem go away: this is about killpg, not a kernel bug where parent signals get delivered to a child.

ianlancetaylor · 2017-01-25T18:39:19Z

This problem does not happen on GNU/Linux because the child process still has an alternate signal stack in effect. At least, that is what I see from this program, which on GNU/Linux prints

0x2517010 0 8192

Program:

#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

static void
die(const char *s)
{
  perror(s);
  exit(EXIT_FAILURE);
}

int
main()
{
  stack_t st;
  pid_t pid;

  st.ss_sp = malloc(SIGSTKSZ);
  if (st.ss_sp == NULL)
    die("malloc");
  st.ss_flags = 0;
  st.ss_size = SIGSTKSZ;
  if (sigaltstack(&st, NULL) < 0)
    die("sigaltstack");

  pid = fork();
  if (pid < 0)
    die("fork");
  if (pid == 0) {
    if (sigaltstack(NULL, &st) < 0)
      die("sigaltstack in child");
    printf("%p %#x %zu\n", st.ss_sp, st.ss_flags, st.ss_size);
    exit(EXIT_SUCCESS);
  }
  if (waitpid(pid, NULL, 0) < 0)
    die("waitpid");
  exit(EXIT_SUCCESS);
}

rsc · 2017-01-25T18:44:10Z

On macOS 12.12.2 I get

0x0 0x4 0

OK, so we know that Linux doesn't clear sigaltstack at fork (that's nice), but there's still technically a problem. A signal has arrived and we're handling it as if we're in the parent, but we're in the child. If the signal handler did anything that depended on being in the real parent process, we'd be in trouble. We probably do need to either make this impossible or detect it and react accordingly.

ianlancetaylor · 2017-01-25T18:57:34Z

The most correct procedure I can think of is

block all signals before fork
after fork in parent, restore original signal mask
after fork in child, change all signal handlers that are not SIG_IGN to SIG_DFL, then restore signal mask

I will work on a CL.

ianlancetaylor · 2017-01-25T18:58:55Z

To clarify, the reason for changing all the signal handlers to SIG_DFL is 1) that avoids the problem of invoking a signal handler in the child; 2) that is what execve is going to do anyhow, so any race condition already exists.

rsc · 2017-01-25T18:59:52Z

SGTM. Thanks.

rsc · 2017-01-25T19:00:44Z

I think this can wait for Go 1.9. We're more likely to break something common than fix this rare event.

aclements · 2017-06-13T01:28:59Z

Ping @ianlancetaylor.

gopherbot · 2017-06-13T05:40:21Z

CL https://golang.org/cl/45471 mentions this issue.

aclements · 2017-07-20T19:44:26Z

@ianlancetaylor should this (plus the follow-up fix 28f650a) get back-ported to Go 1.8.4?

ianlancetaylor · 2017-07-20T20:01:12Z

Russ decided above that this was rare enough that we wouldn't worry about for 1.8.

bradfitz added this to the Go1.8Maybe milestone Jan 10, 2017

rsc modified the milestones: Go1.9, Go1.8Maybe Jan 25, 2017

mark-rushakoff mentioned this issue Mar 16, 2017

Fatal error on SIGWINCH influxdata/telegraf#2539

Closed

aclements assigned ianlancetaylor Jun 13, 2017

aclements added the NeedsFix The path to resolution is known, but the work has not been done. label Jun 13, 2017

gopherbot closed this as completed in df0892c Jun 14, 2017

aclements mentioned this issue Jul 20, 2017

cmd/compile: go build panic #20361

Closed

ianlancetaylor mentioned this issue Jul 26, 2017

runtime: memory corruption crashes with os/exec on Linux kernel 4.4 #20427

Closed

golang locked and limited conversation to collaborators Jul 20, 2018

gopherbot added the FrozenDueToAge label Jul 20, 2018

rsc unassigned ianlancetaylor Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: signal 2 received on thread with no signal stack #18600

runtime: signal 2 received on thread with no signal stack #18600

jbardin commented Jan 10, 2017

bradfitz commented Jan 10, 2017

ianlancetaylor commented Jan 11, 2017

jbardin commented Jan 11, 2017

ianlancetaylor commented Jan 11, 2017

quentinmit commented Jan 17, 2017

jbardin commented Jan 17, 2017

quentinmit commented Jan 17, 2017

jbardin commented Jan 17, 2017

rsc commented Jan 25, 2017

rsc commented Jan 25, 2017

rsc commented Jan 25, 2017

ianlancetaylor commented Jan 25, 2017 •

edited

rsc commented Jan 25, 2017

ianlancetaylor commented Jan 25, 2017

ianlancetaylor commented Jan 25, 2017

rsc commented Jan 25, 2017

rsc commented Jan 25, 2017

aclements commented Jun 13, 2017

gopherbot commented Jun 13, 2017

aclements commented Jul 20, 2017

ianlancetaylor commented Jul 20, 2017

runtime: signal 2 received on thread with no signal stack #18600

runtime: signal 2 received on thread with no signal stack #18600

Comments

jbardin commented Jan 10, 2017

bradfitz commented Jan 10, 2017

ianlancetaylor commented Jan 11, 2017

jbardin commented Jan 11, 2017

ianlancetaylor commented Jan 11, 2017

quentinmit commented Jan 17, 2017

jbardin commented Jan 17, 2017

quentinmit commented Jan 17, 2017

jbardin commented Jan 17, 2017

rsc commented Jan 25, 2017

rsc commented Jan 25, 2017

rsc commented Jan 25, 2017

ianlancetaylor commented Jan 25, 2017 • edited

rsc commented Jan 25, 2017

ianlancetaylor commented Jan 25, 2017

ianlancetaylor commented Jan 25, 2017

rsc commented Jan 25, 2017

rsc commented Jan 25, 2017

aclements commented Jun 13, 2017

gopherbot commented Jun 13, 2017

aclements commented Jul 20, 2017

ianlancetaylor commented Jul 20, 2017

ianlancetaylor commented Jan 25, 2017 •

edited