os/signal: flaky test on darwin/arm #10207

crawshaw · 2015-03-20T14:54:35Z

I occasionally see this failure on the darwin/arm builder:

fatal error: cgo callback before cgo call

Process 22758 exited with status = 1 (0x00000001) 
go_darwin_arm_exec: exit status 253
FAIL    os/signal   6.989s

Of the few examples I have, it's always in the os/signal package.

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2015-03-20T15:32:17Z

This happens when C code calls into Go code before any Go code calls into C code. For example, it could happen if a C++ global constructor calls a Go function. I don't think there is any other way it could happen. In a program that uses cgo, needextram is initialized to 1. It is set to 0 in the first cgo call, and is never to set to 1 again. The error occurs when a call from C code to Go code sees that needextram is still set to 1.

crawshaw · 2015-03-23T17:00:41Z

In particular the failing test is TestStress.

process continue
Process 40912 resuming
=== RUN TestSignal
--- PASS: TestSignal (0.08s)
    signal_test.go:40: sighup...
    signal_test.go:49: sigwinch...
    signal_test.go:56: sighup...
    signal_test.go:59: sighup...
=== RUN TestStress
fatal error: cgo callback before cgo call
Process 40912 exited with status = 1 (0x00000001) 
go_darwin_arm_exec: exit status 253
exit status 1
FAIL    os/signal   6.183s

Mercifully this is SIGUSR1 which rules out the EXC_BAD_ACCESS handler. Unfortunately I fear that this is some kind of bug in lldb's signal interception and replaying. Still investigating.

crawshaw · 2015-03-23T18:59:21Z

My current working theory is that the new thread is getting a signal after the signal handler is installed but before it calls crosscall_arm1(ts.fn, ...). I believe this race exists on several platforms, but is much more likely on darwin/arm because darwin_arm_init_thread_exception_port happens in between.

https://github.com/golang/go/blob/master/src/runtime/cgo/gcc_darwin_arm.c#L70

A cheap workaround just for darwin/arm would be to have the new thread set a condition that the creating thread can wait on, and only unblock signals after that. That would reduce the race to what it is on other platforms.

If I'm right about this, a real fix would require the signals staying blocked until the new thread can successfully inform it that it is back in Go. Or have the signal handler know about this intermediate state. Not sure how to go about either option.

@ianlancetaylor does this theory make any sense?

ianlancetaylor · 2015-03-23T19:34:59Z

The bug can only occur when C/C++ code, generated by cgo, calls into Go code before any Go code calls in C/C++ code. That is, something needs to call runtime·cgocallback_gofunc or runtime·cgocallback. So you are suggesting that when a signal occurs, it causes C/C++ code to call into Go code, even though nothing has called into C/C++ code. That would be a problem. But how could it happen?

You say that calling crosscall_arm has something to do with fixing this, but I don't see that. Perhaps there is a problem there, but I don't see how it could be this problem. This problem can only happen in a callback from C/C++ code. And, to be clear, it has to be a callback from code generated by cgo, or something that is behaving the way that cgo behaves. I don't see any cgo code in os/signal. So, is there cgo in the darwin/arm support? cgo code that calls exported Go functions? I don't see anything but I don't know if I am looking in the right places.

crawshaw · 2015-03-23T19:56:39Z

If the signal handler gets called somewhere around the call to darwin_arm_init_thread_exception_port, then the signal handler will be run. It will attempt to call load_g, which will fail and that will pass into badsignal.

In badsignal, there's a cgocallback. I think that's what's causing this.

crawshaw · 2015-03-23T20:37:49Z

No, I am misunderstanding pthread_create. All signals are blocked when it's called, and that means the new thread starts with all signals blocked and none pending. So that cannot be cause.

crawshaw · 2015-03-23T20:45:06Z

To confirm however, the cgocallback being triggered is the one in badsignal. (I put a print statement in it and then signal.TestStress eventually died trying to get runtime.printlock.)

Now what's not clear is how badsignal is being called.

minux · 2015-03-23T20:49:41Z

Try adding some print statement in runtime/cgo to print the current thread signal mask, esp. in threadentry.

ianlancetaylor · 2015-03-23T22:19:13Z

Huh, badsignal uses cgocallback. That can't possibly work. It will fail any time a program that uses cgo starts a new thread that receives a signal before it makes any cgo calls. This isn't a matter of the signal mask or of the new thread preparing itself to receive signals. The only question is whether the program has made any cgo calls yet. If it hasn't, it is doomed. The comment in badsignal explains the situation well enough, without bothering to point out that this can happen for any signal, not just SIGPROF.

I think the only possible fix is for mstart1 to say something like

if needextram == 1 && cas(&needextram, 1, 0) {
    newextram()
}

before calling initsig. And then the check in badsignal is not needed.

minux · 2015-03-24T14:29:56Z

@ianlancetaylor, I don't understand why it can't work. I think the only case where it will fail is when the foreign thread is started by static constructor. In this specific case, I don't understand why our current solution is inadequate. we block all signals just before pthread_create, and the new thread should inherent the parent threads's signal masks, so no signals will be sent to it until it explicitly unblocks the signals, but at which point, g is guaranteed to be set in its TLS slot. I still think something is wrong, either lldb's signal handling or the kernel fails to honor the signal masks.

crawshaw · 2015-03-24T14:38:32Z

I agree that we haven't found the cause of this bug yet, but we may as well fix badsignal on the way. After that's in, I'm going to do a bunch of instrumenting and see if I can work out what is happening.

minux · 2015-03-24T14:46:06Z

I think badsignal only breaks when threads are started by static constructor, which should be rare. it won't break if the user code uses cgo to create threads, and there is test in misc/cgo/test to verify this.

ianlancetaylor · 2015-03-24T16:53:53Z

@minux Yes, I was imprecise. It's when a thread is started by non-Go code before any cgo call, which does imply that the thread is started by a global constructor.

Previously the extra m needed for cgo callbacks was created on the first callback. This works for cgo, however the cgocallback mechanism is also borrowed by badsignal which can run before any cgo calls are made. Now we initialize the extra M at runtime startup before any signal handlers are registered, so badsignal cannot be called until the extra M is ready. Updates #10207. Change-Id: Iddda2c80db6dc52d8b60e2b269670fbaa704c7b3 Reviewed-on: https://go-review.googlesource.com/7978 Reviewed-by: Ian Lance Taylor <iant@golang.org> Run-TryBot: David Crawshaw <crawshaw@golang.org>

crawshaw · 2015-03-26T11:37:20Z

Tentatively closing this, it appears to be fixed by golang.org/cl/7978.

crawshaw added this to the Go1.5 milestone Mar 20, 2015

crawshaw closed this as completed Mar 26, 2015

crawshaw mentioned this issue Mar 27, 2015

runtime/cgo: test signal from foreign thread before cgo call #10268

Open

golang locked and limited conversation to collaborators Jun 25, 2016

gopherbot added the FrozenDueToAge label Jun 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/signal: flaky test on darwin/arm #10207

os/signal: flaky test on darwin/arm #10207

crawshaw commented Mar 20, 2015

ianlancetaylor commented Mar 20, 2015

crawshaw commented Mar 23, 2015

crawshaw commented Mar 23, 2015

ianlancetaylor commented Mar 23, 2015

crawshaw commented Mar 23, 2015

crawshaw commented Mar 23, 2015

crawshaw commented Mar 23, 2015

minux commented Mar 23, 2015 via email

ianlancetaylor commented Mar 23, 2015

minux commented Mar 24, 2015 via email

crawshaw commented Mar 24, 2015

minux commented Mar 24, 2015 via email

ianlancetaylor commented Mar 24, 2015

crawshaw commented Mar 26, 2015

os/signal: flaky test on darwin/arm #10207

os/signal: flaky test on darwin/arm #10207

Comments

crawshaw commented Mar 20, 2015

ianlancetaylor commented Mar 20, 2015

crawshaw commented Mar 23, 2015

crawshaw commented Mar 23, 2015

ianlancetaylor commented Mar 23, 2015

crawshaw commented Mar 23, 2015

crawshaw commented Mar 23, 2015

crawshaw commented Mar 23, 2015

minux commented Mar 23, 2015 via email

ianlancetaylor commented Mar 23, 2015

minux commented Mar 24, 2015 via email

crawshaw commented Mar 24, 2015

minux commented Mar 24, 2015 via email

ianlancetaylor commented Mar 24, 2015

crawshaw commented Mar 26, 2015