Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: flaky test on darwin/arm #10207

Closed
crawshaw opened this issue Mar 20, 2015 · 14 comments
Closed

os/signal: flaky test on darwin/arm #10207

crawshaw opened this issue Mar 20, 2015 · 14 comments
Milestone

Comments

@crawshaw
Copy link
Member

I occasionally see this failure on the darwin/arm builder:

fatal error: cgo callback before cgo call

Process 22758 exited with status = 1 (0x00000001) 
go_darwin_arm_exec: exit status 253
FAIL    os/signal   6.989s

Of the few examples I have, it's always in the os/signal package.

@crawshaw crawshaw added this to the Go1.5 milestone Mar 20, 2015
@ianlancetaylor
Copy link
Contributor

This happens when C code calls into Go code before any Go code calls into C code. For example, it could happen if a C++ global constructor calls a Go function. I don't think there is any other way it could happen. In a program that uses cgo, needextram is initialized to 1. It is set to 0 in the first cgo call, and is never to set to 1 again. The error occurs when a call from C code to Go code sees that needextram is still set to 1.

@crawshaw
Copy link
Member Author

In particular the failing test is TestStress.

process continue
Process 40912 resuming
=== RUN TestSignal
--- PASS: TestSignal (0.08s)
    signal_test.go:40: sighup...
    signal_test.go:49: sigwinch...
    signal_test.go:56: sighup...
    signal_test.go:59: sighup...
=== RUN TestStress
fatal error: cgo callback before cgo call
Process 40912 exited with status = 1 (0x00000001) 
go_darwin_arm_exec: exit status 253
exit status 1
FAIL    os/signal   6.183s

Mercifully this is SIGUSR1 which rules out the EXC_BAD_ACCESS handler. Unfortunately I fear that this is some kind of bug in lldb's signal interception and replaying. Still investigating.

@crawshaw
Copy link
Member Author

My current working theory is that the new thread is getting a signal after the signal handler is installed but before it calls crosscall_arm1(ts.fn, ...). I believe this race exists on several platforms, but is much more likely on darwin/arm because darwin_arm_init_thread_exception_port happens in between.

https://github.com/golang/go/blob/master/src/runtime/cgo/gcc_darwin_arm.c#L70

A cheap workaround just for darwin/arm would be to have the new thread set a condition that the creating thread can wait on, and only unblock signals after that. That would reduce the race to what it is on other platforms.

If I'm right about this, a real fix would require the signals staying blocked until the new thread can successfully inform it that it is back in Go. Or have the signal handler know about this intermediate state. Not sure how to go about either option.

@ianlancetaylor does this theory make any sense?

@ianlancetaylor
Copy link
Contributor

The bug can only occur when C/C++ code, generated by cgo, calls into Go code before any Go code calls in C/C++ code. That is, something needs to call runtime·cgocallback_gofunc or runtime·cgocallback. So you are suggesting that when a signal occurs, it causes C/C++ code to call into Go code, even though nothing has called into C/C++ code. That would be a problem. But how could it happen?

You say that calling crosscall_arm has something to do with fixing this, but I don't see that. Perhaps there is a problem there, but I don't see how it could be this problem. This problem can only happen in a callback from C/C++ code. And, to be clear, it has to be a callback from code generated by cgo, or something that is behaving the way that cgo behaves. I don't see any cgo code in os/signal. So, is there cgo in the darwin/arm support? cgo code that calls exported Go functions? I don't see anything but I don't know if I am looking in the right places.

@crawshaw
Copy link
Member Author

If the signal handler gets called somewhere around the call to darwin_arm_init_thread_exception_port, then the signal handler will be run. It will attempt to call load_g, which will fail and that will pass into badsignal.

In badsignal, there's a cgocallback. I think that's what's causing this.

@crawshaw
Copy link
Member Author

No, I am misunderstanding pthread_create. All signals are blocked when it's called, and that means the new thread starts with all signals blocked and none pending. So that cannot be cause.

@crawshaw
Copy link
Member Author

To confirm however, the cgocallback being triggered is the one in badsignal. (I put a print statement in it and then signal.TestStress eventually died trying to get runtime.printlock.)

Now what's not clear is how badsignal is being called.

@minux
Copy link
Member

minux commented Mar 23, 2015 via email

@ianlancetaylor
Copy link
Contributor

Huh, badsignal uses cgocallback. That can't possibly work. It will fail any time a program that uses cgo starts a new thread that receives a signal before it makes any cgo calls. This isn't a matter of the signal mask or of the new thread preparing itself to receive signals. The only question is whether the program has made any cgo calls yet. If it hasn't, it is doomed. The comment in badsignal explains the situation well enough, without bothering to point out that this can happen for any signal, not just SIGPROF.

I think the only possible fix is for mstart1 to say something like

if needextram == 1 && cas(&needextram, 1, 0) {
    newextram()
}

before calling initsig. And then the check in badsignal is not needed.

@minux
Copy link
Member

minux commented Mar 24, 2015 via email

@crawshaw
Copy link
Member Author

I agree that we haven't found the cause of this bug yet, but we may as well fix badsignal on the way. After that's in, I'm going to do a bunch of instrumenting and see if I can work out what is happening.

@minux
Copy link
Member

minux commented Mar 24, 2015 via email

@ianlancetaylor
Copy link
Contributor

@minux Yes, I was imprecise. It's when a thread is started by non-Go code before any cgo call, which does imply that the thread is started by a global constructor.

crawshaw added a commit that referenced this issue Mar 24, 2015
Previously the extra m needed for cgo callbacks was created on the
first callback. This works for cgo, however the cgocallback mechanism
is also borrowed by badsignal which can run before any cgo calls are
made.

Now we initialize the extra M at runtime startup before any signal
handlers are registered, so badsignal cannot be called until the
extra M is ready.

Updates #10207.

Change-Id: Iddda2c80db6dc52d8b60e2b269670fbaa704c7b3
Reviewed-on: https://go-review.googlesource.com/7978
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: David Crawshaw <crawshaw@golang.org>
@crawshaw
Copy link
Member Author

Tentatively closing this, it appears to be fixed by golang.org/cl/7978.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants