Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/cgo: crash when invoking Go callback from C library destructor if running under race detector #59711

Closed
gavv opened this issue Apr 19, 2023 · 9 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@gavv
Copy link

gavv commented Apr 19, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20.3 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/victor/.cache/go-build"
GOENV="/home/victor/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/victor/dev/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/victor/dev/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/linuxbrew/.linuxbrew/Cellar/go/1.20.3/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/linuxbrew/.linuxbrew/Cellar/go/1.20.3/libexec/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.3"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="c++"
CGO_ENABLED="1"
GOMOD="/home/victor/dev/gavv/cgo_racedetector_crash/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2384001255=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Hi,

I was debugging a crash in Go bindings for a C++ library (with pure C public interface). The crash was reproducing in a specific use-case:

  • Go code registers Go callback in the C++ library
  • Go code forgets to unregister callback before exiting
  • Go exits from test
  • presumably, Go starts runtime deinitialization
  • destructor of static C++ object is invoked, it tries to invoke earlier registered Go callback, and a crash happens inside cgo entry point, presumably because it's already deinitialized

The crash was reproducing only when the tests were running with race detector enabled. Without race detector, it seems that global C++ destructors are not running at all.

I've prepared a minimal example that can reproduce it: https://github.com/gavv/cgo_racedetector_crash

In TestExample, we call example() and exit. example() just registers Go callback in libfoo.so. libfoo has a destructor function, which invokes the registered callback. When running under race detector, you can see a crash:

$ make
gcc -fPIC -shared -o  libfoo.so libfoo/foo.c
go build .
LD_LIBRARY_PATH=`pwd` go test -v -race .
=== RUN   TestExample
example
foo_register_callback
--- PASS: TestExample (0.00s)
PASS
foo_destructor
go_callback_proxy
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1 pc=0x1]
...

Here is full backtrace: https://gist.github.com/gavv/586a727c54b85a151994f061ea9fbf7f

Another backtrace: https://gist.github.com/gavv/8e0e5fc20ee34b10a19223bcde64036c

The crash happens when cgocallback invokes cgocallbackg, and it invokes exitsyscall; exitsyscall meets something bad and crashes.

What did you expect to see?

What are possible solutions for this:

  1. Tell the user of the bindings that they must explicitly deinitialize everything before exiting program, otherwise crashes are possible. For now that's the only real option for us. I don't like it because it's really easy to forget closing some long-living objects (when you typically don't rely on defer), and if this will cause segfaults in user programs, it would be really annoying.

  2. Modify the C++ library to avoid invocation of user-registered callbacks in global destructors. That's what I plan to do, however it's not a complete fix for the problem. C++ lib also runs various background threads, and those threads may attempt to invoke registered callbacks too, and they don't know that Go runtime is already being deinitialized.

  3. If Go had public API for exit hooks, we could use it to properly deinitialize native objects or at least notify the C++ library to stop using registered callbacks. For me this seem like the most clean solution. Are there any plans to make exit hooks API public?

  4. If Go was invoking atexit handlers before deinitializing runtime (or whatever bad happens here), we could use them from the C++ side to set a flag that we should stop using callbacks. It would be partial solution (we still can't clearly deinitialize everything), but it would fix the crash.

    runtime/cgo: call C exit to allow global destructors/atexit to run #20713
    cmd/cgo: atexit handlers not run #5948

  5. Maybe it's possible to fix or work around this on cgo side: either fix crash if it's not really legitimate or maybe make callbacks no-op if they're called too late (arguable).

I guess this issue can be classified both as a bug report, or a feature request, or a wontfix, depending on the point of view.

From bindings maintainer point of view, I would put it like this:

  • if the crash is legitimate in this case and it not going to be fixed, please provide us a mechanism to avoid it (e.g. using exit hooks or atexit handlers)
  • if the crash in not legitimate, then I guess we just have to wait until it's fixed eventually

Thanks.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 19, 2023
@ianlancetaylor
Copy link
Contributor

Exit hooks are hard to implement in Go. See the "CAREFUL" comment on runtime.addExitHook.

Moreover, exit hooks are historically a source of complexity and problems in other languages. See the C++ movement over the years of exit to atexit to std::quick_exit to std::at_quick_exit. There are good reasons for each step, but Go takes a different approach: don't start taking those steps at all.

So, no, we aren't going to add exit hooks.

The exact failure that you are encountering is not clear to me. It looks like the call is happening on the wrong stack, but I don't see why that would be. is there a way that we can recreate the problem? It may indeed be impossible to fix, but let's try to pin down what is happening.

@gavv
Copy link
Author

gavv commented Apr 19, 2023

is there a way that we can recreate the problem?

Yes, see this part of the issue: I've prepared a minimal example that can reproduce it: https://github.com/gavv/cgo_racedetector_crash and below. You can clone that repo and just run make.

@ianlancetaylor
Copy link
Contributor

Thanks. Sorry for missing that.

Because the race detector is enabled, os.Exit calls racefini calls the C function exit which runs the global destructors. That doesn't happen for an ordinary call to os.Exit.

The global destructor then turns around and calls back into Go. But the Go runtime is not expecting that. Normally a callback into Go occurs either on a thread created by Go that has called into C which is then calling back into Go, or a thread not created by Go, which is calling a Go function. In this case, though, we have a thread created by Go that has not called into C, at least not in the usual way, so the Go callback runs into a thread not prepared to receive it. Hence the crash, which is basically a sanity check.

Working on a patch.

@gopherbot
Copy link

Change https://go.dev/cl/486615 mentions this issue: runtime: in __tsan_fini tell scheduler we are entering non-Go code

@gavv
Copy link
Author

gavv commented Apr 20, 2023

Thanks for the patch and the explanations!

A little question: why destructors are invoked under race detector, but seems they're not without it? At least in my experiments.

BTW I think you can simplify the test in the patch a bit by removing destructorFn and registerDestructor and just hardcoding GoDestructorCallback call (I added them because I used shared library).

Also, probably it would make sense to print OK from GoDestructorCallback? So that the test would pass only if the destructor is really called.

@prattmic prattmic added the NeedsFix The path to resolution is known, but the work has not been done. label Apr 21, 2023
@prattmic prattmic added this to the Backlog milestone Apr 21, 2023
@ianlancetaylor
Copy link
Contributor

@gavv

A little question: why destructors are invoked under race detector, but seems they're not without it? At least in my experiments.

With the current toolchain, if you call os.Exit when not using the race detector, the Go code simply calls the exit system call. Destructors are not run, and neither are atexit functions. Maybe that's a bug, maybe not. When you do use the race detector, then the os.Exit calls into the race detector so that it can report any final problems and exit with the correct status. That calls the C function exit, which does run destructors.

On this general topic see #20713.

BTW I think you can simplify the test in the patch a bit by removing destructorFn and registerDestructor and just hardcoding GoDestructorCallback call (I added them because I used shared library).

The single program testprogcgo includes a bunch of different tests. Which one is run depends on the command line argument. I copied your registration because I didn't want the destructor to always do a Go callback, as that might confuse an unrelated test.

Also, probably it would make sense to print OK from GoDestructorCallback? So that the test would pass only if the destructor is really called.

As noted above the destructor is not always called.

@gavv
Copy link
Author

gavv commented Apr 21, 2023

@ianlancetaylor I see, thanks!

@gopherbot
Copy link

Change https://go.dev/cl/487635 mentions this issue: runtime: call entersyscall in AIX and Solaris exit function

@gopherbot
Copy link

Change https://go.dev/cl/487955 mentions this issue: runtime: add raceFiniLock to lock ranking

gopherbot pushed a commit that referenced this issue Apr 24, 2023
This is the AIX and Solaris equivalent of CL 269378.

On AIX and Solaris, where we use libc for syscalls, when the runtime exits,
it calls the libc exit function, which may call back into user code,
such as invoking functions registered with atexit. In particular, it
may call back into Go. But at this point, the Go runtime is
already exiting, so this wouldn't work.

On non-libc platforms we use exit syscall directly, which doesn't
invoke any callbacks. Use _exit on AIX and Solaris to achieve the same
behavior.

Test is TestDestructorCallback.

For #59711

Change-Id: I666f75538d3e3d8cf3b697b4c32f3ecde8332890
Reviewed-on: https://go-review.googlesource.com/c/go/+/487635
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
gopherbot pushed a commit that referenced this issue Apr 24, 2023
Also preserve the PC/SP in reentersyscall when doing lock ranking.
The test is TestDestructorCallbackRace with the staticlockranking
experiment enabled.

For #59711

Change-Id: I87ac1d121ec0d399de369666834891ab9e7d11b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/487955
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
@golang golang locked and limited conversation to collaborators Apr 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

4 participants