New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plugin: intermittent SIGSEGV with plugins that use multiple goroutines with cgo on ppc64le #25756
Comments
CC: @ianlancetaylor and @cherrymui for plugin. More precise details would certainly be helpful. 🙂 |
I've been working on a reproducer and trying to understand what causes it to fail since it is intermittent and gdb causes different behavior. But I think I have narrowed it down to just plugins, the original problem used cgo but I have found cgo is not necessary to cause the issue. In my reproducer, the main program calls a plugin which does some work with multiple goroutines. The testcase can be run multiple times without error but sometimes it gets a SIGSEGV on an instruction that depends on the value in r2 and it is obviously wrong. I think part of the problem is that when plugins are used, the main program is not compiled as PIC or at least it does not generate r2 or save it on the stack. The plugin code has been compiled to generate r2 and save it at 24(r1). However many of the functions in asm_ppc64x.s are expecting that 24(r1) contains the TOC address for r2 and loads it before returning back to its caller. In theory I think this shoudl be OK because when main calls code in the plugin, it should generate the r2 value and save it. If it calls other functions in the plugin object, it might not regenerate r2 but should always save it on the stack. But obviously there is some scenario where it is reloading a bad r2. I have a fix that seems to prevent the segv and a testcase to reproduce the problem and will post that shortly. |
Hmm, is this possibly related to the bug @ianlancetaylor addressed in https://golang.org/cl/43051? |
Yes it is related because there are two instructions at the beginning of functions to compute the value for r2 when generating code in an object where it is needed, but if the linker knows it can skip those instructions, it will call function+8 to avoid them. This problem has to do with ensuring that the value of r2 is always saved on the stack at r1+24 if it might need to be restored at some later point. What makes this case tricky is that some of the code is initializing and saving r2 (the plugin) and some isn't (main code calling plugin). I believe r2 only needs to saved if it is going to be needed so usually doesn't cause a problem but there must be some path that is bad. |
Sorry I was wrong yesterday, it doesn't seem to reproduce unless the goroutines called by the plugin also use cgo. Some flaw in my testing process apparently. I will post the CL once the linker bug is fixed which breaks ppc64x testing. |
Change https://golang.org/cl/117515 mentions this issue: |
Please consider this for backport to go 1.10, it is a serious bug. |
@gopherbot, please backport to 1.10. |
Backport issue(s) opened: #25800 (for 1.10). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
Change https://golang.org/cl/117915 mentions this issue: |
When using plugins with goroutines calling cgo, we hit a case where an intermittent SIGSEGV occurs when referencing an address that is based on r2 (TOC address). When the failure can be generated in gdb, the contents of r2 is wrong even though the value in the current stack's slot for r2 is correct. So that means it somehow switched to start running the code in this function without passing through the beginning of the function which had the correct value of r2 and stored it there. It was noted that in runtime.gogo when the state is restored from gobuf, r2 is not restored from its slot on the stack. Adding the instruction to restore r2 prevents the SIGSEGV. This adds a testcase under testplugin which reproduces the problem if the program is run multiple times. The team who reported this problem has verified it fixes the issue on their larger, more complex application. Fixes golang#25756 Change-Id: I6028b6f1f8775d5c23f4ebb57ae273330a28eb8f Reviewed-on: https://go-review.googlesource.com/117515 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
What version of Go are you using (
go version
)?tip, go 1.10 latest
Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?ppc64le
A problem was reported to me by someone trying to build an application that uses plugins which call C code through cgo. They were building and running in a container with many other dependencies but I was able to reproduce it in a smaller program. Their application works on x86 and also if they replace some of their Go code with C code instead of using plugins on ppc64le.
Based on panic traces and objdumps of other plugin objects, I can tell the segv occurs when trying to access data based on r2 within the plugin object. I am still debugging but at this point but it appears that on some path the wrong value of r2 is being restored, possibly after switching a goroutine or thread. I've been trying to debug this using gdb but that seems to also introduce some random behavior.
I will keep debugging my reproducer testcase and will post more information later, but wanted to open this in case there is any ideas on what might be happening.
I think my testcase is correct because it runs successfully on x86.
The text was updated successfully, but these errors were encountered: