Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: intermittent SIGSEGV with plugins that use multiple goroutines with cgo on ppc64le #25756

Closed
laboger opened this issue Jun 6, 2018 · 10 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@laboger
Copy link
Contributor

laboger commented Jun 6, 2018

What version of Go are you using (go version)?

tip, go 1.10 latest

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

ppc64le

A problem was reported to me by someone trying to build an application that uses plugins which call C code through cgo. They were building and running in a container with many other dependencies but I was able to reproduce it in a smaller program. Their application works on x86 and also if they replace some of their Go code with C code instead of using plugins on ppc64le.

Based on panic traces and objdumps of other plugin objects, I can tell the segv occurs when trying to access data based on r2 within the plugin object. I am still debugging but at this point but it appears that on some path the wrong value of r2 is being restored, possibly after switching a goroutine or thread. I've been trying to debug this using gdb but that seems to also introduce some random behavior.

I will keep debugging my reproducer testcase and will post more information later, but wanted to open this in case there is any ideas on what might be happening.

I think my testcase is correct because it runs successfully on x86.

@bcmills
Copy link
Contributor

bcmills commented Jun 7, 2018

CC: @ianlancetaylor and @cherrymui for plugin.

More precise details would certainly be helpful. 🙂

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jun 7, 2018
@bcmills bcmills added this to the Go1.11 milestone Jun 7, 2018
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 7, 2018
@bcmills bcmills changed the title plugin: segv occurs intermittently when running program with plugins that use cgo on ppc64le plugin: intermittent SIGSEGV with plugins that use cgo on ppc64le Jun 7, 2018
@laboger
Copy link
Contributor Author

laboger commented Jun 7, 2018

I've been working on a reproducer and trying to understand what causes it to fail since it is intermittent and gdb causes different behavior. But I think I have narrowed it down to just plugins, the original problem used cgo but I have found cgo is not necessary to cause the issue.

In my reproducer, the main program calls a plugin which does some work with multiple goroutines. The testcase can be run multiple times without error but sometimes it gets a SIGSEGV on an instruction that depends on the value in r2 and it is obviously wrong. I think part of the problem is that when plugins are used, the main program is not compiled as PIC or at least it does not generate r2 or save it on the stack. The plugin code has been compiled to generate r2 and save it at 24(r1). However many of the functions in asm_ppc64x.s are expecting that 24(r1) contains the TOC address for r2 and loads it before returning back to its caller.

In theory I think this shoudl be OK because when main calls code in the plugin, it should generate the r2 value and save it. If it calls other functions in the plugin object, it might not regenerate r2 but should always save it on the stack. But obviously there is some scenario where it is reloading a bad r2.

I have a fix that seems to prevent the segv and a testcase to reproduce the problem and will post that shortly.

@laboger laboger changed the title plugin: intermittent SIGSEGV with plugins that use cgo on ppc64le plugin: intermittent SIGSEGV with plugins that use multiple goroutines on ppc64le Jun 7, 2018
@bcmills
Copy link
Contributor

bcmills commented Jun 7, 2018

Hmm, is this possibly related to the bug @ianlancetaylor addressed in https://golang.org/cl/43051?

@laboger
Copy link
Contributor Author

laboger commented Jun 7, 2018

Yes it is related because there are two instructions at the beginning of functions to compute the value for r2 when generating code in an object where it is needed, but if the linker knows it can skip those instructions, it will call function+8 to avoid them.

This problem has to do with ensuring that the value of r2 is always saved on the stack at r1+24 if it might need to be restored at some later point. What makes this case tricky is that some of the code is initializing and saving r2 (the plugin) and some isn't (main code calling plugin). I believe r2 only needs to saved if it is going to be needed so usually doesn't cause a problem but there must be some path that is bad.

@laboger
Copy link
Contributor Author

laboger commented Jun 8, 2018

Sorry I was wrong yesterday, it doesn't seem to reproduce unless the goroutines called by the plugin also use cgo. Some flaw in my testing process apparently. I will post the CL once the linker bug is fixed which breaks ppc64x testing.

@laboger laboger changed the title plugin: intermittent SIGSEGV with plugins that use multiple goroutines on ppc64le plugin: intermittent SIGSEGV with plugins that use multiple goroutines with cgo on ppc64le Jun 8, 2018
@gopherbot
Copy link

Change https://golang.org/cl/117515 mentions this issue: runtime: restore r2 whn restoring state from gobuf in gogo on ppc64x

@laboger
Copy link
Contributor Author

laboger commented Jun 8, 2018

Please consider this for backport to go 1.10, it is a serious bug.

@bcmills
Copy link
Contributor

bcmills commented Jun 8, 2018

@gopherbot, please backport to 1.10.

@gopherbot
Copy link

Backport issue(s) opened: #25800 (for 1.10).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

Change https://golang.org/cl/117915 mentions this issue: runtime: restore r2 when restoring state from gobuf in gogo on ppc64x

ceseo pushed a commit to powertechpreview/go that referenced this issue Aug 13, 2018
When using plugins with goroutines calling cgo, we hit a case where
an intermittent SIGSEGV occurs when referencing an address that is based
on r2 (TOC address). When the failure can be generated in gdb, the
contents of r2 is wrong even though the value in the current stack's
slot for r2 is correct. So that means it somehow switched to start
running the code in this function without passing through the beginning
of the function which had the correct value of r2 and stored it there.

It was noted that in runtime.gogo when the state is restored from
gobuf, r2 is not restored from its slot on the stack. Adding the
instruction to restore r2 prevents the SIGSEGV.

This adds a testcase under testplugin which reproduces the problem if
the program is run multiple times. The team who reported this problem
has verified it fixes the issue on their larger, more complex
application.

Fixes golang#25756

Change-Id: I6028b6f1f8775d5c23f4ebb57ae273330a28eb8f
Reviewed-on: https://go-review.googlesource.com/117515
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Jun 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

3 participants