Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: SIGSEGV with concurrent GC during plugin load #17455

Closed
aclements opened this issue Oct 15, 2016 · 6 comments
Closed

plugin: SIGSEGV with concurrent GC during plugin load #17455

aclements opened this issue Oct 15, 2016 · 6 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@aclements
Copy link
Member

What version of Go are you using (go version)?

Current master (86b2f29)

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/austin/r/go"
GORACE=""
GOROOT="/home/austin/go.dev"
GOTOOLDIR="/home/austin/go.dev/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build566905877=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

What did you do?

cd misc/cgo/testplugin
GOGC=1 ./test.bash

What did you expect to see?

PASS

What did you see instead?

fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x479ebb]

goroutine 19 [running]:
runtime.throw(0x53483e, 0x2a)
    /home/austin/go.dev/src/runtime/panic.go:587 +0x95 fp=0xc42002cd98 sp=0xc42002cd78
runtime.sigpanic()
    /home/austin/go.dev/src/runtime/signal_unix.go:253 +0x2db fp=0xc42002cde8 sp=0xc42002cd98
runtime.scanblock(0x7fa095ab3020, 0x11b0, 0x0, 0xc420027228)
    /home/austin/go.dev/src/runtime/mgcmark.go:1108 +0x5b fp=0xc42002ce60 sp=0xc42002cde8
runtime.markrootBlock(0x7fa095ab3020, 0x11b0, 0x0, 0xc420027228, 0x0)
    /home/austin/go.dev/src/runtime/mgcmark.go:280 +0x7c fp=0xc42002ce90 sp=0xc42002ce60
runtime.markroot(0xc420027228, 0x3)
    /home/austin/go.dev/src/runtime/mgcmark.go:164 +0xc8 fp=0xc42002cf18 sp=0xc42002ce90
runtime.gcDrain(0xc420027228, 0x5)
    /home/austin/go.dev/src/runtime/mgcmark.go:977 +0x8f fp=0xc42002cf48 sp=0xc42002cf18
runtime.gcBgMarkWorker(0xc420026000)
    /home/austin/go.dev/src/runtime/mgc.go:1468 +0x1ca fp=0xc42002cfb8 sp=0xc42002cf48
runtime.goexit()
    /home/austin/go.dev/src/runtime/asm_amd64.s:2158 +0x1 fp=0xc42002cfc0 sp=0xc42002cfb8
created by runtime.gcBgMarkStartWorkers
    /home/austin/go.dev/src/runtime/mgc.go:1357 +0x98

goroutine 1 [runnable]:
plugin.lastmoduleinit(0x1587400)
    /home/austin/go.dev/src/runtime/plugin.go:39 +0x16a
plugin.open(0x52f37c, 0x7, 0x0, 0x0, 0x0)
    /home/austin/go.dev/src/plugin/plugin_dlopen.go:73 +0x367
plugin.Open(0x52f37c, 0xa, 0x3, 0x2, 0xc420040e08)
    /home/austin/go.dev/src/plugin/plugin.go:28 +0x35
main.main()
    /home/austin/go.dev/misc/cgo/testplugin/src/host/host.go:24 +0x5c

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /home/austin/go.dev/src/runtime/asm_amd64.s:2158 +0x1
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x479ebb]

goroutine 21 [running]:
runtime.throw(0x53483e, 0x2a)
    /home/austin/go.dev/src/runtime/panic.go:587 +0x95 fp=0xc42002dd98 sp=0xc42002dd78
runtime.sigpanic()
    /home/austin/go.dev/src/runtime/signal_unix.go:253 +0x2db fp=0xc42002dde8 sp=0xc42002dd98
runtime.scanblock(0x7fa095ab4920, 0x1a590, 0x0, 0xc420029828)
    /home/austin/go.dev/src/runtime/mgcmark.go:1108 +0x5b fp=0xc42002de60 sp=0xc42002dde8
runtime.markrootBlock(0x7fa095ab4920, 0x1a590, 0x0, 0xc420029828, 0x0)
    /home/austin/go.dev/src/runtime/mgcmark.go:280 +0x7c fp=0xc42002de90 sp=0xc42002de60
runtime.markroot(0xc420029828, 0x4)
    /home/austin/go.dev/src/runtime/mgcmark.go:169 +0x153 fp=0xc42002df18 sp=0xc42002de90
runtime.gcDrain(0xc420029828, 0x5)
    /home/austin/go.dev/src/runtime/mgcmark.go:977 +0x8f fp=0xc42002df48 sp=0xc42002df18
runtime.gcBgMarkWorker(0xc420028600)
    /home/austin/go.dev/src/runtime/mgc.go:1468 +0x1ca fp=0xc42002dfb8 sp=0xc42002df48
runtime.goexit()
    /home/austin/go.dev/src/runtime/asm_amd64.s:2158 +0x1 fp=0xc42002dfc0 sp=0xc42002dfb8
created by runtime.gcBgMarkStartWorkers
    /home/austin/go.dev/src/runtime/mgc.go:1357 +0x98

This is crashing in the GC's data segment scan. The problem is almost certainly that the plugin's moduledata has been linked into the module data list, but md.gcdatamask hasn't been initialized yet, so if a concurrent GC tries to scan the data segment of the module at just the wrong time, it will crash. I wasn't able to figure out where it gets linked into the list, but I can think of two potential solutions:

  1. Don't link the module data into the list until everything that could be concurrently accessed is fully initialized.
  2. Stop the world around the plugin loading process.

The first is preferable if possible, since attempting to stop the world will delay plugin loading during a concurrent GC cycle until the cycle ends.

@aclements aclements added this to the Go1.8 milestone Oct 15, 2016
@crawshaw
Copy link
Member

crawshaw commented Oct 16, 2016

Your preferable solution is possible, but a bit involved. The module data is put on the linked list by runtime.addmoduledata, called from an ELF .init_array. It's called with the C calling convention and doesn't have the Go machinery setup to call progToPointerMask. I think the easiest thing to do would be to have addmoduledata put the module on a pending linked list, and then have plugin_lastmoduleinit initialize gccdatamask/gccbssmask, then move the module to the main linked list.

But a (somewhat depressing) question before that: if we are concurrently reading/writing moduledata.next, does that mean we should be atomically loading/storing it?

If so, then I think while I'll keep the pending data structure used by addmouledata a linked list, the list of active modules should turn into a global slice, so all the sites were we iterate through modules only gets hit with a single atomic load.

@crawshaw
Copy link
Member

Oh and thanks for analyzing the crash. When I wrote plugin_lastmoduleinit I saw the gcdatamask/gcbssmask uses in mbitmap.go and assumed late initialization was fine because there were no active data pointers yet. But I completely missed the data root loading in gcMarkRootPrepare.

@aclements
Copy link
Member Author

Your preferable solution is possible, but a bit involved. The module data is put on the linked list by runtime.addmoduledata, called from an ELF .init_array. It's called with the C calling convention and doesn't have the Go machinery setup to call progToPointerMask.

Ah, no wonder I couldn't find it. Guru needs to learn assembly. :)

I think the easiest thing to do would be to have addmoduledata put the module on a pending linked list, and then have plugin_lastmoduleinit initialize gccdatamask/gccbssmask, then move the module to the main linked list.

That seems reasonable. Do you even need the list? It looks like there can only be one pending module load right now (enforced by pluginsMu and assumed by lastmoduleinit.)

But a (somewhat depressing) question before that: if we are concurrently reading/writing moduledata.next, does that mean we should be atomically loading/storing it?

Well, we need to be doing something. I suspect what you proposed above would be fine on x86 as long as the compiler isn't doing any real memory reordering. On a weak machine or if the compiler gets more aggressive with reordering, there are three hazards:

  1. The initializing writes to the moduledata could be reordered to after the write that adds it to the list, making it possible to observe an uninitialized moduledata (the read side is safe because the reads are dependent).
  2. The write to add it to the list could be reordered after some other write from the module that causes another CPU to walk the list, causing the other CPU to miss the module, even though it saw its effects.
  3. The read-side equivalent of 2: reading the list could be reordered before some read that observed some effect of the module.

These could be solved by always accessing the list with atomics, but that's pretty annoying. Hazard 1 could be solved by making just the store to next be atomic, or by using the runtime's publicationBarrier function between the last write to the moduledata and adding it to the list.

Your global slice idea may be the way to go for 2 and 3 (and may address 1 by side effect), though note that you can't read or write a slice value atomically (since it's three words), so it can't actually be a slice, per se. You could "manually" store the pointer and length, though that would require two (carefully ordered) loads at each loop. Perhaps the best thing would be to store it as a pointer to an array with the convention that there's always a nil pointer after the last *moduledata in the array. The loops would then atomically load just this one array pointer and iterate until they reached a nil. To keep the type system happy and the loops simple, you could give the array some ridiculous bound with the knowledge that you'll never go past the nil.

For completeness, a few other solutions come to mind, but I think I like the array the best:

  1. If the list were stored in reverse, so you were adding new modules to the beginning, you would only need to access the head of the list with an atomic. Since the list is usually 1 long, maybe this is still a lose, though uncontended atomic loads aren't much more expensive than regular loads (on x86 it's literally the same instruction). It looks like most loops have to go over all of the modules anyway, but some important ones don't, like bulkBarrierBitmap, so this would probably slow those down in the common case where firstmoduledata is what they're looking for.
  2. After adding the moduledata to the list, use forEachP with an empty function to make the write globally visible by creating a global happens-before relation. Unfortunately, while this isn't as big of a hammer as stopping the world, it requires acquiring worldsema, which will still block until a concurrent GC cycle is over.
  3. Make a convincing argument that 2 and 3 aren't a problem, or are only a problem if the program is racy anyway. It's possible that anything where another CPU really needs to observe the module in the list also creates a happens-before edge.

@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 21, 2016
@crawshaw
Copy link
Member

Thanks for the thoughts.

There are two uses of the moduledata linked list and addmoduledata. In -buildmode=plugin there only ever is one pending module, but in -buildmode=shared all modules are loaded by ld.so before program initialization, so in that case there are any number of pending modules. We could separate the two paths, but I would like to minimize architecture-dependent code in cmd/link, and the code to call addmoduledata is very arch-dependent.

I've cut a CL that I believe introduces an array to take care of this, and run all.bash a few times on darwin. I don't have a better testing strategy for this, as I haven't been able to replicate the crash.

@aclements
Copy link
Member Author

so in that case there are any number of pending modules.

Ah, okay.

I don't have a better testing strategy for this, as I haven't been able to replicate the crash.

I can reproduce it 100% of the time on linux/amd64. I'm happy to test.

@gopherbot
Copy link

CL https://golang.org/cl/32357 mentions this issue.

@golang golang locked and limited conversation to collaborators Nov 1, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

4 participants