|
|
Descriptionruntime: keep g->syscallsp consistent after cgo->Go callbacks
Normally, the caller to runtime.entersyscall() must not return before
calling runtime.exitsyscall(), lest g->syscallsp become a dangling
pointer. runtime.cgocallbackg() violates this constraint. To work around
this, save g->syscallsp and g->syscallpc around cgo->Go callbacks, then
restore them after calling runtime.entersyscall(), which restores the
syscall stack frame pointer saved by cgocall. This allows the GC to
correctly trace a goroutine that is currently returning from a
Go->cgo->Go chain.
This also adds a check to proc.c that panics if g->syscallsp is clearly
invalid. It is not 100% foolproof, as it will not catch a case where the
stack was popped then pushed back beyond g->syscallsp, but it does catch
the present cgo issue and makes existing tests fail without the bugfix.
Fixes issue 7978.
Patch Set 1 #Patch Set 2 : diff -r 71db3dc120afee10f9b0e8e91bd0f0d4ba97bcd7 https://code.google.com/p/go #
Total comments: 4
Patch Set 3 : diff -r 71db3dc120afee10f9b0e8e91bd0f0d4ba97bcd7 https://code.google.com/p/go #
Total comments: 9
Patch Set 4 : diff -r 601ad8b625532388ff9dee86409bf32c1f2f624a https://code.google.com/p/go #
Total comments: 2
Patch Set 5 : diff -r 601ad8b625532388ff9dee86409bf32c1f2f624a https://code.google.com/p/go #
Total comments: 8
Patch Set 6 : diff -r 769430bdffb74d3c36b14517b2b6972938b9428c https://code.google.com/p/go #
Total comments: 6
Patch Set 7 : diff -r b18ebcb9f2367d52af8c3515dee63888ca96db70 https://code.google.com/p/go #
Total comments: 9
Patch Set 8 : diff -r 6b163ec2122a172030284060788f535ab3b9d0e3 https://code.google.com/p/go #Patch Set 9 : diff -r 6aa1064ad5644251c889157d948c0bec255d5984 https://code.google.com/p/go #Patch Set 10 : diff -r 932fe22207465e6c4bcdae29f5c519ba069f8927 https://code.google.com/p/go #
Total comments: 2
MessagesTotal messages: 46
Hello golang-codereviews@googlegroups.com, I'd like you to review this change to https://code.google.com/p/go
Sign in to reply to this message.
This change is incomplete: I forgot that GC can already run after entersyscall(), and might interrupt the restore of syscallsp and syscallpc. I've just gotten a crash that can be attributed to the GC running right between those two lines, with syscallsp and syscallpc in an inconsistent state. It's also possible that the GC might pick up syscallsp and syscallcp before cgocallbackg has a chance to restore them, and then trace the stack after cgocallbackg has already returned and invalidated that stack frame. I'm still just getting to grips with the Go scheduler. What's the right way to lock the GC around the call to entersyscall and the subsequent syscallsp/pc updates? g->m->locks++? At this point I'd rather ask for advice before I go off and do something silly.
Sign in to reply to this message.
How does it currently manifest itself? Stack trace in callback is incorrect? If so, please add a test that obtains and verifies stack trace in callback.
Sign in to reply to this message.
On 2014/08/19 15:53:28, dvyukov wrote: > How does it currently manifest itself? Stack trace in callback is incorrect? If > so, please add a test that obtains and verifies stack trace in callback. The bug before or after the incomplete patch? Before the patch, it causes occasional crashes when the GC happens to run while another goroutine is running cgo code after returning from a cgo->Go callback, as described in issue 7978. This is caught by the check that I added in proc.c - after that check, all cgo to Go callbacks reliably cause a panic, and existing tests fail. I guess it could also be checked by an explicit test that requests a traceback in cgo code after a callback, but as it is the existing tests already fail due to the extra sanity check. After the incomplete patch (the change to cgocall.c), reproducing what's left of the bug is much harder because at that point it's a race condition that doesn't involve user code. I had to leave a load tester running against my server for an hour or two before it died.
Sign in to reply to this message.
https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/cgocall.c File src/pkg/runtime/cgocall.c (right): https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/cgocall.c... src/pkg/runtime/cgocall.c:237: uintptr saved_sp, saved_pc; we generally don't use underscores in identifiers please name them savedpc/savedsp https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/proc.c#ne... src/pkg/runtime/proc.c:1629: runtime·throw("runtime: syscall frame is no longer valid"); throw already prepends "runtime error" to messages, we usually include function name in the message: "exitsyscall: syscall frame is invalid"
Sign in to reply to this message.
thanks for this this you need to sign CLA as described in contribution guidelines
Sign in to reply to this message.
could this be tested? ideally, i'd like test(s) to be added to misc/cgo/test before submiting this fix.
Sign in to reply to this message.
runtime.GC from a cgo callback should trigger the bug today On Thu, Aug 21, 2014 at 8:03 AM, minux <minux@golang.org> wrote: > could this be tested? > ideally, i'd like test(s) to be added to misc/cgo/test before submiting this > fix. > > -- > You received this message because you are subscribed to the Google Groups > "golang-codereviews" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-codereviews+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/d/optout.
Sign in to reply to this message.
PTAL. I've refactored runtime.entersyscall and added runtime.reentersyscall, which takes a syscallpc/sp to use from the caller. This is a bit more intrusive but fixes the remaining race. On 2014/08/20 11:47:19, dvyukov wrote: > you need to sign CLA as described in contribution guidelines I already did prior to opening the code review; I guess it hasn't been processed yet. On 2014/08/21 06:13:09, dvyukov wrote: > runtime.GC from a cgo callback should trigger the bug today Not from a callback, since then g wouldn't be in the syscall state. You'd have to call runtime.GC (or otherwise request a traceback) from *C* code *after* it has called a Go callback (and that callback has returned). That's the point at which entersyscall() has been called from a no-longer-valid stack frame. But I don't think you can call the GC from g0, so that wouldn't work... The runtime.panic() that I added to exitsyscall catches this on all cgo callbacks though, making existing tests fail. Try patching in the CL and replacing runtime·reentersyscall(...) with runtime·entersyscall() (undoing that part of the change). You get this: # ../misc/cgo/test fatal error: exitsyscall: syscall frame is no longer valid goroutine 1 [syscall, locked to thread]: runtime: unexpected return pc for runtime.cgocallbackg called from 0x90ac7d runtime.cgocallbackg() /home/marcansoft/go-src/go/src/pkg/runtime/cgocall.c:255 +0x6f fp=0xc208019e48 sp=0xc208019e28 goroutine 4 [finalizer wait]: runtime.park(0x425c20, 0x918188, 0x909dd4) /home/marcansoft/go-src/go/src/pkg/runtime/proc.c:1407 +0x9b runtime.parkunlock(0x918188, 0x909dd4) /home/marcansoft/go-src/go/src/pkg/runtime/proc.c:1423 +0x3d runfinq() /home/marcansoft/go-src/go/src/pkg/runtime/mgc0.c:1656 +0xd2 runtime.goexit() /home/marcansoft/go-src/go/src/pkg/runtime/proc.c:1476 goroutine 5 [syscall]: os/signal.loop() /home/marcansoft/go-src/go/src/pkg/os/signal/signal_unix.go:21 +0x1e created by os/signal.init·1 /home/marcansoft/go-src/go/src/pkg/os/signal/signal_unix.go:27 +0x34 goroutine 17 [syscall, locked to thread]: runtime.goexit() /home/marcansoft/go-src/go/src/pkg/runtime/proc.c:1476 goroutine 6 [syscall, locked to thread]: _/home/marcansoft/go-src/go/misc/cgo/test._Cfunc_usleep(0x2710, 0x0) _/home/marcansoft/go-src/go/misc/cgo/test/_test/_obj_test/_cgo_defun.c:883 +0x33 created by _/home/marcansoft/go-src/go/misc/cgo/test.lockOSThreadCallback /home/marcansoft/go-src/go/misc/cgo/test/issue3775.go:35 +0x39 exit status 2 FAIL _/home/marcansoft/go-src/go/misc/cgo/test 0.003s Is that sufficient or do you want a test that explicitly requests a traceback from C code and checks it for validity? https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/cgocall.c File src/pkg/runtime/cgocall.c (right): https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/cgocall.c... src/pkg/runtime/cgocall.c:237: uintptr saved_sp, saved_pc; On 2014/08/20 11:46:40, dvyukov wrote: > we generally don't use underscores in identifiers > please name them savedpc/savedsp Done. https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/proc.c#ne... src/pkg/runtime/proc.c:1629: runtime·throw("runtime: syscall frame is no longer valid"); On 2014/08/20 11:46:40, dvyukov wrote: > throw already prepends "runtime error" to messages, we usually include function > name in the message: > "exitsyscall: syscall frame is invalid" Done.
Sign in to reply to this message.
On Thu, Aug 21, 2014 at 10:46 AM, <hector@marcansoft.com> wrote: > PTAL. I've refactored runtime.entersyscall and added > runtime.reentersyscall, which takes a syscallpc/sp to use from the > caller. This is a bit more intrusive but fixes the remaining race. What is the remaining race? Is it when traceback is requested while you restore savedpc/sp? > On 2014/08/20 11:47:19, dvyukov wrote: >> >> you need to sign CLA as described in contribution guidelines > > > I already did prior to opening the code review; I guess it hasn't been > processed yet. > > > On 2014/08/21 06:13:09, dvyukov wrote: >> >> runtime.GC from a cgo callback should trigger the bug today > > > Not from a callback, since then g wouldn't be in the syscall state. > You'd have to call runtime.GC (or otherwise request a traceback) from > *C* code *after* it has called a Go callback How is that possible? How did you hit this issue? > (and that callback has > returned). That's the point at which entersyscall() has been called > from a no-longer-valid stack frame. But I don't think you can call the > GC from g0, so that wouldn't work... > > The runtime.panic() that I added to exitsyscall catches this on all cgo > callbacks though, making existing tests fail. Try patching in the CL and > replacing runtime·reentersyscall(...) with runtime·entersyscall() > (undoing that part of the change). You get this: > > # ../misc/cgo/test > fatal error: exitsyscall: syscall frame is no longer valid > > goroutine 1 [syscall, locked to thread]: > runtime: unexpected return pc for runtime.cgocallbackg called from > 0x90ac7d > runtime.cgocallbackg() > /home/marcansoft/go-src/go/src/pkg/runtime/cgocall.c:255 +0x6f > fp=0xc208019e48 sp=0xc208019e28 > > goroutine 4 [finalizer wait]: > runtime.park(0x425c20, 0x918188, 0x909dd4) > /home/marcansoft/go-src/go/src/pkg/runtime/proc.c:1407 +0x9b > runtime.parkunlock(0x918188, 0x909dd4) > /home/marcansoft/go-src/go/src/pkg/runtime/proc.c:1423 +0x3d > runfinq() > /home/marcansoft/go-src/go/src/pkg/runtime/mgc0.c:1656 +0xd2 > runtime.goexit() > /home/marcansoft/go-src/go/src/pkg/runtime/proc.c:1476 > > goroutine 5 [syscall]: > os/signal.loop() > /home/marcansoft/go-src/go/src/pkg/os/signal/signal_unix.go:21 > +0x1e > created by os/signal.init·1 > /home/marcansoft/go-src/go/src/pkg/os/signal/signal_unix.go:27 > +0x34 > > goroutine 17 [syscall, locked to thread]: > runtime.goexit() > /home/marcansoft/go-src/go/src/pkg/runtime/proc.c:1476 > > goroutine 6 [syscall, locked to thread]: > _/home/marcansoft/go-src/go/misc/cgo/test._Cfunc_usleep(0x2710, 0x0) > > _/home/marcansoft/go-src/go/misc/cgo/test/_test/_obj_test/_cgo_defun.c:883 > +0x33 > created by > _/home/marcansoft/go-src/go/misc/cgo/test.lockOSThreadCallback > /home/marcansoft/go-src/go/misc/cgo/test/issue3775.go:35 +0x39 > exit status 2 > FAIL _/home/marcansoft/go-src/go/misc/cgo/test 0.003s > > Is that sufficient or do you want a test that explicitly requests a > traceback from C code and checks it for validity? > > > > https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/cgocall.c > File src/pkg/runtime/cgocall.c (right): > > https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/cgocall.c... > src/pkg/runtime/cgocall.c:237: uintptr saved_sp, saved_pc; > On 2014/08/20 11:46:40, dvyukov wrote: >> >> we generally don't use underscores in identifiers >> please name them savedpc/savedsp > > > Done. > > > https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/proc.c > File src/pkg/runtime/proc.c (right): > > https://codereview.appspot.com/131910043/diff/60001/src/pkg/runtime/proc.c#ne... > src/pkg/runtime/proc.c:1629: runtime·throw("runtime: syscall frame is no > longer valid"); > On 2014/08/20 11:46:40, dvyukov wrote: >> >> throw already prepends "runtime error" to messages, we usually include > > function >> >> name in the message: >> "exitsyscall: syscall frame is invalid" > > > Done. > > https://codereview.appspot.com/131910043/
Sign in to reply to this message.
On 2014/08/21 07:37:49, dvyukov wrote: > On Thu, Aug 21, 2014 at 10:46 AM, <mailto:hector@marcansoft.com> wrote: > > PTAL. I've refactored runtime.entersyscall and added > > runtime.reentersyscall, which takes a syscallpc/sp to use from the > > caller. This is a bit more intrusive but fixes the remaining race. > > What is the remaining race? Is it when traceback is requested while > you restore savedpc/sp? Yes, if a traceback is requested at any point between when the goroutine is no longer in syscall context and when savedsp/pc have been restored. For example, it could also break if a traceback is requested before savedpc/sp are restored, starts executing, picks up on the old syscallsp/syscallpc values, and *then* savedsp/pc are restored and cgocallbackg returns, invalidating the stack frame that the traceback is trying to trace. The two or three times I saw it crash due to this issue, the evidence pointed to the traceback having used the old syscallpc but the new syscallsp, i.e. it caught it right between those two statements. > >> runtime.GC from a cgo callback should trigger the bug today > > > > > > Not from a callback, since then g wouldn't be in the syscall state. > > You'd have to call runtime.GC (or otherwise request a traceback) from > > *C* code *after* it has called a Go callback > > > How is that possible? How did you hit this issue? Another goroutine was running GC. GC is allowed during syscalls, which includes cgo C code. After cgo C code called a Go callback and that callback returned back to the C code, but before the it returned to its original Go caller, another goroutine started a GC run and got confused by the broken stack. Specifically, the GC happened on the same machine thread but a different goroutine. exitsyscall was trying to acquire a p, fell into the slow path, called the scheduler, yielded to another g, and that g called the GC. It could've happened from another machine thread though, AFAICT. This was just the failure mode I saw. Issue 7978 contains my analysis of what the stack activity looks like as far as I could tell, and where exactly the crash happens.
Sign in to reply to this message.
On 2014/08/21 08:11:15, marcan wrote: > Yes, if a traceback is requested at any point between when the goroutine is no > longer in syscall context and when savedsp/pc have been restored. Er, I meant is *again* in syscall context.
Sign in to reply to this message.
Friendly ping? :-)
Sign in to reply to this message.
On 2014/08/27 01:50:54, marcan wrote: > Friendly ping? :-) Please CC rsc on this change. Russ, please double check this change, this code is super subtle.
Sign in to reply to this message.
update to tip (hg sync) proc.c has changed recently
Sign in to reply to this message.
On 2014/08/27 10:18:46, dvyukov wrote: > On 2014/08/27 01:50:54, marcan wrote: > > Friendly ping? :-) > > Please CC rsc on this change. > Russ, please double check this change, this code is super subtle. On 2014/08/27 10:20:23, dvyukov wrote: > update to tip (hg sync) > proc.c has changed recently Done and done.
Sign in to reply to this message.
https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1535: entersyscallcommon(void *arg0) name the parameter argp, this is the name we use throughout the codebase https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1591: } say here that entersyscallcommon does g->m->locks-- https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1606: g->syscallpc = syscallpc; this restores syscallsp/pc, but not syscallstack/guard even if syscallstack/guard are always the same today (?) this is not necessary true in future https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1608: entersyscallcommon(&syscallpc); say here that entersyscallcommon does g->m->locks--
Sign in to reply to this message.
https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/cgocall.c File src/pkg/runtime/cgocall.c (right): https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/cgocall.... src/pkg/runtime/cgocall.c:240: void *savedsp; Stack space is scarce here, and linker does not actually ensure that we have enough. cgocallbackg is effectively called from cgocall->asmcgocall->cgocallback_gofunc->cgocallbackg, but linker does not see that call chain, so it can overflow the nosplit area. I've hacked a change that makes linker verify that callchain: https://codereview.appspot.com/129680043/diff/20001/src/pkg/runtime/cgocall.c Fortunately now it fails only with buf size 6: checkcallbackstack: nosplit stack overflow 120 assumed on entry to checkcallbackstack 72 after checkcallbackstack uses 48 64 on entry to runtime.cgocallbackg 56 after runtime.cgocallbackg uses 8 48 on entry to runtime.exitsyscall 40 after runtime.exitsyscall uses 8 32 on entry to exitsyscallfast 8 after exitsyscallfast uses 24 0 on entry to runtime.lock -8 on entry to runtime.morestack8_noctxt So probably it is safe today, but I am not sure that it's still safe after this change. Russ, what buf size do I need to use to model precisely what happens in cgocallback_gofunc? Is it 1? Call of checkcallbackstack corresponds to call of asmcgocall; and then 'uintptr buf[1]' corresponds to saved R8 in cgocallback_gofunc. Is it correct?
Sign in to reply to this message.
https://codereview.appspot.com/131910043/diff/130001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/130001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1626: save(&syscallpc); this saves the same incorrect pc/sp into g->sched why is it OK?
Sign in to reply to this message.
No test, no guarantees that it continues to work. Coming C->Go cgocall conversion can break it again, the new check in exitsyscall will not necessary prevent the breakage. A real test would be very useful.
Sign in to reply to this message.
On 2014/08/27 17:51:31, dvyukov wrote: > No test, no guarantees that it continues to work. > Coming C->Go cgocall conversion can break it again, the new check in exitsyscall > will not necessary prevent the breakage. > A real test would be very useful. I've added a test, though I'm not very happy with it. It ends up coming down to getting stack traces, parsing through the text (there seems to be no API to get it in a structured form), and checking to make sure that it's correct. Unfortunately the actual crash is very hard to write a test for, since even though syscallsp is already invalidated by the time control returns to the user's c code after a callback, the stack frames haven't been clobbered yet and the stack trace works (and the user can't clobber them since they're on the g stack, not the g0 stack where the C code runs). The crash only happens when exitsyscall after the C code happens to yield when grabbing a P, and _then_ another goroutine schedules and triggers a stack traceback, but I can't see a way of forcing that issue. Also, I ended up using some ugly atomic sync between C and Go, to keep the two goroutines in lockstep and make sure they hit all the interesting cases. At least the hacky stack scraping thing does catch the cgocallback in the (invalid but still unclobbered) stack frame and complains. https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1535: entersyscallcommon(void *arg0) On 2014/08/27 10:58:41, dvyukov wrote: > name the parameter argp, this is the name we use throughout the codebase Done. https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1591: } On 2014/08/27 10:58:41, dvyukov wrote: > say here that entersyscallcommon does g->m->locks-- Done. https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1606: g->syscallpc = syscallpc; On 2014/08/27 10:58:41, dvyukov wrote: > this restores syscallsp/pc, but not syscallstack/guard > even if syscallstack/guard are always the same today (?) this is not necessary > true in future Done, but unfortunately this overflows the nosplit stack area (even without your patch). runtime.cgocallback_gofunc: nosplit stack overflow 120 assumed on entry to runtime.cgocallback_gofunc 112 after runtime.cgocallback_gofunc uses 8 104 on entry to runtime.cgocallbackg 40 after runtime.cgocallbackg uses 64 32 on entry to runtime.exitsyscall 24 after runtime.exitsyscall uses 8 16 on entry to exitsyscallfast -8 after exitsyscallfast uses 24 runtime.cgocallbackg: nosplit stack overflow 120 assumed on entry to runtime.cgocallbackg 56 after runtime.cgocallbackg uses 64 48 on entry to runtime.exitsyscall 40 after runtime.exitsyscall uses 8 32 on entry to exitsyscallfast 8 after exitsyscallfast uses 24 0 on entry to runtime.lock -8 on entry to runtime.morestack8_noctxt I've uploaded the new patchset anyway, although it currently doesn't compile. Suggestions welcome. https://codereview.appspot.com/131910043/diff/120001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1608: entersyscallcommon(&syscallpc); On 2014/08/27 10:58:41, dvyukov wrote: > say here that entersyscallcommon does g->m->locks-- Done. https://codereview.appspot.com/131910043/diff/130001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/130001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1626: save(&syscallpc); On 2014/08/27 14:22:49, dvyukov wrote: > this saves the same incorrect pc/sp into g->sched > why is it OK? I honestly have no idea what the purpose of save() is here. The GC/traceback code that was crashing uses the shadow syscall* members, which is what I fixed. I don't know if anything cares about or attempts to use g->sched during a syscall/Cgo code. I'm completely new to this code, so even though I found the source of the crash I don't have a full understanding of the scheduler.
Sign in to reply to this message.
Sorry for the delays. We are busy with runtime C->Go conversion at the moment. Please file an issue on issue tracker for this. Add a link to this codereview and say that cgo callbacks crash due to this issue.
Sign in to reply to this message.
Issue 7978 is already open for this bug. On 29 August 2014 19:01:25 GMT+09:00, dvyukov@google.com wrote: >Sorry for the delays. We are busy with runtime C->Go conversion at the >moment. Please file an issue on issue tracker for this. Add a link to >this codereview and say that cgo callbacks crash due to this issue. > >https://codereview.appspot.com/131910043/ -- Hector Martin
Sign in to reply to this message.
ah, OK On Fri, Aug 29, 2014 at 4:25 PM, Hector Martin <hector@marcansoft.com> wrote: > Issue 7978 is already open for this bug. > > On 29 August 2014 19:01:25 GMT+09:00, dvyukov@google.com wrote: >>Sorry for the delays. We are busy with runtime C->Go conversion at the >>moment. Please file an issue on issue tracker for this. Add a link to >>this codereview and say that cgo callbacks crash due to this issue. >> >>https://codereview.appspot.com/131910043/ > > -- > Hector Martin
Sign in to reply to this message.
Ping. Could somebody review this? Internal Google users suspect this is the cause of some of their problems. On Fri, Aug 29, 2014 at 5:39 AM, 'Dmitry Vyukov' via golang-codereviews < golang-codereviews@googlegroups.com> wrote: > ah, OK > > On Fri, Aug 29, 2014 at 4:25 PM, Hector Martin <hector@marcansoft.com> > wrote: > > Issue 7978 is already open for this bug. > > > > On 29 August 2014 19:01:25 GMT+09:00, dvyukov@google.com wrote: > >>Sorry for the delays. We are busy with runtime C->Go conversion at the > >>moment. Please file an issue on issue tracker for this. Add a link to > >>this codereview and say that cgo callbacks crash due to this issue. > >> > >>https://codereview.appspot.com/131910043/ > > > > -- > > Hector Martin > > -- > You received this message because you are subscribed to the Google Groups > "golang-codereviews" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-codereviews+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. >
Sign in to reply to this message.
A few comments. Nice test. https://codereview.appspot.com/131910043/diff/150001/misc/cgo/test/issue7978.go File misc/cgo/test/issue7978.go (right): https://codereview.appspot.com/131910043/diff/150001/misc/cgo/test/issue7978.... misc/cgo/test/issue7978.go:18: while(__atomic_load_n(sync, __ATOMIC_SEQ_CST) != 0); Put the semicolon on the next line here and in the other loops below. https://codereview.appspot.com/131910043/diff/150001/src/pkg/runtime/cgocall.c File src/pkg/runtime/cgocall.c (right): https://codereview.appspot.com/131910043/diff/150001/src/pkg/runtime/cgocall.... src/pkg/runtime/cgocall.c:237: uintptr savedpc; Unfortunately this file has now changed into cgocall.go. https://codereview.appspot.com/131910043/diff/150001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/150001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1539: g->sched.pc = (uintptr)runtime·getcallerpc(argp); This isn't the same as what was there before, and I don't see this mentioned in the CL description. Why is this correct? https://codereview.appspot.com/131910043/diff/150001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1700: ·exitsyscall(int32 dummy) Why remove the runtime prefix here?
Sign in to reply to this message.
I've updated the patch against current head. Luckily, syscallstack/guard went away, so the current patch builds and passes the tests again. We're probably again OK on the nosplit stack area, I think? https://codereview.appspot.com/131910043/diff/150001/misc/cgo/test/issue7978.go File misc/cgo/test/issue7978.go (right): https://codereview.appspot.com/131910043/diff/150001/misc/cgo/test/issue7978.... misc/cgo/test/issue7978.go:18: while(__atomic_load_n(sync, __ATOMIC_SEQ_CST) != 0); On 2014/09/11 04:33:01, iant wrote: > Put the semicolon on the next line here and in the other loops below. Done. https://codereview.appspot.com/131910043/diff/150001/src/pkg/runtime/cgocall.c File src/pkg/runtime/cgocall.c (right): https://codereview.appspot.com/131910043/diff/150001/src/pkg/runtime/cgocall.... src/pkg/runtime/cgocall.c:237: uintptr savedpc; On 2014/09/11 04:33:01, iant wrote: > Unfortunately this file has now changed into cgocall.go. Done. https://codereview.appspot.com/131910043/diff/150001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/150001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1539: g->sched.pc = (uintptr)runtime·getcallerpc(argp); On 2014/09/11 04:33:01, iant wrote: > This isn't the same as what was there before, and I don't see this mentioned in > the CL description. Why is this correct? In what way is it different? We're short on stack space, so rather than passing pc, sp around as two arguments, I changed it to pass the stack frame pointer that getcallerpc and getcallersp can use to obtain them (this also results in less code duplication). All the calls to save() ultimately ended up using those functions to obtain the arguments. If you look at the way save() was called before and after the change, it should be equivalent in every instance, unless I missed something. https://codereview.appspot.com/131910043/diff/150001/src/pkg/runtime/proc.c#n... src/pkg/runtime/proc.c:1700: ·exitsyscall(int32 dummy) On 2014/09/11 04:33:01, iant wrote: > Why remove the runtime prefix here? This seems to be a hack because the compiler complains about the prototype mismatch otherwise (dummy isn't really an argument, it's just there to get a pointer to the stack frame). I saw that entersyscall and entersyscallblock used this trick, so I used the same approach here.
Sign in to reply to this message.
R=rsc@golang.org (assigned by iant@golang.org)
Sign in to reply to this message.
https://codereview.appspot.com/131910043/diff/190001/misc/cgo/test/issue7978.go File misc/cgo/test/issue7978.go (right): https://codereview.appspot.com/131910043/diff/190001/misc/cgo/test/issue7978.... misc/cgo/test/issue7978.go:18: while(__atomic_load_n(sync, __ATOMIC_SEQ_CST) != 0) These builtins require at least gcc 4.7. I am not sure everybody has it. As far as I remember even Ubuntu 12.04 has only gcc 4.6.3. I would use __sync_fetch_and_add(sync, 0) for load and __sync_bool_compare_and_swap(sync, old, new) for store. https://codereview.appspot.com/131910043/diff/190001/src/runtime/proc.c File src/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/190001/src/runtime/proc.c#newco... src/runtime/proc.c:1867: save(&syscallpc); add a comment // NOTE: save saves bad pc/sp here // Reconsider whether we need save in entersyscall at all; // if yes, then g->sched.pc/sp need to be set to syscallpc/syscallsp; // otherwise, remove it. We probably don't use g->sched.pc/sp at all if g->syscallsp is set. Or maybe we are using them but were lucky to date... https://codereview.appspot.com/131910043/diff/190001/src/runtime/stubs.go File src/runtime/stubs.go (right): https://codereview.appspot.com/131910043/diff/190001/src/runtime/stubs.go#new... src/runtime/stubs.go:148: func reentersyscall(pc uintptr, sp unsafe.Pointer) s/pc uintptr, sp unsafe.Pointer/pc, sp unsafe.Pointer/
Sign in to reply to this message.
I am going to say LGTM after the fixes
Sign in to reply to this message.
https://codereview.appspot.com/131910043/diff/190001/misc/cgo/test/issue7978.go File misc/cgo/test/issue7978.go (right): https://codereview.appspot.com/131910043/diff/190001/misc/cgo/test/issue7978.... misc/cgo/test/issue7978.go:18: while(__atomic_load_n(sync, __ATOMIC_SEQ_CST) != 0) On 2014/09/13 01:00:39, dvyukov wrote: > These builtins require at least gcc 4.7. I am not sure everybody has it. As far > as I remember even Ubuntu 12.04 has only gcc 4.6.3. > I would use __sync_fetch_and_add(sync, 0) for load and > __sync_bool_compare_and_swap(sync, old, new) for store. Done. I just used __sync_fetch_and_add for both. https://codereview.appspot.com/131910043/diff/190001/src/runtime/proc.c File src/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/190001/src/runtime/proc.c#newco... src/runtime/proc.c:1867: save(&syscallpc); On 2014/09/13 01:00:40, dvyukov wrote: > add a comment > > // NOTE: save saves bad pc/sp here > // Reconsider whether we need save in entersyscall at all; > // if yes, then g->sched.pc/sp need to be set to syscallpc/syscallsp; > // otherwise, remove it. > > We probably don't use g->sched.pc/sp at all if g->syscallsp is set. > Or maybe we are using them but were lucky to date... > Done. https://codereview.appspot.com/131910043/diff/190001/src/runtime/stubs.go File src/runtime/stubs.go (right): https://codereview.appspot.com/131910043/diff/190001/src/runtime/stubs.go#new... src/runtime/stubs.go:148: func reentersyscall(pc uintptr, sp unsafe.Pointer) On 2014/09/13 01:00:40, dvyukov wrote: > s/pc uintptr, sp unsafe.Pointer/pc, sp unsafe.Pointer/ Done.
Sign in to reply to this message.
LGTM Russ, do you want to take a look as well?
Sign in to reply to this message.
The changes to save are unnecessary and wrong. https://codereview.appspot.com/131910043/diff/250001/src/runtime/cgocall.go File src/runtime/cgocall.go (right): https://codereview.appspot.com/131910043/diff/250001/src/runtime/cgocall.go#n... src/runtime/cgocall.go:191: savedpc := unsafe.Pointer(gp.syscallpc) PCs should be uintptr always. savedsp should probably stay a pointer so that it will be updated on stack growth https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c File src/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c#newco... src/runtime/proc.c:1748: g->sched.pc = (uintptr)runtime·getcallerpc(argp); This is wrong. getcallerpc and getcallersp are only valid when called from the function whose caller you want to know about. You cannot pass argp down. This will work on x86 but break on non-x86 systems. Please put this back the way it was. https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c#newco... src/runtime/proc.c:1856: // Reconsider whether we need save in entersyscall at all; I don't understand this "reconsider" comment. I do think the call to save is wrong here. There should be code here that puts g->sched.pc and g->sched.sp back to syscallsp/syscallpc instead.
Sign in to reply to this message.
https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c File src/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c#newco... src/runtime/proc.c:1748: g->sched.pc = (uintptr)runtime·getcallerpc(argp); On 2014/09/16 17:24:05, rsc wrote: > This is wrong. getcallerpc and getcallersp are only valid when called from the > function whose caller you want to know about. You cannot pass argp down. This > will work on x86 but break on non-x86 systems. Please put this back the way it > was. This was done because of the NOSPLIT sequence overflow. Probably it will work after the NOSPLIT limit increase. https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c#newco... src/runtime/proc.c:1856: // Reconsider whether we need save in entersyscall at all; On 2014/09/16 17:24:05, rsc wrote: > I don't understand this "reconsider" comment. I do think the call to save is > wrong here. There should be code here that puts g->sched.pc and g->sched.sp back > to syscallsp/syscallpc instead. Why? What is broken with current code?
Sign in to reply to this message.
https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c File src/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c#newco... src/runtime/proc.c:1856: // Reconsider whether we need save in entersyscall at all; On 2014/09/16 17:32:15, dvyukov wrote: > On 2014/09/16 17:24:05, rsc wrote: > > I don't understand this "reconsider" comment. I do think the call to save is > > wrong here. There should be code here that puts g->sched.pc and g->sched.sp > back > > to syscallsp/syscallpc instead. > > Why? What is broken with current code? As written save is writing things that are wrong (g->sched.pc/g->sched.sp). But once save is fixed, then it should be fine to do: save(g->syscallpc, g->syscallsp);
Sign in to reply to this message.
PTAL. I moved the calls to runtime·getcaller{sp,pc} back to the callee functions and made save() just pull its sp/pc from syscall{sp,pc}. https://codereview.appspot.com/131910043/diff/250001/src/runtime/cgocall.go File src/runtime/cgocall.go (right): https://codereview.appspot.com/131910043/diff/250001/src/runtime/cgocall.go#n... src/runtime/cgocall.go:191: savedpc := unsafe.Pointer(gp.syscallpc) On 2014/09/16 17:24:05, rsc wrote: > PCs should be uintptr always. > savedsp should probably stay a pointer so that it will be updated on stack > growth Done. https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c File src/runtime/proc.c (right): https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c#newco... src/runtime/proc.c:1748: g->sched.pc = (uintptr)runtime·getcallerpc(argp); On 2014/09/16 17:32:15, dvyukov wrote: > On 2014/09/16 17:24:05, rsc wrote: > > This is wrong. getcallerpc and getcallersp are only valid when called from the > > function whose caller you want to know about. You cannot pass argp down. This > > will work on x86 but break on non-x86 systems. Please put this back the way it > > was. > > > This was done because of the NOSPLIT sequence overflow. Probably it will work > after the NOSPLIT limit increase. Done; see comment below. https://codereview.appspot.com/131910043/diff/250001/src/runtime/proc.c#newco... src/runtime/proc.c:1856: // Reconsider whether we need save in entersyscall at all; On 2014/09/16 17:46:25, rsc wrote: > On 2014/09/16 17:32:15, dvyukov wrote: > > On 2014/09/16 17:24:05, rsc wrote: > > > I don't understand this "reconsider" comment. I do think the call to save is > > > wrong here. There should be code here that puts g->sched.pc and g->sched.sp > > back > > > to syscallsp/syscallpc instead. > > > > Why? What is broken with current code? > > As written save is writing things that are wrong (g->sched.pc/g->sched.sp). But > once save is fixed, then it should be fine to do: > > save(g->syscallpc, g->syscallsp); If we're always saving the same thing to g->syscallpc and g->syscallsp and to g->sched then we might as well make save() do that without taking any arguments.
Sign in to reply to this message.
Please put save back EXACTLY as it was. Refactoring distracts from understanding the bug fix.
Sign in to reply to this message.
On 2014/09/17 18:10:32, rsc wrote: > Please put save back EXACTLY as it was. Refactoring distracts from understanding > the bug fix. Done. PTAL.
Sign in to reply to this message.
I think you can simplify the proc.c changes even further. If you start with the original proc.c, you can rename entersyscall(int dummy) to reentersyscall(uintptr pc, uintptr sp) and in that body just change the arguments to save to use pc, sp each time. Then entersyscall is just #pragma textflag NOSPLIT void entersyscall(int dummy) { reentersyscall(getcallerpc(&dummy), getcallersp(&dummy)); }
Sign in to reply to this message.
On 19/09/14 02:24, rsc@golang.org wrote: > Then entersyscall is just > > #pragma textflag NOSPLIT > void > entersyscall(int dummy) > { > reentersyscall(getcallerpc(&dummy), getcallersp(&dummy)); > } Is the reordering of getcallerpc/getcallersp and the rest of the beginning of entersyscall guaranteed not to cause a problem? This was my original reason for keeping the head of entersyscall the same (and copying it to both functions); I wasn't sure if any of it, particularly g->m->locks++, might need to happen first. For example, this will break if the stack ends up being copied before the goroutine is properly in the syscall state (preemption?), though I don't know if that's possible. If there's no issue, I'll go ahead and make it simpler as you describe. -- Hector Martin (hector@marcansoft.com) Public Key: http://www.marcansoft.com/marcan.asc
Sign in to reply to this message.
On Thu, Sep 18, 2014 at 12:55 PM, Hector Martin <hector@marcansoft.com> wrote: > On 19/09/14 02:24, rsc@golang.org wrote: >> Then entersyscall is just >> >> #pragma textflag NOSPLIT >> void >> entersyscall(int dummy) >> { >> reentersyscall(getcallerpc(&dummy), getcallersp(&dummy)); >> } > > Is the reordering of getcallerpc/getcallersp and the rest of the > beginning of entersyscall guaranteed not to cause a problem? This was my > original reason for keeping the head of entersyscall the same (and > copying it to both functions); I wasn't sure if any of it, particularly > g->m->locks++, might need to happen first. For example, this will break > if the stack ends up being copied before the goroutine is properly in > the syscall state (preemption?), though I don't know if that's possible. > > If there's no issue, I'll go ahead and make it simpler as you describe. Calling not-NOSPLIT functions can cause problems, because the goroutine can be rescheduled. But if you call only NOSPLIT functions, it is fine, as it is essentially no-op wrt the rest of the world.
Sign in to reply to this message.
As Dmitriy wrote, it's fine. restartsyscall, getcallerpc, and getcallersp are all 'nosplit', so that they cannot be preempted.
Sign in to reply to this message.
PTAL. Could also save a few casts by either having the first arg to reentersyscall be a void* or by changing save() to take two uintptrs, but that's slightly more intrusive (and I'm not sure if uintptr is guaranteed to be compatible ABI-wise with void*, though my guess would be yes). CL 132520043 fixed a bug that the test depended on (it needs to inspect the traceback for runtime functions), so I added GOTRACEBACK=2 to run.{bash,bat} to restore the old behavior. Without those runtime frames the test would be less useful.
Sign in to reply to this message.
LGTM https://codereview.appspot.com/131910043/diff/310001/src/run.bat File src/run.bat (right): https://codereview.appspot.com/131910043/diff/310001/src/run.bat#newcode95 src/run.bat:95: set OLDGOTRACEBACK=%GOTRACEBACK% This needs to be one line earlier. I will fix it when I submit.
Sign in to reply to this message.
*** Submitted as https://code.google.com/p/go/source/detail?r=98e896f92a90 *** runtime: keep g->syscallsp consistent after cgo->Go callbacks Normally, the caller to runtime.entersyscall() must not return before calling runtime.exitsyscall(), lest g->syscallsp become a dangling pointer. runtime.cgocallbackg() violates this constraint. To work around this, save g->syscallsp and g->syscallpc around cgo->Go callbacks, then restore them after calling runtime.entersyscall(), which restores the syscall stack frame pointer saved by cgocall. This allows the GC to correctly trace a goroutine that is currently returning from a Go->cgo->Go chain. This also adds a check to proc.c that panics if g->syscallsp is clearly invalid. It is not 100% foolproof, as it will not catch a case where the stack was popped then pushed back beyond g->syscallsp, but it does catch the present cgo issue and makes existing tests fail without the bugfix. Fixes issue 7978. LGTM=dvyukov, rsc R=golang-codereviews, dvyukov, minux, bradfitz, iant, gobot, rsc CC=golang-codereviews, rsc https://codereview.appspot.com/131910043 Committer: Russ Cox <rsc@golang.org>
Sign in to reply to this message.
https://codereview.appspot.com/131910043/diff/310001/src/run.bat File src/run.bat (right): https://codereview.appspot.com/131910043/diff/310001/src/run.bat#newcode95 src/run.bat:95: set OLDGOTRACEBACK=%GOTRACEBACK% On 2014/09/24 17:10:39, rsc wrote: > This needs to be one line earlier. I will fix it when I submit. D'oh. Thanks.
Sign in to reply to this message.
|