-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: fpTracebackPCs
crash on linux/amd64 in various places
#59692
Comments
I can reproduce this on go test -trace /dev/null -run TestDeferHeapAndStack
The test does not always crash at the same place: Another example
My first thought is to double check the various stack-switching routines (systemstack, mcall, morestack) to see if they're writing stuff where the frame pointer should be, or not setting a new frame pointer up properly when taking over the stack. |
fpTracebackPCs
crash on linux/amd64 while building kubeletfpTracebackPCs
crash on linux/amd64 in various places
Given we have a reliable reproducer, I think we should consider disabling frame pointer unwinding at HEAD until we have a fix. (I'd say flip the default value of |
@prattmic I'm planning to look into this as soon as I'm back from vacation on Monday. That being said, I'm okay with disabling One thing I'm curious about is if this issue is also caught by @nsrip-dd 🙇 thanks for finding a reproducer! |
Change https://go.dev/cl/486382 mentions this issue: |
Just tried it with
|
Frame pointer unwinding during execution tracing sometimes crashes. Until this is diagnosed and fixed, it should be turned off by default. Updates #59692 Change-Id: I0f2ca24b6d48435b0acfd3da8e4f25b9cfa4ec19 Reviewed-on: https://go-review.googlesource.com/c/go/+/486382 Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Run-TryBot: Nick Ripley <nick.ripley@datadoghq.com> Reviewed-by: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Michael Pratt <mpratt@google.com>
I've dug into this a little bit. I don't have any conclusive answer, but I think what's happening involves an incorrect value ending up in the I built the test binary and ran it under gdb, just doing
The crash is happening during frame pointer unwinding, at this instruction (where we're looking for the PC relative to the current frame pointer):
Now, we can see right away that curgp.sched.bp (where we're going to start unwinding) does not look right:
(This makes me wonder why Here's where morestack was called from:
So morestack is being called from runtime.gcMarkDone. This happens at the beginning of gcMarkDone, before the frame is set up. We know that morestack sets up curgp.sched.sp to point the the caller's frame, just above where the return address is pushed (i.e. the stack pointer as it would have been immediately prior to the CALL instruction, which pushes the return address). Since we're at the begging of runtime.gcMarkDone, we know that the stack pointer should thus be pointing to the return address that was pushed by runtime.gcMarkDone's caller:
Now, we can look at the disassembly of runtime.gcAssistAlloc to see how big of a frame it uses:
So it uses a 0x50 byte frame. Note that this is after pushing RBP to the stack. Let's see what was saved there. It should be 0x50 bytes off of curgp.sched.sp, plus an additional 8 bytes because we're pointing to where the return address within gcAssistAlloc is saved before calling gcMarkDone.
Let's see what the return address of gcAssistAlloc is, which should be an additional 8 bytes above the saved frame pointer:
This looks right! Now, let's check the saved frame pointer:
This is promising. The address 0xc0003f3880 is within our stack bounds. Let's see if there's a return address saved right above this address, assuming it's a valid frame pointer:
This checks out too!
So, despite curgp.sched.bp being wrong, it looks like the frame pointers saved on the stack are right. So, why is curgp.sched.bp wrong? Let's consider how curgp.sched.bp is set. Here's where it's set up by morestack:
So, it's set to be the current value of RBP when morestack is called. Note that morestack is marked NOFRAME, so it doesn't create a new frame pointer. It's jumped to from morestack_noctxt:
So, this means we're going to use the value of BP as it was immediately prior to gcMarkDone being called within gcAssitAlloc. At some point between when gcAssistAlloc was called, and when we reached the gcMarkDone call, something invalid was put in BP. I think based on the analysis above, gcAssistAlloc would have set up the correct frame pointer. My hunch is that something in between overwrote it incorrectly. One idea I'm trying out: If my hunch is that RBP is invalid at some point after gcAssistAlloc starts, perhaps let's stop around where we think it might be invalid and see what its value is. I set a breakpoint in gcAssistAlloc just before gcMarkDone is called. Then I had gdb display RBP and RSP. I think we should always have RBP >= RSP during normal code (so potentially not for a few instructions during stack switching or something). They should probably also be close-ish together, since in theory RBP just points to the top of the frame and RSP points to the bottom. A typical case looks like this:
As noted above, these values differ by 0x50, the size of the frame that gcAssitAlloc sets up. So this seems to be right. Here's one case where that wasn't true:
Notice how different the two values are. Here's the stack trace at this point:
This is where I'm stumped for now. I don't yet have a machine that can run |
@nsrip-dd Nice deep dive! After reading your analysis, I had a suspicion that maybe the EDIT: To reproduce I had to shrink GOMAXPROCS on my machine to 8. If it was higher than that it wouldn't reproduce. Maybe it lowered the chance that assists took care of stack shrinking and copying, since there were more idle-priority GC workers? |
I think I've confirmed that there's consistently a stack shrink on the relevant goroutine specifically while it's in stackDebug=1 (augmented)
|
Hm. Perhaps that's a red herring? If I put The few most recent crashes I've seen seem to be in an attempt to trace a preemption at entry to |
Well, OK. This is interesting. I caught the bad BP just before the traceback happens, and I can confirm the bad BP corresponds to the old stack, before the shrink. So the shrink matters. stackDebug=1 (augmented)
|
The BP is bad upon entrance to The thing is, I think this means it has to be Indeed, if I print |
Hmmm. Tell us more. |
With Cherry's help I think we've got it. The problem is that EDIT: This is the reason by |
Yeah, looks like the fact that the
|
Cherry and I think that something like the above diff is actually the real fix. The Probably the fix looks like getting rid of this case and writing a long doc comment about why it's safe to let |
I was also hacking on this last night (mostly using lots of
I also agree with the conclusion from @mknyszek and @cherrymui that it's the interaction with AFAICT this is a regression from CL 472195. One thing that still surprised me was that this issue is only observed during stack shrinking and not stack growth. I think the answer to this is that stack growth can't happen while we're on the g0 stack, but stack shrinking (via GC) can happen to a goroutine that switched to g0. Does that sound right?
+1 - this sounds like the best approach to me. Unless somebody is already working on it (please comment), I'll try to work on a patch for this. |
That's right. You specifically need a goroutine stack to move while that goroutine has switched to g0; that's the only case where you can fail to update a frame pointer (specifically this systemstack one). And currently the only way for a stack to move while executing on the system stack is if some other goroutine scans the stack and decides to shrink it as well. |
Deleted my comment, it was slightly inaccurate. I'll provide more details when I send my patches later today or tomorrow. |
Change https://go.dev/cl/489015 mentions this issue: |
Change https://go.dev/cl/489117 mentions this issue: |
Change https://go.dev/cl/489115 mentions this issue: |
Change https://go.dev/cl/489116 mentions this issue: |
Change https://go.dev/cl/489016 mentions this issue: |
The patches above should fix the issue and add better test coverage. The last patch fixes a related issue, but it's not impacting traceEvent unwinding. Details are in the CLs. |
Add TestSystemstackFramePointerAdjust as a regression test for CL 489015. By turning stackPoisonCopy into a var instead of const and introducing the ShrinkStackAndVerifyFramePointers() helper function, we are able to trigger the exact combination of events that can crash traceEvent() if systemstack restores a frame pointer that is pointing into the old stack. Updates #59692 Change-Id: I60fc6940638077e3b60a81d923b5f5b4f6d8a44c Reviewed-on: https://go-review.googlesource.com/c/go/+/489115 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Run-TryBot: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
Remove logic for skipping some adjustframe logic for systemstack (aka FuncID_systemstack_switch). This was introduced in 2014 by 9198ed4 but doesn't seem to be needed anymore. Updates #59692 Change-Id: I2368d64f9bb28ced4e7f15c9b15dac7a29194389 Reviewed-on: https://go-review.googlesource.com/c/go/+/489116 Reviewed-by: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
I was running the Sweet benchmarks and I encountered a crash in
fpTracebackPCs
:Crash in Go compiler during the go-build benchmark building kubelet.
(This was run with a modified Go tree, but the only change was replacing
cputicks
withnanotime
and removing the trace tick division in insrc/runtime/trace.go
.)I don't think this is easily reproducible yet so I'll keep running and try to reproduce.
CC @felixge @nsrip-dd @prattmic
The text was updated successfully, but these errors were encountered: