-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: stack split at a bad time on mipsle #21431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It looks like we've hit this a few times:
In all three, it's been something around syscalls, though all of the details differ. In all cases the traceback of the crashing goroutine is curiously truncated. 2017-08-13: The crash appears to happen when syscall.Syscall calls runtime.exitsyscall, but exitsyscall is marked nosplit, so it doesn't even have a morestack prologue. However, there may be other runtime frames in there since newstack manually calls traceback to print the traceback, but the default behavior is to hide runtime frames, so we may just not be seeing them. The arguments are also odd: SYS_READ is 4003 (0xfa3), which appears as the second argument to syscall.Syscall rather than the first. The next argument of syscall.Syscall is supposed to be the FD, which is a completely reasonable 5 in the traceback, but none of the printed arguments to syscall.readlen is 5. My guess here is that the traceback code maybe got confused somehow by syscall.Syscall and went off the rails. The location of the failure is also odd. The exact location in syscall.Syscall indicates that the read syscall failed, but the only place we use readlen is in forkExec in exec_unix.go, and I can't think of any reason why that read could fail. Unfortunately, I haven't been able to reproduce the binary to check the specific PCs because this is the dist binary, so it was compiled with the bootstrap compiler on 2017-08-13 and I don't know what that was (ping @bradfitz). 2017-04-21: It's not even at a call, though, again, maybe the traceback is messed up. I tried to track down the morebuf PC. I can't locally build a cgo-enabled runtime_test linux/mipsle binary (and there seems to be no hope of reaching the linux-mipsle gomote), so I assumed syscall.Syscall was being called by readlen (the arguments suggest it is), assumed 0x48ff68 was the return address of the call to syscall.Syscall in readlen (because syscall.Syscall declares a zero-sized frame, but exitsyscall takes a dummy argument, that dummy argument should line up with syscall.Syscall's saved LR), and used that as a relocation delta against the non-cgo binary I could build locally. Unfortunately, that put me in the middle of entersyscallblock_handoff, which is called on the system stack, so isn't the culprit. |
Change https://golang.org/cl/79518 mentions this issue: |
Change https://golang.org/cl/79517 mentions this issue: |
If exitsyscall tries to grow the stack it will panic, but throw calls print, which can grow the stack. Move the two bare throws in exitsyscall to the system stack. Updates #21431. Change-Id: I5b29da5d34ade908af648a12075ed327a864476c Reviewed-on: https://go-review.googlesource.com/79517 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
newstack manually prints the stack trace if we try to grow the stack when throwsplit is set. However, the default behavior is to omit runtime frames. Since runtime frames can be critical to understanding this crash, this change fixes this traceback to include them. Updates #21431. Change-Id: I5aa43f43aa2f10a8de7d67bcec743427be3a3b5d Reviewed-on: https://go-review.googlesource.com/79518 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
Change https://golang.org/cl/79815 mentions this issue: |
exitsyscall should be recursively nosplit, but we don't have a way to annotate that right now (see #21314). There's exactly one remaining place where this is violated right now: exitsyscall -> casgstatus -> print. The other prints in casgstatus are wrapped in systemstack calls. This fixes the remaining print. Updates #21431 (in theory could fix it, but that would just indicate that we have a different G status-related crash and we've *never* seen that failure on the dashboard.) Change-Id: I9a5e8d942adce4a5c78cfc6b306ea5bda90dbd33 Reviewed-on: https://go-review.googlesource.com/79815 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Rick Hudson <rlh@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
Hello @aclements, gently pinging here to see if there is more work that needs to be done for Go1.10 for this issue. |
I wish I knew what to do; then I would do more for Go1.10. :) It looks like it's happened a few more times recently: These failures are all essentially the same. As before, they're all around system calls. And, as before, the traceback doesn't make much sense. @cherrymui and I spent some time digging into the new failures. Some observations: According to morebuf and the stack trace, the call to the stack-splitting function is at The traceback claims We also noticed that |
Change https://golang.org/cl/84115 mentions this issue: |
(from earlier failure)
The "first arg" there, 0x48a528, looks pretty much like a PC. This suggests there is an off-by-one error somewhere, because one word below the first arg is the saved LR.
This also agrees with what we see in more recent failures:
It appears that
The traceback code thinks |
This attempts to symbolize the PC of morestack's caller when there's a stack split at a bad time. The stack trace starts at the *caller* of the function that attempted to grow the stack, so this is useful if it isn't obvious what's being called at that point, such as in #21431. Change-Id: I5dee305d87c8069611de2d14e7a3083d76264f8f Reviewed-on: https://go-review.googlesource.com/84115 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>
This hasn't happened against yet since 2017-12-05, so no diagnostics, unfortunately. However, I think @cherrymui's theory is pretty sound. This got me thinking that Of course, that wouldn't help with whatever underlying issue is causing the signal here (assuming that's what's really going on), but it would help debug issues like these. |
Change https://golang.org/cl/87595 mentions this issue: |
Change https://golang.org/cl/89016 mentions this issue: |
Currently, if anything goes wrong when printing a traceback, we simply cut off the traceback without any further diagnostics. Unfortunately, right now, we have a few issues that are difficult to debug because the traceback simply cuts off (#21431, #23484). This is an attempt to improve the debuggability of traceback failure by printing a diagnostic message plus a hex dump around the failed traceback frame when something goes wrong. The failures look like: goroutine 5 [running]: runtime: unexpected return pc for main.badLR2 called from 0xbad stack: frame={sp:0xc42004dfa8, fp:0xc42004dfc8} stack=[0xc42004d800,0xc42004e000) 000000c42004dea8: 0000000000000001 0000000000000001 000000c42004deb8: 000000c42004ded8 000000c42004ded8 000000c42004dec8: 0000000000427eea <runtime.dopanic+74> 000000c42004ded8 000000c42004ded8: 000000000044df70 <runtime.dopanic.func1+0> 000000c420001080 000000c42004dee8: 0000000000427b21 <runtime.gopanic+961> 000000c42004df08 000000c42004def8: 000000c42004df98 0000000000427b21 <runtime.gopanic+961> 000000c42004df08: 0000000000000000 0000000000000000 000000c42004df18: 0000000000000000 0000000000000000 000000c42004df28: 0000000000000000 0000000000000000 000000c42004df38: 0000000000000000 000000c420001080 000000c42004df48: 0000000000000000 0000000000000000 000000c42004df58: 0000000000000000 0000000000000000 000000c42004df68: 000000c4200010a0 0000000000000000 000000c42004df78: 00000000004c6400 00000000005031d0 000000c42004df88: 0000000000000000 0000000000000000 000000c42004df98: 000000c42004dfb8 00000000004ae7d9 <main.badLR2+73> 000000c42004dfa8: <00000000004c6400 00000000005031d0 000000c42004dfb8: 000000c42004dfd0 !0000000000000bad 000000c42004dfc8: >0000000000000000 0000000000000000 000000c42004dfd8: 0000000000451821 <runtime.goexit+1> 0000000000000000 000000c42004dfe8: 0000000000000000 0000000000000000 000000c42004dff8: 0000000000000000 main.badLR2(0x0) /go/src/runtime/testdata/testprog/badtraceback.go:42 +0x49 For #21431, #23484. Change-Id: I8718fc76ced81adb0b4b0b4f2293f3219ca80786 Reviewed-on: https://go-review.googlesource.com/89016 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
Currently, if a _SigPanic signal arrives in a throwsplit context, nothing is stopping the runtime from injecting a call to sigpanic that may attempt to grow the stack. This will fail and, in turn, mask the real problem. Fix this by checking for throwsplit in the signal handler itself before injecting the sigpanic call. Updates #21431, where this problem is likely masking the real problem. Change-Id: I64b61ff08e8c4d6f6c0fb01315d7d5e66bf1d3e2 Reviewed-on: https://go-review.googlesource.com/87595 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>
Still no new reproductions since 2017-12-05T18:42:38-49fec9b/linux-mipsle. |
Hasn't happened in six months. I don't know what fixed this, but I'm calling it fixed. |
Recent build failure from the dashboard (linux-mipsle at 816deac
): https://build.golang.org/log/e101d6d552cc5dce1ebf5b5560c0b399dbe86f6e
The text was updated successfully, but these errors were encountered: