runtime: stack split at a bad time on mipsle #21431

josharian · 2017-08-14T04:13:21Z

Recent build failure from the dashboard (linux-mipsle at 816deac
): https://build.golang.org/log/e101d6d552cc5dce1ebf5b5560c0b399dbe86f6e

runtime: newstack sp=0x8861994 stack=[0x8861000, 0x8862000]
	morebuf={pc:0x93440 sp:0x8861994 lr:0x0}
	sched={pc:0x55e3c sp:0x8861994 lr:0x93440 ctxt:0x0}
syscall.Syscall(0x90010, 0xfa3, 0x5, 0x8861a0c, 0x4, 0x0, 0x0)
	/usr/local/go-bootstrap-mipsle/src/syscall/asm_linux_mipsx.s:31 +0x58
syscall.readlen(0x19, 0x0, 0x888e780, 0x8861ac8, 0x234960, 0x6)
	/usr/local/go-bootstrap-mipsle/src/syscall/zsyscall_linux_mipsle.go:1050 +0x5c
created by main.cmdbootstrap
	/data/mipsle/go/src/cmd/dist/build.go:1071 +0x498
fatal error: runtime: stack split at bad time

[stack traces]

The text was updated successfully, but these errors were encountered:

aclements · 2017-11-22T20:26:06Z

It looks like we've hit this a few times:

$ greplogs -dashboard -E "stack split at bad time" -l -md | grep linux-mips
2016-12-21T12:27:41-0ef4815/linux-mipsle
2017-04-21T14:34:10-bb6309c/linux-mipsle
2017-08-13T17:33:10-816deac/linux-mipsle

In all three, it's been something around syscalls, though all of the details differ. In all cases the traceback of the crashing goroutine is curiously truncated.

2017-08-13: The crash appears to happen when syscall.Syscall calls runtime.exitsyscall, but exitsyscall is marked nosplit, so it doesn't even have a morestack prologue. However, there may be other runtime frames in there since newstack manually calls traceback to print the traceback, but the default behavior is to hide runtime frames, so we may just not be seeing them.

The arguments are also odd: SYS_READ is 4003 (0xfa3), which appears as the second argument to syscall.Syscall rather than the first. The next argument of syscall.Syscall is supposed to be the FD, which is a completely reasonable 5 in the traceback, but none of the printed arguments to syscall.readlen is 5. My guess here is that the traceback code maybe got confused somehow by syscall.Syscall and went off the rails.

The location of the failure is also odd. The exact location in syscall.Syscall indicates that the read syscall failed, but the only place we use readlen is in forkExec in exec_unix.go, and I can't think of any reason why that read could fail.

Unfortunately, I haven't been able to reproduce the binary to check the specific PCs because this is the dist binary, so it was compiled with the bootstrap compiler on 2017-08-13 and I don't know what that was (ping @bradfitz).

2017-04-21: It's not even at a call, though, again, maybe the traceback is messed up. I tried to track down the morebuf PC. I can't locally build a cgo-enabled runtime_test linux/mipsle binary (and there seems to be no hope of reaching the linux-mipsle gomote), so I assumed syscall.Syscall was being called by readlen (the arguments suggest it is), assumed 0x48ff68 was the return address of the call to syscall.Syscall in readlen (because syscall.Syscall declares a zero-sized frame, but exitsyscall takes a dummy argument, that dummy argument should line up with syscall.Syscall's saved LR), and used that as a relocation delta against the non-cgo binary I could build locally. Unfortunately, that put me in the middle of entersyscallblock_handoff, which is called on the system stack, so isn't the culprit.

gopherbot · 2017-11-22T20:39:07Z

Change https://golang.org/cl/79518 mentions this issue: runtime: print runtime frames in throwsplit trace

gopherbot · 2017-11-22T20:39:08Z

Change https://golang.org/cl/79517 mentions this issue: runtime: call throw on systemstack in exitsyscall

If exitsyscall tries to grow the stack it will panic, but throw calls print, which can grow the stack. Move the two bare throws in exitsyscall to the system stack. Updates #21431. Change-Id: I5b29da5d34ade908af648a12075ed327a864476c Reviewed-on: https://go-review.googlesource.com/79517 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

newstack manually prints the stack trace if we try to grow the stack when throwsplit is set. However, the default behavior is to omit runtime frames. Since runtime frames can be critical to understanding this crash, this change fixes this traceback to include them. Updates #21431. Change-Id: I5aa43f43aa2f10a8de7d67bcec743427be3a3b5d Reviewed-on: https://go-review.googlesource.com/79518 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

gopherbot · 2017-11-24T15:25:23Z

Change https://golang.org/cl/79815 mentions this issue: runtime: fix final stack split in exitsyscall

exitsyscall should be recursively nosplit, but we don't have a way to annotate that right now (see #21314). There's exactly one remaining place where this is violated right now: exitsyscall -> casgstatus -> print. The other prints in casgstatus are wrapped in systemstack calls. This fixes the remaining print. Updates #21431 (in theory could fix it, but that would just indicate that we have a different G status-related crash and we've *never* seen that failure on the dashboard.) Change-Id: I9a5e8d942adce4a5c78cfc6b306ea5bda90dbd33 Reviewed-on: https://go-review.googlesource.com/79815 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Rick Hudson <rlh@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

odeke-em · 2017-12-11T08:32:34Z

Hello @aclements, gently pinging here to see if there is more work that needs to be done for Go1.10 for this issue.

aclements · 2017-12-14T20:36:06Z

I wish I knew what to do; then I would do more for Go1.10. :)

It looks like it's happened a few more times recently:
2017-11-22T21:16:29-c6c0f47/linux-mipsle
2017-11-30T14:39:19-4435fcf/linux-mipsle
2017-12-05T18:22:53-eb441e6/linux-mipsle
2017-12-05T18:42:38-49fec9b/linux-mipsle

These failures are all essentially the same. As before, they're all around system calls. And, as before, the traceback doesn't make much sense. runtime.reentersyscall is nosplit, so it can't be calling morestack, and reentersyscall doesn't call reentersyscall.

@cherrymui and I spent some time digging into the new failures. Some observations:

According to morebuf and the stack trace, the call to the stack-splitting function is at reentersyscall+0xe4. This corresponds to the nil check of p.ptr() in if _g_.m.p.ptr().runSafePointFn != 0. This could in principle cause a SIGSEGV that turns into a synthesized call to sigpanic from this PC, and sigpanic can indeed split the stack. If sched.pc from the panic message is sigpanic, that would confirm this, but I still haven't had any luck building any of the failing binaries. I'm going to make a CL to add this symbolization directly to the panic so we can hopefully catch it in the wild.

The traceback claims reentersyscall is being called from reentersyscall+0x74, which is a call, but to runtime.casgstatus. We're not sure what to make of that. Given that the traceback cuts off at that point, it's probably failing to walk to stack and simply found the remnants of some earlier call.

We also noticed that syscall.Syscall on mipsle isn't technically following the MIPS C calling convention, which is supposed to always save space for registers on the stack (even those passed in registers), and always save at least 16 bytes for the argument space. syscall.Syscall doesn't save any space. Fixing this would be non-trivial because Go's calling convention places the saved LR in this argument scratch space, so we'd need to move that around. It also doesn't seem to matter both because it would be really weird for the kernel to scribble over this space, and because none of the runtime syscalls reserve this space and they seem to be fine.

gopherbot · 2017-12-14T20:44:37Z

Change https://golang.org/cl/84115 mentions this issue: runtime: symbolize morestack caller in throwsplit panic

cherrymui · 2017-12-14T22:46:06Z

(from earlier failure)

The arguments are also odd: SYS_READ is 4003 (0xfa3), which appears as the second argument to syscall.Syscall rather than the first.

The "first arg" there, 0x48a528, looks pretty much like a PC. This suggests there is an off-by-one error somewhere, because one word below the first arg is the saved LR.

preparePanic subtracts 4 from SP and stores the faulting address there. The traceback code handles this specially if it sees sigpanic's PC (https://go.googlesource.com/go/+/67295d6eb0b9ef8d40fcddf052d18ebaa03566e4/src/runtime/traceback.go#482). But if it is sigpanic's prolog try to grow the stack but failed, we're not really in sigpanic yet and the traceback code doesn't do this. This causes argp off by one word and prints a saved LR as first arg.

This also agrees with what we see in more recent failures:

runtime.reentersyscall(0x48d588, 0x48d588)
	/data/mipsle/go/src/runtime/proc.go:2811 +0xe4 fp=0x91152c4 sp=0x91152a8 pc=0x43dab4
runtime.reentersyscall(0x0, 0x0)
	/data/mipsle/go/src/runtime/proc.go:2790 +0x74 fp=0x91152e0 sp=0x91152c4 pc=0x43da44

It appears that reentersyscall get called with two identical args, but it cannot. In fact, reentersyscall can be called by entersyscall with getcallerpc() as its first arg, which is also the saved LR on entersyscall's frame. The off-by-one error makes them appears as two args. So the actual call stack may be

reentersyscall
entersyscall
function at 0x48d588 (probably syscall.Syscall?)

The traceback code thinks reentersyscall recurse to itself because it uses the LR saved by preparePanic as its return address, which is whatever in the LR register when the fault happened, which happens to be the return address of casgstatus in reentersyscall. From there, the backtrace goes far off. This also explains why the backtrace is always truncated.

This attempts to symbolize the PC of morestack's caller when there's a stack split at a bad time. The stack trace starts at the *caller* of the function that attempted to grow the stack, so this is useful if it isn't obvious what's being called at that point, such as in #21431. Change-Id: I5dee305d87c8069611de2d14e7a3083d76264f8f Reviewed-on: https://go-review.googlesource.com/84115 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>

aclements · 2018-01-12T16:54:55Z

This hasn't happened against yet since 2017-12-05, so no diagnostics, unfortunately. However, I think @cherrymui's theory is pretty sound.

This got me thinking that sigpanic itself really ought to be nosplit, since it can be "called" from contexts where it's unsafe to grow the stack. Then, if throwsplit is set, it should switch to the system stack and throw rather than attempt any further handling of the signal panic (which may grow the stack).

Of course, that wouldn't help with whatever underlying issue is causing the signal here (assuming that's what's really going on), but it would help debug issues like these.

gopherbot · 2018-01-12T17:14:06Z

Change https://golang.org/cl/87595 mentions this issue: runtime: don't grow the stack on sigpanic if unsafe

gopherbot · 2018-01-22T20:15:44Z

Change https://golang.org/cl/89016 mentions this issue: runtime: print hexdump on traceback failure

Currently, if anything goes wrong when printing a traceback, we simply cut off the traceback without any further diagnostics. Unfortunately, right now, we have a few issues that are difficult to debug because the traceback simply cuts off (#21431, #23484). This is an attempt to improve the debuggability of traceback failure by printing a diagnostic message plus a hex dump around the failed traceback frame when something goes wrong. The failures look like: goroutine 5 [running]: runtime: unexpected return pc for main.badLR2 called from 0xbad stack: frame={sp:0xc42004dfa8, fp:0xc42004dfc8} stack=[0xc42004d800,0xc42004e000) 000000c42004dea8: 0000000000000001 0000000000000001 000000c42004deb8: 000000c42004ded8 000000c42004ded8 000000c42004dec8: 0000000000427eea <runtime.dopanic+74> 000000c42004ded8 000000c42004ded8: 000000000044df70 <runtime.dopanic.func1+0> 000000c420001080 000000c42004dee8: 0000000000427b21 <runtime.gopanic+961> 000000c42004df08 000000c42004def8: 000000c42004df98 0000000000427b21 <runtime.gopanic+961> 000000c42004df08: 0000000000000000 0000000000000000 000000c42004df18: 0000000000000000 0000000000000000 000000c42004df28: 0000000000000000 0000000000000000 000000c42004df38: 0000000000000000 000000c420001080 000000c42004df48: 0000000000000000 0000000000000000 000000c42004df58: 0000000000000000 0000000000000000 000000c42004df68: 000000c4200010a0 0000000000000000 000000c42004df78: 00000000004c6400 00000000005031d0 000000c42004df88: 0000000000000000 0000000000000000 000000c42004df98: 000000c42004dfb8 00000000004ae7d9 <main.badLR2+73> 000000c42004dfa8: <00000000004c6400 00000000005031d0 000000c42004dfb8: 000000c42004dfd0 !0000000000000bad 000000c42004dfc8: >0000000000000000 0000000000000000 000000c42004dfd8: 0000000000451821 <runtime.goexit+1> 0000000000000000 000000c42004dfe8: 0000000000000000 0000000000000000 000000c42004dff8: 0000000000000000 main.badLR2(0x0) /go/src/runtime/testdata/testprog/badtraceback.go:42 +0x49 For #21431, #23484. Change-Id: I8718fc76ced81adb0b4b0b4f2293f3219ca80786 Reviewed-on: https://go-review.googlesource.com/89016 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>

Currently, if a _SigPanic signal arrives in a throwsplit context, nothing is stopping the runtime from injecting a call to sigpanic that may attempt to grow the stack. This will fail and, in turn, mask the real problem. Fix this by checking for throwsplit in the signal handler itself before injecting the sigpanic call. Updates #21431, where this problem is likely masking the real problem. Change-Id: I64b61ff08e8c4d6f6c0fb01315d7d5e66bf1d3e2 Reviewed-on: https://go-review.googlesource.com/87595 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>

aclements · 2018-01-23T19:51:57Z

Still no new reproductions since 2017-12-05T18:42:38-49fec9b/linux-mipsle.

ianlancetaylor · 2018-07-11T20:58:06Z

Hasn't happened in six months. I don't know what fixed this, but I'm calling it fixed.

josharian added NeedsInvestigation release-blocker labels Aug 14, 2017

josharian added this to the Go1.10 milestone Aug 14, 2017

aclements mentioned this issue Jan 19, 2018

runtime: stack grow panic tracing back through sigpanic from signal handler #23484

Open

aclements modified the milestones: Go1.10, Go1.11 Jan 23, 2018

ianlancetaylor closed this as completed Jul 11, 2018

golang locked and limited conversation to collaborators Jul 11, 2019

gopherbot added the FrozenDueToAge label Jul 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: stack split at a bad time on mipsle #21431

runtime: stack split at a bad time on mipsle #21431

josharian commented Aug 14, 2017

aclements commented Nov 22, 2017

gopherbot commented Nov 22, 2017

gopherbot commented Nov 22, 2017

gopherbot commented Nov 24, 2017

odeke-em commented Dec 11, 2017

aclements commented Dec 14, 2017

gopherbot commented Dec 14, 2017

cherrymui commented Dec 14, 2017

aclements commented Jan 12, 2018

gopherbot commented Jan 12, 2018

gopherbot commented Jan 22, 2018

aclements commented Jan 23, 2018

ianlancetaylor commented Jul 11, 2018

runtime: stack split at a bad time on mipsle #21431

runtime: stack split at a bad time on mipsle #21431

Comments

josharian commented Aug 14, 2017

aclements commented Nov 22, 2017

gopherbot commented Nov 22, 2017

gopherbot commented Nov 22, 2017

gopherbot commented Nov 24, 2017

odeke-em commented Dec 11, 2017

aclements commented Dec 14, 2017

gopherbot commented Dec 14, 2017

cherrymui commented Dec 14, 2017

aclements commented Jan 12, 2018

gopherbot commented Jan 12, 2018

gopherbot commented Jan 22, 2018

aclements commented Jan 23, 2018

ianlancetaylor commented Jul 11, 2018