runtime: panics during gc #11640

zeebo · 2015-07-09T17:15:24Z

I recently upgraded one of our systems to use go1.5beta1 (b6ead9f) and observe this crash periodically:

panic: runtime error: index out of range
fatal error: panic holding locks
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty

goroutine 28 [running]:
runtime.throw(0x93e588, 0x13)
    /home/jeff/go/src/runtime/panic.go:527 +0x78 fp=0x10e18e4c sp=0x10e18e40
runtime.gopanic(0x83a6b8, 0x10e20008)
    /home/jeff/go/src/runtime/panic.go:385 +0x2a4 fp=0x10e18e90 sp=0x10e18e4c
runtime.panicindex()
    /home/jeff/go/src/runtime/panic.go:12 +0x48 fp=0x10e18ea4 sp=0x10e18e90
runtime.(*gcWork).put(0x10e15328, 0x10f5d1c0)
    /home/jeff/go/src/runtime/mgcwork.go:87 +0x70 fp=0x10e18eb4 sp=0x10e18ea4
runtime.greyobject(0x10f5d1c0, 0x10f05200, 0x888, 0x10dea2e3, 0x0, 0xb69ae8f8, 0x10e15328)
    /home/jeff/go/src/runtime/mgcmark.go:898 +0x304 fp=0x10e18f00 sp=0x10e18eb4
runtime.scanobject(0x10f05200, 0x10e15328)
    /home/jeff/go/src/runtime/mgcmark.go:821 +0x294 fp=0x10e18f68 sp=0x10e18f00
runtime.gcDrainUntilPreempt(0x10e15328, 0x7d0, 0x0)
    /home/jeff/go/src/runtime/mgcmark.go:686 +0x1dc fp=0x10e18f94 sp=0x10e18f68
runtime.gcBgMarkWorker(0x10e14a00)
    /home/jeff/go/src/runtime/mgc.go:1261 +0x48c fp=0x10e18fd4 sp=0x10e18f94
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1016 +0x4 fp=0x10e18fd4 sp=0x10e18fd4
created by runtime.gcBgMarkStartWorkers
    /home/jeff/go/src/runtime/mgc.go:1173 +0x84

panic: runtime error: index out of range
fatal error: panic holding locks

goroutine 41 [running]:
runtime.throw(0x93e588, 0x13)
    /home/jeff/go/src/runtime/panic.go:527 +0x78 fp=0x10e1fefc sp=0x10e1fef0
runtime.gopanic(0x83a6b8, 0x10e20008)
    /home/jeff/go/src/runtime/panic.go:385 +0x2a4 fp=0x10e1ff40 sp=0x10e1fefc
runtime.panicindex()
    /home/jeff/go/src/runtime/panic.go:12 +0x48 fp=0x10e1ff54 sp=0x10e1ff40
runtime.(*gcWork).tryGet(0x10f6c928, 0x3cf54)
    /home/jeff/go/src/runtime/mgcwork.go:115 +0x9c fp=0x10e1ff68 sp=0x10e1ff54
runtime.gcDrainUntilPreempt(0x10f6c928, 0x7d0, 0x0)
    /home/jeff/go/src/runtime/mgcmark.go:681 +0x158 fp=0x10e1ff94 sp=0x10e1ff68
runtime.gcBgMarkWorker(0x10f6c000)
    /home/jeff/go/src/runtime/mgc.go:1261 +0x48c fp=0x10e1ffd4 sp=0x10e1ff94
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1016 +0x4 fp=0x10e1ffd4 sp=0x10e1ffd4
created by runtime.gcBgMarkStartWorkers
    /home/jeff/go/src/runtime/mgc.go:1173 +0x84

It's running on a GOOS=linux GOARCH=arm GOARM=5 system. It's a large program so I can't provide a simple reproduction at this time, but I can do my best to give any information that might be helpful, including running any diagnostics you ask for.

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2015-07-09T20:05:00Z

CC @RLH @aclements

aclements · 2015-07-13T19:17:16Z

I'm not sure about the "index out of range", but "workbuf is empty" has happened a few times on the dashboard:

2015-06-19T00:44:13-2f3d103/freebsd-386-gce101
2015-06-24T17:53:25-7708248/dragonfly-amd64
2015-06-24T22:22:55-6613090/darwin-amd64
2015-06-26T17:26:09-e0e47e2/linux-arm-arm5
2015-07-02T08:29:33-be2a3e2/linux-arm-arm5
2015-07-09T20:05:33-4c33250/darwin-386
2015-07-12T20:43:01-7e4c06d/darwin-386

Given the variety of archs and OSs, this doesn't appear to be arch- or OS-specific. In particular, since it's also happening on x86, I don't think this is a memory ordering issue.

Where it happens is also scattered. It's often testing $GOROOT/test, but several are in test/run.go itself or the linker invoked by test/run.go. This makes me wonder if this is related to #11617.

I worked through an informal backwards proof that we're satisfying all of the invariants that should ensure checknonempty never fails in trygetfull or getfull. This depended on a few assumptions, like that the compiler is correct, that our lock-free list is correct (note that the above failures cover three different implementations of the platform-dependent part, so that's unlikely to be the problem), and that there's no memory corruption.

aclements · 2015-07-13T19:22:18Z

To clarify: the "no memory corruption" assumption is more specifically "no writes through bad pointers". We've had some memory corruption problems resulting from premature collection of reachable objects; that alone isn't sufficient except as a springboard for aliasing a pointer and a non-pointer, where the non-pointer happens to be the address of a workbuf, which is quite unlikely.

aclements · 2015-07-13T20:08:26Z

Another assumption I made is that the per-P gcWork cache is never accessed concurrently. That's perhaps the most likely assumption to be wrong.

aclements · 2015-07-13T22:51:33Z

@zeebo, are you able to reproduce this fairly reliably? Would you mind trying out https://go-review.googlesource.com/12124 and see if it fixes the problem for you? Otherwise I think we'll have to wait a week or so to see if it pops up on the dashboard again before declaring this fixed.

gopherbot · 2015-07-13T23:00:20Z

CL https://golang.org/cl/12124 mentions this issue.

zeebo · 2015-07-14T03:06:54Z

Yeah, it's pretty reliable. I get 2-3 crashes per day at least. I'll try again with that change first thing tomorrow morning and report back.

aclements · 2015-07-15T15:03:02Z

@zeebo, did you get a chance to try it?

zeebo · 2015-07-15T16:40:01Z

Sorry, no. I'm doing it right now.

Currently, we run a P's safe-point function immediately after entering _Psyscall state. This is unsafe, since as soon as we put the P in _Psyscall, we no longer control the P and another M may claim it. We'll still run the safe-point function only once (because doing so races on an atomic), but the P may no longer be at a safe-point when we do so. In particular, this means that the use of forEachP to dispose all P's gcw caches is unsafe. A P may enter a syscall, run the safe-point function, and dispose the P's gcw cache concurrently with another M claiming the P and attempting to use its gcw cache. If this happens, we may empty the gcw's workbuf after putting it on work.{full,partial}, or add pointers to it after putting it in work.empty. This will cause an assertion failure when we later pop the workbuf from the list and its object count is inconsistent with the list we got it from. Fix this by running the safe-point function just before putting the P in _Psyscall. Related to #11640. This probably fixes this issue, but while I'm able to show that we can enter a bad safe-point state as a result of this, I can't reproduce that specific failure. Change-Id: I6989c8ca7ef2a4a941ae1931e9a0748cbbb59434 Reviewed-on: https://go-review.googlesource.com/12124 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Russ Cox <rsc@golang.org>

zeebo · 2015-07-16T16:50:37Z

It has been a day and I've seen no panics. Usually it happens by now, but you can never be sure with these things. :)

aclements · 2015-07-16T16:53:01Z

Great! Maybe give it another day and if it hasn't fallen over, we'll call this issue closed.

Thanks for testing!

zeebo · 2015-07-16T16:58:33Z

No problem. A picture is worth a thousand words, so here's a graph of the uptime with the go version the binary was compiled with:

Green is go1.4.2, red is go1.5beta1, and blue is with the cherry picked change. The purple line is the uptime. It looks like what might have been triggering the restarts recently slowed down in the past few days for whatever reason, so yeah it's probably for the best to wait another day or two.

zeebo · 2015-07-17T19:10:23Z

Looks like no crashes in two days. I'd be happy with you closing this bug on the condition that I can reopen it if I observe it again over longer periods of time. :)

aclements · 2015-07-19T06:28:46Z

Awesome. Please do reopen if you observe it again.

ianlancetaylor added this to the Go1.5 milestone Jul 9, 2015

aclements mentioned this issue Jul 16, 2015

runtime: GC corrupting live objects in cmd/link #11617

Closed

aclements closed this as completed Jul 19, 2015

zeebo mentioned this issue Aug 13, 2015

runtime: bad pointer in unexpected span #12138

Closed

rsc mentioned this issue Aug 15, 2015

runtime: invalidptr check triggered on linux/armv5 with cgo #12157

Closed

golang locked and limited conversation to collaborators Jul 18, 2016

gopherbot added the FrozenDueToAge label Jul 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: panics during gc #11640

runtime: panics during gc #11640

zeebo commented Jul 9, 2015

ianlancetaylor commented Jul 9, 2015

aclements commented Jul 13, 2015

aclements commented Jul 13, 2015

aclements commented Jul 13, 2015

aclements commented Jul 13, 2015

gopherbot commented Jul 13, 2015

zeebo commented Jul 14, 2015

aclements commented Jul 15, 2015

zeebo commented Jul 15, 2015

zeebo commented Jul 16, 2015

aclements commented Jul 16, 2015

zeebo commented Jul 16, 2015

zeebo commented Jul 17, 2015

aclements commented Jul 19, 2015

runtime: panics during gc #11640

runtime: panics during gc #11640

Comments

zeebo commented Jul 9, 2015

ianlancetaylor commented Jul 9, 2015

aclements commented Jul 13, 2015

aclements commented Jul 13, 2015

aclements commented Jul 13, 2015

aclements commented Jul 13, 2015

gopherbot commented Jul 13, 2015

zeebo commented Jul 14, 2015

aclements commented Jul 15, 2015

zeebo commented Jul 15, 2015

zeebo commented Jul 16, 2015

aclements commented Jul 16, 2015

zeebo commented Jul 16, 2015

zeebo commented Jul 17, 2015

aclements commented Jul 19, 2015