Skip to content

runtime: panics during gc #11640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zeebo opened this issue Jul 9, 2015 · 14 comments
Closed

runtime: panics during gc #11640

zeebo opened this issue Jul 9, 2015 · 14 comments
Milestone

Comments

@zeebo
Copy link
Contributor

zeebo commented Jul 9, 2015

I recently upgraded one of our systems to use go1.5beta1 (b6ead9f) and observe this crash periodically:

panic: runtime error: index out of range
fatal error: panic holding locks
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
fatal error: workbuf is empty

goroutine 28 [running]:
runtime.throw(0x93e588, 0x13)
    /home/jeff/go/src/runtime/panic.go:527 +0x78 fp=0x10e18e4c sp=0x10e18e40
runtime.gopanic(0x83a6b8, 0x10e20008)
    /home/jeff/go/src/runtime/panic.go:385 +0x2a4 fp=0x10e18e90 sp=0x10e18e4c
runtime.panicindex()
    /home/jeff/go/src/runtime/panic.go:12 +0x48 fp=0x10e18ea4 sp=0x10e18e90
runtime.(*gcWork).put(0x10e15328, 0x10f5d1c0)
    /home/jeff/go/src/runtime/mgcwork.go:87 +0x70 fp=0x10e18eb4 sp=0x10e18ea4
runtime.greyobject(0x10f5d1c0, 0x10f05200, 0x888, 0x10dea2e3, 0x0, 0xb69ae8f8, 0x10e15328)
    /home/jeff/go/src/runtime/mgcmark.go:898 +0x304 fp=0x10e18f00 sp=0x10e18eb4
runtime.scanobject(0x10f05200, 0x10e15328)
    /home/jeff/go/src/runtime/mgcmark.go:821 +0x294 fp=0x10e18f68 sp=0x10e18f00
runtime.gcDrainUntilPreempt(0x10e15328, 0x7d0, 0x0)
    /home/jeff/go/src/runtime/mgcmark.go:686 +0x1dc fp=0x10e18f94 sp=0x10e18f68
runtime.gcBgMarkWorker(0x10e14a00)
    /home/jeff/go/src/runtime/mgc.go:1261 +0x48c fp=0x10e18fd4 sp=0x10e18f94
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1016 +0x4 fp=0x10e18fd4 sp=0x10e18fd4
created by runtime.gcBgMarkStartWorkers
    /home/jeff/go/src/runtime/mgc.go:1173 +0x84
panic: runtime error: index out of range
fatal error: panic holding locks

goroutine 41 [running]:
runtime.throw(0x93e588, 0x13)
    /home/jeff/go/src/runtime/panic.go:527 +0x78 fp=0x10e1fefc sp=0x10e1fef0
runtime.gopanic(0x83a6b8, 0x10e20008)
    /home/jeff/go/src/runtime/panic.go:385 +0x2a4 fp=0x10e1ff40 sp=0x10e1fefc
runtime.panicindex()
    /home/jeff/go/src/runtime/panic.go:12 +0x48 fp=0x10e1ff54 sp=0x10e1ff40
runtime.(*gcWork).tryGet(0x10f6c928, 0x3cf54)
    /home/jeff/go/src/runtime/mgcwork.go:115 +0x9c fp=0x10e1ff68 sp=0x10e1ff54
runtime.gcDrainUntilPreempt(0x10f6c928, 0x7d0, 0x0)
    /home/jeff/go/src/runtime/mgcmark.go:681 +0x158 fp=0x10e1ff94 sp=0x10e1ff68
runtime.gcBgMarkWorker(0x10f6c000)
    /home/jeff/go/src/runtime/mgc.go:1261 +0x48c fp=0x10e1ffd4 sp=0x10e1ff94
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1016 +0x4 fp=0x10e1ffd4 sp=0x10e1ffd4
created by runtime.gcBgMarkStartWorkers
    /home/jeff/go/src/runtime/mgc.go:1173 +0x84

It's running on a GOOS=linux GOARCH=arm GOARM=5 system. It's a large program so I can't provide a simple reproduction at this time, but I can do my best to give any information that might be helpful, including running any diagnostics you ask for.

@ianlancetaylor ianlancetaylor added this to the Go1.5 milestone Jul 9, 2015
@ianlancetaylor
Copy link
Member

CC @RLH @aclements

@aclements
Copy link
Member

I'm not sure about the "index out of range", but "workbuf is empty" has happened a few times on the dashboard:

2015-06-19T00:44:13-2f3d103/freebsd-386-gce101
2015-06-24T17:53:25-7708248/dragonfly-amd64
2015-06-24T22:22:55-6613090/darwin-amd64
2015-06-26T17:26:09-e0e47e2/linux-arm-arm5
2015-07-02T08:29:33-be2a3e2/linux-arm-arm5
2015-07-09T20:05:33-4c33250/darwin-386
2015-07-12T20:43:01-7e4c06d/darwin-386

Given the variety of archs and OSs, this doesn't appear to be arch- or OS-specific. In particular, since it's also happening on x86, I don't think this is a memory ordering issue.

Where it happens is also scattered. It's often testing $GOROOT/test, but several are in test/run.go itself or the linker invoked by test/run.go. This makes me wonder if this is related to #11617.

I worked through an informal backwards proof that we're satisfying all of the invariants that should ensure checknonempty never fails in trygetfull or getfull. This depended on a few assumptions, like that the compiler is correct, that our lock-free list is correct (note that the above failures cover three different implementations of the platform-dependent part, so that's unlikely to be the problem), and that there's no memory corruption.

@aclements
Copy link
Member

To clarify: the "no memory corruption" assumption is more specifically "no writes through bad pointers". We've had some memory corruption problems resulting from premature collection of reachable objects; that alone isn't sufficient except as a springboard for aliasing a pointer and a non-pointer, where the non-pointer happens to be the address of a workbuf, which is quite unlikely.

@aclements
Copy link
Member

Another assumption I made is that the per-P gcWork cache is never accessed concurrently. That's perhaps the most likely assumption to be wrong.

@aclements
Copy link
Member

@zeebo, are you able to reproduce this fairly reliably? Would you mind trying out https://go-review.googlesource.com/12124 and see if it fixes the problem for you? Otherwise I think we'll have to wait a week or so to see if it pops up on the dashboard again before declaring this fixed.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/12124 mentions this issue.

@zeebo
Copy link
Contributor Author

zeebo commented Jul 14, 2015

Yeah, it's pretty reliable. I get 2-3 crashes per day at least. I'll try again with that change first thing tomorrow morning and report back.

@aclements
Copy link
Member

@zeebo, did you get a chance to try it?

@zeebo
Copy link
Contributor Author

zeebo commented Jul 15, 2015

Sorry, no. I'm doing it right now.

aclements added a commit that referenced this issue Jul 15, 2015
Currently, we run a P's safe-point function immediately after entering
_Psyscall state. This is unsafe, since as soon as we put the P in
_Psyscall, we no longer control the P and another M may claim it.
We'll still run the safe-point function only once (because doing so
races on an atomic), but the P may no longer be at a safe-point when
we do so.

In particular, this means that the use of forEachP to dispose all P's
gcw caches is unsafe. A P may enter a syscall, run the safe-point
function, and dispose the P's gcw cache concurrently with another M
claiming the P and attempting to use its gcw cache. If this happens,
we may empty the gcw's workbuf after putting it on
work.{full,partial}, or add pointers to it after putting it in
work.empty. This will cause an assertion failure when we later pop the
workbuf from the list and its object count is inconsistent with the
list we got it from.

Fix this by running the safe-point function just before putting the P
in _Psyscall.

Related to #11640. This probably fixes this issue, but while I'm able
to show that we can enter a bad safe-point state as a result of this,
I can't reproduce that specific failure.

Change-Id: I6989c8ca7ef2a4a941ae1931e9a0748cbbb59434
Reviewed-on: https://go-review.googlesource.com/12124
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
@zeebo
Copy link
Contributor Author

zeebo commented Jul 16, 2015

It has been a day and I've seen no panics. Usually it happens by now, but you can never be sure with these things. :)

@aclements
Copy link
Member

Great! Maybe give it another day and if it hasn't fallen over, we'll call this issue closed.

Thanks for testing!

@zeebo
Copy link
Contributor Author

zeebo commented Jul 16, 2015

No problem. A picture is worth a thousand words, so here's a graph of the uptime with the go version the binary was compiled with:

Uptime Graph

Green is go1.4.2, red is go1.5beta1, and blue is with the cherry picked change. The purple line is the uptime. It looks like what might have been triggering the restarts recently slowed down in the past few days for whatever reason, so yeah it's probably for the best to wait another day or two.

@zeebo
Copy link
Contributor Author

zeebo commented Jul 17, 2015

uptime

Looks like no crashes in two days. I'd be happy with you closing this bug on the condition that I can reopen it if I observe it again over longer periods of time. :)

@aclements
Copy link
Member

Awesome. Please do reopen if you observe it again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants