-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: panics during gc #11640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CC @RLH @aclements |
I'm not sure about the "index out of range", but "workbuf is empty" has happened a few times on the dashboard: 2015-06-19T00:44:13-2f3d103/freebsd-386-gce101 Given the variety of archs and OSs, this doesn't appear to be arch- or OS-specific. In particular, since it's also happening on x86, I don't think this is a memory ordering issue. Where it happens is also scattered. It's often testing $GOROOT/test, but several are in test/run.go itself or the linker invoked by test/run.go. This makes me wonder if this is related to #11617. I worked through an informal backwards proof that we're satisfying all of the invariants that should ensure checknonempty never fails in trygetfull or getfull. This depended on a few assumptions, like that the compiler is correct, that our lock-free list is correct (note that the above failures cover three different implementations of the platform-dependent part, so that's unlikely to be the problem), and that there's no memory corruption. |
To clarify: the "no memory corruption" assumption is more specifically "no writes through bad pointers". We've had some memory corruption problems resulting from premature collection of reachable objects; that alone isn't sufficient except as a springboard for aliasing a pointer and a non-pointer, where the non-pointer happens to be the address of a workbuf, which is quite unlikely. |
Another assumption I made is that the per-P gcWork cache is never accessed concurrently. That's perhaps the most likely assumption to be wrong. |
@zeebo, are you able to reproduce this fairly reliably? Would you mind trying out https://go-review.googlesource.com/12124 and see if it fixes the problem for you? Otherwise I think we'll have to wait a week or so to see if it pops up on the dashboard again before declaring this fixed. |
CL https://golang.org/cl/12124 mentions this issue. |
Yeah, it's pretty reliable. I get 2-3 crashes per day at least. I'll try again with that change first thing tomorrow morning and report back. |
@zeebo, did you get a chance to try it? |
Sorry, no. I'm doing it right now. |
Currently, we run a P's safe-point function immediately after entering _Psyscall state. This is unsafe, since as soon as we put the P in _Psyscall, we no longer control the P and another M may claim it. We'll still run the safe-point function only once (because doing so races on an atomic), but the P may no longer be at a safe-point when we do so. In particular, this means that the use of forEachP to dispose all P's gcw caches is unsafe. A P may enter a syscall, run the safe-point function, and dispose the P's gcw cache concurrently with another M claiming the P and attempting to use its gcw cache. If this happens, we may empty the gcw's workbuf after putting it on work.{full,partial}, or add pointers to it after putting it in work.empty. This will cause an assertion failure when we later pop the workbuf from the list and its object count is inconsistent with the list we got it from. Fix this by running the safe-point function just before putting the P in _Psyscall. Related to #11640. This probably fixes this issue, but while I'm able to show that we can enter a bad safe-point state as a result of this, I can't reproduce that specific failure. Change-Id: I6989c8ca7ef2a4a941ae1931e9a0748cbbb59434 Reviewed-on: https://go-review.googlesource.com/12124 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Russ Cox <rsc@golang.org>
It has been a day and I've seen no panics. Usually it happens by now, but you can never be sure with these things. :) |
Great! Maybe give it another day and if it hasn't fallen over, we'll call this issue closed. Thanks for testing! |
No problem. A picture is worth a thousand words, so here's a graph of the uptime with the go version the binary was compiled with: Green is go1.4.2, red is go1.5beta1, and blue is with the cherry picked change. The purple line is the uptime. It looks like what might have been triggering the restarts recently slowed down in the past few days for whatever reason, so yeah it's probably for the best to wait another day or two. |
Awesome. Please do reopen if you observe it again. |
I recently upgraded one of our systems to use go1.5beta1 (b6ead9f) and observe this crash periodically:
It's running on a GOOS=linux GOARCH=arm GOARM=5 system. It's a large program so I can't provide a simple reproduction at this time, but I can do my best to give any information that might be helpful, including running any diagnostics you ask for.
The text was updated successfully, but these errors were encountered: