-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: self-deadlock on mheap_.lock #64067
Comments
Change https://go.dev/cl/541635 mentions this issue: |
Huh! Is there any chance this might also explain #56418? |
Hm... I don't obviously see goroutines blocked on the same lock in #56418, so I can't be sure. But two goroutines are suspiciously reported as "running on another thread" and they could in theory be blocked on this lock, so I can't rule it out though. I suppose we'll find out. :) As an aside, I think we should have a checker mode for this invariant. In the vast majority of cases (like this one) it should be pretty obvious statically that we're acquiring this lock not on the system stack. The runtime does not make very many indirect calls. |
@gopherbot Please open a backport issue for Go 1.21. This issue can cause deadlocks in running programs. It's rare but there are no workarounds. It also only applies to code that landed in Go 1.21. |
Backport issue(s) opened: #64073 (for 1.21). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/541955 mentions this issue: |
… callees on the systemstack These functions acquire the heap lock. If they're not called on the systemstack, a stack growth could cause a self-deadlock since stack growth may allocate memory from the page heap. This has been a problem for a while. If this is what's plaguing the ppc64 port right now, it's very surprising (and probably just coincidental) that it's showing up now. For #64050. For #64062. For #64067. Fixes #64073. Change-Id: I2b95dc134d17be63b9fe8f7a3370fe5b5438682f Reviewed-on: https://go-review.googlesource.com/c/go/+/541635 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Paul Murphy <murp@ibm.com> (cherry picked from commit 5f08b44) Reviewed-on: https://go-review.googlesource.com/c/go/+/541955 Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
After landing a bunch of changes today (and yesterday) the ppc64 builders started timing out:
running on other thread
goroutine created byruntime.gcenable
#64062The root cause of the issue is a self-deadlock on
mheap_.lock
. An invariant of this lock is that it must be held only on the system stack, because a stack growth may cause the lock to be acquired again when allocating a new stack.This invariant isn't followed in a very particular function that was added last release. The fix is simple: follow the invariant.
What's bizarre is why this started showing up now, a very long time later. Another question is why it started showing up in response to a set of changes I landed today, and how those changes perturbed the runtime into triggering this case more often. This is also a potential problem on all platforms, not just ppc64. I don't know what makes ppc64 more likely to trigger it.
One theory is that this call path is taken when the heap goal exceeds a certain threshold. The thing is, the heap goal is simply set arbitrarily high if
GOGC=off
(which is actually kind of a bug in the check, but that part is mostly harmless), so any test settingGOGC=off
would be susceptible. I suspect it has something to do with this, but I am unsure.I'm also unclear as to why this is happening in the compiler so often. Perhaps some packages we build create a heap big enough to exceed the check? Either that, or it could be related to the manipulation of GOGC that it does.
The text was updated successfully, but these errors were encountered: