-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/pprof: TestProfilerStackDepth/mutex failures #68562
Comments
Found new dashboard test flakes for:
2024-07-23 21:29 gotip-windows-386 go@c18ff292 runtime/pprof.TestProfilerStackDepth/mutex [ABORT] (log)
|
I spent some time looking into this failure, and it's clearly some kind of GC deadlock. Notably, the goroutine trying to write out the mutex profile calls into malloc, then into GC assists, and gets stuck somewhere in assists. Unfortunately it's unclear where. Some candidates include:
The lack of any smoking guns in the stack dump suggests to me that A final note is all the GC mark workers in the stack dump are parked. This kinda makes sense -- the GC is not done, but they can't find any work to do because in 2 minutes they almost certainly found everything they could aside from wherever that one goroutine is stuck. |
Actually, the assist queue lock theory is not possible. We know for a fact we're in So, the likely explanation is difficulty preempting some goroutine (I can't think of any other reason we'd block so long...). |
I have a new possible lead/theory. This test is somewhat special in that the mutex profile rate is 1. What if there's a place where we're acquiring and/or releasing a runtime lock, somewhere deep in Though, this doesn't really line up with the evidence, since there should be some other goroutine waiting on a lock. Maybe it's a self deadlock? The write-mutex-profile code is interacting with the same locks that the runtime unlock path is, though I don't immediately see how we could self-deadlock in exactly the place we're stuck. Even if we were holding a lock (which it doesn't look like we are), that should prevent recording further locks. |
In the last triage meeting, I said we'd just wait for more failures. So, setting the WaitingForInfo label. |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
Issue created automatically to collect these failures.
Example (log):
— watchflakes
The text was updated successfully, but these errors were encountered: