-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: checkdead fires due to suspected race in the Go runtime when GOMAXPROCS=1 on AWS #59600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Not sure it is valuable information, but, as can be seen in the runtime stacktrace, the source of the |
Thanks for the report. Do you have crash output with that extra logging added to checkdead? |
hey @prattmic, thank you for looking into it. I have been considering if it is possible that the netpoll part in Currently working on obtaining a core dump. Here is an example of output with the patched runtime
|
Normally I'd say it seems unlikely to lose a race against all of the Ps like this, but since this is GOMAXPROCS=1 that makes it more plausible.
A core dump should provide more insight. Consider building from a tip toolchain and setting Setting I don't suppose you can share a program that reproduces this behavior? Do you know if this reproduced on older versions of Go? |
To clarify, you cannot reproduce elsewhere even when explicitly setting |
I was not able to reproduce it on other instance types so far (ie forcing GOMAXPROCS=1 in a multithreaded environment). I should try again with things I learned in the interim.
I have no minimal reproducer that I can share yet. I did not try to switch to older versions yet, as anyway I couldn't go past 1.19 (would that still be useful?).
Thank you for your suggestions. I tried to get a core dump with GODEBUG=dontfreezetheworld=1, but the process did hang. The checkdead debug line was printed, and the process just stay there. After ~40min, I concluded it won't quit, so I got a core dump with gcore, and killed it. That core dump did not help. Still looking at the core dumps I got so far. I am trying to figure out how to share more context |
After reviewing enough core dumps, I could see that the threads are consistently in a state that indicates a race on shed.lock. This issue stood out, although it is not the same race condition happening. According to the stack traces below, M1 is crashing in M1
M2 (always waking up scavengerState)
I concluded that something is blocking M2 from starting a M before
Here, m=1 is our M2 (scavengerState waking up), and m=3 is our M1 (checkdead panic). It is a little clearer what's happening here: M2 is stealing P (remember, monothreaded instance) from M1, then yielding somehow, making M1 push runnable goroutines from It looks like M2 is preempted before being able to reacquire the scheduler lock in startIdle := func(n int) {
for i := 0; i < n; i++ {
mp := acquirem() // See comment in startm.
pushEventTrace("injectglist acquiring sched lock (1)")
lock(&sched.lock)
pushEventTrace("injectglist acquired sched lock (1)")
pp, _ := pidlegetSpinning(0) // stealing the P from M1 here
if pp == nil {
pushEventTrace("injectglist releasing sched lock (1)")
unlock(&sched.lock)
pushEventTrace("injectglist released sched lock (1)")
releasem(mp)
break
}
pushEventTrace("injectglist releasing sched lock (2)")
unlock(&sched.lock)
pushEventTrace("injectglist released sched lock (2)")
startm(pp, false) // preempted?? mp.locks=5 at this stage
releasem(mp)
}
} What could be happening here? I am trying to solve this by making sure |
Thanks for continuing to dig. I'll look more closely at your logs in a bit. I will note that Nice event trace log you added, that looks nice! For future reference, the runtime has similar functionality built in in
The buffer will be dumped on |
I am relatively new to the runtime package and would have appreciated discovering debuglog earlier (although I did learn about go's write barriers in the process 🤦). I agree that the state of M2 in the stack trace does not accurately represent the crash. I think it might not even be entering We may have identified the issue. I previously mentioned a previous fix regarding the conflicting relationship between nmp := mget()
if nmp == nil {
// No M is available, we must drop sched.lock and call newm.
// However, we already own a P to assign to the M.
//
// Once sched.lock is released, another G (e.g., in a syscall),
// could find no idle P while checkdead finds a runnable G but
// no running M's because this new M hasn't started yet, thus
// throwing in an apparent deadlock.
//
// Avoid this situation by pre-allocating the ID for the new M,
// thus marking it as 'running' before we drop sched.lock. This
// new M will eventually run the scheduler to execute any
// queued G's.
id := mReserveID()
unlock(&sched.lock)
var fn func()
if spinning {
// The caller incremented nmspinning, so set m.spinning in the new M.
fn = mspinning
}
newm(fn, pp, id)
// Ownership transfer of pp committed by start in newm.
// Preemption is now safe.
releasem(mp)
return
} The crux of the issue is that if the lock is not acquired at the beginning of I propose a fix that includes introducing a new // in our case, the caller (injectglist) acquires the sched lock and calls startm with lockheld=true
func startm(pp *p, spinning, lockheld bool) {
mp := acquirem()
if !lockheld {
lock(&sched.lock)
}
[...]
nmp := mget()
if nmp == nil {
id := mReserveID()
unlock(&sched.lock) // <- unlock no matter what
var fn func()
if spinning {
// The caller incremented nmspinning, so set m.spinning in the new M.
fn = mspinning
}
newm(fn, pp, id)
if lockheld {
lock(&sched.lock) // <- reacquire lock if necessary
}
releasem(mp)
return
} All existing references to startIdle := func(n int) {
for i := 0; i < n; i++ {
mp := acquirem() // See comment in startm.
lock(&sched.lock)
pp, _ := pidlegetSpinning(0)
if pp == nil {
unlock(&sched.lock)
releasem(mp)
break
}
startm(pp, false, true)
unlock(&sched.lock)
releasem(mp)
}
} I am still validating this fix. |
Great find! I think you are right, this sounds like an issue. Would you like to send a CL? |
Sure! I am sending a CL Regarding my previous attempt with With Here's the stacktrace after around 40 minutes of deadlock in this specific instance:
Another thread is stuck attempting to acquire the lock, specifically the
|
Change https://go.dev/cl/487316 mentions this issue: |
Regarding schedtrace, we have this problem in a few other places that throw can be called in nearly any context with arbitrary locks held. The throws in checkdead should unlock sched.lock before throwing. The final one does already (https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=5258;bpv=1;bpt=1?), but not the others. |
submitted https://go.dev/cl/487316 for #59758 |
Is it possible to get this bugfix backported to 1.20? It is affecting users of Prometheus monitoring as we default GOMAXPROCS=1 for the node_exporter to reduce problems with data races in the Linux kernel. |
@gopherbot Please open a backport issue to 1.20. Quoting @SuperQ :
|
Backport issue(s) opened: #60760 (for 1.20). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
@gopherbot Please open a backport to 1.19. This issue is not new in 1.20, and potentially affects users with GOMAXPROCS=1. |
Change https://go.dev/cl/504395 mentions this issue: |
Change https://go.dev/cl/504396 mentions this issue: |
…`startm` lock handling in caller context This change addresses a `checkdead` panic caused by a race condition between `sysmon->startm` and `checkdead` callers, due to prematurely releasing the scheduler lock. The solution involves allowing a `startm` caller to acquire the scheduler lock and call `startm` in this context. A new `lockheld` bool argument is added to `startm`, which manages all lock and unlock calls within the function. The`startIdle` function variable in `injectglist` is updated to call `startm` with the lock held, ensuring proper lock handling in this specific case. This refined lock handling resolves the observed race condition issue. For #59600. Fixes #60760. Change-Id: I11663a15536c10c773fc2fde291d959099aa71be Reviewed-on: https://go-review.googlesource.com/c/go/+/487316 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Pratt <mpratt@google.com> (cherry picked from commit ff059ad) Reviewed-on: https://go-review.googlesource.com/c/go/+/504395 Reviewed-by: Lucien Coffe <lucien.coffe@botify.com> Auto-Submit: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
What version of Go are you using (
go version
)?1.20.3
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?This happens on m3.medium AWS instances. I could not reproduce it anywhere else so far. This is a monothreaded amd64 instance
What did you do?
What did you expect to see?
No crash
What did you see instead?
A crash with "runtime: checkdead: found g XX in status 1". This crash happens on average after a few hours, apparently correlated with the network activity - the more network activity (ie, the more goroutines waiting on network), the more often it will happen.
Additional details
We've encountered a race condition in the Go runtime that leads to a panic in the
checkdead()
function when running on a monothreaded instance (GOMAXPROCS=1). The issue seems related to a specific scenario in thefindRunnable()
function where a goroutine is readied between thefindRunnable()
check andstopm()
call on line 3007 acquiring the scheduler lock, causing an inconsistent scheduler state.Recap:
findRunnable()
cannot find any goroutines to return, reaching the end of the function and callingstopm()
.stopm()
acquires the lock on the scheduler and callsmput()
.mput()
adds the M to the idle pool and callscheckdead()
.checkdead()
finds a runnable G and panics.The race condition appears to be related to the "delicate dance" code in
findRunnable()
, which tries to handle a race between the network poller readying a goroutine and the M being stopped. However, this race condition can still occur in rare cases, leading to a panic incheckdead()
.To investigate this issue further, we tried to create a minimal reproducer, but we could not trigger it consistently due to the race condition's rarity. Nonetheless, runtime stack traces from the
checkdead()
panics in the real-world scenario clearly show that it originates from the specificstopm()
call infindRunnable()
, indicating a strong correlation between the race condition and the panic.A temporary fix patch has been applied to the
checkdead()
function to detect and handle the inconsistent state caused by the race condition without panicking immediately. Instead, it collects information and prints debug logs about the state of goroutines and the runtime stack. It only panics when the inconsistent state occurs too many times in a row or for an extended period of time. All occurrences so far were already resolved (ie an M was active) on the next call to checkdead().This is ofc not a viable solution as it does not fix the root cause of the issue and may introduce other problems.
The text was updated successfully, but these errors were encountered: