Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: deadlock involving gcControllerState.enlistWorker #19112

Closed
rsc opened this issue Feb 15, 2017 · 2 comments
Closed

runtime: deadlock involving gcControllerState.enlistWorker #19112

rsc opened this issue Feb 15, 2017 · 2 comments
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Feb 15, 2017

We have seen one instance of a production job suddenly spinning to 100% CPU and becoming unresponsive. In that one instance, a SIGQUIT was sent after 328 minutes of spinning, and the stacks showed a single goroutine in "IO wait (scan)" state.

Looking for things that might get stuck if a goroutine got stuck in scanning a stack, we found that injectglist does:

lock(&sched.lock)
var n int
for n = 0; glist != nil; n++ {
	gp := glist
	glist = gp.schedlink.ptr()
	casgstatus(gp, _Gwaiting, _Grunnable)
	globrunqput(gp)
}
unlock(&sched.lock)

and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes away. Essentially, this code locks sched.lock and then while holding sched.lock, waits to lock gp.atomicstatus.

The code that is doing the scan is:

if castogscanstatus(gp, s, s|_Gscan) {
	if !gp.gcscandone {
		scanstack(gp, gcw)
		gp.gcscandone = true
	}
	restartg(gp)
	break loop
}

More analysis showed that scanstack can, in a rare case, end up calling back into code that acquires sched.lock. For example:

runtime.scanstack at proc.go:866
calls runtime.gentraceback at mgcmark.go:842
calls runtime.scanstack$1 at traceback.go:378
calls runtime.scanframeworker at mgcmark.go:819
calls runtime.scanblock at mgcmark.go:904
calls runtime.greyobject at mgcmark.go:1221
calls (*runtime.gcWork).put at mgcmark.go:1412
calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127
calls runtime.wakep at mgc.go:632
calls runtime.startm at proc.go:1779
acquires runtime.sched.lock at proc.go:1675

This path was found with an automated deadlock-detecting tool by @aclements. There are many such paths but they all go through enlistWorker -> wakep.

The evidence strongly suggests that one of these paths is what caused the deadlock we observed. We're running those jobs with GOTRACEBACK=crash now to try to get more information if it happens again.

Further refinement and analysis by @aclements and me shows that if we drop the wakep call from enlistWorker, the remaining few deadlock cycles found by the tool are all false positives caused by not understanding the effect of calls to func variables.

For Go 1.8 we intend to drop the enlistWorker -> wakep call. It was intended only as a performance optimization, it rarely executes, and if it does execute at just the wrong time it can (and plausibly did) cause the deadlock we saw.

@rsc rsc added this to the Go1.8 milestone Feb 15, 2017
@rsc rsc self-assigned this Feb 15, 2017
@rsc
Copy link
Contributor Author

rsc commented Feb 15, 2017

Reopening for cherry-pick.

@rsc rsc reopened this Feb 15, 2017
@gopherbot
Copy link

CL https://golang.org/cl/37022 mentions this issue.

@rsc rsc closed this as completed Feb 16, 2017
gopherbot pushed a commit that referenced this issue Feb 16, 2017
…to avoid possible deadlock

We have seen one instance of a production job suddenly spinning to
100% CPU and becoming unresponsive. In that one instance, a SIGQUIT
was sent after 328 minutes of spinning, and the stacks showed a single
goroutine in "IO wait (scan)" state.

Looking for things that might get stuck if a goroutine got stuck in
scanning a stack, we found that injectglist does:

	lock(&sched.lock)
	var n int
	for n = 0; glist != nil; n++ {
		gp := glist
		glist = gp.schedlink.ptr()
		casgstatus(gp, _Gwaiting, _Grunnable)
		globrunqput(gp)
	}
	unlock(&sched.lock)

and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes
away. Essentially, this code locks sched.lock and then while holding
sched.lock, waits to lock gp.atomicstatus.

The code that is doing the scan is:

	if castogscanstatus(gp, s, s|_Gscan) {
		if !gp.gcscandone {
			scanstack(gp, gcw)
			gp.gcscandone = true
		}
		restartg(gp)
		break loop
	}

More analysis showed that scanstack can, in a rare case, end up
calling back into code that acquires sched.lock. For example:

	runtime.scanstack at proc.go:866
	calls runtime.gentraceback at mgcmark.go:842
	calls runtime.scanstack$1 at traceback.go:378
	calls runtime.scanframeworker at mgcmark.go:819
	calls runtime.scanblock at mgcmark.go:904
	calls runtime.greyobject at mgcmark.go:1221
	calls (*runtime.gcWork).put at mgcmark.go:1412
	calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127
	calls runtime.wakep at mgc.go:632
	calls runtime.startm at proc.go:1779
	acquires runtime.sched.lock at proc.go:1675

This path was found with an automated deadlock-detecting tool.
There are many such paths but they all go through enlistWorker -> wakep.

The evidence strongly suggests that one of these paths is what caused
the deadlock we observed. We're running those jobs with
GOTRACEBACK=crash now to try to get more information if it happens
again.

Further refinement and analysis shows that if we drop the wakep call
from enlistWorker, the remaining few deadlock cycles found by the tool
are all false positives caused by not understanding the effect of calls
to func variables.

The enlistWorker -> wakep call was intended only as a performance
optimization, it rarely executes, and if it does execute at just the
wrong time it can (and plausibly did) cause the deadlock we saw.

Comment it out, to avoid the potential deadlock.

Fixes #19112.
Unfixes #14179.

Change-Id: I6f7e10b890b991c11e79fab7aeefaf70b5d5a07b
Reviewed-on: https://go-review.googlesource.com/37093
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-on: https://go-review.googlesource.com/37022
TryBot-Result: Gobot Gobot <gobot@golang.org>
albertjin pushed a commit to albertjin/golang-go that referenced this issue Feb 17, 2017
…to avoid possible deadlock

We have seen one instance of a production job suddenly spinning to
100% CPU and becoming unresponsive. In that one instance, a SIGQUIT
was sent after 328 minutes of spinning, and the stacks showed a single
goroutine in "IO wait (scan)" state.

Looking for things that might get stuck if a goroutine got stuck in
scanning a stack, we found that injectglist does:

	lock(&sched.lock)
	var n int
	for n = 0; glist != nil; n++ {
		gp := glist
		glist = gp.schedlink.ptr()
		casgstatus(gp, _Gwaiting, _Grunnable)
		globrunqput(gp)
	}
	unlock(&sched.lock)

and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes
away. Essentially, this code locks sched.lock and then while holding
sched.lock, waits to lock gp.atomicstatus.

The code that is doing the scan is:

	if castogscanstatus(gp, s, s|_Gscan) {
		if !gp.gcscandone {
			scanstack(gp, gcw)
			gp.gcscandone = true
		}
		restartg(gp)
		break loop
	}

More analysis showed that scanstack can, in a rare case, end up
calling back into code that acquires sched.lock. For example:

	runtime.scanstack at proc.go:866
	calls runtime.gentraceback at mgcmark.go:842
	calls runtime.scanstack$1 at traceback.go:378
	calls runtime.scanframeworker at mgcmark.go:819
	calls runtime.scanblock at mgcmark.go:904
	calls runtime.greyobject at mgcmark.go:1221
	calls (*runtime.gcWork).put at mgcmark.go:1412
	calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127
	calls runtime.wakep at mgc.go:632
	calls runtime.startm at proc.go:1779
	acquires runtime.sched.lock at proc.go:1675

This path was found with an automated deadlock-detecting tool.
There are many such paths but they all go through enlistWorker -> wakep.

The evidence strongly suggests that one of these paths is what caused
the deadlock we observed. We're running those jobs with
GOTRACEBACK=crash now to try to get more information if it happens
again.

Further refinement and analysis shows that if we drop the wakep call
from enlistWorker, the remaining few deadlock cycles found by the tool
are all false positives caused by not understanding the effect of calls
to func variables.

The enlistWorker -> wakep call was intended only as a performance
optimization, it rarely executes, and if it does execute at just the
wrong time it can (and plausibly did) cause the deadlock we saw.

Comment it out, to avoid the potential deadlock.

Fixes golang#19112.
Unfixes golang#14179.

Change-Id: I6f7e10b890b991c11e79fab7aeefaf70b5d5a07b
Reviewed-on: https://go-review.googlesource.com/37093
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-on: https://go-review.googlesource.com/37022
TryBot-Result: Gobot Gobot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Feb 16, 2018
@rsc rsc removed their assignment Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants