New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/api: occasional hangs on 'go list' in NewWalker without useful output #45884
Comments
Also seen on a |
Marking as release-blocker for Go 1.17, because this is looking like a regression introduced sometime in late April. |
(It's not obvious to me whether the regression is in the runtime or in |
The earliest failure here (by filename timestamp) is 2021-04-21T04:26:11-69c94ad/openbsd-arm64-jsing. That ran at 69c94ad, which is before 7e97e4e and ecfce58, two of my top contenders as potentially problematic. So if these issues are all related, those may not be involved. |
/cc @dmitshur |
Change https://golang.org/cl/318569 mentions this issue: |
It is hard to be certain without more complete stacks, but I think http://golang.org/cl/318569 will fix this. i.e., that #45975, #45916, #45885, and #45884 all have the same cause. |
As a cleanup, golang.org/cl/307914 unintentionally caused the idle GC work recheck to drop sched.lock between acquiring a P and committing to keep it (once a worker G was found). This is unsafe, as releasing a P requires extra checks once sched.lock is taken (such as for runSafePointFn). Since checkIdleGCNoP does not perform these extra checks, we can now race with other users. In the case of #45975, we may hang with this sequence: 1. M1: checkIdleGCNoP takes sched.lock, gets P1, releases sched.lock. 2. M2: forEachP takes sched.lock, iterates over sched.pidle without finding P1, releases sched.lock. 3. M1: checkIdleGCNoP puts P1 back in sched.pidle. 4. M2: forEachP waits forever for P1 to run the safePointFn. Change back to the old behavior of releasing sched.lock only after we are certain we will keep the P. Thus if we put it back its removal from sched.pidle was never visible. Fixes #45975 For #45916 For #45885 For #45884 Change-Id: I191a1800923b206ccaf96bdcdd0bfdad17b532e9 Reviewed-on: https://go-review.googlesource.com/c/go/+/318569 Trust: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
I think this bug should be fixed, but can't be sure. Please note here if you see more occurrences. |
The last 30 commits, roughly corresponding to the last week, show no signs of this problem. Closing. |
2021-04-23T20:57:54-691e1b8/linux-mips64le-mengzhuo (12 minutes on
go list
)2021-04-23T13:48:10-bedfeed/linux-mips-rtrk (12 minutes on
go list
)2021-04-21T20:24:34-2550563/linux-mips64le-mengzhuo (12 minutes in
go list
)This is probably a bug in one or more of the
go
command, theos
package, or the Linux platform on MIPS.However, this failure output isn't giving us much useful information.
NewWalker
should accept acontext.Context
, which we can then plumb to the test's deadline to send SIGQUIT togo list
to get more useful output in case of failure.The text was updated successfully, but these errors were encountered: