Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: frequent timeouts in doTestParallelReaders since 2021-11-04 #49680

Closed
bcmills opened this issue Nov 19, 2021 · 8 comments
Closed

runtime: frequent timeouts in doTestParallelReaders since 2021-11-04 #49680

bcmills opened this issue Nov 19, 2021 · 8 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 19, 2021

greplogs --dashboard -md -l -e '(?ms)\ngoroutine \d+ \[run.*\]:\n(?:.+\n\t.*\n)*runtime_test\.doTestParallelReaders'

2021-11-18T06:05:29-f6647f2/linux-ppc64le-buildlet
2021-11-18T02:17:23-ce7e501/linux-ppc64-buildlet
2021-11-17T21:26:25-0440fb8/linux-ppc64le-power9osu
2021-11-17T17:36:36-0555ea3/linux-ppc64le-power9osu
2021-11-17T04:31:40-489f587/linux-ppc64le-power9osu
2021-11-17T04:31:22-f384c70/linux-ppc64le-power9osu
2021-11-16T18:36:59-6c36c33/linux-ppc64-buildlet
2021-11-16T07:47:08-313cae3/linux-ppc64le-power9osu
2021-11-15T21:22:14-42fa03a/linux-ppc64le-power9osu
2021-11-15T21:22:09-184ca3c/linux-ppc64le-power9osu
2021-11-13T00:26:24-787708a/linux-ppc64le-power9osu
2021-11-12T22:20:50-9150c16/linux-ppc64le-power9osu
2021-11-11T13:58:28-d76b1ac/linux-ppc64-buildlet
2021-11-11T05:16:39-99fa49e/linux-ppc64le-buildlet
2021-11-09T20:47:54-a096316/linux-ppc64le-power9osu
2021-11-09T19:01:06-81f37a7/linux-ppc64le-buildlet
2021-11-09T19:01:06-81f37a7/linux-ppc64le-power9osu
2021-11-08T21:52:47-955f9f5/linux-ppc64le-power9osu
2021-11-08T21:28:37-ccea0b2/linux-ppc64le-power9osu
2021-11-06T19:41:12-7ca772a/linux-ppc64le-buildlet
2021-11-05T20:59:43-dbd3cf8/linux-ppc64le-buildlet
2021-11-05T18:20:07-53bab19/linux-ppc64le-power9osu
2021-11-05T17:46:27-4f543b5/linux-ppc64-buildlet
2021-11-05T15:52:40-a0d661a/android-amd64-emu
2021-11-04T23:56:29-0e5f287/linux-ppc64le-buildlet
2021-11-04T21:50:21-bfd74fd/linux-ppc64le-buildlet
2021-11-04T20:01:22-6d1fffa/android-amd64-emu
2021-11-04T20:00:54-961aab2/illumos-amd64
2020-01-31T23:17:00-8390c47/android-amd64-emu
2019-11-01T05:38:51-e96fd13/netbsd-amd64-8_0

@bcmills
Copy link
Contributor Author

bcmills commented Nov 19, 2021

Marking as release-blocker for Go 1.18 because the failure rate is markedly higher as of 2021-11-04: this looks like a regression, or at the very least a flaky test exacerbated by 1.18 changes.

(@mknyszek, another one for the “GC running during test” pile, perhaps?)

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 19, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 19, 2021
@mknyszek mknyszek self-assigned this Nov 19, 2021
@toothrot toothrot added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Nov 19, 2021
@mknyszek
Copy link
Contributor

Oh, I think I see what's wrong here. The test already says that it's unsafe for a GC to run, and calls SetGCPercent(-1). However, I think what's missing here is that one could actively be in progress, so runtime.GC needs to be called to make sure.

I'll try to confirm it.

@gopherbot
Copy link

Change https://golang.org/cl/366534 mentions this issue: runtime: ensure no GC is running in TestParallelRWMutexReaders

@mknyszek
Copy link
Contributor

I've had trouble reproducing the issue, but I do think my previous theory does indicate a real potential problem. That theory does also fit into the change in minimum heap size triggering the issue.

Sent golang.org/cl/366534. We can try landing it and see if it fixes it, placing this issue in a WaitingForInfo state. I think it's something that the test should be doing anyway.

@mknyszek
Copy link
Contributor

Actually, I'm more confident this is actually the fix. If I simply start a GC on another goroutine before the test starts, it easily reproduces. I guess it's possible this is some other bug, but it's unlikely at this point. Once I land https://golang.org/cl/366534 I'll close this optimistically.

gopherbot pushed a commit that referenced this issue Nov 23, 2021
Currently this test makes it clear that it's unsafe for a GC to run,
otherwise a deadlock could occur, so it calls SetGCPercent(-1). However,
a GC may be actively in progress, and SetGCPercent is not going to end
any in-progress GC. Call runtime.GC to block until at least the current
GC is over.

Updates #49680.

Change-Id: Ibdc7d378e8cf7e05270910e92effcad8c6874e59
Reviewed-on: https://go-review.googlesource.com/c/go/+/366534
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/367874 mentions this issue: misc/cgo/test: further reduce likeliness of hang in Test9400

gopherbot pushed a commit that referenced this issue Nov 30, 2021
As suggested by #49680, a GC could be in-progress when we
disable GC. Force a GC after we pause to ensure we don't
hang in this case.

For #49695

Change-Id: I4fc4c06ef2ac174217c3dcf7d58c7669226e2d24
Reviewed-on: https://go-review.googlesource.com/c/go/+/367874
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Trust: Paul Murphy <murp@ibm.com>
@mknyszek
Copy link
Contributor

mknyszek commented Dec 7, 2021

Update on the root cause here: the fixes for #49680, #49695, and #45867 all assumed that SetGCPercent(-1) doesn't block until the GC's mark phase is done, but it actually does. The cause of those failures is that at the beginning of the sweep phase, the GC does try to preempt every P once, and this may run concurrently with test code that has non-preemptible goroutines that create the potential for a deadlock. In short, the fix was correct, but the reasoning was incorrect.

@gopherbot
Copy link

Change https://golang.org/cl/369815 mentions this issue: runtime: fix comments on the behavior of SetGCPercent

gopherbot pushed a commit that referenced this issue Dec 7, 2021
Fixes for #49680, #49695, #45867, and #49370 all assumed that
SetGCPercent(-1) doesn't block until the GC's mark phase is done, but
it actually does. The cause of 3 of those 4 failures comes from the fact
that at the beginning of the sweep phase, the GC does try to preempt
every P once, and this may run concurrently with test code. In the
fourth case, the issue was likely that only *one* of the debug_test.go
tests was missing a call to SetGCPercent(-1). Just to be safe, leave a
TODO there for now to remove the extraneous runtime.GC calls, but leave
the calls in.

Updates #49680, #49695, #45867, and #49370.

Change-Id: Ibf4e64addfba18312526968bcf40f1f5d54eb3f1
Reviewed-on: https://go-review.googlesource.com/c/go/+/369815
Reviewed-by: Austin Clements <austin@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker
Projects
None yet
Development

No branches or pull requests

4 participants