Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: scavenger's frequent wake-ups interfere with runnext #35271

Closed
bcmills opened this issue Oct 31, 2019 · 9 comments
Closed

runtime: scavenger's frequent wake-ups interfere with runnext #35271

bcmills opened this issue Oct 31, 2019 · 9 comments

Comments

@bcmills
Copy link
Contributor

bcmills commented Oct 31, 2019

From windows-amd64-longtest (https://build.golang.org/log/3fbe37b41ed88143f4b6d6f6e277e99891537be7):

--- FAIL: TestPingPongHog (0.11s)
    proc_test.go:460: want hogCount/lightCount in [0.2, 5]; got 51000000/6733000 = 7.574632407544928
FAIL
FAIL	runtime	288.181s

CC @aclements @mknyszek

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 31, 2019
@bcmills bcmills added this to the Go1.14 milestone Oct 31, 2019
@bcmills
Copy link
Contributor Author

bcmills commented Oct 31, 2019

linux-amd64-longtest: https://build.golang.org/log/8f4ed420d92d7a3872a0701928f301386266e77f

I'm guessing this probably turned flaky due to the goroutine-preemption changes?

@bcmills bcmills changed the title runtime: TestPingPongHog failure on windows-amd64-longtest runtime: TestPingPongHog is flaky Oct 31, 2019
@bcmills bcmills added the Soon This needs to be done soon. (regressions, serious bugs, outages) label Nov 1, 2019
@bcmills
Copy link
Contributor Author

bcmills commented Nov 1, 2019

This is preventing me from getting meaningful TryBot results on Windows for CL 204521.

(https://storage.googleapis.com/go-build-log/66ca91a7/windows-amd64-longtest_cfa17417.log)

@mknyszek
Copy link
Contributor

mknyszek commented Nov 7, 2019

This is now failing much more consistently on the linux longtests, as of 21445b0.

The scavenger is likely going to sleep and waking up much more often than before (though I'm not sure that it's enough to still matter. The time spent "measuring" is 100 ms, though, so I suppose I could see that making this worse.

@gopherbot
Copy link

Change https://golang.org/cl/206098 mentions this issue: runtime: skip TestPingPongHog on builders

gopherbot pushed a commit that referenced this issue Nov 8, 2019
This test is failing consistently in the longtest builders,
potentially masking regressions in other packages.

Updates #35271

Change-Id: Idc03171c0109b5c8d4913e0af2078c1115666897
Reviewed-on: https://go-review.googlesource.com/c/go/+/206098
Reviewed-by: Carlos Amedee <carlos@golang.org>
@bcmills bcmills removed the Soon This needs to be done soon. (regressions, serious bugs, outages) label Nov 8, 2019
@mknyszek
Copy link
Contributor

I'll take a look into this since Austin is currently focused on the memory corruption issues.

@mknyszek mknyszek assigned mknyszek and unassigned aclements Nov 22, 2019
@mknyszek
Copy link
Contributor

OK! I got somewhere.

TestPingPongHog is all about scheduling goroutines. I was digging through the history around these failures and found some interesting information:

  1. They started happening VERY consistently as of 21445b0 (as I mentioned in a previous comment).
  2. Reproducing this by running the test directly is nigh impossible but...
  3. If I make some big allocation (1 GiB), free it, and runtime.GC() twice at the beginning of the test (which basically ensures the scavenger will be enabled with plenty of work to do), it fails very, very consistently.
  4. At the same time it started happening consistently on Linux, it stopped happening on Windows.

This implies that the frequent wake-ups from the scavenger are having a detrimental effect, at least on Linux (and perhaps a beneficial one on Windows?). Keep in mind that GOMAXPROCS=1 for this test, so I'm more inclined to think now that it's the wake-ups themselves causing issues.

I think there are actually two issues at play here: one introduced around Oct 31st when Windows started failing, and the scavenger sleeping/waking more often. I'll tackle the latter first and come back to the former if I can get that fixed.

Note also that my fixes to #35788 make the test fail around 1 in 500 times when run directly (as opposed to not reproducible at all).

@mknyszek
Copy link
Contributor

@aclements informed me that TestPingPongHog is trying to check for fairness. Fairness is helped by the "next" mechanism in the scheduler, the 1 element LIFO in front of the per-P FIFO (which is the runqueue). Currently the scavenger, upon wake-up (which it's doing a lot more of now), will be readied onto this LIFO, interfering with the test.

One solution could be to just have the scavenger do more work. But perhaps a better solution is to just call "ready" with "next" as false. This way the scavenger never ends up in this LIFO, and I can confirm that this fixes the test flake (1000 consecutive runs and no failure, with the 1 GiB allocation in front of the test). This fix stays in line with the reasoning that the scavenger should mostly stay out of the way of the rest of the application, and if the scavenger takes longer to get scheduled (i.e. sleeps longer than it wanted to) it'll simply request to sleep for less to account for this overhead, effectively doing more work anyway.

This also gave me an opportunity to get some numbers for the new self-paced scavenger: each cycle is on the order of 300-1000µs, which means our order-of-magnitude assumption in the old scavenger was mostly on-target. :)

@mknyszek mknyszek removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 22, 2019
@mknyszek mknyszek changed the title runtime: TestPingPongHog is flaky runtime: scavenger's frequent wake-ups interfere with runnext Nov 22, 2019
@gopherbot
Copy link

Change https://golang.org/cl/208380 mentions this issue: runtime: ready scavenger without next

@mknyszek
Copy link
Contributor

FWIW I don't think this should block the beta. These frequent wake-ups definitely don't seem to show up in benchmarks.

@golang golang locked and limited conversation to collaborators Nov 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants