Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Gosched causes scheduler thrashing #13527

Closed
aclements opened this issue Dec 7, 2015 · 3 comments
Closed

runtime: Gosched causes scheduler thrashing #13527

aclements opened this issue Dec 7, 2015 · 3 comments
Milestone

Comments

@aclements
Copy link
Member

If there are more Ps than runnable Gs and a G calls Gosched, it causes the scheduler to wake up another P, the woken P finds no work to do, and it goes back to sleep. Doing this once isn't so bad, but if Gosched is called in a loop, such as in runtime.bgsweep (though user code can do this, too), this causes significant sleep/wakeup thrashing and consumes non-trivial CPU in futex calls. In Google, we see ~5% of cycles going to futex calls, most of which appear to be involved in this sort of thrashing based on the call graphs.

Here's a detailed sequence of what happens. Suppose there are two Ps and Ms and one G. G0 is running on M0 on P0. M1 and P1 are idle; stopped in stopm at the end of findrunnable.

  1. G0 invokes Gosched, which calls schedule, which calls findrunnable. findrunnable will always find a G (such as G0) and return it to schedule. schedule will then call resetspinning, which will see that sched.nmspinning is 0 and call wakep. wakep cas's sched.nmspinning to 1 and calls startm. In the steady state, startm will find P1 and M1 and wake up M1 with m.spinning set to true.
  2. M1 wakes up in stopm and returns to findrunnable, which gotos top. Since there's nothing on any of the run queues and g.m.spinning is true, it will fall through to "return P and block", which will drop the P, decrement sched.nmspinning back to 0 and stopm.
  3. Meanwhile, P0 reschedules G0 because that's the only thing on its run queue. It does a little work (e.g., bgsweep sweeps a page) and calls Gosched again, which takes us through the whole process of waking up another P just for it to find it has no work to do.

It may be that this is just a bad way for bgsweep to work, but given that user code is just as capable of calling Gosched in a loop, I think we should consider fixing this in the scheduler. Unfortunately, full call graphs aren't working on the Google profile data right now, so I can't check if all of the time in futex is from bgsweep specifically, or just this problem in general.

@dvyukov @RLH @rsc

@aclements aclements added this to the Go1.6 milestone Dec 7, 2015
@aclements
Copy link
Member Author

report.txt is the perf report of futex syscalls that shows the full call graphs leading to these sleeps and wakeups, collected with GOEXPERIMENT=framepointer and

perf record -g -c 1 -m 1024 -e syscalls:sys_enter_futex ./bench.after.fp -test.bench BinaryTree17

Note that the test runs for about 4 seconds and makes 145,938 futex calls.

@dvyukov dvyukov self-assigned this Dec 8, 2015
@aclements
Copy link
Member Author

Interestingly, this turns out to be a great example of when traditional sampling profiling does not correlate with end-to-end performance. As a workaround, I disabled the background sweeper so I can debug my original issue, but even though this reduces the number of futex sleeps/wakeups by ~250X in BinaryTree17, it has almost no effect on the benchmark's performance! Of course, this makes sense, because this only consumes time on otherwise idle Ms.

This workaround did, however, demonstrate other reasons it's important to fix this. It reduced the total CPU time of the benchmark by 10% (and presumably had a similar, if not more pronounced effect on power). It also significantly reduced the noise in profiles. Without the workaround, this noise masked the issue I was actually trying to debug to the point where every one of a dozen different hardware counters indicated the benchmark should have been faster when it was in fact slower. With the workaround in place, the counters paint a very clear picture.

@gopherbot
Copy link

CL https://golang.org/cl/17540 mentions this issue.

@golang golang locked and limited conversation to collaborators Dec 14, 2016
@rsc rsc unassigned dvyukov Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants