You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If there are more Ps than runnable Gs and a G calls Gosched, it causes the scheduler to wake up another P, the woken P finds no work to do, and it goes back to sleep. Doing this once isn't so bad, but if Gosched is called in a loop, such as in runtime.bgsweep (though user code can do this, too), this causes significant sleep/wakeup thrashing and consumes non-trivial CPU in futex calls. In Google, we see ~5% of cycles going to futex calls, most of which appear to be involved in this sort of thrashing based on the call graphs.
Here's a detailed sequence of what happens. Suppose there are two Ps and Ms and one G. G0 is running on M0 on P0. M1 and P1 are idle; stopped in stopm at the end of findrunnable.
G0 invokes Gosched, which calls schedule, which calls findrunnable. findrunnable will always find a G (such as G0) and return it to schedule. schedule will then call resetspinning, which will see that sched.nmspinning is 0 and call wakep. wakep cas's sched.nmspinning to 1 and calls startm. In the steady state, startm will find P1 and M1 and wake up M1 with m.spinning set to true.
M1 wakes up in stopm and returns to findrunnable, which gotos top. Since there's nothing on any of the run queues and g.m.spinning is true, it will fall through to "return P and block", which will drop the P, decrement sched.nmspinning back to 0 and stopm.
Meanwhile, P0 reschedules G0 because that's the only thing on its run queue. It does a little work (e.g., bgsweep sweeps a page) and calls Gosched again, which takes us through the whole process of waking up another P just for it to find it has no work to do.
It may be that this is just a bad way for bgsweep to work, but given that user code is just as capable of calling Gosched in a loop, I think we should consider fixing this in the scheduler. Unfortunately, full call graphs aren't working on the Google profile data right now, so I can't check if all of the time in futex is from bgsweep specifically, or just this problem in general.
report.txt is the perf report of futex syscalls that shows the full call graphs leading to these sleeps and wakeups, collected with GOEXPERIMENT=framepointer and
Interestingly, this turns out to be a great example of when traditional sampling profiling does not correlate with end-to-end performance. As a workaround, I disabled the background sweeper so I can debug my original issue, but even though this reduces the number of futex sleeps/wakeups by ~250X in BinaryTree17, it has almost no effect on the benchmark's performance! Of course, this makes sense, because this only consumes time on otherwise idle Ms.
This workaround did, however, demonstrate other reasons it's important to fix this. It reduced the total CPU time of the benchmark by 10% (and presumably had a similar, if not more pronounced effect on power). It also significantly reduced the noise in profiles. Without the workaround, this noise masked the issue I was actually trying to debug to the point where every one of a dozen different hardware counters indicated the benchmark should have been faster when it was in fact slower. With the workaround in place, the counters paint a very clear picture.
If there are more Ps than runnable Gs and a G calls Gosched, it causes the scheduler to wake up another P, the woken P finds no work to do, and it goes back to sleep. Doing this once isn't so bad, but if Gosched is called in a loop, such as in runtime.bgsweep (though user code can do this, too), this causes significant sleep/wakeup thrashing and consumes non-trivial CPU in futex calls. In Google, we see ~5% of cycles going to futex calls, most of which appear to be involved in this sort of thrashing based on the call graphs.
Here's a detailed sequence of what happens. Suppose there are two Ps and Ms and one G. G0 is running on M0 on P0. M1 and P1 are idle; stopped in stopm at the end of findrunnable.
It may be that this is just a bad way for bgsweep to work, but given that user code is just as capable of calling Gosched in a loop, I think we should consider fixing this in the scheduler. Unfortunately, full call graphs aren't working on the Google profile data right now, so I can't check if all of the time in futex is from bgsweep specifically, or just this problem in general.
@dvyukov @RLH @rsc
The text was updated successfully, but these errors were encountered: