Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: reduce scheduling overhead on large ppc64 systems #16476

Closed
bcbrock opened this issue Jul 23, 2016 · 17 comments
Closed

runtime: reduce scheduling overhead on large ppc64 systems #16476

bcbrock opened this issue Jul 23, 2016 · 17 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. Proposal-Hold
Milestone

Comments

@bcbrock
Copy link

bcbrock commented Jul 23, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
    Go 1.7beta2
  2. What operating system and processor architecture are you using (go env)?
    X86 (Ivy Bridge) - Ubuntu 16.04
    POWER8 - Ubuntu 15.10

This issue/proposal is a bit long, and likely only of interest to those interested in goroutine scheduling.

I work on the Hyperledger fabric (https://github.com/hyperledger/fabric), a large Go application implementing a permissioned blockchain. As part of this work I have observed what I would call "excessive" amounts of cumulative time spent in runtime.findrunnable when running on large systems with GOMAXPROCS defaulting to the number of available CPUs. In the following I assume the reader is familiar with the findrunnable routine in proc.go.

Drilling down into a findrunnable profile, the obvious culprit is seen to be the work-stealing loop. This loop is inefficient on large systems for several reasons:

  1. "Spinners" poll the system 4 times while holding a P, and all threads poll once again after releasing their P.

  2. The stealing loop checks for stealable work from all Ps, including Ps that have no possibility of having any work to steal. The atomic operations used to load the queue pointers in runqgrab require synchronization primitives on some architectures, and a subroutine call overhead on all architectures. This global polling is disruptive in an SMP-coherence sense, since the poller must pull cache lines from around the system in order to examine only a few fields of each line. The randomized polling order also defeats the hardware's prefetching heuristics.

Regarding 1): I understand why it is good to poll at least twice - First for ez-pickin's from the local run queues, and a second pass for the longer-latency runnext stealing. It occurred to me that perhaps 4 loops were made in Go 1.6 because the randomization used there was not guaranteed to visit every P, so polling 4X increased the odds of looking at every local queue. Now that this has been fixed in Go 1.7, polling more than twice is arguably not necessary. The polling with runnext grabs included is so thourough that once this loop is finished there is no a priori reason to expect that another pass will bear fruit.

Regarding 2): Note that the answer to the question: "Could this P possibly have any work to steal?" can be efficiently centralized since the answer is relatively rarely modified but relatively often observed. I've created a modified scheduler that includes a global array called mayhavework that is indexed by the id of a P. Currently, mayhavework[i] is false whenever a P is queued in the list of idle Ps, and true otherwise. More aggressive update protocols are also possible, but this simple protocol is sufficient to illustrate the benefit.

Setting/clearing mayhavework[i] adds a small overhead to queue management of idle Ps, as well as a test during polling loops. Note that the polling loop in the "delicate dance" already includes what appears to be a redundant guard of allp[i] != nil which is not made by the work-stealing loop.

Here are some results for an example Hyperledger fabric benchmark running on a 4-socket X86 Ivy Bridge server with 120 hardware threads. These examples are for illustration only and are not claimed to be exhaustive; The arguments for the proposal should be valid based on first principles. Performance (throuhgput) of the server is measured in transactions per second (TPS). Cumulative profile percentages were reported by the Golang net/http/pprof profiling service running in the application. Results for GOMAXPROCS eqaul to 12 and 120 (the default) are presented.

GOMAXPROCS = 12
-------------------------------------------------------------------------
                        Baseline   2 Stealing Loops Only   Full Proposal
-------------------------------------------------------------------------
Throughput               996 TPS          987 TPS              997 TPS
runtime.findrunnable      14.0%            13.5%                14.1%
-------------------------------------------------------------------------

GOMAXPROCS = 120
-------------------------------------------------------------------------
                        Baseline   2 Stealing Loops Only   Full Proposal
-------------------------------------------------------------------------
Throughput               991 TPS          963 TPS              997 TPS
runtime.findrunnable      28.2%            21.9%                16.5%
-------------------------------------------------------------------------

This full proposal has no effect on findrunnable overhead or performance on this system with GOMAXPROCS=12. However I have also run the experiment on a POWER8 server and observed a reduction from 14.5% to 9.4% of findrunnable overhead on that system with GOMAXPROCS=12. This may be due to the fact that atomic.Load includes a synchronization instruction on POWER.

For the full system there is a significant reduction in scheduling overhead. It is not clear whether the slight performance drop with 2 stealing loops only is real, or due to experimental variation. In a number of experiments (on POWER8) I have seen what I believe are small, real performance increases and decreases from these modified heuristics, which vary based on the particular benchmark.

To summarize the proposal:

  1. Only poll twice in the work stealing loop;

  2. Implement an efficient centralized data structure that records which Ps might possibly have any work to steal.

Bishop Brock

@ianlancetaylor ianlancetaylor changed the title A proposal to reduce scheduling overhead on large systems runtime: a proposal to reduce scheduling overhead on large systems Jul 23, 2016
@ianlancetaylor ianlancetaylor added this to the Go1.8 milestone Jul 23, 2016
@ianlancetaylor
Copy link
Contributor

CC @dvyukov @aclements

@dvyukov
Copy link
Member

dvyukov commented Jul 23, 2016

HI Bishop,

It would also be useful to see request latency and CPU load.
Thread parking/unparking strategy is the most tricky part of distributed task schedulers. It affects throughput, latency and CPU load. A change is not arguable if it improves at least one characteristic and makes no characteristic worse. Otherwise it is a matter of tradeoffs.
Time spent in findrunnable is not very interesting per se. For example, if we reduce time in findrunnable, but spend equally more time in thread parking/unparking, that's a profitable change.

It may also make sense to benchmark 4 steal iterations + mayhavework.

Changing this condition can also have profound effect:

    if !_g_.m.spinning && 2*atomic.Load(&sched.nmspinning) >= procs-atomic.Load(&sched.npidle) {
        goto stop
    }

I've chosen the threshold with bind intuition and my desktop had only 16 hw threads at the time.
You may try reducing it all way down to 1 spinning goroutine at most.

@bcbrock
Copy link
Author

bcbrock commented Jul 25, 2016

Hi Dmitry,

You are right that I had simply assumed that reducing cumulative runtime in findrunnable (which includes stopm parking/unparking) would reduce overall CPU consumption - but that does turn out to be the case. I don't have instrumentation for transaction latency, but the clients here are gated in part by the server response so throughput should be a proxy for latency. My X86 system is having issues so I'll show some results from POWER8 for a slightly different benchmark (partial results from X86 have a similar character).

GOMAXPROCS = 12
------------------------------------------------------------
                        Baseline  1) Only   2) Only    Both
------------------------------------------------------------
Throughput, TPS          1028       1031      1021     1036
findrunnable, cum. %     14.4       11.1       9.2     10.2
CPU Time, sec             161        159       158      158
CPU Time, % of Baseline              99%        99%     98%
------------------------------------------------------------

GOMAXPROCS = 80
------------------------------------------------------------
                        Baseline  1) Only   2) Only    Both
------------------------------------------------------------
Throughput, TPS          1045       1015      1026     1026
findrunnable, cum. %     31.2       26.0      14.3     16.5
CPU Time, sec             200        195       171      174
CPU Time, % of Baseline              98%       86%      87%
------------------------------------------------------------

Of the 2 proposals, the proposal 2) to introduce mayhavework has the largest effect for this benchmark, especially for large GOMAXPROCS.

I'm glad you mentioned the "don't spin" heuristic above. I'm puzzled at how this reduces CPU time, since threads branching to stop here enter a kind of limbo in which they are both forbidden to search for work to steal, yet also forbidden to park until there is no more work to steal.

I have in fact created a highly modified scheduler that keeps one thread spinning permanently and sends all others immediately to park. This scheduler delivers a solid 5+% throughput boost for our application at the expense of burning a hardware thread. I wouldn't suggest this as a general solution, but I could see us offering something like this to our clients under certain conditions.

For today though, I would ask you to consider these proposals based on first principles, and even with the knowledge that decreasing the amount of time spent in work-stealing may slightly drop performance by missing out on chance encounters with work which was not present when the stealing loop was first entered. My own outlook is probably colored by having spent many years working on server energy efficiency, and I understand that not everyone is willing to make this kind of trade.

@dvyukov
Copy link
Member

dvyukov commented Jul 25, 2016

GOMAXPROCS = 80
------------------------------------------------------------
                        Baseline  1) Only   2) Only    Both
------------------------------------------------------------
Throughput, TPS          1045       1015      1026     1026

Isn't it 2.9/1.8% TPS decrease?

I'm glad you mentioned the "don't spin" heuristic above. I'm puzzled at how this reduces CPU time, since threads branching to stop here enter a kind of limbo in which they are both forbidden to search for work to steal, yet also forbidden to park until there is no more work to steal.

Good question.
The intention was that if there are so many threads spinning already, then there should be no work available. The second check of runqueue (after stop label) is there only to prevent CPU underutilization (prevent situation that we have both runnable goroutines and idle Ps).

As you noted, it's can be suboptimal when we have lots of Ps and lots of churn in findrunnable. Maybe we could try to pop a G and obtain a P in the second check, if both succeeds, we can proceed directly to running the G (return from findrunnable).

Or maybe we should let each P check all queues once, then all except 1 P proceed to blocks and 1 is left spinning for longer.

This requires careful thought, experimentation and extensive benchmarking.

I would suggest to also test this on http benchmark from golang.org/x/benchmarks. It is not super realistic, but at least easy to setup and run and reports latency.

@bcbrock
Copy link
Author

bcbrock commented Aug 2, 2016

Regarding the slight throughput loss: Yes, I am arguing for changes that will likely increase latency / reduce performance slightly in some cases, the argument being that a small part of the current performance is "accidental" due to spending longer than necessary spinning.

I have been looking at the HTTP benchmark for a couple of days. This benchmark is quite different from our application, for example threads entering the stealing loop have about a 90% chance of stealing work in the HTTP benchmark, but only about 20% in our application. The results are noisy - run-to-run variation is on the same order (low single-digit percentages) as the differences we're trying to observe. However if the data is real then in general the things I've proposed seem to reduce CPU time slightly for large GOMAXPROCS and seem to increase latency slightly in the HTTP benchmark. The work-stealing "hit rate" goes from 86.3% for the current scheduler to 84.8% with 2 loops and the mayhavework check here. A scheduler with exactly 1 permanently spinning thread looks like it would also work well for this benchmark.

I agree that it would be useful to see a comprehensive study of scheduling heuristics on a large number of Go applications. Unfortunately I don't have a mandate to undertake that kind of study myself.

@randall77
Copy link
Contributor

On Fri, Jul 22, 2016 at 5:16 PM, Bishop Brock notifications@github.com
wrote:

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?
Go 1.7beta2
2.

What operating system and processor architecture are you using (go env
)?
X86 (Ivy Bridge) - Ubuntu 16.04
POWER8 - Ubuntu 15.10

This issue/proposal is a bit long, and likely only of interest to those
interested in goroutine scheduling.

I work on the Hyperledger fabric (https://github.com/hyperledger/fabric),
a large Go application implementing a permissioned blockchain. As part of
this work I have observed what I would call "excessive" amounts of
cumulative time spent in runtime.findrunnable when running on large
systems with GOMAXPROCS defaulting to the number of available CPUs. In the
following I assume the reader is familiar with the findrunnable routine
in proc.go.

Drilling down into a findrunnable profile, the obvious culprit is seen to
be the work-stealing loop. This loop is inefficient on large systems for
several reasons:

  1. "Spinners" poll the system 4 times while holding a P, and all threads
    poll once again after releasing their P.

  2. The stealing loop checks for stealable work from all Ps, including Ps
    that have no possibility of having any work to steal. The atomic operations
    used to load the queue pointers in runqgrab require synchronization
    primitives on some architectures, and a subroutine call overhead on all
    architectures. This global polling is disruptive in an SMP-coherence sense,
    since the poller must pull cache lines from around the system in order to
    examine only a few fields of each line. The randomized polling order also
    defeats the hardware's prefetching heuristics.

Randomized polling order gives you provable guarantees about the time
required to find work, though.

Regarding 1): I understand why it is good to poll at least twice - First
for ez-pickin's from the local run queues, and a second pass for the
longer-latency runnext stealing. It occurred to me that perhaps 4 loops
were made in Go 1.6 because the randomization used there was not guaranteed
to visit every P, so polling 4X increased the odds of looking at every
local queue. Now that this has been fixed in Go 1.7, polling more than
twice is arguably not necessary. The polling with runnext grabs included
is so thourough that once this loop is finished there is no a priori
reason to expect that another pass will bear fruit.

That sounds reasonable to me. If you can demonstrate that only 2 passes
instead of 4 improves performance, we could do that. I don't remember why
those extra iterations are in there, maybe Dmitry knows. There's always a
tradeoff in looking for more work vs. sleeping, I just want to make sure we
don't improve cases you're worried about and make lots of other cases worse.

Regarding 2): Note that the answer to the question: "Could this P possibly
have any work to steal?" can be efficiently centralized since the answer is
relatively rarely modified but relatively often observed. I've created a
modified scheduler that includes a global array called mayhavework that
is indexed by the id of a P. Currently, mayhavework[i] is false whenever
a P is queued in the list of idle Ps, and true otherwise. More aggressive
update protocols are also possible, but this simple protocol is sufficient
to illustrate the benefit.

Setting/clearing mayhavework[i] adds a small overhead to queue management
of idle Ps, as well as a test during polling loops. Note that the polling
loop in the "delicate dance" already includes what appears to be a
redundant guard of allp[i] != nil which is not made by the work-stealing
loop.

That sounds reasonable also. The only thing I worry about is false sharing
between mayhavework[i] and mayhavework[i+1], we'd need padding to ensure
that doesn't happen. (Ps with work to do would otherwise never have to
write-share a cache line with any other P.) Q: Could this be a field of P
instead?

Here are some results for an example Hyperledger fabric benchmark running
on a 4-socket X86 Ivy Bridge server with 120 hardware threads. These
examples are for illustration only and are not claimed to be exhaustive;
The arguments for the proposal should be valid based on first principles.
Performance (throuhgput) of the server is measured in transactions per
second (TPS). Cumulative profile percentages were reported by the Golang
net/http/pprof profiling service running in the application. Results for
GOMAXPROCS eqaul to 12 and 120 (the default) are presented.

GOMAXPROCS = 12

                    Baseline   2 Stealing Loops Only   Full Proposal

Throughput 996 TPS 987 TPS 997 TPS

runtime.findrunnable 14.0% 13.5% 14.1%

GOMAXPROCS = 120

                    Baseline   2 Stealing Loops Only   Full Proposal

Throughput 991 TPS 963 TPS 997 TPS

runtime.findrunnable 28.2% 21.9% 16.5%

This full proposal has no effect on findrunnable overhead or performance
on this system with GOMAXPROCS=12. However I have also run the experiment
on a POWER8 server and observed a reduction from 14.5% to 9.4% of
findrunnable overhead on that system with GOMAXPROCS=12. This may be due
to the fact that atomic.Load includes a synchronization instruction on
POWER.

For the full system there is a significant reduction in scheduling
overhead. It is not clear whether the slight performance drop with 2
stealing loops only is real, or due to experimental variation. In a number
of experiments (on POWER8) I have seen what I believe are small, real
performance increases and decreases from these modified heuristics,
which vary based on the particular benchmark.

To summarize the proposal:

  1. Only poll twice in the work stealing loop;

  2. Implement an efficient centralized data structure that records which Ps
    might possibly have any work to steal.

Bishop Brock


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#16476, or mute the thread
https://github.com/notifications/unsubscribe-auth/AGkgIJ2zqc6paFu3EIpWzMIS2J0N_UT4ks5qYV16gaJpZM4JTOyN
.

@dvyukov
Copy link
Member

dvyukov commented Aug 9, 2016

4 iterations were introduced in the first revision of the distributed scheduler. At that time we did have the good iteration algorithm that covers all Ps in one pass. And we had lots of much more important problems to solve. There is no particular theory behind exactly 4 iterations.

@bcbrock
Copy link
Author

bcbrock commented Aug 10, 2016

@randall77 : The sharing of mayhavework is intentional. Assuming mayhavework is a Boolean (byte) array, the typical spinner will have access to information for all configured Ps in a single cache line. This status is rarely updated but often read, so the shared line is not a problem. Storing mayhavework with each P would largely defeat the purpose, which is to remove the need for every spinner to randomly pull in one or more remote cache lines from P's that are idle.

@randall77
Copy link
Contributor

The typical case as I see it, in a heavily loaded system:

  • P runs out of work, sets mayhavework[p.ID] = false
  • picks a P to steal from, checks its mayhavework entry, it is true
  • steals some work G from that P
  • sets mayhavework[p.ID] = true
  • start working on the G we just stole.

So we write mayhavework twice, read it once. That would argue for making writes fast.

The other argument for making writes fast is that writes are on the critical path to getting work done. Reads are done by otherwise idle Ps.

Or does mayhavework indicate having work that is currently not being executed (that is, the P has 2+ goroutines assigned to it). I'm not sure if that changes my math.

@bcbrock
Copy link
Author

bcbrock commented Aug 10, 2016

All of the numbers I quoted above are from an implementation where mayhavework[i] is false if allp[i] is queued in the global queue of idle Ps, otherwise true. So the variables really are very rarely updated.

I started with a more aggressive implementations where mayhavework[i] is updated as you suggested above. In that case mayhavework[i] was implemented as an integer. There is a tradeoff of the size of integer to use, trading write collisions vs. read efficiency - I used 32-bit which worked well in my case. I also avoided redundant writes by only writing the new mayhavework value when it changed.

I think the unknown in the more aggressive context is how often the question "does this P have work" is answered yes vs. no. In the system I was working in that led to this issue the answer was almost always "no" - most of the Ps were always idle and no longer updating their entries, but they were being read multiple times by each thread looking for work.

@dvyukov
Copy link
Member

dvyukov commented Aug 11, 2016

@randall77 if we distribute mayhavework per P, then we can as well just use runq instead of mayhavework to understand if P has work or not, and then we get exactly what we have now.
I am not saying that what we have now is bad. I am saying that we don't need distributed mayhavework.

@adg
Copy link
Contributor

adg commented Sep 26, 2016

Any progress here? @aclements, do you have input?

@quentinmit quentinmit added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 11, 2016
@rsc
Copy link
Contributor

rsc commented Oct 27, 2016

I believe runtime scheduler overhead is something Bishop and others at IBM will be looking at more for Go 1.9.

@rsc rsc modified the milestones: Go1.9Early, Go1.8 Oct 27, 2016
@rsc rsc changed the title runtime: a proposal to reduce scheduling overhead on large systems runtime: reduce scheduling overhead on large ppc64 systems Nov 14, 2016
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.9Early May 3, 2017
@bradfitz bradfitz added early-in-cycle A change that should be done early in the 3 month dev cycle. and removed early-in-cycle A change that should be done early in the 3 month dev cycle. labels Jun 14, 2017
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 May 18, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.12 milestone Jun 1, 2018
@rsc
Copy link
Contributor

rsc commented Jun 11, 2018

Can this be closed?

My memory is that we agreed that maybe runtime/internal/atomic can have a custom atomic used in the one line identified above, if it makes such a big difference, but that we wouldn't expose that same semantics under any name in sync/atomic. Maybe this has already been done, or maybe other optimizations have removed the hot spot?

@rsc
Copy link
Contributor

rsc commented Jun 11, 2018

/cc @laboger

@laboger
Copy link
Contributor

laboger commented Jul 19, 2018

Sorry I don't seem to get these notifications.

@ceseo is working on a change for go 1.12 to add atomics for load-acquire and store-release to runtime/internal/atomic. There are several functions in the runtime package with comments as to where these can be used, maybe there are others.

@bcbrock
Copy link
Author

bcbrock commented Jul 26, 2018

Thanks to everyone for your attention on this issue. Although I haven't been working with Golang for a couple of years now, my recollection is that I was never able to find anything that was generally better (worthwhile to implement) than the current scheduler algorithms. If IBM is able to get more efficient atomics for POWER implemented that will help a little. People concerned about this issue can use GOMAXPROCS to limit resource utilization.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. Proposal-Hold
Projects
None yet
Development

No branches or pull requests

10 participants