runtime: make chan-based generators faster #8903

dvyukov · 2014-10-07T15:07:30Z

In some cases parallelism is not needed/desirable -- fine-grained chan-based
generators/iterators, or "coroutine-like" approach when you have several
independent stacks and manually switch between them.
Sync channels substantially limit parallelism, to the point where pipeline-like
parallelization does not work. But they do not eliminate it entirely, as after an
operation both producer and consumer are runnable. Current implementation handles this
"contact bounce" under-parallelization very inefficiently -- goroutines are
constantly scheduled and de-scheduled, threads are constantly parked and unparked, work
migrate between threads, etc.

The idea is to limit parallelism on sync channels even more aggressively. So that
fine-grained communication over sync channels can be very efficient. But w/o eliminating
concurrency nor affecting semantics.

Implementation:
1. Each P has an additional local work queue, let's call it Domain. Work form Domain
can't be stolen by other P's. Domain is has maximum size of 4 goroutines.
2. When a goroutine unblocks another goroutine on a sync chan, it adds the other
goroutine to own Domain (Domain of the current P) instead of putting it to the normal
work queue. This effectively limits parallelism. We can use Domains in other situations,
but this is outside of context of this bug.
3. When a goroutine blocks, P schedules one of the goroutines from Domain (if it's not
empty). In some circumstances (e.g. a goroutine is preempted due to time slice
exhaustion, or blocks in syscall) P can decide to just move all goroutines in the Domain
to normal work queue.
Then, *before* a goroutine blocks on a sync chan it saves Hchan and elem pointers in G
and switches to another goroutine in Domain *once*. Or if it sees that a goroutine in
the Domain want to accomplish a pairing operation on the same chan, it executes the
operation with that other goroutine.

This allows to implement tight communication over sync channels with the cost of just
several function calls and w/o several degradation with GOMAXPROCS>1. Communication
domains form and break up dynamically. Program semantics are not affected.

Here is a prototype implementation:
https://golang.org/cl/74180043/

I've benchmarked it on a real application where chan-based generators were abounded due
to forementioned reasons:
https://groups.google.com/d/msg/golang-dev/0IElw_BbTrk/cGHMdNoHGQEJ
The change provides 17.5x speedup (and it's not the limit as runtime.switchto can be
significantly more efficient).

qiulaidongfeng · 2024-01-17T07:11:47Z

See #61405 and #61897 and https://research.swtch.com/coro , the use cases here seem to have a better choice.
cc @golang/runtime
Does this issue still need to be open?

dvyukov added accepted Performance labels Oct 7, 2014

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed release-none labels Apr 10, 2015

dvyukov mentioned this issue May 18, 2019

runtime: optimization to reduce P churn #32113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: make chan-based generators faster #8903

runtime: make chan-based generators faster #8903

dvyukov commented Oct 7, 2014

qiulaidongfeng commented Jan 17, 2024

runtime: make chan-based generators faster #8903

runtime: make chan-based generators faster #8903

Comments

dvyukov commented Oct 7, 2014

qiulaidongfeng commented Jan 17, 2024