Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: proposal/rfc for chansendn, fastpath optimization for broadcasting same value to buffered channels. #22197

Closed
cstockton opened this issue Oct 10, 2017 · 4 comments

Comments

@cstockton
Copy link

cstockton commented Oct 10, 2017

Currently the runtime has a single channel send generic send function that I am aware of, which attempts to send a single value to a buffered or buffered channel. I propose defining a chansendn(c *hchan, ep unsafe.Pointer, n int, block bool, callerpc uintptr) bool (this may need to return an int of the number of sends) function that has an additional parameter N that indicates the number of sends to attempt before falling back to the original chansend behavior. Where an attempt is defined as these additional two operations

  • If a blocking request try to drain the recvq until N is 0 or recvq.dequeue() returns nil. 1
  • Regardless of Blocking or Non-blocking, until N is 0 or the buffer is full place the item into the buffer. 2

Benefits:
This will allow an ssa pass to optimize code such as:

// loop has known bounds, or maybe any chan send on a channel with a capacity could go through this
// pass as long as the element being sent was the same value.
for n := 0; n < cap(ch); n++ {
 ch <- struct{}{}
 ch <- *pointer // or any pointer
 ch <- value{} // or any value
}

// generated similar to:
  if n := chansendn(ch, loop.Iterations, value); n < loop.Iterations {
    for n > 0 { chansend(ch, n, value) }
  }
}

I created an example in user space as an experiment to illustrate and would be willing to do the work I'm proposing, I believe I could create a reasonable patch for the runtime. I have never done anything in the ssa backend but thanks to all the documentation for existing passes and current loop optimizations I could put something together for an initial review I believe. Since there will be a learning curve for the latter I wanted to see if the change was reasonable before I went much further. Thanks.

@ianlancetaylor
Copy link
Contributor

I'm somewhat skeptical that this is worth the cost in compile time. Can you more precisely characterize the kinds of loops that the compiler would recognize? How often do such loops actually occur in real code?

@cstockton
Copy link
Author

cstockton commented Oct 11, 2017

It came up when attempting to broadcast to a large number of goroutines. An example of real world code that uses the pattern is any event emitter API that can receive notifications when an event occurs. A standard library example that comes to mind off the top of my head is the os.Notify API, though currently under the hood it uses a map and does some filtering before sends it is akin to an event emitter. Each OS signal is a topic, each topic can have an arbitrary number of listeners. If we wanted to pretend the Notify api was for arbitrary events of all kinds, and it had many listeners, instead of being written like:

m map[chan<- os.Signal]*handler
for c, h := range handlers.m {
if h.want(n) {
  select {
  case c <- sig:
  default: }}}

It could be:

// has a counter new listeners, new listeners can't start until a prior broadcast finishes.
m map[os.Signal]*handlerState // has a listener N
h := handlers.m[sig]
for i := h.N; i > 0; i-- {
  select {
  case h.ch <- sig:
  default: }}}

In the second example the compiler could perform a single non-blocking call to chansendn1. Extrapolate this to pretty much any API that notifies listeners when changes occur for a given "topic". "Watch" api's specifically. I could find some examples in popular code bases this weekend if you are swayed into thinking there could be possible benefit, if not fair enough maybe it's not worth it but the benchmarks I posted even from my naive user space implementation were fairly promising with 3-5x speedup for buffered channel sends, as well as sends on non-buffered channels with waiting receivers.

@ianlancetaylor
Copy link
Contributor

Thanks, but that wasn't quite what I was asking. You are suggesting adding code to the compiler to recognize certain kinds of loops. These loops are complex, as your examples show. What precisely is the compiler going to look for? How is the compiler going to recognize the kinds of loops you describe? Once the loop is recognized, how is the compiler going to safely transform it? Remember that the compiler is looking at a processed representation of the source code, not the source code itself.

@cstockton
Copy link
Author

Hello Ian- I wasn't really able to answer your question without getting a better understanding of the SSA backend, I went through that process partially and came to the conclusion it's possible to recognize loops of these types, and the cost during compliation (least in the current naive form) is a very tiny amount- microseconds, much less if the chansendn pass was to piggy back off the induction variable finding inside of loopbce. The doc comment for the chansendn should answer some of your questions:

// chansendn takes calls to runtime.chansend1 and converts them to runtime
// chansendn1, which will perform multiple non-blocking channel sends in a single
// call under the same mutex. If not all values may be sent it returns the number
// remaining, allow to transform a loop body such as:
//
// 	loop:
// 	  ind = (Phi min nxt)
// 		sendVal = v
// 	  if ind < max
// 	    then goto enter_loop
// 	    else goto exit_loop
//
// 	  enter_loop:
// 		CallStatic chansend1 [sendVal]
// 	     nxt = inc + ind
// 		goto loop
// 	exit_loop:
//
// Into:
//
// 	loop:
// 	  ind = (Phi min nxt),
// 		sendVal = v
// 	  if ind < max
// 	    then goto enter_loop
// 	    else goto exit_loop
//
// 	  enter_loop:
// 		n = CallStatic chansendn1 [sendVal]
// 	     ind = n
// 		goto loop
// 	exit_loop:
//
// 	loop:
// 		n = chansendn1(...)
// 		if n > 0:
// 		  goto loop
//
// Conditions:
//
//   1. phi name must have enabledChansendn prefix [POC constraint]
//   2. phi must have a loop body
//   3. loop body must have a single [POC constraint] integer induction var
//      with a maximum value > 1
//   4. loop condition increment must be +1 [POC constraint]
//   5. loop must contain a single blocking channel send [POC constraint] on a
//      channel with a capacity > 1.
//   6. channel send must be the same value each call
//        (prove it does not change within the loop body)
//
// i.e.:
//
// 	Eligible
//
// 		for i := 0; i < count; i++ {
// 			ch <- struct{}{}
// 		}
//
// 	Ineligble, I don't think it _has_ to be, but currently I'm avoiding it because
// 	the generated code would be more complex
//
// 		for i := 0; i < count; i++ {
// 			select {
// 			case ch <- struct{}{}:
// 			default:
// 			}
// 		}
//
// 	Ineligble: (could be allowed, but I imagine this rarely occurs)
//
// 		ch <- struct{}{}
// 		ch <- struct{}{}
// 		ch <- struct{}{}
//
// Note: this is just a poc and is far from correct, served as a learning
// experience and introduction to this code. I make a lot of assumptions based
// on observation of a small sampling, not on correct API usage.

Discovering how much effort it would take first hand to create a correct SSA pass for this, I am not sure I'm convinced anymore it's worth the cycles I would take from the Go team reviewing the patches. It would open up some nice capabilities for broadcasting to a group of goroutines with channels, but I can do that now relatively cheaply with the cost of an allocation by closing a channel each broadcast. Getting all the responses with a empty receive may have more of a benefit, but again not convinced its worth the effort.

I'll close this out and chalk it up as a good learning experience, thanks.

@golang golang locked and limited conversation to collaborators Oct 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants