cmd/compile: mid-stack inline dispatch functions that call a function on each path #30548

josharian · 2019-03-03T18:39:33Z

package p

func dispatch(x int) int {
	if x < 10 {
		return small(x)
	}
	return big(x)
}

//go:noinline
func small(x int) int { return 0 }

//go:noinline
func big(x int) int { return 1 }

dispatch should be inlined: It is very simple, and on either path through the function we call exactly one other simple function.

It is not:

x.go:3:6: cannot inline dispatch: function too complex: cost 126 exceeds budget 80

This is because we attach a significant cost to each function call.

Note that we are happy to inline this dispatch function:

func dispatch(x int) int {
	fn := big
	if x < 10 {
		fn = small
	}
	return fn(x)
}

Fixing this requires either some special casing to recognize common patterns (like the original dispatch) or a bit of control flow analysis.

cc @randall77 @dr2chase

The text was updated successfully, but these errors were encountered:

gopherbot · 2019-03-03T23:18:46Z

Change https://golang.org/cl/164968 mentions this issue: math/big: add fast path for pure Go addVW for large z

In the normal case, only a few words have to be updated when adding a word to a vector. When that happens, we can simply copy the rest of the words, which is much faster. However, the overhead of that makes it prohibitive for small vectors, so we check the size at the beginning. The implementation is a bit weird to allow addVW to continued to be inlined; see #30548. The AddVW benchmarks are surprising, but fully repeatable. The SubVW benchmarks are more or less as expected. I expect that removing the indirect function call will help both and make them a bit more normal. name old time/op new time/op delta AddVW/1-8 4.27ns ± 2% 3.81ns ± 3% -10.83% (p=0.000 n=89+90) AddVW/2-8 4.91ns ± 2% 4.34ns ± 1% -11.60% (p=0.000 n=83+90) AddVW/3-8 5.77ns ± 4% 5.76ns ± 2% ~ (p=0.365 n=91+87) AddVW/4-8 6.03ns ± 1% 6.03ns ± 1% ~ (p=0.392 n=80+76) AddVW/5-8 6.48ns ± 2% 6.63ns ± 1% +2.27% (p=0.000 n=76+74) AddVW/10-8 9.56ns ± 2% 9.56ns ± 1% -0.02% (p=0.002 n=69+76) AddVW/100-8 90.6ns ± 0% 18.1ns ± 4% -79.99% (p=0.000 n=72+94) AddVW/1000-8 865ns ± 0% 85ns ± 6% -90.14% (p=0.000 n=66+96) AddVW/10000-8 8.57µs ± 2% 1.82µs ± 3% -78.73% (p=0.000 n=99+94) AddVW/100000-8 84.4µs ± 2% 31.8µs ± 4% -62.29% (p=0.000 n=93+98) name old time/op new time/op delta SubVW/1-8 3.90ns ± 2% 4.13ns ± 4% +6.02% (p=0.000 n=92+95) SubVW/2-8 4.15ns ± 1% 5.20ns ± 1% +25.22% (p=0.000 n=83+85) SubVW/3-8 5.50ns ± 2% 6.22ns ± 6% +13.21% (p=0.000 n=91+97) SubVW/4-8 5.99ns ± 1% 6.63ns ± 1% +10.63% (p=0.000 n=79+61) SubVW/5-8 6.75ns ± 4% 6.88ns ± 2% +1.82% (p=0.000 n=98+73) SubVW/10-8 9.57ns ± 1% 9.56ns ± 1% -0.13% (p=0.000 n=77+64) SubVW/100-8 90.3ns ± 1% 18.1ns ± 2% -80.00% (p=0.000 n=75+94) SubVW/1000-8 860ns ± 4% 85ns ± 7% -90.14% (p=0.000 n=97+99) SubVW/10000-8 8.51µs ± 3% 1.77µs ± 6% -79.21% (p=0.000 n=100+97) SubVW/100000-8 84.4µs ± 3% 31.5µs ± 3% -62.66% (p=0.000 n=92+92) Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937 Reviewed-on: https://go-review.googlesource.com/c/go/+/164968 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>

In the normal case, only a few words have to be updated when adding a word to a vector. When that happens, we can simply copy the rest of the words, which is much faster. However, the overhead of that makes it prohibitive for small vectors, so we check the size at the beginning. The implementation is a bit weird to allow addVW to continued to be inlined; see golang#30548. The AddVW benchmarks are surprising, but fully repeatable. The SubVW benchmarks are more or less as expected. I expect that removing the indirect function call will help both and make them a bit more normal. name old time/op new time/op delta AddVW/1-8 4.27ns ± 2% 3.81ns ± 3% -10.83% (p=0.000 n=89+90) AddVW/2-8 4.91ns ± 2% 4.34ns ± 1% -11.60% (p=0.000 n=83+90) AddVW/3-8 5.77ns ± 4% 5.76ns ± 2% ~ (p=0.365 n=91+87) AddVW/4-8 6.03ns ± 1% 6.03ns ± 1% ~ (p=0.392 n=80+76) AddVW/5-8 6.48ns ± 2% 6.63ns ± 1% +2.27% (p=0.000 n=76+74) AddVW/10-8 9.56ns ± 2% 9.56ns ± 1% -0.02% (p=0.002 n=69+76) AddVW/100-8 90.6ns ± 0% 18.1ns ± 4% -79.99% (p=0.000 n=72+94) AddVW/1000-8 865ns ± 0% 85ns ± 6% -90.14% (p=0.000 n=66+96) AddVW/10000-8 8.57µs ± 2% 1.82µs ± 3% -78.73% (p=0.000 n=99+94) AddVW/100000-8 84.4µs ± 2% 31.8µs ± 4% -62.29% (p=0.000 n=93+98) name old time/op new time/op delta SubVW/1-8 3.90ns ± 2% 4.13ns ± 4% +6.02% (p=0.000 n=92+95) SubVW/2-8 4.15ns ± 1% 5.20ns ± 1% +25.22% (p=0.000 n=83+85) SubVW/3-8 5.50ns ± 2% 6.22ns ± 6% +13.21% (p=0.000 n=91+97) SubVW/4-8 5.99ns ± 1% 6.63ns ± 1% +10.63% (p=0.000 n=79+61) SubVW/5-8 6.75ns ± 4% 6.88ns ± 2% +1.82% (p=0.000 n=98+73) SubVW/10-8 9.57ns ± 1% 9.56ns ± 1% -0.13% (p=0.000 n=77+64) SubVW/100-8 90.3ns ± 1% 18.1ns ± 2% -80.00% (p=0.000 n=75+94) SubVW/1000-8 860ns ± 4% 85ns ± 7% -90.14% (p=0.000 n=97+99) SubVW/10000-8 8.51µs ± 3% 1.77µs ± 6% -79.21% (p=0.000 n=100+97) SubVW/100000-8 84.4µs ± 3% 31.5µs ± 3% -62.66% (p=0.000 n=92+92) Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937

CAFxX · 2020-05-20T02:29:59Z

A similar-but-not-identical case (I can open a separate issue if needed) is common in logging libraries like zap:

// Debug logs a message at DebugLevel. The message includes any fields passed
// at the log site, as well as any fields accumulated on the logger.
func (log *Logger) Debug(msg string, fields ...Field) {
	if ce := log.check(DebugLevel, msg); ce != nil {
		ce.Write(fields...)
	}
}

So much so that library authors resort to suggesting to users to -basically- perform the inlining manually as it allows to turn something like this:

logger.Debug("something happened",
  zap.String("field", obj.Field()),
  zap.Int64("state", obj.State()),
)

into this, potentially bypassing one vararg call:

if e := logger.Check(zap.DebugLevel, "something happened"); e != nil {
  e.Write(
    zap.String("field", obj.Field()),
    zap.Int64("state", obj.State()),
  )
}

If the compiler inlined Debug, it would be possible to avoid the vararg call in the (likely) case debug-level logging is disabled at runtime, and potentially even avoid calling the two obj methods if we knew they were side-effect free.

josharian · 2020-05-20T03:08:37Z

Following our ~~proud~~ ugly tradition of inlining hacks, one relatively obvious option here is to reduce the inlining cost of calls inside conditionals. That would treat these two cases similarly. Any interest in prototyping this?

CAFxX · 2020-05-21T07:45:48Z

I have a simplistic prototype that halves extraCallCost of calls inside the if.

It works for

if f1() {
  f2()
}

but it obviously falls for

if condition {
  return f1()
}
return f2()

even though it inlines

if condition {
  return f1()
} else {
  return f2()
}

This prototype doesn't seem to affect the time it takes for make.bash, and bin/go increases by 0.5% (+73KB). Haven't run yet more benchmarks (which corpus do we run nowadays for changes to the inliner?).

That made me think that maybe a better way to handle this would be to have the extraCallCost of the second call in each function to be always 1 (even the first one would work, but that would also affect functions with a single call) possibly with the added condition that at least one of the calls is inside an if (assuming we don't want to inline a(), b())

josharian · 2020-05-21T15:06:34Z

maybe a better way to handle this would be to have the extraCallCost of the second call in each function to be always 1

Seems worth a shot.

possibly with the added condition that at least one of the calls is inside an if (assuming we don't want to inline a(), b())

I think we want that added condition. I assume that we don't want to inline a(), b().

@dr2chase put together the initial heuristic, but that piece at least seems pretty clearly intentional.

josharian · 2020-05-21T15:07:16Z

Btw, for testing the impact, you may find github.com/josharian/compilecmp helpful

ugorji · 2020-09-14T09:50:25Z

The ideas here will resolves the different use-cases I was targeting in #19348 (comment).

My initial use-cases were around helper methods that do a switch/if-else-if-else or that call 2 functions (see f and g below).

f() {
    if x {
        a() // cost 57
    } else if y {
        b() // cost 1
    } else {
        c() // cost 1
    }
    d() // cost 1 (second function)
    e() // cost 57 // without this call, f should be inlineable
}

g() {
    a() // cost 57
    b() // cost 1
}

dr2chase · 2020-09-14T15:30:06Z

Seems like we could rewrite the visitor slightly, make it compute the max of visits down two arms of an if, rather than the sum.

dr2chase · 2020-09-14T15:58:02Z

CL coming....

gopherbot · 2020-09-14T16:19:03Z

Change https://golang.org/cl/254796 mentions this issue: cmd/compile: modify inlining heuristic for if; max arms, not sum

dr2chase · 2020-09-15T00:41:15Z

If y'all feel like playing with the CL to see how it does with your examples, that might be good.
When I benchmark, I don't get ginormous binary growth, but also things don't get much faster, but perhaps people have coded around this already.

amd64: https://perf.golang.org/search?q=upload:20200914.8
arm64: https://perf.golang.org/search?q=upload:20200915.1

Both did poorly on the same benchmark, which is "interesting".

PS there's at least one failure when you run tests, because this changed some of the output for logopt (the JSON optimization logger for LSP use).

dr2chase · 2020-09-16T00:31:01Z

Mystery regression debugged; if a function f looks like "return new(Thing).initThing()", and

initThing does not escape its receiver,
initThing is large enough that inlining it into f causes f to become too large to inline,
a hot caller of f does not escape f's result

then inlining initThing into f is a Bad Idea, because it results in avoidable heap allocation. The "solution" to this is a hack, whereby small functions that contain allocations have a lower limit on the size of things that they will inline (similar mechanism to not-inlining into giant methods) which will tend to let them inline more often. This bug could be biting us already, I just happened to make it appear with this bit of tinkering.

Better interleaving of escape analysis and inlining would help with this. Like I said, hack. I am benchmarking just to see what the this-time-for-sure heuristic does for binary size and benchmark time.

Interested parties are again invited to tinker with this, to see if it helps their code enough to matter.

dr2chase · 2020-09-16T03:10:34Z

There, regressions gone. Slight performance increase, slight binary size increase.
https://perf.golang.org/search?q=upload:20200916.1

josharian · 2020-09-16T03:25:28Z

My Go time is severely restricted at the moment, but I really look forward to trying this out. And reverting my math/big hacks.

dr2chase · 2020-09-16T14:45:08Z

@josharian We're also discussing better hacks. I'm not thrilled at the existence of additional knobs that are not measuring precisely what we care about, yet definitely have value.

ugorji · 2020-10-16T15:12:27Z

@dr2chase Gentle ping. Any update? It will be great to get this into go1.16.

For more context, I have many functions that does something to the tune of:

func read1Byte() byte {
    if decodingFromBytes {
          read1ByteFromByteSlice()
    } else if decodingFromBufferedIO {
          read1ByteFromBufIO()
    } else {
          read1ByteFromIO()
    }
}

Currently, code like this can never be inlined, and using an interface prevented inlining from happening.

The common path of read1ByteFromByteSlice() is very cheap, and I wanted that inlined. With your code change, it will be.

dr2chase · 2020-10-16T15:47:38Z

So, the CL is up for review, and there's an open "bug", and this is low-risk -- so it might go in later, rather than sooner -- i.e., it can go in after the deadline, unless I misunderstand that policy. Right now I'm working on stuff that I'd call much riskier that we need to be sure works. There's also some higher-priority inlining work going on to deal with a performance regression in 1.15, and that needs to happen first. See #41474. It's mostly done.

There's default/automatic concern about the effect this will have on build times and binary size, since it will by default cause more things to become inlineable, hence larger binaries, and more time needed to produce them, so we may want to tweak those knobs a little bit.

gopherbot · 2020-11-09T02:41:14Z

Change https://golang.org/cl/267419 mentions this issue: cmd/compile: allocation-aware inlining tweaks

mdempsky · 2020-11-10T20:29:27Z

What's special about

if x { return f() } else { return g() }

that it should be inlined, but that

return f() + g()

should not be?

The inlining heuristics are currently oriented around the resulting code size. This justifies inlining a := g; if x { a = f }; return a() because it needs less code size than if x { return f() } else { return g() }: in particular, it only needs to generate one function call rather than two.

If the inlining heuristic cost assigned to function calls is too high, I think it's reasonable to just lower it. I don't see how control flow details weigh in here, since we're not (at least yet) doing any substantive control flow analysis at time of inlining.

dr2chase · 2020-11-11T14:46:14Z

Cutting the cost of a call in half increases text size by 3.5%.
Cutting it to 39 (half of inline budget, minus 1) increases text size by 2%.
I could benchmark for a corresponding performance increase, but my recollection is we won't get it.

By-the-way, the best-looking "floats-all-boats" version of this CL doesn't work for a 3-call if tree. It includes a budget for how much the "max accounting" can be lower than the "sum accounting", and the current works-best number there is only 68, which is 11 more than the cost of a call; 100 was also tested, and it was not as good -- the binaries were a bit larger (of course) and the performance improvements were not as good, and there were more losers.

All the benchmarking results used to obtain these parameters appear here.
Some heuristics were dropped, and two were added, in the course of this experiment. Most recent is the right-most "sheet" (of many). Escape-analysis-informed heuristics seemed like they ought to be better, but (1) other well-motivated CLs made it harder to use them and (2) given the parameters being tested at the time, they didn't work that well and (3) the hacky allocation/call/size heuristics worked surprisingly well. Further research (that I don't have time for -- the freeze is getting harder and harder every day, and I need to return to the register ABI work) might revive them, though that would require a discussion of updating the call-graph on the fly (there are algorithms for this, but they're not the sort of thing you'd introduce this late in the release cycle).

ugorji · 2020-11-12T19:46:48Z

What's special about
if x { return f() } else { return g() }
that it should be inlined, but that
return f() + g()
should not be?

The inlining heuristics are currently oriented around the resulting code size. This justifies inlining a := g; if x { a = f }; return a() because it needs less code size than if x { return f() } else { return g() }: in particular, it only needs to generate one function call rather than two.

My understanding of the inlining heuristics, is that a function takes more budget (~57 out of 80) than a simple instruction, because we decided the inclusion of a call that cannot be inlined suggests that inlining the wrapping function is of less value, so we reduced the leftover budget to about ~20.

However, a call like if x { f1() } else if y { f2() } else { f3() } is really a single function being called. Based on that, the cost should be the max cost of each block.

It should be just as possible to inline:

func a() {
  i++
  f()
}

as to inline

func a() {
  i++
  if i > 0 {
    f()
  } else {
    g()
  }
}

When we started talking about mid-stack inlining, the goal was to reduce function call cost to 1 (similar to panic, append, and other intrinsic functions). Since that is no longer on the table and a function call costs a large portion of the inlining budget, then we should allow a switch where only one branch is taken to equal roughly the same cost.

This is what the CL achieves.

If the inlining heuristic cost assigned to function calls is too high, I think it's reasonable to just lower it. I don't see how control flow details weigh in here, since we're not (at least yet) doing any substantive control flow analysis at time of inlining.

It would suck, both in performance (creating unnecessary temp closures) and readability, if we unintentionally encourage folks to resort to this.

randall77 · 2020-11-12T20:27:45Z

However, a call like if x { f1() } else if y { f2() } else { f3() } is really a single function being called. Based on that, the cost should be the max cost of each block.

One of the costs we're trying to model here is code size. Inlining that 3-branch if statement replaces 1 call (the callsite) with 3 calls. The fact that only one of those 3 calls can execute on a particular invocation doesn't help with code size (unless the conditions can be statically evaluated).

dr2chase · 2020-11-13T04:16:41Z

Putting this off till 1.17, need to do other work first. I do think the max-of-if-arms is reasonable, but we need to do a better job at not inlining in some cases. The companion CL that was supposed to help with this was too flaky, Keith and I were discussing ideas in comments there. Crude sketch is to put off the inlining decision till we can tell if it will over-fill an otherwise inlineable caller; alternately, look ahead more careful in the max-cost estimator.

One thing I tried that seems like a win is to also impose a quota/budget on the amount that max-if-cost may diverge from sum-if-cost; this reduced binary size increase without doing much at all to hurt performance. Note that the interaction of max-if-cost with look-before-leap inline cost estimation will be a bit of pain.

dr2chase · 2020-11-24T04:20:37Z

Before I forget, here's a writeup of what I tried and what I learned, and what I might try in the future. https://docs.google.com/document/d/17LAdIdHbcdhvxxQ2nszx4SgM_vbDGz6rDoj7BjL7VqM/edit?usp=sharing

josharian added Performance NeedsInvestigation labels Mar 3, 2019

josharian added this to the Go1.13 milestone Mar 3, 2019

josharian mentioned this issue Apr 1, 2019

cmd/compile: improve inlining cost model #17566

Open

andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

gopherbot added the compiler/runtime label Jul 13, 2022

mknyszek added this to Go Compiler / Runtime Jul 13, 2022

mknyszek moved this to Triage Backlog in Go Compiler / Runtime Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: mid-stack inline dispatch functions that call a function on each path #30548

cmd/compile: mid-stack inline dispatch functions that call a function on each path #30548

josharian commented Mar 3, 2019

gopherbot commented Mar 3, 2019

CAFxX commented May 20, 2020 •

edited

Loading

josharian commented May 20, 2020

CAFxX commented May 21, 2020 •

edited

Loading

josharian commented May 21, 2020

josharian commented May 21, 2020

ugorji commented Sep 14, 2020 •

edited

Loading

dr2chase commented Sep 14, 2020

dr2chase commented Sep 14, 2020

gopherbot commented Sep 14, 2020

dr2chase commented Sep 15, 2020 •

edited

Loading

dr2chase commented Sep 16, 2020

dr2chase commented Sep 16, 2020

josharian commented Sep 16, 2020

dr2chase commented Sep 16, 2020

ugorji commented Oct 16, 2020

dr2chase commented Oct 16, 2020

gopherbot commented Nov 9, 2020

mdempsky commented Nov 10, 2020 •

edited

Loading

dr2chase commented Nov 11, 2020

ugorji commented Nov 12, 2020

randall77 commented Nov 12, 2020

dr2chase commented Nov 13, 2020

dr2chase commented Nov 24, 2020

cmd/compile: mid-stack inline dispatch functions that call a function on each path #30548

cmd/compile: mid-stack inline dispatch functions that call a function on each path #30548

Comments

josharian commented Mar 3, 2019

gopherbot commented Mar 3, 2019

CAFxX commented May 20, 2020 • edited Loading

josharian commented May 20, 2020

CAFxX commented May 21, 2020 • edited Loading

josharian commented May 21, 2020

josharian commented May 21, 2020

ugorji commented Sep 14, 2020 • edited Loading

dr2chase commented Sep 14, 2020

dr2chase commented Sep 14, 2020

gopherbot commented Sep 14, 2020

dr2chase commented Sep 15, 2020 • edited Loading

dr2chase commented Sep 16, 2020

dr2chase commented Sep 16, 2020

josharian commented Sep 16, 2020

dr2chase commented Sep 16, 2020

ugorji commented Oct 16, 2020

dr2chase commented Oct 16, 2020

gopherbot commented Nov 9, 2020

mdempsky commented Nov 10, 2020 • edited Loading

dr2chase commented Nov 11, 2020

ugorji commented Nov 12, 2020

randall77 commented Nov 12, 2020

dr2chase commented Nov 13, 2020

dr2chase commented Nov 24, 2020

CAFxX commented May 20, 2020 •

edited

Loading

CAFxX commented May 21, 2020 •

edited

Loading

ugorji commented Sep 14, 2020 •

edited

Loading

dr2chase commented Sep 15, 2020 •

edited

Loading

mdempsky commented Nov 10, 2020 •

edited

Loading