cmd/compile: no closure inlining even in the simple case? #35196

ivoras · 2019-10-27T14:00:02Z

I hesitate to call this an issue, since I know it's a complex topic, and I've read about half a dozen issue reports touching this topic, which kind-of sort-of fix some inlining cases, but not all of them.

I'm using a simple pattern for avoiding lock leakage:

// WithRWMutex extends the RWMutex type with convenient .With(func) functions
type WithRWMutex struct {
	sync.RWMutex
}

// WithRLock executes the given function with the mutex rlocked
func (m *WithRWMutex) WithRLock(f func()) {
	m.RWMutex.RLock()
	f()
	m.RWMutex.RUnlock()
}

// WithWLock executes the given function with the mutex wlocked
func (m *WithRWMutex) WithWLock(f func()) {
	m.RWMutex.Lock()
	f()
	m.RWMutex.Unlock()
}

in practice it's used like this:

beforeCode()
o.WithWLock(func() {
 doSomething(o)
})
afterCode()

Full example here: https://go.godbolt.org/z/mkvMbZ

I'd expect that when the closure passed to WithWlock() is simple enough (i.e. no need for a new stack), inlining would collapse all the scaffolding and compile it to something like:

beforeCode()
m.RWMutex.Lock()
doSomething(o)
m.RWMutex.Unlock()
afterCode()

But looking at the assembly code (https://go.godbolt.org/z/mkvMbZ), the inner function is compiled once, its address taken, passed to the WithWLock() function, i.e. a full closure call is being done.

This is on go 1.13.

It seems that a full closure call is an expensive choice for this particular pattern?

The text was updated successfully, but these errors were encountered:

randall77 · 2019-10-27T15:45:09Z

Our current inline heuristic thinks that (*WithRWMutex).WithRLock is too expensive to inline. Its body contains 3 calls, which is above the cost threshold.

Even if I hack the compiler to inline in this case, it still doesn't inline the anonymous func. But it does do so in simpler cases. Not sure what's going on there, probably a phase ordering issue.

ivoras · 2019-10-27T16:08:06Z

Hmmm... which criteria is used which results in 3 function calls being considered expensive? Wouldn't the assembly for that case be something like 3 CALL instructions with some low-weight scaffolding for the arguments and the stack?

Or is it just the middle call, the one which calls the closure function, the one which drives the cost up, since it's a call via a function pointer?

Re: hacking the compiler - could the "inlining threshold" be a compiler argument?

randall77 · 2019-10-27T16:25:18Z

Hmmm... which criteria is used which results in 3 function calls being considered expensive? Wouldn't the assembly for that case be something like 3 CALL instructions with some low-weight scaffolding for the arguments and the stack?

We have a cost budget of 80, and each call is cost 57. That normally allows only one call in an inlineable body (with a few exceptions, like intrinsics). See #19348 (comment)

Or is it just the middle call, the one which calls the closure function, the one which drives the cost up, since it's a call via a function pointer?

The closureness of the call isn't used as part of the cost, other than it must be able to resolve the target function.

Re: hacking the compiler - could the "inlining threshold" be a compiler argument?

We do have a -l argument to the compiler which adjusts inlining decisions. But that flag is mostly for testing/debugging, it's not intended for performance tuning. In general we try to avoid providing knobs like this, as it leads to all kinds of thorny questions, like "what if two packages want two different inlining thresholds?" "is is the threshold of the calling package, or the called package?".

ivoras · 2019-10-27T17:05:58Z

Thank you, I appreciate your answers and your expertise.

On my system, the -l flag is documented as "disable inlining" so not really usable for experimenting.

But even if inlining worked, is the compiler smart enough to inline the closure as a whole inside the one and only caller (since the closure is an anonymous function used exactly once)?

FWIW, 2 more cents: I'm guessing the issue is because "go build" also builds required packages, so there could be mixed results from multiple invocations. Doesn't that point to flags like inline thresholds being global settings on a build machine, for all packages. Maybe in a config file, an environment variable, etc.?

randall77 · 2019-10-27T17:20:07Z

But even if inlining worked, is the compiler smart enough to inline the closure as a whole inside the one and only caller (since the closure is an anonymous function used exactly once)?

This code optimizes to just a call to doSomething.

	func() {
		doSomething(o)
	}()

FWIW, 2 more cents: I'm guessing the issue is because "go build" also builds required packages, so there could be mixed results from multiple invocations. Doesn't that point to flags like inline thresholds being global settings on a build machine, for all packages. Maybe in a config file, an environment variable, etc.?

Sure, but that doesn't really solve the problem, it just punts it to the user. I import 2 packages, A and B. A says "I am fastest when the inlining threshold is 63". B says "I am fastest when the inlining threshold is 95". What should I set my inlining threshold to?

We've basically decided that tuning to this level is not worth the complication it introduces to the ecosystem.

cespare · 2020-02-12T19:23:21Z

I'm wondering about what I think is this issue, but in reading through this thread a few times I'm not exactly sure what the precise problem under discussion is. @randall77 said

Even if I hack the compiler to inline in this case, it still doesn't inline the anonymous func. But it does do so in simpler cases. Not sure what's going on there, probably a phase ordering issue.

Is that the core problem here to be fixed?

In my case, I'm looking at code like this:

// +build ignore

package main

import (
	"fmt"
)

type M struct {
	bs []float64
}

func (m *M) E(ff func(b float64) float64) float64 {
	var sum float64
	for _, b := range m.bs {
		sum += ff(b)
	}
	return sum
}

func f(b float64) float64 {
	return b / 12345
}

func main() {
	m := &M{bs: []float64{1, 2, 3, 4, 5}}
	fmt.Println(m.E(f))
}

The call to ff won't get inlined:

$ go1.14rc1 run -gcflags '-m -m' test.go
# command-line-arguments
./test.go:13:6: cannot inline (*M).E: unhandled op RANGE
./test.go:21:6: can inline f as: func(float64) float64 { return b / 12345 }
./test.go:25:6: cannot inline main: function too complex: cost 150 exceeds budget 80
./test.go:27:13: inlining call to fmt.Println func(...interface {}) (int, error) { var fmt..autotmp_3 int; fmt..autotmp_3 = <N>; var fmt..autotmp_4 error; fmt..autotmp_4 = <N>; fmt..autotmp_3, fmt..autotmp_4 = fmt.Fprintln(io.Writer(os.Stdout), fmt.a...); return fmt..autotmp_3, fmt..autotmp_4 }
./test.go:13:7: m does not escape
./test.go:13:15: ff does not escape
./test.go:27:17: m.E(f) escapes to heap:
./test.go:27:17:   flow: ~arg0 = &{storage for m.E(f)}:
./test.go:27:17:     from m.E(f) (spill) at ./test.go:27:17
./test.go:27:17:     from ~arg0 = <N> (assign-pair) at ./test.go:27:13
./test.go:27:17:   flow: {storage for []interface {} literal} = ~arg0:
./test.go:27:17:     from []interface {} literal (slice-literal-element) at ./test.go:27:13
./test.go:27:17:   flow: fmt.a = &{storage for []interface {} literal}:
./test.go:27:17:     from []interface {} literal (spill) at ./test.go:27:13
./test.go:27:17:     from fmt.a = []interface {} literal (assign) at ./test.go:27:13
./test.go:27:17:   flow: {heap} = *fmt.a:
./test.go:27:17:     from fmt.Fprintln(io.Writer(os.Stdout), fmt.a...) (call parameter) at ./test.go:27:13
./test.go:26:7: &M literal does not escape
./test.go:26:23: []float64 literal does not escape
./test.go:27:17: m.E(f) escapes to heap
./test.go:27:13: []interface {} literal does not escape
<autogenerated>:1: .this does not escape
0.0012150668286755773

However, if I call f directly rather than passing it as a function argument, it is inlined:

// +build ignore

package main

import (
	"fmt"
)

type M struct {
	bs []float64
}

func (m *M) E() float64 {
	var sum float64
	for _, b := range m.bs {
		sum += f(b)
	}
	return sum
}

func f(b float64) float64 {
	return b / 12345
}

func main() {
	m := &M{bs: []float64{1, 2, 3, 4, 5}}
	fmt.Println(m.E())
}

$ go1.14rc1 run -gcflags '-m -m' test.go
# command-line-arguments
./test.go:21:6: can inline f as: func(float64) float64 { return b / 12345 }
./test.go:13:6: cannot inline (*M).E: unhandled op RANGE
./test.go:16:11: inlining call to f func(float64) float64 { return b / 12345 }
./test.go:25:6: cannot inline main: function too complex: cost 149 exceeds budget 80
./test.go:27:13: inlining call to fmt.Println func(...interface {}) (int, error) { var fmt..autotmp_3 int; fmt..autotmp_3 = <N>; var fmt..autotmp_4 error; fmt..autotmp_4 = <N>; fmt..autotmp_3, fmt..autotmp_4 = fmt.Fprintln(io.Writer(os.Stdout), fmt.a...); return fmt..autotmp_3, fmt..autotmp_4 }
./test.go:13:7: m does not escape
./test.go:27:17: m.E() escapes to heap:
./test.go:27:17:   flow: ~arg0 = &{storage for m.E()}:
./test.go:27:17:     from m.E() (spill) at ./test.go:27:17
./test.go:27:17:     from ~arg0 = <N> (assign-pair) at ./test.go:27:13
./test.go:27:17:   flow: {storage for []interface {} literal} = ~arg0:
./test.go:27:17:     from []interface {} literal (slice-literal-element) at ./test.go:27:13
./test.go:27:17:   flow: fmt.a = &{storage for []interface {} literal}:
./test.go:27:17:     from []interface {} literal (spill) at ./test.go:27:13
./test.go:27:17:     from fmt.a = []interface {} literal (assign) at ./test.go:27:13
./test.go:27:17:   flow: {heap} = *fmt.a:
./test.go:27:17:     from fmt.Fprintln(io.Writer(os.Stdout), fmt.a...) (call parameter) at ./test.go:27:13
./test.go:26:7: &M literal does not escape
./test.go:26:23: []float64 literal does not escape
./test.go:27:17: m.E() escapes to heap
./test.go:27:13: []interface {} literal does not escape
<autogenerated>:1: .this does not escape
0.0012150668286755773

Is that the same as this issue? Is it another issue? Should I open a new issue?

ivoras · 2020-02-12T19:59:35Z

Heavy edit: I thought the issue was something else.

How do you know that ff in the first case won't be inlined? Neither outputs mention ff in the context of inlining?

Regarding whether it's similar or not - the original issue is about anonymous function, which I think warrant aggressive inlining since the majority of them will only be called from a single context (judging how code translates to inlining budget, I'd say push it to 500, that should cover many logging / formatting handlers, various wrappers and error handlers).

thanm · 2020-02-12T20:48:27Z

Setting aside the original example and looking at this one (passing function f to method E):

the inliner currently won't inline any function that contains a for+range loop (you can see this in the -m output, "unhandled op RANGE")
given that 'E' is not inlined into main, then at that point the body of E has an indirect call. The Go compiler currently only inlines direct calls; there is no special magic to analyze the program and determine the possible set of indirect call targets. In particular, in the general case this requires interprocedural analysis (build a call graph, etc).

In your second example you've replaced the indirect call with a direct call; the Go compiler has no problems with inlining those as long as the callee is small enough (and various other criteria are met).

Hope this helps.

cespare · 2020-02-12T21:49:06Z

@ivoras

How do you know that ff in the first case won't be inlined? Neither outputs mention ff in the context of inlining?

If it were inlined, it would be mentioned in the -gcflags '-m -m' output. In the second case where the direct call is inlined, it prints:

./test.go:16:11: inlining call to f func(float64) float64 { return b / 12345 }

(But I also confirmed by looking at the assembly.)

cespare · 2020-02-12T22:34:19Z

@thanm thanks, that is very helpful. To summarize, it sounds like the key optimization for me would be inlining of indirect calls (in some cases where the target is static), which is difficult because it requires heavyweight analysis.

FiloSottile changed the title ~~No closure inlining even in the simple case?~~ cmd/compile: no closure inlining even in the simple case? Oct 28, 2019

FiloSottile added NeedsInvestigation Performance labels Oct 28, 2019

FiloSottile added this to the Unplanned milestone Oct 28, 2019

gopherbot added the compiler/runtime label Jul 13, 2022

mknyszek added this to Go Compiler / Runtime Jul 13, 2022

mknyszek removed this from Go Compiler / Runtime Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: no closure inlining even in the simple case? #35196

cmd/compile: no closure inlining even in the simple case? #35196

ivoras commented Oct 27, 2019

randall77 commented Oct 27, 2019

ivoras commented Oct 27, 2019

randall77 commented Oct 27, 2019

ivoras commented Oct 27, 2019

randall77 commented Oct 27, 2019

cespare commented Feb 12, 2020

ivoras commented Feb 12, 2020 •

edited

Loading

thanm commented Feb 12, 2020

cespare commented Feb 12, 2020

cespare commented Feb 12, 2020

cmd/compile: no closure inlining even in the simple case? #35196

cmd/compile: no closure inlining even in the simple case? #35196

Comments

ivoras commented Oct 27, 2019

randall77 commented Oct 27, 2019

ivoras commented Oct 27, 2019

randall77 commented Oct 27, 2019

ivoras commented Oct 27, 2019

randall77 commented Oct 27, 2019

cespare commented Feb 12, 2020

ivoras commented Feb 12, 2020 • edited Loading

thanm commented Feb 12, 2020

cespare commented Feb 12, 2020

cespare commented Feb 12, 2020

ivoras commented Feb 12, 2020 •

edited

Loading