Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: missed obvious inlining #58905

Closed
aclements opened this issue Mar 6, 2023 · 13 comments
Closed

cmd/compile: missed obvious inlining #58905

aclements opened this issue Mar 6, 2023 · 13 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Milestone

Comments

@aclements
Copy link
Member

I'm debugging a performance regression in CL 466099. One cause is that the compiler does not inline the call from newInlineUnwinder to resolveInternal, even though resolveInternal is inlinable according to the -m output and the call from next to resolveInternal a few lines down does get inlined. This appears to be because this call to typecheck.HaveInlineBody returns false for resolveInternal when we're visiting newInlineUnwinder and true when we're visiting next. Beyond that I get past my depth. The -m output seems to indicate that the inliner visits newInlineUnwinder, then resolveInternal, then next, which is surprising because that's not in bottom-up order (and this is regardless of source order). If visiting a function is what creates the inline body, then this visit order is clearly a problem.

Relevant -m output:

./symtabinl.go:49:6: cannot inline newInlineUnwinder: function too complex: cost 173 exceeds budget 80
./symtabinl.go:50:21: inlining call to funcdata
./symtabinl.go:60:6: can inline (*inlineUnwinder).resolveInternal with cost 73 as: method(*inlineUnwinder) func(uintptr) { u.pc = pc; u.index = pcdatavalue1(u.f, _PCDATA_InlTreeIndex, pc, u.cache, false) }
./symtab.go:1109:6: cannot inline pcdatavalue1: function too complex: cost 100 exceeds budget 80
./symtab.go:1113:32: inlining call to pcdatastart
./symtab.go:1113:32: inlining call to add
./symtabinl.go:72:6: cannot inline (*inlineUnwinder).next: function too complex: cost 170 exceeds budget 80
./symtabinl.go:78:29: inlining call to funcInfo.entry
./symtabinl.go:78:19: inlining call to (*inlineUnwinder).resolveInternal

cc @mdempsky @thanm

@aclements aclements added Performance compiler/runtime Issues related to the Go compiler and/or runtime. labels Mar 6, 2023
@aclements aclements added this to the Go1.21 milestone Mar 6, 2023
@cuonglm
Copy link
Member

cuonglm commented Mar 7, 2023

It seems to me that in next, funcInfo.entry only has cost 65, so after it's inlined, the max inline budget is still not reached, so the inliner continue to process and inlining call to resolveInternal.

In newInlineUnwinder, funcInfo.funcdata has cost 73, so after it's inlined, the max inline budget is reached, so the inliner stop processing, causing resolveInternal not inlined.

@aclements
Copy link
Member Author

The inliner doesn't track an ongoing cost within a function and stop inlining once some threshold is reached. With one exception, the threshold it applies is solely on the complexity of the target of a given call edge. The one exception is that if we're inlining into a "big" function, we reduce the threshold for all call edges coming from the big function, but none of the functions involved here are close to the "big" threshold. In this case, resolveInternal has cost 73, which is under the default threshold of 80, so it should be inlined into all non-big functions.

(We're looking into changing up all of this, but that's how it currently works.)

@cuonglm
Copy link
Member

cuonglm commented Mar 7, 2023

Sounds like #54632

@aclements
Copy link
Member Author

Sounds like #54632

I don't see how that's related. All of the function references in my code in static calls, and this is a case where we simply don't inline a call, versus producing bad code from inlining.

@aclements
Copy link
Member Author

Just in case it matters:

go env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/austin/.cache/go-build"
GOENV="/home/austin/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/austin/r/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/austin/r/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/austin/go.tmp"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/austin/go.tmp/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="devel go1.21-eaa00d878e Thu Mar 2 17:29:22 2023 -0500"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/austin/go.tmp/src/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2641378236=/tmp/go-build -gno-record-gcc-switches"

Bootstrap is go1.17.13.

@thanm
Copy link
Contributor

thanm commented Mar 7, 2023

I pulled the patch, e.g.

git fetch https://go.googlesource.com/go refs/changes/99/466099/6 && git checkout FETCH_HEAD

and I can see the weird behavior. To get more info on the ordering, I added some code to the inliner's InlineDecls function to trace the order in which functions are processed. Specifically just before this line I inserted this code:

			fmt.Fprintf(os.Stderr, "=-= bottom-up visit to: [")
			for _, n := range list {
				fmt.Fprintf(os.Stderr, " %s", ir.PkgFuncName(n))
			}
			fmt.Fprintf(os.Stderr, "]\n")

From the resulting trace output, it looks like the odd ordering is due to cycles in the call graph. Specifically it looks like there is a giant cycle that includes runtime.newInlineUnwinder, (*inlineUnwinder).resolveInternal, and a large number of other functions (on the order of 740). Here is the blob:

=-= bottom-up visit to: [ runtime.throw runtime.throw.func1 runtime.fatalthrow runtime.fatalthrow.func1 runtime.startpanic_m runtime.lock runtime.lockWithRank runtime.lock2 runtime.schedtrace runtime.unlock runtime.unlockWithRank runtime.unlock2 runtime.forEachG runtime.schedtrace.func1 runtime.dopanic_m runtime.traceback runtime.traceback1 runtime.printCgoTraceback runtime.printOneCgoTraceback runtime.callCgoSymbolizer runtime.cgocall runtime.entersyscall runtime.reentersyscall runtime.save runtime.badctxt runtime.casgstatus runtime.casgstatus.func1 runtime.reentersyscall.func1 runtime.traceGoSysCall runtime.traceEvent runtime.traceAcquireBuffer runtime.traceReleaseBuffer runtime.traceEventLocked runtime.traceEventLocked.func1 runtime.traceFlush runtime.sysAlloc runtime.(*sysMemStat).add runtime.traceStackID runtime.callers runtime.callers.func1 runtime.gentraceback runtime.tracebackHexdump runtime.hexdumpWords runtime.printlock runtime.gwrite runtime.recordForPanic runtime.printunlock runtime.funcInfo.entry runtime.(*moduledata).textAddr runtime.tracebackHexdump.func1 runtime.funcspdelta runtime.pcvalue runtime.newInlineUnwinder runtime.(*inlineUnwinder).resolveInternal runtime.pcdatavalue1 runtime.(*inlineUnwinder).next runtime.(*inlineUnwinder).fileLine runtime.funcline1 runtime.printArgs runtime.pcdatavalue runtime.printArgs.func1 runtime.printArgs.func2 runtime.printArgs.func3 runtime.tracebackCgoContext runtime.cgoContextPCs runtime.gcallers runtime.(*traceStackTable).put runtime.(*traceStackTable).put.func1 runtime.(*traceStackTable).newStack runtime.(*traceAlloc).alloc runtime.atomicstorep runtime.atomicwb runtime.(*wbBuf).get2 runtime.wbBufFlush runtime.wbBufFlush.func1 runtime.wbBufFlush1 runtime.shade runtime.findObject runtime.badPointer runtime.greyobject runtime.setCheckmark runtime.(*gcWork).put runtime.(*gcWork).init runtime.getempty runtime.(*workbuf).checkempty runtime.(*mSpanList).remove runtime.(*mSpanList).insert runtime.getempty.func1 runtime.(*mheap).allocManual runtime.(*mheap).allocSpan runtime.(*pageAlloc).allocToCache runtime.(*pageAlloc).find runtime.(*pageAlloc).find.func1 runtime.(*mheap).grow runtime.(*mheap).sysAlloc runtime.(*linearAlloc).alloc runtime.sysMap runtime.sysMapOS runtime.sysUsed runtime.sysUsedOS runtime.sysReserveAligned runtime.(*fixalloc).alloc runtime.persistentalloc runtime.persistentalloc.func1 runtime.persistentalloc1 runtime.(*consistentHeapStats).acquire runtime.(*consistentHeapStats).release runtime.(*pageAlloc).grow runtime.(*pageAlloc).sysGrow runtime.(*pageAlloc).sysGrow.func1 runtime.(*pageAlloc).sysGrow.func2 runtime.(*pageAlloc).sysGrow.func3 runtime.makeAddrRange runtime.addrRange.subtract runtime.(*scavengeIndex).grow runtime.(*addrRanges).add runtime.(*pageAlloc).alloc runtime.(*mheap).allocMSpanLocked runtime.(*pageAlloc).scavenge runtime.(*pageAlloc).scavenge.func1 runtime.(*pageAlloc).scavengeOne runtime.(*pallocData).findScavengeCandidate runtime.fillAligned runtime.fillAligned.func1 runtime.sysUnused runtime.sysUnusedOS runtime.(*mheap).allocSpan.func1 runtime.(*limiterEvent).stop runtime.(*mheap).initSpan runtime.(*mheap).allocNeedsZero runtime.newMarkBits runtime.newArenaMayUnlock runtime.newAllocBits runtime.lfnodeValidate runtime.putempty runtime.(*lfstack).push runtime.trygetfull runtime.(*workbuf).checknonempty runtime.putfull runtime.(*wbBuf).reset runtime.(*gcWork).putBatch runtime.entersyscall_sysmon runtime.notewakeup runtime.runSafePointFn runtime.entersyscall_gcwait runtime.traceGoSysBlock runtime.traceProcStop runtime.exitsyscall runtime.exitsyscallfast runtime.wirep runtime.exitsyscallfast_reacquired runtime.exitsyscallfast_reacquired.func1 runtime.traceGoSysExit runtime.exitsyscallfast.func1 runtime.exitsyscallfast_pidle runtime.pidleget runtime.acquirep runtime.(*mcache).prepareForSweep runtime.(*mcache).releaseAll runtime.(*mcentral).uncacheSpan runtime.(*sweepLocked).sweep runtime.traceGCSweepSpan runtime.freeSpecial runtime.queuefinalizer runtime.mProf_Free runtime.(*bucket).mp runtime.tracefree runtime.tracefree.func1 runtime.(*mspan).reportZombies runtime.(*spanSet).push runtime.(*atomicHeadTailIndex).incTail runtime.(*spanSetBlockAlloc).alloc runtime.(*sweepLocked).sweep.func1 runtime.(*mheap).freeSpan runtime.(*mheap).freeSpan.func1 runtime.(*mheap).freeSpanLocked runtime.(*gcControllerState).update runtime.traceHeapAlloc runtime.stackcache_clear runtime.stackpoolfree runtime.(*mheap).freeManual runtime.traceProcStart runtime.exitsyscall.func1 runtime.tryRecordGoroutineProfileWB runtime.tryRecordGoroutineProfile runtime.doRecordGoroutineProfile runtime.doRecordGoroutineProfile.func1 runtime.saveg runtime.traceGoStart runtime.Gosched runtime.gosched_m runtime.traceGoSched runtime.goschedImpl runtime.schedule runtime.stoplockedm runtime.releasep runtime.handoffp runtime.startm runtime.mReserveID runtime.checkmcount runtime.newm runtime.allocm runtime.(*rwmutex).rlock runtime.(*rwmutex).rlock.func1 runtime.notesleep runtime.allocm.func1 runtime.stackfree runtime.sysFree runtime.stackcacherelease runtime.mcommoninit runtime.mpreinit runtime.malg runtime.malg.func1 runtime.stackalloc runtime.stackpoolalloc runtime.stackcacherefill runtime.(*rwmutex).runlock runtime.newm1 runtime.newosproc runtime.newosproc.func1 runtime.pidleput runtime.updateTimerPMask runtime.wakeNetPoller runtime.netpollBreak runtime.wakep runtime.pidlegetSpinning runtime.incidlelocked runtime.checkdead runtime.checkdead.func1 runtime.fatal runtime.fatal.func1 runtime.timeSleepUntil runtime.mPark runtime.execute runtime.setThreadCPUProfiler runtime.findRunnable runtime.gcstopm runtime.stopm runtime.mput runtime.checkTimers runtime.adjusttimers runtime.dodeltimer runtime.siftupTimer runtime.badTimer runtime.siftdownTimer runtime.addAdjustedTimers runtime.doaddtimer runtime.netpollGenericInit runtime.netpollinit runtime.runtimer runtime.runOneTimer runtime.dodeltimer0 runtime.clearDeletedTimers runtime.traceReader runtime.traceGoUnpark runtime.(*gcControllerState).findRunnableGCWorker runtime.(*gcCPULimiterState).update runtime.(*gcCPULimiterState).updateLocked runtime.(*gcCPULimiterState).unlock runtime.(*gcControllerState).findRunnableGCWorker.func1 runtime.globrunqget runtime.runqput runtime.runqputslow runtime.ready runtime.netpoll runtime.injectglist runtime.injectglist.func1 runtime.runqputbatch runtime.stealWork runtime.runqsteal runtime.(*gcControllerState).addIdleMarkWorker runtime.(*gcControllerState).removeIdleMarkWorker runtime.checkRunqsNoP runtime.checkIdleGCNoP runtime.resetspinning runtime.startlockedm runtime.exitsyscall0 runtime.printcreatedby runtime.printcreatedby1 runtime.funcline runtime.printAncestorTraceback runtime.printAncestorTracebackFuncInfo runtime.tracebackothers runtime.tracebackothers.func1 runtime.crash runtime.dieFromSignal runtime.setsig runtime.sighandler runtime.sigprof runtime.(*cpuProfile).add runtime.(*cpuProfile).addExtra runtime.(*profBuf).write runtime.(*profBuf).wakeupExtra runtime.traceCPUSample runtime.runPerThreadSyscall runtime.doSigPreempt runtime.isAsyncSafePoint runtime.pcdatavalue2 runtime.sigsend runtime.tracebacktrap runtime.sigaction runtime.sysSigaction runtime.sysSigaction.func1 runtime.sigaction.func1]

This would probably explain some of the weirdness, since runtime.(*inlineUnwinder).resolveInternal appears later in this list than runtime.newInlineUnwinder.

We could try to do some work to process nodes within the SCC in a better order, although I am not sure exactly how much work that would be.

@thanm
Copy link
Contributor

thanm commented Mar 9, 2023

Here is a smaller example program that illustrates the problem (easier
to see here than with the giant runtime package).

package p

var p func(int) int

func root(x int) int {
	return mid(x + 1)
}

func mid(x int) int {
	root(2)
	return leaf(x + 1)
}

func leaf(x int) int {
	if x < 10 {
		return x
	}
	return 0
}

func apart(q, r int) int {
	return q + p(r)
}

When the inliner processes this package, it calls ir.VisitFuncsBottomUp to collect up strongly connected components in the call graph; then handles all functions in a given SCC (either a trivial SCC or a non-trivial SCC where recursion is actually happening) as a clump.

For each function "f" in an SCC clump, the inliner first calls CanInline (to see if "f" itself can be inlined), then calls InlineCalls on the body of "f" (to inline things into "f").

For non-trivial SCCs, this implies that when processing a function earlier in the clump, we may miss inlining opportunities since functions later in the clump haven't been tested for inlinability yet.

In the example above, we'll have a clump { "root", "mid" }, then all other functions will be in their own trivial SCCs. Thus the "-m=2" output for this package looks like:

./p.go:14:6: can inline leaf with cost 8 as: func(int) int { if x < 10 { return x }; return 0 }
./p.go:5:6: can inline root with cost 63 as: func(int) int { return mid(x + 1) }
./p.go:9:6: can inline mid with cost 80 as: func(int) int { root(2); return leaf(x + 1) }
./p.go:10:6: inlining call to root
./p.go:11:13: inlining call to leaf
./p.go:21:6: can inline apart with cost 63 as: func(int, int) int { return q + p(r) }

Note that even though "mid" is inlinable, we don't inline the call to "mid" from "root", because at the point we are doing inlining in "root", "mid" hasn't been checked yet.

I am going to send a CL that changes this behavior for non-trivial SCCs (e.g. does all "CanInline" calls in an SCC before any "InlineCalls" operations).

@gopherbot
Copy link

Change https://go.dev/cl/474955 mentions this issue: cmd/compile: reorder operations in SCCs to enable more inlining

@aclements
Copy link
Member Author

Thanks for figuring this out, Than.

This lends more support to the idea I've been pushing that inlining shouldn't be bottom-up at all, but should be done starting with the cheapest edge. That would fix SCC ordering issues by eliminating the SCC entirely.

@thanm
Copy link
Contributor

thanm commented Mar 9, 2023

For grins, I wrote a small CL that gathers statistics on how many functions are presented to the inliner as part of a non-trivial SCC. Here's what I got for cmd/compile and k8s kubelet:

Program Total nonTrivialSCC Closure
cmd/compile 17969 1531 942
kubelet 172575 2261 7626

So it seems that SCC functions are relatively uncommon but not unheard of. Note that this is total functions presented to the inliner (which is not the same as total functions in the final executable).

gopherbot pushed a commit that referenced this issue Mar 9, 2023
This patch changes the relative order of "CanInline" and "InlineCalls"
operations within the inliner for clumps of functions corresponding to
strongly connected components in the call graph. This helps increase
the amount of inlining within SCCs, particularly in Go's runtime
package, which has a couple of very large SCCs.

For a given SCC of the form { fn1, fn2, ... fnk }, the inliner would
(prior to this point) walk through the list of functions and for each
function first compute inlinability ("CanInline") and then perform
inlining ("InlineCalls"). This meant that if there was an inlinable
call from fn3 to fn4 (for example), this call would never be inlined,
since at the point fn3 was visited, we would not have computed
inlinability for fn4.

We now do inlinability analysis for all functions in an SCC first,
then do actual inlining for everything. This results in 47 additional
inlines in the Go runtime package (a fairly modest increase
percentage-wise of 0.6%).

Updates #58905.

Change-Id: I48dbb1ca16f0b12f256d9eeba8cf7f3e6dd853cd
Reviewed-on: https://go-review.googlesource.com/c/go/+/474955
Run-TryBot: Than McIntosh <thanm@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@thanm
Copy link
Contributor

thanm commented Mar 10, 2023

Let me know if this issue can be closed out... thanks

@mknyszek mknyszek assigned thanm and unassigned mdempsky Mar 15, 2023
@aclements
Copy link
Member Author

Thanks. My particular issue was fixed by your change, so I'll close this. There's still the open question of how to deal better with large SCCs, but that's part of a bigger project. :)

@gopherbot
Copy link

Change https://go.dev/cl/492015 mentions this issue: cmd/compile: remove debugging option InlineSCCOnePass from inliner

gopherbot pushed a commit that referenced this issue May 11, 2023
Delete the "InlineSCCOnePass" debugging flag and the inliner fallback
code that kicks in if it is used. The change it was intended to guard
has been working on tip for some time, no need for the fallback any
more.

Updates #58905.

Change-Id: I2e1dbc7640902d9402213db5ad338be03deb96c5
Reviewed-on: https://go-review.googlesource.com/c/go/+/492015
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Run-TryBot: Than McIntosh <thanm@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Projects
None yet
Development

No branches or pull requests

5 participants