-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: use context-sensitivity to prevent inlining of cold callsites in PGO #72107
Comments
Change https://go.dev/cl/654935 mentions this issue: |
Hi @yukatan1701, thanks for filing this. If this moves forward, I think it would be viewed as an internal implementation improvement and as such would not need to go through the proposal process (https://go.dev/s/proposal), so I'll take it out of the proposal process. |
Hi @yukatan1701, ah, from a quick look, I see your draft CL includes adding a new In general, proposals take more time to work through (both for you, as well as for the core Go team), so my question to you is whether a user-facing flag like that or other proposal-level change is critical, vs. maybe more of a "nice to have". My advice would be that if your change is always or almost always an improvement, then a new cmd/go build flag probably would not be needed, and it would probably be better to do something like add a compiler debug flag, or use one that exists. That said, I don't know if this change is something that fits that category. You can look at some of the existing PGO debug flags here: go/src/cmd/compile/internal/base/debug.go Lines 68 to 73 in 583d586
Also, if you haven't seen it yet, the "Tips" section of the cmd/compile README has an overview of how to use those types of debug flags (including which package a debug flag affects is something someone might not just guess on their own): https://github.com/golang/go/tree/master/src/cmd/compile#8-tips |
Thanks for working on this @yukatan1701, the results look nice (thanks for running sweet!). Just to clarify, am I understanding correctly that I agree that ultimately this doesn't need to be a proposal or add a top-level flag to It looks like this is currently using a I didn't quite understand the new serialization format from a quick look (an overview comment would be helpful), but if feasible, I think the right approach would be to have [1] For future reference, your life would likely have been easier by adding a new GOEXPERIMENT (example). GOEXPERIMENTS are already handled by the |
Hi @thepudds! Initially, I had no plans to add a new build flag. I used a PGO debug flag but encountered the following issue. Context-sensitive inlining requires a special CS-graph that allows the compiler to quickly check whether an inlining path is present in any sample. Building this graph can take some time, depending on the volume of profile data. To speed up this process, I needed a pre-profile. The only way to leverage this was to introduce a new build flag. Since CS-inlining may slow down compilation, I believe this optimization should be optional. It is particularly useful for those looking to balance code size and performance, especially in projects with large codebases. Given the potentially long compilation times, the need for pre-profiling is justified. I have no objections to using the debug flag if there is a way to incorporate the pre-profile. @prattmic, thanks for the recommendations. I'll try to come up with workarounds to load the pre-profile for CS-inlining.
Yes, it is. |
CC @golang/compiler |
Thanks for looking into this. I actually tried something simple along this line (very simple heuristics for not inlining cold calls) in the early days working on PGO. But result was not good. There were actually quite some regression. So we decided to not penalize cold functions. I think a reason could be due to the sampling nature of CPU profiles. Some functions may still be important to performance, but just didn't get enough samples. An extreme case would be that a function is inlined and optimized out to very few instructions (even a constant), which would get very few CPU samples, but this optimization is actually important for performance. It is possible that the new algorithm in the paper is better. I haven't read it. Does it have mechanisms to handle such cases? Thanks. |
@cherrymui, thanks for your comment, this is a very interesting note. During this research we observed some degradations after not inlining cold functions. These degradations related to the cases of actually cold (or even unreachable) functions. Such functions were small and had size about 5 instructions or lesser. The reasons of such degradation were additional spill/fill instructions around call and general stack size increase. We solved this problem with adding heuristics, that prevent not inlining of small functions. However, testing with bent shows performance degradation in several tests, and we improving heuristics. Some of heuristics will show better result with AST node profile (especially if-branches profile). Speaking about functions, that will be inlined and their size will be reduced to few instructions and make it invisible to profile sampling, we assume, that if function is big enough, than the probability of such case is quite low. Currently our lower size bound heuristic is inline cost 40. We tested this optimization with sweet and our production code with ~11 million lines of code and did not see degradations. We are ready to test with any other benchmark. |
Proposal Details
proposal: cmd/compile: use context-sensitivity to prevent inlining of cold callsites in PGO
Abstract
This proposal aims to support context-sensitive inline in the Go Compiler. Context-sensitive inline extends the current inlining approach. It considers both an inlining path and profile information when deciding whether a function is worth being inlined in a particular callsite. It helps reduce code size without any significant performance changes. This proposal is based on Wenlei He, Hongtao Yu, Lei Wang, Taewook Oh, "Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation" paper. Section III.B of the paper ("Context-sensitive Sampling-based PGO") contains details.
Problem
This example demonstrates in which cases this optimization can be useful:
As can be seen,
scalarSub
is never called when invokingscalarOp
fromAddVectorHead
. The Go Compiler's inliner can't handle such cases and inlines bothscalarAdd
andscalarSub
inscalarOp
by default. In this simple example, unreachable calls will be eliminated later, after constant propagation and dead code elimination. However, this approach does not work in more complex cases where specific variable values are unknown. In such situations, we can use profile information to prevent the inlining of cold call sites, which helps reduce code size.A callsite may be cold on one execution path and hot on another. In the example above, a call to
scalarAdd
is hot whenscalarOp
is invoked fromAddVectorHead
while a call toscalarSub
is cold. After inliningscalarOp
inAddVectorHead
, the nodes of the inlined function body are traversed. Since we know the inlining pathAddVectorHead -> scalarOp
, we can check whether the pathAddVectorHead -> scalarOp -> scalarSub
is cold before inliningscalarSub
.Proposed changes
One possible solution is to use profile information. A collected profile contains a list of samples representing call stacks along with additional data. If there are no samples that match the inlining path (i.e. if this path didn't appear in the samples), the inlining of
scalarSub
can be prevented in this context. In this particular example, it may be best to inlinescalarSub
everywhere as it would need only a singlesub
instruction instead of a function call. However, preventing inlining can be beneficial when applied to relatively large functions. Context-sensitivity involves considering the path through a call graph that corresponds to the call stack leading to the current call site.Collect a profile for the example:
Compile with PGO enabled and look at inlining decisions:
As we can see on the last four lines, both
scalarAdd
andscalarSub
were inlined intoAddVectorHead
andSubVectorHead
.Context-Sensitive Inline prevents cold callsites from being inlined:
Implementation details
For this example, there are only 7 samples in the profile:
Each sample represents a call stack and its occurrence count in the profile. Stack frames are sorted from leaf to root.
AddVectorHead:15 -> scalarOp:25 -> scalarSub
doesn't appear in the profile, so inlining ofscalarSub
may be prevented in this case. However,scalarAdd
is worth being inlined after finding theAddVectorHead:15 -> scalarOp:25 -> scalarAdd
path in the graph.Results
Benchmarking with Sweet (threshold = 40) shows that this approach can reduce code size by up to 5.11% without any significant performance changes. Some internal tests show a code size reduction of 10% (threshold = 30).
ARMv8 Kunpeng920
Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
The text was updated successfully, but these errors were encountered: