Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: PGO opportunities umbrella issue #62463

Open
2 of 18 tasks
aclements opened this issue Sep 5, 2023 · 9 comments
Open
2 of 18 tasks

cmd/compile: PGO opportunities umbrella issue #62463

aclements opened this issue Sep 5, 2023 · 9 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. umbrella
Milestone

Comments

@aclements
Copy link
Member

aclements commented Sep 5, 2023

This issue is to track the list of PGO optimization opportunities we're considering. As we begin work on any of these, it should be broken into its own issue. We'll edit and add to this list over time.

  • Inlining
  • Indirect call devirtualization: If the profile shows that a particular closure or method is by far the most common target of an indirect call, specialize the caller to check that that's the target and make a direct call if it is (possibly even inlining the target). Done modulo some follow-up work:
  • Dynamic escapes on cold paths: If the profile shows that the only reason for an escape is on a cold path and the hot path does not escape an allocation, we could optimistically stack-allocate and dynamically move the allocation to the heap on the cold path.
  • Stenciling: Stencil hot generic functions, possibly based on type information from the profile.
  • Local basic block ordering: ordering blocks within a function to cluster hot blocks and potentially improve branch prediction (though the win of the latter on modern CPUs is small).
  • Register allocation: register allocation currently uses heuristics to determine a hot path and move spills off that hot path. PGO can tell it the true hot path.
  • Loop unrolling: Loop unrolling, like inlining, is a balance with binary size/i-cache footprint (also like inlining, its primary benefit on modern CPUs is enabling follow-on optimizations). We can use PGO to focus loop unrolling on hot loops, which will both have minimal impact on binary size and on compile time spent analyzing whether unrolling is worthwhile. (See cmd/compile: add unrolling stage for automatic loop unrolling #51302)
  • Loop unswitching: In hot functions, loop unswitching is more likely to be worth the code size increase.
  • Architecture feature check unswitching and/or function multiversioning: For architecture feature checks in hot functions, lift those feature checks, perhaps to the function level. (This is closely related to loop unswitching, but isn’t necessarily in a loop, and we know a priori that it’s safe to lift the condition.)
  • Select between branches and conditional MOVs: When a condition is predictable, branches are more efficient, and when a conditional is unpredictable, conditional MOVs are more efficient. Use a profile to select between these. (Predictability is hard to get from a CPU profile, so this may require an LBR profile; however, an LBR profile doesn't capture conditional MOVs, so this may be just be impractical.)
  • Function ordering: Clustering functions at a whole-binary level for better locality.
  • Global block ordering: A step beyond function ordering. The basic form of this is hot/cold splitting, but it can be more aggressive than that.
  • Map/slice pre-sizing: Pre-size maps and slices based on allocation site. (This requires more than a CPU profile.)
  • Stack pre-sizing by site: Allocate initial stacks based on profiled stack sizes. (This requires more than a CPU profile. We already have simple runtime heuristics for this and it may be possible to do better at runtime, without a profile.)
  • Lifetime allocation: Co-locate allocations with similar lifetimes by allocation site. (This also requires more than a CPU profile. This may be possible to do at runtime, but sophisticated techniques are also expensive.)
  • Loop alignment: Aligning small loops can have a large performance impact, but also increases binary size. With PGO, we could do this only for hot loops. (E.g., section 3.4.1.4 of the Intel Optimization Reference Manual)

(This list was originally based on an old comment of mine, and this issue is partly to surface and track this list better.)

Any basic block-level optimizations likely depend on profile discriminators (#59612).

@aclements aclements added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 5, 2023
@aclements aclements added this to the Backlog milestone Sep 5, 2023
@cherrymui cherrymui added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. umbrella labels Sep 5, 2023
@erifan
Copy link

erifan commented Sep 6, 2023

For this one "Select between branches and conditional MOVs", some architecture features such as Arm's Branch Record Buffer Extension (BRBE) may help.
By the way, are the above items something that Google is considering doing or is doing? Or is it just a list of possible optimization opportunities for PGO that anyone who is interested can do?

@myaaaaaaaaa
Copy link

Another possibility could be to automatically apply Function Multi-Versioning to hot functions with large loops, which would greatly improve the performance of math-heavy workloads.

Intel's Clear Linux uses this approach to automatically take advantage of modern CPU instructions while still being backwards-compatible with x86-64-v1.

From my understanding, Go doesn't have a particularly mature auto-vectorizer, but having PGO-guided FMV should allow for new opportunities in this area.

@cherrymui
Copy link
Member

@erifan LBR support is on our plan, and we've been thinking about it. I'm not sure it belongs to this issue, though (perhaps it could be).

@aclements
Copy link
Member Author

For this one "Select between branches and conditional MOVs", some architecture features such as Arm's Branch Record Buffer Extension (BRBE) may help.

@erifan Does BRBE record conditional moves? That would certainly be nice. x86's LBR does not, which would make this pretty tricky on x86.

By the way, are the above items something that Google is considering doing or is doing? Or is it just a list of possible optimization opportunities for PGO that anyone who is interested can do?

These are things we're considering doing, but we're not staking a claim or anything. :) Currently, we're definitely working on "Indirect call devirtualization" (1.21 had a version of this, but with significant limitations we're hoping to lift in 1.22). We're also thinking seriously about "Dynamic escapes on cold paths", but I don't think we've started implementing it. We haven't made inroads on the others ourselves. I believe Uber has done some work on "Function ordering", but I haven't heard any updates on that in quite a while.

@aclements
Copy link
Member Author

Another possibility could be to automatically apply Function Multi-Versioning to hot functions with large loops, which would greatly improve the performance of math-heavy workloads.

@myaaaaaaaaa , thanks. That's what I meant by "Architecture feature check unswitching", but I've added the term "function multi-versioning" to that in my list. Function multi-versioning is a specific way to do this, and does have a nice advantage that if you have a call from A -> B and both A and B are multi-versioned, A can make a direct call to the right version of B.

@erifan
Copy link

erifan commented Sep 7, 2023

Thanks @cherrymui
@aclements I just took a look at the BRBE documentation and it doesn't record conditional moves either.

@myaaaaaaaaa
Copy link

myaaaaaaaaa commented Oct 2, 2023

There may also be an opportunity to scan for code regions that can be safely parallelized (such as pure functions/loops), and automatically rewrite them to launch as separate goroutines that send their results back through a channel, effectively implementing a function-scale version of instruction-level parallelism.

This would normally risk introducing synchronization/context switching overhead, but PGO would allow the compiler to apply this optimization only to functions/loops that typically run for a long time every invocation (say, 1-10ms).

I'm unsure how broadly applicable this would be in practice. On the other hand, I would imagine that successfully detecting just a few functions can easily create enough parallelism to saturate even 32-thread machines, since goroutine counts would increase exponentially with every parallelized function in the stack.

@felixge
Copy link
Contributor

felixge commented Oct 3, 2023

but PGO would allow the compiler to apply this optimization only to functions/loops that typically run for a long time every invocation (say, 1-10ms).

I don’t think PGO (CPU Profiles) know anything about the number of a times a function is invoked or how long those invocations last?

@shoham-b
Copy link

shoham-b commented Dec 28, 2023

All the pre-size could also be PGOed dynamically.
This can add a more dedicated exact PGO.
I know .NET added dynamic profiling and people were happy about it.

Dynamically not escaping maybe could also benefit from that, but I'm not sure. It would be something like the escape if the branch was used, but also escape at the beginning of the call if the dynamic PGO figured this should not be optimized.

We could even use OSR for all the other optimizations, but that is probably a bit too much for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. umbrella
Projects
Development

No branches or pull requests

6 participants