Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: reorganize associative computation to allow superscalar execution #49331

Open
josharian opened this issue Nov 4, 2021 · 8 comments · May be fixed by #60278
Open

cmd/compile: reorganize associative computation to allow superscalar execution #49331

josharian opened this issue Nov 4, 2021 · 8 comments · May be fixed by #60278
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@josharian
Copy link
Contributor

The compiler currently compiles a+b+c+d as a+(b+(c+d)). It should use (a+b)+(c+d) instead, because the latter can be executed out of order.

More broadly, we should balance trees of associative computation.

Doing this in the compiler with a rule tailored for a single computation type in round 4 of md5block.go yielded a 15% throughput improvement. (See below for details.)

It's not obvious to me whether we can do this with carefully crafted rewrite rules or whether a dedicated pass would be better. But it looks like there may be significant performance wins available on tight mathematical code.

I don't plan to work on this further, but I really hope someone else picks it up.

cc @randall77 @martisch @FiloSottile @mmcloughlin @mdempsky


To reproduce the md5 results, disable the optimized assembly routines, and add this rewrite rule:

(Add32 <t> c:(Const32) (Add32 d:(Load _ _) (Add32 x:(Xor32 _ _) a:(Add32 _ _)))) => (Add32 (Add32 <t> c d) (Add32 <t> x a))

and disable this one to avoid an infinite loop:

(Add32 (Add32 i:(Const32 <t>) z) x) && (z.Op != OpConst32 && x.Op != OpConst32) => (Add32 i (Add32 <t> z x))
@seebs
Copy link
Contributor

seebs commented Nov 4, 2021

	a := float64(1<<53)
	b := float64(1)
	c := float64(1)
	d := float64(0)
	e := (a+(b+(c+d)))
	f := (a+b)+(c+d)

It's probably good for integers, because I think that, with the guarantees Go gives, addition is associative in Go, but for float values, it isn't always.

@thanm thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 4, 2021
@josharian
Copy link
Contributor Author

@seebs the SSA backend has different ops for different data types exactly to avoid this kind of issue. (We have @dr2chase to thank for that.) Relatedly this kind of optimization is also not appropriate for adding pointers, as intermediate results may be invalid pointers.

@mdempsky
Copy link
Member

mdempsky commented Nov 5, 2021

Relatedly this kind of optimization is also not appropriate for adding pointers, as intermediate results may be invalid pointers.

Note that it is safe (AFAICT) to rewrite (ptr + x1) + x2 into ptr + (x1 + x2), just not the other way around. I'm skeptical this is useful in practice though.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
@seankhliao seankhliao added this to the Unplanned milestone Aug 20, 2022
ryan-berger added a commit to ryan-berger/go that referenced this issue May 5, 2023
…ions

Currently the compiler groups expressions with commutative operations
such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for
CPU instruction pipelining.

This pass balances commutative expressions as shown above to (a + b) + (c + d)
to optimally pipeline them. It also attempts to reassociate constants to as far
right of the commutative expression as possible for better constant folding
opportunities.

Below is a benchmark from crypto/md5 on an MacBook Pro M2:

          trunk          reassociate
Hash1K-8  433.7Mi ± 0%   499.4Mi ± 4%  +15.17% (p=0.000 n=10)
Hash8K-8  454.3Mi ± 1%   524.9Mi ± 1%  +15.53% (p=0.000 n=10)
....
geomean   284.4Mi        327.5Mi       +15.15%

Other CPU architectures tried showed very little change (+/-1%) on
this particular benchmark but tight mathematical code stands to gain
greatly from this optimization

Fixes golang#49331
ryan-berger added a commit to ryan-berger/go that referenced this issue May 5, 2023
…ions

Currently the compiler groups expressions with commutative operations
such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for
CPU instruction pipelining.

This pass balances commutative expressions as shown above to (a + b) + (c + d)
to optimally pipeline them. It also attempts to reassociate constants to as far
right of the commutative expression as possible for better constant folding
opportunities.

Below is a benchmark from crypto/md5 on an MacBook Pro M2:

          trunk          reassociate
Hash1K-8  433.7Mi ± 0%   499.4Mi ± 4%  +15.17% (p=0.000 n=10)
Hash8K-8  454.3Mi ± 1%   524.9Mi ± 1%  +15.53% (p=0.000 n=10)
....
geomean   284.4Mi        327.5Mi       +15.15%

Other CPU architectures tried showed very little change (+/-1%) on
this particular benchmark but tight mathematical code stands to gain
greatly from this optimization

Fixes golang#49331
@gopherbot
Copy link

Change https://go.dev/cl/493115 mentions this issue: cmd/compile: add reassociate pass to rebalance commutative operations

@y1yang0
Copy link
Contributor

y1yang0 commented May 6, 2023

Interesting. As far as I know, reassociate in other compilers is mainly to assist other optimizations (such as sccp, gcse, loopopt). It seems that reassociate is used to "accelerate CPU superscalar execution" is a new expressive statement, is there any related extension information/paper? Maybe this is also helpful for other compilers. Thanks.

Java applies reassociate only when invariant participates computation in loop. GCC trunk does not reassociate a+b+c+d to (a+b)+(c+d). Clang trunk does reassociates it.

@CAFxX
Copy link
Contributor

CAFxX commented May 6, 2023

is there any related extension information/paper?

@y1yang0 see section 9.5 (and in general the whole of chapter 9) of https://www.agner.org/optimize/optimizing_assembly.pdf

@ryan-berger
Copy link

@y1yang0 Yes, Clang outputs the optimization because of the Reassociate pass in LLVM.

This is an extremely simple analysis pass compared to LLVM's version, and some of the things the LLVM pass are already taken care of by the rewrite rules which I didn't know existed until after I submitted the PR so I've simplified this even further to get rid of constant sorting.

If this pass runs again further after opt it might have constant folding opportunities but it is more important to get done before gcse since it can help group expressions together nicely. For now the goal is to just get out of order execution accelerated and I'll work on some more of the opportunities that this optimization opens up soon.

It looks like the reason Java does a reassociation optimization only inside of loops is auto-vectorization. This sort of pass helps sort out all the dependencies and lets you pretty easily recognize 4 consecutive additions that could be turned into some SIMD if a model decides it is cost effective.

This pass seems to help make analysis much easier in a lot of places though so I'll be combing through some other compilers and LLVM to see if we can leverage it more in Go

ryan-berger added a commit to ryan-berger/go that referenced this issue May 18, 2023
…ions

Currently the compiler groups expressions with commutative operations
such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for
CPU instruction pipelining.

This pass balances commutative expressions as shown above to (a + b) + (c + d)
to optimally pipeline them. It also attempts to reassociate constants to as far
right of the commutative expression as possible for better constant folding
opportunities.

Below is a benchmark from crypto/md5 on an MacBook Pro M2:

          trunk          reassociate
Hash1K-8  433.7Mi ± 0%   499.4Mi ± 4%  +15.17% (p=0.000 n=10)
Hash8K-8  454.3Mi ± 1%   524.9Mi ± 1%  +15.53% (p=0.000 n=10)
....
geomean   284.4Mi        327.5Mi       +15.15%

Other CPU architectures tried showed very little change (+/-1%) on
this particular benchmark but tight mathematical code stands to gain
greatly from this optimization

Fixes golang#49331
@gopherbot
Copy link

Change https://go.dev/cl/496095 mentions this issue: cmd/compile: add ilp pass to help balance commutative expressions aiding in instruction level parallelism

ryan-berger added a commit to ryan-berger/go that referenced this issue Sep 9, 2023
…ions

Currently the compiler groups expressions with commutative operations
such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for
CPU instruction pipelining.

This pass balances commutative expressions as shown above to (a + b) + (c + d)
to optimally pipeline them. It also attempts to reassociate constants to as far
right of the commutative expression as possible for better constant folding
opportunities.

Below is a benchmark from crypto/md5 on an MacBook Pro M2:

          trunk          reassociate
Hash1K-8  433.7Mi ± 0%   499.4Mi ± 4%  +15.17% (p=0.000 n=10)
Hash8K-8  454.3Mi ± 1%   524.9Mi ± 1%  +15.53% (p=0.000 n=10)
....
geomean   284.4Mi        327.5Mi       +15.15%

Other CPU architectures tried showed very little change (+/-1%) on
this particular benchmark but tight mathematical code stands to gain
greatly from this optimization

Fixes golang#49331
ryan-berger added a commit to ryan-berger/go that referenced this issue Sep 9, 2023
…ions

Currently the compiler groups expressions with commutative operations
such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for
CPU instruction pipelining.

This pass balances commutative expressions as shown above to (a + b) + (c + d)
to optimally pipeline them. It also attempts to reassociate constants to as far
right of the commutative expression as possible for better constant folding
opportunities.

Below is a benchmark from crypto/md5 on an MacBook Pro M2:

          trunk          reassociate
Hash1K-8  433.7Mi ± 0%   499.4Mi ± 4%  +15.17% (p=0.000 n=10)
Hash8K-8  454.3Mi ± 1%   524.9Mi ± 1%  +15.53% (p=0.000 n=10)
....
geomean   284.4Mi        327.5Mi       +15.15%

Other CPU architectures tried showed very little change (+/-1%) on
this particular benchmark but tight mathematical code stands to gain
greatly from this optimization

Fixes golang#49331
ryan-berger added a commit to ryan-berger/go that referenced this issue Jan 3, 2024
…ions

Currently the compiler groups expressions with commutative operations
such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for
CPU instruction pipelining.

This pass balances commutative expressions as shown above to (a + b) + (c + d)
to optimally pipeline them. It also attempts to reassociate constants to as far
right of the commutative expression as possible for better constant folding
opportunities.

Below is a benchmark from crypto/md5 on an MacBook Pro M2:

          trunk          reassociate
Hash1K-8  433.7Mi ± 0%   499.4Mi ± 4%  +15.17% (p=0.000 n=10)
Hash8K-8  454.3Mi ± 1%   524.9Mi ± 1%  +15.53% (p=0.000 n=10)
....
geomean   284.4Mi        327.5Mi       +15.15%

Other CPU architectures tried showed very little change (+/-1%) on
this particular benchmark but tight mathematical code stands to gain
greatly from this optimization

Fixes golang#49331
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: Triage Backlog
9 participants