Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: amd64 addressing modes are unwieldy #36468

Closed
josharian opened this issue Jan 9, 2020 · 7 comments
Closed

cmd/compile: amd64 addressing modes are unwieldy #36468

josharian opened this issue Jan 9, 2020 · 7 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@josharian
Copy link
Contributor

There are many amd64 instructions that accept a full range of memory addressing modes. It would be nice to make use of them all.

The current approach is to give each its own op, such as BTRQmodifyidx8 or ANDQloadidx1. The problem is that this generates a large number of ops and rules. To add ANDQloadidx1, we need to find all rules starting with ANDQ and add variants for ANDQloadidx1 as appropriate. And we need to add rules (say) absorbing SHLQ of the index into the scale, and ADDQ into the offset. For an example of how many rules are involved just for adding addressing modes for CMP, see https://golang.org/cl/167089. And that's a fairly simple case, because there aren't many existing rules involving CMPQ that need indexed load analogues.

Such additions have steady incremental improvements in generated code, at the cost of developer time, reviewer time, compiler binary size, and time and memory required to build the compiler (which has proved to be an ongoing problem).

I wonder whether we can find a way to factor out some of this commonality, so we can hoover up all the incremental improvements at much less cost.

One idea is to encode the scale into AuxInt, so we'd have just idx instead of idx1, idx2, idx4, idx8. We'd need helper routines to tell us whether it's ok to adjust the scale in a particular way, since not all scales are available for all bit widths, but that is doable. The great shortcoming of this idea is that idx1 ops are commutative, but the others are not. Maybe just encode the scale for 2, 4, and 8?

Another (vague) idea is to have a separate handwritten pass after lower that does nothing but addressing modes. We'd mark ops with what kind of address modes they support in which argument slots and then do all combining then. We'd still have a combinatorial number of ops, but many fewer rules.

This is actually an instance of a broader amd64 problem around having a combinatorial number of ops (base X const? X load/store/modify X addressingmode). Maybe if we structured this all appropriately, we could generate (rather than hand code) all the ops and their encodings in amd64/ssa.go and their generated rules (like we did for commutativity).

For discussion and ideas.

cc @randall77 @mvdan @martisch @cherrymui

@toothrot toothrot added this to the Backlog milestone Jan 9, 2020
@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 9, 2020
@gopherbot
Copy link

Change https://golang.org/cl/214042 mentions this issue: cmd/compile: move amd64 addressing modes to a new pass

@josharian
Copy link
Contributor Author

Here's another example of the combinatorial rule problem with amd64.

LEAx values with non-zero offsets get split at the end of compilation into multiple LEAs to avoid slow LEAs.

Some other values can absorb the LEA offsets, which would lead to a simpler encoding. Here is a sample rule:

(MOVQstoreconst [x] {sym1} (LEA(Q1|Q2|Q4|Q8|L1|L2|L4|L8|W1|W2|W4|W8) <t> [c] {sym2} ptr idx) mem) && c != 0 && ValAndOff(x).canAdd(c) ->
  (MOVQstoreconst [ValAndOff(x).add(c)] {sym1} (LEA(Q1|Q2|Q4|Q8|L1|L2|L4|L8|W1|W2|W4|W8) <t> [0] {sym2} ptr idx) mem)

But if you look at the number of ops that have ValAndOff-style auxints that can absorb an index, there are a bunch...the combinatorial number mentioned above. As such, even though these rules do shorten generated code I'm not inclined to add them.


Another possible approach here is to construct another way to write rewrite rules. We have | to do parallel op selection. We could have named |s to let you do combinatorial op selection. (Maybe the number of consecutive |s?) Something like this:

(X(a|b)(c||d) [z]) -> (Y(a|b)(c||d) [c])

This would generate combinations Xac, Xad, Xbc, Xbd, correlated to Yac, Yad, Ybc, Ybd. If you needed eight, you could have (X(a|b)(c||d)(e|||f) [z]).

Problems with this idea. (1) The syntax is lousy and I don't see a better solution. (2) It generates a lot of rewrite rules. I guess we could do this with loops instead of generation, like we did with commutativity, but that's gonna start hurting my head pretty soon.

@gopherbot
Copy link

Change https://golang.org/cl/217097 mentions this issue: cmd/compile: insert complicated x86 addressing modes as a separate pass

@randall77
Copy link
Contributor

I hacked together another possible strategy to fix this. Take a look and let me know what you think.

gopherbot pushed a commit that referenced this issue Mar 10, 2020
Use a separate compiler pass to introduce complicated x86 addressing
modes.  Loads in the normal architecture rules (for x86 and all other
platforms) can have constant offsets (AuxInt values) and symbols (Aux
values), but no more.

The complex addressing modes (x+y, x+2*y, etc.) are introduced in a
separate pass that combines loads with LEAQx ops.

Organizing rewrites this way simplifies the number of rewrites
required, as there are lots of different rule orderings that have to
be specified to ensure these complex addressing modes are always found
if they are possible.

Update #36468

Change-Id: I5b4bf7b03a1e731d6dfeb9ef19b376175f3b4b44
Reviewed-on: https://go-review.googlesource.com/c/go/+/217097
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
@gopherbot
Copy link

Change https://golang.org/cl/222782 mentions this issue: cmd/compile: convert 386 port to use addressing modes pass

gopherbot pushed a commit that referenced this issue Mar 13, 2020
Update #36468

Change-Id: Idfdb845d097994689be450d6e8a57fa9adb57166
Reviewed-on: https://go-review.googlesource.com/c/go/+/222782
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
@randall77
Copy link
Contributor

I'm going to close this, I think we've done enough, at least for x86. I'll open a separate issue for s390x.

@gopherbot
Copy link

Change https://golang.org/cl/227960 mentions this issue: cmd/compile: add indexed load+op operations to amd64

gopherbot pushed a commit that referenced this issue Apr 30, 2020
name        old time/op  new time/op  delta
LoadAdd-16   545ns ± 0%   456ns ± 0%  -16.31%  (p=0.000 n=10+10)

Update #36468

Change-Id: I84f390d55490648fa1f58cdbc24fd74c4f1bc8c1
Reviewed-on: https://go-review.googlesource.com/c/go/+/227960
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
xujianhai666 pushed a commit to xujianhai666/go-1 that referenced this issue May 21, 2020
name        old time/op  new time/op  delta
LoadAdd-16   545ns ± 0%   456ns ± 0%  -16.31%  (p=0.000 n=10+10)

Update golang#36468

Change-Id: I84f390d55490648fa1f58cdbc24fd74c4f1bc8c1
Reviewed-on: https://go-review.googlesource.com/c/go/+/227960
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
@golang golang locked and limited conversation to collaborators Jun 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

4 participants