cmd/compile: find a good way to eliminate zero/sign extensions on arm64 #42162

zhangfannie · 2020-10-23T07:40:27Z

Recently, I am working on the bit field optizations for arm64 and find that the existing bitfield optimizations will cause the compiler generate bad codes for some cases.

For example, for the following case, because we have the bitfield rewrite rule (SLLconst [sc] (MOVWUreg x)) && isARM64BFMask(sc, 1<<32-1, 0) => (UBFIZ [armBFAuxInt(sc, 32)] x) to optimize uint64(hi) << 18 as UBFIZ, but we also have the rewrite rule (OR x0 x1:(SLLconst [c] y)) && clobberIfDead(x1) => (ORshiftLL x0 y [c]) to optimize it as ORR(shifted register). Obviously, the later one is better.

e.g.

func or(hi, lo uint32) uint64 {
        return uint64(hi) << 18  | uint64(lo) 
}

// the master assembly code:
or STEXT size=32 args=0x10 locals=0x0 funcid=0x0 leaf
        0x0000 00000  TEXT    or(SB), LEAF|NOFRAME|ABIInternal, $0-16
        0x0000 00000  MOVWU   hi(FP), R0
        0x0004 00004  UBFIZ   $18, R0, $32, R0
        0x0008 00008  MOVWU   lo+4(FP), R1
        0x000c 00012  ORR     R1, R0, R0
        0x0010 00016  MOVD    R0, r2+8(FP)
        0x0014 00020  RET     (R30)

// without the UBFIZ rewrite rule, the assembly code:
or STEXT size=32 args=0x10 locals=0x0 funcid=0x0 leaf
        0x0000 00000 TEXT    or(SB), LEAF|NOFRAME|ABIInternal, $0-16
        0x0000 00000 MOVWU   hi(FP), R0
        0x0004 00004 MOVWU   lo+4(FP), R1
        0x0008 00008 ORR     R0<<18, R1, R0
        0x000c 00012 MOVD    R0, r2+8(FP)
        0x0010 00016 RET     (R30)

We know that the following rewrite rules are to merge zero/sign extensions into bitfiled ops. For the following case, they will eliminate a zero/sign extension instruction.

(SRAconst [rc] (MOVWreg x)) && rc < 32 => (SBFX [armBFAuxInt(rc, 32-rc)] x) ...
(SLLconst [sc] (MOVWUreg x)) && isARM64BFMask(sc, 1<<32-1, 0) => (UBFIZ [armBFAuxInt(sc, 32)] x) ...
(SRLconst [sc] (MOVWUreg x)) && isARM64BFMask(sc, 1<<32-1, sc) => (UBFX [armBFAuxInt(sc, arm64BFWidth(1<<32-1, sc))] x)

e.g

func sbfx(a, b int32) int64 {
        hi := a+b
        return int64(hi) >> 18
}

func ubfiz(a, b uint32) uint64 {
        hi := a+b
        return uint64(hi) << 18
}

But in the following case, comparing the assembly code with and without these rewrite rules, both of have have only one intruction, there is no benefit from these changes. Without these rewrite rules，the reason why the zero/sign extension will not be generated is that the codegen do the optimization, that is, if the value is a proper-typed load, already zero/sign-extended, do not extend again.
e.g.

func shiftR(hi int32) int64 {
        return int64(hi) >> 18
}

func shiftL(hi uint32) uint64 {
        return uint64(hi) << 18
}

// the master assembly code:
shiftR STEXT size=16 args=0x10 locals=0x0 funcid=0x0 leaf
        0x0000 00000       TEXT    shiftR(SB), LEAF|NOFRAME|ABIInternal, $0-16
        0x0000 00000       MOVW   hi(FP), R0
        0x0004 00004       SBFX       $18, R0, $14, R0
        0x0008 00008       MOVD    R0, r1+8(FP)
        0x000c 00012       RET        (R30)
shiftL STEXT size=16 args=0x10 locals=0x0 funcid=0x0 leaf
        0x0000 00000       TEXT    shiftL(SB), LEAF|NOFRAME|ABIInternal, $0-16
        0x0000 00000       MOVWU   hi(FP), R0
        0x0004 00004       UBFIZ       $18, R0, $32, R0
        0x0008 00008       MOVD      R0, r1+8(FP)
        0x000c 00012       RET     (R30)

// without the above rules, the assembly code:
shiftR STEXT size=16 args=0x10 locals=0x0 funcid=0x0 leaf
        0x0000 00000       TEXT    shiftR(SB), LEAF|NOFRAME|ABIInternal, $0-16
        0x0000 00000       MOVW    hi(FP), R0
        0x0004 00004       ASR     $18, R0, R0
        0x0008 00008       MOVD    R0, r1+8(FP)
        0x000c 00012       RET     (R30)
shiftL STEXT size=16 args=0x10 locals=0x0 funcid=0x0 leaf
        0x0000 00000       TEXT    shiftL(SB), LEAF|NOFRAME|ABIInternal, $0-16
        0x0000 00000       MOVWU   hi(FP), R0
        0x0004 00004       LSL     $18, R0, R0
        0x0008 00008       MOVD    R0, r1+8(FP)
        0x000c 00012        RET     (R30)

The text was updated successfully, but these errors were encountered:

zhangfannie · 2020-10-23T08:38:05Z

The main reason for the above-mentioned cases is that some unnecessary zero/sign extension are not eliminated in the lower pass, so causing some rewrite rules to merge unnessary zero/sign extension into other ops like bitfield ops, but these rules may break other optimizations like ORshiftLL. In order to resolve this problem, we need to eliminate zero/sign extension in the lower pass. But I do not find a good way to do this optimization. Do you have any good suggestions? Thank you.

zhangfannie · 2020-10-23T11:13:16Z

@josharian

josharian · 2020-10-23T14:23:45Z

Order dependencies in rewrite rules are a bother. And my experience last year suggests that it has a disproportionate impact on arm64. (I hacked a b.Values shuffle into the rewrite rule inner loop and looked for differences in generated code.)

The best fix available now that I know of is to change order-dependent rewrite rules to accept the later form, so that they trigger regardless of order. Sometimes that requires accepting both forms.

cherrymui · 2020-10-23T21:31:57Z

I don't see why this needs to be a proposal (and I couldn't find what is proposed). Changed to a regular issue.

cherrymui · 2020-10-23T21:36:05Z

Order dependencies in rewrite rules are a bother.

Agreed. It has caused issues/complexities, e.g. adding a new optimization caused another rule not to fire, so we had to add new rules to match more forms. The load-shift combining rules on ARM64 are an example.

I think there were discussions about giving rules order/priority. Some rules only fire in early stage, some only in later stage, etc.. I'm not sure if it will make things better. Just to bring it up.

gopherbot · 2020-10-26T03:45:23Z

Change https://golang.org/cl/265038 mentions this issue: cmd/compile: simiplify arm64 bitfield optimizations

zhangfannie · 2020-10-26T10:13:55Z

For the case mentioned above, we can indeed prevent the degradation of optimization by reordering the optimization rules. I will submit a patch to do this. Thank you for the comments.

In addition, I only know to use "toolstash-check -all" command to check wheter the new change has assembly codes that are inconsistent with the master, but I do not know how to check whether the performance is worse or better than the master's. Any suggestions? Thank you.

gopherbot · 2020-11-04T10:24:20Z

Change https://golang.org/cl/267598 mentions this issue: cmd/compile: reorder the CSEL0 optimization rules on arm64

RuinanSun · 2022-08-23T06:51:53Z

This is a gentle ping.

Recently, I notice that there are some redundant unsign/sign extend instructions following 32-bit ops. Check the following code:

func unsignEXT(x, y uint32) uint64 {
	ret := uint64(x & y)
	return ret
}

codegen:

AND R1, R0, R1
MOVWU R1, R0
RET

we want:

ANDW R1, R0, R0
RET

The root cause is that we don't support all 32-bit opcodes on arm64. For example, we lower And(64|32|16|8) to AND, which is a 64-bit opcode.

(And(64|32|16|8) ...) => (AND ...)

The good thing is that it's really convenient for us to write the rules based on ADD since we only need to care about the rules under 64-bit circumstances. But the bad thing is that we have to do a redundant unsign/sign extend after a 32-bit operation.

I have dug into this issue for a while, and there are two approaches come to my mind:

(Anyway we have to add ANDW in our backend, let's assume we already support it)

lower And32 => ANDW, then remove the unsign extend instruction.

(And32 ...) => (ANDW ...)
(MOVWUreg && zeroUpper32Bits(x,3)) => x

Since we break the original rules (And(64|32|16|8) ...) => (AND ...), we have to rewrite the rules for ANDW which is the same as AND. This will be a big change, but rules are simple and easy to maintain.

lower 64-bit opcode with unsign extend to 32 bit opcode

(MOVWUreg <t> (AND <ts> x y)) && ts.Size() == 4 => (ANDW <t> x y)

This will not change the original rules, but we have to add such rules for newly added 32-bit opcodes one by one, the rules will be complicated.

I'm not sure which one is more acceptable or whether there are more elegant solutions, any suggestions?

gopherbot · 2022-09-01T09:21:34Z

Change https://go.dev/cl/427454 mentions this issue: cmd/compile: omit sign/unsign extended under some circumstances on arm64

randall77 · 2022-09-01T20:57:37Z

I'm hesitant to add lots of opcodes to support 32->64 upconversion. It just doesn't happen that often. It's no more expensive to operate on full-width registers, and as long as we support upconversion on load (sign/zero extending loads), there's no reason for someone to code in 32-bit values on 64-bit hardware. At least in places where people have at least a small interest in the performance of the code. I guess if you wanted to be performance-portable to 32-bit archs, but even then we have int and uint for that.

RuinanSun · 2022-09-02T02:08:41Z

I'm hesitant to add lots of opcodes to support 32->64 upconversion. It just doesn't happen that often. It's no more expensive to operate on full-width registers, and as long as we support upconversion on load (sign/zero extending loads), there's no reason for someone to code in 32-bit values on 64-bit hardware. At least in places where people have at least a small interest in the performance of the code. I guess if you wanted to be performance-portable to 32-bit archs, but even then we have int and uint for that.

Got it, I will keep the original rules. Thanks~!

gopherbot added this to the Proposal milestone Oct 23, 2020

gopherbot added the Proposal label Oct 23, 2020

cherrymui removed the Proposal label Oct 23, 2020

gopherbot added the Proposal label Oct 23, 2020

cherrymui changed the title ~~proposal: find a good way to eliminate zero/sign extensions on arm64.~~ cmd/compile: find a good way to eliminate zero/sign extensions on arm64 Oct 23, 2020

cherrymui modified the milestones: Proposal, Backlog Oct 23, 2020

cherrymui added arch-arm64 and removed Proposal labels Oct 23, 2020

cagedmantis added the NeedsInvestigation label Oct 26, 2020

gopherbot closed this as completed in 4d180f7 Feb 28, 2023

dmitshur added NeedsFix and removed NeedsInvestigation labels Feb 28, 2023

dmitshur modified the milestones: Backlog, Go1.21 Feb 28, 2023

golang locked and limited conversation to collaborators Feb 28, 2024

gopherbot added the FrozenDueToAge label Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: find a good way to eliminate zero/sign extensions on arm64 #42162

cmd/compile: find a good way to eliminate zero/sign extensions on arm64 #42162

zhangfannie commented Oct 23, 2020 •

edited

Loading

zhangfannie commented Oct 23, 2020 •

edited

Loading

zhangfannie commented Oct 23, 2020

josharian commented Oct 23, 2020

cherrymui commented Oct 23, 2020

cherrymui commented Oct 23, 2020

gopherbot commented Oct 26, 2020

zhangfannie commented Oct 26, 2020

gopherbot commented Nov 4, 2020

RuinanSun commented Aug 23, 2022 •

edited

Loading

gopherbot commented Sep 1, 2022

randall77 commented Sep 1, 2022

RuinanSun commented Sep 2, 2022

cmd/compile: find a good way to eliminate zero/sign extensions on arm64 #42162

cmd/compile: find a good way to eliminate zero/sign extensions on arm64 #42162

Comments

zhangfannie commented Oct 23, 2020 • edited Loading

zhangfannie commented Oct 23, 2020 • edited Loading

zhangfannie commented Oct 23, 2020

josharian commented Oct 23, 2020

cherrymui commented Oct 23, 2020

cherrymui commented Oct 23, 2020

gopherbot commented Oct 26, 2020

zhangfannie commented Oct 26, 2020

gopherbot commented Nov 4, 2020

RuinanSun commented Aug 23, 2022 • edited Loading

gopherbot commented Sep 1, 2022

randall77 commented Sep 1, 2022

RuinanSun commented Sep 2, 2022

zhangfannie commented Oct 23, 2020 •

edited

Loading

zhangfannie commented Oct 23, 2020 •

edited

Loading

RuinanSun commented Aug 23, 2022 •

edited

Loading