cmd/compile: miscellaneous optimizations #24958

benshi001 · 2018-04-20T02:35:43Z

My optimization plan for arm64 in go-1.12:

further optimization for (shifted) register indexed load/store, especially interference with combined load/store.
1.1 combination of load/store uint16/uint32 to upper type
1.2 BE/LE should be both supported for the same situation
1.3 load/store in the order of from/to high memory down to low memory (it is not BE, LE load/store can also happen from high-memory/upper-byte to low-memory/lower byte)
(shifted) register indexed load/store for FP. (assembler and compiler)
optimization with MADD/MSUB/MADDW/MSUBW
"MUL R1, R2, R3"
"ADD R3, R4" can be optimized to
"MADD R1, R2, R3, R4"
I expect to see both performance improvement and code size reduction with MADD/MSUB emitted.
optimize comparasion with ANDS/ADDS/SUBS
optimize atomic operations with SWPALD/SWPALW/SWPALH/SWPALB
STADD/STSUB/STEOR/STOR
those instructions directly operate on a memory operand atomically, I expect to see both performance improvement and code size reduction with them emitted.
BFC - bit field clear
constant pool
"ADD $0x00aaaaaa, Rx" is assembled to "LDR off(PC), Rtmp" + "ADD Rtmp, Rx", and a 4-byte item is stored in the constant pool, total 12 bytes are cost.
It can be optimized to "ADD $0xaaa, Rx" + "ADD $0xaaa000, Rx", and only 8 bytes are cost.
This optimization does not improve performance, but does reduce code size.
FMULA/FMULS
(FADD z:(FMUL x y) a) -> (FMULA x a y)
a restriction z.Uses==1 is needed to improve performance.
constant pool
use "MOV $0xaaaa, Rx" + "MOVK $0xbbbb0000, Rx" instead of loading a 32-bit 0xbbbbaaaa from constant pool;
use "MOV $0xaaaa, Rx" + "MOVK $0xbbbb0000, Rx" + "MOVK $0xcccc00000000, Rx" instead of loading a 64-bit 0xccccbbbbaaaa from constant pool;
LDP/STP for FP (assembler and compiler)

benshi001 · 2018-05-17T03:08:32Z

Some optimization can also be done for ARM:

BICconst <-> ANDconst, SUBconst <-> ADDconst for smaller constant
MULA/MULS
2.1 (ADD a h:(MUL x y)) -> (MULA a x y) needs h.Uses==1
2.2 (CMPconst [0] (MULA a x y)) -> (CMN a (MUL x y))
optimize comparasion
3.1 (CMPconst [0] L:(ADD x y)) && L.Uses==1 -> (CMN x y)
3.2 (CMPconst [0] L:(ADD x y)) && L.Uses>1 -> (ADDS x y)
combination of MOVB to MOV[HW] & MOVH to MOVW
use BFI / BFC to simplify bit operations
optimization with MOVBUloadshiftLL, MOVBUloadshiftRA, MOVBUloadshiftRL
MULAF/MULAD/MULSF/MULSD
(ADDF a (MULF x y)) && a.Uses == 1 && objabi.GOARM >= 6 -> (MULAF a x y)
also needs z:(MULF x y), z.Uses == 1
use MOVT/MOVW pair instead of constant pool on ARMv7 (for large constant and symbolic address)

benshi001 · 2018-05-18T09:45:19Z

The reason why the following two MOVBs are not combined to a single MOVH, is that the first load (v23) is used twice.

func ssg(s []byte, idx int) {
	s[(idx<<1)+0], s[(idx<<1)+1] = 0, 0
}

This optimization seems can not be done by SSA rules only.

  pass lower begin
  pass lower end [16576 ns]
ssg func([]byte, int)
  b1:
    v1 = InitMem <mem>
    v7 = Arg <int> {idx}
    v8 = MOVDconst <uint> [1] DEAD
    v13 = MOVDconst <int> [1] DEAD
    v15 = MOVDconst <byte> [0] DEAD
    v30 = Arg <*byte> {s} (s[*byte])
    v25 = Arg <int> {s} [8] (s+8[int])
    v28 = SLLconst <int> [1] v7
    v26 = MOVDconst <int> [0] DEAD
    v3 = FlagLT_ULT <flags> DEAD
    v6 = CMPshiftLL <flags> [1] v25 v7
    v14 = ADDconst <int> [1] v28
    v17 = GreaterThanU <bool> v6 DEAD
    v24 = InvertFlags <flags> v6 DEAD
    UGT v6 -> b2 b3 (likely)
  b2: <- b1
    v21 = ADDshiftLL <*byte> [1] v30 v7 DEAD
    v23 = MOVBstorezeroidx <mem> v30 v28 v1
    v12 = CMP <flags> v14 v25
    v27 = LessThanU <bool> v12 DEAD
    ULT v12 -> b4 b3 (likely)
  b3: <- b1 b2
    v18 = Phi <mem> v1 v23
    v19 = CALLstatic <mem> {runtime.panicindex} v18
    Exit v19
  b4: <- b2
    v29 = ADD <*byte> v30 v14 DEAD
    v31 = MOVBstorezeroidx <mem> v30 v14 v23
    Ret v31

randall77 · 2018-05-18T14:02:29Z

This is because the bounds check might panic after assigning the s[(idx<<1)+0] but before assigning s[(idx<<1)+1]. If you ensure the bounds check, if it fails, fails before the assignment, then it should work.

func ssg(s []byte, idx int) {
	_ = s[(idx<<1)+1]
	s[(idx<<1)+0], s[(idx<<1)+1] = 0, 0
}

benshi001 · 2018-05-21T03:50:33Z

Thank you. Keith.

But such kinds of code looks odd. Is it possible to clobber the MOVBstore if it was only referred by a bound check? Or making bound check omits disappeared store/load ?

randall77 · 2018-05-21T03:59:15Z

Combining the stores is not legal in the original example. If the first array index succeeds but the second fails, the language semantics require that the first write happens and the second doesn't. That means that the first write must be just a single byte; it can't be combined with the second write. The "in-between" memory state is observable if someone recovers the panic.

That requirement is realized in SSA by passing the results of the first store to the bounds check.

benshi001 · 2018-05-21T06:35:34Z

Thank you. Keith. I see the key point.

benshi001 · 2018-05-29T09:20:31Z

386/amd64:

"MOVFconst $0.0, F0" -> "LDZ F0", "MOVFconst $1.0, F0" -> "LD1 F0"
implement DIVSSload/DIVSDload/MULLload
support more instructions use a destination memory operand, such as (ADD/SUB/OR/EOR/AND)Lconstmodify, (ADD/SUB/OR/EOR/AND)Lmodify
support more instructions use a source memory operand, such as CMP
optimize 386 with BT/BTC/BTR/BTS
improve memory operands with register indexed form
can more combined load/store be optimized?
can optimization be done with CMOV/FMOVcc?
can some optimization be done with FIADD/FISUB/FIMUL/FIDIV ?

benshi001 · 2018-09-10T10:44:46Z

about 40% of them have been implemented via my recent commits, others do not improvement.

agnivade changed the title ~~optimize ARM64 code~~ cmd/compile/internal/arm64: miscellaneous optimizations Apr 20, 2018

agnivade added this to the Go1.11 milestone Apr 20, 2018

agnivade added the Performance label Apr 20, 2018

benshi001 modified the milestones: Go1.11, Go1.12 Apr 20, 2018

benshi001 self-assigned this Apr 20, 2018

benshi001 changed the title ~~cmd/compile/internal/arm64: miscellaneous optimizations~~ cmd/compile: miscellaneous optimizations May 29, 2018

benshi001 closed this as completed Sep 10, 2018

golang locked and limited conversation to collaborators Sep 10, 2019

gopherbot added the FrozenDueToAge label Sep 10, 2019

rsc unassigned benshi001 Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: miscellaneous optimizations #24958

cmd/compile: miscellaneous optimizations #24958

benshi001 commented Apr 20, 2018 •

edited

benshi001 commented May 17, 2018 •

edited

benshi001 commented May 18, 2018

randall77 commented May 18, 2018

benshi001 commented May 21, 2018

randall77 commented May 21, 2018

benshi001 commented May 21, 2018

benshi001 commented May 29, 2018 •

edited

benshi001 commented Sep 10, 2018

cmd/compile: miscellaneous optimizations #24958

cmd/compile: miscellaneous optimizations #24958

Comments

benshi001 commented Apr 20, 2018 • edited

benshi001 commented May 17, 2018 • edited

benshi001 commented May 18, 2018

randall77 commented May 18, 2018

benshi001 commented May 21, 2018

randall77 commented May 21, 2018

benshi001 commented May 21, 2018

benshi001 commented May 29, 2018 • edited

benshi001 commented Sep 10, 2018

benshi001 commented Apr 20, 2018 •

edited

benshi001 commented May 17, 2018 •

edited

benshi001 commented May 29, 2018 •

edited