cmd/compile: combine extension with register loads and stores on amd64 #15300

josharian · 2016-04-14T14:32:15Z

Extending a register load or store currently generates suboptimal code with the SSA backend. In some cases, this is a regression from the old backend. For example:

func load8(i uint8) uint64 { return uint64(i) }

Generates:

"".load8 t=1 size=16 args=0x10 locals=0x0
    0x0000 00000 (extend.go:3)  TEXT    "".load8(SB), $0-16
    0x0000 00000 (extend.go:3)  FUNCDATA    $0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
    0x0000 00000 (extend.go:3)  FUNCDATA    $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0000 00000 (extend.go:3)  MOVBLZX "".i+8(FP), AX
    0x0005 00005 (extend.go:3)  MOVBQZX AL, AX
    0x0008 00008 (extend.go:3)  MOVQ    AX, "".~r1+16(FP)
    0x000d 00013 (extend.go:3)  RET

The old back end generates:

"".load8 t=1 size=16 args=0x10 locals=0x0
    0x0000 00000 (extend.go:3)  TEXT    "".load8(SB), $0-16
    0x0000 00000 (extend.go:3)  FUNCDATA    $0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
    0x0000 00000 (extend.go:3)  FUNCDATA    $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0000 00000 (extend.go:3)  MOVBQZX "".i+8(FP), BX
    0x0005 00005 (extend.go:3)  MOVQ    BX, "".~r1+16(FP)
    0x000a 00010 (extend.go:3)  RET

I tried fixing this in CL 21838, but the fix was partial and probably not in the right place.

It's hard to do this as part of the arch-specific rewrite rules, because (a) you lose the type extension work that the ssa conversion did for you and have to recreate it later and (b) you don't know where all the register loads and stores will be, because regalloc hasn't run.

However, it's hard to do this as part of converting final SSA to instructions (genvalue), since that's really geared to handle one value at a time, in isolation.

Teaching regalloc to combine these MOVs seems arch-specific and would further complicate already complicated machinery. Maybe the thing to do is to add an arch-specific rewrite pass after regalloc ("peep"?), using hand-written rewrite rules. Input requested.

Related, for those extension MOVs that remain, we should test whether CWB and friends are desirable--they are shorter, but are register-restricted and the internet disagrees about whether they are as fast.

Here are some test cases:

package x

func load8(i uint8) uint64   { return uint64(i) }
func load32(i uint32) uint64 { return uint64(i) }

func store8(i uint64) uint64  { return uint64(uint8(i)) }
func store32(i uint64) uint64 { return uint64(uint32(i)) }

var p *int

func load8spill(i uint8) uint64 {
    i++            // use i
    print(i)       // spill i
    j := uint64(i) // use and extend i
    return j
}

func load32spill(i uint32) uint64 {
    i++            // use i
    print(i)       // spill i
    j := uint64(i) // use and extend i
    return j
}

func store8spill(i uint64) uint64 {
    j := uint8(i)    // convert
    print(j)         // spill
    return uint64(j) // use
}

func store32spill(i uint64) uint64 {
    j := uint32(i)   // convert
    print(j)         // spill
    return uint64(j) // use
}

cc @randall77

The text was updated successfully, but these errors were encountered:

randall77 · 2016-04-14T17:39:26Z

Yes, I don't have any good ideas here. Some modified Arg value might work, say with a memory layout type and a post-load type, but it's ugly.
I'd like to avoid adding a peep pass. It might be a good idea for this but it always ends up as an easy dumping ground for things that are better implemented elsewhere.

josharian · 2016-05-02T23:22:34Z

Note that it's not just Args, it's also loading spilled registers.

DO NOT REVIEW [code freeze] This CL teaches SSA to recognize code of the form // b is a boolean value, i is an int of some flavor if b { i = 1 } else { i = 0 } and use b's underlying 0/1 representation for i instead of generating jumps. Unfortunately, it does not work on the obvious code: func bool2int(b bool) int { if b { return 1 } return 0 } This is left for future work. Note that the existing phiopt optimizations also don't work for: func neg(b bool) bool { if b { return false } return true } In the meantime, runtime authors and the like can use: func bool2int(b bool) int { var i int if b { i = 1 } else { i = 0 } return i } This compiles to: "".bool2int t=1 size=16 args=0x10 locals=0x0 0x0000 00000 (x.go:25) TEXT "".bool2int(SB), $0-16 0x0000 00000 (x.go:25) FUNCDATA $0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB) 0x0000 00000 (x.go:25) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0000 00000 (x.go:32) MOVBLZX "".b+8(FP), AX 0x0005 00005 (x.go:32) MOVBQZX AL, AX 0x0008 00008 (x.go:32) MOVQ AX, "".~r1+16(FP) 0x000d 00013 (x.go:32) RET The extraneous MOVBQZX is golang#15300. This optimization also helps range and slice. The compiler must protect against pointers pointing to the end of a slice/string. It does this by increasing a pointer by either 0 or 1 * elemsize, based on a condition. This CL optimizes away a jump in that code. This CL triggers 382 while compiling the standard library. Updates golang#6011 Change-Id: Ia7c1185f8aa223c543f91a3cd6d4a2a09c691c70

gopherbot · 2016-08-10T18:00:24Z

CL https://golang.org/cl/22711 mentions this issue.

This CL teaches SSA to recognize code of the form // b is a boolean value, i is an int of some flavor if b { i = 1 } else { i = 0 } and use b's underlying 0/1 representation for i instead of generating jumps. Unfortunately, it does not work on the obvious code: func bool2int(b bool) int { if b { return 1 } return 0 } This is left for future work. Note that the existing phiopt optimizations also don't work for: func neg(b bool) bool { if b { return false } return true } In the meantime, runtime authors and the like can use: func bool2int(b bool) int { var i int if b { i = 1 } else { i = 0 } return i } This compiles to: "".bool2int t=1 size=16 args=0x10 locals=0x0 0x0000 00000 (x.go:25) TEXT "".bool2int(SB), $0-16 0x0000 00000 (x.go:25) FUNCDATA $0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB) 0x0000 00000 (x.go:25) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0000 00000 (x.go:32) MOVBLZX "".b+8(FP), AX 0x0005 00005 (x.go:32) MOVBQZX AL, AX 0x0008 00008 (x.go:32) MOVQ AX, "".~r1+16(FP) 0x000d 00013 (x.go:32) RET The extraneous MOVBQZX is #15300. This optimization also helps range and slice. The compiler must protect against pointers pointing to the end of a slice/string. It does this by increasing a pointer by either 0 or 1 * elemsize, based on a condition. This CL optimizes away a jump in that code. This CL triggers 382 times while compiling the standard library. Updating code to utilize this optimization is left for future CLs. Updates #6011 Change-Id: Ia7c1185f8aa223c543f91a3cd6d4a2a09c691c70 Reviewed-on: https://go-review.googlesource.com/22711 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

gopherbot · 2018-05-31T21:39:04Z

Change https://golang.org/cl/115617 mentions this issue: cmd/compile/internal/ssa: remove useless zero extension

We generate MOVBLZX for byte-sized LoadReg, so (MOVBQZX (LoadReg (Arg))) is the same as (LoadReg (Arg)). Remove those zero extension where possible. Triggers several times during all.bash. Fixes #25378 Updates #15300 Change-Id: If50656e66f217832a13ee8f49c47997f4fcc093a Reviewed-on: https://go-review.googlesource.com/115617 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

josharian added the Performance label Apr 14, 2016

josharian added this to the Unplanned milestone Apr 14, 2016

josharian mentioned this issue Apr 14, 2016

cmd/compile: silly amounts of code generated for simple struct equality tests #15303

Closed

josharian mentioned this issue Aug 30, 2016

cmd/compile: combine AND 0xff and MOVQload into MOVBQZXload #15105

Open

josharian mentioned this issue Jan 9, 2017

cmd/compile: byte type conversion optimizations #18575

Closed

josharian mentioned this issue Feb 6, 2017

cmd/compile: generate BT (bit-test) instruction on 386/amd64 #18943

Closed

josharian mentioned this issue Mar 9, 2017

cmd/compile: use TEST/CMP with memory operands on amd64 #19485

Closed

randall77 mentioned this issue Mar 25, 2017

cmd/compile: Unnecessary register high-bytes rezeroing after unsigned right shift #19712

Open

josharian mentioned this issue Mar 25, 2017

cmd/compile: inefficient increment of global array #10432

Open

martisch mentioned this issue May 14, 2018

cmd/compile: unnecessary zeroing of AX after MOVBLZX #25378

Closed

josharian mentioned this issue Oct 1, 2019

cmd/compile: unnecessary sign-extensions when converting int's #34625

Closed

josharian mentioned this issue Feb 27, 2021

cmd/compile: unnecessary copy in regalloc #44670

Open

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: combine extension with register loads and stores on amd64 #15300

cmd/compile: combine extension with register loads and stores on amd64 #15300

josharian commented Apr 14, 2016

randall77 commented Apr 14, 2016

josharian commented May 2, 2016

gopherbot commented Aug 10, 2016

gopherbot commented May 31, 2018

cmd/compile: combine extension with register loads and stores on amd64 #15300

cmd/compile: combine extension with register loads and stores on amd64 #15300

Comments

josharian commented Apr 14, 2016

randall77 commented Apr 14, 2016

josharian commented May 2, 2016

gopherbot commented Aug 10, 2016

gopherbot commented May 31, 2018