Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: combine extension with register loads and stores on amd64 #15300

Open
josharian opened this issue Apr 14, 2016 · 4 comments
Open
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Milestone

Comments

@josharian
Copy link
Contributor

Extending a register load or store currently generates suboptimal code with the SSA backend. In some cases, this is a regression from the old backend. For example:

func load8(i uint8) uint64 { return uint64(i) }

Generates:

"".load8 t=1 size=16 args=0x10 locals=0x0
    0x0000 00000 (extend.go:3)  TEXT    "".load8(SB), $0-16
    0x0000 00000 (extend.go:3)  FUNCDATA    $0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
    0x0000 00000 (extend.go:3)  FUNCDATA    $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0000 00000 (extend.go:3)  MOVBLZX "".i+8(FP), AX
    0x0005 00005 (extend.go:3)  MOVBQZX AL, AX
    0x0008 00008 (extend.go:3)  MOVQ    AX, "".~r1+16(FP)
    0x000d 00013 (extend.go:3)  RET

The old back end generates:

"".load8 t=1 size=16 args=0x10 locals=0x0
    0x0000 00000 (extend.go:3)  TEXT    "".load8(SB), $0-16
    0x0000 00000 (extend.go:3)  FUNCDATA    $0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
    0x0000 00000 (extend.go:3)  FUNCDATA    $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0000 00000 (extend.go:3)  MOVBQZX "".i+8(FP), BX
    0x0005 00005 (extend.go:3)  MOVQ    BX, "".~r1+16(FP)
    0x000a 00010 (extend.go:3)  RET

I tried fixing this in CL 21838, but the fix was partial and probably not in the right place.

It's hard to do this as part of the arch-specific rewrite rules, because (a) you lose the type extension work that the ssa conversion did for you and have to recreate it later and (b) you don't know where all the register loads and stores will be, because regalloc hasn't run.

However, it's hard to do this as part of converting final SSA to instructions (genvalue), since that's really geared to handle one value at a time, in isolation.

Teaching regalloc to combine these MOVs seems arch-specific and would further complicate already complicated machinery. Maybe the thing to do is to add an arch-specific rewrite pass after regalloc ("peep"?), using hand-written rewrite rules. Input requested.

Related, for those extension MOVs that remain, we should test whether CWB and friends are desirable--they are shorter, but are register-restricted and the internet disagrees about whether they are as fast.

Here are some test cases:

package x

func load8(i uint8) uint64   { return uint64(i) }
func load32(i uint32) uint64 { return uint64(i) }

func store8(i uint64) uint64  { return uint64(uint8(i)) }
func store32(i uint64) uint64 { return uint64(uint32(i)) }

var p *int

func load8spill(i uint8) uint64 {
    i++            // use i
    print(i)       // spill i
    j := uint64(i) // use and extend i
    return j
}

func load32spill(i uint32) uint64 {
    i++            // use i
    print(i)       // spill i
    j := uint64(i) // use and extend i
    return j
}

func store8spill(i uint64) uint64 {
    j := uint8(i)    // convert
    print(j)         // spill
    return uint64(j) // use
}

func store32spill(i uint64) uint64 {
    j := uint32(i)   // convert
    print(j)         // spill
    return uint64(j) // use
}

cc @randall77

@randall77
Copy link
Contributor

Yes, I don't have any good ideas here. Some modified Arg value might work, say with a memory layout type and a post-load type, but it's ugly.
I'd like to avoid adding a peep pass. It might be a good idea for this but it always ends up as an easy dumping ground for things that are better implemented elsewhere.

@josharian
Copy link
Contributor Author

Note that it's not just Args, it's also loading spilled registers.

josharian added a commit to josharian/go that referenced this issue May 9, 2016
DO NOT REVIEW

[code freeze]

This CL teaches SSA to recognize code of the form

// b is a boolean value, i is an int of some flavor
if b {
	i = 1
} else {
	i = 0
}

and use b's underlying 0/1 representation for i
instead of generating jumps.

Unfortunately, it does not work on the obvious code:

func bool2int(b bool) int {
	if b {
		return 1
	}
	return 0
}

This is left for future work.
Note that the existing phiopt optimizations also don't work for:

func neg(b bool) bool {
	if b {
		return false
	}
	return true
}

In the meantime, runtime authors and the like can use:

func bool2int(b bool) int {
	var i int
	if b {
		i = 1
	} else {
		i = 0
	}
	return i
}

This compiles to:

"".bool2int t=1 size=16 args=0x10 locals=0x0
	0x0000 00000 (x.go:25)	TEXT	"".bool2int(SB), $0-16
	0x0000 00000 (x.go:25)	FUNCDATA	$0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
	0x0000 00000 (x.go:25)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (x.go:32)	MOVBLZX	"".b+8(FP), AX
	0x0005 00005 (x.go:32)	MOVBQZX	AL, AX
	0x0008 00008 (x.go:32)	MOVQ	AX, "".~r1+16(FP)
	0x000d 00013 (x.go:32)	RET

The extraneous MOVBQZX is golang#15300.

This optimization also helps range and slice.
The compiler must protect against pointers pointing
to the end of a slice/string. It does this by increasing
a pointer by either 0 or 1 * elemsize, based on a condition.
This CL optimizes away a jump in that code.

This CL triggers 382 while compiling the standard library.

Updates golang#6011

Change-Id: Ia7c1185f8aa223c543f91a3cd6d4a2a09c691c70
josharian added a commit to josharian/go that referenced this issue May 11, 2016
DO NOT REVIEW

[code freeze]

This CL teaches SSA to recognize code of the form

// b is a boolean value, i is an int of some flavor
if b {
	i = 1
} else {
	i = 0
}

and use b's underlying 0/1 representation for i
instead of generating jumps.

Unfortunately, it does not work on the obvious code:

func bool2int(b bool) int {
	if b {
		return 1
	}
	return 0
}

This is left for future work.
Note that the existing phiopt optimizations also don't work for:

func neg(b bool) bool {
	if b {
		return false
	}
	return true
}

In the meantime, runtime authors and the like can use:

func bool2int(b bool) int {
	var i int
	if b {
		i = 1
	} else {
		i = 0
	}
	return i
}

This compiles to:

"".bool2int t=1 size=16 args=0x10 locals=0x0
	0x0000 00000 (x.go:25)	TEXT	"".bool2int(SB), $0-16
	0x0000 00000 (x.go:25)	FUNCDATA	$0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
	0x0000 00000 (x.go:25)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (x.go:32)	MOVBLZX	"".b+8(FP), AX
	0x0005 00005 (x.go:32)	MOVBQZX	AL, AX
	0x0008 00008 (x.go:32)	MOVQ	AX, "".~r1+16(FP)
	0x000d 00013 (x.go:32)	RET

The extraneous MOVBQZX is golang#15300.

This optimization also helps range and slice.
The compiler must protect against pointers pointing
to the end of a slice/string. It does this by increasing
a pointer by either 0 or 1 * elemsize, based on a condition.
This CL optimizes away a jump in that code.

This CL triggers 382 while compiling the standard library.

Updates golang#6011

Change-Id: Ia7c1185f8aa223c543f91a3cd6d4a2a09c691c70
@gopherbot
Copy link

CL https://golang.org/cl/22711 mentions this issue.

gopherbot pushed a commit that referenced this issue Aug 16, 2016
This CL teaches SSA to recognize code of the form

// b is a boolean value, i is an int of some flavor
if b {
	i = 1
} else {
	i = 0
}

and use b's underlying 0/1 representation for i
instead of generating jumps.

Unfortunately, it does not work on the obvious code:

func bool2int(b bool) int {
	if b {
		return 1
	}
	return 0
}

This is left for future work.
Note that the existing phiopt optimizations also don't work for:

func neg(b bool) bool {
	if b {
		return false
	}
	return true
}

In the meantime, runtime authors and the like can use:

func bool2int(b bool) int {
	var i int
	if b {
		i = 1
	} else {
		i = 0
	}
	return i
}

This compiles to:

"".bool2int t=1 size=16 args=0x10 locals=0x0
	0x0000 00000 (x.go:25)	TEXT	"".bool2int(SB), $0-16
	0x0000 00000 (x.go:25)	FUNCDATA	$0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
	0x0000 00000 (x.go:25)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (x.go:32)	MOVBLZX	"".b+8(FP), AX
	0x0005 00005 (x.go:32)	MOVBQZX	AL, AX
	0x0008 00008 (x.go:32)	MOVQ	AX, "".~r1+16(FP)
	0x000d 00013 (x.go:32)	RET

The extraneous MOVBQZX is #15300.

This optimization also helps range and slice.
The compiler must protect against pointers pointing
to the end of a slice/string. It does this by increasing
a pointer by either 0 or 1 * elemsize, based on a condition.
This CL optimizes away a jump in that code.

This CL triggers 382 times while compiling the standard library.

Updating code to utilize this optimization is left for future CLs.

Updates #6011

Change-Id: Ia7c1185f8aa223c543f91a3cd6d4a2a09c691c70
Reviewed-on: https://go-review.googlesource.com/22711
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/115617 mentions this issue: cmd/compile/internal/ssa: remove useless zero extension

gopherbot pushed a commit that referenced this issue Aug 20, 2018
We generate MOVBLZX for byte-sized LoadReg, so
(MOVBQZX (LoadReg (Arg))) is the same as
(LoadReg (Arg)). Remove those zero extension where possible.
Triggers several times during all.bash.

Fixes #25378
Updates #15300

Change-Id: If50656e66f217832a13ee8f49c47997f4fcc093a
Reviewed-on: https://go-review.googlesource.com/115617
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Projects
None yet
Development

No branches or pull requests

3 participants