cmd/compile: handle 64-bit masks better on 32-bit systems #32781

rsc · 2019-06-26T00:02:26Z

On a 32-bit system, an expression like uint64value & constantUint64Mask could easily be decomposed two 32-bit mask operations. That does not seem to happen (see below, especially the OR instructions). It seems like probably it should.

$ cat x.go
package p

func f(x uint64) uint64 {
	return x & (1<<60 - 1)
}
$ GOARCH=arm go build -gcflags=-S x.go
# command-line-arguments
"".f STEXT size=40 args=0x10 locals=0x0 leaf
	0x0000 00000 (/Users/rsc/x.go:3)	TEXT	"".f(SB), LEAF|NOFRAME|ABIInternal, $-4-16
	0x0000 00000 (/Users/rsc/x.go:3)	FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (/Users/rsc/x.go:3)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (/Users/rsc/x.go:3)	FUNCDATA	$2, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (/Users/rsc/x.go:4)	PCDATA	$0, $0
	0x0000 00000 (/Users/rsc/x.go:4)	PCDATA	$1, $0
	0x0000 00000 (/Users/rsc/x.go:4)	MOVW	"".x(FP), R0
	0x0004 00004 (/Users/rsc/x.go:4)	SRL	$28, R0, R1
	0x0008 00008 (/Users/rsc/x.go:4)	MOVW	"".x+4(FP), R2
	0x000c 00012 (/Users/rsc/x.go:4)	ORR	R2<<4, R1, R1
	0x0010 00016 (/Users/rsc/x.go:4)	BFXU	$28, R0, $0, R0
	0x0014 00020 (/Users/rsc/x.go:4)	ORR	R1<<28, R0, R0
	0x0018 00024 (/Users/rsc/x.go:4)	MOVW	R0, "".~r1+8(FP)
	0x001c 00028 (/Users/rsc/x.go:4)	SRL	$4, R1, R0
	0x0020 00032 (/Users/rsc/x.go:4)	MOVW	R0, "".~r1+12(FP)
	0x0024 00036 (/Users/rsc/x.go:4)	JMP	(R14)
...
$ GOARCH=386 go build -gcflags=-S x.go
# command-line-arguments
"".f STEXT size=62 args=0x10 locals=0x0
	0x0000 00000 (/Users/rsc/x.go:3)	TEXT	"".f(SB), ABIInternal, $0-16
	0x0000 00000 (/Users/rsc/x.go:3)	MOVL	(TLS), CX
	0x0007 00007 (/Users/rsc/x.go:3)	CMPL	SP, 8(CX)
	0x000a 00010 (/Users/rsc/x.go:3)	JLS	55
	0x000c 00012 (/Users/rsc/x.go:3)	FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x000c 00012 (/Users/rsc/x.go:3)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x000c 00012 (/Users/rsc/x.go:3)	FUNCDATA	$2, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x000c 00012 (/Users/rsc/x.go:4)	PCDATA	$0, $0
	0x000c 00012 (/Users/rsc/x.go:4)	PCDATA	$1, $0
	0x000c 00012 (/Users/rsc/x.go:4)	MOVL	"".x+8(SP), AX
	0x0010 00016 (/Users/rsc/x.go:4)	SHLL	$4, AX
	0x0013 00019 (/Users/rsc/x.go:4)	MOVL	"".x+4(SP), CX
	0x0017 00023 (/Users/rsc/x.go:4)	MOVL	CX, DX
	0x0019 00025 (/Users/rsc/x.go:4)	SHRL	$28, CX
	0x001c 00028 (/Users/rsc/x.go:4)	ORL	AX, CX
	0x001e 00030 (/Users/rsc/x.go:4)	SHLL	$4, DX
	0x0021 00033 (/Users/rsc/x.go:4)	SHRL	$4, DX
	0x0024 00036 (/Users/rsc/x.go:4)	MOVL	CX, AX
	0x0026 00038 (/Users/rsc/x.go:4)	SHLL	$28, CX
	0x0029 00041 (/Users/rsc/x.go:4)	ORL	DX, CX
	0x002b 00043 (/Users/rsc/x.go:4)	MOVL	CX, "".~r1+12(SP)
	0x002f 00047 (/Users/rsc/x.go:4)	SHRL	$4, AX
	0x0032 00050 (/Users/rsc/x.go:4)	MOVL	AX, "".~r1+16(SP)
	0x0036 00054 (/Users/rsc/x.go:4)	RET
	0x0037 00055 (/Users/rsc/x.go:4)	NOP
	0x0037 00055 (/Users/rsc/x.go:3)	PCDATA	$1, $-1
	0x0037 00055 (/Users/rsc/x.go:3)	PCDATA	$0, $-1
	0x0037 00055 (/Users/rsc/x.go:3)	CALL	runtime.morestack_noctxt(SB)
	0x003c 00060 (/Users/rsc/x.go:3)	JMP	0

The text was updated successfully, but these errors were encountered:

cherrymui · 2019-06-26T01:30:14Z

Emmm. I'm pretty sure it decomposes 64-bit mask to two 32-bit masks when I wrote the code a few years ago. https://go.googlesource.com/go/+/6d1aaf143c879ae7acbd20a1e60bf74681bf6055/src/cmd/compile/internal/ssa/gen/dec64.rules#88

This particular case is interesting, as the generic opt pass (runs before decompose pass) changes the AND to a pair of shifts (x<<4>>4), and the decomposing shifts is more complex.

If it is a different mask, which cannot be rewritten to shifts, it works as expected.

func f(x uint64) uint64 {
	return x & 0x1234567890abcdef
}

	0x0000 00000 (x64.go:4)	MOVW	"".x(FP), R0
	0x0004 00004 (x64.go:4)	AND	$-1867788817, R0, R0
	0x000c 00012 (x64.go:4)	MOVW	R0, "".~r1+8(FP)
	0x0010 00016 (x64.go:4)	MOVW	"".x+4(FP), R0
	0x0014 00020 (x64.go:4)	AND	$305419896, R0, R0
	0x001c 00028 (x64.go:4)	MOVW	R0, "".~r1+12(FP)
	0x0020 00032 (x64.go:4)	JMP	(R14)

Maybe we should either not do the mask-to-shift rewriting for 64-bit ops on 32-bit architectures, or somehow recognizes this pattern and do better in decomposing.

gopherbot · 2019-06-26T17:53:35Z

Change https://golang.org/cl/183957 mentions this issue: cmd/compile: not rewrite 64-bit AND to shifts on 32-bit machines

gopherbot · 2019-08-25T11:16:15Z

Change https://golang.org/cl/191780 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): Updates #33826 Updates #32781 Change-Id: I862d3587446410c447b9a7265196b57f85358633 Reviewed-on: https://go-review.googlesource.com/c/go/+/191780 Run-TryBot: Martin Möhrmann <moehrmann@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

gopherbot · 2019-09-09T17:38:22Z

Change https://golang.org/cl/194297 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): 'go run run.go -all_codegen -v codegen' passes with following adjustments: ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment since ORshiftRL generation fusing '>> rc' and '|' interferes with matching ((x << lc) >> rc) to generate UBFX. Previously ORshiftLL was created first using the shifts generated for (y & ac). S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs. Updates #33826 Updates #32781 Change-Id: I43227da76b625de03fbc51117162b23b9c678cdb Reviewed-on: https://go-review.googlesource.com/c/go/+/194297 Run-TryBot: Martin Möhrmann <martisch@uos.de> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>

gopherbot · 2019-09-23T13:26:23Z

Change https://golang.org/cl/196957 mentions this issue: Revert "compile: prefer an AND instead of SHR+SHL instructions"

This reverts CL 194297. Reason for revert: introduced register allocation failures on PPC64LE builders. Updates #33826 Updates #32781 Updates #34468 Change-Id: I7d0b55df8cdf8e7d2277f1814299b083c2692e48 Reviewed-on: https://go-review.googlesource.com/c/go/+/196957 Run-TryBot: Bryan C. Mills <bcmills@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Martin Möhrmann <moehrmann@google.com>

gopherbot · 2019-09-24T20:18:02Z

Change https://golang.org/cl/196810 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): 'go run run.go -all_codegen -v codegen' passes with following adjustments: ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment since ORshiftRL generation fusing '>> rc' and '|' interferes with matching ((x << lc) >> rc) to generate UBFX. Previously ORshiftLL was created first using the shifts generated for (y & ac). S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs. Updates #33826 Updates #32781 Change-Id: I5a59f6239660d53c029cd22dfb44ddf39f93a56c Reviewed-on: https://go-review.googlesource.com/c/go/+/196810 Run-TryBot: Martin Möhrmann <moehrmann@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

cherrymui · 2022-03-31T19:33:05Z

The AND-to-shift rules are removed by the CL above. It now generates on ARM

"".f STEXT size=24 args=0x10 locals=0x0 funcid=0x0 align=0x0 leaf
	0x0000 00000 (/tmp/x.go:3)	TEXT	"".f(SB), LEAF|NOFRAME|ABIInternal, $-4-16
	0x0000 00000 (/tmp/x.go:3)	FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (/tmp/x.go:3)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (/tmp/x.go:3)	FUNCDATA	$5, "".f.arginfo1(SB)
	0x0000 00000 (/tmp/x.go:4)	MOVW	"".x+4(FP), R0
	0x0004 00004 (/tmp/x.go:4)	BIC	$-268435456, R0, R0
	0x0008 00008 (/tmp/x.go:4)	MOVW	R0, "".~r0+12(FP)
	0x000c 00012 (/tmp/x.go:4)	MOVW	"".x(FP), R0
	0x0010 00016 (/tmp/x.go:4)	MOVW	R0, "".~r0+8(FP)
	0x0014 00020 (/tmp/x.go:4)	JMP	(R14)

which looks expected.

rsc added this to the Go1.14 milestone Jun 26, 2019

rsc added the NeedsFix label Jun 26, 2019

martisch mentioned this issue Aug 25, 2019

cmd/compile: prefer AND instead of SHR+SHL #33826

Closed

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

cherrymui closed this as completed Mar 31, 2022

golang locked and limited conversation to collaborators Mar 31, 2023

gopherbot added the FrozenDueToAge label Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: handle 64-bit masks better on 32-bit systems #32781

cmd/compile: handle 64-bit masks better on 32-bit systems #32781

rsc commented Jun 26, 2019

cherrymui commented Jun 26, 2019

gopherbot commented Jun 26, 2019

gopherbot commented Aug 25, 2019

gopherbot commented Sep 9, 2019

gopherbot commented Sep 23, 2019

gopherbot commented Sep 24, 2019

cherrymui commented Mar 31, 2022

cmd/compile: handle 64-bit masks better on 32-bit systems #32781

cmd/compile: handle 64-bit masks better on 32-bit systems #32781

Comments

rsc commented Jun 26, 2019

cherrymui commented Jun 26, 2019

gopherbot commented Jun 26, 2019

gopherbot commented Aug 25, 2019

gopherbot commented Sep 9, 2019

gopherbot commented Sep 23, 2019

gopherbot commented Sep 24, 2019

cherrymui commented Mar 31, 2022