cmd/compile: prefer AND instead of SHR+SHL #33826

martisch · 2019-08-25T11:07:46Z

In cl/19485 we added a generic SSA rule to replace an AND with specific constants by two SHIFT instructions.

While this optimization does avoid a load of the constant to be ANDed into an extra register and has a shorter encoding on amd64 it does use two data dependent instructions. There was already some discussion on the CL after accidental early submission that micro benchmarks do not show using two shifts to be faster. Some seem to show it can be slower e.g. x &= 1 << 63. I benchmarked math.Copysign and it showed the same throughput using either variant. LLVM and GCC do not seem to replace AND with SHIFTs on amd64 either and some platforms like arm64 in the go gc compiler even reverse this rewrite.

https://go.googlesource.com/go/+/06f709a04acc6d5ba0ba181129e9ee93ed20f311/src/cmd/compile/internal/ssa/gen/ARM64.rules#1866

There has also been some unwanted interaction with other optimizations rules e.g. #32781.

The initial added rules from cl/19485 operated on And64 and And32. The And32 case was already removed in cl/20410 .

Removing the And64 case too can make binaries slightly larger but in common cases where there is no register pressure should be as fast or faster as two SHIFTs on modern amd64 CPUs and should create less interference with other SSA rules that need not consider the additional case of a mask using AND having been rewritten to SHIFTs.

I intend to send CLs for evaluation and submission to remove the generic rule to rewrite AND into a pair of SHIFTs for go1.14 and follow up with some CLs that avoid regressing on interaction with other rules that are based on optimizing SHIFTs instead of ANDs. For example the AND instruction in (u64>>32)&0xFFFFFFFF should still be optimized away by SSA.

This issue is to document the related CLs and discuss this (de)optimization and whether any go gc supported 64bit platforms should keep rewriting some ANDs to two SHIFTs.

/cc @klauspost @cherrymui @khr @dr2chase @laboger @mundaym

The text was updated successfully, but these errors were encountered:

gopherbot · 2019-08-25T11:16:14Z

Change https://golang.org/cl/191780 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

klauspost · 2019-08-25T16:09:45Z

Looking at instruction timing it should be pretty much the same.

I am not saying that this is a bad change, just that I don't think this benchmark is accurate if it is emitting what you write.

func BenchmarkAndHighBits(b *testing.B) {
	x := uint(0)
	for i := 0; i < b.N; i++ {
		x &= 1 << 63
	}
	GlobalU = x
}

Is MOVQ $0x8000000000000000, RAX actually inside the loop? And wouldn't it emit ANDQ $0x8000000000000000, REG?

On Ivy Bridge (your CPU) and pretty much all other CPUs, the instructions involved are all with a 1 cycle latency. Since your benchmark is serial, it should pretty much be the same assuming it is emitting MOVQ+ANDQ. If it is only ANDQ, it should be 2x faster (excluding the loop overhead).

It could also be that Ive Bridge is able to do some magic and can ignore the MOVQ.

Either way; with the AND, the compiler will be able to do some optimizations. A clever compiler could even move the constant loading outside the loop, as mentioned above.

martisch · 2019-08-25T16:59:16Z

Looking at instruction timing it should be pretty much the same.

I would agree if viewed standalone. If there is surrounding code like in the benchmark loop I would expect the MOV being able to be issued in parallel ahead with some earlier instruction (unless the CPU is saturated with 4 retires per cycle already). The two shifts together always need at least 2 cycles after their input is ready from a previous loop iteration.

Is MOVQ $0x8000000000000000, RAX actually inside the loop?

Checked again and the MOV is inside the loop. That a MOV is needed should only be clear very late - in the arch specific SSA (see why below). I do not think there is loop hoisting pass run after that stage.

And wouldn't it emit ANDQ $0x8000000000000000, REG?

AND on amd64 does not support 64bit immediates https://www.felixcloutier.com/x86/and .

klauspost · 2019-08-25T17:51:51Z

Yeah, the CPU is doing magic.

A simple test, pure assembler:

TEXT ·testAnd(SB), NOSPLIT, $0
	MOVQ $65535, CX
	XORQ AX, AX

loop_and:
	MOVQ $0x8000000000000000, BX
	ANDQ BX, AX
	DECQ CX
	JNZ  loop_and
	RET

TEXT ·testShift(SB), NOSPLIT, $0
	MOVQ $65535, CX
	XORQ AX, AX

loop_shift:
	SHRQ $63, AX
	SHLQ $63, AX
	DECQ CX
	JNZ  loop_shift
	RET

Using AND is ~2x as fast on my Ryzen 3600. Moving the MOVQ outside the loop makes absolutely no difference, which indicates the magic :)

laboger · 2019-08-28T14:51:42Z

Tried your CL on ppc64le and seems to be better by about 30% on the benchmark you provided.

On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): Updates #33826 Updates #32781 Change-Id: I862d3587446410c447b9a7265196b57f85358633 Reviewed-on: https://go-review.googlesource.com/c/go/+/191780 Run-TryBot: Martin Möhrmann <moehrmann@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

martisch · 2019-09-09T09:28:22Z

The first attempt broke codegen tests in arm64 (bfxil) and s390x (abs and copysign).
https://build.golang.org/log/a47a1c98bac2aa509d570d6bf583609857b1be5a
https://build.golang.org/log/6456da8a84cc82e7407981287f211caaab41c7eb

Will make sure to run codegen tests for all platforms for the new version of the CL.

For a new version of the CL I added abs and copysign rules that need detect the new AND variant, still working on the arm64 bitfield adjustment.

gopherbot · 2019-09-09T17:38:20Z

Change https://golang.org/cl/194297 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): 'go run run.go -all_codegen -v codegen' passes with following adjustments: ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment since ORshiftRL generation fusing '>> rc' and '|' interferes with matching ((x << lc) >> rc) to generate UBFX. Previously ORshiftLL was created first using the shifts generated for (y & ac). S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs. Updates #33826 Updates #32781 Change-Id: I43227da76b625de03fbc51117162b23b9c678cdb Reviewed-on: https://go-review.googlesource.com/c/go/+/194297 Run-TryBot: Martin Möhrmann <martisch@uos.de> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>

gopherbot · 2019-09-23T13:26:23Z

Change https://golang.org/cl/196957 mentions this issue: Revert "compile: prefer an AND instead of SHR+SHL instructions"

This reverts CL 194297. Reason for revert: introduced register allocation failures on PPC64LE builders. Updates #33826 Updates #32781 Updates #34468 Change-Id: I7d0b55df8cdf8e7d2277f1814299b083c2692e48 Reviewed-on: https://go-review.googlesource.com/c/go/+/196957 Run-TryBot: Bryan C. Mills <bcmills@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Martin Möhrmann <moehrmann@google.com>

gopherbot · 2019-09-24T20:18:00Z

Change https://golang.org/cl/196810 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): 'go run run.go -all_codegen -v codegen' passes with following adjustments: ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment since ORshiftRL generation fusing '>> rc' and '|' interferes with matching ((x << lc) >> rc) to generate UBFX. Previously ORshiftLL was created first using the shifts generated for (y & ac). S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs. Updates #33826 Updates #32781 Change-Id: I5a59f6239660d53c029cd22dfb44ddf39f93a56c Reviewed-on: https://go-review.googlesource.com/c/go/+/196810 Run-TryBot: Martin Möhrmann <moehrmann@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

randall77 · 2019-10-07T01:36:27Z

Anything else need to be done here?

martisch · 2019-10-08T13:45:54Z

The feature is in at the 3rd try and AFAIK did not cause any new breakages since then.

There is some cleanups that can be done. e.g. adding codegen tests that the old shift based rules are exercised and tested or just remove the old rules. However I wont likely get to that and test them across the different architectures due to other feature CLs before go1.14. Therefore closing this issue as the desired generate of code itself is in for go1.14.

martisch added this to the Go1.14 milestone Aug 25, 2019

martisch added the Performance label Aug 25, 2019

martisch self-assigned this Aug 25, 2019

martisch mentioned this issue Sep 9, 2019

compile: s390x copysign and abs codegen tests broken #34183

Closed

martisch closed this as completed Oct 8, 2019

golang locked and limited conversation to collaborators Oct 7, 2020

gopherbot added the FrozenDueToAge label Oct 7, 2020

rsc unassigned martisch Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: prefer AND instead of SHR+SHL #33826

cmd/compile: prefer AND instead of SHR+SHL #33826

martisch commented Aug 25, 2019 •

edited

Loading

gopherbot commented Aug 25, 2019

klauspost commented Aug 25, 2019

martisch commented Aug 25, 2019 •

edited

Loading

klauspost commented Aug 25, 2019

laboger commented Aug 28, 2019

martisch commented Sep 9, 2019

gopherbot commented Sep 9, 2019

gopherbot commented Sep 23, 2019

gopherbot commented Sep 24, 2019

randall77 commented Oct 7, 2019

martisch commented Oct 8, 2019

cmd/compile: prefer AND instead of SHR+SHL #33826

cmd/compile: prefer AND instead of SHR+SHL #33826

Comments

martisch commented Aug 25, 2019 • edited Loading

gopherbot commented Aug 25, 2019

klauspost commented Aug 25, 2019

martisch commented Aug 25, 2019 • edited Loading

klauspost commented Aug 25, 2019

laboger commented Aug 28, 2019

martisch commented Sep 9, 2019

gopherbot commented Sep 9, 2019

gopherbot commented Sep 23, 2019

gopherbot commented Sep 24, 2019

randall77 commented Oct 7, 2019

martisch commented Oct 8, 2019

martisch commented Aug 25, 2019 •

edited

Loading

martisch commented Aug 25, 2019 •

edited

Loading