Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: prefer AND instead of SHR+SHL #33826

Closed
martisch opened this issue Aug 25, 2019 · 11 comments
Closed

cmd/compile: prefer AND instead of SHR+SHL #33826

martisch opened this issue Aug 25, 2019 · 11 comments

Comments

@martisch
Copy link
Contributor

martisch commented Aug 25, 2019

In cl/19485 we added a generic SSA rule to replace an AND with specific constants by two SHIFT instructions.

While this optimization does avoid a load of the constant to be ANDed into an extra register and has a shorter encoding on amd64 it does use two data dependent instructions. There was already some discussion on the CL after accidental early submission that micro benchmarks do not show using two shifts to be faster. Some seem to show it can be slower e.g. x &= 1 << 63. I benchmarked math.Copysign and it showed the same throughput using either variant. LLVM and GCC do not seem to replace AND with SHIFTs on amd64 either and some platforms like arm64 in the go gc compiler even reverse this rewrite.

https://go.googlesource.com/go/+/06f709a04acc6d5ba0ba181129e9ee93ed20f311/src/cmd/compile/internal/ssa/gen/ARM64.rules#1866

There has also been some unwanted interaction with other optimizations rules e.g. #32781.

The initial added rules from cl/19485 operated on And64 and And32. The And32 case was already removed in cl/20410 .

Removing the And64 case too can make binaries slightly larger but in common cases where there is no register pressure should be as fast or faster as two SHIFTs on modern amd64 CPUs and should create less interference with other SSA rules that need not consider the additional case of a mask using AND having been rewritten to SHIFTs.

I intend to send CLs for evaluation and submission to remove the generic rule to rewrite AND into a pair of SHIFTs for go1.14 and follow up with some CLs that avoid regressing on interaction with other rules that are based on optimizing SHIFTs instead of ANDs. For example the AND instruction in (u64>>32)&0xFFFFFFFF should still be optimized away by SSA.

This issue is to document the related CLs and discuss this (de)optimization and whether any go gc supported 64bit platforms should keep rewriting some ANDs to two SHIFTs.

/cc @klauspost @cherrymui @khr @dr2chase @laboger @mundaym

@martisch martisch added this to the Go1.14 milestone Aug 25, 2019
@martisch martisch self-assigned this Aug 25, 2019
@gopherbot
Copy link

Change https://golang.org/cl/191780 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

@klauspost
Copy link
Contributor

Looking at instruction timing it should be pretty much the same.

I am not saying that this is a bad change, just that I don't think this benchmark is accurate if it is emitting what you write.

func BenchmarkAndHighBits(b *testing.B) {
	x := uint(0)
	for i := 0; i < b.N; i++ {
		x &= 1 << 63
	}
	GlobalU = x
}

Is MOVQ $0x8000000000000000, RAX actually inside the loop? And wouldn't it emit ANDQ $0x8000000000000000, REG?

On Ivy Bridge (your CPU) and pretty much all other CPUs, the instructions involved are all with a 1 cycle latency. Since your benchmark is serial, it should pretty much be the same assuming it is emitting MOVQ+ANDQ. If it is only ANDQ, it should be 2x faster (excluding the loop overhead).

It could also be that Ive Bridge is able to do some magic and can ignore the MOVQ.

Either way; with the AND, the compiler will be able to do some optimizations. A clever compiler could even move the constant loading outside the loop, as mentioned above.

@martisch
Copy link
Contributor Author

martisch commented Aug 25, 2019

Looking at instruction timing it should be pretty much the same.

I would agree if viewed standalone. If there is surrounding code like in the benchmark loop I would expect the MOV being able to be issued in parallel ahead with some earlier instruction (unless the CPU is saturated with 4 retires per cycle already). The two shifts together always need at least 2 cycles after their input is ready from a previous loop iteration.

Is MOVQ $0x8000000000000000, RAX actually inside the loop?

Checked again and the MOV is inside the loop. That a MOV is needed should only be clear very late - in the arch specific SSA (see why below). I do not think there is loop hoisting pass run after that stage.

And wouldn't it emit ANDQ $0x8000000000000000, REG?

AND on amd64 does not support 64bit immediates https://www.felixcloutier.com/x86/and .

@klauspost
Copy link
Contributor

Yeah, the CPU is doing magic.

A simple test, pure assembler:

TEXT ·testAnd(SB), NOSPLIT, $0
	MOVQ $65535, CX
	XORQ AX, AX

loop_and:
	MOVQ $0x8000000000000000, BX
	ANDQ BX, AX
	DECQ CX
	JNZ  loop_and
	RET

TEXT ·testShift(SB), NOSPLIT, $0
	MOVQ $65535, CX
	XORQ AX, AX

loop_shift:
	SHRQ $63, AX
	SHLQ $63, AX
	DECQ CX
	JNZ  loop_shift
	RET

Using AND is ~2x as fast on my Ryzen 3600. Moving the MOVQ outside the loop makes absolutely no difference, which indicates the magic :)

@laboger
Copy link
Contributor

laboger commented Aug 28, 2019

Tried your CL on ppc64le and seems to be better by about 30% on the benchmark you provided.

gopherbot pushed a commit that referenced this issue Sep 9, 2019
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.

Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.

However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.

Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.

For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f                SHRQ $0x3f, DX
48c1e23f                SHLQ $0x3f, DX

after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080    MOVQ $0x8000000000000000, AX
4821d0                  ANDQ DX, AX

Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.

Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:

var GlobalU uint

func BenchmarkAndHighBits(b *testing.B) {
	x := uint(0)
	for i := 0; i < b.N; i++ {
		x &= 1 << 63
	}
	GlobalU = x
}

amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name           old time/op  new time/op  delta
AndHighBits-4  0.61ns ± 6%  0.42ns ± 6%  -31.42%  (p=0.000 n=25+25):

Updates #33826
Updates #32781

Change-Id: I862d3587446410c447b9a7265196b57f85358633
Reviewed-on: https://go-review.googlesource.com/c/go/+/191780
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@martisch
Copy link
Contributor Author

martisch commented Sep 9, 2019

The first attempt broke codegen tests in arm64 (bfxil) and s390x (abs and copysign).
https://build.golang.org/log/a47a1c98bac2aa509d570d6bf583609857b1be5a
https://build.golang.org/log/6456da8a84cc82e7407981287f211caaab41c7eb

Will make sure to run codegen tests for all platforms for the new version of the CL.

For a new version of the CL I added abs and copysign rules that need detect the new AND variant, still working on the arm64 bitfield adjustment.

@gopherbot
Copy link

Change https://golang.org/cl/194297 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

gopherbot pushed a commit that referenced this issue Sep 21, 2019
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.

Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.

However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.

Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.

For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f                SHRQ $0x3f, DX
48c1e23f                SHLQ $0x3f, DX

after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080    MOVQ $0x8000000000000000, AX
4821d0                  ANDQ DX, AX

Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.

Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:

    var GlobalU uint

    func BenchmarkAndHighBits(b *testing.B) {
        x := uint(0)
        for i := 0; i < b.N; i++ {
                x &= 1 << 63
        }
        GlobalU = x
    }

amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name           old time/op  new time/op  delta
AndHighBits-4  0.61ns ± 6%  0.42ns ± 6%  -31.42%  (p=0.000 n=25+25):

'go run run.go -all_codegen -v codegen' passes  with following adjustments:

ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment
       since ORshiftRL generation fusing '>> rc' and '|' interferes
       with matching ((x << lc) >> rc) to generate UBFX. Previously
       ORshiftLL was created first using the shifts generated for (y & ac).

S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs.

Updates #33826
Updates #32781

Change-Id: I43227da76b625de03fbc51117162b23b9c678cdb
Reviewed-on: https://go-review.googlesource.com/c/go/+/194297
Run-TryBot: Martin Möhrmann <martisch@uos.de>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/196957 mentions this issue: Revert "compile: prefer an AND instead of SHR+SHL instructions"

gopherbot pushed a commit that referenced this issue Sep 23, 2019
This reverts CL 194297.

Reason for revert: introduced register allocation failures on PPC64LE builders.

Updates #33826
Updates #32781
Updates #34468

Change-Id: I7d0b55df8cdf8e7d2277f1814299b083c2692e48
Reviewed-on: https://go-review.googlesource.com/c/go/+/196957
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/196810 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

gopherbot pushed a commit that referenced this issue Sep 24, 2019
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.

Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.

However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.

Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.

For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f                SHRQ $0x3f, DX
48c1e23f                SHLQ $0x3f, DX

after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080    MOVQ $0x8000000000000000, AX
4821d0                  ANDQ DX, AX

Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.

Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:

    var GlobalU uint

    func BenchmarkAndHighBits(b *testing.B) {
        x := uint(0)
        for i := 0; i < b.N; i++ {
                x &= 1 << 63
        }
        GlobalU = x
    }

amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name           old time/op  new time/op  delta
AndHighBits-4  0.61ns ± 6%  0.42ns ± 6%  -31.42%  (p=0.000 n=25+25):

'go run run.go -all_codegen -v codegen' passes  with following adjustments:

ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment
       since ORshiftRL generation fusing '>> rc' and '|' interferes
       with matching ((x << lc) >> rc) to generate UBFX. Previously
       ORshiftLL was created first using the shifts generated for (y & ac).

S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs.

Updates #33826
Updates #32781

Change-Id: I5a59f6239660d53c029cd22dfb44ddf39f93a56c
Reviewed-on: https://go-review.googlesource.com/c/go/+/196810
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@randall77
Copy link
Contributor

Anything else need to be done here?

@martisch
Copy link
Contributor Author

martisch commented Oct 8, 2019

The feature is in at the 3rd try and AFAIK did not cause any new breakages since then.

There is some cleanups that can be done. e.g. adding codegen tests that the old shift based rules are exercised and tested or just remove the old rules. However I wont likely get to that and test them across the different architectures due to other feature CLs before go1.14. Therefore closing this issue as the desired generate of code itself is in for go1.14.

@martisch martisch closed this as completed Oct 8, 2019
@golang golang locked and limited conversation to collaborators Oct 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants