-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: prefer AND instead of SHR+SHL #33826
Comments
Change https://golang.org/cl/191780 mentions this issue: |
Looking at instruction timing it should be pretty much the same. I am not saying that this is a bad change, just that I don't think this benchmark is accurate if it is emitting what you write.
Is On Ivy Bridge (your CPU) and pretty much all other CPUs, the instructions involved are all with a 1 cycle latency. Since your benchmark is serial, it should pretty much be the same assuming it is emitting MOVQ+ANDQ. If it is only ANDQ, it should be 2x faster (excluding the loop overhead). It could also be that Ive Bridge is able to do some magic and can ignore the MOVQ. Either way; with the |
I would agree if viewed standalone. If there is surrounding code like in the benchmark loop I would expect the MOV being able to be issued in parallel ahead with some earlier instruction (unless the CPU is saturated with 4 retires per cycle already). The two shifts together always need at least 2 cycles after their input is ready from a previous loop iteration.
Checked again and the MOV is inside the loop. That a MOV is needed should only be clear very late - in the arch specific SSA (see why below). I do not think there is loop hoisting pass run after that stage.
AND on amd64 does not support 64bit immediates https://www.felixcloutier.com/x86/and . |
Yeah, the CPU is doing magic. A simple test, pure assembler:
Using |
Tried your CL on ppc64le and seems to be better by about 30% on the benchmark you provided. |
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): Updates #33826 Updates #32781 Change-Id: I862d3587446410c447b9a7265196b57f85358633 Reviewed-on: https://go-review.googlesource.com/c/go/+/191780 Run-TryBot: Martin Möhrmann <moehrmann@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
The first attempt broke codegen tests in arm64 (bfxil) and s390x (abs and copysign). Will make sure to run codegen tests for all platforms for the new version of the CL. For a new version of the CL I added abs and copysign rules that need detect the new AND variant, still working on the arm64 bitfield adjustment. |
Change https://golang.org/cl/194297 mentions this issue: |
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): 'go run run.go -all_codegen -v codegen' passes with following adjustments: ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment since ORshiftRL generation fusing '>> rc' and '|' interferes with matching ((x << lc) >> rc) to generate UBFX. Previously ORshiftLL was created first using the shifts generated for (y & ac). S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs. Updates #33826 Updates #32781 Change-Id: I43227da76b625de03fbc51117162b23b9c678cdb Reviewed-on: https://go-review.googlesource.com/c/go/+/194297 Run-TryBot: Martin Möhrmann <martisch@uos.de> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
Change https://golang.org/cl/196957 mentions this issue: |
This reverts CL 194297. Reason for revert: introduced register allocation failures on PPC64LE builders. Updates #33826 Updates #32781 Updates #34468 Change-Id: I7d0b55df8cdf8e7d2277f1814299b083c2692e48 Reviewed-on: https://go-review.googlesource.com/c/go/+/196957 Run-TryBot: Bryan C. Mills <bcmills@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Change https://golang.org/cl/196810 mentions this issue: |
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): 'go run run.go -all_codegen -v codegen' passes with following adjustments: ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment since ORshiftRL generation fusing '>> rc' and '|' interferes with matching ((x << lc) >> rc) to generate UBFX. Previously ORshiftLL was created first using the shifts generated for (y & ac). S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs. Updates #33826 Updates #32781 Change-Id: I5a59f6239660d53c029cd22dfb44ddf39f93a56c Reviewed-on: https://go-review.googlesource.com/c/go/+/196810 Run-TryBot: Martin Möhrmann <moehrmann@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
Anything else need to be done here? |
The feature is in at the 3rd try and AFAIK did not cause any new breakages since then. There is some cleanups that can be done. e.g. adding codegen tests that the old shift based rules are exercised and tested or just remove the old rules. However I wont likely get to that and test them across the different architectures due to other feature CLs before go1.14. Therefore closing this issue as the desired generate of code itself is in for go1.14. |
In cl/19485 we added a generic SSA rule to replace an AND with specific constants by two SHIFT instructions.
While this optimization does avoid a load of the constant to be ANDed into an extra register and has a shorter encoding on amd64 it does use two data dependent instructions. There was already some discussion on the CL after accidental early submission that micro benchmarks do not show using two shifts to be faster. Some seem to show it can be slower e.g.
x &= 1 << 63
. I benchmarked math.Copysign and it showed the same throughput using either variant. LLVM and GCC do not seem to replace AND with SHIFTs on amd64 either and some platforms like arm64 in the go gc compiler even reverse this rewrite.https://go.googlesource.com/go/+/06f709a04acc6d5ba0ba181129e9ee93ed20f311/src/cmd/compile/internal/ssa/gen/ARM64.rules#1866
There has also been some unwanted interaction with other optimizations rules e.g. #32781.
The initial added rules from cl/19485 operated on And64 and And32. The And32 case was already removed in cl/20410 .
Removing the And64 case too can make binaries slightly larger but in common cases where there is no register pressure should be as fast or faster as two SHIFTs on modern amd64 CPUs and should create less interference with other SSA rules that need not consider the additional case of a mask using AND having been rewritten to SHIFTs.
I intend to send CLs for evaluation and submission to remove the generic rule to rewrite AND into a pair of SHIFTs for go1.14 and follow up with some CLs that avoid regressing on interaction with other rules that are based on optimizing SHIFTs instead of ANDs. For example the AND instruction in
(u64>>32)&0xFFFFFFFF
should still be optimized away by SSA.This issue is to document the related CLs and discuss this (de)optimization and whether any go gc supported 64bit platforms should keep rewriting some ANDs to two SHIFTs.
/cc @klauspost @cherrymui @khr @dr2chase @laboger @mundaym
The text was updated successfully, but these errors were encountered: