cmd/compile: various low level x86 instruction generation improvements #28671
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
help wanted
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Performance
Suggested
Issues that may be good for new contributors looking for work to do.
Milestone
While reading (to much) go generated assembly code I picked up a few x86 code sequences that seemed sub optimal. I do not remember where I had spotted each of them and some might just come from my imagination, compiler optimization guides or from outside the std library.
Instead of creating an issue per possibility here is a list of some possible low level performance improvements. Note that this does not mean they are common and therefore worth introducing. That can be evaluated. However these can serve to spark ideas for other improvements and for new compiler contributors to try out adding ssa optimization rules or codegen improvements and benchmarking their effects and frequency. UPDATE: CLs should make sure to include statistics/examples of use in std lib and/or generally when introducing optimizations.
Current assembly gc and gccgo create can be quickly checked with https://godbolt.org/.
Given the low level nature there might be oversights in what is possible and whether they are size or performance improvements. As always needs benchmarks and tests.
Many of them should be considered as examples for more general optimizations.
The list:
no baseless lea:
There is no lea without a base and only
index*operandsize
resulting inx+x+x+x
being compiled toleaq+addq
instead of a singleleaq[0+x*4]
with some more combination rules.to many imul
x*x*x*x
is compiled as 3imuls
when 2 are sufficient.set all bits in a register (UPDATE: not generally worth it due to false dependency)
Instead of
MOVQ $-0x1, Reg
useORQ $-0x1, Reg
which is shorter by ~4 bytes for 64bit ints but may create a false dependency as noted by @randall77.comparing modulo
x % 2 == 0
can beandl $1, %eax, testq %rax, %rax
(orbtl $0, AX
...) instead ofshift+shift+add+shift+shift+cmp
instead of add use subtract for some powers of 2
sub -128
is shorter thanadd 128
. (watch out that flags are not used)for some powers of 2 less equal is better than less of a bit more
x < 128
encodes toCMPQ $0x80, AX; JG
which is larger thanCMPQ $0x7f, AX; JGE
. Should work similar for other comparisons encoding of constants.unsigned division with int
If it is known the int divisor is positive instead of
CQO+IDIV
aXOR+DIV
could be used.optimize modulo with shifts that produce power of 2
var x,y uint; x % (1<<y)
can be replaced withx & ((1<<y)-1)
.@josharian @randall77 @TocarIP @quasilyte
The text was updated successfully, but these errors were encountered: