cmd/compile: various low level x86 instruction generation improvements #28671

martisch · 2018-11-08T17:16:15Z

While reading (to much) go generated assembly code I picked up a few x86 code sequences that seemed sub optimal. I do not remember where I had spotted each of them and some might just come from my imagination, compiler optimization guides or from outside the std library.

Instead of creating an issue per possibility here is a list of some possible low level performance improvements. Note that this does not mean they are common and therefore worth introducing. That can be evaluated. However these can serve to spark ideas for other improvements and for new compiler contributors to try out adding ssa optimization rules or codegen improvements and benchmarking their effects and frequency. UPDATE: CLs should make sure to include statistics/examples of use in std lib and/or generally when introducing optimizations.

Current assembly gc and gccgo create can be quickly checked with https://godbolt.org/.
Given the low level nature there might be oversights in what is possible and whether they are size or performance improvements. As always needs benchmarks and tests.

Many of them should be considered as examples for more general optimizations.

The list:

no baseless lea:
There is no lea without a base and only index*operandsize resulting in x+x+x+x being compiled to leaq+addq instead of a single leaq[0+x*4] with some more combination rules.
to many imul
x*x*x*x is compiled as 3 imuls when 2 are sufficient.
set all bits in a register (UPDATE: not generally worth it due to false dependency)
Instead of MOVQ $-0x1, Reg use ORQ $-0x1, Reg which is shorter by ~4 bytes for 64bit ints but may create a false dependency as noted by @randall77.
comparing modulo
x % 2 == 0 can be andl $1, %eax, testq %rax, %rax (or btl $0, AX ...) instead of shift+shift+add+shift+shift+cmp
instead of add use subtract for some powers of 2
sub -128 is shorter than add 128. (watch out that flags are not used)
for some powers of 2 less equal is better than less of a bit more
x < 128 encodes to CMPQ $0x80, AX; JG which is larger than CMPQ $0x7f, AX; JGE. Should work similar for other comparisons encoding of constants.
unsigned division with int
If it is known the int divisor is positive instead of CQO+IDIV a XOR+DIV could be used.
optimize modulo with shifts that produce power of 2
var x,y uint; x % (1<<y) can be replaced with x & ((1<<y)-1).

@josharian @randall77 @TocarIP @quasilyte

The text was updated successfully, but these errors were encountered:

randall77 · 2018-11-08T17:23:27Z

Instead of MOVQ $-0x1, Reg use ORQ $-0x1, Reg which is shorter.

Does this have a false dependency on the previous value of Reg?
XORL AX, AX is a special case which breaks the dependency, not sure whether ORQ $-1, AX also does.

martisch · 2018-11-08T17:25:42Z

Thanks. You are right that ORQ could introduce and unwarranted dependency (or at least its not known that all x86 do optimize the dependency away) . Likely not always a win unless optimizing for size only and needs to much complex analyses to be sure its a win in the concrete instruction flow.
Added as feedback to the list.

josharian · 2018-11-08T17:29:44Z

Somewhat related: #21439.

seebs · 2018-11-08T22:45:39Z

does "shorter" relate at all to runtime performance? i suppose smaller code improves cache performance, in general, all else being equal.

martisch · 2018-11-09T08:07:39Z

does "shorter" relate at all to runtime performance?

Granted these are hard to measure especially in micro benchmarks. Smaller binary footprint means less cache use. Depending on the instruction and microarchitecture the instruction decoders can also decode more short/simple instructions per cycle and some archs seem to "only" fetch 16/32bytes of instructions per cycle. More instructions will fit into some loop buffers if they are smaller in bytes and therefore more loops will be able to make use of loop buffers and thereby could execute faster.

gopherbot · 2018-11-14T06:34:47Z

Change https://golang.org/cl/149537 mentions this issue: cmd/compile/internal/ssa: optimized x*x*x*x to only 2 imuls

martisch added this to the Unplanned milestone Nov 8, 2018

martisch added Performance help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 8, 2018

josharian added the Suggested Issues that may be good for new contributors looking for work to do. label Nov 8, 2018

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: various low level x86 instruction generation improvements #28671

cmd/compile: various low level x86 instruction generation improvements #28671

martisch commented Nov 8, 2018 •

edited

randall77 commented Nov 8, 2018 •

edited

martisch commented Nov 8, 2018 •

edited

josharian commented Nov 8, 2018

seebs commented Nov 8, 2018

martisch commented Nov 9, 2018 •

edited

gopherbot commented Nov 14, 2018

cmd/compile: various low level x86 instruction generation improvements #28671

cmd/compile: various low level x86 instruction generation improvements #28671

Comments

martisch commented Nov 8, 2018 • edited

randall77 commented Nov 8, 2018 • edited

martisch commented Nov 8, 2018 • edited

josharian commented Nov 8, 2018

seebs commented Nov 8, 2018

martisch commented Nov 9, 2018 • edited

gopherbot commented Nov 14, 2018

martisch commented Nov 8, 2018 •

edited

randall77 commented Nov 8, 2018 •

edited

martisch commented Nov 8, 2018 •

edited

martisch commented Nov 9, 2018 •

edited