Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compile: convert standalone LEA to ADD/SUB to reduce port contention on x86/amd64 #49087

Open
martisch opened this issue Oct 20, 2021 · 2 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@martisch
Copy link
Contributor

Except very new cores (e.g. sunny cove) LEA usually has around 2 cpu ports to execute while ADD has equal or more (4). It has shown advantageous in production profiles of C++ compiler optimized binaries to use ADD/SUB instead of LEA to reduce port contention.

We should test adding a pass after current SSA optimization that transforms stand alone simple LEA instructions (wasnt fused with compares ore moves) to use ADD and SUB where equivalent (modulo flags). We could also use that pass to split slow 3 operand LEAs to multiple ADDs or 2 operand LEAs.

@martisch martisch added this to the Unplanned milestone Oct 20, 2021
@martisch martisch added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 20, 2021
@klauspost
Copy link
Contributor

I can confirm this is also the case for AMD Zen 2.

#43690 shows improvement even when replacing LEAL const(a)(R8*1), a with ADDL $const, a; ADDL R8, a; in pipeline extensive work. I am not claiming this is always the case.

Also LEAW (16 bit destination) appears significantly slower (3x on Zen2, 2x on Intel) than 32 and 64 bit equivalents, so they seem to best be avoided.

For ADD, it seems the only surprise is ADDW imm16, r16 has a 3x penalty on Intel, but both ADDW imm8, r16 and ADDW imm32, r16 are fine.

@martisch
Copy link
Contributor Author

martisch commented Oct 22, 2021

LEAL const(a)(R8*1), a

Is even more special. That is a 3 operand LEA (3 cycles). Those except on very newest CPUs are already faster if split into 2x two operand LEA (2x 1 cycle): #21735 The go compiler does the splitting for Go code but not for assembler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

3 participants