Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: avoid slow versions of LEA instructions on x86 #21735

Open
martisch opened this issue Sep 1, 2017 · 21 comments
Open

cmd/compile: avoid slow versions of LEA instructions on x86 #21735

martisch opened this issue Sep 1, 2017 · 21 comments
Assignees
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@martisch
Copy link
Contributor

martisch commented Sep 1, 2017

On newer x86 cpus (amd and intel) 3 operand LEA instructions with base, index and offset have a higher latency and less throughput than 2 operand LEA instructions.

The compiler when emitting the instructions could rewrite slow leas into e.g. LEA + ADD instructions where possible (flag clobbering ok) similar how MOV $0 R is rewritten to XOR R R.

Intel® 64 and IA-32 Architectures Optimization Reference Manual
3.5.1.3 Using LEA

For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
...

relevant llvm optimization ticket: https://reviews.llvm.org/D32277

/cc @TocarIP @randall77 @josharian

@martisch martisch added this to the Go1.10 milestone Sep 1, 2017
@martisch martisch changed the title cmd/compile: avoid slow versions of LEA instructions on amd64 cmd/compile: avoid slow versions of LEA instructions on x86 Sep 1, 2017
@TocarIP
Copy link
Contributor

TocarIP commented Sep 1, 2017

If we are going to do this, we can split lea into lea+lea, to avoid clobbering flags.
For reference, there are 952 3 operands leas in go tool binary.

grep "lea.*0x.*%.*%.*,%" qaz | grep -v "0x0(" | wc -l

@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017
@TocarIP
Copy link
Contributor

TocarIP commented May 10, 2018

Looks like I found a specific example of this.
After b1df8d6 we convert 32-bit multiplication by 0x101 into slow lea, which caused a performance regression on master vs 1.10 for image/draw benchmarks:

CMYK-8   461µs ± 2%   513µs ± 1%  +11.31%  (p=0.000 n=10+9)

@martisch
Copy link
Contributor Author

martisch commented May 11, 2018

Thanks for noticing the regression and finding a specific test case Ilya. I will work on a slow LEA fix within the next week unless somebody else wants to have a go at it earlier.

@martisch martisch self-assigned this May 11, 2018
@martisch martisch added release-blocker NeedsFix The path to resolution is known, but the work has not been done. labels May 11, 2018
@benshi001
Copy link
Member

can it be fixed in the assembler? I found that there are inefficient LEA used in assembly code.

@martisch
Copy link
Contributor Author

Fixing it in asm will need a manual fix (or script detecting it + cl to fix) as i dont think we should rewrite handwritten code when translating to binary. Maybe its needed to achieve some alignment of instructions.

Also for asm depending on the context a LEA+ADD or MUL could be better or equally valid instead of LEA+LEA so i would say those should be fixed on a case by case basis with benchmarks if possible.

Maybe some vet/lint check for asm might be nice.

I was working on a fix in the amd64 opcode code that emits the assembler instructions. Similar how MOV of 0 is rewritten as XOR.

@randall77
Copy link
Contributor

I think we should leave assembly as is. Magic in the assembler is always confusing.

So is the fix just to disable the rules that can generate 3-part LEA instructions? Basically, enforce that LEAQ[1248] can't have either an Aux or AuxInt?
Are all LEA affected? LEA[LW]? Also on 386?

This is a (pretty weak) argument for introducing GOAMD64 architecture variants. If we could figure out reliably which chips have this problem and which don't.

@martisch
Copy link
Contributor Author

While looking into this I found and started on 3 different ways to reduce slow LEAs:

  1. rewrite some of the rules that output LEA to prefer LEA2 [c] x over LEA1 [c] x x
  2. add simplification rules LEA1 [c] x x -> LEA2 [c] x ...
    3 rewrite LEA with 3 operands to two LEA when emitting asm for the LEA opcodes.
    3(bonus) possibly for go 1.12: rewrite LEAs to use ADD if Flags can be clobbered since ADD has four possible execution ports (e.g. Ryzen, Skylake) and LEA only 2 on some modern x64 chips.

Still working on testing these as well as gathering statistics about occurrences and rerunning benchmarks so we have slow downs that we know covered.

AFAIK Slow LEAs with 3 operands exist on modern Intel and AMD chips independent of 386 or x64 mode and word width. But 16bit LEA are slower even with 2 operands.

@TocarIP
Copy link
Contributor

TocarIP commented May 22, 2018

We have a bunch of rules for merging lea into mov, so I feel like option 3 will have lowest amount of unintended consequences. It also makes it easier to switch between versions, if we want to e. g. optimize some function for size, instead of speed or update it for future CPUs.

@gopherbot
Copy link

Change https://golang.org/cl/114655 mentions this issue: cmd/compile: split slow 3 operand LEA instructions into two LEAs

@martisch
Copy link
Contributor Author

martisch commented May 26, 2018

@TocarIP
did you find a file+line number where a slow lea is created that effects the draw benchmarks?
For 0x101 multiplication it seems a 2 (but not 3) operand lea is created since there is no offset/displacement e.g.:

  draw.go:243		0x112f368		89ca			MOVL CX, DX			
  draw.go:243		0x112f36a		c1e108			SHLL $0x8, CX			
  draw.go:243		0x112f36d		8d0c11			LEAL 0(CX)(DX*1), CX

@quasilyte
Copy link
Contributor

Minimal example for that multiplication is:

package foo

func mul(x uint32) uint32 {
	return x * 0x101
}
	0x0000 (../foo.go:4)	MOVL	"".x+8(SP), AX
	0x0004 (../foo.go:4)	MOVL	AX, CX
	0x0006 (../foo.go:4)	SHLL	$8, AX
	0x0009 (../foo.go:4)	LEAL	(AX)(CX*1), AX
	0x000c (../foo.go:4)	MOVL	AX, "".~r1+16(SP)
	0x0010 (../foo.go:4)	RET

It used to emit IMUL3L:

	0x0000 (../foo.go:4)	MOVL	"".x+8(SP), AX
	0x0004 (../foo.go:4)	IMUL3L	$257, AX, AX
	0x000a (../foo.go:4)	MOVL	AX, "".~r1+16(SP)
	0x000e (../foo.go:4)	RET

And prior to that, it was IMUL:

	0x0000 (../foo.go:3)	MOVL	"".x+8(SP), AX
	0x0004 (../foo.go:4)	IMULL	$257, AX
	0x000a (../foo.go:4)	MOVL	AX, "".~r1+16(SP)
	0x000e (../foo.go:4)	RET

@martisch
Copy link
Contributor Author

martisch commented May 29, 2018

@quasilyte Thanks but i do not think this is a slow LEA version affecting the benchmark. (However the particular rewrite to MOV+SHL+(fast)LEA still looks like to be slower)
LEAL (AX)(CX*1), AX is AFAIK not a slow version (3 operand) of LEA so the posted fix for slow LEAs itself wont fix that particular issue (mult by 0x101).

@TocarIP
Copy link
Contributor

TocarIP commented May 29, 2018

@martisch
In image/draw.drawCMYK I see (output of perf) following hotspot:

mov    0x4(%rsp),%r8d                                                                                                                                                        
  2.20 │       lea    -0xffff(%r8,%r13,1),%r8d                                                                                                                                              
  0.20 │       neg    %r8d                                                                                                                                                                  
  3.77 │       imul   %eax,%r8d                                                                                                                                                             
       │             g := (0xffff - uint32(m)*0x101) * w / 0xffff                                                                                                                           
  4.77 │       mov    %r15d,%r13d                                                                                                                                                           
  1.37 │       shl    $0x8,%r15d                                                                                                                                                            
  0.02 │       lea    -0xffff(%r13,%r15,1),%r13d                                                                                                                                            
  2.37 │       neg    %r13d                                                                                                                                                                 
  1.96 │       imul   %eax,%r13d                                                                                                                                                            
       │             b := (0xffff - uint32(y)*0x101) * w / 0xffff                                                                                                                           
  1.13 │       mov    %esi,%r15d                                                                                                                                                            
  0.04 │       shl    $0x8,%esi                                                                                                                                                             
  2.33 │       lea    -0xffff(%r15,%rsi,1),%esi                                                                                                                                             
  1.66 │       neg    %esi                                                                                                                                                                  
  1.46 │       imul   %eax,%esi                                                                                                                                                             

So it looks like g := (0xffff - uint32(m)*0x101) * w / 0xffff produces
lea -0xffff(%r13,%r15,1),%r13d , which is a 3-operand lea

@dr2chase
Copy link
Contributor

Are there objections to waiting till 1.12? The performance improvements are overall very small (0.16% is what I measure pretty consistently, including across a much larger selection of benchmarks), and it's getting very late in the 1.11 cycle and we seem to be behind schedule.

To summarize the performance changes, across the low-noise 256 benchmarks (out of 488 total, low noise is less than 2% max standard deviation reported across 30 trials) 15 showed >= 1% improvement, 16 showed >= 1% slowdown.

For comparison, the experiment to pretend we are only targeting more-recent versions of intel hardware (CL 117925) had 83 improved, 6 slowed down, and the geomean improvement was 1.28%

@dr2chase dr2chase removed this from the Go1.11 milestone Jun 26, 2018
@dr2chase dr2chase added this to the Go1.12 milestone Jun 26, 2018
@dr2chase dr2chase added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Jun 26, 2018
@dr2chase
Copy link
Contributor

Abusing "early-in-cycle" so we don't forget it, this one is low risk.

gopherbot pushed a commit that referenced this issue Aug 20, 2018
go tool objdump ../bin/go | grep "\.go\:" | grep -c "LEA.*0x.*[(].*[(].*"
Before: 1012
After: 20

Updates #21735

Benchmarks thanks to drchase@google.com
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
benchstat -geomean *.stdout | grep -v pkg:
name                                       old time/op    new time/op    delta
FastTest2KB-12                              131.0ns ± 0%   131.0ns ± 0%    ~     (all equal)
BaseTest2KB-12                              601.3ns ± 0%   601.3ns ± 0%    ~     (p=0.942 n=45+41)
Encoding4KBVerySparse-12                    15.38µs ± 1%   15.41µs ± 0%  +0.24%  (p=0.000 n=44+43)
Join_8-12                                    2.117s ± 2%    2.128s ± 2%  +0.51%  (p=0.007 n=48+48)
HashimotoLight-12                           1.663ms ± 1%   1.668ms ± 1%  +0.35%  (p=0.006 n=49+49)
Sha3_224_MTU-12                             4.843µs ± 0%   4.836µs ± 0%  -0.14%  (p=0.000 n=45+42)
GenSharedKeyP256-12                         73.74µs ± 0%   70.92µs ± 0%  -3.82%  (p=0.000 n=48+47)
Run/10k/1-12                                 24.81s ± 0%    24.88s ± 0%  +0.30%  (p=0.000 n=48+47)
Run/10k/16-12                                4.621s ± 2%    4.625s ± 3%    ~     (p=0.776 n=50+50)
Dnrm2MediumPosInc-12                        4.018µs ± 0%   4.019µs ± 0%    ~     (p=0.060 n=50+48)
DasumMediumUnitaryInc-12                    855.3ns ± 0%   855.0ns ± 0%  -0.03%  (p=0.000 n=47+42)
Dgeev/Circulant10-12                        40.45µs ± 1%   41.11µs ± 1%  +1.62%  (p=0.000 n=45+49)
Dgeev/Circulant100-12                       10.42ms ± 2%   10.61ms ± 1%  +1.83%  (p=0.000 n=49+48)
MulWorkspaceDense1000Hundredth-12           64.69ms ± 1%   64.63ms ± 1%    ~     (p=0.718 n=48+48)
ScaleVec10000Inc20-12                       22.31µs ± 1%   22.29µs ± 1%    ~     (p=0.424 n=50+50)
ValidateVersionTildeFail-12                 737.6ns ± 1%   736.0ns ± 1%  -0.22%  (p=0.000 n=49+49)
StripHTML-12                                2.846µs ± 0%   2.806µs ± 1%  -1.40%  (p=0.000 n=43+50)
ReaderContains-12                           6.073µs ± 0%   5.999µs ± 0%  -1.22%  (p=0.000 n=48+48)
EncodeCodecFromInternalProtobuf-12          5.817µs ± 2%   5.555µs ± 2%  -4.51%  (p=0.000 n=47+47)
TarjanSCCGnp_10_tenth-12                    7.091µs ± 5%   7.132µs ± 7%    ~     (p=0.361 n=50+50)
TarjanSCCGnp_1000_half-12                   82.25ms ± 3%   81.29ms ± 2%  -1.16%  (p=0.000 n=50+43)
AStarUndirectedmallWorld_10_2_2_2_Heur-12   15.18µs ± 8%   15.11µs ± 7%    ~     (p=0.511 n=50+49)
LouvainDirectedMultiplex-12                 20.92ms ± 1%   21.00ms ± 1%  +0.36%  (p=0.000 n=48+49)
WalkAllBreadthFirstGnp_10_tenth-12          2.974µs ± 4%   2.964µs ± 5%    ~     (p=0.504 n=50+50)
WalkAllBreadthFirstGnp_1000_tenth-12        9.733ms ± 4%   9.741ms ± 4%    ~     (p=0.774 n=48+50)
TextMovementBetweenSegments-12              432.8µs ± 0%   433.2µs ± 1%    ~     (p=0.128 n=50+50)
Growth_MultiSegment-12                      13.11ms ± 0%   13.19ms ± 1%  +0.58%  (p=0.000 n=44+46)
AddingFields/Zap.Sugar-12                   1.296µs ± 1%   1.310µs ± 2%  +1.09%  (p=0.000 n=43+43)
AddingFields/apex/log-12                    34.19µs ± 1%   34.31µs ± 1%  +0.35%  (p=0.000 n=45+45)
AddingFields/inconshreveable/log15-12       30.08µs ± 2%   30.07µs ± 2%    ~     (p=0.803 n=48+47)
AddingFields/sirupsen/logrus-12             6.683µs ± 3%   6.735µs ± 3%  +0.78%  (p=0.000 n=43+42)

[Geo mean]                                  143.5µs        143.3µs       -0.16%

Change-Id: I637203c75c837737f1febced75d5985703e51044
Reviewed-on: https://go-review.googlesource.com/114655
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
@bradfitz bradfitz modified the milestones: Go1.12, Go1.13 Nov 3, 2018
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@josharian
Copy link
Contributor

See also #36468 (comment), for an approach that can reduce the number of slow LEAs.

@andig
Copy link
Contributor

andig commented Oct 22, 2021

Wondering what's left to do here since #21735 (comment) has been merged. Fix the 16bit LEA?

@martisch
Copy link
Contributor Author

martisch commented Oct 22, 2021

Even for LEAQ and LEAL my CL doesnt manage to split all 3 operand LEA. There are some that had aux set and arent transformed.

@gopherbot
Copy link

Change https://go.dev/cl/440035 mentions this issue: cmd/compile: split 3 operand LEA in late lower pass

gopherbot pushed a commit that referenced this issue Oct 17, 2022
On newer amd64 cpus 3 operand LEA instructions are slow, CL 114655 split
them to 2 LEA instructions in genssa.

This CL make late lower pass run after addressing modes, and split 3
operand LEA in late lower pass so that we can do common-subexpression
elimination for splited LEAs.

Updates #21735

Change-Id: Ied49139c7abab655e1a14a6fd793bdf9f987d1f1
Reviewed-on: https://go-review.googlesource.com/c/go/+/440035
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Joedian Reid <joedian@golang.org>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
On newer amd64 cpus 3 operand LEA instructions are slow, CL 114655 split
them to 2 LEA instructions in genssa.

This CL make late lower pass run after addressing modes, and split 3
operand LEA in late lower pass so that we can do common-subexpression
elimination for splited LEAs.

Updates golang#21735

Change-Id: Ied49139c7abab655e1a14a6fd793bdf9f987d1f1
Reviewed-on: https://go-review.googlesource.com/c/go/+/440035
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Joedian Reid <joedian@golang.org>
@gopherbot
Copy link

Change https://go.dev/cl/482820 mentions this issue: cmd/compile: remove broken LEA "optimization"

gopherbot pushed a commit that referenced this issue Apr 7, 2023
CL 440035 added rewrite rules to simplify "costly" LEA
instructions, but the types in the rewrites were wrong and
the code would go bad if the wrong-typed register was spilled.

CL 482536 attempted to fix this by correcting the type in the
rewrite, but that "fix" broke something on windows-amd64-race.

Instead / for-now, remove the offending rewrite rules.

Updates #21735.
Fixes #59432.

Change-Id: I0497c42db414f2055e1378e0a53e2bceee9cd5d9
Reviewed-on: https://go-review.googlesource.com/c/go/+/482820
Run-TryBot: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

Change https://go.dev/cl/482164 mentions this issue: cmd/compile: remove broken LEA "optimization"

gopherbot pushed a commit that referenced this issue Apr 24, 2023
CL 440035 added rewrite rules to simplify "costly" LEA
instructions, but the types in the rewrites were wrong and
the code would go bad if the wrong-typed register was spilled.

CL 482536 attempted to fix this by correcting the type in the
rewrite, but that "fix" broke something on windows-amd64-race.

Instead / for-now, remove the offending rewrite rules.

Updates #21735.
Updates #59432.
Fixes #59468.

Change-Id: I0497c42db414f2055e1378e0a53e2bceee9cd5d9
Reviewed-on: https://go-review.googlesource.com/c/go/+/482820
Run-TryBot: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
(cherry picked from commit 6a97a60)
Reviewed-on: https://go-review.googlesource.com/c/go/+/482164
Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
bradfitz pushed a commit to tailscale/go that referenced this issue May 25, 2023
CL 440035 added rewrite rules to simplify "costly" LEA
instructions, but the types in the rewrites were wrong and
the code would go bad if the wrong-typed register was spilled.

CL 482536 attempted to fix this by correcting the type in the
rewrite, but that "fix" broke something on windows-amd64-race.

Instead / for-now, remove the offending rewrite rules.

Updates golang#21735.
Updates golang#59432.
Fixes golang#59468.

Change-Id: I0497c42db414f2055e1378e0a53e2bceee9cd5d9
Reviewed-on: https://go-review.googlesource.com/c/go/+/482820
Run-TryBot: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
(cherry picked from commit 6a97a60)
Reviewed-on: https://go-review.googlesource.com/c/go/+/482164
Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
bradfitz pushed a commit to tailscale/go that referenced this issue May 25, 2023
CL 440035 added rewrite rules to simplify "costly" LEA
instructions, but the types in the rewrites were wrong and
the code would go bad if the wrong-typed register was spilled.

CL 482536 attempted to fix this by correcting the type in the
rewrite, but that "fix" broke something on windows-amd64-race.

Instead / for-now, remove the offending rewrite rules.

Updates golang#21735.
Updates golang#59432.
Fixes golang#59468.

Change-Id: I0497c42db414f2055e1378e0a53e2bceee9cd5d9
Reviewed-on: https://go-review.googlesource.com/c/go/+/482820
Run-TryBot: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
(cherry picked from commit 6a97a60)
Reviewed-on: https://go-review.googlesource.com/c/go/+/482164
Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests