cmd/compile: performance regression in Bessel functions on AMD64 #16889

TocarIP · 2016-08-26T12:13:40Z

Comparing 1.6 vs 1.7 performance I see:
...
J0-4 57.0ns ± 0% 71.9ns ± 1% +26.17% (p=0.000 n=19+20)
J1-4 57.7ns ± 0% 71.6ns ± 0% +24.04% (p=0.000 n=20+19)
Jn-4 126ns ± 0% 153ns ± 0% +21.43% (p=0.000 n=20+20)

...
Y0-4 56.5ns ± 0% 70.8ns ± 0% +25.31% (p=0.000 n=19+19)
Y1-4 56.3ns ± 0% 70.8ns ± 0% +25.68% (p=0.000 n=20+20)
Yn-4 122ns ± 0% 149ns ± 0% +22.13% (p=0.000 n=20+19)

This is mainly due to time spent in pzero/qzero/... functions.
Quick and dirty benchmark of pzero shows:

J0_3-4 8.49ns ± 0% 13.90ns ± 0% +63.72% (p=0.001 n=6+7)

Analysis shows that main problem is due to :
var p [6]float64
...
p = p0R3
...
r := p[0] +...

Previously this resulted in load directly from global p0R3 array:

movsd 0x1f9d68(%rip),%xmm13 # 67bce0 <math.p0R3>
movsd 0x1f9d67(%rip),%xmm12 # 67bce8 <math.p0R3+0x8>
movsd 0x1f9d66(%rip),%xmm11 # 67bcf0 <math.p0R3+0x10>
movsd 0x1f9d65(%rip),%xmm10 # 67bcf8 <math.p0R3+0x18>
movsd 0x1f9d64(%rip),%xmm9 # 67bd00 <math.p0R3+0x20>
movsd 0x1f9d64(%rip),%xmm2 # 67bd08 <math.p0R3+0x28>

But with ssa we generate duffcopy to stack and than load to stack:
LEAQ "".p0R3(SB), SI
DUFFCOPY $854
...
MOVSD "".p(SP), X0
MOVSD "".p+8(SP), X2
MOVSD "".p+16(SP), X3
MOVSD "".p+24(SP), X4
MOVSD "".p+32(SP), X5
MOVSD "".p+40(SP), X6

Other code looks ~similar.
I've verified that replacing local p with use of global p0R3 in go code produces fast code for both 1.6 and 1.7

ianlancetaylor · 2016-08-26T14:10:33Z

Redirecting to cmd/compile to see why it is generating slower code.

CC @randall77 @josharian

randall77 · 2016-08-26T16:45:07Z

Large object (> 4 word) copies don't have much in the way of optimization at the moment. This is one of those cases.

randall77 · 2016-08-29T20:51:06Z

This will be tricky to fix in the compiler. For instance, if I do

var x [6]int
func f() int { return x[3] * x[3] }
func g() int { y := x; return y[3] * y[3] }

f can read from x[3] twice and get different answers for each read. g is not allowed to do that; there is only one load from the global and thus g is guaranteed to return a square.

Perhaps the Bessel functions should be modified to define p to be *[6]float64. I believe that would resolve the performance problem.

TocarIP · 2016-08-30T13:09:11Z

Using *[6]float64 indeed helps:
1.6 vs 1.7 + suggestion
Y0-4 56.5ns ± 0% 53.9ns ± 0% -4.60%

gopherbot · 2016-08-30T16:00:34Z

CL https://golang.org/cl/28086 mentions this issue.

TocarIP changed the title ~~math: performance regression in Bessel functions~~ math: performance regression in Bessel functions on AMD64 Aug 26, 2016

ianlancetaylor changed the title ~~math: performance regression in Bessel functions on AMD64~~ cmd/compile: performance regression in Bessel functions on AMD64 Aug 26, 2016

ianlancetaylor added this to the Go1.8 milestone Aug 26, 2016

gopherbot closed this as completed in 2a2cab2 Aug 31, 2016

golang locked and limited conversation to collaborators Aug 31, 2017

gopherbot added the FrozenDueToAge label Aug 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: performance regression in Bessel functions on AMD64 #16889

cmd/compile: performance regression in Bessel functions on AMD64 #16889

TocarIP commented Aug 26, 2016

ianlancetaylor commented Aug 26, 2016

randall77 commented Aug 26, 2016

randall77 commented Aug 29, 2016

TocarIP commented Aug 30, 2016

gopherbot commented Aug 30, 2016

cmd/compile: performance regression in Bessel functions on AMD64 #16889

cmd/compile: performance regression in Bessel functions on AMD64 #16889

Comments

TocarIP commented Aug 26, 2016

ianlancetaylor commented Aug 26, 2016

randall77 commented Aug 26, 2016

randall77 commented Aug 29, 2016

TocarIP commented Aug 30, 2016

gopherbot commented Aug 30, 2016