You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But with ssa we generate duffcopy to stack and than load to stack:
LEAQ "".p0R3(SB), SI
DUFFCOPY $854
...
MOVSD "".p(SP), X0
MOVSD "".p+8(SP), X2
MOVSD "".p+16(SP), X3
MOVSD "".p+24(SP), X4
MOVSD "".p+32(SP), X5
MOVSD "".p+40(SP), X6
Other code looks ~similar.
I've verified that replacing local p with use of global p0R3 in go code produces fast code for both 1.6 and 1.7
The text was updated successfully, but these errors were encountered:
TocarIP
changed the title
math: performance regression in Bessel functions
math: performance regression in Bessel functions on AMD64
Aug 26, 2016
ianlancetaylor
changed the title
math: performance regression in Bessel functions on AMD64
cmd/compile: performance regression in Bessel functions on AMD64
Aug 26, 2016
This will be tricky to fix in the compiler. For instance, if I do
var x [6]int
func f() int { return x[3] * x[3] }
func g() int { y := x; return y[3] * y[3] }
f can read from x[3] twice and get different answers for each read. g is not allowed to do that; there is only one load from the global and thus g is guaranteed to return a square.
Perhaps the Bessel functions should be modified to define p to be *[6]float64. I believe that would resolve the performance problem.
Comparing 1.6 vs 1.7 performance I see:
...
J0-4 57.0ns ± 0% 71.9ns ± 1% +26.17% (p=0.000 n=19+20)
J1-4 57.7ns ± 0% 71.6ns ± 0% +24.04% (p=0.000 n=20+19)
Jn-4 126ns ± 0% 153ns ± 0% +21.43% (p=0.000 n=20+20)
...
Y0-4 56.5ns ± 0% 70.8ns ± 0% +25.31% (p=0.000 n=19+19)
Y1-4 56.3ns ± 0% 70.8ns ± 0% +25.68% (p=0.000 n=20+20)
Yn-4 122ns ± 0% 149ns ± 0% +22.13% (p=0.000 n=20+19)
This is mainly due to time spent in pzero/qzero/... functions.
Quick and dirty benchmark of pzero shows:
J0_3-4 8.49ns ± 0% 13.90ns ± 0% +63.72% (p=0.001 n=6+7)
Analysis shows that main problem is due to :
var p [6]float64
...
p = p0R3
...
r := p[0] +...
Previously this resulted in load directly from global p0R3 array:
movsd 0x1f9d68(%rip),%xmm13 # 67bce0 <math.p0R3>
movsd 0x1f9d67(%rip),%xmm12 # 67bce8 <math.p0R3+0x8>
movsd 0x1f9d66(%rip),%xmm11 # 67bcf0 <math.p0R3+0x10>
movsd 0x1f9d65(%rip),%xmm10 # 67bcf8 <math.p0R3+0x18>
movsd 0x1f9d64(%rip),%xmm9 # 67bd00 <math.p0R3+0x20>
movsd 0x1f9d64(%rip),%xmm2 # 67bd08 <math.p0R3+0x28>
But with ssa we generate duffcopy to stack and than load to stack:
LEAQ "".p0R3(SB), SI
DUFFCOPY $854
...
MOVSD "".p(SP), X0
MOVSD "".p+8(SP), X2
MOVSD "".p+16(SP), X3
MOVSD "".p+24(SP), X4
MOVSD "".p+32(SP), X5
MOVSD "".p+40(SP), X6
Other code looks ~similar.
I've verified that replacing local p with use of global p0R3 in go code produces fast code for both 1.6 and 1.7
The text was updated successfully, but these errors were encountered: