-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/compile: performance regression in Bessel functions on AMD64 #16889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Redirecting to cmd/compile to see why it is generating slower code. |
Large object (> 4 word) copies don't have much in the way of optimization at the moment. This is one of those cases. |
This will be tricky to fix in the compiler. For instance, if I do
Perhaps the Bessel functions should be modified to define |
Using *[6]float64 indeed helps: |
CL https://golang.org/cl/28086 mentions this issue. |
Comparing 1.6 vs 1.7 performance I see:
...
J0-4 57.0ns ± 0% 71.9ns ± 1% +26.17% (p=0.000 n=19+20)
J1-4 57.7ns ± 0% 71.6ns ± 0% +24.04% (p=0.000 n=20+19)
Jn-4 126ns ± 0% 153ns ± 0% +21.43% (p=0.000 n=20+20)
...
Y0-4 56.5ns ± 0% 70.8ns ± 0% +25.31% (p=0.000 n=19+19)
Y1-4 56.3ns ± 0% 70.8ns ± 0% +25.68% (p=0.000 n=20+20)
Yn-4 122ns ± 0% 149ns ± 0% +22.13% (p=0.000 n=20+19)
This is mainly due to time spent in pzero/qzero/... functions.
Quick and dirty benchmark of pzero shows:
J0_3-4 8.49ns ± 0% 13.90ns ± 0% +63.72% (p=0.001 n=6+7)
Analysis shows that main problem is due to :
var p [6]float64
...
p = p0R3
...
r := p[0] +...
Previously this resulted in load directly from global p0R3 array:
movsd 0x1f9d68(%rip),%xmm13 # 67bce0 <math.p0R3>
movsd 0x1f9d67(%rip),%xmm12 # 67bce8 <math.p0R3+0x8>
movsd 0x1f9d66(%rip),%xmm11 # 67bcf0 <math.p0R3+0x10>
movsd 0x1f9d65(%rip),%xmm10 # 67bcf8 <math.p0R3+0x18>
movsd 0x1f9d64(%rip),%xmm9 # 67bd00 <math.p0R3+0x20>
movsd 0x1f9d64(%rip),%xmm2 # 67bd08 <math.p0R3+0x28>
But with ssa we generate duffcopy to stack and than load to stack:
LEAQ "".p0R3(SB), SI
DUFFCOPY $854
...
MOVSD "".p(SP), X0
MOVSD "".p+8(SP), X2
MOVSD "".p+16(SP), X3
MOVSD "".p+24(SP), X4
MOVSD "".p+32(SP), X5
MOVSD "".p+40(SP), X6
Other code looks ~similar.
I've verified that replacing local p with use of global p0R3 in go code produces fast code for both 1.6 and 1.7
The text was updated successfully, but these errors were encountered: