Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: performance regression in Bessel functions on AMD64 #16889

Closed
TocarIP opened this issue Aug 26, 2016 · 5 comments
Closed

cmd/compile: performance regression in Bessel functions on AMD64 #16889

TocarIP opened this issue Aug 26, 2016 · 5 comments
Milestone

Comments

@TocarIP
Copy link
Contributor

TocarIP commented Aug 26, 2016

Comparing 1.6 vs 1.7 performance I see:
...
J0-4 57.0ns ± 0% 71.9ns ± 1% +26.17% (p=0.000 n=19+20)
J1-4 57.7ns ± 0% 71.6ns ± 0% +24.04% (p=0.000 n=20+19)
Jn-4 126ns ± 0% 153ns ± 0% +21.43% (p=0.000 n=20+20)

...
Y0-4 56.5ns ± 0% 70.8ns ± 0% +25.31% (p=0.000 n=19+19)
Y1-4 56.3ns ± 0% 70.8ns ± 0% +25.68% (p=0.000 n=20+20)
Yn-4 122ns ± 0% 149ns ± 0% +22.13% (p=0.000 n=20+19)

This is mainly due to time spent in pzero/qzero/... functions.
Quick and dirty benchmark of pzero shows:

J0_3-4 8.49ns ± 0% 13.90ns ± 0% +63.72% (p=0.001 n=6+7)

Analysis shows that main problem is due to :
var p [6]float64
...
p = p0R3
...
r := p[0] +...

Previously this resulted in load directly from global p0R3 array:

movsd 0x1f9d68(%rip),%xmm13 # 67bce0 <math.p0R3>
movsd 0x1f9d67(%rip),%xmm12 # 67bce8 <math.p0R3+0x8>
movsd 0x1f9d66(%rip),%xmm11 # 67bcf0 <math.p0R3+0x10>
movsd 0x1f9d65(%rip),%xmm10 # 67bcf8 <math.p0R3+0x18>
movsd 0x1f9d64(%rip),%xmm9 # 67bd00 <math.p0R3+0x20>
movsd 0x1f9d64(%rip),%xmm2 # 67bd08 <math.p0R3+0x28>

But with ssa we generate duffcopy to stack and than load to stack:
LEAQ "".p0R3(SB), SI
DUFFCOPY $854
...
MOVSD "".p(SP), X0
MOVSD "".p+8(SP), X2
MOVSD "".p+16(SP), X3
MOVSD "".p+24(SP), X4
MOVSD "".p+32(SP), X5
MOVSD "".p+40(SP), X6

Other code looks ~similar.
I've verified that replacing local p with use of global p0R3 in go code produces fast code for both 1.6 and 1.7

@TocarIP TocarIP changed the title math: performance regression in Bessel functions math: performance regression in Bessel functions on AMD64 Aug 26, 2016
@ianlancetaylor ianlancetaylor changed the title math: performance regression in Bessel functions on AMD64 cmd/compile: performance regression in Bessel functions on AMD64 Aug 26, 2016
@ianlancetaylor ianlancetaylor added this to the Go1.8 milestone Aug 26, 2016
@ianlancetaylor
Copy link
Contributor

Redirecting to cmd/compile to see why it is generating slower code.

CC @randall77 @josharian

@randall77
Copy link
Contributor

Large object (> 4 word) copies don't have much in the way of optimization at the moment. This is one of those cases.

@randall77
Copy link
Contributor

This will be tricky to fix in the compiler. For instance, if I do

var x [6]int
func f() int { return x[3] * x[3] }
func g() int { y := x; return y[3] * y[3] }

f can read from x[3] twice and get different answers for each read. g is not allowed to do that; there is only one load from the global and thus g is guaranteed to return a square.

Perhaps the Bessel functions should be modified to define p to be *[6]float64. I believe that would resolve the performance problem.

@TocarIP
Copy link
Contributor Author

TocarIP commented Aug 30, 2016

Using *[6]float64 indeed helps:
1.6 vs 1.7 + suggestion
Y0-4 56.5ns ± 0% 53.9ns ± 0% -4.60%

@gopherbot
Copy link

CL https://golang.org/cl/28086 mentions this issue.

@golang golang locked and limited conversation to collaborators Aug 31, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants