New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: SSA performance inconsistency/regression difference across amd64 CPUs. #16982
Comments
Additional findingsIf I rewrite return 4 / (2*float64(k) + 1) * float64(1-2*(k%2)) Then computer A performance becomes ~2x slower with SSA enabled, and computer B becomes ~4x slower.
However, if I simply move that return float64(1-2*(k%2)) * 4 / (2*float64(k) + 1)
Then suddenly ssa=1 becomes equally as performant as ssa=0 on both computer A and B (or rather ssa=0 becomes much slower). Are these kind of fluctuations normal and expected, or this is unexpected and caused by an unintended bug? |
Strange. Please try building both binaries beforehand, and alternate running+timing them. Might also be worth trying the same binaries on both machines, just to make sure it's really the same code. |
That's not unexpected, I can only reproduce this on the 2016 12" Retina MacBook. Notably, it has an ultra low voltage Core m3-6Y30 CPU without a fan. It does not happen on this 2011 MBP, nor a 2015 iMac that I tried it on (but didn't mention above). I've tried building both binaries on the 2011 MBP just now, copied and ran them interchangeably on the 12" MacBook, and got the same results. 2011 MBP (where the binaries were built beforehand):
2016 12" Retina MacBook (copied the binaries that were built on the 2011 MBP):
If I run a CPU + GPU intensive load on the 2016 12" Retina MacBook in the background, just to make sure the CPU is both clocked high, and not throttling due to load bursts, etc., I can still see the same effect. Both times get around 30% slower because of the background load. I suspect it'd be a good idea to look at the generated assembly difference for |
@shurcooL can you upload code somewhere? I get Forbidden from playground. |
@TocarIP, would pasting the code here work for you? Here it is, taken from https://play.golang.org/p/aAM1SuV6U4: // Play with benchmarking a tight loop with many iterations and a func call,
// compare gc vs GopherJS performance.
//
// An alternative more close-to-metal implementation that doesn't use math.Pow.
//
// Disclaimer: This is a microbenchmark and is very poorly representative of
// overall general real world performance of larger applications.
//
package main
import (
"fmt"
"time"
)
func term(k int32) float64 {
// | Computer A | Computer B
// ------|--------------|-------------
// SSA=0 | 6.431564409s | 2.564973583s
// SSA=1 | 6.420316364s | 5.771555271s
if k%2 == 0 {
return 4 / (2*float64(k) + 1)
} else {
return -4 / (2*float64(k) + 1)
}
// | Computer A | Computer B
// ------|---------------|--------------
// SSA=0 | 6.703508163s | 3.067721176s
// SSA=1 | 11.302787342s | 14.389823165s
//return 4 / (2*float64(k) + 1) * float64(1-2*(k%2))
// | Computer A | Computer B
// ------|---------------|--------------
// SSA=0 | 10.206198046s | 12.165551807s
// SSA=1 | 9.072183098s | 10.452724107s
//return float64(1-2*(k%2)) * 4 / (2*float64(k) + 1)
}
// pi performs n iterations to compute an approximation of pi.
func pi(n int32) float64 {
f := 0.0
for k := int32(0); k <= n; k++ {
f += term(k)
}
return f
}
func main() {
// Start measuring time from now.
started := time.Now()
const n = 1000 * 1000 * 1000
fmt.Printf("approximating pi with %v iterations.\n", n)
fmt.Println(pi(n))
fmt.Printf("total time taken is: %v\n", time.Since(started))
} |
Thanks, If I change xmm3 into xmm5: Performance improves by 2x (gets back to 1.6 level) |
By introducing register allocator heuristic, it is possible to avoid writing to the same register: |
CL https://golang.org/cl/28874 mentions this issue. |
CL https://golang.org/cl/31490 mentions this issue. |
Disclaimer: This is not necessarily an issue, I'm opening this thread to provide information that I hope may be helpful. It contains a microbenchmark which is not representative of real world performance, just a tiny subset. But there's something unusual/strange about it, which is why I think there's a chance this might be helpful and I'm reporting this. Please close if it's not helpful and nothing needs to be done.
I had a little microbenchmark snippet I used previously to compare gc and GopherJS performance, and I decided to try it on the SSA backend of Go 1.7. I found a surprise where one
amd64
computer behaves very differently to all others I've tried it on, and I'm wondering if it's caused by an unintended bug somewhere or not.What version of Go are you using (
go version
)?What operating system and processor architecture are you using (
go env
)?What did you do?
I ran the following program with and without SSA backend on two different computers (both have
amd64
CPU architecture).https://play.golang.org/p/aAM1SuV6U4
Computer A is a MacBook Pro (15-inch, Late 2011), running OS X 10.11.6, with 2.4 GHz Intel Core i7-2760QM CPU @ 2.40GHz x 8.
Computer B is a MacBook (Retina, 12-inch, Early 2016), running OS X 10.11.6, with 1.1 GHz Intel Core m3-6Y30 CPU @ 0.90GHz x 4.
(There is a variance of about ±5% between individual runs.)
What did you expect to see?
Given that the SSA backend generated code that performed roughly equally well on computer A, I expected that it have a similar result on computer B.
What did you see instead?
Instead, I saw that on computer B (and but not on computer A) enabling SSA reduces the performance by a factor of more than two.
The text was updated successfully, but these errors were encountered: