Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: SSA performance inconsistency/regression difference across amd64 CPUs. #16982

Closed
dmitshur opened this issue Sep 3, 2016 · 9 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented Sep 3, 2016

Disclaimer: This is not necessarily an issue, I'm opening this thread to provide information that I hope may be helpful. It contains a microbenchmark which is not representative of real world performance, just a tiny subset. But there's something unusual/strange about it, which is why I think there's a chance this might be helpful and I'm reporting this. Please close if it's not helpful and nothing needs to be done.

I had a little microbenchmark snippet I used previously to compare gc and GopherJS performance, and I decided to try it on the SSA backend of Go 1.7. I found a surprise where one amd64 computer behaves very differently to all others I've tried it on, and I'm wondering if it's caused by an unintended bug somewhere or not.

What version of Go are you using (go version)?

go version go1.7 darwin/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/Dmitri/Dropbox/Work/2013/GoLanding:/Users/Dmitri/Dropbox/Work/2013/GoLand"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/g3/yrvdj9f55ll7jy7l5ygz82yc0000gn/T/go-build084371455=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"

What did you do?

I ran the following program with and without SSA backend on two different computers (both have amd64 CPU architecture).

https://play.golang.org/p/aAM1SuV6U4

Computer A is a MacBook Pro (15-inch, Late 2011), running OS X 10.11.6, with 2.4 GHz Intel Core i7-2760QM CPU @ 2.40GHz x 8.

$ go build -gcflags="-ssa=0" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.431564409s

$ go build -gcflags="-ssa=1" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.420316364s

Computer B is a MacBook (Retina, 12-inch, Early 2016), running OS X 10.11.6, with 1.1 GHz Intel Core m3-6Y30 CPU @ 0.90GHz x 4.

$ go build -gcflags="-ssa=0" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.564973583s

$ go build -gcflags="-ssa=1" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.771555271s

(There is a variance of about ±5% between individual runs.)

Computer A Computer B
SSA=0 6.431564409s 2.564973583s
SSA=1 6.420316364s 5.771555271s

What did you expect to see?

Given that the SSA backend generated code that performed roughly equally well on computer A, I expected that it have a similar result on computer B.

What did you see instead?

Instead, I saw that on computer B (and but not on computer A) enabling SSA reduces the performance by a factor of more than two.

@dmitshur
Copy link
Contributor Author

dmitshur commented Sep 3, 2016

Additional findings

If I rewrite term to avoid an if branch, into this form:

return 4 / (2*float64(k) + 1) * float64(1-2*(k%2))

Then computer A performance becomes ~2x slower with SSA enabled, and computer B becomes ~4x slower.

Computer A Computer B
SSA=0 6.703508163s 3.067721176s
SSA=1 11.302787342s 14.389823165s

However, if I simply move that float64(1-2*(k%2)) multiplier to the front of the expression like this:

return float64(1-2*(k%2)) * 4 / (2*float64(k) + 1)
Computer A Computer B
SSA=0 10.206198046s 12.165551807s
SSA=1 9.072183098s 10.452724107s

Then suddenly ssa=1 becomes equally as performant as ssa=0 on both computer A and B (or rather ssa=0 becomes much slower).

Are these kind of fluctuations normal and expected, or this is unexpected and caused by an unintended bug?

@randall77
Copy link
Contributor

Strange.
I cannot reproduce on my desktop.
Timing experiments can be tricky on laptops, with frequency scaling and all that other garbage. Maybe the SSA compiler heats up the chip so the chip scales back the clock frequency during the benchmark.

Please try building both binaries beforehand, and alternate running+timing them. Might also be worth trying the same binaries on both machines, just to make sure it's really the same code.

@dmitshur
Copy link
Contributor Author

dmitshur commented Sep 3, 2016

I cannot reproduce on my desktop.

That's not unexpected, I can only reproduce this on the 2016 12" Retina MacBook. Notably, it has an ultra low voltage Core m3-6Y30 CPU without a fan. It does not happen on this 2011 MBP, nor a 2015 iMac that I tried it on (but didn't mention above).

I've tried building both binaries on the 2011 MBP just now, copied and ran them interchangeably on the 12" MacBook, and got the same results.

2011 MBP (where the binaries were built beforehand):

$ ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.491640868s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.483156137s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.42614708s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.449560988s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.463344901s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.455693056s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.441013487s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 7.03565282s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.977611407s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.725358761s

2016 12" Retina MacBook (copied the binaries that were built on the 2011 MBP):

$ ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.596706378s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.863147823s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.641448078s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.809569822s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.575924752s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.761018202s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.592914719s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.98196503s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.665577762s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.813804654s

If I run a CPU + GPU intensive load on the 2016 12" Retina MacBook in the background, just to make sure the CPU is both clocked high, and not throttling due to load bursts, etc., I can still see the same effect. Both times get around 30% slower because of the background load.

screen shot 2016-09-03 at 2 46 31 pm

I suspect it'd be a good idea to look at the generated assembly difference for term with ssa=0 and ssa=1.

@quentinmit quentinmit added this to the Go1.8 milestone Sep 6, 2016
@TocarIP
Copy link
Contributor

TocarIP commented Sep 7, 2016

@shurcooL can you upload code somewhere? I get Forbidden from playground.
2016 12" Retina MacBook has skylake CPU, so i'd like to try to reproduce this on my skylake desktop.

@dmitshur
Copy link
Contributor Author

dmitshur commented Sep 7, 2016

@TocarIP, would pasting the code here work for you? Here it is, taken from https://play.golang.org/p/aAM1SuV6U4:

// Play with benchmarking a tight loop with many iterations and a func call,
// compare gc vs GopherJS performance.
//
// An alternative more close-to-metal implementation that doesn't use math.Pow.
//
// Disclaimer: This is a microbenchmark and is very poorly representative of
//             overall general real world performance of larger applications.
//
package main

import (
    "fmt"
    "time"
)

func term(k int32) float64 {
    //       | Computer A   | Computer B
    // ------|--------------|-------------
    // SSA=0 | 6.431564409s | 2.564973583s
    // SSA=1 | 6.420316364s | 5.771555271s
    if k%2 == 0 {
        return 4 / (2*float64(k) + 1)
    } else {
        return -4 / (2*float64(k) + 1)
    }

    //       | Computer A    | Computer B
    // ------|---------------|--------------
    // SSA=0 | 6.703508163s  | 3.067721176s
    // SSA=1 | 11.302787342s | 14.389823165s
    //return 4 / (2*float64(k) + 1) * float64(1-2*(k%2))

    //       | Computer A    | Computer B
    // ------|---------------|--------------
    // SSA=0 | 10.206198046s | 12.165551807s
    // SSA=1 | 9.072183098s  | 10.452724107s
    //return float64(1-2*(k%2)) * 4 / (2*float64(k) + 1)
}

// pi performs n iterations to compute an approximation of pi.
func pi(n int32) float64 {
    f := 0.0
    for k := int32(0); k <= n; k++ {
        f += term(k)
    }
    return f
}

func main() {
    // Start measuring time from now.
    started := time.Now()

    const n = 1000 * 1000 * 1000
    fmt.Printf("approximating pi with %v iterations.\n", n)
    fmt.Println(pi(n))

    fmt.Printf("total time taken is: %v\n", time.Since(started))
}

@TocarIP
Copy link
Contributor

TocarIP commented Sep 8, 2016

Thanks,
I've tested 1.6 vs tip
On haswell both versions have the same speed.
But on Broadwell(sic!) and Skylake I was able to verify a 2x slowdown.
Looking at asm main difference is usage of floating constants:
1.6 always reload them, while 1.7 loads them once in the beginning of pi function.
This is due to both path ( k%2 == 0 and k%2 == 1) writing to same register
In asm loop looks like:
jne q2
cvtsi2sd %ebx,%xmm3
mulsd %xmm0,%xmm3
addsd %xmm1,%xmm3
movsd 0xb817d(%rip),%xmm4 # 4b91c0 <$f64.4010000000000000>
divsd %xmm5,%xmm4
q4:
lea 0x1(%rbx),%ecx
addsd %xmm4,%xmm2
cmp %eax,%ecx
jle q3
q1:
movsd %xmm2,0x10(%rsp)
retq
q2:
cvtsi2sd %ebx,%xmm3
mulsd %xmm0,%xmm3
addsd %xmm1,%xmm3
movsd 0xb8193(%rip),%xmm4 # 4b9200 <$f64.c010000000000000>
divsd %xmm3,%xmm4

If I change xmm3 into xmm5:
jne q2
cvtsi2sd %ebx,%xmm5
mulsd %xmm0,%xmm5
addsd %xmm1,%xmm5
movsd 0xb817d(%rip),%xmm4 # 4b91c0 <$f64.4010000000000000>
divsd %xmm5,%xmm4
q4:
lea 0x1(%rbx),%ecx
addsd %xmm4,%xmm2
cmp %eax,%ecx
jle q3
q1:
movsd %xmm2,0x10(%rsp)
retq
q2:
cvtsi2sd %ebx,%xmm3
mulsd %xmm0,%xmm3
addsd %xmm1,%xmm3
movsd 0xb8193(%rip),%xmm4 # 4b9200 <$f64.c010000000000000>
divsd %xmm3,%xmm4

Performance improves by 2x (gets back to 1.6 level)

@TocarIP
Copy link
Contributor

TocarIP commented Sep 9, 2016

By introducing register allocator heuristic, it is possible to avoid writing to the same register:
See https://go-review.googlesource.com/#/c/28874/ as a prototype.

@gopherbot
Copy link

CL https://golang.org/cl/28874 mentions this issue.

@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 11, 2016
@gopherbot
Copy link

CL https://golang.org/cl/31490 mentions this issue.

@golang golang locked and limited conversation to collaborators Dec 19, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

5 participants