cmd/compile: SSA performance inconsistency/regression difference across amd64 CPUs. #16982

dmitshur · 2016-09-03T20:48:42Z

Disclaimer: This is not necessarily an issue, I'm opening this thread to provide information that I hope may be helpful. It contains a microbenchmark which is not representative of real world performance, just a tiny subset. But there's something unusual/strange about it, which is why I think there's a chance this might be helpful and I'm reporting this. Please close if it's not helpful and nothing needs to be done.

I had a little microbenchmark snippet I used previously to compare gc and GopherJS performance, and I decided to try it on the SSA backend of Go 1.7. I found a surprise where one amd64 computer behaves very differently to all others I've tried it on, and I'm wondering if it's caused by an unintended bug somewhere or not.

What version of Go are you using (`go version`)?

go version go1.7 darwin/amd64

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/Dmitri/Dropbox/Work/2013/GoLanding:/Users/Dmitri/Dropbox/Work/2013/GoLand"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/g3/yrvdj9f55ll7jy7l5ygz82yc0000gn/T/go-build084371455=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"

What did you do?

I ran the following program with and without SSA backend on two different computers (both have amd64 CPU architecture).

https://play.golang.org/p/aAM1SuV6U4

Computer A is a MacBook Pro (15-inch, Late 2011), running OS X 10.11.6, with 2.4 GHz Intel Core i7-2760QM CPU @ 2.40GHz x 8.

$ go build -gcflags="-ssa=0" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.431564409s

$ go build -gcflags="-ssa=1" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.420316364s

Computer B is a MacBook (Retina, 12-inch, Early 2016), running OS X 10.11.6, with 1.1 GHz Intel Core m3-6Y30 CPU @ 0.90GHz x 4.

$ go build -gcflags="-ssa=0" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.564973583s

$ go build -gcflags="-ssa=1" -o /tmp/o && /tmp/o
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.771555271s

(There is a variance of about ±5% between individual runs.)

	Computer A	Computer B
SSA=0	6.431564409s	2.564973583s
SSA=1	6.420316364s	5.771555271s

What did you expect to see?

Given that the SSA backend generated code that performed roughly equally well on computer A, I expected that it have a similar result on computer B.

What did you see instead?

Instead, I saw that on computer B (and but not on computer A) enabling SSA reduces the performance by a factor of more than two.

The text was updated successfully, but these errors were encountered:

dmitshur · 2016-09-03T20:49:04Z

Additional findings

If I rewrite term to avoid an if branch, into this form:

return 4 / (2*float64(k) + 1) * float64(1-2*(k%2))

Then computer A performance becomes ~2x slower with SSA enabled, and computer B becomes ~4x slower.

	Computer A	Computer B
SSA=0	6.703508163s	3.067721176s
SSA=1	11.302787342s	14.389823165s

However, if I simply move that float64(1-2*(k%2)) multiplier to the front of the expression like this:

return float64(1-2*(k%2)) * 4 / (2*float64(k) + 1)

	Computer A	Computer B
SSA=0	10.206198046s	12.165551807s
SSA=1	9.072183098s	10.452724107s

Then suddenly ssa=1 becomes equally as performant as ssa=0 on both computer A and B (or rather ssa=0 becomes much slower).

Are these kind of fluctuations normal and expected, or this is unexpected and caused by an unintended bug?

randall77 · 2016-09-03T21:13:10Z

Strange.
I cannot reproduce on my desktop.
Timing experiments can be tricky on laptops, with frequency scaling and all that other garbage. Maybe the SSA compiler heats up the chip so the chip scales back the clock frequency during the benchmark.

Please try building both binaries beforehand, and alternate running+timing them. Might also be worth trying the same binaries on both machines, just to make sure it's really the same code.

dmitshur · 2016-09-03T21:53:18Z

I cannot reproduce on my desktop.

That's not unexpected, I can only reproduce this on the 2016 12" Retina MacBook. Notably, it has an ultra low voltage Core m3-6Y30 CPU without a fan. It does not happen on this 2011 MBP, nor a 2015 iMac that I tried it on (but didn't mention above).

I've tried building both binaries on the 2011 MBP just now, copied and ran them interchangeably on the 12" MacBook, and got the same results.

2011 MBP (where the binaries were built beforehand):

$ ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.491640868s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.483156137s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.42614708s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.449560988s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.463344901s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.455693056s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.441013487s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 7.03565282s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.977611407s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 6.725358761s

2016 12" Retina MacBook (copied the binaries that were built on the 2011 MBP):

$ ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1 && ./ssa0 && ./ssa1
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.596706378s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.863147823s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.641448078s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.809569822s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.575924752s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.761018202s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.592914719s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.98196503s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 2.665577762s
approximating pi with 1000000000 iterations.
3.1415926545880506
total time taken is: 5.813804654s

If I run a CPU + GPU intensive load on the 2016 12" Retina MacBook in the background, just to make sure the CPU is both clocked high, and not throttling due to load bursts, etc., I can still see the same effect. Both times get around 30% slower because of the background load.

I suspect it'd be a good idea to look at the generated assembly difference for term with ssa=0 and ssa=1.

TocarIP · 2016-09-07T10:42:31Z

@shurcooL can you upload code somewhere? I get Forbidden from playground.
2016 12" Retina MacBook has skylake CPU, so i'd like to try to reproduce this on my skylake desktop.

dmitshur · 2016-09-07T20:41:19Z

@TocarIP, would pasting the code here work for you? Here it is, taken from https://play.golang.org/p/aAM1SuV6U4:

// Play with benchmarking a tight loop with many iterations and a func call,
// compare gc vs GopherJS performance.
//
// An alternative more close-to-metal implementation that doesn't use math.Pow.
//
// Disclaimer: This is a microbenchmark and is very poorly representative of
//             overall general real world performance of larger applications.
//
package main

import (
    "fmt"
    "time"
)

func term(k int32) float64 {
    //       | Computer A   | Computer B
    // ------|--------------|-------------
    // SSA=0 | 6.431564409s | 2.564973583s
    // SSA=1 | 6.420316364s | 5.771555271s
    if k%2 == 0 {
        return 4 / (2*float64(k) + 1)
    } else {
        return -4 / (2*float64(k) + 1)
    }

    //       | Computer A    | Computer B
    // ------|---------------|--------------
    // SSA=0 | 6.703508163s  | 3.067721176s
    // SSA=1 | 11.302787342s | 14.389823165s
    //return 4 / (2*float64(k) + 1) * float64(1-2*(k%2))

    //       | Computer A    | Computer B
    // ------|---------------|--------------
    // SSA=0 | 10.206198046s | 12.165551807s
    // SSA=1 | 9.072183098s  | 10.452724107s
    //return float64(1-2*(k%2)) * 4 / (2*float64(k) + 1)
}

// pi performs n iterations to compute an approximation of pi.
func pi(n int32) float64 {
    f := 0.0
    for k := int32(0); k <= n; k++ {
        f += term(k)
    }
    return f
}

func main() {
    // Start measuring time from now.
    started := time.Now()

    const n = 1000 * 1000 * 1000
    fmt.Printf("approximating pi with %v iterations.\n", n)
    fmt.Println(pi(n))

    fmt.Printf("total time taken is: %v\n", time.Since(started))
}

TocarIP · 2016-09-08T16:01:46Z

Thanks,
I've tested 1.6 vs tip
On haswell both versions have the same speed.
But on Broadwell(sic!) and Skylake I was able to verify a 2x slowdown.
Looking at asm main difference is usage of floating constants:
1.6 always reload them, while 1.7 loads them once in the beginning of pi function.
This is due to both path ( k%2 == 0 and k%2 == 1) writing to same register
In asm loop looks like:
jne q2
cvtsi2sd %ebx,%xmm3
mulsd %xmm0,%xmm3
addsd %xmm1,%xmm3
movsd 0xb817d(%rip),%xmm4 # 4b91c0 <$f64.4010000000000000>
divsd %xmm5,%xmm4
q4:
lea 0x1(%rbx),%ecx
addsd %xmm4,%xmm2
cmp %eax,%ecx
jle q3
q1:
movsd %xmm2,0x10(%rsp)
retq
q2:
cvtsi2sd %ebx,%xmm3
mulsd %xmm0,%xmm3
addsd %xmm1,%xmm3
movsd 0xb8193(%rip),%xmm4 # 4b9200 <$f64.c010000000000000>
divsd %xmm3,%xmm4

If I change xmm3 into xmm5:
jne q2
cvtsi2sd %ebx,%xmm5
mulsd %xmm0,%xmm5
addsd %xmm1,%xmm5
movsd 0xb817d(%rip),%xmm4 # 4b91c0 <$f64.4010000000000000>
divsd %xmm5,%xmm4
q4:
lea 0x1(%rbx),%ecx
addsd %xmm4,%xmm2
cmp %eax,%ecx
jle q3
q1:
movsd %xmm2,0x10(%rsp)
retq
q2:
cvtsi2sd %ebx,%xmm3
mulsd %xmm0,%xmm3
addsd %xmm1,%xmm3
movsd 0xb8193(%rip),%xmm4 # 4b9200 <$f64.c010000000000000>
divsd %xmm3,%xmm4

Performance improves by 2x (gets back to 1.6 level)

TocarIP · 2016-09-09T15:36:54Z

By introducing register allocator heuristic, it is possible to avoid writing to the same register:
See https://go-review.googlesource.com/#/c/28874/ as a prototype.

gopherbot · 2016-09-09T16:00:20Z

CL https://golang.org/cl/28874 mentions this issue.

gopherbot · 2016-10-19T17:30:11Z

CL https://golang.org/cl/31490 mentions this issue.

quentinmit added this to the Go1.8 milestone Sep 6, 2016

quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 11, 2016

gopherbot closed this as completed in 8589049 Oct 20, 2016

golang locked and limited conversation to collaborators Dec 19, 2017

gopherbot added the FrozenDueToAge label Dec 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: SSA performance inconsistency/regression difference across amd64 CPUs. #16982

cmd/compile: SSA performance inconsistency/regression difference across amd64 CPUs. #16982

dmitshur commented Sep 3, 2016 •

edited

dmitshur commented Sep 3, 2016

randall77 commented Sep 3, 2016

dmitshur commented Sep 3, 2016 •

edited

TocarIP commented Sep 7, 2016

dmitshur commented Sep 7, 2016

TocarIP commented Sep 8, 2016

TocarIP commented Sep 9, 2016

gopherbot commented Sep 9, 2016

gopherbot commented Oct 19, 2016

cmd/compile: SSA performance inconsistency/regression difference across amd64 CPUs. #16982

cmd/compile: SSA performance inconsistency/regression difference across amd64 CPUs. #16982

Comments

dmitshur commented Sep 3, 2016 • edited

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

dmitshur commented Sep 3, 2016

Additional findings

randall77 commented Sep 3, 2016

dmitshur commented Sep 3, 2016 • edited

TocarIP commented Sep 7, 2016

dmitshur commented Sep 7, 2016

TocarIP commented Sep 8, 2016

TocarIP commented Sep 9, 2016

gopherbot commented Sep 9, 2016

gopherbot commented Oct 19, 2016

dmitshur commented Sep 3, 2016 •

edited

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

dmitshur commented Sep 3, 2016 •

edited