cmd/compile: go1.7beta2 performance regression #16141

samuel · 2016-06-21T21:38:47Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.6.2 darwin/amd64
go version go1.7beta2 darwin/amd64

What operating system and processor architecture are you using (go env)?

Intel i7-3540M (also tried on i7-2677M but didn't see the same regression)

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fno-common"
CXX="clang++"
CGO_ENABLED="1"

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/30/6hyj4x_x783f12hmbcmn5tt00000gn/T/go-build183604977=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"

What did you do?

https://play.golang.org/p/3WDEr-_QZR

Disassembly (-gcflags -S): https://gist.github.com/samuel/3053bafe149a0459322f6eeaf8bd5ae5

What did you expect to see?

Go 1.7beta2 the same performance or better than Go 1.6

What did you see instead?

benchmark old ns/op new ns/op delta
BenchmarkVScaleF32-4 2906 3801 +30.80%
BenchmarkVMaxF32-4 2682 3951 +47.32%

benchmark old MB/s new MB/s speedup
BenchmarkVScaleF32-4 1409.19 1077.37 0.76x
BenchmarkVMaxF32-4 1527.02 1036.64 0.68x

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2016-06-21T22:23:24Z

CC @randall77 @josharian

Marking as 1.8 because it is late in the release cycle, but I'm not opposed to a fix for 1.7 if it is safe.

josharian · 2016-06-22T14:31:21Z

Thanks for providing the disassembly.

I'm away from my computer for the week, so I can't confirm directly, but it looks like this is a register allocation issue. Looking at max, there's an unnecessary reg-reg mov in the loop in the SSA version. I'm surprised the impact is so big, but I'd guess it is because it creates a data dependency and prevents pipelining.

Also, I don't see why we are doing FP adjustments, since we don't call any functions (we only reference math.b; math.Inf is inlined). That's probably a bug, but it's not responsible for the slowdown. I'll file a separate issue next week.

This should wait for 1.8.

randall77 · 2016-08-31T21:27:28Z

I confess I have no idea what is going on here.
@TocarIP
1.6 disassembly:

    0x0045 00069 (tmp1.go:10)   MOVSS   (CX), X0
    0x0049 00073 (tmp1.go:11)   CMPQ    AX, DI
    0x004c 00076 (tmp1.go:11)   JCC $1, 103
    0x004e 00078 (tmp1.go:11)   LEAQ    (R9)(AX*4), BX
    0x0052 00082 (tmp1.go:11)   MULSS   X2, X0
    0x0056 00086 (tmp1.go:11)   MOVSS   X0, (BX)
    0x005a 00090 (tmp1.go:10)   ADDQ    $4, CX
    0x005e 00094 (tmp1.go:10)   INCQ    AX
    0x0061 00097 (tmp1.go:10)   CMPQ    AX, SI
    0x0064 00100 (tmp1.go:10)   JLT $0, 69

tip disassembly:

    0x0044 00068 (tmp1.go:10)   MOVSS   (BX), X1
    0x0048 00072 (tmp1.go:11)   MULSS   X0, X1
    0x004c 00076 (tmp1.go:11)   CMPQ    SI, DX
    0x004f 00079 (tmp1.go:11)   JCC $0, 99
    0x0051 00081 (tmp1.go:11)   MOVSS   X1, (CX)(SI*4)
    0x0056 00086 (tmp1.go:10)   ADDQ    $4, BX
    0x005a 00090 (tmp1.go:10)   INCQ    SI
    0x005d 00093 (tmp1.go:10)   CMPQ    SI, AX
    0x0060 00096 (tmp1.go:10)   JLT $0, 68

Yet tip is 28% slower. For no discernible reason I can see.

Max is even worse.
go1.6:

    0x004a 00074 (tmp1.go:17)   MOVSS   (AX), X2
    0x004e 00078 (tmp1.go:18)   UCOMISS X3, X2
    0x0051 00081 (tmp1.go:18)   JHI 106
    0x0053 00083 (tmp1.go:17)   ADDQ    $4, AX
    0x0057 00087 (tmp1.go:17)   INCQ    CX
    0x005a 00090 (tmp1.go:17)   CMPQ    CX, DX
    0x005d 00093 (tmp1.go:17)   JLT $0, 74

    0x006a 00106 (tmp1.go:19)   MOVSS   X2, X3
    0x006e 00110 (tmp1.go:17)   JMP 83

tip:

    0x0036 00054 (tmp1.go:17)   MOVSS   (CX), X1
    0x003a 00058 (tmp1.go:18)   UCOMISS X0, X1
    0x003d 00061 (tmp1.go:18)   JLS 94
    0x003f 00063 (tmp1.go:17)   ADDQ    $4, CX
    0x0043 00067 (tmp1.go:17)   INCQ    DX
    0x0046 00070 (tmp1.go:22)   MOVUPS  X1, X0
    0x0049 00073 (tmp1.go:17)   CMPQ    DX, AX
    0x004c 00076 (tmp1.go:17)   JLT $0, 54

    0x005e 00094 (tmp1.go:22)   MOVUPS  X0, X1
    0x0061 00097 (tmp1.go:17)   JMP 63

Tip has 1 extra reg->reg move, yet it is 2.3x slower! 2.3x people. You could compute Max twice using 1.6 and still have some time left over. What the heck is going on?
(I tried MOVUPS->MOVSS, didn't help.)

gopherbot · 2016-08-31T22:00:26Z

CL https://golang.org/cl/28273 mentions this issue.

TocarIP · 2016-09-01T14:27:57Z

I can't access playground link (Forbidden), so I couldn't build it myself and play with code/check perf counters.
At a glance it looks like tip version make ucomiss always depend on previous iteration (X0 is always modified)

rjammala · 2016-09-01T14:47:40Z

@TocarIP : See if you can access this:
https://gist.github.com/rjammala/e47e6458413c5e920f2fb712f6943213

Update #16141 Change-Id: I7d32c5cdc197d86491a67ea579fa16cb3d675b51 Reviewed-on: https://go-review.googlesource.com/28273 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>

TocarIP · 2016-09-05T16:40:45Z

Looks like VScale is fixed by c538795, so I focused on VMaxF32.

I've tried implementing maxss recognition as part of phiopt, but while it produces smaller code:

// 46bc06:
movss (%rcx),%xmm1
add $0x4,%rcx
inc %rdx
maxss %xmm1,%xmm0
cmp %rax,%rdx
jl 46bc06 <_/localdisk/itocar/go_tmp/golang/src.VMaxF32+0x36>
... // return

It executes with almost the same speed as tip:
name old time/op new time/op delta
VMaxF32-4 3.62µs ± 7% 3.40µs ± 0% -6.04% (p=0.000 n=20+20)

name old speed new speed delta
VMaxF32-4 1.13GB/s ± 7% 1.20GB/s ± 0% +6.32% (p=0.000 n=20+18)

But has lower IPC
2.01 insns per cycle vs 3.16 insns per cycle (according to perf stat)

Looks like in 1.7 case, as well as in maxss case, there is a true dependency on previous value of X1,
which needs to be updated on each iteration.
But in 1.6 due to branch being highly predictable ( < 0.1% mispredictions)
several iterations can execute (speculatively) in parallel,
Because only in unlikely case (new random float is greater) we need to update X3 (current max).

I've tried reducing size of array to 1 << 6.
This makes maxss ~= 1.6 > 1.7, but with only 60% gap between 1.6 and 1.7
This supports the hypothesis that 1.6 is faster on iterations where current max is not updated.

I'm not sure what should be done:
Implementing max recognition will speed up max for small array or just 2 elements,
and this what Gcc does. But this won't help this case much and I'm not sure that all architectures has max instruction. Unroll and vectorization may help hide latency in the future.

Btw b.SetBytes(benchSize) should be b.SetBytes(benchSize*4), if the intention is to measure MB/s instead of megafloat32/s

rasky · 2016-09-06T00:31:26Z

@TocarIP wouldn't be sufficient to make the compiler understand that the branch is unlikely rather than likely? It would generate a jump-ahead in the unlikely case (like Go 1.6 does), and that would in turn remove the extra MOV that causes the dependency.

TocarIP · 2016-09-06T15:42:56Z

@rasky do you mean something like __buitin_expect in GCC or pattern detection in compiler?
It could help against 1.7. But compared to maxss behavior is data (e. g sorted/sorted in reverse) and array size (maxss is faster for small arrays) dependant and I'm not convinced that this is the right choice.

rasky · 2016-09-06T17:42:55Z

I was thinking pattern detection in the compiler. If you can detect a min/max-loop and tune the likeliness of the branch, it would generate better code for the likely case, removing the dependency and improving the results. It would not be the BEST possible result, but it would be uniformly better than what Go 1.7 does and on par with Go 1.6. Exploring usage of maxss is then a separate possible optimization

rsc · 2016-10-21T03:36:24Z

@randall77, @TocarIP, is there anything left to do here?

randall77 · 2016-10-21T16:39:09Z

I don't see any obvious fix. We're getting bitten by partial register dependencies that are the result of unlucky choices by regalloc.
Punting to 1.9.

josharian · 2017-05-18T19:30:15Z

Reproduces at tip (638ebb0), although there are improvements from 1.7 to 1.8 to tip.

name \ time/op  17             18             tip
VScaleF32-8       2.97µs ± 2%    2.55µs ± 1%    2.24µs ± 1%
VMaxF32-8         3.90µs ± 1%    3.88µs ± 1%    3.85µs ± 1%

Single file reproduction code:

package dsp

import (
	"math"
	"math/rand"
	"testing"
)

func VScaleF32(input, output []float32, scale float32) {
	n := len(input)
	if len(output) < n {
		n = len(output)
	}
	for i, v := range input[:n] {
		output[i] = v * scale
	}
}

func VMaxF32(input []float32) float32 {
	max := float32(math.Inf(-1))
	for _, v := range input {
		if v > max {
			max = v
		}
	}
	return max
}

const benchSize = 1 << 12

func BenchmarkVScaleF32(b *testing.B) {
	input := make([]float32, benchSize)
	output := make([]float32, len(input))
	b.SetBytes(benchSize)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		VScaleF32(input, output, 1.0/benchSize)
	}
}

func BenchmarkVMaxF32(b *testing.B) {
	input := make([]float32, benchSize)
	rand.Seed(0)
	for i := 0; i < len(input); i++ {
		input[i] = rand.Float32()
	}
	b.SetBytes(benchSize)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		VMaxF32(input)
	}
}

mdempsky · 2017-11-30T22:05:24Z

At tip, it seems VScaleF32-4 is ~23% faster, but VVMaxF32 is 20% slower at 1.10 than 1.6.

agnivade · 2019-07-04T04:13:43Z

Still not much difference between 1.11.4 and tip(go version devel +c11f6c4929 Fri Jun 21 21:24:47 2019 +0000 linux/amd64)

$benchstat old.txt new.txt 
name         old time/op    new time/op    delta
VScaleF32-4    2.35µs ± 2%    2.34µs ± 1%   ~     (p=0.268 n=10+9)
VMaxF32-4      3.19µs ± 2%    3.20µs ± 1%   ~     (p=0.447 n=10+10)

name         old speed      new speed      delta
VScaleF32-4  1.74GB/s ± 2%  1.75GB/s ± 1%   ~     (p=0.219 n=10+9)
VMaxF32-4    1.28GB/s ± 1%  1.28GB/s ± 1%   ~     (p=0.436 n=10+10)

ianlancetaylor added this to the Go1.8 milestone Jun 21, 2016

quentinmit added the NeedsFix label Oct 11, 2016

randall77 modified the milestones: Go1.9, Go1.8 Oct 21, 2016

odeke-em mentioned this issue Nov 23, 2016

cmd/compile: go1.7beta2 performance regression (floats and arrays) #16187

Open

randall77 modified the milestones: Go1.10, Go1.9 May 31, 2017

randall77 added the Performance label May 31, 2017

TocarIP mentioned this issue Aug 14, 2017

cmd/compile/ssa: conditional select instructions #21391

Closed

mdempsky modified the milestones: Go1.10, Go1.11 Nov 30, 2017

bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

gopherbot added the compiler/runtime label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: go1.7beta2 performance regression #16141

cmd/compile: go1.7beta2 performance regression #16141

samuel commented Jun 21, 2016

ianlancetaylor commented Jun 21, 2016

josharian commented Jun 22, 2016

randall77 commented Aug 31, 2016

gopherbot commented Aug 31, 2016

TocarIP commented Sep 1, 2016

rjammala commented Sep 1, 2016

TocarIP commented Sep 5, 2016

rasky commented Sep 6, 2016

TocarIP commented Sep 6, 2016

rasky commented Sep 6, 2016

rsc commented Oct 21, 2016

randall77 commented Oct 21, 2016

josharian commented May 18, 2017

mdempsky commented Nov 30, 2017

agnivade commented Jul 4, 2019

cmd/compile: go1.7beta2 performance regression #16141

cmd/compile: go1.7beta2 performance regression #16141

Comments

samuel commented Jun 21, 2016

ianlancetaylor commented Jun 21, 2016

josharian commented Jun 22, 2016

randall77 commented Aug 31, 2016

gopherbot commented Aug 31, 2016

TocarIP commented Sep 1, 2016

rjammala commented Sep 1, 2016

TocarIP commented Sep 5, 2016

rasky commented Sep 6, 2016

TocarIP commented Sep 6, 2016

rasky commented Sep 6, 2016

rsc commented Oct 21, 2016

randall77 commented Oct 21, 2016

josharian commented May 18, 2017

mdempsky commented Nov 30, 2017

agnivade commented Jul 4, 2019