-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: go1.7beta2 performance regression #16141
Comments
Marking as 1.8 because it is late in the release cycle, but I'm not opposed to a fix for 1.7 if it is safe. |
Thanks for providing the disassembly. I'm away from my computer for the week, so I can't confirm directly, but it looks like this is a register allocation issue. Looking at max, there's an unnecessary reg-reg mov in the loop in the SSA version. I'm surprised the impact is so big, but I'd guess it is because it creates a data dependency and prevents pipelining. Also, I don't see why we are doing FP adjustments, since we don't call any functions (we only reference math.b; math.Inf is inlined). That's probably a bug, but it's not responsible for the slowdown. I'll file a separate issue next week. This should wait for 1.8. |
I confess I have no idea what is going on here.
tip disassembly:
Yet tip is 28% slower. For no discernible reason I can see. Max is even worse.
tip:
Tip has 1 extra reg->reg move, yet it is 2.3x slower! 2.3x people. You could compute Max twice using 1.6 and still have some time left over. What the heck is going on? |
CL https://golang.org/cl/28273 mentions this issue. |
I can't access playground link (Forbidden), so I couldn't build it myself and play with code/check perf counters. |
@TocarIP : See if you can access this: |
Update #16141 Change-Id: I7d32c5cdc197d86491a67ea579fa16cb3d675b51 Reviewed-on: https://go-review.googlesource.com/28273 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
Looks like VScale is fixed by c538795, so I focused on VMaxF32. I've tried implementing maxss recognition as part of phiopt, but while it produces smaller code: // 46bc06: It executes with almost the same speed as tip: name old speed new speed delta But has lower IPC Looks like in 1.7 case, as well as in maxss case, there is a true dependency on previous value of X1, I've tried reducing size of array to 1 << 6. I'm not sure what should be done: Btw b.SetBytes(benchSize) should be b.SetBytes(benchSize*4), if the intention is to measure MB/s instead of megafloat32/s |
@TocarIP wouldn't be sufficient to make the compiler understand that the branch is unlikely rather than likely? It would generate a jump-ahead in the unlikely case (like Go 1.6 does), and that would in turn remove the extra MOV that causes the dependency. |
@rasky do you mean something like __buitin_expect in GCC or pattern detection in compiler? |
I was thinking pattern detection in the compiler. If you can detect a min/max-loop and tune the likeliness of the branch, it would generate better code for the likely case, removing the dependency and improving the results. It would not be the BEST possible result, but it would be uniformly better than what Go 1.7 does and on par with Go 1.6. Exploring usage of maxss is then a separate possible optimization |
@randall77, @TocarIP, is there anything left to do here? |
I don't see any obvious fix. We're getting bitten by partial register dependencies that are the result of unlucky choices by regalloc. |
Reproduces at tip (638ebb0), although there are improvements from 1.7 to 1.8 to tip.
Single file reproduction code: package dsp
import (
"math"
"math/rand"
"testing"
)
func VScaleF32(input, output []float32, scale float32) {
n := len(input)
if len(output) < n {
n = len(output)
}
for i, v := range input[:n] {
output[i] = v * scale
}
}
func VMaxF32(input []float32) float32 {
max := float32(math.Inf(-1))
for _, v := range input {
if v > max {
max = v
}
}
return max
}
const benchSize = 1 << 12
func BenchmarkVScaleF32(b *testing.B) {
input := make([]float32, benchSize)
output := make([]float32, len(input))
b.SetBytes(benchSize)
b.ResetTimer()
for i := 0; i < b.N; i++ {
VScaleF32(input, output, 1.0/benchSize)
}
}
func BenchmarkVMaxF32(b *testing.B) {
input := make([]float32, benchSize)
rand.Seed(0)
for i := 0; i < len(input); i++ {
input[i] = rand.Float32()
}
b.SetBytes(benchSize)
b.ResetTimer()
for i := 0; i < b.N; i++ {
VMaxF32(input)
}
} |
At tip, it seems VScaleF32-4 is ~23% faster, but VVMaxF32 is 20% slower at 1.10 than 1.6. |
Still not much difference between 1.11.4 and tip(go version devel +c11f6c4929 Fri Jun 21 21:24:47 2019 +0000 linux/amd64)
|
Please answer these questions before submitting your issue. Thanks!
go version
)?go version go1.6.2 darwin/amd64
go version go1.7beta2 darwin/amd64
go env
)?Intel i7-3540M (also tried on i7-2677M but didn't see the same regression)
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fno-common"
CXX="clang++"
CGO_ENABLED="1"
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/30/6hyj4x_x783f12hmbcmn5tt00000gn/T/go-build183604977=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
https://play.golang.org/p/3WDEr-_QZR
Disassembly (-gcflags -S): https://gist.github.com/samuel/3053bafe149a0459322f6eeaf8bd5ae5
Go 1.7beta2 the same performance or better than Go 1.6
benchmark old ns/op new ns/op delta
BenchmarkVScaleF32-4 2906 3801 +30.80%
BenchmarkVMaxF32-4 2682 3951 +47.32%
benchmark old MB/s new MB/s speedup
BenchmarkVScaleF32-4 1409.19 1077.37 0.76x
BenchmarkVMaxF32-4 1527.02 1036.64 0.68x
The text was updated successfully, but these errors were encountered: