cmd/compile: tweak branchelim heuristic on amd64 #25298

TocarIP · 2018-05-08T21:13:38Z

What version of Go are you using (`go version`)?

master:
go version devel +cd1976d Tue May 8 19:57:49 2018 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/localdisk/itocar/gocache/"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/localdisk/itocar/gopath"
GORACE=""
GOROOT="/localdisk/itocar/golang"
GOTMPDIR=""
GOTOOLDIR="/localdisk/itocar/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build948974604=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Run strconv/Atof benchmarks

What did you expect to see?

Same or better performance as 1.10

What did you see instead?

Atof64Decimal-8       20.9ns ± 1%  20.3ns ± 2%   -3.11%  (p=0.000 n=10+10)
Atof64Float-8         22.9ns ± 1%  23.1ns ± 2%   +0.83%  (p=0.042 n=10+10)
Atof64FloatExp-8      56.3ns ± 3%  68.8ns ± 2%  +22.11%  (p=0.000 n=10+10)
Atof64Big-8           84.0ns ± 1%  94.0ns ± 1%  +11.91%  (p=0.000 n=10+10)
Atof64RandomBits-8    72.9ns ±15%  80.5ns ±24%  +10.47%  (p=0.022 n=10+9)
Atof64RandomFloats-8  80.2ns ±25%  91.6ns ±19%  +14.27%  (p=0.029 n=10+10)
Atof32Decimal-8       20.1ns ± 2%  20.5ns ± 2%   +1.59%  (p=0.008 n=10+10)
Atof32Float-8         21.5ns ± 3%  21.9ns ± 3%   +2.09%  (p=0.012 n=10+10)
Atof32FloatExp-8      58.6ns ± 3%  68.2ns ± 2%  +16.49%  (p=0.000 n=10+10)
Atof32Random-8        93.8ns ± 0%  86.4ns ± 0%   -7.83%  (p=0.000 n=10+8)

Bisects points to a35ec9a

Looking at code I see that (*extFloat).Normalize got slower, probably because branches were well predicted, and most instruction are dependent on a result of branch.

It looks like heuristic for generating CMOV should be tweaked, but I'm not sure how. Current threshold is low and reducing it further will cause performance impact in other cases. Maybe we should avoid generating long chain of dependent CMOVs?

The text was updated successfully, but these errors were encountered:

randall77 · 2018-05-08T21:34:07Z

@rasky

quasilyte · 2018-05-11T11:48:29Z

Comparing package strings on go tip and Go 1.10.2:

SingleMaxSkipping-4     1.36µs ± 0%     3.21µs ± 0%   +135.09%  (p=0.008 n=5+5)

Main regression happened in a35ec9a (1675ns/op -> 3201ns/op).

rasky · 2018-05-12T17:11:08Z

By a cursory look, it seems that regressing benchmarks always execute code with fixed inputs. For instance, if you look at Atof64FloatExp which is +22% with low variance, it is:

func BenchmarkAtof64FloatExp(b *testing.B) {
	for i := 0; i < b.N; i++ {
		ParseFloat("-5.09e75", 64)
	}
}

So it always parses the same number, so the CPU is able to branch predict very well what it is going to happen. The same applies to SingleMaxSkipping:

func BenchmarkSingleMaxSkipping(b *testing.B) {
	benchmarkSingleString(b, Repeat("b", 25), Repeat("a", 10000))
}

So, yes, if the branches are well predicted, CMOV can be slower. @TocarIP what is the exact CPU model you used for benchmarks?

On the other hand, on random inputs it is indeed faster:

Atof32Random-8        93.8ns ± 0%  86.4ns ± 0%   -7.83%  (p=0.000 n=10+8)

I'm not sure what to do here.

PS: @TocarIP FWIW, (*extFloat).Normalize should use bits.LeadingZeros64 and avoid all branches in the first place:

func (f *extFloat) Normalize() uint {
    shift := bits.LeadingZeros64(f.mant)
    f.mant <<= uint(shift)
    f.exp -= shift
    return uint(shift)
}

gopherbot · 2018-05-15T18:41:33Z

Change https://golang.org/cl/113256 mentions this issue: strconv: simplify (*extFloat).Normalize

TocarIP · 2018-05-15T18:58:40Z

@rasky I've used i7-6700. Agreed about (*extFloat).Normalize.
FWIW in singleMaxSkipping cmov is generated only for inlined

func max(a, b int) int {
        if a > b {
                return a
        }
        return b
}

Which is a case where we want to generate CMOV, and where gcc/clang also generate CMOV.

Use math/bits.LeadingZeros64 instead of local implementation. This simplifies code, makes Normalize inlinable and fixes performance regression. Idea was suggested by Giovanni Bajo in #25298 Performance results below: Atof64Decimal-6 46.7ns ± 0% 46.7ns ± 0% ~ (all equal) Atof64Float-6 57.9ns ± 1% 56.9ns ± 0% -1.72% (p=0.000 n=10+9) Atof64FloatExp-6 163ns ± 0% 123ns ± 0% -24.54% (p=0.002 n=8+10) Atof64Big-6 222ns ± 1% 185ns ± 1% -16.65% (p=0.000 n=9+10) Atof64RandomBits-6 155ns ± 2% 154ns ± 3% ~ (p=0.225 n=10+10) Atof64RandomFloats-6 156ns ± 2% 154ns ± 2% ~ (p=0.124 n=10+9) Atof32Decimal-6 47.3ns ± 0% 46.7ns ± 0% -1.26% (p=0.000 n=7+9) Atof32Float-6 51.5ns ± 1% 51.6ns ± 1% ~ (p=0.455 n=10+9) Atof32FloatExp-6 163ns ± 1% 124ns ± 1% -24.36% (p=0.000 n=10+10) Atof32Random-6 199ns ± 1% 163ns ± 0% -17.93% (p=0.000 n=10+10) FormatFloat/Decimal-6 209ns ± 2% 211ns ± 2% ~ (p=0.402 n=10+10) FormatFloat/Float-6 393ns ± 2% 379ns ± 1% -3.57% (p=0.000 n=10+10) FormatFloat/Exp-6 333ns ± 2% 321ns ± 1% -3.56% (p=0.000 n=10+9) FormatFloat/NegExp-6 338ns ± 3% 317ns ± 1% -6.27% (p=0.000 n=10+9) FormatFloat/Big-6 457ns ± 1% 443ns ± 2% -2.99% (p=0.000 n=9+10) FormatFloat/BinaryExp-6 230ns ± 2% 232ns ± 2% ~ (p=0.070 n=10+10) FormatFloat/32Integer-6 209ns ± 2% 211ns ± 1% ~ (p=0.203 n=10+8) FormatFloat/32ExactFraction-6 330ns ± 2% 319ns ± 1% -3.42% (p=0.000 n=10+10) FormatFloat/32Point-6 393ns ± 2% 377ns ± 1% -4.15% (p=0.000 n=10+10) FormatFloat/32Exp-6 331ns ± 2% 318ns ± 2% -4.02% (p=0.000 n=10+10) FormatFloat/32NegExp-6 327ns ± 2% 315ns ± 2% -3.70% (p=0.000 n=10+10) FormatFloat/64Fixed1-6 265ns ± 2% 253ns ± 2% -4.38% (p=0.000 n=10+10) FormatFloat/64Fixed2-6 278ns ± 2% 262ns ± 3% -5.71% (p=0.000 n=10+10) FormatFloat/64Fixed3-6 271ns ± 2% 260ns ± 2% -4.03% (p=0.000 n=10+10) FormatFloat/64Fixed4-6 277ns ± 3% 267ns ± 1% -3.55% (p=0.000 n=10+9) FormatFloat/Slowpath64-6 71.0µs ± 0% 71.0µs ± 0% ~ (p=0.744 n=10+8) AppendFloat/Decimal-6 100ns ± 1% 100ns ± 0% ~ (p=0.294 n=10+8) AppendFloat/Float-6 273ns ± 0% 260ns ± 1% -4.87% (p=0.000 n=7+10) AppendFloat/Exp-6 213ns ± 0% 200ns ± 0% -6.29% (p=0.000 n=8+10) AppendFloat/NegExp-6 211ns ± 0% 198ns ± 0% -6.16% (p=0.000 n=8+8) AppendFloat/Big-6 319ns ± 0% 305ns ± 0% -4.31% (p=0.000 n=8+7) AppendFloat/BinaryExp-6 98.4ns ± 0% 92.9ns ± 0% -5.63% (p=0.000 n=9+8) AppendFloat/32Integer-6 101ns ± 1% 102ns ± 1% +0.89% (p=0.004 n=10+10) AppendFloat/32ExactFraction-6 222ns ± 1% 210ns ± 0% -5.28% (p=0.000 n=10+9) AppendFloat/32Point-6 273ns ± 1% 261ns ± 1% -4.62% (p=0.000 n=10+9) AppendFloat/32Exp-6 209ns ± 1% 197ns ± 0% -5.56% (p=0.000 n=10+9) AppendFloat/32NegExp-6 207ns ± 1% 194ns ± 1% -6.18% (p=0.000 n=10+10) AppendFloat/64Fixed1-6 145ns ± 0% 131ns ± 1% -9.93% (p=0.000 n=9+10) AppendFloat/64Fixed2-6 160ns ± 0% 146ns ± 0% -8.58% (p=0.000 n=10+8) AppendFloat/64Fixed3-6 147ns ± 1% 132ns ± 1% -10.25% (p=0.000 n=10+10) AppendFloat/64Fixed4-6 161ns ± 1% 149ns ± 0% -7.93% (p=0.000 n=10+10) AppendFloat/Slowpath64-6 70.6µs ± 1% 70.9µs ± 0% +0.37% (p=0.000 n=10+8) Change-Id: I63bbc40905abd795fbd24743604c790023d11a43 Reviewed-on: https://go-review.googlesource.com/113256 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>

bradfitz added the Performance label May 15, 2018

bradfitz added this to the Unplanned milestone May 18, 2018

ulikunitz mentioned this issue Jul 10, 2018

cmd/compile: CMOV instruction slower than branch #26306

Closed

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: tweak branchelim heuristic on amd64 #25298

cmd/compile: tweak branchelim heuristic on amd64 #25298

TocarIP commented May 8, 2018

randall77 commented May 8, 2018

quasilyte commented May 11, 2018

rasky commented May 12, 2018

gopherbot commented May 15, 2018

TocarIP commented May 15, 2018

cmd/compile: tweak branchelim heuristic on amd64 #25298

cmd/compile: tweak branchelim heuristic on amd64 #25298

Comments

TocarIP commented May 8, 2018

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

randall77 commented May 8, 2018

quasilyte commented May 11, 2018

rasky commented May 12, 2018

gopherbot commented May 15, 2018

TocarIP commented May 15, 2018

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?