New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: tweak branchelim heuristic on amd64 #25298
Comments
Comparing package
Main regression happened in a35ec9a (1675ns/op -> 3201ns/op). |
By a cursory look, it seems that regressing benchmarks always execute code with fixed inputs. For instance, if you look at func BenchmarkAtof64FloatExp(b *testing.B) {
for i := 0; i < b.N; i++ {
ParseFloat("-5.09e75", 64)
}
} So it always parses the same number, so the CPU is able to branch predict very well what it is going to happen. The same applies to func BenchmarkSingleMaxSkipping(b *testing.B) {
benchmarkSingleString(b, Repeat("b", 25), Repeat("a", 10000))
} So, yes, if the branches are well predicted, CMOV can be slower. @TocarIP what is the exact CPU model you used for benchmarks? On the other hand, on random inputs it is indeed faster:
I'm not sure what to do here. PS: @TocarIP FWIW, func (f *extFloat) Normalize() uint {
shift := bits.LeadingZeros64(f.mant)
f.mant <<= uint(shift)
f.exp -= shift
return uint(shift)
} |
Change https://golang.org/cl/113256 mentions this issue: |
@rasky I've used i7-6700. Agreed about (*extFloat).Normalize.
Which is a case where we want to generate CMOV, and where gcc/clang also generate CMOV. |
Use math/bits.LeadingZeros64 instead of local implementation. This simplifies code, makes Normalize inlinable and fixes performance regression. Idea was suggested by Giovanni Bajo in #25298 Performance results below: Atof64Decimal-6 46.7ns ± 0% 46.7ns ± 0% ~ (all equal) Atof64Float-6 57.9ns ± 1% 56.9ns ± 0% -1.72% (p=0.000 n=10+9) Atof64FloatExp-6 163ns ± 0% 123ns ± 0% -24.54% (p=0.002 n=8+10) Atof64Big-6 222ns ± 1% 185ns ± 1% -16.65% (p=0.000 n=9+10) Atof64RandomBits-6 155ns ± 2% 154ns ± 3% ~ (p=0.225 n=10+10) Atof64RandomFloats-6 156ns ± 2% 154ns ± 2% ~ (p=0.124 n=10+9) Atof32Decimal-6 47.3ns ± 0% 46.7ns ± 0% -1.26% (p=0.000 n=7+9) Atof32Float-6 51.5ns ± 1% 51.6ns ± 1% ~ (p=0.455 n=10+9) Atof32FloatExp-6 163ns ± 1% 124ns ± 1% -24.36% (p=0.000 n=10+10) Atof32Random-6 199ns ± 1% 163ns ± 0% -17.93% (p=0.000 n=10+10) FormatFloat/Decimal-6 209ns ± 2% 211ns ± 2% ~ (p=0.402 n=10+10) FormatFloat/Float-6 393ns ± 2% 379ns ± 1% -3.57% (p=0.000 n=10+10) FormatFloat/Exp-6 333ns ± 2% 321ns ± 1% -3.56% (p=0.000 n=10+9) FormatFloat/NegExp-6 338ns ± 3% 317ns ± 1% -6.27% (p=0.000 n=10+9) FormatFloat/Big-6 457ns ± 1% 443ns ± 2% -2.99% (p=0.000 n=9+10) FormatFloat/BinaryExp-6 230ns ± 2% 232ns ± 2% ~ (p=0.070 n=10+10) FormatFloat/32Integer-6 209ns ± 2% 211ns ± 1% ~ (p=0.203 n=10+8) FormatFloat/32ExactFraction-6 330ns ± 2% 319ns ± 1% -3.42% (p=0.000 n=10+10) FormatFloat/32Point-6 393ns ± 2% 377ns ± 1% -4.15% (p=0.000 n=10+10) FormatFloat/32Exp-6 331ns ± 2% 318ns ± 2% -4.02% (p=0.000 n=10+10) FormatFloat/32NegExp-6 327ns ± 2% 315ns ± 2% -3.70% (p=0.000 n=10+10) FormatFloat/64Fixed1-6 265ns ± 2% 253ns ± 2% -4.38% (p=0.000 n=10+10) FormatFloat/64Fixed2-6 278ns ± 2% 262ns ± 3% -5.71% (p=0.000 n=10+10) FormatFloat/64Fixed3-6 271ns ± 2% 260ns ± 2% -4.03% (p=0.000 n=10+10) FormatFloat/64Fixed4-6 277ns ± 3% 267ns ± 1% -3.55% (p=0.000 n=10+9) FormatFloat/Slowpath64-6 71.0µs ± 0% 71.0µs ± 0% ~ (p=0.744 n=10+8) AppendFloat/Decimal-6 100ns ± 1% 100ns ± 0% ~ (p=0.294 n=10+8) AppendFloat/Float-6 273ns ± 0% 260ns ± 1% -4.87% (p=0.000 n=7+10) AppendFloat/Exp-6 213ns ± 0% 200ns ± 0% -6.29% (p=0.000 n=8+10) AppendFloat/NegExp-6 211ns ± 0% 198ns ± 0% -6.16% (p=0.000 n=8+8) AppendFloat/Big-6 319ns ± 0% 305ns ± 0% -4.31% (p=0.000 n=8+7) AppendFloat/BinaryExp-6 98.4ns ± 0% 92.9ns ± 0% -5.63% (p=0.000 n=9+8) AppendFloat/32Integer-6 101ns ± 1% 102ns ± 1% +0.89% (p=0.004 n=10+10) AppendFloat/32ExactFraction-6 222ns ± 1% 210ns ± 0% -5.28% (p=0.000 n=10+9) AppendFloat/32Point-6 273ns ± 1% 261ns ± 1% -4.62% (p=0.000 n=10+9) AppendFloat/32Exp-6 209ns ± 1% 197ns ± 0% -5.56% (p=0.000 n=10+9) AppendFloat/32NegExp-6 207ns ± 1% 194ns ± 1% -6.18% (p=0.000 n=10+10) AppendFloat/64Fixed1-6 145ns ± 0% 131ns ± 1% -9.93% (p=0.000 n=9+10) AppendFloat/64Fixed2-6 160ns ± 0% 146ns ± 0% -8.58% (p=0.000 n=10+8) AppendFloat/64Fixed3-6 147ns ± 1% 132ns ± 1% -10.25% (p=0.000 n=10+10) AppendFloat/64Fixed4-6 161ns ± 1% 149ns ± 0% -7.93% (p=0.000 n=10+10) AppendFloat/Slowpath64-6 70.6µs ± 1% 70.9µs ± 0% +0.37% (p=0.000 n=10+8) Change-Id: I63bbc40905abd795fbd24743604c790023d11a43 Reviewed-on: https://go-review.googlesource.com/113256 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>
What version of Go are you using (
go version
)?master:
go version devel +cd1976d Tue May 8 19:57:49 2018 +0000 linux/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?What did you do?
Run strconv/Atof benchmarks
What did you expect to see?
Same or better performance as 1.10
What did you see instead?
Bisects points to a35ec9a
Looking at code I see that (*extFloat).Normalize got slower, probably because branches were well predicted, and most instruction are dependent on a result of branch.
It looks like heuristic for generating CMOV should be tweaked, but I'm not sure how. Current threshold is low and reducing it further will cause performance impact in other cases. Maybe we should avoid generating long chain of dependent CMOVs?
The text was updated successfully, but these errors were encountered: