New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: Improve performance of round2 #42865
Comments
Change https://golang.org/cl/273671 mentions this issue: |
Unless you have a 10ghz processor that is executing all the instructions in this function in parallel, you need to watch out for compiler optimisations. Inspect the compiled output of your benchmarks, I suspect the code has been optimised away because in your benchmark you are not using the result of round2 |
Also, there is no need to use pb.Parallel. This code does not use concurrency, its performance should not change depending on the value of GOMAXPROCS. |
Hey @davecheney If I don't use BenchmarkRound2Old-16 133692382 8.90 ns/op
BenchmarkRound2Old-16 135668180 8.94 ns/op
BenchmarkRound2Old-16 134220864 8.99 ns/op
BenchmarkRound2Old-16 135382104 8.93 ns/op
BenchmarkRound2Old-16 134764070 8.86 ns/op
BenchmarkRound2Old-16 135216817 8.97 ns/op
BenchmarkRound2Old-16 131684757 8.92 ns/op
BenchmarkRound2Old-16 135538406 8.92 ns/op
BenchmarkRound2Old-16 135814132 9.00 ns/op
BenchmarkRound2Old-16 134046284 8.83 ns/op
BenchmarkRound2New-16 1000000000 0.240 ns/op
BenchmarkRound2New-16 1000000000 0.236 ns/op
BenchmarkRound2New-16 1000000000 0.240 ns/op
BenchmarkRound2New-16 1000000000 0.235 ns/op
BenchmarkRound2New-16 1000000000 0.238 ns/op
BenchmarkRound2New-16 1000000000 0.238 ns/op
BenchmarkRound2New-16 1000000000 0.237 ns/op
BenchmarkRound2New-16 1000000000 0.236 ns/op
BenchmarkRound2New-16 1000000000 0.241 ns/op
BenchmarkRound2New-16 1000000000 0.243 ns/op |
Your benchmark is being optimised away because you are not using the result. Because your chance removed the for loop round2 can be inlined allowing the compiler to see that you’re not using the result so it removes the call completely. Don’t get distracted by how fast the benchmark runs, focus on the delta between the original version and your replacement. |
@davecheney How about this func BenchmarkRound2(b *testing.B) {
var r int32
for i := 0; i < b.N; i++ {
r = round22(2059)
}
_ = r
} OLD
BenchmarkRound2-16 135571280 8.95 ns/op
BenchmarkRound2-16 133842273 8.97 ns/op
BenchmarkRound2-16 133245867 8.95 ns/op
BenchmarkRound2-16 134098809 8.95 ns/op
BenchmarkRound2-16 129663037 8.94 ns/op
BenchmarkRound2-16 131995621 9.03 ns/op
BenchmarkRound2-16 134425384 8.91 ns/op
BenchmarkRound2-16 132728696 9.00 ns/op
BenchmarkRound2-16 132275148 9.01 ns/op
BenchmarkRound2-16 132570732 8.96 ns/op
NEW
BenchmarkRound2-16 1000000000 0.242 ns/op
BenchmarkRound2-16 1000000000 0.238 ns/op
BenchmarkRound2-16 1000000000 0.243 ns/op
BenchmarkRound2-16 1000000000 0.248 ns/op
BenchmarkRound2-16 1000000000 0.239 ns/op
BenchmarkRound2-16 1000000000 0.235 ns/op
BenchmarkRound2-16 1000000000 0.240 ns/op
BenchmarkRound2-16 1000000000 0.242 ns/op
BenchmarkRound2-16 1000000000 0.240 ns/op
BenchmarkRound2-16 1000000000 0.236 ns/op benchstat name old time/op new time/op delta
Round2-16 8.97ns ± 1% 0.24ns ± 3% -97.32% (p=0.000 n=10+10) |
Respectfully you need to double check your math; if you have a 3ghz processor, then There are several possible reasons for this:
I suggest the following replacement benchmark var Result, Input int32 = 0, 2059
func BenchmarkRound2(b *testing.B) {
var r int32
for i := 0; i < b.N; i++ {
r = round22(Input)
}
R = r
} |
@davecheney Thanks very much!
I do think so! var result, input int32 = 0, 2059
func BenchmarkRound2(b *testing.B) {
var r int32
for i := 0; i < b.N; i++ {
r = round2(input)
}
result = r
} OLD
BenchmarkRound2-16 128067048 8.89 ns/op
BenchmarkRound2-16 134646064 8.85 ns/op
BenchmarkRound2-16 133322764 8.86 ns/op
BenchmarkRound2-16 135313116 8.95 ns/op
BenchmarkRound2-16 133720844 8.98 ns/op
BenchmarkRound2-16 134429626 8.90 ns/op
BenchmarkRound2-16 134209008 9.04 ns/op
BenchmarkRound2-16 135072374 8.86 ns/op
BenchmarkRound2-16 135947492 8.86 ns/op
BenchmarkRound2-16 134961540 8.93 ns/op
NEW
BenchmarkRound2-16 817970109 1.38 ns/op
BenchmarkRound2-16 885790197 1.38 ns/op
BenchmarkRound2-16 853032567 1.37 ns/op
BenchmarkRound2-16 882751386 1.39 ns/op
BenchmarkRound2-16 869532646 1.37 ns/op
BenchmarkRound2-16 857339872 1.39 ns/op
BenchmarkRound2-16 822184684 1.37 ns/op
BenchmarkRound2-16 856609690 1.36 ns/op
BenchmarkRound2-16 892192333 1.38 ns/op
BenchmarkRound2-16 889352274 1.40 ns/op benchstat name old time/op new time/op delta
Round2-16 8.91ns ± 1% 1.38ns ± 2% -84.53% (p=0.000 n=10+10) |
Some observations:
|
Hey @martisch based on your observations:
Same here
Yeah, it's overflow because of type
Sorry I don't follow this
The new one is always faster on my machine |
It seems that round2 is called twice in the codebase and only one of those is in a non constant context, inside malg. I think some higher level benchmarks are needed to demonstrate the benefits of this change. |
@davecheney Any examples about |
Benchmark the things that call round2, and possibly their callers. The goal is not to rewrite a low level function if it has no observable benefit. Change has a cost, each code review takes time, and could possibly introduce bugs. If there is no meaningful improvement to the runtime of real go programs, there is little justification to continue. |
Sometimes benchmarks should take into account that there is a dependency chain since the result will be used. I dont think its that important here just something to keep in mind when benchmarking:
I do not think it is much important here but to make clear what I benchmarked: Adding Which to me makes sense since the new version has a longer dependency chain for small numbers. All in all this function is not at all critical for performance so it will be hard to impossible to prove any effect if there is a 4 to 8 ns difference between the non inlined versions. |
@davecheney I think |
Good. Then benchmark that in support of your change. |
@martisch Ok thanks for the explanation! |
@davecheney I don't think I can provide a benchmark about |
From my vague understanding system stack is used when the goroutine needs to make a call using the underlying threads stack, not the goroutines. I don’t know the specifics, sorry. Look. I appreciate your enthusiasm, and don’t want you to be discouraged, but it’s important to understand the cost of any change is large, especially when it’s inside the runtime. It has to be supported with clear evidence that the change is beneficial in real life, not just on paper. Lastly, the reason round2 got faster with your change is it became inlinable. I encourage you to research the series of changes landing in 1.16 (or maybe 1.17, I’m not following closely) that allow functions with a for loop to be inlinable. This may reduce the benefit of your change |
From profiling data that I know I can confirm that round2 is not worth optimising for performance. If we would have gotten payed for the change then we already would spend way more for the time discussing the change then it can get back in CPU resource costs. |
I totally understand.
@davecheney and @martisch Thanks for the information |
@kiyonlin thanks for the suggestion. If this would have been indeed in a hot path it could have been worth while optimising but it is not in a hot path called often enough. There are better gains elsewhere in the runtime lets focus on finding and optimising those. |
Thank you @davecheney and @martisch for your time and the excellent discussions. |
@martisch Last question, could I know how to profile this? I'm curious. |
This may give you a starting point: https://blog.golang.org/pprof |
@martisch Okey and may I know which data should I care/notice? |
Profile the program you care about (not a benchmark, a real program and use case) and see how much pprof reports is spend in round2. |
@martisch I'll try to investigate that. Thanks again for your patience! |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What did you do?
Current implementation:
We can use pure bit operations to improve it.
Here is the final benchmark result
The text was updated successfully, but these errors were encountered: