Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Improve performance of round2 #42865

Closed
kiyonlin opened this issue Nov 28, 2020 · 28 comments
Closed

runtime: Improve performance of round2 #42865

kiyonlin opened this issue Nov 28, 2020 · 28 comments

Comments

@kiyonlin
Copy link

kiyonlin commented Nov 28, 2020

What version of Go are you using (go version)?

$ go version
go version go1.15.4 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What did you do?

Current implementation:

// round x up to a power of 2.
func round2(x int32) int32 {
	s := uint(0)
	for 1<<s < x {
		s++
	}
	return 1 << s
}

We can use pure bit operations to improve it.

// round x up to a power of 2.
func round2(x int32) int32 {
	if x <= 1 {
		return 1
	}

	n := x - 1
	n |= n >> 1
	n |= n >> 2
	n |= n >> 4
	n |= n >> 8
	n |= n >> 16

	return n + 1
}

Here is the final benchmark result

@gopherbot
Copy link

Change https://golang.org/cl/273671 mentions this issue: runtime: improve round2 by using pure bit operations

@davecheney
Copy link
Contributor

davecheney commented Nov 28, 2020

Unless you have a 10ghz processor that is executing all the instructions in this function in parallel, you need to watch out for compiler optimisations. Inspect the compiled output of your benchmarks, I suspect the code has been optimised away because in your benchmark you are not using the result of round2

@davecheney
Copy link
Contributor

Also, there is no need to use pb.Parallel. This code does not use concurrency, its performance should not change depending on the value of GOMAXPROCS.

@kiyonlin
Copy link
Author

kiyonlin commented Nov 28, 2020

Also, there is no need to use pb.Parallel. This code does not use concurrency, its performance should not change depending on the value of GOMAXPROCS.

Hey @davecheney If I don't use pb.Parllel, I get

BenchmarkRound2Old-16           133692382                8.90 ns/op
BenchmarkRound2Old-16           135668180                8.94 ns/op
BenchmarkRound2Old-16           134220864                8.99 ns/op
BenchmarkRound2Old-16           135382104                8.93 ns/op
BenchmarkRound2Old-16           134764070                8.86 ns/op
BenchmarkRound2Old-16           135216817                8.97 ns/op
BenchmarkRound2Old-16           131684757                8.92 ns/op
BenchmarkRound2Old-16           135538406                8.92 ns/op
BenchmarkRound2Old-16           135814132                9.00 ns/op
BenchmarkRound2Old-16           134046284                8.83 ns/op
BenchmarkRound2New-16           1000000000               0.240 ns/op
BenchmarkRound2New-16           1000000000               0.236 ns/op
BenchmarkRound2New-16           1000000000               0.240 ns/op
BenchmarkRound2New-16           1000000000               0.235 ns/op
BenchmarkRound2New-16           1000000000               0.238 ns/op
BenchmarkRound2New-16           1000000000               0.238 ns/op
BenchmarkRound2New-16           1000000000               0.237 ns/op
BenchmarkRound2New-16           1000000000               0.236 ns/op
BenchmarkRound2New-16           1000000000               0.241 ns/op
BenchmarkRound2New-16           1000000000               0.243 ns/op

@davecheney
Copy link
Contributor

Your benchmark is being optimised away because you are not using the result. Because your chance removed the for loop round2 can be inlined allowing the compiler to see that you’re not using the result so it removes the call completely.

Don’t get distracted by how fast the benchmark runs, focus on the delta between the original version and your replacement.

@kiyonlin
Copy link
Author

kiyonlin commented Nov 28, 2020

@davecheney How about this

func BenchmarkRound2(b *testing.B) {
	var r int32
	for i := 0; i < b.N; i++ {
		r = round22(2059)
	}
	_ = r
}
OLD
BenchmarkRound2-16      135571280                8.95 ns/op
BenchmarkRound2-16      133842273                8.97 ns/op
BenchmarkRound2-16      133245867                8.95 ns/op
BenchmarkRound2-16      134098809                8.95 ns/op
BenchmarkRound2-16      129663037                8.94 ns/op
BenchmarkRound2-16      131995621                9.03 ns/op
BenchmarkRound2-16      134425384                8.91 ns/op
BenchmarkRound2-16      132728696                9.00 ns/op
BenchmarkRound2-16      132275148                9.01 ns/op
BenchmarkRound2-16      132570732                8.96 ns/op

NEW
BenchmarkRound2-16      1000000000               0.242 ns/op
BenchmarkRound2-16      1000000000               0.238 ns/op
BenchmarkRound2-16      1000000000               0.243 ns/op
BenchmarkRound2-16      1000000000               0.248 ns/op
BenchmarkRound2-16      1000000000               0.239 ns/op
BenchmarkRound2-16      1000000000               0.235 ns/op
BenchmarkRound2-16      1000000000               0.240 ns/op
BenchmarkRound2-16      1000000000               0.242 ns/op
BenchmarkRound2-16      1000000000               0.240 ns/op
BenchmarkRound2-16      1000000000               0.236 ns/op

benchstat

name       old time/op  new time/op  delta
Round2-16  8.97ns ± 1%  0.24ns ± 3%  -97.32%  (p=0.000 n=10+10)

@davecheney
Copy link
Contributor

davecheney commented Nov 28, 2020

Respectfully you need to double check your math; if you have a 3ghz processor, then round2 completed in less than one clock cycle. This seems unlikely.

There are several possible reasons for this:

  • inlining, your revision makes round2 inlinable, the compiler can now see what happens to the value 2059 when it is passed into round2 and it can see that round2 is a pure function which has no side effects.
  • because round2 is inlined, the compiler can use constant propagation to compute the result at compile time, not run time.
  • the result is probably still being thrown away because the compile can see you never read from r, so the compiler never needs to assign any value to it.

I suggest the following replacement benchmark

var Result, Input int32 = 0, 2059

func BenchmarkRound2(b *testing.B) {
	var r int32
	for i := 0; i < b.N; i++ {
		r = round22(Input)
	}
	R = r
}

@kiyonlin
Copy link
Author

kiyonlin commented Nov 28, 2020

@davecheney Thanks very much!

Respectfully you need to double check your math

I do think so!

var result, input int32 = 0, 2059

func BenchmarkRound2(b *testing.B) {
	var r int32
	for i := 0; i < b.N; i++ {
		r = round2(input)
	}
	result = r
}
OLD
BenchmarkRound2-16      128067048                8.89 ns/op
BenchmarkRound2-16      134646064                8.85 ns/op
BenchmarkRound2-16      133322764                8.86 ns/op
BenchmarkRound2-16      135313116                8.95 ns/op
BenchmarkRound2-16      133720844                8.98 ns/op
BenchmarkRound2-16      134429626                8.90 ns/op
BenchmarkRound2-16      134209008                9.04 ns/op
BenchmarkRound2-16      135072374                8.86 ns/op
BenchmarkRound2-16      135947492                8.86 ns/op
BenchmarkRound2-16      134961540                8.93 ns/op

NEW
BenchmarkRound2-16      817970109                1.38 ns/op
BenchmarkRound2-16      885790197                1.38 ns/op
BenchmarkRound2-16      853032567                1.37 ns/op
BenchmarkRound2-16      882751386                1.39 ns/op
BenchmarkRound2-16      869532646                1.37 ns/op
BenchmarkRound2-16      857339872                1.39 ns/op
BenchmarkRound2-16      822184684                1.37 ns/op
BenchmarkRound2-16      856609690                1.36 ns/op
BenchmarkRound2-16      892192333                1.38 ns/op
BenchmarkRound2-16      889352274                1.40 ns/op

benchstat

name       old time/op  new time/op  delta
Round2-16  8.91ns ± 1%  1.38ns ± 2%  -84.53%  (p=0.000 n=10+10)

@martisch
Copy link
Contributor

martisch commented Nov 29, 2020

Some observations:

  1. In the benchmark the old round2 does not get inlined in the benchmark but the new one does. With //go:noinline the new round2 gets a little bit slower to around 2ns.

  2. for input 1<<31-1 the old round2 produces an infinite loop and the new round2 produces: -2147483648

  3. adding a dependency in the benchmark loop drops the performance of the new round2 from 2ns to 4ns on my computer.
    r = round2(r)

  4. for small inputs like 2 the old version can be faster (assuming no inlining).

@kiyonlin
Copy link
Author

kiyonlin commented Nov 29, 2020

Hey @martisch based on your observations:

  1. In the benchmark the old round2 does not get inlined in the benchmark but the new one does. With //go:noinline the new round2 gets a little bit slower to around 2ns.

Same here

  1. for input 1<<31-1 the old round2 produces an infinite loop and the new round2 produces: -2147483648

Yeah, it's overflow because of type int32. But I think this situation won't happen in the real world. Because that means allocating 4GB memory 😆

  1. adding a dependency in the benchmark loop drops the performance of the new round2 from 2ns to 4ns on my computer.
    r = round2(r)

Sorry I don't follow this

  1. for small inputs like 2 the old version can be faster (assuming no inlining).

The new one is always faster on my machine

@davecheney
Copy link
Contributor

It seems that round2 is called twice in the codebase and only one of those is in a non constant context, inside malg. I think some higher level benchmarks are needed to demonstrate the benefits of this change.

@kiyonlin
Copy link
Author

@davecheney Any examples about higher level benchmarks?

@davecheney
Copy link
Contributor

Benchmark the things that call round2, and possibly their callers.

The goal is not to rewrite a low level function if it has no observable benefit. Change has a cost, each code review takes time, and could possibly introduce bugs. If there is no meaningful improvement to the runtime of real go programs, there is little justification to continue.

@martisch
Copy link
Contributor

martisch commented Nov 29, 2020

  1. adding a dependency in the benchmark loop drops the performance of the new round2 from 2ns to 4ns on my computer.
    r = round2(r)

Sorry I don't follow this

Sometimes benchmarks should take into account that there is a dependency chain since the result will be used. I dont think its that important here just something to keep in mind when benchmarking:

func BenchmarkRound2(b *testing.B) { r := input for i := 0; i < b.N; i++ { r = round2(r) } result = r }

The new one is always faster on my machine

I do not think it is much important here but to make clear what I benchmarked: Adding //go:noinline and the loop dependency from above for an input of 2 produces 4ns for the new round2 on my Intel(R) Core(TM) i5-1038NG7 CPU instead of 2ns for the old.

Which to me makes sense since the new version has a longer dependency chain for small numbers.

All in all this function is not at all critical for performance so it will be hard to impossible to prove any effect if there is a 4 to 8 ns difference between the non inlined versions.

@kiyonlin
Copy link
Author

The goal is not to rewrite a low level function if it has no observable benefit. Change has a cost, each code review takes time, and could possibly introduce bugs. If there is no meaningful improvement to the runtime of real go programs, there is little justification to continue.

@davecheney I think malg is in a hot path, so the improvement is meaningful even it may be just small.

@davecheney
Copy link
Contributor

The goal is not to rewrite a low level function if it has no observable benefit. Change has a cost, each code review takes time, and could possibly introduce bugs. If there is no meaningful improvement to the runtime of real go programs, there is little justification to continue.

@davecheney I think malg is in a hot path, so the improvement is meaningful even it may be just small.

Good. Then benchmark that in support of your change.

@kiyonlin
Copy link
Author

kiyonlin commented Nov 29, 2020

Sometimes benchmarks should take into account that there is a dependency chain since the result will be used. I dont think its that important here just something to keep in mind when benchmarking:

@martisch Ok thanks for the explanation!

@kiyonlin
Copy link
Author

@davecheney I don't think I can provide a benchmark about malg but just round2. The systemstack is related to the compiler, right?

@davecheney
Copy link
Contributor

@davecheney I don't think I can provide a benchmark about malg but just round2. The systemstack is related to the compiler, right?

From my vague understanding system stack is used when the goroutine needs to make a call using the underlying threads stack, not the goroutines. I don’t know the specifics, sorry.

Look. I appreciate your enthusiasm, and don’t want you to be discouraged, but it’s important to understand the cost of any change is large, especially when it’s inside the runtime. It has to be supported with clear evidence that the change is beneficial in real life, not just on paper.

Lastly, the reason round2 got faster with your change is it became inlinable. I encourage you to research the series of changes landing in 1.16 (or maybe 1.17, I’m not following closely) that allow functions with a for loop to be inlinable. This may reduce the benefit of your change

@martisch
Copy link
Contributor

martisch commented Nov 29, 2020

From profiling data that I know I can confirm that round2 is not worth optimising for performance.

If we would have gotten payed for the change then we already would spend way more for the time discussing the change then it can get back in CPU resource costs.

@kiyonlin
Copy link
Author

Look. I appreciate your enthusiasm, and don’t want you to be discouraged, but it’s important to understand the cost of any change is large, especially when it’s inside the runtime. It has to be supported with clear evidence that the change is beneficial in real life, not just on paper.

I totally understand.

Lastly, the reason round2 got faster with your change is it became inlinable. I encourage you to research the series of changes landing in 1.16 (or maybe 1.17, I’m not following closely) that allow functions with a for loop to be inlinable. This may reduce the benefit of your change

@davecheney and @martisch Thanks for the information

@martisch
Copy link
Contributor

@kiyonlin thanks for the suggestion. If this would have been indeed in a hot path it could have been worth while optimising but it is not in a hot path called often enough. There are better gains elsewhere in the runtime lets focus on finding and optimising those.

@kiyonlin
Copy link
Author

Thank you @davecheney and @martisch for your time and the excellent discussions.

@kiyonlin
Copy link
Author

kiyonlin commented Nov 29, 2020

From profiling data that I know I can confirm that round2 is not worth optimising for performance.

@martisch Last question, could I know how to profile this? I'm curious.

@martisch
Copy link
Contributor

This may give you a starting point: https://blog.golang.org/pprof

@kiyonlin
Copy link
Author

@martisch Okey and may I know which data should I care/notice?

@martisch
Copy link
Contributor

martisch commented Nov 29, 2020

Profile the program you care about (not a benchmark, a real program and use case) and see how much pprof reports is spend in round2.

@kiyonlin
Copy link
Author

Profile the program you care about (not a benchmark, a real program and use case) and see how much pprof reports is spend in round2.

@martisch I'll try to investigate that. Thanks again for your patience!

@golang golang locked and limited conversation to collaborators Nov 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants