runtime: Improve performance of round2 #42865

kiyonlin · 2020-11-28T07:06:26Z

What version of Go are you using (`go version`)?

$ go version
go version go1.15.4 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What did you do?

Current implementation:

// round x up to a power of 2.
func round2(x int32) int32 {
	s := uint(0)
	for 1<<s < x {
		s++
	}
	return 1 << s
}

We can use pure bit operations to improve it.

// round x up to a power of 2.
func round2(x int32) int32 {
	if x <= 1 {
		return 1
	}

	n := x - 1
	n |= n >> 1
	n |= n >> 2
	n |= n >> 4
	n |= n >> 8
	n |= n >> 16

	return n + 1
}

Here is the final benchmark result

The text was updated successfully, but these errors were encountered:

gopherbot · 2020-11-28T07:38:17Z

Change https://golang.org/cl/273671 mentions this issue: runtime: improve round2 by using pure bit operations

davecheney · 2020-11-28T07:38:23Z

Unless you have a 10ghz processor that is executing all the instructions in this function in parallel, you need to watch out for compiler optimisations. Inspect the compiled output of your benchmarks, I suspect the code has been optimised away because in your benchmark you are not using the result of round2

davecheney · 2020-11-28T07:39:46Z

Also, there is no need to use pb.Parallel. This code does not use concurrency, its performance should not change depending on the value of GOMAXPROCS.

kiyonlin · 2020-11-28T07:51:14Z

Also, there is no need to use pb.Parallel. This code does not use concurrency, its performance should not change depending on the value of GOMAXPROCS.

Hey @davecheney If I don't use pb.Parllel, I get

BenchmarkRound2Old-16           133692382                8.90 ns/op
BenchmarkRound2Old-16           135668180                8.94 ns/op
BenchmarkRound2Old-16           134220864                8.99 ns/op
BenchmarkRound2Old-16           135382104                8.93 ns/op
BenchmarkRound2Old-16           134764070                8.86 ns/op
BenchmarkRound2Old-16           135216817                8.97 ns/op
BenchmarkRound2Old-16           131684757                8.92 ns/op
BenchmarkRound2Old-16           135538406                8.92 ns/op
BenchmarkRound2Old-16           135814132                9.00 ns/op
BenchmarkRound2Old-16           134046284                8.83 ns/op
BenchmarkRound2New-16           1000000000               0.240 ns/op
BenchmarkRound2New-16           1000000000               0.236 ns/op
BenchmarkRound2New-16           1000000000               0.240 ns/op
BenchmarkRound2New-16           1000000000               0.235 ns/op
BenchmarkRound2New-16           1000000000               0.238 ns/op
BenchmarkRound2New-16           1000000000               0.238 ns/op
BenchmarkRound2New-16           1000000000               0.237 ns/op
BenchmarkRound2New-16           1000000000               0.236 ns/op
BenchmarkRound2New-16           1000000000               0.241 ns/op
BenchmarkRound2New-16           1000000000               0.243 ns/op

davecheney · 2020-11-28T07:57:52Z

Your benchmark is being optimised away because you are not using the result. Because your chance removed the for loop round2 can be inlined allowing the compiler to see that you’re not using the result so it removes the call completely.

Don’t get distracted by how fast the benchmark runs, focus on the delta between the original version and your replacement.

kiyonlin · 2020-11-28T08:12:55Z

@davecheney How about this

func BenchmarkRound2(b *testing.B) {
	var r int32
	for i := 0; i < b.N; i++ {
		r = round22(2059)
	}
	_ = r
}

OLD
BenchmarkRound2-16      135571280                8.95 ns/op
BenchmarkRound2-16      133842273                8.97 ns/op
BenchmarkRound2-16      133245867                8.95 ns/op
BenchmarkRound2-16      134098809                8.95 ns/op
BenchmarkRound2-16      129663037                8.94 ns/op
BenchmarkRound2-16      131995621                9.03 ns/op
BenchmarkRound2-16      134425384                8.91 ns/op
BenchmarkRound2-16      132728696                9.00 ns/op
BenchmarkRound2-16      132275148                9.01 ns/op
BenchmarkRound2-16      132570732                8.96 ns/op

NEW
BenchmarkRound2-16      1000000000               0.242 ns/op
BenchmarkRound2-16      1000000000               0.238 ns/op
BenchmarkRound2-16      1000000000               0.243 ns/op
BenchmarkRound2-16      1000000000               0.248 ns/op
BenchmarkRound2-16      1000000000               0.239 ns/op
BenchmarkRound2-16      1000000000               0.235 ns/op
BenchmarkRound2-16      1000000000               0.240 ns/op
BenchmarkRound2-16      1000000000               0.242 ns/op
BenchmarkRound2-16      1000000000               0.240 ns/op
BenchmarkRound2-16      1000000000               0.236 ns/op

benchstat

name       old time/op  new time/op  delta
Round2-16  8.97ns ± 1%  0.24ns ± 3%  -97.32%  (p=0.000 n=10+10)

davecheney · 2020-11-28T08:18:17Z

Respectfully you need to double check your math; if you have a 3ghz processor, then round2 completed in less than one clock cycle. This seems unlikely.

There are several possible reasons for this:

inlining, your revision makes round2 inlinable, the compiler can now see what happens to the value 2059 when it is passed into round2 and it can see that round2 is a pure function which has no side effects.
because round2 is inlined, the compiler can use constant propagation to compute the result at compile time, not run time.
the result is probably still being thrown away because the compile can see you never read from r, so the compiler never needs to assign any value to it.

I suggest the following replacement benchmark

var Result, Input int32 = 0, 2059

func BenchmarkRound2(b *testing.B) {
	var r int32
	for i := 0; i < b.N; i++ {
		r = round22(Input)
	}
	R = r
}

kiyonlin · 2020-11-28T08:25:09Z

@davecheney Thanks very much!

Respectfully you need to double check your math

I do think so!

var result, input int32 = 0, 2059

func BenchmarkRound2(b *testing.B) {
	var r int32
	for i := 0; i < b.N; i++ {
		r = round2(input)
	}
	result = r
}

OLD
BenchmarkRound2-16      128067048                8.89 ns/op
BenchmarkRound2-16      134646064                8.85 ns/op
BenchmarkRound2-16      133322764                8.86 ns/op
BenchmarkRound2-16      135313116                8.95 ns/op
BenchmarkRound2-16      133720844                8.98 ns/op
BenchmarkRound2-16      134429626                8.90 ns/op
BenchmarkRound2-16      134209008                9.04 ns/op
BenchmarkRound2-16      135072374                8.86 ns/op
BenchmarkRound2-16      135947492                8.86 ns/op
BenchmarkRound2-16      134961540                8.93 ns/op

NEW
BenchmarkRound2-16      817970109                1.38 ns/op
BenchmarkRound2-16      885790197                1.38 ns/op
BenchmarkRound2-16      853032567                1.37 ns/op
BenchmarkRound2-16      882751386                1.39 ns/op
BenchmarkRound2-16      869532646                1.37 ns/op
BenchmarkRound2-16      857339872                1.39 ns/op
BenchmarkRound2-16      822184684                1.37 ns/op
BenchmarkRound2-16      856609690                1.36 ns/op
BenchmarkRound2-16      892192333                1.38 ns/op
BenchmarkRound2-16      889352274                1.40 ns/op

benchstat

name       old time/op  new time/op  delta
Round2-16  8.91ns ± 1%  1.38ns ± 2%  -84.53%  (p=0.000 n=10+10)

martisch · 2020-11-29T04:54:31Z

Some observations:

In the benchmark the old round2 does not get inlined in the benchmark but the new one does. With //go:noinline the new round2 gets a little bit slower to around 2ns.
for input 1<<31-1 the old round2 produces an infinite loop and the new round2 produces: -2147483648
adding a dependency in the benchmark loop drops the performance of the new round2 from 2ns to 4ns on my computer.
r = round2(r)
for small inputs like 2 the old version can be faster (assuming no inlining).

kiyonlin · 2020-11-29T05:13:25Z

Hey @martisch based on your observations:

In the benchmark the old round2 does not get inlined in the benchmark but the new one does. With //go:noinline the new round2 gets a little bit slower to around 2ns.

Same here

for input 1<<31-1 the old round2 produces an infinite loop and the new round2 produces: -2147483648

Yeah, it's overflow because of type int32. But I think this situation won't happen in the real world. Because that means allocating 4GB memory 😆

adding a dependency in the benchmark loop drops the performance of the new round2 from 2ns to 4ns on my computer.
r = round2(r)

Sorry I don't follow this

for small inputs like 2 the old version can be faster (assuming no inlining).

The new one is always faster on my machine

davecheney · 2020-11-29T05:15:41Z

It seems that round2 is called twice in the codebase and only one of those is in a non constant context, inside malg. I think some higher level benchmarks are needed to demonstrate the benefits of this change.

kiyonlin · 2020-11-29T05:21:47Z

@davecheney Any examples about higher level benchmarks?

davecheney · 2020-11-29T05:28:35Z

Benchmark the things that call round2, and possibly their callers.

The goal is not to rewrite a low level function if it has no observable benefit. Change has a cost, each code review takes time, and could possibly introduce bugs. If there is no meaningful improvement to the runtime of real go programs, there is little justification to continue.

martisch · 2020-11-29T05:35:55Z

adding a dependency in the benchmark loop drops the performance of the new round2 from 2ns to 4ns on my computer.
r = round2(r)

Sorry I don't follow this

Sometimes benchmarks should take into account that there is a dependency chain since the result will be used. I dont think its that important here just something to keep in mind when benchmarking:

func BenchmarkRound2(b *testing.B) { r := input for i := 0; i < b.N; i++ { r = round2(r) } result = r }

The new one is always faster on my machine

I do not think it is much important here but to make clear what I benchmarked: Adding //go:noinline and the loop dependency from above for an input of 2 produces 4ns for the new round2 on my Intel(R) Core(TM) i5-1038NG7 CPU instead of 2ns for the old.

Which to me makes sense since the new version has a longer dependency chain for small numbers.

All in all this function is not at all critical for performance so it will be hard to impossible to prove any effect if there is a 4 to 8 ns difference between the non inlined versions.

kiyonlin · 2020-11-29T05:41:24Z

The goal is not to rewrite a low level function if it has no observable benefit. Change has a cost, each code review takes time, and could possibly introduce bugs. If there is no meaningful improvement to the runtime of real go programs, there is little justification to continue.

@davecheney I think malg is in a hot path, so the improvement is meaningful even it may be just small.

davecheney · 2020-11-29T05:42:59Z

The goal is not to rewrite a low level function if it has no observable benefit. Change has a cost, each code review takes time, and could possibly introduce bugs. If there is no meaningful improvement to the runtime of real go programs, there is little justification to continue.

@davecheney I think malg is in a hot path, so the improvement is meaningful even it may be just small.

Good. Then benchmark that in support of your change.

kiyonlin · 2020-11-29T05:44:46Z

Sometimes benchmarks should take into account that there is a dependency chain since the result will be used. I dont think its that important here just something to keep in mind when benchmarking:

@martisch Ok thanks for the explanation!

kiyonlin · 2020-11-29T05:51:21Z

@davecheney I don't think I can provide a benchmark about malg but just round2. The systemstack is related to the compiler, right?

davecheney · 2020-11-29T06:00:26Z

@davecheney I don't think I can provide a benchmark about malg but just round2. The systemstack is related to the compiler, right?

From my vague understanding system stack is used when the goroutine needs to make a call using the underlying threads stack, not the goroutines. I don’t know the specifics, sorry.

Look. I appreciate your enthusiasm, and don’t want you to be discouraged, but it’s important to understand the cost of any change is large, especially when it’s inside the runtime. It has to be supported with clear evidence that the change is beneficial in real life, not just on paper.

Lastly, the reason round2 got faster with your change is it became inlinable. I encourage you to research the series of changes landing in 1.16 (or maybe 1.17, I’m not following closely) that allow functions with a for loop to be inlinable. This may reduce the benefit of your change

martisch · 2020-11-29T06:07:33Z

From profiling data that I know I can confirm that round2 is not worth optimising for performance.

If we would have gotten payed for the change then we already would spend way more for the time discussing the change then it can get back in CPU resource costs.

kiyonlin · 2020-11-29T06:14:45Z

Look. I appreciate your enthusiasm, and don’t want you to be discouraged, but it’s important to understand the cost of any change is large, especially when it’s inside the runtime. It has to be supported with clear evidence that the change is beneficial in real life, not just on paper.

I totally understand.

Lastly, the reason round2 got faster with your change is it became inlinable. I encourage you to research the series of changes landing in 1.16 (or maybe 1.17, I’m not following closely) that allow functions with a for loop to be inlinable. This may reduce the benefit of your change

@davecheney and @martisch Thanks for the information

martisch · 2020-11-29T06:17:46Z

@kiyonlin thanks for the suggestion. If this would have been indeed in a hot path it could have been worth while optimising but it is not in a hot path called often enough. There are better gains elsewhere in the runtime lets focus on finding and optimising those.

kiyonlin · 2020-11-29T06:18:28Z

Thank you @davecheney and @martisch for your time and the excellent discussions.

kiyonlin · 2020-11-29T06:28:59Z

From profiling data that I know I can confirm that round2 is not worth optimising for performance.

@martisch Last question, could I know how to profile this? I'm curious.

martisch · 2020-11-29T06:50:37Z

This may give you a starting point: https://blog.golang.org/pprof

kiyonlin · 2020-11-29T12:33:16Z

@martisch Okey and may I know which data should I care/notice?

martisch · 2020-11-29T17:48:52Z

Profile the program you care about (not a benchmark, a real program and use case) and see how much pprof reports is spend in round2.

kiyonlin · 2020-11-30T00:24:33Z

Profile the program you care about (not a benchmark, a real program and use case) and see how much pprof reports is spend in round2.

@martisch I'll try to investigate that. Thanks again for your patience!

kiyonlin mentioned this issue Nov 28, 2020

Improve round2 performance valyala/fasthttp#914

Merged

kiyonlin closed this as completed Nov 29, 2020

golang locked and limited conversation to collaborators Nov 30, 2021

gopherbot added the FrozenDueToAge label Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: Improve performance of round2 #42865

runtime: Improve performance of round2 #42865

kiyonlin commented Nov 28, 2020 •

edited

gopherbot commented Nov 28, 2020

davecheney commented Nov 28, 2020 •

edited

davecheney commented Nov 28, 2020

kiyonlin commented Nov 28, 2020 •

edited

davecheney commented Nov 28, 2020

kiyonlin commented Nov 28, 2020 •

edited

davecheney commented Nov 28, 2020 •

edited

kiyonlin commented Nov 28, 2020 •

edited

martisch commented Nov 29, 2020 •

edited

kiyonlin commented Nov 29, 2020 •

edited

davecheney commented Nov 29, 2020

kiyonlin commented Nov 29, 2020

davecheney commented Nov 29, 2020

martisch commented Nov 29, 2020 •

edited

kiyonlin commented Nov 29, 2020

davecheney commented Nov 29, 2020

kiyonlin commented Nov 29, 2020 •

edited

kiyonlin commented Nov 29, 2020

davecheney commented Nov 29, 2020

martisch commented Nov 29, 2020 •

edited

kiyonlin commented Nov 29, 2020

martisch commented Nov 29, 2020

kiyonlin commented Nov 29, 2020

kiyonlin commented Nov 29, 2020 •

edited

martisch commented Nov 29, 2020

kiyonlin commented Nov 29, 2020

martisch commented Nov 29, 2020 •

edited

kiyonlin commented Nov 30, 2020

runtime: Improve performance of round2 #42865

runtime: Improve performance of round2 #42865

Comments

kiyonlin commented Nov 28, 2020 • edited

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What did you do?

gopherbot commented Nov 28, 2020

davecheney commented Nov 28, 2020 • edited

davecheney commented Nov 28, 2020

kiyonlin commented Nov 28, 2020 • edited

davecheney commented Nov 28, 2020

kiyonlin commented Nov 28, 2020 • edited

davecheney commented Nov 28, 2020 • edited

kiyonlin commented Nov 28, 2020 • edited

martisch commented Nov 29, 2020 • edited

kiyonlin commented Nov 29, 2020 • edited

davecheney commented Nov 29, 2020

kiyonlin commented Nov 29, 2020

davecheney commented Nov 29, 2020

martisch commented Nov 29, 2020 • edited

kiyonlin commented Nov 29, 2020

davecheney commented Nov 29, 2020

kiyonlin commented Nov 29, 2020 • edited

kiyonlin commented Nov 29, 2020

davecheney commented Nov 29, 2020

martisch commented Nov 29, 2020 • edited

kiyonlin commented Nov 29, 2020

martisch commented Nov 29, 2020

kiyonlin commented Nov 29, 2020

kiyonlin commented Nov 29, 2020 • edited

martisch commented Nov 29, 2020

kiyonlin commented Nov 29, 2020

martisch commented Nov 29, 2020 • edited

kiyonlin commented Nov 30, 2020

kiyonlin commented Nov 28, 2020 •

edited

What version of Go are you using (`go version`)?

davecheney commented Nov 28, 2020 •

edited

kiyonlin commented Nov 28, 2020 •

edited

kiyonlin commented Nov 28, 2020 •

edited

davecheney commented Nov 28, 2020 •

edited

kiyonlin commented Nov 28, 2020 •

edited

martisch commented Nov 29, 2020 •

edited

kiyonlin commented Nov 29, 2020 •

edited

martisch commented Nov 29, 2020 •

edited

kiyonlin commented Nov 29, 2020 •

edited

martisch commented Nov 29, 2020 •

edited

kiyonlin commented Nov 29, 2020 •

edited

martisch commented Nov 29, 2020 •

edited