Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math: make benchmark iterations more data-dependent #33441

Closed
smasher164 opened this issue Aug 2, 2019 · 3 comments
Closed

math: make benchmark iterations more data-dependent #33441

smasher164 opened this issue Aug 2, 2019 · 3 comments

Comments

@smasher164
Copy link
Member

As discussed in CL 127458, benchmarks should be updated to make each iteration dependent on the result of the previous. This reduces the chance of work done in the loop being optimized away, as well feeding the benchmarks a larger input space. For instance, the FMA benchmark was changed from

func BenchmarkFma(b *testing.B) {
	x := 0.0
	for i := 0; i < b.N; i++ {
		x = Fma(E, Pi, Phi)
	}
	GlobalF = x
}

to

func BenchmarkFma(b *testing.B) {
	x := 0.0
	for i := 0; i < b.N; i++ {
		x = Fma(E, Pi, x)
	}
	GlobalF = x
}
@mundaym
Copy link
Member

mundaym commented Aug 3, 2019

I think we should be careful doing this, at least when measuring the speed of operations that are only a couple of instructions:

  1. This change makes the benchmark measure the latency of the operation rather than throughput which is a choice that may not be ideal for all the benchmarks. Floating point operations often have high latency but also high throughput due to pipelining. Which one we care more about is dependent on the application.

  2. Making the input value dependent on the number of iterations the benchmark runs for (i.e. b.N) isn't ideal if the speed of the operation depends on said input value since it may lead to the benchmark speeding up or slowing down the longer it is run for. For example, the benchmark may end up in a steady state that happens to be an edge case where the operation is a no-op or unusually expensive.

One possible solution is to add a small array of input values (with a length of a power of 2 to avoid division instructions) and iterate over those for throughput benchmarks. The cost of the indexing and load should be low. Then add a separate latency benchmark for operations where we are particularly interested in latency (probably only the single instruction operations such as FMA, Sqrt, Abs, Copysign etc.).

@nsajko
Copy link
Contributor

nsajko commented Aug 3, 2019

This CL is probably the better solution, more robust as runtime.KeepAlive is basically guaranteed not to get optimizted out: https://go-review.googlesource.com/c/go/+/188437

@smasher164
Copy link
Member Author

If there's no boxing that happens with runtime.KeepAlive, then @nsajko's CL is probably a reasonable option. Especially because as @mundaym points out, we would have to first differentiate and duplicate the benchmarks based on the metric we care about (latency vs throughput), and then define additional input values for each of those benchmarks to avoid reaching a steady state. That seems like a lot of work for little return.

@golang golang locked and limited conversation to collaborators Aug 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants