proposal: testing: a less error-prone API for benchmark iteration #48768

bcmills · 2021-10-04T15:23:54Z

Background

It is currently difficult for Go users to write benchmarks that accurately measure what they appear to measure, for two reasons:

Certain compiler optimizations are relatively easy to trigger accidentally, even when those optimizations are not intended for the code being benchmarked (testing: document best practices for avoiding compiler optimizations in benchmarks #27400).
- Today, that often occurs with constant-propagation and dead-store elimination. If the Go compiler gets better at loop-invariant code motion (cmd/compile: loads/constants not lifted out of loop #15808) and loop optimizations in general (such as cmd/compile: for-loops cannot be inlined #14768), this may become a problem in even more cases.
- @cespare gives a good summary of the workarounds and pitfalls today in testing: document best practices for avoiding compiler optimizations in benchmarks #27400 (comment).
It is relatively easy for a benchmark to accidentally run the code to be benchmarked the wrong number of times, either by ignoring the passed-in b.N entirely (cmd/vet: flag benchmarks that don’t use b #38677), or by doing either one or N iterations on a data-structure or input of size N instead of N iterations on an input of constant cost (examples).
- We may be able to automatically flag tests that fail to use b.N using an approach like the one in CL 230978, but I don't see how we could do the same to detect benchmarks that misuse b.N: some real APIs that use caches or global data-structures really do have weird timing dependencies on the number of prior calls.

Proposal

I propose that we add the following methods to *testing.B, inspired by F.Fuzz (#44551) and T.RunParallel, and deprecate explicit use of the N field.

package testing

// Iterate accepts an argument of any function type and zero or more arguments to pass to that function.
// Iterate then invokes the function, sequentially, exactly b.N times with those arguments.
//
// The compiler will not optimize through the function's arguments or return values,
// so arguments passed to the function will not be subject to constant-propagation
// optimizations, and values returned by it will not be subject to dead-store elimination.
func (b *B) Iterate(f interface{}, args ...interface{})

// IterateParallel is like Iterate, but invokes the function on multiple goroutines in parallel.
//
// The number of goroutines defaults to GOMAXPROCS, and can be increased by calling
// SetParallelism before IterateParallel.
func (b *B) IterateParallel(f interface{}, args ...interface{})

Specifically, the Iterate method replaces a for-loop on b.N, and the IterateParallel replaces a call to b.RunParallel using a much simpler API.

Examples

The example cited in #27400 comes from @davecheney's 2013 blog post, How to write benchmarks in Go. The naive benchmark looks like this:

func BenchmarkFib10(b *testing.B) {
	// run the Fib function b.N times
	for n := 0; n < b.N; n++ {
		Fib(10)
	}
}

At the end of the post, the author suggests to defeat optimizations by writing to a global variable:

var result int

func BenchmarkFibComplete(b *testing.B) {
        var r int
        for n := 0; n < b.N; n++ {
                // always record the result of Fib to prevent
                // the compiler eliminating the function call.
                r = Fib(10)
        }
        // always store the result to a package level variable
        // so the compiler cannot eliminate the Benchmark itself.
        result = r
}

However, there are at least two problems with that approach:

Because the input to the Fib function is a constant and the function is pure, the compiler may still inline the function call and hoist the entire computation out of the loop.
A correct Go compiler may determine that all stores to result are dead, and apply dead-code elimination to the b.N loop.

I regard the author of that blog post as a Go expert. If even he can't get it right, I don't expect that most Go novices can either. I am an active Go maintainer, and I also run into this problem on a semi-regular basis (#27400 (comment), #48684 (comment)).

In contrast, under this proposal, the simple way to write the benchmark would also be the correct way:

func BenchmarkFib10(b *testing.B) {
	b.Iterate(Fib, 10)
}

Another example by the same author cited in #27400 (comment) (@gbbr), explicitly uses the current iteration number as an input to the function under test:

var Result uint64

func BenchmarkPopcnt(b *testing.B) {
	var r uint64
	for i := 0; i < b.N; i++ {
		r = popcnt(uint64(i))
	}
	Result = r
}

This example mostly works around the compiler-optimization problem — assuming that the compiler and linker aren't clever enough to eliminate the dead store to the write-only Result variable — but may again mask nonlinear behavior that depends on the specific value of N. That is: it mostly avoids the first pitfall, but by falling headlong into the second one.

This function would be both simpler and more linear using Iterate:

func BenchmarkPopcnt0(b *testing.B) {
	b.Iterate(popcnt, 0)
}

If the user actually intends for the test to benchmark popcnt with a variety of interleaved inputs, they can make that more explicit:

func BenchmarkPopcntInterleaved(b *testing.B) {
	inputs := make([]uint64, 512)
	for i := range inputs {
		inputs[i] = rand.Uint64()
	}
	b.Iterate(func(inputs []uint64) uint64 {
		var result uint64
		for _, x := range inputs {
			result += popcnt(x)
		}
		return result
	}, inputs)
}

Or, to restore the explicit dependency on i, for the specific cases where the author believes it not to be harmful:

func BenchmarkPopcntInterleaved(b *testing.B) {
	var i uint64
	b.Iterate(func() uint64 {
		result := popcnt(i)
		i++
		return result
	}, inputs)
}

Compatibility

This proposal is a strict addition to the testing.B API. As a strict addition, it would not change or invalidate the behavior of any existing benchmarks. (On the other hand, it would also do nothing to detect problematic optimizations or nonlinearities in existing benchmarks.)

Implementation

The proposed API could theoretically be implemented using the reflect package today. However, care would need to be taken to ensure that such an implementation does not add noise or bias to the benchmarks, taking into account the high current cost and rate of garbage of reflect.Call (#7818, #43732).

If it is not practical to modify reflect.Call to provide stable, fast performance, the compiler itself could special-case these functions and compile them down to more efficient equivalents, ensuring that the documented optimization properties continue to be respected (CC @randall77, @josharian).

The text was updated successfully, but these errors were encountered:

randall77 · 2021-10-04T15:52:14Z

I think the API would be simpler with just

func (b *B) Iterate(f interface{})

And you need to make a closure if you need to capture anything, including arguments.
It seems you will often have to make a closure anyway.

bcmills · 2021-10-04T15:58:19Z

@randall77, the problem with the closure API is that it doesn't address the compiler-optimization issues.

That would make it too easy to write something like

func BenchmarkFib10(b *testing.B) {
	b.Iterate(func() { Fib(10) })
}

which wouldn't really address #27400 at all.

bcmills · 2021-10-04T16:03:30Z

I guess another way to put it is: one the advantage of the explicit-argument API is that it encourages users to write their benchmark in terms of an exported function or method (perhaps expressed as a method value) instead of a wrapper around that function or method.

Benchmarking a wrapper instead of the actual function is exactly what leads to the complex interactions with compiler optimizations, so avoiding the need for wrappers — at least in the typical case — is an important aspect of this proposal.

rogpeppe · 2021-10-04T17:59:25Z

Won't the overhead of calling a closure (or, even worse, invoking reflect.Call with its associated allocations) add hard-to-deal-with overhead to the body of low cost benchmarks?

rogpeppe · 2021-10-04T18:00:17Z

To me, T.RunParallel seems like a better model than F.Fuzz in that respect.

bcmills · 2021-10-04T18:15:34Z

Won't the overhead of calling a closure (or, even worse, invoking reflect.Call with its associated allocations) add hard-to-deal-with overhead to the body of low cost benchmarks?

Yes. The proposed API may require specialization in order to provide a good enough noise floor. (See the “Implementation” section of the original post.)

To me, T.RunParallel seems like a better model than F.Fuzz in that respect.

T.RunParallel does indeed avoid the reflect.Call overhead problem — but unfortunately does not help with the wrapper–optimization interaction or the possibility of forgetting to call PB.Next. As far as I can tell, a PB-style API would only help with the problem of doing other-than-O(1) work per iteration.

renthraysk · 2021-10-04T18:16:01Z

How about providing two sets of arguments for the function to be benched? And Iterate() could randomly pick one for each call. Guessing most of the time would want each set of arguments to be used equally, but in an unpredictable order.

func BenchmarkFib(b *testing.B) {
	args := [2]int{10, 20}

	r := rand.Uint32()
	i := uint64(r)<<32 | uint64(^r)
	for n := b.N % 64; n < b.N; n++ {
		i = bits.RotateLeft64(i, 1)
		Fib(args[i&1])
	}
}

bcmills · 2021-10-04T18:22:31Z

@renthraysk, I considered also including something like

// Iterate accepts an argument of any function type and a 'next' function that produces values
// assignable to f's arguments.
// Generate then invokes f(next()), sequentially, exactly b.N times,
// with each invocation using a set of values from a distinct call to next.
func (*B) Generate(f interface{}, next interface{})

But, I'm not sure that the benefits of that style of method outweigh its complexity, given that the equivalent Iterate call is itself not too bad:

func BenchmarkFib(b *testing.B) {
	args := [2]int{10, 20}

	r := rand.Uint32()
	i := uint64(r)<<32 | uint64(^r)
	b.Iterate(func() uint64 {
		i = bits.RotateLeft64(i, 1)
		return Fib(args[i&1])
	})
}

bcmills · 2021-10-04T18:24:17Z

I guess that Generate could have some advantages in terms of preventing compiler optimizations based on the set of possible inputs. (I'm somewhat ambivalent about it, but I'd like to see more arguments in either direction.)

renthraysk · 2021-10-04T18:25:34Z

@bcmills The problem in my example, and the latter is for each argument set to be used equally, b.N has to be multiple of 64. Otherwise the computed stats will be off.

bcmills · 2021-10-04T18:28:07Z

@renthraysk, benchmarks are noisy anyway. If being off by one iteration makes a difference, your value of N is too low.

josharian · 2021-10-04T19:00:59Z

If being off by one iteration makes a difference, your value of N is too low.

Some useful, reliable benchmarks naturally take multiple seconds to run, and thus end up with N=1.

josharian · 2021-10-04T19:02:43Z

I think we would still need to expose b.N. It is not uncommon for a benchmark to do expensive O(b.N) setup work to enable accurately benchmark.

bcmills · 2021-10-04T19:32:37Z

Some useful, reliable benchmarks naturally take multiple seconds to run, and thus end up with N=1.

That's true, but those benchmarks do not depend on b.N%k == 0 for any particular k.

bcmills · 2021-10-04T19:33:30Z

I think we would still need to expose b.N. It is not uncommon for a benchmark to do expensive O(b.N) setup work to enable accurately benchmark.

I am not proposing to remove b.N — only to deprecate it. 😁

That said, we shouldn't deprecate it if it is actually still needed in some cases. Can you give some examples of the sorts of benchmarks that would still need it?

renthraysk · 2021-10-04T19:47:20Z

Some useful, reliable benchmarks naturally take multiple seconds to run, and thus end up with N=1.

That's true, but those benchmarks do not depend on b.N%k == 0 for any particular k.

A better version which args set is used alternates, but the random rotation defeats the compiler's optimizations.

func BenchmarkFib(b *testing.B) {
	args := [2]int{10, 20}

	i := bits.RotateLeft32(0x55555555, rand.Int())
	for n := 0; n < b.N; n++ {
		i = bits.RotateLeft32(i, 1)
		Fib(args[i&1])
	}
}

josharian · 2021-10-04T19:48:34Z

A quick grep for unusual uses of b.N in std turns up: runtime.BenchmarkAllocation, encoding/binary.BenchmarkReadStruct, expvar.BenchmarkRealworldExpvarUsage, runtime.BenchmarkMatmult, net.benchmarkTCP, and several others. (I sampled.)

Also, it's not possible to use testing.B.ReportMetric correctly in many cases without b.N.

ianlancetaylor · 2021-10-04T19:53:30Z

reflect.Call is expensive and I don't see that changing, so this approach will be difficult to use for micro-benchmarks.

As far as I know, the problems with micro-benchmarks all boil down to needing a way to force the computation of some value. If we add Benchmark.Sink(interface{}), then we can have Benchmark.Iterate(func()). Your earlier example would become b.Iterate(func() (b.Sink(Fib(10))).

bcmills · 2021-10-04T20:44:56Z

@ianlancetaylor, as far as I can tell Sink would not help with the compiler constant-propagating inputs into inlined functions, even if their output is not eliminated. (See, for example, the attempted benchmarks in first post at #48684.)

That is:

	b.Iterate(func() { b.Sink(Fib(10)) })

does not prevent the compiler from turning Fib(10) into a compile-time constant and passing that constant to Sink.

Or, if it does, that would imply that the compiler applies special AST-rewriting magic to the expression passed to b.Sink. But if the compiler is rewriting ASTs for b.Sink, then why couldn't it also rewrite ASTs for calls to b.Iterate to avoid the overhead of reflect.Call — and provide a more ergonomic API — for approximately the same implementation cost?

bcmills · 2021-10-04T21:09:02Z

@josharian, thanks for the references! Some analysis:

runtime.BenchmarkAllocation is probably just plain broken. It uses a memory footprint proportional to b.N, so the longer the -benchtime you run it for the more memory it consumes, and at some value of N it just crashes (or, worse, swaps your machine into oblivion) instead of actually benchmarking anything useful. That's an extremely hostile sort of benchmark and not one that we should facilitate. Probably instead it should be using b.Run to run sub-benchmarks with varying hard-coded sizes.
encoding/binary.BenchmarkReadStruct only depends on b.N to the extent that it checks whether b.N > 0. But as far as I understand b.N == 0 should never happen (testing: decide whether benchmark probe should use b.N==0 or b.N==1 #14863), so that part of the condition could simply be dropped.
expvar.BenchmarkRealworldExpvarUsage is splitting b.N into runtime.GOMAXPROCS(0) chunks. It should probably instead be replaced with a call to b.RunParallel.
runtime.BenchmarkMatmult is similar to runtime.BenchmarkAllocation, in that it is benchmarking one iteration on an input of size N rather than N iterations on a constant-sized input. It should be replaced with a test using b.Run, or perhaps someone should file a separate proposal for proper support for benchmarks that vary input-size rather than (just) running time.
net.benchmarkTCP is using b.N as a synchronization mechanism, spawning O(N) goroutines and leaving them in flight until the end of the benchmark. It seems to be benchmarking the Go scheduler's ability to keep up with go statements rather than any realistic property of the net package — and has the same problem of increasing memory usage arbitrarily as -benchtime increases.

TL;DR: as far as I can tell, all of the tests in that sample would be improved by eliminating the dependency on b.N.

bcmills · 2021-10-04T21:13:34Z

Also, it's not possible to use testing.B.ReportMetric correctly in many cases without b.N.

Neat! Per the documentation for ReportMetric:

If the metric is per-iteration, the caller should divide by b.N, and by convention units should end in "/op".

I can't even parse what that means — I would argue that it mostly points to the need for a more ergonomic ReportMetric replacement too. 😅

josharian · 2021-10-04T22:08:37Z

I can't even parse what that means — I would argue that it mostly points to the need for a more ergonomic ReportMetric replacement too. 😅

Hmm. I find that quite clear, although that's unsurprising since I was involved in its design. How can I help you help me find a clearer way to express that? In case it is useful, examples of non-per-iteration metric are cache hit rate or percent packet loss or STW latency percentiles.

ianlancetaylor · 2021-10-04T23:25:12Z

as far as I can tell Sink would not help with the compiler constant-propagating inputs into inlined functions

That is true. I suppose we could add Benchmark.Source....

bcmills · 2021-10-05T15:35:31Z

We could, but B.Source and B.Sink also seem much more complicated to use than B.Iterate, and it's tricky to connect the value passed to Source back to the function's arguments.

I suppose it would have to be a standalone generic function, so that it could return a value? That might look like:

	b.Iterate(func() {
		b.Sink(Fib(testing.Source(10)))
	})

But then you would end up needing to call Source for each individual input, whereas Iterate requires only one call per benchmark — the signal-to-noise ratio for readers of the code seems much better for Iterate.

bcmills · 2021-10-05T15:50:59Z

@josharian, the terms I've seen from systems like Prometheus are “counter” (or “cumulative”) and “gauge”. Counter metrics are often meaningful per-op, whereas gauge metrics generally are not. So I suppose that the advice in the comment is really to report “counter” metrics averaged over the number of operations.

But this is a bit of a digression, because I don't think deprecating b.N would preclude users from reporting per-op metrics anyway. For example, cmd/go.BenchmarkExecGoEnv — which uses b.RunParallel rather than an explicit loop over b.N — already uses b.ReportMetric to report per-op metrics without explicitly referring to b.N:

	var n, userTime, systemTime int64
	…
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			…
			atomic.AddInt64(&n, 1)
			atomic.AddInt64(&userTime, int64(cmd.ProcessState.UserTime()))
			atomic.AddInt64(&systemTime, int64(cmd.ProcessState.SystemTime()))
		}
	})
	b.ReportMetric(float64(userTime)/float64(n), "user-ns/op")
	b.ReportMetric(float64(systemTime)/float64(n), "sys-ns/op")

So I think the only change strictly needed to ReportMetric would be to adjust the doc comment to remove the reference to b.N. (We could consider other ergonomic improvements as a separate proposal, but it's not obvious to me that they'd be worth extra the API surface.)

ianlancetaylor · 2021-10-05T21:55:56Z

We could, but B.Source and B.Sink also seem much more complicated to use than B.Iterate, and it's tricky to connect the value passed to Source back to the function's arguments.

I agree. The difference is that I think I see how to implement Source and Sink, but I don't see how to implement Iterate, at least not in a way that makes it possible to write micro-benchmarks.

bcmills · 2021-10-06T19:49:16Z

No matter what we do, a benchmark than runs for N iterations will always be upward-biased by the O(N) overhead of running a loop for N iterations (i.e. one increment, one compare, and one predicted branch per iteration). As the benchmark function becomes smaller the constant factors start to matter more, but they're always there.

The different implementation possibilities for Iterate may change the constant factors.

Teaching the compiler to intrinsify b.Iterate — or implementing b.Iterate in terms of a simpler internal intrinsic added for that purpose — gives the lowest constant factors, but perhaps the most complex implementation.
reflect.Call gives the easiest implementation, but the highest constant factors. (It's probably at least fine as a fallback for platforms where we don't care to do the extra work to reduce those factors.)
If intrinsification isn't feasible, we could probably do a hybrid implementation that sets up the function arguments using reflect and then calls out to an assembly routine to invoke the function's closure N times with minimal additional function-call boilerplate. (FWIW, I think that assembly routine is essentially the “simpler internal intrinsic” from the first option, just located in the testing package instead of the compiler proper.)

rsc · 2021-10-06T22:07:22Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

rsc · 2021-10-20T17:37:22Z

It sounds like the most important case where something new is needed is for microbenchmarks that might be optimized away entirely, but those are also the case where the overhead of the function call is likely to be problematic. (That overhead is why we chose the loop to begin with.)

It doesn't seem like this discussion is converging to an answer.

rsc · 2021-11-03T18:01:16Z

It seems like we are not converging, and I know that @bcmills wants to take more time to work through this, but we are up against the freeze and other work. Putting on hold until @bcmills is ready to engage again.

rsc · 2021-11-03T18:05:54Z

Placed on hold.
— rsc for the proposal review group

aclements · 2023-06-21T18:53:59Z

The different implementation possibilities for Iterate may change the constant factors.

I'm not sure this is a good idea, but another approach is to establish a baseline for that overhead and subtract it out. For example, the testing package could run b.Iterate on a no-op function to measure its overhead, and subtract this from benchmark results. There are of course hazards to doing this. Most likely, it would need to take minimum overhead over several samples. It may even need to run the overhead test before each benchmark to mitigate non-independence of benchmarks.

@ianlancetaylor, as far as I can tell Sink would not help with the compiler constant-propagating inputs into inlined functions,

Another option is to wrap the called function itself:

//go:noinline
func Keep[T any](v T) T { return v }

for i := 0; i < b.N; i++ {
  Keep(Fib)(10)
}

I'm not sure what a good name for this would be (it's similar to Rust's black_box, but I find that name, shall we say, opaque).

This doesn't help with the b.N issue like Iterate does, but I think its much better than wrapping all of the inputs in Source and the result in Sink. Most likely this function would need to be understood by the compiler. It can be implemented today at an overhead of two instructions, as I did above, but I believe that forces any pointer arguments to escape.

rsc · 2023-07-05T18:22:38Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

rsc · 2023-07-05T18:47:13Z

The Keep proposal is #61179. It probably makes sense to have the discussion in one place, so I nominate the other proposal (#61179).

rsc · 2023-07-12T17:32:41Z

Will mark this likely decline in favor of whatever we decide in #61179.

rsc · 2023-07-12T19:54:23Z

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

rsc · 2023-07-19T21:21:05Z

No change in consensus, so declined.
— rsc for the proposal review group

gopherbot added this to the Proposal milestone Oct 4, 2021

gopherbot added the Proposal label Oct 4, 2021

bcmills mentioned this issue Oct 4, 2021

testing: document best practices for avoiding compiler optimizations in benchmarks #27400

Open

ianlancetaylor added this to Incoming in Proposals (old) Oct 4, 2021

rsc moved this from Incoming to Active in Proposals (old) Oct 6, 2021

rsc moved this from Active to Hold in Proposals (old) Nov 3, 2021

rsc added the Proposal-Hold label Nov 3, 2021

AlexanderYastrebov mentioned this issue Oct 5, 2022

Made cachedEndpoints read-only to avoid write+read races zalando/skipper#2080

Closed

aclements mentioned this issue Jul 5, 2023

proposal: testing: add Keep, to force evaluation in benchmarks #61179

Open

rsc removed the Proposal-Hold label Jul 5, 2023

rsc added the Proposal-FinalCommentPeriod label Jul 12, 2023

rsc closed this as completed Jul 19, 2023

rsc removed the Proposal-FinalCommentPeriod label Jul 19, 2023

aclements mentioned this issue Jul 21, 2023

testing: add testing.B.Loop for iteration #61515

Open

proposal: testing: a less error-prone API for benchmark iteration #48768

proposal: testing: a less error-prone API for benchmark iteration #48768

Comments

bcmills commented Oct 4, 2021 • edited

Background

Proposal

Examples

Compatibility

Implementation

randall77 commented Oct 4, 2021

bcmills commented Oct 4, 2021 • edited

bcmills commented Oct 4, 2021 • edited

rogpeppe commented Oct 4, 2021

rogpeppe commented Oct 4, 2021

bcmills commented Oct 4, 2021

renthraysk commented Oct 4, 2021 • edited

bcmills commented Oct 4, 2021

bcmills commented Oct 4, 2021

renthraysk commented Oct 4, 2021

bcmills commented Oct 4, 2021

josharian commented Oct 4, 2021

josharian commented Oct 4, 2021

bcmills commented Oct 4, 2021

bcmills commented Oct 4, 2021

renthraysk commented Oct 4, 2021 • edited

josharian commented Oct 4, 2021

ianlancetaylor commented Oct 4, 2021

bcmills commented Oct 4, 2021 • edited

bcmills commented Oct 4, 2021

bcmills commented Oct 4, 2021

josharian commented Oct 4, 2021 • edited

ianlancetaylor commented Oct 4, 2021

bcmills commented Oct 5, 2021 • edited

bcmills commented Oct 5, 2021 • edited

ianlancetaylor commented Oct 5, 2021

bcmills commented Oct 6, 2021

rsc commented Oct 6, 2021

rsc commented Oct 20, 2021

rsc commented Nov 3, 2021

rsc commented Nov 3, 2021

aclements commented Jun 21, 2023

rsc commented Jul 5, 2023

rsc commented Jul 5, 2023

rsc commented Jul 12, 2023

rsc commented Jul 12, 2023

rsc commented Jul 19, 2023

bcmills commented Oct 4, 2021 •

edited

bcmills commented Oct 4, 2021 •

edited

bcmills commented Oct 4, 2021 •

edited

renthraysk commented Oct 4, 2021 •

edited

renthraysk commented Oct 4, 2021 •

edited

bcmills commented Oct 4, 2021 •

edited

josharian commented Oct 4, 2021 •

edited

bcmills commented Oct 5, 2021 •

edited

bcmills commented Oct 5, 2021 •

edited