testing: add -benchtime=100x (x suffix for exact count) #24735

wmorrow · 2018-04-06T19:18:23Z

Currently benchmarks built using testing's Benchmark interface run a variable number of iterations, auto-adjusting to run for at least -benchiters seconds. This complicates HW PMU counter collection and A/B comparisons because the amount of work is (potentially) variable and the time adjustment code can muddle the benchmark results. It can also easily overshoot the amount of time you expect it to take:

$ go test archive/zip -benchtime 30s -run=Benchmark -bench=BenchmarkZip64Test$
goos: linux
goarch: amd64
pkg: archive/zip
BenchmarkZip64Test-8        1000          59382178 ns/op
PASS
ok      archive/zip     65.380s

The proposal

is to add a new flag to go test that circumvents the adjustment process and runs a benchmark for some exact user-defined number of iterations:

$ go test archive/zip -benchiterations 500 -run=Benchmark -bench=BenchmarkZip64Test$
goos: linux
goarch: amd64
pkg: archive/zip
BenchmarkZip64Test-8         500          59419401 ns/op
PASS
ok      archive/zip     29.775s

References

A draft of this change is already on Gerrit (+92617)

The text was updated successfully, but these errors were encountered:

bcmills · 2018-04-06T19:35:41Z

(CC @rsc @josharian; see also #19128 and #10930.)

rsc · 2018-04-09T20:32:03Z

I'm reluctant to add more complexity to the benchmark flags. We just added benchsplit last round.
/cc @aclements

josharian · 2018-04-11T19:33:23Z

The main time I've wanted this recently is when working around benchmarks that are non-linear (i.e. broken) and hard to fix. But I share the reluctance to add more complexity here.

aclements · 2018-04-11T19:42:50Z

@dr2chase had some recent situation where there would have been useful, though I'm blanking on the details.

I'm also reluctant to add more flags. Sometimes, however, the non-linearity isn't the benchmark's fault, but GC's. For example, 1,000 iterations vs 2,000 iterations could be the difference between 1 GC cycle and 3 GC cycles.

rsc · 2018-04-16T20:27:48Z

It bothers me that this is only meaningful for a single benchmark at a time. Is there something different we can do that addresses the underlying problem?

aclements · 2018-04-16T20:40:53Z

We could be less aggressive about rounding to "nice" iteration counts, the effect of this would be damped.

For smoothing out GCs specifically, we could track the allocation rate of the benchmark to predict the number of GC cycles. If the number of cycles would be small, bias the next selected iteration count toward a nice value that would stay away from the edge between n and n+1 GC cycles.

cespare · 2018-04-16T21:08:34Z

Possibly related to #23423.

We could be less aggressive about rounding to "nice" iteration counts, the effect of this would be damped.

This behavior has annoyed me in the past. On #23423 I wrote

[...] it would've been nice if go test had picked a smaller value than 1000 here. It took 10s when I asked for 5s.

wmorrow · 2018-04-16T21:12:19Z

From my perspective the goal is to have some way to perform a set amount of work in a benchmark, rather than a (potentially) variable amount based on execution time or some other heuristic. Anything that can fix that amount of work to a reproducible value would be fine.

Specifically I'm thinking of cases where I want to compare hardware counters on two different hosts or between two versions of Go (with and without some patch). Assuming the two instances (hosts/toolchains/etc) have some different performance profile, the faster instance will likely perform a different amount of work than the slower instance if only because it uses different iteration counts when scaling. If the two instances also settle to different iteration counts the problem is exacerbated.

Another strategy I can think of off the top of my head is to read some datafile (perhaps a previous log?) and use the number of iterations set there. It's certainly not elegant but I'm also dubious that a heuristic based on measured performance of the instance can accurately reproduce a set amount of work. I think at some point you'll have to have an iteration count recorded or passed somewhere.

meirf · 2018-04-21T18:32:39Z

I'm reluctant to add more complexity to the benchmark flags. We just added benchsplit last round.

As a quick note, the benchsplit proposal was accepted but the feature was not added. The review hasn't been merged. I'm not mentioning this as a nit; rather there is some debate on that review which relates to the complexity/reluctance (and needs a decision to move forward.)

https://go-review.googlesource.com/c/go/+/47411/2/src/testing/benchmark.go#317
https://go-review.googlesource.com/c/go/+/47411/5/src/testing/benchmark.go#296

rsc · 2018-05-07T20:55:52Z

It seems like if you want to do right by performance counters generally, you need to have the testing package collect them around the runs and report them, including dividing by the iteration count. For example the 1000 iterations are preceded by 1, 10, and 100 iterations, trying to guess how many will be needed for benchtime. Really you don't want to see the performance counters for those trial runs at all. I don't know of any proposals for adding performance counters, though, and that would be quite a complex API.

Alternatively, the big jumps you see, where sometimes a benchmark runs for 1000 iterations and sometimes 2000 to reach benchtime, would go away if we stopped rounding so aggressively, as @aclements suggested in #24735 (comment)? Would that be enough to address the issue here, @wmorrow?

Originally I put the "nice" numbers in for eye-balling times more easily, but it really makes little sense given that you need to run 10 or so and run them through benchstat to have reliable statistics anyway.

josharian · 2018-05-07T21:06:04Z

a12cc71 took a small step towards removing aggressive rounding. I’m all for ditching it entirely. We should also consider eliminating or reducing the multiplier as well. (See that commit message for details.)

aclements · 2018-05-07T21:48:20Z

I don't know of any proposals for adding performance counters, though, and that would be quite a complex API.

#21295

@wmorrow, it seems like any performance counter comparison needs to account for warm-up and divide by the number of iterations anyway. If you're using perf for this, it seems like perf record --switch-output (maybe with -s if you just want counts) would make it easy to capture just the last run of the benchmark, and then the counts should be divided by the iteration count reported in the benchmark output.

gopherbot · 2018-05-08T13:37:54Z

Change https://golang.org/cl/112155 mentions this issue: testing: stop rounding b.N

josharian · 2018-05-08T13:41:33Z

I wanted to see the impact, so I hacked together CL 112155.

Thinking about that CL raises another concern.

Suppose we have a benchmark with high variance. We use our estimates to try to get near the benchtime. We are more likely to exceed the benchtime if we get a particularly slow run. A particularly fast run is more likely to trigger another benchmark run.

The current approach thus introduces bias.

One simple way to fix this would be to decide when our estimate is going to be "close enough", that is, when we are one iteration way from being done, and then stick with that final iteration even if it falls short of the benchtime.

There are probably other fixes, too...including setting a fixed number of iterations up front. :)

josharian · 2018-05-08T13:43:47Z

Oh, and note that making our iteration estimates ever more accurate (e.g. by eliminating rounding) actually exacerbates the bias-for-slow-runs problem, rather than ameliorating it.

wmorrow · 2018-05-08T17:35:50Z

I don't know of any proposals for adding performance counters, though, and that would be quite a complex API.

I agree, I think you'd end up having to rewrite perf/etc or call out to it and probably lose some expressiveness in the process.

If you're using perf for this, it seems like perf record --switch-output (maybe with -s if you just want counts) would make it easy to capture just the last run of the benchmark, and then the counts should be divided by the iteration count reported in the benchmark output.

If I'm reading this correctly you're suggesting adding a SIGUSR2 (or some other signal) when restarting the benchmark with some new N so that the outer perf can segment the perf.data files accordingly?

That could work to some extent (at least to ensure there's some known amount of work represented in the perf record), but it seems like a complicated solution to the problem. I suppose you save on having to define iteration counts for each benchmark but you have to do more work after the run to ensure you're matching the right perf data to the right benchmark. It's also not portable for collecting data in situations where we're not using perf (as an example we use MAMBO to track visited basic blocks).

The original goal of rounding to readable b.N was to make it easier to eyeball times. However, proper analysis requires tooling (such as benchstat) anyway. Instead, take b.N as it comes. This will reduce the impact of external noise such as GC on benchmarks. This requires doing our iteration estimation in floats instead of ints. When using ints, extremely fast (sub-nanosecond) benchmarks always get an estimated ns/op of 0, and with the reduced rounding up, they converged much too slowly. It also reduces the wall time required to run benchmarks. Here's the impact of this CL on the wall time to run all benchmarks once with benchtime=1s on some std packages: name old time/op new time/op delta bytes 306s ± 1% 238s ± 1% -22.24% (p=0.000 n=10+10) encoding/json 112s ± 8% 99s ± 7% -11.64% (p=0.000 n=10+10) net/http 54.7s ± 7% 44.9s ± 4% -17.94% (p=0.000 n=10+9) runtime 957s ± 1% 714s ± 0% -25.38% (p=0.000 n=10+9) strings 262s ± 1% 201s ± 1% -23.27% (p=0.000 n=10+10) [Geo mean] 216s 172s -20.23% Updates golang#24735 Change-Id: I7e38efb8e23c804046bf4fc065b3f5f3991d0a15

warpfork · 2018-07-21T11:48:23Z

I have a very similar desire to what @wmorrow describes: I want to compare profiling info across code changes, or across packages which implement similar interfaces, or of the same code on different hardware, and so forth -- all situations where the iteration count detected by the bench tool is (quite sensibly) going to vary -- and have all the numbers be reasonably comparable with a bare minimum of further normalization.

In order to get stable human-handy info like this, it seems I want to either be able to declare a fixed number for b.N; or, have some way to get that number out afterwards so I can divide e.g. prof stats into average-per-op numbers.

A potential hack to work around this would be to also write Test1000Times_Foo functions to match every Benchmark_Foo function I have in the codebase; but this seems like a lot of boilerplate. It's a practical truism that the subjects of my benchmark functions are also what I want to evaluate any pprof'ing, etc, against, and so it would be nice if the test and benchmark run tooling would DWIM for this.

rsc · 2018-07-23T20:46:02Z

We're discussing a lot of complexity but even if we add it, it's hard to see whether it solves the real problems with warm-ups and the like. The SIGUSR2 is kind of weird but interesting. I bet if we sent SIGUSR2 to test binaries in general some packages would get mad, like when we tried to have package testing install a SIGINT handler.

theckman · 2018-08-05T00:37:21Z

To add my color from the other issue I opened (since GitHub search didn't lead me here):

I'm in the process of helping do a technical review of a Go book for newbies, and the author wants to show the performance benefits of using concurrency for web requests over serializing them. Originally, the author wanted to teach readers how to write benchmarks first, and to then use these benchmarking skills to compare a concurrent vs serial implementation. The issue here is that because b.N can't be set to a constant value of 1, the author doesn't want to cause a DDoS against golang.org (and another site) from people running this benchmark. Because we're looking to measure the performance difference on the magnitude of milliseconds, running the benchmark once should give us an accurate-enough result to show the performance improvement.

I have been in a similar spot too, where I wanted to benchmark something but only wanted the high-level/one run result because of the overall complexity of the thing I'm benchmarking. It would have at least reduced the time I was waiting for benchmarks to finish. As I recall, I ended up having to implement a package main that invoked my function with a time.Since. I would have much rather written a benchmark and invoked go test with the right inputs to do what I wanted.

Edit: For posterity, here's the other reason I'd seen this asked for (from my comment in the other issue):

While I'm less convinced of this use case, another ask I've seen in the Go community for this functionality is around using a constant value for your benchmarks so that you can compare the total wallclock time of the benchmarks. The reason I'm not convinced in this situation is that because most people do (and should) care about the ns/op result on a per-benchmark basis, not the total wallclock time it took the benchmark suite to run. I think they wanted to use this as a way to automatically detect large changes in performance, without needing to parse the results and keep track of each benchmark between runs.

josharian · 2018-08-05T18:50:37Z

FWIW, my reaction to the "I really just want it to run once" style of benchmarking is that package main and time.Since is the right answer. Otherwise you'll always have to remember to set b.N=1 on every benchmark invocation; the opportunity for accidental DDOS is high.

I'd argue that doing so might also make for good pedagogy: Start with time.Since, discuss the shortcomings of that approach, graduate to benchmarks, and discuss their benefits.

bcmills · 2018-08-07T19:55:47Z

because b.N can't be set to a constant value of 1, the author doesn't want to cause a DDoS against golang.org (and another site) from people running this benchmark.

If the author is concerned about the amount of traffic, then:
a. Congratulations are in order for a successful book! 🙂
b. They should provide a server that readers can start locally to produce (or simulate) the desired latency characteristics.
c. Or, they should target the benchmark against the book's own website. Since traffic is proportional to readers, and readers are proportional to sales, the more traffic the book produces, the more machines the author can afford in order to handle the traffic!

More seriously, though: benchmarks are experiments, and a low N is bad for benchmarking for exactly the same reason that it's bad for a scientific study. Ideally we should compute it based on the measured variability rather than a fixed duration, but deliberately setting it much too low is not a good practice to teach either way.

Because we're looking to measure the performance difference on the magnitude of milliseconds, running the benchmark once should give us an accurate-enough result to show the performance improvement.

-benchtime already fills a similar niche: if you know that the operation only takes a few hundred milliseconds, just run it for a few hundred milliseconds.

theckman · 2018-08-07T23:38:59Z

The point was to keep the example simple and digestable for readers who are newer to programming. Adding more complexity to the example makes it easier for people to become overwhelmed.

Thank you for the feedback.

magical · 2018-08-14T04:06:23Z

Like @warpfork i'm interested in an option like this to help with profiling.

go test -cpuprofile makes it very easy to identify hotspots in the code, but once identified i want to be able to reliably compare profiles both before and after a change. go tool pprof reports time in two units: percentage of runtime, and absolute runtime. It can't report time per iteration like go test -bench does because it has no concept of iterations. Percentage is out because, due to the potato paradox, comparing percentages can be extremely counterintuitive (if a function accounts for 99% of runtime and you make it twice as fast, it now takes 98%). That leaves absolute runtime, but if benchmarks always run for a constant amount of time, then this becomes as useless as percentages.

Things mostly work out today because, due to rounding, it is likely that go test -bench will select the same number of iterations each time--as long as your optimizations aren't extreme enough to kick you into a higher bucket. If CL 112155 goes in rounding goes away it seems like this problem will get worse.

My desired workflow goes something like this

pick a benchmark to work on
run go test -bench=foo to find a suitable number of iterations N to run it for
run go test -bench=foo -benchiter=N -run=NONE -cpuprofile=cpu.prof; go tool pprof -top cpu.prof to identify hot functions
make changes to the code
run go test -bench=foo -benchiter=N -run=NONE -cpuprofile=cpu.prof; go tool pprof -top cpu.prof again to see if changes had the intended effect by comparing absolute run times from step 3
repeat steps 4-5

I worked around this issue at first by fudging with -benchtime to get the number of iterations i wanted, but later took the more direct route of patching a -benchiter option directly into go test.

I could do what @josharian suggested and write a main program which starts a CPU profile and runs a copy of the benchmark code for a fixed number of iterations, but it seems a shame to give up all the tooling around go test.

rsc · 2018-08-20T20:24:22Z

@bradfitz suggested changing -benchtime to accept a little bit more than a time.Duration, so that you can say -benchtime=100x. Maybe that's good enough?

josharian · 2018-08-21T01:21:11Z

SGTM. I'd still like to finish up CL 112155 and think more about the bias concerns raised in #24735 (comment).

magical · 2018-08-21T04:08:54Z

SGTM too. I'd be happy to take a stab at implementing this.

Another thought: for the examples @theckman brought up about wanting to run a benchmark only once, it seems like a workaround would be to use -benchtime=0s, since benchmarks are always run at least once.

gopherbot · 2018-08-22T05:17:25Z

Change https://golang.org/cl/130675 mentions this issue: testing: allow passing an iteration count to -benchtime

magical · 2018-08-22T05:21:27Z

Oh, I didn't notice @mrosier-qdt had already uploaded a CL. Oh well, here's my take on it.

josharian · 2018-08-23T05:52:10Z

SGTM. I'd still like to finish up CL 112155 and think more about the bias concerns raised in #24735 (comment).

I've broken out the bias concern as #27168. (And I'll eventually get back to the CL.)

rsc · 2018-09-19T18:07:17Z

Looks like people generally agree with -benchtime=100x and there is a pending CL. Proposal accepted.

gopherbot · 2018-10-03T15:53:07Z

Change https://golang.org/cl/139258 mentions this issue: testing: implement -benchtime=100x

magical · 2018-10-09T05:40:17Z

@rsc I see you've uploaded a CL of your own. Should i abandon https://golang.org/cl/130675?

bradfitz · 2018-10-09T14:42:55Z

@magical, sorry, I don't think we realized a CL was already open for it. I'll let @rsc decide how he wants to go forward.

The original goal of rounding to readable b.N was to make it easier to eyeball times. However, proper analysis requires tooling (such as benchstat) anyway. Instead, take b.N as it comes. This will reduce the impact of external noise such as GC on benchmarks. This requires reworking our iteration estimates. We used to calculate the estimated ns/op and then divide our target ns by that estimate. However, this order of operations was destructive when the ns/op was very small; rounding could hide almost an order of magnitude of variation. Instead, multiply first, then divide. Also, make n an int64 to avoid overflow. Prior to this change, we attempted to cap b.N at 1e9. Due to rounding up, it was possible to get b.N as high as 2e9. This change consistently enforces the 1e9 cap. This change also reduces the wall time required to run benchmarks. Here's the impact of this change on the wall time to run all benchmarks once with benchtime=1s on some std packages: name old time/op new time/op delta bytes 306s ± 1% 238s ± 1% -22.24% (p=0.000 n=10+10) encoding/json 112s ± 8% 99s ± 7% -11.64% (p=0.000 n=10+10) net/http 54.7s ± 7% 44.9s ± 4% -17.94% (p=0.000 n=10+9) runtime 957s ± 1% 714s ± 0% -25.38% (p=0.000 n=10+9) strings 262s ± 1% 201s ± 1% -23.27% (p=0.000 n=10+10) [Geo mean] 216s 172s -20.23% Updates #24735 Change-Id: I7e38efb8e23c804046bf4fc065b3f5f3991d0a15 Reviewed-on: https://go-review.googlesource.com/c/go/+/112155 Reviewed-by: Austin Clements <austin@google.com>

gopherbot added this to the Proposal milestone Apr 6, 2018

gopherbot added the Proposal label Apr 6, 2018

bcmills added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Apr 6, 2018

cespare mentioned this issue Aug 5, 2018

testing: add flag to specify size of b.N during benchmarks #26811

Closed

josharian mentioned this issue Aug 23, 2018

testing: investigate possible bias due to time estimation #27168

Closed

rsc changed the title ~~proposal: testing: add option for running benchmarks a fixed number of iterations~~ testing: add -benchtime=100x (x suffix for exact count) Sep 19, 2018

rsc added the Proposal-Accepted label Sep 19, 2018

rsc modified the milestones: Proposal, Go1.12 Sep 19, 2018

rsc added the NeedsFix The path to resolution is known, but the work has not been done. label Sep 26, 2018

gopherbot removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Sep 26, 2018

gopherbot closed this as completed in 8e0aea1 Oct 12, 2018

josharian mentioned this issue Jan 13, 2019

testing: tiny benchmark with StopTimer runs forever #27217

Open

cespare mentioned this issue May 15, 2019

cmd/go: go test running bench twice with -benchtime=1x #32051

Closed

golang locked and limited conversation to collaborators Oct 12, 2019

gopherbot added the FrozenDueToAge label Oct 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing: add -benchtime=100x (x suffix for exact count) #24735

testing: add -benchtime=100x (x suffix for exact count) #24735

wmorrow commented Apr 6, 2018

bcmills commented Apr 6, 2018

rsc commented Apr 9, 2018

josharian commented Apr 11, 2018

aclements commented Apr 11, 2018

rsc commented Apr 16, 2018

aclements commented Apr 16, 2018

cespare commented Apr 16, 2018

wmorrow commented Apr 16, 2018

meirf commented Apr 21, 2018 •

edited

rsc commented May 7, 2018

josharian commented May 7, 2018

aclements commented May 7, 2018

gopherbot commented May 8, 2018

josharian commented May 8, 2018

josharian commented May 8, 2018

wmorrow commented May 8, 2018

warpfork commented Jul 21, 2018

rsc commented Jul 23, 2018

theckman commented Aug 5, 2018 •

edited

josharian commented Aug 5, 2018

bcmills commented Aug 7, 2018 •

edited

theckman commented Aug 7, 2018

magical commented Aug 14, 2018

rsc commented Aug 20, 2018

josharian commented Aug 21, 2018

magical commented Aug 21, 2018

gopherbot commented Aug 22, 2018

magical commented Aug 22, 2018

josharian commented Aug 23, 2018

rsc commented Sep 19, 2018

gopherbot commented Oct 3, 2018

magical commented Oct 9, 2018

bradfitz commented Oct 9, 2018

testing: add -benchtime=100x (x suffix for exact count) #24735

testing: add -benchtime=100x (x suffix for exact count) #24735

Comments

wmorrow commented Apr 6, 2018

The proposal

References

bcmills commented Apr 6, 2018

rsc commented Apr 9, 2018

josharian commented Apr 11, 2018

aclements commented Apr 11, 2018

rsc commented Apr 16, 2018

aclements commented Apr 16, 2018

cespare commented Apr 16, 2018

wmorrow commented Apr 16, 2018

meirf commented Apr 21, 2018 • edited

rsc commented May 7, 2018

josharian commented May 7, 2018

aclements commented May 7, 2018

gopherbot commented May 8, 2018

josharian commented May 8, 2018

josharian commented May 8, 2018

wmorrow commented May 8, 2018

warpfork commented Jul 21, 2018

rsc commented Jul 23, 2018

theckman commented Aug 5, 2018 • edited

josharian commented Aug 5, 2018

bcmills commented Aug 7, 2018 • edited

theckman commented Aug 7, 2018

magical commented Aug 14, 2018

rsc commented Aug 20, 2018

josharian commented Aug 21, 2018

magical commented Aug 21, 2018

gopherbot commented Aug 22, 2018

magical commented Aug 22, 2018

josharian commented Aug 23, 2018

rsc commented Sep 19, 2018

gopherbot commented Oct 3, 2018

magical commented Oct 9, 2018

bradfitz commented Oct 9, 2018

meirf commented Apr 21, 2018 •

edited

theckman commented Aug 5, 2018 •

edited

bcmills commented Aug 7, 2018 •

edited