x/perf/cmd/benchstat: tips or quickstart for newcomers #23471

mvdan · 2018-01-18T14:59:44Z

benchstat is a very useful tool, but if you're not familiar with what it does, it can be very confusing to use for the first time.

One such example is "how many times should my benchmarks run". If one has used benchcmp before, running each benchmark before and after exactly once, trying to use benchstat will result in something confusing like:

name                old time/op    new time/op    delta
Decode-4               2.20s ± 0%     1.54s ± 0%   ~     (p=1.000 n=1+1)

The answer here is that the user should be running the benchmark more times - at least 3 or 4 to get p-values low enough for a result.

However, neither benchstat -h nor the godoc page are very clear on this, nor do they have a "quickstart" guide. The godoc page does show an example input with many runs, and does talk about "a number of runs" and p-values, but it's not very clear if you're not familiar with statistics and benchmarking.

I believe that a quick guide would greatly improve the usability of the tool - for example:

$ go test -bench=. -count 5 >old.txt
$ <apply source changes>
$ go test -bench=. -count 5 >new.txt
$ benchstat old.txt new.txt

I think it should also introduce other best practices, such as:

Using higher -count values if the benchmark numbers aren't stable
Usingn -benchmem to also get stats on allocated objects and space
Running the benchmarks on an idle machine not running on battery (and with power management off?)
Adding -run='$^' or -run=- to each go test command to avoid running the tests too

I realise that some of these tips are more about benchmarking than benchstat itself. But I think it's fine to have it all there, as in general you're going to be using that tool anyway.

/cc @rsc @ALTree @aclements @AlekSi

The text was updated successfully, but these errors were encountered:

aclements · 2018-01-18T15:22:10Z

These all sound like good things to include in the docs to me. If I don't get around to this myself, I'd be happy to review a CL.

I'd recommend doing at least 10 runs, if not 20 (where practical). Otherwise you might be tempted to compare them, see that the p-values are too high, and then get more samples, falling prey to the multiple testing problem. (I'd love for benchstat to support some form of sequential hypothesis testing, but alas.)

AlekSi · 2018-01-18T15:39:41Z

Daniel, thank you for creating this issue!

May I also request a short warning in stderr if a number of runs is not enough? Similar to "no test files" from go test.

ALTree · 2018-01-18T16:11:34Z

Where would we put this? I'd say the README.md file in the repo may be a good place, since it's already quite wordy but at the moment I think it's not terribly useful as an introduction to benchstat. It could be re-organized into a small tutorial-style document.

mvdan · 2018-01-18T16:17:00Z

Well, I was thinking just the package godoc. That's where the current information on the tool is.

mvdan · 2018-01-18T16:42:17Z

@AlekSi that might work, but it's tricker than it may sound. What should the minimum be? If the numbers are very stable (for example, the number of allocations for a deterministic program tends to be), 5 runs may be enough. Or you might not get any low p-values even if you do 20 runs.

However, such a warning could be added for 1 run, as that seems to be a common mistake.

@aclements 10 runs is probably better general advice. My default is 6 simply because I tend to split the benchmarks into multiple sub-benchmarks, meaning that even just 6 runs stalls me for a good minute. And I tend to go after substantial performance wins, in which cases less than 10 runs is often enough.

I didn't know about the multiple testing problem. Is that basically saying that it's much worse to benchmark twice with -count 5 than once with -count 10, because with the former you are more likely to get wrong results?

aclements · 2018-01-18T17:01:38Z

I didn't know about the multiple testing problem. Is that basically saying that it's much worse to benchmark twice with -count 5 than once with -count 10, because with the former you are more likely to get wrong results?

Benchmarking twice with -count 5 is fine, as long as you don't look at the results after the first run. :) (In fact, just in terms of running benchmarks, it's better to alternate individual runs back and forth in case there are systemic effects.)

The multiple testing problem arises when you look at the results and then decide whether to run more iterations based on the p-value. Since you'll be happy and stop if the p-value is small (or sad, if it got worse, but the same argument applies :), you bias toward accepting a "significant" result. But with a p-value of 0.05, you have a 5% chance that any random result will be "significant". In the extreme, if you imagine that you just keep running iterations until p < 0.05, you'll eventually get what you're looking for, even if it's not real.

There are ways to deal with this. For example, you can keep raising the p-value threshold each time you look at the results (the "Bonferroni correction"). But this sort of thing gets really complicated really fast, so the easiest rule is just to pick an n and stick with it. :)

In fact! The exact same problem arises if you run a whole bunch of separate benchmarks and look at whether any of them changed. You should expect 5% of them to show a statistically significant change, even if it's just noise. I don't think all of the sequential testing stuff I said above is worth explaining in the benchstat docs, but it might be worth mentioning this.

mvdan · 2018-01-18T17:36:29Z

You should expect 5% of them to show a statistically significant change, even if it's just noise.

Welp, I'm likely doing this wrong now then. Instead of having five benchmarks with different inputs for a parser, perhaps I should just have a single benchmark with a mix of all the kinds of input. That would also make it easier to use higher -count values.

This should definitely be an item in the best practices list :)

dgryski · 2018-01-18T19:33:47Z

I find git stash helps the workflow while benchmarking old vs new code.

mvdan · 2018-01-18T19:41:38Z

I use git stash for this too, but I have found that it can be confusing to use to someone who isn't used to it. Perhaps it could be another one of the tips, instead of putting it as part of the quickstart.

aclements · 2018-01-18T19:50:45Z

Welp, I'm likely doing this wrong now then. Instead of having five benchmarks with different inputs for a parser, perhaps I should just have a single benchmark with a mix of all the kinds of input.

That's not necessarily wrong, you just have to understand that there's always a false positive rate. Even if it's just a single benchmark, there's a 5% chance of benchstat reporting noise as a significant change. It's just that if you have a whole benchmark set and each individual benchmark has a 5% chance of a false positive, then there's a higher collective chance that some benchmark in the set is going to have a false positive.

mvdan · 2018-01-20T12:52:18Z

Another piece of advice could be whether to do go test -bench=... -count=N or for i in {1..N}; do go test -bench=...; done.

I assume calling go test more times is slower, though. I wonder if an option to alternate benchmark runs would be a good idea, as -count just runs the same benchmark multiple times in a row at the moment.

dgryski · 2018-01-20T14:56:08Z

Rather than calling go test multiple times, run go test -c and run the binary multiple times. If you have an original test binary you don't need to worry about swapping back and forth between source code changes with git stash.

josharian · 2018-01-21T05:27:50Z

I wonder if an option to alternate benchmark runs would be a good idea, as -count just runs the same benchmark multiple times in a row at the moment.

I believe that e.g. -cpu=8,8,8,8 should cycle through the benchmarks four times. Compare with -count=4, which runs each benchmark four times before moving on to the next.

Or write a driver script, a la compilebench. See also my abandoned attempts at writing tooling for a more generic driver script at #10930.

run go test -c and run the binary multiple times

And don't forget -run=NONE.

mvdan · 2018-01-29T11:51:41Z

Thanks for the tips regarding cycling through benchmarks. Whatever the best solution at the moment would be, we should document it. And if a better tool or method surfaces at a later point, we can just change the docs.

And don't forget -run=NONE.

Already mentioned in the list in my initial post, though I prefer -run=^$ or -run=- since those leave no possibility of actually matching a test.

New tasks include: golang/go#19675 cmd/vet: report uses of -0 in float32/64 context golang/go#19683 cmd/compile: eliminate usages of global lineno golang/go#19670 x/tools/go/ssa: make opaqueType less annoying to use golang/go#19636 encoding/base64: decoding is slow golang/go#23471 x/perf/cmd/benchstat: tips or quickstart for newcomers golang/go#19577 test: errorcheck support for intraline errors golang/go#19490 cmd/vet: reduce the amount of false positives for -shadow mode golang/go#19042 cmd/internal/obj: optimize wrapper method prologue for branch prediction golang/go#19013 cmd/compile: add tool for understanding/debugging SSA rules

gopherbot · 2019-07-31T23:42:30Z

Change https://golang.org/cl/188438 mentions this issue: cmd/benchstat: add example benchmarking workflow to package comment

nsajko · 2019-08-03T01:29:19Z

An idea for a benchmarking suggestion that should maybe be put into benchstat's documentation in https://go-review.googlesource.com/c/perf/+/188438 :

Some computers, notebooks for example, have insufficient cooling, which requires the kernel to lower CPU frequencies at variable times during benchmarking. This obviously affects benchmarking results in a hard to predict way. But, I think there is a way to counteract that effect: disable CPU frequency scaling! That is, set the hardware minimum frequency as the maximum frequency, so that all cores always run on the minimum possible frequency.

Is this a good practice for benchmarking? One possible drawback is prolonged total benchmarking time.

EDIT: indeed the LLVM documentation kind of agrees with me: https://llvm.org/docs/Benchmarking.html There is a bunch of other good suggestions there.

see golang/go#23471

Remove the million endpoint test (we are so far way from scaling that high it doesn't matter really) because it constantly leads to CPU throttling both in CI and locally which leads to inaccurate results. Increase count to 5 per recommendation of golang/go#23471

missinglink · 2021-10-19T13:25:38Z

Thank you, reading this issue description once had a ± 0% impact on my understanding of the output, reading it three times resulted in a ± 100% gain in my understanding.

gopherbot · 2021-11-04T02:24:45Z

Change https://golang.org/cl/309969 mentions this issue: cmd/benchstat: new version of benchstat

This is a complete rewrite of benchstat. Basic usage remains the same, as does the core idea of showing statistical benchmark summaries and A/B comparisons in a table, but there are several major improvements. The statistics is now more robust. Previously, benchstat used IQR-based outlier rejection, showed the mean of the reduced sample, its range, and did a non-parametric difference-of-distribution test on reduced samples. Any form of outlier rejection must start with distributional assumptions, in this case assuming normality, which is generally not sound for benchmark data. Hence, now benchstat does not do any outlier rejection. As a result, it must use robust summary statistics as well, so benchstat now uses the median and confidence interval of the median as summary statistics. Benchstat continues to use the same Mann-Whitney U-test for the delta, but now does it on the full samples since the U-test is already non-parametric, hence increasing the power of this test. As part of these statistical improvements, benchstat now detects and warns about several common mistakes, such as having too few samples for meaningful statistical results, or having incomparable geomeans. The output format is more consistent. Previously, benchstat transformed units like "ns/op" into a metric like "time/op", which it used as a column header; and a numerator like "sec", which it used to label each measurement. This was easy enough for the standard units used by the testing framework, but was basically impossible to generalize to custom units. Now, benchstat does unit scaling, but otherwise leaves units alone. The full (scaled) unit is used as a column header and each measurement is simply a scaled value shown with an SI prefix. This also means that the text and CSV formats can be much more similar while still allowing the CSV format to be usefully machine-readable. Benchstat will also now do A/B comparisons even if there are more than two inputs. It shows a comparison to the base in the second and all subsequent columns. This approach is consistent for any number of inputs. Benchstat now supports the full Go benchmark format, including sophisticated control over exactly how it structures the results into rows, columns, and tables. This makes it easy to do meaningful comparisons across benchmark data that isn't simply structured into two input files, and gives significantly more control over how results are sorted. The default behavior is still to turn each input file into a column and each benchmark into a row. Fixes golang/go#19565 by showing all results, even if the benchmark sets don't match across columns, and warning when geomean sets are incompatible. Fixes golang/go#19634 by no longer doing outlier rejection and clearly reporting when there are not enough samples to do a meaningful difference test. Updates golang/go#23471 by providing more through command documentation. I'm not sure it quite fixes this issue, but it's much better than it was. Fixes golang/go#30368 because benchstat now supports filter expressions, which can also filter down units. Fixes golang/go#33169 because benchstat now always shows file configuration labels. Updates golang/go#43744 by integrating unit metadata to control statistical assumptions into the main tool that implements those assumptions. Fixes golang/go#48380 by introducing a way to override labels from the command line rather than always using file names. Change-Id: Ie2c5a12024e84b4918e483df2223eb1f10413a4f

This is a complete rewrite of benchstat. Basic usage remains the same, as does the core idea of showing statistical benchmark summaries and A/B comparisons in a table, but there are several major improvements. The statistics is now more robust. Previously, benchstat used IQR-based outlier rejection, showed the mean of the reduced sample, its range, and did a non-parametric difference-of-distribution test on reduced samples. Any form of outlier rejection must start with distributional assumptions, in this case assuming normality, which is generally not sound for benchmark data. Hence, now benchstat does not do any outlier rejection. As a result, it must use robust summary statistics as well, so benchstat now uses the median and confidence interval of the median as summary statistics. Benchstat continues to use the same Mann-Whitney U-test for the delta, but now does it on the full samples since the U-test is already non-parametric, hence increasing the power of this test. As part of these statistical improvements, benchstat now detects and warns about several common mistakes, such as having too few samples for meaningful statistical results, or having incomparable geomeans. The output format is more consistent. Previously, benchstat transformed units like "ns/op" into a metric like "time/op", which it used as a column header; and a numerator like "sec", which it used to label each measurement. This was easy enough for the standard units used by the testing framework, but was basically impossible to generalize to custom units. Now, benchstat does unit scaling, but otherwise leaves units alone. The full (scaled) unit is used as a column header and each measurement is simply a scaled value shown with an SI prefix. This also means that the text and CSV formats can be much more similar while still allowing the CSV format to be usefully machine-readable. Benchstat will also now do A/B comparisons even if there are more than two inputs. It shows a comparison to the base in the second and all subsequent columns. This approach is consistent for any number of inputs. Benchstat now supports the full Go benchmark format, including sophisticated control over exactly how it structures the results into rows, columns, and tables. This makes it easy to do meaningful comparisons across benchmark data that isn't simply structured into two input files, and gives significantly more control over how results are sorted. The default behavior is still to turn each input file into a column and each benchmark into a row. Fixes golang/go#19565 by showing all results, even if the benchmark sets don't match across columns, and warning when geomean sets are incompatible. Fixes golang/go#19634 by no longer doing outlier rejection and clearly reporting when there are not enough samples to do a meaningful difference test. Updates golang/go#23471 by providing more through command documentation. I'm not sure it quite fixes this issue, but it's much better than it was. Fixes golang/go#30368 because benchstat now supports filter expressions, which can also filter down units. Fixes golang/go#33169 because benchstat now always shows file configuration labels. Updates golang/go#43744 by integrating unit metadata to control statistical assumptions into the main tool that implements those assumptions. Fixes golang/go#48380 by introducing a way to override labels from the command line rather than always using file names. Change-Id: Ie2c5a12024e84b4918e483df2223eb1f10413a4f Reviewed-on: https://go-review.googlesource.com/c/perf/+/309969 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

mvdan added the Documentation label Jan 18, 2018

mvdan added this to the Unreleased milestone Jan 18, 2018

mvdan mentioned this issue Jan 18, 2018

x/perf/benchstat: bogus "no statistical difference" report when times are the same #19634

Closed

mvdan mentioned this issue Jan 4, 2019

consider moving to x/perf/cmd/perflock aclements/perflock#3

Open

nsajko mentioned this issue Jul 31, 2019

cmd/benchstat: add example benchmarking workflow to package comment golang/perf#1

Closed

gopherbot mentioned this issue Feb 23, 2020

encoding/json: remove comment about performance of scanner.step #37386

Closed

changkun mentioned this issue Mar 26, 2020

第 83 期对 Go 程序进行可靠的性能测试 talkgo/night#564

Closed

gopherbot mentioned this issue Apr 22, 2020

encoding/base64: improve performance up to 20% total #36910

Closed

hibiken mentioned this issue Jun 6, 2020

Replace benchcmp with benchstat hibiken/asynq#160

Closed

NewbMiao added a commit to NewbMiao/opa-koans that referenced this issue Jul 10, 2020

show more obvious performance improvement in benchstat

8674213

see golang/go#23471

howardjohn mentioned this issue Jul 13, 2020

Tune benchmark test istio/istio#25472

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/perf/cmd/benchstat: tips or quickstart for newcomers #23471

x/perf/cmd/benchstat: tips or quickstart for newcomers #23471

mvdan commented Jan 18, 2018

aclements commented Jan 18, 2018

AlekSi commented Jan 18, 2018

ALTree commented Jan 18, 2018

mvdan commented Jan 18, 2018

mvdan commented Jan 18, 2018

aclements commented Jan 18, 2018

mvdan commented Jan 18, 2018

dgryski commented Jan 18, 2018

mvdan commented Jan 18, 2018

aclements commented Jan 18, 2018

mvdan commented Jan 20, 2018

dgryski commented Jan 20, 2018

josharian commented Jan 21, 2018

mvdan commented Jan 29, 2018

gopherbot commented Jul 31, 2019

nsajko commented Aug 3, 2019 •

edited

missinglink commented Oct 19, 2021

gopherbot commented Nov 4, 2021

x/perf/cmd/benchstat: tips or quickstart for newcomers #23471

x/perf/cmd/benchstat: tips or quickstart for newcomers #23471

Comments

mvdan commented Jan 18, 2018

aclements commented Jan 18, 2018

AlekSi commented Jan 18, 2018

ALTree commented Jan 18, 2018

mvdan commented Jan 18, 2018

mvdan commented Jan 18, 2018

aclements commented Jan 18, 2018

mvdan commented Jan 18, 2018

dgryski commented Jan 18, 2018

mvdan commented Jan 18, 2018

aclements commented Jan 18, 2018

mvdan commented Jan 20, 2018

dgryski commented Jan 20, 2018

josharian commented Jan 21, 2018

mvdan commented Jan 29, 2018

gopherbot commented Jul 31, 2019

nsajko commented Aug 3, 2019 • edited

missinglink commented Oct 19, 2021

gopherbot commented Nov 4, 2021

nsajko commented Aug 3, 2019 •

edited