x/perf/benchstat: Return status code 1 when benchmarks change significantly #20728

nicholasjackson · 2017-06-19T11:34:51Z

Running benchstat on a CI server to detect anomalies relies on the user to parse the output from the command in order to pick up any deltas. To make this process simpler I propose benchstat would return a status code 1 when any of the benchmarks have significant change.

In the event that backwards compatibility is required, a new flag could be added to activate this behaviour.

Example:

// ...
var (
	flagDeltaTest    = flag.String("delta-test", "utest", "significance `test` to apply to delta: utest, ttest, or none")
	flagAlpha        = flag.Float64("alpha", 0.05, "consider change significant if p < `α`")
	flagGeomean      = flag.Bool("geomean", false, "print the geometric mean of each file")
	flagHTML         = flag.Bool("html", false, "print results as an HTML table")
	flagSplit.       = flag.String("split", "pkg,goos,goarch", "split benchmarks by `labels`")
        flagFailOnChange = flag.Bool("failonchange", false, "returns exit code 1 when benchmarks have significant change")
)

func main() {

// ...

    for _,row := Range c.Rows {
        if row.Changed != 0 {
            fmt.Fprintln(os.Stderr, "One or more benchmarks have changed significantly")
            os.Exit(1)
        }
    }

}

mvdan · 2017-06-19T12:52:21Z

If you run the benchmarks enough times, you could get a change as small as 0.50%, which could well be caused by factors one can't control like code alignment. For example, see this false positive I encountered a while ago, which was as high as 4%: #17250

You could change the flag to be a treshold, but I'm not sure if that would be a good solution.

Do you have an idea of how we could deal with these? I feel like the flag would be fairly useless with the high likelihood of false positives.

mvdan · 2017-06-19T12:58:42Z

Oh, forget my point on the treshold - I didn't know about -alpha.

rochaporto · 2017-08-10T14:57:19Z

+1 on this. Currently have a non voting job on CI for exactly this, and parsing the result as described in the initial description which is not ideal.

aclements · 2021-11-03T21:13:09Z

I'm not sure this makes statistical sense. With the default alpha threshold, you expect a benchmark with no changes to show a "significant" change 5% of the time by random chance. If you're running multiple benchmarks, the chance that at least one of them will appear significant amplifies (unless you apply a correction for multiple hypothesis testing, which benchstat currently won't do automatically for you).

So is this actually useful for CI?

Note that there is also a CSV output, so it wouldn't be hard to write a tool to parse that output. I'm also pulling all of the benchmark stats out into their own package that could be reused by another tool directly.

On the topic of the threshold, note that statistical significance does not mean that a change is "big", just that it's unlikely to be from random chance. It could be a very small change, but there was enough data and low enough noise to determine that there probably was a change. -alpha controls the threshold for statistical significance, but for what is considered a "big" change.

gopherbot · 2021-11-04T02:24:49Z

Change https://golang.org/cl/283616 mentions this issue: benchmath: new package of opinionated benchmark statistics

Updates golang/go#20728. Change-Id: I4c33e64d5959cadfbb97ca6a2274e0c060e87d29

Updates golang/go#20728. Change-Id: I4c33e64d5959cadfbb97ca6a2274e0c060e87d29 Reviewed-on: https://go-review.googlesource.com/c/perf/+/283616 Trust: Austin Clements <austin@google.com> Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

gopherbot added this to the Unreleased milestone Jun 19, 2017

ALTree added the NeedsDecision label Jun 19, 2017

zchee pushed a commit to zchee/golang-perf that referenced this issue Nov 28, 2021

benchmath: new package of opinionated benchmark statistics

34824a9

Updates golang/go#20728. Change-Id: I4c33e64d5959cadfbb97ca6a2274e0c060e87d29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/perf/benchstat: Return status code 1 when benchmarks change significantly #20728

x/perf/benchstat: Return status code 1 when benchmarks change significantly #20728

nicholasjackson commented Jun 19, 2017 •

edited

Loading

mvdan commented Jun 19, 2017

mvdan commented Jun 19, 2017

rochaporto commented Aug 10, 2017

aclements commented Nov 3, 2021

gopherbot commented Nov 4, 2021

x/perf/benchstat: Return status code 1 when benchmarks change significantly #20728

x/perf/benchstat: Return status code 1 when benchmarks change significantly #20728

Comments

nicholasjackson commented Jun 19, 2017 • edited Loading

mvdan commented Jun 19, 2017

mvdan commented Jun 19, 2017

rochaporto commented Aug 10, 2017

aclements commented Nov 3, 2021

gopherbot commented Nov 4, 2021

nicholasjackson commented Jun 19, 2017 •

edited

Loading