Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/perf/benchstat: Return status code 1 when benchmarks change significantly #20728

Open
nicholasjackson opened this issue Jun 19, 2017 · 5 comments
Labels
NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@nicholasjackson
Copy link

nicholasjackson commented Jun 19, 2017

Running benchstat on a CI server to detect anomalies relies on the user to parse the output from the command in order to pick up any deltas. To make this process simpler I propose benchstat would return a status code 1 when any of the benchmarks have significant change.

In the event that backwards compatibility is required, a new flag could be added to activate this behaviour.

Example:

// ...
var (
	flagDeltaTest    = flag.String("delta-test", "utest", "significance `test` to apply to delta: utest, ttest, or none")
	flagAlpha        = flag.Float64("alpha", 0.05, "consider change significant if p < `α`")
	flagGeomean      = flag.Bool("geomean", false, "print the geometric mean of each file")
	flagHTML         = flag.Bool("html", false, "print results as an HTML table")
	flagSplit.       = flag.String("split", "pkg,goos,goarch", "split benchmarks by `labels`")
        flagFailOnChange = flag.Bool("failonchange", false, "returns exit code 1 when benchmarks have significant change")
)

func main() {

// ...

    for _,row := Range c.Rows {
        if row.Changed != 0 {
            fmt.Fprintln(os.Stderr, "One or more benchmarks have changed significantly")
            os.Exit(1)
        }
    }

}
@gopherbot gopherbot added this to the Unreleased milestone Jun 19, 2017
@ALTree ALTree added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jun 19, 2017
@mvdan
Copy link
Member

mvdan commented Jun 19, 2017

If you run the benchmarks enough times, you could get a change as small as 0.50%, which could well be caused by factors one can't control like code alignment. For example, see this false positive I encountered a while ago, which was as high as 4%: #17250

You could change the flag to be a treshold, but I'm not sure if that would be a good solution.

Do you have an idea of how we could deal with these? I feel like the flag would be fairly useless with the high likelihood of false positives.

@mvdan
Copy link
Member

mvdan commented Jun 19, 2017

Oh, forget my point on the treshold - I didn't know about -alpha.

@rochaporto
Copy link

+1 on this. Currently have a non voting job on CI for exactly this, and parsing the result as described in the initial description which is not ideal.

@aclements
Copy link
Member

I'm not sure this makes statistical sense. With the default alpha threshold, you expect a benchmark with no changes to show a "significant" change 5% of the time by random chance. If you're running multiple benchmarks, the chance that at least one of them will appear significant amplifies (unless you apply a correction for multiple hypothesis testing, which benchstat currently won't do automatically for you).

So is this actually useful for CI?

Note that there is also a CSV output, so it wouldn't be hard to write a tool to parse that output. I'm also pulling all of the benchmark stats out into their own package that could be reused by another tool directly.

On the topic of the threshold, note that statistical significance does not mean that a change is "big", just that it's unlikely to be from random chance. It could be a very small change, but there was enough data and low enough noise to determine that there probably was a change. -alpha controls the threshold for statistical significance, but for what is considered a "big" change.

@gopherbot
Copy link

Change https://golang.org/cl/283616 mentions this issue: benchmath: new package of opinionated benchmark statistics

zchee pushed a commit to zchee/golang-perf that referenced this issue Nov 28, 2021
Updates golang/go#20728.

Change-Id: I4c33e64d5959cadfbb97ca6a2274e0c060e87d29
gopherbot pushed a commit to golang/perf that referenced this issue Mar 17, 2022
Updates golang/go#20728.

Change-Id: I4c33e64d5959cadfbb97ca6a2274e0c060e87d29
Reviewed-on: https://go-review.googlesource.com/c/perf/+/283616
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

6 participants