Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: cmd/go: run benchmarks for one iteration when testing #41824

Closed
zeebo opened this issue Oct 6, 2020 · 46 comments
Closed

proposal: cmd/go: run benchmarks for one iteration when testing #41824

zeebo opened this issue Oct 6, 2020 · 46 comments

Comments

@zeebo
Copy link
Contributor

zeebo commented Oct 6, 2020

Currently, benchmarks are only run when passing the -bench flag to go test. Usually, CI checks don't run benchmarks because of the extra time required and noise reducing the usefulness of the results. This leads to situations where benchmark code can rot, causing it to become a liability instead of an asset. Instead, go test without the -bench flag could also run all of the benchmarks for a single iteration (b.N = 1) and just report PASS or FAIL like a normal test. Then benchmark code would be exercised at least as often as test code, preventing rot.

Some details to work out:

  • What about -run? I would suggest it additionally filters the benchmarks to run when -bench is not passed.
@gopherbot gopherbot added this to the Proposal milestone Oct 6, 2020
@davecheney
Copy link
Contributor

go test -run=. -bench=. -benchtime=1x

@zeebo
Copy link
Contributor Author

zeebo commented Oct 6, 2020

That's a good way of saying the default behavior this proposes.

Are you trying to make an argument one way or the other with that statement? If so, what argument?

@bcmills
Copy link
Contributor

bcmills commented Oct 7, 2020

To give a recent example: cmd/api.TestBenchmark was broken from CL 164623 to CL 224619, and nobody had noticed because the benchmark wasn't covered in the Go project's own CI checks until CL 224038.

@mvdan
Copy link
Member

mvdan commented Oct 7, 2020

I generally agree with this; I've been bitten by benchmarks going stale as well. I think benchmarks should be run as part of go test for the same reason that most of vet is now also run when testing; because otherwise, users will forget to check regularly. go test -run=. -bench=. -benchtime=1x exists, but it's not easy to type nor remember.

The only trouble I can imagine with this change is that it would cause pain if there are any low-quality benchmarks which would now suddenly run as part of go test. For example:

  1. Benchmarks which do a lot of work upfront, before looping over N. Imagine one that takes a full second before the loop starts, so even with N=1, the added cost would be high.
  2. Benchmarks where each iteration is expensive. Imagine a benchmark where each iteration takes a full second, so N=1 is still heavy.

You could say "you should never do either of those things", but sometimes what you need to benchmark is just very costly, especially when benchmarking a feature end-to-end. And I've seen a number of such heavy benchmarks in the past, where even with -benchtime=5s you'd get results with N=10.

I still think it's good to incentivize developers to write cheaper benchmarks, so I'm still in favor of this change. But for those cases where heavy benchmarks are useful, we should have a recommendation for keeping them around without slowing down go test. Perhaps skip them with testing.Short()? Put them in separate packages, so that go test ./... runs them in parallel? Use build tags like // +build slowbench?

The concern above reminds me of #36586 too, because runnable examples and benchmarks are similar in how they can't run in parallel with each other, or with tests.

@mvdan
Copy link
Member

mvdan commented Oct 7, 2020

What about -run? I would suggest it additionally filters the benchmarks to run when -bench is not passed.

I would further suggest that go test -short should ignore benchmarks entirely, too.

@bcmills
Copy link
Contributor

bcmills commented Oct 7, 2020

But for those cases where heavy benchmarks are useful, we should have a recommendation for keeping them around without slowing down go test.

Perhaps skip them with testing.Short()?

Yes.

Put them in separate packages, so that go test ./... runs them in parallel?

That seems worse than skipping with testing.Short(): go test ./... would run them in parallel, but go test -bench=. foo wouldn't give any hint that some benchmarks of foo were skipped. (And go test -bench=. all wouldn't find those benchmarks when the package is imported from some other module.)

Use build tags like // +build slowbench?

That seems excessive, given the option of using testing.Short()..?

@seebs
Copy link
Contributor

seebs commented Oct 7, 2020

I endorse a default of -bench=. -benchtime=1x for go test, it would make my life better.

@rsc
Copy link
Contributor

rsc commented Oct 7, 2020

There are a bunch of benchmarks that do a lot of setup work anticipating that they will be run for the full 2s or more.
I don't believe this change is well-advised at all. If your benchmark is testing something, make a test.

@rsc
Copy link
Contributor

rsc commented Oct 7, 2020

Usually, CI checks don't run benchmarks because of the extra time required and noise reducing the usefulness of the results. This leads to situations where benchmark code can rot, causing it to become a liability instead of an asset. Instead, go test without the -bench flag could also run all of the benchmarks for a single iteration (b.N = 1) and just report PASS or FAIL like a normal test.

CI systems today can use -bench=1x. Do any CI systems do that?

@josharian
Copy link
Contributor

One way to make this opt-in would be a method on testing.T. Maybe something like:

// RunBenchmark runs the benchmark b once with b.N == 1.
// Any TB method called on b, such as Errorf or Skip, are treated as though they were called on b.
func (t *T) RunBenchmark(bench func(b *B))

Basically, sub-tests but crossing the test/benchmark line. This would also provide an opportunity for the calling testing.T to do other work, like calling Parallel.

(You could also imagine doing this for examples, although then the examples would be executed twice during a normal run.)

@cespare
Copy link
Contributor

cespare commented Oct 7, 2020

@mvdan I disagree with your implication that any benchmarks which do expensive setup work are "low-quality".

One system I work on, for instance, is a database which includes benchmarks that initialize themselves with a large synthetic table; the benchmark loops run queries against the table. It takes a lot of CPU and RAM to do the initial setup and I wouldn't want to run these every time I do the usual go test ./.... In CI we use -benchtime 1x to ensure that the code is exercised.

From my perspective, things work well as-is and I would prefer we not change the existing behavior.

@zeebo
Copy link
Contributor Author

zeebo commented Oct 7, 2020

I don't believe this change is well-advised at all. If your benchmark is testing something, make a test.

Of course tests should be tests, but this proposal is to make it so that benchmarks don't go stale and become broken over time because no one runs them. Phrased differently, it's about testing the benchmark, not having the benchmark test the code.

CI systems today can use -bench=1x. Do any CI systems do that?

Picking a sample of systems I'm aware of, plus whatever was at the top of the github marketplace for CI

I would be surprised if running the benchmarks as part of go test was included in any documentation for any CI system. Furthermore, many examples do include more advanced features like generating code coverage reports, so it can't be explained by keeping examples simple.

As far as users overriding the default configuration to run benchmarks, I could not find a single .travis.yaml that executes benchmarks as part of go test using github search:

In my personal experience, 3 out of 3 different workplaces did not run benchmarks as part of the tests. Indeed, as @bcmills said, the Go project itself had a broken benchmark for a period of over a year.

I believe this is strong evidence that running benchmarks as part of testing is rare.

@kortschak
Copy link
Contributor

I think this should be opt in, as it currently is with -benchtime=1x. I just ran the Gonum test suite with -benchtime=1x and it goes from completing in 2 minutes to taking more than half an hour. It may well be that we are an outlier (we have a lot of benchmarks and due to the nature of the things we do, we need to test a variety of problem sizes with algorithms that have — intrinsically — fairly unpleasant time complexity).

@rsc
Copy link
Contributor

rsc commented Oct 12, 2020

@zeebo

I believe this is strong evidence that running benchmarks as part of testing is rare.

Indeed. The fact that CI systems can today opt in to running 1 iteration of benchmarks (using -benchtime=1x) but do not seems to me strong evidence that the Go command should not start doing it either.

Benchmarks and tests are separate by design, for good reasons (benchmarks can be slow, even 1x, and that hurts go test latency, which is very important to keep low).

It seems to me that we should keep them separate. As has been pointed out, CI systems and users have the ability to opt in even today, using -benchtime=1x. You could even set GOFLAGS=-benchtime=1x to force it into all your go test commands.

@seebs
Copy link
Contributor

seebs commented Oct 12, 2020

Many CI systems predate the existence of "1x" as a benchtime option. For a long time, benchmarks could only be run in a loop until they took at least some amount of time, defaulting to a second, which in practice often required quite a few iterations per benchmark. Now that it exists, it's much more likely that people would start adopting it, but I think the lack of people doing it may reflect in part that it wasn't available when they specced out their setup.

@rsc
Copy link
Contributor

rsc commented Oct 12, 2020

The CI systems are not the primary reason not to do this. The primary reason is the overall slowdown.
The fact that the CI systems already can do this but do not is just icing atop the cake.

@zeebo
Copy link
Contributor Author

zeebo commented Oct 12, 2020

None of the CI systems run go vet either, yet that was included to be a part of go test even though it too would slow down execution. Is it a matter of the degree of slow down? If so, would it be useful to collect data about what the expected slow down is?

@zeebo
Copy link
Contributor Author

zeebo commented Oct 12, 2020

Benchmarks and tests are separate by design, for good reasons

Examples and tests are also separate for good reasons, yet examples are run during go test not as a form of testing the library, but to ensure that the documented examples do not become stale or broken. Why should that not also apply to benchmarks?

@kortschak
Copy link
Contributor

If so, would it be useful to collect data about what the expected slow down is?

That's essentially open-ended; there is one data point above. Given the wide range of possible impacts on test times, the question then becomes what is the work impact on those project where the time impact is too high. For us, if there were no additional knobs in the form of command flags or env vars, the work would be onerous to prevent a blow-out of testing time by preventing running of benchmarks, and given the Go project's dislike of knobs, that additional work would be guaranteed.

@davecheney
Copy link
Contributor

@zeebo you mentioned

This leads to situations where benchmark code can rot, causing it to become a liability instead of an asset.

What does this rotting look like? Is it mis compilation, or the benchmarks returning incorrect results?

@zikaeroh
Copy link
Contributor

zikaeroh commented Oct 12, 2020

The searches in #41824 (comment) aren't complete, as they are searching through ".travis.yaml", not ".travis.yml", which is the massively more common filename. There are many examples of go test being run directly. Some run benchmarks, often with short benchtimes.

E.g., this is an old repo that only tests up to 1.6, but they were using a very short benchtime (100ms) to get something close to 1x before it was even supported: https://github.com/encryptio/go-meetup/blob/755fb270c209065a5242e60431973ff2ce2f4307/.travis.yml#L8

I'd be wary of running benchmarks by default, even at 1x; on my main project my benchmarks often spin up a temporary database. I run with 1x in my CI, but I've gone to great lengths to ensure that this can happen quickly (using a special image that caches the initialization, plus pooling and DB templating). I'm not certain everyone has gone through that effort.

@kortschak
Copy link
Contributor

kortschak commented Oct 12, 2020

@davecheney It can't be mis-compilation because irrespective of the flags, the benchmarks are still built.

In looking into the time impact on Gonum's test suite, I did find one set of cases where a change in the API had resulted in a rotted benchmark resulting in a panic (so I am torn by this, I don't want to see rotted benchmarks, but a greater than one order of magnitude increase in test suite running time is too much).

@davecheney
Copy link
Contributor

@kortschak that's what I figured, thanks for confirming.

@josharian
Copy link
Contributor

It seems like there's consensus here (1) that there's a problem with benchmarks getting stale and (2) that running with -benchtime=1x by default isn't the solution.

I offered one suggestion above for making it easy to explicitly opt in on a benchmark by benchmark basis, and a related approach in #41878.

Anyone else care to throw out other possible solutions? This seems like as good a location as any to brainstorm.

@zeebo
Copy link
Contributor Author

zeebo commented Oct 12, 2020

There are many examples of go test being run directly. Some run benchmarks, often with short benchtimes.

Thank you @zikaeroh for spotting the errors in the analysis I made.

I offered one suggestion above for making it easy to explicitly opt in on a benchmark by benchmark basis, and a related approach in #41878.

Anything that is opt-in may as well be during the invocation of the go tool, I think. In other words, I don't see the benefit of encoding the opt-in in the source code versus the arguments passed to go test.

@randall77
Copy link
Contributor

What about just using names? Currently Test* are tests, and Benchmark* are benchmarks. How about anything named BenchmarkTest* is both a benchmark and a test?
A BenchmarkTest would have to take a *testing.B like benchmarks do. But it would also be run with b.N=1 when testing. Conceivably we could add a method like TestMode() bool to *testing.B so if BenchmarkTests cared they could distinguish test mode (e.g. Skip, or avoid setting up a database, etc.).

The only downside is that anything currently named BenchmarkTest* would now run when testing, when it didn't before.

Happy to entertain better names.

@josharian
Copy link
Contributor

I don't see the benefit of encoding the opt-in in the source code versus the arguments passed to go test.

You can specify a subset of benchmarks to include/exclude. You can enable b.Parallel. You can set it once in one place and not have to manage external configuration (CI) or training/documentation (teammates).

@josharian
Copy link
Contributor

And if we wanted, with *testing.BT, instead of exposing b.N directly, we could expose b.Next(). But that might be a bridge too far, since it'd make opt-in more than a one character change. :)

@randall77
Copy link
Contributor

There's already the interface testing.TB, so introducing testing.BT struct to distinguish by argument type seems ripe for confusion.
But I like the idea if we can think of a better name.

@rsc
Copy link
Contributor

rsc commented Oct 14, 2020

If you want to run a benchmark during a test:

func BenchmarkB(b *testing.B) { ... }
func TestBenchmarkB(t *testing.T) { testing.Benchmark(BenchmarkB) }

I don't see why new API is needed here.

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Oct 14, 2020
@josharian
Copy link
Contributor

@rsc that will do the full benchmark run, increasing b.N incrementally, which is much slower than only b.N=1.

@rsc
Copy link
Contributor

rsc commented Oct 16, 2020

@josharian, indeed, I was clearly tired. :-)

The point remains, though: if you want to opt-in to testing certain benchmark code during tests, factor it out and call it from a Test function.

@josharian
Copy link
Contributor

josharian commented Oct 16, 2020

:) Here's a worked example of that approach.

Here's the simplest possible benchmark:

func BenchmarkFoo(b *testing.B) {
  for i := 0; i < b.N; i++ {
     foo()
  }
}

Here's what it looks like refactored to be a test and a benchmark:

func benchmarkFoo(tb testing.TB) {
  bN  := 1
  if b, ok := tb.(*testing.B); ok {
    bN = b.N
  }
  for i := 0; i < bN; i++ {
     foo()
  }
}

func BenchmarkFoo(b *testing.B) {
  benchmarkFoo(b)
}

func TestFoo(t *testing.T) {
  benchmarkFoo(t)
}

Maybe I'm missing a simpler way, but that's a lot of boilerplate, particularly repeated over many benchmarks.

@seebs
Copy link
Contributor

seebs commented Oct 16, 2020

This, plus my rejected proposal for testing middleware, are sort running into similar things -- there's no good way to programatically express "for each test" or "for each benchmark", or request that specific things be run, or influence
settings. For instance, "if no benchmark runs were requested, run each benchmark once with N=1" seems like a thing that would be lovely to be able to express concisely in code.

There's a lot of things like this where it's certainly possible to do the thing by hand-refactoring, but in a large test suite, it's hundreds of instances of boilerplate, and that means hundreds of things that could get out of sync.

Maybe this is or should be a code generation problem.

That said, adding RunTests, RunBenchmarks, and a way to interact with flags to testing.M would also solve a lot of this:

func
TestMain(m *testing.M) {
    var mFlags flag.FlagSet = m.Flags()
    bench := mFlags.Lookup("bench")
    if bench.String() == "" {
        bench.Set(".")
        mFlags.Lookup("benchtime").Set("1x")
    }
    m.RunTests()
    m.RunBenchmarks()
}

(Of course, I'd prefer wrapper funcs that take (t *testing.T, name string, fn func(*testing.T)) that you could pass to RunTests and RunBenchmarks, so you could intercept each individual test-or-benchmark, but that's sort of orthogonal.)

@josharian
Copy link
Contributor

There's already the interface testing.TB

We could use testing.TB, like in my refactored code above. Code could then type switch to *testing.T or *testing.B for test-specific code (t.Parallel) or benchmark-specific code (b.ReportAllocs). Calculating b.N would be wordy, though, as above.

@bitfield
Copy link

If you do have benchmarks, and you need to run them, you'll run them. If you don't need to run them, you don't need to run them.

@rsc
Copy link
Contributor

rsc commented Oct 20, 2020

@josharian, isn't it:

func BenchmarkFoo(b *testing.B) {
  for i := 0; i < b.N; i++ {
     foo()
  }
}

func TestFoo(t *testing.T) {
  foo()
}

?

@josharian
Copy link
Contributor

@rsc in my trivial example, yes. And that's a common case, and everything is obviously fine now in those cases.

But there's often a fair amount more work (or at least LOC) in the benchmark.

My example was trivial to attempt to highlight the structure of the refactoring under the assumption that completely copy/pasting the benchmark is not the right solution.

Note that if you copy/paste the benchmark into a test, and the test fails, you also need to remember to make the corresponding benchmark fix.

You could also take similar approaches, like factoring out setup and loop code. But that refactor will be awkward because of the testing.T/testing.B mismatch, and refactoring the loop into a function call may alter performance characteristics.

@rsc rsc moved this from Incoming to Active in Proposals (old) Nov 4, 2020
@rsc
Copy link
Contributor

rsc commented Nov 4, 2020

I guess there are two important questions here.

  1. If we were starting from scratch, would it be the right thing to do to run every benchmark for one iteration to check to see if it has any failures?
  2. If the answer to (1) is "yes", then it is the right thing to do to change the default at this late stage?

As I've noted above, I don't see the argument for (1) being "yes", and I'm even more skeptical that (2) is "yes".

We also don't run examples with no output, for good reason. I don't think we should start running those either.
Not every benchmark is written expecting to be able to run during tests. They're just different.
And it is easy to refactor code as needed to test what a benchmark needs to test, in a separate test.

@nightlyone
Copy link
Contributor

I believe documenting the existing solution as example for a go test invocation seems useful. Maybe repeat that in the testing package documentation and in the blog article when introducing benchmarks.

@rsc
Copy link
Contributor

rsc commented Nov 11, 2020

Given how potentially disruptive this change would be and that people who do want to run benchmarks as tests can already use

go test -bench=. -benchtime=1x

it seems like we should probably leave things as is otherwise. Thoughts?

@rsc
Copy link
Contributor

rsc commented Nov 18, 2020

Based on the discussion, this seems like a likely decline.

@mvdan
Copy link
Member

mvdan commented Nov 19, 2020

people who do want to run benchmarks as tests can already use

I'm not sure I understand this argumeent, because it applied to go vet when it was included as part of go test, too.

I assume the only difference here is potentially making go test much slower by default. I can buy that argument, but not the "this is already available if you want it" one.

@rsc rsc moved this from Active to Likely Decline in Proposals (old) Nov 19, 2020
@zeebo
Copy link
Contributor Author

zeebo commented Nov 20, 2020

Given how potentially disruptive this change would be and that people who do want to run benchmarks as tests can already use

(emphasis mine)

The conversation keeps going in the direction of "running benchmarks as tests" which is not at all what is being proposed. It's testing the benchmark, not using the benchmark to test. In other words, this proposal would still make sense to me even if the *testing.B type had no way to indicate failures.

That said, it does seem like it would have a high probability of being too disruptive. I do think it's worthwhile to document that the -benchtime=1x flag can be used to quickly check benchmarks, given all the evidence that benchmarks do rot presented in this thread and elsewhere.

@rsc
Copy link
Contributor

rsc commented Dec 2, 2020

No change in consensus, so declined.

@rsc rsc closed this as completed Dec 2, 2020
@rsc rsc moved this from Likely Decline to Declined in Proposals (old) Dec 2, 2020
@golang golang locked and limited conversation to collaborators Dec 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests