testing: report GC/op when b.ReportAllocs is called #52466

mknyszek · 2022-04-21T15:57:04Z

Currently benchmarks defined in the testing package only show ns/op by default, while optionally reporting average allocation count and rate (allocs/op, bytes/op). Users may also export a custom metric, via testing.(*B).ReportMetric.

A common issue with benchmarks defined by the testing package, especially microbenchmarks, is that they might execute only a handful of GC cycles. If, due to internal or external noise, like scheduling, an extra GC cycle gets executed, it could cause a significant shift in the results (typically ns/op). As a result, it's more difficult to trust the results from such benchmarks; they need either more iterations to be useful, or a change in how they're defined (say, with a bigger GOGC value).

I propose exposing the ~~total number of GC cycles~~ average number of GC cycles that passed during a given benchmark run to help expose such issues. One issue with making the total a benchmark metric is that it depends on the benchmark time, or the number of benchmark iterations, so this number is not comparable if those are different between runs. I think this is OK, because this metric is just supposed to flag an issue, not to be used in actual performance comparisons. Tooling such as benchstat could simply understand this, or we could export some benchmark unit metadata that makes this explicit.

I don't have strong feelings about the details, such as the metric name (maybe ~~total-GC-cycles~~ GCs/op?). While this metric potentially adds noise to benchmark output, I believe that cost is outweighed by the value of being able to identify this kind of noise early.

Alternatively, we could make it the job of the testing package to flag issues around such performance cliffs, but I suspect it's cleaner to just export the raw data and have it be interpreted by downstream tooling.

One might think that the testing package could just add iterations to adjust for these cliffs, but this is very difficult to do in the general case. It would involve a lot of heuristics that will generally be fragile and depend on the GC implementation. On the other hand, it's straightforward for tools to find issues after the fact, by identifying outliers and correlating those outliers with relatively large (by %) differences in GC cycle count.

Note that the average number of GC cycles per op would not work, because in many scenarios where this matters, it wouldn't properly flag an issue (the average would just be zero, or very, very close to zero).

EDIT: Switched to average number of GC cycles.

CC @aclements @prattmic @dr2chase

The text was updated successfully, but these errors were encountered:

prattmic · 2022-04-21T16:01:42Z

Note that the average number of GC cycles per op would not work, because in many scenarios where this matters, it wouldn't properly flag an issue (the average would just be zero, or very, very close to zero).

Is this actually the case? It sounds mostly like an issue of ensuring we include enough significant figures. As long as there are enough significant figures, we should be able to tell that a doubling of 0.00001 GC/op to 0.00002 GC/op is problematic.

mknyszek · 2022-04-21T16:07:41Z

That's true. It looks a little weird but I think you're right that as long as there are enough significant figures it's fine. I think even with a float64 we're fine on that front.

rhysh · 2022-04-21T22:58:01Z

How would this behave when a mark phase is active at the start, or at the end, of the benchmark loop? (Possible to wait for the cycle to finish before entering the benchmark, but maybe that's not possible at the end. Maybe that's something more to report so a post-processor could choose to discard those results?)

How does it interact with B.{Stop,Start,Reset}Timer, especially around an in-progress mark phase?

mknyszek · 2022-04-22T13:47:42Z

Thanks @rhysh. I think I'd lean toward not trying to do anything about in-progress mark phases and just measure completed GC cycles between the {Start,Reset}Timer and StopTimer. It's noisy, but ultimately what we're trying to flag is that noise. If we blocked until the end of a mark phase, it creates a much more stable condition for the number of GC cycles which might mask the problem. If the mark phases are short (true most of the time) then masking isn't too much of a problem, so I don't feel particularly strongly about this if there are other good reasons to block.

rsc · 2022-06-22T16:33:09Z

Turning this on by default seems like a lot of noise, unless we believe that the majority of benchmarks are invalidated by GC issues. I don't think that is true.

Is there some way we can tell when to turn it on versus not?

Or failing that, is it enough to turn it on with b.ReportAllocs?

mknyszek · 2022-06-22T16:36:11Z

I think the problem with deciding whether to turn it on or not is that you don't really know until you've done at least one iteration. The first iteration would be missing the signal, but maybe that's fine? As opposed to a benchmark metric, we could also just emit a warning, or a benchmark key ("gc-noise: yes" or something), in which case that doesn't really matter.

I think grouping it with ReportAllocs would also make sense.

rsc · 2022-06-22T18:07:11Z

I think the problem with deciding whether to turn it on or not is that you don't really know until you've done at least one iteration. The first iteration would be missing the signal, but maybe that's fine? As opposed to a benchmark metric, we could also just emit a warning, or a benchmark key ("gc-noise: yes" or something), in which case that doesn't really matter.

I don't think there's a 'turn it on' problem. Surely asking for the GC count is just an atomic load. The decision about reporting can be done at the end of the benchmark. To me, a bigger problem is that you really care about whether the GC/op is different between two different benchmark runs, so there's no global view available to make the decision.

It sounds like you are saying maybe we can do something with the individual iterations we get during the determination of the right b.N. That would be great if so, but then it seems like we'd want to report some kind of GC/op variance. And we expect variance when b.N is too small, so it's not clear whether deciding anything from it would be meaningful.

For all those reasons I'm not sure we can do better than "report GC/op as part of b.ReportAllocs".

rsc · 2022-06-22T18:08:47Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

mknyszek · 2022-06-22T18:14:06Z

To me, a bigger problem is that you really care about whether the GC/op is different between two different benchmark runs, so there's no global view available to make the decision.

Sorry, this is what I thought you meant by the "turn it on" problem. In other words, retroactively reporting it. I think you're right though that this probably can't be better than just reporting it as part of ReportAllocs due to variance.

rsc · 2022-06-29T17:16:14Z

Retitled. Does anyone object to this?

prattmic · 2022-06-29T18:46:02Z

As a bit of a strawman, GOGC (and GOMEMLIMIT) will likely have a large impact on GC/op. Would it make sense to include those as benchfmt keys so that they are captured as part of the benchmark, just as we capture CPU type and GOMAXPROCS (in the name rather than a key), which impact ns/op?

I think we'd need to add those keys universally since they come before we know if b.ReportAllocs will be used. That feels like overkill to me, so I'd lean against including them, but figured I'd mention it.

rsc · 2022-07-13T17:19:56Z

Maybe only include GOGC if it's not the default 100?

rsc · 2022-07-13T18:41:16Z

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

rsc · 2022-07-20T18:03:45Z

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

bboreham · 2022-07-22T08:45:17Z

As someone who looks at benchmark results a lot, I didn't immediately feel this was useful: in general it will be a linear multiple of bytes/op, and the multiple can vary a lot depending on what has run previous to this benchmark to push the heap larger.

However, cases where it is a very small number do indeed flag potential noise in results.
Perhaps benchstat could flag these.

mknyszek added the Proposal label Apr 21, 2022

gopherbot added this to the Proposal milestone Apr 21, 2022

mknyszek changed the title ~~proposal: testing: add GC cycle count as a standard benchmark metric~~ proposal: testing: add GC cycles per op as a standard benchmark metric Apr 21, 2022

ianlancetaylor added this to Incoming in Proposals (old) Apr 21, 2022

mknyszek mentioned this issue May 5, 2022

proposal: runtime/debug: add GCStats field to track count of marked pointers during last GC #52692

Closed

rsc moved this from Incoming to Active in Proposals (old) Jun 22, 2022

rsc changed the title ~~proposal: testing: add GC cycles per op as a standard benchmark metric~~ proposal: testing: report GC/op when b.ReportAllocs is called Jun 29, 2022

rsc moved this from Active to Likely Accept in Proposals (old) Jul 13, 2022

rsc added the Proposal-FinalCommentPeriod label Jul 13, 2022

rsc moved this from Likely Accept to Accepted in Proposals (old) Jul 20, 2022

rsc changed the title ~~proposal: testing: report GC/op when b.ReportAllocs is called~~ testing: report GC/op when b.ReportAllocs is called Jul 20, 2022

rsc added the Proposal-Accepted label Jul 20, 2022

rsc modified the milestones: Proposal, Backlog Jul 20, 2022

rsc removed the Proposal-FinalCommentPeriod label Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing: report GC/op when b.ReportAllocs is called #52466

testing: report GC/op when b.ReportAllocs is called #52466

mknyszek commented Apr 21, 2022 •

edited

prattmic commented Apr 21, 2022

mknyszek commented Apr 21, 2022

rhysh commented Apr 21, 2022

mknyszek commented Apr 22, 2022

rsc commented Jun 22, 2022

mknyszek commented Jun 22, 2022

rsc commented Jun 22, 2022

rsc commented Jun 22, 2022

mknyszek commented Jun 22, 2022

rsc commented Jun 29, 2022

prattmic commented Jun 29, 2022 •

edited

rsc commented Jul 13, 2022

rsc commented Jul 13, 2022

rsc commented Jul 20, 2022

bboreham commented Jul 22, 2022

testing: report GC/op when b.ReportAllocs is called #52466

testing: report GC/op when b.ReportAllocs is called #52466

Comments

mknyszek commented Apr 21, 2022 • edited

prattmic commented Apr 21, 2022

mknyszek commented Apr 21, 2022

rhysh commented Apr 21, 2022

mknyszek commented Apr 22, 2022

rsc commented Jun 22, 2022

mknyszek commented Jun 22, 2022

rsc commented Jun 22, 2022

rsc commented Jun 22, 2022

mknyszek commented Jun 22, 2022

rsc commented Jun 29, 2022

prattmic commented Jun 29, 2022 • edited

rsc commented Jul 13, 2022

rsc commented Jul 13, 2022

rsc commented Jul 20, 2022

bboreham commented Jul 22, 2022

mknyszek commented Apr 21, 2022 •

edited

prattmic commented Jun 29, 2022 •

edited