x/build: noise in first perf dashboard result #20414

josharian · 2017-05-18T18:47:42Z

[continued from #20412]

Looking at https://perf.golang.org/search?q=cl:43491+try:Tfbd4a8a6e, which is the perf dashboard result for CL 43491, I see some unlikely things:

peak-VM-bytes	delta

HTTP-2	180M ± 3%	230M ±21%	+27.37%	(p=0.032 n=4+5)

Observe the very high variance in the "after" column; see like one of the 5 runs went amok somehow due to external factors. I'm not sure how to try to reproduce this locally, but the x/benchmarks/http benchmarks all came back neutral on my machine before/after that CL, including for memory:

name    old time/op                  new time/op                  delta
HTTP-8                  14.1µs ± 1%                  14.1µs ± 0%   ~     (p=0.690 n=5+5)

name    old GC-bytes-from-system     new GC-bytes-from-system     delta
HTTP-8                   1.47M ± 0%                   1.47M ± 0%   ~     (all equal)

name    old P50-time/op              new P50-time/op              delta
HTTP-8                   439µs ± 1%                   439µs ± 0%   ~     (p=0.690 n=5+5)

name    old P95-time/op              new P95-time/op              delta
HTTP-8                   553µs ± 4%                   557µs ± 6%   ~     (p=1.000 n=5+5)

name    old P99-time/op              new P99-time/op              delta
HTTP-8                   980µs ± 1%                   990µs ± 2%   ~     (p=0.421 n=5+5)

name    old STW-ns/GC                new STW-ns/GC                delta
HTTP-8                   79.7k ± 1%                   81.2k ± 6%   ~     (p=0.151 n=5+5)

name    old STW-time/op              new STW-time/op              delta
HTTP-8                  40.4ns ± 1%                  41.0ns ± 5%   ~     (p=0.238 n=5+5)

name    old allocated-bytes/op       new allocated-bytes/op       delta
HTTP-8                   4.80k ± 0%                   4.80k ± 0%   ~     (all equal)

name    old allocs/op                new allocs/op                delta
HTTP-8                    61.0 ± 0%                    61.0 ± 0%   ~     (all equal)

name    old bytes-from-system        new bytes-from-system        delta
HTTP-8                   44.9M ± 0%                   44.9M ± 0%   ~     (all equal)

name    old heap-bytes-from-system   new heap-bytes-from-system   delta
HTTP-8                   38.0M ± 0%                   38.0M ± 0%   ~     (p=0.103 n=5+5)

name    old other-bytes-from-system  new other-bytes-from-system  delta
HTTP-8                   3.87M ± 0%                   3.87M ± 0%   ~     (all equal)

name    old peak-RSS-bytes           new peak-RSS-bytes           delta
HTTP-8                   47.5M ± 0%                   47.5M ± 0%   ~     (p=0.619 n=5+4)

name    old stack-bytes-from-system  new stack-bytes-from-system  delta
HTTP-8                   1.55M ± 3%                   1.61M ± 4%   ~     (p=0.103 n=5+5)

name    old user+sys-time/op         new user+sys-time/op         delta
HTTP-8                  87.2µs ± 0%                  86.5µs ± 2%   ~     (p=0.690 n=5+5)

(Btw, I'm excited to see percentiles in there!)

I also see some crazy geomeans in there. stack-bytes-from-system -9.8%, STW-ns/op and STW-ns/GC +6%. I think it is very unlikely that this CL, which only changes code layout, should have impacted either of those. Note that the results that contributed all have very high variances. Unless/until we can get those variances down, we probably should just suppress geomeans there. Or run more iterations? Not sure.

I point all these out to illustrate the key meta-problem of this dashboard: how to interpret this data meaningfully and sort relevant from not relevant, and yet still not miss important but unexpected results.

@quentinmit @bradfitz

The text was updated successfully, but these errors were encountered:

quentinmit · 2017-05-18T20:16:32Z

Look at the HTTP result more closely:

peak-VM-bytes	delta

HTTP-2	180M ± 3%	230M ±21%	+27.37%	(p=0.032 n=4+5)

See the "n=4+5"? The benchstat code threw out one of the points from the old set as an outlier, but did not throw out any points on the right. If we grab the raw data, you will see that both old and new exhibit similar variability. I think this is probably just because the metric is very noisy - it depends on exactly when each GC happens to run. Also keep in mind that the benchmark is being run in parallel, further exacerbating the nondeterminism with the GC.

Your repro doesn't actually include peak-VM-bytes. It's included or not depending on what OS is being used - it looks like it is only currently implemented for GOOS=linux. Were you trying to reproduce on a different OS?

Testing on my own linux machine, I definitely get similar variance in peak-VM-bytes. It seems to be anticorrelated with GOMAXPROCS - larger GOMAXPROCS produces less variance. You ran with 8, the trybots ran with 2. Results here.

Your overall point is still well-taken, though. We need to think hard about how to meaningful analyze the data for significance. So far, I have entirely punted this decision to the existing code in benchstat. The geomean lines currently do no significance testing at all; they always report the geomean regardless of variance. Note also that the geomean line shows the percentage difference of the raw values, instead of the geomean of the percentages.

josharian · 2017-05-18T20:26:28Z

Another data point here, from a trybot run for CL 43637, which shouldn't impact any go1 benchmarks: GobDecode is -25%, GobEncode is -20%, TimeFormat is +23%. And the variances for all the go1 benchmarks are super high; typical variances on my machine are 1-3%.

bradfitz · 2018-05-21T21:48:49Z

Closing old performance dashboard bugs. #20552 tracks bringing it back one day.

gopherbot added this to the Unreleased milestone May 18, 2017

gopherbot added the Builders label May 18, 2017

bradfitz closed this as completed May 21, 2018

golang locked and limited conversation to collaborators May 21, 2019

gopherbot added the FrozenDueToAge label May 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/build: noise in first perf dashboard result #20414

x/build: noise in first perf dashboard result #20414

josharian commented May 18, 2017

quentinmit commented May 18, 2017

josharian commented May 18, 2017

bradfitz commented May 21, 2018

x/build: noise in first perf dashboard result #20414

x/build: noise in first perf dashboard result #20414

Comments

josharian commented May 18, 2017

quentinmit commented May 18, 2017

josharian commented May 18, 2017

bradfitz commented May 21, 2018