You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Observe the very high variance in the "after" column; see like one of the 5 runs went amok somehow due to external factors. I'm not sure how to try to reproduce this locally, but the x/benchmarks/http benchmarks all came back neutral on my machine before/after that CL, including for memory:
name old time/op new time/op delta
HTTP-8 14.1µs ± 1% 14.1µs ± 0% ~ (p=0.690 n=5+5)
name old GC-bytes-from-system new GC-bytes-from-system delta
HTTP-8 1.47M ± 0% 1.47M ± 0% ~ (all equal)
name old P50-time/op new P50-time/op delta
HTTP-8 439µs ± 1% 439µs ± 0% ~ (p=0.690 n=5+5)
name old P95-time/op new P95-time/op delta
HTTP-8 553µs ± 4% 557µs ± 6% ~ (p=1.000 n=5+5)
name old P99-time/op new P99-time/op delta
HTTP-8 980µs ± 1% 990µs ± 2% ~ (p=0.421 n=5+5)
name old STW-ns/GC new STW-ns/GC delta
HTTP-8 79.7k ± 1% 81.2k ± 6% ~ (p=0.151 n=5+5)
name old STW-time/op new STW-time/op delta
HTTP-8 40.4ns ± 1% 41.0ns ± 5% ~ (p=0.238 n=5+5)
name old allocated-bytes/op new allocated-bytes/op delta
HTTP-8 4.80k ± 0% 4.80k ± 0% ~ (all equal)
name old allocs/op new allocs/op delta
HTTP-8 61.0 ± 0% 61.0 ± 0% ~ (all equal)
name old bytes-from-system new bytes-from-system delta
HTTP-8 44.9M ± 0% 44.9M ± 0% ~ (all equal)
name old heap-bytes-from-system new heap-bytes-from-system delta
HTTP-8 38.0M ± 0% 38.0M ± 0% ~ (p=0.103 n=5+5)
name old other-bytes-from-system new other-bytes-from-system delta
HTTP-8 3.87M ± 0% 3.87M ± 0% ~ (all equal)
name old peak-RSS-bytes new peak-RSS-bytes delta
HTTP-8 47.5M ± 0% 47.5M ± 0% ~ (p=0.619 n=5+4)
name old stack-bytes-from-system new stack-bytes-from-system delta
HTTP-8 1.55M ± 3% 1.61M ± 4% ~ (p=0.103 n=5+5)
name old user+sys-time/op new user+sys-time/op delta
HTTP-8 87.2µs ± 0% 86.5µs ± 2% ~ (p=0.690 n=5+5)
(Btw, I'm excited to see percentiles in there!)
I also see some crazy geomeans in there. stack-bytes-from-system -9.8%, STW-ns/op and STW-ns/GC +6%. I think it is very unlikely that this CL, which only changes code layout, should have impacted either of those. Note that the results that contributed all have very high variances. Unless/until we can get those variances down, we probably should just suppress geomeans there. Or run more iterations? Not sure.
I point all these out to illustrate the key meta-problem of this dashboard: how to interpret this data meaningfully and sort relevant from not relevant, and yet still not miss important but unexpected results.
See the "n=4+5"? The benchstat code threw out one of the points from the old set as an outlier, but did not throw out any points on the right. If we grab the raw data, you will see that both old and new exhibit similar variability. I think this is probably just because the metric is very noisy - it depends on exactly when each GC happens to run. Also keep in mind that the benchmark is being run in parallel, further exacerbating the nondeterminism with the GC.
Your repro doesn't actually include peak-VM-bytes. It's included or not depending on what OS is being used - it looks like it is only currently implemented for GOOS=linux. Were you trying to reproduce on a different OS?
Testing on my own linux machine, I definitely get similar variance in peak-VM-bytes. It seems to be anticorrelated with GOMAXPROCS - larger GOMAXPROCS produces less variance. You ran with 8, the trybots ran with 2. Results here.
Your overall point is still well-taken, though. We need to think hard about how to meaningful analyze the data for significance. So far, I have entirely punted this decision to the existing code in benchstat. The geomean lines currently do no significance testing at all; they always report the geomean regardless of variance. Note also that the geomean line shows the percentage difference of the raw values, instead of the geomean of the percentages.
Another data point here, from a trybot run for CL 43637, which shouldn't impact any go1 benchmarks: GobDecode is -25%, GobEncode is -20%, TimeFormat is +23%. And the variances for all the go1 benchmarks are super high; typical variances on my machine are 1-3%.
[continued from #20412]
Looking at https://perf.golang.org/search?q=cl:43491+try:Tfbd4a8a6e, which is the perf dashboard result for CL 43491, I see some unlikely things:
Observe the very high variance in the "after" column; see like one of the 5 runs went amok somehow due to external factors. I'm not sure how to try to reproduce this locally, but the x/benchmarks/http benchmarks all came back neutral on my machine before/after that CL, including for memory:
(Btw, I'm excited to see percentiles in there!)
I also see some crazy geomeans in there.
stack-bytes-from-system
-9.8%,STW-ns/op
andSTW-ns/GC
+6%. I think it is very unlikely that this CL, which only changes code layout, should have impacted either of those. Note that the results that contributed all have very high variances. Unless/until we can get those variances down, we probably should just suppress geomeans there. Or run more iterations? Not sure.I point all these out to illustrate the key meta-problem of this dashboard: how to interpret this data meaningfully and sort relevant from not relevant, and yet still not miss important but unexpected results.
@quentinmit @bradfitz
The text was updated successfully, but these errors were encountered: