testing: consider calling ReadMemStats less during benchmarking #20875

josharian · 2017-06-30T23:29:34Z

https://golang.org/cl/36791 reduced the number of times benchmarks call ReadMemStats. However, the implementation was incorrect (#20590, #20863), and it was rolled back (https://golang.org/cl/46612 and https://golang.org/cl/47350).

The rationale for the rollback is that ReadMemStats is now fast (https://golang.org/cl/34937). However, using tip as of June 30, 2017 (445652f), cpuprofiling of package sort's benchmarks still shows almost half of all execution time in ReadMemStats. See sort-cpu.pdf.

So for 1.10, either ReadMemStats should be made cheaper still, or we should re-roll https://golang.org/cl/36791 with a better implementation; see the initial patchsets of https://golang.org/cl/46612 for improvements.

cc @bradfitz @aclements @meirf @ALTree

The text was updated successfully, but these errors were encountered:

bradfitz · 2017-06-30T23:31:39Z

Sorry about that. I suspected it might be the case that ReadMemStats would still be noticeably expensive after the rollback, but I also suspected you would file this very bug. I would've been happy rolling forward a few weeks ago, but at this point we're trying to lock down for a release.

josharian · 2017-06-30T23:40:18Z

I also suspected you would file this very bug.

How did you guess? :)

at this point we're trying to lock down for a release.

But of course; not a problem. Happy to fix in 1.10 instead.

aclements · 2017-07-03T14:07:01Z

Interesting. Do you happen to know how many times ReadMemStats is called versus how much time each call takes? After I optimized it I wasn't able to get it to take more than a few microseconds even under stress testing.

josharian · 2017-07-11T02:50:50Z

Do you happen to know how many times ReadMemStats is called versus how much time each call takes?

ReadMemStats gets called a lot--every iteration, via StartTimer/StopTimer, e.g. at https://github.com/golang/go/blob/master/src/sort/sort_test.go#L640.

aclements · 2017-07-17T20:49:51Z

ReadMemStats gets called a lot--every iteration, via StartTimer/StopTimer, e.g. at https://github.com/golang/go/blob/master/src/sort/sort_test.go#L640.

Oh, geez, okay. :)

If there's a way to safely avoid ReadMemStats, that's fine, but I also don't think we should over-optimize for this one weirdness in this one benchmark

josharian · 2017-07-21T19:30:53Z

If there's a way to safely avoid ReadMemStats, that's fine, but I also don't think we should over-optimize for this one weirdness in this one benchmark

I think there is, more or less described above. And for better or for worse, called StartTimer/StopTimer every iteration is not all that rare. :) Anyway, this is on my plate to try again for 1.10.

iand · 2018-09-26T00:36:22Z

How about an opt-in approach: a new SkipMemStats method on B. When called it would disable the collection of memory usage. Running the benchmark with -test.benchmem would override SkipMemStats as would a later call to ReportAllocs in the same benchmark if people cared to do that.

iand · 2018-09-27T09:28:49Z

I implemented the opt-in I suggested above and tried it out on the sort package tests. I found it made very little difference in the actual benchmark. This is on linux/amd64.

The cpu profile of master on my system shows that ReadMemStats has negligible impact. This may just be an issue with the speed of ReadMemStats on some architectures. cc @josharian @aclements

Benchmarks of sort when calling SkipMemStats compared to master:

benchmark                         old ns/op      new ns/op      delta
BenchmarkSearchWrappers-8         88.1           89.3           +1.36%
BenchmarkSortString1K-8           137554         139288         +1.26%
BenchmarkSortString1K_Slice-8     105527         104311         -1.15%
BenchmarkStableString1K-8         193180         195908         +1.41%
BenchmarkSortInt1K-8              82989          84371          +1.67%
BenchmarkStableInt1K-8            98501          99706          +1.22%
BenchmarkStableInt1K_Slice-8      53439          53342          -0.18%
BenchmarkSortInt64K-8             8134449        8232955        +1.21%
BenchmarkSortInt64K_Slice-8       4556906        4604299        +1.04%
BenchmarkStableInt64K-8           8856651        8728228        -1.45%
BenchmarkSort1e2-8                46021          43748          -4.94%
BenchmarkStable1e2-8              90498          89224          -1.41%
BenchmarkSort1e4-8                9528589        9573428        +0.47%
BenchmarkStable1e4-8              28448203       28591226       +0.50%
BenchmarkSort1e6-8                1484072102     1470491590     -0.92%
BenchmarkStable1e6-8              5864716595     5833299379     -0.54%

ALTree · 2018-09-27T10:24:17Z

@iand you should run each benchmark multiple times (see the -count flag of go test), and then compare the output using benchstat (instad of benchcmp). It'll give you a report that is less likely to be infuenced by random external noise during the test.

iand · 2018-09-27T10:28:59Z

@ALTree I did that too, but didn't post the results here since the profile showed that ReadMemStats was not affecting the run times on my system

josharian · 2018-10-09T00:33:47Z

How about an opt-in approach

The approach I sketched (in the linked CLs) should not require any API. I still think it is worth exploring, and it remains in my queue.

I originally found that:

cpuprofiling of package sort's benchmarks still shows almost half of all execution time in ReadMemStats

@iand, you report:

The cpu profile of master on my system shows that ReadMemStats has negligible impact.

That's an interesting mismatch. I will have to re-measure to see whether I can still reproduce.

mvdan · 2019-01-13T13:27:59Z

I agree with Josh that calling StartTimer and StopTimer at every iteration isn't rare. For example, I tend to do that often when the setup/cleanup cost per benchmark iteration is comparable to the benchmark work itself.

And this is a big problem when both of these costs are small. For example, right now I was benchmarking a function that took just under a microsecond, and the pprof cpu profile showed that ReadMemStats took over 60% of the CPU, while my function took less than 3%.

Removing Start/StopTimer from my benchmark is a possibility, but then I'm not really benchmarking my function anymore. The numbers could get better or worse by only changing the setup/cleanup code.

The cost of starting and stopping the timer can also throw off -benchtime=duration estimates, if its cost is much larger than the benchmarked func itself. For example, see #27217 (comment), where a go test -benchtime=0.1s ran for over five seconds. I can provide a small standalone example if that would help illustrate the point.

iand · 2019-01-13T13:40:22Z

@mvdan what OS are you testing on? Are you able to repeat the results across Linux/Win/Mac?

mvdan · 2019-01-13T13:43:53Z

I am on go version devel +5f699e400a Thu Jan 10 21:21:34 2019 +0000 linux/amd64. I'd like to have Windows/Mac machines around to test things with, but I'd rather not pay for those :)

It can completely mess up the benchmark numbers for init functions that were too small. Moreover, it could make the -benchtime estimates be way off. For example, 'benchinit cmd/go' was taking over a minute to run the benchmark, instead of the expected ~1s. The benchtime estimate being off is likely the upstream issue golang/go#27217. The fact that StartTimer and StopTimer are expensive is being tracked in golang/go#20875.

seebs · 2020-04-15T20:14:49Z

Probably irrelevant to most benchmarks, but: I suspect the impact may be very widely variable because ReadMemStats needs to stop the world. So single-threaded benchmarks might not be affected, but in parallel runs, you might see very weird impact, and a lot of that impact won't be directly attributable to ReadMemStats itself.

aclements · 2020-04-15T20:36:42Z

Stopping the world is typically < 50us even in parallel runs, since it's also critical to GC latency. Of course, if a benchmark is usually measured in nanoseconds, this is still really slow by comparison. :)

I wonder if testing's ReadMemStats is less of a problem on tip now that it no longer blocks during a GC.

Also, it may be that @mknyszek's unstable runtime metrics API could help here. With that API, the testing package could ask for just the couple metrics it actually wants. That would be cheaper to collect and maybe we could even do just those without stopping the world.

mknyszek · 2020-04-15T21:03:44Z

Stopping the world is typically < 50us even in parallel runs, since it's also critical to GC latency. Of course, if a benchmark is usually measured in nanoseconds, this is still really slow by comparison. :)

I wonder if testing's ReadMemStats is less of a problem on tip now that it no longer blocks during a GC.

It should be a lot better, especially if the benchmark is allocating a lot, but as you say it's still going to be problematic for very small benchmarks.

(See #19812 for details.)

Also, it may be that @mknyszek's unstable runtime metrics API could help here. With that API, the testing package could ask for just the couple metrics it actually wants. That would be cheaper to collect and maybe we could even do just those without stopping the world.

Unfortunately the testing package measures the number of mallocs, which will require stopping the world for the foreseeable future. This is because we need to flush those stats out of each mcache to get an exact number (we need to count on the fast path). We could expose an "approximate" mallocs statistic which may or may not be monotonic (I'm not sure how difficult it'll be to do that given that we do that whole "assume the span gets filled" trick). That doesn't really help the testing package, though, which probably wants an exact number.

It's still probably worth reducing the number of calls if there's a way.

(See #37112 for details about that API, though.)

gopherbot · 2020-09-26T22:31:19Z

Change https://golang.org/cl/257647 mentions this issue: testing: grow benchmark iteration incrementally

mknyszek · 2022-04-05T17:54:23Z

I recently came across this just by happenstance, and the runtime/metrics package doesn't require a STW in any way. We should replace the call to ReadMemStats in the testing package.

Though, there's currently a monotonicity bug. The fix is up for 1.19.

mknyszek · 2022-04-14T18:53:12Z

I just tried replacing ReadMemStats the easy way and there is a problem with using runtime/metrics: because it doesn't stop the world, it can't easily force flush all the allocation stats like ReadMemStats can. In sum, it's not going to be able to pick up on, say, 1 allocation from 1 run, but if you do many runs, then it will pick up an accurate average.

I'm not sure if that's a deal-breaker. It might be for AllocsPerRun, but that might be OK. I don't know.

josharian added this to the Go1.10 milestone Jun 30, 2017

bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 30, 2017

josharian self-assigned this Jul 21, 2017

bradfitz added NeedsFix The path to resolution is known, but the work has not been done. Performance labels Nov 28, 2017

bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017

gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 28, 2017

bradfitz modified the milestones: Go1.11, Go1.12 Jun 19, 2018

josharian mentioned this issue Aug 26, 2018

testing: tiny benchmark with StopTimer runs forever #27217

Open

josharian mentioned this issue Jan 11, 2019

Zero globals before each init() run mvdan/benchinit#1

Merged

josharian modified the milestones: Go1.12, Go1.13 Jan 11, 2019

andybons removed this from the Go1.13 milestone Jul 8, 2019

andybons added this to the Go1.14 milestone Jul 8, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

josharian mentioned this issue Apr 15, 2020

proposal: testing: add B.Lap for phases of benchmarks #33194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing: consider calling ReadMemStats less during benchmarking #20875

testing: consider calling ReadMemStats less during benchmarking #20875

josharian commented Jun 30, 2017 •

edited by mvdan

bradfitz commented Jun 30, 2017

josharian commented Jun 30, 2017

aclements commented Jul 3, 2017

josharian commented Jul 11, 2017

aclements commented Jul 17, 2017

josharian commented Jul 21, 2017

iand commented Sep 26, 2018

iand commented Sep 27, 2018 •

edited

ALTree commented Sep 27, 2018 •

edited

iand commented Sep 27, 2018

josharian commented Oct 9, 2018

mvdan commented Jan 13, 2019

iand commented Jan 13, 2019

mvdan commented Jan 13, 2019

seebs commented Apr 15, 2020

aclements commented Apr 15, 2020

mknyszek commented Apr 15, 2020

gopherbot commented Sep 26, 2020

mknyszek commented Apr 5, 2022

mknyszek commented Apr 14, 2022

testing: consider calling ReadMemStats less during benchmarking #20875

testing: consider calling ReadMemStats less during benchmarking #20875

Comments

josharian commented Jun 30, 2017 • edited by mvdan

bradfitz commented Jun 30, 2017

josharian commented Jun 30, 2017

aclements commented Jul 3, 2017

josharian commented Jul 11, 2017

aclements commented Jul 17, 2017

josharian commented Jul 21, 2017

iand commented Sep 26, 2018

iand commented Sep 27, 2018 • edited

ALTree commented Sep 27, 2018 • edited

iand commented Sep 27, 2018

josharian commented Oct 9, 2018

mvdan commented Jan 13, 2019

iand commented Jan 13, 2019

mvdan commented Jan 13, 2019

seebs commented Apr 15, 2020

aclements commented Apr 15, 2020

mknyszek commented Apr 15, 2020

gopherbot commented Sep 26, 2020

mknyszek commented Apr 5, 2022

mknyszek commented Apr 14, 2022

josharian commented Jun 30, 2017 •

edited by mvdan

iand commented Sep 27, 2018 •

edited

ALTree commented Sep 27, 2018 •

edited