testing: add -benchsplit to get more data points #19128

bcmills · 2017-02-16T21:03:40Z

For #18177, we need to make tradeoffs between worst-case latency and average throughput for the functions in question.

It's relatively easy to measure the average throughput today using benchmarks with the testing package.

It would be nice if I could use those same benchmarks to measure the distribution of timings across iterations. That isn't really feasible using (*B).N with caller-side iteration, but it might at least be possible to tap into (*PB).Next to get some idea of the latency between calls for parallel benchmarks.

The text was updated successfully, but these errors were encountered:

gopherbot · 2017-02-16T21:11:57Z

CL https://golang.org/cl/37153 mentions this issue.

gopherbot · 2017-02-23T20:46:33Z

CL https://golang.org/cl/37342 mentions this issue.

rsc · 2017-03-06T21:32:43Z

/cc @aclements

josharian · 2017-03-06T21:35:22Z

Related: #7465

aclements · 2017-03-06T21:38:29Z

That isn't really feasible using (*B).N with caller-side iteration

One possibility is to crank up N only far enough to eliminate clock measurement overhead and then use caller-driven iteration to drive up the number of samples/total iterations to the desired measurement interval. However, this may not work well with benchmarks that perform non-trivial setup and use B.ResetTimer.

josharian · 2017-03-06T21:45:03Z

benchmarks that perform non-trivial setup and use B.ResetTimer

Also related: #10930. I bring it up because I learned there that you can in fact detect this case--both the case in which b.ResetTimer is called and the case in which b.ResetTimer is (erroneously) not called. Just look at how the total wall time to run the benchmark compares to the time as recorded by the timer.

rsc · 2017-03-07T03:35:14Z

We can also tell whether b.ResetTimer is in use because we control the implementation of b.ResetTimer.

One of the reasons we've moved from benchcmp to benchstat is that the latter allows multiple data points for a given benchmark. Today those data points are all for roughly the same large size N, but the eventual plan was to emit more lines about smaller N, so that benchstat can do more work with distributions. If anyone wants to experiment with that and let us know how it goes, please do.

I'd also like to have a check somewhere (in package testing or benchstat, probably the former) that b.N really is scaling linearly. If the time for b.N is 4X that of b.N/2 and 16X that of b.N/4, the benchmark function is using b.N incorrectly and shouldn't be reported at all. Doing this in package testing, without printing the smaller runs at all, would let benchstat continue to assume that all the reported runs are about the same size.

josharian · 2017-03-07T22:54:08Z

Adding a correlation coefficient test to benchmarks seems like a good idea. We could also use the estimated slope of a (weighted?) linear regression as the ns/op instead of straight division, which in theory might yield better results.

I tried hacking in a quick-and-dirty linear regression and got plausible slope numbers. (I didn't mail it because I used a naive rather than a numerically stable calculation.)

I don't plan to work on this further at the moment.

rsc · 2017-03-08T01:17:38Z

A few years ago I spent a while looking at what Haskell's benchmark tool does. I convinced myself that trying to be clever about slope vs N=0 cost was probably not worthwhile - better to brute force enough iterations that the slope is dominant.

The initial version of Haskell's tool (I forget the name) apparently did fancy things like boosting too, but that seems like overkill too, super subtle and easy to quietly get wrong. Later versions seemed to have backed off some of the complexity.

josharian · 2017-03-08T22:14:35Z

I convinced myself that trying to be clever about slope vs N=0 cost was probably not worthwhile - better to brute force enough iterations that the slope is dominant.

Fair enough. Looking at the N=0 cost may still be worth doing, if only to catch bugs like dbf533a. (Of course, one could brute force enough iterations for that, but it seems inadvisable.)

In the previous version, Loads of keys with unchanged values would block on the Mutex if the map was dirty, leading to potentially long stalls while the read-only map is being copied to the dirty map. This change adds a double pointer indirection to each entry and an extra atomic load on the fast path. In exchange, it removes the need to lock the map for nearly all operations on existing keys. Each entry is represented by an atomically-updated pointer to an interface{} value. The same entries are stored in both the read-only and dirty maps. Keys deleted before the dirty map was last copied are marked as "expunged" and omitted from the dirty map until the next Store to that key. Newly-stored values exist only in the dirty map. The existence of new keys in the dirty map is indicated by an "amended" bit attached to the read-only map, allowing Load or Delete calls for nonexistent keys to sometimes return immediately (without acquiring the Mutex). This trades off a bit of steady-state throughput in the "mostly hits" case in exchange for less contention (and lower Load tail latencies) for mixed Load/Store/Delete usage. Unfortunately, the latency impact is currently difficult to measure (#19128). updates golang/go#19128 updates golang/go#18177 https://perf.golang.org/search?q=upload:20170315.5 name old time/op new time/op delta LoadMostlyHits/*syncmap_test.DeepCopyMap 70.3ns ± 6% 70.2ns ± 4% ~ (p=0.886 n=4+4) LoadMostlyHits/*syncmap_test.DeepCopyMap-48 10.9ns ± 4% 13.6ns ±24% ~ (p=0.314 n=4+4) LoadMostlyHits/*syncmap_test.RWMutexMap 86.0ns ± 9% 86.0ns ± 4% ~ (p=1.000 n=4+4) LoadMostlyHits/*syncmap_test.RWMutexMap-48 140ns ± 3% 139ns ± 2% ~ (p=0.686 n=4+4) LoadMostlyHits/*syncmap.Map 70.6ns ± 8% 70.3ns ± 1% ~ (p=0.571 n=4+4) LoadMostlyHits/*syncmap.Map-48 10.3ns ± 7% 11.2ns ± 5% ~ (p=0.114 n=4+4) LoadMostlyMisses/*syncmap_test.DeepCopyMap 59.4ns ± 4% 59.4ns ± 4% ~ (p=1.000 n=4+4) LoadMostlyMisses/*syncmap_test.DeepCopyMap-48 10.4ns ±30% 12.6ns ±29% ~ (p=0.486 n=4+4) LoadMostlyMisses/*syncmap_test.RWMutexMap 64.8ns ± 6% 63.8ns ± 4% ~ (p=0.686 n=4+4) LoadMostlyMisses/*syncmap_test.RWMutexMap-48 138ns ± 3% 139ns ± 2% ~ (p=0.800 n=4+4) LoadMostlyMisses/*syncmap.Map 51.5ns ± 7% 50.4ns ± 2% ~ (p=0.371 n=4+4) LoadMostlyMisses/*syncmap.Map-48 9.37ns ± 3% 9.40ns ± 3% ~ (p=0.886 n=4+4) LoadOrStoreBalanced/*syncmap_test.RWMutexMap 820ns ±17% 834ns ±14% ~ (p=1.000 n=4+4) LoadOrStoreBalanced/*syncmap_test.RWMutexMap-48 1.38µs ± 4% 1.38µs ± 3% ~ (p=0.886 n=4+4) LoadOrStoreBalanced/*syncmap.Map 709ns ±13% 1085ns ± 9% +53.09% (p=0.029 n=4+4) LoadOrStoreBalanced/*syncmap.Map-48 1.30µs ±13% 1.08µs ± 8% -17.40% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap_test.RWMutexMap 1.26µs ±14% 1.16µs ±29% ~ (p=0.343 n=4+4) LoadOrStoreUnique/*syncmap_test.RWMutexMap-48 1.82µs ±15% 1.98µs ± 6% ~ (p=0.143 n=4+4) LoadOrStoreUnique/*syncmap.Map 1.06µs ± 7% 1.86µs ±11% +76.09% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap.Map-48 2.00µs ± 4% 1.62µs ± 4% -19.20% (p=0.029 n=4+4) LoadOrStoreCollision/*syncmap_test.DeepCopyMap 32.9ns ± 8% 32.6ns ± 9% ~ (p=0.686 n=4+4) LoadOrStoreCollision/*syncmap_test.DeepCopyMap-48 2.26ns ±136% 3.41ns ±62% ~ (p=0.114 n=4+4) LoadOrStoreCollision/*syncmap_test.RWMutexMap 58.0ns ±20% 54.6ns ±17% ~ (p=0.343 n=4+4) LoadOrStoreCollision/*syncmap_test.RWMutexMap-48 458ns ± 2% 445ns ± 6% ~ (p=0.229 n=4+4) LoadOrStoreCollision/*syncmap.Map 35.8ns ± 5% 39.6ns ± 9% ~ (p=0.114 n=4+4) LoadOrStoreCollision/*syncmap.Map-48 1.24ns ± 2% 1.58ns ± 4% +27.16% (p=0.029 n=4+4) Range/*syncmap_test.DeepCopyMap 19.7µs ± 4% 19.8µs ± 3% ~ (p=0.686 n=4+4) Range/*syncmap_test.DeepCopyMap-48 763ns ± 1% 864ns ± 3% +13.24% (p=0.029 n=4+4) Range/*syncmap_test.RWMutexMap 20.9µs ± 3% 20.4µs ± 4% ~ (p=0.343 n=4+4) Range/*syncmap_test.RWMutexMap-48 764ns ± 1% 870ns ± 1% +13.77% (p=0.029 n=4+4) Range/*syncmap.Map 20.4µs ± 5% 22.7µs ± 5% +10.98% (p=0.029 n=4+4) Range/*syncmap.Map-48 776ns ± 3% 954ns ± 4% +22.95% (p=0.029 n=4+4) AdversarialAlloc/*syncmap_test.DeepCopyMap 206ns ± 5% 199ns ± 2% ~ (p=0.200 n=4+4) AdversarialAlloc/*syncmap_test.DeepCopyMap-48 8.94µs ± 5% 9.21µs ± 4% ~ (p=0.200 n=4+4) AdversarialAlloc/*syncmap_test.RWMutexMap 63.4ns ± 4% 63.8ns ± 3% ~ (p=0.686 n=4+4) AdversarialAlloc/*syncmap_test.RWMutexMap-48 184ns ±10% 198ns ±11% ~ (p=0.343 n=4+4) AdversarialAlloc/*syncmap.Map 213ns ± 3% 264ns ± 3% +23.97% (p=0.029 n=4+4) AdversarialAlloc/*syncmap.Map-48 556ns ± 5% 1389ns ± 8% +149.93% (p=0.029 n=4+4) AdversarialDelete/*syncmap_test.DeepCopyMap 300ns ± 6% 304ns ± 7% ~ (p=0.886 n=4+4) AdversarialDelete/*syncmap_test.DeepCopyMap-48 647ns ± 3% 646ns ± 3% ~ (p=0.943 n=4+4) AdversarialDelete/*syncmap_test.RWMutexMap 69.1ns ± 1% 69.2ns ± 6% ~ (p=0.686 n=4+4) AdversarialDelete/*syncmap_test.RWMutexMap-48 289ns ±15% 300ns ±17% ~ (p=0.829 n=4+4) AdversarialDelete/*syncmap.Map 198ns ± 5% 264ns ± 2% +33.17% (p=0.029 n=4+4) AdversarialDelete/*syncmap.Map-48 291ns ± 9% 173ns ± 8% -40.50% (p=0.029 n=4+4) name old alloc/op new alloc/op delta LoadMostlyHits/*syncmap_test.DeepCopyMap 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyHits/*syncmap_test.DeepCopyMap-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyHits/*syncmap_test.RWMutexMap 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyHits/*syncmap_test.RWMutexMap-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyHits/*syncmap.Map 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyHits/*syncmap.Map-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap_test.DeepCopyMap 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap_test.DeepCopyMap-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap_test.RWMutexMap 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap_test.RWMutexMap-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap.Map 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap.Map-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadOrStoreBalanced/*syncmap_test.RWMutexMap 95.0B ± 0% 95.0B ± 0% ~ (all equal) LoadOrStoreBalanced/*syncmap_test.RWMutexMap-48 95.0B ± 0% 95.0B ± 0% ~ (all equal) LoadOrStoreBalanced/*syncmap.Map 95.0B ± 0% 88.0B ± 0% -7.37% (p=0.029 n=4+4) LoadOrStoreBalanced/*syncmap.Map-48 95.0B ± 0% 88.0B ± 0% -7.37% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap_test.RWMutexMap 175B ± 0% 175B ± 0% ~ (all equal) LoadOrStoreUnique/*syncmap_test.RWMutexMap-48 175B ± 0% 175B ± 0% ~ (all equal) LoadOrStoreUnique/*syncmap.Map 175B ± 0% 161B ± 0% -8.00% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap.Map-48 175B ± 0% 161B ± 0% -8.00% (p=0.029 n=4+4) LoadOrStoreCollision/*syncmap_test.DeepCopyMap 0.00B 0.00B ~ (all equal) LoadOrStoreCollision/*syncmap_test.DeepCopyMap-48 0.00B 0.00B ~ (all equal) LoadOrStoreCollision/*syncmap_test.RWMutexMap 0.00B 0.00B ~ (all equal) LoadOrStoreCollision/*syncmap_test.RWMutexMap-48 0.00B 0.00B ~ (all equal) LoadOrStoreCollision/*syncmap.Map 0.00B 0.00B ~ (all equal) LoadOrStoreCollision/*syncmap.Map-48 0.00B 0.00B ~ (all equal) Range/*syncmap_test.DeepCopyMap 0.00B 0.00B ~ (all equal) Range/*syncmap_test.DeepCopyMap-48 0.00B 0.00B ~ (all equal) Range/*syncmap_test.RWMutexMap 0.00B 0.00B ~ (all equal) Range/*syncmap_test.RWMutexMap-48 0.00B 0.00B ~ (all equal) Range/*syncmap.Map 0.00B 0.00B ~ (all equal) Range/*syncmap.Map-48 0.00B 0.00B ~ (all equal) AdversarialAlloc/*syncmap_test.DeepCopyMap 74.0B ± 0% 74.0B ± 0% ~ (all equal) AdversarialAlloc/*syncmap_test.DeepCopyMap-48 3.15kB ± 0% 3.15kB ± 0% ~ (p=1.000 n=4+4) AdversarialAlloc/*syncmap_test.RWMutexMap 8.00B ± 0% 8.00B ± 0% ~ (all equal) AdversarialAlloc/*syncmap_test.RWMutexMap-48 8.00B ± 0% 8.00B ± 0% ~ (all equal) AdversarialAlloc/*syncmap.Map 74.0B ± 0% 55.0B ± 0% -25.68% (p=0.029 n=4+4) AdversarialAlloc/*syncmap.Map-48 8.00B ± 0% 56.25B ± 1% +603.12% (p=0.029 n=4+4) AdversarialDelete/*syncmap_test.DeepCopyMap 155B ± 0% 155B ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.DeepCopyMap-48 156B ± 0% 156B ± 0% ~ (p=1.000 n=4+4) AdversarialDelete/*syncmap_test.RWMutexMap 8.00B ± 0% 8.00B ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.RWMutexMap-48 8.00B ± 0% 8.00B ± 0% ~ (all equal) AdversarialDelete/*syncmap.Map 81.0B ± 0% 65.0B ± 0% -19.75% (p=0.029 n=4+4) AdversarialDelete/*syncmap.Map-48 23.0B ± 9% 15.2B ± 5% -33.70% (p=0.029 n=4+4) name old allocs/op new allocs/op delta LoadMostlyHits/*syncmap_test.DeepCopyMap 0.00 0.00 ~ (all equal) LoadMostlyHits/*syncmap_test.DeepCopyMap-48 0.00 0.00 ~ (all equal) LoadMostlyHits/*syncmap_test.RWMutexMap 0.00 0.00 ~ (all equal) LoadMostlyHits/*syncmap_test.RWMutexMap-48 0.00 0.00 ~ (all equal) LoadMostlyHits/*syncmap.Map 0.00 0.00 ~ (all equal) LoadMostlyHits/*syncmap.Map-48 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap_test.DeepCopyMap 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap_test.DeepCopyMap-48 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap_test.RWMutexMap 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap_test.RWMutexMap-48 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap.Map 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap.Map-48 0.00 0.00 ~ (all equal) LoadOrStoreBalanced/*syncmap_test.RWMutexMap 2.00 ± 0% 2.00 ± 0% ~ (all equal) LoadOrStoreBalanced/*syncmap_test.RWMutexMap-48 2.00 ± 0% 2.00 ± 0% ~ (all equal) LoadOrStoreBalanced/*syncmap.Map 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.029 n=4+4) LoadOrStoreBalanced/*syncmap.Map-48 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap_test.RWMutexMap 2.00 ± 0% 2.00 ± 0% ~ (all equal) LoadOrStoreUnique/*syncmap_test.RWMutexMap-48 2.00 ± 0% 2.00 ± 0% ~ (all equal) LoadOrStoreUnique/*syncmap.Map 2.00 ± 0% 4.00 ± 0% +100.00% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap.Map-48 2.00 ± 0% 4.00 ± 0% +100.00% (p=0.029 n=4+4) LoadOrStoreCollision/*syncmap_test.DeepCopyMap 0.00 0.00 ~ (all equal) LoadOrStoreCollision/*syncmap_test.DeepCopyMap-48 0.00 0.00 ~ (all equal) LoadOrStoreCollision/*syncmap_test.RWMutexMap 0.00 0.00 ~ (all equal) LoadOrStoreCollision/*syncmap_test.RWMutexMap-48 0.00 0.00 ~ (all equal) LoadOrStoreCollision/*syncmap.Map 0.00 0.00 ~ (all equal) LoadOrStoreCollision/*syncmap.Map-48 0.00 0.00 ~ (all equal) Range/*syncmap_test.DeepCopyMap 0.00 0.00 ~ (all equal) Range/*syncmap_test.DeepCopyMap-48 0.00 0.00 ~ (all equal) Range/*syncmap_test.RWMutexMap 0.00 0.00 ~ (all equal) Range/*syncmap_test.RWMutexMap-48 0.00 0.00 ~ (all equal) Range/*syncmap.Map 0.00 0.00 ~ (all equal) Range/*syncmap.Map-48 0.00 0.00 ~ (all equal) AdversarialAlloc/*syncmap_test.DeepCopyMap 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialAlloc/*syncmap_test.DeepCopyMap-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialAlloc/*syncmap_test.RWMutexMap 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialAlloc/*syncmap_test.RWMutexMap-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialAlloc/*syncmap.Map 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialAlloc/*syncmap.Map-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.DeepCopyMap 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.DeepCopyMap-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.RWMutexMap 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.RWMutexMap-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap.Map 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap.Map-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) Change-Id: I93c9458505d84238cc8e9016a7dfd285868af236 Reviewed-on: https://go-review.googlesource.com/37342 Reviewed-by: Russ Cox <rsc@golang.org>

rsc · 2017-04-10T16:34:09Z

I'd like to move this along, assuming that we're not going to introduce a new API for writing benchmarks (so no equivalent of pb.Next) and that we can't emit extra output lines by default (preserving the current one benchmark = one output line). And when we do emit extra output lines, all the lines need to be about the same thing (no warmups to filter out from averaging calculations in benchstat or other tools).

That implies that we need to add some flag that causes a benchmark to be run multiple times for smaller counts and then with one line of output per each of those runs.

Assuming the default -benchtime=1s, if instead of printing output for one run of 1s we want to print output for 10 runs of 0.1s, this can be specified in two possible ways: by setting the target run duration with the count implicit (-benchtime2=0.1s), or by setting the target number of runs with the duration implicit (-benchsplit=10). We only want to add one of these to package testing.

I don't have a great name for the target run duration (obviously), but I think benchsplit could work if we decide to specify the number of runs instead.

The split factor is subject to change to respect the overall time limit. If, say, one iteration of a benchmark takes 0.6 seconds, then with -benchtime=1s -benchsplit=10, we'll still only run ~2 of them, not 10. This makes it sensible to run go test -benchsplit=10 in the go1 directory, as compared to go test -benchtime=0.1 -count=10. The latter runs even very expensive benchmarks 10 times, while the former only runs them once.

So my proposal for this proposal is to add -benchsplit=N as described above.

I don't know if this really helps @bcmills's original request. If not I apologize for hijacking. We've been talking about getting this data out for years, and it does provide more fine-grained detail about the distribution of runs, just not individual runs. If there are rare high-latency individual runs then hopefully you'd be able to split down to a small enough interval that some are lucky and some are not, and then that would come out in the data as a bimodal distribution.

bcmills · 2017-04-10T16:38:44Z

I don't think that really addresses my original request, unfortunately: for the kinds of latency distributions I'm interested in, one output line per run would be far more data than I want to sift through.

But it does at least seem like a step in the right direction.

josharian · 2017-04-10T16:56:06Z

Agreed: Seems like a step in the right direction, and I would definitely use this. The interaction between slow benchmarks and -count has always been painful.

To be clear on the (non-)interaction between -benchsplit and -count, if you did -benchtime=1s -benchsplit=10 -count=5, you'd get anywhere between 5 and 50 lines of output?

rsc · 2017-04-10T17:06:40Z

@bcmills, sifting through the output is for computers.

@josharian, yes.

for 0..count {
    total = 0
    for total < benchtime {
        t = runAndReport(benchtime/benchsplit)
        total += t
    }
}

rsc · 2017-04-17T19:57:13Z

Objections to adding -benchsplit?

/cc @aclements @mpvl

rsc · 2017-06-12T20:13:58Z

Sounds like there are no objections to -benchsplit. Anyone want to implement it?

meirf · 2017-06-29T04:05:23Z

(I've started working on this and plan on having a CL by early next week. Hopefully my recent testing contributions grant me such a runway.)

gopherbot · 2017-07-05T02:47:59Z

CL https://golang.org/cl/47411 mentions this issue.

gopherbot · 2017-11-29T21:32:38Z

Change https://golang.org/cl/80841 mentions this issue: testing: remove claim that b.Run is safe for concurrent use

cespare · 2018-05-14T21:29:56Z

Has progress on this stalled?

cespare · 2019-04-25T18:37:58Z

Answering my own question from a year ago: yes, it seems like CL 47411 has stalled out, but the proposal is approved and the details seem to have been mostly ironed out in the extensive discussion on that CL. I think anyone could pick up these changes and take them across the finish line.

In the previous version, Loads of keys with unchanged values would block on the Mutex if the map was dirty, leading to potentially long stalls while the read-only map is being copied to the dirty map. This change adds a double pointer indirection to each entry and an extra atomic load on the fast path. In exchange, it removes the need to lock the map for nearly all operations on existing keys. Each entry is represented by an atomically-updated pointer to an interface{} value. The same entries are stored in both the read-only and dirty maps. Keys deleted before the dirty map was last copied are marked as "expunged" and omitted from the dirty map until the next Store to that key. Newly-stored values exist only in the dirty map. The existence of new keys in the dirty map is indicated by an "amended" bit attached to the read-only map, allowing Load or Delete calls for nonexistent keys to sometimes return immediately (without acquiring the Mutex). This trades off a bit of steady-state throughput in the "mostly hits" case in exchange for less contention (and lower Load tail latencies) for mixed Load/Store/Delete usage. Unfortunately, the latency impact is currently difficult to measure (#19128). updates golang/go#19128 updates golang/go#18177 https://perf.golang.org/search?q=upload:20170315.5 name old time/op new time/op delta LoadMostlyHits/*syncmap_test.DeepCopyMap 70.3ns ± 6% 70.2ns ± 4% ~ (p=0.886 n=4+4) LoadMostlyHits/*syncmap_test.DeepCopyMap-48 10.9ns ± 4% 13.6ns ±24% ~ (p=0.314 n=4+4) LoadMostlyHits/*syncmap_test.RWMutexMap 86.0ns ± 9% 86.0ns ± 4% ~ (p=1.000 n=4+4) LoadMostlyHits/*syncmap_test.RWMutexMap-48 140ns ± 3% 139ns ± 2% ~ (p=0.686 n=4+4) LoadMostlyHits/*syncmap.Map 70.6ns ± 8% 70.3ns ± 1% ~ (p=0.571 n=4+4) LoadMostlyHits/*syncmap.Map-48 10.3ns ± 7% 11.2ns ± 5% ~ (p=0.114 n=4+4) LoadMostlyMisses/*syncmap_test.DeepCopyMap 59.4ns ± 4% 59.4ns ± 4% ~ (p=1.000 n=4+4) LoadMostlyMisses/*syncmap_test.DeepCopyMap-48 10.4ns ±30% 12.6ns ±29% ~ (p=0.486 n=4+4) LoadMostlyMisses/*syncmap_test.RWMutexMap 64.8ns ± 6% 63.8ns ± 4% ~ (p=0.686 n=4+4) LoadMostlyMisses/*syncmap_test.RWMutexMap-48 138ns ± 3% 139ns ± 2% ~ (p=0.800 n=4+4) LoadMostlyMisses/*syncmap.Map 51.5ns ± 7% 50.4ns ± 2% ~ (p=0.371 n=4+4) LoadMostlyMisses/*syncmap.Map-48 9.37ns ± 3% 9.40ns ± 3% ~ (p=0.886 n=4+4) LoadOrStoreBalanced/*syncmap_test.RWMutexMap 820ns ±17% 834ns ±14% ~ (p=1.000 n=4+4) LoadOrStoreBalanced/*syncmap_test.RWMutexMap-48 1.38µs ± 4% 1.38µs ± 3% ~ (p=0.886 n=4+4) LoadOrStoreBalanced/*syncmap.Map 709ns ±13% 1085ns ± 9% +53.09% (p=0.029 n=4+4) LoadOrStoreBalanced/*syncmap.Map-48 1.30µs ±13% 1.08µs ± 8% -17.40% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap_test.RWMutexMap 1.26µs ±14% 1.16µs ±29% ~ (p=0.343 n=4+4) LoadOrStoreUnique/*syncmap_test.RWMutexMap-48 1.82µs ±15% 1.98µs ± 6% ~ (p=0.143 n=4+4) LoadOrStoreUnique/*syncmap.Map 1.06µs ± 7% 1.86µs ±11% +76.09% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap.Map-48 2.00µs ± 4% 1.62µs ± 4% -19.20% (p=0.029 n=4+4) LoadOrStoreCollision/*syncmap_test.DeepCopyMap 32.9ns ± 8% 32.6ns ± 9% ~ (p=0.686 n=4+4) LoadOrStoreCollision/*syncmap_test.DeepCopyMap-48 2.26ns ±136% 3.41ns ±62% ~ (p=0.114 n=4+4) LoadOrStoreCollision/*syncmap_test.RWMutexMap 58.0ns ±20% 54.6ns ±17% ~ (p=0.343 n=4+4) LoadOrStoreCollision/*syncmap_test.RWMutexMap-48 458ns ± 2% 445ns ± 6% ~ (p=0.229 n=4+4) LoadOrStoreCollision/*syncmap.Map 35.8ns ± 5% 39.6ns ± 9% ~ (p=0.114 n=4+4) LoadOrStoreCollision/*syncmap.Map-48 1.24ns ± 2% 1.58ns ± 4% +27.16% (p=0.029 n=4+4) Range/*syncmap_test.DeepCopyMap 19.7µs ± 4% 19.8µs ± 3% ~ (p=0.686 n=4+4) Range/*syncmap_test.DeepCopyMap-48 763ns ± 1% 864ns ± 3% +13.24% (p=0.029 n=4+4) Range/*syncmap_test.RWMutexMap 20.9µs ± 3% 20.4µs ± 4% ~ (p=0.343 n=4+4) Range/*syncmap_test.RWMutexMap-48 764ns ± 1% 870ns ± 1% +13.77% (p=0.029 n=4+4) Range/*syncmap.Map 20.4µs ± 5% 22.7µs ± 5% +10.98% (p=0.029 n=4+4) Range/*syncmap.Map-48 776ns ± 3% 954ns ± 4% +22.95% (p=0.029 n=4+4) AdversarialAlloc/*syncmap_test.DeepCopyMap 206ns ± 5% 199ns ± 2% ~ (p=0.200 n=4+4) AdversarialAlloc/*syncmap_test.DeepCopyMap-48 8.94µs ± 5% 9.21µs ± 4% ~ (p=0.200 n=4+4) AdversarialAlloc/*syncmap_test.RWMutexMap 63.4ns ± 4% 63.8ns ± 3% ~ (p=0.686 n=4+4) AdversarialAlloc/*syncmap_test.RWMutexMap-48 184ns ±10% 198ns ±11% ~ (p=0.343 n=4+4) AdversarialAlloc/*syncmap.Map 213ns ± 3% 264ns ± 3% +23.97% (p=0.029 n=4+4) AdversarialAlloc/*syncmap.Map-48 556ns ± 5% 1389ns ± 8% +149.93% (p=0.029 n=4+4) AdversarialDelete/*syncmap_test.DeepCopyMap 300ns ± 6% 304ns ± 7% ~ (p=0.886 n=4+4) AdversarialDelete/*syncmap_test.DeepCopyMap-48 647ns ± 3% 646ns ± 3% ~ (p=0.943 n=4+4) AdversarialDelete/*syncmap_test.RWMutexMap 69.1ns ± 1% 69.2ns ± 6% ~ (p=0.686 n=4+4) AdversarialDelete/*syncmap_test.RWMutexMap-48 289ns ±15% 300ns ±17% ~ (p=0.829 n=4+4) AdversarialDelete/*syncmap.Map 198ns ± 5% 264ns ± 2% +33.17% (p=0.029 n=4+4) AdversarialDelete/*syncmap.Map-48 291ns ± 9% 173ns ± 8% -40.50% (p=0.029 n=4+4) name old alloc/op new alloc/op delta LoadMostlyHits/*syncmap_test.DeepCopyMap 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyHits/*syncmap_test.DeepCopyMap-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyHits/*syncmap_test.RWMutexMap 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyHits/*syncmap_test.RWMutexMap-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyHits/*syncmap.Map 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyHits/*syncmap.Map-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap_test.DeepCopyMap 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap_test.DeepCopyMap-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap_test.RWMutexMap 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap_test.RWMutexMap-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap.Map 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadMostlyMisses/*syncmap.Map-48 7.00B ± 0% 7.00B ± 0% ~ (all equal) LoadOrStoreBalanced/*syncmap_test.RWMutexMap 95.0B ± 0% 95.0B ± 0% ~ (all equal) LoadOrStoreBalanced/*syncmap_test.RWMutexMap-48 95.0B ± 0% 95.0B ± 0% ~ (all equal) LoadOrStoreBalanced/*syncmap.Map 95.0B ± 0% 88.0B ± 0% -7.37% (p=0.029 n=4+4) LoadOrStoreBalanced/*syncmap.Map-48 95.0B ± 0% 88.0B ± 0% -7.37% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap_test.RWMutexMap 175B ± 0% 175B ± 0% ~ (all equal) LoadOrStoreUnique/*syncmap_test.RWMutexMap-48 175B ± 0% 175B ± 0% ~ (all equal) LoadOrStoreUnique/*syncmap.Map 175B ± 0% 161B ± 0% -8.00% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap.Map-48 175B ± 0% 161B ± 0% -8.00% (p=0.029 n=4+4) LoadOrStoreCollision/*syncmap_test.DeepCopyMap 0.00B 0.00B ~ (all equal) LoadOrStoreCollision/*syncmap_test.DeepCopyMap-48 0.00B 0.00B ~ (all equal) LoadOrStoreCollision/*syncmap_test.RWMutexMap 0.00B 0.00B ~ (all equal) LoadOrStoreCollision/*syncmap_test.RWMutexMap-48 0.00B 0.00B ~ (all equal) LoadOrStoreCollision/*syncmap.Map 0.00B 0.00B ~ (all equal) LoadOrStoreCollision/*syncmap.Map-48 0.00B 0.00B ~ (all equal) Range/*syncmap_test.DeepCopyMap 0.00B 0.00B ~ (all equal) Range/*syncmap_test.DeepCopyMap-48 0.00B 0.00B ~ (all equal) Range/*syncmap_test.RWMutexMap 0.00B 0.00B ~ (all equal) Range/*syncmap_test.RWMutexMap-48 0.00B 0.00B ~ (all equal) Range/*syncmap.Map 0.00B 0.00B ~ (all equal) Range/*syncmap.Map-48 0.00B 0.00B ~ (all equal) AdversarialAlloc/*syncmap_test.DeepCopyMap 74.0B ± 0% 74.0B ± 0% ~ (all equal) AdversarialAlloc/*syncmap_test.DeepCopyMap-48 3.15kB ± 0% 3.15kB ± 0% ~ (p=1.000 n=4+4) AdversarialAlloc/*syncmap_test.RWMutexMap 8.00B ± 0% 8.00B ± 0% ~ (all equal) AdversarialAlloc/*syncmap_test.RWMutexMap-48 8.00B ± 0% 8.00B ± 0% ~ (all equal) AdversarialAlloc/*syncmap.Map 74.0B ± 0% 55.0B ± 0% -25.68% (p=0.029 n=4+4) AdversarialAlloc/*syncmap.Map-48 8.00B ± 0% 56.25B ± 1% +603.12% (p=0.029 n=4+4) AdversarialDelete/*syncmap_test.DeepCopyMap 155B ± 0% 155B ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.DeepCopyMap-48 156B ± 0% 156B ± 0% ~ (p=1.000 n=4+4) AdversarialDelete/*syncmap_test.RWMutexMap 8.00B ± 0% 8.00B ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.RWMutexMap-48 8.00B ± 0% 8.00B ± 0% ~ (all equal) AdversarialDelete/*syncmap.Map 81.0B ± 0% 65.0B ± 0% -19.75% (p=0.029 n=4+4) AdversarialDelete/*syncmap.Map-48 23.0B ± 9% 15.2B ± 5% -33.70% (p=0.029 n=4+4) name old allocs/op new allocs/op delta LoadMostlyHits/*syncmap_test.DeepCopyMap 0.00 0.00 ~ (all equal) LoadMostlyHits/*syncmap_test.DeepCopyMap-48 0.00 0.00 ~ (all equal) LoadMostlyHits/*syncmap_test.RWMutexMap 0.00 0.00 ~ (all equal) LoadMostlyHits/*syncmap_test.RWMutexMap-48 0.00 0.00 ~ (all equal) LoadMostlyHits/*syncmap.Map 0.00 0.00 ~ (all equal) LoadMostlyHits/*syncmap.Map-48 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap_test.DeepCopyMap 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap_test.DeepCopyMap-48 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap_test.RWMutexMap 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap_test.RWMutexMap-48 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap.Map 0.00 0.00 ~ (all equal) LoadMostlyMisses/*syncmap.Map-48 0.00 0.00 ~ (all equal) LoadOrStoreBalanced/*syncmap_test.RWMutexMap 2.00 ± 0% 2.00 ± 0% ~ (all equal) LoadOrStoreBalanced/*syncmap_test.RWMutexMap-48 2.00 ± 0% 2.00 ± 0% ~ (all equal) LoadOrStoreBalanced/*syncmap.Map 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.029 n=4+4) LoadOrStoreBalanced/*syncmap.Map-48 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap_test.RWMutexMap 2.00 ± 0% 2.00 ± 0% ~ (all equal) LoadOrStoreUnique/*syncmap_test.RWMutexMap-48 2.00 ± 0% 2.00 ± 0% ~ (all equal) LoadOrStoreUnique/*syncmap.Map 2.00 ± 0% 4.00 ± 0% +100.00% (p=0.029 n=4+4) LoadOrStoreUnique/*syncmap.Map-48 2.00 ± 0% 4.00 ± 0% +100.00% (p=0.029 n=4+4) LoadOrStoreCollision/*syncmap_test.DeepCopyMap 0.00 0.00 ~ (all equal) LoadOrStoreCollision/*syncmap_test.DeepCopyMap-48 0.00 0.00 ~ (all equal) LoadOrStoreCollision/*syncmap_test.RWMutexMap 0.00 0.00 ~ (all equal) LoadOrStoreCollision/*syncmap_test.RWMutexMap-48 0.00 0.00 ~ (all equal) LoadOrStoreCollision/*syncmap.Map 0.00 0.00 ~ (all equal) LoadOrStoreCollision/*syncmap.Map-48 0.00 0.00 ~ (all equal) Range/*syncmap_test.DeepCopyMap 0.00 0.00 ~ (all equal) Range/*syncmap_test.DeepCopyMap-48 0.00 0.00 ~ (all equal) Range/*syncmap_test.RWMutexMap 0.00 0.00 ~ (all equal) Range/*syncmap_test.RWMutexMap-48 0.00 0.00 ~ (all equal) Range/*syncmap.Map 0.00 0.00 ~ (all equal) Range/*syncmap.Map-48 0.00 0.00 ~ (all equal) AdversarialAlloc/*syncmap_test.DeepCopyMap 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialAlloc/*syncmap_test.DeepCopyMap-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialAlloc/*syncmap_test.RWMutexMap 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialAlloc/*syncmap_test.RWMutexMap-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialAlloc/*syncmap.Map 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialAlloc/*syncmap.Map-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.DeepCopyMap 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.DeepCopyMap-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.RWMutexMap 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap_test.RWMutexMap-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap.Map 1.00 ± 0% 1.00 ± 0% ~ (all equal) AdversarialDelete/*syncmap.Map-48 1.00 ± 0% 1.00 ± 0% ~ (all equal) Change-Id: I93c9458505d84238cc8e9016a7dfd285868af236 Reviewed-on: https://go-review.googlesource.com/37342 Reviewed-by: Russ Cox <rsc@golang.org>

rsc added this to the Proposal milestone Mar 6, 2017

rsc mentioned this issue Mar 7, 2017

testing: add ratcheting variants #7465

Open

gopherbot added the Proposal label Mar 20, 2017

rsc changed the title ~~proposal: testing: measure benchmark latency distributions~~ testing: add -benchsplit to get more data points Jun 12, 2017

rsc added the Proposal-Accepted label Jun 12, 2017

rsc modified the milestones: Go1.10, Proposal Jun 12, 2017

josharian mentioned this issue Jul 12, 2017

testing: autodetect appropriate benchtime #10930

Open

rsc modified the milestones: Go1.10, Go1.11 Nov 29, 2017

cespare mentioned this issue Jan 12, 2018

testing: only find the target b.N once when -count > 1 #23423

Closed

This was referenced Apr 2, 2018

testing: don't truncate allocs/op #24631

Closed

testing: add -benchtime=100x (x suffix for exact count) #24735

Closed

ianlancetaylor modified the milestones: Go1.11, Unplanned Jun 29, 2018

ianlancetaylor added help wanted NeedsFix FeatureRequest labels Jun 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing: add -benchsplit to get more data points #19128

testing: add -benchsplit to get more data points #19128

bcmills commented Feb 16, 2017

gopherbot commented Feb 16, 2017

gopherbot commented Feb 23, 2017

rsc commented Mar 6, 2017

josharian commented Mar 6, 2017

aclements commented Mar 6, 2017

josharian commented Mar 6, 2017

rsc commented Mar 7, 2017

josharian commented Mar 7, 2017

rsc commented Mar 8, 2017

josharian commented Mar 8, 2017

rsc commented Apr 10, 2017

bcmills commented Apr 10, 2017

josharian commented Apr 10, 2017

rsc commented Apr 10, 2017

rsc commented Apr 17, 2017

rsc commented Jun 12, 2017

meirf commented Jun 29, 2017

gopherbot commented Jul 5, 2017

gopherbot commented Nov 29, 2017

cespare commented May 14, 2018

cespare commented Apr 25, 2019

testing: add -benchsplit to get more data points #19128

testing: add -benchsplit to get more data points #19128

Comments

bcmills commented Feb 16, 2017

gopherbot commented Feb 16, 2017

gopherbot commented Feb 23, 2017

rsc commented Mar 6, 2017

josharian commented Mar 6, 2017

aclements commented Mar 6, 2017

josharian commented Mar 6, 2017

rsc commented Mar 7, 2017

josharian commented Mar 7, 2017

rsc commented Mar 8, 2017

josharian commented Mar 8, 2017

rsc commented Apr 10, 2017

bcmills commented Apr 10, 2017

josharian commented Apr 10, 2017

rsc commented Apr 10, 2017

rsc commented Apr 17, 2017

rsc commented Jun 12, 2017

meirf commented Jun 29, 2017

gopherbot commented Jul 5, 2017

gopherbot commented Nov 29, 2017

cespare commented May 14, 2018

cespare commented Apr 25, 2019