Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: some possible SwissTable map benchmark improvements #70700

Open
thepudds opened this issue Dec 5, 2024 · 5 comments
Open

runtime: some possible SwissTable map benchmark improvements #70700

thepudds opened this issue Dec 5, 2024 · 5 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@thepudds
Copy link
Contributor

thepudds commented Dec 5, 2024

The current runtime map benchmarks (imported from the CockroachDB SwissTable repo and used for recent runtime map work) have clearly been useful, especially for comparing the optimizations that have happened so far, but it might be worth tweaking those benchmarks.

For example, for the current benchmarks:

  1. Most might be overly friendly to the branch predictor with repetitive key patterns, especially at lower elem counts.
  2. Most of the elem sizes are power-of-two, which translates to a relatively low load factor and might not make the tables "sweat" as much as they would in practice.
  3. Most of the benchmarks use a mod operation in the benchmark code, which might be around 1/3 of the CPU time for the benchmarks that fit in cache.
    • This is fine if someone has this in mind when comparing results, but might be cleaner without.
  4. Could probably benefit from some cold map cases (e.g., so many small or medium maps that they become cold) to better assess cache miss impact.
    • The original SwissTable folks and Folly F14 folks had collaborated on C++ benchmarks to help them compare approaches, with a sample of one of their cold map C++ benchmarks here. (A couple of years ago, I had a WIP set of cold map Go benchmarks here, partially based on that).

Part of my immediate interest is the current benchmarks might not capture some of the benefits / costs of some adjustments to the current implementation, like:

  • Trying 16-element group sizes (instead of the current 8), or
  • Alternative memory layouts (like putting all the control bytes for a single table together, rather than current interspersing of control bytes with the slots)

In particular, I have a draft change to the runtime SwissTable for 16-element group sizes (currently SWAR-only, not SIMD), but I suspect some of the benefit would be masked by the current benchmarks (maybe -- I don't know without tweaking the benchmarks 😅).

Separately, the benchmarks as they are might not be showing the advantages of the new implementation as clearly as they could compared to the old runtime map (including because "easier" benchmarks are probably making the prior map implementation look artificially better, vs. "easy" benchmarks not helping the SwissTable implementation as much).


All that said, maybe some of those concerns don't matter in practice or maybe I'm mistaken.

Filing this issue to help with any discussion (including perhaps any feedback of "not worth it").

I plan on sending some benchmark CLs to attempt to improve or mitigate above issues.

CC @prattmic

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 5, 2024
@gabyhelp
Copy link

gabyhelp commented Dec 5, 2024

Related Issues

Related Code Changes

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 6, 2024
@mknyszek mknyszek added this to the Backlog milestone Dec 6, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/634396 mentions this issue: internal/runtime/maps: speed up small map lookups ~1.7x for unpredictable keys

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/634698 mentions this issue: runtime: remove somewhat expensive mod from map benchmarks

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/634700 mentions this issue: runtime: add hot key lookup benchmark for maps

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/634699 mentions this issue: runtime: shuffle keys to reduce impact of branch predictor in map benchmarks

@prattmic prattmic mentioned this issue Dec 13, 2024
5 tasks
gopherbot pushed a commit that referenced this issue Mar 31, 2025
…able keys

On master, lookups on small Swiss Table maps (<= 8 elements) for
non-specialized key types are seemingly a performance regression
compared to the Go 1.23 map implementation (reported in #70849).
Currently, a linear scan is used for gets in these cases.

This CL changes (*Map).getWithKeySmall to instead use the SIMD or SWAR
match on the control bytes to then jump to candidate matching slots,
with sample results below for a 16-byte key. This especially helps the
hit case when the key is unpredictable, which previously had to scan an
unpredictable number of control bytes to find a candidate slot when the
key is unpredictable.

Separately, other CLs in this stack modify the main Swiss Table
benchmarks to randomize lookup key order (vs. previously most of the
benchmarks had a repeating lookup key ordering, which likely is
predictable until the map is too big). We have sample results for the
randomized key order benchmarks followed by results from the older
benchmarks.

The first table below is with randomized key order. For hits, the older
results get slower as there are more elements. With this CL, we see hits
for unpredictable key ordering (sizes 2-8) get a ~1.7x speedup from
~25ns to ~14ns, with a now consistent lookup time for the different
sizes. (The 1 element size map has a predictable key ordering because
there is only one key, and that reports a modest ~0.5ns or ~3%
performance penalty). Misses for unpredictable key order get a ~1.3x
speedup, from ~13ns to ~10ns, with similar results for the 1 element
size.

                                                   │ no-fix-new-bmarks  │ fix-with-new-bmarks   │
                                                   │     sec/op         │  sec/op       vs base │
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4        13.26n ±  0%   13.64n ±  0%   +2.90% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4        19.47n ±  0%   13.62n ±  0%  -30.05% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4        22.23n ±  0%   13.64n ±  0%  -38.68% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4        23.98n ±  0%   13.64n ±  0%  -43.11% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4        25.02n ±  0%   13.67n ±  0%  -45.35% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4        25.77n ±  1%   13.68n ±  2%  -46.89% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4        26.38n ±  0%   13.64n ±  0%  -48.28% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4        26.31n ±  0%   13.71n ± 21%  -47.90% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4      13.055n ±  0%   9.815n ±  0%  -24.82% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4      13.070n ±  0%   9.813n ±  0%  -24.92% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4      13.060n ±  0%   9.819n ±  0%  -24.82% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4      13.075n ±  0%   9.816n ±  0%  -24.92% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4      13.060n ±  0%   9.826n ±  0%  -24.76% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4      13.095n ± 19%   9.834n ± 31%  -24.90% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4      13.075n ± 19%   9.822n ± 27%  -24.88% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4       13.11n ± 16%   12.14n ± 19%   -7.43% (p=0.000 n=20)

The next table uses the original benchmarks from just before this CL
stack (i.e., without shuffling lookup keys).

With this CL, we see improvement that is directionally similar to the
above results but not as large, presumably because the branches in the
linear scan are fairly predictable with predictable keys. (The numbers
here also include the time from a mod in the benchmark code, which
seemed to take around ~1/3 of CPU time based on spot checking a couple
of examples, vs. the modified benchmarks shown above have removed that
mod).

                                                  │ master-8c3e391573 │   just-fix-with-old-bmarks       │
                                                  │      sec/op       │    sec/op     vs base            │
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4      20.85n ±  0%   21.69n ±  0%   +4.03% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4      21.22n ±  0%   21.70n ±  0%   +2.24% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4      21.73n ±  0%   21.71n ±  0%        ~ (p=0.158 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4      22.06n ±  0%   21.71n ±  0%   -1.56% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4      22.41n ±  0%   21.73n ±  0%   -3.01% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4      22.71n ±  0%   21.72n ±  0%   -4.38% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4      22.98n ±  0%   21.71n ±  0%   -5.53% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4      23.20n ±  0%   21.72n ±  0%   -6.36% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4     19.95n ±  0%   17.30n ±  0%  -13.28% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4     19.96n ±  0%   17.31n ±  0%  -13.28% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4     19.95n ±  0%   17.29n ±  0%  -13.33% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4     19.95n ±  0%   17.30n ±  0%  -13.29% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4     19.96n ± 25%   17.32n ±  0%  -13.22% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4     19.99n ± 24%   17.29n ±  0%  -13.51% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4     19.97n ± 20%   17.34n ± 16%  -13.14% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4     20.02n ± 11%   17.33n ± 14%  -13.44% (p=0.000 n=20)
geomean                                                 21.02n         19.39n         -7.78%

See #70849 for additional benchmark results, including results for arm64
(which also means without SIMD support).

Updates #54766
Updates #70700
Fixes #70849

Change-Id: Ic2361bb6fc15b4436d1d1d5be7e4712e547f611b
Reviewed-on: https://go-review.googlesource.com/c/go/+/634396
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

4 participants