Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: enhance map cacheline efficiency via prefetch #52902

Open
simonhf opened this issue May 13, 2022 · 4 comments
Open

runtime: enhance map cacheline efficiency via prefetch #52902

simonhf opened this issue May 13, 2022 · 4 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@simonhf
Copy link

simonhf commented May 13, 2022

Was watching this [1] video on Golang map design. Says the 6.5 load factor is used and that this can cause overflow buckets, and I wondered how many overflow buckets there are in a typical map?

So I wrote this one-liner to simulate the number of overflow buckets for maps with increasing sizes of fixed, non-overflow buckets. For each map size, it adds "keys" until the 6.5 load factor is reached (usually resulting in the map being 81% full, at which point a real map would expand and double in size). At the point of reaching the 6.5 load factor size then the number of 1st level overflow buckets is shown, as well as 2nd & 3rd level:

$ perl -e '$| ++; use Digest::SHA; foreach $bits(4..26){ $buckets = 1 << $bits; $slots = $buckets * 7; printf qq[- %2d=bits %10d=buckets %10d=slots ], $bits, $buckets, $slots; foreach $seed(1..1){ $a = chr(0) x $buckets; $i = 1; while(1){ $h = hex(substr(Digest::SHA::sha256_hex($i . $seed . q[foobar]), 0, 8)); $b = ($h & ($buckets - 1)); last if($i >= $slots * 81.25 / 100); substr($a, $b, 1) = chr(1 + ord(substr($a, $b, 1))); $i ++; } printf qq[.]; push @p, int($i / $slots * 100); } foreach(sort {$a <=> $b} @p){ printf qq[ %2d%%], $_; } undef @p; $over1 = 0; $over2 = 0; $over3 = 0; foreach $b(0..($buckets - 1)){ if(ord(substr($a, $b, 1)) > 8){ $over1 ++; } if(ord(substr($a, $b, 1)) > 16){ $over2 ++; } if(ord(substr($a, $b, 1)) > 24){ $over3 ++; } } printf qq[ e.g. %10d=i %10d=over1 or %4.1f%% %4d=over2 %d=over3], $i, $over1, $over1 / $buckets * 100, $over2, $over3; printf qq[\n]; }'
-  4=bits         16=buckets        112=slots . 81% e.g.         91=i          2=over1 or 12.5%    0=over2 0=over3
-  5=bits         32=buckets        224=slots . 81% e.g.        182=i          4=over1 or 12.5%    0=over2 0=over3
-  6=bits         64=buckets        448=slots . 81% e.g.        364=i          7=over1 or 10.9%    0=over2 0=over3
-  7=bits        128=buckets        896=slots . 81% e.g.        728=i         12=over1 or  9.4%    0=over2 0=over3
-  8=bits        256=buckets       1792=slots . 81% e.g.       1456=i         33=over1 or 12.9%    0=over2 0=over3
-  9=bits        512=buckets       3584=slots . 81% e.g.       2912=i         62=over1 or 12.1%    0=over2 0=over3
- 10=bits       1024=buckets       7168=slots . 81% e.g.       5824=i        130=over1 or 12.7%    0=over2 0=over3
- 11=bits       2048=buckets      14336=slots . 81% e.g.      11648=i        259=over1 or 12.6%    1=over2 0=over3
- 12=bits       4096=buckets      28672=slots . 81% e.g.      23296=i        514=over1 or 12.5%    2=over2 0=over3
- 13=bits       8192=buckets      57344=slots . 81% e.g.      46592=i        982=over1 or 12.0%    0=over2 0=over3
- 14=bits      16384=buckets     114688=slots . 81% e.g.      93184=i       2003=over1 or 12.2%    2=over2 0=over3
- 15=bits      32768=buckets     229376=slots . 81% e.g.     186368=i       4021=over1 or 12.3%    5=over2 0=over3
- 16=bits      65536=buckets     458752=slots . 81% e.g.     372736=i       7948=over1 or 12.1%    5=over2 0=over3
- 17=bits     131072=buckets     917504=slots . 81% e.g.     745472=i      16071=over1 or 12.3%   12=over2 0=over3
- 18=bits     262144=buckets    1835008=slots . 81% e.g.    1490944=i      32169=over1 or 12.3%   19=over2 0=over3
- 19=bits     524288=buckets    3670016=slots . 81% e.g.    2981888=i      63855=over1 or 12.2%   50=over2 0=over3
- 20=bits    1048576=buckets    7340032=slots . 81% e.g.    5963776=i     127938=over1 or 12.2%  103=over2 0=over3
- 21=bits    2097152=buckets   14680064=slots . 81% e.g.   11927552=i     256563=over1 or 12.2%  162=over2 0=over3
- 22=bits    4194304=buckets   29360128=slots . 81% e.g.   23855104=i     511457=over1 or 12.2%  414=over2 0=over3
- 23=bits    8388608=buckets   58720256=slots . 81% e.g.   47710208=i    1025173=over1 or 12.2%  795=over2 0=over3
- 24=bits   16777216=buckets  117440512=slots . 81% e.g.   95420416=i    2051391=over1 or 12.2% 1578=over2 0=over3
- 25=bits   33554432=buckets  234881024=slots . 81% e.g.  190840832=i    4103432=over1 or 12.2% 3172=over2 0=over3
- 26=bits   67108864=buckets  469762048=slots . 81% e.g.  381681664=i    8201492=over1 or 12.2% 6258=over2 0=over3

At the point of reaching the 6.5 load factor then there are generally about 12% 1st level overflow buckets in the simulation. Which presumably means about 1 in 8 bucket accesses might have to read an overflow bucket to find the key?

Worse, as the number of buckets increases, then the number of 2nd level overflow buckets also increases, but is relatively minor compared to the number of 1st level overflow buckets. So if the programmer is interested in absolute worst case key lookup times, then there will be a very small percentage of keys where the absolute key look up time will be longer than other keys due to having to scan not only the 1st level overflow bucket, but a 2nd level overflow bucket too?

The key lookup code [2] is still very much like in the presentation [1] and has these two loops:

hash := t.hasher(key, uintptr(h.hash0))
m := bucketMask(h.B)
b := (*bmap)(add(h.buckets, (hash&m)*uintptr(t.bucketsize))) // regular bucket
...
for ; b != nil; b = b.overflow(t) {
	for i := uintptr(0); i < bucketCnt; i++ {
		if b.tophash[i] != top { // delay point without prefetch
		...
		}
	}
	// come here if key not found in bucket; loop to nth overflow bucket
}

I was thinking that this would be the perfect opportunity to use a CPU prefetch instruction -- if it exists in Golang... does it? -- and the pseudo code would look something like:

for ; b != nil; b = b.overflow(t) {
	b_next := b.overflow(t)
	if b_next != nil {
		__builtin_prefetch(&b_next, 0, 0)
	}
	for i := uintptr(0); i < bucketCnt; i++ {
		if b.tophash[i] != top { // delay point without prefetch
		...
		}
	}
	// come here if key not found in bucket; loop to nth overflow bucket
}

Why would it be the perfect opportunity? Because generally __builtin_prefetch() (from GCC in this case) is difficult to use in code because depending upon how fast the RAM is, after executing this instruction there will be a sizeable and difficult to predict async delay before the desired cacheline is cached and ready to be used. Most code does not know which next memory address it's going to need to access this far in the future, and thus it's notoriously difficult to use the prefetch instruction in real code...

In this case -- for the regular code in use today; normally and without prefetching -- this async delay becomes a sync delay at the first time the cacheline containing address b is accessed, e.g. b.tophash[i]. However, this code is always going to loop over 8 bucket items before accessing the the next overflow b.

So if the time to fetch the overflow b cacheline is x, and the time to execute the prefetch instruction is y, and the time to loop over the 8 bucket items is z, then using the prefetch instruction we stand to save x + y - z time (assuming that x > z) :-)

I would love to try this out, but does Golang already implement a prefetch instruction? Have been searched and not finding. If not, how to go about implementing it? Presumably, if it existed this could speed up up to approx. 1 in 8 bucket reads using the current load factor for a fuller map?

Thoughts?

[1] https://www.youtube.com/watch?v=Tl7mi9QmLns
[2] https://github.com/golang/go/blob/master/src/runtime/map.go#L432

@simonhf simonhf changed the title untime: enhance map cacheline efficiency via prefetch runtime: enhance map cacheline efficiency via prefetch May 13, 2022
@ianlancetaylor
Copy link
Contributor

CC @randall77

@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 13, 2022
@ianlancetaylor ianlancetaylor added this to the Backlog milestone May 13, 2022
@randall77
Copy link
Contributor

We do have a prefetch you can try, it is runtime/internal/sys.Prefetch.

I'm not entirely convinced this would be worth it.

  • A given map is probably operating below the max load factor, making overflows less likely. Running at max load factor is kind of a worst case.
  • Even if a bucket has an overflow, it is probably only a few elements. Say it is 2. Then there is a 8/10 chance on an existing-key-lookup that we won't have to look in the overflow, despite its existence. (On missing-key-lookup, or insert, we would though.)
  • Every access would need execute the instructions needed to compute and conditionally issue the prefetch, even if the overflow doesn't exist, or is not needed.
  • Successful prefetches which aren't needed just pollute the cache. In cases where prefetching might help, it is likely that cache space is a scarce resource.

That said, when the prefetch is useful it can save lots of cycles. Definitely worth an experiment.

@simonhf
Copy link
Author

simonhf commented May 13, 2022

@randall77 good points and thanks for the tip about runtime/internal/sys.Prefetch :-)

Running at max load factor is kind of a worst case

Presumably it wouldn't have to run at max load factor, and the overflow buckets just increase in number long before the max load factor is reached? Isn't it also normal case for any map that grows bigger? I.e. a map will reach the max load factor perhaps many times during its time growing bigger?

Every access would need execute the instructions needed to compute and conditionally issue the prefetch, even if the overflow doesn't exist, or is not needed.

This gave me the thought that maybe there could be an extra conditional so that this business logic only kicks in when the map reaches a certain load factor? Presumably the current load factor is relatively easy to compute? This way, the overhead for the general access to a not heavily loaded map would be a single simpler if statement with no "instructions needed to compute and conditionally issue the prefetch"? :-)

Definitely worth an experiment

Maybe the experiment could be as follows:

Step 1: Time adding n more keys to a map, and then time reading all map keys using key lookup.
Step 2: Loop again until a map key cap is reached.

Run this experiment with and without the prefetch mechanism, and compare the overall times, and the times for each step 1 operation.

Presumably this experiment would let us compare performance for different load levels? What do you think?

@randall77
Copy link
Contributor

Presumably it wouldn't have to run at max load factor, and the overflow buckets just increase in number long before the max load factor is reached? Isn't it also normal case for any map that grows bigger? I.e. a map will reach the max load factor perhaps many times during its time growing bigger?

Yes, a map might be at the max load factor many times during its life. But it will also be at lower load factors many times during its life. The chance that you have to look at an overflow bucket is dependent on the current load factor. That load factor could be as high as 6.5, but it just as easily could be as low as 3.25. I like to think that the "average" or "random" map has a load factor of 4.875, which implies a much lower "average" overflow bucket probability. Of course that's kind of imprecise, but I think my point is clear.

This gave me the thought that maybe there could be an extra conditional so that this business logic only kicks in when the map reaches a certain load factor? Presumably the current load factor is relatively easy to compute?

Sure, that's one possible way to do it. Everything has a cost, though. You'd need the cost of the "test if we're in the high load factor regime" to be cheaper than just issuing the prefetch.

Maybe the experiment could be as follows:

I was thinking an initial experiment could be simpler. Allocate and populate a 1M entry map. Time a loop that looks up all the keys in a randomish order. Add prefetching and see if it gets faster. Vary the map size a bit to get to both the min and max load factor.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: Triage Backlog
Development

No branches or pull requests

5 participants