cmd/compile: jump tables perform much worse on AMD zen3? #53331

dsnet · 2022-06-11T04:15:00Z

On my AMD Ryzen 5900x, I noticed a drop in performance after jump tables were added (1ba96d8).

The CL warns that microbenchmarks will favor the older binary search approach, while jump tables will be better utilized in larger benchmarks since it uses fewer hardware resources (branch predictors, etc.). To measure, this I tried to craft a micro-benchmark that would exercise edge cases of either approach. This is what I came up with:

func BenchmarkSame(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for _, k := range sourcesSame {
			sink = KindString(k)
		}
	}
}
func BenchmarkOrdered(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for _, k := range sourcesOrdered {
			sink = KindString(k)
		}
	}
}
func BenchmarkRandom(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for _, k := range sourcesRandom {
			sink = KindString(k)
		}
	}
}

Essentially, KindString is just like reflect.Kind.String but uses a giant switch statement.
The Same, Ordered, and Random benchmarks keep calling KindString with either the same argument, sequentially ordered arguments (i.e., easily predictable), or completely random arguments. I would expect jump tables to perform better on Random over the binary search approach since binary search would continually run log2(27) layers of branch prediction failures.

I ran the benchmarks using a Go toolchain built at 1ba96d8 and also the commit right before it.

On at least 3 CPUs that I have access to, I see promising results:

cpu: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
benchmark              old ns/op     new ns/op     delta
BenchmarkSame-8        38859         31988         -17.68%
BenchmarkOrdered-8     41096         40540         -1.35%
BenchmarkRandom-8      154696        108772        -29.69%

Here, jump tables do better than binary search on all cases. Great!

cpu: Intel(R) Celeron(R) CPU J3355 @ 2.00GHz
benchmark              old ns/op     new ns/op     delta
BenchmarkSame-2        47493         44109         -7.13%
BenchmarkOrdered-2     81576         140271        +71.95%
BenchmarkRandom-2      240434        138600        -42.35%

Here, jump tables does worse on Ordered, but has much better performance on Random. Somewhat expected.

cpu: DO-Regular
benchmark            old ns/op     new ns/op     delta
BenchmarkSame        32203         23410         -27.30%
BenchmarkOrdered     35731         27531         -22.95%
BenchmarkRandom      179656        125754        -30.00%

Here, this is a Digital Ocean droplet, which is probably an Intel Xeon processor of some kind.
Here, we see jump tables do better than binary search on all cases. Great again!

However, things turn for the worse on an AMD Ryzen 5900X:

cpu: AMD Ryzen 9 5900X 12-Core Processor            
benchmark               old ns/op     new ns/op     delta
BenchmarkSame-24        19514         61886         +217.14%
BenchmarkOrdered-24     22926         75008         +227.17%
BenchmarkRandom-24      38740         72986         +88.40%

Here, the jump tables are dramatically worse on all cases. I'm surprised that it's worse by 2x on the Random benchmark, which is where I expect jump tables to outshine binary search.

I understand that performance improvements are really hard to quantify since what works well for one CPU design (e.g., Intel) may not be optimal for another CPU design (e.g., AMD). However, these results are much more drastic than I was expecting. I don't know whether this issue is just with the Ryzen 5900X or all AMD chips. I also, don't know if this is going to result in any changes to the implementation of jump tables, but I think this is a surprising result worth reporting.

\cc @randall77 @cherrymui

The text was updated successfully, but these errors were encountered:

Jorropo · 2022-06-11T11:53:56Z

I don't know whether this issue is just with the Ryzen 5900X or all AMD chips.

I cannot reproduce on a Ryzen 3600 CPU (zen2) 3200mhz cl16 ram:

benchmark               old ns/op     new ns/op     delta
BenchmarkSame-12        19689         20289         +3.05%
BenchmarkOrdered-12     32540         27448         -15.65%
BenchmarkRandom-12      139527        93838         -32.75%

It would be nice to confirm with an other zen3 cpu to know if that a problem for the complete generation.

dominikh · 2022-06-11T18:48:04Z

More Zen 2 results (that can't reproduce the problem):

goos: linux
goarch: amd64
cpu: AMD Ryzen 9 3950X 16-Core Processor
name        old time/op  new time/op  delta
Same-32     19.9µs ± 4%  19.9µs ± 2%     ~     (p=1.000 n=10+10)
Ordered-32  34.1µs ± 4%  27.1µs ± 3%  -20.68%  (p=0.000 n=10+10)
Random-32    127µs ± 1%    92µs ± 1%  -27.19%  (p=0.000 n=10+10)

florianl · 2022-06-11T19:30:29Z

More for AMD Ryzen 7 PRO 4750U

Details

goos: linux
goarch: amd64
pkg: jumptables
cpu: AMD Ryzen 7 PRO 4750U with Radeon Graphics
benchmark               old ns/op     new ns/op     delta
BenchmarkSame-16        20669         20676         +0.03%
BenchmarkOrdered-16     36472         28692         -21.33%
BenchmarkRandom-16      131109        97300         -25.79%

hhoughgg · 2022-06-11T19:46:00Z

AMD Ryzen 7 5800X

goos: linux
goarch: amd64
cpu: AMD Ryzen 7 5800X 8-Core Processor
name        old time/op  new time/op  delta
Same-16     17.2µs ± 0%  15.1µs ± 1%   -12.53%  
Ordered-16  25.9µs ± 3%  73.2µs ± 1%  +182.54% 
Random-16   35.9µs ± 1%  71.0µs ± 2%   +97.93%

dsnet · 2022-06-11T19:53:27Z

Thanks for results so far. Here's a summary for AMD chips:

CPU	MicroArch	Same Delta	Ordered Delta	Random Delta	Source
Ryzen 5 3400G	Zen 2	+4%	~	-24%	from @soypat
Ryzen 5 3600	Zen 2	+3%	-16%	-33%	from @Jorropo
Ryzen 7 PRO 4750U	Zen 2	~	-21%	-26%	from @florianl
Ryzen 9 3900X	Zen 2	-11%	-15%	-32%	from @inkeliz
Ryzen 9 3950X	Zen 2	~	-21%	-27%	from @dominikh
EPYC 7502P	Zen 2	-22%	-3%	-37%	from @inkeliz
Ryzen 7 5800X	Zen 3	-13%	+183%	+98%	from @hhoughgg
Ryzen 7 PRO 5850U	Zen 3	~	+182%	+109%	from @mvdan
Ryzen 9 5900X	Zen 3	+217%	+227%	+88%	from @dsnet

inkeliz · 2022-06-11T21:27:43Z

AMD Ryzen 3900X (Windows, 64GB 2666MHZ ECC):

name        old time/op  new time/op  delta
Same-24     20.9µs ± 1%  18.6µs ± 1%  -11.02%  (p=0.000 n=10+10)
Ordered-24  30.0µs ± 2%  25.6µs ± 0%  -14.75%  (p=0.000 n=10+10)
Random-24    130µs ± 2%    88µs ± 0%  -32.26%  (p=0.000 n=10+10)

(old = go1.18.2; new= go1.19-55590f3a2b)

AMD EPYC 7502P (Debian, 512GB 2933MHZ ECC):

name        old time/op  new time/op  delta
Same-64     27.2µs ± 0%  21.2µs ± 0%  -22.12%  (p=0.000 n=10+10)
Ordered-64  37.8µs ± 2%  36.6µs ± 0%   -3.20%  (p=0.000 n=10+9)
Random-64    170µs ± 1%   108µs ± 1%  -36.66%  (p=0.000 n=9+10)

soypat · 2022-06-12T06:18:11Z

Zen 2 arch AMD Ryzen 5 3400G with Radeon Vega Graphics

$ go version go1.18.2 linux/amd64
$ go test -bench=. -count=100 . > old.txt
$ go1.19beta1 test -bench=. -count=35 . > new.txt
$ benchstat  old.txt new.txt 
name       old time/op  new time/op  delta
Same-8     23.3µs ± 5%  24.2µs ± 3%   +4.16%  (p=0.000 n=91+35)
Ordered-8  29.8µs ± 4%  29.8µs ± 3%     ~     (p=0.517 n=97+34)
Random-8    132µs ± 5%   100µs ± 3%  -24.34%  (p=0.000 n=95+33)

mvdan · 2022-06-12T20:52:56Z

I can confirm that mobile zen3 is also affected; AMD Ryzen 7 PRO 5850U with Radeon Graphics on linux/amd64, idle and plugged in so it benchmarks at ~25W, showing parent commit versus the change:

name        old time/op  new time/op  delta
Same-16     18.3µs ± 2%  18.2µs ± 1%      ~     (p=0.394 n=6+6)
Ordered-16  27.6µs ± 2%  77.8µs ± 1%  +181.89%  (p=0.002 n=6+6)
Random-16   37.4µs ± 0%  78.1µs ± 1%  +108.62%  (p=0.010 n=4+6)

randall77 · 2022-06-13T16:01:17Z

Possibly this effect could be related to similar problems we saw with CMOV generation. Branches, if you occasionally guess right, can let loads happen in parallel that would otherwise be serialized. See the discussion starting at #26306 (comment)
I'm not convinced it is the same effect, but maybe. It would be worth examining the benchmark to see if that effect is possible, and if so maybe trying prefetches to see if that reduces the effect.

randall77 · 2022-06-13T16:08:13Z

Also possibly the loss in performance has something to do with Spectre mitigation?

mvdan · 2022-09-29T11:44:46Z

Also possibly the loss in performance has something to do with Spectre mitigation?

Worth noting that I got my slow-down numbers above on Linux with mitigations=off, but I do use up-to-date BIOS firmware and AMD microcode. I'm not sure which of those you mean.

mvdan · 2022-09-29T11:47:18Z

It would be worth examining the benchmark to see if that effect is possible, and if so maybe trying prefetches to see if that reduces the effect.

More than happy to help benchmark attempts at solving the regression, as I imagine few in the compiler team have zen3 machines.

dsnet · 2022-10-16T16:31:04Z

It's possible that this report on reddit stems from this issue.

CAFxX · 2022-10-17T02:19:37Z

It's possible that this report on reddit stems from this issue.

Possibly unrelated, OP reported it's more likely a NUMA issue, as disabling the second CCD makes it go faster: https://www.reddit.com/r/golang/comments/y4w74o/comment/isjxqby/

qiulaidongfeng · 2023-10-21T02:56:19Z

Zen4 is also affected.

I use 1.18.10 and 1.21.3.

cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
Same-16 15.01µ ± 0% 12.90µ ± 0% -14.08% (p=0.000 n=10)
Ordered-16 18.58µ ± 2% 69.72µ ± 0% +275.29% (p=0.000 n=10)
Random-16 33.24µ ± 1% 71.37µ ± 1% +114.71% (p=0.000 n=10)
geomean 21.01µ 40.04µ +90.59%

dsnet added the Performance label Jun 11, 2022

dsnet changed the title ~~cmd/compile: jump tables perform much worse on AMD?~~ cmd/compile: jump tables perform much worse on AMD zen3? Jun 11, 2022

thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 14, 2022

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

seankhliao added this to the Unplanned milestone Aug 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: jump tables perform much worse on AMD zen3? #53331

cmd/compile: jump tables perform much worse on AMD zen3? #53331

dsnet commented Jun 11, 2022 •

edited

Jorropo commented Jun 11, 2022 •

edited

dominikh commented Jun 11, 2022 •

edited

florianl commented Jun 11, 2022 •

edited

hhoughgg commented Jun 11, 2022 •

edited by dsnet

dsnet commented Jun 11, 2022 •

edited

inkeliz commented Jun 11, 2022

soypat commented Jun 12, 2022

mvdan commented Jun 12, 2022 •

edited

randall77 commented Jun 13, 2022

randall77 commented Jun 13, 2022

mvdan commented Sep 29, 2022

mvdan commented Sep 29, 2022

dsnet commented Oct 16, 2022

CAFxX commented Oct 17, 2022

qiulaidongfeng commented Oct 21, 2023

cmd/compile: jump tables perform much worse on AMD zen3? #53331

cmd/compile: jump tables perform much worse on AMD zen3? #53331

Comments

dsnet commented Jun 11, 2022 • edited

Jorropo commented Jun 11, 2022 • edited

dominikh commented Jun 11, 2022 • edited

florianl commented Jun 11, 2022 • edited

hhoughgg commented Jun 11, 2022 • edited by dsnet

dsnet commented Jun 11, 2022 • edited

inkeliz commented Jun 11, 2022

soypat commented Jun 12, 2022

mvdan commented Jun 12, 2022 • edited

randall77 commented Jun 13, 2022

randall77 commented Jun 13, 2022

mvdan commented Sep 29, 2022

mvdan commented Sep 29, 2022

dsnet commented Oct 16, 2022

CAFxX commented Oct 17, 2022

qiulaidongfeng commented Oct 21, 2023

dsnet commented Jun 11, 2022 •

edited

Jorropo commented Jun 11, 2022 •

edited

dominikh commented Jun 11, 2022 •

edited

florianl commented Jun 11, 2022 •

edited

hhoughgg commented Jun 11, 2022 •

edited by dsnet

dsnet commented Jun 11, 2022 •

edited

mvdan commented Jun 12, 2022 •

edited