runtime: performance scaling degradation in some Chan benchmarks when futexes became private #25625

laboger · 2018-05-29T14:13:00Z

What version of Go are you using (`go version`)?

tip

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (`go env`)?

ppc64le, not sure about others

What did you do?

In comparing performance for some package benchmarks using latest against 1.10, it was noted that some of the runtime Chan benchmarks degraded by 2-4X. The degradation was tracked down to the change where futexes became private.

07751f4 runtime: use private futexes on Linux

What did you expect to see?

No significant change.

What did you see instead?

Significant degradation in the following:

BenchmarkChanSem-16                                         397             1925            +384.89%
BenchmarkChanProdCons100-16                                 603             2571            +326.37%
BenchmarkChanContended-16                                   51225           215813          +321.30%
BenchmarkChanProdCons10-16                                  797             3256            +308.53%
BenchmarkChanProdCons0-16                                   1288            4279            +232.22%
BenchmarkChanProdConsWork10-16                              1133            3692            +225.86%
BenchmarkChanProdConsWork100-16                             1049            3293            +213.92%
BenchmarkChanProdConsWork0-16                               1573            4261            +170.88%
BenchmarkChanPopular-16                                     2335795         4591414         +96.57%

The degradation is worse with higher values of GOMAXPROCS when comparing go 1.10 against latest.

According the to profiles, more time spent in the atomic.Xchg statements in the lock function in runtime/lock_futex.go after change to make futexes private, less time spent in the futex.

@ianlancetaylor

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2018-05-29T17:28:58Z

Could the problem be that the futex code is faster, causing more contention on the memory cache lines?

I benchmarked the change on amd64 and didn't see any significant differences. The hope was that the system load would be a bit less overall, and in any case using the PRIVATE flag seemed clearly correct. Not sure what to do here.

ianlancetaylor · 2018-05-29T17:31:20Z

Can you benchmark on ppc64le with tip and with tip with _FUTEX_PRIVATE_FLAG set to 0 in runtime/lock_futex.go?

laboger · 2018-05-29T21:32:23Z

Could the problem be that the futex code is faster, causing more contention on the memory cache lines?

Yes it looks like the time spent for in the futex code is faster and there is more contention with the Xchg and Cas. But I'm surprised it causes such a huge degradation.

Can you benchmark on ppc64le with tip and with tip with _FUTEX_PRIVATE_FLAG set to 0 in runtime/lock_futex.go?

My first experiment was to build with the commit where the futex was made private, and I compared that against the one right before and that's when the degradation occurs. I will also try your experiment with latest.

bcmills · 2018-05-29T21:40:47Z

there is more contention with the Xchg and Cas.

If there is a CAS involved, that could be the problem: it could be that the futex used to serialize the CAS operations so that they always succeeded, but now the code is fast enough that they can fail multiple times (presumably with a cache miss on each iteration).

laboger · 2018-05-30T12:37:53Z

Here are the results comparing latest with and without setting futex to private. I also reverted back to the way benchmark timings are done with count > 1 (#25622) to avoid flaky results.

old = latest
new = futex not private
In both cases, I changed testing/benchmark.go to avoid using hintN on the launch function.

name                    old time/op  new time/op   delta

ChanNonblocking-16      1.46ns ±29%   1.33ns ±26%     ~     (p=0.257 n=4+4)
ChanUncontended-16       927ns ± 8%    918ns ± 2%     ~     (p=0.771 n=4+4)
ChanContended-16         209µs ± 2%     38µs ± 3%  -82.01%  (p=0.029 n=4+4)
ChanSync-16              378ns ± 0%    393ns ± 1%   +3.83%  (p=0.029 n=4+4)
ChanSyncWork-16         9.18µs ± 7%  10.00µs ± 5%     ~     (p=0.057 n=4+4)
ChanProdCons0-16        4.23µs ± 2%   1.26µs ± 3%  -70.10%  (p=0.029 n=4+4)
ChanProdCons10-16       3.27µs ± 0%   0.64µs ± 3%  -80.40%  (p=0.029 n=4+4)
ChanProdCons100-16      2.45µs ± 1%   0.44µs ± 2%  -81.87%  (p=0.029 n=4+4)
ChanProdConsWork0-16    4.27µs ± 1%   1.58µs ± 2%  -63.05%  (p=0.029 n=4+4)
ChanProdConsWork10-16   3.74µs ± 1%   1.07µs ± 1%  -71.38%  (p=0.029 n=4+4)
ChanProdConsWork100-16  3.32µs ± 0%   0.96µs ± 1%  -71.03%  (p=0.029 n=4+4)
ChanCreation-16         90.6ns ± 5%   85.5ns ± 8%     ~     (p=0.200 n=4+4)
ChanSem-16              1.76µs ± 1%   0.39µs ± 1%  -77.78%  (p=0.029 n=4+4)
ChanPopular-16          4.63ms ± 2%   2.37ms ± 0%  -48.87%  (p=0.029 n=4+4)

ianlancetaylor · 2018-05-30T14:23:56Z

Thanks for the measurements.

It seems unfortunate to throw sand in the gears to reduce memory contention during microbenchmarks.

My reading of the futex code is that it more or less assumes that atomic.Xchg is fast. On amd64 that is probably true. On ppc64le, which unlike amd64 uses a loop for atomic.Xchg, it sounds like it is slow when contended. I wonder if it might make sense to modify the loop to slow down when it fails.

Similarly we should probably implement procyield in runtime/asm_ppc64x.s to execute a loop of yield instructions, much like the implementation in runtime/asm_amd64.s.

At least according to https://sourceware.org/ml/libc-alpha/2012-11/msg00761.html, the yield instruction would be or r27,r27,r27.

laboger · 2018-05-30T15:14:01Z

Carlos is looking into the procyield.

We have found that these results may be related to the kernel on the machine where I did the initial testing, i.e., it was old. I am trying newer kernels and those don't seem to have the same degradation. Let us dig into this more before you spend any more time on it.

ceseo · 2018-05-30T19:09:19Z

@ianlancetaylor I created a procyield() implementation for ppc64x. I did not use the 'yield' hint, though, as it is not recommended anymore in the ISA 3.0. I used the Program Priority Register instead. The results are pretty good compared against the current master:

benchmark                          old ns/op     new ns/op     delta
BenchmarkMakeChan/Byte-8           87.7          85.1          -2.96%
BenchmarkMakeChan/Int-8            107           105           -1.87%
BenchmarkMakeChan/Ptr-8            201           202           +0.50%
BenchmarkMakeChan/Struct/0-8       78.2          75.9          -2.94%
BenchmarkMakeChan/Struct/32-8      196           200           +2.04%
BenchmarkMakeChan/Struct/40-8      236           237           +0.42%
BenchmarkChanNonblocking-8         8.64          8.64          +0.00%
BenchmarkChanUncontended-8         5577          5645          +1.22%
BenchmarkChanContended-8           66106         39830         -39.75%
BenchmarkChanSync-8                451           452           +0.22%
BenchmarkChanSyncWork-8            9155          9633          +5.22%
BenchmarkChanProdCons0-8           1585          802           -49.40%
BenchmarkChanProdCons10-8          1094          593           -45.80%
BenchmarkChanProdCons100-8         831           455           -45.25%
BenchmarkChanProdConsWork0-8       1471          738           -49.83%
BenchmarkChanProdConsWork10-8      1033          553           -46.47%
BenchmarkChanProdConsWork100-8     730           416           -43.01%
BenchmarkChanCreation-8            135           149           +10.37%
BenchmarkChanSem-8                 602           349           -42.03%
BenchmarkChanPopular-8             3017466       1584372       -47.49%

If you think it's reasonable, I can submit it now during the freeze.

laboger · 2018-05-30T19:17:04Z

@ceseo You should note what kernel those results were collected on and that the futex private change did not show a degradation on that kernel.

ceseo · 2018-05-30T20:03:17Z

It doesn't degrade on 4.12. Here's 1.10 vs master:

benchmark                          old ns/op     new ns/op     delta
BenchmarkMakeChan/Byte-8           82.6          86.5          +4.72%
BenchmarkMakeChan/Int-8            103           103           +0.00%
BenchmarkMakeChan/Ptr-8            188           235           +25.00%
BenchmarkMakeChan/Struct/0-8       75.0          76.8          +2.40%
BenchmarkMakeChan/Struct/32-8      197           204           +3.55%
BenchmarkMakeChan/Struct/40-8      238           239           +0.42%
BenchmarkChanNonblocking-8         8.09          8.78          +8.53%
BenchmarkChanUncontended-8         5940          5540          -6.73%
BenchmarkChanContended-8           76837         66017         -14.08%
BenchmarkChanSync-8                451           436           -3.33%
BenchmarkChanSyncWork-8            8814          10350         +17.43%
BenchmarkChanProdCons0-8           1799          1613          -10.34%
BenchmarkChanProdCons10-8          1253          1093          -12.77%
BenchmarkChanProdCons100-8         940           833           -11.38%
BenchmarkChanProdConsWork0-8       1602          1466          -8.49%
BenchmarkChanProdConsWork10-8      1199          1054          -12.09%
BenchmarkChanProdConsWork100-8     825           737           -10.67%
BenchmarkChanCreation-8            132           129           -2.27%
BenchmarkChanSem-8                 696           610           -12.36%
BenchmarkChanPopular-8             3358808       2999794       -10.69%

ceseo · 2018-05-30T20:13:27Z

Btw, there are two kernel patches that might be connected with this degradation gone in recent kernels. See kernel commits fd851a3cdc196 and ede8e2bbb0eb3.

ianlancetaylor · 2018-05-30T20:48:51Z

@ceseo Sure, let's put it in for 1.11. Thanks.

gopherbot · 2018-05-30T21:02:53Z

Change https://golang.org/cl/115376 mentions this issue: runtime: implement procyield properly for ppc64x

laboger · 2018-05-30T21:17:04Z

The degradation with private futexes happens on rhel7 power8 systems using kernel 3.10. We've run on other Ubuntu and SUSE systems that are 4.10 or higher, power8 and 9, and the degradation does not occur on those.

laboger · 2018-06-19T19:40:58Z

This might be something to mention in the release notes for go 1.11? Something like:

With the change to private futexes in go 1.11, it is possible to see a performance degradation in some channel functions if running on Linux kernels earlier than 4.x on ppc64le. Moving to a kernel 4.0 or later avoids the problem.

The procyield() function should yield the processor as in other architectures. On ppc64x, this is achieved by setting the Program Priority Register to 'low priority' prior to the spin loop, and setting it back to 'medium-low priority' afterwards. benchmark old ns/op new ns/op delta BenchmarkMakeChan/Byte-8 87.7 86.6 -1.25% BenchmarkMakeChan/Int-8 107 106 -0.93% BenchmarkMakeChan/Ptr-8 201 204 +1.49% BenchmarkMakeChan/Struct/0-8 78.2 79.7 +1.92% BenchmarkMakeChan/Struct/32-8 196 200 +2.04% BenchmarkMakeChan/Struct/40-8 236 230 -2.54% BenchmarkChanNonblocking-8 8.64 8.85 +2.43% BenchmarkChanUncontended-8 5577 5598 +0.38% BenchmarkChanContended-8 66106 51529 -22.05% BenchmarkChanSync-8 451 441 -2.22% BenchmarkChanSyncWork-8 9155 9170 +0.16% BenchmarkChanProdCons0-8 1585 1083 -31.67% BenchmarkChanProdCons10-8 1094 838 -23.40% BenchmarkChanProdCons100-8 831 657 -20.94% BenchmarkChanProdConsWork0-8 1471 941 -36.03% BenchmarkChanProdConsWork10-8 1033 721 -30.20% BenchmarkChanProdConsWork100-8 730 511 -30.00% BenchmarkChanCreation-8 135 128 -5.19% BenchmarkChanSem-8 602 463 -23.09% BenchmarkChanPopular-8 3017466 2188441 -27.47% Fixes golang#25625 Change-Id: Iacb1c888d3c066902152b8367500348fb631c5f9 Reviewed-on: https://go-review.googlesource.com/115376 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 29, 2018

ianlancetaylor added this to the Go1.11 milestone May 29, 2018

ianlancetaylor added the Performance label May 30, 2018

gopherbot closed this as completed in 1caa062 Jun 19, 2018

golang locked and limited conversation to collaborators Jun 19, 2019

gopherbot added the FrozenDueToAge label Jun 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: performance scaling degradation in some Chan benchmarks when futexes became private #25625

runtime: performance scaling degradation in some Chan benchmarks when futexes became private #25625

laboger commented May 29, 2018 •

edited

ianlancetaylor commented May 29, 2018

ianlancetaylor commented May 29, 2018

laboger commented May 29, 2018

bcmills commented May 29, 2018

laboger commented May 30, 2018

ianlancetaylor commented May 30, 2018

laboger commented May 30, 2018

ceseo commented May 30, 2018

laboger commented May 30, 2018

ceseo commented May 30, 2018

ceseo commented May 30, 2018

ianlancetaylor commented May 30, 2018

gopherbot commented May 30, 2018

laboger commented May 30, 2018

laboger commented Jun 19, 2018

runtime: performance scaling degradation in some Chan benchmarks when futexes became private #25625

runtime: performance scaling degradation in some Chan benchmarks when futexes became private #25625

Comments

laboger commented May 29, 2018 • edited

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

ianlancetaylor commented May 29, 2018

ianlancetaylor commented May 29, 2018

laboger commented May 29, 2018

bcmills commented May 29, 2018

laboger commented May 30, 2018

ianlancetaylor commented May 30, 2018

laboger commented May 30, 2018

ceseo commented May 30, 2018

laboger commented May 30, 2018

ceseo commented May 30, 2018

ceseo commented May 30, 2018

ianlancetaylor commented May 30, 2018

gopherbot commented May 30, 2018

laboger commented May 30, 2018

laboger commented Jun 19, 2018

laboger commented May 29, 2018 •

edited

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?