Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: performance scaling degradation in some Chan benchmarks when futexes became private #25625

Closed
laboger opened this issue May 29, 2018 · 15 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@laboger
Copy link
Contributor

laboger commented May 29, 2018

What version of Go are you using (go version)?

tip

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

ppc64le, not sure about others

What did you do?

In comparing performance for some package benchmarks using latest against 1.10, it was noted that some of the runtime Chan benchmarks degraded by 2-4X. The degradation was tracked down to the change where futexes became private.

07751f4 runtime: use private futexes on Linux

What did you expect to see?

No significant change.

What did you see instead?

Significant degradation in the following:

BenchmarkChanSem-16                                         397             1925            +384.89%
BenchmarkChanProdCons100-16                                 603             2571            +326.37%
BenchmarkChanContended-16                                   51225           215813          +321.30%
BenchmarkChanProdCons10-16                                  797             3256            +308.53%
BenchmarkChanProdCons0-16                                   1288            4279            +232.22%
BenchmarkChanProdConsWork10-16                              1133            3692            +225.86%
BenchmarkChanProdConsWork100-16                             1049            3293            +213.92%
BenchmarkChanProdConsWork0-16                               1573            4261            +170.88%
BenchmarkChanPopular-16                                     2335795         4591414         +96.57%

The degradation is worse with higher values of GOMAXPROCS when comparing go 1.10 against latest.

According the to profiles, more time spent in the atomic.Xchg statements in the lock function in runtime/lock_futex.go after change to make futexes private, less time spent in the futex.

@ianlancetaylor

@ianlancetaylor
Copy link
Contributor

Could the problem be that the futex code is faster, causing more contention on the memory cache lines?

I benchmarked the change on amd64 and didn't see any significant differences. The hope was that the system load would be a bit less overall, and in any case using the PRIVATE flag seemed clearly correct. Not sure what to do here.

@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 29, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.11 milestone May 29, 2018
@ianlancetaylor
Copy link
Contributor

Can you benchmark on ppc64le with tip and with tip with _FUTEX_PRIVATE_FLAG set to 0 in runtime/lock_futex.go?

@laboger
Copy link
Contributor Author

laboger commented May 29, 2018

Could the problem be that the futex code is faster, causing more contention on the memory cache lines?

Yes it looks like the time spent for in the futex code is faster and there is more contention with the Xchg and Cas. But I'm surprised it causes such a huge degradation.

Can you benchmark on ppc64le with tip and with tip with _FUTEX_PRIVATE_FLAG set to 0 in runtime/lock_futex.go?

My first experiment was to build with the commit where the futex was made private, and I compared that against the one right before and that's when the degradation occurs. I will also try your experiment with latest.

@bcmills
Copy link
Contributor

bcmills commented May 29, 2018

there is more contention with the Xchg and Cas.

If there is a CAS involved, that could be the problem: it could be that the futex used to serialize the CAS operations so that they always succeeded, but now the code is fast enough that they can fail multiple times (presumably with a cache miss on each iteration).

@laboger
Copy link
Contributor Author

laboger commented May 30, 2018

Here are the results comparing latest with and without setting futex to private. I also reverted back to the way benchmark timings are done with count > 1 (#25622) to avoid flaky results.

old = latest
new = futex not private
In both cases, I changed testing/benchmark.go to avoid using hintN on the launch function.

name                    old time/op  new time/op   delta

ChanNonblocking-16      1.46ns ±29%   1.33ns ±26%     ~     (p=0.257 n=4+4)
ChanUncontended-16       927ns ± 8%    918ns ± 2%     ~     (p=0.771 n=4+4)
ChanContended-16         209µs ± 2%     38µs ± 3%  -82.01%  (p=0.029 n=4+4)
ChanSync-16              378ns ± 0%    393ns ± 1%   +3.83%  (p=0.029 n=4+4)
ChanSyncWork-16         9.18µs ± 7%  10.00µs ± 5%     ~     (p=0.057 n=4+4)
ChanProdCons0-16        4.23µs ± 2%   1.26µs ± 3%  -70.10%  (p=0.029 n=4+4)
ChanProdCons10-16       3.27µs ± 0%   0.64µs ± 3%  -80.40%  (p=0.029 n=4+4)
ChanProdCons100-16      2.45µs ± 1%   0.44µs ± 2%  -81.87%  (p=0.029 n=4+4)
ChanProdConsWork0-16    4.27µs ± 1%   1.58µs ± 2%  -63.05%  (p=0.029 n=4+4)
ChanProdConsWork10-16   3.74µs ± 1%   1.07µs ± 1%  -71.38%  (p=0.029 n=4+4)
ChanProdConsWork100-16  3.32µs ± 0%   0.96µs ± 1%  -71.03%  (p=0.029 n=4+4)
ChanCreation-16         90.6ns ± 5%   85.5ns ± 8%     ~     (p=0.200 n=4+4)
ChanSem-16              1.76µs ± 1%   0.39µs ± 1%  -77.78%  (p=0.029 n=4+4)
ChanPopular-16          4.63ms ± 2%   2.37ms ± 0%  -48.87%  (p=0.029 n=4+4)

@ianlancetaylor
Copy link
Contributor

Thanks for the measurements.

It seems unfortunate to throw sand in the gears to reduce memory contention during microbenchmarks.

My reading of the futex code is that it more or less assumes that atomic.Xchg is fast. On amd64 that is probably true. On ppc64le, which unlike amd64 uses a loop for atomic.Xchg, it sounds like it is slow when contended. I wonder if it might make sense to modify the loop to slow down when it fails.

Similarly we should probably implement procyield in runtime/asm_ppc64x.s to execute a loop of yield instructions, much like the implementation in runtime/asm_amd64.s.

At least according to https://sourceware.org/ml/libc-alpha/2012-11/msg00761.html, the yield instruction would be or r27,r27,r27.

@laboger
Copy link
Contributor Author

laboger commented May 30, 2018

Carlos is looking into the procyield.

We have found that these results may be related to the kernel on the machine where I did the initial testing, i.e., it was old. I am trying newer kernels and those don't seem to have the same degradation. Let us dig into this more before you spend any more time on it.

@ceseo
Copy link
Contributor

ceseo commented May 30, 2018

@ianlancetaylor I created a procyield() implementation for ppc64x. I did not use the 'yield' hint, though, as it is not recommended anymore in the ISA 3.0. I used the Program Priority Register instead. The results are pretty good compared against the current master:

benchmark                          old ns/op     new ns/op     delta
BenchmarkMakeChan/Byte-8           87.7          85.1          -2.96%
BenchmarkMakeChan/Int-8            107           105           -1.87%
BenchmarkMakeChan/Ptr-8            201           202           +0.50%
BenchmarkMakeChan/Struct/0-8       78.2          75.9          -2.94%
BenchmarkMakeChan/Struct/32-8      196           200           +2.04%
BenchmarkMakeChan/Struct/40-8      236           237           +0.42%
BenchmarkChanNonblocking-8         8.64          8.64          +0.00%
BenchmarkChanUncontended-8         5577          5645          +1.22%
BenchmarkChanContended-8           66106         39830         -39.75%
BenchmarkChanSync-8                451           452           +0.22%
BenchmarkChanSyncWork-8            9155          9633          +5.22%
BenchmarkChanProdCons0-8           1585          802           -49.40%
BenchmarkChanProdCons10-8          1094          593           -45.80%
BenchmarkChanProdCons100-8         831           455           -45.25%
BenchmarkChanProdConsWork0-8       1471          738           -49.83%
BenchmarkChanProdConsWork10-8      1033          553           -46.47%
BenchmarkChanProdConsWork100-8     730           416           -43.01%
BenchmarkChanCreation-8            135           149           +10.37%
BenchmarkChanSem-8                 602           349           -42.03%
BenchmarkChanPopular-8             3017466       1584372       -47.49%

If you think it's reasonable, I can submit it now during the freeze.

@laboger
Copy link
Contributor Author

laboger commented May 30, 2018

@ceseo You should note what kernel those results were collected on and that the futex private change did not show a degradation on that kernel.

@ceseo
Copy link
Contributor

ceseo commented May 30, 2018

It doesn't degrade on 4.12. Here's 1.10 vs master:

benchmark                          old ns/op     new ns/op     delta
BenchmarkMakeChan/Byte-8           82.6          86.5          +4.72%
BenchmarkMakeChan/Int-8            103           103           +0.00%
BenchmarkMakeChan/Ptr-8            188           235           +25.00%
BenchmarkMakeChan/Struct/0-8       75.0          76.8          +2.40%
BenchmarkMakeChan/Struct/32-8      197           204           +3.55%
BenchmarkMakeChan/Struct/40-8      238           239           +0.42%
BenchmarkChanNonblocking-8         8.09          8.78          +8.53%
BenchmarkChanUncontended-8         5940          5540          -6.73%
BenchmarkChanContended-8           76837         66017         -14.08%
BenchmarkChanSync-8                451           436           -3.33%
BenchmarkChanSyncWork-8            8814          10350         +17.43%
BenchmarkChanProdCons0-8           1799          1613          -10.34%
BenchmarkChanProdCons10-8          1253          1093          -12.77%
BenchmarkChanProdCons100-8         940           833           -11.38%
BenchmarkChanProdConsWork0-8       1602          1466          -8.49%
BenchmarkChanProdConsWork10-8      1199          1054          -12.09%
BenchmarkChanProdConsWork100-8     825           737           -10.67%
BenchmarkChanCreation-8            132           129           -2.27%
BenchmarkChanSem-8                 696           610           -12.36%
BenchmarkChanPopular-8             3358808       2999794       -10.69%

@ceseo
Copy link
Contributor

ceseo commented May 30, 2018

Btw, there are two kernel patches that might be connected with this degradation gone in recent kernels. See kernel commits fd851a3cdc196 and ede8e2bbb0eb3.

@ianlancetaylor
Copy link
Contributor

@ceseo Sure, let's put it in for 1.11. Thanks.

@gopherbot
Copy link

Change https://golang.org/cl/115376 mentions this issue: runtime: implement procyield properly for ppc64x

@laboger
Copy link
Contributor Author

laboger commented May 30, 2018

The degradation with private futexes happens on rhel7 power8 systems using kernel 3.10. We've run on other Ubuntu and SUSE systems that are 4.10 or higher, power8 and 9, and the degradation does not occur on those.

@laboger
Copy link
Contributor Author

laboger commented Jun 19, 2018

This might be something to mention in the release notes for go 1.11? Something like:

With the change to private futexes in go 1.11, it is possible to see a performance degradation in some channel functions if running on Linux kernels earlier than 4.x on ppc64le. Moving to a kernel 4.0 or later avoids the problem.

ceseo added a commit to powertechpreview/go that referenced this issue Aug 13, 2018
The procyield() function should yield the processor as in other
architectures. On ppc64x, this is achieved by setting the Program
Priority Register to 'low priority' prior to the spin loop, and
setting it back to 'medium-low priority' afterwards.

benchmark                          old ns/op     new ns/op     delta
BenchmarkMakeChan/Byte-8           87.7          86.6          -1.25%
BenchmarkMakeChan/Int-8            107           106           -0.93%
BenchmarkMakeChan/Ptr-8            201           204           +1.49%
BenchmarkMakeChan/Struct/0-8       78.2          79.7          +1.92%
BenchmarkMakeChan/Struct/32-8      196           200           +2.04%
BenchmarkMakeChan/Struct/40-8      236           230           -2.54%
BenchmarkChanNonblocking-8         8.64          8.85          +2.43%
BenchmarkChanUncontended-8         5577          5598          +0.38%
BenchmarkChanContended-8           66106         51529         -22.05%
BenchmarkChanSync-8                451           441           -2.22%
BenchmarkChanSyncWork-8            9155          9170          +0.16%
BenchmarkChanProdCons0-8           1585          1083          -31.67%
BenchmarkChanProdCons10-8          1094          838           -23.40%
BenchmarkChanProdCons100-8         831           657           -20.94%
BenchmarkChanProdConsWork0-8       1471          941           -36.03%
BenchmarkChanProdConsWork10-8      1033          721           -30.20%
BenchmarkChanProdConsWork100-8     730           511           -30.00%
BenchmarkChanCreation-8            135           128           -5.19%
BenchmarkChanSem-8                 602           463           -23.09%
BenchmarkChanPopular-8             3017466       2188441       -27.47%

Fixes golang#25625

Change-Id: Iacb1c888d3c066902152b8367500348fb631c5f9
Reviewed-on: https://go-review.googlesource.com/115376
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@golang golang locked and limited conversation to collaborators Jun 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants