crypto: performance loss with 1.14 beta #36218

pascaldekloe · 2019-12-19T15:26:55Z

What version of Go are you using (`go version`)?

go version go1.13.5 darwin/amd64 v.s. go version go1.14beta1 darwin/amd64

What operating system and processor architecture are you using (`go env`)?

kern.version: Darwin Kernel Version 19.2.0: Sat Nov  9 03:47:04 PST 2019; root:xnu-6153.61.1~20/RELEASE_X86_64
machdep.cpu.brand_string: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz

What did you do?

Run the benchmarks at https://github.com/pascaldekloe/jwt.

name                  old time/op  new time/op  delta
ECDSA/sign-ES256-4    26.9µs ± 0%  28.3µs ± 1%  +5.17%  (p=0.000 n=9+10)
ECDSA/sign-ES384-4    4.21ms ± 0%  4.48ms ± 0%  +6.33%  (p=0.000 n=10+10)
ECDSA/sign-ES512-4    7.42ms ± 1%  7.80ms ± 0%  +5.15%  (p=0.000 n=10+10)
ECDSA/check-ES256-4   81.0µs ± 0%  82.7µs ± 1%  +2.07%  (p=0.000 n=10+10)
ECDSA/check-ES384-4   8.38ms ± 0%  8.63ms ± 0%  +3.02%  (p=0.000 n=10+10)
ECDSA/check-ES512-4   14.8ms ± 0%  15.0ms ± 0%  +1.93%  (p=0.000 n=10+9)
EdDSA/sign-EdDSA-4    51.0µs ± 0%  51.2µs ± 1%  +0.45%  (p=0.006 n=10+9)
EdDSA/check-EdDSA-4    137µs ± 0%   137µs ± 1%  +0.54%  (p=0.035 n=9+10)
HMAC/sign-HS256-4     2.06µs ± 0%  2.10µs ± 1%  +2.09%  (p=0.000 n=10+10)
HMAC/sign-HS384-4     2.35µs ± 1%  2.40µs ± 1%  +2.00%  (p=0.000 n=10+10)
HMAC/sign-HS512-4     2.40µs ± 1%  2.44µs ± 1%  +1.46%  (p=0.000 n=10+10)
HMAC/check-HS256-4    4.18µs ± 0%  4.36µs ± 0%  +4.38%  (p=0.000 n=9+9)
HMAC/check-HS384-4    4.43µs ± 0%  4.60µs ± 0%  +3.98%  (p=0.000 n=10+8)
HMAC/check-HS512-4    4.69µs ± 3%  4.76µs ± 1%    ~     (p=0.218 n=10+10)
RSA/sign-1024-bit-4    321µs ± 0%   339µs ± 1%  +5.68%  (p=0.000 n=10+10)
RSA/sign-2048-bit-4   1.48ms ± 0%  1.51ms ± 0%  +1.73%  (p=0.000 n=10+9)
RSA/sign-4096-bit-4   8.31ms ± 0%  8.34ms ± 1%  +0.40%  (p=0.013 n=9+10)
RSA/check-1024-bit-4  27.7µs ± 1%  29.8µs ± 1%  +7.58%  (p=0.000 n=10+10)
RSA/check-2048-bit-4  63.4µs ± 0%  66.3µs ± 2%  +4.63%  (p=0.000 n=9+10)
RSA/check-4096-bit-4   169µs ± 0%   176µs ± 1%  +4.62%  (p=0.000 n=10+10)

What did you expect to see?

Similar or better performance.

What did you see instead?

Overall slowdown.

The text was updated successfully, but these errors were encountered:

randall77 · 2019-12-19T16:35:53Z

Looking at the profile of ECDSA/sign-ES256, there appears to be more madvise in the profile at tip.

(run with go test --bench=ECDSA/sign-ES256 --count=10 --cpuprofile=cpu.prof)

1.13.5:

      flat  flat%   sum%        cum   cum%
    2460ms 15.78% 15.78%     2460ms 15.78%  crypto/elliptic.p256OrdSqr
    2150ms 13.79% 29.57%     2150ms 13.79%  p256MulInternal
    1610ms 10.33% 39.90%     1610ms 10.33%  crypto/elliptic.p256Sqr
    1480ms  9.49% 49.39%     1480ms  9.49%  syscall.syscall
     750ms  4.81% 54.20%      750ms  4.81%  crypto/elliptic.p256SelectBase
     650ms  4.17% 58.37%      650ms  4.17%  p256SqrInternal
     610ms  3.91% 62.28%      610ms  3.91%  runtime.usleep
     570ms  3.66% 65.94%      570ms  3.66%  runtime.madvise
     560ms  3.59% 69.53%     3490ms 22.39%  crypto/elliptic.p256PointAddAffineAsm
     500ms  3.21% 72.74%      500ms  3.21%  crypto/elliptic.p256OrdMul

tip:

      flat  flat%   sum%        cum   cum%
    2430ms 14.05% 14.05%     2430ms 14.05%  crypto/elliptic.p256OrdSqr
    1880ms 10.87% 24.93%     1880ms 10.87%  runtime.madvise
    1840ms 10.64% 35.57%     1840ms 10.64%  p256MulInternal
    1330ms  7.69% 43.26%     1330ms  7.69%  crypto/elliptic.p256Sqr
    1110ms  6.42% 49.68%     1110ms  6.42%  syscall.syscall
     760ms  4.40% 54.08%      760ms  4.40%  crypto/elliptic.p256SelectBase
     700ms  4.05% 58.13%      700ms  4.05%  runtime.kevent
     680ms  3.93% 62.06%     3180ms 18.39%  crypto/elliptic.p256PointAddAffineAsm
     510ms  2.95% 65.01%      510ms  2.95%  p256SqrInternal
     450ms  2.60% 67.61%      450ms  2.60%  runtime.pthread_kill

Maybe the new page allocator? @mknyszek

mknyszek · 2019-12-19T17:31:57Z

That's pretty surprising since I figured scavenging only got less aggressive this release. The increased it's probably related to the changes that were made there. I'll check it out.

mknyszek · 2019-12-19T19:36:17Z

So here's an interesting bit: I tried to reproduce this on linux/amd64 and I couldn't.

Running under the perflock 1.14beta1 faster by a few percent. Also, the profiles look completely different. For 1.13.4 it's dominated by runtime.cgocall, but it's just crypto functions for go1.14beta1. I don't really understand what's going on here, but that's not the platform this bug is about so I'll put that to rest for now.

I'll try to run this on a darwin machine and see what's going on. One big difference this release with scavenging is that we're calling madvise more often (even though, I think we're doing roughly the same amount of scavenging work or less). It's possible that there are larger overheads on darwin for each call to madvise which could make this a problem. On the other hand, the pacing is now time-based to avoid using more than 1% of CPU time of a single core, so even if it was more expensive, it should work itself out.

FiloSottile · 2019-12-19T19:46:32Z

Running under the perflock 1.14beta1 faster by a few percent. Also, the profiles look completely different. For 1.13.4 it's dominated by runtime.cgocall, but it's just crypto functions for go1.14beta1. I don't really understand what's going on here, but that's not the platform this bug is about so I'll put that to rest for now.

There's a good chance you're running the Go+BoringCrypto corp build for your Go 1.13.4 benchmark, which uses BoringSSL via cgo on linux/amd64.

mknyszek · 2019-12-19T19:48:14Z

@FiloSottile ah hah, thank you, that's probably it. I'll keep that in mind for the future.

mknyszek · 2019-12-19T19:56:16Z

@pascaldekloe @randall77 OK so I still can't reproduce on linux/amd64 after building 1.13.5 myself (go1.14beta1 is consistently a hair faster). The profiles contain runtime.madvise as the 20th entry for 1.13.5 and it doesn't appear in the top 30 for 1.14beta1.

Now I'll go and try on darwin for real.

pascaldekloe · 2019-12-19T20:31:08Z

Thanks for the dedication @mknyszek. I tried with Docker [Linux] on the same machine, and the new beta is slightly faster indeed. 🙂

mknyszek · 2019-12-20T00:34:46Z

Alrighty, I think I found the problem.

It isn't that madvise is too slow, and in fact, the system handles that case quite well.

It's that madvise is too fast on these systems. About four times as fast as the old assumption in Go 1.13 (we assumed 4 KiB takes 10 µs, but here 4 KiB takes about 2.2 µs). Because we no longer make any assumptions, the rate isn't limited, and scavenging too often leads to outsized effects on performance because of goroutine wake-ups and lock contention.

I think a reasonable course of action here is to limit how fast the background scavenger can go. I tried one (runtime) page per millisecond, which is an assumption of 8 KiB taking 10 µs. This is roughly twice our old assumption, and I can confirm that just doing this fixes the performance regression by a lot.

name                  old time/op  new time/op  delta
ECDSA/sign-ES256-4    28.3µs ±24%  26.0µs ± 1%     ~     (p=0.095 n=10+9)
ECDSA/sign-ES384-4    4.04ms ± 2%  4.09ms ± 1%   +1.05%  (p=0.035 n=10+10)
ECDSA/sign-ES512-4    7.09ms ± 2%  7.16ms ± 2%   +0.96%  (p=0.043 n=8+10)
ECDSA/check-ES256-4   80.1µs ± 2%  77.3µs ± 1%   -3.51%  (p=0.000 n=10+8)
ECDSA/check-ES384-4   8.10ms ± 3%  8.00ms ± 2%     ~     (p=0.182 n=10+9)
ECDSA/check-ES512-4   15.0ms ±18%  14.4ms ± 3%     ~     (p=1.000 n=10+10)
EdDSA/sign-EdDSA-4    52.7µs ±14%  50.7µs ± 2%     ~     (p=0.481 n=10+10)
EdDSA/check-EdDSA-4    146µs ±11%   132µs ± 3%  -10.00%  (p=0.000 n=10+10)
HMAC/sign-HS256-4     1.92µs ± 1%  1.90µs ± 1%     ~     (p=0.145 n=9+8)
HMAC/sign-HS384-4     2.15µs ± 1%  2.19µs ± 4%   +1.92%  (p=0.013 n=9+10)
HMAC/sign-HS512-4     2.25µs ± 3%  2.20µs ± 3%     ~     (p=0.052 n=10+10)
HMAC/check-HS256-4    3.98µs ± 3%  3.93µs ± 1%     ~     (p=0.110 n=10+10)
HMAC/check-HS384-4    4.29µs ± 4%  4.31µs ± 5%     ~     (p=0.842 n=9+10)
HMAC/check-HS512-4    4.30µs ± 3%  4.47µs ± 3%   +3.83%  (p=0.001 n=10+9)
RSA/sign-1024-bit-4    315µs ± 8%   317µs ± 3%     ~     (p=0.447 n=10+9)
RSA/sign-2048-bit-4   1.45ms ± 2%  1.46ms ± 3%     ~     (p=0.315 n=8+10)
RSA/sign-4096-bit-4   8.03ms ± 2%  8.13ms ± 2%   +1.28%  (p=0.015 n=10+10)
RSA/check-1024-bit-4  26.2µs ± 1%  27.8µs ± 2%   +6.10%  (p=0.000 n=10+10)
RSA/check-2048-bit-4  59.7µs ± 1%  63.1µs ± 2%   +5.68%  (p=0.000 n=8+9)
RSA/check-4096-bit-4   161µs ± 2%   163µs ± 1%   +1.32%  (p=0.006 n=10+9)
[Geo mean]             107µs        106µs        -0.65%

There are a few benchmarks still showing a regression, which may be for a different reason, or perhaps. I'm going to try going all the way back to our old order-of-magnitude assumption as well and see if that helps at all.

mknyszek · 2019-12-20T16:46:30Z

Setting the limit back to the pacing of last release brings the regression down further.

name                  old time/op  new time/op  delta
ECDSA/sign-ES256-4    28.9µs ±13%  26.0µs ± 3%  -10.07%  (p=0.002 n=10+10)
ECDSA/sign-ES384-4    4.17ms ±10%  4.11ms ± 3%     ~     (p=0.965 n=10+8)
ECDSA/sign-ES512-4    7.09ms ± 3%  7.15ms ± 1%     ~     (p=0.315 n=10+10)
ECDSA/check-ES256-4   78.4µs ± 2%  76.8µs ± 0%   -2.02%  (p=0.000 n=9+8)
ECDSA/check-ES384-4   8.24ms ± 5%  7.97ms ± 2%   -3.30%  (p=0.006 n=10+9)
ECDSA/check-ES512-4   13.4ms ± 1%  14.3ms ± 2%   +6.46%  (p=0.000 n=10+10)
EdDSA/sign-EdDSA-4    49.6µs ± 2%  49.4µs ± 1%     ~     (p=0.436 n=10+10)
EdDSA/check-EdDSA-4    134µs ± 1%   129µs ± 1%   -4.03%  (p=0.000 n=10+10)
HMAC/sign-HS256-4     1.95µs ± 4%  1.88µs ± 0%   -3.53%  (p=0.000 n=9+9)
HMAC/sign-HS384-4     2.26µs ± 4%  2.12µs ± 1%   -6.19%  (p=0.000 n=9+8)
HMAC/sign-HS512-4     2.28µs ± 2%  2.16µs ± 1%   -5.26%  (p=0.000 n=9+9)
HMAC/check-HS256-4    3.92µs ± 5%  3.96µs ± 1%     ~     (p=0.143 n=10+10)
HMAC/check-HS384-4    4.09µs ± 1%  4.21µs ± 1%   +2.98%  (p=0.000 n=10+10)
HMAC/check-HS512-4    4.52µs ±12%  4.30µs ± 1%     ~     (p=0.676 n=10+9)
RSA/sign-1024-bit-4    347µs ±26%   321µs ± 3%     ~     (p=0.400 n=10+9)
RSA/sign-2048-bit-4   1.46ms ± 7%  1.47ms ± 2%     ~     (p=0.165 n=10+10)
RSA/sign-4096-bit-4   7.95ms ± 1%  8.26ms ± 5%   +3.84%  (p=0.000 n=9+9)
RSA/check-1024-bit-4  26.1µs ± 2%  26.9µs ± 3%   +3.13%  (p=0.000 n=10+9)
RSA/check-2048-bit-4  59.1µs ± 1%  60.7µs ± 2%   +2.57%  (p=0.000 n=10+10)
RSA/check-4096-bit-4   159µs ± 1%   161µs ± 0%   +1.37%  (p=0.000 n=10+10)
[Geo mean]             107µs        105µs        -1.37%

The results aren't super reproducible (hence why that -10% is now -4%) since I'm running on a laptop which has the full UI running and perflock doesn't work on darwin/amd64. I'd say that ECDSA/check-ES512 is a bit noisy on my machine (though would be worth maybe trying more runs, spinning up a new process every time, etc., to capture the noise) so I'm not going to dig any deeper there. Same for HMAC/check-HS384. But I think there may still be a legitimate regression in the RSA/check-*-bit after my rate-limiting fix.

Looking at profiles for RSA/check-1024-bit more closely, there are clearly differences and costs have shifted around but nothing jumps out at me.

go1.13.5

Showing nodes accounting for 13.52s, 84.66% of 15.97s total
Dropped 212 nodes (cum <= 0.08s)
Showing top 30 nodes out of 130
      flat  flat%   sum%        cum   cum%
     4.67s 29.24% 29.24%      9.18s 57.48%  math/big.nat.divLarge
     1.73s 10.83% 40.08%      1.73s 10.83%  math/big.mulAddVWW
     1.37s  8.58% 48.65%      1.37s  8.58%  math/big.addMulVVW
     1.17s  7.33% 55.98%      1.17s  7.33%  math/big.subVV
     0.51s  3.19% 59.17%      0.51s  3.19%  runtime.madvise
     0.48s  3.01% 62.18%      0.48s  3.01%  runtime.pthread_cond_wait
     0.42s  2.63% 64.81%      1.81s 11.33%  math/big.basicMul
     0.35s  2.19% 67.00%      0.35s  2.19%  runtime.procyield
     0.29s  1.82% 68.82%      0.29s  1.82%  math/big.shlVU
     0.25s  1.57% 70.38%      0.25s  1.57%  runtime.pthread_cond_signal
     0.24s  1.50% 71.88%      0.24s  1.50%  math/big.shrVU
     0.22s  1.38% 73.26%      0.22s  1.38%  math/big.greaterThan
     0.19s  1.19% 74.45%      0.19s  1.19%  runtime.kevent
     0.17s  1.06% 75.52%      0.17s  1.06%  runtime.pthread_cond_timedwait_relative_np
     0.15s  0.94% 76.46%      1.12s  7.01%  runtime.mallocgc
     0.14s  0.88% 77.33%      0.14s  0.88%  runtime.memclrNoHeapPointers
     0.14s  0.88% 78.21%      0.15s  0.94%  runtime.stkbucket
     0.13s  0.81% 79.02%      0.13s  0.81%  runtime.usleep
     0.12s  0.75% 79.77%      0.12s  0.75%  crypto/sha512.blockAVX2
     0.12s  0.75% 80.53%      0.23s  1.44%  encoding/json.checkValid
     0.10s  0.63% 81.15%      0.10s  0.63%  runtime.procPin
     0.10s  0.63% 81.78%      0.31s  1.94%  sync.(*Pool).Get
     0.09s  0.56% 82.34%      0.09s  0.56%  math/big.nat.norm
     0.07s  0.44% 82.78%      0.11s  0.69%  encoding/base64.(*Encoding).Decode
     0.06s  0.38% 83.16%      0.12s  0.75%  runtime.heapBitsSetType
     0.05s  0.31% 83.47%      9.25s 57.92%  math/big.nat.div
     0.05s  0.31% 83.78%      1.92s 12.02%  math/big.nat.sqr
     0.05s  0.31% 84.10%      0.22s  1.38%  sync.(*Pool).Put
     0.05s  0.31% 84.41%      0.19s  1.19%  sync.(*Pool).pin
     0.04s  0.25% 84.66%      0.12s  0.75%  encoding/json.indirect

go1.14beta1

Showing nodes accounting for 13.07s, 85.37% of 15.31s total
Dropped 246 nodes (cum <= 0.08s)
Showing top 30 nodes out of 116
      flat  flat%   sum%        cum   cum%
     4.21s 27.50% 27.50%      7.45s 48.66%  math/big.nat.divBasic
     1.66s 10.84% 38.34%      1.66s 10.84%  math/big.mulAddVWW
     1.49s  9.73% 48.07%      1.49s  9.73%  math/big.addMulVVW
     1.14s  7.45% 55.52%      1.14s  7.45%  math/big.subVV
     0.96s  6.27% 61.79%      0.96s  6.27%  runtime.kevent
     0.56s  3.66% 65.45%      0.56s  3.66%  runtime.madvise
     0.47s  3.07% 68.52%      2.06s 13.46%  math/big.basicMul
     0.33s  2.16% 70.67%      0.33s  2.16%  math/big.shlVU
     0.19s  1.24% 71.91%      0.19s  1.24%  math/big.shrVU
     0.19s  1.24% 73.15%      0.19s  1.24%  runtime.pthread_cond_wait
     0.17s  1.11% 74.27%      0.65s  4.25%  runtime.mallocgc
     0.16s  1.05% 75.31%      0.16s  1.05%  runtime.memclrNoHeapPointers
     0.14s  0.91% 76.22%      0.14s  0.91%  math/big.greaterThan (inline)
     0.13s  0.85% 77.07%      0.14s  0.91%  runtime.stkbucket
     0.11s  0.72% 77.79%     10.52s 68.71%  math/big.nat.expNN
     0.10s  0.65% 78.45%      0.10s  0.65%  runtime.procyield
     0.10s  0.65% 79.10%      0.28s  1.83%  sync.(*Pool).Get
     0.10s  0.65% 79.75%      0.23s  1.50%  sync.(*Pool).Put
     0.09s  0.59% 80.34%      8.50s 55.52%  math/big.nat.divLarge
     0.09s  0.59% 80.93%      0.09s  0.59%  runtime.nextFreeFast (inline)
     0.09s  0.59% 81.52%      0.09s  0.59%  runtime.pthread_cond_signal
     0.08s  0.52% 82.04%      0.09s  0.59%  math/big.nat.bytes
     0.08s  0.52% 82.56%      0.08s  0.52%  runtime.memmove
     0.08s  0.52% 83.08%      0.30s  1.96%  runtime.newobject
     0.07s  0.46% 83.54%      0.11s  0.72%  encoding/base64.(*Encoding).Decode
     0.07s  0.46% 84.00%      0.10s  0.65%  sync.runtime_procPin
     0.06s  0.39% 84.39%      0.87s  5.68%  encoding/json.(*decodeState).object
     0.05s  0.33% 84.72%      0.12s  0.78%  encoding/json.checkValid
     0.05s  0.33% 85.04%      0.33s  2.16%  math/big.nat.make (inline)
     0.05s  0.33% 85.37%      0.28s  1.83%  math/big.putNat (inline)

@randall77 @FiloSottile any ideas?

FiloSottile · 2019-12-20T17:03:37Z

Maybe the Int.Div changes from #21960?

mknyszek · 2020-01-10T23:17:01Z

I discovered in the last few weeks that the problem is more fundamental: we don't account for the cost of sysUsed in scavenger pacing like we do the cost of sysUnused, and as it turns out sysUsed is more expensive than (or equal to) sysUnused on Darwin and that's really what's going on here. On most other platforms sysUsed is a no-op and the only performance hit in that case is a page fault (which is non-trivial, but definitely much less expensive than the madvise).

In other words, my assertion that scavenging is too fast on Darwin is still correct, but the associated costs are not. The extra goroutine wake-ups and lock contention are negligible in comparison to the large additional number of sysUsed calls that need to be made for newly-allocated memory.

As a result, the fix I used above turns out to be kind of a hack. I think the right thing to do is just make some empirical measurements of the relative costs of sysUsed and sysUnused on Darwin and multiply the measured critical time by this ratio. This is simple and likely to be the solution we'd settle on in 1.15 anyway. We can always change our minds, too.

mknyszek · 2020-01-10T23:17:38Z

Oh, and also I'll probably open a separate bug for that so this bug can continue to be about a crypto performance regression.

pascaldekloe · 2020-01-15T12:24:48Z

I can confirm that 71154e0 resolved the performance loss. The attention is much appreciated guys. 👍

randall77 added the Performance label Dec 19, 2019

mknyszek self-assigned this Dec 19, 2019

mknyszek added this to the Go1.14 milestone Jan 7, 2020

mknyszek mentioned this issue Jan 10, 2020

runtime: scavenger is too eager on Darwin #36507

Closed

mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 10, 2020

pascaldekloe closed this as completed Jan 15, 2020

mknyszek mentioned this issue Jan 16, 2020

runtime: sysUsed often called on non-scavenged memory #36603

Open

golang locked and limited conversation to collaborators Jan 14, 2021

gopherbot added the FrozenDueToAge label Jan 14, 2021

rsc unassigned mknyszek Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crypto: performance loss with 1.14 beta #36218

crypto: performance loss with 1.14 beta #36218

pascaldekloe commented Dec 19, 2019

randall77 commented Dec 19, 2019

mknyszek commented Dec 19, 2019

mknyszek commented Dec 19, 2019

FiloSottile commented Dec 19, 2019

mknyszek commented Dec 19, 2019

mknyszek commented Dec 19, 2019

pascaldekloe commented Dec 19, 2019

mknyszek commented Dec 20, 2019

mknyszek commented Dec 20, 2019

FiloSottile commented Dec 20, 2019

mknyszek commented Jan 10, 2020 •

edited

mknyszek commented Jan 10, 2020

pascaldekloe commented Jan 15, 2020

crypto: performance loss with 1.14 beta #36218

crypto: performance loss with 1.14 beta #36218

Comments

pascaldekloe commented Dec 19, 2019

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

randall77 commented Dec 19, 2019

mknyszek commented Dec 19, 2019

mknyszek commented Dec 19, 2019

FiloSottile commented Dec 19, 2019

mknyszek commented Dec 19, 2019

mknyszek commented Dec 19, 2019

pascaldekloe commented Dec 19, 2019

mknyszek commented Dec 20, 2019

mknyszek commented Dec 20, 2019

FiloSottile commented Dec 20, 2019

mknyszek commented Jan 10, 2020 • edited

mknyszek commented Jan 10, 2020

pascaldekloe commented Jan 15, 2020

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

mknyszek commented Jan 10, 2020 •

edited