Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto: performance loss with 1.14 beta #36218

Closed
pascaldekloe opened this issue Dec 19, 2019 · 13 comments
Closed

crypto: performance loss with 1.14 beta #36218

pascaldekloe opened this issue Dec 19, 2019 · 13 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@pascaldekloe
Copy link
Contributor

What version of Go are you using (go version)?

go version go1.13.5 darwin/amd64 v.s. go version go1.14beta1 darwin/amd64

What operating system and processor architecture are you using (go env)?

kern.version: Darwin Kernel Version 19.2.0: Sat Nov  9 03:47:04 PST 2019; root:xnu-6153.61.1~20/RELEASE_X86_64
machdep.cpu.brand_string: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz

What did you do?

Run the benchmarks at https://github.com/pascaldekloe/jwt.

name                  old time/op  new time/op  delta
ECDSA/sign-ES256-4    26.9µs ± 0%  28.3µs ± 1%  +5.17%  (p=0.000 n=9+10)
ECDSA/sign-ES384-4    4.21ms ± 0%  4.48ms ± 0%  +6.33%  (p=0.000 n=10+10)
ECDSA/sign-ES512-4    7.42ms ± 1%  7.80ms ± 0%  +5.15%  (p=0.000 n=10+10)
ECDSA/check-ES256-4   81.0µs ± 0%  82.7µs ± 1%  +2.07%  (p=0.000 n=10+10)
ECDSA/check-ES384-4   8.38ms ± 0%  8.63ms ± 0%  +3.02%  (p=0.000 n=10+10)
ECDSA/check-ES512-4   14.8ms ± 0%  15.0ms ± 0%  +1.93%  (p=0.000 n=10+9)
EdDSA/sign-EdDSA-4    51.0µs ± 0%  51.2µs ± 1%  +0.45%  (p=0.006 n=10+9)
EdDSA/check-EdDSA-4    137µs ± 0%   137µs ± 1%  +0.54%  (p=0.035 n=9+10)
HMAC/sign-HS256-4     2.06µs ± 0%  2.10µs ± 1%  +2.09%  (p=0.000 n=10+10)
HMAC/sign-HS384-4     2.35µs ± 1%  2.40µs ± 1%  +2.00%  (p=0.000 n=10+10)
HMAC/sign-HS512-4     2.40µs ± 1%  2.44µs ± 1%  +1.46%  (p=0.000 n=10+10)
HMAC/check-HS256-4    4.18µs ± 0%  4.36µs ± 0%  +4.38%  (p=0.000 n=9+9)
HMAC/check-HS384-4    4.43µs ± 0%  4.60µs ± 0%  +3.98%  (p=0.000 n=10+8)
HMAC/check-HS512-4    4.69µs ± 3%  4.76µs ± 1%    ~     (p=0.218 n=10+10)
RSA/sign-1024-bit-4    321µs ± 0%   339µs ± 1%  +5.68%  (p=0.000 n=10+10)
RSA/sign-2048-bit-4   1.48ms ± 0%  1.51ms ± 0%  +1.73%  (p=0.000 n=10+9)
RSA/sign-4096-bit-4   8.31ms ± 0%  8.34ms ± 1%  +0.40%  (p=0.013 n=9+10)
RSA/check-1024-bit-4  27.7µs ± 1%  29.8µs ± 1%  +7.58%  (p=0.000 n=10+10)
RSA/check-2048-bit-4  63.4µs ± 0%  66.3µs ± 2%  +4.63%  (p=0.000 n=9+10)
RSA/check-4096-bit-4   169µs ± 0%   176µs ± 1%  +4.62%  (p=0.000 n=10+10)

What did you expect to see?

Similar or better performance.

What did you see instead?

Overall slowdown.

@randall77
Copy link
Contributor

Looking at the profile of ECDSA/sign-ES256, there appears to be more madvise in the profile at tip.

(run with go test --bench=ECDSA/sign-ES256 --count=10 --cpuprofile=cpu.prof)

1.13.5:

      flat  flat%   sum%        cum   cum%
    2460ms 15.78% 15.78%     2460ms 15.78%  crypto/elliptic.p256OrdSqr
    2150ms 13.79% 29.57%     2150ms 13.79%  p256MulInternal
    1610ms 10.33% 39.90%     1610ms 10.33%  crypto/elliptic.p256Sqr
    1480ms  9.49% 49.39%     1480ms  9.49%  syscall.syscall
     750ms  4.81% 54.20%      750ms  4.81%  crypto/elliptic.p256SelectBase
     650ms  4.17% 58.37%      650ms  4.17%  p256SqrInternal
     610ms  3.91% 62.28%      610ms  3.91%  runtime.usleep
     570ms  3.66% 65.94%      570ms  3.66%  runtime.madvise
     560ms  3.59% 69.53%     3490ms 22.39%  crypto/elliptic.p256PointAddAffineAsm
     500ms  3.21% 72.74%      500ms  3.21%  crypto/elliptic.p256OrdMul

tip:

      flat  flat%   sum%        cum   cum%
    2430ms 14.05% 14.05%     2430ms 14.05%  crypto/elliptic.p256OrdSqr
    1880ms 10.87% 24.93%     1880ms 10.87%  runtime.madvise
    1840ms 10.64% 35.57%     1840ms 10.64%  p256MulInternal
    1330ms  7.69% 43.26%     1330ms  7.69%  crypto/elliptic.p256Sqr
    1110ms  6.42% 49.68%     1110ms  6.42%  syscall.syscall
     760ms  4.40% 54.08%      760ms  4.40%  crypto/elliptic.p256SelectBase
     700ms  4.05% 58.13%      700ms  4.05%  runtime.kevent
     680ms  3.93% 62.06%     3180ms 18.39%  crypto/elliptic.p256PointAddAffineAsm
     510ms  2.95% 65.01%      510ms  2.95%  p256SqrInternal
     450ms  2.60% 67.61%      450ms  2.60%  runtime.pthread_kill

Maybe the new page allocator? @mknyszek

@mknyszek
Copy link
Contributor

That's pretty surprising since I figured scavenging only got less aggressive this release. The increased it's probably related to the changes that were made there. I'll check it out.

@mknyszek mknyszek self-assigned this Dec 19, 2019
@mknyszek
Copy link
Contributor

So here's an interesting bit: I tried to reproduce this on linux/amd64 and I couldn't.

Running under the perflock 1.14beta1 faster by a few percent. Also, the profiles look completely different. For 1.13.4 it's dominated by runtime.cgocall, but it's just crypto functions for go1.14beta1. I don't really understand what's going on here, but that's not the platform this bug is about so I'll put that to rest for now.

I'll try to run this on a darwin machine and see what's going on. One big difference this release with scavenging is that we're calling madvise more often (even though, I think we're doing roughly the same amount of scavenging work or less). It's possible that there are larger overheads on darwin for each call to madvise which could make this a problem. On the other hand, the pacing is now time-based to avoid using more than 1% of CPU time of a single core, so even if it was more expensive, it should work itself out.

@FiloSottile
Copy link
Contributor

Running under the perflock 1.14beta1 faster by a few percent. Also, the profiles look completely different. For 1.13.4 it's dominated by runtime.cgocall, but it's just crypto functions for go1.14beta1. I don't really understand what's going on here, but that's not the platform this bug is about so I'll put that to rest for now.

There's a good chance you're running the Go+BoringCrypto corp build for your Go 1.13.4 benchmark, which uses BoringSSL via cgo on linux/amd64.

@mknyszek
Copy link
Contributor

@FiloSottile ah hah, thank you, that's probably it. I'll keep that in mind for the future.

@mknyszek
Copy link
Contributor

@pascaldekloe @randall77 OK so I still can't reproduce on linux/amd64 after building 1.13.5 myself (go1.14beta1 is consistently a hair faster). The profiles contain runtime.madvise as the 20th entry for 1.13.5 and it doesn't appear in the top 30 for 1.14beta1.

Now I'll go and try on darwin for real.

@pascaldekloe
Copy link
Contributor Author

Thanks for the dedication @mknyszek. I tried with Docker [Linux] on the same machine, and the new beta is slightly faster indeed. 🙂

@mknyszek
Copy link
Contributor

Alrighty, I think I found the problem.

It isn't that madvise is too slow, and in fact, the system handles that case quite well.

It's that madvise is too fast on these systems. About four times as fast as the old assumption in Go 1.13 (we assumed 4 KiB takes 10 µs, but here 4 KiB takes about 2.2 µs). Because we no longer make any assumptions, the rate isn't limited, and scavenging too often leads to outsized effects on performance because of goroutine wake-ups and lock contention.

I think a reasonable course of action here is to limit how fast the background scavenger can go. I tried one (runtime) page per millisecond, which is an assumption of 8 KiB taking 10 µs. This is roughly twice our old assumption, and I can confirm that just doing this fixes the performance regression by a lot.

name                  old time/op  new time/op  delta
ECDSA/sign-ES256-4    28.3µs ±24%  26.0µs ± 1%     ~     (p=0.095 n=10+9)
ECDSA/sign-ES384-4    4.04ms ± 2%  4.09ms ± 1%   +1.05%  (p=0.035 n=10+10)
ECDSA/sign-ES512-4    7.09ms ± 2%  7.16ms ± 2%   +0.96%  (p=0.043 n=8+10)
ECDSA/check-ES256-4   80.1µs ± 2%  77.3µs ± 1%   -3.51%  (p=0.000 n=10+8)
ECDSA/check-ES384-4   8.10ms ± 3%  8.00ms ± 2%     ~     (p=0.182 n=10+9)
ECDSA/check-ES512-4   15.0ms ±18%  14.4ms ± 3%     ~     (p=1.000 n=10+10)
EdDSA/sign-EdDSA-4    52.7µs ±14%  50.7µs ± 2%     ~     (p=0.481 n=10+10)
EdDSA/check-EdDSA-4    146µs ±11%   132µs ± 3%  -10.00%  (p=0.000 n=10+10)
HMAC/sign-HS256-4     1.92µs ± 1%  1.90µs ± 1%     ~     (p=0.145 n=9+8)
HMAC/sign-HS384-4     2.15µs ± 1%  2.19µs ± 4%   +1.92%  (p=0.013 n=9+10)
HMAC/sign-HS512-4     2.25µs ± 3%  2.20µs ± 3%     ~     (p=0.052 n=10+10)
HMAC/check-HS256-4    3.98µs ± 3%  3.93µs ± 1%     ~     (p=0.110 n=10+10)
HMAC/check-HS384-4    4.29µs ± 4%  4.31µs ± 5%     ~     (p=0.842 n=9+10)
HMAC/check-HS512-4    4.30µs ± 3%  4.47µs ± 3%   +3.83%  (p=0.001 n=10+9)
RSA/sign-1024-bit-4    315µs ± 8%   317µs ± 3%     ~     (p=0.447 n=10+9)
RSA/sign-2048-bit-4   1.45ms ± 2%  1.46ms ± 3%     ~     (p=0.315 n=8+10)
RSA/sign-4096-bit-4   8.03ms ± 2%  8.13ms ± 2%   +1.28%  (p=0.015 n=10+10)
RSA/check-1024-bit-4  26.2µs ± 1%  27.8µs ± 2%   +6.10%  (p=0.000 n=10+10)
RSA/check-2048-bit-4  59.7µs ± 1%  63.1µs ± 2%   +5.68%  (p=0.000 n=8+9)
RSA/check-4096-bit-4   161µs ± 2%   163µs ± 1%   +1.32%  (p=0.006 n=10+9)
[Geo mean]             107µs        106µs        -0.65%

There are a few benchmarks still showing a regression, which may be for a different reason, or perhaps. I'm going to try going all the way back to our old order-of-magnitude assumption as well and see if that helps at all.

@mknyszek
Copy link
Contributor

Setting the limit back to the pacing of last release brings the regression down further.

name                  old time/op  new time/op  delta
ECDSA/sign-ES256-4    28.9µs ±13%  26.0µs ± 3%  -10.07%  (p=0.002 n=10+10)
ECDSA/sign-ES384-4    4.17ms ±10%  4.11ms ± 3%     ~     (p=0.965 n=10+8)
ECDSA/sign-ES512-4    7.09ms ± 3%  7.15ms ± 1%     ~     (p=0.315 n=10+10)
ECDSA/check-ES256-4   78.4µs ± 2%  76.8µs ± 0%   -2.02%  (p=0.000 n=9+8)
ECDSA/check-ES384-4   8.24ms ± 5%  7.97ms ± 2%   -3.30%  (p=0.006 n=10+9)
ECDSA/check-ES512-4   13.4ms ± 1%  14.3ms ± 2%   +6.46%  (p=0.000 n=10+10)
EdDSA/sign-EdDSA-4    49.6µs ± 2%  49.4µs ± 1%     ~     (p=0.436 n=10+10)
EdDSA/check-EdDSA-4    134µs ± 1%   129µs ± 1%   -4.03%  (p=0.000 n=10+10)
HMAC/sign-HS256-4     1.95µs ± 4%  1.88µs ± 0%   -3.53%  (p=0.000 n=9+9)
HMAC/sign-HS384-4     2.26µs ± 4%  2.12µs ± 1%   -6.19%  (p=0.000 n=9+8)
HMAC/sign-HS512-4     2.28µs ± 2%  2.16µs ± 1%   -5.26%  (p=0.000 n=9+9)
HMAC/check-HS256-4    3.92µs ± 5%  3.96µs ± 1%     ~     (p=0.143 n=10+10)
HMAC/check-HS384-4    4.09µs ± 1%  4.21µs ± 1%   +2.98%  (p=0.000 n=10+10)
HMAC/check-HS512-4    4.52µs ±12%  4.30µs ± 1%     ~     (p=0.676 n=10+9)
RSA/sign-1024-bit-4    347µs ±26%   321µs ± 3%     ~     (p=0.400 n=10+9)
RSA/sign-2048-bit-4   1.46ms ± 7%  1.47ms ± 2%     ~     (p=0.165 n=10+10)
RSA/sign-4096-bit-4   7.95ms ± 1%  8.26ms ± 5%   +3.84%  (p=0.000 n=9+9)
RSA/check-1024-bit-4  26.1µs ± 2%  26.9µs ± 3%   +3.13%  (p=0.000 n=10+9)
RSA/check-2048-bit-4  59.1µs ± 1%  60.7µs ± 2%   +2.57%  (p=0.000 n=10+10)
RSA/check-4096-bit-4   159µs ± 1%   161µs ± 0%   +1.37%  (p=0.000 n=10+10)
[Geo mean]             107µs        105µs        -1.37%

The results aren't super reproducible (hence why that -10% is now -4%) since I'm running on a laptop which has the full UI running and perflock doesn't work on darwin/amd64. I'd say that ECDSA/check-ES512 is a bit noisy on my machine (though would be worth maybe trying more runs, spinning up a new process every time, etc., to capture the noise) so I'm not going to dig any deeper there. Same for HMAC/check-HS384. But I think there may still be a legitimate regression in the RSA/check-*-bit after my rate-limiting fix.

Looking at profiles for RSA/check-1024-bit more closely, there are clearly differences and costs have shifted around but nothing jumps out at me.

go1.13.5

Showing nodes accounting for 13.52s, 84.66% of 15.97s total
Dropped 212 nodes (cum <= 0.08s)
Showing top 30 nodes out of 130
      flat  flat%   sum%        cum   cum%
     4.67s 29.24% 29.24%      9.18s 57.48%  math/big.nat.divLarge
     1.73s 10.83% 40.08%      1.73s 10.83%  math/big.mulAddVWW
     1.37s  8.58% 48.65%      1.37s  8.58%  math/big.addMulVVW
     1.17s  7.33% 55.98%      1.17s  7.33%  math/big.subVV
     0.51s  3.19% 59.17%      0.51s  3.19%  runtime.madvise
     0.48s  3.01% 62.18%      0.48s  3.01%  runtime.pthread_cond_wait
     0.42s  2.63% 64.81%      1.81s 11.33%  math/big.basicMul
     0.35s  2.19% 67.00%      0.35s  2.19%  runtime.procyield
     0.29s  1.82% 68.82%      0.29s  1.82%  math/big.shlVU
     0.25s  1.57% 70.38%      0.25s  1.57%  runtime.pthread_cond_signal
     0.24s  1.50% 71.88%      0.24s  1.50%  math/big.shrVU
     0.22s  1.38% 73.26%      0.22s  1.38%  math/big.greaterThan
     0.19s  1.19% 74.45%      0.19s  1.19%  runtime.kevent
     0.17s  1.06% 75.52%      0.17s  1.06%  runtime.pthread_cond_timedwait_relative_np
     0.15s  0.94% 76.46%      1.12s  7.01%  runtime.mallocgc
     0.14s  0.88% 77.33%      0.14s  0.88%  runtime.memclrNoHeapPointers
     0.14s  0.88% 78.21%      0.15s  0.94%  runtime.stkbucket
     0.13s  0.81% 79.02%      0.13s  0.81%  runtime.usleep
     0.12s  0.75% 79.77%      0.12s  0.75%  crypto/sha512.blockAVX2
     0.12s  0.75% 80.53%      0.23s  1.44%  encoding/json.checkValid
     0.10s  0.63% 81.15%      0.10s  0.63%  runtime.procPin
     0.10s  0.63% 81.78%      0.31s  1.94%  sync.(*Pool).Get
     0.09s  0.56% 82.34%      0.09s  0.56%  math/big.nat.norm
     0.07s  0.44% 82.78%      0.11s  0.69%  encoding/base64.(*Encoding).Decode
     0.06s  0.38% 83.16%      0.12s  0.75%  runtime.heapBitsSetType
     0.05s  0.31% 83.47%      9.25s 57.92%  math/big.nat.div
     0.05s  0.31% 83.78%      1.92s 12.02%  math/big.nat.sqr
     0.05s  0.31% 84.10%      0.22s  1.38%  sync.(*Pool).Put
     0.05s  0.31% 84.41%      0.19s  1.19%  sync.(*Pool).pin
     0.04s  0.25% 84.66%      0.12s  0.75%  encoding/json.indirect

go1.14beta1

Showing nodes accounting for 13.07s, 85.37% of 15.31s total
Dropped 246 nodes (cum <= 0.08s)
Showing top 30 nodes out of 116
      flat  flat%   sum%        cum   cum%
     4.21s 27.50% 27.50%      7.45s 48.66%  math/big.nat.divBasic
     1.66s 10.84% 38.34%      1.66s 10.84%  math/big.mulAddVWW
     1.49s  9.73% 48.07%      1.49s  9.73%  math/big.addMulVVW
     1.14s  7.45% 55.52%      1.14s  7.45%  math/big.subVV
     0.96s  6.27% 61.79%      0.96s  6.27%  runtime.kevent
     0.56s  3.66% 65.45%      0.56s  3.66%  runtime.madvise
     0.47s  3.07% 68.52%      2.06s 13.46%  math/big.basicMul
     0.33s  2.16% 70.67%      0.33s  2.16%  math/big.shlVU
     0.19s  1.24% 71.91%      0.19s  1.24%  math/big.shrVU
     0.19s  1.24% 73.15%      0.19s  1.24%  runtime.pthread_cond_wait
     0.17s  1.11% 74.27%      0.65s  4.25%  runtime.mallocgc
     0.16s  1.05% 75.31%      0.16s  1.05%  runtime.memclrNoHeapPointers
     0.14s  0.91% 76.22%      0.14s  0.91%  math/big.greaterThan (inline)
     0.13s  0.85% 77.07%      0.14s  0.91%  runtime.stkbucket
     0.11s  0.72% 77.79%     10.52s 68.71%  math/big.nat.expNN
     0.10s  0.65% 78.45%      0.10s  0.65%  runtime.procyield
     0.10s  0.65% 79.10%      0.28s  1.83%  sync.(*Pool).Get
     0.10s  0.65% 79.75%      0.23s  1.50%  sync.(*Pool).Put
     0.09s  0.59% 80.34%      8.50s 55.52%  math/big.nat.divLarge
     0.09s  0.59% 80.93%      0.09s  0.59%  runtime.nextFreeFast (inline)
     0.09s  0.59% 81.52%      0.09s  0.59%  runtime.pthread_cond_signal
     0.08s  0.52% 82.04%      0.09s  0.59%  math/big.nat.bytes
     0.08s  0.52% 82.56%      0.08s  0.52%  runtime.memmove
     0.08s  0.52% 83.08%      0.30s  1.96%  runtime.newobject
     0.07s  0.46% 83.54%      0.11s  0.72%  encoding/base64.(*Encoding).Decode
     0.07s  0.46% 84.00%      0.10s  0.65%  sync.runtime_procPin
     0.06s  0.39% 84.39%      0.87s  5.68%  encoding/json.(*decodeState).object
     0.05s  0.33% 84.72%      0.12s  0.78%  encoding/json.checkValid
     0.05s  0.33% 85.04%      0.33s  2.16%  math/big.nat.make (inline)
     0.05s  0.33% 85.37%      0.28s  1.83%  math/big.putNat (inline)

@randall77 @FiloSottile any ideas?

@FiloSottile
Copy link
Contributor

Maybe the Int.Div changes from #21960?

@mknyszek mknyszek added this to the Go1.14 milestone Jan 7, 2020
@mknyszek
Copy link
Contributor

mknyszek commented Jan 10, 2020

I discovered in the last few weeks that the problem is more fundamental: we don't account for the cost of sysUsed in scavenger pacing like we do the cost of sysUnused, and as it turns out sysUsed is more expensive than (or equal to) sysUnused on Darwin and that's really what's going on here. On most other platforms sysUsed is a no-op and the only performance hit in that case is a page fault (which is non-trivial, but definitely much less expensive than the madvise).

In other words, my assertion that scavenging is too fast on Darwin is still correct, but the associated costs are not. The extra goroutine wake-ups and lock contention are negligible in comparison to the large additional number of sysUsed calls that need to be made for newly-allocated memory.

As a result, the fix I used above turns out to be kind of a hack. I think the right thing to do is just make some empirical measurements of the relative costs of sysUsed and sysUnused on Darwin and multiply the measured critical time by this ratio. This is simple and likely to be the solution we'd settle on in 1.15 anyway. We can always change our minds, too.

@mknyszek
Copy link
Contributor

Oh, and also I'll probably open a separate bug for that so this bug can continue to be about a crypto performance regression.

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 10, 2020
@pascaldekloe
Copy link
Contributor Author

I can confirm that 71154e0 resolved the performance loss. The attention is much appreciated guys. 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants