Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/tls: 500% increase in allocations from (*tls.Conn).Read in go 1.17 #50657

Closed
dt opened this issue Jan 17, 2022 · 15 comments
Closed

crypto/tls: 500% increase in allocations from (*tls.Conn).Read in go 1.17 #50657

dt opened this issue Jan 17, 2022 · 15 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@dt
Copy link
Contributor

dt commented Jan 17, 2022

What version of Go are you using (go version)?

$ go version
1.17.3

Does this issue reproduce with the latest release?

I believe so, but have not yet confirmed.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOOS="linux"

What did you do?

I'm running a service that has many http2 clients.

What did you expect to see?

Fewer allocations when reading data from established connections.

What did you see instead?

Around 20-25% of our service's total allocations at times are coming from context.WithCancel calls made by (*tls.Conn).handshakeContext.

It looks like every (*tls.Context).Read call made by http2's ReadFrame is calling c.Handshake which calls through to c.handshakeContext, which begins with a call to context.WithCancel, which allocates.

@dt dt changed the title crypto/tls: large increase in allocations from (*tls.Context).Read in go 1.17 crypto/tls: large increase in allocations from (*tls.Conn).Read in go 1.17 Jan 17, 2022
@johanbrandhorst
Copy link
Member

johanbrandhorst commented Jan 17, 2022

As mentioned in the Gophers Slack discussion, it seems to me we could simply move the early exit checks above the context manipulation functions to remove these allocations in the case where the handshake has already completed.

@dt
Copy link
Contributor Author

dt commented Jan 17, 2022

it seems to me we could simply move the early exit checks above the context manipulation functions

One thing I was trying to figure out when glancing over the code: can we just check c.handshakeComplete(), only reading the atomic status field, right away and return if that's true, or do we want to move the whole block that acquires handshakeMu also checks the err up as well? Right now we're only checking status after acquiring the mutex but I'm wondering if we could check it first to avoid that?

@ALTree ALTree added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels Jan 17, 2022
@johanbrandhorst
Copy link
Member

johanbrandhorst commented Jan 17, 2022

I would expect that we'd need to do both, so that if there was an error during handshaking that is returned first.

@dt
Copy link
Contributor Author

dt commented Jan 17, 2022

if there was an error during handshaking

Huh, but in that case, isn't handshakeStatus still 0? I'm just reading this for the first time, so I might be missing something, but it looked like the intent of the atomic status was to avoid needing to acquire the mutex to check if handshake was completed or not?

@johanbrandhorst
Copy link
Member

I'm not too intimately familiar with that specific logic, but judging by the existing logic flow we have to check the error before returning?

@gopherbot
Copy link

Change https://golang.org/cl/379034 mentions this issue: crypto/tls: avoid extra allocations in noop handshake calls

@johanbrandhorst
Copy link
Member

CC @golang/security (sorry if this is not the right way to get the attention from someone more experienced with the code).

@dt
Copy link
Contributor Author

dt commented Jan 21, 2022

I did a quick run comparing release-branch.go1.16 and release-branch.go1.17 using crypto/tls.BenchmarkThroughput. The raw timing numbers across those two are not directly comparable, as an unrelated change in 9d0819b switched the cipher config to avoid hardware aes, but just looking at the allocations / op:

name                                     old allocs/op  new allocs/op  delta
Throughput/MaxPacket/1MB/TLSv12-32            773 ± 0%      1030 ± 0%   +33.25%  (p=0.016 n=5+4)
Throughput/MaxPacket/1MB/TLSv13-32          1.44k ± 0%     1.67k ± 0%   +16.47%  (p=0.008 n=5+5)
Throughput/MaxPacket/2MB/TLSv12-32            843 ± 0%      1546 ± 0%   +83.35%  (p=0.008 n=5+5)
Throughput/MaxPacket/2MB/TLSv13-32          1.51k ± 0%     2.19k ± 0%   +45.50%  (p=0.008 n=5+5)
Throughput/MaxPacket/4MB/TLSv12-32            973 ± 1%      2576 ± 0%  +164.79%  (p=0.008 n=5+5)
Throughput/MaxPacket/4MB/TLSv13-32          1.64k ± 0%     3.22k ± 0%   +96.12%  (p=0.008 n=5+5)
Throughput/MaxPacket/8MB/TLSv12-32          1.24k ± 1%     4.65k ± 0%  +274.66%  (p=0.008 n=5+5)
Throughput/MaxPacket/8MB/TLSv13-32          1.91k ± 1%     5.30k ± 0%  +177.47%  (p=0.008 n=5+5)
Throughput/MaxPacket/16MB/TLSv12-32         1.77k ± 1%     8.78k ± 0%  +394.98%  (p=0.008 n=5+5)
Throughput/MaxPacket/16MB/TLSv13-32         2.43k ± 0%     9.41k ± 0%  +287.08%  (p=0.008 n=5+5)
Throughput/MaxPacket/32MB/TLSv12-32         2.83k ± 1%    17.01k ± 0%  +501.69%  (p=0.008 n=5+5)
Throughput/MaxPacket/32MB/TLSv13-32         3.50k ± 1%    17.65k ± 0%  +404.74%  (p=0.008 n=5+5)
Throughput/MaxPacket/64MB/TLSv12-32         4.93k ± 1%    33.60k ± 0%  +581.96%  (p=0.008 n=5+5)
Throughput/MaxPacket/64MB/TLSv13-32         5.63k ± 1%    34.14k ± 0%  +506.90%  (p=0.008 n=5+5)
Throughput/DynamicPacket/1MB/TLSv12-32        786 ± 0%      1093 ± 0%   +39.00%  (p=0.008 n=5+5)
Throughput/DynamicPacket/1MB/TLSv13-32      1.45k ± 0%     1.73k ± 0%   +19.32%  (p=0.016 n=4+5)
Throughput/DynamicPacket/2MB/TLSv12-32        856 ± 0%      1607 ± 0%   +87.69%  (p=0.008 n=5+5)
Throughput/DynamicPacket/2MB/TLSv13-32      1.52k ± 0%     2.25k ± 0%   +47.69%  (p=0.008 n=5+5)
Throughput/DynamicPacket/4MB/TLSv12-32        987 ± 1%      2640 ± 0%  +167.46%  (p=0.008 n=5+5)
Throughput/DynamicPacket/4MB/TLSv13-32      1.66k ± 0%     3.28k ± 0%   +97.82%  (p=0.008 n=5+5)
Throughput/DynamicPacket/8MB/TLSv12-32      1.26k ± 0%     4.71k ± 0%  +272.60%  (p=0.008 n=5+5)
Throughput/DynamicPacket/8MB/TLSv13-32      1.92k ± 1%     5.35k ± 0%  +178.67%  (p=0.008 n=5+5)
Throughput/DynamicPacket/16MB/TLSv12-32     1.79k ± 1%     8.84k ± 0%  +394.33%  (p=0.008 n=5+5)
Throughput/DynamicPacket/16MB/TLSv13-32     2.46k ± 1%     9.46k ± 0%  +285.13%  (p=0.008 n=5+5)
Throughput/DynamicPacket/32MB/TLSv12-32     2.86k ± 1%    17.10k ± 0%  +498.46%  (p=0.008 n=5+5)

By looking at 1.16 vs 90d6bbb -- the last commit to touch crypto/tls commit prior to the cipher change above -- I think we can see what effect these extra allocations have on the actual performance:

name                                 old time/op    new time/op    delta
Throughput/MaxPacket/2MB/TLSv12-32     6.20ms ± 1%    6.37ms ± 1%    +2.63%  (p=0.008 n=5+5)
Throughput/MaxPacket/2MB/TLSv13-32     6.46ms ± 1%    6.61ms ± 1%    +2.38%  (p=0.008 n=5+5)
Throughput/MaxPacket/32MB/TLSv12-32    77.9ms ± 0%    80.9ms ± 0%    +3.88%  (p=0.008 n=5+5)
Throughput/MaxPacket/32MB/TLSv13-32    80.3ms ± 1%    82.7ms ± 0%    +2.93%  (p=0.008 n=5+5)

name                                 old speed      new speed      delta
Throughput/MaxPacket/2MB/TLSv12-32    338MB/s ± 1%   329MB/s ± 1%    -2.56%  (p=0.008 n=5+5)
Throughput/MaxPacket/2MB/TLSv13-32    325MB/s ± 1%   317MB/s ± 1%    -2.32%  (p=0.008 n=5+5)
Throughput/MaxPacket/32MB/TLSv12-32   431MB/s ± 0%   415MB/s ± 0%    -3.74%  (p=0.008 n=5+5)
Throughput/MaxPacket/32MB/TLSv13-32   418MB/s ± 1%   406MB/s ± 0%    -2.85%  (p=0.008 n=5+5)

name                                 old alloc/op   new alloc/op   delta
Throughput/MaxPacket/2MB/TLSv12-32      216kB ± 0%     249kB ± 0%   +15.22%  (p=0.008 n=5+5)
Throughput/MaxPacket/2MB/TLSv13-32      257kB ± 0%     290kB ± 0%   +12.85%  (p=0.008 n=5+5)
Throughput/MaxPacket/32MB/TLSv12-32     267kB ± 2%     848kB ± 1%  +217.68%  (p=0.008 n=5+5)
Throughput/MaxPacket/32MB/TLSv13-32     309kB ± 1%     889kB ± 0%  +187.97%  (p=0.008 n=5+5)

name                                 old allocs/op  new allocs/op  delta
Throughput/MaxPacket/2MB/TLSv12-32        843 ± 0%      1558 ± 0%   +84.79%  (p=0.008 n=5+5)
Throughput/MaxPacket/2MB/TLSv13-32      1.51k ± 0%     2.22k ± 0%   +47.52%  (p=0.008 n=5+5)
Throughput/MaxPacket/32MB/TLSv12-32     2.83k ± 1%    17.02k ± 0%  +501.81%  (p=0.008 n=5+5)
Throughput/MaxPacket/32MB/TLSv13-32     3.50k ± 1%    17.69k ± 0%  +405.72%  (p=0.008 n=5+5)

Is this enough of a regression to justify a backport?

@alkmc
Copy link

alkmc commented Feb 24, 2022

Is there any update on this issue?

@moredure
Copy link
Contributor

moredure commented Apr 13, 2022

#52274

@dt dt changed the title crypto/tls: large increase in allocations from (*tls.Conn).Read in go 1.17 crypto/tls: 500% increase in allocations from (*tls.Conn).Read in go 1.17 May 4, 2022
@johanbrandhorst
Copy link
Member

Thanks @dt and @FiloSottile for getting this in ❤️

@alkmc
Copy link

alkmc commented May 9, 2022

@dt @FiloSottile what about 1.17 and 1.18 backport?

@dt
Copy link
Contributor Author

dt commented May 9, 2022

I'm unclear as to whether or not the performance regression seen here qualifies for a backport under the backport policy's definition of a serious issue with no workaround:

A “serious” problem is one that prevents a program from working at all.

I'll go ahead and open backport issues to continue the discussion there I guess?

@gopherbot please consider this for backport to 1.17, as it is a regression compared to 1.16.

@gopherbot please consider this for backport to 1.18, as it is a regression compared to 1.16.

@gopherbot
Copy link

Backport issue(s) opened: #52790 (for 1.17), #52791 (for 1.18).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link

Change https://go.dev/cl/405544 mentions this issue: crypto/tls: avoid extra allocations in steady-state Handshake calls

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

7 participants