Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/tls: significant drop in throughput from Go 1.6 #15713

Closed
rsc opened this issue May 17, 2016 · 11 comments
Closed

crypto/tls: significant drop in throughput from Go 1.6 #15713

rsc opened this issue May 17, 2016 · 11 comments

Comments

@rsc
Copy link
Contributor

rsc commented May 17, 2016

Attached is a stab at a TLS benchmark (crypto/tls has none!).
It runs b.N TLS connections in sequence, each transferring 16 MB in each direction in 64 kB writes.
tlsbench_test.go.txt

Through Go 1.5 it was hovering around 30 MB/s on my workstation.
Go 1.6 made it jump by almost 10X, to around 295 MB/s.
At tip, however, it is down a bit, around 235 MB/s.

It would be good to understand this. This might be something to do soon after the beta is released.

/cc @agl

@rsc rsc added this to the Go1.7 milestone May 17, 2016
@mikioh
Copy link
Contributor

mikioh commented May 17, 2016

Probably it's because of the dynamic record sizing; https://go-review.googlesource.com/19591. What happens when you run the bench with &tls.Config{DynamicRecordSizingDisabled: true}?

@josharian
Copy link
Contributor

Looks like dynamic record sizing, indeed. Going from 1.6.2 to tip, I can reproduce a slowdown:

name  old time/op   new time/op   delta
IO-8    115ms ± 7%    130ms ± 3%  +12.34%  (p=0.000 n=15+15)

name  old speed     new speed     delta
IO-8  146MB/s ± 7%  129MB/s ± 3%  -11.05%  (p=0.000 n=15+15)

Going from 1.6 to tip with DynamicRecordSizingDisabled: true, I see a slight speed-up:

name  old time/op   new time/op   delta
IO-8    115ms ± 7%    109ms ± 2%  -5.26%  (p=0.000 n=15+15)

name  old speed     new speed     delta
IO-8  146MB/s ± 7%  154MB/s ± 2%  +5.47%  (p=0.000 n=15+15)

So maybe the only things to do here are to call out dynamic record sizing in the 1.7 release notes and add this benchmark to the repo. Back to you, @rsc.

@cespare
Copy link
Contributor

cespare commented May 17, 2016

/cc @tombergan

@tombergan
Copy link
Contributor

tombergan commented May 17, 2016

DynamicRecordSizing writes the first 1MB in ~1KB chunks instead of ~16KB chunks. Some loss of throughput is expected as this trades throughput for latency early in the connection.

The 1MB constant may benefit from tuning. I borrowed that constant from the Google frontend servers. Unfortunately, it doesn't seem like much thought went into that constant (or at least, the thought process was not documented anywhere) and I could not find benchmark data. I'm not even sure what the right benchmarks would be -- there are competing interests and the right value will likely depend on the underlying network speed and specifics of the application layer.

@tombergan
Copy link
Contributor

Just to verify that the loss in throughput is limited to the start of the connection, I ran Russ's benchmark with various transfer sizes. There is noise, but the overhead generally decreases as the transfer size increases.

BenchmarkIO_64MB_NoDynamicSizing-40           30     375858719 ns/op     178.55 MB/s
BenchmarkIO_64MB_DynamicSizing-40             30     396826655 ns/op     169.11 MB/s
BenchmarkIO_32MB_NoDynamicSizing-40          100     192553850 ns/op     174.26 MB/s
BenchmarkIO_32MB_DynamicSizing-40            100     201446453 ns/op     166.57 MB/s
BenchmarkIO_24MB_NoDynamicSizing-40          100     140218146 ns/op     179.48 MB/s
BenchmarkIO_24MB_DynamicSizing-40            100     155411500 ns/op     161.93 MB/s
BenchmarkIO_16MB_NoDynamicSizing-40          100     100766193 ns/op     166.50 MB/s
BenchmarkIO_16MB_DynamicSizing-40            100     110000313 ns/op     152.52 MB/s
BenchmarkIO_1MB_NoDynamicSizing-40          2000      10185166 ns/op     102.95 MB/s
BenchmarkIO_1MB_DynamicSizing-40            1000      23113165 ns/op      45.37 MB/s

@rsc
Copy link
Contributor Author

rsc commented May 18, 2016

@tombergan, given that we don't even know what the right values are and the ones we do use cause a significant performance drop, maybe dynamic record sizing should default to off for now or be reverted?

Maybe it should be "DynamicRecordSize int" and "DynamicRecordLimit int64", where 0 and 0 means off and the current behavior would be DynamicRecordSize=1024 and DynamicRecordLimit=1048576.

@tombergan
Copy link
Contributor

It causes a significant performance drop when measured by throughput, but a performance increase when measured by latency, particularly for slow connections. For a simple web page served via HTTPS over a simulated 2G connection, I measured a ~500ms improvement in page load time with dynamic record sizing. There's more background in the first comment of issue #14376 and in the discussion with agl in the linked CL.

I can try to cook up a more reusable latency-centric benchmark if you like. I tend to believe that most clients will use TLS via HTTPS, and for those clients, latency is more important than throughput early in the connection. I am somewhat worried about over-tuning for a throughput benchmark when many uses are not actually throughput-limited.

That said, I don't feel too strongly about whether this should default to "on" or "off". If you feel strongly one way or the other, let's go with that. We tried to avoid knobs other than a single on/off bool, but I'm happy to change the bool to an int if you think that will help (I actually had an int in a prior version of the CL). I do feel strongly about reverting this -- given that it demonstrably improves latency, I think the feature should stay in some form.

@tombergan
Copy link
Contributor

How about:

  • Disable by default in tls.Config
  • Change the 1MB constant to 128KB
  • Enable by default in net/http.Server (via cloneTLSConfig, here)

Rationale:

  • The most important use case is serving HTML. Pageload latency improves when the browser's preload scanner parses the HTML as quickly as possible. Smaller TLS records improve time-to-first-decrypted-byte.
  • 128KB should be plenty to cover the HTML with response headers.
  • Disabling by default elsewhere means there are no unexpected throughput surprises.

@rsc, @agl, WDYT?

I added a simple latency benchmark to Russ's tlsbench_test.go. Results below from my workstation:

BenchmarkLatency_2500kbps_NoDynamicSizing-40             200      10392396 ns/op
BenchmarkLatency_1500kbps_NoDynamicSizing-40             100      14912234 ns/op
BenchmarkLatency_500kbps_NoDynamicSizing-40              100     281757497 ns/op
BenchmarkLatency_200kbps_NoDynamicSizing-40               30     677013274 ns/op
BenchmarkLatency_2500kbps_DynamicSizing-40               200       6145195 ns/op
BenchmarkLatency_1500kbps_DynamicSizing-40               200       6641656 ns/op
BenchmarkLatency_500kbps_DynamicSizing-40                100      78163048 ns/op
BenchmarkLatency_200kbps_DynamicSizing-40                 30     184665580 ns/op

tlsbench_test.go.txt

@rsc rsc modified the milestones: Go1.7Beta, Go1.7 May 23, 2016
@funny-falcon
Copy link
Contributor

may be it is sensible to make size of "small records" growing?
with linear growth, sum of 1KB+2KB+3KB+...+15KB=120KB - quite close to 128KB bound.

@gopherbot
Copy link

CL https://golang.org/cl/23487 mentions this issue.

@tombergan
Copy link
Contributor

To follow-up on the above CL, I tried loading a very simple web page (16KB HTML file that imports one 16KB script in ) served over HTTP/2 on a simulated 2G connection (1200ms RTT, 200kbps throughput). Page load time results over a median of 5 runs:

Max record sizing: 6.2s
Dynamic record sizing before cl/23487: 4.6s
Dynamic record sizing after cl/23487: 5.2s

Web benchmarking is hard. The benchmark I chose is a "best case" for dynamic record sizing in some ways, with a very slow client and a very simple page. The reason dynamic record sizing is slightly slower after cl/23487 is that the HTML actually starts on the third data packet -- the first packet has h2 settings and the second packet has h2 response headers.

If I were to propose any changes, I might propose changing the sequence to 1,1,1,2,3,4,5,etc. But, I think this is a good enough compromise between throughput/latency. We can revisit this later if it becomes a problem.

/cc @rsc

@golang golang locked and limited conversation to collaborators May 27, 2017
FiloSottile pushed a commit to FiloSottile/go that referenced this issue Oct 12, 2018
The current code, introduced after Go 1.6 to improve latency on
low-bandwidth connections, sends 1 kB packets until 1 MB has been sent,
and then sends 16 kB packets (the maximum record size).

Unfortunately this decreases throughput for 1-16 MB responses by 20% or so.

Following discussion on golang#15713, change cutoff to 128 kB sent
and also grow the size allowed for successive packets:
1 kB, 2 kB, 3 kB, ..., 15 kB, 16 kB.
This fixes the throughput problems: the overhead is now closer to 2%.

I hope this still helps with latency but I don't have a great way to test it.
At the least, it's not worse than Go 1.6.

Comparing MaxPacket vs DynamicPacket benchmarks:

name              maxpkt time/op  dyn. time/op delta
Throughput/1MB-8    5.07ms ± 7%   5.21ms ± 7%  +2.73%  (p=0.023 n=16+16)
Throughput/2MB-8   15.7ms ±201%    8.4ms ± 5%    ~     (p=0.604 n=20+16)
Throughput/4MB-8    14.3ms ± 1%   14.5ms ± 1%  +1.53%  (p=0.000 n=16+16)
Throughput/8MB-8    26.6ms ± 1%   26.8ms ± 1%  +0.47%  (p=0.003 n=19+18)
Throughput/16MB-8   51.0ms ± 1%   51.3ms ± 1%  +0.47%  (p=0.000 n=20+20)
Throughput/32MB-8    100ms ± 1%    100ms ± 1%  +0.24%  (p=0.033 n=20+20)
Throughput/64MB-8    197ms ± 0%    198ms ± 0%  +0.56%   (p=0.000 n=18+7)

The small MB runs are bimodal in both cases, probably GC pauses.
But there's clearly no general slowdown anymore.

Fixes golang#15713.

Change-Id: I5fc44680ba71812d24baac142bceee0e23f2e382
Reviewed-on: https://go-review.googlesource.com/23487
Reviewed-by: Ian Lance Taylor <iant@golang.org>
FiloSottile pushed a commit to FiloSottile/go that referenced this issue Oct 12, 2018
The current code, introduced after Go 1.6 to improve latency on
low-bandwidth connections, sends 1 kB packets until 1 MB has been sent,
and then sends 16 kB packets (the maximum record size).

Unfortunately this decreases throughput for 1-16 MB responses by 20% or so.

Following discussion on golang#15713, change cutoff to 128 kB sent
and also grow the size allowed for successive packets:
1 kB, 2 kB, 3 kB, ..., 15 kB, 16 kB.
This fixes the throughput problems: the overhead is now closer to 2%.

I hope this still helps with latency but I don't have a great way to test it.
At the least, it's not worse than Go 1.6.

Comparing MaxPacket vs DynamicPacket benchmarks:

name              maxpkt time/op  dyn. time/op delta
Throughput/1MB-8    5.07ms ± 7%   5.21ms ± 7%  +2.73%  (p=0.023 n=16+16)
Throughput/2MB-8   15.7ms ±201%    8.4ms ± 5%    ~     (p=0.604 n=20+16)
Throughput/4MB-8    14.3ms ± 1%   14.5ms ± 1%  +1.53%  (p=0.000 n=16+16)
Throughput/8MB-8    26.6ms ± 1%   26.8ms ± 1%  +0.47%  (p=0.003 n=19+18)
Throughput/16MB-8   51.0ms ± 1%   51.3ms ± 1%  +0.47%  (p=0.000 n=20+20)
Throughput/32MB-8    100ms ± 1%    100ms ± 1%  +0.24%  (p=0.033 n=20+20)
Throughput/64MB-8    197ms ± 0%    198ms ± 0%  +0.56%   (p=0.000 n=18+7)

The small MB runs are bimodal in both cases, probably GC pauses.
But there's clearly no general slowdown anymore.

Fixes golang#15713.

Change-Id: I5fc44680ba71812d24baac142bceee0e23f2e382
Reviewed-on: https://go-review.googlesource.com/23487
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants