New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crypto/tls: figure out why/how buffer size affects crypto/tls performance on windows #15899
Comments
Looks like I misread the log output. The |
FWIW this takes 44 seconds on my MacBook:
|
The
And it's not that the builder is slow. I compared the results of the So we need to investigate why this is so slow on the windows builder. Is it just the test, or the |
I tried running the benchmarks on a linux-amd64 VM, and they run about as fast as on the real hardware. So it's not a GCE thing in general. I wonder if it's a windows+GCE thing? If someone with a physical windows box wants to help out, please run
and paste your output here. Thanks. |
System info: Windows 7 Home Premium SP1 I built tip with CGO_ENABLED=0
I noticed something that could help. Once the DynamicPacket test started, CPU usage by tls.test.exe dropped to zero. When I run just the MaxPacket benchmarks they take 18.3s total and consume ~38% CPU (or ~1.5 cores) during that time. |
@ChrisHines thanks for that. Those numbers seem normal. So it does seem to be a combination of Windows and GCE. I have another interesting finding. Lowering the read/write buffer size in the throughput test appears to dramatically affect run time. Reducing it from
Still some slowness in the DynamicPacket benchmarks, but the others seem to have risen to normal range again. Lowering it further to
So the network stack for Windows on GCE prefers smaller buffers? I'll send a change to lower the buffer size, so that the tests run in a reasonable time, but we should find out why this is happening. |
I changed read/write buffer size to
|
This isn't just a GCE issue. My Win7 box has terrible performance in the test code with "1<<16" but on par with my linux box (similar gen hardware) with "1<<14". Terrible performance as in MaxPacket 0.05Mb/s (not just dyn packet). |
@kardianos thanks for the additional information. That's interesting. I wonder why, with the smaller buffer, the performance goes down on @ChrisHines' machine while it goes up on ours? |
CL https://golang.org/cl/23574 mentions this issue. |
The Windows builders run the throughput benchmarks really slowly with a 64kb buffer. Lowering it to 16kb brings the performance back into line with the other builders. This is a work-around to get the build green until we can figure out why the Windows builders are slow with the larger buffer size. Update #15899 Change-Id: I215ebf115e8295295c87f3b3e22a4ef1f9e77f81 Reviewed-on: https://go-review.googlesource.com/23574 Reviewed-by: Ian Lance Taylor <iant@golang.org>
1<<14 is the maximum TLS record size. The only difference between 1<<14 and 1<<16 is that the former produces a single netFD.Write per Write call in the benchmark, while the latter produces 4 netFD.Writes per Write call. (At least, that should be the only difference.) Is it possible to get a tcpdump for a single slow iteration? That may help. |
Is it possible to verify that Windows is respecting TCP_NODELAY? A tcpdump may help here as well. |
In general, a loopback software interface is designed for protocol stack maintenance purpose and probably not suitable for performance measurement (and for the verification of some fancy feature such as IPv6 multicasting). FWIW, I just asked google: https://blogs.technet.microsoft.com/wincat/2012/12/05/fast-tcp-loopback-performance-and-low-latency-with-windows-server-2012-tcp-loopback-fast-path/ |
I can reproduce this on one of my computers too. I copied (with small adjustments) BenchmarkThroughput into net package and run it there. And I can see about 20% more throughput with 1<<16 than 1<<14. That seems reasonable to me. I think this rules out some magic slowness in tcp/ip stack. I am happy to debug this, but I don't even know where to start. Suggestions are welcome. Alex |
If you have time to debug:
Note that the benchmark code alternates conn.Write with conn.Read. When bufsize=1<<14, each conn.Write call translates to a single TLS record (the max record size is 1<<14). Each TLS record is written with a single netFD.Write call, which translates to one WSASend syscall. When bufsize=1<<16, there are four TLS records (and four netFD.Writes / WSASends) per conn.Write. In the 1<<16 case, I'd guess that the 2nd, 3rd, or 4th WSASend call is being delayed for some unknown reason. Perhaps that this is a weird interaction with TCP_NODELAY and/or TCP flow control. |
I cannot explain the slowdown. But I can explain why windows race builder was hanging. The benchmark assumes that you can write as much data as you like to the network (without reading data on the other end of network connection), but that is not true. For example:
Same happens on windows. But the numbers are obviously diffferent:
(I did press ctrl+C to break the program). The BenchmarkThroughput has the same problem. I can make it hang on Linux just like it did on Windows:
Alex |
You're onto something. The problem definitely seems related to the buffer size relative to TCP flow control parameters. The client and server are using different buffer sizes -- the client uses len(buf) chunks while the server uses a fixed 32KB buffer (in io.Copy). This results in the following sequence each loop iteration:
The behavior of the above sequence depends on the size of the TCP receive window (rwin). If rwin >= len(buf), each step can act as an atomic unit and everything works fine. If rwin <= len(buf)-32KB, the loop will eventually hang at step 3. In other cases, the benchmark will work, however, steps 1 and 4 cannot act as atomic units because rwin is not large enough. This implies more synchronization between the client and server, which slows down the benchmark. The origin version of this benchmark used len(buf)=1<<16=64KB. The question is now, is rwin between 32KB and 64KB? Here is a link that suggests Windows uses rwin=64512, which is just shy of 64KB. At least that's my hypothesis. I mailed the following change to close this issue. Can someone verify it works on Windows? Thanks! |
CL https://golang.org/cl/24581 mentions this issue. |
I don't know if these numbers are good or not. Alex |
…TLS throughput benchmark I believe it's necessary to use a buffer size smaller than 64KB because (at least some versions of) Window using a TCP receive window less than 64KB. Currently the client and server use buffer sizes of 16KB and 32KB, respectively (the server uses io.Copy, which defaults to 32KB internally). Since the server has been using 32KB, it should be safe for the client to do so as well. Fixes golang#15899 Change-Id: I36d44b29f2a5022c03fc086213d3c1adf153e983 Reviewed-on: https://go-review.googlesource.com/24581 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
…TLS throughput benchmark I believe it's necessary to use a buffer size smaller than 64KB because (at least some versions of) Window using a TCP receive window less than 64KB. Currently the client and server use buffer sizes of 16KB and 32KB, respectively (the server uses io.Copy, which defaults to 32KB internally). Since the server has been using 32KB, it should be safe for the client to do so as well. Fixes golang#15899 Change-Id: I36d44b29f2a5022c03fc086213d3c1adf153e983 Reviewed-on: https://go-review.googlesource.com/24581 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
It looks like a recent change to
crypto/tls
increased the duration of the throughput benchmarks beyond 10 minutes on the windows race builder.https://build.golang.org/log/1217fb6191fad2bf98dc54b61cfbe51e33af21e6
The benchmarks are run with
-benchtime .1s
, but many of the benchmarks take longer than 100 milliseconds.One simple fix would be to increase the timeout, currently 10 minutes for all standard library benchmarks combined. But I'm hesitant to make builds take longer.
I will investigate why
-benchtime
doesn't appear to have the desired effect.The text was updated successfully, but these errors were encountered: