Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/cipher: HTTPS throughput is poor #11929

Closed
jacobsa opened this issue Jul 30, 2015 · 16 comments
Closed

crypto/cipher: HTTPS throughput is poor #11929

jacobsa opened this issue Jul 30, 2015 · 16 comments

Comments

@jacobsa
Copy link
Contributor

jacobsa commented Jul 30, 2015

Here is a program that downloads the first 1 GiB from a large object full of zeroes stored in Google Cloud Storage, and reports its throughput. It also dumps a CPU profile to /tmp/cpu.pprof.

Running on an n1-standard-4 Google Cloud Engine VM in us-central1-a on Sandy Bridge CPUs looks like this:

% go version
go version devel +d3ffc97 Wed Jul 29 23:50:20 2015 +0000 linux/amd64

% go build -o read tmp/read.go && time ./read
Copied 1.00 GiB in 13.160669712s (77.81 MiB/s)
./read  12.67s user 0.88s system 102% cpu 13.164 total

In other words, it takes 100% CPU to do 78 MiB/s; CPU appears to be the bottleneck here. Compare that to curl:

% time curl -H "Range: bytes=0-1073741823" "https://storage.googleapis.com/jacobsa-standard-us/10g-big-chunks" > /dev/null   
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1024M  100 1024M    0     0   240M      0  0:00:04  0:00:04 --:--:--  240M
curl -H "Range: bytes=0-1073741823"  > /dev/null  1.72s user 1.01s system 64% cpu 4.262 total

Here it appears we've maxed out the network bandwidth available, using 64% CPU. The throughput in terms of walltime is 3x as high. Throughput in terms of bytes per CPU second is about 5x as high.

The CPU profile for the Go program looks like this:

11530ms of 13190ms total (87.41%)
Dropped 61 nodes (cum <= 65.95ms)
Showing top 5 nodes out of 45 (cum >= 330ms)
      flat  flat%   sum%        cum   cum%
    7780ms 58.98% 58.98%     7780ms 58.98%  crypto/cipher.(*gcm).mul
    1930ms 14.63% 73.62%     1930ms 14.63%  crypto/aes.encryptBlockAsm
     900ms  6.82% 80.44%     8670ms 65.73%  crypto/cipher.(*gcm).updateBlocks
     590ms  4.47% 84.91%     3550ms 26.91%  crypto/cipher.(*gcm).counterCrypt
     330ms  2.50% 87.41%      330ms  2.50%  crypto/cipher.fastXORWords

I know that it's possible to do some AES cipher work in hardware on modern Intel CPUs, and presumably that's what curl is doing via OpenSSL. Has this been considered for crypto/cipher in Go?

@bradfitz
Copy link
Contributor

/cc @agl

@cespare
Copy link
Contributor

cespare commented Jul 30, 2015

@jacobsa crypto/aes already uses AES-NI when possible for the AES block calculations (I believe that's the crypto/aes.encryptBlockAsm in your profile there). The internet tells me that the Galois field multiplication that's at the top of your profile (implemented in Go right now) can be done efficiently using the PCLMULQDQ instruction. Maybe that makes up a large part of the performance gap to OpenSSL.

I can't locate any previous discussion of GCM-specific hardware acceleration in issues or on the mailing lists.

@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Jul 30, 2015
@jacobsa
Copy link
Contributor Author

jacobsa commented Jul 31, 2015

Thanks to the inspiration from @cespare I wrote an implementation based on this Intel whitepaper that's about twice as fast. We could probably do better by incorporating the techniques for parallelizing blocks from that paper, but it's a start.

I'll clean it up so that it works on other architectures and older CPUs, then send a CL.

@gopherbot
Copy link

CL https://golang.org/cl/13020 mentions this issue.

@dgryski
Copy link
Contributor

dgryski commented Jul 31, 2015

@agl
Copy link
Contributor

agl commented Jul 31, 2015

Note that we do have https://go-review.googlesource.com/#/c/10484/ pending for 1.6.

@jacobsa
Copy link
Contributor Author

jacobsa commented Aug 2, 2015

@agl: Awesome, this is where I was hoping to go eventually anyway. I only wish I had known about this before I spent a day on my CL. :-(

jacobsa added a commit to GoogleCloudPlatform/gcsfuse that referenced this issue Aug 3, 2015
This essentially exercises GCS throughput. The current results are pretty
crappy, probably in large part due to Go's poor AES-GCM performange:

    golang/go#11929

For #83.
jacobsa added a commit to jacobsa/go that referenced this issue Aug 5, 2015
This doubles the throughput of AES-GCM encryption and decryption, mostly
due to being much faster to perform the GF(2^128) multiplication. The
implementation of the multiplication step is based on the gfmul function
from this Intel whitepaper:

    https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf

I suspect we could do better by using the techniques in that paper for
the overall AES-GCM process, but this gives a nice win already.

For golang#11929.

Before:
    BenchmarkAESGCMSeal1K-12          200000              8449 ns/op         121.19 MB/s
    BenchmarkAESGCMOpen1K-12          200000              8461 ns/op         121.01 MB/s

After:
    BenchmarkAESGCMSeal1K-12          300000              4028 ns/op         254.18 MB/s
    BenchmarkAESGCMOpen1K-12          300000              4047 ns/op         252.97 MB/s

Change-Id: I819339be142e67d3482b832ff012afe036e96222
jacobsa added a commit to GoogleCloudPlatform/gcsfuse that referenced this issue Aug 11, 2015
The flush performance is a bit shitty, probably because of
golang/go#11929.

% go install -v && gcsfuse --temp-dir /mnt/ssd0 jacobsa-standard-asia ~/mp
% go build ./benchmarks/write_to_gcs && ./write_to_gcs --dir ~/mp

Wrote 1.00 GiB in 3.58963678s (285.27 MiB/s)
Flushed 1.00 GiB in 28.135228987s (36.40 MiB/s)
jacobsa added a commit to GoogleCloudPlatform/gcsfuse that referenced this issue Aug 11, 2015
The flush performance is a bit shitty, probably because of
golang/go#11929.

% go install -v && gcsfuse --temp-dir /mnt/ssd0 jacobsa-standard-asia ~/mp
% go build ./benchmarks/write_to_gcs && ./write_to_gcs --dir ~/mp

Wrote 1.00 GiB in 3.58963678s (285.27 MiB/s)
Flushed 1.00 GiB in 28.135228987s (36.40 MiB/s)
@jacobsa
Copy link
Contributor Author

jacobsa commented Oct 30, 2015

For the record, that CL was merged as eedaf9a. That appears to help a lot; not sure if it's good enough to call the issue fixed or not.

@bradfitz
Copy link
Contributor

@jacobsa, can you update this bug with a status update of where we're at, numbers-wise?

@cespare
Copy link
Contributor

cespare commented Oct 30, 2015

@jacobsa that commit eedaf9a is not on master. (And CL 13020 is abandoned?)

Also, is your change superseded by CL 10484, which went in a couple of months ago? Or would it still be helpful for GCM-with-a-non-AES-block-cipher? (Is that a thing?)

@jacobsa
Copy link
Contributor Author

jacobsa commented Oct 30, 2015

Sorry, I copied from the wrong tab. I meant efeeee3, which is CL 10484.

@bradfitz: Will do.

@jacobsa
Copy link
Contributor Author

jacobsa commented Oct 30, 2015

For the record, here's what I did to set up a VM and the object:

gcloud compute instances create ubuntu-us --scopes storage-full --image ubuntu-15-04 --machine-type n1-standard-4 --zone us-central1-a --boot-disk-size 500GB 
gcloud compute ssh --ssh-flag=-A --zone us-central1-a ubuntu-us
head -c $(numfmt --from iec-i 10Gi) /dev/zero > 10g
gsutil cp 10g gs://jacobsa-standard-us/10g
gsutil acl set public-read gs://jacobsa-standard-us/10g

Curl's throughput varies a decent amount, but this run showing 245 MiB/CPU-second seems representative:

% time curl -H "Range: bytes=0-1073741823" "https://storage.googleapis.com/jacobsa-standard-us/10g" > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1024M  100 1024M    0     0   245M      0  0:00:04  0:00:04 --:--:--  245M
curl -H "Range: bytes=0-1073741823"  > /dev/null  1.48s user 0.79s system 54% cpu 4.184 total

Go 1.5.1 looks like this:

% go version
go version go1.5.1 linux/amd64

% go build foo.go && time ./foo
Copied 1.00 GiB in 12.830507037s (79.81 MiB/s)
./foo  12.19s user 0.84s system 101% cpu 12.833 total

And at tip:

% go version
go version devel +a92543e Thu Oct 29 23:16:27 2015 +0000 linux/amd64

% go build foo.go && time ./foo
Copied 1.00 GiB in 4.116806503s (248.74 MiB/s)
./foo  2.44s user 1.29s system 90% cpu 4.121 total

So looking pretty comparable with curl. Nice work.

@jacobsa
Copy link
Contributor Author

jacobsa commented Oct 30, 2015

I should have translated the output to be comparable with curl. Go 1.5.1 does 80 MiB/CPU-second, and tip does 248 MiB/CPU-second. So about as fast as curl now.

@minux minux closed this as completed Oct 30, 2015
@yonderblue
Copy link

Can this be done for ARM and ARM64?

@bradfitz
Copy link
Contributor

@Gaillard, search for (or create) an ARM-specific bug. This one is closed and we don't reuse issues.

@yonderblue
Copy link

will do thanks

@golang golang locked and limited conversation to collaborators Apr 19, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants