Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/crypto/internal/chacha20: improve performance for ppc64le #25051

Closed
pfsmorigo opened this issue Apr 24, 2018 · 11 comments
Closed

x/crypto/internal/chacha20: improve performance for ppc64le #25051

pfsmorigo opened this issue Apr 24, 2018 · 11 comments

Comments

@pfsmorigo
Copy link

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

Using upstream.

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="ppc64le"
GOHOSTARCH="ppc64le"
GOHOSTOS="linux"

What did you do?

I opened this issue in order to submit a patch that adds asm implementation for chacha20 using vector instructions on ppc64le.

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

What did you expect to see?

What did you see instead?

@laboger
Copy link
Contributor

laboger commented Apr 24, 2018

The heading should start with x/crypto/chacha20poly1305:

@gopherbot
Copy link

Change https://golang.org/cl/108999 mentions this issue: internal/chacha20: improve performance for ppc64le

@pfsmorigo pfsmorigo changed the title chacha20: improve performance for ppc64le x/crypto/internal/chacha20: improve performance for ppc64le Apr 24, 2018
@gopherbot gopherbot added this to the Unreleased milestone Apr 24, 2018
@pfsmorigo
Copy link
Author

pfsmorigo commented Apr 24, 2018

I renamed to "x/crypto/internal/chacha20" since chacha20 is not inside chacha20poly1305 anymore (see commit 4937306)

@pfsmorigo
Copy link
Author

@dot-asm Hello Andy, can you add chacha/asm/chacha-ppc.pl to the cryptogams repository? Same as you did for #22637 Thanks!

@pfsmorigo
Copy link
Author

Any update @dot-asm?

@gopherbot
Copy link

Change https://golang.org/cl/172177 mentions this issue: internal/chacha20: improve performance for ppc64le

@dot-asm
Copy link

dot-asm commented Apr 16, 2019

For reference, the original implementation had face-lift, dramatically improving performance on POWER9, by ~80%. Trouble there is that POWER9 is kind of "allergic" to mixtures of scalar and vector instructions, which original fits-all-ppc is heavily relying on. Since 'go' doesn't seem to be concerned about anything pre-POWER8, it would be sensible to deploy new code path...

@laboger
Copy link
Contributor

laboger commented Apr 16, 2019

@dot-asm Do you mean you have a new power9-only implementation, or that the new implementation works on both power8 and power9?

@dot-asm
Copy link

dot-asm commented Apr 16, 2019

POWER9 was merely a trigger for a change, because it was observed to perform unexpectedly poorly. But the change itself is PowerISA 2.07-based, so that it works on POWER8 and forward. Is it faster on POWER8? Of course, just not as much as on POWER9, only by ~40%.

@laboger
Copy link
Contributor

laboger commented May 28, 2019

Due to the deadline for Go 1.13 we will submit it as is for now to get better performance, then convert to the vsx implementation in Go 1.14 to improve further.

@gopherbot
Copy link

Change https://golang.org/cl/195959 mentions this issue: internal/chacha20: improve chacha20 performance on ppc64le

gopherbot pushed a commit to golang/crypto that referenced this issue Oct 10, 2019
This improves the performance of the asm implementation for
chacha20 on ppc64le by updating to the vsx implementation
provided in cryptogams. The previous implementation was found to
not perform as well as possible on power9. This implementation
improves performance on both power8 and power9.

Power9 improvement with this change as compared to current:

name               old time/op    new time/op     delta
ChaCha20/32-64        361ns ± 0%      225ns ± 0%  -37.67%  (p=1.000 n=1+1)
ChaCha20/63-64        364ns ± 0%      229ns ± 0%  -37.09%  (p=1.000 n=1+1)
ChaCha20/64-64        364ns ± 0%      231ns ± 0%  -36.54%  (p=1.000 n=1+1)
ChaCha20/256-64       332ns ± 0%      199ns ± 0%  -40.06%  (p=1.000 n=1+1)
ChaCha20/1024-64     1.24µs ± 0%     0.70µs ± 0%  -43.23%  (p=1.000 n=1+1)
ChaCha20/1350-64     1.89µs ± 0%     1.03µs ± 0%  -45.35%  (p=1.000 n=1+1)
ChaCha20/65536-64    77.0µs ± 0%     42.5µs ± 0%  -44.83%  (p=1.000 n=1+1)

This is discussed in issue golang/go#25051.

A few asm instructions vmrgew and vmrgow were just added in Go 1.14
so have been encoded using WORD at this point.

Change-Id: I2b192a63cf46b0b20195e60e4412c43c5dd14ad8
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/195959
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
bored-engineer pushed a commit to bored-engineer/ssh that referenced this issue Oct 13, 2019
Add asm implementation for chacha20 using vector instructions on ppc64le.
Below, the difference using the new code:

name               old speed     new speed      delta
ChaCha20/32-16     167MB/s ± 0%   129MB/s ± 0%   -22.60%  (p=0.008 n=5+5)
ChaCha20/63-16     308MB/s ± 0%   249MB/s ± 0%   -19.00%  (p=0.008 n=5+5)
ChaCha20/64-16     357MB/s ± 0%   251MB/s ± 0%   -29.57%  (p=0.008 n=5+5)
ChaCha20/256-16    398MB/s ± 0%  1199MB/s ± 0%  +201.20%  (p=0.008 n=5+5)
ChaCha20/1024-16   413MB/s ± 0%  1398MB/s ± 0%  +238.67%  (p=0.008 n=5+5)
ChaCha20/1350-16   395MB/s ± 0%  1189MB/s ± 0%  +200.71%  (p=0.008 n=5+5)
ChaCha20/65536-16  420MB/s ± 0%  1489MB/s ± 0%  +254.10%  (p=0.008 n=5+5)

Small sizes are slower due the fact that it always calculates using
256 bytes of key stream.

This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com>
and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999).

Fixes golang/go#25051

Change-Id: Ie510494249b227379e23d993467256b3d4088035
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
bored-engineer pushed a commit to bored-engineer/ssh that referenced this issue Oct 13, 2019
This improves the performance of the asm implementation for
chacha20 on ppc64le by updating to the vsx implementation
provided in cryptogams. The previous implementation was found to
not perform as well as possible on power9. This implementation
improves performance on both power8 and power9.

Power9 improvement with this change as compared to current:

name               old time/op    new time/op     delta
ChaCha20/32-64        361ns ± 0%      225ns ± 0%  -37.67%  (p=1.000 n=1+1)
ChaCha20/63-64        364ns ± 0%      229ns ± 0%  -37.09%  (p=1.000 n=1+1)
ChaCha20/64-64        364ns ± 0%      231ns ± 0%  -36.54%  (p=1.000 n=1+1)
ChaCha20/256-64       332ns ± 0%      199ns ± 0%  -40.06%  (p=1.000 n=1+1)
ChaCha20/1024-64     1.24µs ± 0%     0.70µs ± 0%  -43.23%  (p=1.000 n=1+1)
ChaCha20/1350-64     1.89µs ± 0%     1.03µs ± 0%  -45.35%  (p=1.000 n=1+1)
ChaCha20/65536-64    77.0µs ± 0%     42.5µs ± 0%  -44.83%  (p=1.000 n=1+1)

This is discussed in issue golang/go#25051.

A few asm instructions vmrgew and vmrgow were just added in Go 1.14
so have been encoded using WORD at this point.

Change-Id: I2b192a63cf46b0b20195e60e4412c43c5dd14ad8
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/195959
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
@golang golang locked and limited conversation to collaborators Sep 16, 2020
c-expert-zigbee pushed a commit to c-expert-zigbee/crypto_go that referenced this issue Mar 28, 2022
Add asm implementation for chacha20 using vector instructions on ppc64le.
Below, the difference using the new code:

name               old speed     new speed      delta
ChaCha20/32-16     167MB/s ± 0%   129MB/s ± 0%   -22.60%  (p=0.008 n=5+5)
ChaCha20/63-16     308MB/s ± 0%   249MB/s ± 0%   -19.00%  (p=0.008 n=5+5)
ChaCha20/64-16     357MB/s ± 0%   251MB/s ± 0%   -29.57%  (p=0.008 n=5+5)
ChaCha20/256-16    398MB/s ± 0%  1199MB/s ± 0%  +201.20%  (p=0.008 n=5+5)
ChaCha20/1024-16   413MB/s ± 0%  1398MB/s ± 0%  +238.67%  (p=0.008 n=5+5)
ChaCha20/1350-16   395MB/s ± 0%  1189MB/s ± 0%  +200.71%  (p=0.008 n=5+5)
ChaCha20/65536-16  420MB/s ± 0%  1489MB/s ± 0%  +254.10%  (p=0.008 n=5+5)

Small sizes are slower due the fact that it always calculates using
256 bytes of key stream.

This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com>
and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999).

Fixes golang/go#25051

Change-Id: Ie510494249b227379e23d993467256b3d4088035
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
c-expert-zigbee pushed a commit to c-expert-zigbee/crypto_go that referenced this issue Mar 29, 2022
Add asm implementation for chacha20 using vector instructions on ppc64le.
Below, the difference using the new code:

name               old speed     new speed      delta
ChaCha20/32-16     167MB/s ± 0%   129MB/s ± 0%   -22.60%  (p=0.008 n=5+5)
ChaCha20/63-16     308MB/s ± 0%   249MB/s ± 0%   -19.00%  (p=0.008 n=5+5)
ChaCha20/64-16     357MB/s ± 0%   251MB/s ± 0%   -29.57%  (p=0.008 n=5+5)
ChaCha20/256-16    398MB/s ± 0%  1199MB/s ± 0%  +201.20%  (p=0.008 n=5+5)
ChaCha20/1024-16   413MB/s ± 0%  1398MB/s ± 0%  +238.67%  (p=0.008 n=5+5)
ChaCha20/1350-16   395MB/s ± 0%  1189MB/s ± 0%  +200.71%  (p=0.008 n=5+5)
ChaCha20/65536-16  420MB/s ± 0%  1489MB/s ± 0%  +254.10%  (p=0.008 n=5+5)

Small sizes are slower due the fact that it always calculates using
256 bytes of key stream.

This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com>
and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999).

Fixes golang/go#25051

Change-Id: Ie510494249b227379e23d993467256b3d4088035
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
c-expert-zigbee pushed a commit to c-expert-zigbee/crypto_go that referenced this issue Mar 29, 2022
Add asm implementation for chacha20 using vector instructions on ppc64le.
Below, the difference using the new code:

name               old speed     new speed      delta
ChaCha20/32-16     167MB/s ± 0%   129MB/s ± 0%   -22.60%  (p=0.008 n=5+5)
ChaCha20/63-16     308MB/s ± 0%   249MB/s ± 0%   -19.00%  (p=0.008 n=5+5)
ChaCha20/64-16     357MB/s ± 0%   251MB/s ± 0%   -29.57%  (p=0.008 n=5+5)
ChaCha20/256-16    398MB/s ± 0%  1199MB/s ± 0%  +201.20%  (p=0.008 n=5+5)
ChaCha20/1024-16   413MB/s ± 0%  1398MB/s ± 0%  +238.67%  (p=0.008 n=5+5)
ChaCha20/1350-16   395MB/s ± 0%  1189MB/s ± 0%  +200.71%  (p=0.008 n=5+5)
ChaCha20/65536-16  420MB/s ± 0%  1489MB/s ± 0%  +254.10%  (p=0.008 n=5+5)

Small sizes are slower due the fact that it always calculates using
256 bytes of key stream.

This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com>
and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999).

Fixes golang/go#25051

Change-Id: Ie510494249b227379e23d993467256b3d4088035
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
LewiGoddard pushed a commit to LewiGoddard/crypto that referenced this issue Feb 16, 2023
Add asm implementation for chacha20 using vector instructions on ppc64le.
Below, the difference using the new code:

name               old speed     new speed      delta
ChaCha20/32-16     167MB/s ± 0%   129MB/s ± 0%   -22.60%  (p=0.008 n=5+5)
ChaCha20/63-16     308MB/s ± 0%   249MB/s ± 0%   -19.00%  (p=0.008 n=5+5)
ChaCha20/64-16     357MB/s ± 0%   251MB/s ± 0%   -29.57%  (p=0.008 n=5+5)
ChaCha20/256-16    398MB/s ± 0%  1199MB/s ± 0%  +201.20%  (p=0.008 n=5+5)
ChaCha20/1024-16   413MB/s ± 0%  1398MB/s ± 0%  +238.67%  (p=0.008 n=5+5)
ChaCha20/1350-16   395MB/s ± 0%  1189MB/s ± 0%  +200.71%  (p=0.008 n=5+5)
ChaCha20/65536-16  420MB/s ± 0%  1489MB/s ± 0%  +254.10%  (p=0.008 n=5+5)

Small sizes are slower due the fact that it always calculates using
256 bytes of key stream.

This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com>
and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999).

Fixes golang/go#25051

Change-Id: Ie510494249b227379e23d993467256b3d4088035
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
BiiChris pushed a commit to BiiChris/crypto that referenced this issue Sep 15, 2023
Add asm implementation for chacha20 using vector instructions on ppc64le.
Below, the difference using the new code:

name               old speed     new speed      delta
ChaCha20/32-16     167MB/s ± 0%   129MB/s ± 0%   -22.60%  (p=0.008 n=5+5)
ChaCha20/63-16     308MB/s ± 0%   249MB/s ± 0%   -19.00%  (p=0.008 n=5+5)
ChaCha20/64-16     357MB/s ± 0%   251MB/s ± 0%   -29.57%  (p=0.008 n=5+5)
ChaCha20/256-16    398MB/s ± 0%  1199MB/s ± 0%  +201.20%  (p=0.008 n=5+5)
ChaCha20/1024-16   413MB/s ± 0%  1398MB/s ± 0%  +238.67%  (p=0.008 n=5+5)
ChaCha20/1350-16   395MB/s ± 0%  1189MB/s ± 0%  +200.71%  (p=0.008 n=5+5)
ChaCha20/65536-16  420MB/s ± 0%  1489MB/s ± 0%  +254.10%  (p=0.008 n=5+5)

Small sizes are slower due the fact that it always calculates using
256 bytes of key stream.

This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com>
and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999).

Fixes golang/go#25051

Change-Id: Ie510494249b227379e23d993467256b3d4088035
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants