-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/crypto/internal/chacha20: improve performance for ppc64le #25051
Comments
The heading should start with x/crypto/chacha20poly1305: |
Change https://golang.org/cl/108999 mentions this issue: |
I renamed to "x/crypto/internal/chacha20" since chacha20 is not inside chacha20poly1305 anymore (see commit 4937306) |
Any update @dot-asm? |
Change https://golang.org/cl/172177 mentions this issue: |
For reference, the original implementation had face-lift, dramatically improving performance on POWER9, by ~80%. Trouble there is that POWER9 is kind of "allergic" to mixtures of scalar and vector instructions, which original fits-all-ppc is heavily relying on. Since 'go' doesn't seem to be concerned about anything pre-POWER8, it would be sensible to deploy new code path... |
@dot-asm Do you mean you have a new power9-only implementation, or that the new implementation works on both power8 and power9? |
POWER9 was merely a trigger for a change, because it was observed to perform unexpectedly poorly. But the change itself is PowerISA 2.07-based, so that it works on POWER8 and forward. Is it faster on POWER8? Of course, just not as much as on POWER9, only by ~40%. |
Due to the deadline for Go 1.13 we will submit it as is for now to get better performance, then convert to the vsx implementation in Go 1.14 to improve further. |
Change https://golang.org/cl/195959 mentions this issue: |
This improves the performance of the asm implementation for chacha20 on ppc64le by updating to the vsx implementation provided in cryptogams. The previous implementation was found to not perform as well as possible on power9. This implementation improves performance on both power8 and power9. Power9 improvement with this change as compared to current: name old time/op new time/op delta ChaCha20/32-64 361ns ± 0% 225ns ± 0% -37.67% (p=1.000 n=1+1) ChaCha20/63-64 364ns ± 0% 229ns ± 0% -37.09% (p=1.000 n=1+1) ChaCha20/64-64 364ns ± 0% 231ns ± 0% -36.54% (p=1.000 n=1+1) ChaCha20/256-64 332ns ± 0% 199ns ± 0% -40.06% (p=1.000 n=1+1) ChaCha20/1024-64 1.24µs ± 0% 0.70µs ± 0% -43.23% (p=1.000 n=1+1) ChaCha20/1350-64 1.89µs ± 0% 1.03µs ± 0% -45.35% (p=1.000 n=1+1) ChaCha20/65536-64 77.0µs ± 0% 42.5µs ± 0% -44.83% (p=1.000 n=1+1) This is discussed in issue golang/go#25051. A few asm instructions vmrgew and vmrgow were just added in Go 1.14 so have been encoded using WORD at this point. Change-Id: I2b192a63cf46b0b20195e60e4412c43c5dd14ad8 Reviewed-on: https://go-review.googlesource.com/c/crypto/+/195959 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
Add asm implementation for chacha20 using vector instructions on ppc64le. Below, the difference using the new code: name old speed new speed delta ChaCha20/32-16 167MB/s ± 0% 129MB/s ± 0% -22.60% (p=0.008 n=5+5) ChaCha20/63-16 308MB/s ± 0% 249MB/s ± 0% -19.00% (p=0.008 n=5+5) ChaCha20/64-16 357MB/s ± 0% 251MB/s ± 0% -29.57% (p=0.008 n=5+5) ChaCha20/256-16 398MB/s ± 0% 1199MB/s ± 0% +201.20% (p=0.008 n=5+5) ChaCha20/1024-16 413MB/s ± 0% 1398MB/s ± 0% +238.67% (p=0.008 n=5+5) ChaCha20/1350-16 395MB/s ± 0% 1189MB/s ± 0% +200.71% (p=0.008 n=5+5) ChaCha20/65536-16 420MB/s ± 0% 1489MB/s ± 0% +254.10% (p=0.008 n=5+5) Small sizes are slower due the fact that it always calculates using 256 bytes of key stream. This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com> and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999). Fixes golang/go#25051 Change-Id: Ie510494249b227379e23d993467256b3d4088035 Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177 Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
This improves the performance of the asm implementation for chacha20 on ppc64le by updating to the vsx implementation provided in cryptogams. The previous implementation was found to not perform as well as possible on power9. This implementation improves performance on both power8 and power9. Power9 improvement with this change as compared to current: name old time/op new time/op delta ChaCha20/32-64 361ns ± 0% 225ns ± 0% -37.67% (p=1.000 n=1+1) ChaCha20/63-64 364ns ± 0% 229ns ± 0% -37.09% (p=1.000 n=1+1) ChaCha20/64-64 364ns ± 0% 231ns ± 0% -36.54% (p=1.000 n=1+1) ChaCha20/256-64 332ns ± 0% 199ns ± 0% -40.06% (p=1.000 n=1+1) ChaCha20/1024-64 1.24µs ± 0% 0.70µs ± 0% -43.23% (p=1.000 n=1+1) ChaCha20/1350-64 1.89µs ± 0% 1.03µs ± 0% -45.35% (p=1.000 n=1+1) ChaCha20/65536-64 77.0µs ± 0% 42.5µs ± 0% -44.83% (p=1.000 n=1+1) This is discussed in issue golang/go#25051. A few asm instructions vmrgew and vmrgow were just added in Go 1.14 so have been encoded using WORD at this point. Change-Id: I2b192a63cf46b0b20195e60e4412c43c5dd14ad8 Reviewed-on: https://go-review.googlesource.com/c/crypto/+/195959 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
Add asm implementation for chacha20 using vector instructions on ppc64le. Below, the difference using the new code: name old speed new speed delta ChaCha20/32-16 167MB/s ± 0% 129MB/s ± 0% -22.60% (p=0.008 n=5+5) ChaCha20/63-16 308MB/s ± 0% 249MB/s ± 0% -19.00% (p=0.008 n=5+5) ChaCha20/64-16 357MB/s ± 0% 251MB/s ± 0% -29.57% (p=0.008 n=5+5) ChaCha20/256-16 398MB/s ± 0% 1199MB/s ± 0% +201.20% (p=0.008 n=5+5) ChaCha20/1024-16 413MB/s ± 0% 1398MB/s ± 0% +238.67% (p=0.008 n=5+5) ChaCha20/1350-16 395MB/s ± 0% 1189MB/s ± 0% +200.71% (p=0.008 n=5+5) ChaCha20/65536-16 420MB/s ± 0% 1489MB/s ± 0% +254.10% (p=0.008 n=5+5) Small sizes are slower due the fact that it always calculates using 256 bytes of key stream. This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com> and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999). Fixes golang/go#25051 Change-Id: Ie510494249b227379e23d993467256b3d4088035 Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177 Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Add asm implementation for chacha20 using vector instructions on ppc64le. Below, the difference using the new code: name old speed new speed delta ChaCha20/32-16 167MB/s ± 0% 129MB/s ± 0% -22.60% (p=0.008 n=5+5) ChaCha20/63-16 308MB/s ± 0% 249MB/s ± 0% -19.00% (p=0.008 n=5+5) ChaCha20/64-16 357MB/s ± 0% 251MB/s ± 0% -29.57% (p=0.008 n=5+5) ChaCha20/256-16 398MB/s ± 0% 1199MB/s ± 0% +201.20% (p=0.008 n=5+5) ChaCha20/1024-16 413MB/s ± 0% 1398MB/s ± 0% +238.67% (p=0.008 n=5+5) ChaCha20/1350-16 395MB/s ± 0% 1189MB/s ± 0% +200.71% (p=0.008 n=5+5) ChaCha20/65536-16 420MB/s ± 0% 1489MB/s ± 0% +254.10% (p=0.008 n=5+5) Small sizes are slower due the fact that it always calculates using 256 bytes of key stream. This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com> and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999). Fixes golang/go#25051 Change-Id: Ie510494249b227379e23d993467256b3d4088035 Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177 Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Add asm implementation for chacha20 using vector instructions on ppc64le. Below, the difference using the new code: name old speed new speed delta ChaCha20/32-16 167MB/s ± 0% 129MB/s ± 0% -22.60% (p=0.008 n=5+5) ChaCha20/63-16 308MB/s ± 0% 249MB/s ± 0% -19.00% (p=0.008 n=5+5) ChaCha20/64-16 357MB/s ± 0% 251MB/s ± 0% -29.57% (p=0.008 n=5+5) ChaCha20/256-16 398MB/s ± 0% 1199MB/s ± 0% +201.20% (p=0.008 n=5+5) ChaCha20/1024-16 413MB/s ± 0% 1398MB/s ± 0% +238.67% (p=0.008 n=5+5) ChaCha20/1350-16 395MB/s ± 0% 1189MB/s ± 0% +200.71% (p=0.008 n=5+5) ChaCha20/65536-16 420MB/s ± 0% 1489MB/s ± 0% +254.10% (p=0.008 n=5+5) Small sizes are slower due the fact that it always calculates using 256 bytes of key stream. This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com> and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999). Fixes golang/go#25051 Change-Id: Ie510494249b227379e23d993467256b3d4088035 Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177 Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Add asm implementation for chacha20 using vector instructions on ppc64le. Below, the difference using the new code: name old speed new speed delta ChaCha20/32-16 167MB/s ± 0% 129MB/s ± 0% -22.60% (p=0.008 n=5+5) ChaCha20/63-16 308MB/s ± 0% 249MB/s ± 0% -19.00% (p=0.008 n=5+5) ChaCha20/64-16 357MB/s ± 0% 251MB/s ± 0% -29.57% (p=0.008 n=5+5) ChaCha20/256-16 398MB/s ± 0% 1199MB/s ± 0% +201.20% (p=0.008 n=5+5) ChaCha20/1024-16 413MB/s ± 0% 1398MB/s ± 0% +238.67% (p=0.008 n=5+5) ChaCha20/1350-16 395MB/s ± 0% 1189MB/s ± 0% +200.71% (p=0.008 n=5+5) ChaCha20/65536-16 420MB/s ± 0% 1489MB/s ± 0% +254.10% (p=0.008 n=5+5) Small sizes are slower due the fact that it always calculates using 256 bytes of key stream. This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com> and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999). Fixes golang/go#25051 Change-Id: Ie510494249b227379e23d993467256b3d4088035 Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177 Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Add asm implementation for chacha20 using vector instructions on ppc64le. Below, the difference using the new code: name old speed new speed delta ChaCha20/32-16 167MB/s ± 0% 129MB/s ± 0% -22.60% (p=0.008 n=5+5) ChaCha20/63-16 308MB/s ± 0% 249MB/s ± 0% -19.00% (p=0.008 n=5+5) ChaCha20/64-16 357MB/s ± 0% 251MB/s ± 0% -29.57% (p=0.008 n=5+5) ChaCha20/256-16 398MB/s ± 0% 1199MB/s ± 0% +201.20% (p=0.008 n=5+5) ChaCha20/1024-16 413MB/s ± 0% 1398MB/s ± 0% +238.67% (p=0.008 n=5+5) ChaCha20/1350-16 395MB/s ± 0% 1189MB/s ± 0% +200.71% (p=0.008 n=5+5) ChaCha20/65536-16 420MB/s ± 0% 1489MB/s ± 0% +254.10% (p=0.008 n=5+5) Small sizes are slower due the fact that it always calculates using 256 bytes of key stream. This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com> and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999). Fixes golang/go#25051 Change-Id: Ie510494249b227379e23d993467256b3d4088035 Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177 Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Add asm implementation for chacha20 using vector instructions on ppc64le. Below, the difference using the new code: name old speed new speed delta ChaCha20/32-16 167MB/s ± 0% 129MB/s ± 0% -22.60% (p=0.008 n=5+5) ChaCha20/63-16 308MB/s ± 0% 249MB/s ± 0% -19.00% (p=0.008 n=5+5) ChaCha20/64-16 357MB/s ± 0% 251MB/s ± 0% -29.57% (p=0.008 n=5+5) ChaCha20/256-16 398MB/s ± 0% 1199MB/s ± 0% +201.20% (p=0.008 n=5+5) ChaCha20/1024-16 413MB/s ± 0% 1398MB/s ± 0% +238.67% (p=0.008 n=5+5) ChaCha20/1350-16 395MB/s ± 0% 1189MB/s ± 0% +200.71% (p=0.008 n=5+5) ChaCha20/65536-16 420MB/s ± 0% 1489MB/s ± 0% +254.10% (p=0.008 n=5+5) Small sizes are slower due the fact that it always calculates using 256 bytes of key stream. This change was originally from Paulo Flabiano Smorigo <pfsmorigo@linux.vnet.ibm.com> and started as CL 108999 (https://go-review.googlesource.com/c/crypto/+/108999). Fixes golang/go#25051 Change-Id: Ie510494249b227379e23d993467256b3d4088035 Reviewed-on: https://go-review.googlesource.com/c/crypto/+/172177 Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?Using upstream.
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?GOARCH="ppc64le"
GOHOSTARCH="ppc64le"
GOHOSTOS="linux"
What did you do?
I opened this issue in order to submit a patch that adds asm implementation for chacha20 using vector instructions on ppc64le.
If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.
What did you expect to see?
What did you see instead?
The text was updated successfully, but these errors were encountered: