Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/cipher: easy optimisations to xorBytes #53023

Open
jech opened this issue May 21, 2022 · 5 comments · May be fixed by #53154
Open

crypto/cipher: easy optimisations to xorBytes #53023

jech opened this issue May 21, 2022 · 5 comments · May be fixed by #53154
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@jech
Copy link

jech commented May 21, 2022

The function crypto/cipher.xorBytes implements the XOR operations between slices of bytes, and is heavily used by encryption in CTR mode. The version in the stdlib has seen a significant amount of optimisation, but some easy optimisations are missing:

  1. The variable supportUnaligned in xor_generic.go should be set for GOARCH equal to arm64.
  2. Said variable sohould be set for GOARCH equal to ARM and GOARM equal to 6 or 7 (but not 5).
  3. When supportsUnaligned is not set, the fast version should still be called when both src and dst are multiples of wordSize (this actually happens fairly often, e.g. when encrypting a freshly allocated slice);
  4. When supportsUnaligned is not set, and src and dst are equal modulo wordSize, then the initial bytes should be xored and then the fast version called on the rest of the array;
  5. Ideally, there should be a NEON implementation for ARMv7.

Points 1 through 3 are fairly trivial, and would bring a significant part of the benefit. This is related to #53021.

@adriancable
Copy link

@jech - point 1 is unnecessary because arm64 is excluded via the go:build line in xor_generic.go (due to the presence of the arm64 neon assembly implementation).

But an emphatic yes to the other points. Especially point 2 where we are currently leaving significant gains on the table on a performance critical function on ARMv6/7 where unaligned access is OK.

There are frequent posts regarding golang’s slow crypto performance on arm32 and this would be a low hanging fruit change with big gains.

@josharian
Copy link
Contributor

cc @bradfitz

@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 23, 2022
@toothrot toothrot added this to the Backlog milestone May 23, 2022
@bcmills
Copy link
Contributor

bcmills commented May 23, 2022

See also #35381.

@adriancable
Copy link

@bcmills / @bradfitz / @jech - I have written both NEON and non-NEON versions of xorBytes for 32-bit ARM. The performance increase over the 'slow' version in xor_generic.go is significant, between 4X and 20X.

Given that 32-bit ARM remains one of the most popular IoT device platforms, it's a shame that 32-bit ARM is the sole unoptimised platform for 32-bit ARM in the crypto library for xorBytes, which can be a bottleneck function for crypto.

So, I would be very happy to create a PR here to contribute optimised 32-bit NEON and non-NEON ARM implementations of xorBytes. This could be a significant deal for crypto performance in general on 32-bit ARM. But I would appreciate a signal first that there is the appetite to accept such a PR, so it does not end up in /dev/null.

@gopherbot
Copy link

Change https://go.dev/cl/409394 mentions this issue: crypto/cipher: add optimized assembly xorBytes for ARM (NEON + non-NEON)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants