Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: bytes.Xor #30553

Closed
kixelated opened this issue Mar 4, 2019 · 30 comments
Closed

proposal: bytes.Xor #30553

kixelated opened this issue Mar 4, 2019 · 30 comments

Comments

@kixelated
Copy link
Contributor

kixelated commented Mar 4, 2019

Variation of #28113 (rejected proposal)

I would like to export the optimized xorBytes implementation in crypto/cipher. Unlike the above ticket, I think it should be a member of the bytes package as bytes.Xor.

Applying a XOR over multiple bytes is commonly used in crypto as well as parity generation. It is was relatively trivial to implement yourself but thanks to a recent contribution, the Go version is now very optimized.

package bytes

// Xor will xor each byte in a and b, writing the result to dst. 
// The destination should have enough space, otherwise Xor will panic.
// Returns the number of bytes written, which will be the minimum of len(a), and len(b).
func Xor(dst, a, b []byte) int

Currently, xorBytes will panic when dst is not large enough. It could follow copy semantics and write up to min(len(dst), len(a), len(b)) bytes. But it seems relatively common to panic in the crypto package when the dst buffer is not large enough, such as XORKeyStream.

@gopherbot gopherbot added this to the Proposal milestone Mar 4, 2019
@kixelated
Copy link
Contributor Author

kixelated commented Mar 4, 2019

My use-case is for a webrtc library.

I'm currently profiling the application and a sizable amount of time is spent in NewCTR. The SRTP spec states that each packet (<1400 bytes) uses a new CTR cipher based on the header information (it's careful to not reuse the same IV).

This line is one of my current bottlenecks, as it generates a slice of size 512 on the heap each time. NewCTR accounts for 2.16% of the program's execution time according to pprof.

My plan was to write my own CTR class, potentially abusing the fact that AES has a fixed block size to allocate on the stack. However, I'm not able to go down this route without lifting the fast xorBytes implementation from crypto/cipher. It's probably not a blocker; I just don't think it should be necessary to copy the assembly optimized functions that are used extensively in the stdlib but are not exposed.

For now, I'm evaluating a few alternative approaches but all of them involve Go changes.

Later in the project, I will have to implement FEC. Forward error correction is performed by XORing multiple packets together (in a specific way) to create parity packets. This is similar to RAID5, which is another example of when a fast bytes.Xor function would be desirable.

@kixelated
Copy link
Contributor Author

kixelated commented Mar 4, 2019

Also worth mentioning, this is similar to the purpose of the math/bits package. Go developers should not have to resort to assembly implementations for efficiency if the function is common and can be well defined.

@josharian
Copy link
Contributor

This might be a better fit for golang.org/x/crypto(?).

cc @FiloSottile

Btw, Filippo, I see that the amd64 version uses SSE2. Any interest in me adding an AVX2 version?

@kixelated
Copy link
Contributor Author

@josharian Possibly, but I was thinking crypto/cipher would import bytes instead of having two disjoint implementations.

@josharian
Copy link
Contributor

I think the chance of adding this to bytes is vanishingly small.

To ask some obvious questions: what about And? Or? Andnot? All three of those (but not Xor) are used in sometimes hot code in the compiler (bv.go). And package bytes is not really a catchall home for things related to byte slices; it’s more a sibling package to strings, which makes it more focused on textual tasks.

@kixelated
Copy link
Contributor Author

Yeah, that's fair. Adding it somewhere like golang.org/x/crypto would help my situation so long as you guys are okay with the code duplication.

@mvdan
Copy link
Member

mvdan commented Mar 4, 2019

What about adding it to crypto/cypher directly?

@mundaym
Copy link
Member

mundaym commented Mar 4, 2019

For now, I'm evaluating a few alternative approaches but all of them involve Go changes.

For the AES-CTR use case I think changing the Go standard library implementation of the CTR cipher mode would be preferable to adding any new APIs. It shouldn't be too complicated to make the buffer lazily allocated, i.e. allocated only if extra state is needed across XORKeyStream calls. Note also that you probably want to stick with the Go standard library implementation to pick up any future assembly optimizations (see #20967).

@kixelated
Copy link
Contributor Author

kixelated commented Mar 4, 2019

@mundaym that's what I thought of doing too, but crypto/aes would need access to xorBytes in crypto/cipher. It would have to be moved to an internal package or made public somehow.

@mundaym
Copy link
Member

mundaym commented Mar 4, 2019

@kixelated I think it would be reasonable to put it in an internal package, probably under crypto/internal. That is definitely preferable to a new public API.

The implementation of xorBytes should be constant time for crypto uses (i.e. the execution time should be not depend on the data) so it might perhaps make sense for it to live in crypto/internal/subtle rather than adding another internal package. We could then consider moving it to the crypto/subtle package if we wanted to make it public later.

@rsc
Copy link
Contributor

rsc commented Mar 6, 2019

Like Josh said, there's basically no chance of this being bytes.Xor.

One option would be for the compiler to recognize

for i := range d {
  d[i] = a[i] ^ b[i]
}

but even that seems too specialized. Probably better would be to start with a package x/crypto/bits or something like that.

/cc @FiloSottile to see if we want to retitle as an x/crypto proposal or else close.

@FiloSottile
Copy link
Contributor

If it looks good to the compiler folks, I'd rather have it via pattern-matching than make a name-conflicting bits package for a single function.

@josharian
Copy link
Contributor

@FiloSottile does it need to be constant time (per #30553 (comment))? If so it seems like it belongs in subtle and definitely not in the compiler. (There’s nothing about the Go code that is necessarily constant time—if a compiler could prove that b is all zero, then it a no-op.)

Assuming that that is ok, I don’t personally object to the pattern matching. It’s easy enough to implement and maintain, and though it does contribute a little bit to binary bloat, it is nice to have obvious, clear Go code be fast.

P.S. Did you want an AVX2 version?

@FiloSottile
Copy link
Contributor

@FiloSottile does it need to be constant time (per #30553 (comment))? If so it seems like it belongs in subtle and definitely not in the compiler. (There’s nothing about the Go code that is necessarily constant time—if a compiler could prove that b is all zero, then it a no-op.)

Hmm, that's a good point. I don't actually see how the compiler would know anything about cipher streams, but this is looking for trouble.

Let's start it in golang.org/x/crypto/internal/subtle, and we can move it to crypto/subtle if it works out.

P.S. Did you want an AVX2 version?

I don't know, does it move the needle on benchmarks of higher level primitives, or lets me remove some assembly elsewhere? I don't really ever need XOR by itself, so if it's not taking significant time already, I'm ok with it as is.

@FiloSottile
Copy link
Contributor

But also, there’s no reason we can’t remove that CTR bottleneck for the common case, so let’s do that.

@kixelated
Copy link
Contributor Author

Yeah, I've been messing around with some aes.NewCTR benchmarks. It requires moving the xorBytes code to an internal package but there are some improvements you can make with the fixed block size. Nothing major but it's a pretty popular cipher suite, so it's probably worth it.

@kixelated
Copy link
Contributor Author

kixelated commented Mar 7, 2019

I'll run some quick benchmarks with a specialized AES-CTR class. I also want to compare it against a hypothetical ctr.Reset(iv []byte) function that would let me reuse objects, although afaik there shouldn't be too much difference compared to the specialized AES-CTR class if it is allocated on the stack.

@kixelated
Copy link
Contributor Author

Okay, I wrote some benchmarks for more information:

  • NewAESCTR just creates a CTR cipher.
  • AESCTRRTP will create a NewCTR cipher and encrypt 1400 bytes (my use case).

The old code uses cipher.NewCTR which allocates ctr and out on the heap. The new code uses aesCipherAsm.NewCTR which reduces allocations and a lot of logic because of the fixed BlockSize.

name           old time/op    new time/op    delta
NewAESCTR-8       157ns ± 2%     113ns ± 1%  -27.93%  (p=0.000 n=10+10)
AESCTR1K-8       1.62µs ± 4%    1.56µs ± 1%   -3.47%  (p=0.000 n=10+10)
AESCTR8K-8       12.9µs ± 2%    12.5µs ± 2%   -2.88%  (p=0.000 n=8+10)
AESCTRRTP-8      2.56µs ± 2%    2.46µs ± 1%   -3.97%  (p=0.000 n=9+10)

name           old alloc/op   new alloc/op   delta
NewAESCTR-8        608B ± 0%      576B ± 0%   -5.26%  (p=0.000 n=10+10)
AESCTR1K-8        0.00B          0.00B          ~     (all equal)
AESCTR8K-8        0.00B          0.00B          ~     (all equal)
AESCTRRTP-8        608B ± 0%      576B ± 0%   -5.26%  (p=0.000 n=10+10)

name           old allocs/op  new allocs/op  delta
NewAESCTR-8        3.00 ± 0%      1.00 ± 0%  -66.67%  (p=0.000 n=10+10)
AESCTR1K-8         0.00           0.00          ~     (all equal)
AESCTR8K-8         0.00           0.00          ~     (all equal)
AESCTRRTP-8        3.00 ± 0%      1.00 ± 0%  -66.67%  (p=0.000 n=10+10)

name           old speed      new speed      delta
AESCTR1K-8      629MB/s ± 4%   651MB/s ± 1%   +3.56%  (p=0.000 n=10+10)
AESCTR8K-8      636MB/s ± 2%   655MB/s ± 2%   +2.96%  (p=0.000 n=8+10)
AESCTRRTP-8     546MB/s ± 2%   569MB/s ± 1%   +4.13%  (p=0.000 n=9+10)

+3-4% speedup is nothing to scoff at. I'll mess around with a few more things, like seeing if we really need to encrypt the counter 512 bytes at a time.

@kixelated
Copy link
Contributor Author

kixelated commented Mar 7, 2019

Oh and here's comparing NewCTR versus Reset after encrypting 1400 bytes.

BenchmarkAESCTRRTP-8        	  500000	      2454 ns/op	 570.43 MB/s	     576 B/op	       1 allocs/op
BenchmarkAESCTRRTPReset-8   	 1000000	      2334 ns/op	 599.82 MB/s	       0 B/op	       0 allocs/op

Reset just sets the counter to the given iv.

@josharian
Copy link
Contributor

P.S. Did you want an AVX2 version?
I don't know, does it move the needle on benchmarks of higher level primitives, or lets me remove some assembly elsewhere?

It wouldn't let us remove assembly, because AVX2 is not available on all amd64 chips that Go supports. It would speed things up, but perhaps not enough to care. I ran this:

$ go test -bench=AES -cpuprofile=c.p -count=3 crypto/cipher

and then pprof and got:

(pprof) top10 
Showing top 10 nodes out of 52
      flat  flat%   sum%        cum   cum%
    11.68s 18.99% 18.99%     11.68s 18.99%  crypto/aes.encryptBlockAsm
    10.24s 16.65% 35.64%     10.24s 16.65%  crypto/aes.gcmAesDec
     9.20s 14.96% 50.59%      9.20s 14.96%  crypto/aes.gcmAesEnc
     7.03s 11.43% 62.02%      7.03s 11.43%  crypto/aes.gcmAesData
     4.27s  6.94% 68.96%     12.09s 19.66%  crypto/cipher.(*cfb).XORKeyStream
     2.13s  3.46% 72.43%      2.13s  3.46%  runtime.nanotime
     2.10s  3.41% 75.84%     14.61s 23.75%  crypto/aes.(*aesCipherAsm).Encrypt
     1.82s  2.96% 78.80%      3.57s  5.80%  crypto/cipher.xorBytes
     1.75s  2.85% 81.65%      8.08s 13.14%  crypto/cipher.(*ctr).refill
     1.74s  2.83% 84.47%      1.74s  2.83%  crypto/cipher.xorBytesSSE2

Assuming an ~2x speedup from SSE2 to AVX2, it looks like this would be a ~1.4% speed-up.

So I'm guessing the answer is no. But now I can stop wondering, and the rest of this conversation can resume. :)

@beoran
Copy link

beoran commented Mar 15, 2019

How about adding a new package "math/bits/slices" that has the functions Xor, And, Or, Andnot, and maybe a few more that operate over byte slices, or why not, also over all sorts of uintX slices, and that may have platform dependent assembly language implementation for them? As @josharian indicated, the go compiler uses such functionality itself, and I can see how these functions would be useful for cryptography but also for image processing. So I'd say that these functions would be of general use.

@ianlancetaylor
Copy link
Contributor

This seems like a useful third-party package, but I don't yet see a clear reason to put it into the standard library. Perhaps it could go into the x repo at some point. That package can start with a copy of the existing code in crypto/cipher, which isn't all that much.

@kixelated
Copy link
Contributor Author

kixelated commented Mar 22, 2019

I would totally be fine with copying the fast xor code to golang.org/x/crypto or similar. It's really up to the team to decide if it would be a burden to maintain the same code in two repositories. Could crypto/cipher import golang.org/x/crypto/bytes (or whatever the name should be)?

I've written a CTR package for my application with a Reset(iv []byte) method. It's a 5% speedup when encrypting 1k size packets because I can reuse the cipher, avoiding allocating ~600 bytes on the heap for each NewCTR call.

My package requires the fast XorBytes function otherwise it's 40% slower. I can't ship with a copy of Go's assembly implementation because of licensing; plus it would be excessive.

@josharian
Copy link
Contributor

I can't ship with a copy of Go's assembly implementation because of licensing; plus it would be excessive.

Go’s license is very permissive. I’m surprised to hear that that is an issue. And fwiw, I’d consider copy/pasting such code a pretty reasonable interim fix under the circumstances—huge win, not all that much code, low refactoring risk (just copy/paste), pretty stable source with lots of miles on it. Just my 2c.

@kixelated
Copy link
Contributor Author

kixelated commented Mar 23, 2019

I'm contributing to a project using the MIT license, so I assumed it wouldn't be compatible but you're right. I'll make a pull request and see what they think of a package containing the Go license plus files with the Go copyright.

I'll still need xorBytes (or another copy of the assembly code) when I get around to implementing FEC (parity packets) but yeah this solution seems okay.

kixelated pushed a commit to pion/srtp that referenced this issue Mar 23, 2019
Much of the time spent is performing `cipher.NewCTR` for each packet.
This function allocates ~600 bytes on the heap each time which is a
little excessive to encrypt 1000 bytes.

I added a `Reset` function that allows cipher reuse. In addition, I sped
up the CTR cipher taking advantage of the fixed aes.BlockSize.

Unfortunately, forking the Go CTR implementation requires `xorBytes`,
which is not exported. There's a Go issue open that will hopefully
export the function because it's pretty useful. Until then, there needs
to be a lot of copy-pasted assembly code related to fast XOR bytes.

golang/go#30553

```
name                 old time/op   new time/op   delta
EncryptRTP-8          3.66µs ± 6%   3.36µs ± 6%   -8.04%  (p=0.001 n=10+10)
EncryptRTPInPlace-8   3.38µs ± 8%   3.13µs ± 5%   -7.33%  (p=0.000 n=10+10)
DecryptRTP-8          3.69µs ± 7%   3.37µs ± 8%   -8.80%  (p=0.000 n=10+10)
Write-8               3.80µs ± 9%   3.45µs ± 5%   -9.33%  (p=0.000 n=10+10)
WriteRTP-8            3.72µs ± 7%   3.46µs ± 8%   -6.96%  (p=0.005 n=10+10)

name                 old speed     new speed     delta
EncryptRTP-8         277MB/s ± 6%  301MB/s ± 6%   +8.76%  (p=0.001 n=10+10)
EncryptRTPInPlace-8  300MB/s ± 7%  323MB/s ± 5%   +7.85%  (p=0.000 n=10+10)
DecryptRTP-8         277MB/s ± 7%  304MB/s ± 7%   +9.68%  (p=0.000 n=10+10)
Write-8              266MB/s ± 8%  294MB/s ± 5%  +10.24%  (p=0.000 n=10+10)
WriteRTP-8           272MB/s ± 7%  293MB/s ± 8%   +7.61%  (p=0.005 n=10+10)
```
@andybons
Copy link
Member

Seems like this is happening outside the bytes package, so closing per discussion with @golang/proposal-review.

@0xmichalis
Copy link
Contributor

Where was this merged into after all?

@ianlancetaylor
Copy link
Contributor

@Kargakis I'm not sure I understand your question. This proposal was declined.

@0xmichalis
Copy link
Contributor

0xmichalis commented Nov 5, 2019 via email

@ianlancetaylor
Copy link
Contributor

OK, thanks, I don't know what the status of that is.

@golang golang locked and limited conversation to collaborators Nov 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests