Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: intrinsify bits.RotateLeft32 on mipsle #39139

Open
assadobaid opened this issue May 19, 2020 · 7 comments · May be fixed by #45028
Open

cmd/compile: intrinsify bits.RotateLeft32 on mipsle #39139

assadobaid opened this issue May 19, 2020 · 7 comments · May be fixed by #45028
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@assadobaid
Copy link

assadobaid commented May 19, 2020

What version of Go are you using (go version)?

Build env:
go1.14.3 linux/amd64

Runtime:
GOOS=linux
GOARCH=mipsle 
GOMIPS=softfloat

Does this issue reproduce with the latest release?

Yes

The performance of TLS1.3 has decreased significantly in Go version 1.14.x and latest x/crypto master branch.

What did you do?

Our application uses TLS1.3 to stream real-time video data. When we upgraded go version from 1.13 to 1.14.3 the CPU performance decreased and the latency increased.
When we run the same test in go 1.13 and 1.14.3 we can see that the amount of time that Chach20 Poly1305 takes in 1.14 is almost double as much as in 1.13.x.
We see the problem in 1.14 both with the released version of x/crypto and with latest master of x/crypto.
We tried also with TLS1.2 and still see the issue.

What did you expect to see?

Same performance across versions.

What did you see instead?

In our 4 minutes test we can see that the time we spend in crypto increased from 54 seconds in total to 96 seconds.

Go1.13

Link to pprof svg graph

  flat  flat%   sum%        cum   cum%
23.20s 19.59% 19.59%     26.67s 22.52%  golang.org/x/crypto/poly1305.updateGeneric
17.25s 14.57% 34.16%     19.12s 16.15%  syscall.Syscall
16.23s 13.71% 47.87%     16.23s 13.71%  golang.org/x/crypto/internal/chacha20.quarterRound
14.32s 12.09% 59.96%     14.32s 12.09%  runtime.usleep
 5.96s  5.03% 64.99%      5.96s  5.03%  runtime.futex
 5.14s  4.34% 69.34%      5.14s  4.34%  runtime.memmove
 4.09s  3.45% 72.79%      4.09s  3.45%  runtime._LostSIGPROFDuringAtomic64
 3.54s  2.99% 75.78%      3.54s  2.99%  encoding/binary.littleEndian.Uint32
 3.28s  2.77% 78.55%      3.28s  2.77%  golang.org/x/crypto/internal/chacha20.xor
 2.90s  2.45% 81.00%     22.61s 19.09%  golang.org/x/crypto/internal/chacha20.(*Cipher).XORKeyStream
 2.32s  1.96% 82.96%      2.32s  1.96%  runtime.nanotime
 1.59s  1.34% 84.30%      1.59s  1.34%  runtime.epollwait
 1.12s  0.95% 85.25%     20.89s 17.64%  runtime.sysmon
 0.95s   0.8% 86.05%      1.88s  1.59%  runtime.retake
 0.71s   0.6% 86.65%      0.71s   0.6%  runtime.memclrNoHeapPointers
 0.62s  0.52% 87.17%      0.65s  0.55%  runtime.lock
 0.52s  0.44% 87.61%      0.78s  0.66%  runtime.unlock
 0.37s  0.31% 87.92%      1.10s  0.93%  runtime.mallocgc
 0.34s  0.29% 88.21%      6.26s  5.29%  runtime.schedule
 0.33s  0.28% 88.49%      5.54s  4.68%  runtime.findrunnable
 0.27s  0.23% 88.72%     13.44s 11.35%  crypto/tls.(*Conn).write
 0.26s  0.22% 88.94%     57.90s 48.90%  crypto/tls.(*halfConn).encrypt
 0.23s  0.19% 89.13%      1.06s   0.9%  runtime.reentersyscall
 0.20s  0.17% 89.30%      0.90s  0.76%  crypto/tls.(*Conn).SetWriteDeadline
 0.20s  0.17% 89.47%     71.67s 60.53%  crypto/tls.(*Conn).writeRecordLocked
 0.18s  0.15% 89.62%     54.14s 45.72%  golang.org/x/crypto/chacha20poly1305.(*chacha20poly1305).sealGeneric
 0.18s  0.15% 89.77%         1s  0.84%  runtime.exitsyscall
 0.16s  0.14% 89.91%      1.76s  1.49%  runtime.netpoll
 0.13s  0.11% 90.02%     26.80s 22.63%  golang.org/x/crypto/poly1305.(*macGeneric).Write
 0.13s  0.11% 90.13%     13.02s 11.00%  internal/poll.(*FD).Write

Go1.14.3 and x/crypto master

Link to pprof svg graph

  flat  flat%   sum%        cum   cum%
23.89s 12.98% 12.98%     23.89s 12.98%  runtime.usleep
19.10s 10.38% 23.35%     49.38s 26.83%  golang.org/x/crypto/chacha20.(*Cipher).xorKeyStreamBlocksGeneric
16.10s  8.75% 32.10%     17.50s  9.51%  syscall.Syscall
16.06s  8.72% 40.82%     25.44s 13.82%  golang.org/x/crypto/chacha20.quarterRound
12.19s  6.62% 47.45%     12.19s  6.62%  runtime.futex
11.50s  6.25% 53.69%     11.83s  6.43%  math/bits.Mul64
10.79s  5.86% 59.56%     45.49s 24.71%  golang.org/x/crypto/poly1305.updateGeneric
 8.39s  4.56% 64.11%      8.70s  4.73%  math/bits.RotateLeft32
 7.13s  3.87% 67.99%      7.30s  3.97%  math/bits.Add64
 4.30s  2.34% 70.32%      4.30s  2.34%  runtime._LostSIGPROFDuringAtomic64
 4.29s  2.33% 72.65%      4.37s  2.37%  encoding/binary.littleEndian.Uint64
 4.23s  2.30% 74.95%     20.18s 10.96%  golang.org/x/crypto/poly1305.mul64
 3.97s  2.16% 77.11%      4.12s  2.24%  golang.org/x/crypto/chacha20.addXor
 3.86s  2.10% 79.20%     15.76s  8.56%  golang.org/x/crypto/poly1305.bitsMul64
 3.81s  2.07% 81.27%      3.81s  2.07%  runtime.nanotime1
 3.34s  1.81% 83.09%      3.34s  1.81%  runtime.epollwait
 3.18s  1.73% 84.82%      3.18s  1.73%  runtime.memmove
 3.01s  1.64% 86.45%      3.02s  1.64%  runtime.asyncPreempt
 2.19s  1.19% 87.64%      3.12s  1.69%  runtime.timeSleepUntil
 1.81s  0.98% 88.62%     36.83s 20.01%  runtime.sysmon
 1.76s  0.96% 89.58%      4.62s  2.51%  golang.org/x/crypto/poly1305.add128
 1.47s   0.8% 90.38%      3.10s  1.68%  runtime.retake
 0.94s  0.51% 90.89%      1.04s  0.56%  runtime.lock
 0.63s  0.34% 91.23%     13.08s  7.11%  runtime.findrunnable
 0.46s  0.25% 91.48%      7.77s  4.22%  golang.org/x/crypto/poly1305.bitsAdd64 (partial-inline)
 0.42s  0.23% 91.71%    100.46s 54.57%  crypto/tls.(*halfConn).encrypt
 0.41s  0.22% 91.93%      3.94s  2.14%  runtime.netpoll
 0.31s  0.17% 92.10%     12.78s  6.94%  crypto/tls.(*Conn).write
 0.31s  0.17% 92.27%     49.89s 27.10%  golang.org/x/crypto/chacha20.(*Cipher).XORKeyStream
 0.26s  0.14% 92.41%     45.86s 24.91%  golang.org/x/crypto/poly1305.(*macGeneric).Write
@ALTree ALTree changed the title Significant performance drop in TLS (Chacha20 Poly1305) x/crypto: significant performance drop in TLS (Chacha20 Poly1305) May 19, 2020
@gopherbot gopherbot added this to the Unreleased milestone May 19, 2020
@ALTree ALTree added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels May 19, 2020
@ALTree
Copy link
Member

ALTree commented May 19, 2020

cc @FiloSottile

@FiloSottile FiloSottile changed the title x/crypto: significant performance drop in TLS (Chacha20 Poly1305) x/crypto/chacha20poly1305: significant performance drop on mipsle May 19, 2020
@stffabi
Copy link

stffabi commented Mar 15, 2021

We have also seen a similar performance drop for chacha20poly1305 on our MT7688 platform. I've tried to further investigate the issue and could trace it a little bit down. It seems like it isn't directly related to the GoLang release but with the x/crypto version bundled with the go version.

I could trace it down to the commit x/crypto golang/crypto@85e5e33 in which the bit rotations in quarterRound have been rewritten to use bits.RotateLeft32. On most platforms this call would be replaced with some Bit-Rotation assembler instruction by the go compiler. But this is currently not the case for MIPS/MIPS64, as a result the change introduced some additional function calls and decreased the throughput of chacha20poly1305.

After having taken a look into the "MIPS32 Instruction Set" it seems like MIPS32r2 (AFAIK this is the minimal requirement of go) supports a bit rotation instruction ROTR/ROTRV. I've added support for rewriting of bits.RotateLeft32 to ROTR/ROTRV in a fork of the go compiler to see what kind of performance improvement we would get.

Throughput has been increased about 65%-80% on a MT7688. These are the results of the x/crypto chacha20poly1305 benchmarks (old = golang/crypto@5ea612d compiled with Go 1.16 , new = golang/crypto@5ea612d compiled with patched Go compiler) on our MT7688 platform:

goos: linux
goarch: mipsle
pkg: golang.org/x/crypto/chacha20poly1305
name                         old time/op    new time/op    delta
Chacha20Poly1305/Open-16       56.2µs ±20%    38.5µs ±40%   -31.45%  (p=0.001 n=8+10)
Chacha20Poly1305/Seal-16       68.3µs ±49%    30.6µs ±13%   -55.14%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-64       67.5µs ±22%    37.8µs ±19%   -43.98%  (p=0.000 n=9+9)
Chacha20Poly1305/Seal-64       64.7µs ±10%    37.6µs ± 8%   -41.96%  (p=0.000 n=9+8)
Chacha20Poly1305/Open-256       151µs ±13%      89µs ±20%   -41.03%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-256       148µs ±19%      93µs ±35%   -37.15%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-1024      456µs ±16%     260µs ±23%   -42.95%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-1024      469µs ±14%     254µs ±15%   -45.88%  (p=0.000 n=10+9)
Chacha20Poly1305/Open-8192     3.59ms ±23%    1.94ms ±15%   -45.86%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-8192     3.47ms ±20%    2.03ms ±22%   -41.60%  (p=0.000 n=9+10)
Chacha20Poly1305/Open-16384    7.01ms ± 9%    4.22ms ±22%   -39.89%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-16384    7.43ms ±19%    4.23ms ±11%   -43.04%  (p=0.000 n=10+9)

name                         old speed      new speed      delta
Chacha20Poly1305/Open-16      258kB/s ±46%   431kB/s ±32%   +67.05%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-16      246kB/s ±35%   527kB/s ±13%  +114.23%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-64      927kB/s ±31%  1664kB/s ±22%   +79.50%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-64      993kB/s ±10%  1709kB/s ± 8%   +72.02%  (p=0.000 n=9+8)
Chacha20Poly1305/Open-256    1.70MB/s ±13%  2.90MB/s ±18%   +70.88%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-256    1.74MB/s ±17%  2.81MB/s ±28%   +61.16%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-1024   2.26MB/s ±15%  3.99MB/s ±20%   +76.38%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-1024   2.20MB/s ±13%  3.92MB/s ±32%   +78.82%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-8192   2.31MB/s ±19%  4.24MB/s ±14%   +83.72%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-8192   2.30MB/s ±29%  4.09MB/s ±19%   +77.66%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-16384  2.34MB/s ±10%  3.93MB/s ±19%   +68.04%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-16384  2.23MB/s ±17%  3.79MB/s ±23%   +70.00%  (p=0.000 n=10+10)

There are also other x/crypto algorithms that would benefit from this compiler change, like blake2s, scrypt and ripemd160 also of these use bits.RotateLeft32.

@FiloSottile what do you think about this change, do you see any chance to get this into the Go compiler? I would be very happy to invest some time to contribute the code upstream.

@FiloSottile
Copy link
Contributor

@stffabi thank you for looking into it and prototyping a fix! With such a clear benchstat, I think there is a good chance the change would get accepted into the compiler for Go 1.17.

Retitling the issue, and cc @randall77 for cmd/compile/mips.

@FiloSottile FiloSottile changed the title x/crypto/chacha20poly1305: significant performance drop on mipsle cmd/compile: intrinsify bits.RotateLeft32 on mipsle Mar 15, 2021
@randall77
Copy link
Contributor

Yes, if you have a patch for making bits.RotateLeft32 an intrinsic please send it our way.

stffabi added a commit to stffabi/go that referenced this issue Mar 15, 2021
This CL implements the ROTR & ROTRV instructions for
MIPS and MIPS64, which are mips32r2 instructions.

Additionally bits.RotateLeft32 is now instrinsic and will be
rewritten to ROTR during the SSA phase.

This brings roughly a 65-70% improvement on mipsle
code running Chacha20Poly1305 on a MT7688:

goos: linux
goarch: mipsle
pkg: golang.org/x/crypto/chacha20poly1305
name                         old time/op    new time/op    delta
Chacha20Poly1305/Open-16       56.2µs ±20%    38.5µs ±40%   -31.45%  (p=0.001 n=8+10)
Chacha20Poly1305/Seal-16       68.3µs ±49%    30.6µs ±13%   -55.14%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-64       67.5µs ±22%    37.8µs ±19%   -43.98%  (p=0.000 n=9+9)
Chacha20Poly1305/Seal-64       64.7µs ±10%    37.6µs ± 8%   -41.96%  (p=0.000 n=9+8)
Chacha20Poly1305/Open-256       151µs ±13%      89µs ±20%   -41.03%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-256       148µs ±19%      93µs ±35%   -37.15%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-1024      456µs ±16%     260µs ±23%   -42.95%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-1024      469µs ±14%     254µs ±15%   -45.88%  (p=0.000 n=10+9)
Chacha20Poly1305/Open-8192     3.59ms ±23%    1.94ms ±15%   -45.86%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-8192     3.47ms ±20%    2.03ms ±22%   -41.60%  (p=0.000 n=9+10)
Chacha20Poly1305/Open-16384    7.01ms ± 9%    4.22ms ±22%   -39.89%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-16384    7.43ms ±19%    4.23ms ±11%   -43.04%  (p=0.000 n=10+9)

name                         old speed      new speed      delta
Chacha20Poly1305/Open-16      258kB/s ±46%   431kB/s ±32%   +67.05%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-16      246kB/s ±35%   527kB/s ±13%  +114.23%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-64      927kB/s ±31%  1664kB/s ±22%   +79.50%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-64      993kB/s ±10%  1709kB/s ± 8%   +72.02%  (p=0.000 n=9+8)
Chacha20Poly1305/Open-256    1.70MB/s ±13%  2.90MB/s ±18%   +70.88%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-256    1.74MB/s ±17%  2.81MB/s ±28%   +61.16%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-1024   2.26MB/s ±15%  3.99MB/s ±20%   +76.38%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-1024   2.20MB/s ±13%  3.92MB/s ±32%   +78.82%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-8192   2.31MB/s ±19%  4.24MB/s ±14%   +83.72%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-8192   2.30MB/s ±29%  4.09MB/s ±19%   +77.66%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-16384  2.34MB/s ±10%  3.93MB/s ±19%   +68.04%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-16384  2.23MB/s ±17%  3.79MB/s ±23%   +70.00%  (p=0.000 n=10+10)

Fixes golang#39139

Signed-off-by: stffabi <stffabi@users.noreply.github.com>
stffabi added a commit to stffabi/go that referenced this issue Mar 15, 2021
This CL implements the ROTR & ROTRV instructions for
MIPS and MIPS64, which are mips32r2 instructions.

Additionally bits.RotateLeft32 is now instrinsic and will be
rewritten to ROTR during the SSA phase.

This brings roughly a 65-70% improvement on mipsle
code running Chacha20Poly1305 on a MT7688:

goos: linux
goarch: mipsle
pkg: golang.org/x/crypto/chacha20poly1305
name                         old time/op    new time/op    delta
Chacha20Poly1305/Open-16       56.2µs ±20%    38.5µs ±40%   -31.45%  (p=0.001 n=8+10)
Chacha20Poly1305/Seal-16       68.3µs ±49%    30.6µs ±13%   -55.14%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-64       67.5µs ±22%    37.8µs ±19%   -43.98%  (p=0.000 n=9+9)
Chacha20Poly1305/Seal-64       64.7µs ±10%    37.6µs ± 8%   -41.96%  (p=0.000 n=9+8)
Chacha20Poly1305/Open-256       151µs ±13%      89µs ±20%   -41.03%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-256       148µs ±19%      93µs ±35%   -37.15%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-1024      456µs ±16%     260µs ±23%   -42.95%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-1024      469µs ±14%     254µs ±15%   -45.88%  (p=0.000 n=10+9)
Chacha20Poly1305/Open-8192     3.59ms ±23%    1.94ms ±15%   -45.86%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-8192     3.47ms ±20%    2.03ms ±22%   -41.60%  (p=0.000 n=9+10)
Chacha20Poly1305/Open-16384    7.01ms ± 9%    4.22ms ±22%   -39.89%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-16384    7.43ms ±19%    4.23ms ±11%   -43.04%  (p=0.000 n=10+9)

name                         old speed      new speed      delta
Chacha20Poly1305/Open-16      258kB/s ±46%   431kB/s ±32%   +67.05%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-16      246kB/s ±35%   527kB/s ±13%  +114.23%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-64      927kB/s ±31%  1664kB/s ±22%   +79.50%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-64      993kB/s ±10%  1709kB/s ± 8%   +72.02%  (p=0.000 n=9+8)
Chacha20Poly1305/Open-256    1.70MB/s ±13%  2.90MB/s ±18%   +70.88%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-256    1.74MB/s ±17%  2.81MB/s ±28%   +61.16%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-1024   2.26MB/s ±15%  3.99MB/s ±20%   +76.38%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-1024   2.20MB/s ±13%  3.92MB/s ±32%   +78.82%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-8192   2.31MB/s ±19%  4.24MB/s ±14%   +83.72%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-8192   2.30MB/s ±29%  4.09MB/s ±19%   +77.66%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-16384  2.34MB/s ±10%  3.93MB/s ±19%   +68.04%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-16384  2.23MB/s ±17%  3.79MB/s ±23%   +70.00%  (p=0.000 n=10+10)

Fixes golang#39139
stffabi added a commit to stffabi/go that referenced this issue Mar 15, 2021
This CL implements the ROTR & ROTRV instructions for
MIPS and MIPS64, which are mips32r2 instructions.

Additionally bits.RotateLeft32 is now instrinsic and will be
rewritten to ROTR during the SSA phase.

This brings roughly a 65-70% improvement on mipsle
code running Chacha20Poly1305 on a MT7688:

goos: linux
goarch: mipsle
pkg: golang.org/x/crypto/chacha20poly1305
name                         old time/op    new time/op    delta
Chacha20Poly1305/Open-16       56.2µs ±20%    38.5µs ±40%   -31.45%  (p=0.001 n=8+10)
Chacha20Poly1305/Seal-16       68.3µs ±49%    30.6µs ±13%   -55.14%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-64       67.5µs ±22%    37.8µs ±19%   -43.98%  (p=0.000 n=9+9)
Chacha20Poly1305/Seal-64       64.7µs ±10%    37.6µs ± 8%   -41.96%  (p=0.000 n=9+8)
Chacha20Poly1305/Open-256       151µs ±13%      89µs ±20%   -41.03%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-256       148µs ±19%      93µs ±35%   -37.15%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-1024      456µs ±16%     260µs ±23%   -42.95%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-1024      469µs ±14%     254µs ±15%   -45.88%  (p=0.000 n=10+9)
Chacha20Poly1305/Open-8192     3.59ms ±23%    1.94ms ±15%   -45.86%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-8192     3.47ms ±20%    2.03ms ±22%   -41.60%  (p=0.000 n=9+10)
Chacha20Poly1305/Open-16384    7.01ms ± 9%    4.22ms ±22%   -39.89%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-16384    7.43ms ±19%    4.23ms ±11%   -43.04%  (p=0.000 n=10+9)

name                         old speed      new speed      delta
Chacha20Poly1305/Open-16      258kB/s ±46%   431kB/s ±32%   +67.05%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-16      246kB/s ±35%   527kB/s ±13%  +114.23%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-64      927kB/s ±31%  1664kB/s ±22%   +79.50%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-64      993kB/s ±10%  1709kB/s ± 8%   +72.02%  (p=0.000 n=9+8)
Chacha20Poly1305/Open-256    1.70MB/s ±13%  2.90MB/s ±18%   +70.88%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-256    1.74MB/s ±17%  2.81MB/s ±28%   +61.16%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-1024   2.26MB/s ±15%  3.99MB/s ±20%   +76.38%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-1024   2.20MB/s ±13%  3.92MB/s ±32%   +78.82%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-8192   2.31MB/s ±19%  4.24MB/s ±14%   +83.72%  (p=0.000 n=10+10)
Chacha20Poly1305/Seal-8192   2.30MB/s ±29%  4.09MB/s ±19%   +77.66%  (p=0.000 n=10+10)
Chacha20Poly1305/Open-16384  2.34MB/s ±10%  3.93MB/s ±19%   +68.04%  (p=0.000 n=9+10)
Chacha20Poly1305/Seal-16384  2.23MB/s ±17%  3.79MB/s ±23%   +70.00%  (p=0.000 n=10+10)

Fixes golang#39139
@gopherbot
Copy link

Change https://golang.org/cl/301711 mentions this issue: cmd/compile/mips: intrinsify bits.RotateLeft32 on MIPS

@stffabi
Copy link

stffabi commented Mar 15, 2021

Thanks @FiloSottile for your very fast reply and routing it to the appropriate person on the Go-Team.

@randall77 I've created the PR #45028 with the patch. It's my first contribution to Go, so there might be some things which need to be settled down on the code to have it merged 😄

The PR also contains the changes for MIPS64, but unfortunately I don't have any access to a MIPS64 machine to test them.

Thanks in advance for taking your time to look into the PR.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
@stffabi
Copy link

stffabi commented Aug 21, 2023

I've started on porting assembly versions of ChaCha20 and Poly1305 over to Go Assembler for the MIPSLE platform. The results are quite great for a small MT7688 SoC.

                │     old      │                 new                     │
geomean         │     B/s      │      B/s       vs base                  │
poly1305:           5.293Mi         51.07Mi        +864.84%
ChaCha20:           4.684Mi         18.54Mi        +295.71%
ChaCha20Poly1305:   2.243Mi         11.79Mi        +425.71%

@FiloSottile do you see any chance to get those Go Assembler implementations for MIPSLE into x/crypto? I would be very happy to invest some time to contribute the code upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: Triage Backlog
Development

Successfully merging a pull request may close this issue.

6 participants