Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/asm: add neon, vector instructions for arm #7300

Open
gopherbot opened this issue Feb 10, 2014 · 13 comments
Open

cmd/asm: add neon, vector instructions for arm #7300

gopherbot opened this issue Feb 10, 2014 · 13 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@gopherbot
Copy link

by byron.rakitzis:

go1.2

In contrast to the amd64 port, the arm port of the Go assembler does not recognize SIMD
instructions ("V…") or vector registers (D or Q).

It would be useful for us (we are writing custom speedups for a project using Intel SSE,
and would care to do the same for ARM), but it would also be useful for the Go library
itself if the library functions which have SIMD speedups in xxx_amd64.s had analogous
speedups in xxx_arm.s

Thank you,

Byron Rakitzis.
@davecheney
Copy link
Contributor

Comment 1:

Byron,
Could you please list the complete set of instructions you need.

Status changed to WaitingForReply.

@gopherbot
Copy link
Author

Comment 2 by byron.rakitzis:

Certainly,
These include:
load, store, move, table lookup, shift, xor:
VLD1
VST1
VMOV
VSHR
VTBL
VEOR
This is highly selective, of course. We also use the D and Q register set
(I hope that is self-evident).
Byron.

@robpike
Copy link
Contributor

robpike commented Feb 13, 2014

Comment 3:

Marking 1.3Maybe. This is a minor, non-breaking change that could open up some
significant performance improvements.

Labels changed: added repo-main, release-go1.3maybe.

Status changed to Accepted.

@rsc
Copy link
Contributor

rsc commented Apr 3, 2014

Comment 4:

Labels changed: added release-go1.4, removed release-go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Sep 15, 2014

Comment 5:

I'd like to see this happen but it's going to have to wait for the next release.

Labels changed: added release-go1.5, removed release-go1.4.

@davecheney
Copy link
Contributor

This will not happen for the 1.5 release

@rsc rsc changed the title cmd/5a: no support for NEON instructions or vector register set in ARM assembler cmd/asm: add neon, vector instructions for arm Jun 8, 2015
@rsc
Copy link
Contributor

rsc commented Nov 4, 2015

I agree this would be useful, and I apologize that we haven't had a chance to do it yet. Note that if you really need the instructions you can figure out what the encodings are (for example using the GNU assembler) and then use WORD directives to insert them in your assembly. I know that's less than ideal, but it's a workaround.

Right now there's more we'd like to do than we have bandwidth for, so the reality is that this one is unplanned.

@rsc rsc modified the milestones: Unplanned, Go1.6 Nov 4, 2015
rkusa added a commit to rkusa/gm that referenced this issue Jan 10, 2016
... however, 'implementation' in this case just means, to fallback to the
pure Go implementation, because go assembly does not yet support arm
neon instructions (which are the equivalent of SSE).
See golang/go#7300
agl referenced this issue in golang/crypto Oct 14, 2016
This change adds a package, chacha20poly1305, which implements the
ChaCha20-Poly1305 AEAD from RFC 7539. This AEAD has several attractive
features:
   1. It's naturally constant time. AES-GCM needs either dedicated
      hardware or extreme effort to be fast and constant-time, while
      this design is easy to make constant-time.
   2. It's fast on modern processors: it runs at 1GB/s on my IvyBrige
      system.
   3. It's seeing significant use in TLS. (A change for crypto/tls is
      forthcoming.)

This change merges two CLs:
  https://go-review.googlesource.com/#/c/24717
  https://go-review.googlesource.com/#/c/26691

I took the amd64-optimised AEAD implementation from the former because
it was significantly faster. But the structure of the change is taken
from the latter.

This version will be checked into x/crypto. This package will then be
vendored into the stdlib so that it can be used from crypto/tls.

Change-Id: I5a60587958b7afeec81ca1091e603a7e8517000b
Reviewed-on: https://go-review.googlesource.com/30728
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
stapelberg added a commit to stapelberg/scan2drive that referenced this issue May 5, 2017
This conversion used to take about 3s (for 4960x7016 RGB pixels).
With the new code it takes about 300ms.

More potential for improvement: we could run this code while reading
pixels via USB.

I haven’t looked at the USB timing in detail, but my guess is that we
could squeeze in this post-processing into the time between requesting
data from the device and receiving the data from the kernel.

If that doesn’t work out, we could parallelize and post-process the
previous buffer while reading the current buffer.

Note that we need to use the WORD instruction because the Go assembler
is lacking support for the NEON instructions, see
golang/go#7300

related to issue #7
@kriskwiatkowski
Copy link

kriskwiatkowski commented Mar 7, 2018

Any clue how much effort is needed to implement support for NEON? The "quick guide to Go's assembler" says that updating go's assembler is "straightforward" - I'm looking for some more details. May someone point me to some PR/diff with some similar implementation that was already done (for example SIMD for intel)?
Many thanks

@andybons andybons added the NeedsFix The path to resolution is known, but the work has not been done. label Mar 7, 2018
@ALTree
Copy link
Member

ALTree commented Mar 7, 2018

@henrydcase a change adding SSE4: https://golang.org/cl/57470

@smasher164
Copy link
Member

Specialized code also has to feature-detect for NEON, so a flag needs to be added to internal/cpu (and correspondingly x/sys/cpu) for HasNEON. On linux the flag is hwcap_NEON = 1 << 12.

@shibukawa
Copy link

https://translate.google.com/translate?sl=ja&tl=en&u=https://future-architect.github.io/articles/20201203/

I did benchmark with M1 and 10th Gen Core i5, Ryzen 9 on https://github.com/SimonWaldherr/golang-benchmarks. I got interesting result.

  • M1 is much faster than Core i5/Ryzen (basically, took 50%-33% less time to complete)
  • CRC32, SHA1, SHA256 test took much time than other CPUs and Rosetta2 translation

I think M1 native Go implementation doesn't use NEON, but Rosetta2 translate SSE instructions into NEON. I read hash/crc32 code, only amd64.s uses SIMD instructions. So I suppose this issue is important for improving benchmark result of ARM.

@gilbahat
Copy link

gilbahat commented Sep 8, 2022

any news here?
it's 2022, arm instances are available across the 3 major clouds...

@kokroo
Copy link

kokroo commented Feb 16, 2023

https://translate.google.com/translate?sl=ja&tl=en&u=https://future-architect.github.io/articles/20201203/

I did benchmark with M1 and 10th Gen Core i5, Ryzen 9 on https://github.com/SimonWaldherr/golang-benchmarks. I got interesting result.

  • M1 is much faster than Core i5/Ryzen (basically, took 50%-33% less time to complete)
  • CRC32, SHA1, SHA256 test took much time than other CPUs and Rosetta2 translation

I think M1 native Go implementation doesn't use NEON, but Rosetta2 translate SSE instructions into NEON. I read hash/crc32 code, only amd64.s uses SIMD instructions. So I suppose this issue is important for improving benchmark result of ARM.

I second your statement. It's 2023, ARM is getting more popular by the day and ARM servers are now available on all the 3 major cloud service providers. On top of that, Apple Silicon's performance is absolutely phenomenal, golang with NEON support on Apple Silicon will be just amazing.

Rust already supports it, I think it's high time golang supports it too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests