-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: cmd/compile: intrinsicify user defined assembly functions #17373
Comments
I think this would be a significant source of complexity for very little benefit. I do think we need a story for access to functionality like this, but a decent package API for bit twiddling, known to the compiler the same way math.Sqrt is known, should suffice. Also note that today the Go compiler runs before the assembler, because it generates go_asm.h for use by assembly code. In addition to the complexity of somehow interpreting these magic "inline assembly in another file" stanzas, we'd have to run the compiler, run the assembler, and run the compiler again. /cc @dr2chase |
The SIMD instructions don't map well to Go's strict typing.
For example, suppose we have SIMD types for [4]float32, [8]uint16,
but some functions like bitwise operations could apply to both,
either we define separate intrinsic functions that take different
types, or we somehow make the function "generic" using compiler
magic. Neither solution is ideal.
Or we can define a generic 128-bit SIMD type that is accepted
by all SIMD intrinsics, but provide conversion functions that
extract different kinds (float32, float64, uint8, uint16) of component
from the generic 128-bit type. This will reduce the number of
SIMD intrinsics functions but then it's like programming with
interface{} only.
|
In Go 1.8 we intrinsify some operations in runtime/internal/sys which includes CLZ mentioned in this issue. The intrinsification is easy to extend to a user-exposed package (we already do so for sync/atomic). The hard part is coming up with APIs to access these ops. The instructions across architectures can vary on what they do in corner cases (e.g. CLZ with a zero argument). Do you expose that in the API, or introduce fixup code to make the API clean but slower? We can afford to be sloppy in runtime/internal but for something external and under the Go1 guarantee it is harder. Doesn't help with generic assembly inlining, but it may help a bunch of the common cases. |
Yeah, designing an API to access the special instructions is the hard part.
I think the essence of the issue is to provide some kind of user defined
SSA rewrite rules (that can expand to instructions unknown to
cmd/internal/obj).
We can be creative, the end user interface doesn't have to be an assembly
file (in fact, using the assembly file as input means the compiler must
somehow infer the allowed register, or forced to use the one used by the
assembly function.)
|
I would really love a fast bit-twiddling package in std. I'd suggest introducing fixup code to make the API clean but slower. I suspect most of these cleanups would be a single highly predictable branch anyway. For many bit-twiddling operations we already have two existing users, the runtime and math/big. That should help some with API design.
One option is to have package-specific SIMD types, which correspond to SIMD register widths. Then all functions in the package operate on those types, e.g.
Interesting. Something a la cgo? But then we have a third DSL to parse, maintain, etc. Seems like we should start with bit-twiddling and maybe SIMD. |
I think this would be a serious mistake. It exposes compiler internals that are internals. We do not want them to become the public API of the toolchain. The other problem with designing a general low-level mechanism is that then you're asking everyone to be low-level users. Instead, find the higher-level things that are not easily accessed (vector math, bit twiddling) and write a good API for them. Yes, good API design is hard. All the more reason to do it once. |
Related to bit-twiddling I made a survey of the available libraries before #10757. On the bug Rob opposed including bit-twidling library as a standard library. Is supporting an external library (e.g. golang.org/bittwiddling) by the compiler fine? If one were to work on a proposal for a bit-twiddling, what would the timeline be for inclusion in Go1.9? |
I don't think it makes sense for the compiler to reference anything outside the stdlib. When semantic inlining, you really want the code you are semantic inlining and the compiler to be released in lock step. A proposal for a bit-twiddling library would be good. Probably want that out by Feb 1 so we can discuss + implement in time for 1.9. |
On Thu, Jan 5, 2017 at 9:40 AM, Russ Cox ***@***.***> wrote:
I think the essence of the issue is to provide some kind of user defined
SSA rewrite rules (that can expand to instructions unknown to
cmd/internal/obj).
I think this would be a serious mistake. It exposes compiler internals
that are *internals*. We do not want them to become the public API of the
toolchain. The other problem with designing a general low-level mechanism
is that then you're asking everyone to be low-level users.
Instead, find the higher-level things that are not easily accessed (vector
math, bit twiddling) and write a good API for them. Yes, good API design is
hard. All the more reason to do it once.
I'm not proposing to expose the current SSA rule to the user. I'm proposing
to
design and expose a suitable way to make arbitrary instruction sequence
available to Go programs as an inlinable function, and in fact, I think
most, if
not all, use cases just require one instruction to be available to Go.
gcc's extended
inline assembly is one such interface, but I think we can do better (no,
I'm not
proposing inline assembly for Go. The instruction templates should be in a
separate file.)
Designing a high-level API, by definition, won't remove all the needs for
custom
assembly functions, because the API can't be future-proof and processor
vendors
will keep adding more and more specialized instructions.
And one more reason why we should do it at the tool level, not as a std
package:
the Go 1 compatibility makes designing good abstraction pretty hard, and
because the nature of this problem, we can't design the API in a package in
a subrepo (e.g. x/simd) like context or http2 and later move it into std
when it's
mature.
Instead, we can start with the tool level support and see what the community
come up with. When we have a really nice API come up, we can then absorb
it as a standard package.
Exposing the low-level mechanism makes experiments with the API possible.
Of course, there is an alternative way to do the API experiments. We can
have
a package in a sub-repo, with functions backed by assembly functions. But
I'd
argue this approach won't work because the function call overhead is
prohibitive
for small intrinsics and if the package doesn't bring significant speedups,
users
simply won't use it. And without many users, how can we evaluate the API
decisions?
|
I'd really like to focus our development efforts on whatever the high-level needs are (bit twiddling, etc) and not on the low-level out-of-line inline assembly mechanism. |
To be clear, we can easily imagine a beautiful system along these lines that "just works" and makes things transparently run faster with no effort by users. But that's a huge effort that must be implemented, debugged, and maintained (by the core team). We don't believe that this effort is a high enough priority right now compared to other efforts. -rsc for @golang/proposal-review |
On Mon, Jan 9, 2017 at 4:36 PM, Russ Cox ***@***.***> wrote:
To be clear, we can easily imagine a beautiful system along these lines
that "just works" and makes things transparently run faster with no effort
by users. But that's a huge effort that must be implemented, debugged, and
maintained (by the core team). We don't believe that this effort is a high
enough priority right now compared to other efforts.
I don't understand how does the proposal have anything to do with
"transparently make thing run faster
with no effort by users". And I don't even think that's possible.
Additionally, as Keith said, any such inlining mechanism must be
implemented, debugged, and maintained
by the core team because the compiler can't inline intrinsics defined by
outside parties due to possible
version skew problems.
How do you propose to design a good intrinsics library in std in six months
and immediately lock its API
in the following Go release? I'm not talking about simple bit twiddling
operations like POPCNT, CLZ or
the like. I'm talking about more processor specific SIMD and crypto
instructions. I think no one has ever
designed such a library that is portable across all our support
architectures.
|
For the specific case of crypto, we seem to be doing OK by (1) defining good, portable Go APIs, and (2) taking advantage of arch-specific hardware acceleration in a way that's transparent to the end user. I don't see why we can't do that in other settings too. They don't all have to happen in the same release. We can build things up incrementally. |
I have some application code where I want to use SIMD intrinsics. It would be good to be able to write that in a way that I could take advantage of the compiler's other general smarts, so I definitely see Minux's motivation. I understand that for now at least this is a niche feature and probably entails a lot of additional complexity in the compiler. I'd still be curious to see this sketched or prototyped by any interested parties, even if it doesn't end up being implemented. Doing 32 multiplies per instruction in AVX512 sounds very appealing! var a, b, dst [32]int16
asm.VPMULLW(a, b, dst) I ponder if you could achieve a similar effect a source code transformation, but I guess that would amount to writing (or forking) a compiler. I'd love to give a brief shoutout to PeachPy which uses Python as a DSL to generate Go ASM, which I have found lowers the cognitive burden of writing assembly. It would be interesting to see what it would look like if you used Go as the DSL. Probably not much like Go, but maybe more managable for engineers than assembly? |
@rsc: as input to "whatever the high-level needs are". The following primitives have a large effect on the speed of gonum programs. The most important are Scale ( We have coded our own assembly implementations of them based on real speedups. It would be great to have cooperation with the compiler and for specific processors to make these operations as fast as possible. |
@btracey sounds interesting, do you have benchmark numbers? :) |
Yea. The code is https://godoc.org/github.com/gonum/blas/native Comparing
This translates to about 10% performance on a real program I have. |
On Jan 9, 2017 9:12 PM, "Russ Cox" <notifications@github.com> wrote:
For the specific case of crypto, we seem to be doing OK by (1) defining
good, portable Go APIs, and (2) taking advantage of arch-specific hardware
acceleration in a way that's transparent to the end user. I don't see why
we can't do that in other settings too. They don't all have to happen in
the same release. We can build things up incrementally.
But note that crypto instructions aren't necessarily used for their
intended application.
One primary example is our aeshash algorithm.
I agree that high-level API are better, but this issue is about making the
intrisics available. If I understand you correctly, you rejected
user-defined intrisics and instead want pre-defined intrisics in a std
package. I'm wondering how do you propose to do that (again, not simple but
twiddling, but SIMD.)
|
I think predefined intrinsics provided by the language will go a long way. Something like I also agree with @btracey - having written enough assembly (tho most are generated from C) for Gorgonia's vector related code (and also leveraging Gonum), I'd really love if we could get help from the compiler on this front. Or some integrated dev tools from the Go team |
I've been thinking about a potential implementation. The first obstacle is
making the compiler aware of SIMD register classes (and their aliases with
floating point registers).
We need to overcome this no matter we choose user-defined intrisics or not.
In fact, solving this means we need to define a suitable Go type to
represent them. As I mentioned earlier, this is hard because we don't have
unions. And this problem is precisely the reason I don't think pre-defined
intrisics package could succeed.
I think we all see the benefit of having intrisics: using the correct
instructions is the hard part that must be (at least for now) solved by
humans, but choosing the registers are best left to compilers.
|
Minux, would you mind spelling out the SIMD design impossibility you see in a bit more detail? I'm not following it. Relatedly, would you mind spelling out a bit why (say) the simd.Reg128 type I suggested above is unsuitable? Note that reinterpreting bits in a vector is a common thing to want to do for performance (which is the only point of SIMD), so its flexibility about the type of things it contains is a good thing. Lastly, an anecdote. When I last made heavy use of SIMD was NEON, five or six years back, and the compiler made such abominable decisions with intrinsics (mostly around spilling and restoring registers) that I ended up dropping down to assembly anyway, just to be able to fully utilize all registers and schedule operations to make good use of the pipeline. I don't know how much better clang is now, but the experience makes me think that a good SIMD package should be fairly close to the metal and provide users relatively large amounts of control. I say this not as an argument for generic inline assembly support but as a consider for SIMD package design. But honestly, I'd rather see bit twiddling get tackled first. There's probably plenty to learn there, and I suspect it has a broader audience, including known stdlib uses in the runtime and math/big. |
Isn't programming with Reg128 like programming with interface{}? Even the C
intrisics separate mm128 and mm128d, mm256, mm256i, mm256d. Ideally, if we
were to design a beautiful interface, I want to use [8]float32, [16]uint16,
etc.
I should clarify my last comment: it's certainly possible to design one
package if given enough time, but my point is that it's impossible to
design one in one release cycle and we still need a way for user-defined
intrisics to allow experiments on the API. There is no existing interface
to adapt (and we can start from a clean slate, why copy existing cumbersome
intrisics?)
I think this follows the general Go philosophy of providing general
mechanism and gain experience through it. I just don't understand the
resistance. Russ' comment seems to indicate it will waste too much effort,
but I think designing a proper std SIMD package requires even more effort
than adding user-defined intrisics because even supporting single
instruction could relieve a lot of needs for custom assembly.
Bit twiddling package would be nice, but I'm afraid it won't solve much of
the problem for this issue. For example, it doesn't even need new register
class support, which is crucial for SIMD intrisics.
|
One way to tackle the new register class support problem is to teach the compiler to use SIMD instructions itself when their presence is guaranteed, for the obvious/easy cases (simple for loops, some zeroing).
In my case, the resistance comes from concern that the scope of the generic proposal (including initial implementation, maintenance, and compatibility impact) will end up being much bigger than it initially appears--and it initially appears pretty big, at least to me. |
I think most of the effort lies in robust vector register class support. The actual
user-defined intrisics part is tiny compared to that. For this to work, the
compiler must know how to save and restore SIMD registers and support them
in the rewrite rules.
I agree we can first making the compiler use simple SIMD to clean small
chunks of memory and then we can experiment with simd rewrite rules. Once
both are available, the user-defined intrisics should be easy enough to
add. Just need a language (e.g. JSON) to encode the actual rule (again, we
don't need the entire SSA rewrite rule support as SIMD instructions are
generally very regular.)
But could we reopen the issue and put it in Thinking status? (@rsc) We now
have a clear way forward now. The first steps are required whether we
implement this proposal or not. When those are ready, we can re-evaluate
the cost of implementing this. (Assuming the only concern is the perceived
complexity of this proposal. If not, please elaborate.)
|
@josharian GCC at least does quite well with intrinsics these days. I wrote C code to do the same thing as Keith's aeshash code in runtime/asm_amd64.s, and the compiled result was essentially the same as the asm code. https://github.com/golang/gofrontend/blob/master/libgo/runtime/aeshash.c |
Another use of intrinsics in C, a previous effort at Oracle to speed up CRC calculations. Start at line 425: http://cr.openjdk.java.net/~drchase/7088419/webrev.01/src/share/native/java/util/zip/CRC32.c.html The compiler's main role in this was to be a stenographer for XMM code. (Pay no attention to the assembly encoded with .byte -- I used a modern compiler to generate the code, but had to feed it to an archaic assembler that didn't speak "modern" AMD64.) |
As a simple addition to my benchmarks above, here is the profile for Cholesky decomposition:
|
I'd like the compiler to intrinsicify user defined assembly functions.
here is a simple sketch of the user interface:
we introduce CLOBBER pseudo instruction to describe that after
certain instruction, certain registers are clobbered.
When the user defines such an assembly function in a package:
TEXT ·Clz(SB), NOSPLIT, $0-4
MOVW 0(FP), R0 // compiler recognizes this as input read
CLZ R0, R1 // the new instruction template
CLOBBER [R2-R3, R11] // as an example of CLOBBER, so that compiler know to spill those registers before using this instruction template
MOVW R1, 4(FP) // compiler recognizes this as output write
RET
The user effectively tells the compiler a new rule to expand call
to Clz function with a custom instruction CLZ.
If we ignore the possibility of doing register allocation for the new
instruction template, I think instrinsicify such assembly functions
with the current SSA backend is within reach.
Of course, this has implications on safety, but if the user is already
using assembly, then that shouldn't be a big concern.
The text was updated successfully, but these errors were encountered: