proposal: cmd/compile: intrinsicify user defined assembly functions #17373

minux · 2016-10-07T00:28:39Z

I'd like the compiler to intrinsicify user defined assembly functions.

here is a simple sketch of the user interface:
we introduce CLOBBER pseudo instruction to describe that after
certain instruction, certain registers are clobbered.

When the user defines such an assembly function in a package:
TEXT ·Clz(SB), NOSPLIT, $0-4
MOVW 0(FP), R0 // compiler recognizes this as input read
CLZ R0, R1 // the new instruction template
CLOBBER [R2-R3, R11] // as an example of CLOBBER, so that compiler know to spill those registers before using this instruction template
MOVW R1, 4(FP) // compiler recognizes this as output write
RET

The user effectively tells the compiler a new rule to expand call
to Clz function with a custom instruction CLZ.

If we ignore the possibility of doing register allocation for the new
instruction template, I think instrinsicify such assembly functions
with the current SSA backend is within reach.

Of course, this has implications on safety, but if the user is already
using assembly, then that shouldn't be a big concern.

rsc · 2017-01-04T15:40:40Z

I think this would be a significant source of complexity for very little benefit. I do think we need a story for access to functionality like this, but a decent package API for bit twiddling, known to the compiler the same way math.Sqrt is known, should suffice.

Also note that today the Go compiler runs before the assembler, because it generates go_asm.h for use by assembly code. In addition to the complexity of somehow interpreting these magic "inline assembly in another file" stanzas, we'd have to run the compiler, run the assembler, and run the compiler again.

/cc @dr2chase

minux · 2017-01-04T19:31:54Z

The SIMD instructions don't map well to Go's strict typing. For example, suppose we have SIMD types for [4]float32, [8]uint16, but some functions like bitwise operations could apply to both, either we define separate intrinsic functions that take different types, or we somehow make the function "generic" using compiler magic. Neither solution is ideal. Or we can define a generic 128-bit SIMD type that is accepted by all SIMD intrinsics, but provide conversion functions that extract different kinds (float32, float64, uint8, uint16) of component from the generic 128-bit type. This will reduce the number of SIMD intrinsics functions but then it's like programming with interface{} only.

randall77 · 2017-01-04T21:16:00Z

In Go 1.8 we intrinsify some operations in runtime/internal/sys which includes CLZ mentioned in this issue.

The intrinsification is easy to extend to a user-exposed package (we already do so for sync/atomic). The hard part is coming up with APIs to access these ops. The instructions across architectures can vary on what they do in corner cases (e.g. CLZ with a zero argument). Do you expose that in the API, or introduce fixup code to make the API clean but slower? We can afford to be sloppy in runtime/internal but for something external and under the Go1 guarantee it is harder.

Doesn't help with generic assembly inlining, but it may help a bunch of the common cases.

minux · 2017-01-04T21:26:20Z

Yeah, designing an API to access the special instructions is the hard part. I think the essence of the issue is to provide some kind of user defined SSA rewrite rules (that can expand to instructions unknown to cmd/internal/obj). We can be creative, the end user interface doesn't have to be an assembly file (in fact, using the assembly file as input means the compiler must somehow infer the allowed register, or forced to use the one used by the assembly function.)

josharian · 2017-01-04T21:59:50Z

The hard part is coming up with APIs to access these ops. The instructions across architectures can vary on what they do in corner cases (e.g. CLZ with a zero argument). Do you expose that in the API, or introduce fixup code to make the API clean but slower?

I would really love a fast bit-twiddling package in std. I'd suggest introducing fixup code to make the API clean but slower. I suspect most of these cleanups would be a single highly predictable branch anyway. For many bit-twiddling operations we already have two existing users, the runtime and math/big. That should help some with API design.

The SIMD instructions don't map well to Go's strict typing.

One option is to have package-specific SIMD types, which correspond to SIMD register widths. Then all functions in the package operate on those types, e.g. AddUint32x4(dst, src1, src2 Reg128). We'd provide a bunch of Load/Store functions, e.g. LoadUint64x2(p *[2]uint64) Reg128 and StoreUint64x2(p *[2]uint64, r Reg128). Maybe also pointer-free variants. This corresponds reasonably well to what happens under the hood--data gets loaded into a register, manipulated into that register, and then stored back into memory--which is probably a good thing for such low-level operations. It looks a fair amount like the C intrinsics (be that good or bad). And it'd make the job of the register allocator way easier.

We can be creative, the end user interface doesn't have to be an assembly
file

Interesting. Something a la cgo? But then we have a third DSL to parse, maintain, etc. Seems like we should start with bit-twiddling and maybe SIMD.

rsc · 2017-01-05T14:40:03Z

I think the essence of the issue is to provide some kind of user defined SSA rewrite rules (that can expand to instructions unknown to cmd/internal/obj).

I think this would be a serious mistake. It exposes compiler internals that are internals. We do not want them to become the public API of the toolchain. The other problem with designing a general low-level mechanism is that then you're asking everyone to be low-level users.

Instead, find the higher-level things that are not easily accessed (vector math, bit twiddling) and write a good API for them. Yes, good API design is hard. All the more reason to do it once.

brtzsnr · 2017-01-05T17:52:25Z

Related to bit-twiddling I made a survey of the available libraries before #10757. On the bug Rob opposed including bit-twidling library as a standard library. Is supporting an external library (e.g. golang.org/bittwiddling) by the compiler fine?

If one were to work on a proposal for a bit-twiddling, what would the timeline be for inclusion in Go1.9?

randall77 · 2017-01-05T18:08:07Z

I don't think it makes sense for the compiler to reference anything outside the stdlib. When semantic inlining, you really want the code you are semantic inlining and the compiler to be released in lock step.

A proposal for a bit-twiddling library would be good. Probably want that out by Feb 1 so we can discuss + implement in time for 1.9.
Check out runtime/internal/sys/intrinsics.go, we'd want at least these.

minux · 2017-01-05T22:53:25Z

On Thu, Jan 5, 2017 at 9:40 AM, Russ Cox ***@***.***> wrote: I think the essence of the issue is to provide some kind of user defined SSA rewrite rules (that can expand to instructions unknown to cmd/internal/obj). I think this would be a serious mistake. It exposes compiler internals that are *internals*. We do not want them to become the public API of the toolchain. The other problem with designing a general low-level mechanism is that then you're asking everyone to be low-level users. Instead, find the higher-level things that are not easily accessed (vector math, bit twiddling) and write a good API for them. Yes, good API design is hard. All the more reason to do it once.

I'm not proposing to expose the current SSA rule to the user. I'm proposing to design and expose a suitable way to make arbitrary instruction sequence available to Go programs as an inlinable function, and in fact, I think most, if not all, use cases just require one instruction to be available to Go. gcc's extended inline assembly is one such interface, but I think we can do better (no, I'm not proposing inline assembly for Go. The instruction templates should be in a separate file.) Designing a high-level API, by definition, won't remove all the needs for custom assembly functions, because the API can't be future-proof and processor vendors will keep adding more and more specialized instructions. And one more reason why we should do it at the tool level, not as a std package: the Go 1 compatibility makes designing good abstraction pretty hard, and because the nature of this problem, we can't design the API in a package in a subrepo (e.g. x/simd) like context or http2 and later move it into std when it's mature. Instead, we can start with the tool level support and see what the community come up with. When we have a really nice API come up, we can then absorb it as a standard package. Exposing the low-level mechanism makes experiments with the API possible. Of course, there is an alternative way to do the API experiments. We can have a package in a sub-repo, with functions backed by assembly functions. But I'd argue this approach won't work because the function call overhead is prohibitive for small intrinsics and if the package doesn't bring significant speedups, users simply won't use it. And without many users, how can we evaluate the API decisions?

rsc · 2017-01-09T21:32:48Z

I'd really like to focus our development efforts on whatever the high-level needs are (bit twiddling, etc) and not on the low-level out-of-line inline assembly mechanism.

rsc · 2017-01-09T21:36:57Z

To be clear, we can easily imagine a beautiful system along these lines that "just works" and makes things transparently run faster with no effort by users. But that's a huge effort that must be implemented, debugged, and maintained (by the core team). We don't believe that this effort is a high enough priority right now compared to other efforts.

-rsc for @golang/proposal-review

minux · 2017-01-10T00:34:08Z

On Mon, Jan 9, 2017 at 4:36 PM, Russ Cox ***@***.***> wrote: To be clear, we can easily imagine a beautiful system along these lines that "just works" and makes things transparently run faster with no effort by users. But that's a huge effort that must be implemented, debugged, and maintained (by the core team). We don't believe that this effort is a high enough priority right now compared to other efforts.

I don't understand how does the proposal have anything to do with "transparently make thing run faster with no effort by users". And I don't even think that's possible. Additionally, as Keith said, any such inlining mechanism must be implemented, debugged, and maintained by the core team because the compiler can't inline intrinsics defined by outside parties due to possible version skew problems. How do you propose to design a good intrinsics library in std in six months and immediately lock its API in the following Go release? I'm not talking about simple bit twiddling operations like POPCNT, CLZ or the like. I'm talking about more processor specific SIMD and crypto instructions. I think no one has ever designed such a library that is portable across all our support architectures.

rsc · 2017-01-10T02:11:59Z

For the specific case of crypto, we seem to be doing OK by (1) defining good, portable Go APIs, and (2) taking advantage of arch-specific hardware acceleration in a way that's transparent to the end user. I don't see why we can't do that in other settings too. They don't all have to happen in the same release. We can build things up incrementally.

pwaller · 2017-01-10T08:47:23Z

I have some application code where I want to use SIMD intrinsics. It would be good to be able to write that in a way that I could take advantage of the compiler's other general smarts, so I definitely see Minux's motivation. I understand that for now at least this is a niche feature and probably entails a lot of additional complexity in the compiler.

I'd still be curious to see this sketched or prototyped by any interested parties, even if it doesn't end up being implemented. Doing 32 multiplies per instruction in AVX512 sounds very appealing!

var a, b, dst [32]int16
asm.VPMULLW(a, b, dst)

I ponder if you could achieve a similar effect a source code transformation, but I guess that would amount to writing (or forking) a compiler. I'd love to give a brief shoutout to PeachPy which uses Python as a DSL to generate Go ASM, which I have found lowers the cognitive burden of writing assembly. It would be interesting to see what it would look like if you used Go as the DSL. Probably not much like Go, but maybe more managable for engineers than assembly?

btracey · 2017-01-11T16:48:40Z

@rsc: as input to "whatever the high-level needs are". The following primitives have a large effect on the speed of gonum programs. The most important are Scale (x = a*x), "Axpy (x = a*x+y)", "AxpyTo (z = a*x +y)" and Dot product (sum_i x_i * y_i), where x is []float64, []float32, or the complex equivalents. We use be both non-stided (standard Go slice) and strided versions. All implementations may assume that there is no non-trivial overlap between the inputs.

We have coded our own assembly implementations of them based on real speedups. It would be great to have cooperation with the compiler and for specific processors to make these operations as fast as possible.

pwaller · 2017-01-11T16:52:45Z

@btracey sounds interesting, do you have benchmark numbers? :)

btracey · 2017-01-11T17:22:29Z

Yea. The code is https://godoc.org/github.com/gonum/blas/native

Comparing go test -bench Dgemm -tags=noasm with go test -bench Dgemm

DgemmSmSmSm-8        1.88µs ± 6%  1.17µs ± 1%  -37.74%  (p=0.008 n=5+5)
DgemmMedMedMed-8      628µs ±17%   274µs ± 4%  -56.37%  (p=0.008 n=5+5)
DgemmMedLgMed-8      4.37ms ± 3%  1.80ms ± 7%  -58.74%  (p=0.008 n=5+5)
DgemmLgLgLg-8         411ms ± 4%   169ms ± 3%  -58.90%  (p=0.008 n=5+5)
DgemmLgSmLg-8        6.79ms ± 3%  4.03ms ± 5%  -40.69%  (p=0.008 n=5+5)
DgemmLgLgSm-8        4.83ms ± 3%  2.39ms ± 6%  -50.52%  (p=0.008 n=5+5)
DgemmHgHgSm-8         345ms ± 3%   172ms ± 3%  -50.10%  (p=0.008 n=5+5)
DgemmMedMedMedTNT-8   689µs ± 2%   461µs ± 2%  -33.10%  (p=0.008 n=5+5)
DgemmMedMedMedNTT-8   596µs ± 4%   252µs ± 4%  -57.70%  (p=0.008 n=5+5)
DgemmMedMedMedTT-8    762µs ± 5%   667µs ± 2%  -12.39%  (p=0.008 n=5+5)

This translates to about 10% performance on a real program I have.

minux · 2017-01-11T20:58:10Z

On Jan 9, 2017 9:12 PM, "Russ Cox" <notifications@github.com> wrote: For the specific case of crypto, we seem to be doing OK by (1) defining good, portable Go APIs, and (2) taking advantage of arch-specific hardware acceleration in a way that's transparent to the end user. I don't see why we can't do that in other settings too. They don't all have to happen in the same release. We can build things up incrementally. But note that crypto instructions aren't necessarily used for their intended application. One primary example is our aeshash algorithm. I agree that high-level API are better, but this issue is about making the intrisics available. If I understand you correctly, you rejected user-defined intrisics and instead want pre-defined intrisics in a std package. I'm wondering how do you propose to do that (again, not simple but twiddling, but SIMD.)

chewxy · 2017-01-11T21:03:35Z

I think predefined intrinsics provided by the language will go a long way. Something like __mm256 and _mm_INSERTMACRONAME_ would make life a lot easier for those working with SIMD related assembly.

I also agree with @btracey - having written enough assembly (tho most are generated from C) for Gorgonia's vector related code (and also leveraging Gonum), I'd really love if we could get help from the compiler on this front. Or some integrated dev tools from the Go team

minux · 2017-01-11T21:06:26Z

I've been thinking about a potential implementation. The first obstacle is making the compiler aware of SIMD register classes (and their aliases with floating point registers). We need to overcome this no matter we choose user-defined intrisics or not. In fact, solving this means we need to define a suitable Go type to represent them. As I mentioned earlier, this is hard because we don't have unions. And this problem is precisely the reason I don't think pre-defined intrisics package could succeed. I think we all see the benefit of having intrisics: using the correct instructions is the hard part that must be (at least for now) solved by humans, but choosing the registers are best left to compilers.

josharian · 2017-01-11T21:49:44Z

Minux, would you mind spelling out the SIMD design impossibility you see in a bit more detail? I'm not following it.

Relatedly, would you mind spelling out a bit why (say) the simd.Reg128 type I suggested above is unsuitable? Note that reinterpreting bits in a vector is a common thing to want to do for performance (which is the only point of SIMD), so its flexibility about the type of things it contains is a good thing.

Lastly, an anecdote. When I last made heavy use of SIMD was NEON, five or six years back, and the compiler made such abominable decisions with intrinsics (mostly around spilling and restoring registers) that I ended up dropping down to assembly anyway, just to be able to fully utilize all registers and schedule operations to make good use of the pipeline. I don't know how much better clang is now, but the experience makes me think that a good SIMD package should be fairly close to the metal and provide users relatively large amounts of control. I say this not as an argument for generic inline assembly support but as a consider for SIMD package design.

But honestly, I'd rather see bit twiddling get tackled first. There's probably plenty to learn there, and I suspect it has a broader audience, including known stdlib uses in the runtime and math/big.

minux · 2017-01-11T22:13:07Z

Isn't programming with Reg128 like programming with interface{}? Even the C intrisics separate mm128 and mm128d, mm256, mm256i, mm256d. Ideally, if we were to design a beautiful interface, I want to use [8]float32, [16]uint16, etc. I should clarify my last comment: it's certainly possible to design one package if given enough time, but my point is that it's impossible to design one in one release cycle and we still need a way for user-defined intrisics to allow experiments on the API. There is no existing interface to adapt (and we can start from a clean slate, why copy existing cumbersome intrisics?) I think this follows the general Go philosophy of providing general mechanism and gain experience through it. I just don't understand the resistance. Russ' comment seems to indicate it will waste too much effort, but I think designing a proper std SIMD package requires even more effort than adding user-defined intrisics because even supporting single instruction could relieve a lot of needs for custom assembly. Bit twiddling package would be nice, but I'm afraid it won't solve much of the problem for this issue. For example, it doesn't even need new register class support, which is crucial for SIMD intrisics.

josharian · 2017-01-11T22:23:18Z

One way to tackle the new register class support problem is to teach the compiler to use SIMD instructions itself when their presence is guaranteed, for the obvious/easy cases (simple for loops, some zeroing).

I just don't understand the resistance.

In my case, the resistance comes from concern that the scope of the generic proposal (including initial implementation, maintenance, and compatibility impact) will end up being much bigger than it initially appears--and it initially appears pretty big, at least to me.

minux · 2017-01-11T23:04:41Z

I think most of the effort lies in robust vector register class support. The actual user-defined intrisics part is tiny compared to that. For this to work, the compiler must know how to save and restore SIMD registers and support them in the rewrite rules. I agree we can first making the compiler use simple SIMD to clean small chunks of memory and then we can experiment with simd rewrite rules. Once both are available, the user-defined intrisics should be easy enough to add. Just need a language (e.g. JSON) to encode the actual rule (again, we don't need the entire SSA rewrite rule support as SIMD instructions are generally very regular.) But could we reopen the issue and put it in Thinking status? (@rsc) We now have a clear way forward now. The first steps are required whether we implement this proposal or not. When those are ready, we can re-evaluate the cost of implementing this. (Assuming the only concern is the perceived complexity of this proposal. If not, please elaborate.)

ianlancetaylor · 2017-01-12T00:14:05Z

@josharian GCC at least does quite well with intrinsics these days. I wrote C code to do the same thing as Keith's aeshash code in runtime/asm_amd64.s, and the compiled result was essentially the same as the asm code.

https://github.com/golang/gofrontend/blob/master/libgo/runtime/aeshash.c

dr2chase · 2017-01-12T14:34:23Z

Another use of intrinsics in C, a previous effort at Oracle to speed up CRC calculations. Start at line 425: http://cr.openjdk.java.net/~drchase/7088419/webrev.01/src/share/native/java/util/zip/CRC32.c.html

The compiler's main role in this was to be a stenographer for XMM code. (Pay no attention to the assembly encoded with .byte -- I used a modern compiler to generate the code, but had to feed it to an archaic assembler that didn't speak "modern" AMD64.)

btracey · 2017-01-17T01:19:18Z

As a simple addition to my benchmarks above, here is the profile for Cholesky decomposition:

brendan:~/Documents/mygo/src/github.com/gonum/matrix/mat64$ go test -cpuprofile cpu.prof -run=none -bench=CholeskyLarge
BenchmarkCholeskyLarge-8   	       1	3476738111 ns/op
PASS
ok  	github.com/gonum/matrix/mat64	14.642s
brendan:~/Documents/mygo/src/github.com/gonum/matrix/mat64$ go tool pprof mat64.test cpu.prof 
Entering interactive mode (type "help" for commands)
(pprof) top5
67.51s of 69.60s total (97.00%)
Dropped 91 nodes (cum <= 0.35s)
Showing top 5 nodes out of 16 (cum >= 0.83s)
      flat  flat%   sum%        cum   cum%
    57.79s 83.03% 83.03%     57.79s 83.03%  github.com/gonum/internal/asm.DaxpyUnitaryTo
     9.66s 13.88% 96.91%     67.35s 96.77%  github.com/gonum/blas/native.dgemmSerialTransNot
     0.04s 0.057% 96.97%      1.63s  2.34%  github.com/gonum/matrix/mat64.benchmarkCholesky
     0.01s 0.014% 96.98%     67.49s 96.97%  github.com/gonum/blas/native.dgemmParallel.func1
     0.01s 0.014% 97.00%      0.83s  1.19%  github.com/gonum/lapack/native.Implementation.Dlatrs

minux added the Proposal label Oct 7, 2016

rsc added this to the Proposal milestone Jan 4, 2017

rsc changed the title ~~cmd/compile: intrinsicify user defined assembly functions~~ proposal: cmd/compile: intrinsicify user defined assembly functions Jan 5, 2017

rsc added the Proposal-Declined label Jan 9, 2017

rsc closed this as completed Jan 9, 2017

brtzsnr mentioned this issue Jan 11, 2017

math/bits: an integer bit twiddling library #18616

Closed

golang locked and limited conversation to collaborators Jan 17, 2018

gopherbot added the FrozenDueToAge label Jan 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: cmd/compile: intrinsicify user defined assembly functions #17373

proposal: cmd/compile: intrinsicify user defined assembly functions #17373

minux commented Oct 7, 2016

rsc commented Jan 4, 2017

minux commented Jan 4, 2017 via email

randall77 commented Jan 4, 2017

minux commented Jan 4, 2017 via email

josharian commented Jan 4, 2017 •

edited

Loading

rsc commented Jan 5, 2017

brtzsnr commented Jan 5, 2017

randall77 commented Jan 5, 2017

minux commented Jan 5, 2017 via email

rsc commented Jan 9, 2017

rsc commented Jan 9, 2017

minux commented Jan 10, 2017 via email

rsc commented Jan 10, 2017

pwaller commented Jan 10, 2017

btracey commented Jan 11, 2017 •

edited

Loading

pwaller commented Jan 11, 2017 •

edited

Loading

btracey commented Jan 11, 2017 •

edited

Loading

minux commented Jan 11, 2017 via email

chewxy commented Jan 11, 2017 •

edited

Loading

minux commented Jan 11, 2017 via email

josharian commented Jan 11, 2017

minux commented Jan 11, 2017 via email

josharian commented Jan 11, 2017

minux commented Jan 11, 2017 via email •

edited

Loading

ianlancetaylor commented Jan 12, 2017

dr2chase commented Jan 12, 2017

btracey commented Jan 17, 2017

proposal: cmd/compile: intrinsicify user defined assembly functions #17373

proposal: cmd/compile: intrinsicify user defined assembly functions #17373

Comments

minux commented Oct 7, 2016

rsc commented Jan 4, 2017

minux commented Jan 4, 2017 via email

randall77 commented Jan 4, 2017

minux commented Jan 4, 2017 via email

josharian commented Jan 4, 2017 • edited Loading

rsc commented Jan 5, 2017

brtzsnr commented Jan 5, 2017

randall77 commented Jan 5, 2017

minux commented Jan 5, 2017 via email

rsc commented Jan 9, 2017

rsc commented Jan 9, 2017

minux commented Jan 10, 2017 via email

rsc commented Jan 10, 2017

pwaller commented Jan 10, 2017

btracey commented Jan 11, 2017 • edited Loading

pwaller commented Jan 11, 2017 • edited Loading

btracey commented Jan 11, 2017 • edited Loading

minux commented Jan 11, 2017 via email

chewxy commented Jan 11, 2017 • edited Loading

minux commented Jan 11, 2017 via email

josharian commented Jan 11, 2017

minux commented Jan 11, 2017 via email

josharian commented Jan 11, 2017

minux commented Jan 11, 2017 via email • edited Loading

ianlancetaylor commented Jan 12, 2017

dr2chase commented Jan 12, 2017

btracey commented Jan 17, 2017

josharian commented Jan 4, 2017 •

edited

Loading

btracey commented Jan 11, 2017 •

edited

Loading

pwaller commented Jan 11, 2017 •

edited

Loading

btracey commented Jan 11, 2017 •

edited

Loading

chewxy commented Jan 11, 2017 •

edited

Loading

minux commented Jan 11, 2017 via email •

edited

Loading