cmd/asm: ARM64 instructions FLDPQ and FSTPQ not supported #40725

clausecker · 2020-08-12T14:04:15Z

What version of Go are you using (`go version`)?

$ go version
go version go1.15 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build668753232=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Assemble the following ARM64 assembly file:

TEXT foo(SB),0,$0-0
        FLDPQ (R0), (F0, F1)
        FSTPQ (F0, F1), (R0)

What did you expect to see?

An object file with instructions that disassemble to the GNU-style instructions

   0:	ad400400 	ldp	q0, q1, [x0]
   4:	ad000400 	stp	q0, q1, [x0]

What did you see instead?

$ go tool asm test.s
test.s:2: unrecognized instruction "FLDPQ"
test.s:3: unrecognized instruction "FSTPQ"
asm: assembly of test.s failed

It appears that only FLDPS and FLDPD are supported. Going through the instruction lists, it appears that no FP instructions with 128 bit instruction size seem to be supported at all. It would be great to have them added as I need them for a project.

The text was updated successfully, but these errors were encountered:

clausecker · 2020-08-12T15:05:29Z

I found a bunch more issues with the assembler. Is it okay if I make a large issue just gathering all of them?

clausecker · 2020-08-12T16:35:03Z

Some more valid (I hope) instructions that don't encode:

VLD1R.P 1(R1), [V9.B8] // displacement 8 is incorrectly accepted
FMOVD $0x8040201008040201, F10 // won't generate a literal pool load, won't encode
VCMTST V8.B8, V9.B8, V9.B8 // unknown instruction
VSUB V10.B8, V9.B8, V10.B8 // invalid rrr, VADD works

There's probably a bunch more. Right now, almost every single SIMD instruction I want to use is broken in the assembler. This is not a sustainable state of affairs.

zhangfannie · 2020-08-13T06:08:53Z

@fuzxxl Thank you for the observation.
Yes, the current go assembler on ARM64 is not complemented. We added assembler support for most instructions because these instructions need to be used. Can you tell us which instructions you want to use now? We are happy to enable them, or you can try to do it if you are interested. 😊

clausecker · 2020-08-13T11:35:26Z

@zhangfannie Best would be to have support for all SIMD and FP instructions, but fixing those I mentioned might be a good start. I have not written the code yet, so I can't tell for sure what exactly it is going to involve. If you like, I can try to implement the algorithm using the GNU assembler and then tell you what instructions I used. How does that sound?

davecheney · 2020-08-14T02:21:12Z

Best would be to have support for all SIMD and FP instructions, but fixing those I mentioned might be a good start.

Please don't take this as a critisism, but as an observer of a number of this class of request, the best results are obtained when the OP, you in this case, can enumerate exactly which instructions to add. I have no explanation why requests for all XXX instructions are unsuccessful, but encourage you to list precisely the instructions you would like to see added as there is anecdotal evidence that requests formed in this way are resolved faster.

clausecker · 2020-08-14T08:25:38Z

Thanks, will do so. I'll proceed and prototype the algorithm in the GNU assembler so I can tell you which instructions exactly I'm going to need.

clausecker · 2020-08-19T00:03:21Z

So, I've now sketched out a prototype using the UNIX assembler. The following new instructions are required:

VEOR (possibly already implemented)
VBIT
VBSL
VLD1 (possibly already implemented)
VLD1.P (might already be implemented)
FMOVD $imm64, Fr (translates to a load from a literal pool; having FMOVQ too would be nice)
VMOVI (seems like that one is already implemented)
VLD1R.P (implemented, but incorrectly; cf. example above)
VCMTST
VSUB (already implemented but defective)
VAND (possibly already implemented)
VUSHR (possibly already implemented)
VSHL (possibly already implemented)
UADDLV (possibly already implemented)
VMOV Vx.y[z], Vx.y[z] (possibly already implemented)
VUXTL
VUXTL2
VADD (possibly already implemented)
VST1 (possibly already implemented)
PRFM (possibly already implemented; nice to have)

Especially important is the literal-pool FMOVD $imm64 as I can't replace that one with a WORD directive or similar due to the impossibility of manipulating literal pools directly.

zhangfannie · 2020-08-19T01:54:25Z

@fuzxxl We will enable the above instructions ASAP. 🙂

clausecker · 2020-08-19T08:27:24Z

Thanks! Very much appreciated.

zhangfannie · 2020-08-19T11:49:59Z

@fuzxxl The current go assembler have enabled FMOV (scalar, immediate) instruction, the go syntax is FMOVS and FMOVD. And according to ARM reference, there are no instructions like FMOVQ .

clausecker · 2020-08-19T11:55:52Z

The instruction I want is listed in the section “LDR (literal, SIMD&FP)” of the ARM reference:

LDR <St>, label
LDR <Dt>, label
LDR <Qt>, label

here, label has a PC-relative addressing mode referring to a literal in a nearby literal pool. These instructions should therefore be translated to Go syntax as

FMOVS $imm32, Fr
FMOVD $imm64, Fr
FMOVQ $imm128, Fr

The assembler needs to place the immediate in a nearby literal pool and substitute its address.

Of course, the assembler is free to use the “FMOV (scalar, immediate)” variant if the immediate fits. Similarly, it should translate these to movi or mvni if the immediate is of the appropriate form.

The desired usage is to load a bit mask into a SIMD register:

FMOVD $0x8040201008040201, F0

Perhaps the mnemonics MOVS, MOVD, and MOVQ might indeed be better here as the instructions do not directly deal with floating point numbers.

This prototype performs at 2.25 GB/s on a Raspberry Pi 3B and is written for the GNU assembler. I was unable to write it directly for the Go assembler as critical defects prevent me from using the desired instructions. Relevant issue in the Go bug tracker: golang/go#40725

gopherbot · 2020-08-21T01:41:06Z

Change https://golang.org/cl/249757 mentions this issue: cmd/asm: fix the error of checking the post-index offset of VLD[1-4]R instructions of arm64

gopherbot · 2020-08-21T01:46:50Z

Change https://golang.org/cl/249758 mentions this issue: cmd/internal/obj/arm64: enable some SIMD instructions

clausecker · 2020-08-21T07:57:07Z

Thanks for the changes. You made a typo in changeset 249758, file asm7.go:5111:

	case 101: // /FMOVS/FMOVD/FMOVQ(liternal, SIMD&FP)

zhangfannie · 2020-08-21T10:04:08Z

I will modify it. 😅 Thank you.

clausecker · 2020-08-25T19:20:26Z

Due to some further optimisations in the code, I might also need the following instructions:

VUZP1 and VUZP2
VUXTL and VUXTL2 (or their generic variants, USHLL and USHLL2)
VADDP (might already be implemented)
VBIF

It would be great if you could add them, too. I hope this covers all of them.

Thanks for all your great help! I'm feeling like we're getting somewhere with this.

zhangfannie · 2020-08-26T02:27:49Z

@fuzxxl You are welcome. We will enable the assembler's support for these instructions ASAP. 🙂

clausecker · 2020-08-26T07:16:28Z

Thanks a lot for your hard work!

gopherbot · 2020-08-31T06:26:36Z

Change https://golang.org/cl/251519 mentions this issue: cmd/internal/obj/arm64: add support for arm64 instruction FLDPQ and FSTPQ

gopherbot · 2020-09-09T05:52:35Z

Change https://golang.org/cl/253659 mentions this issue: cmd/asm: add more SIMD instructions on arm64

Enable VBSL, VBIT, VCMTST, VUXTL VUXTL2 and FMOVQ SIMD instructions required by the issue #40725. And FMOVQ instrucion is used to move a large constant to a Vn register. Add test cases. Fixes #40725 Change-Id: I1cac1922a0a0165d698a4b73a41f7a5f0a0ad549 Reviewed-on: https://go-review.googlesource.com/c/go/+/249758 Reviewed-by: Cherry Zhang <cherryyz@google.com>

This CL adds USHLL, USHLL2, UZP1, UZP2, and BIF instructions requested by #40725. And since UXTL* are aliases of USHLL*, this CL also merges them into one case. Updates #40725 Change-Id: I404a4fdaf953319f72eea548175bec1097a2a816 Reviewed-on: https://go-review.googlesource.com/c/go/+/253659 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

clausecker · 2020-09-11T21:49:54Z

@zhangfannie I've tried your new instructions, thanks a lot! One thing I noticed is that FMOVD and FMOVQ take a vector register for an argument (e.g. V0) despite having an F prefix in the mnemonic. Taking a FP register (e.g. F0) might have been more idiomatic. I also noticed that it is unfortunately not possible to specify 128 bit immediates. For example,

FMOVQ $0x123456789abcdef0123456789abcdef, V0

fails to assemble with an “immediate out of range” message. And even if I choose a suitably sized immediate like

FMOVQ $0x123456789abcdef0, V0

the assembler only generates a 64 bit literal in the literal pool despite this instruction reading 128 bit from the literal pool. I would be great to have this improved.

zhangfannie · 2020-09-14T03:49:28Z

@fuzxxl For the issue of "immediate out of range". This issue is reported by strconv.ParseUint(str, 0, 64) called by p.atoi() in the process of parsing immediate. The current go only supports bit sizes 0, 8, 16, 32 and 64, if bit size is below 0 or above 64, an error is returned (Please see https://golang.org/pkg/strconv/#ParseUint). If we want to parse a 128 bit immediate, we need to support 128 bit sizes. @cherrymui Can we do it? Thank you.

cherrymui · 2020-09-14T14:33:08Z

Go doesn't have 128-bit integers (for now), so I don't think we want to do that. Instead, can we let it take two 64-bit operands, for the high and low parts separately? E.g. FMOVQ $1, $2, V3 is V3 = 1<<64 + 2 ? This doesn't look as nice but maybe fine.

Or, we only support loading from memory, and let user write large constants to memory, something like FMOVQ x(SB), V3 with

DATA x+0(SB)/8, $lo64
DATA x+8(SB)/8, $hi64
GLOBL x(SB),RODATA,$16

GNU assembler doesn't seem to support directly loading large constant either. It needs a label. We could use a global data symbol.

clausecker · 2020-09-14T14:42:32Z

@cherrymui The problem with a global symbol (as opposed to a literal pool) is that instructions are wasted generating the address of the global symbol. This incurs needless cost in high-performance code. Having two separate operands would be fine with me and I suppose it's a good alternative. Another idea would be to support a form that loads from a local label such that the programmer can manually create a literal pool.

Any resolution on the inconsistency of using V0 instead of F0 for the register we load into? After all, the instruction mnemonic indicates an FP-type instruction, not a SIMD-type instruction.

cherrymui · 2020-09-14T15:23:00Z

How many instructions are there like this, where it takes a constant that doesn't fit into 64-bit? Is this the only case or there are still many? I'm less sure about supporting manual constant pool management, mainly for consistency (there have been several instructions that use the constant pool managed by the assembler).

I agree that the line between FP instruction and vector instruction is blurry. Does this instruction have anything specific to FP? If not, maybe we should just use VMOVQ (or VMOV) and take a V register?

clausecker · 2020-09-14T15:34:53Z

@cherrymui I am unaware of any other instructions operating on 128 bit constants in a literal pool. I think having to write this one with two immediates is a workable choice.

The instruction is coded as taking a FP-style register in the ARM reference manual, so FMOVQ is consistent with that choice. I am inclined to say that both VMOVQ $imm, V0 and FMOVQ $imm, F0 should be supported as aliases to establish consistency in both SIMD and FP code and to make writing macros easier. In any case, the same choice should be made for literal-pool-loading FMOVD, too (this one appears to be somewhat more common in FP code for loading float64 constants, but it's also useful in vector code). It's not my call to make though.

zhangfannie · 2020-09-15T06:15:50Z

@cherrymui @clausecker Thank you for the solution. I will post a patch to write 128 bit with two immediates.

zhangfannie · 2020-09-15T13:17:01Z

@cherrymui @clausecker I spent some time trying to find a SIMD-FP instruction to do the "Dn << 64 + Dm = Qd" operation we need. Unfortunately, I did not find such an instruction 😥. Do you have any ideas? Thank you.

clausecker · 2020-09-15T13:23:15Z

@zhangfannie I'm not sure what you mean. The idea was to have the literal-loading form of FMOVQ be

    FMOVQ $foo, $bar, F0

where foo and bar will be placed in a nearby literal pool such that they are adjacent. This then compiles to something like

    FMOVQ pool(PC), F0
    ...
pool:
    DWORD $foo
    DWORD $bar

Or perhaps I misunderstood something here?

zhangfannie · 2020-09-16T06:45:22Z

@clausecker Sorry, my understanding is wrong. The patch will be submitted soon. Thank you for the explanation. 😊

clausecker · 2020-09-17T16:14:54Z

@cherrymui @zhangfannie While we're add it, I have difficulty understanding what the Go syntax for a load or store instruction like these is:

ldr q0, [r5, #32]
str q0, [r5, #32]

I'm sure they are implemented (at least I hope), but neither of these variants seem to work:

FMOVQ 32(R5), F0
VMOVQ 32(R5), V0
MOVQ 32(R5), F0
MOVQ 32(R5), V0

Have I missed anything?

cherrymui · 2020-09-17T16:19:55Z

FMOVQ/VMOVQ seem the right syntax (MOVQ is not). They don't seem to be implemented yet, though. For now, FMOVQ only supports constant source operand, and VMOVQ does not exist. (We have VMOV, but that is currently used only for moves between V registers.)

clausecker · 2020-09-17T16:25:01Z

@cherrymui Being able to load/store SIMD and FP registers would be a useful addition I suppose. I can use VLDR1 in many cases, but the addressing mode is pretty restrictive.

cherrymui · 2020-09-17T16:27:17Z

I agree it is useful. It just needs to be implemented.

clausecker · 2020-09-17T17:26:48Z

@cherrymui

Thanks! If you're adding this one, could you also add VBIC and VUSRA? It turns out these ones can be used to shave off some instructions in my code.

zhangfannie · 2020-09-18T02:36:10Z

@clausecker We will do it. Thank you.

clausecker · 2020-09-18T10:01:00Z

@zhangfannie Super cool! Reading the manual further, I believe VSRI, VSLI, VUADDW, and VUADDW2 might be very helpful, too.

gopherbot · 2020-09-21T05:40:43Z

Change https://golang.org/cl/255900 mentions this issue: cmd/asm: fix the issue of moving 128-bit integers to vector registers on arm64

… on arm64 The CL 249758 added `FMOVQ $vcon, Vd` instruction and assembler used 128-bit simd literal-loading to load `$vcon` from pool into 128-bit vector register `Vd`. Because Go does not have 128-bit integers for now, the assembler will report an error of `immediate out of range` when assembleing `FMOVQ $0x123456789abcdef0123456789abcdef, V0` instruction. This patch lets 128-bit integers take two 64-bit operands, for the high and low parts separately and adds `VMOVQ $hi, $lo, Vd` instruction to move `$hi<<64+$lo' into 128-bit register `Vd`. In addition, this patch renames `FMOVQ/FMOVD/FMOVS` ops to 'VMOVQ/VMOVD/VMOVS' and uses them to move 128-bit, 64-bit and 32-bit constants into vector registers, respectively Update the go doc. Fixes #40725 Change-Id: Ia3c83bb6463f104d2bee960905053a97299e0a3a Reviewed-on: https://go-review.googlesource.com/c/go/+/255900 Trust: fannie zhang <Fannie.Zhang@arm.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>

zhangfannie · 2020-09-29T06:49:39Z

@clausecker The patch is ready and is under internal review. I will submit it as soon as possible. Thank you.

gopherbot · 2020-09-30T09:53:47Z

Change https://golang.org/cl/258397 mentions this issue: cmd/asm: add several arm64 SIMD instructions

This patch enables VSLI, VUADDW(2), VUSRA and FMOVQ SIMD instructions required by the issue #40725. And the GNU syntax of 'FMOVQ' is 128-bit ldr/str(immediate, simd&fp). Add test cases. Fixes #40725 Change-Id: Ide968ef4a9385ce4cd8f69bce854289014d30456 Reviewed-on: https://go-review.googlesource.com/c/go/+/258397 Trust: fannie zhang <Fannie.Zhang@arm.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>

With 3089ef6, the Go project finally supports VUSRA, FMOVQ, VUADDW, and VUADDW2. Make use of them. Unfortunately, not all addressing modes seem to be supported for FMOVQ and support for VUADDL and VUADDL2 has not yet landed. This will be added in a future commit. See also golang/go#40725

clausecker changed the title ~~ARM64 128 bit LDP/STP not supported~~ ARM64 instructions FLDPQ and FSTPQ not supported Aug 12, 2020

dmitshur added the NeedsInvestigation label Aug 18, 2020

dmitshur added this to the Backlog milestone Aug 18, 2020

dmitshur changed the title ~~ARM64 instructions FLDPQ and FSTPQ not supported~~ cmd/asm: ARM64 instructions FLDPQ and FSTPQ not supported Aug 18, 2020

dmitshur added arch-arm64 FeatureRequest labels Aug 18, 2020

davecheney mentioned this issue Aug 28, 2020

cmd/asm: ARM64 NEON floating point instructions (VFABD VFMAX, VFMAXNM, VFMINNM VFMIN VFADD, VFSUB VFMUL, VFDIV VFMLA, VFMLS VCVT*) #41092

Open

gopherbot closed this as completed in 0e19aaa Sep 7, 2020

erifan mentioned this issue Mar 2, 2021

cmd/asm: refactor the framework of the arm64 assembler #44734

Open

cagedmantis mentioned this issue Sep 13, 2021

cmd/compile: add VBIC, VMVN instructions for ARMv8 #48199

Open

golang locked and limited conversation to collaborators Sep 30, 2021

gopherbot added the FrozenDueToAge label Sep 30, 2021

cmd/asm: ARM64 instructions FLDPQ and FSTPQ not supported #40725

cmd/asm: ARM64 instructions FLDPQ and FSTPQ not supported #40725

Comments

clausecker commented Aug 12, 2020

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

clausecker commented Aug 12, 2020

clausecker commented Aug 12, 2020

zhangfannie commented Aug 13, 2020 • edited Loading

clausecker commented Aug 13, 2020 • edited Loading

davecheney commented Aug 14, 2020

clausecker commented Aug 14, 2020

clausecker commented Aug 19, 2020 • edited Loading

zhangfannie commented Aug 19, 2020

clausecker commented Aug 19, 2020

zhangfannie commented Aug 19, 2020

clausecker commented Aug 19, 2020 • edited Loading

gopherbot commented Aug 21, 2020

gopherbot commented Aug 21, 2020

clausecker commented Aug 21, 2020

zhangfannie commented Aug 21, 2020

clausecker commented Aug 25, 2020

zhangfannie commented Aug 26, 2020

clausecker commented Aug 26, 2020

gopherbot commented Aug 31, 2020

gopherbot commented Sep 9, 2020

clausecker commented Sep 11, 2020

zhangfannie commented Sep 14, 2020

cherrymui commented Sep 14, 2020

clausecker commented Sep 14, 2020

cherrymui commented Sep 14, 2020

clausecker commented Sep 14, 2020

zhangfannie commented Sep 15, 2020

zhangfannie commented Sep 15, 2020

clausecker commented Sep 15, 2020 • edited Loading

zhangfannie commented Sep 16, 2020

clausecker commented Sep 17, 2020

cherrymui commented Sep 17, 2020

clausecker commented Sep 17, 2020

cherrymui commented Sep 17, 2020

clausecker commented Sep 17, 2020

zhangfannie commented Sep 18, 2020

clausecker commented Sep 18, 2020 • edited Loading

gopherbot commented Sep 21, 2020

zhangfannie commented Sep 29, 2020

gopherbot commented Sep 30, 2020

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

zhangfannie commented Aug 13, 2020 •

edited

Loading

clausecker commented Aug 13, 2020 •

edited

Loading

clausecker commented Aug 19, 2020 •

edited

Loading

clausecker commented Aug 19, 2020 •

edited

Loading

clausecker commented Sep 15, 2020 •

edited

Loading

clausecker commented Sep 18, 2020 •

edited

Loading