Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/asm: ARM64 instructions FLDPQ and FSTPQ not supported #40725

Closed
clausecker opened this issue Aug 12, 2020 · 40 comments
Closed

cmd/asm: ARM64 instructions FLDPQ and FSTPQ not supported #40725

clausecker opened this issue Aug 12, 2020 · 40 comments
Labels
arch-arm64 FeatureRequest FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@clausecker
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.15 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build668753232=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Assemble the following ARM64 assembly file:

TEXT foo(SB),0,$0-0
        FLDPQ (R0), (F0, F1)
        FSTPQ (F0, F1), (R0)

What did you expect to see?

An object file with instructions that disassemble to the GNU-style instructions

   0:	ad400400 	ldp	q0, q1, [x0]
   4:	ad000400 	stp	q0, q1, [x0]

What did you see instead?

$ go tool asm test.s
test.s:2: unrecognized instruction "FLDPQ"
test.s:3: unrecognized instruction "FSTPQ"
asm: assembly of test.s failed

It appears that only FLDPS and FLDPD are supported. Going through the instruction lists, it appears that no FP instructions with 128 bit instruction size seem to be supported at all. It would be great to have them added as I need them for a project.

@clausecker clausecker changed the title ARM64 128 bit LDP/STP not supported ARM64 instructions FLDPQ and FSTPQ not supported Aug 12, 2020
@clausecker
Copy link
Author

I found a bunch more issues with the assembler. Is it okay if I make a large issue just gathering all of them?

@clausecker
Copy link
Author

Some more valid (I hope) instructions that don't encode:

VLD1R.P 1(R1), [V9.B8] // displacement 8 is incorrectly accepted
FMOVD $0x8040201008040201, F10 // won't generate a literal pool load, won't encode
VCMTST V8.B8, V9.B8, V9.B8 // unknown instruction
VSUB V10.B8, V9.B8, V10.B8 // invalid rrr, VADD works

There's probably a bunch more. Right now, almost every single SIMD instruction I want to use is broken in the assembler. This is not a sustainable state of affairs.

@zhangfannie
Copy link
Contributor

zhangfannie commented Aug 13, 2020

@fuzxxl Thank you for the observation.
Yes, the current go assembler on ARM64 is not complemented. We added assembler support for most instructions because these instructions need to be used. Can you tell us which instructions you want to use now? We are happy to enable them, or you can try to do it if you are interested. 😊

@clausecker
Copy link
Author

clausecker commented Aug 13, 2020

@zhangfannie Best would be to have support for all SIMD and FP instructions, but fixing those I mentioned might be a good start. I have not written the code yet, so I can't tell for sure what exactly it is going to involve. If you like, I can try to implement the algorithm using the GNU assembler and then tell you what instructions I used. How does that sound?

@davecheney
Copy link
Contributor

Best would be to have support for all SIMD and FP instructions, but fixing those I mentioned might be a good start.

Please don't take this as a critisism, but as an observer of a number of this class of request, the best results are obtained when the OP, you in this case, can enumerate exactly which instructions to add. I have no explanation why requests for all XXX instructions are unsuccessful, but encourage you to list precisely the instructions you would like to see added as there is anecdotal evidence that requests formed in this way are resolved faster.

@clausecker
Copy link
Author

Thanks, will do so. I'll proceed and prototype the algorithm in the GNU assembler so I can tell you which instructions exactly I'm going to need.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 18, 2020
@dmitshur dmitshur added this to the Backlog milestone Aug 18, 2020
@dmitshur dmitshur changed the title ARM64 instructions FLDPQ and FSTPQ not supported cmd/asm: ARM64 instructions FLDPQ and FSTPQ not supported Aug 18, 2020
@clausecker
Copy link
Author

clausecker commented Aug 19, 2020

So, I've now sketched out a prototype using the UNIX assembler. The following new instructions are required:

  • VEOR (possibly already implemented)
  • VBIT
  • VBSL
  • VLD1 (possibly already implemented)
  • VLD1.P (might already be implemented)
  • FMOVD $imm64, Fr (translates to a load from a literal pool; having FMOVQ too would be nice)
  • VMOVI (seems like that one is already implemented)
  • VLD1R.P (implemented, but incorrectly; cf. example above)
  • VCMTST
  • VSUB (already implemented but defective)
  • VAND (possibly already implemented)
  • VUSHR (possibly already implemented)
  • VSHL (possibly already implemented)
  • UADDLV (possibly already implemented)
  • VMOV Vx.y[z], Vx.y[z] (possibly already implemented)
  • VUXTL
  • VUXTL2
  • VADD (possibly already implemented)
  • VST1 (possibly already implemented)
  • PRFM (possibly already implemented; nice to have)

Especially important is the literal-pool FMOVD $imm64 as I can't replace that one with a WORD directive or similar due to the impossibility of manipulating literal pools directly.

@zhangfannie
Copy link
Contributor

@fuzxxl We will enable the above instructions ASAP. 🙂

@clausecker
Copy link
Author

Thanks! Very much appreciated.

@zhangfannie
Copy link
Contributor

@fuzxxl The current go assembler have enabled FMOV (scalar, immediate) instruction, the go syntax is FMOVS and FMOVD. And according to ARM reference, there are no instructions like FMOVQ .
image

@clausecker
Copy link
Author

clausecker commented Aug 19, 2020

The instruction I want is listed in the section “LDR (literal, SIMD&FP)” of the ARM reference:

LDR <St>, label
LDR <Dt>, label
LDR <Qt>, label

here, label has a PC-relative addressing mode referring to a literal in a nearby literal pool. These instructions should therefore be translated to Go syntax as

FMOVS $imm32, Fr
FMOVD $imm64, Fr
FMOVQ $imm128, Fr

The assembler needs to place the immediate in a nearby literal pool and substitute its address.

Of course, the assembler is free to use the “FMOV (scalar, immediate)” variant if the immediate fits. Similarly, it should translate these to movi or mvni if the immediate is of the appropriate form.

The desired usage is to load a bit mask into a SIMD register:

FMOVD $0x8040201008040201, F0

Perhaps the mnemonics MOVS, MOVD, and MOVQ might indeed be better here as the instructions do not directly deal with floating point numbers.

clausecker added a commit to clausecker/pospop that referenced this issue Aug 19, 2020
This prototype performs at 2.25 GB/s on a Raspberry Pi 3B and is
written for the GNU assembler.  I was unable to write it directly
for the Go assembler as critical defects prevent me from using the
desired instructions.  Relevant issue in the Go bug tracker:

golang/go#40725
@gopherbot
Copy link

Change https://golang.org/cl/249757 mentions this issue: cmd/asm: fix the error of checking the post-index offset of VLD[1-4]R instructions of arm64

@gopherbot
Copy link

Change https://golang.org/cl/249758 mentions this issue: cmd/internal/obj/arm64: enable some SIMD instructions

@clausecker
Copy link
Author

Thanks for the changes. You made a typo in changeset 249758, file asm7.go:5111:

	case 101: // /FMOVS/FMOVD/FMOVQ(liternal, SIMD&FP)

@zhangfannie
Copy link
Contributor

I will modify it. 😅 Thank you.

@clausecker
Copy link
Author

Due to some further optimisations in the code, I might also need the following instructions:

  • VUZP1 and VUZP2
  • VUXTL and VUXTL2 (or their generic variants, USHLL and USHLL2)
  • VADDP (might already be implemented)
  • VBIF

It would be great if you could add them, too. I hope this covers all of them.

Thanks for all your great help! I'm feeling like we're getting somewhere with this.

@zhangfannie
Copy link
Contributor

@fuzxxl You are welcome. We will enable the assembler's support for these instructions ASAP. 🙂

@clausecker
Copy link
Author

Thanks a lot for your hard work!

@gopherbot
Copy link

Change https://golang.org/cl/251519 mentions this issue: cmd/internal/obj/arm64: add support for arm64 instruction FLDPQ and FSTPQ

@gopherbot
Copy link

Change https://golang.org/cl/253659 mentions this issue: cmd/asm: add more SIMD instructions on arm64

gopherbot pushed a commit that referenced this issue Sep 10, 2020
Enable VBSL, VBIT, VCMTST, VUXTL VUXTL2 and FMOVQ SIMD
instructions required by the issue #40725. And FMOVQ
instrucion is used to move a large constant to a Vn
register.

Add test cases.

Fixes #40725

Change-Id: I1cac1922a0a0165d698a4b73a41f7a5f0a0ad549
Reviewed-on: https://go-review.googlesource.com/c/go/+/249758
Reviewed-by: Cherry Zhang <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Sep 10, 2020
This CL adds USHLL, USHLL2, UZP1, UZP2, and BIF instructions requested
by #40725. And since UXTL* are aliases of USHLL*, this CL also merges
them into one case.

Updates #40725

Change-Id: I404a4fdaf953319f72eea548175bec1097a2a816
Reviewed-on: https://go-review.googlesource.com/c/go/+/253659
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@clausecker
Copy link
Author

@zhangfannie I've tried your new instructions, thanks a lot! One thing I noticed is that FMOVD and FMOVQ take a vector register for an argument (e.g. V0) despite having an F prefix in the mnemonic. Taking a FP register (e.g. F0) might have been more idiomatic. I also noticed that it is unfortunately not possible to specify 128 bit immediates. For example,

FMOVQ $0x123456789abcdef0123456789abcdef, V0

fails to assemble with an “immediate out of range” message. And even if I choose a suitably sized immediate like

FMOVQ $0x123456789abcdef0, V0

the assembler only generates a 64 bit literal in the literal pool despite this instruction reading 128 bit from the literal pool. I would be great to have this improved.

@zhangfannie
Copy link
Contributor

@fuzxxl For the issue of "immediate out of range". This issue is reported by strconv.ParseUint(str, 0, 64) called by p.atoi() in the process of parsing immediate. The current go only supports bit sizes 0, 8, 16, 32 and 64, if bit size is below 0 or above 64, an error is returned (Please see https://golang.org/pkg/strconv/#ParseUint). If we want to parse a 128 bit immediate, we need to support 128 bit sizes. @cherrymui Can we do it? Thank you.

@cherrymui
Copy link
Member

Go doesn't have 128-bit integers (for now), so I don't think we want to do that. Instead, can we let it take two 64-bit operands, for the high and low parts separately? E.g. FMOVQ $1, $2, V3 is V3 = 1<<64 + 2 ? This doesn't look as nice but maybe fine.

Or, we only support loading from memory, and let user write large constants to memory, something like FMOVQ x(SB), V3 with

DATA x+0(SB)/8, $lo64
DATA x+8(SB)/8, $hi64
GLOBL x(SB),RODATA,$16

GNU assembler doesn't seem to support directly loading large constant either. It needs a label. We could use a global data symbol.

@clausecker
Copy link
Author

@cherrymui The problem with a global symbol (as opposed to a literal pool) is that instructions are wasted generating the address of the global symbol. This incurs needless cost in high-performance code. Having two separate operands would be fine with me and I suppose it's a good alternative. Another idea would be to support a form that loads from a local label such that the programmer can manually create a literal pool.

Any resolution on the inconsistency of using V0 instead of F0 for the register we load into? After all, the instruction mnemonic indicates an FP-type instruction, not a SIMD-type instruction.

@cherrymui
Copy link
Member

How many instructions are there like this, where it takes a constant that doesn't fit into 64-bit? Is this the only case or there are still many? I'm less sure about supporting manual constant pool management, mainly for consistency (there have been several instructions that use the constant pool managed by the assembler).

I agree that the line between FP instruction and vector instruction is blurry. Does this instruction have anything specific to FP? If not, maybe we should just use VMOVQ (or VMOV) and take a V register?

@clausecker
Copy link
Author

@cherrymui I am unaware of any other instructions operating on 128 bit constants in a literal pool. I think having to write this one with two immediates is a workable choice.

The instruction is coded as taking a FP-style register in the ARM reference manual, so FMOVQ is consistent with that choice. I am inclined to say that both VMOVQ $imm, V0 and FMOVQ $imm, F0 should be supported as aliases to establish consistency in both SIMD and FP code and to make writing macros easier. In any case, the same choice should be made for literal-pool-loading FMOVD, too (this one appears to be somewhat more common in FP code for loading float64 constants, but it's also useful in vector code). It's not my call to make though.

@zhangfannie
Copy link
Contributor

@cherrymui @clausecker Thank you for the solution. I will post a patch to write 128 bit with two immediates.

@zhangfannie
Copy link
Contributor

@cherrymui @clausecker I spent some time trying to find a SIMD-FP instruction to do the "Dn << 64 + Dm = Qd" operation we need. Unfortunately, I did not find such an instruction 😥. Do you have any ideas? Thank you.

@clausecker
Copy link
Author

clausecker commented Sep 15, 2020

@zhangfannie I'm not sure what you mean. The idea was to have the literal-loading form of FMOVQ be

    FMOVQ $foo, $bar, F0

where foo and bar will be placed in a nearby literal pool such that they are adjacent. This then compiles to something like

    FMOVQ pool(PC), F0
    ...
pool:
    DWORD $foo
    DWORD $bar

Or perhaps I misunderstood something here?

@zhangfannie
Copy link
Contributor

@clausecker Sorry, my understanding is wrong. The patch will be submitted soon. Thank you for the explanation. 😊

@clausecker
Copy link
Author

@cherrymui @zhangfannie While we're add it, I have difficulty understanding what the Go syntax for a load or store instruction like these is:

ldr q0, [r5, #32]
str q0, [r5, #32]

I'm sure they are implemented (at least I hope), but neither of these variants seem to work:

FMOVQ 32(R5), F0
VMOVQ 32(R5), V0
MOVQ 32(R5), F0
MOVQ 32(R5), V0

Have I missed anything?

@cherrymui
Copy link
Member

FMOVQ/VMOVQ seem the right syntax (MOVQ is not). They don't seem to be implemented yet, though. For now, FMOVQ only supports constant source operand, and VMOVQ does not exist. (We have VMOV, but that is currently used only for moves between V registers.)

@clausecker
Copy link
Author

@cherrymui Being able to load/store SIMD and FP registers would be a useful addition I suppose. I can use VLDR1 in many cases, but the addressing mode is pretty restrictive.

@cherrymui
Copy link
Member

I agree it is useful. It just needs to be implemented.

@clausecker
Copy link
Author

@cherrymui

Thanks! If you're adding this one, could you also add VBIC and VUSRA? It turns out these ones can be used to shave off some instructions in my code.

@zhangfannie
Copy link
Contributor

@clausecker We will do it. Thank you.

@clausecker
Copy link
Author

clausecker commented Sep 18, 2020

@zhangfannie Super cool! Reading the manual further, I believe VSRI, VSLI, VUADDW, and VUADDW2 might be very helpful, too.

@gopherbot
Copy link

Change https://golang.org/cl/255900 mentions this issue: cmd/asm: fix the issue of moving 128-bit integers to vector registers on arm64

gopherbot pushed a commit that referenced this issue Sep 25, 2020
… on arm64

The CL 249758 added `FMOVQ $vcon, Vd` instruction and assembler used
128-bit simd literal-loading to load `$vcon` from pool into 128-bit vector
register `Vd`. Because Go does not have 128-bit integers for now, the
assembler will report an error of `immediate out of range` when
assembleing `FMOVQ $0x123456789abcdef0123456789abcdef, V0` instruction.

This patch lets 128-bit integers take two 64-bit operands, for the high
and low parts separately and adds `VMOVQ $hi, $lo, Vd` instruction to
move `$hi<<64+$lo' into 128-bit register `Vd`.

In addition, this patch renames `FMOVQ/FMOVD/FMOVS` ops to 'VMOVQ/VMOVD/VMOVS'
and uses them to move 128-bit, 64-bit and 32-bit constants into vector
registers, respectively

Update the go doc.

Fixes #40725

Change-Id: Ia3c83bb6463f104d2bee960905053a97299e0a3a
Reviewed-on: https://go-review.googlesource.com/c/go/+/255900
Trust: fannie zhang <Fannie.Zhang@arm.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@zhangfannie
Copy link
Contributor

@clausecker The patch is ready and is under internal review. I will submit it as soon as possible. Thank you.

@gopherbot
Copy link

Change https://golang.org/cl/258397 mentions this issue: cmd/asm: add several arm64 SIMD instructions

gopherbot pushed a commit that referenced this issue Oct 29, 2020
This patch enables VSLI, VUADDW(2), VUSRA and FMOVQ SIMD instructions
required by the issue #40725. And the GNU syntax of 'FMOVQ' is 128-bit
ldr/str(immediate, simd&fp).

Add test cases.

Fixes #40725

Change-Id: Ide968ef4a9385ce4cd8f69bce854289014d30456
Reviewed-on: https://go-review.googlesource.com/c/go/+/258397
Trust: fannie zhang <Fannie.Zhang@arm.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
clausecker added a commit to clausecker/pospop that referenced this issue Oct 29, 2020
With 3089ef6, the Go project finally supports VUSRA, FMOVQ, VUADDW, and
VUADDW2.  Make use of them.  Unfortunately, not all addressing modes
seem to be supported for FMOVQ and support for VUADDL and VUADDL2
has not yet landed.  This will be added in a future commit.

See also golang/go#40725
@golang golang locked and limited conversation to collaborators Sep 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 FeatureRequest FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants