Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: slower performance of array copies on arm64 #55920

Closed
randall77 opened this issue Sep 28, 2022 · 13 comments
Closed

cmd/compile: slower performance of array copies on arm64 #55920

randall77 opened this issue Sep 28, 2022 · 13 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@randall77
Copy link
Contributor

Originally came up in issue #54467 .

If you run this program with go test -bench=.* issue54467_test.go on an arm64 machine, it is slower on tip than 1.19.

package datamove

import (
	"testing"
)

var resultArray []Array

func BenchmarkSliceOfArray(b *testing.B) {
	var r []Array
	s := make([]Array, 1000)
	for n := 0; n < b.N; n++ {
		r = moveSliceOfArrayData(s)
	}
	resultArray = r
}

type Array [2]int

func moveSliceOfArrayData(s []Array) []Array {
	for i := 1; i < len(s); i++ {
		s[i-1], s[i] = s[i], s[i-1]
	}
	return s
}

With 1.19:

BenchmarkSliceOfArray-4   	  331569	      3594 ns/op

With tip:

BenchmarkSliceOfArray-4   	  264745	      4529 ns/op

(Note: don't compare with 1.19.1, it has a separate issue that makes this code slow. Use 1.19.0.)

The inner loop with 1.19 looks like this:

	0x0018 00024 (issue54467_test.go:22)	SUB	$1, R3, R4
	0x001c 00028 (issue54467_test.go:22)	LSL	$4, R4, R5
	0x0020 00032 (issue54467_test.go:22)	MOVD	(R0)(R5), R6
	0x0024 00036 (issue54467_test.go:22)	ADD	R4<<4, R0, R4
	0x0028 00040 (issue54467_test.go:22)	MOVD	8(R4), R7
	0x002c 00044 (issue54467_test.go:22)	MOVD	R6, p..autotmp_5-16(SP)
	0x0030 00048 (issue54467_test.go:22)	MOVD	R7, p..autotmp_5-8(SP)
	0x0034 00052 (issue54467_test.go:22)	LSL	$4, R3, R6
	0x0038 00056 (issue54467_test.go:22)	MOVD	(R0)(R6), R7
	0x003c 00060 (issue54467_test.go:22)	ADD	R3<<4, R0, R8
	0x0040 00064 (issue54467_test.go:22)	MOVD	8(R8), R9
	0x0044 00068 (issue54467_test.go:22)	MOVD	R7, (R0)(R5)
	0x0048 00072 (issue54467_test.go:22)	MOVD	R9, 8(R4)
	0x004c 00076 (issue54467_test.go:22)	MOVD	p..autotmp_5-16(SP), R4
	0x0050 00080 (issue54467_test.go:22)	MOVD	p..autotmp_5-8(SP), R5
	0x0054 00084 (issue54467_test.go:22)	MOVD	R4, (R0)(R6)
	0x0058 00088 (issue54467_test.go:22)	MOVD	R5, 8(R8)
	0x005c 00092 (issue54467_test.go:21)	ADD	$1, R3, R3
	0x0060 00096 (issue54467_test.go:21)	CMP	R3, R1
	0x0064 00100 (issue54467_test.go:21)	BGT	24

The inner loop with tip looks like this:

	0x0018 00024 (/workdir/issue54467/issue54467_test.go:22)	SUB	$1, R3, R4
	0x001c 00028 (/workdir/issue54467/issue54467_test.go:22)	ADD	R4<<4, R0, R4
	0x0020 00032 (/workdir/issue54467/issue54467_test.go:22)	LDP	(R4), (R5, R6)
	0x0024 00036 (/workdir/issue54467/issue54467_test.go:22)	STP	(R5, R6), p..autotmp_5-16(SP)
	0x0028 00040 (/workdir/issue54467/issue54467_test.go:22)	ADD	R3<<4, R0, R5
	0x002c 00044 (/workdir/issue54467/issue54467_test.go:22)	LDP	(R5), (R6, R7)
	0x0030 00048 (/workdir/issue54467/issue54467_test.go:22)	STP	(R6, R7), (R4)
	0x0034 00052 (/workdir/issue54467/issue54467_test.go:22)	LDP	p..autotmp_5-16(SP), (R4, R6)
	0x0038 00056 (/workdir/issue54467/issue54467_test.go:22)	STP	(R4, R6), (R5)
	0x003c 00060 (/workdir/issue54467/issue54467_test.go:21)	ADD	$1, R3, R3
	0x0040 00064 (/workdir/issue54467/issue54467_test.go:21)	CMP	R3, R1
	0x0044 00068 (/workdir/issue54467/issue54467_test.go:21)	BGT	24

Why is the tip inner loop slower? It looks like better code to me. Maybe the multi-load/multi-store ops are slower?

@golang/arm

@randall77 randall77 added this to the Go1.20 milestone Sep 28, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 28, 2022
@randall77
Copy link
Contributor Author

@W8CYE

@randall77
Copy link
Contributor Author

@erifan I suspect this is your CL https://go-review.googlesource.com/c/go/+/421654

@erifan
Copy link

erifan commented Sep 29, 2022

Ok, I'll take a look.

@erifan
Copy link

erifan commented Sep 29, 2022

I tested it on several different arm64 machines and it turns out that the LDP/STP version is slower than the LDR/STR version on most machines, but a bit faster on one machine. The main problem here should be that the addresses for LDP/STP are not 16-byte aligned. Maybe we should check that the address is 16-byte aligned before generating the LDP/STP instruction, but I didn't think of a good way to do this check.

@W8CYE
Copy link

W8CYE commented Sep 29, 2022

@erifan, for reference, on the Apple M1 Max based Mac, tip is 106% slower whereas @randall77's benchmark on a different chip/system was 26% slower. This supports your point that it is CPU-dependent, but I wanted to emphasize how significant the difference is on the M1 Max.

With 1.19:

BenchmarkSliceOfArray-10 710150 1681 ns/op

With tip:

BenchmarkSliceOfArray-10 292400 3469 ns/op

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 29, 2022
@erifan
Copy link

erifan commented Sep 30, 2022

@W8CYE yes I saw it, I also tested it on m1. But the improvement in the case optimized with LDP on m1 is equally significant. For most cases we should believe that LDP is a more optimal choice than LDR.

In the above case, in addition to the alignment problem of LDP/STP, there are two other points that affect the benchmark.
First, the dependency between overlapping reads and writes, if we change the iteration increment from 1 to 2, the LDP version will be faster than the LDR version, although there are still alignment issues. This also shows that the efficiency of LDP instructions is higher than that of LDR.
Second, STP (R5, R6), p..autotmp_5-16(SP) temporarily stores the value of R5 and R6 on the stack. In fact, this is unnecessary because there are enough registers in arm64. Eliminating this can also significantly improve the performance of this benchmark.

So I think the benchmarks above are multifactorial, and the alignment issues and unnecessary spills are worth fixing.

@randall77
Copy link
Contributor Author

So I'm not sure I entirely understand what the problem is here.

If the issue is that we're writing to 16 bytes and then reading 16 bytes, offset 8 from the original write so that the source and destination are partially overlapping? Performance loss in that case seems acceptable to me, as that's a pretty rare thing to do.

But if this is a case where if the address is not aligned by 16 bytes then it is slow, then that's not good. Part of Go's philosophy is that performance stability is more important than absolute performance. I want to avoid cases where random changes in alignment, possibly triggered by unrelated code changes, cause wild swings in performance.

So I think my default stance here is that we should revert the multi-word load/store CL, unless we can more accurately characterize the performance differences we're seeing and convince ourselves that the loss in performance only happens is cases that wouldn't happen in real code, or would be easy to find and fix when they do happen in real code.

@erifan
Copy link

erifan commented Oct 12, 2022

But if this is a case where if the address is not aligned by 16 bytes then it is slow, then that's not good.

I'm sure that's not the case. From Arm64 optimization guide (chapter 4.5) we can see that Arm64 CPU handles most unaligned accesses without performance penalties except for a few special cases. We can also prove this point with the above case by changing the iteration increment from 1 to 2, where the addresses of LDP instructions are not 16-byte aligned, but the LDP version performs better than LDR version.
There are only a few special cases mentioned in the documentation where we need to pay attention to alignment. So for the vast majority of cases, we don't need to consider alignment issues.

If the issue is that we're writing to 16 bytes and then reading 16 bytes, offset 8 from the original write so that the source and destination are partially overlapping?

I believe this is the cause of the problem, and the cause may be related to instruction alignment (I'm not sure). From an Arm internal tool EBM, we see that some LDR/LDP instructions in the loop are restarted in the pipeline (analogous to branch miss or cache miss). I don't know if the restart of LDP is more heavy than the restart of LDR, which leads to the performance difference. These are some very low-level hardware details that I don't understand. We can convert the code to C code to demonstrate that this performance difference is independent of Go. The LDP version implemented in C is also slower than the LDR version. This is the C code:

#include<stdlib.h>
#include<stdio.h>
#include<sys/time.h>
#include<stdint.h>
void ldpstp() {
        __asm(
        "mov    x27, #0x3ea0\n\t"
        "sub    x20, sp, x27\n\t"
        "stp    x29, x30, [x20,#-8]\n\t"
        "mov    sp, x20\n\t"
        "sub    x29, sp, #0x8\n\t"
        "mov x1, xzr\n\t"
        "mov x0, #0x2710\n\t"
        "3:\n\t"
        "orr x2, xzr, #0x1\n\t"
        "b 2f\n\t"
        "1:\n\t"
        "sub x3, x2, #0x1\n\t"
        "add x4, sp, #0x18\n\t"
        "add x3, x4, x3, lsl #4\n\t"
        "ldp x5, x6, [x3]\n\t"
        "stp x5, x6, [sp,#8]\n\t"
        "add x5, x4, x2, lsl #4\n\t"
        "ldp x6, x7, [x5]\n\t"
        "stp x6, x7, [x3]\n\t"
        "ldp x3, x6, [sp,#8]\n\t"
        "stp x3, x6, [x5]\n\t"
        "add x2, x2, #0x1\n\t"
        "2:\n\t"
        "cmp x2, #0x3e8\n\t"
        "b.lt 1b\n\t"
        "add x1, x1, #0x1\n\t"
        "cmp x1, x0\n\t"
        "b.lt 3b\n\t"
        "mov x27, #0x3ea0\n\t"
        "add sp, sp, x27\n\t"
        "sub x29, sp, #0x8\n\t"
        "ret\n\t"
        :
        :
        :
        );
}
void ldrstr() {
        __asm(
        "mov    x27, #0x3ea0\n\t"
        "sub    x20, sp, x27\n\t"
        "stp    x29, x30, [x20,#-8]\n\t"
        "mov    sp, x20\n\t"
        "sub    x29, sp, #0x8\n\t"
        "mov x1, xzr\n\t"
        "mov x0, #0x2710\n\t"
        "3:\n\t"
        "orr x2, xzr, #0x1\n\t"
        "b 2f\n\t"
        "1:\n\t"
        "sub x3, x2, #0x1\n\t"
        "lsl x4, x3, #4\n\t"
        "add x5, sp, #0x18\n\t"
        "ldr x6, [x5,x4]\n\t"
        "add x3, x5, x3, lsl #4\n\t"
        "ldr x7, [x3,#8]\n\t"
        "str x6, [sp,#8]\n\t"
        "str x7, [sp,#16]\n\t"
        "lsl x6, x2, #4\n\t"
        "ldr x7, [x5,x6]\n\t"
        "add x8, x5, x2, lsl #4\n\t"
        "ldr x9, [x8,#8]\n\t"
        "str x7, [x5,x4]\n\t"
        "str x9, [x3,#8]\n\t"
        "ldr x3, [sp,#8]\n\t"
        "ldr x4, [sp,#16]\n\t"
        "str x3, [x5,x6]\n\t"
        "str x4, [x8,#8]\n\t"
        "add x2, x2, #0x1\n\t"
        "2:\n\t"
        "cmp x2, #0x3e8\n\t"
        "b.lt 1b\n\t"
        "add x1, x1, #0x1\n\t"
        "cmp x1, x0\n\t"
        "b.lt 3b\n\t"
        "mov x27, #0x3ea0\n\t"
        "add sp, sp, x27\n\t"
        "sub x29, sp, #0x8\n\t"
        "ret\n\t"
        :
        :
        :
        );
}
int main() {
        struct timeval tv_begin, tv_end;
        gettimeofday(&tv_begin,NULL);
        //ldpstp();
        ldrstr();
        gettimeofday(&tv_end, NULL);
        double total_time = (double) (tv_end.tv_usec - tv_begin.tv_usec);
        printf("test 1 consumes %.3f us\n", total_time);
}

I think the above cases may be exactly the kind of cases that touch hardware pain points, I believe this is a minority. I don't think we need to do anything in Go about this. And if we want to avoid the performance difference, I mentioned two ways earlier:
1, Align LDP/STP instruction to 16bytes.
2, Eliminate this spilling STP (R5, R6), p..autotmp_5-16(SP)

If we don't think we need to do something in Go, we can close this issue directly.

@randall77
Copy link
Contributor Author

We can also prove this point with the above case by changing the iteration increment from 1 to 2, where the addresses of LDP instructions are not 16-byte aligned, but the LDP version performs better than LDR version.

Could you post the source code for this?

Can we reproduce this using 8-byte operations? That is, issue an 8-byte store and then an 8-byte load that is offset by 4 bytes from the store. If we can, that would give some confidence that this is a generic something-about-the-store-to-load-bypass-when-loads-dont-match-stores effect and not something particular to STP/LDP.

1, Align LDP/STP instruction to 16bytes.

How would we do that? For autotmps, we could conceivably increase the alignment of SP to 16 bytes for the internal ABI, but I'm not sure how we'd do that anywhere else (e.g. any heap data structure).

2, Eliminate this spilling STP (R5, R6), p..autotmp_5-16(SP)

Yes, this is a general problem with the current compiler design (not registerizing arrays with length > 1). It would be nice to fix. I don't see any evidence that this is the cause of the current issue, though. We could try hand-written assembly that avoids this spill/restore and see if that helps?

@erifan
Copy link

erifan commented Oct 12, 2022

Could you post the source code for this?

Yeah. I trimmed the above case a bit, this is the Go code.

package datamove

import (
        "testing"
)

func BenchmarkSliceOfArray(b *testing.B) {
        s := make([]Array, 1000)
        for n := 0; n < b.N; n++ {
                moveSliceOfArrayData(s)
        }
}

type Array [2]int

func moveSliceOfArrayData(s []Array) {
        for i := 1; i < len(s)-1; i++ {
                s[i-1], s[i] = s[i], s[i-1]
        }
}

Compiled with go version devel go1.20-9c2fd81ee1 Tue Oct 4 23:29:48 2022 +0000 darwin/arm64.
The assembly code of BenchmarkSliceOfArray:

TEXT command-line-arguments.BenchmarkSliceOfArray(SB) /Users/erifan01/Desktop/datamove_test.go
func BenchmarkSliceOfArray(b *testing.B) {
  0x1000e9390           f9400b90                MOVD 16(R28), R16
  0x1000e9394           d287c41b                MOVD $15904, R27
  0x1000e9398           eb3b63f1                SUBS R27, RSP, R17
  0x1000e939c           54000503                BCC 40(PC)
  0x1000e93a0           eb10023f                CMP R16, R17
  0x1000e93a4           540004c9                BLS 38(PC)
  0x1000e93a8           d287d41b                MOVD $16032, R27
  0x1000e93ac           cb3b63f4                SUB R27, RSP, R20
  0x1000e93b0           a93ffa9d                STP (R29, R30), -8(R20)
  0x1000e93b4           9100029f                MOVD R20, RSP
  0x1000e93b8           d10023fd                SUB $8, RSP, R29
        s := make([]Array, 1000)
  0x1000e93bc           910063f0                ADD $24, RSP, R16
  0x1000e93c0           91400fe1                ADD $(3<<12), RSP, R1
  0x1000e93c4           913a2021                ADD $3720, R1, R1
  0x1000e93c8           a8817e1f                STP.P (ZR, ZR), 16(R16)
  0x1000e93cc           eb01021f                CMP R1, R16
  0x1000e93d0           54ffffcd                BLE -2(PC)
  0x1000e93d4           aa1f03e1                MOVD ZR, R1
        for n := 0; n < b.N; n++ {
  0x1000e93d8           14000002                JMP 2(PC)
  0x1000e93dc           91000421                ADD $1, R1, R1
  0x1000e93e0           f940cc02                MOVD 408(R0), R2
  0x1000e93e4           eb02003f                CMP R2, R1
  0x1000e93e8           5400006a                BGE 3(PC)
                moveSliceOfArrayData(s)
  0x1000e93ec           b24003e2                ORR $1, ZR, R2
        for i := 1; i < len(s)-1; i += 1 {
  0x1000e93f0           14000010                JMP 16(PC)
}
  0x1000e93f4           d287d41b                MOVD $16032, R27
  0x1000e93f8           8b3b63ff                ADD R27, RSP, RSP
  0x1000e93fc           d10023fd                SUB $8, RSP, R29
  0x1000e9400           d65f03c0                RET
                s[i-1], s[i] = s[i], s[i-1]
  0x1000e9404           d1000443                SUB $1, R2, R3
  0x1000e9408           910063e4                ADD $24, RSP, R4
  0x1000e940c           8b031083                ADD R3<<4, R4, R3
  0x1000e9410           a9401865                LDP (R3), (R5, R6)
  0x1000e9414           a9009be5                STP (R5, R6), 8(RSP)
  0x1000e9418           8b021085                ADD R2<<4, R4, R5
  0x1000e941c           a9401ca6                LDP (R5), (R6, R7)
  0x1000e9420           a9001c66                STP (R6, R7), (R3)
  0x1000e9424           a9409be3                LDP 8(RSP), (R3, R6)
  0x1000e9428           a90018a3                STP (R3, R6), (R5)
        for i := 1; i < len(s)-1; i += 1 {
  0x1000e942c           91000442                ADD $1, R2, R2
  0x1000e9430           f10f9c5f                CMP $999, R2
  0x1000e9434           54fffe8b                BLT -12(PC)
  0x1000e9438           17ffffe9                JMP -23(PC)
func BenchmarkSliceOfArray(b *testing.B) {
  0x1000e943c           f90007e0                MOVD R0, 8(RSP)
  0x1000e9440           aa1e03e3                MOVD R30, R3
  0x1000e9444           97fde44b                CALL runtime.morestack_noctxt.abi0(SB)
  0x1000e9448           f94007e0                MOVD 8(RSP), R0
  0x1000e944c           17ffffd1                JMP command-line-arguments.BenchmarkSliceOfArray(SB)

Then we change the iteration increment from 1 to 2, the Go code is as follow:

package datamove

import (
        "testing"
)

func BenchmarkSliceOfArray(b *testing.B) {
        s := make([]Array, 1000)
        for n := 0; n < b.N; n++ {
                moveSliceOfArrayData(s)
        }
}

type Array [2]int

func moveSliceOfArrayData(s []Array) {
        for i := 1; i < len(s)-1; i += 2 {
                s[i-1], s[i] = s[i], s[i-1]
        }
}

And this is the corresponding assembly code:

TEXT command-line-arguments.BenchmarkSliceOfArray(SB) /Users/erifan01/Desktop/datamove_test.go
func BenchmarkSliceOfArray(b *testing.B) {
  0x1000e9390           f9400b90                MOVD 16(R28), R16
  0x1000e9394           d287c41b                MOVD $15904, R27
  0x1000e9398           eb3b63f1                SUBS R27, RSP, R17
  0x1000e939c           54000503                BCC 40(PC)
  0x1000e93a0           eb10023f                CMP R16, R17
  0x1000e93a4           540004c9                BLS 38(PC)
  0x1000e93a8           d287d41b                MOVD $16032, R27
  0x1000e93ac           cb3b63f4                SUB R27, RSP, R20
  0x1000e93b0           a93ffa9d                STP (R29, R30), -8(R20)
  0x1000e93b4           9100029f                MOVD R20, RSP
  0x1000e93b8           d10023fd                SUB $8, RSP, R29
        s := make([]Array, 1000)
  0x1000e93bc           910063f0                ADD $24, RSP, R16
  0x1000e93c0           91400fe1                ADD $(3<<12), RSP, R1
  0x1000e93c4           913a2021                ADD $3720, R1, R1
  0x1000e93c8           a8817e1f                STP.P (ZR, ZR), 16(R16)
  0x1000e93cc           eb01021f                CMP R1, R16
  0x1000e93d0           54ffffcd                BLE -2(PC)
  0x1000e93d4           aa1f03e1                MOVD ZR, R1
        for n := 0; n < b.N; n++ {
  0x1000e93d8           14000002                JMP 2(PC)
  0x1000e93dc           91000421                ADD $1, R1, R1
  0x1000e93e0           f940cc02                MOVD 408(R0), R2
  0x1000e93e4           eb02003f                CMP R2, R1
  0x1000e93e8           5400006a                BGE 3(PC)
                moveSliceOfArrayData(s)
  0x1000e93ec           b24003e2                ORR $1, ZR, R2
        for i := 1; i < len(s)-1; i += 2 {
  0x1000e93f0           14000010                JMP 16(PC)
}
  0x1000e93f4           d287d41b                MOVD $16032, R27
  0x1000e93f8           8b3b63ff                ADD R27, RSP, RSP
  0x1000e93fc           d10023fd                SUB $8, RSP, R29
  0x1000e9400           d65f03c0                RET
                s[i-1], s[i] = s[i], s[i-1]
  0x1000e9404           d1000443                SUB $1, R2, R3
  0x1000e9408           910063e4                ADD $24, RSP, R4
  0x1000e940c           8b031083                ADD R3<<4, R4, R3
  0x1000e9410           a9401865                LDP (R3), (R5, R6)
  0x1000e9414           a9009be5                STP (R5, R6), 8(RSP)
  0x1000e9418           8b021085                ADD R2<<4, R4, R5
  0x1000e941c           a9401ca6                LDP (R5), (R6, R7)
  0x1000e9420           a9001c66                STP (R6, R7), (R3)
  0x1000e9424           a9409be3                LDP 8(RSP), (R3, R6)
  0x1000e9428           a90018a3                STP (R3, R6), (R5)
        for i := 1; i < len(s)-1; i += 2 {
  0x1000e942c           91000842                ADD $2, R2, R2
  0x1000e9430           f10f9c5f                CMP $999, R2
  0x1000e9434           54fffe8b                BLT -12(PC)
  0x1000e9438           17ffffe9                JMP -23(PC)
func BenchmarkSliceOfArray(b *testing.B) {
  0x1000e943c           f90007e0                MOVD R0, 8(RSP)
  0x1000e9440           aa1e03e3                MOVD R30, R3
  0x1000e9444           97fde44b                CALL runtime.morestack_noctxt.abi0(SB)
  0x1000e9448           f94007e0                MOVD 8(RSP), R0
  0x1000e944c           17ffffd1                JMP command-line-arguments.BenchmarkSliceOfArray(SB)

And we can see there's almost no difference except for the following two lines:

0x1000e942c           91000442                ADD $1, R2, R2  // i++
0x1000e942c           91000842                ADD $2, R2, R2  // i += 2

And the performance test result on m1 mac:

$ benchstat go1.19.plus1 master.plus1
name            old time/op  new time/op  delta
SliceOfArray-8  1.69µs ± 0%  3.50µs ± 0%  +106.23%  (p=0.008 n=5+5)

$ benchstat go1.19.plus2 master.plus2
name            old time/op  new time/op  delta
SliceOfArray-8   547ns ± 0%   499ns ± 0%  -8.73%  (p=0.008 n=5+5)

The core code is the following lines:

  0x1000e9404           d1000443                SUB $1, R2, R3
  0x1000e9408           910063e4                ADD $24, RSP, R4  // 24 is not 16-byte aligned
  0x1000e940c           8b031083                ADD R3<<4, R4, R3
  0x1000e9410           a9401865                LDP (R3), (R5, R6)
  0x1000e9414           a9009be5                STP (R5, R6), 8(RSP)
  0x1000e9418           8b021085                ADD R2<<4, R4, R5
  0x1000e941c           a9401ca6                LDP (R5), (R6, R7)
  0x1000e9420           a9001c66                STP (R6, R7), (R3)
  0x1000e9424           a9409be3                LDP 8(RSP), (R3, R6) // 8 is not 16-byte aligned
  0x1000e9428           a90018a3                STP (R3, R6), (R5)

Since RSP is 16-byte aligned, so the address operated by LDP/STP is not 16-byte aligned, but the performance has changed a lot. Because by i += 2 we eliminate the overlap of the load and store between two iterations.

Can we reproduce this using 8-byte operations? That is, issue an 8-byte store and then an 8-byte load that is offset by 4 bytes from the store.

Sorry I don't quite get your point, do you want to see if eliminating the overlap also has an effect on LDR? If yes, then the answer is yes. From the above test data, we can find this, go1.19.plus2 has half the number of iterations compared to go1.19.plus1, but the performance is improved by 3 times. If we set the number of iterations to be the same, we can still find that the performance improves by 35% after eliminating the overlap. Therefore, eliminating overlap is also beneficial to LDR, it's not something particular to STP/LDP.

1, Align LDP/STP instruction to 16bytes.

How would we do that?

Yes, this is hard to do, especially for heap address. This is just an option, but I think we're not going to do it, because it's not necessary for most cases.

I don't see any evidence that this is the cause of the current issue, though. We could try hand-written assembly that avoids this spill/restore and see if that helps?

Yes this is not the cause of the issue. This is just an workaround way to eliminate the performance gap. I have verified this with the above C code, and it works.

@randall77
Copy link
Contributor Author

Sorry I don't quite get your point, do you want to see if eliminating the overlap also has an effect on LDR

Yes, something like this:

type Array [2]uint32

func moveSliceOfArrayData(s []Array) {
        for i := 1; i < len(s)-1; i++ {
                s[i-1], s[i] = s[i], s[i-1]
        }
}

That leads to 8-byte load/store instructions that overlap in the same way.

go1.19.plus2 has half the number of iterations compared to go1.19.plus1, but the performance is improved by 3 times.

So for LDP/STP, it is similar but the effect is larger? Its performance is improved by ~7x by halving the iteration count.

Thanks for doing all these experiments, it is really helping me understand this issue.

So it does seem to me that there's something going on where doing a LDP of a recently STP'd address is slow. The equivalent 8 byte version is also slow, but less severely so. I suspect something with being able to read from something in the write buffer.

@randall77
Copy link
Contributor Author

I've done some playing around with this. The nearest I can determine, LDP/STP is slower when loading something that has just been stored, and faster otherwise. I think that performance quirk is probably ok, as it is likely workaroundable by keeping the values in registers instead.

So I'm going to close this issue as unfortunate - @erifan's CL is slower in the OP's case, but faster overall. If someone sees a better way to help the OP's without just turning off LDP/STP altogether, please reopen.

Thanks everyone for all the help.

@erifan
Copy link

erifan commented Oct 13, 2022

So for LDP/STP, it is similar but the effect is larger? Its performance is improved by ~7x by halving the iteration count.

Yeah, It seems that both the performance gain and performance drop of LDP/STP are larger than LDR/STR.

@golang golang locked and limited conversation to collaborators Oct 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants