-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: slower performance of array copies on arm64 #55920
Comments
@erifan I suspect this is your CL https://go-review.googlesource.com/c/go/+/421654 |
Ok, I'll take a look. |
I tested it on several different arm64 machines and it turns out that the LDP/STP version is slower than the LDR/STR version on most machines, but a bit faster on one machine. The main problem here should be that the addresses for LDP/STP are not 16-byte aligned. Maybe we should check that the address is 16-byte aligned before generating the LDP/STP instruction, but I didn't think of a good way to do this check. |
@erifan, for reference, on the Apple M1 Max based Mac, tip is 106% slower whereas @randall77's benchmark on a different chip/system was 26% slower. This supports your point that it is CPU-dependent, but I wanted to emphasize how significant the difference is on the M1 Max. With 1.19:
With tip:
|
@W8CYE yes I saw it, I also tested it on m1. But the improvement in the case optimized with LDP on m1 is equally significant. For most cases we should believe that LDP is a more optimal choice than LDR. In the above case, in addition to the alignment problem of LDP/STP, there are two other points that affect the benchmark. So I think the benchmarks above are multifactorial, and the alignment issues and unnecessary spills are worth fixing. |
So I'm not sure I entirely understand what the problem is here. If the issue is that we're writing to 16 bytes and then reading 16 bytes, offset 8 from the original write so that the source and destination are partially overlapping? Performance loss in that case seems acceptable to me, as that's a pretty rare thing to do. But if this is a case where if the address is not aligned by 16 bytes then it is slow, then that's not good. Part of Go's philosophy is that performance stability is more important than absolute performance. I want to avoid cases where random changes in alignment, possibly triggered by unrelated code changes, cause wild swings in performance. So I think my default stance here is that we should revert the multi-word load/store CL, unless we can more accurately characterize the performance differences we're seeing and convince ourselves that the loss in performance only happens is cases that wouldn't happen in real code, or would be easy to find and fix when they do happen in real code. |
I'm sure that's not the case. From Arm64 optimization guide (chapter 4.5) we can see that Arm64 CPU handles most unaligned accesses without performance penalties except for a few special cases. We can also prove this point with the above case by changing the iteration increment from 1 to 2, where the addresses of LDP instructions are not 16-byte aligned, but the LDP version performs better than LDR version.
I believe this is the cause of the problem, and the cause may be related to instruction alignment (I'm not sure). From an Arm internal tool EBM, we see that some LDR/LDP instructions in the loop are restarted in the pipeline (analogous to branch miss or cache miss). I don't know if the restart of LDP is more heavy than the restart of LDR, which leads to the performance difference. These are some very low-level hardware details that I don't understand. We can convert the code to C code to demonstrate that this performance difference is independent of Go. The LDP version implemented in C is also slower than the LDR version. This is the C code:
I think the above cases may be exactly the kind of cases that touch hardware pain points, I believe this is a minority. I don't think we need to do anything in Go about this. And if we want to avoid the performance difference, I mentioned two ways earlier: If we don't think we need to do something in Go, we can close this issue directly. |
Could you post the source code for this? Can we reproduce this using 8-byte operations? That is, issue an 8-byte store and then an 8-byte load that is offset by 4 bytes from the store. If we can, that would give some confidence that this is a generic something-about-the-store-to-load-bypass-when-loads-dont-match-stores effect and not something particular to STP/LDP.
How would we do that? For autotmps, we could conceivably increase the alignment of SP to 16 bytes for the internal ABI, but I'm not sure how we'd do that anywhere else (e.g. any heap data structure).
Yes, this is a general problem with the current compiler design (not registerizing arrays with length > 1). It would be nice to fix. I don't see any evidence that this is the cause of the current issue, though. We could try hand-written assembly that avoids this spill/restore and see if that helps? |
Yeah. I trimmed the above case a bit, this is the Go code.
Compiled with
Then we change the iteration increment from 1 to 2, the Go code is as follow:
And this is the corresponding assembly code:
And we can see there's almost no difference except for the following two lines:
And the performance test result on m1 mac:
The core code is the following lines:
Since RSP is 16-byte aligned, so the address operated by LDP/STP is not 16-byte aligned, but the performance has changed a lot. Because by
Sorry I don't quite get your point, do you want to see if eliminating the overlap also has an effect on LDR? If yes, then the answer is yes. From the above test data, we can find this,
Yes, this is hard to do, especially for heap address. This is just an option, but I think we're not going to do it, because it's not necessary for most cases.
Yes this is not the cause of the issue. This is just an workaround way to eliminate the performance gap. I have verified this with the above C code, and it works. |
Yes, something like this:
That leads to 8-byte load/store instructions that overlap in the same way.
So for LDP/STP, it is similar but the effect is larger? Its performance is improved by ~7x by halving the iteration count. Thanks for doing all these experiments, it is really helping me understand this issue. So it does seem to me that there's something going on where doing a LDP of a recently STP'd address is slow. The equivalent 8 byte version is also slow, but less severely so. I suspect something with being able to read from something in the write buffer. |
I've done some playing around with this. The nearest I can determine, LDP/STP is slower when loading something that has just been stored, and faster otherwise. I think that performance quirk is probably ok, as it is likely workaroundable by keeping the values in registers instead. So I'm going to close this issue as unfortunate - @erifan's CL is slower in the OP's case, but faster overall. If someone sees a better way to help the OP's without just turning off LDP/STP altogether, please reopen. Thanks everyone for all the help. |
Yeah, It seems that both the performance gain and performance drop of LDP/STP are larger than LDR/STR. |
Originally came up in issue #54467 .
If you run this program with
go test -bench=.* issue54467_test.go
on an arm64 machine, it is slower on tip than 1.19.With 1.19:
With tip:
(Note: don't compare with 1.19.1, it has a separate issue that makes this code slow. Use 1.19.0.)
The inner loop with 1.19 looks like this:
The inner loop with tip looks like this:
Why is the tip inner loop slower? It looks like better code to me. Maybe the multi-load/multi-store ops are slower?
@golang/arm
The text was updated successfully, but these errors were encountered: