runtime: memmove performance on arm64 for unaligned copies is poor on some CPUs #40324

AWSjswinney · 2020-07-21T02:35:04Z

The following is indented to start a discussion about the performance of memmove on arm64 and the pros and cons of implementing micro-architecture aware flags to achieve better performance on varying CPUs. A pull request making the changes tested in the data below can be found here: https://go-review.googlesource.com/c/go/+/243357

What version of Go are you using (`go version`)?

tip of master branch (at the time of writing)

$ go version
go version devel +4469f5446a Wed Jul 15 21:52:49 2020 +0000 linux/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

arch: arm64
os: Amazon Linux 2

go env

$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/ec2-user/.cache/go-build"
GOENV="/home/ec2-user/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/ec2-user/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/ec2-user/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/ec2-user/go-compiler-test"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/ec2-user/go-compiler-test/pkg/tool/linux_arm64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build862296855=/tmp/go-build -gno-record-gcc-switches"

What did you do?

cd src/
go test runtime -test.run "TestMemmove" -test.bench="BenchmarkMemmove" -test.count=10 -test.timeout=90m

What did you expect to see?

I expected to see good performance across a range of systems on all of the memmove benchmarks.

What did you see instead?

On some CPUs, unaligned copies showed poor performance.

Data

The CPUs I used for testing are:

AWS Graviton 2, c6g.xlarge, Neoverse-N1
AWS Graviton 1, a1.xlarge, Cortex-A72
Raspberry Pi 3, Cortex-A53

The new implementation has several new optimizations, as noted in the commit message:

This implementation makes use of new optimizations:
 - software pipelined loop for large (>128 byte) moves
 - medium size moves (17..128 bytes) have a new implementation
 - address realignment when src or dst is unaligned
 - preference for aligned src (loads) or dst (stores) depending on
   micro-architecture

Aligned copies

For aligned copies, performance improved by about 5% for the largest copies. The implementation without realignment performs slightly better, since it omits the overhead of checking the flag and performing the realignment.

name                               old time/op    new time/op     delta
Memmove/2048-4                       62.1ns ± 0%     59.0ns ± 0%   -4.99%  (p=0.000 n=10+10)
Memmove/4096-4                        117ns ± 1%      110ns ± 0%   -5.66%  (p=0.000 n=10+10)
name                               old speed      new speed       delta
Memmove/2048-4                     33.0GB/s ± 0%   34.7GB/s ± 0%   +5.36%  (p=0.000 n=10+10)
Memmove/4096-4                     35.2GB/s ± 0%   37.1GB/s ± 0%   +5.56%  (p=0.000 n=10+9)

And for Cortex-A53, the improvements were better. This could be because the A53 doesn't use out of order execution and the loads and stores in this implementation are manually reordered.

name                               old time/op    new time/op    delta
Memmove/512-4                         127ns ± 0%     110ns ± 0%  -13.39%  (p=0.000 n=8+8)
Memmove/1024-4                        222ns ± 0%     205ns ± 1%   -7.66%  (p=0.000 n=7+10)
Memmove/2048-4                        411ns ± 0%     366ns ± 0%  -10.98%  (p=0.000 n=8+9)
Memmove/4096-4                        795ns ± 1%     695ns ± 1%  -12.63%  (p=0.000 n=10+10)
name                               old speed      new speed      delta
Memmove/512-4                      4.03GB/s ± 0%  4.66GB/s ± 0%  +15.47%  (p=0.000 n=8+8)
Memmove/1024-4                     4.62GB/s ± 0%  5.00GB/s ± 1%   +8.17%  (p=0.000 n=8+10)
Memmove/2048-4                     4.98GB/s ± 0%  5.59GB/s ± 0%  +12.22%  (p=0.000 n=8+9)
Memmove/4096-4                     5.15GB/s ± 1%  5.90GB/s ± 1%  +14.51%  (p=0.000 n=10+10)

Unaligned Copies

For unaligned copies, the difference is more apparent. First we look at a Neoverse N1 CPU
The following compares:

the proposed implementation (with the CPUID flag)
realignment with the destination pointer (the wrong choice for this CPU)
the proposed implementation without the CPUID flag and realignment
the existing implementation

Next, you can see in a Cortex-A72, the opposite alignement choice produces the best results.
The following compares:

the proposed implementation (with the CPUID flag)
realignment with the source pointer (the wrong choice for this CPU)
the proposed implementation without the CPUID flag and realignment
the existing implementation

Conclusion

For the proposed implementation, I chose to use CPUID to determine the micro architecture to select the best performing move for the target CPU. Since two of the test CPUs have opposite behavior regarding alignment, this flag also both CPUs (and hopefully others as well) to achieve good performance.

The text was updated successfully, but these errors were encountered:

agnivade · 2020-07-21T04:19:27Z

/cc @cherrymui @zhangfannie

cagedmantis · 2021-01-07T22:31:10Z

https://golang.org/cl/243357 has been submitted. Is there any additional work that should be done which is preventing this issue from being closed?

AWSjswinney · 2021-01-08T15:35:07Z

Yes this issue can be closed.

agnivade added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 21, 2020

ALTree added the Performance label Jul 21, 2020

AWSjswinney closed this as completed Jan 8, 2021

golang locked and limited conversation to collaborators Jan 8, 2022

gopherbot added the FrozenDueToAge label Jan 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: memmove performance on arm64 for unaligned copies is poor on some CPUs #40324

runtime: memmove performance on arm64 for unaligned copies is poor on some CPUs #40324

AWSjswinney commented Jul 21, 2020

agnivade commented Jul 21, 2020

cagedmantis commented Jan 7, 2021

AWSjswinney commented Jan 8, 2021

runtime: memmove performance on arm64 for unaligned copies is poor on some CPUs #40324

runtime: memmove performance on arm64 for unaligned copies is poor on some CPUs #40324

Comments

AWSjswinney commented Jul 21, 2020

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Data

Aligned copies

Unaligned Copies

Conclusion

agnivade commented Jul 21, 2020

cagedmantis commented Jan 7, 2021

AWSjswinney commented Jan 8, 2021

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?