-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/mobile: Go running much slower than CPP on ARM32 / ARM64 #42798
Comments
Your go code uses 64 bit sqrt while the c code uses 32 bit sqrt. |
@AlexRouSg ya, I edited the code above, as it's a large part, so I edited directly on the question. It's my fault didn't take a further test...
|
hmmm for ARM32 try setting the @gopherbot please add labels Performance, NeedsInvestigation |
Both of the funcs have an extra bounds check that might be part of the reason. Compare https://go.godbolt.org/z/eGne9o.
Rest of the difference probably comes from different optimizations https://c.godbolt.org/z/dTdrfo. |
I edited the code above according to your advises.
The compiler do show something, I'm still trying to understand it. |
Looking at gcc & clang output https://c.godbolt.org/z/vo7nhz, it seems they are unrolling the loop. (Notice the multiple calls to |
Here are manually unrolled Go versions https://play.golang.org/p/O3gUjWq4aMi. Results from RaspberryPi 4 (using arm32), Go 1.15.5, gcc 8.3.0:
|
I tested your code on [Go scale] Repeat: [200], Total: 6.441751ms, Mean: 32.208µs
[Go scale_range] Repeat: [200], Total: 5.249708ms, Mean: 26.248µs
[C scale] Repeat: [200], Total: 3.955ms, Mean: 19.775µs
[Go scale_unroll1] Repeat: [200], Total: 10.919709ms, Mean: 54.598µs
[Go scale_unroll2] Repeat: [200], Total: 3.766ms, Mean: 18.83µs
[Go scale_unroll3] Repeat: [200], Total: 3.391792ms, Mean: 16.958µs
[Go scale_unroll4] Repeat: [200], Total: 2.937959ms, Mean: 14.689µs I also did some assemble test on
|
I guess the issue can be clarified to Go arm doesn't use indexed loads. https://go.godbolt.org/z/aovTe1 |
Maybe fmovs needs separate rules at https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssa/gen/ARM64.rules#L772 Indexed loads work properly for uint32.
|
Change https://golang.org/cl/273706 mentions this issue: |
@egonelbre I found something else on
$ CGO_ENABLED=1 GOOS=android GOARCH=$ANDROID_ARCH CC=$ANDROID_CC go build -a test_cpp_scale.go
$ adb push test_cpp_scale /data/local/tmp && adb shell '/data/local/tmp/test_cpp_scale'
test_cpp_scale: 1 file pushed, 0 skipped. 3590.1 MB/s (2374192 bytes in 0.001s)
[Go scale] Repeat: [200], Total: 10.604791ms, Mean: 53.023µs
[Go scale64] Repeat: [200], Total: 10.56849ms, Mean: 52.842µs
[C scale] Repeat: [200], Total: 1.54927ms, Mean: 7.746µs
[C scale64] Repeat: [200], Total: 2.954636ms, Mean: 14.773µs
package main
/*
float scale(float *src, int dim, float scale) {
for(int i = 0; i < dim; i++) {
src[i] *= scale;
}
return src[dim-1];
}
double scale64(double *src, int dim, double scale) {
for(int i = 0; i < dim; i++) {
src[i] *= scale;
}
return src[dim-1];
}
*/
import "C"
import (
"fmt"
"time"
"unsafe"
)
type TestFunc func()
func main() {
dim := 10240
src := make([]float32, dim)
for jj := range src {
src[jj] = float32(jj % 10)
}
src64 := make([]float64, dim)
for jj := range src64 {
src64[jj] = float64(jj % 10)
}
repeats := 200
basicTest(func() { scale(src, 1) }, "Go scale", repeats)
basicTest(func() { scale64(src64, 1) }, "Go scale64", repeats)
basicTest(func() { C.scale((*C.float)(unsafe.Pointer(&src[0])), C.int(dim), 1) }, "C scale", repeats)
basicTest(func() { C.scale64((*C.double)(unsafe.Pointer(&src64[0])), C.int(dim), 1) }, "C scale64", repeats)
}
func basicTest(testFunc TestFunc, name string, repeats int) {
ss := time.Now()
for ii := 0; ii < repeats; ii++ {
testFunc()
}
tt := time.Since(ss)
fmt.Printf("[%s] Repeat: [%d], Total: %s, Mean: %s\n", name, repeats, tt, tt/time.Duration(repeats))
}
func scale(xs []float32, scale float32) {
for i := range xs {
xs[i] *= scale
}
}
func scale64(xs []float64, scale float64) {
for i := range xs {
xs[i] *= scale
}
}
|
I tested the same code on Raspberry 4 (with the fix and using gcc), results:
Using clang:
Looking at the C assembly, it seems clang is automatically vectorizing and interleaving. See https://llvm.org/docs/Vectorizers.html for more information. This explains the 2x difference. After disabling vectorization and interleaving, I got:
|
Ya, I think that can explain it. Thanks for your clarify. $ CGO_ENABLED=1 GOOS=android GOARCH=$ANDROID_ARCH CC="$ANDROID_CC" CGO_CFLAGS="-O3 -fno-vectorize" go build -a test_cpp_scale.go
$ adb push test_cpp_scale /data/local/tmp && adb shell '/data/local/tmp/test_cpp_scale'
test_cpp_scale: 1 file pushed, 0 skipped. 3243.5 MB/s (2364312 bytes in 0.001s)
[Go scale] Repeat: [200], Total: 10.593854ms, Mean: 52.969µs
[Go scale64] Repeat: [200], Total: 10.506198ms, Mean: 52.53µs
[C scale] Repeat: [200], Total: 7.748177ms, Mean: 38.74µs
[C scale64] Repeat: [200], Total: 7.8475ms, Mean: 39.237µs |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I met this issue in multiple functions, like
image resize
/image warpaffine
and many others, thatGo
runs much slower thanCPP
code onARM32 / ARM64
. Here I wrote two test functions incpp
andgo
to reproduce.L2Norm
into two functionssquareSum
andmultiplyWithNum
._range
functions, and useGOARM=7
buildingARM32
.x86_64
bygo run
. The mean running times are simmilar.x86_64
bygo build
and run. The mean running times vary.ARM64
.squareSum
is similar, butmultiplyWithNum
is~7
times slower.ARM32
.squareSum
is similar, butmultiplyWithNum
is~1.5
times slower after addingGOARM=7
.So what did I do wrong here? How can I improve this?
What did you expect to see?
Expect to see
Go
has a similar performance onARM32
/ARM64
platform.What did you see instead?
Go
onX86_64
platform has a similar performance withcpp
.Go multiplyWithNum
function onARM64
platform is~7
times slower thancpp
.Go multiplyWithNum
function onARM32
platform is~1.5
times slower thancpp
.The text was updated successfully, but these errors were encountered: