Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: favour UDIV over UMULH + LSR on arm64 for 64 bit integer division by a constant #37773

Closed
petemoore opened this issue Mar 10, 2020 · 2 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance

Comments

@petemoore
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.14 darwin/amd64  # but (cross compiling to linux/arm64)

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/pmoore/Library/Caches/go-build"
GOENV="/Users/pmoore/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/pmoore/.gvm/pkgsets/go1.14/global"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/pmoore/.gvm/gos/go1.14"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/pmoore/.gvm/gos/go1.14/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/pmoore/git/go/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/v9/mll6p_rj5h94dt_m5m8j0f9c0000gn/T/go-build535314509=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

pmoore@Petes-iMac:~/udiv $ cat udiv.go 
package main

import (
	"fmt"
)

func main() {
	for x := uint64(0); x < 123456; x++ {
		fmt.Printf("%v\n", x/216)
	}
}
pmoore@Petes-iMac:~/udiv $ GOARCH=arm64 GOOS=linux go build -gcflags -S udiv.go 2>&1 | grep -F udiv.go:9
	0x0028 00040 (/Users/pmoore/udiv/udiv.go:9)	MOVD	$-7515340178177965473, R1
	0x0038 00056 (/Users/pmoore/udiv/udiv.go:9)	UMULH	R0, R1, R2
	0x003c 00060 (/Users/pmoore/udiv/udiv.go:9)	LSR	$7, R2, R2
	0x0040 00064 (/Users/pmoore/udiv/udiv.go:9)	MOVD	R2, 8(RSP)
	0x0044 00068 (/Users/pmoore/udiv/udiv.go:9)	CALL	runtime.convT64(SB)
	0x0048 00072 (/Users/pmoore/udiv/udiv.go:9)	PCDATA	ZR, $1
	0x0048 00072 (/Users/pmoore/udiv/udiv.go:9)	MOVD	16(RSP), R0
	0x004c 00076 (/Users/pmoore/udiv/udiv.go:9)	PCDATA	$1, $1
	0x004c 00076 (/Users/pmoore/udiv/udiv.go:9)	STP	(ZR, ZR), ""..autotmp_15-16(SP)
	0x0050 00080 (/Users/pmoore/udiv/udiv.go:9)	PCDATA	ZR, $2
	0x0050 00080 (/Users/pmoore/udiv/udiv.go:9)	MOVD	$type.uint64(SB), R1
	0x0058 00088 (/Users/pmoore/udiv/udiv.go:9)	PCDATA	ZR, $1
	0x0058 00088 (/Users/pmoore/udiv/udiv.go:9)	MOVD	R1, ""..autotmp_15-16(SP)
	0x005c 00092 (/Users/pmoore/udiv/udiv.go:9)	PCDATA	ZR, ZR
	0x005c 00092 (/Users/pmoore/udiv/udiv.go:9)	MOVD	R0, ""..autotmp_15-8(SP)
pmoore@Petes-iMac:~/udiv $ 

What did you expect to see?

I expected the compiler to generate the UDIV instruction rather than UMULH + LSR, as empirical testing (below) shows it to be twice as fast on an arm64 Cortex A53 (BCM2837) SoC.

The final assembly instructions are:

pmoore@Petes-iMac:~/udiv $ aarch64-unknown-linux-gnu-objdump -d udiv | sed -n '/\<main\.main\>/,$p'
000000000009bf50 <main.main>:
   9bf50:	f9400b81 	ldr	x1, [x28, #16]
   9bf54:	910003e2 	mov	x2, sp
   9bf58:	eb01005f 	cmp	x2, x1
   9bf5c:	54000609 	b.ls	9c01c <main.main+0xcc>  // b.plast
   9bf60:	f8180ffe 	str	x30, [sp, #-128]!
   9bf64:	f81f83fd 	stur	x29, [sp, #-8]
   9bf68:	d10023fd 	sub	x29, sp, #0x8
   9bf6c:	d2800000 	mov	x0, #0x0                   	// #0
   9bf70:	14000024 	b	9c000 <main.main+0xb0>
   9bf74:	f90033e0 	str	x0, [sp, #96]
   9bf78:	d2884be1 	mov	x1, #0x425f                	// #16991
   9bf7c:	f2a12f61 	movk	x1, #0x97b, lsl #16
   9bf80:	f2c4bda1 	movk	x1, #0x25ed, lsl #32
   9bf84:	f2f2f681 	movk	x1, #0x97b4, lsl #48
   9bf88:	9bc07c22 	umulh	x2, x1, x0
   9bf8c:	d347fc42 	lsr	x2, x2, #7
   9bf90:	f90007e2 	str	x2, [sp, #8]
   9bf94:	97fdf237 	bl	18870 <runtime.convT64>
   9bf98:	f9400be0 	ldr	x0, [sp, #16]
   9bf9c:	a906ffff 	stp	xzr, xzr, [sp, #104]
   9bfa0:	d0000081 	adrp	x1, ad000 <type.*+0xd000>
   9bfa4:	91348021 	add	x1, x1, #0xd20
   9bfa8:	f90037e1 	str	x1, [sp, #104]
   9bfac:	f9003be0 	str	x0, [sp, #112]
   9bfb0:	d00006db 	adrp	x27, 175000 <runtime.itabTableInit+0xee0>
   9bfb4:	9106237b 	add	x27, x27, #0x188
   9bfb8:	f9400360 	ldr	x0, [x27]
   9bfbc:	f0000262 	adrp	x2, ea000 <runtime.vdsoauxv.stkobj>
   9bfc0:	91060042 	add	x2, x2, #0x180
   9bfc4:	f90007e2 	str	x2, [sp, #8]
   9bfc8:	f9000be0 	str	x0, [sp, #16]
   9bfcc:	d0000180 	adrp	x0, cd000 <type.*+0x2d000>
   9bfd0:	91300800 	add	x0, x0, #0xc02
   9bfd4:	f9000fe0 	str	x0, [sp, #24]
   9bfd8:	b24007e3 	orr	x3, xzr, #0x3
   9bfdc:	f90013e3 	str	x3, [sp, #32]
   9bfe0:	9101a3e4 	add	x4, sp, #0x68
   9bfe4:	f90017e4 	str	x4, [sp, #40]
   9bfe8:	b24003e4 	orr	x4, xzr, #0x1
   9bfec:	f9001be4 	str	x4, [sp, #48]
   9bff0:	f9001fe4 	str	x4, [sp, #56]
   9bff4:	97ffe363 	bl	94d80 <fmt.Fprintf>
   9bff8:	f94033e0 	ldr	x0, [sp, #96]
   9bffc:	91000400 	add	x0, x0, #0x1
   9c000:	d29c481b 	mov	x27, #0xe240                	// #57920
   9c004:	f2a0003b 	movk	x27, #0x1, lsl #16
   9c008:	eb1b001f 	cmp	x0, x27
   9c00c:	54fffb43 	b.cc	9bf74 <main.main+0x24>  // b.lo, b.ul, b.last
   9c010:	f85f83fd 	ldur	x29, [sp, #-8]
   9c014:	f84807fe 	ldr	x30, [sp], #128
   9c018:	d65f03c0 	ret
   9c01c:	aa1e03e3 	mov	x3, x30
   9c020:	97ff2a48 	bl	66940 <runtime.morestack_noctxt>
   9c024:	17ffffcb 	b	9bf50 <main.main>
	...

The division by 216 is comprised of the six instructions:

  9bf78:	d2884be1 	mov	x1, #0x425f                	// #16991
   9bf7c:	f2a12f61 	movk	x1, #0x97b, lsl #16
   9bf80:	f2c4bda1 	movk	x1, #0x25ed, lsl #32
   9bf84:	f2f2f681 	movk	x1, #0x97b4, lsl #48
   9bf88:	9bc07c22 	umulh	x2, x1, x0
   9bf8c:	d347fc42 	lsr	x2, x2, #7

Benchmark results

I created two simple executables to compare the generated mov/movk/movk/movk/umulh/lsr instructions with the equivalent mov/udiv instructions that I had expected to see, and compared results. Here we see that mov/udiv is consistently twice as fast:

ubuntu@ubuntu:~/git/udiv$ cat umulh+lsr.s 
/*
Benchmark with: as -o umulh+lsr.o umulh+lsr.s && gcc -o umulh+lsr umulh+lsr.o && for ((i=0; i<5; i++)); do time ./umulh+lsr; done
*/
.global main

.data
fmt:
        .asciz  "%d / %d = %d\n"

.text
main:
        mov     x1, #0xec73
        movk    x1, #0x0018, lsl #16
        mov     x2, #216
        mov     x11, #0x10000000
1:
        mov     x4, #0x425f
        movk    x4, #0x97b, lsl #16
        movk    x4, #0x25ed, lsl #32
        movk    x4, #0x97b4, lsl #48
        umulh   x3, x4, x1
        lsr     x3, x3, #7
        subs    x11, x11, #1
        b.ne    1b
        ldr     x0,=fmt
        bl      printf
        mov     x8, #93
        svc     0
ubuntu@ubuntu:~/git/udiv$ as -o umulh+lsr.o umulh+lsr.s && gcc -o umulh+lsr umulh+lsr.o && for ((i=0; i<5; i++)); do time ./umulh+lsr; done
1633395 / 216 = 7562

real	0m2.720s
user	0m2.713s
sys	0m0.004s
1633395 / 216 = 7562

real	0m2.713s
user	0m2.706s
sys	0m0.004s
1633395 / 216 = 7562

real	0m2.713s
user	0m2.706s
sys	0m0.005s
1633395 / 216 = 7562

real	0m2.713s
user	0m2.710s
sys	0m0.001s
1633395 / 216 = 7562

real	0m2.713s
user	0m2.709s
sys	0m0.001s
ubuntu@ubuntu:~/git/udiv$ cat udiv.s
/*
Benchmark with: as -o udiv.o udiv.s && gcc -o udiv udiv.o && for ((i=0; i<5; i++)); do time ./udiv; done
*/
.global main

.data
fmt:
        .asciz  "%d / %d = %d\n"

.text
main:
        mov     x1, #0xec73
        movk    x1, #0x0018, lsl #16
        mov     x2, #216
        mov     x11, #0x10000000
1:
        mov     x4, #216
        udiv    x3, x1, x4
        subs    x11, x11, #1
        b.ne    1b
        ldr     x0,=fmt
        bl      printf
        mov     x8, #93
        svc     0
ubuntu@ubuntu:~/git/udiv$ as -o udiv.o udiv.s && gcc -o udiv udiv.o && for ((i=0; i<5; i++)); do time ./udiv; done
1633395 / 216 = 7562

real	0m1.358s
user	0m1.357s
sys	0m0.000s
1633395 / 216 = 7562

real	0m1.358s
user	0m1.353s
sys	0m0.004s
1633395 / 216 = 7562

real	0m1.358s
user	0m1.357s
sys	0m0.000s
1633395 / 216 = 7562

real	0m1.358s
user	0m1.352s
sys	0m0.005s
1633395 / 216 = 7562

real	0m1.358s
user	0m1.352s
sys	0m0.005s
ubuntu@ubuntu:~/git/udiv$ 
@ALTree ALTree added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 10, 2020
@ALTree ALTree changed the title cmd/compile: Favour UDIV over UMULH + LSR on arm64 for 64 bit integer division by a constant cmd/compile: favour UDIV over UMULH + LSR on arm64 for 64 bit integer division by a constant Mar 10, 2020
@randall77
Copy link
Contributor

I worry that these benchmarks do not represent real code. The result of the umulh or udiv is never used. udiv might have a really long latency, but we can issue a new one every cycle.

Try using the result, say by feeding it back around to the input x1:

lsr x3, x3, 10 // will always be 0
add x1, x1, x3

Benchmarking is tricky.

@petemoore
Copy link
Author

Ah many thanks, indeed that is the case! With the feedback loop, the udiv version takes ~4.0s and the umulh/lsr version takes ~3.4s. So indeed this is a non issue (sorry!) and thanks for the prompt feedback and benchmarking advice.

@golang golang locked and limited conversation to collaborators Mar 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

4 participants