cmd/compile: favour UDIV over UMULH + LSR on arm64 for 64 bit integer division by a constant #37773

petemoore · 2020-03-10T12:14:48Z

What version of Go are you using (`go version`)?

$ go version
go version go1.14 darwin/amd64  # but (cross compiling to linux/arm64)

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/pmoore/Library/Caches/go-build"
GOENV="/Users/pmoore/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/pmoore/.gvm/pkgsets/go1.14/global"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/pmoore/.gvm/gos/go1.14"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/pmoore/.gvm/gos/go1.14/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/pmoore/git/go/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/v9/mll6p_rj5h94dt_m5m8j0f9c0000gn/T/go-build535314509=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

pmoore@Petes-iMac:~/udiv $ cat udiv.go 
package main

import (
	"fmt"
)

func main() {
	for x := uint64(0); x < 123456; x++ {
		fmt.Printf("%v\n", x/216)
	}
}
pmoore@Petes-iMac:~/udiv $ GOARCH=arm64 GOOS=linux go build -gcflags -S udiv.go 2>&1 | grep -F udiv.go:9
	0x0028 00040 (/Users/pmoore/udiv/udiv.go:9)	MOVD	$-7515340178177965473, R1
	0x0038 00056 (/Users/pmoore/udiv/udiv.go:9)	UMULH	R0, R1, R2
	0x003c 00060 (/Users/pmoore/udiv/udiv.go:9)	LSR	$7, R2, R2
	0x0040 00064 (/Users/pmoore/udiv/udiv.go:9)	MOVD	R2, 8(RSP)
	0x0044 00068 (/Users/pmoore/udiv/udiv.go:9)	CALL	runtime.convT64(SB)
	0x0048 00072 (/Users/pmoore/udiv/udiv.go:9)	PCDATA	ZR, $1
	0x0048 00072 (/Users/pmoore/udiv/udiv.go:9)	MOVD	16(RSP), R0
	0x004c 00076 (/Users/pmoore/udiv/udiv.go:9)	PCDATA	$1, $1
	0x004c 00076 (/Users/pmoore/udiv/udiv.go:9)	STP	(ZR, ZR), ""..autotmp_15-16(SP)
	0x0050 00080 (/Users/pmoore/udiv/udiv.go:9)	PCDATA	ZR, $2
	0x0050 00080 (/Users/pmoore/udiv/udiv.go:9)	MOVD	$type.uint64(SB), R1
	0x0058 00088 (/Users/pmoore/udiv/udiv.go:9)	PCDATA	ZR, $1
	0x0058 00088 (/Users/pmoore/udiv/udiv.go:9)	MOVD	R1, ""..autotmp_15-16(SP)
	0x005c 00092 (/Users/pmoore/udiv/udiv.go:9)	PCDATA	ZR, ZR
	0x005c 00092 (/Users/pmoore/udiv/udiv.go:9)	MOVD	R0, ""..autotmp_15-8(SP)
pmoore@Petes-iMac:~/udiv $

What did you expect to see?

I expected the compiler to generate the UDIV instruction rather than UMULH + LSR, as empirical testing (below) shows it to be twice as fast on an arm64 Cortex A53 (BCM2837) SoC.

The final assembly instructions are:

pmoore@Petes-iMac:~/udiv $ aarch64-unknown-linux-gnu-objdump -d udiv | sed -n '/\<main\.main\>/,$p'
000000000009bf50 <main.main>:
   9bf50:	f9400b81 	ldr	x1, [x28, #16]
   9bf54:	910003e2 	mov	x2, sp
   9bf58:	eb01005f 	cmp	x2, x1
   9bf5c:	54000609 	b.ls	9c01c <main.main+0xcc>  // b.plast
   9bf60:	f8180ffe 	str	x30, [sp, #-128]!
   9bf64:	f81f83fd 	stur	x29, [sp, #-8]
   9bf68:	d10023fd 	sub	x29, sp, #0x8
   9bf6c:	d2800000 	mov	x0, #0x0                   	// #0
   9bf70:	14000024 	b	9c000 <main.main+0xb0>
   9bf74:	f90033e0 	str	x0, [sp, #96]
   9bf78:	d2884be1 	mov	x1, #0x425f                	// #16991
   9bf7c:	f2a12f61 	movk	x1, #0x97b, lsl #16
   9bf80:	f2c4bda1 	movk	x1, #0x25ed, lsl #32
   9bf84:	f2f2f681 	movk	x1, #0x97b4, lsl #48
   9bf88:	9bc07c22 	umulh	x2, x1, x0
   9bf8c:	d347fc42 	lsr	x2, x2, #7
   9bf90:	f90007e2 	str	x2, [sp, #8]
   9bf94:	97fdf237 	bl	18870 <runtime.convT64>
   9bf98:	f9400be0 	ldr	x0, [sp, #16]
   9bf9c:	a906ffff 	stp	xzr, xzr, [sp, #104]
   9bfa0:	d0000081 	adrp	x1, ad000 <type.*+0xd000>
   9bfa4:	91348021 	add	x1, x1, #0xd20
   9bfa8:	f90037e1 	str	x1, [sp, #104]
   9bfac:	f9003be0 	str	x0, [sp, #112]
   9bfb0:	d00006db 	adrp	x27, 175000 <runtime.itabTableInit+0xee0>
   9bfb4:	9106237b 	add	x27, x27, #0x188
   9bfb8:	f9400360 	ldr	x0, [x27]
   9bfbc:	f0000262 	adrp	x2, ea000 <runtime.vdsoauxv.stkobj>
   9bfc0:	91060042 	add	x2, x2, #0x180
   9bfc4:	f90007e2 	str	x2, [sp, #8]
   9bfc8:	f9000be0 	str	x0, [sp, #16]
   9bfcc:	d0000180 	adrp	x0, cd000 <type.*+0x2d000>
   9bfd0:	91300800 	add	x0, x0, #0xc02
   9bfd4:	f9000fe0 	str	x0, [sp, #24]
   9bfd8:	b24007e3 	orr	x3, xzr, #0x3
   9bfdc:	f90013e3 	str	x3, [sp, #32]
   9bfe0:	9101a3e4 	add	x4, sp, #0x68
   9bfe4:	f90017e4 	str	x4, [sp, #40]
   9bfe8:	b24003e4 	orr	x4, xzr, #0x1
   9bfec:	f9001be4 	str	x4, [sp, #48]
   9bff0:	f9001fe4 	str	x4, [sp, #56]
   9bff4:	97ffe363 	bl	94d80 <fmt.Fprintf>
   9bff8:	f94033e0 	ldr	x0, [sp, #96]
   9bffc:	91000400 	add	x0, x0, #0x1
   9c000:	d29c481b 	mov	x27, #0xe240                	// #57920
   9c004:	f2a0003b 	movk	x27, #0x1, lsl #16
   9c008:	eb1b001f 	cmp	x0, x27
   9c00c:	54fffb43 	b.cc	9bf74 <main.main+0x24>  // b.lo, b.ul, b.last
   9c010:	f85f83fd 	ldur	x29, [sp, #-8]
   9c014:	f84807fe 	ldr	x30, [sp], #128
   9c018:	d65f03c0 	ret
   9c01c:	aa1e03e3 	mov	x3, x30
   9c020:	97ff2a48 	bl	66940 <runtime.morestack_noctxt>
   9c024:	17ffffcb 	b	9bf50 <main.main>
	...

The division by 216 is comprised of the six instructions:

  9bf78:	d2884be1 	mov	x1, #0x425f                	// #16991
   9bf7c:	f2a12f61 	movk	x1, #0x97b, lsl #16
   9bf80:	f2c4bda1 	movk	x1, #0x25ed, lsl #32
   9bf84:	f2f2f681 	movk	x1, #0x97b4, lsl #48
   9bf88:	9bc07c22 	umulh	x2, x1, x0
   9bf8c:	d347fc42 	lsr	x2, x2, #7

Benchmark results

I created two simple executables to compare the generated mov/movk/movk/movk/umulh/lsr instructions with the equivalent mov/udiv instructions that I had expected to see, and compared results. Here we see that mov/udiv is consistently twice as fast:

ubuntu@ubuntu:~/git/udiv$ cat umulh+lsr.s 
/*
Benchmark with: as -o umulh+lsr.o umulh+lsr.s && gcc -o umulh+lsr umulh+lsr.o && for ((i=0; i<5; i++)); do time ./umulh+lsr; done
*/
.global main

.data
fmt:
        .asciz  "%d / %d = %d\n"

.text
main:
        mov     x1, #0xec73
        movk    x1, #0x0018, lsl #16
        mov     x2, #216
        mov     x11, #0x10000000
1:
        mov     x4, #0x425f
        movk    x4, #0x97b, lsl #16
        movk    x4, #0x25ed, lsl #32
        movk    x4, #0x97b4, lsl #48
        umulh   x3, x4, x1
        lsr     x3, x3, #7
        subs    x11, x11, #1
        b.ne    1b
        ldr     x0,=fmt
        bl      printf
        mov     x8, #93
        svc     0
ubuntu@ubuntu:~/git/udiv$ as -o umulh+lsr.o umulh+lsr.s && gcc -o umulh+lsr umulh+lsr.o && for ((i=0; i<5; i++)); do time ./umulh+lsr; done
1633395 / 216 = 7562

real	0m2.720s
user	0m2.713s
sys	0m0.004s
1633395 / 216 = 7562

real	0m2.713s
user	0m2.706s
sys	0m0.004s
1633395 / 216 = 7562

real	0m2.713s
user	0m2.706s
sys	0m0.005s
1633395 / 216 = 7562

real	0m2.713s
user	0m2.710s
sys	0m0.001s
1633395 / 216 = 7562

real	0m2.713s
user	0m2.709s
sys	0m0.001s
ubuntu@ubuntu:~/git/udiv$ cat udiv.s
/*
Benchmark with: as -o udiv.o udiv.s && gcc -o udiv udiv.o && for ((i=0; i<5; i++)); do time ./udiv; done
*/
.global main

.data
fmt:
        .asciz  "%d / %d = %d\n"

.text
main:
        mov     x1, #0xec73
        movk    x1, #0x0018, lsl #16
        mov     x2, #216
        mov     x11, #0x10000000
1:
        mov     x4, #216
        udiv    x3, x1, x4
        subs    x11, x11, #1
        b.ne    1b
        ldr     x0,=fmt
        bl      printf
        mov     x8, #93
        svc     0
ubuntu@ubuntu:~/git/udiv$ as -o udiv.o udiv.s && gcc -o udiv udiv.o && for ((i=0; i<5; i++)); do time ./udiv; done
1633395 / 216 = 7562

real	0m1.358s
user	0m1.357s
sys	0m0.000s
1633395 / 216 = 7562

real	0m1.358s
user	0m1.353s
sys	0m0.004s
1633395 / 216 = 7562

real	0m1.358s
user	0m1.357s
sys	0m0.000s
1633395 / 216 = 7562

real	0m1.358s
user	0m1.352s
sys	0m0.005s
1633395 / 216 = 7562

real	0m1.358s
user	0m1.352s
sys	0m0.005s
ubuntu@ubuntu:~/git/udiv$

The text was updated successfully, but these errors were encountered:

randall77 · 2020-03-10T14:22:40Z

I worry that these benchmarks do not represent real code. The result of the umulh or udiv is never used. udiv might have a really long latency, but we can issue a new one every cycle.

Try using the result, say by feeding it back around to the input x1:

lsr x3, x3, 10 // will always be 0
add x1, x1, x3

Benchmarking is tricky.

petemoore · 2020-03-10T14:48:58Z

Ah many thanks, indeed that is the case! With the feedback loop, the udiv version takes ~4.0s and the umulh/lsr version takes ~3.4s. So indeed this is a non issue (sorry!) and thanks for the prompt feedback and benchmarking advice.

ALTree added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 10, 2020

ALTree changed the title ~~cmd/compile: Favour UDIV over UMULH + LSR on arm64 for 64 bit integer division by a constant~~ cmd/compile: favour UDIV over UMULH + LSR on arm64 for 64 bit integer division by a constant Mar 10, 2020

petemoore closed this as completed Mar 10, 2020

golang locked and limited conversation to collaborators Mar 10, 2021

gopherbot added the FrozenDueToAge label Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: favour UDIV over UMULH + LSR on arm64 for 64 bit integer division by a constant #37773

cmd/compile: favour UDIV over UMULH + LSR on arm64 for 64 bit integer division by a constant #37773

petemoore commented Mar 10, 2020

randall77 commented Mar 10, 2020

petemoore commented Mar 10, 2020

cmd/compile: favour UDIV over UMULH + LSR on arm64 for 64 bit integer division by a constant #37773

cmd/compile: favour UDIV over UMULH + LSR on arm64 for 64 bit integer division by a constant #37773

Comments

petemoore commented Mar 10, 2020

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

Benchmark results

randall77 commented Mar 10, 2020

petemoore commented Mar 10, 2020

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?