cmd/compile: CMOV instruction slower than branch #26306

PeterRK · 2018-07-10T01:25:56Z

What version of Go are you using (`go version`)?

go1.11beta1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/rk/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/rk/GoSpace/Projects"
GOPROXY=""
GORACE=""
GOROOT="/home/rk/GoSpace/GO"
GOTMPDIR=""
GOTOOLDIR="/home/rk/GoSpace/GO/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build473408654=/tmp/go-build -gno-record-gcc-switches"
VGOMODROOT=""

What did you do?

I built the code below with go 1.10 and go 1.11.
https://play.golang.org/p/22MEbiXFpzo

What did you expect to see?

The binary built by go 1.11 is as fast as that built by go 1.10.

What did you see instead?

The binary built by go 1.11 is incredibly slower than that built by go 1.10.

Go 1.11 compiles the function "down" to assembly like this:

    MOVQ pos+32(SP), AX
    MOVQ list+8(SP), CX
    MOVL (CX)(AX*4), DX
    LEAQ 1(AX)(AX*1), BX
    MOVQ list+16(SP), SI
    DECQ SI
    JMP L42
L28:
    MOVL DI, (CX)(AX*4)
    LEAQ 1(BX)(BX*1), DI
    MOVQ BX, AX
    MOVQ DI, BX
L42:
    CMPQ BX, SI
    JGE L73
    MOVL 4(CX)(BX*4), DI
    LEAQ 1(BX), R8
    MOVL (CX)(BX*4), R9
    CMPL DI, R9
    CMOVQHI R8, BX
//  JHI L100
L66:
    MOVL (CX)(BX*4), DI
    CMPL DX, DI
    JCS L28
L73:    
    CMPQ BX, SI
    JNE L97
    MOVL (CX)(BX*4), SI
    CMPL DX, SI
    JCC L92
    MOVL SI, (CX)(AX*4)
L88:    
    MOVL DX, (CX)(BX*4)
    RET
L92:
    MOVQ AX, BX
    JMP L88
L97:    
    MOVQ AX, BX
    JMP L88
L100:
//  MOVQ R8, BX
//  JMP L66

If replacing the CMOV instruction with a branch, it can be 80% faster.

The text was updated successfully, but these errors were encountered:

josharian · 2018-07-10T01:53:43Z

cc @TocarIP @rasky

ianlancetaylor · 2018-07-10T01:58:35Z

@PeterRK What processor are you running?

ulikunitz · 2018-07-10T06:09:28Z

I've microbenchmarked the code here on my Core i7-4510U between go1.10.3 and go1.11beta1 and benchstat returns the following:

name        old time/op  new time/op  delta
HeapSort-4  52.3µs ± 3%  39.8µs ± 2%  -23.89%  (p=0.000 n=10+10)

This would indicate there is actually a 24% speed improvement between go1.10.3 and go1.11beta1. I can confirm that the difference is the use of the CMOV instruction in the down function.

Here is the Benchmark function:

const size = 1000

func makeSlice(size int) []uint32 {
	a := make([]uint32, size)
	for i := range a {
		a[i] = rand.Uint32()
	}
	return a
}

func BenchmarkHeapSort(b *testing.B) {
	a := makeSlice(size)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		HeapSort(a)
	}
}

ulikunitz · 2018-07-10T06:24:01Z

For larger arrrays (size = 1000000) go1.11beta1 is getting slower.

name        old time/op  new time/op  delta
HeapSort-4  75.9ms ± 2%  86.0ms ± 1%  +13.29%  (p=0.000 n=10+9)

PeterRK · 2018-07-10T06:25:27Z

I have tried i7-8700 & i5-7500.
Binary built by go 1.11 have less cache miss and less branch miss，but 80% more execution time.
I think it can not be said 5x slower. @josharian

test with 20M list:

test code:
https://play.golang.org/p/h8mmeue7v1R

ALTree · 2018-07-10T07:55:32Z

Here's an auto-contained benchmark that can be directly copy-pasted in a _test.go file for whoever wants to try it: https://play.golang.org/p/4VwOmmmX069

Here's the result with count=5 on my machine for sizes from 100 to 1000000 (old is 1.10, new is tip):

name                old time/op  new time/op  delta
HeapSort/100-4      1.79µs ± 3%  2.45µs ± 2%  +36.43%  (p=0.008 n=5+5)
HeapSort/1000-4     52.1µs ± 3%  39.2µs ± 0%  -24.73%  (p=0.016 n=5+4)
HeapSort/10000-4     607µs ± 7%   540µs ± 0%  -11.08%  (p=0.016 n=5+4)
HeapSort/100000-4   6.55ms ± 1%  7.09ms ± 2%   +8.20%  (p=0.008 n=5+5)
HeapSort/1000000-4  75.0ms ± 0%  85.7ms ± 2%  +14.27%  (p=0.008 n=5+5)

On a i7-4510U cpu.

agnivade · 2018-07-10T08:57:37Z

Thanks @ALTree for the self-contained benchmark. I was very interested in the CMOV instruction so I gave it a shot too. I see different results in my CPU (Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz).

name                old time/op  new time/op  delta
HeapSort/100-4      2.03µs ± 1%  2.43µs ± 1%  +19.92%  (p=0.008 n=5+5)
HeapSort/1000-4     60.0µs ± 0%  38.8µs ± 1%  -35.29%  (p=0.008 n=5+5)
HeapSort/10000-4     681µs ± 1%   529µs ± 0%  -22.26%  (p=0.008 n=5+5)
HeapSort/100000-4   7.61ms ± 1%  6.78ms ± 1%  -10.93%  (p=0.008 n=5+5)
HeapSort/1000000-4  87.8ms ± 2%  85.9ms ± 1%   -2.19%  (p=0.008 n=5+5)

Things look more or less same, except the last row, where I see perf improvement rather than a slowdown. For the 100 case, I see a slowdown too.

Sorry if this just adds noise. I only posted because I saw different results in the last case.

ulikunitz · 2018-07-10T11:10:52Z

This might be related to #25298 as pointed out by @randall77 on golang-dev. See the discussion branch predictor versus CMOV.

ulikunitz · 2018-07-10T11:58:21Z

The benchmark code has the problem that after the first iteration slice a is already sorted.
Here is a fix: https://play.golang.org/p/b_-IC1OSEWq

The new benchstat looks like this:

name                 old time/op  new time/op  delta
HeapSort/100-4       5.59µs ± 3%  3.74µs ± 4%  -33.14%  (p=0.000 n=10+10)
HeapSort/1000-4      75.0µs ± 0%  50.1µs ± 1%  -33.26%   (p=0.000 n=7+10)
HeapSort/10000-4      879µs ± 2%   610µs ± 0%  -30.61%   (p=0.000 n=10+8)
HeapSort/100000-4    10.8ms ± 0%   8.8ms ± 1%  -18.80%   (p=0.000 n=8+10)
HeapSort/1000000-4    141ms ± 1%   146ms ± 1%   +3.62%  (p=0.000 n=10+10)
HeapSort/10000000-4   2.54s ± 3%   4.23s ± 6%  +66.86%  (p=0.000 n=10+10)

Apparently branch prediction and speculative execution (loads) are faster than CMOV if the slice size exceeds the level 2 cache.

rasky · 2018-07-10T15:00:06Z

Honestly, I'm not sure what to do. @TocarIP?

PeterRK · 2018-07-11T03:08:18Z

go 1.11 vs go 1.10 on different scales:

Go 1.11 loses its advantage when data size grows. CMOV causes that.

randall77 · 2018-07-14T23:03:31Z

I see similar things. I suspect there are two different issues - one with N=100 and one with large N.
I'm going to focus on the large N for now.

With a few more data points (old=tip with cmov generation turned off, new=tip):

benchmark                        old ns/op      new ns/op      delta
BenchmarkHeapSort/100-8          1752           2022           +15.41%
BenchmarkHeapSort/1000-8         53765          35192          -34.54%
BenchmarkHeapSort/10000-8        760440         502731         -33.89%
BenchmarkHeapSort/100000-8       9790058        7496200        -23.43%
BenchmarkHeapSort/1000000-8      131420794      130939404      -0.37%
BenchmarkHeapSort/2000000-8      294149147      333858182      +13.50%
BenchmarkHeapSort/10000000-8     2271577039     3583052169     +57.73%
BenchmarkHeapSort/20000000-8     5213875449     9264420912     +77.69%

As the array gets larger, cmov gets relatively slower. My L3 cache is 10MB, which is N=1.25e6, just about at the inflection point.
The cmov is only being used to select the correct child, like this:

if list[kid+1] > list[kid] {
	kid++
}

The array size dependence seems an awful lot like cache effects. If the array we're sorting fits in L3 cache, it is fast. If the array does not fit, it's slower. My suspicion is that when using branches, if the branch is predicted correctly (probably 50% of the time) we can issue the next read (the grandchild in the heap) in parallel with the current read. Using cmov, there's a data dependence from one load to the next, so we never issue two of them in parallel. That might be enough of an effect to explain the result - the cmov serializes the reads, and the cache miss is so slow that the branch mispredict & restart can all be hidden in the cache miss.

It seems a pretty niche case where this effect comes into play. I wouldn't want to remove cmov generation because of it.

randall77 · 2018-07-15T00:17:42Z

I can somewhat verify my claim. I can add a prefetch to the benchmark and that mostly fixes the large N behavior. At the head of the loop in down, do:

if kid*2 < last {
	_ = atomic.LoadUint32(&list[kid*2])
}

Now old=nocmov, new=tip:

benchmark                        old ns/op      new ns/op      delta
BenchmarkHeapSort/100-8          2029           2262           +11.48%
BenchmarkHeapSort/1000-8         56825          37950          -33.22%
BenchmarkHeapSort/10000-8        807093         526252         -34.80%
BenchmarkHeapSort/100000-8       10113466       6808430        -32.68%
BenchmarkHeapSort/1000000-8      130448267      101057677      -22.53%
BenchmarkHeapSort/2000000-8      295489785      243496350      -17.60%
BenchmarkHeapSort/4000000-8      721814370      646561035      -10.43%
BenchmarkHeapSort/6000000-8      1202974592     1161344992     -3.46%
BenchmarkHeapSort/8000000-8      1729255015     1735633353     +0.37%
BenchmarkHeapSort/10000000-8     2274368639     2370006748     +4.21%
BenchmarkHeapSort/20000000-8     5265656303     5989478005     +13.75%

There is still some effect, but it is much reduced.

randall77 · 2018-07-16T17:06:44Z

Perhaps we could detect the fact that the result of a CMOV is used as an argument to a load, and suppress the use of a CMOV in those cases. That would allow the chip to continue to speculate the load. Seems like a pretty vague condition, though. And definitely not 1.11 material.
I'm going to punt this issue to 1.12. I think we're at a reasonable state for the release.

bcmills · 2018-07-16T17:48:12Z

Are there µ-arches where the CMOV doesn't block the speculative load? (That seems like the sort of thing that register renaming is supposed to take care of.)

randall77 · 2018-07-16T19:26:21Z

@bcmills I don't think so. The speculative load needs the address from which to load, and I don't know of any archs which speculate the address.

addr := ...
if cond { addr++ }
_ = *addr

With regular branches, cond does not need to be evaluated for the load to be issued. The processor speculates on the branch and either issues the load from addr or addr+1, depending on the branch direction it predicted.

addr := ...
addr = CMOV(cond, addr+1, addr)
_ = *addr

With the CMOV, the address from which you load is data-dependent on cond, so cond must be evaluated before the load is issued.

gopherbot · 2018-10-29T22:23:55Z

Change https://golang.org/cl/145717 mentions this issue: cmd/compile: don't use CMOV ops to compute load addresses

gopherbot · 2019-03-18T15:51:47Z

Change https://golang.org/cl/168117 mentions this issue: cmd/compile: replace OpSlicemask with branch

PeterRK changed the title ~~Go 1.11 genertate very slow CMOV instruction~~ Go 1.11 generates very slow CMOV instruction Jul 10, 2018

josharian added Performance release-blocker labels Jul 10, 2018

josharian added this to the Go1.11 milestone Jul 10, 2018

josharian changed the title ~~Go 1.11 generates very slow CMOV instruction~~ cmd/compile: CMOV instruction 5x slower than branch Jul 10, 2018

PeterRK changed the title ~~cmd/compile: CMOV instruction 5x slower than branch~~ cmd/compile: CMOV instruction slower than branch Jul 10, 2018

randall77 modified the milestones: Go1.11, Go1.12 Jul 16, 2018

gopherbot closed this as completed in 0b79dde Nov 27, 2018

mundaym mentioned this issue Mar 15, 2019

cmd/compile: unexpected performance difference accessing slices with different caps #27857

Open

golang locked and limited conversation to collaborators Mar 17, 2020

gopherbot added the FrozenDueToAge label Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: CMOV instruction slower than branch #26306

cmd/compile: CMOV instruction slower than branch #26306

PeterRK commented Jul 10, 2018 •

edited by randall77

josharian commented Jul 10, 2018

ianlancetaylor commented Jul 10, 2018

ulikunitz commented Jul 10, 2018

ulikunitz commented Jul 10, 2018

PeterRK commented Jul 10, 2018 •

edited

ALTree commented Jul 10, 2018

agnivade commented Jul 10, 2018

ulikunitz commented Jul 10, 2018

ulikunitz commented Jul 10, 2018 •

edited

rasky commented Jul 10, 2018

PeterRK commented Jul 11, 2018

randall77 commented Jul 14, 2018 •

edited

randall77 commented Jul 15, 2018

randall77 commented Jul 16, 2018

bcmills commented Jul 16, 2018

randall77 commented Jul 16, 2018

gopherbot commented Oct 29, 2018

gopherbot commented Mar 18, 2019

cmd/compile: CMOV instruction slower than branch #26306

cmd/compile: CMOV instruction slower than branch #26306

Comments

PeterRK commented Jul 10, 2018 • edited by randall77

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

josharian commented Jul 10, 2018

ianlancetaylor commented Jul 10, 2018

ulikunitz commented Jul 10, 2018

ulikunitz commented Jul 10, 2018

PeterRK commented Jul 10, 2018 • edited

ALTree commented Jul 10, 2018

agnivade commented Jul 10, 2018

ulikunitz commented Jul 10, 2018

ulikunitz commented Jul 10, 2018 • edited

rasky commented Jul 10, 2018

PeterRK commented Jul 11, 2018

randall77 commented Jul 14, 2018 • edited

randall77 commented Jul 15, 2018

randall77 commented Jul 16, 2018

bcmills commented Jul 16, 2018

randall77 commented Jul 16, 2018

gopherbot commented Oct 29, 2018

gopherbot commented Mar 18, 2019

PeterRK commented Jul 10, 2018 •

edited by randall77

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

PeterRK commented Jul 10, 2018 •

edited

ulikunitz commented Jul 10, 2018 •

edited

randall77 commented Jul 14, 2018 •

edited