cmd/compile: performance regression on append-like loop #19633

ALTree · 2017-03-21T11:05:58Z

Consider the following benchmark:

func BenchmarkFoo(b *testing.B) {
	a := make([]byte, 1000)
	b.SetBytes(int64(len(a)))

	for i := 0; i < b.N; i++ {
		var n int
		for j := 0; j < 1000; j++ {
			a[n] = byte(j)
			n++
		}
		a = a[:n]
	}
}

Comparing tip (go version devel +d972dc2de9 Tue Mar 21 06:36:56 2017 +0000 linux/amd64) with go1.8 I get, on my machine:

name   old time/op    new time/op    delta
Foo-4     684ns ± 1%    1022ns ± 0%  +49.34%  (p=0.000 n=18+15)

name   old speed      new speed      delta
Foo-4  1.46GB/s ± 1%  0.98GB/s ± 0%  -32.99%  (p=0.000 n=18+18)

A pretty bad regression.

The text was updated successfully, but these errors were encountered:

ALTree · 2017-03-21T11:56:13Z

The regression was introduced in CL 36205 (cmd/internal/obj: remove Follow pass). Also see: #18977.

cc @cherrymui

cherrymui · 2017-03-21T20:22:48Z

Tried on two machines, one Linux desktop and one Mac laptop, the regression I saw is much smaller

name    old time/op    new time/op    delta
Foo-12     488ns ± 0%     524ns ± 0%  +7.44%  (p=0.000 n=19+18)

name    old speed      new speed      delta
Foo-12  2.05GB/s ± 0%  1.91GB/s ± 0%  -6.93%  (p=0.000 n=19+18)

So this is probably very processor-dependent?

Assembly for tip:

        0x000c 00012 (x_test.go:17)     MOVQ    "".a+24(SP), AX
        0x0011 00017 (x_test.go:17)     MOVQ    "".a+16(SP), CX
        0x0016 00022 (x_test.go:16)     MOVQ    $0, DX
        0x0018 00024 (x_test.go:17)     CMPQ    DX, $1000
        0x001f 00031 (x_test.go:17)     JGE     $0, 53
        0x0021 00033 (x_test.go:18)     CMPQ    DX, AX
        0x0024 00036 (x_test.go:18)     JCC     $0, 46
        0x0026 00038 (x_test.go:18)     MOVB    DL, (CX)(DX*1)
        0x0029 00041 (x_test.go:19)     INCQ    DX
        0x002c 00044 (x_test.go:17)     JMP     24
        0x002e 00046 (x_test.go:18)     PCDATA  $0, $1
        0x002e 00046 (x_test.go:18)     CALL    runtime.panicindex(SB)
        0x0033 00051 (x_test.go:18)     UNDEF
        0x0035 00053 (x_test.go:21)     MOVQ    "".a+32(SP), AX
        0x003a 00058 (x_test.go:21)     CMPQ    DX, AX
        0x003d 00061 (x_test.go:21)     JHI     $0, 72
        0x003f 00063 (x_test.go:22)     MOVQ    (SP), BP
        0x0043 00067 (x_test.go:22)     ADDQ    $8, SP
        0x0047 00071 (x_test.go:22)     RET
        0x0048 00072 (x_test.go:21)     PCDATA  $0, $1
        0x0048 00072 (x_test.go:21)     CALL    runtime.panicslice(SB)
        0x004d 00077 (x_test.go:21)     UNDEF

Assembly for Go 1.8:

        0x000c 00012 (x_test.go:17)     MOVQ    "".a+24(FP), AX
        0x0011 00017 (x_test.go:17)     MOVQ    "".a+16(FP), CX
        0x0016 00022 (x_test.go:16)     MOVQ    $0, DX
        0x0018 00024 (x_test.go:17)     CMPQ    DX, $1000
        0x001f 00031 (x_test.go:17)     JGE     $0, 53
        0x0021 00033 (x_test.go:18)     CMPQ    DX, AX
        0x0024 00036 (x_test.go:18)     JCC     $0, 79
        0x0026 00038 (x_test.go:18)     MOVB    DL, (CX)(DX*1)
        0x0029 00041 (x_test.go:19)     INCQ    DX
        0x002c 00044 (x_test.go:17)     CMPQ    DX, $1000
        0x0033 00051 (x_test.go:17)     JLT     $0, 33
        0x0035 00053 (x_test.go:21)     MOVQ    "".a+32(FP), AX
        0x003a 00058 (x_test.go:21)     CMPQ    DX, AX
        0x003d 00061 (x_test.go:21)     JHI     $0, 72
        0x003f 00063 (x_test.go:22)     MOVQ    (SP), BP
        0x0043 00067 (x_test.go:22)     ADDQ    $8, SP
        0x0047 00071 (x_test.go:22)     RET
        0x0048 00072 (x_test.go:21)     PCDATA  $0, $1
        0x0048 00072 (x_test.go:21)     CALL    runtime.panicslice(SB)
        0x004d 00077 (x_test.go:21)     UNDEF
        0x004f 00079 (x_test.go:18)     PCDATA  $0, $1
        0x004f 00079 (x_test.go:18)     CALL    runtime.panicindex(SB)
        0x0054 00084 (x_test.go:18)     UNDEF

Seems the only difference is that on tip we make an unconditional jump for loop backedge (JMP 24), whereas on Go 1.8 it copies two instructions (and reverse the condition) (CMPQ DX, $1000; JLT 33).

@dr2chase made range loop generate for-until like code, which effectively does this. But this example is not range loop, so not affected by that change.

@dr2chase is also working on loop unrolling, which may help from a different direction.

I guess we still don't want to copy instructions in the assembler?

ALTree · 2017-03-21T20:25:51Z

What I'm seeing is +50% (the benchmark above) on a Sandy bridge i5; but only +8% on a Haswell i7.

Looks like it hits hard on Sandy Bridge (2nd gen), while from 4th gen and above is not that bad? I don't have any 3rd gen (Ivy bridge) to check there.

gopherbot · 2017-03-21T22:45:28Z

CL https://golang.org/cl/38431 mentions this issue.

navytux · 2017-03-29T14:56:08Z

Benchmark result on pretty recent 6th-gen i7 (i7-6600U) on linux/amd64 for go18 and gotip (go version devel +846f925464 Wed Mar 29 06:50:07 2017 +0000 linux/amd64):

name   old time/op    new time/op    delta
Foo-4     341ns ± 0%     505ns ± 0%  +48.09%  (p=0.000 n=6+10)

name   old speed      new speed      delta
Foo-4  2.93GB/s ± 0%  1.98GB/s ± 0%  -32.44%  (p=0.000 n=10+10)

So here it is also ~ 50% slowdown.

ALTree added the Performance label Mar 21, 2017

ALTree added this to the Go1.9 milestone Mar 21, 2017

gopherbot closed this as completed in 39ce590 Apr 24, 2017

golang locked and limited conversation to collaborators Apr 24, 2018

gopherbot added the FrozenDueToAge label Apr 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: performance regression on append-like loop #19633

cmd/compile: performance regression on append-like loop #19633

ALTree commented Mar 21, 2017

ALTree commented Mar 21, 2017

cherrymui commented Mar 21, 2017 •

edited

Loading

ALTree commented Mar 21, 2017 •

edited

Loading

gopherbot commented Mar 21, 2017

navytux commented Mar 29, 2017

cmd/compile: performance regression on append-like loop #19633

cmd/compile: performance regression on append-like loop #19633

Comments

ALTree commented Mar 21, 2017

ALTree commented Mar 21, 2017

cherrymui commented Mar 21, 2017 • edited Loading

ALTree commented Mar 21, 2017 • edited Loading

gopherbot commented Mar 21, 2017

navytux commented Mar 29, 2017

cherrymui commented Mar 21, 2017 •

edited

Loading

ALTree commented Mar 21, 2017 •

edited

Loading