Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: performance regression on append-like loop #19633

Closed
ALTree opened this issue Mar 21, 2017 · 5 comments
Closed

cmd/compile: performance regression on append-like loop #19633

ALTree opened this issue Mar 21, 2017 · 5 comments

Comments

@ALTree
Copy link
Member

ALTree commented Mar 21, 2017

Consider the following benchmark:

func BenchmarkFoo(b *testing.B) {
	a := make([]byte, 1000)
	b.SetBytes(int64(len(a)))

	for i := 0; i < b.N; i++ {
		var n int
		for j := 0; j < 1000; j++ {
			a[n] = byte(j)
			n++
		}
		a = a[:n]
	}
}

Comparing tip (go version devel +d972dc2de9 Tue Mar 21 06:36:56 2017 +0000 linux/amd64) with go1.8 I get, on my machine:

name   old time/op    new time/op    delta
Foo-4     684ns ± 1%    1022ns ± 0%  +49.34%  (p=0.000 n=18+15)

name   old speed      new speed      delta
Foo-4  1.46GB/s ± 1%  0.98GB/s ± 0%  -32.99%  (p=0.000 n=18+18)

A pretty bad regression.

@ALTree ALTree added this to the Go1.9 milestone Mar 21, 2017
@ALTree
Copy link
Member Author

ALTree commented Mar 21, 2017

The regression was introduced in CL 36205 (cmd/internal/obj: remove Follow pass). Also see: #18977.

cc @cherrymui

@cherrymui
Copy link
Member

cherrymui commented Mar 21, 2017

Tried on two machines, one Linux desktop and one Mac laptop, the regression I saw is much smaller

name    old time/op    new time/op    delta
Foo-12     488ns ± 0%     524ns ± 0%  +7.44%  (p=0.000 n=19+18)

name    old speed      new speed      delta
Foo-12  2.05GB/s ± 0%  1.91GB/s ± 0%  -6.93%  (p=0.000 n=19+18)

So this is probably very processor-dependent?

Assembly for tip:

        0x000c 00012 (x_test.go:17)     MOVQ    "".a+24(SP), AX
        0x0011 00017 (x_test.go:17)     MOVQ    "".a+16(SP), CX
        0x0016 00022 (x_test.go:16)     MOVQ    $0, DX
        0x0018 00024 (x_test.go:17)     CMPQ    DX, $1000
        0x001f 00031 (x_test.go:17)     JGE     $0, 53
        0x0021 00033 (x_test.go:18)     CMPQ    DX, AX
        0x0024 00036 (x_test.go:18)     JCC     $0, 46
        0x0026 00038 (x_test.go:18)     MOVB    DL, (CX)(DX*1)
        0x0029 00041 (x_test.go:19)     INCQ    DX
        0x002c 00044 (x_test.go:17)     JMP     24
        0x002e 00046 (x_test.go:18)     PCDATA  $0, $1
        0x002e 00046 (x_test.go:18)     CALL    runtime.panicindex(SB)
        0x0033 00051 (x_test.go:18)     UNDEF
        0x0035 00053 (x_test.go:21)     MOVQ    "".a+32(SP), AX
        0x003a 00058 (x_test.go:21)     CMPQ    DX, AX
        0x003d 00061 (x_test.go:21)     JHI     $0, 72
        0x003f 00063 (x_test.go:22)     MOVQ    (SP), BP
        0x0043 00067 (x_test.go:22)     ADDQ    $8, SP
        0x0047 00071 (x_test.go:22)     RET
        0x0048 00072 (x_test.go:21)     PCDATA  $0, $1
        0x0048 00072 (x_test.go:21)     CALL    runtime.panicslice(SB)
        0x004d 00077 (x_test.go:21)     UNDEF

Assembly for Go 1.8:

        0x000c 00012 (x_test.go:17)     MOVQ    "".a+24(FP), AX
        0x0011 00017 (x_test.go:17)     MOVQ    "".a+16(FP), CX
        0x0016 00022 (x_test.go:16)     MOVQ    $0, DX
        0x0018 00024 (x_test.go:17)     CMPQ    DX, $1000
        0x001f 00031 (x_test.go:17)     JGE     $0, 53
        0x0021 00033 (x_test.go:18)     CMPQ    DX, AX
        0x0024 00036 (x_test.go:18)     JCC     $0, 79
        0x0026 00038 (x_test.go:18)     MOVB    DL, (CX)(DX*1)
        0x0029 00041 (x_test.go:19)     INCQ    DX
        0x002c 00044 (x_test.go:17)     CMPQ    DX, $1000
        0x0033 00051 (x_test.go:17)     JLT     $0, 33
        0x0035 00053 (x_test.go:21)     MOVQ    "".a+32(FP), AX
        0x003a 00058 (x_test.go:21)     CMPQ    DX, AX
        0x003d 00061 (x_test.go:21)     JHI     $0, 72
        0x003f 00063 (x_test.go:22)     MOVQ    (SP), BP
        0x0043 00067 (x_test.go:22)     ADDQ    $8, SP
        0x0047 00071 (x_test.go:22)     RET
        0x0048 00072 (x_test.go:21)     PCDATA  $0, $1
        0x0048 00072 (x_test.go:21)     CALL    runtime.panicslice(SB)
        0x004d 00077 (x_test.go:21)     UNDEF
        0x004f 00079 (x_test.go:18)     PCDATA  $0, $1
        0x004f 00079 (x_test.go:18)     CALL    runtime.panicindex(SB)
        0x0054 00084 (x_test.go:18)     UNDEF

Seems the only difference is that on tip we make an unconditional jump for loop backedge (JMP 24), whereas on Go 1.8 it copies two instructions (and reverse the condition) (CMPQ DX, $1000; JLT 33).

@dr2chase made range loop generate for-until like code, which effectively does this. But this example is not range loop, so not affected by that change.

@dr2chase is also working on loop unrolling, which may help from a different direction.

I guess we still don't want to copy instructions in the assembler?

@ALTree
Copy link
Member Author

ALTree commented Mar 21, 2017

What I'm seeing is +50% (the benchmark above) on a Sandy bridge i5; but only +8% on a Haswell i7.

Looks like it hits hard on Sandy Bridge (2nd gen), while from 4th gen and above is not that bad? I don't have any 3rd gen (Ivy bridge) to check there.

@gopherbot
Copy link

CL https://golang.org/cl/38431 mentions this issue.

@navytux
Copy link
Contributor

navytux commented Mar 29, 2017

Benchmark result on pretty recent 6th-gen i7 (i7-6600U) on linux/amd64 for go18 and gotip (go version devel +846f925464 Wed Mar 29 06:50:07 2017 +0000 linux/amd64):

name   old time/op    new time/op    delta
Foo-4     341ns ± 0%     505ns ± 0%  +48.09%  (p=0.000 n=6+10)

name   old speed      new speed      delta
Foo-4  2.93GB/s ± 0%  1.98GB/s ± 0%  -32.44%  (p=0.000 n=10+10)

So here it is also ~ 50% slowdown.

@golang golang locked and limited conversation to collaborators Apr 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants