Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: benchmark performance regression in 1.9beta1 #20711

Closed
cespare opened this issue Jun 17, 2017 · 14 comments
Closed

cmd/compile: benchmark performance regression in 1.9beta1 #20711

cespare opened this issue Jun 17, 2017 · 14 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@cespare
Copy link
Contributor

cespare commented Jun 17, 2017

Sorry for the poor title -- I haven't done enough digging to know how to be more specific about the problem.

I'm comparing the following go versions:

go version go1.8 linux/amd64
go version go1.9beta1 linux/amd64

The following benchmark exhibits a significant slowdown:

package main

import (
	"testing"
)

type MPG []P

type P struct {
	p   uint16
	idx uint32
}

type Dim interface {
	g(mpg MPG, scale uint32)
}

type dim struct {
	cells []uint32
}

func (d *dim) g(mpg MPG, scale uint32) {
	for i, p := range mpg {
		mpg[i].idx += scale * uint32(d.cells[int(p.p)])
	}
}

func BenchmarkGM(b *testing.B) {
	const numRows = 50000
	cells := make([]uint32, numRows)
	var dim Dim = &dim{cells: cells}
	for i := range cells {
		cells[i] = uint32(i % 2)
	}
	mpg := make(MPG, numRows)
	for i := range mpg {
		mpg[i].p = uint16(i)
	}
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		dim.g(mpg, 5)
	}
}
name  old time/op  new time/op  delta
GM-4  59.6µs ± 0%  71.4µs ± 1%  +19.80%  (p=0.008 n=5+5)

This is extracted from a big suite of dozens of similar benchmarks in an internal codebase. Most of these benchmarks stayed the same or got faster in Go 1.9beta1, but a few got slower. I know that low-level codegen optimizations often make many benchmarks speed up while causing a few to slow down, but I thought it would be good to bring an example slowdown here in case it's not one of those. (Overall we're very happy with performance changes in 1.8->1.9beta1.)

@cespare cespare added this to the Go1.9 milestone Jun 17, 2017
@bradfitz bradfitz added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 17, 2017
@bradfitz
Copy link
Contributor

/cc @randall77 @josharian @mdempsky

@josharian
Copy link
Contributor

Any chance of a bisection?

@cespare
Copy link
Contributor Author

cespare commented Jun 18, 2017

Yeah, I'll do one in the next few days.

@gopherbot
Copy link

CL https://golang.org/cl/46134 mentions this issue.

@randall77
Copy link
Contributor

Looks like there is an additional bounds check at tip. Here's a simple repro:

func f(a []int) {
	for i, x := range a {
		a[i] += x
	}
}

For 1.8.3, it generates an inner loop of

	0x0017 00023 (/home/khr/go/tmp2.go:5)	MOVQ	(DX)(BX*8), DI
	0x001b 00027 (/home/khr/go/tmp2.go:5)	ADDQ	DI, SI
	0x001e 00030 (/home/khr/go/tmp2.go:5)	MOVQ	SI, (DX)(BX*8)
	0x0022 00034 (/home/khr/go/tmp2.go:4)	ADDQ	$8, AX
	0x0026 00038 (/home/khr/go/tmp2.go:4)	INCQ	BX
	0x0029 00041 (/home/khr/go/tmp2.go:4)	CMPQ	BX, CX
	0x002c 00044 (/home/khr/go/tmp2.go:4)	JLT	$0, 20

For tip, it generates

	0x0023 00035 (/home/khr/go/tmp2.go:5)	CMPQ	BX, AX           <- extra bounds
	0x0026 00038 (/home/khr/go/tmp2.go:5)	JCC	72                           <- check here
	0x0028 00040 (/home/khr/go/tmp2.go:5)	MOVQ	(DX)(BX*8), DI
	0x002c 00044 (/home/khr/go/tmp2.go:5)	ADDQ	DI, SI
	0x002f 00047 (/home/khr/go/tmp2.go:5)	MOVQ	SI, (DX)(BX*8)
	0x0033 00051 (/home/khr/go/tmp2.go:4)	ADDQ	$8, CX
	0x0037 00055 (/home/khr/go/tmp2.go:4)	INCQ	BX
	0x003a 00058 (/home/khr/go/tmp2.go:4)	CMPQ	BX, AX
	0x003d 00061 (/home/khr/go/tmp2.go:4)	JLT	32

@dr2chase , I think this happens because of your change to do the additional len>0 comparison wrapping the loop. Instead of the loop exit check dominating the bounds check (as it does in 1.8), there are two predecessors of the bounds check block, each which does a comparison (one is the len>0 block, the other is the loop exit check). Only by combining the information from the two predecessors can we deduce that the bounds check is unnecessary.

Merging such information would be hard. Maybe there's another way to achieve the same result? And what would be the downside of backing out your change?

@cespare
Copy link
Contributor Author

cespare commented Jun 21, 2017

Would my doing a bisect still be helpful? Meant to do it earlier but it has been an extremely busy week for me.

@dr2chase
Copy link
Contributor

Backing out the change fully creates a problem for GOEXPERIMENT=preemptibleloops; one possibility is to instead make the loop change conditional on whether the experiment is turned on (but we expect to turn it on for 1.10). Is including both copies of the loop-translating code and switching based on goexperiment considered adequately low-risk?

@randall77
Copy link
Contributor

@dr2chase Is preemptable loops a problem because the preempt check comes after the pointer increment?

@cespare: I don't think we need a bisect. Thanks though. I just mailed a demonstration CL that I think shows we're on the right track.

@gopherbot
Copy link

CL https://golang.org/cl/46331 mentions this issue.

@randall77
Copy link
Contributor

@dr2chase : I have no problem including two versions of the loop-translating code and switching on goexperiment. It all depends on how big the CL is and what confidence we have in it.
I would hope we have high confidence in the old code, if we could just splice that in when not under experiment, that would be ideal.

@dr2chase
Copy link
Contributor

@randall77 Exactly that. And given how tricky it was to update the SSA graph to include the preemption check, I wasn't willing to put it anywhere other than the back edge, though perhaps for 1.10 we should get braver to reduce the cost of actually turning preempable loops on all the time.

@dr2chase
Copy link
Contributor

If you haven't started on it, I just did, should have a CL up as soon as it passes tests locally.

@cespare
Copy link
Contributor Author

cespare commented Jun 21, 2017

Thanks very much @dr2chase and @randall77.

@gopherbot
Copy link

CL https://golang.org/cl/46410 mentions this issue.

gopherbot pushed a commit that referenced this issue Sep 1, 2017
Currently we only use 1 and 4 as a scale for indexed 4-byte load.
In code generated in #20711 we can use indexed load with scale=8,
to improve performance:

name  old time/op  new time/op  delta
GM-6   108µs ± 0%    95µs ± 0%  -12.06%  (p=0.000 n=10+10)

So add new ops and combine loadidx1(shift 3..).. into loadidx8,
same for stores.

Change-Id: I5ed1c250ac40960e20606580cf9de221e75b72f1
Reviewed-on: https://go-review.googlesource.com/46134
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@golang golang locked and limited conversation to collaborators Jun 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

6 participants