Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: add unrolling stage for automatic loop unrolling #51302

Open
vpachkov opened this issue Feb 21, 2022 · 15 comments
Open

cmd/compile: add unrolling stage for automatic loop unrolling #51302

vpachkov opened this issue Feb 21, 2022 · 15 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@vpachkov
Copy link
Contributor

Loop unrolling is a tecnique intended to speed up loops. It's supported by other mature compilers such as Clang.

This proposal consists of possible implementation ideas: general loop unrolling rules and how they can apply to Golang compiler, and some simple benchmarks reflecting this optimization performance on simple constant range loops.

It's easy to begin with a simple constant range for loops such as:

for i := 0 ; i < 1000 ; i++ {
	a[i] += 2;
}

And then add more features inside unroll package which will represent a loop unrolling optimization pass.

Unroll package implementation ideas

The following approach could already be easily integrated inside Golang optimization pipeline right after the inlining stage.

// Inlining
base.Timer.Start("fe", "inlining")
if base.Flag.LowerL != 0 {
	inline.InlinePackage()
}
noder.MakeWrappers(typecheck.Target) // must happen after inlining

// Unrolling
unroll.UnrollPackage()

UnrollPackage() will traverse each function to find for loops and check if it's appropriate for unrolling, then perform unrolling by calling Unroll() function if so.

// Unroll function takes 2 parameters:
// forstmt - an appropriate for loop,
// unroll - an unrolling factor (the amount of times the body will be repeated).
// It unrolls it in-place and returns a tail which should be placed right after the loop.
// The tail is generated if for range isn't divisible by the unrolling factor.
func Unroll(forstmt *ir.ForStmt, unroll uint32) (tail ir.Nodes) {
	....
}

It's important to calculate the unrolling factor correctly. If it's too big we can run into a problem when a for loop body exceeds the instruction cache. A possible idea of picking the factor is by reusing the part of the inlining stage, that is, hairyVisitor since there was already a lot of work done for choosing the weights of the nodes.

if A = maximum weight of the for loop which is short enough to be kept in cache, and B = the weight of a for loop body calculated by the extended version of hairyVisitor, then the unrolling factor = A / B. If it's greater than 1 then loop unrolling is beneficial for that loop.

Unroll function implementation ideas

When unroll variable is picked a for loop can be unrolled in 4 steps. Once again, we're dealing with constant values here:

  1. Align condition so the loop does not go out of the boundaries (i < 1000 -> i < 1000 / unroll * unroll)
if forstmt.Cond != nil {
	cmp := forstmt.Cond.(*ir.BinaryExpr)
	val := ir.ConstValue(cmp.Y).(int64)
	alignedval = val / int64(unroll) * int64(unroll)
	newval := ir.NewConstExpr(constant.MakeInt64(alignedval), cmp.Y)
	cmp.Y = newval
}
  1. Modify post expression so the induction variable goes unroll steps at a time (i++ -> i += unroll)
// Unroll post
if forstmt.Post != nil {
	post := forstmt.Post.(*ir.AssignOpStmt)
	inc := ir.ConstValue(post.Y).(int64)
	unrolledinc := inc * int64(unroll)
	newinc := ir.NewConstExpr(constant.MakeInt64(unrolledinc), post.Y)
	post.Y = newinc
}
  1. Modify body.

This step is a little bit more complex since to only copy the body isn't enough. Suppose unroll is 4 and the body of the loop is:

for i := 0 ; i < 100 ; i++ {
	sum += i
}

Just coping the body 4 times isn't enough since it gives us:

for i := 0 ; i < 100 / 4 * 4 ; i+=4 {
	sum += i
	sum += i
	sum += i
	sum += i
}

The correct version is:

for i := 0 ; i < 100 / 4 * 4 ; i+=4 {
	sum += i
	sum += i + 1
	sum += i + 2
	sum += i + 3
}

Firstly, we must find all induction variables and then after coping the body, apply shifting operation each time.
Keeping that in mind, body unrolling can be implemented in the following way:

// Firstly, copy original version of a body
bodycopy := ir.DeepCopyList(base.Pos, forstmt.Body)

// Copy the body unroll - 1 times, apply shifting and insert it in the body
for unr := uint64(1); unr < unroll; unr++ {
	appendbody := ir.DeepCopyList(base.Pos, bodycopy)

	// i is a loop induction variable
	// Note: there could be multiple induction variables for a loop.
	shiftNodes(appendbody, i, uint64(inc)*unr)

	forstmt.Body.Append(appendbody...)
}

// shiftNodes function takes 3 parameters:
// nodes - a list of nodes,
// orig - an original node that's going to be shifted,
// shift - a shift constant.
// It generates a new node which represents an expression orig + shift and
// changes every orig reference to the new expression.
func shiftNodes(nodes ir.Nodes, orig ir.Node, shift uint64) {
	idx := ir.NewConstExpr(constant.MakeUint64(shift), orig)
	idx.SetType(types.Types[types.TUINTPTR])
	idx.SetTypecheck(1)

	shifted := ir.NewBinaryExpr(base.Pos, ir.OADD, orig, idx)
	shifted.SetType(orig.Type())
	shifted.SetTypecheck(orig.Typecheck())

	var edit func(ir.Node) ir.Node
	edit = func(x ir.Node) ir.Node {
		ir.EditChildren(x, edit)
		if x == orig {
			return shifted
		}
		return x
	}

	for _, node := range nodes {
		edit(node)
	}
}

Suppose we have a loop:

for i := 0 ; i < 101 ; i++ {
	sum += i
}

And the unroll is 3. Than it should be unrolled to this:

for i := 0 ; i < 101 / 3 * 3 ; i+=3 {
	sum += i
	sum += i + 1
	sum += i + 2
}

Since 101 isn't divisible by 3 there are 101 % 3 operations that hasn't been performed yet:

sum += 99
sum += 100

This should also be generated and placed after the for loop.

for idx := val / unroll * unroll; idx < val; idx++ {
	appendtail := ir.DeepCopyList(base.Pos, bodycopy)
	placeConst(appendtail, i, idx)
	tail.Append(appendtail...)
}

// placeConst function takes 3 parameters:
// nodes - a list of nodes,
// orig - an original node that's going to be replaced by a constant,
// con - a constant itself.
// It generates a new constant and changes every orig reference to the new contant node.
func placeConst(nodes ir.Nodes, orig ir.Node, con int64) {
	i := ir.NewConstExpr(constant.MakeInt64(con), orig)
	i.SetType(types.Types[types.TINT64])
	i.SetTypecheck(1)

	var edit func(ir.Node) ir.Node
	edit = func(x ir.Node) ir.Node {
		ir.EditChildren(x, edit)
		if x == orig {
			return i
		}
		return x
	}

	for _, node := range nodes {
		edit(node)
	}
}

Results

The following lines of code:

var a [100]int

for i := 0 ; i < 100 - 1; i++ {
	a[i] += 2
}

currently are compiled to:

 1053eb3:	31 c0                	xor    eax,eax
 1053eb5:	eb 09                	jmp    1053ec0 <_main.main+0x60>
 1053eb7:	48 83 44 c4 18 02    	add    QWORD PTR [rsp+rax*8+0x18],0x2
 1053ebd:	48 ff c0             	inc    rax
 1053ec0:	48 83 f8 63          	cmp    rax,0x63
 1053ec4:	7c f1                	jl     1053eb7 <_main.main+0x57>

After applying a very basic version of loop unrolling with the above approach those are compiled to:

 1053eb3:	31 c0                	xor    eax,eax
 1053eb5:	eb 1c                	jmp    1053ed3 <_main.main+0x73>
 1053eb7:	48 83 44 c4 18 02    	add    QWORD PTR [rsp+rax*8+0x18],0x2
 1053ebd:	48 83 44 c4 20 02    	add    QWORD PTR [rsp+rax*8+0x20],0x2
 1053ec3:	48 83 44 c4 28 02    	add    QWORD PTR [rsp+rax*8+0x28],0x2
 1053ec9:	48 83 44 c4 30 02    	add    QWORD PTR [rsp+rax*8+0x30],0x2
 1053ecf:	48 83 c0 04          	add    rax,0x4
 1053ed3:	48 83 f8 60          	cmp    rax,0x60
 1053ed7:	7c de                	jl     1053eb7 <_main.main+0x57>
 1053ed9:	48 83 84 24 18 03 00 	add    QWORD PTR [rsp+0x318],0x2
 1053ee0:	00 02 
 1053ee2:	48 83 84 24 20 03 00 	add    QWORD PTR [rsp+0x320],0x2
 1053ee9:	00 02 
 1053eeb:	48 83 84 24 28 03 00 	add    QWORD PTR [rsp+0x328],0x2

Body is repeated 4 times with the shifted indices. Tail is placed after the loop that handles 96th, 97th and 98th indices.

Benchmarks

func BenchmarkUnrolling(b *testing.B) {
	var a [100]int

	for j := 0 ; j < b.N ; j++ {
		for i := 0 ; i < 100 - 1; i++ {
			a[i] += 2
		}
	}
}
name          old time/op  new time/op  delta
Unrolling-12  51.7ns ± 0%  23.9ns ± 3%  -53.71%  (p=0.016 n=4+5)
@gopherbot gopherbot added this to the Proposal milestone Feb 21, 2022
@vpachkov
Copy link
Contributor Author

Related: #49997

@ianlancetaylor
Copy link
Contributor

This is not an API change, so taking this out of the proposal process.

In my opinion the most important consideration is compile time. The Go compiler has a much greater emphasis on compile time than a compiler like clang.

CC @randall77 @golang/runtime

@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels Feb 21, 2022
@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Unplanned Feb 21, 2022
@ianlancetaylor ianlancetaylor changed the title proposal: cmd/compile: add unrolling stage for automatic loop unrolling cmd/compile: add unrolling stage for automatic loop unrolling Feb 21, 2022
@randall77
Copy link
Contributor

Another consideration is binary size. We don't want to just unroll everything, as that can lead to binary bloat.

I think there's definitely something we can proceed with here. The way I see it there are 2 parts:

  1. The mechanism of loop unrolling
  2. The heuristics for loop unrolling

It would be good to start with figuring out how to implement 1, and have a very conservative heuristic for 2. Once we have that working, we can investigate improving the heuristic to encompass more cases. Whether that is using profile-directed feedback, or analysis of what kind of loop bodies would benefit the most, or whatever.

@erifan
Copy link

erifan commented Feb 22, 2022

Maybe we should do the loop invariant optimization first?

@randall77
Copy link
Contributor

@erifan is mentioning #15808, I think. I agree that fixing that issue would also help, but I don't think there is any fundamental need to order the two. Other than, perhaps, #15808 is easier and the heuristics are probably a lot easier to figure out.

@erifan
Copy link

erifan commented Feb 22, 2022

@erifan is mentioning #15808, I think. I agree that fixing that issue would also help, but I don't think there is any fundamental need to order the two. Other than, perhaps, #15808 is easier and the heuristics are probably a lot easier to figure out.

Yeah, I agree that there is no necessary connection between the two, I mean if loop unrolling is based on loop invariant hoisting , will this optimization be a little easier to implement?

In addition, perhaps not very relevant to this topic, do we need to consider loop-related optimizations from a higher dimension. Because maybe in the future we will also consider auto-vectorization, I hope we have a general consideration.

@cherrymui
Copy link
Member

cc @dr2chase who did some experiment with loop unrolling in SSA.

@dr2chase
Copy link
Contributor

I've played with both loop unrolling and loop invariant hoisting, and when the Go world was very very Intel-oriented neither of these reliably made sense (and their code was unpleasant). I recently attempted to revive the invariant-hoisting CL and could not easily get it to work (and I am not likely to look at this again before March 7, i.e., the likely go1.18 release date).

But in general, the wins were less than awesome, binary sizes go up a bit with loop unrolling, and for a given binary-size/compile-time budget it was not at all clear that this was the best way to increase performance. It also creates problems for debugging.

@laboger
Copy link
Contributor

laboger commented May 23, 2022

I've played with both loop unrolling and loop invariant hoisting

I'm interested in invariant loop hoisting. Where is the CL you revived? I suspect it would help more on Power or maybe even ARM64.

@dr2chase
Copy link
Contributor

Here's the hoisting CL, newly revived: https://go-review.googlesource.com/c/go/+/37338.
It lacks a test, but I did hand-verify it.

Unrolling, I didn't do as good a job on that CL (it merely boilerplated the body, didn't handle iterator arithmetic cleverly). Unrolling is tricky because updating SSA in place is a pain (I can try to find that one, nonetheless).

@dr2chase
Copy link
Contributor

Note that this CL changes arm64 code generation for one of the codegen tests (so it fails), which is something I'll need to look at. I'm also doing a quick round of benchmarking on amd64 and a low-end arm64.

@jamestiotio
Copy link

But in general, the wins were less than awesome, binary sizes go up a bit with loop unrolling, and for a given binary-size/compile-time budget it was not at all clear that this was the best way to increase performance. It also creates problems for debugging.

I think there are still valid use cases for this feature, especially for more performance-sensitive Go applications. At the very least, we can provide an option for Go users to enable better performance at the cost of larger binary size, longer compile time, and harder debugging as mentioned. By providing this option to Go developers, they can select the appropriate optimization levels depending on the specific tradeoffs of their applications and whether they have performance-critical code or not.

For example, we could add this feature behind an opt-in experimental flag. This approach is adopted by compilers of other programming languages such as:

@shoham-b
Copy link

Now that we have PGO, maybe we can add this just to the most "hot paths"?
The trade-off of binary size\ performance speed up heuristic is already decided there, so we can just use that.

@shoham-b
Copy link

shoham-b commented Dec 26, 2023

  1. Although not strictly bound to loop invariant, as @erifan and @randall77 said, the performance would have better gains, as more loops would be candidates to the unroll since their body size would decrease. In the current state of discussion on whether to implement this at all, it might help enlighten the situation.
  2. It might unroll loops that were never intended to fully run, where there are "break" or "return" in it. (this is also a limitation of GCC https://godbolt.org/z/qE9s6s3cs )
  3. One of the main benefits of unrolling is instruction cache-hits so in places where there is a function call, the benefit would be less significant.
  4. With big loops, the reminder might be huge, causing an almost x2 binary size than what is needed.
  5. Note that GCC for the least does the reminder calculation before the loop not after, https://godbolt.org/z/GnrjdMKT7 , not sure if that has consequences.

All in all, a more sophisticated manner of unrolling would yield a better byte\time ratio IMHO
The most conservative approach is a good start - unroll only when it's guaranteed to speed up.

@aclements
Copy link
Member

I agree that PGO would make loop unrolling far more tenable, so I've added it to the PGO umbrella issue (#62463).

Note however that loop unrolling is not just a balance with binary size/i-cache pressure. On modern CPUs, simply unrolling a loop is often a pessimization, for example if it contains conditional branches then unrolling it will increase pressure on the branch predictor (Intel Optimization Manual, "Loop Unrolling"). It's really a lot like inlining: the main benefit is generally not the unrolling itself, but the follow-on optimizations it enables. That said, that just makes PGO an even better fit for this, since it would allow us to focus where we spend compile time analyzing whether unrolling a loop is even worthwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests