cmd/compile: ranging exclusively values over slices and arrays that are not cheaply indexable keeps alive 3 registers instead of 2 #61629
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Performance
Milestone
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Compile:
What did you expect to see?
I expect the
CALL
to require saving 2 registers into the stack.What did you see instead?
3 registers are saved
Potential Fixes
C++ like iteration
This happen due to how
range
rewrites loop, my function from before is rewritten like this:3 registers are needed
i
,end
andstart
.I wrote a patch that do C++ style iteration (comparing pointers)
However this does not work because the loop body (here calling
f
) may cause a stack resize, and thenend
(which is auintptr
) is not fixed if it points to stack, whileptrInsideTheLoop
is so their relationship gets all messed up.We can't create an
endInsideTheLoop
which is anunsafe.Pointer
like forptr
because end is one past the last valid index of the slice, which cause the GC to panic if it is scanned, we would need a pointer that is scanned by the stack relocator but not by the GC which AFAIK is not compatible with the runtime right now.Downward count
I have an other patch (https://go-review.googlesource.com/c/go/+/512935) which later down the line (in the SSA) rewrites the 3 register loop into a 2 register one like this:
It removes
end
by inverting the loop, instead of starting at a constant and counting up towards a variable (which require keeping alive the index and the bound), it start at the bound and count down until zero is reached.It is slightly less efficient than the C++ like iteration because it does 2 math operation per iteration, adding to the running pointer and decrementing the index, however theses two instructions are not dependent on each other so they execute in parallel unless the loop body only leaves one free port when the loop's tail is reached, the huge amount of loops (such as the one in my example) will do memory anyway and wont be bottleneck by the increments, it also make the code slightly bigger such that maybe it crosses a cacheline boundry, however theses cases are rare and have minor impacts (it is always better than the 3 register form).
On the other side if you have a loop that you do not enter at a comparable rate at which you do iterations it is better, because it does not have an addition in the loop preheader (just load and compare).
Given the negligeable impacts of the 2 register downward count vs the C++ like range, I think this is a better solution than changing the runtime to have support for pointers that are handled by the stack rewriter but not the GC.
Note: it fixes all loops when an index is used to count from 0 to some bound but the index is not used, not just range loops where only the value is used, but this is a very popular place for this optimization to happen.
The text was updated successfully, but these errors were encountered: