Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: missed opportunity to inline runtime.memmove #41662

Closed
josharian opened this issue Sep 27, 2020 · 9 comments
Closed

cmd/compile: missed opportunity to inline runtime.memmove #41662

josharian opened this issue Sep 27, 2020 · 9 comments
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@josharian
Copy link
Contributor

josharian commented Sep 27, 2020

package p

func f(b []byte, x *[8]byte) {
	_ = b[8]
	copy(b, x[:])
}

This should compile down to a pair of MOVQs, one to load from x and one to write to b. It doesn't; it still contains a call to runtime.memmove. From a quick glance at ssa.html, the problem is that the lowering of runtime.memmove to Move happens during generic.rules, but we haven't detected that the size of the memmove is a constant until after lowering.

To fix this, we could either improve the analysis in the generic stages or add arch-specific runtime.memmove-to-Move lowering.

cc @randall77 @dr2chase @martisch @mundaym

@josharian josharian added this to the Unplanned milestone Sep 27, 2020
@andybons andybons added the NeedsFix The path to resolution is known, but the work has not been done. label Sep 29, 2020
@kels-ng
Copy link
Contributor

kels-ng commented Jan 15, 2021

Is the issue still actual? Let me take it.

@dr2chase
Copy link
Contributor

Here's a sample of a similar optimization where small memequals calls are optimized.

@josharian
Copy link
Contributor Author

@Kels9009 great! Ask here if you have questions. Please be sure to add a codegen test (see test/codegen).

@kels-ng
Copy link
Contributor

kels-ng commented Jan 18, 2021

@dr2chase @josharian OK, thanks for useful information.

@kels-ng
Copy link
Contributor

kels-ng commented Jan 20, 2021

Hello,

I investigated the issue and I would like to get an advice to select the right direction in order to fix it.

First, I tried to make generic solution. Here the SSA dump:
image

As you can see on the late stage of optimization (late opt) the problem is memmove are using Phi operation as a size argument (v50). Phi operation can't be eliminated due to rules require both arguments to be constant (we have Arg + Const):

// basic phi simplifications
...
(Phi (Const64  [c]) (Const64  [c])) => (Const64  [c])
...

But 'memmove' size argument detected as a constant on the next stage (before lowering) - dead auto elim + generic deadcode:
image
In this case memmove can be inlined using current rules without modification.

So, I added another one opt stage after dead auto elim + generic deadcode. As a result memmove was inlined successfully and almost all all.bash tests were passed except few codegen tests (for example writebarrier).
Another solution was to move late opt to position after dead auto elim + generic deadcode. Unfortunately, the result is broken code generation.

So, as a result a have few branches for next steps:

  1. Add arch-specific runtime.memmove-to-Move lowering. I already have the first implementation in my local branch for AMD64 and it works fine. I don't like this solution because it requires to implement this rule for each archs and I think generic rules are more proper place for such optimization. But still, we can use it and ignore my doubts :).
  2. Add another generic opt stage after dead auto elim + generic deadcode and fix all failed tests. I think this is the worst solution, because we can't predict how many hidden problems we will create.
  3. Extract memmove recognition rules from generic to the new file and insert new opt stage after dead auto elim + generic deadcode. This solution move memmove inlining to the later position. In this case we reduce chance to overlook mistake and restructure rules. The problem is current source-code structure doesn't have dedicated files for each types of optimizations (they all are stored in generic and arch-specific rules), and I'm not sure I can decide by myself such change without experts.

What do you think about it? Which one of these directions is right solution?

@ianlancetaylor
Copy link
Contributor

@kels-ng A note for the future: when pasting text, please just paste plain text, not an image. At least for me your images are nearly illegible, whereas everyone using this issue tracker can read plain text. Thanks.

@josharian
Copy link
Contributor Author

Thanks, @kels-ng. We should do option (1).

New rules are pretty cheap. There are a fair number of rules duplicated in generic and lowered opt, for similar reasons. We don't want to introduce new opt passes without a very compelling need, and moving around existing passes is risky.

@gopherbot
Copy link

Change https://golang.org/cl/289151 mentions this issue: cmd/compile: add arch-specific inlining for runtime.memmove

@gopherbot
Copy link

Change https://golang.org/cl/352054 mentions this issue: cmd/compile: add PP64-specific inlining for runtime.memmove

gopherbot pushed a commit that referenced this issue Sep 29, 2021
Add rule to PPC64.rules to inline runtime.memmove in more cases, as is
done for other target architectures
Updated tests in codegen/copy.go to verify changes are done on
ppc64/ppc64le

Updates #41662

Change-Id: Id937ce21f9b4f4047b3e66dfa3c960128ee16a2a
Reviewed-on: https://go-review.googlesource.com/c/go/+/352054
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
@golang golang locked and limited conversation to collaborators Sep 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

6 participants