Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: unexpected difference in compiled code for returned array #31198

Open
mariecurried opened this issue Apr 1, 2019 · 8 comments
Open
Labels
binary-size compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mariecurried
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.12 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What did you do?

I compiled this code:

package test

func test() [16]byte {
    x := [16]byte{0}

    return x
}

What did you expect to see?

I expected the generated assembly code to be similar to this program's code:

package test

func test() [16]byte {
    x := [16]byte{1}

    return x
}

What did you see instead?

Instead, the generated code is much more bloated than I anticipated, even involving stack movement.

        subq    $24, SP
        movq    BP, 16(SP)
        leaq    16(SP), BP

        xorps   X0, X0
        movups  X0, "".~r0+32(SP)
        movups  X0, "".x(SP)
        movups  "".x(SP), X0
        movups  X0, "".~r0+32(SP)
        movq    16(SP), BP
        addq    $24, SP
        ret

As pointed out by @vcabbage, this issue is probably related to the assignment to "x", given that the compiler produces better code for the following program:

package test

func test() [16]byte {
    return [16]byte{0}
}

Another interesting case is the one below, where the "good" case above becomes as bad as the "bad" one:

package test

func test() [16]byte {
    x := [16]byte{1}
    y := x

    return y
}
@randall77
Copy link
Contributor

Probably a dup of #24416.
Generated code is not optimal for large structs and arrays.

@mundaym
Copy link
Member

mundaym commented Apr 2, 2019

I think this is caused by LocalAddr + VarDef blocking store forwarding rewrite rules. This is what the SSA graph looks like just before lowering:

late opt [4974 ns]

    b1:

    v1 (?) = InitMem <mem>
    v2 (?) = SP <uintptr>
    v5 (17) = VarDef <mem> {~r0} v1
    v6 (17) = LocalAddr <*[16]byte> {~r0} v2 v5
    v7 (+17) = Zero <mem> {[16]byte} [16] v6 v5
    v8 (18) = VarDef <mem> {x} v7
    v9 (18) = LocalAddr <*[16]byte> {x} v2 v8
    v10 (+18) = Zero <mem> {[16]byte} [16] v9 v8
    v11 (20) = LocalAddr <*[16]byte> {x} v2 v10
    v12 (20) = VarDef <mem> {~r0} v10
    v13 (+20) = LocalAddr <*[16]byte> {~r0} v2 v12
    v14 (20) = Move <mem> {[16]byte} [16] v13 v11 v12
    v4 (?) = Const64 <int64> [0]

Ret v14 (+20)

Most of the store forwarding rules have special cases to allow them to work despite the presence of VarDefs. However, most of these rules also check that the number of uses of the VarDef memory state is 1, so they are blocked when LocalAddr is used, since it also takes the memory state as an argument.

In general this is all quite fragile, sorry. These rules were added before the LocalAddr op was introduced.

@mariecurried
Copy link
Author

The last example (the "interesting case") is actually a regression from commit 8607b2e:

Before:

xorps   X0, X0
movups  X0, "".~r0+8(SP)
movups  X0, "".~r0+8(SP)
movb    $1, "".~r0+8(SP)
ret

After:

subq    $24, SP
movq    BP, 16(SP)
leaq    16(SP), BP

xorps   X0, X0
movups  X0, "".~r0+32(SP)
movups  X0, "".x(SP)
movb    $1, "".x(SP)
movups  "".x(SP), X0
movups  X0, "".~r0+32(SP)
movq    16(SP), BP
addq    $24, SP
ret

Should I open a separate issue for this?

@mundaym
Copy link
Member

mundaym commented Apr 2, 2019

The last example (the "interesting case") is actually a regression from commit 8607b2e:

Thanks, that makes sense, any thoughts @josharian? I guess the rule added in that commit adds a 'dead end' for the optimization somehow.

Should I open a separate issue for this?

No, this issue is fine.

@josharian
Copy link
Contributor

I guess the rule added in that commit adds a 'dead end' for the optimization somehow.

I think what is happening is that rule is interfering with dead auto elimination. Previously, we had 'x -> y -> ret', and the x and y were recognized as dead-autos. This rule rewrites 'x -> y -> ret' into 'x -> ret', ironically in the hopes that y can be dead-auto-eliminated. This interferes with eliminating x. I'm not sure exactly why yet; maybe dead auto elimination relies on your store forwarding rules that aren't triggering in order to be effective?

From spending a few minutes on this now I suspect that the fix won't be straightforward/easy. So part of the question is whether this is a rare case or not, and thus whether it is worth spending the time to restore this optimization.

@josharian
Copy link
Contributor

The challenge here is that it is hard to tell at the moment that that rule is being applied whether 'x' can be eliminated or not. I don't see a good approach here, although I'd be open to suggestions. On balance, the optimization does appear to help more than it hurts.

One question:

However, most of these rules also check that the number of uses of the VarDef memory state is 1, so they are blocked when LocalAddr is used, since it also takes the memory state as an argument.

Is it worth it (and sound!) to allow these rules to count the number of non-LocalAddr uses of the VarDef memory state? The only important thing for LocalAddr's memory arg is that it be correctly ordered in the memory chain for liveness purposes; we should be able to preserve that property.

All in all, I'm inclined to wait for Keith's work on SSAing all structs and arrays. That should handle this and much more.

@mundaym
Copy link
Member

mundaym commented Apr 2, 2019

Is it worth it (and sound!) to allow these rules to count the number of non-LocalAddr uses of the VarDef memory state?

Not sure, sounds unpleasant in SSA. I'd rather we found a way to make these sorts of annotations less intrusive when optimizing. Not that I have any ideas how to do that...

All in all, I'm inclined to wait for Keith's work on SSAing all structs and arrays. That should handle this and much more.

Yes, that sounds interesting. Avoiding touching memory entirely, at least early on in the compilation pipeline, seems like the nicest approach to take for these sorts of issues.

@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 3, 2019
@andybons andybons added this to the Unplanned milestone Apr 3, 2019
@josharian
Copy link
Contributor

It's possible that we could fix this by adding arch-specific variants of these generic rules:

// Don't zero the same bits twice.
(Zero {t} [s] dst1 zero:(Zero {t} [s] dst2 _)) && isSamePtr(dst1, dst2) => zero
(Zero {t} [s] dst1 vardef:(VarDef (Zero {t} [s] dst2 _))) && isSamePtr(dst1, dst2) => vardef

We'd be able to detect writing any set of duplicate bits twice (not just zeros), although we'd have to write out all the size-specific variants. During lowering, the LocalAddrs become LEAQs (on amd64) and drop out of the memory chain.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binary-size compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

7 participants