cmd/compile: unexpected difference in compiled code for returned array #31198

mariecurried · 2019-04-01T22:35:35Z

What version of Go are you using (`go version`)?

$ go version
go version go1.12 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What did you do?

I compiled this code:

package test

func test() [16]byte {
    x := [16]byte{0}

    return x
}

What did you expect to see?

I expected the generated assembly code to be similar to this program's code:

package test

func test() [16]byte {
    x := [16]byte{1}

    return x
}

What did you see instead?

Instead, the generated code is much more bloated than I anticipated, even involving stack movement.

        subq    $24, SP
        movq    BP, 16(SP)
        leaq    16(SP), BP

        xorps   X0, X0
        movups  X0, "".~r0+32(SP)
        movups  X0, "".x(SP)
        movups  "".x(SP), X0
        movups  X0, "".~r0+32(SP)
        movq    16(SP), BP
        addq    $24, SP
        ret

As pointed out by @vcabbage, this issue is probably related to the assignment to "x", given that the compiler produces better code for the following program:

package test

func test() [16]byte {
    return [16]byte{0}
}

Another interesting case is the one below, where the "good" case above becomes as bad as the "bad" one:

package test

func test() [16]byte {
    x := [16]byte{1}
    y := x

    return y
}

The text was updated successfully, but these errors were encountered:

randall77 · 2019-04-01T23:08:01Z

Probably a dup of #24416.
Generated code is not optimal for large structs and arrays.

mundaym · 2019-04-02T11:29:58Z

I think this is caused by LocalAddr + VarDef blocking store forwarding rewrite rules. This is what the SSA graph looks like just before lowering:

late opt [4974 ns]

    b1:

    v1 (?) = InitMem <mem>
    v2 (?) = SP <uintptr>
    v5 (17) = VarDef <mem> {~r0} v1
    v6 (17) = LocalAddr <*[16]byte> {~r0} v2 v5
    v7 (+17) = Zero <mem> {[16]byte} [16] v6 v5
    v8 (18) = VarDef <mem> {x} v7
    v9 (18) = LocalAddr <*[16]byte> {x} v2 v8
    v10 (+18) = Zero <mem> {[16]byte} [16] v9 v8
    v11 (20) = LocalAddr <*[16]byte> {x} v2 v10
    v12 (20) = VarDef <mem> {~r0} v10
    v13 (+20) = LocalAddr <*[16]byte> {~r0} v2 v12
    v14 (20) = Move <mem> {[16]byte} [16] v13 v11 v12
    v4 (?) = Const64 <int64> [0]

Ret v14 (+20)

Most of the store forwarding rules have special cases to allow them to work despite the presence of VarDefs. However, most of these rules also check that the number of uses of the VarDef memory state is 1, so they are blocked when LocalAddr is used, since it also takes the memory state as an argument.

In general this is all quite fragile, sorry. These rules were added before the LocalAddr op was introduced.

mariecurried · 2019-04-02T12:57:02Z

The last example (the "interesting case") is actually a regression from commit 8607b2e:

Before:

xorps   X0, X0
movups  X0, "".~r0+8(SP)
movups  X0, "".~r0+8(SP)
movb    $1, "".~r0+8(SP)
ret

After:

subq    $24, SP
movq    BP, 16(SP)
leaq    16(SP), BP

xorps   X0, X0
movups  X0, "".~r0+32(SP)
movups  X0, "".x(SP)
movb    $1, "".x(SP)
movups  "".x(SP), X0
movups  X0, "".~r0+32(SP)
movq    16(SP), BP
addq    $24, SP
ret

Should I open a separate issue for this?

mundaym · 2019-04-02T14:48:13Z

The last example (the "interesting case") is actually a regression from commit 8607b2e:

Thanks, that makes sense, any thoughts @josharian? I guess the rule added in that commit adds a 'dead end' for the optimization somehow.

Should I open a separate issue for this?

No, this issue is fine.

josharian · 2019-04-02T14:54:18Z

I guess the rule added in that commit adds a 'dead end' for the optimization somehow.

I think what is happening is that rule is interfering with dead auto elimination. Previously, we had 'x -> y -> ret', and the x and y were recognized as dead-autos. This rule rewrites 'x -> y -> ret' into 'x -> ret', ironically in the hopes that y can be dead-auto-eliminated. This interferes with eliminating x. I'm not sure exactly why yet; maybe dead auto elimination relies on your store forwarding rules that aren't triggering in order to be effective?

From spending a few minutes on this now I suspect that the fix won't be straightforward/easy. So part of the question is whether this is a rare case or not, and thus whether it is worth spending the time to restore this optimization.

josharian · 2019-04-02T15:04:28Z

The challenge here is that it is hard to tell at the moment that that rule is being applied whether 'x' can be eliminated or not. I don't see a good approach here, although I'd be open to suggestions. On balance, the optimization does appear to help more than it hurts.

One question:

However, most of these rules also check that the number of uses of the VarDef memory state is 1, so they are blocked when LocalAddr is used, since it also takes the memory state as an argument.

Is it worth it (and sound!) to allow these rules to count the number of non-LocalAddr uses of the VarDef memory state? The only important thing for LocalAddr's memory arg is that it be correctly ordered in the memory chain for liveness purposes; we should be able to preserve that property.

All in all, I'm inclined to wait for Keith's work on SSAing all structs and arrays. That should handle this and much more.

mundaym · 2019-04-02T15:23:07Z

Is it worth it (and sound!) to allow these rules to count the number of non-LocalAddr uses of the VarDef memory state?

Not sure, sounds unpleasant in SSA. I'd rather we found a way to make these sorts of annotations less intrusive when optimizing. Not that I have any ideas how to do that...

All in all, I'm inclined to wait for Keith's work on SSAing all structs and arrays. That should handle this and much more.

Yes, that sounds interesting. Avoiding touching memory entirely, at least early on in the compilation pipeline, seems like the nicest approach to take for these sorts of issues.

josharian · 2020-05-15T18:43:51Z

It's possible that we could fix this by adding arch-specific variants of these generic rules:

// Don't zero the same bits twice.
(Zero {t} [s] dst1 zero:(Zero {t} [s] dst2 _)) && isSamePtr(dst1, dst2) => zero
(Zero {t} [s] dst1 vardef:(VarDef (Zero {t} [s] dst2 _))) && isSamePtr(dst1, dst2) => vardef

We'd be able to detect writing any set of duplicate bits twice (not just zeros), although we'd have to write out all the size-specific variants. During lowering, the LocalAddrs become LEAQs (on amd64) and drop out of the memory chain.

andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 3, 2019

andybons added this to the Unplanned milestone Apr 3, 2019

bradfitz added the binary-size label May 1, 2020

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: unexpected difference in compiled code for returned array #31198

cmd/compile: unexpected difference in compiled code for returned array #31198

mariecurried commented Apr 1, 2019

randall77 commented Apr 1, 2019

mundaym commented Apr 2, 2019

mariecurried commented Apr 2, 2019

mundaym commented Apr 2, 2019

josharian commented Apr 2, 2019

josharian commented Apr 2, 2019

mundaym commented Apr 2, 2019

josharian commented May 15, 2020

cmd/compile: unexpected difference in compiled code for returned array #31198

cmd/compile: unexpected difference in compiled code for returned array #31198

Comments

mariecurried commented Apr 1, 2019

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What did you do?

What did you expect to see?

What did you see instead?

randall77 commented Apr 1, 2019

mundaym commented Apr 2, 2019

mariecurried commented Apr 2, 2019

mundaym commented Apr 2, 2019

josharian commented Apr 2, 2019

josharian commented Apr 2, 2019

mundaym commented Apr 2, 2019

josharian commented May 15, 2020

What version of Go are you using (`go version`)?