Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: unnecessary temporary variable on stack #65495

Open
katsusan opened this issue Feb 3, 2024 · 19 comments
Open

cmd/compile: unnecessary temporary variable on stack #65495

katsusan opened this issue Feb 3, 2024 · 19 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@katsusan
Copy link

katsusan commented Feb 3, 2024

Go version

go version go1.21.6 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://goproxy.cn'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.6'
GCCGO='/usr/local/bin/gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build734905149=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Use GOSSAFUNC=enc go build qmac.go or go tool compile -S qmac.go for its assembly results or see https://godbolt.org/z/9GxzbY4eP.

//qmac.go
package main

func main() {}

type desc struct {
        t int
        x int
        y int
        z int
        p int
}

func enc(d []byte) []byte {
        var buf = d
        t, a, b, c, pr := 0,  -1, 3, 1, 113
        for i := 0; i < len(buf); i++ {
                res := process(t, a, b, c, pr)
                buf[i] = byte(res.z)
        }

        return buf
}

//go:noinline
func process(t, a, b, c, pr int) desc {
        d :=  desc{
                t: t,
                x : a,
                y: b,
                z: c,
                p: pr,
        }
        //...
        return d
}

What did you see happen?

For every loop Go compiler stores the call result into onstack variable main.res, which is meaningless, as main.res is only an intermediate result for setting buf[i].

I also haven't figured out what autotmp_11 is used for here, why not doing a direct store MOVQ AX, main.res-xx(SP) instead of MOVQ AX, main..autotmp_11-40(SP), MOVQ main..autotmp_11-40(SP), DX, MOVQ DX, main.res-80(SP)

image

What did you expect to see?

Compiler can optimize the redundant store operations like following case.
(reduce process's parameters from 5 to 4)

image
//qmac2.go
package main

func main() {}

type desc struct {
        t int
        x int
        y int
        z int
}

func enc(d []byte) []byte {
        var buf = d
        t, a, b, c := 0,  -1, 3, 1
        for i := 0; i < len(buf); i++ {
                res := process(t, a, b, c)
                buf[i] = byte(res.z)
        }

        return buf
}

//go:noinline
func process(t, a, b, c int) desc {
        d :=  desc{
                t: t,
                x : a,
                y: b,
                z: c,
        }
        //...
        return d
}
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 3, 2024
@Jorropo Jorropo self-assigned this Feb 5, 2024
@Jorropo Jorropo added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 5, 2024
@Jorropo Jorropo added this to the Backlog milestone Feb 5, 2024
@Jorropo
Copy link
Member

Jorropo commented Feb 5, 2024

This happens because the compiler has a hardcoded limit, it will place on the stack all the structs with more than 4 fields or the structs with a width bigger than 4 pointers.

if t.Size() > int64(4*types.PtrSize) {

if t.NumFields() > MaxStruct {

I'm testing with patches that allow to treat any struct as SSA values, will report later.

@Jorropo Jorropo added the FixPending Issues that have a fix which has not yet been reviewed or submitted. label Feb 5, 2024
@gopherbot
Copy link

Change https://go.dev/cl/561695 mentions this issue: cmd/compile: allow arbitrary sized structs to be SSAified

@Jorropo Jorropo removed their assignment Feb 5, 2024
@Jorropo
Copy link
Member

Jorropo commented Feb 5, 2024

Here is the results with the CL I sent:

  TEXT command-line-arguments.enc(SB), ABIInternal
  CALL command-line-arguments.process(SB)
  MOVQ DI, AX
  RET

vs:

command-line-arguments_enc_pc0:
  TEXT command-line-arguments.enc(SB), ABIInternal, $88-0
  CMPQ SP, 16(R14)
  PCDATA $0, $-2
  JLS command-line-arguments_enc_pc83
  PCDATA $0, $-1
  PUSHQ BP
  MOVQ SP, BP
  SUBQ $80, SP
  FUNCDATA $0, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
  FUNCDATA $1, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
  PCDATA $1, $0
  CALL command-line-arguments.process(SB)
  MOVQ AX, command-line-arguments..autotmp_2(SP)
  MOVQ BX, command-line-arguments..autotmp_2+8(SP)
  MOVQ CX, command-line-arguments..autotmp_2+16(SP)
  MOVQ DI, command-line-arguments..autotmp_2+24(SP)
  MOVQ SI, command-line-arguments..autotmp_2+32(SP)
  MOVQ command-line-arguments..autotmp_2(SP), AX
  MOVQ AX, command-line-arguments..autotmp_1+40(SP)
  MOVUPS command-line-arguments..autotmp_2+8(SP), X0
  MOVUPS X0, command-line-arguments..autotmp_1+48(SP)
  MOVUPS command-line-arguments..autotmp_2+24(SP), X0
  MOVUPS X0, command-line-arguments..autotmp_1+64(SP)
  MOVQ command-line-arguments..autotmp_1+64(SP), AX
  ADDQ $80, SP
  POPQ BP
  RET
command-line-arguments_enc_pc83:
  NOP
  PCDATA $1, $-1
  PCDATA $0, $-2
  CALL runtime.morestack_noctxt(SB)
  PCDATA $0, $-1
  JMP command-line-arguments_enc_pc0

For enc:

package a

type desc struct{ t, x, y, z, p int }

func enc() byte {
	// amd64:-"MOV.+autotmp"
	return byte(process().z)
}

//go:noinline
func process() desc {
	return desc{}
}

@randall77
Copy link
Contributor

The problem with just SSAing everything is that the resulting code can be exponential in size relative to the program. For instance,

type S1 { a, b S2 }
type S2 { a, b S3 }
...
type SN { a, b int }

S1 is potentially very big, requiring 2^N SSA values to represent it. Currently it would be represented by just 1 autotmp of size 2^N. So doing things like copying var p1 *S1 = ...; var p2 *S2 = ...; *p1 = *p2 uses 2^N individual loads/stores instead of just one call to memmove.

@Jorropo
Copy link
Member

Jorropo commented Feb 6, 2024

How would *p1 = *p2 compiles when one of them is *S1 and the other one is *S2 ?

I don't exactly see why neither is gonna be SSAified. Given they both have their address taken it should fail to ssa them in genssa.
Maybe Load and Store are destructured too aggressively and we should only do that based on finality.


I think I was able to repro the issue you are talking about this way:

package a

type S0 struct{ a, b S1 }
type S1 struct{ a, b S2 }
type S2 struct{ a, b S3 }
type S3 struct{ a, b S4 }
type S4 struct{ a, b S5 }
type S5 struct{ a, b S6 }
type S6 struct{ a, b S7 }
type S7 struct{ a, b S8 }
type S8 struct{ a, b S9 }
type S9 struct{ a, b S10 }
type S10 struct{ a, b S11 }
type S11 struct{ a, b S12 }
type S12 struct{ a, b S13 }
type S13 struct{ a, b S14 }
type S14 struct{ a, b S15 }
type S15 struct{ a, b S16 }
type S16 struct{ a, b S17 }
type S17 struct{ a, b S18 }
type S18 struct{ a, b S19 }
type S19 struct{ a, b S20 }
type S20 struct{ a, b S21 }
type S21 struct{ a, b S22 }
type S22 struct{ a, b S23 }
type S23 struct{ a, b S24 }
type S24 struct{ a, b S25 }
type S25 struct{ a, b S26 }
type S26 struct{ a, b S27 }
type S27 struct{ a, b S28 }
type S28 struct{ a, b S29 }
type S29 struct{ a, b S30 }
type S30 struct{ a, b S31 }
type S31 struct{ a, b int }

func Copy(a, b *S0) {
	*a = *b
}

However the compiler also have quadratic behavior before my patch. That already a problem.

@Jorropo
Copy link
Member

Jorropo commented Feb 6, 2024

Alternatively I did this patch as a shot in the dark and it's been doing ok to great on real world example of codes I threw at it.
If we have support for variable based ssaification of structs I think it would make sense to vary the limit based on available registers.

We should allow ourself to be more aggressive on RISCV64 with ~<32 gp registers compared to i386 with ~<8.

@thepudds
Copy link
Contributor

thepudds commented Feb 6, 2024

@Jorropo, probably worth a look at #24416.

@randall77
Copy link
Contributor

How would *p1 = *p2 compiles when one of them is *S1 and the other one is *S2 ?

Right, sorry, I meant both of type *S1 (as you figured out).

However the compiler also have quadratic behavior before my patch. That already a problem.

Nice bug find. Looks like exponential behavior from cmd/compile/internal/types/alg.go:AlgType. Opened an issue at #65540.

@Jorropo
Copy link
Member

Jorropo commented Feb 6, 2024

I tested a version with 8 structs (512 elements) and it was able to compile.
You are correct my CL generates 2 copies which are significantly slower and HUGE.

I'll modify my CL to only SSA-ify structs that would be passed only in registers using ABIInternal, I will need testing but I think that is a good threshold of arch adaptiveness. (with min of 4 for the same reasons as currently)

@mknyszek
Copy link
Contributor

mknyszek commented Feb 7, 2024

In triage, we think this is related to (or at least has the same solution) as #65532.

@aclements
Copy link
Member

This is also closely related to #62077 and #62737.

@gopherbot
Copy link

Change https://go.dev/cl/566177 mentions this issue: cmd/compile/internal/liveness: introduce "live intervals" utility

@gopherbot
Copy link

Change https://go.dev/cl/553055 mentions this issue: cmd/compile/internal: merge stack slots for selected local auto vars

gopherbot pushed a commit that referenced this issue Mar 29, 2024
Introduce a helper type "Intervals" that contains sets of sorted
disjoint ranges corresponding to live ranges within a function.
Example: the Intervals set "{ [0,1), [4,10) }" would indicate that
something is live starting at instruction 0, then up to but not
including instruction 1, then dead from 1-3, then live again at
instruction 4 up to (but not including) instruction 10.

This patch provides APIs for constructing interval sets, testing to
see whether two sets overlap, and unioning/merging together two
intervals sets.

Updates #62737.
Updates #65532.
Updates #65495.

Change-Id: I7140a5989eba93bf3b8762d9224261f5eba0646d
Reviewed-on: https://go-review.googlesource.com/c/go/+/566177
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Mar 29, 2024
Preliminary compiler support for merging/overlapping stack
slots of local variables whose access patterns are disjoint.

This patch includes changes in AllocFrame to do the actual
merging/overlapping based on information returned from a new
liveness.MergeLocals helper. The MergeLocals helper identifies
candidates by looking for sets of AUTO variables that either A) have
the same size and GC shape (if types contain pointers), or B) have the
same size (but potentially different types as long as those types have
no pointers). Variables must be greater than (3*types.PtrSize) in size
to be considered for merging.

After forming candidates, MergeLocals collects variables into "can be
overlapped" equivalence classes or partitions; this process is driven
by an additional liveness analysis pass. Ideally it would be nice to
move the existing stackmap liveness pass up before AllocFrame
and "widen" it to include merge candidates so that we can do just a
single liveness as opposed to two passes, however this may be difficult
given that the merge-locals liveness has to take into account
writes corresponding to dead stores.

This patch also required a change to the way ssa.OpVarDef pseudo-ops
are generated; prior to this point they would only be created for
variables whose type included pointers; if stack slot merging is
enabled then the ssagen code creates OpVarDef ops for all auto vars
that are merge candidates.

Note that some temporaries created late in the compilation process
(e.g. during ssa backend) are difficult to reason about, especially in
cases where we take the address of a temp and pass it to the runtime.
For the time being we mark most of the vars created post-ssagen as
"not a merge candidate".

Stack slot merging for locals/autos is enabled by default if "-N" is
not in effect, and can be disabled via "-gcflags=-d=mergelocals=0".

Fixmes/todos/restrictions:
- try lowering size restrictions
- re-evaluate the various skips that happen in SSA-created autotmps

Fixes #62737.
Updates #65532.
Updates #65495.

Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest
Change-Id: Ibc22e8a76c87e47bc9fafe4959804d9ea923623d
Reviewed-on: https://go-review.googlesource.com/c/go/+/553055
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@ALTree
Copy link
Member

ALTree commented Mar 30, 2024

This is also closely related to #62077 and #62737.

And #23377 I guess.

@gopherbot
Copy link

Change https://go.dev/cl/575415 mentions this issue: cmd/compile/internal: merge stack slots for selected local auto vars

@gopherbot
Copy link

Change https://go.dev/cl/576680 mentions this issue: cmd/compile/internal: small tweak to merge locals trace output

@gopherbot
Copy link

Change https://go.dev/cl/576735 mentions this issue: cmd/compile/internal/liveness: enhance mergelocals for addr-taken candidates

@gopherbot
Copy link

Change https://go.dev/cl/577615 mentions this issue: cmd/compile/internal: stack slot merging region formation enhancements

@gopherbot
Copy link

Change https://go.dev/cl/576681 mentions this issue: cmd/compile/internal/base: enable stack slot merging by default

gopherbot pushed a commit that referenced this issue Apr 9, 2024
[This is a partial roll-forward of CL 553055, the main change here
is that the stack slot overlap operation is flagged off by default
(can be enabled by hand with -gcflags=-d=mergelocals=1) ]

Preliminary compiler support for merging/overlapping stack slots of
local variables whose access patterns are disjoint.

This patch includes changes in AllocFrame to do the actual
merging/overlapping based on information returned from a new
liveness.MergeLocals helper. The MergeLocals helper identifies
candidates by looking for sets of AUTO variables that either A) have
the same size and GC shape (if types contain pointers), or B) have the
same size (but potentially different types as long as those types have
no pointers). Variables must be greater than (3*types.PtrSize) in size
to be considered for merging.

After forming candidates, MergeLocals collects variables into "can be
overlapped" equivalence classes or partitions; this process is driven
by an additional liveness analysis pass. Ideally it would be nice to
move the existing stackmap liveness pass up before AllocFrame
and "widen" it to include merge candidates so that we can do just a
single liveness as opposed to two passes, however this may be difficult
given that the merge-locals liveness has to take into account
writes corresponding to dead stores.

This patch also required a change to the way ssa.OpVarDef pseudo-ops
are generated; prior to this point they would only be created for
variables whose type included pointers; if stack slot merging is
enabled then the ssagen code creates OpVarDef ops for all auto vars
that are merge candidates.

Note that some temporaries created late in the compilation process
(e.g. during ssa backend) are difficult to reason about, especially in
cases where we take the address of a temp and pass it to the runtime.
For the time being we mark most of the vars created post-ssagen as
"not a merge candidate".

Stack slot merging for locals/autos is enabled by default if "-N" is
not in effect, and can be disabled via "-gcflags=-d=mergelocals=0".

Fixmes/todos/restrictions:
- try lowering size restrictions
- re-evaluate the various skips that happen in SSA-created autotmps

Updates #62737.
Updates #65532.
Updates #65495.

Change-Id: Ifda26bc48cde5667de245c8a9671b3f0a30bb45d
Reviewed-on: https://go-review.googlesource.com/c/go/+/575415
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Apr 9, 2024
For -gcflags=-d=mergelocalstrace=1 (which reports estimated savings
from stack slot merging), emit separate values for pointerful vs
non-pointerful variables, for a bit more detail.

Updates #62737.
Updates #65532.
Updates #65495.

Change-Id: I9dd27d2a254036448c85c13d189d1ed36157c9d7
Reviewed-on: https://go-review.googlesource.com/c/go/+/576680
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Apr 9, 2024
…didates

It is possible to have situations where a given ir.Name is
non-address-taken at the source level, but whose address is
materialized in order to accommodate the needs of arch-dependent
memory ops. The issue here is that the SymAddr op will show up as
touching a variable of interest, but the subsequent memory op will
not. This is generally not an issue for computing whether something is
live across a call, but it is problematic for collecting the more
fine-grained live interval info that drives stack slot merging.

As an example, consider this Go code:

    package p
    type T struct {
	    x [10]int
	    f float64
    }
    func ABC(i, j int) int {
	    var t T
	    t.x[i&3] = j
	    return t.x[j&3]
    }

On amd64 the code sequences we'll see for accesses to "t" might look like

    v10 = VarDef <mem> {t} v1
    v5 = MOVOstoreconst <mem> {t} [val=0,off=0] v2 v10
    v23 = LEAQ <*T> {t} [8] v2 : DI
    v12 = DUFFZERO <mem> [80] v23 v5
    v14 = ANDQconst <int> [3] v7 : AX
    v19 = MOVQstoreidx8 <mem> {t} v2 v14 v8 v12
    v22 = ANDQconst <int> [3] v8 : BX
    v24 = MOVQloadidx8 <int> {t} v2 v22 v19 : AX
    v25 = MakeResult <int,mem> v24 v19 : <>

Note that the the loads and stores (ex: v19, v24) all refer directly
to "t", which means that regular live analysis will work fine for
identifying variable lifetimes. The DUFFZERO is (in effect) an
indirect write, but since there are accesses immediately after it we
wind up with the same live intervals.

Now the same code with GOARCH=ppc64:

    v10 = VarDef <mem> {t} v1
    v20 = MOVDaddr <*T> {t} v2 : R20
    v12 = LoweredZero <mem> [88] v20 v10
     v3 = CLRLSLDI <int> [212543] v7 : R5
    v15 = MOVDaddr <*T> {t} v2 : R6
    v19 = MOVDstoreidx <mem> v15 v3 v8 v12
    v29 = CLRLSLDI <int> [212543] v8 : R4
    v24 = MOVDloadidx <int> v15 v29 v19 : R3
    v25 = MakeResult <int,mem> v24 v19 : <>

Here instead of memory ops that refer directly to the symbol, we take
the address of "t" (ex: v15) and then pass the address to memory ops
(where the ops themselves no longer refer to the symbol).

This patch enhances the stack slot merging liveness analysis to handle
cases like the PPC64 one above. We add a new phase in candidate
selection that collects more precise use information for merge
candidates, and screens out candidates that are too difficult to
analyze. The phase make a forward pass over each basic block looking
for instructions of the form vK := SymAddr(N) where N is a raw
candidate. It then creates an entry in a map with key vK and value
holding name and the vK use count. As the walk continues, we check for
uses of of vK: when we see one, record it in a side table as an
upwards exposed use of N. At each vK use we also decrement the use
count in the map entry, and if we hit zero, remove the map entry. If
we hit the end of the basic block and we still have map entries, this
implies that the address in question "escapes" the block -- at that
point to be conservative we just evict the name in question from the
candidate set.

Although this CL fixes the issues that forced a revert of the original
merging CL, this CL doesn't enable stack slot merging by default; a
subsequent CL will do that.

Updates #62737.
Updates #65532.
Updates #65495.

Change-Id: Id41d359a677767a8e7ac1e962ae23f7becb4031f
Reviewed-on: https://go-review.googlesource.com/c/go/+/576735
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue Apr 9, 2024
Flag flip to enable stack slot merging by default when optimizing.
Please see the earlier CL for details on what this is doing.

Updates #62737.
Updates #65532.
Updates #65495.

Change-Id: I8e30d553e74ace43d418f883199721f05320d3d7
Reviewed-on: https://go-review.googlesource.com/c/go/+/576681
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Apr 18, 2024
This patch revises the algorithm/strategy used for overlapping the
stack slots of disjointly accessed local variables. The main change
here is to allow merging the stack slot of B into the slot for A if
B's size is less then A (prior to this they had to be identical), and
to also allow merging a non-pointer variables into pointer-variable
slots.

The new algorithm sorts the candidate list first by pointerness
(pointer variables first), then by alignment, then by size, and
finally by name. We no longer check that two variables have the same
GC shape before merging: since it should never be the case that we
have two vars X and Y both live across a given callsite where X and Y
share a stack slot, their gc shape doesn't matter.

Doing things this new way increases the total number of bytes saved
(across all functions) from 91256 to 124336 for the sweet benchmarks.

Updates #62737.
Updates #65532.
Updates #65495.

Change-Id: I1daaac1b1240aa47a6975e98ccd24e03304ab602
Reviewed-on: https://go-review.googlesource.com/c/go/+/577615
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

8 participants