cmd/compile: pgo can dramatically increase goroutine stack size #65532

felixge · 2024-02-05T22:37:44Z

Go version

go1.21.5

Output of `go env` in your module/workspace:

GOARCH=amd64
GOOS=linux
GOAMD64=v1

What did you do?

Build my application using a default.pgo CPU profile from production.

What did you see happen?

Go memory usage (/memory/classes/total:bytes − /memory/classes/heap/released:bytes) increased from 720 MB to 850 MB (18%) until rollback, see below.

This increase in memory usage seems to have been caused by an increase in goroutine stack size (/memory/classes/heap/stacks:bytes) from 207 MB to 280MB (35%).

This increase was not due to an increase in the number of active goroutines, but due to an increase of the average stack size (/memory/classes/heap/stacks:bytes / /sched/goroutines:goroutines).

To debug this further, I built a hacky goroutine stack frame profiler. This pointed me to to google.golang.org/grpc/internal/transport.(*loopyWriter).run

For the binary compiled without pgo, my tool estimated 2MB of stack usage for ~1000 goroutines:

And for the binary compiled with pgo, my tool estimated 71MB of stack usage for ~1000 goroutines:

Looking at the assembly, it becomes clear that is due to the frame size increasing from 0x50 (80) bytes to 0xc1f8 (49656) bytes.

assembly

before pgo:

TEXT google.golang.org/grpc/internal/transport.(*loopyWriter).run(SB) /go/pkg/mod/google.golang.org/grpc@v1.58.2/internal/transport/controlbuf.go
  0x8726e0              493b6610                CMPQ SP, 0x10(R14)                   // cmp 0x10(%r14),%rsp
  0x8726e4              0f86ab020000            JBE 0x872995                         // jbe 0x872995
  0x8726ea              55                      PUSHQ BP                             // push %rbp
  0x8726eb              4889e5                  MOVQ SP, BP                          // mov %rsp,%rbp
  0x8726ee              4883ec50                SUBQ $0x50, SP                       // sub $0x50,%rsp

after pgo:

TEXT google.golang.org/grpc/internal/transport.(*loopyWriter).run(SB) /go/pkg/mod/google.golang.org/grpc@v1.58.2/internal/transport/controlbuf.go
  0x8889a0              4989e4                          MOVQ SP, R12                         // mov %rsp,%r12
  0x8889a3              4981ec80c10000                  SUBQ $0xc180, R12                    // sub $0xc180,%r12
  0x8889aa              0f82c0300000                    JB 0x88ba70                          // jb 0x88ba70
  0x8889b0              4d3b6610                        CMPQ R12, 0x10(R14)                  // cmp 0x10(%r14),%r12
  0x8889b4              0f86b6300000                    JBE 0x88ba70                         // jbe 0x88ba70
  0x8889ba              55                              PUSHQ BP                             // push %rbp
  0x8889bb              4889e5                          MOVQ SP, BP                          // mov %rsp,%rbp
  0x8889be              4881ecf8c10000                  SUBQ $0xc1f8, SP                     // sub $0xc1f8,%rsp

And the root cause for this appears to be the inlining of 3 calls to processData, each of which allocates a 16KiB byte array on its stack

What did you expect to see?

No significant increase in memory usage.

Maybe PGO could take frame sizes into account for inlining, especially if multiple calls are being made to a function that has a large frame size.

Meanwhile, maybe we should send a PR that adds a //go:noinline pragma to the processData func in gRPC. Given the current code structure, it seems highly undesirable to inline this function up to 3 times in the run method.

cc @prattmic

The text was updated successfully, but these errors were encountered:

prattmic · 2024-02-06T23:29:45Z

Thanks for the detailed report! I know you are aware, but just to be clear to others skimming the issue: this stack size estimate tool will likely underestimate the actual stack allocation size. We are seeing the goroutines sitting parked in loopyWriter.run, below a call to processData. But presumably these goroutines do call processData, at which point they will need 16KB of stack.

Inlining three copies of this function is a particularly bad case because without inlining you can't call processData three times at once, so the peak stack use is 16KB, but with inlining, the frame size of the caller is 16KB*3.

I agree with you that frame size does seem reasonable to consider when making inlining decisions. We probably want to avoid making frames too large. This is potentially in scope for #61502, or a follow-up to that (cc @mdempsky @thanm). I don't think this would really be PGO-specific, though PGO inlining may make us more likely to hit general frame size thresholds. I'm also not sure how good of a sense of final frame size we have during inlining; this is pretty early in compilation and even before we do escape analysis.

At the other end of the spectrum, I wonder if we could more aggressively reuse stack slots for large objects. Since localBuf doesn't escape the function, it seems likely that even after inlining, the three uses of localBuf are mutually exclusive and could technically use the same stack slot. I have no idea how complicated this would be.

cc @golang/compiler @cherrymui @aclements

thanm · 2024-02-07T14:13:41Z

As part of my work on inlining heuristics I considered adding a heuristic that would bias against inlining functions with large stack frames, but didn't get as far as implementing it. It is a tricky thing to get right, since the inliner runs at an early stage in the compiler, making it tricky to compute an accurate stack size estimate.

I think a better long term solution here is something like #62737, e.g. do the inlines but then arrange for the three inlined blobs to share the same instance of the array in question. This can be done in the back end (e.g. late during stack frame layout) or we can change the inliner itself to reuse temps (this is something that the LLVM inliner does IIRC).

felixge · 2024-02-07T20:17:12Z

I know you are aware, but just to be clear to others skimming the issue: this stack size estimate tool will likely underestimate the actual stack allocation size.

Yup, it's a rough estimate. As discussed during the last diagnostic sync, I'll try to file my proposal for a runtime implementation of this profile type which would allow us to overcome some of the estimation issues.

think a better long term solution here is something like #62737, e.g. do the inlines but then arrange for the three inlined blobs to share the same instance of the array in question.

That's probably a good solution in most cases, including this one.

There will still be edge cases e.g. A inlining B (with a big frame) and then calling C in a for loop forever. But maybe those edge cases are rare enough to not be a problem in practice.

aclements · 2024-02-07T22:05:10Z

This is closely related to #62077, #62737, and #65495

thepudds · 2024-03-28T15:01:08Z

Some related progress might be happening in https://go.dev/cl/553055.

gopherbot · 2024-03-29T15:24:45Z

Change https://go.dev/cl/566177 mentions this issue: cmd/compile/internal/liveness: introduce "live intervals" utility

gopherbot · 2024-03-29T15:24:47Z

Change https://go.dev/cl/553055 mentions this issue: cmd/compile/internal: merge stack slots for selected local auto vars

Introduce a helper type "Intervals" that contains sets of sorted disjoint ranges corresponding to live ranges within a function. Example: the Intervals set "{ [0,1), [4,10) }" would indicate that something is live starting at instruction 0, then up to but not including instruction 1, then dead from 1-3, then live again at instruction 4 up to (but not including) instruction 10. This patch provides APIs for constructing interval sets, testing to see whether two sets overlap, and unioning/merging together two intervals sets. Updates #62737. Updates #65532. Updates #65495. Change-Id: I7140a5989eba93bf3b8762d9224261f5eba0646d Reviewed-on: https://go-review.googlesource.com/c/go/+/566177 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>

Preliminary compiler support for merging/overlapping stack slots of local variables whose access patterns are disjoint. This patch includes changes in AllocFrame to do the actual merging/overlapping based on information returned from a new liveness.MergeLocals helper. The MergeLocals helper identifies candidates by looking for sets of AUTO variables that either A) have the same size and GC shape (if types contain pointers), or B) have the same size (but potentially different types as long as those types have no pointers). Variables must be greater than (3*types.PtrSize) in size to be considered for merging. After forming candidates, MergeLocals collects variables into "can be overlapped" equivalence classes or partitions; this process is driven by an additional liveness analysis pass. Ideally it would be nice to move the existing stackmap liveness pass up before AllocFrame and "widen" it to include merge candidates so that we can do just a single liveness as opposed to two passes, however this may be difficult given that the merge-locals liveness has to take into account writes corresponding to dead stores. This patch also required a change to the way ssa.OpVarDef pseudo-ops are generated; prior to this point they would only be created for variables whose type included pointers; if stack slot merging is enabled then the ssagen code creates OpVarDef ops for all auto vars that are merge candidates. Note that some temporaries created late in the compilation process (e.g. during ssa backend) are difficult to reason about, especially in cases where we take the address of a temp and pass it to the runtime. For the time being we mark most of the vars created post-ssagen as "not a merge candidate". Stack slot merging for locals/autos is enabled by default if "-N" is not in effect, and can be disabled via "-gcflags=-d=mergelocals=0". Fixmes/todos/restrictions: - try lowering size restrictions - re-evaluate the various skips that happen in SSA-created autotmps Fixes #62737. Updates #65532. Updates #65495. Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest Change-Id: Ibc22e8a76c87e47bc9fafe4959804d9ea923623d Reviewed-on: https://go-review.googlesource.com/c/go/+/553055 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

gopherbot · 2024-03-31T18:26:47Z

Change https://go.dev/cl/575415 mentions this issue: cmd/compile/internal: merge stack slots for selected local auto vars

gopherbot · 2024-04-09T14:48:03Z

Change https://go.dev/cl/576680 mentions this issue: cmd/compile/internal: small tweak to merge locals trace output

gopherbot · 2024-04-09T14:48:04Z

Change https://go.dev/cl/576735 mentions this issue: cmd/compile/internal/liveness: enhance mergelocals for addr-taken candidates

gopherbot · 2024-04-09T14:48:06Z

Change https://go.dev/cl/577615 mentions this issue: cmd/compile/internal: stack slot merging region formation enhancements

gopherbot · 2024-04-09T14:48:08Z

Change https://go.dev/cl/576681 mentions this issue: cmd/compile/internal/base: enable stack slot merging by default

[This is a partial roll-forward of CL 553055, the main change here is that the stack slot overlap operation is flagged off by default (can be enabled by hand with -gcflags=-d=mergelocals=1) ] Preliminary compiler support for merging/overlapping stack slots of local variables whose access patterns are disjoint. This patch includes changes in AllocFrame to do the actual merging/overlapping based on information returned from a new liveness.MergeLocals helper. The MergeLocals helper identifies candidates by looking for sets of AUTO variables that either A) have the same size and GC shape (if types contain pointers), or B) have the same size (but potentially different types as long as those types have no pointers). Variables must be greater than (3*types.PtrSize) in size to be considered for merging. After forming candidates, MergeLocals collects variables into "can be overlapped" equivalence classes or partitions; this process is driven by an additional liveness analysis pass. Ideally it would be nice to move the existing stackmap liveness pass up before AllocFrame and "widen" it to include merge candidates so that we can do just a single liveness as opposed to two passes, however this may be difficult given that the merge-locals liveness has to take into account writes corresponding to dead stores. This patch also required a change to the way ssa.OpVarDef pseudo-ops are generated; prior to this point they would only be created for variables whose type included pointers; if stack slot merging is enabled then the ssagen code creates OpVarDef ops for all auto vars that are merge candidates. Note that some temporaries created late in the compilation process (e.g. during ssa backend) are difficult to reason about, especially in cases where we take the address of a temp and pass it to the runtime. For the time being we mark most of the vars created post-ssagen as "not a merge candidate". Stack slot merging for locals/autos is enabled by default if "-N" is not in effect, and can be disabled via "-gcflags=-d=mergelocals=0". Fixmes/todos/restrictions: - try lowering size restrictions - re-evaluate the various skips that happen in SSA-created autotmps Updates #62737. Updates #65532. Updates #65495. Change-Id: Ifda26bc48cde5667de245c8a9671b3f0a30bb45d Reviewed-on: https://go-review.googlesource.com/c/go/+/575415 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>

For -gcflags=-d=mergelocalstrace=1 (which reports estimated savings from stack slot merging), emit separate values for pointerful vs non-pointerful variables, for a bit more detail. Updates #62737. Updates #65532. Updates #65495. Change-Id: I9dd27d2a254036448c85c13d189d1ed36157c9d7 Reviewed-on: https://go-review.googlesource.com/c/go/+/576680 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>

…didates It is possible to have situations where a given ir.Name is non-address-taken at the source level, but whose address is materialized in order to accommodate the needs of arch-dependent memory ops. The issue here is that the SymAddr op will show up as touching a variable of interest, but the subsequent memory op will not. This is generally not an issue for computing whether something is live across a call, but it is problematic for collecting the more fine-grained live interval info that drives stack slot merging. As an example, consider this Go code: package p type T struct { x [10]int f float64 } func ABC(i, j int) int { var t T t.x[i&3] = j return t.x[j&3] } On amd64 the code sequences we'll see for accesses to "t" might look like v10 = VarDef <mem> {t} v1 v5 = MOVOstoreconst <mem> {t} [val=0,off=0] v2 v10 v23 = LEAQ <*T> {t} [8] v2 : DI v12 = DUFFZERO <mem> [80] v23 v5 v14 = ANDQconst <int> [3] v7 : AX v19 = MOVQstoreidx8 <mem> {t} v2 v14 v8 v12 v22 = ANDQconst <int> [3] v8 : BX v24 = MOVQloadidx8 <int> {t} v2 v22 v19 : AX v25 = MakeResult <int,mem> v24 v19 : <> Note that the the loads and stores (ex: v19, v24) all refer directly to "t", which means that regular live analysis will work fine for identifying variable lifetimes. The DUFFZERO is (in effect) an indirect write, but since there are accesses immediately after it we wind up with the same live intervals. Now the same code with GOARCH=ppc64: v10 = VarDef <mem> {t} v1 v20 = MOVDaddr <*T> {t} v2 : R20 v12 = LoweredZero <mem> [88] v20 v10 v3 = CLRLSLDI <int> [212543] v7 : R5 v15 = MOVDaddr <*T> {t} v2 : R6 v19 = MOVDstoreidx <mem> v15 v3 v8 v12 v29 = CLRLSLDI <int> [212543] v8 : R4 v24 = MOVDloadidx <int> v15 v29 v19 : R3 v25 = MakeResult <int,mem> v24 v19 : <> Here instead of memory ops that refer directly to the symbol, we take the address of "t" (ex: v15) and then pass the address to memory ops (where the ops themselves no longer refer to the symbol). This patch enhances the stack slot merging liveness analysis to handle cases like the PPC64 one above. We add a new phase in candidate selection that collects more precise use information for merge candidates, and screens out candidates that are too difficult to analyze. The phase make a forward pass over each basic block looking for instructions of the form vK := SymAddr(N) where N is a raw candidate. It then creates an entry in a map with key vK and value holding name and the vK use count. As the walk continues, we check for uses of of vK: when we see one, record it in a side table as an upwards exposed use of N. At each vK use we also decrement the use count in the map entry, and if we hit zero, remove the map entry. If we hit the end of the basic block and we still have map entries, this implies that the address in question "escapes" the block -- at that point to be conservative we just evict the name in question from the candidate set. Although this CL fixes the issues that forced a revert of the original merging CL, this CL doesn't enable stack slot merging by default; a subsequent CL will do that. Updates #62737. Updates #65532. Updates #65495. Change-Id: Id41d359a677767a8e7ac1e962ae23f7becb4031f Reviewed-on: https://go-review.googlesource.com/c/go/+/576735 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

Flag flip to enable stack slot merging by default when optimizing. Please see the earlier CL for details on what this is doing. Updates #62737. Updates #65532. Updates #65495. Change-Id: I8e30d553e74ace43d418f883199721f05320d3d7 Reviewed-on: https://go-review.googlesource.com/c/go/+/576681 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>

thanm · 2024-04-12T14:28:23Z

Hello @felixge -- I have a CL stack that should fix this problem (mostly submitted, there is one more enhancement still pending). I would like to test your particular case to make sure it gets handled. Would you be willing to share your PGO file so that I ran rerun your PGO build please? Thanks.

felixge · 2024-04-15T12:00:47Z

@thanm yeah, what's the best way to get you a copy? It's from our prod environment, so I'd prefer to no post it in public. I'm on gopher slack or could e-mail you the file.

Also: How are you planning to test this? By building the gRPC library with the PGO profile and checking the generated machine code?

thanm · 2024-04-15T12:24:26Z

I'm on gopher slack or could e-mail you the file.

Thanks, email is probably easiest (thanm @ google).

By building the gRPC library with the PGO profile and checking the generated machine code?

Yes.

thanm · 2024-04-18T15:41:33Z

An update on this issue:

@felixge provided me with the PGO profile off-line and I was able to easily reproduce the problematic set of inlines.

Unfortunately it looks as though the CLs I've been working on to do stack slot merging for disjointly accessed locals (e.g. 577615 and related predecessors) are not going to help here.

The reason is that the variable in question, "localBuf" (defined here) is considered address-taken by the compiler. IR excerpt:

   AS tc(1) # controlbuf.go:573:30 controlbuf.go:956:8
   .   NAME-transport.buf esc(no) Class:PAUTO Offset:0 InlLocal OnStack Used SLICE-[]byte tc(1) # controlbuf.go:573:30 controlbuf.go:931:3
   .   SLICEARR SLICE-[]byte tc(1) # controlbuf.go:573:30 controlbuf.go:956:18
   .   .   ADDR Implicit PTR-*[16384]byte tc(1) # controlbuf.go:573:30 controlbuf.go:956:18
   .   .   .   NAME-transport.localBuf esc(no) Class:PAUTO Offset:0 Addrtaken InlLocal OnStack Used ARRAY-[16384]byte tc(1) # controlbuf.go:573:30 controlbuf.go:953:8
   .   SLICEARR-High
   .   .   NAME-transport..autotmp_266 esc(N) Class:PAUTO Offset:0 AutoTemp OnStack Used int tc(1) # controlbuf.go:559:28 controlbuf.go:910:32

This corresponds to the code that assigned a slice of localBuf to buf at line 956. This results in an indirect read (via buf) later in the function. The liveness analysis I'm using in the Go compiler's SSA backend isn't set up to track these sorts of indirect reads and writes, so this variable winds up being rejected from consideration.

I can think of a couple of ways to potentially attack this problem.

The first would be to introduce additional lifetime markers, similar to what's used in the LLVM back end (code). At the moment the Go compiler uses just a single lifetime-start marker (e.g. ssa.OpVarDef), which works great if you're only computing stack maps at calls; if we were to add a lifetime end marker as well we could change the inliner to insert them and the exploit them during stack slot merging liveness analysis. This carries some risk in that we would need to make the other back end optimizations aware of the new marker (so as to avoid having reads/writes moved across it).

A second possibility would be to teach the inliner to reuse variables when it actually carries out the inlining, e.g. keep track of additional locals introduced by inlining within a given functions, and then at the point where we inline the body of B into A, check to see if there is an existing var (from a previous inline) that we can reuse. We could apply this sort of optimization even for address-taken variables, which would (in theory) make it applicable in this case.

I'll look into this some more and see whether I can come up with something viable.

This patch revises the algorithm/strategy used for overlapping the stack slots of disjointly accessed local variables. The main change here is to allow merging the stack slot of B into the slot for A if B's size is less then A (prior to this they had to be identical), and to also allow merging a non-pointer variables into pointer-variable slots. The new algorithm sorts the candidate list first by pointerness (pointer variables first), then by alignment, then by size, and finally by name. We no longer check that two variables have the same GC shape before merging: since it should never be the case that we have two vars X and Y both live across a given callsite where X and Y share a stack slot, their gc shape doesn't matter. Doing things this new way increases the total number of bytes saved (across all functions) from 91256 to 124336 for the sweet benchmarks. Updates #62737. Updates #65532. Updates #65495. Change-Id: I1daaac1b1240aa47a6975e98ccd24e03304ab602 Reviewed-on: https://go-review.googlesource.com/c/go/+/577615 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>

randall77 · 2024-04-18T15:46:41Z

We used to have a lifetime-ending marker, VarKill. I got rid of it because we weren't using it any more. https://go-review.googlesource.com/c/go/+/419234

cherrymui · 2024-04-18T15:51:32Z

Address-taken local variables are tricky. If a variable is address-taken, and the address escapes the inlined function scope, we can't mark its lifetime end. Currently we do the escape analysis after inlining, so at the inlining stage we don't have the escape information. Perhaps (long term?) we could do a pre-pass of escape analysis before inlining, then at inlining add the lifetime markers (or reuse locals directly) if the variable does not escape the inline function scope.

thanm · 2024-04-19T00:13:00Z

If a variable is address-taken, and the address escapes the inlined function scope, we can't mark its lifetime end.

Maybe I am missing something, but if an address value can flow out from a function wouldn't the var in question be marked as AUTOHEAP or escaping? E.g.

package e

type Bar struct {
	x [240]byte
	s string
}

func Foo(b []byte) *Bar {
	var localBuf [16 * 1024]byte
	copy(localBuf[:], b)
	buf := localBuf[0:10]
	if len(b) != 101 {
		buf[1] = 0
	}
	var bar Bar
	bar.x[1] = buf[0]
	return &bar
}

Here localBuf is marked as address-taken, but also marked as non-escaping (as opposed to bar, which escapes).

If we only target address-taken autos that are also non-escaping, seems as though this would be a viable option?

cherrymui · 2024-04-19T14:40:05Z

Maybe I am missing something, but if an address value can flow out from a function wouldn't the var in question be marked as AUTOHEAP or escaping?

Yes, if it actually escapes the physical (outermost) function. What I meant was, say, F calls G, G inlined into F, and G has a local variable x, &x escapes G but doesn't escape F. In this case, x can still be stack allocated in F (with G inlined into it), but its lifetime is not bounded by G.

thanm · 2024-04-22T19:41:14Z

A second possibility would be to teach the inliner to reuse variables when
it actually carries out the inlining, e.g. keep track of additional locals
introduced by inlining within a given functions, and then at the point
where we inline the body of B into A, check to see if there is an existing
var (from a previous inline) that we can reuse. We could apply this sort of
optimization even for address-taken variables, which would (in theory) make
it applicable in this case.

I spent a little while prototyping this, but there are still significant hurdles that would have to be overcome. I cooked up some machinery to detect when two given autos within function F are derived from disjoint inlines as in this case, that is straightforward code to write. The difficulty comes in with regard to escape analysis.

Let's say you have two packages, P and Q, where P imports Q:

package Q

func Good(x, y int) int {
	var ok [4096]int
	sl := ok[:]
	sl[x] = y
	return ok[y]
}

func Bad(x, y int) []int {
	var nope [4096]int
	sl := nope[:]
	return sl[x:y]
}

package P

import "Q"

func A(x, y int) {
	r := 0
	if x < y {
		r += Q.Good(x, y)
	} else if y > x {
		r += Q.Good(y, x)
	} else {
		sl1 := Bad(x, y)
		sl2 := Q.Bad(x, y)
		sl3:= Q.Bad(y, x)
		r += sl1[x] + sl2[x] + sl[x]
	}
}

func Bad(x, y int) []int {
	var nada [4096]int
	sl := nope[:]
	return sl[x:y]
}

When inlining is done for P's function A, we'll have two copies of the array ok (from inlined bodies of Q.Good), two copies of the array nope (from inlined bodies of Q.Bad) and a single copy of the array nada (from the inlined body of P.Bad).

If you compile Q and dump out the AST you can see that escape analysis has determined that ok doesn't escape but nope does escape:

.   DCL # Q.go:4:6
.   .   NAME-Q.ok esc(no) Class:PAUTO Offset:0 OnStack Used ARRAY-[4096]int tc(1) # Q.go:4:6
...
.   DCL # Q.go:10:6
.   .   NAME-Q.nope esc(h) Class:PAUTO Offset:0 Addrtaken Used ARRAY-[4096]int tc(1) # Q.go:10:6

Note the 'esc(no)' (doesn't escape) for ok and esc(h) (escapes to heap) for nope. This is the information we need, but the problem is that at the moment we don't write this granularity of escape info for locals when writing the export data. It's only around when we're compiling Q, and isn't around when we're copiling P, which is when we need it.

We can't use the escape analysis info computed for A, since it only tell us the state of the world with respect to that function. The various copies of nope and nada don't escape A, but they are not safe to overlap. In the specific case of nada from Q.Bad, at the point where the inliner is running we haven't done escape analysis, but it might be possible to go back later on and inspect the results then. This doesn't help with the cross-package case however.

aclements · 2024-04-24T19:52:41Z

Summary: In P.Good, escape analysis tells us that, while ok is addrtaken, it doesn’t escape, so that address must not leave the extent of P.Good. In theory, inlining should be able to use this to realize that ok isn't live past the extent of P.Good inlined into Q.A. But escape analysis information about locals doesn’t get exported (or kept at all), so we lose that information about ok.

Some options:

Could we re-analyze the body of the imported function P.Good and keep the escape information for the locals?
Could escape analysis of Q.A keep more information about the extents of locals? (It seems like this would work well if escape analysis were done in SSA; it seems much more awkward in IR.)

(Side thought: addrtaken may be a mistake in our compiler. It’s a big hammer and very coarse-grained.)

aclements · 2024-04-24T20:10:42Z

Could escape analysis of Q.A keep more information about the extents of locals?

@cherrymui, @thanm, and I were just discussing this in more depth.

My basic observation is that there are deep parallels between liveness analysis and escape analysis. Liveness analysis determines the extents within a function of non-addrtaken variables, while escape analysis determines whether the extent of addrtaken variables is bounded by a function or not.

The idea here is to produce finer-grain extent information in escape analysis, bringing liveness analysis and escape analysis closer to each other.

Since escape analysis is done at the IR level, this raises the question of where to attach extent information. In principle this could get very complex, but for these purposes we really only need it at the OINLCALL level. When we translate an OINLCALL to SSA, we would produce VarKill operations for all addrtaken variables that don't escape the OINLCALL, threaded onto the mem at the "end" of the OINLCALL.

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 5, 2024

mknyszek mentioned this issue Feb 7, 2024

cmd/compile: unnecessary temporary variable on stack #65495

Open

mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 7, 2024

mknyszek added this to the Backlog milestone Feb 7, 2024

atollena mentioned this issue Feb 20, 2024

Profile Guided Optimization results in high memory usage grpc/grpc-go#6991

Closed

felixge mentioned this issue Mar 27, 2024

proposal: runtime/pprof: add goroutine stack memory usage profile #66566

Open

thanm added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: pgo can dramatically increase goroutine stack size #65532

cmd/compile: pgo can dramatically increase goroutine stack size #65532

felixge commented Feb 5, 2024 •

edited

prattmic commented Feb 6, 2024

thanm commented Feb 7, 2024

felixge commented Feb 7, 2024

aclements commented Feb 7, 2024

thepudds commented Mar 28, 2024

gopherbot commented Mar 29, 2024

gopherbot commented Mar 29, 2024

gopherbot commented Mar 31, 2024

gopherbot commented Apr 9, 2024

gopherbot commented Apr 9, 2024

gopherbot commented Apr 9, 2024

gopherbot commented Apr 9, 2024

thanm commented Apr 12, 2024

felixge commented Apr 15, 2024 •

edited

thanm commented Apr 15, 2024

thanm commented Apr 18, 2024

randall77 commented Apr 18, 2024

cherrymui commented Apr 18, 2024

thanm commented Apr 19, 2024

cherrymui commented Apr 19, 2024

thanm commented Apr 22, 2024 •

edited

aclements commented Apr 24, 2024

aclements commented Apr 24, 2024

cmd/compile: pgo can dramatically increase goroutine stack size #65532

cmd/compile: pgo can dramatically increase goroutine stack size #65532

Comments

felixge commented Feb 5, 2024 • edited

Go version

Output of go env in your module/workspace:

What did you do?

What did you see happen?

What did you expect to see?

prattmic commented Feb 6, 2024

thanm commented Feb 7, 2024

felixge commented Feb 7, 2024

aclements commented Feb 7, 2024

thepudds commented Mar 28, 2024

gopherbot commented Mar 29, 2024

gopherbot commented Mar 29, 2024

gopherbot commented Mar 31, 2024

gopherbot commented Apr 9, 2024

gopherbot commented Apr 9, 2024

gopherbot commented Apr 9, 2024

gopherbot commented Apr 9, 2024

thanm commented Apr 12, 2024

felixge commented Apr 15, 2024 • edited

thanm commented Apr 15, 2024

thanm commented Apr 18, 2024

randall77 commented Apr 18, 2024

cherrymui commented Apr 18, 2024

thanm commented Apr 19, 2024

cherrymui commented Apr 19, 2024

thanm commented Apr 22, 2024 • edited

aclements commented Apr 24, 2024

aclements commented Apr 24, 2024

felixge commented Feb 5, 2024 •

edited

Output of `go env` in your module/workspace:

felixge commented Apr 15, 2024 •

edited

thanm commented Apr 22, 2024 •

edited