Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: consumes huge amount of memory when compiling table driven test code #11612

Closed
rin01 opened this issue Jul 6, 2015 · 20 comments
Closed

Comments

@rin01
Copy link

rin01 commented Jul 6, 2015

uname -a
Linux nico-Qiana-Xfce 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

go version go1.4.2 linux/amd64

When I click on "share" button in Go playground to share the problematic code, it fails with "server error: try again" message.
I imagine it is because the source code is too big (~8000 lines).

But you can find a small version here:

http://play.golang.org/p/qpxcVGkzuk

It compiles quickly without any problem because it has only 500 lines in the "samples" table.

But if you copy-paste these sample lines to have 8000 of them in the "samples" table, 6g consumes more than 2.5 Gb and then, it is killed, or crashes Virtualbox.

This problem is related to issue "cmd/go: go build takes too long to compile table driven test code":
#8259

But this time, it is not the compile time which is the problem, but the memory consumption, which is really high.
Even with 2.5 Gb of ram allocated to Virtualbox, 6g consumes all the available ram before being killed.

With go 1.3.2, I could compile, even if I should allocate at least 2.4 Gb of ram to Virtualbox.
But with go 1.4.2, the memory consumption has increased so much that I cannot compile the same code any more.

@bradfitz
Copy link
Contributor

bradfitz commented Jul 6, 2015

Please try with Go tip (which will shortly become Go 1.5) from the master branch.

@rin01
Copy link
Author

rin01 commented Jul 6, 2015

On tip, there is the same problem.
I can see that it consumes more than 1.2 Gb, and then "compile" program stops.

nico@nico-Qiana-Xfce ~/z_go/src/inut/inut_long_compile $ go version
go version devel +ef37184 Mon Jul 6 00:07:10 2015 +0000 linux/amd64

nico@nico-Qiana-Xfce ~/z_go/src/inut/inut_long_compile $ go build
\# inut/inut_long_compile
**fatal error: runtime: out of memory**

runtime stack:
runtime.throw(0x841090, 0x16)
    /home/nico/go/src/runtime/panic.go:527 +0x96
runtime.sysMap(0xc85a460000, 0x649c0000, 0xc93c00, 0xda7938)
    /home/nico/go/src/runtime/mem_linux.go:143 +0x9b
runtime.mHeap_SysAlloc(0xc90400, 0x649c0000, 0xc90400)
    /home/nico/go/src/runtime/malloc.go:423 +0x160
runtime.mHeap_Grow(0xc90400, 0x324e0, 0x0)
    /home/nico/go/src/runtime/mheap.go:625 +0x63
runtime.mHeap_AllocSpanLocked(0xc90400, 0x324d9, 0xc81e4c9fff)
    /home/nico/go/src/runtime/mheap.go:529 +0x5f1
runtime.mHeap_Alloc_m(0xc90400, 0x324d9, 0x100000000, 0xc98100)
    /home/nico/go/src/runtime/mheap.go:422 +0x161
runtime.mHeap_Alloc.func1()
    /home/nico/go/src/runtime/mheap.go:481 +0x41
runtime.systemstack(0xc82003dee8)
    /home/nico/go/src/runtime/asm_amd64.s:278 +0xb1
runtime.mHeap_Alloc(0xc90400, 0x324d9, 0x10100000000, 0xc98100)
    /home/nico/go/src/runtime/mheap.go:482 +0x63
runtime.largeAlloc(0x649b1c78, 0x7fa600000001, 0xc800000008)
    /home/nico/go/src/runtime/malloc.go:727 +0x94
runtime.mallocgc.func3()
    /home/nico/go/src/runtime/malloc.go:634 +0x33
runtime.systemstack(0xc820024000)
    /home/nico/go/src/runtime/asm_amd64.s:262 +0x7c
runtime.mstart()
    /home/nico/go/src/runtime/proc1.go:668

goroutine 1 [running]:
runtime.systemstack_switch()
    /home/nico/go/src/runtime/asm_amd64.s:216 fp=0xc85a45f290 sp=0xc85a45f288
runtime.mallocgc(0x649b1c78, 0x78eb00, 0x7fa600000001, 0xc85a45f3a8)
    /home/nico/go/src/runtime/malloc.go:635 +0x972 fp=0xc85a45f360 sp=0xc85a45f290
runtime.newarray(0x78eb00, 0x1926c71e, 0x587579)
    /home/nico/go/src/runtime/malloc.go:759 +0xc9 fp=0xc85a45f3a0 sp=0xc85a45f360
runtime.makeslice(0x783ae0, 0x1926c71e, 0x1926c71e, 0x0, 0x0, 0x0)
    /home/nico/go/src/runtime/slice.go:32 +0x165 fp=0xc85a45f3f0 sp=0xc85a45f3a0
cmd/compile/internal/gc.newliveness(0xc82021c750, 0xc8240b8240, 0xc856026000, 0x27a1e, 0x2e400, 0xc85656c000, 0x206d2, 0x25000, 0x468f7dd9f72f9f3e)
    /home/nico/go/src/cmd/compile/internal/gc/plive.go:702 +0x161 fp=0xc85a45f6a8 sp=0xc85a45f3f0
cmd/compile/internal/gc.liveness(0xc82021c750, 0xc8240b8240, 0xc82eec3980, 0xc82eec3a00)
    /home/nico/go/src/cmd/compile/internal/gc/plive.go:1792 +0x2d3 fp=0xc85a45f788 sp=0xc85a45f6a8
cmd/compile/internal/gc.compile(0xc82021c750)
    /home/nico/go/src/cmd/compile/internal/gc/pgen.go:528 +0xe07 fp=0xc85a45f9f8 sp=0xc85a45f788
cmd/compile/internal/gc.funccompile(0xc82021c750)
    /home/nico/go/src/cmd/compile/internal/gc/dcl.go:1480 +0x1c9 fp=0xc85a45fa70 sp=0xc85a45f9f8
cmd/compile/internal/gc.Main()
    /home/nico/go/src/cmd/compile/internal/gc/lex.go:473 +0x1f08 fp=0xc85a45fd70 sp=0xc85a45fa70
cmd/compile/internal/amd64.Main()
    /home/nico/go/src/cmd/compile/internal/amd64/galign.go:127 +0x58d fp=0xc85a45fdd8 sp=0xc85a45fd70
main.main()
    /home/nico/go/src/cmd/compile/main.go:26 +0x189 fp=0xc85a45feb0 sp=0xc85a45fdd8
runtime.main()
    /home/nico/go/src/runtime/proc.go:111 +0x2bf fp=0xc85a45ff00 sp=0xc85a45feb0
runtime.goexit()
    /home/nico/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc85a45ff08 sp=0xc85a45ff00

@bradfitz bradfitz added this to the Go1.6 milestone Jul 6, 2015
@bradfitz bradfitz changed the title cmd/go: go build consumes huge amount of memory when compiling table driven test code cmd/compile: consumes huge amount of memory when compiling table driven test code Jul 6, 2015
@rsc
Copy link
Contributor

rsc commented Dec 14, 2015

This seems to be better now. On my Linux system:

$ /usr/bin/time -a $(go tool -n compile) x.go
2.62user 0.12system 0:01.99elapsed 137%CPU (0avgtext+0avgdata 235516maxresident)k
0inputs+15416outputs (0major+53338minor)pagefaults 0swaps

That seems to be saying the maximum compiler footprint was 235 kB for the input file from the playground.

@rsc rsc closed this as completed Dec 14, 2015
@minux
Copy link
Member

minux commented Dec 16, 2015 via email

@ALTree
Copy link
Member

ALTree commented Mar 25, 2016

This got much much worse with the new SSA backend. Even the 500 lines example requires gigabytes of memory to be compiled

go1.6:

 /usr/bin/time -a $(go tool -n compile) test.go
2.66user 0.17system 0:02.77elapsed 102%CPU (0avgtext+0avgdata 233136maxresident)k

tip:

/usr/bin/time -a $(gotip tool -n compile) test.go
43.64user 1.38system 0:27.60elapsed 163%CPU (0avgtext+0avgdata 3444832maxresident)k

@davecheney
Copy link
Contributor

The example code seems to cause lookupOutgoingVar to generate a huge amount of garbage.

out.pdf

@randall77
Copy link
Contributor

This looks like a consequence of our phi building algorithm. For variables which are long-lived but unmodified across lots of basic blocks, we end up placing lots of FwdRef markers that are unnecessary. We use http://pp.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf , and this seems a fairly fundamental limitation of that algorithm.
We could switch to another algorithm which might be better in these cases (but maybe slower otherwise?). Or maybe there are ways to optimize this pattern using the current algorithm without generating so much garbage.

@zx9597446
Copy link

I couldn't build https://github.com/sourcegraph/webloop on my vps which has 512Mb memory and 256Mb swap space.
go version go1.6 linux/amd64
I could build same project with go1.4.2(maybe) which has 256Mb memory.

@rsc rsc modified the milestones: Go1.8, Go1.7 May 17, 2016
@UniQP
Copy link

UniQP commented Jul 26, 2016

@randall77 The algorithm presented in http://pp.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf removes unnecessary phi functions during construction (as soon as the block is sealed). Since the go implementation delay the phi removal until the whole CFG is build, you end up with a lot of unnecessary phi functions/FwdRef markers. Thus, a possible solution would be to create and optimize phi functions during the construction.

Even with FwdRef the memory usage looks pretty high.
@rin01 Can you provide some statistics for your input program:

  • number of basic blocks
  • number of variables

It would also nice to see how the CFG for one line looks like.

@randall77
Copy link
Contributor

We do optimize phi functions during construction. I think we do exactly what Braun et al describe, as if all blocks were sealed to start. If there are situations where we generate more phis than necessary, I'm happy to accept patches to fix that.

@UniQP
Copy link

UniQP commented Jul 28, 2016

https://github.com/golang/go/blob/master/src/cmd/compile/internal/gc/ssa.go#L3793 says "Phi optimization is a separate pass". So it is performed after construction, right?

@randall77
Copy link
Contributor

The optimizations described in Braun are done during phi construction. There is in addition a phi optimization pass which is run after building.

@UniQP
Copy link

UniQP commented Jul 31, 2016

The optimizations described in Braun are done during phi construction.

Yes, but the phi construction is done after the whole CFG is constructed. In the paper, phi functions are constructed and optimized as soon as the block is sealed. This makes a significant difference in terms of memory usage.

@randall77
Copy link
Contributor

I think the comment in linkForwardReferences is stale, that may be the confusing part. All the things described as "phi optimizations" in the Braun paper are done during phi construction. Another way of looking at it, we do what Braun does, but we don't start making phi values until all the blocks are sealed.

The one exception is that we don't do the recursive removal they do in tryRmoveTrivialPhi, as we don't have pointers to users in our IR. I'm open to suggestions as to how to make that work. Part of the point of phielim is to catch those cases.

The "phi optimization" pass in our compiler is a separate set of optimizations.

@UniQP
Copy link

UniQP commented Jul 31, 2016

All the things described as "phi optimizations" in the Braun paper are done during phi construction.

Can you point me to the corresponding code?

Again, the crucial question (regarding memory usage) is: When do we optimize the phi functions (or FwdRefs)? In my understanding of the code, this happens after the CFG construction rather than during the CFG construction.

The one exception is that we don't do the recursive removal they do in tryRemoveTrivialPhi, as we don't have pointers to users in our IR. I'm open to suggestions as to how to make that work.

You can catch some (but not all) cases when optimizing the unique operand instead (if it is a phi function).
Example:
x = 42
do {
if (...) {
} else {
}
} while (...);
use(x)

@randall77
Copy link
Contributor

cmd/compile/internal/gc/ssa.go:resolveFwdRef. We find the value of a variable on each incoming edge and only make the phi if there are two distinct witnesses.

@UniQP
Copy link

UniQP commented Jul 31, 2016

I think I finally figured out my wrong assumption. I thought you applied the algorithm during CFG construction, but you applied it afterwards.

I found at least two difference to the original algorithm in lookupVarOutgoing

  1. If the block has only one predecessor, the implementations skips the block, but does not register the resulting value in s.defvars (as the paper would). This may lead to runtime issues for long chains of such blocks. However, it is certainly not the reason for the out of memory.
  2. When the function finds a block with multiple predecessors, it creates a new FwdRef, do some bookkeeping, and returns it immediately. In contrast, the paper would recursively handle the new FwdRef. This leads to a lot of unnecessary operations.

Example:
x = 42
if (...) {
if (...) {
}
}
use(x)

Let's assume we resolve the FwdRef after the outer if. Then, we find the definition x = 42 as first phi operand and would create a new FwdRef after the inner if as second phi operand. In the next step, we resolve the new FwdRef after the inner if, finding x = 42 both times, which results in a Copy operation. Thus, we end up with

v0: 42
if (...) {
if (...) {
}
v1: Copy(v0)
}
v2: Phi(v0, v1)
use(v2)

In contrast, the paper would construct:
v0: 42
if (...) {
if (...) {
}
}
use(v0)

@chmallon
Copy link

chmallon commented Aug 2, 2016

This got much much worse with the new SSA backend. Even the 500 lines example requires gigabytes of memory to be compiled

AFAICT there are only 2 variables ('r_INT' and 'samples') for SSA construction in the code snippet.
Assuming it has 10k basic blocks (which is hopefully an exaggeration for the 500 line version) and it requires 2GB (lower bound for the word 'gigabytes'), then this would mean 100kB memory per variable per basic block.
To me this seems, that something is going very wrong.
It should only be a few bytes per variable per block to remember the current binding of the variable.
Several thousand basic blocks should be no issue at all for the algorithm.

Some statistics for compiling the code snippet would be very interesting:
How many variables are there (considered for SSA construction)?
How many basic blocks?
How is getting values from other basic blocks handled?
As described in the paper or is there some multi-pass handling?
When are useless Phis eliminated -- on the fly while looking up bindings in other blocks or afterwards?

Disclaimer:
UniQP and me are two of the authors of the paper.
Our compiler framework (libfirm) uses this construction algorithm for a long time now.

@dr2chase
Copy link
Contributor

dr2chase commented Aug 2, 2016

For the referenced "test.go", 6784 blocks, 55190 vars.
I tried compiling it with 1.7.latest (whatever I got by syncing and building) and tried turning the sparse-phi-assist cutoff up and down (0 = always on, -1 = always off, default = 2,500,000, the cutoff is compared to the product of #blocks and #vars):

for i in a b c d e f g h i j k l ; do
> GO_SSA_PHI_LOC_CUTOFF=-1 /usr/bin/time $(go tool -n compile) test.go
> done
        0.93 real         1.03 user         0.04 sys
        0.91 real         1.02 user         0.03 sys
        0.92 real         1.01 user         0.04 sys
        0.96 real         1.06 user         0.03 sys
        1.01 real         1.08 user         0.03 sys
        1.04 real         1.12 user         0.04 sys
        0.93 real         1.02 user         0.04 sys
        1.02 real         1.12 user         0.04 sys
        1.04 real         1.10 user         0.05 sys
        0.94 real         1.04 user         0.03 sys
        0.95 real         1.04 user         0.04 sys
        1.08 real         1.14 user         0.04 sys
~/GoogleDrive/work/tmp/b11612$ for i in a b c d e f g h i j k l ; do GO_SSA_PHI_LOC_CUTOFF=0 /usr/bin/time $(go tool -n compile) test.go; done
        0.78 real         0.91 user         0.03 sys
        0.88 real         0.96 user         0.04 sys
        0.84 real         0.96 user         0.04 sys
        0.83 real         0.93 user         0.03 sys
        0.84 real         0.95 user         0.03 sys
        0.84 real         0.94 user         0.04 sys
        0.86 real         0.96 user         0.04 sys
        0.82 real         0.91 user         0.04 sys
        0.85 real         0.96 user         0.04 sys
        0.84 real         0.96 user         0.03 sys
        0.86 real         0.94 user         0.04 sys
        0.84 real         0.95 user         0.04 sys
~/GoogleDrive/work/tmp/b11612$ for i in a b c d e f g h i j k l ; do /usr/bin/time $(go tool -n compile) test.go; done
        0.77 real         0.87 user         0.03 sys
        0.89 real         0.97 user         0.04 sys
        0.88 real         0.99 user         0.04 sys
        0.85 real         0.94 user         0.04 sys
        0.82 real         0.92 user         0.03 sys
        0.89 real         0.93 user         0.05 sys
        0.81 real         0.93 user         0.03 sys
        0.81 real         0.93 user         0.04 sys
        0.82 real         0.95 user         0.04 sys
        0.86 real         0.96 user         0.04 sys
        0.85 real         0.94 user         0.03 sys
        0.82 real         0.92 user         0.04 sys

Space use (with sparse assist):

(pprof) top 20
57.21MB of 72.92MB total (78.46%)
Showing top 20 nodes out of 175 (cum >= 1.38MB)
      flat  flat%   sum%        cum   cum%
       6MB  8.23%  8.23%        6MB  8.23%  cmd/compile/internal/gc.Prog
       5MB  6.86% 15.09%        5MB  6.86%  cmd/compile/internal/gc.Nod
    4.50MB  6.17% 21.26%     4.50MB  6.17%  cmd/compile/internal/ssa.(*Func).newValue
    4.11MB  5.64% 26.90%     4.11MB  5.64%  cmd/compile/internal/ssa.(*Func).setHome
    4.02MB  5.51% 32.40%     5.69MB  7.80%  cmd/compile/internal/ssa.(*regAllocState).computeLive
       4MB  5.49% 37.89%        4MB  5.49%  cmd/compile/internal/gc.typ
    3.18MB  4.37% 42.26%     3.18MB  4.37%  cmd/compile/internal/ssa.cse
    2.91MB  3.99% 46.25%     2.91MB  3.99%  cmd/internal/obj.(*LSym).Grow
    2.88MB  3.95% 50.20%     2.88MB  3.95%  cmd/compile/internal/ssa.(*stackAllocState).stackalloc
    2.84MB  3.90% 54.10%     8.53MB 11.70%  cmd/compile/internal/ssa.(*regAllocState).init
    2.76MB  3.79% 57.88%     2.76MB  3.79%  cmd/internal/obj.Linknew
    2.12MB  2.90% 60.78%     2.12MB  2.90%  cmd/compile/internal/ssa.schedule
       2MB  2.74% 63.53%        3MB  4.11%  cmd/compile/internal/ssa.(*SparseTreeMap).Insert
    1.75MB  2.40% 65.92%     3.12MB  4.28%  cmd/compile/internal/ssa.(*stackAllocState).init
    1.71MB  2.35% 68.27%     1.71MB  2.35%  cmd/compile/internal/gc.Naddr
    1.52MB  2.09% 70.36%     1.52MB  2.09%  cmd/compile/internal/ssa.prove
    1.52MB  2.09% 72.45%     1.52MB  2.09%  cmd/compile/internal/gc.livenessepilogue
    1.51MB  2.07% 74.52%     1.51MB  2.07%  cmd/compile/internal/gc.escflows
    1.50MB  2.06% 76.57%     2.50MB  3.43%  cmd/compile/internal/gc.(*state).variable
    1.38MB  1.89% 78.46%     1.38MB  1.89%  cmd/compile/internal/ssa.(*stackAllocState).buildInterferenceGraph

Without sparse assist:

~/GoogleDrive/work/tmp/b11612$ GO_SSA_PHI_LOC_CUTOFF=-1 $(go tool -n compile) -memprofile=n.prof test.go 
~/GoogleDrive/work/tmp/b11612$ go tool pprof -alloc_space $(go tool -n compile) n.prof
Entering interactive mode (type "help" for commands)
(pprof) top 20
64.60MB of 83.13MB total (77.71%)
Showing top 20 nodes out of 200 (cum >= 1.06MB)
      flat  flat%   sum%        cum   cum%
    9.51MB 11.44% 11.44%     9.51MB 11.44%  cmd/compile/internal/ssa.(*Func).newValue
    6.50MB  7.82% 19.26%     6.50MB  7.82%  cmd/compile/internal/gc.Prog
    6.50MB  7.82% 27.08%     6.50MB  7.82%  cmd/compile/internal/gc.Nod
    5.39MB  6.48% 33.56%     5.39MB  6.48%  cmd/compile/internal/ssa.(*Func).setHome
    4.42MB  5.32% 38.88%     4.42MB  5.32%  cmd/compile/internal/ssa.cse
    3.63MB  4.36% 43.24%     3.63MB  4.36%  cmd/compile/internal/ssa.(*regAllocState).computeLive
    3.36MB  4.04% 47.29%     3.36MB  4.04%  cmd/internal/obj.(*LSym).Grow
    3.07MB  3.69% 50.98%     6.69MB  8.05%  cmd/compile/internal/ssa.(*regAllocState).init
    2.76MB  3.32% 54.30%     2.76MB  3.32%  cmd/internal/obj.Linknew
    2.56MB  3.07% 57.37%     2.56MB  3.07%  cmd/compile/internal/gc.Naddr
    2.14MB  2.58% 59.95%     2.14MB  2.58%  cmd/compile/internal/gc.(*Pkg).LookupBytes
    2.14MB  2.58% 62.53%     2.14MB  2.58%  cmd/compile/internal/ssa.(*stackAllocState).stackalloc
       2MB  2.41% 64.94%        2MB  2.41%  cmd/compile/internal/gc.typ
    1.90MB  2.29% 67.22%     1.90MB  2.29%  cmd/compile/internal/ssa.schedule
    1.75MB  2.10% 69.33%     3.12MB  3.76%  cmd/compile/internal/ssa.(*stackAllocState).init
    1.65MB  1.98% 71.31%    12.30MB 14.79%  cmd/compile/internal/ssa.(*regAllocState).regalloc
    1.54MB  1.85% 73.15%     2.04MB  2.45%  cmd/compile/internal/gc.(*state).variable
    1.38MB  1.65% 74.81%     1.38MB  1.65%  cmd/compile/internal/ssa.(*stackAllocState).buildInterferenceGraph
    1.35MB  1.63% 76.44%     1.35MB  1.63%  cmd/compile/internal/ssa.tighten
    1.06MB  1.27% 77.71%     1.06MB  1.27%  cmd/compile/internal/ssa.layout

There are still some inputs we don't compile as fast as we'd like, or as fast as we once did, and I'd love to have any mistakes in our implementation of the paper's algorithm sorted out, but this particular slow compilation is solved.

@dr2chase dr2chase closed this as completed Aug 2, 2016
@chmallon
Copy link

chmallon commented Aug 5, 2016

For the referenced "test.go", 6784 blocks, 55190 vars.

Sorry, I do not understand the following aspect:
Where are 55k variables in the source code?
Or are these SSA values?
Still, this seems to be awfully many values for such a snippet.
Where are they coming from?
Also where do the 6.8k basic blocks come from?

@golang golang locked and limited conversation to collaborators Mar 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests