-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: killed while compiling gogoprotobuf #15537
Comments
What's your What does |
Sometimes strace shows failure with ENOMEM, other times I just see SIGKILLs as the oom-killer steps in. dmesg shows the following (oomkiller)
|
Specifically, strace sometimes shows the following kind of failure:
|
The output of ps ax is given below. Watching these with htop shows them growing rapidly to the point where they use upwards of 1GB each and finally trigger the oom-killer.
|
Possible reason: the generated casttype.pb.go files have lots (1000+) anonymous function invocations like this:
Memprofile shows:
|
Thanks for profiling. If you have the binary and profile still around, will you re-run pprof with the -lines argument added as well? Thanks! |
Does your machine had any swap partition or file allocated? Even though you may have huge amounts of physical memory available, Linux behaves very conservatively if no swap is available.
|
I just reproduced here. (Not the killing, but high memory usage.) I'm investigating, although no promises. An initial observation is that we're generating way too many OpCopies during SSA construction. |
@dr2chase : if this is phi insertion your CL might help. |
I think I have a simpler fix. Hang tight. |
Sparse-phi code looks like a big help. On my laptop (8GB, OS X) it appears to max out at 4 simultaneous compiles each over 2G if sparse-phi is activated, but it runs to completion with only a brief instant of red in the system memory monitor. With the sparse-phi code deactivated, at an earlier stage one of the compiles grows to 6.7G (at least -- I killed it, it did this twice in a row so it is repeatable enough for me) alongside three others, one of those over 2G and two over 1G, and the system memory monitor goes red and sticks there. |
CL https://golang.org/cl/22790 mentions this issue. |
@dr2chase if I want to benchmark the sparse phi change locally, do I just patch in the CL (which again?) or do I also need to fiddle with any settings? I'm curious how my little optimization compares. |
Not sure if this is still useful but here are profile results with -line. CL 22790 helps a lot: Before the CL:
After the CL:
|
Your optimization helps a lot on Dave Cheney's types.go compilation benchmark, and helps a good bit on this one but the sparse phi change does notably better. I need to check one other slowed-way-down build. If you patch in the CL ( https://go-review.googlesource.com/#/c/22342/ ) there is a cutover point based on program size (product of #vars and #blocks) that is normally 2,500,000 but you can change it on the command line.
|
This is the other one: #14934 |
Checking against the 38x slowdown -- your change gets most of it, (down to 15s user on my laptop) sparse manages to get one second faster, not enough to matter for that one. This is actually the only one I've seen where it seems to make a real difference. Keith was wondering if you ought to decorate s.defvar for some of the blocks on your recursive walk. Not nearly all, but enough to prevent quadratic badness if you had a bunch of uses far from the first def. Say, you could iterate instead of recurse, count as you go, and then stick a def halfway to the end of the iteration if it was more than 3ish. |
Thanks, @dr2chase. It's probable that both the sparse phi and this optimization should go in; this one is good for small and general cases, and sparse is good for large ones. I'm going to run some more benchmarks (last time was low iterations and still had browser etc running) and take one more look. I'm not 100% sure I follow the decoration suggestion, but I'll read it again tomorrow. (Off to take my kid to the playground while some benchmarks run.) I will observe that we were actually generating O(n^2) values to avoid an O(n^2) recursion, so we are going from expensive quadratric to cheap quadratic. If you want things in sooner for freeze reasons, feel free to just submit my change and build atop it. Oh, and reminder to self: There's also still some stackAlloc to look into here. :) |
Ok, I think I understand the decoration suggestion. I don't think halfway if > 3 is the right answer--I suspect something more like every 100 iterations is probably going to yield better results. I will do some experimenting. I have a nice, simple benchmark case to use for this: package p
func f() []*int {
x := [...]*int{
func(v int) *int { return &v }(1),
func(v int) *int { return &v }(2),
func(v int) *int { return &v }(3),
func(v int) *int { return &v }(4),
func(v int) *int { return &v }(5),
func(v int) *int { return &v }(6),
func(v int) *int { return &v }(7),
func(v int) *int { return &v }(8),
// and so on, as desired to create quadratic behavior
}
return x[:]
} (Note to self: Why aren't these getting inlined, anyway?) |
If b has exactly one predecessor, as happens frequently with static calls, we can make lookupVarOutgoing generate less garbage. Instead of generating a value that is just going to be an OpCopy and then get eliminated, loop. This can lead to lots of looping. However, this loop is way cheaper than generating lots of ssa.Values and then eliminating them. For a subset of the code in #15537: Before: 28.31 real 36.17 user 1.68 sys 2282450944 maximum resident set size After: 9.63 real 11.66 user 0.51 sys 638144512 maximum resident set size Updates #15537. Excitingly, it appears that this also helps regular code: name old time/op new time/op delta Template 288ms ± 6% 276ms ± 7% -4.13% (p=0.000 n=21+24) Unicode 143ms ± 8% 141ms ±10% ~ (p=0.287 n=24+25) GoTypes 932ms ± 4% 874ms ± 4% -6.20% (p=0.000 n=23+22) Compiler 4.89s ± 4% 4.58s ± 4% -6.46% (p=0.000 n=22+23) MakeBash 40.2s ±13% 39.8s ± 9% ~ (p=0.648 n=23+23) name old user-ns/op new user-ns/op delta Template 388user-ms ±10% 373user-ms ± 5% -3.80% (p=0.000 n=24+25) Unicode 203user-ms ± 6% 202user-ms ± 7% ~ (p=0.492 n=22+24) GoTypes 1.29user-s ± 4% 1.17user-s ± 4% -9.67% (p=0.000 n=25+23) Compiler 6.86user-s ± 5% 6.28user-s ± 4% -8.49% (p=0.000 n=25+25) name old alloc/op new alloc/op delta Template 51.5MB ± 0% 47.6MB ± 0% -7.47% (p=0.000 n=22+25) Unicode 37.2MB ± 0% 37.1MB ± 0% -0.21% (p=0.000 n=25+25) GoTypes 166MB ± 0% 138MB ± 0% -16.83% (p=0.000 n=25+25) Compiler 756MB ± 0% 628MB ± 0% -16.96% (p=0.000 n=25+23) name old allocs/op new allocs/op delta Template 450k ± 0% 445k ± 0% -1.02% (p=0.000 n=25+25) Unicode 356k ± 0% 356k ± 0% ~ (p=0.374 n=24+25) GoTypes 1.31M ± 0% 1.25M ± 0% -4.18% (p=0.000 n=25+25) Compiler 5.29M ± 0% 5.02M ± 0% -5.15% (p=0.000 n=25+23) It also seems to help in other cases in which phi insertion is a pain point (#14774, #14934). Change-Id: Ibd05ed7b99d262117ece7bb250dfa8c3d1cc5dd2 Reviewed-on: https://go-review.googlesource.com/22790 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
With Josh's fix, and sparse layered on top of that, the big memory pigs seem to be in register and stack allocation: regalloc:
This is a typical memory profile from one of the fattest compilations.
The
|
Thank you guys - the next hurdle seems to be the compilation of |
@dr2chase thinking out loud, I wonder whether instead of storing liveness as block+value pairs (which can grow quadratically), we could store it as value+[]range where range is an sdom entry/exit pair representing a subset of the CFG. In theory this could also be quadratic, if a value pops in and out of liveness, but that seems less likely. |
@josharian Maybe. I tried plain first+last postorder numbers for stack allocation (this is essentially linear scan) and didn't get happy results, but couldn't tell if I had made some mistake with self-conflict (there are issues with loop phi functions and uses transmitted through backedges). Can you give the sparse CL a look for style/explanation etc? @gpaul, what's the exact recipe for that example? I'm here:
and I see (using sparse CL still under review):
There's a huge difference in memory allocation between the two, though still plenty of room for additional improvement in register/stack allocation. |
@dr2chase will take a look at sparse phi CL on Monday. (Mother's Day today.) Sorry that I've been a delinquent reviewer in general. Juggling too many things. A few other random, somewhat related thoughts: The self-conflict bit is a fly in the ointment indeed. On my simple benchmark case in the comment above, most of the allocation is the two append statements building s.interfere, so I looked into representing interference as sets of equivalence classes. I know that the interferes relationship is not transitive in general, but it is frequently, so I was going to model it with multiple equivalence classes. The interferes relationship is also not reflexive, but I was going to assume it was anti-reflexive and compensate...boom. :( I noticed in the list of SSA stack size regressions that most of the really big increases are init functions, which generally look a lot like this code--building a large composite literal. But the components of a composite literal can always be computed independently of each other. I wonder whether a better short term fix might be teaching the compiler that fact. We still need to fix the quadratic behavior, but reducing spills from composite literals would (a) generate better code anyway and (b) whack a whole class of problem cases. It scares me that Type is used as a map key in stackalloc, given that everywhere else we use .Compare == CMPeq for equality. Given that, I wonder whether we should use == for the related cases in stackalloc. For 1.8, I'll plan switch us to using the GC pattern instead (onebitwalktype). That'll make the interference sets bigger, though, exacerbating the performance problems here. |
Scratch the comment about composite literal independence. The evaluation of the components may have side-effects. Duh. There's still some independence, in that we can safely write each component into the final destination as calculated, but I'm not sure that that is enough to help. |
@dr2chase: The sparse phi CL let's the combos/both package compile in isolation. Building its siblings concurrently still leads to OOM issues with 8 cores and 16GB of memory: go install -v github.com/gogo/protobuf/test/theproto3/combos/... |
I hate to say "works for me" but it actually does, 4 cores and 8GB, OSX, I did shut down Chrome first: OSX is playing games with compressed memory, not sure what that means, but this is a screen shot taken seconds before the memory consumption peak: I'll see what I can do to get the sparse CL in -- it may still need polish -- but I'm not sure what else can be done in 1.7. |
@dr2chase again thinking out loud, if we can predict/detect when liveness info has stabilized for a block, we could interleave liveness and interference for stackalloc, since interference works arbitrary block-at-a-time, which would (probably) allow us to store less liveness info. Reactions? |
@dr2chase 'works for me' is totally valid. It's open source after all. Unfortunately I'm not so lucky. For this specific issue we may just remove the gogoprotobuf ./test packages from our repo. It seems like a pretty good test for the compiler and since it's quite a popular library it would be great if you and the rest of the compiler team would continue working towards pre-SSA memory usage and compilation times for it. We have several large swaths of generated code internally which I haven't tried to compile yet. If they trip the compiler we'll extract some test cases early in the 1.8 cycle. The performance improvements we're seeing with at least one high-performance component compiled with tip are pretty amazing (~30% improvements over 1.6) so I hope we can figure something out. |
CL https://golang.org/cl/22342 mentions this issue. |
For golang#15537 Compiling github.com/gogo/protobuf/test/combos/both Before: 69.83 real 82.72 user 2.01 sys 4206456832 maximum resident set size After: 62.94 real 74.74 user 1.92 sys 3832225792 maximum resident set size name old time/op new time/op delta Template 279ms ± 4% 277ms ± 4% ~ (p=0.529 n=20+20) Unicode 142ms ±10% 143ms ± 6% ~ (p=0.461 n=20+20) GoTypes 850ms ± 1% 846ms ± 2% ~ (p=0.175 n=19+20) Compiler 4.43s ± 2% 4.38s ± 1% -1.12% (p=0.000 n=19+20) name old user-ns/op new user-ns/op delta Template 376user-ms ± 4% 378user-ms ± 3% ~ (p=0.755 n=19+20) Unicode 203user-ms ± 3% 205user-ms ± 8% ~ (p=0.314 n=15+20) GoTypes 1.15user-s ± 2% 1.15user-s ± 3% ~ (p=0.799 n=20+20) Compiler 6.14user-s ± 3% 6.12user-s ± 2% ~ (p=0.547 n=20+20) name old alloc/op new alloc/op delta Template 47.6MB ± 0% 46.9MB ± 0% -1.41% (p=0.000 n=19+20) Unicode 37.1MB ± 0% 37.1MB ± 0% ~ (p=0.134 n=20+20) GoTypes 138MB ± 0% 136MB ± 0% -1.65% (p=0.000 n=20+20) Compiler 632MB ± 0% 614MB ± 0% -2.82% (p=0.000 n=20+20) name old allocs/op new allocs/op delta Template 445k ± 0% 444k ± 0% -0.17% (p=0.000 n=20+20) Unicode 356k ± 0% 356k ± 0% ~ (p=0.204 n=20+20) GoTypes 1.25M ± 0% 1.25M ± 0% -0.41% (p=0.000 n=20+19) Compiler 5.03M ± 0% 4.99M ± 0% -0.87% (p=0.000 n=19+20) Change-Id: Iee7b56ede4c319e4194160502f0796f478a9b73f
CL https://golang.org/cl/23211 mentions this issue. |
CL https://golang.org/cl/24904 mentions this issue. |
For golang#15537 Compiling github.com/gogo/protobuf/test/combos/both Before: 69.83 real 82.72 user 2.01 sys 4206456832 maximum resident set size After: 62.94 real 74.74 user 1.92 sys 3832225792 maximum resident set size name old time/op new time/op delta Template 279ms ± 4% 277ms ± 4% ~ (p=0.529 n=20+20) Unicode 142ms ±10% 143ms ± 6% ~ (p=0.461 n=20+20) GoTypes 850ms ± 1% 846ms ± 2% ~ (p=0.175 n=19+20) Compiler 4.43s ± 2% 4.38s ± 1% -1.12% (p=0.000 n=19+20) name old user-ns/op new user-ns/op delta Template 376user-ms ± 4% 378user-ms ± 3% ~ (p=0.755 n=19+20) Unicode 203user-ms ± 3% 205user-ms ± 8% ~ (p=0.314 n=15+20) GoTypes 1.15user-s ± 2% 1.15user-s ± 3% ~ (p=0.799 n=20+20) Compiler 6.14user-s ± 3% 6.12user-s ± 2% ~ (p=0.547 n=20+20) name old alloc/op new alloc/op delta Template 47.6MB ± 0% 46.9MB ± 0% -1.41% (p=0.000 n=19+20) Unicode 37.1MB ± 0% 37.1MB ± 0% ~ (p=0.134 n=20+20) GoTypes 138MB ± 0% 136MB ± 0% -1.65% (p=0.000 n=20+20) Compiler 632MB ± 0% 614MB ± 0% -2.82% (p=0.000 n=20+20) name old allocs/op new allocs/op delta Template 445k ± 0% 444k ± 0% -0.17% (p=0.000 n=20+20) Unicode 356k ± 0% 356k ± 0% ~ (p=0.204 n=20+20) GoTypes 1.25M ± 0% 1.25M ± 0% -0.41% (p=0.000 n=20+19) Compiler 5.03M ± 0% 4.99M ± 0% -0.87% (p=0.000 n=19+20) Change-Id: Iee7b56ede4c319e4194160502f0796f478a9b73f
For golang#15537 Compiling github.com/gogo/protobuf/test/combos/both Before: 69.83 real 82.72 user 2.01 sys 4206456832 maximum resident set size After: 62.94 real 74.74 user 1.92 sys 3832225792 maximum resident set size name old time/op new time/op delta Template 279ms ± 4% 277ms ± 4% ~ (p=0.529 n=20+20) Unicode 142ms ±10% 143ms ± 6% ~ (p=0.461 n=20+20) GoTypes 850ms ± 1% 846ms ± 2% ~ (p=0.175 n=19+20) Compiler 4.43s ± 2% 4.38s ± 1% -1.12% (p=0.000 n=19+20) name old user-ns/op new user-ns/op delta Template 376user-ms ± 4% 378user-ms ± 3% ~ (p=0.755 n=19+20) Unicode 203user-ms ± 3% 205user-ms ± 8% ~ (p=0.314 n=15+20) GoTypes 1.15user-s ± 2% 1.15user-s ± 3% ~ (p=0.799 n=20+20) Compiler 6.14user-s ± 3% 6.12user-s ± 2% ~ (p=0.547 n=20+20) name old alloc/op new alloc/op delta Template 47.6MB ± 0% 46.9MB ± 0% -1.41% (p=0.000 n=19+20) Unicode 37.1MB ± 0% 37.1MB ± 0% ~ (p=0.134 n=20+20) GoTypes 138MB ± 0% 136MB ± 0% -1.65% (p=0.000 n=20+20) Compiler 632MB ± 0% 614MB ± 0% -2.82% (p=0.000 n=20+20) name old allocs/op new allocs/op delta Template 445k ± 0% 444k ± 0% -0.17% (p=0.000 n=20+20) Unicode 356k ± 0% 356k ± 0% ~ (p=0.204 n=20+20) GoTypes 1.25M ± 0% 1.25M ± 0% -0.41% (p=0.000 n=20+19) Compiler 5.03M ± 0% 4.99M ± 0% -0.87% (p=0.000 n=19+20) Change-Id: Iee7b56ede4c319e4194160502f0796f478a9b73f
go version
)?go version devel +ba6765c Wed May 4 05:55:07 2016 +0000 linux/amd64
go env
)?linux x86_64
should succeed
a couple of signal: killed errors:
This is on a pretty decent laptop with 16GB of RAM and is doing little else:
The text was updated successfully, but these errors were encountered: