Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: killed while compiling gogoprotobuf #15537

Closed
gpaul opened this issue May 4, 2016 · 32 comments
Closed

cmd/compile: killed while compiling gogoprotobuf #15537

gpaul opened this issue May 4, 2016 · 32 comments

Comments

@gpaul
Copy link
Contributor

gpaul commented May 4, 2016

  1. What version of Go are you using (go version)?
    go version devel +ba6765c Wed May 4 05:55:07 2016 +0000 linux/amd64
  2. What operating system and processor architecture are you using (go env)?
    linux x86_64
  3. What did you do?
mkdir /tmp/gopath
cd $!
export GOPATH=`pwd`
go get -d -v github.com/gogo/protobuf
go install -v ...
  1. What did you expect to see?
    should succeed
  2. What did you see instead?
    a couple of signal: killed errors:
github.com/gogo/protobuf/test/casttype/combos/unsafeunmarshaler
go build github.com/gogo/protobuf/test: /home/gustav/go/pkg/tool/linux_amd64/compile: signal: killed
github.com/gogo/protobuf/test/castvalue
go build github.com/gogo/protobuf/test/casttype/combos/neither: /home/gustav/go/pkg/tool/linux_amd64/compile: signal: killed

This is on a pretty decent laptop with 16GB of RAM and is doing little else:

$ free -g
              total        used        free      shared  buff/cache   available
Mem:             15           1          13           0           0          13
Swap:             0           0           0
@bradfitz
Copy link
Contributor

bradfitz commented May 4, 2016

What's your ulimit -a?

What does time go install -v .. say?

@bradfitz bradfitz changed the title compile: killed while compiling gogoprotobuf cmd/compile: killed while compiling gogoprotobuf May 4, 2016
@bradfitz bradfitz added this to the Go1.7Maybe milestone May 4, 2016
@gpaul
Copy link
Contributor Author

gpaul commented May 4, 2016

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 63873
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 63873
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Sometimes strace shows failure with ENOMEM, other times I just see SIGKILLs as the oom-killer steps in.

dmesg shows the following (oomkiller)

[Wed May  4 17:04:11 2016] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
...
[Wed May  4 17:04:11 2016] [ 9770]     0  9770    93361    26797      90       3        0             0 perf
[Wed May  4 17:04:11 2016] [13022]  1000 13022     1828       92       9       3        0             0 strace
[Wed May  4 17:04:11 2016] [13024]  1000 13024    75524     2265      36       6        0             0 go
[Wed May  4 17:04:11 2016] [13123]  1000 13123   661874   518769    1140       7        0             0 compile
[Wed May  4 17:04:11 2016] [13131]  1000 13131   662488   521653    1146       7        0             0 compile
[Wed May  4 17:04:11 2016] [13132]  1000 13132   597042   437581    1005       7        0             0 compile
[Wed May  4 17:04:11 2016] [13143]  1000 13143   685774   497712    1099       7        0             0 compile
[Wed May  4 17:04:11 2016] [13144]  1000 13144   588702   389245     897       7        0             0 compile
[Wed May  4 17:04:11 2016] [13154]  1000 13154   661903   522499    1170       7        0             0 compile
[Wed May  4 17:04:11 2016] [13157]  1000 13157   661655   518735    1141       8        0             0 compile
[Wed May  4 17:04:11 2016] [14667]  1000 14667    30977    27699      64       5        0             0 compile
[Wed May  4 17:04:11 2016] [14713]  1000 14713    91364     1211      30       5        0             0 go
[Wed May  4 17:04:11 2016] Out of memory: Kill process 13154 (compile) score 127 or sacrifice child
[Wed May  4 17:04:11 2016] Killed process 13154 (compile) total-vm:2647612kB, anon-rss:2089996kB, file-rss:0kB

@gpaul
Copy link
Contributor Author

gpaul commented May 4, 2016

Specifically, strace sometimes shows the following kind of failure:

17263 mmap(0xc8a5dc0000, 256507904, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
17263 mincore(0xc8a5dc0000, 4096, 0xb104a9) = -1 ENOMEM (Cannot allocate memory)
17263 mincore(0xc8a5dc1000, 4096, 0xb104a9) = -1 ENOMEM (Cannot allocate memory)
17263 mincore(0xc8a5dc2000, 4096, 0xb104a9) = -1 ENOMEM (Cannot allocate memory)
17263 mincore(0xc8a5dc3000, 4096, 0xb104a9) = -1 ENOMEM (Cannot allocate memory)
...

@gpaul
Copy link
Contributor Author

gpaul commented May 4, 2016

The output of ps ax is given below. Watching these with htop shows them growing rapidly to the point where they use upwards of 1GB each and finally trigger the oom-killer.

    17225 pts/1    Rl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test -complete -buildid 1bb3edbfa185d998f7210687c649cc52b6422a76 -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/thetest.pb.go /home/gustav/gopath2/src/github.com/gogo/protobuf/test/uuid.go
    17226 pts/1    Rl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/both.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/both -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/both -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/both/casttype.pb.go
    17236 pts/1    Sl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/marshaler.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/marshaler -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/marshaler -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/marshaler/casttype.pb.go
    17238 pts/1    Rl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/neither.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/neither -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/neither -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/neither/casttype.pb.go
    17242 pts/1    Rl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/unmarshaler.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/unmarshaler -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unmarshaler -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unmarshaler/casttype.pb.go
    17248 pts/1    Sl+    0:09 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/unsafeboth.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/unsafeboth -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafeboth -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafeboth/casttype.pb.go
    17253 pts/1    Sl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/unsafemarshaler.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/unsafemarshaler -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafemarshaler -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafemarshaler/casttype.pb.go
    17254 pts/1    Sl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/unsafeunmarshaler.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/unsafeunmarshaler -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafeunmarshaler -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafeunmarshaler/casttype.pb.go

@kostya-sh
Copy link
Contributor

Possible reason: the generated casttype.pb.go files have lots (1000+) anonymous function invocations like this:

   {Name: func(v string) *string { return &v }("google/protobuf/descriptor.proto"),
    Package: func(v string) *string { return &v }("google.protobuf"),
    MessageType: []*descriptor.DescriptorProto{{Name: func(v string) *string { return &v }("FileDescriptorSet"),
        Field: []*descriptor.FieldDescriptorProto{{Name: func(v string) *string { return &v }("file"),
            Number:   func(v int32) *int32 { return &v }(1),
            Label:    func(v descriptor.FieldDescriptorProto_Label) *descriptor.FieldDescriptorProto_Label { return &v }(3),
            Type:     func(v descriptor.FieldDescriptorProto_Type) *descriptor.FieldDescriptorProto_Type { return &v }(11),
            TypeName: func(v string) *string { return &v }(".google.protobuf.FileDescriptorProto"),
            JsonName: func(v string) *string { return &v }("file"),
        }},
    }

Memprofile shows:

  299.64MB 36.26% 36.26%   300.37MB 36.35%  cmd/compile/internal/ssa.(*stackAllocState).stackalloc
  195.68MB 23.68% 59.94%   342.44MB 41.44%  cmd/compile/internal/ssa.(*stackAllocState).init
  146.77MB 17.76% 77.70%   146.77MB 17.76%  cmd/compile/internal/ssa.(*Func).newSparseSet
  146.76MB 17.76% 95.46%   146.76MB 17.76%  cmd/compile/internal/ssa.(*stackAllocState).buildInterferenceGraph

@josharian
Copy link
Contributor

Thanks for profiling. If you have the binary and profile still around, will you re-run pprof with the -lines argument added as well? Thanks!

@davecheney
Copy link
Contributor

Does your machine had any swap partition or file allocated? Even though you may have huge amounts of physical memory available, Linux behaves very conservatively if no swap is available.

On 4 May 2016, at 23:34, Gustav Paul notifications@github.com wrote:

The output of ps ax is given below. Watching these with htop shows them growing rapidly to the point where they use upwards of 1GB each and finally trigger the oom-killer.

17225 pts/1    Rl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test -complete -buildid 1bb3edbfa185d998f7210687c649cc52b6422a76 -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/thetest.pb.go /home/gustav/gopath2/src/github.com/gogo/protobuf/test/uuid.go
17226 pts/1    Rl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/both.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/both -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/both -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/both/casttype.pb.go
17236 pts/1    Sl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/marshaler.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/marshaler -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/marshaler -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/marshaler/casttype.pb.go
17238 pts/1    Rl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/neither.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/neither -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/neither -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/neither/casttype.pb.go
17242 pts/1    Rl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/unmarshaler.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/unmarshaler -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unmarshaler -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unmarshaler/casttype.pb.go
17248 pts/1    Sl+    0:09 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/unsafeboth.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/unsafeboth -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafeboth -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafeboth/casttype.pb.go
17253 pts/1    Sl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/unsafemarshaler.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/unsafemarshaler -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafemarshaler -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafemarshaler/casttype.pb.go
17254 pts/1    Sl+    0:10 /home/gustav/go/pkg/tool/linux_amd64/compile -o /dev/shm/go-build625944181/github.com/gogo/protobuf/test/casttype/combos/unsafeunmarshaler.a -trimpath /dev/shm/go-build625944181 -p github.com/gogo/protobuf/test/casttype/combos/unsafeunmarshaler -complete -buildid 0dec27d7c21ec48f91578f15d0f446edc5ee76aa -D _/home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafeunmarshaler -I /dev/shm/go-build625944181 -I /home/gustav/gopath2/pkg/linux_amd64 -pack /home/gustav/gopath2/src/github.com/gogo/protobuf/test/casttype/combos/unsafeunmarshaler/casttype.pb.go


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

@josharian
Copy link
Contributor

I just reproduced here. (Not the killing, but high memory usage.) I'm investigating, although no promises. An initial observation is that we're generating way too many OpCopies during SSA construction.

@randall77
Copy link
Contributor

@dr2chase : if this is phi insertion your CL might help.

@josharian
Copy link
Contributor

I think I have a simpler fix. Hang tight.

@dr2chase
Copy link
Contributor

dr2chase commented May 4, 2016

Sparse-phi code looks like a big help. On my laptop (8GB, OS X) it appears to max out at 4 simultaneous compiles each over 2G if sparse-phi is activated, but it runs to completion with only a brief instant of red in the system memory monitor.

With the sparse-phi code deactivated, at an earlier stage one of the compiles grows to 6.7G (at least -- I killed it, it did this twice in a row so it is repeatable enough for me) alongside three others, one of those over 2G and two over 1G, and the system memory monitor goes red and sticks there.

@gopherbot
Copy link

CL https://golang.org/cl/22790 mentions this issue.

@josharian
Copy link
Contributor

@dr2chase if I want to benchmark the sparse phi change locally, do I just patch in the CL (which again?) or do I also need to fiddle with any settings? I'm curious how my little optimization compares.

@kostya-sh
Copy link
Contributor

Not sure if this is still useful but here are profile results with -line. CL 22790 helps a lot:

Before the CL:

(pprof) top
796.84MB of 826.39MB total (96.42%)
Dropped 567 nodes (cum <= 4.13MB)
Showing top 10 nodes out of 30 (cum >= 790.81MB)
      flat  flat%   sum%        cum   cum%
  244.60MB 29.60% 29.60%   244.60MB 29.60%  cmd/compile/internal/ssa.(*stackAllocState).stackalloc src/cmd/compile/internal/ssa/stackalloc.go:134
  195.68MB 23.68% 53.28%   195.68MB 23.68%  cmd/compile/internal/ssa.(*stackAllocState).init src/cmd/compile/internal/ssa/stackalloc.go:103
  146.77MB 17.76% 71.04%   146.77MB 17.76%  cmd/compile/internal/ssa.(*Func).newSparseSet src/cmd/compile/internal/ssa/func.go:64
  146.76MB 17.76% 88.80%   146.76MB 17.76%  cmd/compile/internal/ssa.(*stackAllocState).buildInterferenceGraph src/cmd/compile/internal/ssa/stackalloc.go:361
   48.92MB  5.92% 94.72%    48.92MB  5.92%  cmd/compile/internal/ssa.(*stackAllocState).stackalloc src/cmd/compile/internal/ssa/stackalloc.go:170

After the CL:

(pprof) top
164.39MB of 181.64MB total (90.50%)
Dropped 417 nodes (cum <= 0.91MB)
Showing top 10 nodes out of 178 (cum >= 2.31MB)
      flat  flat%   sum%        cum   cum%
   44.47MB 24.48% 24.48%    44.47MB 24.48%  cmd/compile/internal/ssa.(*stackAllocState).stackalloc src/cmd/compile/internal/ssa/stackalloc.go:134
   35.57MB 19.58% 44.06%    35.57MB 19.58%  cmd/compile/internal/ssa.(*stackAllocState).init src/cmd/compile/internal/ssa/stackalloc.go:103
   26.70MB 14.70% 58.76%    26.70MB 14.70%  cmd/compile/internal/ssa.(*Func).newSparseSet src/cmd/compile/internal/ssa/func.go:64
   26.68MB 14.69% 73.45%    26.68MB 14.69%  cmd/compile/internal/ssa.(*stackAllocState).buildInterferenceGraph src/cmd/compile/internal/ssa/stackalloc.go:361

@dr2chase
Copy link
Contributor

dr2chase commented May 4, 2016

Your optimization helps a lot on Dave Cheney's types.go compilation benchmark, and helps a good bit on this one but the sparse phi change does notably better. I need to check one other slowed-way-down build.

If you patch in the CL ( https://go-review.googlesource.com/#/c/22342/ ) there is a cutover point based on program size (product of #vars and #blocks) that is normally 2,500,000 but you can change it on the command line.

export GO_SSA_PHI_LOC_CUTOFF=0  # enables sparse for all >= 0
export GO_SSA_PHI_LOC_CUTOFF=-1 # disables sparse completely

@dr2chase
Copy link
Contributor

dr2chase commented May 4, 2016

This is the other one: #14934

@dr2chase
Copy link
Contributor

dr2chase commented May 5, 2016

Checking against the 38x slowdown -- your change gets most of it, (down to 15s user on my laptop) sparse manages to get one second faster, not enough to matter for that one. This is actually the only one I've seen where it seems to make a real difference.

Keith was wondering if you ought to decorate s.defvar for some of the blocks on your recursive walk. Not nearly all, but enough to prevent quadratic badness if you had a bunch of uses far from the first def. Say, you could iterate instead of recurse, count as you go, and then stick a def halfway to the end of the iteration if it was more than 3ish.

@josharian
Copy link
Contributor

Thanks, @dr2chase. It's probable that both the sparse phi and this optimization should go in; this one is good for small and general cases, and sparse is good for large ones. I'm going to run some more benchmarks (last time was low iterations and still had browser etc running) and take one more look.

I'm not 100% sure I follow the decoration suggestion, but I'll read it again tomorrow. (Off to take my kid to the playground while some benchmarks run.) I will observe that we were actually generating O(n^2) values to avoid an O(n^2) recursion, so we are going from expensive quadratric to cheap quadratic. If you want things in sooner for freeze reasons, feel free to just submit my change and build atop it.

Oh, and reminder to self: There's also still some stackAlloc to look into here. :)

@josharian
Copy link
Contributor

Ok, I think I understand the decoration suggestion. I don't think halfway if > 3 is the right answer--I suspect something more like every 100 iterations is probably going to yield better results. I will do some experimenting. I have a nice, simple benchmark case to use for this:

package p

func f() []*int {
    x := [...]*int{
        func(v int) *int { return &v }(1),
        func(v int) *int { return &v }(2),
        func(v int) *int { return &v }(3),
        func(v int) *int { return &v }(4),
        func(v int) *int { return &v }(5),
        func(v int) *int { return &v }(6),
        func(v int) *int { return &v }(7),
        func(v int) *int { return &v }(8),
                // and so on, as desired to create quadratic behavior
    }
    return x[:]
}

(Note to self: Why aren't these getting inlined, anyway?)

gopherbot pushed a commit that referenced this issue May 5, 2016
If b has exactly one predecessor, as happens
frequently with static calls, we can make
lookupVarOutgoing generate less garbage.

Instead of generating a value that is just
going to be an OpCopy and then get eliminated,
loop. This can lead to lots of looping.
However, this loop is way cheaper than generating
lots of ssa.Values and then eliminating them.

For a subset of the code in #15537:

Before:

       28.31 real        36.17 user         1.68 sys
2282450944  maximum resident set size

After:

        9.63 real        11.66 user         0.51 sys
 638144512  maximum resident set size

Updates #15537.

Excitingly, it appears that this also helps
regular code:

name       old time/op      new time/op      delta
Template        288ms ± 6%       276ms ± 7%   -4.13%        (p=0.000 n=21+24)
Unicode         143ms ± 8%       141ms ±10%     ~           (p=0.287 n=24+25)
GoTypes         932ms ± 4%       874ms ± 4%   -6.20%        (p=0.000 n=23+22)
Compiler        4.89s ± 4%       4.58s ± 4%   -6.46%        (p=0.000 n=22+23)
MakeBash        40.2s ±13%       39.8s ± 9%     ~           (p=0.648 n=23+23)

name       old user-ns/op   new user-ns/op   delta
Template   388user-ms ±10%  373user-ms ± 5%   -3.80%        (p=0.000 n=24+25)
Unicode    203user-ms ± 6%  202user-ms ± 7%     ~           (p=0.492 n=22+24)
GoTypes    1.29user-s ± 4%  1.17user-s ± 4%   -9.67%        (p=0.000 n=25+23)
Compiler   6.86user-s ± 5%  6.28user-s ± 4%   -8.49%        (p=0.000 n=25+25)

name       old alloc/op     new alloc/op     delta
Template       51.5MB ± 0%      47.6MB ± 0%   -7.47%        (p=0.000 n=22+25)
Unicode        37.2MB ± 0%      37.1MB ± 0%   -0.21%        (p=0.000 n=25+25)
GoTypes         166MB ± 0%       138MB ± 0%  -16.83%        (p=0.000 n=25+25)
Compiler        756MB ± 0%       628MB ± 0%  -16.96%        (p=0.000 n=25+23)

name       old allocs/op    new allocs/op    delta
Template         450k ± 0%        445k ± 0%   -1.02%        (p=0.000 n=25+25)
Unicode          356k ± 0%        356k ± 0%     ~           (p=0.374 n=24+25)
GoTypes         1.31M ± 0%       1.25M ± 0%   -4.18%        (p=0.000 n=25+25)
Compiler        5.29M ± 0%       5.02M ± 0%   -5.15%        (p=0.000 n=25+23)

It also seems to help in other cases in which
phi insertion is a pain point (#14774, #14934).

Change-Id: Ibd05ed7b99d262117ece7bb250dfa8c3d1cc5dd2
Reviewed-on: https://go-review.googlesource.com/22790
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
@dr2chase
Copy link
Contributor

dr2chase commented May 6, 2016

With Josh's fix, and sparse layered on top of that, the big memory pigs seem to be in register and stack allocation:

regalloc:

   1280     448.16MB   448.16MB             s.spillLive[b.ID] = append(s.spillLive[b.ID], spill.ID) 
   2170     345.39MB   345.39MB                     l = make([]liveInfo, 0, t.size()) 

This is a typical memory profile from one of the fattest compilations.

2530.22MB of 2766.97MB total (91.44%)
Dropped 356 nodes (cum <= 13.83MB)
Showing top 30 nodes out of 123 (cum >= 17.13MB)
      flat  flat%   sum%        cum   cum%
  497.29MB 17.97% 17.97%   951.42MB 34.38%  cmd/compile/internal/ssa.(*regAllocState).regalloc
  389.79MB 14.09% 32.06%   401.64MB 14.52%  cmd/compile/internal/ssa.(*regAllocState).computeLive
  216.63MB  7.83% 39.89%   216.63MB  7.83%  cmd/compile/internal/ssa.(*Func).newValue
  124.35MB  4.49% 44.38%   130.05MB  4.70%  cmd/compile/internal/ssa.(*stackAllocState).computeLive
  114.73MB  4.15% 48.53%   311.30MB 11.25%  cmd/compile/internal/gc.(*state).lookupVarOutgoing
  104.51MB  3.78% 52.31%   104.51MB  3.78%  cmd/compile/internal/gc.Nod
   97.95MB  3.54% 55.85%   499.59MB 18.06%  cmd/compile/internal/ssa.(*regAllocState).init
   90.50MB  3.27% 59.12%    90.50MB  3.27%  cmd/compile/internal/ssa.(*edgeState).set
   86.55MB  3.13% 62.24%    87.05MB  3.15%  cmd/compile/internal/ssa.cse
   86.08MB  3.11% 65.36%    86.08MB  3.11%  cmd/compile/internal/ssa.(*node32).insert
   81.10MB  2.93% 68.29%    81.10MB  2.93%  cmd/compile/internal/ssa.schedule
   78.81MB  2.85% 71.14%    82.37MB  2.98%  cmd/compile/internal/ssa.(*stackAllocState).stackalloc
   70.90MB  2.56% 73.70%    70.90MB  2.56%  cmd/compile/internal/ssa.(*stackAllocState).buildInterferenceGraph
   51.51MB  1.86% 75.56%    51.51MB  1.86%  cmd/compile/internal/gc.newliveness
   49.70MB  1.80% 77.36%   250.65MB  9.06%  cmd/compile/internal/ssa.(*stackAllocState).init
   48.81MB  1.76% 79.12%    48.81MB  1.76%  cmd/internal/obj.(*LSym).Grow
   37.29MB  1.35% 80.47%    37.29MB  1.35%  cmd/compile/internal/ssa.(*Func).newSparseSet
   37.01MB  1.34% 81.80%   169.70MB  6.13%  cmd/compile/internal/gc.(*state).locatePotentialPhiFunctions
   30.10MB  1.09% 82.89%    30.10MB  1.09%  cmd/compile/internal/ssa.(*Func).setHome
   29.32MB  1.06% 83.95%    29.32MB  1.06%  cmd/compile/internal/ssa.tighten
   26.01MB  0.94% 84.89%    26.01MB  0.94%  cmd/compile/internal/gc.Prog
   25.44MB  0.92% 85.81%    25.44MB  0.92%  cmd/compile/internal/ssa.liveValues
      24MB  0.87% 86.68%       24MB  0.87%  cmd/compile/internal/gc.typ
   20.51MB  0.74% 87.42%   115.10MB  4.16%  cmd/compile/internal/ssa.(*SparseTreeMapper).Insert
   20.50MB  0.74% 88.16%    20.50MB  0.74%  cmd/compile/internal/ssa.critical
   19.92MB  0.72% 88.88%    19.92MB  0.72%  cmd/compile/internal/gc.livenessepilogue
      18MB  0.65% 89.53%       18MB  0.65%  cmd/compile/internal/gc.newblock
   17.95MB  0.65% 90.18%    17.95MB  0.65%  cmd/compile/internal/gc.Naddr

The (*Func).newValue consumption is still driven by lookupVarOutgoing:

   4057            .   196.58MB     v := b.NewValue0A(line, ssa.OpFwdRef, t, name) 
   4058            .          .     s.fwdRefs = append(s.fwdRefs, v) 
   4059     114.73MB   114.73MB     s.defvars[b.ID][name] = v 

@gpaul
Copy link
Contributor Author

gpaul commented May 7, 2016

Thank you guys - the next hurdle seems to be the compilation of
https://github.com/gogo/protobuf/blob/master/test/combos/both/thetest.pb.go

@josharian
Copy link
Contributor

@dr2chase thinking out loud, I wonder whether instead of storing liveness as block+value pairs (which can grow quadratically), we could store it as value+[]range where range is an sdom entry/exit pair representing a subset of the CFG. In theory this could also be quadratic, if a value pops in and out of liveness, but that seems less likely.

@dr2chase
Copy link
Contributor

dr2chase commented May 8, 2016

@josharian Maybe. I tried plain first+last postorder numbers for stack allocation (this is essentially linear scan) and didn't get happy results, but couldn't tell if I had made some mistake with self-conflict (there are issues with loop phi functions and uses transmitted through backedges).

Can you give the sparse CL a look for style/explanation etc?
https://go-review.googlesource.com/#/c/22342/

@gpaul, what's the exact recipe for that example? I'm here:

/tmp/gopath/src/github.com/gogo/protobuf/test/combos/both$ 

and I see (using sparse CL still under review):

/tmp/gopath/src/github.com/gogo/protobuf/test/combos/both$ GO_SSA_PHI_LOC_CUTOFF=0
/tmp/gopath/src/github.com/gogo/protobuf/test/combos/both$ time go test .
ok      github.com/gogo/protobuf/test/combos/both   0.168s

real    0m46.232s
user    0m49.512s
sys 0m1.726s
/tmp/gopath/src/github.com/gogo/protobuf/test/combos/both$ GO_SSA_PHI_LOC_CUTOFF=-1
/tmp/gopath/src/github.com/gogo/protobuf/test/combos/both$ time go test .
ok      github.com/gogo/protobuf/test/combos/both   0.183s

real    1m20.361s
user    1m26.718s
sys 0m3.962s

There's a huge difference in memory allocation between the two, though still plenty of room for additional improvement in register/stack allocation.

@josharian
Copy link
Contributor

@dr2chase will take a look at sparse phi CL on Monday. (Mother's Day today.) Sorry that I've been a delinquent reviewer in general. Juggling too many things.

A few other random, somewhat related thoughts:

The self-conflict bit is a fly in the ointment indeed. On my simple benchmark case in the comment above, most of the allocation is the two append statements building s.interfere, so I looked into representing interference as sets of equivalence classes. I know that the interferes relationship is not transitive in general, but it is frequently, so I was going to model it with multiple equivalence classes. The interferes relationship is also not reflexive, but I was going to assume it was anti-reflexive and compensate...boom. :(

I noticed in the list of SSA stack size regressions that most of the really big increases are init functions, which generally look a lot like this code--building a large composite literal. But the components of a composite literal can always be computed independently of each other. I wonder whether a better short term fix might be teaching the compiler that fact. We still need to fix the quadratic behavior, but reducing spills from composite literals would (a) generate better code anyway and (b) whack a whole class of problem cases.

It scares me that Type is used as a map key in stackalloc, given that everywhere else we use .Compare == CMPeq for equality. Given that, I wonder whether we should use == for the related cases in stackalloc. For 1.8, I'll plan switch us to using the GC pattern instead (onebitwalktype). That'll make the interference sets bigger, though, exacerbating the performance problems here.

@josharian
Copy link
Contributor

Scratch the comment about composite literal independence. The evaluation of the components may have side-effects. Duh. There's still some independence, in that we can safely write each component into the final destination as calculated, but I'm not sure that that is enough to help.

@gpaul
Copy link
Contributor Author

gpaul commented May 9, 2016

@dr2chase: The sparse phi CL let's the combos/both package compile in isolation. Building its siblings concurrently still leads to OOM issues with 8 cores and 16GB of memory:

go install -v github.com/gogo/protobuf/test/theproto3/combos/...

@dr2chase
Copy link
Contributor

dr2chase commented May 9, 2016

I hate to say "works for me" but it actually does, 4 cores and 8GB, OSX, I did shut down Chrome first:
time go get -v -u github.com/gogo/protobuf/test/theproto3/combos/...
takes about 13 user minutes, 5,15s real.

OSX is playing games with compressed memory, not sure what that means, but this is a screen shot taken seconds before the memory consumption peak:

screen shot 2016-05-09 at 11 15 03 am

I'll see what I can do to get the sparse CL in -- it may still need polish -- but I'm not sure what else can be done in 1.7.

@josharian
Copy link
Contributor

@dr2chase again thinking out loud, if we can predict/detect when liveness info has stabilized for a block, we could interleave liveness and interference for stackalloc, since interference works arbitrary block-at-a-time, which would (probably) allow us to store less liveness info. Reactions?

@gpaul
Copy link
Contributor Author

gpaul commented May 9, 2016

@dr2chase 'works for me' is totally valid. It's open source after all. Unfortunately I'm not so lucky. For this specific issue we may just remove the gogoprotobuf ./test packages from our repo. It seems like a pretty good test for the compiler and since it's quite a popular library it would be great if you and the rest of the compiler team would continue working towards pre-SSA memory usage and compilation times for it.

We have several large swaths of generated code internally which I haven't tried to compile yet. If they trip the compiler we'll extract some test cases early in the 1.8 cycle.

The performance improvements we're seeing with at least one high-performance component compiled with tip are pretty amazing (~30% improvements over 1.6) so I hope we can figure something out.

@gopherbot
Copy link

CL https://golang.org/cl/22342 mentions this issue.

josharian added a commit to josharian/go that referenced this issue May 11, 2016
For golang#15537

Compiling github.com/gogo/protobuf/test/combos/both

Before:

       69.83 real        82.72 user         2.01 sys
4206456832  maximum resident set size

After:

       62.94 real        74.74 user         1.92 sys
3832225792  maximum resident set size


name       old time/op      new time/op      delta
Template        279ms ± 4%       277ms ± 4%    ~           (p=0.529 n=20+20)
Unicode         142ms ±10%       143ms ± 6%    ~           (p=0.461 n=20+20)
GoTypes         850ms ± 1%       846ms ± 2%    ~           (p=0.175 n=19+20)
Compiler        4.43s ± 2%       4.38s ± 1%  -1.12%        (p=0.000 n=19+20)

name       old user-ns/op   new user-ns/op   delta
Template   376user-ms ± 4%  378user-ms ± 3%    ~           (p=0.755 n=19+20)
Unicode    203user-ms ± 3%  205user-ms ± 8%    ~           (p=0.314 n=15+20)
GoTypes    1.15user-s ± 2%  1.15user-s ± 3%    ~           (p=0.799 n=20+20)
Compiler   6.14user-s ± 3%  6.12user-s ± 2%    ~           (p=0.547 n=20+20)

name       old alloc/op     new alloc/op     delta
Template       47.6MB ± 0%      46.9MB ± 0%  -1.41%        (p=0.000 n=19+20)
Unicode        37.1MB ± 0%      37.1MB ± 0%    ~           (p=0.134 n=20+20)
GoTypes         138MB ± 0%       136MB ± 0%  -1.65%        (p=0.000 n=20+20)
Compiler        632MB ± 0%       614MB ± 0%  -2.82%        (p=0.000 n=20+20)

name       old allocs/op    new allocs/op    delta
Template         445k ± 0%        444k ± 0%  -0.17%        (p=0.000 n=20+20)
Unicode          356k ± 0%        356k ± 0%    ~           (p=0.204 n=20+20)
GoTypes         1.25M ± 0%       1.25M ± 0%  -0.41%        (p=0.000 n=20+19)
Compiler        5.03M ± 0%       4.99M ± 0%  -0.87%        (p=0.000 n=19+20)


Change-Id: Iee7b56ede4c319e4194160502f0796f478a9b73f
@gopherbot
Copy link

CL https://golang.org/cl/23211 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/24904 mentions this issue.

josharian added a commit to josharian/go that referenced this issue Aug 31, 2016
For golang#15537

Compiling github.com/gogo/protobuf/test/combos/both

Before:

       69.83 real        82.72 user         2.01 sys
4206456832  maximum resident set size

After:

       62.94 real        74.74 user         1.92 sys
3832225792  maximum resident set size


name       old time/op      new time/op      delta
Template        279ms ± 4%       277ms ± 4%    ~           (p=0.529 n=20+20)
Unicode         142ms ±10%       143ms ± 6%    ~           (p=0.461 n=20+20)
GoTypes         850ms ± 1%       846ms ± 2%    ~           (p=0.175 n=19+20)
Compiler        4.43s ± 2%       4.38s ± 1%  -1.12%        (p=0.000 n=19+20)

name       old user-ns/op   new user-ns/op   delta
Template   376user-ms ± 4%  378user-ms ± 3%    ~           (p=0.755 n=19+20)
Unicode    203user-ms ± 3%  205user-ms ± 8%    ~           (p=0.314 n=15+20)
GoTypes    1.15user-s ± 2%  1.15user-s ± 3%    ~           (p=0.799 n=20+20)
Compiler   6.14user-s ± 3%  6.12user-s ± 2%    ~           (p=0.547 n=20+20)

name       old alloc/op     new alloc/op     delta
Template       47.6MB ± 0%      46.9MB ± 0%  -1.41%        (p=0.000 n=19+20)
Unicode        37.1MB ± 0%      37.1MB ± 0%    ~           (p=0.134 n=20+20)
GoTypes         138MB ± 0%       136MB ± 0%  -1.65%        (p=0.000 n=20+20)
Compiler        632MB ± 0%       614MB ± 0%  -2.82%        (p=0.000 n=20+20)

name       old allocs/op    new allocs/op    delta
Template         445k ± 0%        444k ± 0%  -0.17%        (p=0.000 n=20+20)
Unicode          356k ± 0%        356k ± 0%    ~           (p=0.204 n=20+20)
GoTypes         1.25M ± 0%       1.25M ± 0%  -0.41%        (p=0.000 n=20+19)
Compiler        5.03M ± 0%       4.99M ± 0%  -0.87%        (p=0.000 n=19+20)


Change-Id: Iee7b56ede4c319e4194160502f0796f478a9b73f
josharian added a commit to josharian/go that referenced this issue Sep 13, 2016
For golang#15537

Compiling github.com/gogo/protobuf/test/combos/both

Before:

       69.83 real        82.72 user         2.01 sys
4206456832  maximum resident set size

After:

       62.94 real        74.74 user         1.92 sys
3832225792  maximum resident set size


name       old time/op      new time/op      delta
Template        279ms ± 4%       277ms ± 4%    ~           (p=0.529 n=20+20)
Unicode         142ms ±10%       143ms ± 6%    ~           (p=0.461 n=20+20)
GoTypes         850ms ± 1%       846ms ± 2%    ~           (p=0.175 n=19+20)
Compiler        4.43s ± 2%       4.38s ± 1%  -1.12%        (p=0.000 n=19+20)

name       old user-ns/op   new user-ns/op   delta
Template   376user-ms ± 4%  378user-ms ± 3%    ~           (p=0.755 n=19+20)
Unicode    203user-ms ± 3%  205user-ms ± 8%    ~           (p=0.314 n=15+20)
GoTypes    1.15user-s ± 2%  1.15user-s ± 3%    ~           (p=0.799 n=20+20)
Compiler   6.14user-s ± 3%  6.12user-s ± 2%    ~           (p=0.547 n=20+20)

name       old alloc/op     new alloc/op     delta
Template       47.6MB ± 0%      46.9MB ± 0%  -1.41%        (p=0.000 n=19+20)
Unicode        37.1MB ± 0%      37.1MB ± 0%    ~           (p=0.134 n=20+20)
GoTypes         138MB ± 0%       136MB ± 0%  -1.65%        (p=0.000 n=20+20)
Compiler        632MB ± 0%       614MB ± 0%  -2.82%        (p=0.000 n=20+20)

name       old allocs/op    new allocs/op    delta
Template         445k ± 0%        444k ± 0%  -0.17%        (p=0.000 n=20+20)
Unicode          356k ± 0%        356k ± 0%    ~           (p=0.204 n=20+20)
GoTypes         1.25M ± 0%       1.25M ± 0%  -0.41%        (p=0.000 n=20+19)
Compiler        5.03M ± 0%       4.99M ± 0%  -0.87%        (p=0.000 n=19+20)


Change-Id: Iee7b56ede4c319e4194160502f0796f478a9b73f
@golang golang locked and limited conversation to collaborators Jul 13, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants