Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/buildlet: remaining processes not killed #29319

Open
Helflym opened this issue Dec 18, 2018 · 3 comments
Open

x/build/cmd/buildlet: remaining processes not killed #29319

Helflym opened this issue Dec 18, 2018 · 3 comments
Labels
Builders x/build issues (builders, bots, dashboards)
Milestone

Comments

@Helflym
Copy link
Contributor

Helflym commented Dec 18, 2018

When buildlet times out, some processes remain in background and they need to be killed manually.
It seems that buildlet only kills its first children and not all its offsprings.

It occurs on aix/ppc64 builder. But I think it's a more general issue.

Here are some processes remaining after builds had failed on aix/ppc64.
cmd/buildlet seems to only kill "go tool dist test ... "

    root 15008262        1   0 14:43:03      -  0:00 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/pkg/tool/aix_ppc64/dist test --no-rebuild --banner=XXXBANNERXXX: test:1_10
    root  7930506 15008262   0 14:43:06      -  0:01 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/test/runtest.exe --shard=1 --shards=10
    root 15860332  7930506   0 14:43:15      -  0:00 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/bin/go run -gcflags= fixedbugs/issue4667.go
    root 16449854 15860332 117 14:43:16      - 244:33 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/pkg/tool/aix_ppc64/link -o /ramdisk8GB/workdir-host-aix-ppc64-osuosl/tmp/go-build817241358/b001/exe/issue4667 -importcfg /ramdisk8GB/workdir-host-aix-ppc64-osuosl/tmp/go-build817241358/b001/importcfg.link -s -w -buildmode=exe -buildid=0SI_pU16NxBjYDL8rkvo/L9t_iML1T1ahnSRDhIh8/1p0X_Ta401ybIH1jFEpb/0SI_pU16NxBjYDL8rkvo -extld=gcc /ramdisk8GB/workdir-host-aix-ppc64-osuosl/tmp/go-build817241358/b001/_pkg_.a

    root  4391522        1   0 03:47:21      -  0:00 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/pkg/tool/aix_ppc64/dist test --no-rebuild --banner=XXXBANNERXXX: test:3_10
    root 10158466  4391522   0 03:47:24      -  0:01 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/test/runtest.exe --shard=3 --shards=10
    root 10748370 10158466 356 03:47:34      - 1877:52 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/bin/go build -gcflags -S=2 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/test/codegen/memcombine.go


    root 12583384        1   0 14:07:27      -  0:00 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/pkg/tool/aix_ppc64/dist test --no-rebuild --banner=XXXBANNERXXX: test:2_10
    root 20709728 12583384   0 14:07:29      -  0:01 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/test/runtest.exe --shard=2 --shards=10
    root 14418236 20709728   0 14:07:35      -  0:00 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/bin/go tool compile -e /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/test/fixedbugs/bug151.go
    root 17760658 14418236 469 14:07:35      - 1051:28 /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/pkg/tool/aix_ppc64/compile -e /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/test/fixedbugs/bug151.go

Looking at the code, killProcessTree is called when the builder is killed by the coordinator (right?).
However, in my understanding, it only kills one process and not the whole process tree (cf https://github.com/golang/build/blob/master/cmd/buildlet/buildlet.go#L1590).
I think it should call syscall.Kill with -p.Pid in order to kill the whole process group.

Maybe it's a duplicate of #15778.

@gopherbot gopherbot added this to the Unreleased milestone Dec 18, 2018
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Dec 18, 2018
@Helflym Helflym changed the title x/build/cmd/buildlet x/build/cmd/buildlet: remaining processes Dec 18, 2018
@Helflym Helflym changed the title x/build/cmd/buildlet: remaining processes x/build/cmd/buildlet: remaining processes not killed Dec 18, 2018
@bradfitz
Copy link
Contributor

I always forget how process groups & stuff work. So syscall.Kill is POSIX kill and POSIX kill with negative pid is:

If pid is negative, but not -1, sig shall be sent to all processes (excluding an unspecified set of system processes) whose process group ID is equal to the absolute value of pid, and for which the process has permission to send a signal.

Is this the behavior of everywhere where Go runs? Or do we need to do per-GOOS behavior?

Does the buildlet currently run child processes in their own process group? It probably should if not?

/cc @dmitshur @ianlancetaylor

@ianlancetaylor
Copy link
Contributor

That is how syscall.Kill works on all Unix systems. No idea about Windows.

I don't know what the buildlet does today but it can start processes in their own process group by setting cmd.SysProcAttr.Setpgid. That should work on all Unix systems but not Windows. The pgid will then be cmd.Process.Pid, and you will want to pass the negative of that value to syscall.Kill.

(There are a few tests that themselves fiddled with the pgid of their own child processes, but it seems reasonable to not worry about those. We could handle those on GNU/Linux by setting up a PID namespace, but it hardly seems worth doing.)

@alexbrainman
Copy link
Member

That should work on all Unix systems but not Windows.

I looked at Windows code

https://github.com/golang/build/blob/57258c564e6f790b19d257469cced6b6ab47f38c/cmd/buildlet/buildlet_windows.go#L88

and it looks like it manually finds all main process children and kill them all.

It is not perfect solution (there is a race there), but I think it is good enough. It would be nice to use Windows Job Objects https://godoc.org/github.com/alexbrainman/ps here, but that will affect the tests that buildlet runs. We do not want that.

Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards)
Projects
None yet
Development

No branches or pull requests

5 participants