Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: exec.Cmd fails to cancel with non-*os.File outputs on linux #18874

Closed
jbardin opened this issue Jan 31, 2017 · 31 comments
Closed

os/exec: exec.Cmd fails to cancel with non-*os.File outputs on linux #18874

jbardin opened this issue Jan 31, 2017 · 31 comments
Labels
Documentation FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@jbardin
Copy link
Contributor

jbardin commented Jan 31, 2017

go1.7, go1.8rc3

When using a Context to cancel an exec.Cmd, if either of the outputs are an io.Writer which isn't an *os.File the command will fail to terminate.

The following panics on amd64 linux

	ctx, cancel := context.WithCancel(context.Background())
	cmd := exec.CommandContext(ctx, "/bin/sh", "-c", "sleep 60")

	var output bytes.Buffer
	cmd.Stdout = &output

	cmd.Start()
	log.Println("started command")

	done := make(chan error)
	go func() {
		done <- cmd.Wait()
	}()

	time.Sleep(50 * time.Millisecond)
	log.Println("canceling command")
	cancel()

	select {
	case err := <-done:
		log.Println("done:", err)
	case <-time.After(time.Second):
		panic("failed to cancel")

The full stack trace is:

main.main()
	/home/user/command_context.go:35 +0x499

goroutine 5 [syscall]:
syscall.Syscall(0x0, 0x4, 0xc420092000, 0x200, 0x7f15a35f4000, 0x19, 0xc420092000)
	/usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.read(0x4, 0xc420092000, 0x200, 0x200, 0xc420026400, 0x7f15a35f4000, 0x0)
	/usr/local/go/src/syscall/zsyscall_linux_amd64.go:783 +0x55
syscall.Read(0x4, 0xc420092000, 0x200, 0x200, 0x425b5f, 0x4dd550, 0xc420023dc0)
	/usr/local/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc42000c038, 0xc420092000, 0x200, 0x200, 0x4602ab, 0x4b38e0, 0x200)
	/usr/local/go/src/os/file_unix.go:165 +0x4f
os.(*File).Read(0xc42000c038, 0xc420092000, 0x200, 0x200, 0x0, 0x7f15a3598028, 0xc420023e60)
	/usr/local/go/src/os/file.go:101 +0x76
bytes.(*Buffer).ReadFrom(0xc420048070, 0x5381e0, 0xc42000c038, 0x7f15a3598028, 0xc420048070, 0x1)
	/usr/local/go/src/bytes/buffer.go:179 +0x160
io.copyBuffer(0x538120, 0xc420048070, 0x5381e0, 0xc42000c038, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/io/io.go:384 +0x2cb
io.Copy(0x538120, 0xc420048070, 0x5381e0, 0xc42000c038, 0x0, 0x0, 0x0)
	/usr/local/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x0, 0x0)
	/usr/local/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc42007e000, 0xc42000a2a0)
	/usr/local/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
	/usr/local/go/src/os/exec/exec.go:372 +0x4e4

goroutine 7 [chan receive]:
os/exec.(*Cmd).Wait(0xc42007e000, 0x0, 0x0)
	/usr/local/go/src/os/exec/exec.go:443 +0x118
main.main.func1(0xc4200681e0, 0xc42007e000)
	/home/user/command_context.go:24 +0x2b
created by main.main
	/home/user/command_context.go:25 +0x1eb
exit status 2
@bradfitz bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 31, 2017
@bradfitz bradfitz added this to the Go1.9 milestone Jan 31, 2017
@jbardin
Copy link
Contributor Author

jbardin commented Jan 31, 2017

I initially thought that this was caused by Context cancelation, but it turns out to be the same when directly killing the process.

This also hangs/panics

	cmd := exec.Command("/bin/sh", "-c", "sleep 60")
	var output bytes.Buffer
	cmd.Stdout = &output
	cmd.Start()
	log.Println("started command")

	done := make(chan error)
	go func() {
		done <- cmd.Wait()
	}()

	time.Sleep(50 * time.Millisecond)
	log.Println("killing command")
	cmd.Process.Kill()

	select {
	case err := <-done:
		log.Println("done:", err)
	case <-time.After(5 * time.Second):
		panic("failed to cancel")
	}

A second call to wait rightly returns Wait was already called and fails the race detector.

@ianlancetaylor
Copy link
Contributor

It seems to work fine for me. Both programs end immediately with done: signal: killed. I'm on GNU/Linux, Ubuntu Trusty. I checked both Go 1.7 and Go 1.8rc3.

Can you post a complete program, and post the output of the program? Thanks.

@dsnet
Copy link
Member

dsnet commented Feb 1, 2017

I reproduce the problem on:

$ cat /etc/issue
Linux Mint 17.1 Rebecca \n \l
$ uname -a
Linux carbonite 3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:28:38 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ go version
go version go1.7.4 linux/amd64

@ianlancetaylor
Copy link
Contributor

@dsnet Can you attach your complete program?

@dsnet
Copy link
Member

dsnet commented Feb 1, 2017

It is quite literally the program from the original post: https://play.golang.org/p/ovO_hVDW49

Interestingly, I'm not able to repro the issue on my work computer.

@ianlancetaylor
Copy link
Contributor

Interesting. I've now tried that program on three different GNU/Linux distros, kernel versions 2.6.32, 2.6.35, 4.2.0. For me it always completes immediately.

@jbardin Tell us more about the system you are running.

@ianlancetaylor
Copy link
Contributor

When the context is cancelled we kill the child process, in this case /bin/sh, with SIGKILL. We don't kill the whole process group, just the immediate child. Depending on which shell you are using, that won't necessarily kill the child process running sleep 60. If the child process does not die, and if sleep does not close its standard output, then the Go program will hang waiting to copy the child process's standard output into the buffer.

I'm not even sure this is wrong.

@silviucm
Copy link

silviucm commented Feb 1, 2017

If it helps in any way:

Does Panic:
Intel NUC Intel(R) Core(TM) i5-6260U (panics)
Linux medianuc 4.9.3-040903-generic #201701120631 SMP Thu Jan 12 11:33:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue -> Linux Mint 18 Sarah

Does Panic:
Dell Desktop Intel(R) Core(TM) i7-4790 (panics)
Linux silviu-desktop 4.4.0-59-generic #80-Ubuntu SMP Fri Jan 6 17:47:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue -> Linux Mint 18 Sarah

Does NOT Panic:
Odroid XU4 - ARMv7 Processor rev 3 (v7l)
Linux odroid 3.10.104-4-ARCH #1 SMP PREEMPT Sun Jan 1 18:12:52 MST 2017 armv7l GNU/Linux
$ cat /etc/issue -> Arch Linux

@jbardin
Copy link
Contributor Author

jbardin commented Feb 1, 2017

I think I see the difference, I bet all the systems where this fails have /bin/sh linked to /bin/dash.

I'm also note sure what the correct behavior is now, but dash is the only common shell I can find that does this.

@silviucm
Copy link

silviucm commented Feb 1, 2017

confirming the only system where it does not panic is not dash:

ll -a /bin/sh
lrwxrwxrwx 1 root root 4 Jan 25 14:58 /bin/sh -> bash

@ianlancetaylor
Copy link
Contributor

I'm inclined to close this as working as expected. You are asking for the processes's standard output to be collected. That means that the Wait method will block until the process closes its standard output. In the case that fails, the child process still has standard output open. If you want something else to happen, you need to ensure that you kill not just the child, but any other process it starts that may keep standard output open.

@evverx
Copy link

evverx commented Feb 1, 2017

@ianlancetaylor , I think this is a documentation bug: Wait and context don't work well together.

The problem is

If c.Stdin is not an *os.File, Wait also waits for the I/O loop copying from
c.Stdin into the process's standard input to complete.

Consider this example:

package main

import (
        "context"
        "fmt"
        "io"
        "os/exec"
        "time"
)

type Hanger time.Duration

func (h Hanger) Read(b []byte) (int, error) {
        time.Sleep(time.Duration(h))
        return 0, io.EOF
}

func main() {
        ctx, _ := context.WithTimeout(context.Background(), 5*time.Second)
        cmd := exec.CommandContext(ctx, "/bin/sh", "-c", "sleep 7")
        cmd.Stdin = Hanger(10 * time.Second)
        cmd.Start()
        now := time.Now()
        go func() {
                <-ctx.Done()
                fmt.Printf("context: %v\n", time.Since(now))
        }()
        cmd.Wait()
        fmt.Printf("wait: %v\n", time.Since(now))
}

This prints

context: 4.999701339s
wait: 10.001522024s

@ianlancetaylor
Copy link
Contributor

I don't see why this has anything to do with Wait and context. As the second comment above shows, the same issue happens if you use Kill directly.

@evverx
Copy link

evverx commented Feb 1, 2017

As the second comment above shows, the same issue happens if you use Kill directly

Yes. And that works as expected.

But the code shouldn't use done <- cmd.Wait() instead of the <- context.Done().

@evverx
Copy link

evverx commented Feb 1, 2017

But, yes, you're right. The Wait/context-documentation is another issue. Sorry for the noise

@jbardin
Copy link
Contributor Author

jbardin commented Feb 1, 2017

The difference in shell behavior here is that bash, zsh, etc will exec a single command in place, where as dash always executes in a sub process. Giving bash a series of commands causes it to exhibit this behavior too, and I assume it will be the same on all unix platforms.

cmd := exec.CommandContext(ctx, "/bin/zsh", "-c", `sleep 60; sleep 60`)

This explains a number of issues people have asked me about in os/exec, and I didn't make the connection that the io loop for non *os.File stdin/stdout/stderr was the difference. I wonder if this could be cleaned up without having interruptible IO on regular file descriptors. We could poll the output with select, and check if we've killed the process in the meantime. We are killing the first child process, and we can wait on that immediately which is all I would expect. I wouldn't expect to be waiting on an FD passed to another process (which will die as soon as we close it).

Or maybe because this is a particular behavior of shells, we just document it as such.

@jbardin
Copy link
Contributor Author

jbardin commented Feb 1, 2017

@evverx:

But the code shouldn't use done <- cmd.Wait() instead of the <- context.Done().

You always need to call Wait on a child process. Waiting on the context doesn't do that for you.

@evverx
Copy link

evverx commented Feb 1, 2017

@jbardin , sure.

I mean something like that:

go func() {
    <-ctx.Done()
    cancel_input_producers
    cancel_output_consumers
}()
....
cmd.Wait()

@jbardin
Copy link
Contributor Author

jbardin commented Feb 1, 2017

@evverx: That still hangs. The Wait still may never return -- I've only put it in the goroutine to be able to show the timeout and panic.

@evverx
Copy link

evverx commented Feb 1, 2017

@jbardin

That still hangs.

It depends. Your example doesn't control output_consumers. So, it can't cancel_output_consumers correctly.

Anyway, that's unrelated to this issue.

@jbardin
Copy link
Contributor Author

jbardin commented Feb 3, 2017

I think the underlying issue actually is directly related to that example.

What's causing the hang here is the hidden goroutine (referenced via Cmd.goroutine, communicating over Cmd.errch) that's launched to copy from the process file descriptor to the io.Writer, which equates to the output_consumers. Since we have no visibility to that goroutine, we can't choose to cancel_output_consumers which is why Wait can block indefinitely.

Using an os.Pipe for the output means the process can write directly to the pipe fd, and the blocked Read is lifted into my code where I can choose to stop waiting on it (though I still can't cancel that Read without polling the file descriptor).


My feeling here is that Wait is only expected to call the wait syscall, and has done so at this point. The child process has been terminated and reaped, and now we're blocking on file descriptors which are being used by a new process that we don't intend kill or wait on.

@ianlancetaylor, with the current file io implementation, would polling the FDs or selecting along with a self-pipe be an acceptable solution? (also related to #18507)

Without an immediate solution, I think it would be good to document that setting the process stdin/stdout/stderr to non-*os.Files could cause Wait to block indefinitely.

@ianlancetaylor
Copy link
Contributor

I don't understand your suggestion about polling. Right now if you point Stdout at a buffer, exec.Cmd promises that that buffer will hold the program's standard output after you call Wait. I don't see how we can do anything except wait for the descriptor to be closed.

I agree that the documentation needs to be better.

@jbardin
Copy link
Contributor Author

jbardin commented Feb 3, 2017

I guess it depends on the interpretation of "the program's standard output". I was expecting Wait to wait on the process and nothing more, but it's blocked waiting on the output from another process which isn't going to be killed.

By the time Wait is blocked, the process referenced by Cmd.Process is gone and reaped. Short of a delay in the kernel buffering the output, there won't ever be any more data coming from the process that is being waited on. If we're continuing to block there, it must be because another process has taken over the fd. Some might consider that output part of the "program" launched by exec, which is why I'm not certain that aborting the Read is 100% correct, but it also seems strange that we sent a SIGKILL to a process yet Wait is blocked for some reason.

By polling, I mean syscall.Select with a timeout to check for cancellation, but the self-pipe trick works just as well, in order to not have a blocked Read call.

@bcmills
Copy link
Contributor

bcmills commented Feb 3, 2017

Would it make sense to have exec.CommandContext send SIGTERM instead of SIGKILL? Context cancellation in a Go program is normally cooperative and asynchronous, not immediate, so SIGTERM seems like a more precise expression of that intent.

I'm not sure whether that would actually change the behavior of bash w.r.t. subprocesses, though.

@ianlancetaylor
Copy link
Contributor

Sending SIGTERM instead of SIGKILL would not affect the problem in this issue.

@ronin13
Copy link

ronin13 commented Feb 19, 2017

I can confirm this with go tip ( to consistently reproduce with any shell use 'sleep 60; sleep 60' as suggested above: complete code https://play.golang.org/p/hORl4wxoQk). The backtrace is slightly different (due to c05b06a I think): https://gist.github.com/ronin13/16a328dd56846a28d25feb8a2140f858

@gopherbot
Copy link

CL https://golang.org/cl/42270 mentions this issue.

@ianlancetaylor
Copy link
Contributor

Quoting @eZanmoto from #20730 :

The problem, in my opinion, is that I expected exec.Cmd.Wait to function like os.Process.Wait and return when the process exited. However, based on #18874, exec.Cmd is additionally "bound" to the FDs it passed to the process, not just its PID, and so cannot exit until they are closed. My proposed solution is to update the exec.Cmd.Wait documentation to reflect this, in a similar way to the current documentation concerning c.Stdin. I would also update the initial line to indicate that Wait doesn't just wait for the command to exit, but may also wait for other events (to be detailed in the body of the documentation) to transpire, which could cause Wait to return later.

@gopherbot
Copy link

CL https://golang.org/cl/47650 mentions this issue.

@bradfitz bradfitz modified the milestones: Go1.10, Go1.9 Jul 6, 2017
gopherbot pushed a commit that referenced this issue Jul 6, 2017
It already implied that Cmd.Wait is more than os.Process.Wait, but say
so explicitly.

See #18874 (comment)

Updates #18874

Change-Id: Iaa46defd776ae0be817d9f4466a99ac78cfd672b
Reviewed-on: https://go-review.googlesource.com/47650
Reviewed-by: Russ Cox <rsc@golang.org>
@ianlancetaylor ianlancetaylor added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 30, 2017
mguegan pushed a commit to virtua-network/smartos_exporter that referenced this issue Nov 23, 2017
I suspect Go is calling `/bin/sh` when using os.exec Command, but this
will spawn `dash` on LX zones. Incorrect behavior expected as shown on
golang/go#18874
fenollp added a commit to FuzzyMonkeyCo/monkey that referenced this issue Dec 6, 2017
fenollp added a commit to FuzzyMonkeyCo/monkey that referenced this issue Dec 6, 2017
* fails: work around golang/go/issues/18874

* fails: have curl output server errors
@ianlancetaylor
Copy link
Contributor

The docs have been improved as part of #22610. It's not clear that there is anything else to do here.

sjezewski added a commit to pachyderm/pachyderm that referenced this issue Dec 7, 2017
It was copied from my local go version at: go version go1.9.2 linux/amd64

This is so that we can export a helper to close IO explicitly when we
need it. For more info, see this issue:
golang/go#18874
@ianlancetaylor
Copy link
Contributor

I'm going to close this. The documentation is clearer in the upcoming 1.10 release. I don't see how we can change anything in the behavior.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Documentation FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

9 participants