Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/crypto/ssh: semantics around running session.Wait() after calling session.Run(), when EOF messages are sent #32453

Open
ianvernon opened this issue Jun 5, 2019 · 8 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@ianvernon
Copy link

ianvernon commented Jun 5, 2019

What version of Go are you using (go version)?

$ go version
1.12.5

Does this issue reproduce with the latest release?

I am using a vendored version of the library with commit e84da0312774c21d64ee2317962ef669b27ffb41

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/vagrant/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/vagrant/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build896755704=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I will work on providing an exact reproducer of this via a small go program, but do not have this on hand immediately. I can do this if you like. My question is more about how to properly use the SSH library, as I feel like I am misunderstanding how it works.

I am wondering about the semantics behind running session.Wait() after calling session.Run().

It appears that session.Run() calls session.Start() followed by session.Wait().

I've found that if I try calling session.Wait() again after calling session.Run(), my code will hang infinitely. Is this because there is nothing to wait upon after session.Run() terminates, since the remote command has exited, so session.Wait() is waiting for something, while nothing is being sent?

I also am confused about how sessions are closed. I observe when running with debugMux set to true locally, that my session gets sent a channelEOFMsg after I have called session.Run(), but before I call session.Close(). When I try to close the session, I get the error failed to close session: EOF. I would expect that after calling Run, that I'd be able to close the session gracefully. Who is sending these EOF's? Is that the expected behavior for running a command on a session, in that the session will get an EOF when the command is done running?

Output from running with debugMux=true:

--------------> session.Run() <----------------
2019/06/05 08:17:03 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:17:03 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:17:03 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x56, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x67, 0x65, 0x74, 0x20, 0x70, 0x6f, 0x64, 0x73, 0x20, 0x2d, 0x2d, 0x61, 0x6c, 0x6c, 0x2d, 0x6e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x69, 0x74, 0x65, 0x6d, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x6d, 0x65, 0x74, 0x61, 0x64, 0x61, 0x74, 0x61, 0x2e, 0x64, 0x65, 0x6c, 0x65, 0x74, 0x69, 0x6f, 0x6e, 0x54, 0x69, 0x6d, 0x65, 0x73, 0x74, 0x61, 0x6d, 0x70, 0x7d, 0x27}}
2019/06/05 08:17:03 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:17:03 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

-------------> where is this EOF coming from? <---------------
2019/06/05 08:17:03 send(2): ssh.channelEOFMsg{PeersID:0x0}

-----------> close session here <-----------------
2019/06/05 08:17:03 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:17:03 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:17:03 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:17:03 send(2): ssh.channelCloseMsg{PeersID:0x0}
2019/06/05 08:17:03 send(2): ssh.channelCloseMsg{PeersID:0x0}

-------------> log of error from closing session <----------------
failed to close session: EOF

What did you expect to see?

I would expect to not get an EOF after calling Run() on a session, and for Wait() to be idempotent / to return immediately if there is nothing to wait upon. I would also expect that EOF would not get sent by Run(), and that I would not get an error closing a session when nothing else has tried to close it.

What did you see instead?

See above.

I will work on a small reproducer locally, but I hope that the above is enough context / information for me to get some answers, as I feel like this is moreso me not understanding how the library is supposed to be used.

@gopherbot gopherbot added this to the Unreleased milestone Jun 5, 2019
@ianvernon ianvernon changed the title x/crypto/ssh: semantics around running session.Wait() after calling session.Run() x/crypto/ssh: semantics around running session.Wait() after calling session.Run(), when EOF messages are sent Jun 5, 2019
ianvernon pushed a commit to cilium/cilium that referenced this issue Jun 5, 2019
…mandContext`

It was observed while running `gops stack` for the Ginkgo test suite locally,
that we were leaking goroutines that were getting stuck while waiting for SSH
sessions to finish. We accrued over 1000 of these per K8s CI run:

```
16 unique stack traces
960 occurences. Sample stack trace:
github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a)
        /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57
github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220)
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc
created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c
```

This example shows that there were over 960 goroutines stuck on `session.Wait()`.

Whenever we run a command via SSH, we call `runCommand`. When we call
`runCommand`, it calls `session.Run`, which calls `session.Start()` and
`session.Wait()`. I observed that  that calling `Wait()` on a session which
already has had `Run` invoked will never return, even if we try
to call `session.Close()` before invoking `session.Wait()`. This indicates
that our logic for trying to kill the session if the context which is provided
to `RunCommandContext` is canceled is flawed, as waiting for the session to
finish before closing it will block infinitely.

I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and
I see that the session receives an EOF message *before* we even try to close the
session:

```
>------- session started, and session.Run() invoked

2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}}
2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

>-------  EOF sent on channel (not by us; we have not closed the sesion yet!

2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0}
2019/06/05 08:16:59 decoding(2): data packet - 181 bytes
2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac}
2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0}

>------- we try to close the session, and receive the following error

failed to close session: EOF
```

It appears that we cannot close the session, since an EOF has already been sent
for it. I am not exactly sure where this comes from. I've posted an issue /
question in the GitHub repository for golang:
golang/go#32453 . Our attempts to send signals (SIGHUP
and SIGINT) are met by this same EOF error as well; there is no point on
waiting for the session to finish in this case, so just try to close it
and move on, and not leak goroutines that will be stuck forever. When running
`gops` now against the Gingko test suite, we no longer accrue a ton of these
goroutines blocked on `session.Wait()` - the biggest # of occcurrences for
"similar" goroutines is at most 2 in a stack trace captured below, for example:

```
$ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out  | head -n 15
14 unique stack traces
2 occurences. Sample stack trace:
internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
...
```

Signed-off by: Ian Vernon <ian@cilium.io>
ianvernon pushed a commit to cilium/cilium that referenced this issue Jun 5, 2019
…mandContext`

It was observed while running `gops stack` for the Ginkgo test suite locally,
that we were leaking goroutines that were getting stuck while waiting for SSH
sessions to finish. We accrued over 1000 of these per K8s CI run:

```
16 unique stack traces
960 occurences. Sample stack trace:
github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a)
        /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57
github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220)
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc
created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c
```

This example shows that there were over 960 goroutines stuck on `session.Wait()`.

Whenever we run a command via SSH, we call `runCommand`. When we call
`runCommand`, it calls `session.Run`, which calls `session.Start()` and
`session.Wait()`. I observed that  that calling `Wait()` on a session which
already has had `Run` invoked will never return, even if we try
to call `session.Close()` before invoking `session.Wait()`. This indicates
that our logic for trying to kill the session if the context which is provided
to `RunCommandContext` is canceled is flawed, as waiting for the session to
finish before closing it will block infinitely.

I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and
I see that the session receives an EOF message *before* we even try to close the
session:

```
>------- session started, and session.Run() invoked

2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}}
2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

>-------  EOF sent on channel (not by us; we have not closed the sesion yet!

2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0}
2019/06/05 08:16:59 decoding(2): data packet - 181 bytes
2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac}
2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0}

>------- we try to close the session, and receive the following error

failed to close session: EOF
```

It appears that we cannot close the session, since an EOF has already been sent
for it. I am not exactly sure where this comes from. I've posted an issue /
question in the GitHub repository for golang:
golang/go#32453 . Our attempts to send signals (SIGHUP
and SIGINT) are met by this same EOF error as well; there is no point on
waiting for the session to finish in this case, so just try to close it
and move on, and not leak goroutines that will be stuck forever. When running
`gops` now against the Gingko test suite, we no longer accrue a ton of these
goroutines blocked on `session.Wait()` - the biggest # of occcurrences for
"similar" goroutines is at most 2 in a stack trace captured below, for example:

```
$ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out  | head -n 15
14 unique stack traces
2 occurences. Sample stack trace:
internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
...
```

Signed-off by: Ian Vernon <ian@cilium.io>
borkmann pushed a commit to cilium/cilium that referenced this issue Jun 5, 2019
…mandContext`

[ upstream commit cf14b6a ]

It was observed while running `gops stack` for the Ginkgo test suite locally,
that we were leaking goroutines that were getting stuck while waiting for SSH
sessions to finish. We accrued over 1000 of these per K8s CI run:

```
16 unique stack traces
960 occurences. Sample stack trace:
github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a)
        /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57
github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220)
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc
created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c
```

This example shows that there were over 960 goroutines stuck on `session.Wait()`.

Whenever we run a command via SSH, we call `runCommand`. When we call
`runCommand`, it calls `session.Run`, which calls `session.Start()` and
`session.Wait()`. I observed that  that calling `Wait()` on a session which
already has had `Run` invoked will never return, even if we try
to call `session.Close()` before invoking `session.Wait()`. This indicates
that our logic for trying to kill the session if the context which is provided
to `RunCommandContext` is canceled is flawed, as waiting for the session to
finish before closing it will block infinitely.

I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and
I see that the session receives an EOF message *before* we even try to close the
session:

```
>------- session started, and session.Run() invoked

2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}}
2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

>-------  EOF sent on channel (not by us; we have not closed the sesion yet!

2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0}
2019/06/05 08:16:59 decoding(2): data packet - 181 bytes
2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac}
2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0}

>------- we try to close the session, and receive the following error

failed to close session: EOF
```

It appears that we cannot close the session, since an EOF has already been sent
for it. I am not exactly sure where this comes from. I've posted an issue /
question in the GitHub repository for golang:
golang/go#32453 . Our attempts to send signals (SIGHUP
and SIGINT) are met by this same EOF error as well; there is no point on
waiting for the session to finish in this case, so just try to close it
and move on, and not leak goroutines that will be stuck forever. When running
`gops` now against the Gingko test suite, we no longer accrue a ton of these
goroutines blocked on `session.Wait()` - the biggest # of occcurrences for
"similar" goroutines is at most 2 in a stack trace captured below, for example:

```
$ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out  | head -n 15
14 unique stack traces
2 occurences. Sample stack trace:
internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
...
```

Signed-off by: Ian Vernon <ian@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
tgraf pushed a commit to cilium/cilium that referenced this issue Jun 6, 2019
…mandContext`

[ upstream commit cf14b6a ]

It was observed while running `gops stack` for the Ginkgo test suite locally,
that we were leaking goroutines that were getting stuck while waiting for SSH
sessions to finish. We accrued over 1000 of these per K8s CI run:

```
16 unique stack traces
960 occurences. Sample stack trace:
github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a)
        /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57
github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220)
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc
created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c
```

This example shows that there were over 960 goroutines stuck on `session.Wait()`.

Whenever we run a command via SSH, we call `runCommand`. When we call
`runCommand`, it calls `session.Run`, which calls `session.Start()` and
`session.Wait()`. I observed that  that calling `Wait()` on a session which
already has had `Run` invoked will never return, even if we try
to call `session.Close()` before invoking `session.Wait()`. This indicates
that our logic for trying to kill the session if the context which is provided
to `RunCommandContext` is canceled is flawed, as waiting for the session to
finish before closing it will block infinitely.

I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and
I see that the session receives an EOF message *before* we even try to close the
session:

```
>------- session started, and session.Run() invoked

2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}}
2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

>-------  EOF sent on channel (not by us; we have not closed the sesion yet!

2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0}
2019/06/05 08:16:59 decoding(2): data packet - 181 bytes
2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac}
2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0}

>------- we try to close the session, and receive the following error

failed to close session: EOF
```

It appears that we cannot close the session, since an EOF has already been sent
for it. I am not exactly sure where this comes from. I've posted an issue /
question in the GitHub repository for golang:
golang/go#32453 . Our attempts to send signals (SIGHUP
and SIGINT) are met by this same EOF error as well; there is no point on
waiting for the session to finish in this case, so just try to close it
and move on, and not leak goroutines that will be stuck forever. When running
`gops` now against the Gingko test suite, we no longer accrue a ton of these
goroutines blocked on `session.Wait()` - the biggest # of occcurrences for
"similar" goroutines is at most 2 in a stack trace captured below, for example:

```
$ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out  | head -n 15
14 unique stack traces
2 occurences. Sample stack trace:
internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
...
```

Signed-off by: Ian Vernon <ian@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@dmitshur
Copy link
Contributor

dmitshur commented Jun 10, 2019

/cc @hanwen @FiloSottile

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 10, 2019
@hanwen
Copy link
Contributor

hanwen commented Jun 17, 2019

session.Run is syntactic sugar for Start + Wait. Don't call Wait if you use Run.

re. EOF, EOF signifies the end of the writing to a channel (stdout/stderr), so you can send it only once.

I am open to review changes to be more intelligent here, but the behavior per se is not a bug.

@ianvernon
Copy link
Author

ianvernon commented Jun 17, 2019

@hanwen thanks for the reply!

Regarding your comments about EOF - it looks like calling Run sends EOF, so we do not need to close the session after we have called Run, and Run has returned, because Run implicitly closes the session as well?

I don't think there is a bug here either, it was just a bit hard to understand the idempotency of session operations. Perhaps I can open up a PR for some godoc changes describing this?

@wiml
Copy link

wiml commented Jul 11, 2019

I ran into a similar problem, and I agree it might be best fixed just by adding some information to the documentation. In particular I had to look at the sources to see which functions were primitive (Start, Shell, and Wait) and which were conveniences which could be implemented in terms of other public API (Run, Output, CombinedOutput).

On the other hand, even though calling Wait() twice is a programmer error, the behavior of hanging seems suboptimal. What about a change like this:

  • In the Session struct, make exitStatus a receive-only channel
  • In newSession(), only assign the receive half of the channel to exitStatus, and give the send half to the waiting goroutine.
  • after sending the exit status to the channel, close the channel.
  • in Wait(), check whether we received on a closed channel, and either panic or throw an error describing the situation (I prefer panic here, since it's a programmer error; but in any case, it had better not just return the default nil value!)

@hanwen
Copy link
Contributor

hanwen commented Jul 11, 2019

Yes, please send a change

your suggestion to change things sound promising. Can you try if you can make this work?

The doc change should be separte from the functional change.

@wiml
Copy link

wiml commented Jul 18, 2019

Hmm, I'd love to send a PR or two, but it looks like that'd require my employer to sign the Google CLA, and while that's in principle possible, it's lot of work for four lines of code. Unless/until we work through that red tape for other reasons, I'm going to have to pass on sending any actual changes. :(

@ianvernon
Copy link
Author

I’ll try to take a shot at this !

@jherman4
Copy link

Hey FYI the docs could still be better around this. There is some essential information in this thread that really should be in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants