New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/crypto/ssh: semantics around running session.Wait() after calling session.Run(), when EOF messages are sent #32453
Comments
…mandContext` It was observed while running `gops stack` for the Ginkgo test suite locally, that we were leaking goroutines that were getting stuck while waiting for SSH sessions to finish. We accrued over 1000 of these per K8s CI run: ``` 16 unique stack traces 960 occurences. Sample stack trace: github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a) /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57 github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220) /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c ``` This example shows that there were over 960 goroutines stuck on `session.Wait()`. Whenever we run a command via SSH, we call `runCommand`. When we call `runCommand`, it calls `session.Run`, which calls `session.Start()` and `session.Wait()`. I observed that that calling `Wait()` on a session which already has had `Run` invoked will never return, even if we try to call `session.Close()` before invoking `session.Wait()`. This indicates that our logic for trying to kill the session if the context which is provided to `RunCommandContext` is canceled is flawed, as waiting for the session to finish before closing it will block infinitely. I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and I see that the session receives an EOF message *before* we even try to close the session: ``` >------- session started, and session.Run() invoked 2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)} 2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes 2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}} 2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes 2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes >------- EOF sent on channel (not by us; we have not closed the sesion yet! 2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0} 2019/06/05 08:16:59 decoding(2): data packet - 181 bytes 2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac} 2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes 2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0} >------- we try to close the session, and receive the following error failed to close session: EOF ``` It appears that we cannot close the session, since an EOF has already been sent for it. I am not exactly sure where this comes from. I've posted an issue / question in the GitHub repository for golang: golang/go#32453 . Our attempts to send signals (SIGHUP and SIGINT) are met by this same EOF error as well; there is no point on waiting for the session to finish in this case, so just try to close it and move on, and not leak goroutines that will be stuck forever. When running `gops` now against the Gingko test suite, we no longer accrue a ton of these goroutines blocked on `session.Wait()` - the biggest # of occcurrences for "similar" goroutines is at most 2 in a stack trace captured below, for example: ``` $ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out | head -n 15 14 unique stack traces 2 occurences. Sample stack trace: internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0) /usr/local/go/src/runtime/netpoll.go:173 +0x66 internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738) /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a ... ``` Signed-off by: Ian Vernon <ian@cilium.io>
…mandContext` It was observed while running `gops stack` for the Ginkgo test suite locally, that we were leaking goroutines that were getting stuck while waiting for SSH sessions to finish. We accrued over 1000 of these per K8s CI run: ``` 16 unique stack traces 960 occurences. Sample stack trace: github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a) /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57 github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220) /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c ``` This example shows that there were over 960 goroutines stuck on `session.Wait()`. Whenever we run a command via SSH, we call `runCommand`. When we call `runCommand`, it calls `session.Run`, which calls `session.Start()` and `session.Wait()`. I observed that that calling `Wait()` on a session which already has had `Run` invoked will never return, even if we try to call `session.Close()` before invoking `session.Wait()`. This indicates that our logic for trying to kill the session if the context which is provided to `RunCommandContext` is canceled is flawed, as waiting for the session to finish before closing it will block infinitely. I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and I see that the session receives an EOF message *before* we even try to close the session: ``` >------- session started, and session.Run() invoked 2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)} 2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes 2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}} 2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes 2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes >------- EOF sent on channel (not by us; we have not closed the sesion yet! 2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0} 2019/06/05 08:16:59 decoding(2): data packet - 181 bytes 2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac} 2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes 2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0} >------- we try to close the session, and receive the following error failed to close session: EOF ``` It appears that we cannot close the session, since an EOF has already been sent for it. I am not exactly sure where this comes from. I've posted an issue / question in the GitHub repository for golang: golang/go#32453 . Our attempts to send signals (SIGHUP and SIGINT) are met by this same EOF error as well; there is no point on waiting for the session to finish in this case, so just try to close it and move on, and not leak goroutines that will be stuck forever. When running `gops` now against the Gingko test suite, we no longer accrue a ton of these goroutines blocked on `session.Wait()` - the biggest # of occcurrences for "similar" goroutines is at most 2 in a stack trace captured below, for example: ``` $ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out | head -n 15 14 unique stack traces 2 occurences. Sample stack trace: internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0) /usr/local/go/src/runtime/netpoll.go:173 +0x66 internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738) /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a ... ``` Signed-off by: Ian Vernon <ian@cilium.io>
…mandContext` [ upstream commit cf14b6a ] It was observed while running `gops stack` for the Ginkgo test suite locally, that we were leaking goroutines that were getting stuck while waiting for SSH sessions to finish. We accrued over 1000 of these per K8s CI run: ``` 16 unique stack traces 960 occurences. Sample stack trace: github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a) /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57 github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220) /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c ``` This example shows that there were over 960 goroutines stuck on `session.Wait()`. Whenever we run a command via SSH, we call `runCommand`. When we call `runCommand`, it calls `session.Run`, which calls `session.Start()` and `session.Wait()`. I observed that that calling `Wait()` on a session which already has had `Run` invoked will never return, even if we try to call `session.Close()` before invoking `session.Wait()`. This indicates that our logic for trying to kill the session if the context which is provided to `RunCommandContext` is canceled is flawed, as waiting for the session to finish before closing it will block infinitely. I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and I see that the session receives an EOF message *before* we even try to close the session: ``` >------- session started, and session.Run() invoked 2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)} 2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes 2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}} 2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes 2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes >------- EOF sent on channel (not by us; we have not closed the sesion yet! 2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0} 2019/06/05 08:16:59 decoding(2): data packet - 181 bytes 2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac} 2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes 2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0} >------- we try to close the session, and receive the following error failed to close session: EOF ``` It appears that we cannot close the session, since an EOF has already been sent for it. I am not exactly sure where this comes from. I've posted an issue / question in the GitHub repository for golang: golang/go#32453 . Our attempts to send signals (SIGHUP and SIGINT) are met by this same EOF error as well; there is no point on waiting for the session to finish in this case, so just try to close it and move on, and not leak goroutines that will be stuck forever. When running `gops` now against the Gingko test suite, we no longer accrue a ton of these goroutines blocked on `session.Wait()` - the biggest # of occcurrences for "similar" goroutines is at most 2 in a stack trace captured below, for example: ``` $ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out | head -n 15 14 unique stack traces 2 occurences. Sample stack trace: internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0) /usr/local/go/src/runtime/netpoll.go:173 +0x66 internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738) /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a ... ``` Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
…mandContext` [ upstream commit cf14b6a ] It was observed while running `gops stack` for the Ginkgo test suite locally, that we were leaking goroutines that were getting stuck while waiting for SSH sessions to finish. We accrued over 1000 of these per K8s CI run: ``` 16 unique stack traces 960 occurences. Sample stack trace: github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a) /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57 github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220) /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c ``` This example shows that there were over 960 goroutines stuck on `session.Wait()`. Whenever we run a command via SSH, we call `runCommand`. When we call `runCommand`, it calls `session.Run`, which calls `session.Start()` and `session.Wait()`. I observed that that calling `Wait()` on a session which already has had `Run` invoked will never return, even if we try to call `session.Close()` before invoking `session.Wait()`. This indicates that our logic for trying to kill the session if the context which is provided to `RunCommandContext` is canceled is flawed, as waiting for the session to finish before closing it will block infinitely. I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and I see that the session receives an EOF message *before* we even try to close the session: ``` >------- session started, and session.Run() invoked 2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)} 2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes 2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}} 2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes 2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes >------- EOF sent on channel (not by us; we have not closed the sesion yet! 2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0} 2019/06/05 08:16:59 decoding(2): data packet - 181 bytes 2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac} 2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes 2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0} >------- we try to close the session, and receive the following error failed to close session: EOF ``` It appears that we cannot close the session, since an EOF has already been sent for it. I am not exactly sure where this comes from. I've posted an issue / question in the GitHub repository for golang: golang/go#32453 . Our attempts to send signals (SIGHUP and SIGINT) are met by this same EOF error as well; there is no point on waiting for the session to finish in this case, so just try to close it and move on, and not leak goroutines that will be stuck forever. When running `gops` now against the Gingko test suite, we no longer accrue a ton of these goroutines blocked on `session.Wait()` - the biggest # of occcurrences for "similar" goroutines is at most 2 in a stack trace captured below, for example: ``` $ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out | head -n 15 14 unique stack traces 2 occurences. Sample stack trace: internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0) /usr/local/go/src/runtime/netpoll.go:173 +0x66 internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738) /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a ... ``` Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
/cc @hanwen @FiloSottile |
session.Run is syntactic sugar for Start + Wait. Don't call Wait if you use Run. re. EOF, EOF signifies the end of the writing to a channel (stdout/stderr), so you can send it only once. I am open to review changes to be more intelligent here, but the behavior per se is not a bug. |
@hanwen thanks for the reply! Regarding your comments about EOF - it looks like calling I don't think there is a bug here either, it was just a bit hard to understand the idempotency of session operations. Perhaps I can open up a PR for some godoc changes describing this? |
I ran into a similar problem, and I agree it might be best fixed just by adding some information to the documentation. In particular I had to look at the sources to see which functions were primitive ( On the other hand, even though calling Wait() twice is a programmer error, the behavior of hanging seems suboptimal. What about a change like this:
|
Yes, please send a change your suggestion to change things sound promising. Can you try if you can make this work? The doc change should be separte from the functional change. |
Hmm, I'd love to send a PR or two, but it looks like that'd require my employer to sign the Google CLA, and while that's in principle possible, it's lot of work for four lines of code. Unless/until we work through that red tape for other reasons, I'm going to have to pass on sending any actual changes. :( |
I’ll try to take a shot at this ! |
Hey FYI the docs could still be better around this. There is some essential information in this thread that really should be in the docs. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
I am using a vendored version of the library with commit
e84da0312774c21d64ee2317962ef669b27ffb41
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I will work on providing an exact reproducer of this via a small go program, but do not have this on hand immediately. I can do this if you like. My question is more about how to properly use the SSH library, as I feel like I am misunderstanding how it works.
I am wondering about the semantics behind running
session.Wait()
after callingsession.Run()
.It appears that
session.Run()
callssession.Start()
followed bysession.Wait()
.I've found that if I try calling
session.Wait()
again after callingsession.Run()
, my code will hang infinitely. Is this because there is nothing to wait upon aftersession.Run()
terminates, since the remote command has exited, sosession.Wait()
is waiting for something, while nothing is being sent?I also am confused about how sessions are closed. I observe when running with
debugMux
set totrue
locally, that my session gets sent achannelEOFMsg
after I have calledsession.Run()
, but before I callsession.Close()
. When I try to close the session, I get the errorfailed to close session: EOF
. I would expect that after callingRun
, that I'd be able to close the session gracefully. Who is sending these EOF's? Is that the expected behavior for running a command on a session, in that the session will get an EOF when the command is done running?Output from running with
debugMux=true
:What did you expect to see?
I would expect to not get an EOF after calling
Run()
on a session, and forWait()
to be idempotent / to return immediately if there is nothing to wait upon. I would also expect that EOF would not get sent byRun()
, and that I would not get an error closing a session when nothing else has tried to close it.What did you see instead?
See above.
I will work on a small reproducer locally, but I hope that the above is enough context / information for me to get some answers, as I feel like this is moreso me not understanding how the library is supposed to be used.
The text was updated successfully, but these errors were encountered: