x/crypto/ssh: Race Conditions on Channel EOF, Closing and Request Processing #29733

paulquerna-okta · 2019-01-14T18:19:25Z

What version of Go are you using (`go version`)?

go version go1.11.4 linux/amd64

Does this issue reproduce with the latest release?

Yes, x/crypto/ssh on master

What did you do?

We have a large codebase that is acting as a ProxyCommand. Our code acts as a bridge, being both a client and server. When running under Ansible, which spawns OpenSSH with ControlMaster enabled and heavily leverages “pipelining” of Ansible commands, we have encountered a race condition in the processing of a Channel.

In mux.go, there is a tight loop for processing messages:
https://github.com/golang/crypto/blob/ff983b9c42bc9fbf91556e191cc8efb585c16908/ssh/mux.go#L189-L191

Eventually, this calls ch.handlePacket:
https://github.com/golang/crypto/blob/ff983b9c42bc9fbf91556e191cc8efb585c16908/ssh/channel.go#L398

Three messages are received quickly on a session channel:

exit-status
eof
close

Each message is handled differently:

exit-status is made into a Request and sent to ch.incomingRequests. (goroutine A)
eof sets .eof() on the buffer, which triggers any Readers of the buffer (stdout, stderr) to stop. (goroutine B)
close is handled internally. (goroutine C)

When writing these “back” to our outbound channel, we must make calls in a specific order, so that the messages on the wire go out in the same order as we received them. In our ideal world, our proxy code, we should call these in order:

SendRequest() for exit-status
.CloseWrite() on the channel (generating eof messages)
.Close() on the channel (generating close)

However, this is extremely difficult, and our code had race conditions because we’re processing the Channel IO and Requests in their own goroutines. In the “bad” path of the race condition, we would:

Inbound channel Read() would return EOF, so we would call .CloseWrite() on the outbound channel, generating an EOF message to the Outbound Channel.
We would then RECEIVE a Close from the Outbound Channel -- this would cause the internal handling to completely close the channel
Inbound channel reading from the ->Requests Go-Channel would process the exit-status Request, and fail to write it to the outbound channel, because the outbound channel has already been hard closed.
Ansible / OpenSSH would then break the ControlMaster muxing because it was expecting to receive an exit-status message, but never does, causing an error message like this:

debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Control master terminated unexpectedly

Using Go-channels and a Read() interface at the same time, requires processing across at least two goroutines. Additionally, because close messages are internally handled, there is a 3rd goroutine making races harder to control. Ordering of receiving messages is lost and hard to guarantee.

What did you expect to see?

I would prefer that all operations were written to a single Go-channel. (Data and Requests, including the Close message). This would allow in-order processing of a SSH Channel. This could maybe be written as an intermediate layer, and the current public API could be changed to run on top of it?

Relevant Logs

2019/01/11 11:39:35 decoding(1): data packet - 16393 bytes
2019/01/11 11:39:35 send(1): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0x4000}
2019/01/11 11:39:35 decoding(1): data packet - 8201 bytes
2019/01/11 11:39:35 decoding(1): data packet - 1010 bytes
2019/01/11 11:39:35 send(1): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0x23e9}
2019/01/11 11:39:35 decoding(1): 98 &ssh.channelRequestMsg{PeersID:0x1, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/01/11 11:39:35 decoding(1): 96 &ssh.channelEOFMsg{PeersID:0x1} - 5 bytes
2019/01/11 11:39:35 decoding(1): 97 &ssh.channelCloseMsg{PeersID:0x1} - 5 bytes
2019/01/11 11:39:35 send(1): ssh.channelCloseMsg{PeersID:0x0}
2019/01/11 11:39:35 send(2): ssh.channelEOFMsg{PeersID:0x2}
2019/01/11 11:39:35 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/01/11 11:39:35 send(2): ssh.channelCloseMsg{PeersID:0x2}
2019/01/11 11:39:35 send(2): ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}}
2019/01/11 11:39:35 send(2): ssh.channelCloseMsg{PeersID:0x2}
2019/01/11 11:39:35 send(1): ssh.channelCloseMsg{PeersID:0x0}

The text was updated successfully, but these errors were encountered:

paulquerna-okta · 2019-01-15T00:05:37Z

/cc @hanwen

hanwen · 2019-01-15T10:28:48Z

Your analysis of the problem is spot on, but I am hesitant to rework the package in the way you suggest.

Mixing data and requests in a single queue is problematic because the former has control flow while the latter has not.

Maybe we could move all of request processing to a single routine, but it would be brainsurgery on the package. I doubt it could be done in a backward compatible way. To make things worse, it's easy to hang and deadlock the SSH mux if the user code (that processes the requests) would somehow send data or requests back to the other side.

I think your proxy always receives the exit-status, is that correct? If so, my suggestion is the following:

when you receive EOF, start a timer.
Wait for exit-status to transmitted for (say) 1 second before you propagate EOF.
When you propagate EOF, only then propagate the channel close.

paulquerna-okta · 2019-01-15T17:36:29Z

The flow you describe of using a timer started off of EOF is what we internally tested and shipped into our testing release channel yesterday, getting wider feedback from users is ongoing. I'm happy we at least have the same short term approach :-)

So far this approach seems to be working, but we are generally concerned under high load / parallelism we could still see "random" Ansible failures which are very painful to debug and reproduce.

exit-status is not always sent -- at the very least its possible to get an exit-signal instead:
https://github.com/openssh/openssh-portable/blob/8a22ffaa13391cfe5b40316d938fe0fb931e9296/session.c#L2383-L2401

hanwen · 2019-01-15T18:11:25Z

Well, yes, any exit-XXXX as applicable.

If you are sure that exit-XXXX always happens, you could wait indefinitely.

haolongzhangm · 2020-02-23T07:03:02Z

any update about this issue

ianlancetaylor · 2020-02-23T13:26:52Z

There are no updates. Based on the comments above, I don't think we are going to change anything here. I'm going to close this issue. Please comment if you disagree.

gopherbot added this to the Unreleased milestone Jan 14, 2019

julieqiu added the NeedsInvestigation label May 28, 2019

ianlancetaylor closed this as completed Feb 23, 2020

golang locked and limited conversation to collaborators Feb 22, 2021

gopherbot added the FrozenDueToAge label Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/crypto/ssh: Race Conditions on Channel EOF, Closing and Request Processing #29733

x/crypto/ssh: Race Conditions on Channel EOF, Closing and Request Processing #29733

paulquerna-okta commented Jan 14, 2019

paulquerna-okta commented Jan 15, 2019

hanwen commented Jan 15, 2019

paulquerna-okta commented Jan 15, 2019

hanwen commented Jan 15, 2019

haolongzhangm commented Feb 23, 2020

ianlancetaylor commented Feb 23, 2020

x/crypto/ssh: Race Conditions on Channel EOF, Closing and Request Processing #29733

x/crypto/ssh: Race Conditions on Channel EOF, Closing and Request Processing #29733

Comments

paulquerna-okta commented Jan 14, 2019

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What did you do?

What did you expect to see?

Relevant Logs

paulquerna-okta commented Jan 15, 2019

hanwen commented Jan 15, 2019

paulquerna-okta commented Jan 15, 2019

hanwen commented Jan 15, 2019

haolongzhangm commented Feb 23, 2020

ianlancetaylor commented Feb 23, 2020

What version of Go are you using (`go version`)?