x/crypto/ssh: deadlock during key renegotiation #18439

vito · 2016-12-27T21:30:30Z

What version of Go are you using (`go version`)?

1.7.x - though in this case the more relevant bit is probably the sha of x/crypto/ssh: a8a4eed

What operating system and processor architecture are you using (`go env`)?

Linux, amd64

What did you do?

Establish a long-lived connection used for SSH tunneling, including sending arbitrary amounts of data over the wire (e.g. ~200MB sent every now and then via the same connection).

What did you expect to see?

SSH traffic should continue unhindered over an arbitrary amount of time with arbitrary amounts of data flowing over the connection.

What did you see instead?

I've observed that after some time all communication over SSH will hang. The connection is still "live" in the TCP sense, but SSH-level keepalives just block waiting for a reply.

I suspect this is actually space-related, and not time, however; the default of 1GB of transfer for key renegotiation sounds very likely given the stack inspection below.

Debugging

Context

(You can skip this if you just want the nitty-gritty.)

This package is used in a CI system wherein all workers register themselves with the master via a custom-built SSH server. They'll open a reverse SSH tunnel so that they can be used from a private network - the SSH server will act as a proxy to the workers. These SSH connections are arbitrarily long lived, and will have a large amount of data transferred over them as various CI resources (i.e. git repos, build artifacts) are streamed to and from the workers.

Procedure

In my case both ends (client and server) are implemented using this package. I suspect this to be a server-side issue as when bouncing the server all clients will immediately notice this and reconnect. I've also looked at the stack dump of the server and (I think) found a smoking gun for the deadlock.

Code sleuthing

I observed that for every hanging client there was one goroutine stuck in chan send for the duration that the worker had been gone. The following is my theory as to how that got there:

(*handshakeTransport).readOnePacket and (*handshakeTransport).writePacket check if the number of bytes exceed t.config.RekeyThreshold, which defaults to 1 gigabyte
this means that after 1GB of data transfer, the crypto/ssh package will initiate a new key exchange and set t.sentInitMsg to a non-nil value
(*handshakeTransport).writePacket has a loop on t.sentInitMsg != nil using a condition variable (i.e. it's waiting for that to become nil, so the kex has finished, or for there to be a write error that will cause it to give up)
t.sentInitMsg only becomes nil in a call to (*handshakeTransport).readOnePacket
the only caller of readOnePacket is (*handshakeTransport).readLoop
in our stack, we observed readLoop is blocked on writing to the incoming channel, so it cannot call readOnePacket
interestingly, incoming is a buffered channel (length 16) - however in this case that didn't help, so the buffer must be full, or no one can read from it
who reads from t.incoming? (*handshakeTransport).readPacket does.
who calls readPacket? (*mux).onePacket
who calls onePacket? (*mux).loop - which isn't too interesting though since it literally just keeps calling that method as long as it works. so let's go back into onePacket and see what it's doing after reading the packet, as that'd be why it's not reading the next one.
in our case, it was in the call to (*channel).handlePacket - so what's that doing?
in our case, it's blocked on a call to (*channel).sendMessage. things are getting interesting here: our read loop is now writing.
following the stack down, this ends up back in (*handleTransport).writePacket! which is dependent on us reading to succeed, as readOnePacket is what completes the key exchange.

The text was updated successfully, but these errors were encountered:

rsc · 2017-01-04T17:12:47Z

/cc @agl

hanwen · 2017-01-09T10:00:24Z

can you send the stack trace dumps? Your description makes it hard to follow what is going on.

If you think key exchanges are to blame, you can make them happen more often by tuning down the rekeying data threshold.

Can you confirm that you are servicing all channels that the SSH library returns?

databus23 · 2017-01-09T10:20:50Z

Regarding your question about servicing all channels. I think thats the case:

chans returned by ssh.NewServerConn are handled here
requests returned by ssh.NewServerConn are handled here
requests returned by NewChannel.Accept() are handled here

hanwen · 2017-01-09T10:58:42Z

if this is open source code, can you recreate the deadlock and SIGQUIT the process and send the resulting trace?

vito · 2017-01-09T15:57:17Z

@hanwen Sure - I'll configure a lower threshold and see if it happens more quickly, and then grab a stack dump.

In the meantime you can find one linked to from here here: concourse/concourse#750 (comment) - they'll be the ones from "web", not "worker", though "worker" will show the client-side so that may be of interest too.

databus23 · 2017-01-09T16:05:43Z

@hanwen Be sure to take one prefixed with concourse-web-issue-750- because I placed some stackdumps from other issues there as well.

hanwen · 2017-01-09T16:26:21Z

studied your explanation for a bit more, and I think you are right: the deadlock is in the SSH package. To avoid deadlocks, we separate the reading and writing sides (see also https://codereview.appspot.com/62180043/).

Unfortunately, the spec requires that on reading a channelClose message, we should confirm it by sending a channelClose message back, so there is no way to make the read and write side independent.

I will have to think for a bit on how to fix this.

hanwen · 2017-01-09T16:34:46Z

probably, we could have a

queuedPackets [][]byte

which can be written while a kex is in progress. We could then either let all writes succeed directly, or special case channelClose. The former will yield more testing coverage, but may not report I/O errors in the right place. The latter will be very hard to test for, since we'd have to make a test case that repros this hang.

hanwen · 2017-01-09T17:43:36Z

I tried one simple minded solution, which didn't work. I need to think this through better. For now, I recommend jacking up the rekey interval to something really high.

The 1Gb default is suggested by the original RFC spec which is from 2006, and is exceedingly conservative. We're 10 years later now, so that number can probably safely be multiplied with 100 or 1000.

hanwen · 2017-01-09T18:21:04Z

summary of deadlock:

some write to writePacket triggers a kex. All writes will now block.
the remote end is slow or sending lots of data. We happen to read a chanClose message before the remote end can send a kex message back.
the chanClose tries to echo a chanClose with writePacket, this blocks which kills the read loop.

databus23 · 2017-01-09T22:27:21Z

@hanwen Thank for taking a deeper look into this. Regarding your suggested workaround of increasing the re-key interval by a factor of 100 or more: I'm not sure I understand the logic why its now safe to increase the re-key interval. Briefly looking at the spec it seems the main concern when choosing the recommended interval was the computational complexity of asymmetric encryption. Wouldn't this be an argument to decrease the re-key interval 10 years later with computers being a lot faster then when the spec was devised? In general isn't it at least a theoretical security concern to increase the interval by huge amounts as an attacker can observe way more data encrypted with the same key material?

hanwen · 2017-01-10T09:43:29Z

from what I recall, there was no real justification given for the 1 Gb number (it's also just recommended). I put in because that's what the RFC said, but it seems ridiculously low, compared to throughput of the network nowadays, and the delay of doing a new handshake.

databus23 · 2017-01-10T22:12:13Z

@hanwen You are right that the original rfc4253 did not provide detailed background how the recommended value was choosen. But rfc4344 Section 3 and 6.1 provides more detail on the re-keying interval and while I can't fully assess the provided math it definitely states that expanding the interval has negative implications on the security of the connection and the recommended value should be used.

hanwen · 2017-01-11T10:28:46Z

thanks for that pointer. You shouldn't encrypt more than 4G packets because the counter wraps around.

L=128 for AES and block size is 16, so you should be OK for rekeying at 16 *2^32 = 64G.

Upping the interval was intended as a short-term workaround. The CL at https://go-review.googlesource.com/c/35012/ fixes this more reliably, hopefully.

databus23 · 2017-01-11T10:42:57Z

@hanwen: Awesome, thanks for taking this on so quickly. Looking forward to the proper fix.

I have one more question: From your comments above it seems like the deadlock can be only caused by a channel close message while a key exchange in in-flight? Is that really the case? Because in the CI system where this is happening I wouldn't expect channel close message as each workers set up a single persistent connection that is not teared down until one of both ends shut down.

hanwen · 2017-01-11T10:46:15Z

the stack traces I got for this bug were pointing to the line where closeChannel message was being sent.

But it could potentially happen in any function that responds to global/channel requests.

vito · 2017-01-11T15:23:00Z

I could see this occurring during the keepalive request our Go client sends every N seconds. Thanks very much for picking this up!

…

On Wed, Jan 11, 2017, 2:46 AM Han-Wen Nienhuys ***@***.***> wrote: the stack traces I got for this bug were pointing to the line where closeChannel message was being sent. But it could potentially happen in any function that responds to global/channel requests. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#18439 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAHWEn0rWqq0iu2VTAxHJ7Q8imyQhA8ks5rRLMagaJpZM4LWhVk> .

hanwen · 2017-01-16T17:59:23Z

fixed in golang/crypto@2e74c77

vito · 2017-01-18T20:32:25Z

@hanwen I'm now seeing handshakes fail occasionally with the following message:

input_userauth_error: bad message during authentication: type 20

Type 20 being KEXINIT. Maybe that's firing concurrently with the handshake or something? It appears to be flaky.

I can open another issue if you'd like.

kardianos · 2017-01-18T20:33:55Z

@vito closed issues are not tracked. Open a new issue please.

Use channels and a dedicated write loop for managing the rekeying process. This lets us collect packets to be written while a key exchange is in progress. Previously, the read loop ran the key exchange, and writers would block if a key exchange was going on. If a reader wrote back a packet while processing a read packet, it could block, stopping the read loop, thus causing a deadlock. Such coupled read/writes are inherent with handling requests that want a response (eg. keepalive, opening/closing channels etc.). The buffered channels (most channels have capacity 16) papered over these problems, but under load SSH connections would occasionally deadlock. Fixes golang#18439. Change-Id: I7c14ff4991fa3100a5d36025125d0cf1119c471d Reviewed-on: https://go-review.googlesource.com/35012 Run-TryBot: Han-Wen Nienhuys <hanwen@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Han-Wen Nienhuys <hanwen@google.com>

vito mentioned this issue Dec 27, 2016

Workers stop heartbeating concourse/concourse#750

Closed

rsc added this to the Unplanned milestone Jan 4, 2017

databus23 mentioned this issue Jan 16, 2017

bump x/crypto to fix ssh deadlock concourse/concourse#880

Merged

hanwen closed this as completed Jan 16, 2017

vito mentioned this issue Jan 18, 2017

x/crypto/ssh: bad kexinit during handshake #18711

Closed

golang locked and limited conversation to collaborators Jan 18, 2018

gopherbot added the FrozenDueToAge label Jan 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/crypto/ssh: deadlock during key renegotiation #18439

x/crypto/ssh: deadlock during key renegotiation #18439

vito commented Dec 27, 2016 •

edited

Loading

rsc commented Jan 4, 2017

hanwen commented Jan 9, 2017

databus23 commented Jan 9, 2017

hanwen commented Jan 9, 2017

vito commented Jan 9, 2017

databus23 commented Jan 9, 2017 •

edited

Loading

hanwen commented Jan 9, 2017

hanwen commented Jan 9, 2017

hanwen commented Jan 9, 2017

hanwen commented Jan 9, 2017

databus23 commented Jan 9, 2017

hanwen commented Jan 10, 2017

databus23 commented Jan 10, 2017 •

edited

Loading

hanwen commented Jan 11, 2017

databus23 commented Jan 11, 2017

hanwen commented Jan 11, 2017

vito commented Jan 11, 2017 via email

hanwen commented Jan 16, 2017

vito commented Jan 18, 2017

kardianos commented Jan 18, 2017

x/crypto/ssh: deadlock during key renegotiation #18439

x/crypto/ssh: deadlock during key renegotiation #18439

Comments

vito commented Dec 27, 2016 • edited Loading

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Debugging

Context

Procedure

Code sleuthing

rsc commented Jan 4, 2017

hanwen commented Jan 9, 2017

databus23 commented Jan 9, 2017

hanwen commented Jan 9, 2017

vito commented Jan 9, 2017

databus23 commented Jan 9, 2017 • edited Loading

hanwen commented Jan 9, 2017

hanwen commented Jan 9, 2017

hanwen commented Jan 9, 2017

hanwen commented Jan 9, 2017

databus23 commented Jan 9, 2017

hanwen commented Jan 10, 2017

databus23 commented Jan 10, 2017 • edited Loading

hanwen commented Jan 11, 2017

databus23 commented Jan 11, 2017

hanwen commented Jan 11, 2017

vito commented Jan 11, 2017 via email

hanwen commented Jan 16, 2017

vito commented Jan 18, 2017

kardianos commented Jan 18, 2017

vito commented Dec 27, 2016 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

databus23 commented Jan 9, 2017 •

edited

Loading

databus23 commented Jan 10, 2017 •

edited

Loading