-
Notifications
You must be signed in to change notification settings - Fork 18k
x/crypto/ssh: deadlock during key renegotiation #18439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/cc @agl |
can you send the stack trace dumps? Your description makes it hard to follow what is going on. If you think key exchanges are to blame, you can make them happen more often by tuning down the rekeying data threshold. Can you confirm that you are servicing all channels that the SSH library returns? |
Regarding your question about servicing all channels. I think thats the case:
|
if this is open source code, can you recreate the deadlock and SIGQUIT the process and send the resulting trace? |
@hanwen Sure - I'll configure a lower threshold and see if it happens more quickly, and then grab a stack dump. In the meantime you can find one linked to from here here: concourse/concourse#750 (comment) - they'll be the ones from "web", not "worker", though "worker" will show the client-side so that may be of interest too. |
@hanwen Be sure to take one prefixed with concourse-web-issue-750- because I placed some stackdumps from other issues there as well. |
studied your explanation for a bit more, and I think you are right: the deadlock is in the SSH package. To avoid deadlocks, we separate the reading and writing sides (see also https://codereview.appspot.com/62180043/). Unfortunately, the spec requires that on reading a channelClose message, we should confirm it by sending a channelClose message back, so there is no way to make the read and write side independent. I will have to think for a bit on how to fix this. |
probably, we could have a queuedPackets [][]byte which can be written while a kex is in progress. We could then either let all writes succeed directly, or special case channelClose. The former will yield more testing coverage, but may not report I/O errors in the right place. The latter will be very hard to test for, since we'd have to make a test case that repros this hang. |
I tried one simple minded solution, which didn't work. I need to think this through better. For now, I recommend jacking up the rekey interval to something really high. The 1Gb default is suggested by the original RFC spec which is from 2006, and is exceedingly conservative. We're 10 years later now, so that number can probably safely be multiplied with 100 or 1000. |
summary of deadlock:
|
@hanwen Thank for taking a deeper look into this. Regarding your suggested workaround of increasing the re-key interval by a factor of 100 or more: I'm not sure I understand the logic why its now safe to increase the re-key interval. Briefly looking at the spec it seems the main concern when choosing the recommended interval was the computational complexity of asymmetric encryption. Wouldn't this be an argument to decrease the re-key interval 10 years later with computers being a lot faster then when the spec was devised? In general isn't it at least a theoretical security concern to increase the interval by huge amounts as an attacker can observe way more data encrypted with the same key material? |
from what I recall, there was no real justification given for the 1 Gb number (it's also just recommended). I put in because that's what the RFC said, but it seems ridiculously low, compared to throughput of the network nowadays, and the delay of doing a new handshake. |
@hanwen You are right that the original rfc4253 did not provide detailed background how the recommended value was choosen. But rfc4344 Section 3 and 6.1 provides more detail on the re-keying interval and while I can't fully assess the provided math it definitely states that expanding the interval has negative implications on the security of the connection and the recommended value should be used. |
thanks for that pointer. You shouldn't encrypt more than 4G packets because the counter wraps around. L=128 for AES and block size is 16, so you should be OK for rekeying at 16 *2^32 = 64G. Upping the interval was intended as a short-term workaround. The CL at https://go-review.googlesource.com/c/35012/ fixes this more reliably, hopefully. |
@hanwen: Awesome, thanks for taking this on so quickly. Looking forward to the proper fix. I have one more question: From your comments above it seems like the deadlock can be only caused by a channel close message while a key exchange in in-flight? Is that really the case? Because in the CI system where this is happening I wouldn't expect channel close message as each workers set up a single persistent connection that is not teared down until one of both ends shut down. |
the stack traces I got for this bug were pointing to the line where closeChannel message was being sent. But it could potentially happen in any function that responds to global/channel requests. |
I could see this occurring during the keepalive request our Go client sends
every N seconds.
Thanks very much for picking this up!
…On Wed, Jan 11, 2017, 2:46 AM Han-Wen Nienhuys ***@***.***> wrote:
the stack traces I got for this bug were pointing to the line where
closeChannel message was being sent.
But it could potentially happen in any function that responds to
global/channel requests.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#18439 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAHWEn0rWqq0iu2VTAxHJ7Q8imyQhA8ks5rRLMagaJpZM4LWhVk>
.
|
fixed in golang/crypto@2e74c77 |
@hanwen I'm now seeing handshakes fail occasionally with the following message:
Type 20 being I can open another issue if you'd like. |
@vito closed issues are not tracked. Open a new issue please. |
Use channels and a dedicated write loop for managing the rekeying process. This lets us collect packets to be written while a key exchange is in progress. Previously, the read loop ran the key exchange, and writers would block if a key exchange was going on. If a reader wrote back a packet while processing a read packet, it could block, stopping the read loop, thus causing a deadlock. Such coupled read/writes are inherent with handling requests that want a response (eg. keepalive, opening/closing channels etc.). The buffered channels (most channels have capacity 16) papered over these problems, but under load SSH connections would occasionally deadlock. Fixes golang#18439. Change-Id: I7c14ff4991fa3100a5d36025125d0cf1119c471d Reviewed-on: https://go-review.googlesource.com/35012 Run-TryBot: Han-Wen Nienhuys <hanwen@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Han-Wen Nienhuys <hanwen@google.com>
What version of Go are you using (
go version
)?1.7.x - though in this case the more relevant bit is probably the sha of
x/crypto/ssh
: a8a4eedWhat operating system and processor architecture are you using (
go env
)?Linux, amd64
What did you do?
Establish a long-lived connection used for SSH tunneling, including sending arbitrary amounts of data over the wire (e.g. ~200MB sent every now and then via the same connection).
What did you expect to see?
SSH traffic should continue unhindered over an arbitrary amount of time with arbitrary amounts of data flowing over the connection.
What did you see instead?
I've observed that after some time all communication over SSH will hang. The connection is still "live" in the TCP sense, but SSH-level keepalives just block waiting for a reply.
I suspect this is actually space-related, and not time, however; the default of 1GB of transfer for key renegotiation sounds very likely given the stack inspection below.
Debugging
Context
(You can skip this if you just want the nitty-gritty.)
This package is used in a CI system wherein all workers register themselves with the master via a custom-built SSH server. They'll open a reverse SSH tunnel so that they can be used from a private network - the SSH server will act as a proxy to the workers. These SSH connections are arbitrarily long lived, and will have a large amount of data transferred over them as various CI resources (i.e. git repos, build artifacts) are streamed to and from the workers.
Procedure
In my case both ends (client and server) are implemented using this package. I suspect this to be a server-side issue as when bouncing the server all clients will immediately notice this and reconnect. I've also looked at the stack dump of the server and (I think) found a smoking gun for the deadlock.
Code sleuthing
I observed that for every hanging client there was one goroutine stuck in
chan send
for the duration that the worker had been gone. The following is my theory as to how that got there:(*handshakeTransport).readOnePacket
and(*handshakeTransport).writePacket
check if the number of bytes exceedt.config.RekeyThreshold
, which defaults to 1 gigabytecrypto/ssh
package will initiate a new key exchange and sett.sentInitMsg
to a non-nil value(*handshakeTransport).writePacket
has a loop ont.sentInitMsg != nil
using a condition variable (i.e. it's waiting for that to becomenil
, so the kex has finished, or for there to be a write error that will cause it to give up)t.sentInitMsg
only becomes nil in a call to(*handshakeTransport).readOnePacket
readOnePacket
is(*handshakeTransport).readLoop
readLoop
is blocked on writing to theincoming
channel, so it cannot callreadOnePacket
incoming
is a buffered channel (length 16) - however in this case that didn't help, so the buffer must be full, or no one can read from itt.incoming
?(*handshakeTransport).readPacket
does.readPacket
?(*mux).onePacket
onePacket
?(*mux).loop
- which isn't too interesting though since it literally just keeps calling that method as long as it works. so let's go back intoonePacket
and see what it's doing after reading the packet, as that'd be why it's not reading the next one.(*channel).handlePacket
- so what's that doing?(*channel).sendMessage
. things are getting interesting here: our read loop is now writing.(*handleTransport).writePacket
! which is dependent on us reading to succeed, asreadOnePacket
is what completes the key exchange.The text was updated successfully, but these errors were encountered: