Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/nettest: TestConn cannot succeed with some valid net.Conn implementations #46977

Open
danderson opened this issue Jun 29, 2021 · 4 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@danderson
Copy link
Contributor

x/net/nettest.TestConn implements a test suite to verify that a net.Conn implementation conforms to the implicit contracts of net.Conn. However, some (I argue) valid implementations of net.Conn cannot pass this test, because it assumes that write timeouts are always recoverable.

As an example, tailscale/tailscale#2261 is an implementation of a Noise IK cryptographic transport. At a high level, it functions similarly to TLS: a handshake generates some session keys, and those session keys are used to transmit encrypted data frames of the form 2b length; Nb ciphertext.

This implementation fails 4 of the TestConn subtests:

    --- FAIL: TestConn/PastTimeout (60.02s)
    --- FAIL: TestConn/PresentTimeout (60.02s)
    --- FAIL: TestConn/FutureTimeout (60.01s)
    --- FAIL: TestConn/ConcurrentMethods (60.02s)

All of these tests set a write deadline, deliberately make it expire, then verify that the write unblocks and that the connection is still usable after resetting the deadline.

However, in the Noise protocol, if a write aborts with a partially written ciphertext frame, the stream synchronization is broken and the io.Writer API is such that we cannot recover:

  • If we truthfully return less than n bytes written, the caller will assume that the connection is still usable and that it could retransmit the remainder of the bytes - which is not true, since the receiver would then receive a truncated ciphertext frame followed by a different full frame.
  • If we return n and buffer the untransmitted frame portion to retry on a later Write call, the caller may assume that the receiver has received all the bytes it sent, and switch to blocking on a Read call waiting for a response, causing a deadlock.

After thinking this through, the conclusion I reached is that in this protocol, any error on Write, including timeouts, is fatal to the write portion of the net.Conn, and no further data can be transmitted from that point on. (Strictly speaking, it's possible to special-case timeouts that fire with 0 bytes sent, but that involves rollbacks of the cipher state that can lead to catastrophic cryptographic failures)

And yet, I think that this net.Conn is a working net.Conn that complies with the interface contract. Nothing in the contract says that timeouts must be non-fatal (indeed, there's no way to guarantee that the connection doesn't terminate for unrelated reasons immediately after a timeout). The failing tests are trying to probe whether the implemented timeout behavior is correct if timeouts are non-fatal - a valid thing to probe, but not strictly required for a net.Conn to be functional.

As another case in point, I believe tls.Conn has the same issue, since TLS data frames look and behave much the same. AFAICT, crypto/tls solves this issue by not running nettest.TestConn against tls.Conn.

(it's worth noting that tls.Conn and my noise.Conn don't have the same problem on the read side, since the Conn can gracefully buffer partial reads until the caller retries enough to get a full frame of data. So, I very much do want to run the contract tests that check for Read timeout behavior)

It would be nice to somehow partition these subtests out, to allow for net.Conns that are intolerant of any write error to still verify the rest of their behaviors.

@gopherbot gopherbot added this to the Unreleased milestone Jun 29, 2021
@danderson
Copy link
Contributor Author

cc @josharian who encouraged me to file this bug, while we were discussing how to make noise.Conn pass nettest.TestConn :)

@josharian
Copy link
Contributor

josharian commented Jun 30, 2021

According to the net.Conn docs:

// If the deadline is exceeded a call to Read or Write or to other
// I/O methods will return an error that wraps os.ErrDeadlineExceeded.
// This can be tested using errors.Is(err, os.ErrDeadlineExceeded).

One way such a cryptographic net.Conn could handle a write timeout is to return a failure that wraps os.ErrDeadlineExceeded but whose Temporary method returns false. In such a case, nettest could then infer that it oughtn't try to run the rest of the "is the conn still usable?" tests.

However, #45729 proposes deprecating the Temporary method, so maybe that's not the best choice.

Another option would be for the net.Conn to return a failure that doesn't wrap os.ErrDeadlineExceeded at all. (Maybe ErrClosed?) That is, it says: Some other thing went wrong: "you requested a write timeout now, but I can't satisfy that, because that would require this connection be resumable, and it is not". Then nettest could use that signal to infer that it ought to stop. However, that does remove the bulk of the utility of those tests.

cc @dsnet @mikioh

@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 30, 2021
@neild
Copy link
Contributor

neild commented Jun 30, 2021

The net.Conn documentation states:

// A deadline is an absolute time after which I/O operations
// fail instead of blocking. The deadline applies to all future
// and pending I/O, not just the immediately following call to
// Read or Write. After a deadline has been exceeded, the
// connection can be refreshed by setting a deadline in the future.

"After a deadline has been exceeded, the connection can be refreshed by setting a deadline in the future" sounds clear to me: A connection must not become unusable if a deadline is exceeded.

I would probably handle this in a cryptographic net.Conn by documenting it as containing an internal buffer and providing a Flush method for cases where the caller needs to ensure that all written bytes have been transmitted.

@danderson
Copy link
Contributor Author

I read that sentence differently: until you set a new deadline in the future, the connection will be unusable. It does not imply that setting a deadline in the future must make the connection usable again. That's impossible to guarantee even for a TCPConn - I might have unplugged the network cable, you can't demand that the connection continues to function just because I reset the deadline.

I think the contract would more accurately be: once you've reset the deadline to some time in the future, you will not get another timeout error until that point in the future - not "you definitely won't get any errors and can rely on this conn to function at least until that future point in time", which is what the tests currently enforce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants