New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: Dial returns before connection complete #19289
Comments
Dial
returns before connection complete
So is the whole Have you tried a repeated loop of dial a child process in a loop while kill -9'ing the child and it yielded no results, or ...? |
It's a theory. The test cluster where we've observed this is used for "chaos" testing where we have a cron job that kills random processes periodically. I've tried to simply connect to a port where nothing is listening and haven't reproduced the problem, so that's where we get the theory that racing with the death of the remote process matters (or at least the closing of the listening socket). I also believe this requires a real network connection and not just two processes on localhost (some time needs to pass between the It's also possible that the missing link is something going on in the client process instead of specific network activity and timing. I had a theory back in #14548 that spurious wakeups were caused by reuse of |
/cc @ianlancetaylor |
OK, I have a repro: https://gist.github.com/bdarnell/2d37a812368bb83090ab60d36ceae3c4 In the repro, we start a bunch of goroutines that dial an address where nothing is listening. We then cancel these dials, attempting to race the cancellation against the arrival of the "connection refused" packet. When we hit the race, here's what I think is going on:
I see two possible solutions:
|
@ianlancetaylor, any theories? Or punt to Go 1.10. |
@bdarnell Thanks for test case and analysis. I am able to recreate the problem. One thing I don't yet understand, though, is that even with a spurious wakeup, the call to fetch |
OK, I understand now. |
CL https://golang.org/cl/45815 mentions this issue. |
Nice find. |
Thanks for this fix, we've been chasing the same thing googleapis/google-api-go-client#220 |
What version of Go are you using (
go version
)?go version go1.8 linux/amd64
What operating system and processor architecture are you using (
go env
)?What did you do?
Call
net.Dial
to connect to a remote host, and if it succeeds,Write
to the connection. Simultaneously,kill -9
the listening process.What did you expect to see?
I expect
Dial
to either return with nilerror
or an error like "connection refused". I expectWrite
to either succeed or return a different type of error, such as "connection reset". Specifically, I do not expect to see "connection refused" from anything butnet.Dial
.What did you see instead?
Sometimes
Write
returns "connection refused". The fact that this is a non-temporary error (but most other errors that are possible forWrite
are temporary according toerr.Temporary()
) can cause error handling to get confused (see grpc/grpc-go#1026). This occurs with some regularity (~daily) on a CockroachDB test cluster, although we have not yet been able to reduce this to something simpler.My hypothesis is that the net poller sometimes causes spurious wakeups, leading
Dial
to return too soon (with nilerror
) becausegetsockopt(..., SO_ERROR)
does not yet have an error to return. This was observed in #14548, where the problem was more severe on Darwin and so it was given a Darwin-specific fix. If spurious wakeups are possible on Linux too (this was not determined one way or the other in #14548), one way this would manifest is connect-time errors leaking out into other system calls.The text was updated successfully, but these errors were encountered: