Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestDialTimeout flake #11872

Closed
aclements opened this issue Jul 26, 2015 · 6 comments
Closed

net: TestDialTimeout flake #11872

aclements opened this issue Jul 26, 2015 · 6 comments
Milestone

Comments

@aclements
Copy link
Member

I've been running run.bash in a loop on my linux/amd64 workstation for the past few days and out of ~500 runs, I've had three failures of TestDialTimeout with:

--- FAIL: TestDialTimeout (0.00s)
        timeout_test.go:82: #3: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL    net     2.223s

It has my GC changes from CL 12674, but I don't think those are related. The latest commit from master is ae1ea2a. This shows up occasionally on the dashboard as well:

2015-06-05T18:51:23-d64cdde/linux-amd64-nocgo
2015-06-05T18:51:23-d64cdde/linux-amd64-noopt
2015-06-11T14:33:40-0beb931/solaris-amd64-smartos
2015-06-14T20:54:01-48d865a/linux-amd64-noopt
2015-07-05T03:36:56-1edf489/linux-amd64-noopt
2015-07-14T00:07:31-337b7e7/linux-amd64
2015-07-15T23:28:42-08dbd8a/solaris-amd64-smartos
2015-07-17T22:46:05-955c0fd/android-arm-crawshaw
2015-07-23T17:10:28-e0ac5c5/linux-386-sid

This may be related to #11474, though the error sounds different.

@mikioh

@aclements aclements added this to the Go1.5Maybe milestone Jul 26, 2015
@mikioh
Copy link
Contributor

mikioh commented Jul 27, 2015

It's probably because the test cases, both TestDialTimeout and TestDialTimeoutFDLeak, depend on a wrong assumption with runtime scheduler. Looks like the flakiness in net/http is irrelevant, though.

--- FAIL: TestIdleConnChannelLeak (0.01s)
    transport_test.go:1949: Get http://foo-host-0.tld/: dial tcp 127.0.0.1:44835: getsockopt: connection refused
FAIL
FAIL    net/http    6.978s

@mikioh
Copy link
Contributor

mikioh commented Jul 27, 2015

@aclements,

Can you give https://golang.org/cl/12691 a shot?

@gopherbot
Copy link

CL https://golang.org/cl/12691 mentions this issue.

@mikioh mikioh closed this as completed in 68557de Jul 27, 2015
@aclements
Copy link
Member Author

I don't really understand why that CL will fix this (or, indeed, what it's really fixing), but I'll update and restart my run.bash loop.

@aclements
Copy link
Member Author

Previously I was able to reproduce this twice in 322 iterations. With this change I've gotten through 1,500 iterations without a failure.

@mikioh mikioh modified the milestones: Go1.5, Go1.5Maybe Jul 28, 2015
@mikioh
Copy link
Contributor

mikioh commented Jul 30, 2015

Thanks for the confirmation. The root cause comes from the current corner-cutting socktest package implementation. It tries to track socket calls with socket descriptor numbers as a key for testing, and has no care about "quick socket descriptor number recycling" for simplicity. Therefore it may confuse socket descriptors in a situation like the following:

G1  CALL socket
G1  RET socket = 3
G1  socktest.register(3)
G1  CALL close(3)
G2  CALL socket
G2  RET socket = 3
G2  socktest.register(3) // eviction
G1  RET close
G1  socktest.unregister(3) // eviction
G2  CALL connect(3, ...)
G2  socktest.connect(3, ...) // socktest.switch redirects the call to syscall and results "dial tcp 127.0.0.1:0 getsockopt: connection refused" error

This may happen in selfConnect, in the case of TCP simultaneous open, and Linux is one of the platforms that can make TCP simultaneous open happen easily.

liamsi added a commit to dedis/cothority that referenced this issue Dec 9, 2015
@golang golang locked and limited conversation to collaborators Aug 5, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants