Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestDialTimeout flake on linux-amd64-sid #22896

Closed
bradfitz opened this issue Nov 28, 2017 · 15 comments
Closed

net: TestDialTimeout flake on linux-amd64-sid #22896

bradfitz opened this issue Nov 28, 2017 · 15 comments
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bradfitz
Copy link
Contributor

Is it supposed to be dialing 127.0.0.1:0 (the 0 port is weird):

https://build.golang.org/log/450b8cb212512d306aed03190d275f546c332715

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net	5.178s

/cc @mikioh

@bradfitz bradfitz added help wanted NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure. labels Nov 28, 2017
@bradfitz bradfitz added this to the Go1.10 milestone Nov 28, 2017
@ianlancetaylor
Copy link
Contributor

I've looked through all the code and I've convinced myself that this can not happen. I also have not been able to reproduce it.

That said, here is another instance:

https://build.golang.org/log/95a745a6f8fed78a9b854b3cbad5986275ffc0cd

@nilsmagnus
Copy link
Contributor

if nerr, ok := err.(Error); !ok || !nerr.Timeout() {
				t.Fatalf("#%d: %v", i, err)
}

It looks to mee that there is an error in the log-message, it should have logged nerr, not err .

@nilsmagnus
Copy link
Contributor

The :0 port is for testing purposes and comes from line 71, at least according to (https://golang.org/src/net/timeout_test.go)

@ianlancetaylor
Copy link
Contributor

@nilsmagnus Logging err rather than nerr is correct; if !ok then nerr will be nil so we want to log err, and if ok then logging err gives us the same value as logging nerr anyhow.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.11 Jan 3, 2018
@agnivade
Copy link
Contributor

agnivade commented Jan 9, 2018

Binding on port 0 is a request to the kernel to look for a free port and bind to it. It actually does not bind to 0. Therefore, it should not be possible to connect to port 0. But that should not be an issue here since the socket filter will prevent the connection from happening in the first place.

However, in both the buildlet failures, the failure happens on item #5 which has the following values -

{
timeout - 50 * time.Millisecond, 
delta- 0, 
guard- 100 * time.Millisecond, 
max- time.Second
}

@bradfitz - Anything suspicious about these values that can cause a flake ? I tried very hard to reproduce but could not.

@ianlancetaylor
Copy link
Contributor

ping @mikioh for above question.

@mikioh
Copy link
Contributor

mikioh commented Jan 12, 2018

I guess that one possibility is that the buildbot, cmd/go or testing package did run the net.test process incorrectly, without the harness of net/internal/socktest package.

@ianlancetaylor
Copy link
Contributor

@mikioh I don't see any way that that could happen.

@mikioh
Copy link
Contributor

mikioh commented Jan 12, 2018

@ianlancetaylor,

I'm also not sure the mechanism, but the evidence, the test failed within 0.00s and attempted to connect to the literal port number zero (as a boobytrap set by me), leads me to the possibility.

@mikioh
Copy link
Contributor

mikioh commented Jan 12, 2018

not sure the mechanism

Hm, the harness is naively designed to be unstrapped before the execution of benchmarking, so it depends on the execution order of testing and benchmarking. It's likely that unstrapping happens if the testing goroutine is overtaken by one of benchmarking goroutines, like #21090.

@ianlancetaylor
Copy link
Contributor

Here are the cases I can find today. It's perhaps interesting that it always seems to be case 5.

2017-04-17T18:33:08-73f283f/linux-ppc64-buildlet:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-04-21T23:43:59-49f54e8/linux-ppc64-buildlet:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-04-22T00:53:51-c202d4d/linux-amd64-clang:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-04-28T15:32:18-6511931/linux-ppc64-buildlet:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-08-28T05:26:59-eb07028/linux-amd64-alpine:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-09-23T09:24:02-6171d05/linux-amd64-alpine:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-10-05T08:28:41-be272ec/linux-ppc64le-buildlet:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-10-08T20:14:33-e285f39/linux-amd64-alpine:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-11-02T03:33:53-b66a294/linux-amd64:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-11-03T17:45:11-14f2bfd/linux-amd64-clang:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-11-10T04:40:13-17f35c6/linux-amd64-nocgo:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-11-22T22:37:15-ecc8650/linux-amd64-sid:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-11-28T01:18:54-b5c7183/linux-amd64-sid:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

@ianlancetaylor
Copy link
Contributor

These failures are occurring on the builders, which as far as I know don't run the benchmarks, so I don't think the problem could be the benchmarks uninstalling the test hooks.

@mikioh
Copy link
Contributor

mikioh commented Jan 26, 2018

Then, there might be a race on the callpath switch between test cases; enableSocketConnect? sounds plausible. Adding per-test case filter feature into socktest.Switch might be a solution.

@gopherbot
Copy link

Change https://golang.org/cl/105095 mentions this issue: net, net/internal/socktest: make concurrent socket system calls in test safe

@ianlancetaylor
Copy link
Contributor

This has not happened since 2017-11-28. I don't know what fixed it, but I think it must be fixed.

@golang golang locked and limited conversation to collaborators Jun 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

6 participants