Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: flaky tests on macOS Sierra #18541

Closed
rsc opened this issue Jan 6, 2017 · 20 comments
Closed

net/http: flaky tests on macOS Sierra #18541

rsc opened this issue Jan 6, 2017 · 20 comments
Labels
FrozenDueToAge help wanted OS-Darwin Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Jan 6, 2017

If I run:

for i in $(seq 100)
do
	if go test -v -short >out.$i 2>&1
	then
		echo pass
	else
		echo fail out.$i
	fi
done

Then I have yet to get 100 passes in a row. Usually I don't even get 10 in a row. There are two failure modes right now.

The first is:

--- FAIL: TestServerEmptyBodyRace_h1 (0.01s)
	serve_test.go:3798: Get http://127.0.0.1:53482: read tcp 127.0.0.1:53484->127.0.0.1:53482: read: connection reset by peer
	serve_test.go:3811: handler ran 19 times; want 20

and the second is for TestTransportPersistConnLeak to hang and cause the process to be killed by the 10 minute watchdog. I suspect that's the same problem, but the test is not written to cope correctly with failures from Get. Working on that.

This issue is about why we get "read: connection reset by peer" so consistently on macOS Sierra. Maybe it's a flaky test but it seems like maybe more.

/cc @bradfitz

@rsc rsc added this to the Go1.8Maybe milestone Jan 6, 2017
@rsc
Copy link
Contributor Author

rsc commented Jan 6, 2017

Still trying to check if I got the de-wedging of the TestTransportPersistConnLeak failure right, but now I'm getting this one repeatedly from TestTransportPersistConnReadLoopEOF instead.

Looks like TestTransportPersistConnReadLoopEOF calls getConn, which kicks off a goroutine to dial, which is blocked in waitWrite on the connect. Meanwhile, the listener is blocked in waitRead on the accept. I assume those are using the same address, in which case you'd think they would complete each other.

SIGQUIT: quit
PC=0x105ab7b m=0 sigcode=0

goroutine 45 [select]:
net/http.(*Transport).getConn(0xc42011c1e0, 0xc4202e73a0, 0x0, 0x14dc9b3, 0x4, 0xc42030a7f0, 0xf, 0x0, 0x0, 0xc420030708)
	/Users/rsc/go/src/net/http/transport.go:913 +0xb92
net/http.TestTransportPersistConnReadLoopEOF(0xc4202171e0)
	/Users/rsc/go/src/net/http/transport_internal_test.go:35 +0x2c7
testing.tRunner(0xc4202171e0, 0x15037f0)
	/Users/rsc/go/src/testing/testing.go:656 +0x93
created by testing.(*T).Run
	/Users/rsc/go/src/testing/testing.go:693 +0x2c4

goroutine 46 [IO wait]:
net.runtime_pollWait(0x187f4d8, 0x72, 0x0)
	/Users/rsc/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc42028b568, 0x72, 0x0, 0xc42032a000)
	/Users/rsc/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc42028b568, 0xffffffffffffffff, 0x0)
	/Users/rsc/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).accept(0xc42028b500, 0x0, 0x16e6420, 0xc42032a000)
	/Users/rsc/go/src/net/fd_unix.go:430 +0x1e5
net.(*TCPListener).accept(0xc42000e268, 0xc400000008, 0xc420326000, 0x0)
	/Users/rsc/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).Accept(0xc42000e268, 0x1504fc0, 0xc4202fcb40, 0x1085ca2, 0x586fb229)
	/Users/rsc/go/src/net/tcpsock.go:228 +0x49
net/http.TestTransportPersistConnReadLoopEOF.func1(0xc4202fcb40, 0x16eb3e0, 0xc42000e268, 0xc4202171e0)
	/Users/rsc/go/src/net/http/transport_internal_test.go:23 +0x5f
created by net/http.TestTransportPersistConnReadLoopEOF
	/Users/rsc/go/src/net/http/transport_internal_test.go:29 +0xdf

goroutine 47 [IO wait]:
net.runtime_pollWait(0x187f418, 0x77, 0xc420051301)
	/Users/rsc/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc42028b5d8, 0x77, 0x0, 0x0)
	/Users/rsc/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitWrite(0xc42028b5d8, 0xc420014f68, 0x16ebf20)
	/Users/rsc/go/src/net/fd_poll_runtime.go:84 +0x34
net.(*netFD).connect(0xc42028b570, 0x16ebf20, 0xc420014f68, 0x0, 0x0, 0x16e66a0, 0xc4202e7400, 0x0, 0x0)
	/Users/rsc/go/src/net/fd_unix.go:156 +0x24e
net.(*netFD).dial(0xc42028b570, 0x16ebf20, 0xc420014f68, 0x16ed480, 0x0, 0x16ed480, 0xc4202c7e00, 0xc42003d520, 0x11dfade)
	/Users/rsc/go/src/net/sock_posix.go:140 +0xe9
net.socket(0x16ebf20, 0xc420014f68, 0x14dc62e, 0x3, 0x2, 0x1, 0x0, 0x0, 0x16ed480, 0x0, ...)
	/Users/rsc/go/src/net/sock_posix.go:92 +0x1a1
net.internetSocket(0x16ebf20, 0xc420014f68, 0x14dc62e, 0x3, 0x16ed480, 0x0, 0x16ed480, 0xc4202c7e00, 0x1, 0x0, ...)
	/Users/rsc/go/src/net/ipsock_posix.go:161 +0x129
net.doDialTCP(0x16ebf20, 0xc420014f68, 0x14dc62e, 0x3, 0x0, 0xc4202c7e00, 0x1744570, 0xc42003d720, 0x1744570)
	/Users/rsc/go/src/net/tcpsock_posix.go:62 +0xb9
net.dialTCP(0x16ebf20, 0xc420014f68, 0x14dc62e, 0x3, 0x0, 0xc4202c7e00, 0xc420014f68, 0x14dc62e, 0x3)
	/Users/rsc/go/src/net/tcpsock_posix.go:58 +0xe4
net.dialSingle(0x16ebf20, 0xc420014f68, 0xc4202d5200, 0x16e91e0, 0xc4202c7e00, 0x0, 0x0, 0x0, 0x0)
	/Users/rsc/go/src/net/dial.go:521 +0x479
net.dialSerial(0x16ebf20, 0xc420014f68, 0xc4202d5200, 0xc42030a840, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0)
	/Users/rsc/go/src/net/dial.go:489 +0x220
net.(*Dialer).DialContext(0x1727f40, 0x16ebf20, 0xc420014f68, 0x14dc62e, 0x3, 0xc42030a7f0, 0xf, 0x0, 0x0, 0x0, ...)
	/Users/rsc/go/src/net/dial.go:371 +0x70b
net/http.(*Transport).dial(0xc42011c1e0, 0x16ebf20, 0xc420014f68, 0x14dc62e, 0x3, 0xc42030a7f0, 0xf, 0x0, 0x0, 0x0, ...)
	/Users/rsc/go/src/net/http/transport.go:858 +0x197
net/http.(*Transport).dialConn(0xc42011c1e0, 0x16ebf20, 0xc420014f68, 0x0, 0x14dc9b3, 0x4, 0xc42030a7f0, 0xf, 0x0, 0x0, ...)
	/Users/rsc/go/src/net/http/transport.go:1008 +0x1c4c
net/http.(*Transport).getConn.func4(0xc42011c1e0, 0x16ebf20, 0xc420014f68, 0xc4202c7d10, 0xc4202fcba0)
	/Users/rsc/go/src/net/http/transport.go:908 +0x78
created by net/http.(*Transport).getConn
	/Users/rsc/go/src/net/http/transport.go:910 +0x3a3

@rsc
Copy link
Contributor Author

rsc commented Jan 6, 2017

Re last comment, my system got into a state where no connections to localhost worked at all, even after stopping the Apple firewall completely. Even 'telnet 127.0.0.1' hung (instead of being rejected). Rebooting cleared that. Now I just see lots of the 'connection reset by peer' in various tests.

Still gathering data.

@bradfitz
Copy link
Contributor

bradfitz commented Jan 6, 2017

After I went on my parallelization spree (which dropped short tests from ~30s to ~3s), I found I missed some global state in a few and servers & clients were being shut down by unrelated tests. I thought I'd fixed them all, but this might be another case.

I'll test on my Sierra desktop here.

@bradfitz bradfitz self-assigned this Jan 6, 2017
@rsc
Copy link
Contributor Author

rsc commented Jan 6, 2017

This is typical, 18/100 failures of go test -short net/http. Always at least ~10%.

If I turn off the Mac firewall, the failures disappear: 0/100. So the Mac firewall is breaking things.

Question is what we should do. Sometimes the read failure comes back from t.getConn (before we've event sent a request), other times from pconn.roundTrip (trying to read response). In general if we are issuing a GET and we get a read failure, does it make sense to retry even once inside net/http instead of handing that off to all users? I thought I had seen some code to do that at one point.

At least I understand the problem, and that it's not net/http. No longer for Go 1.8.

/cc @bradfitz

--- FAIL: TestTransportConcurrency (0.00s)
	transport_test.go:1338: error on req request-3: Get http://127.0.0.1:56615/?echo=request-3: read tcp 127.0.0.1:56616->127.0.0.1:56615: read: connection reset by peer
--
--- FAIL: TestTransportDiscardsUnneededConns (0.02s)
	clientserver_test.go:1000: Get: Get https://127.0.0.1:58528: read tcp 127.0.0.1:58532->127.0.0.1:58528: read: connection reset by peer
--
--- FAIL: TestTransportPersistConnLeak (0.00s)
	transport_test.go:1110: client fetch error: Get http://127.0.0.1:59800: read tcp 127.0.0.1:59801->127.0.0.1:59800: read: connection reset by peer
--
--- FAIL: TestServerEmptyBodyRace_h2 (0.01s)
	serve_test.go:3798: Get https://127.0.0.1:62729: read tcp 127.0.0.1:62731->127.0.0.1:62729: read: connection reset by peer
--
--- FAIL: TestServerEmptyBodyRace_h2 (0.02s)
	serve_test.go:3798: Get https://127.0.0.1:64327: read tcp 127.0.0.1:64329->127.0.0.1:64327: read: connection reset by peer
--
--- FAIL: TestTransportDiscardsUnneededConns (0.02s)
	clientserver_test.go:1000: Get: Get https://127.0.0.1:54481: dial tcp 127.0.0.1:54481: getsockopt: connection reset by peer
--
--- FAIL: TestTimeoutHandlerRaceHeader (0.00s)
	serve_test.go:2024: Get http://127.0.0.1:54738: read tcp 127.0.0.1:54739->127.0.0.1:54738: read: connection reset by peer
--
--- FAIL: TestServerEmptyBodyRace_h2 (0.02s)
	serve_test.go:3798: Get https://127.0.0.1:59287: read tcp 127.0.0.1:59288->127.0.0.1:59287: read: connection reset by peer
--
--- FAIL: TestServerEmptyBodyRace_h2 (0.11s)
	serve_test.go:3798: Get https://127.0.0.1:52769: read tcp 127.0.0.1:52772->127.0.0.1:52769: read: connection reset by peer
--
--- FAIL: TestRetryIdempotentRequestsOnError (0.00s)
	transport_test.go:2589: Get http://127.0.0.1:54153: read tcp 127.0.0.1:54154->127.0.0.1:54153: read: connection reset by peer
--
--- FAIL: TestTransportPersistConnLeak (0.00s)
	transport_test.go:1110: client fetch error: Get http://127.0.0.1:59721: read tcp 127.0.0.1:59722->127.0.0.1:59721: read: connection reset by peer
--
--- FAIL: TestRetryIdempotentRequestsOnError (0.00s)
	transport_test.go:2606: Get http://127.0.0.1:60903: read tcp 127.0.0.1:60906->127.0.0.1:60903: read: connection reset by peer
--
--- FAIL: TestServerEmptyBodyRace_h2 (0.11s)
	serve_test.go:3798: Get https://127.0.0.1:60971: read tcp 127.0.0.1:60972->127.0.0.1:60971: read: connection reset by peer
--
--- FAIL: TestTransportDiscardsUnneededConns (0.12s)
	clientserver_test.go:1000: Get: Get https://127.0.0.1:64291: read tcp 127.0.0.1:64305->127.0.0.1:64291: read: connection reset by peer
--
--- FAIL: TestTransportConcurrency (0.01s)
	transport_test.go:1338: error on req request-0: Get http://127.0.0.1:50162/?echo=request-0: read tcp 127.0.0.1:50163->127.0.0.1:50162: read: connection reset by peer
--
--- FAIL: TestTransportDiscardsUnneededConns (0.12s)
	clientserver_test.go:1000: Get: Get https://127.0.0.1:53562: read tcp 127.0.0.1:53582->127.0.0.1:53562: read: connection reset by peer
--
--- FAIL: TestTransportPersistConnLeak (0.00s)
	transport_test.go:1110: client fetch error: Get http://127.0.0.1:55092: read tcp 127.0.0.1:55096->127.0.0.1:55092: read: connection reset by peer
--
--- FAIL: TestTransportConcurrency (0.10s)
	transport_test.go:1338: error on req request-0: Get http://127.0.0.1:63500/?echo=request-0: read tcp 127.0.0.1:63501->127.0.0.1:63500: read: connection reset by peer
--
--- FAIL: TestTransportPersistConnLeak (0.00s)
	transport_test.go:1110: client fetch error: Get http://127.0.0.1:51099: read tcp 127.0.0.1:51101->127.0.0.1:51099: read: connection reset by peer
--
--- FAIL: TestTimeoutHandlerRaceHeader (0.00s)
	serve_test.go:2024: Get http://127.0.0.1:58952: read tcp 127.0.0.1:58953->127.0.0.1:58952: read: connection reset by peer
--
--- FAIL: TestServerEmptyBodyRace_h2 (0.11s)
	serve_test.go:3798: Get https://127.0.0.1:59166: read tcp 127.0.0.1:59168->127.0.0.1:59166: read: connection reset by peer

@rsc rsc modified the milestones: Go1.9, Go1.8Maybe Jan 6, 2017
@bradfitz
Copy link
Contributor

bradfitz commented Jan 6, 2017

Well, that explains why I couldn't reproduce. I wasn't using the firewall.

The http package does retry idempotent requests, but it makes the assumption that requests' conn reads/writes can only mysteriously break on the 2nd and subsequent requests on a conn, but not the first. That base case also has the nice side effect of preventing endless retry loops, since eventually a request will require a new conn (after exhausting the idle conn pool), and if that new conn fails, perhaps there was something just bad about the request or network.

@josharian
Copy link
Contributor

As mentioned in #18751, I've seen this one (and only once). I don't know what the OS X firewall is or whether I have it enabled, but I do run Little Snitch.

In the off changes that it is relevant, Little Snitch was (I suspect) the source of net test failures ages ago because it was returning EAGAIN in places where package net did not anticipate them. Those were fixed.

@crawshaw
Copy link
Member

crawshaw commented Feb 3, 2017

I see this on about 50% of test runs on a Google-managed laptop running 10.12.2. A sample:

--- FAIL: TestServerEmptyBodyRace_h1 (0.01s)
        serve_test.go:3798: Get http://127.0.0.1:51520: read tcp 127.0.0.1:51521->127.0.0.1:51520: read: connection reset by peer
        serve_test.go:3811: handler ran 19 times; want 20
--
--- FAIL: TestTransportConcurrency (0.01s)
        transport_test.go:1339: error on req request-7: Get http://127.0.0.1:53866/?echo=request-7: read tcp 127.0.0.1:53869->127.0.0.1:53866: read: connection reset by peer
--- FAIL: TestTransportDiscardsUnneededConns (0.02s)
        clientserver_test.go:1000: Get: Get https://127.0.0.1:54232: read tcp 127.0.0.1:54248->127.0.0.1:54232: read: connection reset by peer
--
--- FAIL: TestServerEmptyBodyRace_h1 (0.01s)
        serve_test.go:3798: Get http://127.0.0.1:56410: read tcp 127.0.0.1:56414->127.0.0.1:56410: read: connection reset by peer
        serve_test.go:3811: handler ran 19 times; want 20
--
--- FAIL: TestServerEmptyBodyRace_h1 (0.01s)
        serve_test.go:3798: Get http://127.0.0.1:57438: read tcp 127.0.0.1:57439->127.0.0.1:57438: read: connection reset by peer
        serve_test.go:3811: handler ran 19 times; want 20
--
--- FAIL: TestServerEmptyBodyRace_h2 (0.11s)
        serve_test.go:3798: Get https://127.0.0.1:58335: read tcp 127.0.0.1:58337->127.0.0.1:58335: read: connection reset by peer
        serve_test.go:3811: handler ran 19 times; want 20
--
--- FAIL: TestServerEmptyBodyRace_h1 (0.01s)
        serve_test.go:3798: Get http://127.0.0.1:58956: read tcp 127.0.0.1:58967->127.0.0.1:58956: read: connection reset by peer
        serve_test.go:3811: handler ran 19 times; want 20
--
--- FAIL: TestTransportConcurrency (0.10s)
        transport_test.go:1339: error on req request-4: Get http://127.0.0.1:59616/?echo=request-4: read tcp 127.0.0.1:59617->127.0.0.1:59616: read: connection reset by peer

Similar to @rsc's description above, if I turn off the firewall it starts always passing.

@dmitshur
Copy link
Contributor

dmitshur commented Feb 3, 2017

@crawshaw What version of Go was that?

@crawshaw
Copy link
Member

crawshaw commented Feb 3, 2017

12c58bb

@zchee
Copy link
Contributor

zchee commented Feb 3, 2017

@crawshaw Hi,
As I said in #18751 (comment), I'm also interesting firewall bug.
What do you using macOS build version? Maybe 10.12.2 stable build version are 16C67 or 16C68.

There are two kinds as mentioned here:
https://macops.ca/macos-installers-updated-field
These partly changed, differences are
https://gist.github.com/bruienne/84c447817381b23109b3dfdbe2bea6d1

it might be not related to these, but I'd like to debug in the same environment.

@crawshaw
Copy link
Member

crawshaw commented Feb 3, 2017

I am on macOS 10.12.2, 16C67. There may be corp-managed software on the laptop I don't know about, but as this bug responds to turning the macOS firewall on and off, I hope that is unrelated.

@zchee
Copy link
Contributor

zchee commented Feb 3, 2017

@crawshaw Thanks.
Hmm, In my project(docker-machine xhyve driver), also There was a problem reporting that if Macbook is managed by corp, can't create hypervisor VM caused by network interface error.
Anyway, I have a macOS installer for all builds, so I will try the test in the same environment.

Edit: Ah if you said means 16C67 is corp-managed version, it's no, if pure 16C67.
Apple is sometimes assigned a different build number if that included a new installer.

  • 16C67 is released 2016-12-16T02:12:47Z
  • 16C68 is released 2016-12-16T12:30:42Z

If it means, "my laptop installed any corp-managed software after install os", There is a possibility of a cause.

@dmitshur
Copy link
Contributor

dmitshur commented Feb 3, 2017

Fair warning: if I run that enough times then my Mac stops being able to make new connections to localhost at all (for example, 'telnet localhost 12' hangs instead of getting an immediate connection refused). It appears that rebooting is the only way to clear that condition.

So this happened during all.bash installing 12c58bb (bootstrapping with go1.8rc3). The progress stalled on:

ok  	mime/quotedprintable	0.185s

The next package would be net, but it never completed after 5+ mins. At the same time, telnet localhost 12 stalls and ping 127.0.0.1 times out.

Interestingly, this is with firewall off (and macOS 10.12.3 (16D32)). It's also a personal laptop, so there's no corp-managed software here.

@rsc
Copy link
Contributor Author

rsc commented Feb 3, 2017

@shurcooL did your system come out of that state (no localhost traffic), or did you have to reboot?

@dmitshur
Copy link
Contributor

dmitshur commented Feb 3, 2017

@rsc I had to reboot. I waited a few minutes after ^C'ing the all.bash process, but there were no changes, telnet localhost 12 and ping 127.0.0.1 continued to fail.

After rebooting, I've run all.bash a second time and installed 12c58bb successfully.

Since then, I've turned on firewall again and ran net/http tests continuously... 30~ runs so far, zero failures. Edit: All 100 runs passed successfully.

@rsc
Copy link
Contributor Author

rsc commented Feb 3, 2017

OK, so the connection resets seem at least related to the presence of a network filter kext that Google has loaded onto our laptops. It hasn't changed in a few years, though, so this could still be a bad Sierra interaction, but one that only happens when using the kernel sflt_register functionality.

From @shurcooL's experience it sounds like the localhost network stack wedge may be a distinct problem (obviously not caused by the Google kext), but maybe still introduced in Sierra.

@josharian
Copy link
Contributor

I just hit another localhost network stack wedge running package net's tests. (Is that this bug or #18751?) No Google kext here, but yes com.apple.nke.applicationfirewall. In case it helps, here are dtruss output and stack traces for the hang.

I'll leave my laptop in this broken state overnight in case there's any further info I should grab before rebooting.

@rsc
Copy link
Contributor Author

rsc commented Feb 6, 2017

@josharian Yes, that's the usual hang - the stack traces match what I've seen. It looks like what we really need is kernel stacks but I haven't yet figured out how to do that. Apple has things kind of locked down now it seems.

@bradfitz
Copy link
Contributor

This hasn't recurred (that I've noticed) since I updated the macOS Sierra builders to destroy & recreate their VMWare machines after each build.

@bradfitz bradfitz added the Testing An issue that has been verified to require only test changes, not just a test failure. label May 24, 2017
@bradfitz
Copy link
Contributor

Closing. #18751 tracks the more general Sierra weirdnesses.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted OS-Darwin Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

7 participants