Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: TestServerAllowsBlockingRemoteAddr flake on linux-amd64-race #19161

Closed
mdempsky opened this issue Feb 17, 2017 · 35 comments
Closed

net/http: TestServerAllowsBlockingRemoteAddr flake on linux-amd64-race #19161

mdempsky opened this issue Feb 17, 2017 · 35 comments
Labels
FrozenDueToAge help wanted Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@mdempsky
Copy link
Member

While trybotting https://go-review.googlesource.com/#/c/37149/

https://storage.googleapis.com/go-build-log/aa3dd3e6/linux-amd64-race_252fe50a.log

--- FAIL: TestServerAllowsBlockingRemoteAddr (1.68s)
	serve_test.go:959: Get http://127.0.0.1:41257: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
	serve_test.go:1008: response 1 addr = ""; want "RA:21.21.21.21:21"
1 second passes in backend, proxygone= false
FAIL
FAIL	net/http	13.395s
@bradfitz
Copy link
Contributor

bradfitz commented Mar 8, 2017

And linux-ppc64le:

https://build.golang.org/log/f580613a303c908044253788826180d2c59337c1

--- FAIL: TestServerAllowsBlockingRemoteAddr (0.00s)
	serve_test.go:1000: response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21"
FAIL
FAIL	net/http	6.647s

And I used to see it on Solaris a lot before we quadrupled their CPU from 0.5 to 2 vCPUs.

@bradfitz bradfitz self-assigned this Mar 8, 2017
@bradfitz bradfitz added the Testing An issue that has been verified to require only test changes, not just a test failure. label Mar 8, 2017
@bradfitz bradfitz added this to the Go1.9 milestone Mar 8, 2017
@bradfitz
Copy link
Contributor

bradfitz commented Mar 8, 2017

I spent some time looking at this test. It's not obviously doing anything racy or wrong.

Any eyeballs welcome.

@josharian
Copy link
Contributor

This is showing up a lot. Mark as flaky for the time being?

@bradfitz
Copy link
Contributor

Sure, we could do that. Feel free to send a CL? But while doing so, can you try to spot the flakiness? I apaprently failed.

@josharian
Copy link
Contributor

I think that the test just gets starved. A few observations:

If you add

		time.Sleep(500 * time.Millisecond)

to the HandlerFunc near the beginning of the function, I reliably get a test failure:

--- FAIL: TestServerAllowsBlockingRemoteAddr (1.00s)
	serve_test.go:952: Request 1: Get http://127.0.0.1:60996: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
	serve_test.go:1001: response 1 addr = ""; want "RA:21.21.21.21:21"
FAIL

As further evidence of starvation, greplogs for this test going back several months yields this distribution of builders:

 90.08%   463 solaris-amd64-smartosbuildlet
  3.50%    18 plan9-arm
  2.33%    12 plan9-amd64-9front
  0.78%     4 netbsd-amd64-70
  0.58%     3 freebsd-arm-paulzhol
  0.58%     3 linux-ppc64-buildlet
  0.39%     2 linux-386-clang
  0.39%     2 linux-amd64-clang
  0.39%     2 linux-amd64-race
  0.39%     2 linux-ppc64le-buildlet
  0.39%     2 netbsd-386-minux
  0.19%     1 linux-amd64-noopt

and we know solaris was resource-starved until recently.

This test is one of many marked as parallel in net/http, which could be leading to the starvation.

We could try increasing the http client's timeout, but given that we're already blowing through a full second, I think a better plan is to not have this test be parallel. I'll send a CL to that effect.

@gopherbot
Copy link

CL https://golang.org/cl/38005 mentions this issue.

@josharian
Copy link
Contributor

Re-opening and moving to unplanned, to investigate whether the test can be made robust while still marked parallel. See the discussion in CL 38005.

@josharian josharian reopened this Mar 10, 2017
@josharian josharian modified the milestones: Unplanned, Go1.9 Mar 10, 2017
@mundaym
Copy link
Member

mundaym commented Mar 13, 2017

Is it worth making a new issue (or repurposing this one fully) for the response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21" failure mode? We've been seeing it a lot recently even after the parallelism was removed:

linux/amd64-clang: https://build.golang.org/log/dde822640469f46493857c984fa4a2dca551817f
linux/ppc64le: https://build.golang.org/log/f80618f409ad8e07c6acb503e66d1445644c7700
linux/ppc64le: https://build.golang.org/log/724f466c6e7f7f2f3a67dea9379cdb38a6bbb1e4

I took a quick glance at the test and it looks like this error means that the same message gets sent on both channels. It seems strange to me that resource starvation would be the root cause of such an issue, so maybe it is unrelated to the timeout problem?

@bradfitz
Copy link
Contributor

We can reuse this issue.

@josharian josharian modified the milestones: Go1.9, Unplanned Mar 13, 2017
@josharian
Copy link
Contributor

Here are some recent failures, according to greplogs/fetchlogs:

2017-03-08T18:27:19-6fbedc1/linux-amd64-noopt
2017-03-08T18:52:12-d71f36b/linux-ppc64-buildlet
2017-03-08T18:58:30-29edf0f/linux-ppc64-buildlet
2017-03-08T22:05:52-23be728/linux-ppc64le-buildlet
2017-03-09T16:29:40-b2c391b/linux-amd64-clang
2017-03-09T16:42:46-3b68922/linux-amd64-clang
2017-03-09T18:36:49-e57350f/linux-ppc64le-buildlet
2017-03-09T21:34:14-e471ad9/linux-386-clang
2017-03-09T23:28:48-1f9f0ea/linux-386-clang
2017-03-09T23:39:01-c63ad97/linux-ppc64-buildlet
2017-03-11T00:39:23-2de773d/linux-amd64-clang
2017-03-11T01:19:13-fe34585/linux-ppc64le-buildlet
2017-03-13T04:00:12-dd0e1ac/linux-ppc64le-buildlet
2017-03-13T19:39:15-a51e4cc/netbsd-amd64-70
2017-03-13T20:22:39-2e7c3b3/linux-ppc64le-buildlet

In the hopes of bisecting, I just tried reproducing in a loop on a linux/amd64 gomote with tip and

gomote -user josharian run user-josharian-linux-amd64-0 go/bin/go test -count=1000000 -run=TestServerAllowsBlockingRemoteAddr -timeout=1h net/http

and got no failures. (I tried a similar thing locally on darwin/amd64 and also got no failures.) The linux-amd64-clang gomote won't boot at all.

I'm giving up for now, just leaving breadcrumbs for others. If anyone wants to stress test runs of this test or all net/http tests on some linux hardware and see if you can reliably reproduce on demand, that'd be a first step.

@bradfitz
Copy link
Contributor

The linux-amd64-clang gomote won't boot at all.

Did you wait a full 40-60 seconds? It's a VM still, IIRC, and not a dedicated machine or container. (It could be a container, but I don't think it's been migrated)

@josharian
Copy link
Contributor

josharian commented Mar 13, 2017

gomote waited 180s and timed out.

@bradfitz
Copy link
Contributor

Maybe there was a burst of commits and we were maxed out on GCE quota. Prioritizing some types of buildlet users over others is #19178.

@siebenmann
Copy link

I'm sporadically seeing the response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21" test failures on my Linux machines. This is with physical hardware, not virtualized, and just running as a normal user on generally unloaded workstations.

GOARCH="amd64"
GOOS="linux"

@bradfitz
Copy link
Contributor

@siebenmann, yeah, that one's been on the dashboard recently too.

I need to understand that test more. It was added in b58515b by @glasser. David, could you also take a look and figure out why it's flaky?

@glasser
Copy link
Contributor

glasser commented Mar 18, 2017

Hmm. I agree that if a machine is overloaded then the timeout could be too small, but people are seeing the wrong RemoteAddr being returned (the final check of the test) without any other errors? Did this start recently? The test in question is 1.5 years old. @siebenmann are you just seeing this in the context of bigger test runs or are you able to reproduce it on its own?

@siebenmann
Copy link

I've only seen this when building Go through ./all.bash, but I haven't tried to run this test outside of the build process. As far as I remember I've only seen this on its own (without other errors) and only occasionally, and I think it started happening only relatively recently (although I can't say how recently because I've ignored it before now).

(If the go build status is good in general, I've tended to assume that any test or build failures that I run into are specific to my machines, especially if they're sporadic.)

@glasser
Copy link
Contributor

glasser commented Mar 18, 2017

Could there be a recently introduced bug in DisableKeepAlives? The main way I could imagine that particular failure is if both client.Get calls used the same connection.

If that's the case one could de-flake by using separate Transports and Clients or perhaps by setting Connection: close on one side or the other, but there's probably an actual bug to fix...

@glasser
Copy link
Contributor

glasser commented Mar 18, 2017

Although hmm, that theory doesn't make much sense because we do verify that Accept happens twice.

@glasser
Copy link
Contributor

glasser commented Mar 18, 2017

Oh. It looks like 3b988eb lost the use of a Transport with DisableKeepAlives? I'm on my phone now but it doesn't look at a quick glance like ts.Client uses a non default Transport.

@glasser
Copy link
Contributor

glasser commented Mar 18, 2017

And if I'm correct, looks like a whole bunch of other tests lost DisableKeepAlives in that commit. I suppose some of them may have merely had the flag set to avoid overlapping with other tests on the same Transport but who knows how many other tests actually relied on it internally. I probably should have put a more explicit comment on that line.

@glasser
Copy link
Contributor

glasser commented Mar 18, 2017

Note that the commit in question merged after this issue was opened but before all the comments on it.

@glasser
Copy link
Contributor

glasser commented Mar 18, 2017

I'm still not clear on how the second Accept succeeded in this case. And the connection surely isn't "idle". But losing the flag is definitely wrong. Will stop updating a zillion times from my phone now :)

@glasser
Copy link
Contributor

glasser commented Mar 18, 2017

I actually don't have a great understanding of when http.Transport will send two requests in parallel on the same connection. (and is http/2 perhaps involved? Some of the buildlogs above have errors that mention http/2.)

@siebenmann
Copy link

I just had a build failure that also had a http2 test failure:

--- FAIL: TestServerAllowsBlockingRemoteAddr (0.00s)
        serve_test.go:1018: response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21"
2017/03/19 01:09:21 http2: server: error reading preface from client 127.0.0.1:50232: read tcp 127.0.0.1:35221->127.0.0.1:50232: use of closed file or network connection
FAIL
FAIL    net/http        3.170s

As usual, redoing the build had everything go fine.

@bradfitz
Copy link
Contributor

@siebenmann, that's log spam from an unrelated test I believe.

@eliasnaur
Copy link
Contributor

FWIW, this is happening on darwin/arm64 as well:

https://build.golang.org/log/d6c544a1382073835ea733e4eaf094c243d915a3

@gopherbot
Copy link

CL https://golang.org/cl/38373 mentions this issue.

@glasser
Copy link
Contributor

glasser commented Mar 20, 2017

I'd love to dig in more to see if I can reproduce, but it's been a while since I've dabbled in Go core. Can somebody remind me how to run a specific test without having to run all of the big slow all.bash (the only thing mentioned in contribute.html)? Just normal go test . -run Blocking doesn't work, presumably because I'm running an unrelated version of Go. Are there docs somewhere that I'm missing?

gopherbot pushed a commit that referenced this issue Mar 20, 2017
As noted in #19161 (comment),
CL 37771 (adding use of the new httptest.Server.Client to all net/http
tests) accidentally reverted DisableKeepAlives for this test. For
many tests, DisableKeepAlives was just present to prevent goroutines
from staying active after the test exited.  In this case it might
actually be important. (We'll see)

Updates #19161

Change-Id: I11f889f86c932b51b11846560b68dbe5993cdfc3
Reviewed-on: https://go-review.googlesource.com/38373
Reviewed-by: Michael Munday <munday@ca.ibm.com>
@bradfitz
Copy link
Contributor

@glasser:

$ cd /path/to/devgo
$ cd src
$ ./make.bash
$ cd net/http
$ /path/to/devgo/bin/go test -v -run=Blocking -count=500 

Or if you want to run it in race mode (often shakes things out):

$ /path/to/devgo/bin/go test -race -i    # install race deps once
$ /path/to/devgo/bin/go test -v -run=Blocking -count=500 -race

@siebenmann
Copy link

I can get a test version of Go at commit d80166e to repeatedly fail go test -run=Blocking -count=10000 with the usual serve_test.go:1018: response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21" error. With the next commit, eb6c1dd 'net/http: deflake TestServerAllowsBlockingRemoteAddr more', my go test test run no longer produces any failures even if run for longer than 10,000 iterations. Nor can I get it to fail with -race.

@bradfitz
Copy link
Contributor

@siebenmann, excellent. Thanks for confirming. I wasn't able to cause the failure so I was just hoping. Glad to hear it's fixed now. Will close for now. Hopefully nobody else hits this again.

@glasser
Copy link
Contributor

glasser commented Mar 20, 2017

@siebenmann I'm glad that seems to work! If you're interested in helping me figure out exactly why it was broken, do you think you could run glasser@223b373 and share what it prints on failure cases? The commit adds a bunch of t.Logs using httptrace.

@siebenmann
Copy link

I cloned your Go repo and checked out that commit, and I get:

--- FAIL: TestServerAllowsBlockingRemoteAddr (0.25s)
        serve_test.go:965: Getting conn for 127.0.0.1:46137 ...
        serve_test.go:965: ConnectStart: Connecting to tcp 127.0.0.1:46137 ...
        serve_test.go:965: Getting conn for 127.0.0.1:46137 ...
        serve_test.go:965: ConnectStart: Connecting to tcp 127.0.0.1:46137 ...
        serve_test.go:965: ConnectDone: connected to tcp 127.0.0.1:46137 = <nil>
        serve_test.go:965: got conn: {Conn:0xc42000e050 Reused:false WasIdle:false IdleTime:0s}
        serve_test.go:965: WroteRequest: {Err:<nil>}
        serve_test.go:965: first response byte
        serve_test.go:965: PutIdleConn = <nil>
        serve_test.go:965: got conn: {Conn:0xc42000e050 Reused:true WasIdle:false IdleTime:0s}
        serve_test.go:965: WroteRequest: {Err:<nil>}
        serve_test.go:965: first response byte
        serve_test.go:965: PutIdleConn = <nil>
        serve_test.go:1048: response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21"
        serve_test.go:965: ConnectDone: connected to tcp 127.0.0.1:46137 = <nil>
--- FAIL: TestServerAllowsBlockingRemoteAddr (0.00s)
        serve_test.go:965: Getting conn for 127.0.0.1:33671 ...
        serve_test.go:965: ConnectStart: Connecting to tcp 127.0.0.1:33671 ...
        serve_test.go:965: Getting conn for 127.0.0.1:33671 ...
        serve_test.go:965: ConnectStart: Connecting to tcp 127.0.0.1:33671 ...
        serve_test.go:965: ConnectDone: connected to tcp 127.0.0.1:33671 = <nil>
        serve_test.go:965: got conn: {Conn:0xc420100028 Reused:false WasIdle:false IdleTime:0s}
        serve_test.go:965: WroteRequest: {Err:<nil>}
        serve_test.go:965: first response byte
        serve_test.go:965: PutIdleConn = <nil>
        serve_test.go:965: got conn: {Conn:0xc420100028 Reused:true WasIdle:false IdleTime:0s}
        serve_test.go:965: WroteRequest: {Err:<nil>}
        serve_test.go:965: ConnectDone: connected to tcp 127.0.0.1:33671 = <nil>
        serve_test.go:965: first response byte
        serve_test.go:965: PutIdleConn = <nil>
        serve_test.go:1048: response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21"

(And more copies of them.)

Running with just go test -run=Blocking -count=500 doesn't seem to reliably trigger the failure for me, for what it's worth, although sometimes it does. I thought initially it might be related to the number of sockets in TIME_WAIT or just active, but I checked netstat -n --inet when I can get go test -run=Blocking -count=500 to fail some of the time and it doesn't seem to be this; I can start from what looks like zero localhost sockets in any state and it will still fail some times.

@glasser
Copy link
Contributor

glasser commented Mar 20, 2017

OK, so it clearly shows that the client is reusing the connection (Reused:true) which can certainly cause the observed behavior.

But the strange bit is that first response byte shouldn't happen until we send an address over one of the addrs channels, and that shouldn't happen until after we read conn2 = <-conns (or have a "Second Accept didn't happen" Fatal). And that shouldn't happen until something makes a second TCP connection to the port.

Is it possible that something else on the system is making a TCP connection to the server port at just the wrong time? I'm imagining something like

  • Server starts listening
  • Something random connects to the server over TCP; its net.Conn gets sent to conns
  • Client starts first request; its net.Conn gets sent to conns and its server blocks on RemoteAddr()
  • Test reads the random connection from conns
  • Client starts goroutine which will eventually get around to sending second request
  • Test reads the first request's net.Conn from conns
  • Test sends 12.12.12.12 over the first request's connection's addrs channel
  • Server returns response to first request and client sees it
  • That goroutine for the second request finally gets around to running the request. It sees an idle connection and happily uses it.
  • Server returns response for the second request, which uses the same 12.12.12.12 RemoteAddr since that's per-request.

This is all believable, if there were any reason to believe that somebody would make a TCP connection to the server port right around the time it started listening. Possibly it's even just the first request connecting twice for some error-retry reason?

@golang golang locked and limited conversation to collaborators Mar 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

8 participants