Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: TestRetryRequestsOnError failures with "use of closed network connection" #30938

Closed
bradfitz opened this issue Mar 19, 2019 · 33 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Mar 19, 2019

From https://storage.googleapis.com/go-build-log/c7862f5a/linux-amd64-race_4ed6f3c3.log

--- FAIL: TestRetryRequestsOnError (0.02s)
    --- FAIL: TestRetryRequestsOnError/NothingWrittenGetBody (0.00s)
        transport_test.go:3116: i=1: Do = Post http://fake.golang: read tcp 127.0.0.1:52650->127.0.0.1:42017: use of closed network connection; log:
            Dial
            Write("POST / HTTP/1.1\r\nHost: fake.golang\r\nUser-Agent: Go-http-client/1.1\r\nContent-Length: 4\r\nAccept-Encoding: gzip\r\n\r\nfoo\n")
            Handler
            intentional write failure
2019/03/19 21:17:09 http: TLS handshake error from 127.0.0.1:38418: write tcp 127.0.0.1:43483->127.0.0.1:38418: use of closed network connection
@bradfitz bradfitz added Testing An issue that has been verified to require only test changes, not just a test failure. help wanted NeedsFix The path to resolution is known, but the work has not been done. labels Mar 19, 2019
@bradfitz bradfitz added this to the Go1.13 milestone Mar 19, 2019
@bradfitz bradfitz self-assigned this Mar 19, 2019
@cuonglm
Copy link
Member

cuonglm commented Mar 20, 2019

@bradfitz is there a way to reliably reproduce this?

I tried:

for _ in {1..100}; do go-tip test -count=1 -race .;done

with no error.

@dmitshur dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed NeedsFix The path to resolution is known, but the work has not been done. labels Mar 21, 2019
@bradfitz
Copy link
Contributor Author

I haven't tried.

@bradfitz
Copy link
Contributor Author

@bcmills
Copy link
Contributor

bcmills commented May 16, 2019

https://storage.googleapis.com/go-build-log/de0346c9/linux-amd64-race_86344abc.log

--- FAIL: TestRetryRequestsOnError (0.02s)
    --- FAIL: TestRetryRequestsOnError/NothingWrittenNoBody (0.00s)
        transport_test.go:3217: i=1: Do = Delete http://fake.golang: read tcp 127.0.0.1:56096->127.0.0.1:40507: use of closed network connection; log:
            Dial
            Write("DELETE / HTTP/1.1\r\nHost: fake.golang\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n")
            Handler
            intentional write failure
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49704: read tcp 127.0.0.1:33311->127.0.0.1:49704: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49712: write tcp 127.0.0.1:33311->127.0.0.1:49712: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49710: read tcp 127.0.0.1:33311->127.0.0.1:49710: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49716: read tcp 127.0.0.1:33311->127.0.0.1:49716: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49706: read tcp 127.0.0.1:33311->127.0.0.1:49706: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49708: write tcp 127.0.0.1:33311->127.0.0.1:49708: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49714: write tcp 127.0.0.1:33311->127.0.0.1:49714: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49720: read tcp 127.0.0.1:33311->127.0.0.1:49720: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49734: write tcp 127.0.0.1:33311->127.0.0.1:49734: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49722: read tcp 127.0.0.1:33311->127.0.0.1:49722: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49718: read tcp 127.0.0.1:33311->127.0.0.1:49718: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49726: read tcp 127.0.0.1:33311->127.0.0.1:49726: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49728: read tcp 127.0.0.1:33311->127.0.0.1:49728: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49724: read tcp 127.0.0.1:33311->127.0.0.1:49724: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49732: read tcp 127.0.0.1:33311->127.0.0.1:49732: use of closed network connection
2019/05/16 17:31:29 http: TLS handshake error from 127.0.0.1:49730: read tcp 127.0.0.1:33311->127.0.0.1:49730: use of closed network connection
FAIL

@cuonglm
Copy link
Member

cuonglm commented May 28, 2019

@bcmills
Copy link
Contributor

bcmills commented Jun 7, 2019

@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@bcmills
Copy link
Contributor

bcmills commented Jul 31, 2019

darwin-386-10_14: https://build.golang.org/log/4742b5e51eebf76feb5249e2650939eaebb16793

See previously #25366.

@bcmills
Copy link
Contributor

bcmills commented Aug 20, 2019

android-amd64-emu (https://build.golang.org/log/92ec6ce5e426c6e197dc01557d726a680dc5134a):

--- FAIL: TestRetryRequestsOnError (0.03s)
    --- FAIL: TestRetryRequestsOnError/NothingWrittenNoBody (0.00s)
        transport_test.go:3224: i=1: Do = Delete http://fake.golang: read tcp 127.0.0.1:36280->127.0.0.1:41698: use of closed network connection; log:
            Dial
            Write("DELETE / HTTP/1.1\r\nHost: fake.golang\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n")
            Handler
            intentional write failure
FAIL
exitcode=1FAIL	net/http	9.085s

@jtarchie
Copy link

We may be experiencing this error, too.
We have a test that is testing retry logic when a connection (in <=go1.12) would receive a io.EOF when the server closed the client connection.
When we upgraded to go1.13, we no longer receive io.EOF on the closed connection. We receive tls.errClosed.
With retry logic, we were able to compare the type of io.EOF, but with tls.errClosed, we now have to inspect the error string (Error()). This doesn't feel right to do.
We don't have a reproducible simple test case at the moment, but hopefully I've provided enough information from our current test case.

jtarchie pushed a commit to pivotal-cf/om that referenced this issue Sep 11, 2019
Read for more info: golang/go#30938 (comment)

Signed-off-by: Jeremy Alvis <jalvis@pivotal.io>
@jtarchie
Copy link

It appears that tls.errClosed may not be the only culprit.
We are experiencing also a bunch of write: broken pipe in CI.
🤔

@jtarchie
Copy link

We've reverted back to go1.12.

@bcmills
Copy link
Contributor

bcmills commented Sep 16, 2019

linux-s390x-ibm: https://build.golang.org/log/7b2dcb45b72837f2cb08401d2083f7a857443d7b

(same failure mode as in #30938 (comment))

@bcmills
Copy link
Contributor

bcmills commented Oct 4, 2019

windows-amd64-2016: https://build.golang.org/log/bef07597bea372590c2de133c9ad1afeddd7a42f

Same failure, mode, but no TLS handshake error (consistent with @jtarchie's observation).

@jtarchie
Copy link

jtarchie commented Oct 4, 2019

I might be able to figure out a reproducible (flakey) test, but I don't know if that actually hunts down the issue.
I'm not sure what changed in the go 1.12 to 1.13 source for TLS. I've looked at the diffs via git.
The fact that the error types changed (io.EOF to tls.errClosed | write: broken pipe) is probably where to start.

This could also be due to changes in httptest TLS Server.

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@bcmills
Copy link
Contributor

bcmills commented Oct 21, 2019

dragonfly-amd64: https://build.golang.org/log/782c3e02489de8e3203dcd50f037538956bab465
...at least we know it's not platform-dependent?

Actually, there is a pattern here: the only two cases that seem to flake are NothingWrittenNoBody and NothingWrittenGetBody. The Idempotent cases have not yet been implicated.

@bcmills
Copy link
Contributor

bcmills commented Oct 21, 2019

This note in the test may provide a clue:

// NOTE: we resend a request only if:
// - we reused a keep-alive connection
// - we haven't yet received any header data
// - either we wrote no bytes to the server, or the request is idempotent

I think the TLS handshake error is a red herring: it isn't necessarily even coming from the same test function.

@bcmills bcmills changed the title net/http: TestRetryRequestsOnError flake net/http: TestRetryRequestsOnError is flaky Nov 21, 2019
@bradfitz
Copy link
Contributor Author

I can reproduce this if I burn a bunch of CPU (https://gist.github.com/bradfitz/233894f4aec11b221cb06abd8489d2e4) and run

$ go test  -run=TestRetryRequestsOnError/NothingWrittenGetBody -race -count=500
--- FAIL: TestRetryRequestsOnError (0.02s) 
    --- FAIL: TestRetryRequestsOnError/NothingWrittenGetBody (0.02s)
        transport_test.go:3395: i=1: Do = Post "http://fake.golang": read tcp 127.0.0.1:50552->127.0.0.1:46615: use of closed network connection; log:
            Dial 
            Write("POST / HTTP/1.1\r\nHost: fake.golang\r\nUser-Agent: Go-http-client/1.1\r\nContent-Length: 4\r\nAccept-Encoding: gzip\r\n\r\nfoo\n")
            Handler
            intentional write failure      
--- FAIL: TestRetryRequestsOnError (0.01s) 
    --- FAIL: TestRetryRequestsOnError/NothingWrittenGetBody (0.00s)
        transport_test.go:3395: i=1: Do = Post "http://fake.golang": read tcp 127.0.0.1:42122->127.0.0.1:38355: use of closed network connection; log:
            Dial
            Write("POST / HTTP/1.1\r\nHost: fake.golang\r\nUser-Agent: Go-http-client/1.1\r\nContent-Length: 4\r\nAccept-Encoding: gzip\r\n\r\nfoo\n")
            Handler
            intentional write failure
FAIL
exit status 1   
FAIL    net/http        55.811s

This check involving checking for a failure under 25 milliseconds scares me:

if time.Since(t0) < MaxWriteWaitBeforeConnReuse/2 {

... because all test code involving time scares me.

@bcmills
Copy link
Contributor

bcmills commented Dec 2, 2020

2020-12-01T22:59:34-4826abb/linux-amd64-jessie
2020-11-16T15:00:31-c7233dd/dragonfly-amd64-5_8
2020-10-19T19:27:54-4d1cecd/linux-386-longtest
2020-10-19T19:27:54-4d1cecd/linux-amd64-longtest
2020-10-19T19:27:54-4d1cecd/windows-amd64-longtest
2020-10-19T18:48:17-ab541a0/linux-386-longtest
2020-10-19T18:48:17-ab541a0/linux-amd64-longtest
2020-10-19T18:48:17-ab541a0/windows-amd64-longtest
2020-10-19T18:31:53-06839e3/linux-386-longtest
2020-10-19T18:31:53-06839e3/linux-amd64-longtest
2020-10-19T18:31:53-06839e3/windows-amd64-longtest
2020-10-19T18:31:39-2ff9e01/linux-386-longtest
2020-10-19T18:31:39-2ff9e01/linux-amd64-longtest
2020-10-19T18:31:39-2ff9e01/windows-amd64-longtest
2020-10-19T18:31:05-0040adf/linux-386-longtest
2020-10-19T18:31:05-0040adf/linux-amd64-longtest
2020-10-19T18:31:05-0040adf/windows-amd64-longtest
2020-10-19T18:25:30-bb44c90/linux-386-longtest
2020-10-19T18:25:30-bb44c90/linux-amd64-longtest
2020-10-19T18:25:30-bb44c90/windows-amd64-longtest
2020-10-19T17:23:00-178b0f5/linux-386-longtest
2020-10-19T17:23:00-178b0f5/linux-amd64-longtest
2020-10-19T17:23:00-178b0f5/windows-amd64-longtest
2020-10-19T17:19:18-58eadc2/linux-386-longtest
2020-10-19T17:19:18-58eadc2/linux-amd64-longtest
2020-10-19T17:19:18-58eadc2/windows-amd64-longtest
2020-10-05T18:52:43-a9c75ec/freebsd-arm-paulzhol
2020-08-18T00:08:36-db4cda2/linux-amd64-longtest
2020-05-07T14:29:38-cb14bd8/aix-ppc64
2020-05-05T19:42:01-0e617d3/android-386-emu
2020-04-13T22:48:16-240eac3/android-386-emu
2020-04-02T22:13:55-888a0c8/windows-amd64-2008
2020-02-28T09:29:22-b8f54e5/android-386-emu
2020-02-27T23:20:01-1178269/android-386-emu
2019-12-30T09:34:53-4b43700/freebsd-amd64-11_2

@bcmills
Copy link
Contributor

bcmills commented Feb 22, 2021

@bcmills
Copy link
Contributor

bcmills commented Mar 5, 2021

@bcmills
Copy link
Contributor

bcmills commented Apr 7, 2021

@bcmills
Copy link
Contributor

bcmills commented May 7, 2021

CC @neild @empijei

@bcmills bcmills changed the title net/http: TestRetryRequestsOnError is flaky net/http: TestRetryRequestsOnError failures with "use of closed network connection" Nov 4, 2021
@bcmills
Copy link
Contributor

bcmills commented Nov 4, 2021

@bcmills
Copy link
Contributor

bcmills commented Dec 8, 2021

greplogs --dashboard -md -l -e 'FAIL: TestRetryRequestsOnError' --since=2021-11-05

2021-12-08T04:14:00-a19e72c/dragonfly-amd64
2021-11-15T21:21:51-560dc97/linux-386-longtest
2021-11-10T20:45:04-a881409/linux-amd64-staticlockranking
2021-11-10T05:08:25-17980df/netbsd-arm-bsiegert
2021-11-09T22:58:24-578ada4/linux-amd64-sid
2021-11-09T21:58:03-805b4d5/linux-amd64-unified

Marking as release-blocker via #11811. Note that many of the failures are on linux/amd64 and linux/386, which are first-class ports.

@bcmills bcmills added okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker and removed help wanted labels Dec 8, 2021
@bcmills bcmills modified the milestones: Backlog, Go1.18 Dec 8, 2021
@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@cherrymui
Copy link
Member

Any update on this? Thanks. (checking in as this is a release blocker.)

@bcmills
Copy link
Contributor

bcmills commented Jan 5, 2022

The failures this month include a couple of additional platforms, including darwin/amd64 (another first-class port).

greplogs --dashboard -md -l -e 'FAIL: TestRetryRequestsOnError' --since=2021-12-09

2021-12-21T22:26:48-60f2c12/darwin-amd64-race
2021-12-13T18:55:52-f909f81/dragonfly-amd64
2021-12-10T19:26:50-766f89b/linux-ppc64le-power9osu

@toothrot toothrot assigned neild and unassigned bradfitz Jan 19, 2022
@neild
Copy link
Contributor

neild commented Jan 19, 2022

When a write fails, persistConn.roundTrip can learn of it in one of three ways:

This test failure occurs in the last case. It's uncommon, because quite a lot needs to happen between the write loop signaling writeErrCh and roundTrip reading from it: The persistConn is closed, the read loop reads an error, the read loop sends on resc, and roundTrip reads from resc.

@gopherbot
Copy link

Change https://golang.org/cl/379554 mentions this issue: net/http: deflake request-not-written path

@dmitshur
Copy link
Contributor

Ping as a release blocker. Is it just a matter of submitting the CL above?

@jeremyfaller
Copy link
Contributor

As of last week, @bcmills was going to own this CL as @neild's availability might be all over the place this week. The discussion at the end of last week was Bryan was going to address https://golang.org/cl/379554 comments, and submit.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests