Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: io.Copy to timed-out net.Conn hangs? #4403

Closed
bradfitz opened this issue Nov 16, 2012 · 20 comments
Closed

net: io.Copy to timed-out net.Conn hangs? #4403

bradfitz opened this issue Nov 16, 2012 · 20 comments
Milestone

Comments

@bradfitz
Copy link
Contributor

on OS X, patching in un-mailed http://golang.org/cl/6851061 which adds this
net/http test:

func TestTransportDustinRace(t *testing.T) {
        mux := NewServeMux()
        mux.HandleFunc("/get", func(w ResponseWriter, r *Request) {
                io.Copy(w, neverEnding('a'))
        })
        mux.HandleFunc("/put", func(w ResponseWriter, r *Request) {
                io.Copy(ioutil.Discard, r.Body)
        })
        ts := httptest.NewServer(mux)
        defer ts.Close()

        client := &Client{
                Transport: &Transport{
                        Dial: func(n, addr string) (net.Conn, error) {
                                conn, err := net.Dial(n, addr)
                                if err != nil {
                                        return nil, err
                                }
                                conn.SetDeadline(time.Now().Add(50 * time.Millisecond))
                                return conn, nil
                        },
                        DisableKeepAlives: true,
                },
        }

        for i := 0; i < 50; i++ {
                println("iter", i)
                sres, err := client.Get(ts.URL + "/get")
                if err != nil {
                        t.Logf("Error issuing GET: %v", err)
                        return
                }
                req, _ := NewRequest("PUT", ts.URL+"/put", sres.Body)
                res, err := client.Do(req)
                println(fmt.Sprintf("iteration %d; res=%v, err=%v", i, res, err))
                t.Logf("iteration %d; res=%v, err=%v", i, res, err)
        }
        println("done")
}

the test hangs after "done", before the testserver deferred Close() call,
because the /get handler's io.Copy (from neverEnding('a') to the ResponseWriter) never
completes, always waiting for writability:

goroutine 219 [chan receive]:
net.(*pollServer).WaitWrite(0xc200088de0, 0xc2002b4cc0, 0xc200088c90, 0x23,
0xffffffffffffff01, ...)
    /Users/bradfitz/go/src/pkg/net/fd_unix.go:251 +0x63
net.(*netFD).Write(0xc2002b4cc0, 0xc200512000, 0x1000, 0x1000, 0x0, ...)
    /Users/bradfitz/go/src/pkg/net/fd_unix.go:536 +0x391
net.(*conn).Write(0xc200473c38, 0xc200512000, 0x1000, 0x1000, 0xffffffff, ...)
    /Users/bradfitz/go/src/pkg/net/net.go:129 +0xd3
bufio.(*Writer).Flush(0xc20050d200, 0xff8, 0xff8, 0xc200514000)
    /Users/bradfitz/go/src/pkg/bufio/bufio.go:456 +0xb9
bufio.(*Writer).Write(0xc20050d200, 0xc200514000, 0x8000, 0x8000, 0x0, ...)
    /Users/bradfitz/go/src/pkg/bufio/bufio.go:492 +0x1ef
net/http.(*response).Write(0xc20050d280, 0xc200514000, 0x8000, 0x8000, 0x8000, ...)
    /Users/bradfitz/go/src/pkg/net/http/server.go:501 +0x461
net/http.(*writerOnly).Write(0xc20050e270, 0xc200514000, 0x8000, 0x8000, 0x8000, ...)
    /Users/bradfitz/go/src/pkg/net/http/chunked.go:0 +0x65
io.Copy(0xc2000db1e0, 0xc20050e270, 0xc2000db0f0, 0x19961, 0x8c8000, ...)
    /Users/bradfitz/go/src/pkg/io/io.go:367 +0x236
net/http.(*response).ReadFrom(0xc20050d280, 0xc2000db0f0, 0x19961, 0xc2000db120, 0x0,
...)
    /Users/bradfitz/go/src/pkg/net/http/server.go:168 +0x1ae
io.Copy(0xc2000db0c0, 0xc20050d280, 0xc2000db0f0, 0x19961, 0x0, ...)
    /Users/bradfitz/go/src/pkg/io/io.go:357 +0xa2
net/http_test.func·104(0xc2000ea880, 0xc20050d280)
    /Users/bradfitz/go/src/pkg/net/http/transport_test.go:907 +0x9c
net/http.HandlerFunc.ServeHTTP(0x96ac0, 0xc2000ea880, 0xc20050d280, 0xc2001bf410,
0x310420, ...)
    /Users/bradfitz/go/src/pkg/net/http/server.go:733 +0x3b
net/http.(*ServeMux).ServeHTTP(0xc200088cf0, 0xc2000ea880, 0xc20050d280, 0xc2001bf410,
0x3de6d, ...)
    /Users/bradfitz/go/src/pkg/net/http/server.go:992 +0x61
net/http/httptest.(*waitGroupHandler).ServeHTTP(0xc2000cf320, 0xc2000ea880,
0xc20050d280, 0xc2001bf410, 0x0, ...)
    /Users/bradfitz/go/src/pkg/net/http/httptest/server.go:180 +0x8f
net/http.(*conn).serve(0xc200332f00, 0x0)
    /Users/bradfitz/go/src/pkg/net/http/server.go:696 +0x55d
created by net/http.(*Server).Serve
    /Users/bradfitz/go/src/pkg/net/http/server.go:1120 +0x36d

why?

Keeping this assigned to me, but help would be appreciated if anybody is bored.
@remyoudompheng
Copy link
Contributor

Comment 1:

Because persistConn.roudTrip doesn't check for bufio.Writer.Flush error.
http://golang.org/src/pkg/net/http/transport.go#L678

@bradfitz
Copy link
Contributor Author

Comment 2:

That's not the right code.  It's changed notably since then.
It's currently:
func (pc *persistConn) writeLoop() {
        for {
                select {
                case wr := <-pc.writech:
                        if pc.isBroken() {
                                wr.ch <- errors.New("http: can't write HTTP request on broken connection")
                                continue
                        }
                        err := wr.req.Request.write(pc.bw, pc.isProxy, wr.req.extra)
                        if err == nil {
                                err = pc.bw.Flush()
            }
                        if err != nil {
                            pc.markBroken()
                        }
                        wr.ch <- err
        case <-pc.closech:
                        return
                }
        }
}

@bradfitz
Copy link
Contributor Author

Comment 3:

Okay, I spent too much time debugging this today.  Here's what I now know:
* OS X vs Linux doesn't matter
* the /get goroutine never finishes because the io.Copy never finishes because the
net.Conn never becomes writable.
* when the client disconnects, the server never closes its connection, since it's
blocked on the write.  This bug explains the file descriptor leak in Dustin's stress
test from issue #4191
* after reading the GET request's headers, we never do another Read on the net.Conn, so
we never know the client disconnected.
* keeping a goroutine always reading from the net.Conn, preemptively reading the
upcoming request would be one way to fix this (related to pipeling and CloseNotifier in
issue #2510), but I also wondering whether the pollserver should be telling us, if we're
waiting for writability, that the client is now gone.
I'll be looking at the poll server now and testing some behavior before I do a
concurrent reader fix for this.

@bradfitz
Copy link
Contributor Author

Comment 4:

Okay, I'm getting more and more confused.
Any help debugging this would be awesome.
Here's my latest confusion:  http://golang.org/cl/6851061
Start with this mystery:
$ go test -v -run=Dustin.*1 net/http
Notice that it never times out.  (Look for the "XXX" comments in transport_test.go)
But uncomment the "return conn, nil" line before the logging wrapper return line below
and notice that it successfully times out.
Why is the logging wrapper affecting the wrapped net.Conn's timeout?  Is this some bug
in the pollserver and scheduler differences tickle it?
I'm so confused.

@alberts
Copy link
Contributor

alberts commented Nov 22, 2012

Comment 5:

Any chance that https://golang.org/issue/4384?c=12 is showing up the
same underlying pollserver issue?

@bradfitz
Copy link
Contributor Author

Comment 6:

I would believe anything at this point, but I'm looking for a smoking gun.

@bradfitz
Copy link
Contributor Author

Comment 7:

Uploaded a new version of http://golang.org/cl/6851061 with better logging. 
(the server conns weren't being labeled as "server-1" and "server-2")
Why does DustinRace1 loop forever without timeouts if the Transport dialer
(transport_test.go:914) returns the NewLoggingConn wrapper rather than returning conn
directly?  What about the wrapper makes the underlying deadlines no longer work?  But
the wrapper DOES work with the deadlines if the deadlines are smaller (or negative), so
somehow the wrapper only masks 50-100ms and longer deadlines??  Is something wrongly
renewing the deadlines?
(at least right now on Linux, but I seem to recall this also failing on OS X)

@dvyukov
Copy link
Member

dvyukov commented Nov 22, 2012

Comment 8:

Does ts.Close() calls Close() on netFD associated with the HTTP connection?
If yes, than that Close() must unblock netFD.Write().

@bradfitz
Copy link
Contributor Author

Comment 9:

Dmitry, you're looking at the patch and not looking at the problem.  :)
If you run the patch, you'll see that ts.Close isn't the problem: it's correctly waiting
on a WaitGroup, waiting for all the test handlers to complete.

@bradfitz
Copy link
Contributor Author

Comment 10:

New information:
If I replace the *loggingConn's log statements with sleeps, I find that 500 nanosecond
sleeps makes the test hang (timeouts don't work), but 200 nanoseconds makes the test end
successfully (the timeouts still work).
Looking like an race in the poll server, deep in epoll or pollster land.  Fun.
I'll check what OS X does later. (I brought two laptops with me, just for this fun)

@bradfitz
Copy link
Contributor Author

Comment 11:

The latest clue: the net package doesn't do timeouts reliably:
http://golang.org/cl/6851096/
(unrelated to HTTP)

@dvyukov
Copy link
Member

dvyukov commented Nov 23, 2012

Comment 12:

> Dmitry, you're looking at the patch and not looking at the problem.  :)
> If you run the patch, you'll see that ts.Close isn't the problem:
> it's correctly waiting on a WaitGroup, waiting for all the test handlers to complete.
But does it call netFD.Close() on the connections before waiting?
What is the exact problem? I am lost. Earlier you said that server connections do not
finish after close, and then that client connection do not finish after timeout. Are
there >=2 problems?

@mikioh
Copy link
Contributor

mikioh commented Nov 23, 2012

Comment 13:

I added a bit debug info to net/http/server.go to see what happens in net/httptest
server.
https://golang.org/cl/6843108/
As mentioned in golang-dev, on freebsd, D1 works well whether new net/net_test.go exists.
And D2 doesn't.

@davecheney
Copy link
Contributor

Comment 15:

oh dear.
os.file.close() is blocking .. indefinitely
goroutine 5 [runnable]:
syscall.Syscall()
        /home/dfc/go/src/pkg/syscall/asm_linux_amd64.s:36 +0x64
syscall.Close(0x3, 0x0, 0x0, 0x30)
        /home/dfc/go/src/pkg/syscall/zerrors_linux_amd64.go:1761 +0x52
os.(*file).close(0xc200058750, 0x1, 0xc200058810, 0x43801b)
        /home/dfc/go/src/pkg/os/file_unix.go:108 +0x78
os.(*File).Close(0xc200072260, 0x7f52e936de40, 0xc2000b1000, 0x4399a8)
        /home/dfc/go/src/pkg/os/file_unix.go:99 +0x28
net.(*netFD).decref(0xc200081000, 0x5813e8)
        /home/dfc/go/src/pkg/net/fd_unix.go:375 +0x5b
net.(*netFD).accept(0xc200081000, 0x44e630, 0x0, 0xc200058810,
0xc20007a600, ...)
        /home/dfc/go/src/pkg/net/fd_unix.go:603 +0x78
net.(*TCPListener).AcceptTCP(0xc200072268, 0xc200072410, 0x0, 0x0, 0x10, ...)
        /home/dfc/go/src/pkg/net/tcpsock_posix.go:237 +0x62
net.(*TCPListener).Accept(0xc200072268, 0x0, 0x0, 0x0, 0x0, ...)
        /home/dfc/go/src/pkg/net/tcpsock_posix.go:247 +0x49
net.func�038(0xc200070d30, 0xc200072278, 0xc200072220, 0xc200072280, 0x0, ...)
        /home/dfc/go/src/pkg/net/net_test.go:261 +0x47
created by net.testVariousDeadlines
        /home/dfc/go/src/pkg/net/net_test.go:276 +0x221

@dvyukov
Copy link
Member

dvyukov commented Nov 23, 2012

Comment 16:

It can be because the thread can't reenter Go rutime and blocked in exitsyscall(). Try
running with GOTRACEBACK=2

@davecheney
Copy link
Contributor

Comment 17:

I'm sorry, that previous report was wrong.
I am looking at
https://code.google.com/p/go/source/browse/src/pkg/net/fd_unix.go#426
wondering if this line never returned EAGAIN, will the timeout logic
implemented in WaitRead ever fire ? Especially copying over the
loopback interface there may always be data in the buffer.

@alexbrainman
Copy link
Member

Comment 18:

I agree with dave, the test is sending so much data that net read always succeeds, and
it never even get a chance to check for timeouts. Unless you move timeout check logic
before the read, like I did in windows CL.
I cannot check this theory on Linux because the new timeout test passes on my Linux pc.
Alex

@bradfitz
Copy link
Contributor Author

Comment 19:

Dave, Alex, thanks so much!
Fix and new minimal tests at http://golang.org/cl/6851096

@bradfitz
Copy link
Contributor Author

Comment 20:

Labels changed: removed priority-triage.

@bradfitz
Copy link
Contributor Author

Comment 21:

This issue was closed by revision 5fa3aeb.

Status changed to Fixed.

@bradfitz bradfitz self-assigned this Nov 24, 2012
@rsc rsc added this to the Go1.1 milestone Apr 14, 2015
@rsc rsc removed the go1.1 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants