New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: io.Copy to timed-out net.Conn hangs? #4403
Labels
Milestone
Comments
Because persistConn.roudTrip doesn't check for bufio.Writer.Flush error. http://golang.org/src/pkg/net/http/transport.go#L678 |
That's not the right code. It's changed notably since then. It's currently: func (pc *persistConn) writeLoop() { for { select { case wr := <-pc.writech: if pc.isBroken() { wr.ch <- errors.New("http: can't write HTTP request on broken connection") continue } err := wr.req.Request.write(pc.bw, pc.isProxy, wr.req.extra) if err == nil { err = pc.bw.Flush() } if err != nil { pc.markBroken() } wr.ch <- err case <-pc.closech: return } } } |
Okay, I spent too much time debugging this today. Here's what I now know: * OS X vs Linux doesn't matter * the /get goroutine never finishes because the io.Copy never finishes because the net.Conn never becomes writable. * when the client disconnects, the server never closes its connection, since it's blocked on the write. This bug explains the file descriptor leak in Dustin's stress test from issue #4191 * after reading the GET request's headers, we never do another Read on the net.Conn, so we never know the client disconnected. * keeping a goroutine always reading from the net.Conn, preemptively reading the upcoming request would be one way to fix this (related to pipeling and CloseNotifier in issue #2510), but I also wondering whether the pollserver should be telling us, if we're waiting for writability, that the client is now gone. I'll be looking at the poll server now and testing some behavior before I do a concurrent reader fix for this. |
Okay, I'm getting more and more confused. Any help debugging this would be awesome. Here's my latest confusion: http://golang.org/cl/6851061 Start with this mystery: $ go test -v -run=Dustin.*1 net/http Notice that it never times out. (Look for the "XXX" comments in transport_test.go) But uncomment the "return conn, nil" line before the logging wrapper return line below and notice that it successfully times out. Why is the logging wrapper affecting the wrapped net.Conn's timeout? Is this some bug in the pollserver and scheduler differences tickle it? I'm so confused. |
Any chance that https://golang.org/issue/4384?c=12 is showing up the same underlying pollserver issue? |
Uploaded a new version of http://golang.org/cl/6851061 with better logging. (the server conns weren't being labeled as "server-1" and "server-2") Why does DustinRace1 loop forever without timeouts if the Transport dialer (transport_test.go:914) returns the NewLoggingConn wrapper rather than returning conn directly? What about the wrapper makes the underlying deadlines no longer work? But the wrapper DOES work with the deadlines if the deadlines are smaller (or negative), so somehow the wrapper only masks 50-100ms and longer deadlines?? Is something wrongly renewing the deadlines? (at least right now on Linux, but I seem to recall this also failing on OS X) |
New information: If I replace the *loggingConn's log statements with sleeps, I find that 500 nanosecond sleeps makes the test hang (timeouts don't work), but 200 nanoseconds makes the test end successfully (the timeouts still work). Looking like an race in the poll server, deep in epoll or pollster land. Fun. I'll check what OS X does later. (I brought two laptops with me, just for this fun) |
The latest clue: the net package doesn't do timeouts reliably: http://golang.org/cl/6851096/ (unrelated to HTTP) |
> Dmitry, you're looking at the patch and not looking at the problem. :) > If you run the patch, you'll see that ts.Close isn't the problem: > it's correctly waiting on a WaitGroup, waiting for all the test handlers to complete. But does it call netFD.Close() on the connections before waiting? What is the exact problem? I am lost. Earlier you said that server connections do not finish after close, and then that client connection do not finish after timeout. Are there >=2 problems? |
I added a bit debug info to net/http/server.go to see what happens in net/httptest server. https://golang.org/cl/6843108/ As mentioned in golang-dev, on freebsd, D1 works well whether new net/net_test.go exists. And D2 doesn't. |
oh dear. os.file.close() is blocking .. indefinitely goroutine 5 [runnable]: syscall.Syscall() /home/dfc/go/src/pkg/syscall/asm_linux_amd64.s:36 +0x64 syscall.Close(0x3, 0x0, 0x0, 0x30) /home/dfc/go/src/pkg/syscall/zerrors_linux_amd64.go:1761 +0x52 os.(*file).close(0xc200058750, 0x1, 0xc200058810, 0x43801b) /home/dfc/go/src/pkg/os/file_unix.go:108 +0x78 os.(*File).Close(0xc200072260, 0x7f52e936de40, 0xc2000b1000, 0x4399a8) /home/dfc/go/src/pkg/os/file_unix.go:99 +0x28 net.(*netFD).decref(0xc200081000, 0x5813e8) /home/dfc/go/src/pkg/net/fd_unix.go:375 +0x5b net.(*netFD).accept(0xc200081000, 0x44e630, 0x0, 0xc200058810, 0xc20007a600, ...) /home/dfc/go/src/pkg/net/fd_unix.go:603 +0x78 net.(*TCPListener).AcceptTCP(0xc200072268, 0xc200072410, 0x0, 0x0, 0x10, ...) /home/dfc/go/src/pkg/net/tcpsock_posix.go:237 +0x62 net.(*TCPListener).Accept(0xc200072268, 0x0, 0x0, 0x0, 0x0, ...) /home/dfc/go/src/pkg/net/tcpsock_posix.go:247 +0x49 net.func�038(0xc200070d30, 0xc200072278, 0xc200072220, 0xc200072280, 0x0, ...) /home/dfc/go/src/pkg/net/net_test.go:261 +0x47 created by net.testVariousDeadlines /home/dfc/go/src/pkg/net/net_test.go:276 +0x221 |
I'm sorry, that previous report was wrong. I am looking at https://code.google.com/p/go/source/browse/src/pkg/net/fd_unix.go#426 wondering if this line never returned EAGAIN, will the timeout logic implemented in WaitRead ever fire ? Especially copying over the loopback interface there may always be data in the buffer. |
I agree with dave, the test is sending so much data that net read always succeeds, and it never even get a chance to check for timeouts. Unless you move timeout check logic before the read, like I did in windows CL. I cannot check this theory on Linux because the new timeout test passes on my Linux pc. Alex |
Dave, Alex, thanks so much! Fix and new minimal tests at http://golang.org/cl/6851096 |
This issue was closed by revision 5fa3aeb. Status changed to Fixed. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: