New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http: Unrecoverable client panic crashes my program #4191
Labels
Milestone
Comments
I pasted the wrong gist into the email. Apologies: https://gist.github.com/3831133 |
Labels changed: removed priority-triage. Owner changed to @bradfitz. Status changed to Accepted. |
I've been having a lot of trouble trying to reproduce this in isolation. I thought it could be because of the socket timeouts (though I have no evidence that this is occurring), but if I write a test that does tons of HTTP requests in tight loops that are all timing out, it seems to do the right thing. I thought it might be something related to fetching the data from wikipedia itself, but I ran a test doing just that for hours yesterday and never saw as much as the unsolicited response warning (which I get occasionally even without a corresponding crash). In the meantime, I've disabled the timeout handling on my client that's got it. Sometimes my process crashes within fifteen minutes, and sometimes it runs for three or four hours. It'll be a while before I have confidence that that makes a difference. However, if it does, the way I was using it was very specific. I had a type that holds the underlying connection to an http server and sets the read timeout forward on every read. As far as I can tell, that had a pretty constant stream of data on that one connection, and it's code I've been running continuously for months. The rest of the connections are using an unmodified default client as seen in the gist. I can't tell what the clients that are leading to the panic were connected to. |
If you get me a self-contained client & server repro I will prioritize fixing this, but it's not even obvious what the problem is. Using a custom dialer and custom net.Conns that mess with read timeouts at least gives me an excuse to point the blame finger at you, but I'm perfectly willing to believe it's my fault too, but I'm not sure exactly what to look for. I have enough free time at the moment to fix something, but not to investigate. |
If it's the timeout that's causing the issue, then it's related to issue #3362 in that it may be happening as an attempt to work around this. I do wish it to be my fault, but I'm not sure how else I can do this. func (tc *timeoutClient) Read(p []byte) (n int, err error) { if tc.readTimeout > 0 { tc.socket.SetReadDeadline(time.Now().Add(tc.readTimeout)) } return tc.body.Read(p) } At 0, I don't see any evidence that things are going wrong (again, it would run for hours, though). I've been trying pretty hard to abuse this. My latest attempt crashed OS X and deleted my source (need to stop writing things in /tmp). I'll continue to work towards a repro assuming it's related to the above. |
I'm fairly certain the problem here is that I'm closing the body when the connection times out. This is definitely wrong and would explain why I haven't been able to reproduce this in isolation. Let's just call this my fault until I prove otherwise. I do wish I had another star to offer issue #3362, though. :) |
It's still a bug that the transport should never panic the whole program. That needs to be fixed regardless. Even if you find it to be "your fault", please help me repro this. If it's your timeout code + closing before timeout, please replicate that combination in a nice little bradsucks.go file and I'll fix the panic at least, even if the code isn't what one should do. |
I updated to tip (after seeing that new change come in) and got this today: changeset: 14611:57736947de5a tag: tip user: Dave Cheney <dave@cheney.net> date: Fri Oct 12 09:17:56 2012 +1100 summary: net/http: fix data race on countReader.n This is on FreeBSD (the previous one was Mac). Same program, but goroutine 501870 [running]: net/http.(*bodyEOFSignal).Read(0xf8406d45a0, 0xf8406df000, 0x8000, 0x8000, 0x5a8 , ...) /Users/dustin/prog/eprojects/go/src/pkg/net/http/transport.go:827 +0xcc io.(*multiReader).Read(0xf840118380, 0xf8406df000, 0x8000, 0x8000, 0x5a8, ...) /Users/dustin/prog/eprojects/go/src/pkg/io/multi.go:13 +0x92 io.Copy(0xf840145480, 0xf840573d60, 0xf8401450f0, 0xf840118380, 0xe92878, ...) /Users/dustin/prog/eprojects/go/src/pkg/io/io.go:360 +0x1a9 net/http.(*transferWriter).WriteBody(0xf8402dc9a0, 0xf8400cac00, 0xf8405c7c80, 0 x0, 0x0, ...) /Users/dustin/prog/eprojects/go/src/pkg/net/http/transfer.go:194 +0x1c6 net/http.(*Request).write(0xf8405d9820, 0xf8400cac00, 0xf8405c7c80, 0x0, 0xf8401 18680, ...) /Users/dustin/prog/eprojects/go/src/pkg/net/http/request.go:371 +0x833 net/http.(*persistConn).writeLoop(0xf840162f00, 0x0) /Users/dustin/prog/eprojects/go/src/pkg/net/http/transport.go:661 +0x1b8 created by net/http.(*Transport).getConn /Users/dustin/prog/eprojects/go/src/pkg/net/http/transport.go:409 +0x5cd The problem does go away if I disable my timeout, but I've still not reproduced it outside of this thing without crashing a machine. Part of that is likely in trying to make it happen too fast. ...which is odd, because the timeout is straightforward. Socket has a deadline on read. I am most likely not reading from the body soon enough (due to processing of stuff coming off of it) and when I go back to read from it, it breaks. This happened after about 800,000 articles and who knows how many images I pulled from them today. It was several hours. |
Sorry, again not a repro, but I did capture the full stack this time and wanted to get it included in the report just in case there's something useful here. Note that I'm still running the build from the previous changeset and I've disabled part of my timeout logic (the one I was suspect of). This means the remaining timeouts are on the GET side and the PUT side of an individual request. I'd be most suspicious at this point of the read timeout on a PUT request. I will start going down this path for repro now (notice this one happened almost exactly 24 hours into a run). [comment is too long... http://pastebin.com/djEpGLpR ] |
OK, using that last bit of info, I have a repro: http://play.golang.org/p/VNx9D1u4tX This breaks in a really weird way. It does the right thing twice, hangs consuming lots of CPU for a while and then panics: 2012/10/13 12:39:28 Error doing the thing: read tcp 127.0.0.1:62331: i/o timeout 2012/10/13 12:39:28 put: copied 8036337 bytes before http: suspiciously long trailer after chunked body 2012/10/13 12:39:30 Error doing the thing: read tcp 127.0.0.1:62331: i/o timeout 2012/10/13 12:39:30 put: copied 8355840 bytes before http: suspiciously long trailer after chunked body 2012/10/13 12:41:57 Error doing the thing: read tcp 127.0.0.1:62331: i/o timeout panic: http: unexpected bodyEOFSignal Read after Close; see issue #1725 goroutine 28 [running]: net/http.(*bodyEOFSignal).Read(0xf8400ca120, 0xf8401a0000, 0x8000, 0x8000, 0x7008, ...) /Users/dustin/prog/eprojects/go/src/pkg/net/http/transport.go:827 +0xcc io.(*multiReader).Read(0xf840076a40, 0xf8401a0000, 0x8000, 0x8000, 0xff8, ...) /Users/dustin/prog/eprojects/go/src/pkg/io/multi.go:13 +0x92 io.Copy(0xf8400cad50, 0xf8401859f0, 0xf8400cacc0, 0xf840076a40, 0x838ff8, ...) /Users/dustin/prog/eprojects/go/src/pkg/io/io.go:360 +0x1a9 net/http.(*transferWriter).WriteBody(0xf840125460, 0xf8400ca0f0, 0xf840186bc0, 0x0, 0x0, ...) /Users/dustin/prog/eprojects/go/src/pkg/net/http/transfer.go:194 +0x1c6 net/http.(*Request).write(0xf8400d7ea0, 0xf8400ca0f0, 0xf840186bc0, 0x0, 0xf8400f2420, ...) /Users/dustin/prog/eprojects/go/src/pkg/net/http/request.go:371 +0x833 net/http.(*persistConn).writeLoop(0xf840126400, 0x0) /Users/dustin/prog/eprojects/go/src/pkg/net/http/transport.go:661 +0x1b8 created by net/http.(*Transport).getConn /Users/dustin/prog/eprojects/go/src/pkg/net/http/transport.go:409 +0x5cd |
Dustin, Could you patch in http://golang.org/cl/6739055 and let me know if that fixes it for you? It seems to fix it for me here. Your test consistently crashed in various fun ways (at least on OS X), but now seems to get the right errors (and no bogus successes in the put handler or "doing the thing"). |
I've had this thing running for a few hours now. It would sometimes crash after about 24h (sometimes much sooner). Hard to say authoritatively that the problem is gone, but confidence continues to rise as the program continues to not crash. :) Also, my test program isn't crashing after running for a while. It's probably right, but my program should finish in the next day or two and that will be the highest vote of confidence. |
I still don't like this CL, though. It means a client doing a blocking read can hang forever, since it's holding the mutex, preventing a Close from closing the underlying ReadCloser. I think I changed too much in this CL. It's quite likely that just some of the bug fixes were enough, without redoing the bodyEOFSignal's locking logic. So I'll probably ask you to test a future version of the patch. |
This issue was updated by revision 56bcef0. R=golang-dev, rsc, dave CC=golang-dev http://golang.org/cl/6739055 |
Dustin, could you try http://golang.org/cl/6846068 ? Thanks. |
I noticed the file descriptor (and OS X hangs) as well. Glad to hear it kinda works for you, but the test failure still bothers me. I also opened issue #4403 with a mystery there. |
Update on this via issue #4403, comment #3: https://golang.org/issue/4403?c=3 That explains the fd leak from Dustin's stress test program. I'll be fixing that bug first, but it's looking quite likely these two bugs will be fixed together in the same CL. |
On Linux, at ffd1e075c260 (tip) + http://golang.org/cl/6851061 patched in: $ go test -v -run=InfiniteGetTimeout net/http GOMAXPROCS=4 w/o NewLoggingConn: HANGS GOMAXPROCS=4 w/ NewLoggingConn: pass no GOMAXPROCS w/o NewLoggingConn: pass no GOMAXPROCS w/ NewLoggingConn: pass So looks like there's still a pollster bug. Doesn't seem to be a data race. At least go test -race doesn't report any problems. Help wanted. |
When it hangs you can see the following stack: goroutine 3 [semacquire]: sync.runtime_Semacquire(0xc2000bd718, 0xc2000bd718) src/pkg/runtime/zsema_darwin_amd64.c:165 +0x2e sync.(*WaitGroup).Wait(0xc200001b20, 0x0) src/pkg/sync/waitgroup.go:102 +0xf2 net/http/httptest.(*Server).Close(0xc200001af0, 0x105c2100) src/pkg/net/http/httptest/server.go:157 +0x41 ----- stack segment boundary ----- bufio.(*Reader).Read(0xc20013b060, 0xc2000dd830, 0x2, 0x2, 0xffffffffffffffff, ...) src/pkg/bufio/bufio.go:151 +0x303 io.ReadAtLeast(0xc2000ddd80, 0xc20013b060, 0xc2000dd830, 0x2, 0x2, ...) src/pkg/io/io.go:271 +0x102 io.ReadFull(0xc2000ddd80, 0xc20013b060, 0xc2000dd830, 0x2, 0x2, ...) src/pkg/io/io.go:290 +0x72 net/http.(*chunkedReader).Read(0xc2000dd810, 0xc2000c6000, 0x2000, 0x2000, 0x1008, ...) src/pkg/net/http/chunked.go:77 +0x1dd net/http.(*body).Read(0xc2000f3240, 0xc2000c6000, 0x2000, 0x2000, 0x2000, ...) src/pkg/net/http/transfer.go:538 +0x9a net/http.(*bodyEOFSignal).Read(0xc2000dd840, 0xc2000c6000, 0x2000, 0x2000, 0x2000, ...) src/pkg/net/http/transport.go:837 +0x1b8 io/ioutil.devNull.ReadFrom(0x0, 0xc2000dd720, 0xc2000dd840, 0x1dd6ff8, 0x0, ...) src/pkg/io/ioutil/ioutil.go:137 +0xcf io.Copy(0xc20008a2a0, 0x0, 0xc2000dd720, 0xc2000dd840, 0x0, ...) src/pkg/io/io.go:357 +0xa2 net/http_test.TestIssue4191_InfiniteGetTimeout(0xc2000bb2d0, 0xc20c021549) src/pkg/net/http/transport_test.go:947 +0x359 testing.tRunner(0xc2000bb2d0, 0x4e9fe0, 0x0, 0x0) src/pkg/testing/testing.go:301 +0x6c created by testing.RunTests src/pkg/testing/testing.go:377 +0x86e It says that deferred Server.Close() is called from bufio.Reader.Read(). Initially I thought that it's just a bug in unwinder. But on the second thought... consider that bufio.Reader.Read() panics for some reason. So it calls deferred Server.Close() as expected, which hangs as expected, because the client connection Body is not closed in this case. Indeed, when I replaced defer ts.Close() with normal ts.Close() call, I see: panic: runtime error: slice bounds out of range [recovered] panic: runtime error: slice bounds out of range goroutine 3 [running]: testing.func·003(0x105c2fa8, 0x105c2100) src/pkg/testing/testing.go:296 +0xc3 ----- stack segment boundary ----- bufio.(*Reader).Read(0xc2000c4ea0, 0xc200115800, 0x2, 0x2, 0xffffffffffffffff, ...) src/pkg/bufio/bufio.go:151 +0x303 io.ReadAtLeast(0xc20008af00, 0xc2000c4ea0, 0xc200115800, 0x2, 0x2, ...) src/pkg/io/io.go:271 +0x102 io.ReadFull(0xc20008af00, 0xc2000c4ea0, 0xc200115800, 0x2, 0x2, ...) src/pkg/io/io.go:290 +0x72 net/http.(*chunkedReader).Read(0xc2001157e0, 0xc2000c7000, 0x2000, 0x2000, 0x1008, ...) src/pkg/net/http/chunked.go:77 +0x1dd net/http.(*body).Read(0xc2000f3a40, 0xc2000c7000, 0x2000, 0x2000, 0x2000, ...) src/pkg/net/http/transfer.go:538 +0x9a net/http.(*bodyEOFSignal).Read(0xc200115810, 0xc2000c7000, 0x2000, 0x2000, 0x2000, ...) src/pkg/net/http/transport.go:837 +0x1b8 io/ioutil.devNull.ReadFrom(0x0, 0xc2000dd360, 0xc200115810, 0x1dceff8, 0x0, ...) src/pkg/io/ioutil/ioutil.go:137 +0xcf io.Copy(0xc20008a2a0, 0x0, 0xc2000dd360, 0xc200115810, 0x0, ...) src/pkg/io/io.go:357 +0xa2 net/http_test.TestIssue4191_InfiniteGetTimeout(0xc2000bb2d0, 0xc20ad0e6d2) src/pkg/net/http/transport_test.go:947 +0x336 testing.tRunner(0xc2000bb2d0, 0x4e9fe0, 0x0, 0x0) src/pkg/testing/testing.go:301 +0x6c created by testing.RunTests src/pkg/testing/testing.go:377 +0x86e So, current test code just inherently deadlocks in presence of panics. The solution would be to close client connection in a defer as well (this may require nested function), than the connection will be closed before server close. I've not yet looked at the panic, looks like another bug. |
In this instance, n == -1 is leaking in from the call to syscall.Read() in net/fd_unix.go:433. If err == EAGAIN, then n == -1, then we call to WaitFD, which pauses til the timeout occurs, _THEN_ the loop continues from the top, and detects the deadline has passed, which breaks out of the loop, returning -1, errTimeout. Setting n = 0 inside the err == EAGAIN condition makes the test pass, but could probably be done more elegantly. |
This issue was closed by revision 127d2bf. Status changed to Fixed. |
Comment 35 by fish@freigeist.org: I had a similar problem and AlekSi helped me to reproduce it with that simple snippet: https://gist.github.com/AlekSi/5029142 revision be4b8f195c96 seems to fix it - aborting the transfer works without killing the program. But: 1.) It takes a long time between resp.Body.Close() and the return of io.Copy(): 2013/02/25 13:03:56 Closing... 2013/02/25 13:06:09 Closed. 2013/02/25 13:06:09 Copied 188536/84623657 bytes. 2013/02/25 13:06:09 Copy error: http: read on closed response body exit status 1 2.) During that time it uses about 3-4% CPU on OSX and 15-20% CPU (on two cores) on Ubuntu 12.04 strace on Ubuntu shows _a lot_ futex/epoll_wait/epoll_ctrl calls: futex(0xc2000c00c0, FUTEX_WAKE, 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=3, u64=3}}}, 10, -1) = 1 epoll_ctl(6, EPOLL_CTL_DEL, 3, NULL) = 0 grep epoll_wait aleksi-download.log |wc -l 55398 |
This bug is already fixed. You're talking about issue #3672, but let's not discuss it here. |
Comment 37 by swhite@dena.jp: I have a similar error in that I'm getting the "Unsolicited response received on idle HTTP channel starting with "H"; err=" error but I never have a panic. Instead I'm seeing the wrong response returned for a request. Unfortunately, I have not been able to reproduce this in a non-production environment. I was hoping that those of you familiar with this fix would be able to imagine a way that this might happen. Or possibly point me in the right direction as far as creating a reproducible test case. There are many unique variables in my use of the Transport. I'm setting deadlines only on writes. For reference, I'm using this code to wrap the Transport: https://github.com/smw1218/falcore/blob/master/upstream/upstream.go The response Body returned by the Transport is also consumed in a possibly strange way using TCP_CORK: https://github.com/smw1218/falcore/blob/master/server.go#L249 Other random info: Linux x86_64 GOMAXPROCS = 8 |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: