-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: do not wait on EAGAIN from lossy packet conn #5932
Comments
Comment 2 by swhite@fitstar.com: I could not reproduce with any trivial implementation. The full setup has multiple server processes all writing to rsyslog and required large number of concurrent requests before the issue occurred. I have tried to write several small tests that write to rsyslog or to a self-opened socket but none have been able to reproduce the lockup. I guess this is more of a reference in the hopes someone else runs across this with a simpler setup. |
>I have tried to write several small tests that write to rsyslog or to a self-opened socket but none have been able to reproduce the lockup. Have you tried writing with highest possible speed for a substantial amount of time? You must be able to send faster than syslog can process. So the buffers must overflow eventually. |
Comment 8 by swhite@fitstar.com: I did a little more research and most of the time, EAGAIN from a unixgram socket works as expected (does not block forever). In fact, I can't get it to reproduce with anything but our full stack test setup: http load balancer/reverse proxy -> server under test with multiple writers to syslog all on the same box It only takes about 30s of load testing to make it happen in our test environment so it is 100% reproducible under the right conditions. |
Comment 10 by swhite@fitstar.com: Here are the things I tried that did NOT work to reproduce the bug. FYI, none of these actually run on the playground since they need file or syslog access. This code produces an EAGAIN from a unixgram: http://play.golang.org/p/lf1gs3KGvO This code writes to rsyslog: http://play.golang.org/p/lyws2UQj1O This code uses the library I was using in our code to write to syslog: http://play.golang.org/p/Am0SmrgY4f (The current version of the timber library has been patched to avoid the issue entirely but I disabled that when I tested) |
Comment 11 by swhite@fitstar.com: No, the socket is open and ready. With the timeout set, the code breaks out of the WaitWrite and I can successfully write to the socket again without any reconnect. One other piece of information, is that rsyslog is throttling during the test. But I don't think that's the cause because it also throttles in both of the examples I just posted and the EAGAINs don't block forever on those. |
Comment 12 by swhite@fitstar.com: Here's a stacktrace of a blocked process: http://pastebin.com/8exhXBYE goroutine 4 is the one that's blocked. The code is part of the github.com/fitstar/timber library. There's basically one goroutine draining a channel and writing to syslog. A bunch of other goroutines are blocked waiting to write to that channel and the situation never resolves even with no additional load. |
Hello, i'm running into this issue too. I have an application using multiple connections to syslog (/dev/log) and with just a little volume it stalls after only a few messages. Using only one connection this problem is harder to reproduce. This program has a 100% reproduction rate for me: http://play.golang.org/p/vu12qIyPil I tried it running at multiple servers, different rsyslog versions, etc. It doesn't really matter if rsyslog is configured to drop (rate-limit) messages or not. Even on a tuned rsyslog config this scripts hangs within 10k messages. I can send milions of messages repeatedly in one goroutine / one syslog conncetion without problems. Using 2 goroutines / two syslog connections it hangs within 10k messages. |
Not sure I understand the issue correctly, but if it's a simple deadlock on resource starvation, perhaps returning error on Write/WriteTo/WriteMsg might mitigate the unexpected consequences. FWIW, attached shows the difference btw Linux and OS X Mavericks. On Linux, the unixgram writer will be blocked on write waiting for the far endpoint to read (I mean, EAGAIN). On Darwin, the writer will get "no buffer space available" error eventually. Attachments:
|
Hi, I've reproduced this issue on two machines with rsyslog and syslog-ng. Just very fast writing to syslog. Also there can be two Writers to "/dev/log" - one can hang and one can be still alive. I copied "log/syslog" code and adding SetWriteDeadline in writeString worked for me. |
Comment 21 by rsr@skyportsystems.com: Please see https://groups.google.com/forum/#!topic/Golang-Nuts/PMm8nH0yaoA for more information; we are also seeing this. I don't think it's a simple deadlock, though. |
A common theme in all reproducers seems to be several fd's referring to the same datagram socket. Linux has a known issue with epoll and several fd's referring to the same file description -- fd is not unregistered from epoll when fd is closed, it's unregistered only when the file description is closed (last fd referring to it is closed). That's why we need to do epoll_ctl(EPOLL_DEL) when an fd is closed, while documentation says that it's unnecessary. I suspect that it's the reincarnation of the same kernel bug, that leads to delivery of epoll notification to a random (first?) fd referring to the unixgram socket (even if a different fd is actually got EAGAIN). I don't have any evidence, though. But such kernel behavior would explain all hangs that we see. |
Comment 23 by rsr@skyportsystems.com: Oy, I think you're right. I did an updated tester that's a little more flexible if it's of any interest; see http://play.golang.org/p/1q0vrfplVT (-multi, etc. - repro without -multi is easy) running with -clients 8 - failure every time running with -clients 8 -multi -- ran it around 10 times without observed failure running with _4 separate processes_ -clients 1 - failure in two of four every few runs <----------- bad. It would be extremely useful if I could coax the runtime to dump (somewhere) the life of a descriptor especially in terms of epoll activity. Otherwise time to get systemtap working.. I'll run an extended duration series of tests to see if I can reproduce it with single instance -multi; probably will not reveal anything worthwhile but it can run by itself. |
I can't reproduce it in the following configuration: go version devel +38f88b067042 Tue Sep 16 12:50:05 2014 -0700 linux/amd64 $ uname -a Linux 3.13.0-32-generic Ubuntu SMP which is Ubuntu 14.04 I run ./udptest in one console and then ./udptest -cmode -clients 8 in another. It always say SUCCESS. I have 2 hypothesis: 1. We now attach sockets to epoll after connect, that's what kernel people told me to do, so can fix the issue. 2. It is a kernel bug that is fixed in recent kernels. |
Hello. A user of mine is seeing a similar issue as in Comment 12. Is there any useful information we can provide to help with this? Thanks |
I've also seen this happen with our setup where we mount in |
This may have been fixed somewhere between 1.3 and 1.3.1. The sample from comment 17 fails 100% of the time on 1.3 and succeeds 100% of the time on 1.3.1 for me. |
Closing due to timeout. |
by swhite@fitstar.com:
The text was updated successfully, but these errors were encountered: