Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: do not wait on EAGAIN from lossy packet conn #5932

Closed
gopherbot opened this issue Jul 22, 2013 · 29 comments
Closed

net: do not wait on EAGAIN from lossy packet conn #5932

gopherbot opened this issue Jul 22, 2013 · 29 comments
Labels
FrozenDueToAge Suggested Issues that may be good for new contributors looking for work to do.
Milestone

Comments

@gopherbot
Copy link
Contributor

by swhite@fitstar.com:

When writing to a connection created by net.Dial("unixgram",
"/dev/log") I will occasionally get an EAGAIN returned.  When that happens the
internal WaitWrite call will block forever if no deadline is set.  I can break by
setting a deadline.  Writing to the connection after the timeout error works so it looks
like the WaitWrite is not working properly for this type of socket.  It happens for both
the Write and the WriteToUnix calls.

I don't know how to create an EAGAIN from a unixgram type socket so I don't have a test
case.  It was very difficult to reproduce this against rsyslog.  I'm hoping a socket
expert can figure it out an easy way to reproduce this.

This bug started on go1.1 and is reproducible on go1.1.1 (never saw it on 1.0.x). 
amd64/linux (ubuntu 12.04)
@alberts
Copy link
Contributor

alberts commented Jul 22, 2013

Comment 1:

Could you post a complete program?

@gopherbot
Copy link
Contributor Author

Comment 2 by swhite@fitstar.com:

I could not reproduce with any trivial implementation.  The full setup has multiple
server processes all writing to rsyslog and required large number of concurrent requests
before the issue occurred.  I have tried to write several small tests that write to
rsyslog or to a self-opened socket but none have been able to reproduce the lockup.
I guess this is more of a reference in the hopes someone else runs across this with a
simpler setup.

@rsc
Copy link
Contributor

rsc commented Jul 25, 2013

Comment 5:

If we get EAGAIN from a unixgram or udp write, I think we should just return the error.

Labels changed: added priority-later, go1.2, removed priority-triage.

Status changed to Accepted.

@dvyukov
Copy link
Member

dvyukov commented Jul 26, 2013

Comment 6:

>I have tried to write several small tests that write to rsyslog or to a self-opened
socket but none have been able to reproduce the lockup.
Have you tried writing with highest possible speed for a substantial amount of time? You
must be able to send faster than syslog can process. So the buffers must overflow
eventually.

@dvyukov
Copy link
Member

dvyukov commented Jul 26, 2013

Comment 7:

So we don't get epoll notifications for udp/unixgram sockets after EAGAIN?
Returning the error instantly is probably OK for unreliable protocols. But this is a
subtle change in behavior. E.g. a program that checks return codes will crash...

@gopherbot
Copy link
Contributor Author

Comment 8 by swhite@fitstar.com:

I did a little more research and most of the time, EAGAIN from a unixgram socket works
as expected (does not block forever).  In fact, I can't get it to reproduce with
anything but our full stack test setup:
http load balancer/reverse proxy -> server under test with multiple writers to syslog
all on the same box 
It only takes about 30s of load testing to make it happen in our test environment so it
is 100% reproducible under the right conditions.

@dvyukov
Copy link
Member

dvyukov commented Jul 26, 2013

Comment 9:

Can it be possible that the remote side closes the socket, but we miss the close
notification (rather than write)?

@gopherbot
Copy link
Contributor Author

Comment 10 by swhite@fitstar.com:

Here are the things I tried that did NOT work to reproduce the bug. FYI, none of these
actually run on the playground since they need file or syslog access.
This code produces an EAGAIN from a unixgram:
http://play.golang.org/p/lf1gs3KGvO
This code writes to rsyslog:
http://play.golang.org/p/lyws2UQj1O
This code uses the library I was using in our code to write to syslog:
http://play.golang.org/p/Am0SmrgY4f
(The current version of the timber library has been patched to avoid the issue entirely
but I disabled that when I tested)

@gopherbot
Copy link
Contributor Author

Comment 11 by swhite@fitstar.com:

No, the socket is open and ready.  With the timeout set, the code breaks out of the
WaitWrite and I can successfully write to the socket again without any reconnect.
One other piece of information, is that rsyslog is throttling during the test.  But I
don't think that's the cause because it also throttles in both of the examples I just
posted and the EAGAINs don't block forever on those.

@gopherbot
Copy link
Contributor Author

Comment 12 by swhite@fitstar.com:

Here's a stacktrace of a blocked process:
http://pastebin.com/8exhXBYE
goroutine 4 is the one that's blocked.  The code is part of the
github.com/fitstar/timber library.  There's basically one goroutine draining a channel
and writing to syslog.  A bunch of other goroutines are blocked waiting to write to that
channel and the situation never resolves even with no additional load.

@rsc
Copy link
Contributor

rsc commented Sep 11, 2013

Comment 13:

Labels changed: added go1.2maybe, removed go1.2.

@rsc
Copy link
Contributor

rsc commented Sep 16, 2013

Comment 14:

Too late.

Labels changed: added go1.3maybe, removed go1.2maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 15:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 16:

Labels changed: added repo-main.

@gopherbot
Copy link
Contributor Author

Comment 17 by chiparus:

Hello, i'm running into this issue too. 
I have an application using multiple connections to syslog (/dev/log) and with just a
little volume it stalls after only a few messages. Using only one connection this
problem is harder to reproduce.
This program has a 100% reproduction rate for me: http://play.golang.org/p/vu12qIyPil
I tried it running at multiple servers, different rsyslog versions, etc. It doesn't
really matter if rsyslog is configured to drop (rate-limit) messages or not. Even on a
tuned rsyslog config this scripts hangs within 10k messages.
I can send milions of messages repeatedly in one goroutine / one syslog conncetion
without problems. Using 2 goroutines / two syslog connections it hangs within 10k
messages.

@ianlancetaylor
Copy link
Member

Comment 18:

Labels changed: added suggested, removed priority-later.

@mikioh
Copy link
Contributor

mikioh commented Apr 1, 2014

Comment 19:

Not sure I understand the issue correctly, but if it's a simple deadlock on resource
starvation, perhaps returning error on Write/WriteTo/WriteMsg might mitigate the
unexpected consequences. FWIW, attached shows the difference btw Linux and OS X
Mavericks. On Linux, the unixgram writer will be blocked on write waiting for the far
endpoint to read (I mean, EAGAIN). On Darwin, the writer will get "no buffer space
available" error eventually.

Attachments:

  1. unixgram.go (1161 bytes)

@gopherbot
Copy link
Contributor Author

Comment 20 by lk4d4math:

Hi, I've reproduced this issue on two machines with rsyslog and syslog-ng. Just very
fast writing to syslog. Also there can be two Writers to "/dev/log" - one can hang and
one can be still alive.
I copied "log/syslog" code and adding SetWriteDeadline in writeString worked for me.

@gopherbot
Copy link
Contributor Author

Comment 21 by rsr@skyportsystems.com:

Please see https://groups.google.com/forum/#!topic/Golang-Nuts/PMm8nH0yaoA for more
information; we are also seeing this.  
I don't think it's a simple deadlock, though.

@dvyukov
Copy link
Member

dvyukov commented Jul 26, 2014

Comment 22:

A common theme in all reproducers seems to be several fd's referring to the same
datagram socket.
Linux has a known issue with epoll and several fd's referring to the same file
description -- fd is not unregistered from epoll when fd is closed, it's unregistered
only when the file description is closed (last fd referring to it is closed). That's why
we need to do epoll_ctl(EPOLL_DEL) when an fd is closed, while documentation says that
it's unnecessary.
I suspect that it's the reincarnation of the same kernel bug, that leads to delivery of
epoll notification to a random (first?) fd referring to the unixgram socket (even if a
different fd is actually got EAGAIN). I don't have any evidence, though. But such kernel
behavior would explain all hangs that we see.

@gopherbot
Copy link
Contributor Author

Comment 23 by rsr@skyportsystems.com:

Oy, I think you're right.   I did an updated tester that's a little more flexible if
it's of any interest; see http://play.golang.org/p/1q0vrfplVT (-multi, etc. - repro
without -multi is easy)
running with -clients 8 - failure every time
running with -clients 8 -multi -- ran it around 10 times without observed failure
running with _4 separate processes_ -clients 1 - failure in two of four every few runs
<----------- bad.
It would be extremely useful if I could coax the runtime to dump (somewhere) the life of
a descriptor especially in terms of epoll activity.  Otherwise time to get systemtap
working.. 
I'll run an extended duration series of tests to see if I can reproduce it with single
instance -multi; probably will not reveal anything worthwhile but it can run by itself.

@dvyukov
Copy link
Member

dvyukov commented Jul 27, 2014

Comment 24:

Uh, if it fails with multiple processes (which is reasonable if the hypothesis is
correct), then I do not see any sane ways to actually fix it. We can return EGAIN to
caller or add timeouts.

@a13xb
Copy link

a13xb commented Sep 16, 2014

Comment 25:

Setting write deadline seems like a sensible fix to me. Write failure and reconnect is
already handled in existing code.

@dvyukov
Copy link
Member

dvyukov commented Sep 17, 2014

Comment 26:

I can't reproduce it in the following configuration:
go version devel +38f88b067042 Tue Sep 16 12:50:05 2014 -0700 linux/amd64
$ uname -a
Linux 3.13.0-32-generic Ubuntu SMP
which is Ubuntu 14.04
I run ./udptest in one console and then ./udptest -cmode -clients 8 in another. It
always say SUCCESS.
I have 2 hypothesis:
1. We now attach sockets to epoll after connect, that's what kernel people told me to
do, so can fix the issue.
2. It is a kernel bug that is fixed in recent kernels.

@gopherbot gopherbot added accepted Suggested Issues that may be good for new contributors looking for work to do. labels Sep 17, 2014
@driskell
Copy link

Hello. A user of mine is seeing a similar issue as in Comment 12. Is there any useful information we can provide to help with this? Thanks

@chendo
Copy link

chendo commented Jan 19, 2015

I've also seen this happen with our setup where we mount in /dev/log into our Docker containers to write to

@a13xb
Copy link

a13xb commented Sep 30, 2015

This may have been fixed somewhere between 1.3 and 1.3.1. The sample from comment 17 fails 100% of the time on 1.3 and succeeds 100% of the time on 1.3.1 for me.

@ianlancetaylor
Copy link
Member

@driskell @chendo which version of Go are you using when you see the problem? Which kernel version? Thanks.

@bradfitz
Copy link
Contributor

bradfitz commented Oct 9, 2015

Closing due to timeout.

@bradfitz bradfitz closed this as completed Oct 9, 2015
@golang golang locked and limited conversation to collaborators Oct 9, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge Suggested Issues that may be good for new contributors looking for work to do.
Projects
None yet
Development

No branches or pull requests

10 participants