net: do not wait on EAGAIN from lossy packet conn #5932

gopherbot · 2013-07-22T19:18:17Z

When writing to a connection created by net.Dial("unixgram",
"/dev/log") I will occasionally get an EAGAIN returned.  When that happens the
internal WaitWrite call will block forever if no deadline is set.  I can break by
setting a deadline.  Writing to the connection after the timeout error works so it looks
like the WaitWrite is not working properly for this type of socket.  It happens for both
the Write and the WriteToUnix calls.

I don't know how to create an EAGAIN from a unixgram type socket so I don't have a test
case.  It was very difficult to reproduce this against rsyslog.  I'm hoping a socket
expert can figure it out an easy way to reproduce this.

This bug started on go1.1 and is reproducible on go1.1.1 (never saw it on 1.0.x). 
amd64/linux (ubuntu 12.04)

alberts · 2013-07-22T21:20:23Z

Comment 1:

Could you post a complete program?

gopherbot · 2013-07-23T01:07:37Z

Comment 2 by swhite@fitstar.com:

I could not reproduce with any trivial implementation.  The full setup has multiple
server processes all writing to rsyslog and required large number of concurrent requests
before the issue occurred.  I have tried to write several small tests that write to
rsyslog or to a self-opened socket but none have been able to reproduce the lockup.
I guess this is more of a reference in the hopes someone else runs across this with a
simpler setup.

rsc · 2013-07-25T03:00:06Z

Comment 5:

If we get EAGAIN from a unixgram or udp write, I think we should just return the error.

Labels changed: added priority-later, go1.2, removed priority-triage.

Status changed to Accepted.

dvyukov · 2013-07-26T10:09:31Z

Comment 6:

>I have tried to write several small tests that write to rsyslog or to a self-opened
socket but none have been able to reproduce the lockup.
Have you tried writing with highest possible speed for a substantial amount of time? You
must be able to send faster than syslog can process. So the buffers must overflow
eventually.

dvyukov · 2013-07-26T10:14:00Z

Comment 7:

So we don't get epoll notifications for udp/unixgram sockets after EAGAIN?
Returning the error instantly is probably OK for unreliable protocols. But this is a
subtle change in behavior. E.g. a program that checks return codes will crash...

gopherbot · 2013-07-26T16:53:00Z

Comment 8 by swhite@fitstar.com:

I did a little more research and most of the time, EAGAIN from a unixgram socket works
as expected (does not block forever).  In fact, I can't get it to reproduce with
anything but our full stack test setup:
http load balancer/reverse proxy -> server under test with multiple writers to syslog
all on the same box 
It only takes about 30s of load testing to make it happen in our test environment so it
is 100% reproducible under the right conditions.

dvyukov · 2013-07-26T17:01:21Z

Comment 9:

Can it be possible that the remote side closes the socket, but we miss the close
notification (rather than write)?

gopherbot · 2013-07-26T17:21:04Z

Comment 10 by swhite@fitstar.com:

Here are the things I tried that did NOT work to reproduce the bug. FYI, none of these
actually run on the playground since they need file or syslog access.
This code produces an EAGAIN from a unixgram:
http://play.golang.org/p/lf1gs3KGvO
This code writes to rsyslog:
http://play.golang.org/p/lyws2UQj1O
This code uses the library I was using in our code to write to syslog:
http://play.golang.org/p/Am0SmrgY4f
(The current version of the timber library has been patched to avoid the issue entirely
but I disabled that when I tested)

gopherbot · 2013-07-26T17:27:21Z

Comment 11 by swhite@fitstar.com:

No, the socket is open and ready.  With the timeout set, the code breaks out of the
WaitWrite and I can successfully write to the socket again without any reconnect.
One other piece of information, is that rsyslog is throttling during the test.  But I
don't think that's the cause because it also throttles in both of the examples I just
posted and the EAGAINs don't block forever on those.

gopherbot · 2013-07-26T18:15:26Z

Comment 12 by swhite@fitstar.com:

Here's a stacktrace of a blocked process:
http://pastebin.com/8exhXBYE
goroutine 4 is the one that's blocked.  The code is part of the
github.com/fitstar/timber library.  There's basically one goroutine draining a channel
and writing to syslog.  A bunch of other goroutines are blocked waiting to write to that
channel and the situation never resolves even with no additional load.

rsc · 2013-09-11T14:29:01Z

Comment 13:

Labels changed: added go1.2maybe, removed go1.2.

rsc · 2013-09-16T23:51:40Z

Comment 14:

Too late.

Labels changed: added go1.3maybe, removed go1.2maybe.

rsc · 2013-12-04T01:29:05Z

Comment 15:

Labels changed: added release-none, removed go1.3maybe.

rsc · 2013-12-04T01:48:38Z

Comment 16:

Labels changed: added repo-main.

gopherbot · 2014-03-31T09:44:48Z

Comment 17 by chiparus:

Hello, i'm running into this issue too. 
I have an application using multiple connections to syslog (/dev/log) and with just a
little volume it stalls after only a few messages. Using only one connection this
problem is harder to reproduce.
This program has a 100% reproduction rate for me: http://play.golang.org/p/vu12qIyPil
I tried it running at multiple servers, different rsyslog versions, etc. It doesn't
really matter if rsyslog is configured to drop (rate-limit) messages or not. Even on a
tuned rsyslog config this scripts hangs within 10k messages.
I can send milions of messages repeatedly in one goroutine / one syslog conncetion
without problems. Using 2 goroutines / two syslog connections it hangs within 10k
messages.

ianlancetaylor · 2014-03-31T15:42:34Z

Comment 18:

Labels changed: added suggested, removed priority-later.

mikioh · 2014-04-01T02:22:29Z

Comment 19:

Not sure I understand the issue correctly, but if it's a simple deadlock on resource
starvation, perhaps returning error on Write/WriteTo/WriteMsg might mitigate the
unexpected consequences. FWIW, attached shows the difference btw Linux and OS X
Mavericks. On Linux, the unixgram writer will be blocked on write waiting for the far
endpoint to read (I mean, EAGAIN). On Darwin, the writer will get "no buffer space
available" error eventually.

Attachments:

unixgram.go (1161 bytes)

gopherbot · 2014-06-10T11:37:09Z

Comment 20 by lk4d4math:

Hi, I've reproduced this issue on two machines with rsyslog and syslog-ng. Just very
fast writing to syslog. Also there can be two Writers to "/dev/log" - one can hang and
one can be still alive.
I copied "log/syslog" code and adding SetWriteDeadline in writeString worked for me.

gopherbot · 2014-07-25T04:20:02Z

Comment 21 by rsr@skyportsystems.com:

Please see https://groups.google.com/forum/#!topic/Golang-Nuts/PMm8nH0yaoA for more
information; we are also seeing this.  
I don't think it's a simple deadlock, though.

dvyukov · 2014-07-26T09:06:01Z

Comment 22:

A common theme in all reproducers seems to be several fd's referring to the same
datagram socket.
Linux has a known issue with epoll and several fd's referring to the same file
description -- fd is not unregistered from epoll when fd is closed, it's unregistered
only when the file description is closed (last fd referring to it is closed). That's why
we need to do epoll_ctl(EPOLL_DEL) when an fd is closed, while documentation says that
it's unnecessary.
I suspect that it's the reincarnation of the same kernel bug, that leads to delivery of
epoll notification to a random (first?) fd referring to the unixgram socket (even if a
different fd is actually got EAGAIN). I don't have any evidence, though. But such kernel
behavior would explain all hangs that we see.

gopherbot · 2014-07-26T22:18:11Z

Comment 23 by rsr@skyportsystems.com:

Oy, I think you're right.   I did an updated tester that's a little more flexible if
it's of any interest; see http://play.golang.org/p/1q0vrfplVT (-multi, etc. - repro
without -multi is easy)
running with -clients 8 - failure every time
running with -clients 8 -multi -- ran it around 10 times without observed failure
running with _4 separate processes_ -clients 1 - failure in two of four every few runs
<----------- bad.
It would be extremely useful if I could coax the runtime to dump (somewhere) the life of
a descriptor especially in terms of epoll activity.  Otherwise time to get systemtap
working.. 
I'll run an extended duration series of tests to see if I can reproduce it with single
instance -multi; probably will not reveal anything worthwhile but it can run by itself.

dvyukov · 2014-07-27T09:53:04Z

Comment 24:

Uh, if it fails with multiple processes (which is reasonable if the hypothesis is
correct), then I do not see any sane ways to actually fix it. We can return EGAIN to
caller or add timeouts.

a13xb · 2014-09-16T15:30:27Z

Comment 25:

Setting write deadline seems like a sensible fix to me. Write failure and reconnect is
already handled in existing code.

dvyukov · 2014-09-17T01:36:08Z

Comment 26:

I can't reproduce it in the following configuration:
go version devel +38f88b067042 Tue Sep 16 12:50:05 2014 -0700 linux/amd64
$ uname -a
Linux 3.13.0-32-generic Ubuntu SMP
which is Ubuntu 14.04
I run ./udptest in one console and then ./udptest -cmode -clients 8 in another. It
always say SUCCESS.
I have 2 hypothesis:
1. We now attach sockets to epoll after connect, that's what kernel people told me to
do, so can fix the issue.
2. It is a kernel bug that is fixed in recent kernels.

driskell · 2015-01-19T23:04:30Z

Hello. A user of mine is seeing a similar issue as in Comment 12. Is there any useful information we can provide to help with this? Thanks

chendo · 2015-01-19T23:56:45Z

I've also seen this happen with our setup where we mount in /dev/log into our Docker containers to write to

a13xb · 2015-09-30T10:19:35Z

This may have been fixed somewhere between 1.3 and 1.3.1. The sample from comment 17 fails 100% of the time on 1.3 and succeeds 100% of the time on 1.3.1 for me.

ianlancetaylor · 2015-09-30T14:05:52Z

@driskell @chendo which version of Go are you using when you see the problem? Which kernel version? Thanks.

bradfitz · 2015-10-09T17:29:49Z

Closing due to timeout.

gopherbot added accepted Suggested labels Sep 17, 2014

driskell mentioned this issue Jan 19, 2015

log-courier gets stuck sometimes when using syslog logging driskell/log-courier#90

Closed

LK4D4 mentioned this issue Mar 18, 2015

Add syslog logging driver moby/moby#11458

Merged

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed release-none labels Apr 10, 2015

LK4D4 mentioned this issue Jun 6, 2015

Memory Leak with Syslog Driver moby/moby#13552

Closed

cholick mentioned this issue Sep 17, 2015

Lager deadlocks under heavy load cloudfoundry/lager#15

Closed

bradfitz closed this as completed Oct 9, 2015

golang locked and limited conversation to collaborators Oct 9, 2016

gopherbot added the FrozenDueToAge label Oct 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net: do not wait on EAGAIN from lossy packet conn #5932

net: do not wait on EAGAIN from lossy packet conn #5932

gopherbot commented Jul 22, 2013

alberts commented Jul 22, 2013

gopherbot commented Jul 23, 2013

rsc commented Jul 25, 2013

dvyukov commented Jul 26, 2013

dvyukov commented Jul 26, 2013

gopherbot commented Jul 26, 2013

dvyukov commented Jul 26, 2013

gopherbot commented Jul 26, 2013

gopherbot commented Jul 26, 2013

gopherbot commented Jul 26, 2013

rsc commented Sep 11, 2013

rsc commented Sep 16, 2013

rsc commented Dec 4, 2013

rsc commented Dec 4, 2013

gopherbot commented Mar 31, 2014

ianlancetaylor commented Mar 31, 2014

mikioh commented Apr 1, 2014

gopherbot commented Jun 10, 2014

gopherbot commented Jul 25, 2014

dvyukov commented Jul 26, 2014

gopherbot commented Jul 26, 2014

dvyukov commented Jul 27, 2014

a13xb commented Sep 16, 2014

dvyukov commented Sep 17, 2014

driskell commented Jan 19, 2015

chendo commented Jan 19, 2015

a13xb commented Sep 30, 2015

ianlancetaylor commented Sep 30, 2015

bradfitz commented Oct 9, 2015

net: do not wait on EAGAIN from lossy packet conn #5932

net: do not wait on EAGAIN from lossy packet conn #5932

Comments

gopherbot commented Jul 22, 2013

alberts commented Jul 22, 2013

gopherbot commented Jul 23, 2013

rsc commented Jul 25, 2013

dvyukov commented Jul 26, 2013

dvyukov commented Jul 26, 2013

gopherbot commented Jul 26, 2013

dvyukov commented Jul 26, 2013

gopherbot commented Jul 26, 2013

gopherbot commented Jul 26, 2013

gopherbot commented Jul 26, 2013

rsc commented Sep 11, 2013

rsc commented Sep 16, 2013

rsc commented Dec 4, 2013

rsc commented Dec 4, 2013

gopherbot commented Mar 31, 2014

ianlancetaylor commented Mar 31, 2014

mikioh commented Apr 1, 2014

gopherbot commented Jun 10, 2014

gopherbot commented Jul 25, 2014

dvyukov commented Jul 26, 2014

gopherbot commented Jul 26, 2014

dvyukov commented Jul 27, 2014

a13xb commented Sep 16, 2014

dvyukov commented Sep 17, 2014

driskell commented Jan 19, 2015

chendo commented Jan 19, 2015

a13xb commented Sep 30, 2015

ianlancetaylor commented Sep 30, 2015

bradfitz commented Oct 9, 2015