Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: lookupIPDeadline leaks goroutines #8602

Closed
niemeyer opened this issue Aug 27, 2014 · 24 comments
Closed

net: lookupIPDeadline leaks goroutines #8602

niemeyer opened this issue Aug 27, 2014 · 24 comments
Milestone

Comments

@niemeyer
Copy link
Contributor

The deadline support in lookupIPDeadline is apparently not a real deadline, in the sense
that a goroutine is leaking on every lookup that does not return timely.

I have a bug report for mgo where someone got to 50k+ goroutines hanging on it:

https://jira.mongodb.org/browse/MGO-41

I'll have to do something else as I need to support existing releases, but it'd be good
to get this fixed so that deadlines are real, at some point.
@ianlancetaylor
Copy link
Contributor

Comment 1:

Labels changed: added repo-main, release-go1.4.

@niemeyer
Copy link
Contributor Author

Comment 2:

More details on the issue and a proof-of-concept patch for the net package that hacks
the general cgo_unix.go file to use the GNU extension getaddrinfo_a at:
https://jira.mongodb.org/browse/MGO-41?focusedCommentId=719482
The patch avoid major changes by just enforcing the documented cap of 30 seconds. It
works, but cannot be adopted as-is as the cgo_unix.go file is shared with systems that
most likely do not support this extension.

@rsc
Copy link
Contributor

rsc commented Sep 16, 2014

Comment 3:

I'm sorry but I cannot find the patch on that page.

@niemeyer
Copy link
Contributor Author

Comment 4:

I'm sorry for not providing a more clear description. The patch is at:
    [1] http://paste.ubuntu.com/8354072/
and this is the section of the bug comment referring to it:
4. Improving the Go standard library
Improving the Go standard library appropriately will take a little while as all the code
paths leading to resolution will need to be refactored to provide a timeout, and
different APIs will need to be used in different operating systems, and the current hack
may need to moved closer to the ones that do not support a better API for this.
With that said, here is an incipient patch for Linux [1], tested on Ubuntu 14.04. As you
can see it makes use of the new getaddrinfo_a function, and enforces the documented
capping of 30 seconds. The test suite passes with it, and my cooked failure examples
pass as well. It obviously has seen little real world testing so far, so if you decide
to use it I'd recommend playing with these edges a little more before the release. It's
also important to observe that getaddrinfo_a is a GNU extension and the patch changes
code that is common to a number of operating systems, including Mac OS. If the Go
compiler is not linking with glibc, the file cgo_unix.go file will need to be duplicated
so that the proposed changes only affect Linux, and equivalent changes are done using
specific APIs for other relevant operating systems.

@rsc
Copy link
Contributor

rsc commented Oct 15, 2014

Comment 5:

Gustavo, it looks like you wrote the patch. You know how to send CLs. If you can send
one in the next few days, great. Otherwise let's leave this for 1.5.

Labels changed: added release-go1.4maybe, removed release-go1.4.

Status changed to Accepted.

@niemeyer
Copy link
Contributor Author

Comment 6:

I'm happy to work on this, but the patch cannot be applied as-is because getaddrinfo_a
is a GNU extension, and likely won't work on other systems.
I will specialize cgo_unix.go for Linux and send a proposal over the next few days.

@niemeyer
Copy link
Contributor Author

Comment 7:

Actually, I'm traveling over the next few days, and won't provide an appropriate level
of interaction on the review process for this issue. I can cook a proper patch for next
week if that works for you.

@ianlancetaylor
Copy link
Contributor

Comment 8:

The patch seems incomplete.  It seems that it ought to pass down the deadline and use
that.  But I admit that is complex with the use of singleflight.
You mention the documented cap of 30 seconds, but I think you are slightly misreading
it.  That is the maximum length of time the resolver will wait for a specific name
server.  If that name server does not respond, the resolver will try another.  And then
there is a number of times that the resolve will query a name server, which is attempts
in resolve.conf.  The default for attempts is 2 and the cap is 5.  So I think that if
you have three name servers in resolv.conf, the default time to wait is 3 * 5 (default
timeout) * 2 (default attempts) == 45 seconds, and the maximum time to wait is 3 * 30 *
5 == 450 seconds.  Imposing our own deadline of 30 seconds in all cases would mean
ignoring the values set in resolv.conf in some cases.
In the original bug report the core problem appears to be that getaddrinfo is hanging
indefinitely, despite the timeouts in resolv.conf.  That seems to be a bug in
getaddrinfo.  It's not obvious to me that using getaddrinfo_a will avoid that bug. 
getaddrinfo_a is a more complex code path.  Do you know if anybody has tried to report
the real bug?

@niemeyer
Copy link
Contributor Author

Comment 9:

The patch is definitely suboptimal and incomplete both in that it does not handle
multiple operating systems, and does not provide the real requested timeout down to the
resolving function. That's why I did not submit the patch upstream, despite knowing how
to send CLs.
You're also right in that I misread the timeout option as being a total, rather than per
server.
That said, getaddrinfo_a does seem to avoid the bug, and the right fix (for Linux, at
least) could build on it by providing the real timeout for the resolving function:
"The call blocks until one of the following occurs: (...) The  time  interval  specified
 in  timeout  elapses."
I don't understand your question about reporting the "real bug".

@ianlancetaylor
Copy link
Contributor

Comment 10:

It seems to me that the real bug is that getaddrinfo is in some cases hanging
indefinitely, and that is somehow then causing all future getaddrinfo calls to hang. 
That should be reported to the glibc maintainers.

@niemeyer
Copy link
Contributor Author

Comment 11:

Indeed, but I'm not the right person to report it. I cannot reproduce the bug nor
provide any information about the environment in which this is happening. I only have
indirect evidence that this is true.
Also, this seems like a real bug in Go itself too. Even if it is not indefinite,
goroutines are piling up for as long as the underlying call hangs, and that can happen
for several minutes as you point out. Not only that, but the logic that is in place
today prevents follow up calls from proceeding until the previous call returns.

@ianlancetaylor
Copy link
Contributor

Comment 12:

I agree that the current Go code should change.  I just don't quite see switching from
getaddrinfo to getaddrinfo_a as being the right change.  It's not wrong in itself, it
just doesn't seem to address the real issues.

@gopherbot
Copy link

Comment 13 by scintilla:

I believe this is the libc getaddrinfo bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=722075

@ianlancetaylor
Copy link
Contributor

Comment 14:

That bug is better known as https://sourceware.org/bugzilla/show_bug.cgi?id=15946 , and
is reportedly fixed in glibc 2.20.  It would be interesting to know which glibc version
was in use by the person who observed this bug.
Note that that bug is about having a very large number of file descriptors open
simultaneously for DNS lookups.  The reported problem is a large number of goroutines
waiting to check a DNS timeout.  You can have the latter without the former.

@ianlancetaylor
Copy link
Contributor

Comment 15:

Looking back at the original bug report, I think it's unlikely to be glibc bug 15946,
because that bug only arises when all available file descriptors have been opened.  The
original bug report is saying that many goroutines are hanging waiting on the answer to
a single getaddrinfo request that itself is hanging.  The single getaddrinfo request
will have a single file descriptor, so I don't see any reason to assume that large
numbers of file descriptors are open.

@ianlancetaylor
Copy link
Contributor

Comment 16:

I'm no longer convinced that the Go code has to change.  If getaddrinfo didn't hang,
everything would work fine.
If there were a portable mechanism we could use to add a timeout to getaddrinfo, it
might be worth pursuing for better efficiency.  But as far as I can tell we can only add
a timeout on GNU/Linux; I don't see any mechanism available on other systems.  And the
core problem here, getaddrinfo hanging, appears to be due to a bug on GNU/Linux.  So I
think we need to understand the GNU/Linux bug before we start trying to work around it
by following a different code path.

@niemeyer
Copy link
Contributor Author

Comment 17:

Even without the bug, goroutines pile up for as long as a timeout takes to arrive, and
lookup failures cascade as retrying ineffectively blocks on the same failed lookup.
Somewhat unfortunate.

@ianlancetaylor
Copy link
Contributor

Comment 18:

If you make 100 concurrent requests for a specific DNS hostname with a timeout, and the
DNS lookup is slow, then by definition you will have 100 goroutines waiting for that DNS
lookup to complete.  With the current implementation, if you set a deadline for the
lookup, you will have an extra 100 goroutines waiting.  That is inefficient but it's not
crazily inefficient.
You're right, of course, that if a DNS lookup is slow, other requests for the same
hostname will wait for the first lookup to complete.  And, in particular, if the program
retries after a timeout it will get stuck waiting for the first lookup to complete.  I
think we can fix that problem.  How does https://golang.org/cl/154610044 look
to you?

@niemeyer
Copy link
Contributor Author

Comment 19:

It looks like a good incremental improvement making retries work again. It should not be
too hard to get rid of the leaking goroutines as well, by changing the algorithm so that
the blocked cgo call is the one notifying waiters when it's done, instead of having one
extra middle man for every call.

@gopherbot
Copy link

Comment 20:

CL https://golang.org/cl/154610044 mentions this issue.

@ianlancetaylor
Copy link
Contributor

Comment 21:

This issue was closed by revision 77595e4.

Status changed to Fixed.

@niemeyer
Copy link
Contributor Author

Comment 22:

Thank you!

@gopherbot
Copy link

Comment 23:

CL https://golang.org/cl/166740043 mentions this issue.

@mikioh
Copy link
Contributor

mikioh commented Oct 28, 2014

Comment 24:

This issue was updated by revision 21a9141.

LGTM=iant
R=iant
CC=golang-codereviews
https://golang.org/cl/166740043

@golang golang locked and limited conversation to collaborators Jun 25, 2016
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jun 25, 2018
Before this CL, if the system resolver does a very slow DNS
lookup for a particular host, all subsequent requests for that
host will hang waiting for that lookup to complete.  That is
more or less expected when Dial is called with no deadline.
When Dial has a deadline, though, we can accumulate a large
number of goroutines waiting for that slow DNS lookup.  Try to
avoid this problem by restarting the DNS lookup when it is
redone after a deadline is passed.

This CL also avoids creating an extra goroutine purely to
handle the deadline.

No test because we would have to simulate a slow DNS lookup
followed by a fast DNS lookup.

Fixes golang#8602.

LGTM=bradfitz
R=bradfitz, mikioh.mikioh
CC=golang-codereviews, r, rsc
https://golang.org/cl/154610044
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jun 26, 2018
Before this CL, if the system resolver does a very slow DNS
lookup for a particular host, all subsequent requests for that
host will hang waiting for that lookup to complete.  That is
more or less expected when Dial is called with no deadline.
When Dial has a deadline, though, we can accumulate a large
number of goroutines waiting for that slow DNS lookup.  Try to
avoid this problem by restarting the DNS lookup when it is
redone after a deadline is passed.

This CL also avoids creating an extra goroutine purely to
handle the deadline.

No test because we would have to simulate a slow DNS lookup
followed by a fast DNS lookup.

Fixes golang#8602.

LGTM=bradfitz
R=bradfitz, mikioh.mikioh
CC=golang-codereviews, r, rsc
https://golang.org/cl/154610044
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jul 9, 2018
Before this CL, if the system resolver does a very slow DNS
lookup for a particular host, all subsequent requests for that
host will hang waiting for that lookup to complete.  That is
more or less expected when Dial is called with no deadline.
When Dial has a deadline, though, we can accumulate a large
number of goroutines waiting for that slow DNS lookup.  Try to
avoid this problem by restarting the DNS lookup when it is
redone after a deadline is passed.

This CL also avoids creating an extra goroutine purely to
handle the deadline.

No test because we would have to simulate a slow DNS lookup
followed by a fast DNS lookup.

Fixes golang#8602.

LGTM=bradfitz
R=bradfitz, mikioh.mikioh
CC=golang-codereviews, r, rsc
https://golang.org/cl/154610044
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jul 20, 2018
Before this CL, if the system resolver does a very slow DNS
lookup for a particular host, all subsequent requests for that
host will hang waiting for that lookup to complete.  That is
more or less expected when Dial is called with no deadline.
When Dial has a deadline, though, we can accumulate a large
number of goroutines waiting for that slow DNS lookup.  Try to
avoid this problem by restarting the DNS lookup when it is
redone after a deadline is passed.

This CL also avoids creating an extra goroutine purely to
handle the deadline.

No test because we would have to simulate a slow DNS lookup
followed by a fast DNS lookup.

Fixes golang#8602.

LGTM=bradfitz
R=bradfitz, mikioh.mikioh
CC=golang-codereviews, r, rsc
https://golang.org/cl/154610044
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jul 30, 2018
Before this CL, if the system resolver does a very slow DNS
lookup for a particular host, all subsequent requests for that
host will hang waiting for that lookup to complete.  That is
more or less expected when Dial is called with no deadline.
When Dial has a deadline, though, we can accumulate a large
number of goroutines waiting for that slow DNS lookup.  Try to
avoid this problem by restarting the DNS lookup when it is
redone after a deadline is passed.

This CL also avoids creating an extra goroutine purely to
handle the deadline.

No test because we would have to simulate a slow DNS lookup
followed by a fast DNS lookup.

Fixes golang#8602.

LGTM=bradfitz
R=bradfitz, mikioh.mikioh
CC=golang-codereviews, r, rsc
https://golang.org/cl/154610044
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants