Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: ResolveIPAddr triggers glibc bug writing to wrong fd #6336

Closed
gopherbot opened this issue Sep 5, 2013 · 39 comments
Closed

net: ResolveIPAddr triggers glibc bug writing to wrong fd #6336

gopherbot opened this issue Sep 5, 2013 · 39 comments

Comments

@gopherbot
Copy link

by arnaud.lb:

It seems that ResolveIPAddr() can write to random unrelated file descriptors in some
unknown conditions, at high concurrency.

The code at [1] does a lot a DNS requests in parallel. At the same time, it also does
short connections to a local unix socket (Dial, a few Writes, and Close).

After a few seconds, the unix socket's listener is seeing random garbage, presumably DNS
requests (I can sometimes see part of domain names in the garbage).

It looks very much like ResolveIPAddr() is writing to random file descriptor; or that it
is corrupting some buffer.

To reproduce, run ./server and ./client. The server should start receiving binary
garbage after a few seconds (and logs that to stderr).

go version go1.1.2 linux/amd64

[1] https://gist.github.com/arnaud-lb/3af01cdfb6b1ee38c122
@dvyukov
Copy link
Member

dvyukov commented Sep 5, 2013

Comment 1:

The symptoms sound very much like data race on file descriptor, when read/write races
with close.

@gopherbot
Copy link
Author

Comment 2 by arnaud.lb:

Can this happen even if the net.Conn is not shared between goroutines?
go func() {
    conn, _ = net.Dial("unix", "/tmp/test.sock")
    conn.Write([]byte("foo\n"))
    conn.Close()
}()
Is this something Go or I can fix? Is there a workaround?

@mikioh
Copy link
Contributor

mikioh commented Sep 6, 2013

Comment 3:

I don't understand how could unix-domain stuff use DNS resolver. It that a special
behavior of Linux or Linux runtime libraries?

@robpike
Copy link
Contributor

robpike commented Sep 6, 2013

Comment 4:

I agree that dvyukov's analysis is likely correct. Does the race detector pick up the
problem?

Labels changed: added priority-soon, go1.2, removed priority-triage.

Status changed to Accepted.

@mikioh
Copy link
Contributor

mikioh commented Sep 6, 2013

Comment 5:

Sorry, I misunderstood.
Can you please try with the CGO-disabled client?
I just tried it a few minutes and got no garbage at listener side.
Probably the root cause is the same as issue #6232.

@gopherbot
Copy link
Author

Comment 6 by arnaud.lb:

It reproduces easily with a large /proc/sys/net/core/somaxconn (e.g. 10K).
With CGO-disabled, the problem disappears.
Tried with -race; i had to reduce the number of goroutines to 2K and 6K for dns and
socket, so as to not hit the limit of 8192 threads. It doesn't seem to detect a problem,
however the bug doesn't reproduce either.

@remyoudompheng
Copy link
Contributor

Comment 7:

We encounter the same problem in production, but I couldn't write a minimal example. It
seems related to the behaviour of the GNU libc, although I couldn't understand what
could be the issue. Maybe a TLS corruption? The libc caches and reuses file descriptors
of UDP sockets in a thread-local structure.

@ianlancetaylor
Copy link
Contributor

Comment 8:

If it's related to TLS corruption or some sort of glibc bug, it ought to be possible to
reproduce the problem with a multi-threaded C program.

@gopherbot
Copy link
Author

Comment 9 by arnaud.lb:

The attached bug6336.c reproduces the bug too. Must be a getaddrinfo() bug.

Attachments:

  1. bug6336.c (4045 bytes)

@remyoudompheng
Copy link
Contributor

Comment 10:

Thanks, I can use the program to reproduce the issue too, even with as low as 20
threads. I use glibc 2.18 on Archlinux at home. But strangely I couldn't reproduce the
bug using res_query/nquery instead of getaddrinfo.

@mikioh
Copy link
Contributor

mikioh commented Sep 7, 2013

Comment 11:

Platform:
- It happens on Linux, mostly
- It doesn't happen on FreeBSD
Go runtime:
- It happens when we use DNS resolver written in C via CGO, on both go1.1 and tip
C runtime:
- It happens when we use ??? libc on linux 
- It doesn't happen when we use libc: e.g., shlib version 7 libc on freebsd. It contains
traditional DNS resolver from Berekely and getaddrinfo from KAME
Hm, what a mess.

@gopherbot
Copy link
Author

Comment 12 by sebastien.paolacci:

Arnaud's reproducer does trigger the bug on both eglibc 2.6.11 and 2.6.13 (Debian
Squeeze & Wheezy).

@gopherbot
Copy link
Author

Comment 13 by arnaud.lb:

This bug is crazy. I've filled a debian bug:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=722075

@davecheney
Copy link
Contributor

Comment 14:

Thank you for taking it to (one of) the upstream(s). Given the number of systems
affected, it sounds like Go will need to develop its own workaround regardless.

@rsc
Copy link
Contributor

rsc commented Sep 11, 2013

Comment 15:

Without knowing more about exactly what the bug is in glibc,
the only workaround I can think of is to stop using cgo for networking on Linux.
That's possible but probably too big a hammer.

@ianlancetaylor
Copy link
Contributor

Comment 16:

See also issue #6232, which is about the exact same glibc code.
I suspect issue #6232 was fixed in glibc on 2012-11-19 as part of the fix for
http://sourceware.org/bugzilla/show_bug.cgi?id=14719 .  That patch be in glibc 2.17 and
later.  Not sure if that patch also fixed this issue.  I have not been able to recreate
the problem myself using the C test case.

@rsc
Copy link
Contributor

rsc commented Sep 11, 2013

Comment 17:

FWIW, it looks to me like maybe issue #6232 was _caused_ by the fix you linked to.

@remyoudompheng
Copy link
Contributor

Comment 18:

iant: the C test case reproduces the issue for me with glibc 2.18.

@ianlancetaylor
Copy link
Contributor

Comment 19:

Great, in that case, please file a bug report at http://sourceware.org/bugzilla .  That
is the bug queue that the glibc maintainers actually read.  See
http://www.gnu.org/software/libc/bugs.html .  Thanks.

@gopherbot
Copy link
Author

Comment 20 by arnaud.lb:

I've filled a bug at https://sourceware.org/bugzilla/show_bug.cgi?id=15946
Originally filled at Debian because it may very well be a debian or distros specific
bugs.
The only workaround I've found is to use the pure-go resolver. It works really well so
far. It would be great if the pure-go resolver was exposed, or if there was a way to
"enable" it.

@rsc
Copy link
Contributor

rsc commented Sep 13, 2013

Comment 21:

You can enable the pure-go resolver by rebuilding the standard library with it turned on:
go install -a -tags netgo std

@rsc
Copy link
Contributor

rsc commented Sep 13, 2013

Comment 22:

Since there is a brute force workaround (stop using cgo) and we don't understand the
actual problem better, demoting to Go1.2Maybe, although it's really Go1.2ProbablyNot.

Labels changed: added go1.2maybe, removed go1.2.

@mikioh
Copy link
Contributor

mikioh commented Sep 14, 2013

Comment 23:

Here is a list of APIs related to this issue.
- Dial
- DialTimeout
- Dialer.Dial
- LookupHost
- LookupIP
- LookupPort
- LookupCNAME
- ResolveTCPAddr
- ResolveUDPAddr
- ResolveIPAddr

@rsc
Copy link
Contributor

rsc commented Sep 15, 2013

Comment 24:

What are the drawbacks of not using cgo for name resolution on Linux and
the BSDs?
On the Mac, it is a must because the firewall does not allow ordinary code
to receive the incoming DNS responses.
The only thing I can think of that would cause problems on Linux would be
if the local name server configuration has some non-DNS resolution
mechanisms. For example if you are using something like Bonjour (yes it
works on Linux) to resolve names to IP addresses, Go programs won't be able
to do that. But I imagine that's rare.
Inside Google we arrange to use the pure Go version (instead of the cgo
calls into glibc) on Linux because the Go version scales so much better.
Perhaps we should make the default on Linux and the BSDs be 'not cgo'?

@davecheney
Copy link
Contributor

Comment 25:

+1 
I feel, on balance, the number of support cases raised because Go does (by default) use
libc resolver extensions like ldap and bonjour/avahi, would be lower than the current
number of cases where people are hitting strange concurrency issues in their libc
resolver libraries. 
Disabling net+cgo by default gives a possible resolution for people who need ldap/etc
lookup in their Go processes, where as the workaround of disabling cgo is either not
suitable (they have more cgo in their code) or people feel that something is being taken
away from them. 
If it were possible to change within Go 1.x I would vote for doing this for at least
*BSD and Linux.

@davecheney
Copy link
Contributor

Comment 26:

+1 
I feel, on balance, the number of support cases raised because Go does NOT (by default)
use libc resolver extensions like ldap and bonjour/avahi, would be lower than the
current number of cases where people are hitting strange concurrency issues in their
libc resolver libraries. 
Disabling net+cgo by default gives a possible resolution for people who need ldap/etc
lookup in their Go processes, where as the workaround of disabling cgo is either not
suitable (they have more cgo in their code) or people feel that something is being taken
away from them. 
If it were possible to change within Go 1.x I would vote for doing this for at least
*BSD and Linux.

@ianlancetaylor
Copy link
Contributor

Comment 27:

Not using cgo on GNU/Linux would mean ignoring the /etc/nsswitch.conf file.  Some people
would find that surprising.  I don't know which would be more inexplicable for the
unsuspecting user: different DNS lookups by Go programs, or strange behaviour when doing
highly concurrent name lookups.
I don't think we should take any immediate steps now.  We should give the glibc
maintainers several days to analyze the bug report.  The nature of the bug in glibc may
suggest workarounds we can take in Go.

@mikioh
Copy link
Contributor

mikioh commented Sep 16, 2013

Comment 28:

Looks like built-in name resolver doesn't support EDNS0 (RFC 6891) because of
https://groups.google.com/d/msg/golang-nuts/F2X38c4JcKs/95nMGxjrDu0J.

@mikioh
Copy link
Contributor

mikioh commented Sep 16, 2013

Comment 29:

Fortunately, lack of EDNS0 isn't critical because built-in name resolver supports TCP
fallback in Go 1.2.

@rsc
Copy link
Contributor

rsc commented Oct 2, 2013

Comment 30:

The Debian bug has not been updated, so we still don't know what is causing this. 
The workaround is: "go install -a -tags netgo net". That only needs to be done once in
ordinary use; builds of other packages will see that net is up-to-date and use the pure
Go version.
Leaving for Go 1.3.

Labels changed: added go1.3, removed go1.2maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 31:

Labels changed: added release-go1.3.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 32:

Labels changed: removed go1.3.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 33:

Labels changed: added repo-main.

@rsc
Copy link
Contributor

rsc commented Apr 3, 2014

Comment 34:

Maybe by Go 1.4 the Debian or glibc guys will know what they did wrong.

Labels changed: added release-go1.4, removed release-go1.3.

@ianlancetaylor
Copy link
Contributor

Comment 35:

According to https://sourceware.org/bugzilla/show_bug.cgi?id=15946 this will be fixed in
the glibc 2.20 release.
It was a one-line patch in glibc:
https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=blobdiff;f=resolv/res_send.c;h=af42b8aac216356a5466998df5c47c21357881d3;hp=3273d55ceb5eeb354aab61aae96224412a6ed308;hb=f9d2d03254a58d92635a311a42253eeed5a40a47;hpb=71840409ea45ab9e49d0ac70dfc1c355accf355f
If I'm reading the code correctly, the Go package could avoid the problem by querying
separately for AF_INET and AF_INET6 (by setting hints.ai_family).  I think the problem
only arises when the hints do not specify an address family, in which case the resolv
library will send two simultaneous requests, one for AF_INET and one for AF_INET6.  The
problem arises when the first request is received but the second request times out.  In
that case the second request is resent.  However, between the first and second time the
second request is resent, the network descriptor might have been closed.  If that
happened, the second request is resent on the wrong descriptor.  If you're really
unlucky, that descriptor was reopened as something else.
I do not know if it's worth working around this in the Go code.  I don't know how often
this problem arises in practice.

@bradfitz
Copy link
Contributor

Comment 36:

Nice test updates in that one-line patch.

@rsc
Copy link
Contributor

rsc commented Sep 16, 2014

Comment 37:

People seem to be living with this okay, so I am inclined not to make any changes for
1.4. 
If it comes up, we can tell people to update to glibc 2.20.
A problem with setting hints and making the call twice is that we won't know the
priority order that the merged call would have returned.

Status changed to Unfortunate.

@james-relyea
Copy link

This seems to still be an issue with 1.6.2 on OSX (10.11.5). I was able to resolve it by disabling CGO, as suggested above.

@ianlancetaylor
Copy link
Contributor

@james-relyea This issue was closed long ago, and occurred only on glibc-based systems like GNU/Linux, never on OS X. If you are seeing a problem on OS X, then that is a different problem. Please open a new issue or see https://golang.org/wiki/Questions .

@golang golang locked and limited conversation to collaborators Jun 29, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants