-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: ResolveIPAddr triggers glibc bug writing to wrong fd #6336
Comments
Sorry, I misunderstood. Can you please try with the CGO-disabled client? I just tried it a few minutes and got no garbage at listener side. Probably the root cause is the same as issue #6232. |
It reproduces easily with a large /proc/sys/net/core/somaxconn (e.g. 10K). With CGO-disabled, the problem disappears. Tried with -race; i had to reduce the number of goroutines to 2K and 6K for dns and socket, so as to not hit the limit of 8192 threads. It doesn't seem to detect a problem, however the bug doesn't reproduce either. |
We encounter the same problem in production, but I couldn't write a minimal example. It seems related to the behaviour of the GNU libc, although I couldn't understand what could be the issue. Maybe a TLS corruption? The libc caches and reuses file descriptors of UDP sockets in a thread-local structure. |
The attached bug6336.c reproduces the bug too. Must be a getaddrinfo() bug. Attachments:
|
Platform: - It happens on Linux, mostly - It doesn't happen on FreeBSD Go runtime: - It happens when we use DNS resolver written in C via CGO, on both go1.1 and tip C runtime: - It happens when we use ??? libc on linux - It doesn't happen when we use libc: e.g., shlib version 7 libc on freebsd. It contains traditional DNS resolver from Berekely and getaddrinfo from KAME Hm, what a mess. |
This bug is crazy. I've filled a debian bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=722075 |
See also issue #6232, which is about the exact same glibc code. I suspect issue #6232 was fixed in glibc on 2012-11-19 as part of the fix for http://sourceware.org/bugzilla/show_bug.cgi?id=14719 . That patch be in glibc 2.17 and later. Not sure if that patch also fixed this issue. I have not been able to recreate the problem myself using the C test case. |
FWIW, it looks to me like maybe issue #6232 was _caused_ by the fix you linked to. |
Great, in that case, please file a bug report at http://sourceware.org/bugzilla . That is the bug queue that the glibc maintainers actually read. See http://www.gnu.org/software/libc/bugs.html . Thanks. |
I've filled a bug at https://sourceware.org/bugzilla/show_bug.cgi?id=15946 Originally filled at Debian because it may very well be a debian or distros specific bugs. The only workaround I've found is to use the pure-go resolver. It works really well so far. It would be great if the pure-go resolver was exposed, or if there was a way to "enable" it. |
What are the drawbacks of not using cgo for name resolution on Linux and the BSDs? On the Mac, it is a must because the firewall does not allow ordinary code to receive the incoming DNS responses. The only thing I can think of that would cause problems on Linux would be if the local name server configuration has some non-DNS resolution mechanisms. For example if you are using something like Bonjour (yes it works on Linux) to resolve names to IP addresses, Go programs won't be able to do that. But I imagine that's rare. Inside Google we arrange to use the pure Go version (instead of the cgo calls into glibc) on Linux because the Go version scales so much better. Perhaps we should make the default on Linux and the BSDs be 'not cgo'? |
+1 I feel, on balance, the number of support cases raised because Go does (by default) use libc resolver extensions like ldap and bonjour/avahi, would be lower than the current number of cases where people are hitting strange concurrency issues in their libc resolver libraries. Disabling net+cgo by default gives a possible resolution for people who need ldap/etc lookup in their Go processes, where as the workaround of disabling cgo is either not suitable (they have more cgo in their code) or people feel that something is being taken away from them. If it were possible to change within Go 1.x I would vote for doing this for at least *BSD and Linux. |
+1 I feel, on balance, the number of support cases raised because Go does NOT (by default) use libc resolver extensions like ldap and bonjour/avahi, would be lower than the current number of cases where people are hitting strange concurrency issues in their libc resolver libraries. Disabling net+cgo by default gives a possible resolution for people who need ldap/etc lookup in their Go processes, where as the workaround of disabling cgo is either not suitable (they have more cgo in their code) or people feel that something is being taken away from them. If it were possible to change within Go 1.x I would vote for doing this for at least *BSD and Linux. |
Not using cgo on GNU/Linux would mean ignoring the /etc/nsswitch.conf file. Some people would find that surprising. I don't know which would be more inexplicable for the unsuspecting user: different DNS lookups by Go programs, or strange behaviour when doing highly concurrent name lookups. I don't think we should take any immediate steps now. We should give the glibc maintainers several days to analyze the bug report. The nature of the bug in glibc may suggest workarounds we can take in Go. |
Looks like built-in name resolver doesn't support EDNS0 (RFC 6891) because of https://groups.google.com/d/msg/golang-nuts/F2X38c4JcKs/95nMGxjrDu0J. |
The Debian bug has not been updated, so we still don't know what is causing this. The workaround is: "go install -a -tags netgo net". That only needs to be done once in ordinary use; builds of other packages will see that net is up-to-date and use the pure Go version. Leaving for Go 1.3. Labels changed: added go1.3, removed go1.2maybe. |
According to https://sourceware.org/bugzilla/show_bug.cgi?id=15946 this will be fixed in the glibc 2.20 release. It was a one-line patch in glibc: https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=blobdiff;f=resolv/res_send.c;h=af42b8aac216356a5466998df5c47c21357881d3;hp=3273d55ceb5eeb354aab61aae96224412a6ed308;hb=f9d2d03254a58d92635a311a42253eeed5a40a47;hpb=71840409ea45ab9e49d0ac70dfc1c355accf355f If I'm reading the code correctly, the Go package could avoid the problem by querying separately for AF_INET and AF_INET6 (by setting hints.ai_family). I think the problem only arises when the hints do not specify an address family, in which case the resolv library will send two simultaneous requests, one for AF_INET and one for AF_INET6. The problem arises when the first request is received but the second request times out. In that case the second request is resent. However, between the first and second time the second request is resent, the network descriptor might have been closed. If that happened, the second request is resent on the wrong descriptor. If you're really unlucky, that descriptor was reopened as something else. I do not know if it's worth working around this in the Go code. I don't know how often this problem arises in practice. |
People seem to be living with this okay, so I am inclined not to make any changes for 1.4. If it comes up, we can tell people to update to glibc 2.20. A problem with setting hints and making the call twice is that we won't know the priority order that the merged call would have returned. Status changed to Unfortunate. |
This seems to still be an issue with 1.6.2 on OSX (10.11.5). I was able to resolve it by disabling CGO, as suggested above. |
@james-relyea This issue was closed long ago, and occurred only on glibc-based systems like GNU/Linux, never on OS X. If you are seeing a problem on OS X, then that is a different problem. Please open a new issue or see https://golang.org/wiki/Questions . |
by arnaud.lb:
The text was updated successfully, but these errors were encountered: