New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: replicate DNS resolution behaviour of getaddrinfo(glibc) in the go dns resolver #18518
Comments
Sounds good to me. Matching glibc's behavior is our default goal, unless there's a strong reason otherwise. @mdempsky, thoughts? @rlhh, feel free to send a change. (See https://golang.org/doc/contribute.html) It's too late for Go 1.8, though, since we're about to release it. |
I agree we should prefer to match glibc behavior, but looking at glibc's source code I believe we're already doing that. The code in check_pf.c that you linked to doesn't actually implement the RFC 3484 sorting logic. That's handled by the qsort/rfc3484_sort calls at https://github.molgen.mpg.de/git-mirror/glibc/blob/master/sysdeps/posix/getaddrinfo.c#L2629, which appears to handle unconditionally as long as naddrs>1. Can you provide more evidence of what you think is going wrong? E.g., output from net.LookupIP under both GODEBUG=netdns=go and GODEBUG=netdns=cgo? |
Sure, I'm happy to provide as much information as possible. The check_pf.c code does not implement rfc3484 directly but the results of it affects the outcome of getaddrinfo. Walking through getaddrinfoIn getaddrinfo.c, check_pf can be called in either And since by default, AI_ADDRCONFIG is not set (I am not 100% sure on this), we'll look at the second scenario. In this scenario, if no IPv6 address is found in check_pf, Since All this will in turn affect the netmask that is used to do Rule 9 Comparison in rfc3484_sort. https://github.molgen.mpg.de/git-mirror/glibc/blob/master/sysdeps/posix/getaddrinfo.c#L1732 Example results from different commands
Sorry for the long reply, I felt that walking through the code and showing the results of the commands would be the most helpful. |
Thanks for elaborating. That was helpful. Tracking back through glibc commit history, it looks like that behavior was introduced in https://github.molgen.mpg.de/git-mirror/glibc/commit/df18fdb93027b2e18919707d54556f8bb5f4694b, which was reportedly to fix glibc bug 4599: https://sourceware.org/bugzilla/show_bug.cgi?id=4599 There's nothing to suggest to me that the change in behavior was intentional though. The commit/bug was simply that AI_ADDRCONFIG was meant to ignore loopback IPs, but that commit suddenly disabled RFC 3484 when there are no non-loopback IPv6 addresses. I somewhat wonder if it was a mistaken optimization. We can certainly mimic glibc's behavior here, but it seems very arbitrary to me. I don't see any obvious reason why the presence of IPv6 addresses on a machine should affect sorting of IPv4 addresses. I'm more inclined to say we should just not apply Rule 9 to IPv4 addresses at all.
We do not set AI_ADDRCONFIG: https://golang.org/src/net/cgo_linux.go |
@mdempsky Yeah, I agree that it doesn't look like the change was intentional. However it might also be because of that unintentional change that DNS round robin is working properly for a lot of people. Personally, I am happy to disable Rule 9 for IPv4 addresses but I am worried that having a different behaviour from glibc will once again confuse people in the future. |
Yeah, understood. Here's my thought process: Currently for simplicity, we're not precisely matching glibc behavior anyway: glibc only applies CommonPrefixLen to IPv4 src/dst pairs that are on the same local subnet, but we instead check for the same special IPv4 block (i.e., 192.168/16, 10/8, ...). This does the "wrong" thing when src/dst are on the same public IPv4 subnet (e.g., 8.8.8/24), or on different subnets within a special block (e.g., 192.168.0/24 and 192.168.1/24). RFC 6724 updates RFC 3484 to clarify that CommonPrefixLen should only consider the "prefix," not the "interface ID." These terms don't appear explicitly defined for IPv4-mapped addresses, but my best interpretation is it means within 192.168.1/24, we should consider 192.168.1.2, 192.168.1.3, and 192.168.1.4 as equally preferable. But currently glibc and addrselect.go treat .2 and .3 as closer than .4 (which is silly, since they're all on the same local network). If we don't fix same-subnet detection, the best we can do is to detect when the src/dst addresses are both on the same IPv4 special block and give them preference... but that seems very similar to how RFC 3484 used to classify RFC 1918 subnets as "site-local" scope, which RFC 6724 says broke things and is why they're not classified as "global" scope instead. (Admittedly, they explain it's because of 6to4 tunneling, which I have no idea whether is still in use.) So I'm currently strongly in favor of just disabling Rule 9 for IPv4 addresses, but willing to reconsider given sufficiently strong evidence for glibc's odd behavior. /cc @pmarks-net |
I don't have any particular fondness for Rule 9, but if we disable it for IPv4, then won't any code benefiting from such a change run into the same problem when it migrates to IPv6? |
If we can make it obvious that the go resolver intentionally deviates from glibc and we are willing to deviate from glibc, I would also much prefer that we disable Rule 9 for IPv4 addresses completely. As it stands right now, @pmarks-net From my understanding of RFC6724, the common prefix len of Rule 9 is suppose to be applied to only the first 64 bit of the IPv6 address (routing prefix and subnet id) and then the selection order should be whatever that is returned by the DNS. (https://tools.ietf.org/html/rfc6724 page 15 - Rule 9 and page 29 - common prefixlen). |
@pmarks-net Agreed it's possible IPv6 networks will have similar issues and we may eventually need to disable Rule 9 completely. But that doesn't appear to be the case today. I think the asymmetry is justified in that IPv4 predates Rule 9, so IPv4 network addressing has not been designed to reflect physical colocality. Also, due to address space constraints, it seems unlikely they ever will. I'm not informed well enough to gauge whether IPv6 is actually doing a better job here, but at least IPv6 network operators are hopefully aware of Rule 9 and can still design their networks accordingly. Hopefully if it ends up not working, RFC 6724 will be revised again. |
CL https://golang.org/cl/34914 mentions this issue. |
What version of Go are you using (
go version
)?go version go1.7.1 darwin/amd64
What operating system and processor architecture are you using (
go env
)?GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/root/gocode"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build091809800=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
Explanation
This is related to #13283
All our servers are hosted in AWS in the 10.0.0.0/8 CIDR block and because of that, Rule 9 (precisely https://github.com/golang/go/blob/master/src/net/addrselect.go#L210) of the go dns resolver always kicks in for us and causes an overwhelming number of requests to be routed to a single load balancer node.
Out of our many attempts to understand and solve this issue, we tried disabling ipv6 and falling back to cgo (because we had a better grasp of
getaddrinfo
than the go resolver) and that fixed the dns round robin issue for us. However, that forces us to use cgo to do dns resolution and it is not ideal.Upon further investigation of why it works for
getaddrinfo
and not the go resolver, we came across https://github.molgen.mpg.de/git-mirror/glibc/blob/master/sysdeps/unix/sysv/linux/check_pf.c#L266 in glibc. That block of code causes 0 source addresses to be returned if no ipv6 address is found. This in turns allowsgetaddrinfo
to fallback to the DNS's original behaviour.https://play.golang.org/p/uaLPvjjEqa
Reference:
https://www.ietf.org/rfc/rfc3484.txt
https://www.ietf.org/rfc/rfc6724.txt
Proposal:
To keep the behaviour consistent across the go dns resolver and the glibc dns resolver, I propose that we do the same filtering that is done in the
check_pf
function in glibc for go. When no IPv6 address is found for the outgoing interfaces, we should put in a default address of 0.0.0.0 to indirectly bypass the sorting algorithm of RFC3484/RFC6724.Happy to submit a PR to put this in too.
The text was updated successfully, but these errors were encountered: