Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: parallel A/AAAA queries are mishandled by broken DNS servers #35521

Closed
albrro opened this issue Nov 12, 2019 · 21 comments
Closed

net: parallel A/AAAA queries are mishandled by broken DNS servers #35521

albrro opened this issue Nov 12, 2019 · 21 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@albrro
Copy link

albrro commented Nov 12, 2019

What version of Go are you using (go version)?

$ go version
go version go1.13.3 linux/amd64

This is the latest available from the Fedora repository. Everything is "standard issue".

Does this issue reproduce with the latest release?

unknown

What operating system and processor architecture are you using (go env)?

Fedora 31 Workstation, fresh install

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/a/.cache/go-build"
GOENV="/home/a/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/a/go"
GOPRIVATE=""
GOPROXY="direct"
GOROOT="/usr/lib/golang"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build111605125=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Fedora Toolbox (a container management program) failed to download images complaining that it was unable to get DNS responses. I searched and found issue 1948 for libpod on its Github project page (it's a tool that toolbox uses) and there I found a simple program that was suggested for testing. The issue is closed and the bug was difficult to reproduce.

The program in its entirety:

package main

import (
	"fmt"
	"net/http"
	"net/http/httputil"
)

func main() {
	resp, err := http.Get("https://registry.fedoraproject.org/v2/")
	fmt.Printf("Error: %v\n", err)
	if err == nil {
		dr, _ := httputil.DumpResponse(resp, true)
		fmt.Printf("Body:\n===\n%s\n===\n", string(dr))
	}
}

This small test program had the same error so I thought that a better place to report it would be here.

There is no problem with name resolution on the laptop. Please note that normal DNS query and response traffic is generated yet it does not reach the program.

What did you expect to see?

some http data

What did you see instead?

$ go run /tmp/foo.go
Error: Get https://registry.fedoraproject.org/v2/: dial tcp: lookup registry.fedoraproject.org on 192.168.0.1:53: read udp 192.168.0.3:37044->192.168.0.1:53: i/o timeout

@albrro
Copy link
Author

albrro commented Nov 12, 2019

Fedora made Go 1.13.4 available. The error is still present.

@albrro albrro changed the title x/net/http2: x/net/http2: unable to read DNS responses Nov 12, 2019
@andybons
Copy link
Member

@bradfitz @tombergan

@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 12, 2019
@andybons andybons added this to the Unplanned milestone Nov 12, 2019
@bradfitz bradfitz changed the title x/net/http2: unable to read DNS responses net: DNS failure Nov 12, 2019
@bradfitz
Copy link
Contributor

bradfitz commented Nov 12, 2019

This error message is coming from net, not http2. Retitled.

@albrro, can you provide some evidence showing that your DNS actually works? From this bug report as it stands, I'm tempted to just say "your DNS is broken".

What does dig @192.168.01 registry.fedoraproject.org say?

Also, try running with GODEBUG=netdns=2, like:

$ GODEBUG=netdns=2 go run /tmp/foo.go
go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder(registry.fedoraproject.org) = files,dns
Error: <nil>
Body:
===
HTTP/2.0 200 OK
Content-Length: 2
Accept-Ranges: bytes
Age: 76
Apptime: D=3769
Content-Type: application/json; charset=utf-8
Date: Tue, 12 Nov 2019 17:58:45 GMT
Docker-Distribution-Api-Version: registry/2.0
Referrer-Policy: same-origin
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept
Via: 1.1 varnish (Varnish/6.0)
X-Content-Type-Options: nosniff
X-Fedora-Proxyserver: proxy06.fedoraproject.org
X-Fedora-Requestid: XcrzIvYZywcI1VhORDNzewAAAAE
X-Frame-Options: SAMEORIGIN
X-Varnish: 1261957 1261849
X-Xss-Protection: 1; mode=block
...

Also interesting would be /etc/resolv.conf and /etc/nsswitch.conf

@bradfitz bradfitz added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Nov 12, 2019
@albrro
Copy link
Author

albrro commented Nov 12, 2019

Apart from the go programs on my laptop, nothing else has any DNS problems. I will upload some packet captures shortly.

$ dig @192.168.01 registry.fedoraproject.org

; <<>> DiG 9.11.11-RedHat-9.11.11-1.fc31 <<>> @192.168.01 registry.fedoraproject.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19256
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 11, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1280
;; QUESTION SECTION:
;registry.fedoraproject.org. IN A

;; ANSWER SECTION:
registry.fedoraproject.org. 188 IN CNAME wildcard.fedoraproject.org.
wildcard.fedoraproject.org. 9 IN A 152.19.134.142
wildcard.fedoraproject.org. 9 IN A 152.19.134.198
wildcard.fedoraproject.org. 9 IN A 209.132.181.15
wildcard.fedoraproject.org. 9 IN A 209.132.181.16
wildcard.fedoraproject.org. 9 IN A 209.132.190.2
wildcard.fedoraproject.org. 9 IN A 8.43.85.67
wildcard.fedoraproject.org. 9 IN A 8.43.85.73
wildcard.fedoraproject.org. 9 IN A 67.219.144.68
wildcard.fedoraproject.org. 9 IN A 140.211.169.196
wildcard.fedoraproject.org. 9 IN A 140.211.169.206

;; Query time: 91 msec
;; SERVER: 192.168.0.1#53(192.168.0.1)
;; WHEN: Tue Nov 12 12:21:22 CST 2019
;; MSG SIZE rcvd: 238

$ GODEBUG=netdns=2 go run /tmp/foo.go
go package net: dynamic of DNS resolver
go package net: hostLookupOrder(registry.fedoraproject.org) = files,dns
Error:

Body:

HTTP/2.0 200 OK
Content-Length: 2
Accept-Ranges: bytes
Age: 49
Apptime: D=597
ns.txt

Content-Type: application/json; charset=utf-8
Date: Tue, 12 Nov 2019 18:21:58 GMT
Docker-Distribution-Api-Version: registry/2.0
Referrer-Policy: same-origin
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept
Via: 1.1 varnish (Varnish/6.0)
X-Content-Type-Options: nosniff
X-Fedora-Proxyserver: proxy12.fedoraproject.org
X-Fedora-Requestid: Xcr4eCOgKkFYqFuwpzo@GQAAAAE
X-Frame-Options: SAMEORIGIN
X-Varnish: 2643278 2159674
X-Xss-Protection: 1; mode=block

{}

$ go run /tmp/foo.go
Error: Get https://registry.fedoraproject.org/v2/: dial tcp: lookup registry.fedoraproject.org on 192.168.0.1:53: read udp 192.168.0.3:44807->192.168.0.1:53: i/o timeout

1 similar comment
@albrro
Copy link
Author

albrro commented Nov 12, 2019

Apart from the go programs on my laptop, nothing else has any DNS problems. I will upload some packet captures shortly.

$ dig @192.168.01 registry.fedoraproject.org

; <<>> DiG 9.11.11-RedHat-9.11.11-1.fc31 <<>> @192.168.01 registry.fedoraproject.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19256
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 11, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1280
;; QUESTION SECTION:
;registry.fedoraproject.org. IN A

;; ANSWER SECTION:
registry.fedoraproject.org. 188 IN CNAME wildcard.fedoraproject.org.
wildcard.fedoraproject.org. 9 IN A 152.19.134.142
wildcard.fedoraproject.org. 9 IN A 152.19.134.198
wildcard.fedoraproject.org. 9 IN A 209.132.181.15
wildcard.fedoraproject.org. 9 IN A 209.132.181.16
wildcard.fedoraproject.org. 9 IN A 209.132.190.2
wildcard.fedoraproject.org. 9 IN A 8.43.85.67
wildcard.fedoraproject.org. 9 IN A 8.43.85.73
wildcard.fedoraproject.org. 9 IN A 67.219.144.68
wildcard.fedoraproject.org. 9 IN A 140.211.169.196
wildcard.fedoraproject.org. 9 IN A 140.211.169.206

;; Query time: 91 msec
;; SERVER: 192.168.0.1#53(192.168.0.1)
;; WHEN: Tue Nov 12 12:21:22 CST 2019
;; MSG SIZE rcvd: 238

$ GODEBUG=netdns=2 go run /tmp/foo.go
go package net: dynamic of DNS resolver
go package net: hostLookupOrder(registry.fedoraproject.org) = files,dns
Error:

Body:

HTTP/2.0 200 OK
Content-Length: 2
Accept-Ranges: bytes
Age: 49
Apptime: D=597
ns.txt

Content-Type: application/json; charset=utf-8
Date: Tue, 12 Nov 2019 18:21:58 GMT
Docker-Distribution-Api-Version: registry/2.0
Referrer-Policy: same-origin
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept
Via: 1.1 varnish (Varnish/6.0)
X-Content-Type-Options: nosniff
X-Fedora-Proxyserver: proxy12.fedoraproject.org
X-Fedora-Requestid: Xcr4eCOgKkFYqFuwpzo@GQAAAAE
X-Frame-Options: SAMEORIGIN
X-Varnish: 2643278 2159674
X-Xss-Protection: 1; mode=block

{}

$ go run /tmp/foo.go
Error: Get https://registry.fedoraproject.org/v2/: dial tcp: lookup registry.fedoraproject.org on 192.168.0.1:53: read udp 192.168.0.3:44807->192.168.0.1:53: i/o timeout

@albrro
Copy link
Author

albrro commented Nov 12, 2019

ns.txt

Ugh. I inserted the containing the name service info in the middle of the previous comment.

@bradfitz
Copy link
Contributor

GODEBUG=netdns=2 should only add logging and not influence the results.

Please run it a few times to see if it's just flaky and sometimes able to actually hit your DNS.

Also try GODEBUG=netdns=cgo vs GODEBUG=netdns=go to force the libc vs Go resolvers, respectively.

@albrro
Copy link
Author

albrro commented Nov 12, 2019

Done. The file go.test1.txt is the result of this script:

#!/usr/bin/bash

for i in {61..69}
do
  echo "GODEBUG=netdns=2 go run /tmp/foo.go ###################################"
  ping -p $i -c 1 one.one.one.one
  GODEBUG=netdns=2 go run /tmp/foo.go
done
echo "GODEBUG=netdns=2 go run /tmp/foo.go ###################################"a

echo "GODEBUG=netdns=cgo go run /tmp/foo.go ###################################"
ping -p 6a -c 1 one.one.one.one
GODEBUG=netdns=cgo go run /tmp/foo.go

echo
echo "GODEBUG=netdns=go go run /tmp/foo.go ###################################"
ping -p 6b -c 1 one.one.one.one
GODEBUG=netdns=go go run /tmp/foo.go

The pings are to separate each invocation. The packets are padded with a unique payload to distinguish them.

go.test1.txt

While the script was running I captured the packets. You can see ping resolving its target and the go program also using DNS every time.
packets1.pcapng.txt

@bradfitz
Copy link
Contributor

@mdempsky, you like DNS, right? :) Any idea why Go's resolver times out here, but cgo works, and dig works?

@mdempsky
Copy link
Member

Looking.

@mdempsky
Copy link
Member

@albrro To confirm, /tmp/run.go in your test is still the program from your [original report]? I.e., basically just http.Get("https://registry.fedoraproject.org/v2/") (as far as the network is concerned)?

@albrro
Copy link
Author

albrro commented Nov 12, 2019

Same program, no changes.

Also, to make it clear, my system has ZERO problems with DNS ... except for the test program and other Go programs like Toolbox.

I thought it interesting the the invocations that time out still generated normal (ish?) DNS traffic.

@mdempsky
Copy link
Member

mdempsky commented Nov 12, 2019

Whatever DNS software is running on 192.168.0.1 seems broken.

If you look in the pcap file, you'll see a bunch of lines like this:

11:37:12.704933 IP 192.168.0.3.58863 > 192.168.0.1.53: 31317+ AAAA? registry.fedoraproject.org. (44)
11:37:12.705137 IP 192.168.0.3.47056 > 192.168.0.1.53: 34616+ A? registry.fedoraproject.org. (44)
11:37:12.812334 IP 192.168.0.1.53 > 192.168.0.3.47056: 34616 5/0/0 CNAME wildcard.fedoraproject.org., AAAA 2604:1580:fe00:0:dead:beef:cafe:fed1, AAAA 2605:bc80:3010:600:dead:beef:cafe:fed9, AAAA 2605:bc80:3010:600:dead:beef:cafe:feda, AAAA 2610:28:3090:3001:dead:beef:cafe:fed3 (179)

This corresponds to one of the invocations where the Go DNS client is trying to resolve registry.fedoraproject.org. It's sending two concurrent DNS requests: one for AAAA records from port 58863 with TXID 31317, and one for A records from port 47056 with TXID 34616.

But what's happening here is the AAAA record response (i.e., for the first question) is being sent to port 47056 with TXID 34616 (i.e., for the second question). This is very wrong, and possibly a security issue.

The reason using the cgo resolver works here is that when the parallel queries fail, it retries them sequentially:

11:37:39.213314 IP 192.168.0.3.38573 > 192.168.0.1.53: 38512+ A? registry.fedoraproject.org. (44)
11:37:39.213368 IP 192.168.0.3.38573 > 192.168.0.1.53: 43877+ AAAA? registry.fedoraproject.org. (44)
11:37:39.312438 IP 192.168.0.1.53 > 192.168.0.3.38573: 43877 11/0/0 CNAME wildcard.fedoraproject.org., A 152.19.134.142, A 152.19.134.198, A 209.132.181.15, A 209.132.181.16, A 209.132.190.2, A 8.43.85.67, A 8.43.85.73, A 67.219.144.68, A 140.211.169.196, A 140.211.169.206 (227)

11:37:44.217897 IP 192.168.0.3.38573 > 192.168.0.1.53: 38512+ A? registry.fedoraproject.org. (44)
11:37:44.312194 IP 192.168.0.1.53 > 192.168.0.3.38573: 38512 11/0/0 CNAME wildcard.fedoraproject.org., A 152.19.134.142, A 152.19.134.198, A 209.132.181.15, A 209.132.181.16, A 209.132.190.2, A 8.43.85.67, A 8.43.85.73, A 67.219.144.68, A 140.211.169.196, A 140.211.169.206 (227)

11:37:44.312460 IP 192.168.0.3.38573 > 192.168.0.1.53: 43877+ AAAA? registry.fedoraproject.org. (44)
11:37:44.512137 IP 192.168.0.1.53 > 192.168.0.3.38573: 43877 5/0/0 CNAME wildcard.fedoraproject.org., AAAA 2604:1580:fe00:0:dead:beef:cafe:fed1, AAAA 2605:bc80:3010:600:dead:beef:cafe:fed9, AAAA 2605:bc80:3010:600:dead:beef:cafe:feda, AAAA 2610:28:3090:3001:dead:beef:cafe:fed3 (179)

@mdempsky
Copy link
Member

If you add "options single-request" to your /etc/resolv.conf file, it should work.

But really, the DNS server is broken and should be fixed.

@mdempsky mdempsky changed the title net: DNS failure net: parallel A/AAAA queries are mishandled by broken DNS servers Nov 12, 2019
@albrro
Copy link
Author

albrro commented Nov 12, 2019

Well, I have a no-name router in what it calls "bridge mode" to jump the gap all the way to where my laptop is. It is buggy. What if I make /etc/resolv.conf go directly to 1.1.1.1?
Same test script output:
go.test2.txt

Looks like the issue has a workaround (or three).
FWIW, here are the packets captured:
packets2.pcapng.txt

So it seems that the libc resolver, like you say, is more resilient to buggy servers.

Thanks!

@albrro
Copy link
Author

albrro commented Nov 12, 2019

Tried again making my "real" router (198.168.1.1) the default in resolv.conf and there is no problem. So long as I avoid relying on the buggy router, all is well.

@mdempsky
Copy link
Member

Glad to hear that both 1.1.1.1 and 192.168.1.1 work okay.

Would you mind testing my hypothesis that "options single-request" should fix it too when using the problematic router?

@albrro
Copy link
Author

albrro commented Nov 12, 2019

Yes, the single request option works. It also crashes my so-called "router". Its MAC address info is not even in the MAC vendors' DB.

It is made by Xinwei. Should I delete that?

@mdempsky
Copy link
Member

Yes, the single request option works. It also crashes my so-called "router".

Can you elaborate what you mean by this? How does it work, but also crash?

@albrro
Copy link
Author

albrro commented Nov 12, 2019

The option allows the go test program to function. Meanwhile, perhaps because of what we've been doing, the router lost its connection to the main router and lost all of its ip info and just lays there. It's not the first time this happens but I've never found what triggered it. Perhaps it runs out of memory?

@mdempsky
Copy link
Member

I see. That seems odd, but consistent with the broken DNS server behavior.

It sounds like you have a few workarounds here, so I'm going to close the issue. Thanks for you help looking into it.

@golang golang locked and limited conversation to collaborators Nov 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

5 participants