Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: make DNS lookup timeouts return net.DNSError instead of poll.TimeoutError #39178

Closed
mhale opened this issue May 20, 2020 · 10 comments
Closed

Comments

@mhale
Copy link

mhale commented May 20, 2020

I get DNS lookup timeouts many times a day when running Go HTTP clients on Kubernetes, where UDP packets are sometimes dropped due to a race condition in the Linux kernel. I'd like to propose a small standard library change to help debug these scenarios.

What version of Go are you using (go version)?

$ go version
go version go1.14.3 linux/amd64

Does this issue reproduce with the latest release?

Yes. This occurs in Go 1.14.[1-3] and version devel +c88f6989e1 Tue May 19 04:10:43 2020 +0000 linux/amd64.

What operating system and processor architecture are you using (go env)?

It seems to apply to all platforms. I'm experiencing it most often on Kubernetes on Linux but the executable is compiled on a regular Linux server.

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/mhale/.cache/go-build"
GOENV="/home/mhale/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/mhale/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build197760953=/tmp/go-build -gno-record-gcc-switches"

What did you do?

For any http.Client use with a timeout set on the transport, when the timeout is exceeded and an error is returned it can be challenging to discover the cause of the timeout from the error message. Is it DNS related or TCP related or something else? Consider the following simulation of a DNS lookup that does not return within the timeout.

DNS lookup timeout simulator code
package main
  
import (
    "context"
    "net"
    "net/http"
    "time"
)

var timeout = 50 * time.Millisecond
var host = "http://example.com/"

// delayedDialer introduces a delay when performing DNS lookups.
func delayedDialer(ctx context.Context, network, address string) (net.Conn, error) {
    time.Sleep(2 * timeout)
    d := net.Dialer{}
    return d.DialContext(ctx, network, address)
}

func main() {
    r := &net.Resolver{
        PreferGo: true,
        Dial:     delayedDialer,
    }
    tr := &http.Transport{
        DialContext: (&net.Dialer{
            Timeout:  timeout,
            Resolver: r,
        }).DialContext,
    }
    client := &http.Client{Transport: tr}
    _, err := client.Get(host)
    if err != nil {
        panic(err)
    }
}

In the event of a DNS lookup timeout, the output is:

panic: Get "http://example.com/": dial tcp: i/o timeout

In the event of a TCP connection timeout (comment out the time.Sleep() call) the output is:

panic: Get "http://example.com/": dial tcp 93.184.216.34:80: i/o timeout

The missing IP address in the DNS lookup timeout output is the only clue that the root cause is a DNS problem, rather than a TCP problem. If users do not expect to see an IP address in that error message, they won't know it is missing, and their debugging and testing effort may be wasted investigating possible TCP-related causes, instead of investigating DNS (which normally runs over UDP).

I think the error message could communicate the root cause to users more clearly.

What did you expect to see?

panic: Get "http://example.com/": dial tcp: lookup example.com: i/o timeout

What did you see instead?

panic: Get "http://example.com/": dial tcp: i/o timeout

Proposed solution

After a DNS lookup timeout, the error returned by client.Get() looks like this in Go 1.14.3:

&url.Error{
  Op:  "Get",
  URL: "http://example.com/",
  Err: &net.OpError{
    Op:     "dial",
    Net:    "tcp",
    Source: nil,
    Addr:   nil,
    Err:    &poll.TimeoutError{},
  },
}

Note: The &poll.TimeoutError{} will become &net.timeoutError{} in future releases after commit
d422f54.

The resolver in net/lookup.go is the source of the *poll.TimeoutError. Instead it could return a *net.DNSError error when a lookup times out. Doing so would provide additional error context for all use cases (not only http.Client) and may save debugging time. For example client.Get() could return:

&url.Error{
  Op:  "Get",
  URL: "http://example.com/",
  Err: &net.OpError{
    Op:     "dial",
    Net:    "tcp",
    Source: nil,
    Addr:   nil,
    Err:    &net.DNSError{
      Err:         "i/o timeout",
      Name:        "example.com",
      Server:      "",
      IsTimeout:   true,
      IsTemporary: false,
      IsNotFound:  false,
    },
  },
}

This would preserve the existing "i/o timeout" string and the ability to call err.Timeout(), while adding "lookup example.com: " to the output and the capability to explicitly check for a *net.DNSError. The output from my simulator would also be what I'm expecting to see:

panic: Get "http://example.com/": dial tcp: lookup example.com: i/o timeout

The following change to implement this works in my usage and passes the tests in all.bash.

git diff Output
$ git diff
diff --git a/src/net/lookup.go b/src/net/lookup.go
index 5f7119872a..b1a22dd4e6 100644
--- a/src/net/lookup.go
+++ b/src/net/lookup.go
@@ -314,6 +314,9 @@ func (r *Resolver) lookupIPAddr(ctx context.Context, network, host string) ([]IP
                        }()
                }
                err := mapErr(ctx.Err())
+               if t, ok := err.(timeout); ok && t.Timeout() {
+                       err = &DNSError{Err: err.Error(), Name: host, IsTimeout: true}
+               }
                if trace != nil && trace.DNSDone != nil {
                        trace.DNSDone(nil, false, err)
                }

I can submit a pull request if anyone thinks that this is a good idea and implementation.

@odeke-em odeke-em changed the title net/http: DNS lookup timeouts could return net.DNSError instead of poll.TimeoutError proposal: net/http: DNS lookup timeouts could return net.DNSError instead of poll.TimeoutError May 23, 2020
@gopherbot gopherbot added this to the Proposal milestone May 23, 2020
@rsc rsc added this to Incoming in Proposals (old) Jun 10, 2020
@rsc rsc changed the title proposal: net/http: DNS lookup timeouts could return net.DNSError instead of poll.TimeoutError proposal: net/http: make DNS lookup timeouts return net.DNSError instead of poll.TimeoutError Aug 18, 2021
@rsc
Copy link
Contributor

rsc commented Aug 18, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) Aug 18, 2021
@rsc
Copy link
Contributor

rsc commented Aug 25, 2021

/cc @neild and @ianlancetaylor for thoughts

@rsc
Copy link
Contributor

rsc commented Sep 8, 2021

It looks like if we change the error on that code path to be a net.DNSError, then we would lose what Go 1.15 established, that you can use errors.Is(err, os.ErrDeadlineExceeded) to check for the exceeded deadline. That seems like a mistake.

@mhale
Copy link
Author

mhale commented Sep 13, 2021

errors.Is(err, os.ErrDeadlineExceeded) currently returns false for net.timeoutError despite a deadline being exceeded. I'm not sure if any capability would be lost when DNS times out.

@ianlancetaylor
Copy link
Contributor

@mhale You're right. What we did in 1.15 was establish that errors.Is(err, os.ErrDeadlineExceeded) works for timeouts due to explicit calls to SetDeadline and friends. We did not do that for timeouts due to context expiration or DNS timeouts. Or, for that matter, TCP timeouts.

Making the proposed change seems reasonable to me.

@neild
Copy link
Contributor

neild commented Sep 13, 2021

Making this a net.DNSError seems reasonable to me. DNSError.Timeout should be false, however, since this error doesn't result from an explicit timeout set by SetDeadline or similar.

@mhale
Copy link
Author

mhale commented Sep 14, 2021

If DNSError.IsTimeout is false, then DNSError.Timeout() will return false despite the Err string being "i/o timeout" and the error resulting from a timeout being set on the Dialer. That might clash with expectations.

@rsc
Copy link
Contributor

rsc commented Sep 15, 2021

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Accept in Proposals (old) Sep 15, 2021
@rsc rsc moved this from Likely Accept to Accepted in Proposals (old) Sep 22, 2021
@rsc
Copy link
Contributor

rsc commented Sep 22, 2021

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: net/http: make DNS lookup timeouts return net.DNSError instead of poll.TimeoutError net/http: make DNS lookup timeouts return net.DNSError instead of poll.TimeoutError Sep 22, 2021
@rsc rsc modified the milestones: Proposal, Backlog Sep 22, 2021
@gopherbot
Copy link

Change https://golang.org/cl/353400 mentions this issue: net: consistently return DNSError on lookup failure

@golang golang locked and limited conversation to collaborators Oct 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

5 participants