Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: Resolver errors on Windows with code WSATRY_AGAIN #55050

Closed
toothrot opened this issue Sep 13, 2022 · 18 comments
Closed

net: Resolver errors on Windows with code WSATRY_AGAIN #55050

toothrot opened this issue Sep 13, 2022 · 18 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done. OS-Windows
Milestone

Comments

@toothrot
Copy link
Contributor

toothrot commented Sep 13, 2022

#!watchflakes
post <- goos == "windows" && `getaddrinfow: This is usually a temporary error` && date < "2024-01-01"
--- FAIL: TestScript (0.01s)
    --- FAIL: TestScript/mod_proxy_errors (0.07s)
        script_test.go:282: 
            # (2022-09-09T21:03:34Z)
            # Server responses should be truncated to some reasonable number of lines.
            # (For now, exactly eight.) (0.041s)
            > ! go list -m vcs-test.golang.org/auth/ormanylines@latest
            [stderr]
            go: vcs-test.golang.org/auth/ormanylines@latest: unrecognized import path "vcs-test.golang.org/auth/ormanylines": https fetch: Get "https://vcs-test.golang.org/auth/ormanylines?go-get=1": dial tcp: lookup vcs-test.golang.org: getaddrinfow: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server.
            [exit status 1]
            > stderr '\tserver response:\n(.|\n)*\tline 8\n\t\[Truncated: too many lines.\]$'
            FAIL: testdata\script\mod_proxy_errors.txt:10: no match for `(?m)\tserver response:\n(.|\n)*\tline 8\n\t\[Truncated: too many lines.\]$` found in stderr

2022-09-09T20:29:05-54182ff/windows-amd64-longtest

@bcmills
Copy link
Contributor

bcmills commented Sep 14, 2022

This failure occurred within a cmd/go test, but does not appear to originate in cmd/go.

@bcmills
Copy link
Contributor

bcmills commented Sep 14, 2022

From what I can tell, the getaddrinfow: prefix in the error message comes from here:
https://cs.opensource.google/go/go/+/master:src/net/lookup_windows.go;l=123;drc=f9c0264107a9a36832d70781fe100cff16917855

Per https://docs.microsoft.com/en-us/windows/win32/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexw, the error text corresponds to the error code WSATRY_AGAIN.

@bcmills bcmills changed the title cmd/go: TestScript/mod_proxy_errors fails on https fetch failure net: Resolver errors on Windows with code WSATRY_AGAIN Sep 14, 2022
@bcmills bcmills added OS-Windows NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 14, 2022
@bcmills bcmills added this to the Backlog milestone Sep 14, 2022
@bcmills
Copy link
Contributor

bcmills commented Sep 14, 2022

(attn @ianlancetaylor @neild; CC @golang/windows)

I wonder if the net.Resolver implementation should, y'know, “try again” when it receives the error WSATRY_AGAIN..? 😅

@bcmills
Copy link
Contributor

bcmills commented Sep 14, 2022

This is closely related to #52094 and #52108 (CC @rolandshoemaker).

@beoran
Copy link

beoran commented Sep 16, 2022

A casual look in the source code shows that on Unix platforms EAGAIN is used often (and rightly so)

if err == syscall.EAGAIN && fd.pd.pollable() {
.
It's probably a great idea to handle WSATRY_AGAIN similarly in all windows related code and system calls that might return that error.

@alexbrainman
Copy link
Member

I wonder if the net.Resolver implementation should, y'know, “try again” when it receives the error WSATRY_AGAIN..? 😅

How do you propose we change net.Resolver implementation to try again ?

WSATRY_AGAIN indicates temporary failure in name resolution. Do you suggest we try after 1 second or something? But 1 second might not be long enough.

We can also report temporary failure in name resolution status as part of returned error message. But I am not sure how useful that information even for the net.Resolver user.

go/src/internal/poll/fd_unix.go

This code retries reads from TCP connections. I don't see how that code is relevant to GetAddrInfoExW WSATRY_AGAIN.

Alex

@beoran
Copy link

beoran commented Sep 18, 2022

https://learn.microsoft.com/en-us/search/?terms=WSATRY_AGAIN: My meaning is that this error code is returned by several windows API functions, and should be handled for all of them that Go uses.

@qmuntal
Copy link
Contributor

qmuntal commented Sep 23, 2022

Adding a data point: dotnet/runtime retries once without backoff when calling GetAddrInfoExW and getting WSATRY_AGAIN back (code), but this retry attempt is more related to limitations on GetAddrInfoExW which don't apply to us, because we are using GetAddrInfoW.

See this issue for more context: dotnet/runtime#29935, and specially this comment: dotnet/runtime#29935 (comment).

@bcmills
Copy link
Contributor

bcmills commented Sep 23, 2022

@alexbrainman

How do you propose we change net.Resolver implementation to try again ?

WSATRY_AGAIN indicates temporary failure in name resolution. Do you suggest we try after 1 second or something? But 1 second might not be long enough.

The usual solution I would reach for is to retry with exponential backoff.

It appears that all of the Resolver methods accept context.Context arguments, so one option might be to continue to retry until one of:

  • the passed-in Context is done
  • GetAddrInfoW succeeds
  • GetAddrInfoW returns an error code other than WSATRY_AGAIN.

But I'm curious what the Unix implementation does in terms of default timeouts and retries. There are two configuration parameters here that seem relevant:
https://cs.opensource.google/go/go/+/master:src/net/dnsconfig.go;l=22-23;drc=d7df872267f9071e678732f9469824d629cac595

@bcmills
Copy link
Contributor

bcmills commented Sep 23, 2022

It looks like our default on Unix is 5 seconds and 2 attempts, unless the system's resolv.conf states otherwise:
https://cs.opensource.google/go/go/+/master:src/net/dnsconfig_unix.go;l=20-21;drc=d7df872267f9071e678732f9469824d629cac595

So that might be a good starting point, at least. (Ideally the defaults should be factored out to be platform-independent!)

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- goos == "windows" && `getaddrinfow: This is usually a temporary error`
2022-09-09 20:29 windows-amd64-longtest go@54182ff5 cmd/go.TestScript (log)
go test proxy running at GOPROXY=http://127.0.0.1:54144/mod
--- FAIL: TestScript (0.01s)
    --- FAIL: TestScript/mod_proxy_errors (0.07s)
        script_test.go:282: 
            # (2022-09-09T21:03:34Z)
            # Server responses should be truncated to some reasonable number of lines.
            # (For now, exactly eight.) (0.041s)
            > ! go list -m vcs-test.golang.org/auth/ormanylines@latest
            [stderr]
            go: vcs-test.golang.org/auth/ormanylines@latest: unrecognized import path "vcs-test.golang.org/auth/ormanylines": https fetch: Get "https://vcs-test.golang.org/auth/ormanylines?go-get=1": dial tcp: lookup vcs-test.golang.org: getaddrinfow: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server.
            [exit status 1]
            > stderr '\tserver response:\n(.|\n)*\tline 8\n\t\[Truncated: too many lines.\]$'
            FAIL: testdata\script\mod_proxy_errors.txt:10: no match for `(?m)\tserver response:\n(.|\n)*\tline 8\n\t\[Truncated: too many lines.\]$` found in stderr

watchflakes

@alexbrainman
Copy link
Member

@bcmills

It appears that all of the Resolver methods accept context.Context arguments, so one option might be to continue to retry until one of:

  • the passed-in Context is done
  • GetAddrInfoW succeeds
  • GetAddrInfoW returns an error code other than WSATRY_AGAIN.

Sounds reasonable to me. But I don't have time to implement that.

It looks like our default on Unix is 5 seconds and 2 attempts,

We can do 2 attempts. But I don't see how we can implement 5 seconds timeout by using just getaddrinfow. Perhaps attempts count should be the only possible DNS resolver config parameter on Windows.

Alex

@bcmills
Copy link
Contributor

bcmills commented Nov 9, 2022

I don't see how we can implement 5 seconds timeout by using just getaddrinfow.

I think the idea is to stop initiating retries after 5 seconds, not to cause each individual retry to time out at that boundary.

@bcmills
Copy link
Contributor

bcmills commented Nov 9, 2022

This issue may also explain the strong Windows bias in the test failures on #38111 and #27992.

@alexbrainman
Copy link
Member

I think the idea is to stop initiating retries after 5 seconds, not to cause each individual retry to time out at that boundary

That sounds reasonable. Thanks for explaining.

Alex

@gopherbot
Copy link

Change https://go.dev/cl/462051 mentions this issue: net: fix issue #55050

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 15, 2023
@dmitshur dmitshur modified the milestones: Backlog, Go1.21 Feb 15, 2023
johanbrandhorst pushed a commit to Pryz/go that referenced this issue Feb 22, 2023
GetAddrInfoW is retried now multiple times as per the timeout and number of retries defined in func dnsReadConfig (before it was called only once)

Fixes golang#55050

Change-Id: If5369ebb164d98557a802de938756dbf9c125773
Reviewed-on: https://go-review.googlesource.com/c/go/+/462051
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Quim Muntal <quimmuntal@gmail.com>
Run-TryBot: Quim Muntal <quimmuntal@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
@gopherbot gopherbot reopened this Jan 25, 2024
@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- goos == "windows" && `getaddrinfow: This is usually a temporary error`
2024-01-09 18:15 go1.20-windows-amd64-longtest release-branch.go1.20@a95136a8 net.TestLookupLocalPTR (log)
=== RUN   TestLookupLocalPTR
    lookup_windows_test.go:148: failed to get local ip: dial udp: lookup golang.org: getaddrinfow: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server.
    lookup_windows_test.go:152: failed <nil>: lookup <nil>: unrecognized address
    lookup_windows_test.go:155: no results
    lookup_windows_test.go:159: skipping failed lookup <nil> test: exit status 1: Ping request could not find host <nil>. Please check the name and try again.
--- FAIL: TestLookupLocalPTR (0.06s)
2024-01-09 18:17 go1.21-windows-amd64-race release-branch.go1.21@cc85462b net.TestLookupContextCancel (log)
=== RUN   TestLookupContextCancel
    lookup_test.go:917: starting concurrent LookupIPAddr
    lookup_test.go:940: testHookLookupIP performing lookup
    lookup_test.go:923: lookup google.com: getaddrinfow: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server.
--- FAIL: TestLookupContextCancel (0.01s)
2024-01-09 18:17 go1.21-windows-amd64-race release-branch.go1.21@cc85462b net.TestLookupGoogleHost (log)
=== RUN   TestLookupGoogleHost
    lookup_test.go:408: lookup google.com: getaddrinfow: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server.
--- FAIL: TestLookupGoogleHost (0.03s)
2024-01-09 18:17 go1.21-windows-amd64-race release-branch.go1.21@cc85462b net.TestLookupGoogleIP (log)
=== RUN   TestLookupGoogleIP
    lookup_test.go:461: lookup google.com: getaddrinfow: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server.
--- FAIL: TestLookupGoogleIP (0.00s)

watchflakes

@bcmills bcmills closed this as completed Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. OS-Windows
Projects
Status: Done
Development

No branches or pull requests

7 participants