Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestDialTimeout/1m/5s failures on Windows LUCI builders #62377

Closed
mknyszek opened this issue Aug 30, 2023 · 7 comments
Closed

net: TestDialTimeout/1m/5s failures on Windows LUCI builders #62377

mknyszek opened this issue Aug 30, 2023 · 7 comments
Assignees
Labels
NeedsFix The path to resolution is known, but the work has not been done. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Aug 30, 2023

=== RUN   TestDialTimeout/1ms/5s
    timeout_test.go:121: Dial returned after 0s; want ≥1ms
    timeout_test.go:129: Dial: dial tcp 127.0.0.1:50828: connectex: No connection could be made because the target machine actively refused it., want timeout
--- FAIL: TestDialTimeout/1ms/5s (0.01s)
@mknyszek
Copy link
Contributor Author

Ah, this seems related to a9d4b2d? CC @bcmills

There are also failures on the old infrastructure, but curiously only on windows-arm64-11. The reason it's showing up on LUCI on 386 and amd64 and not the old builders may be due to a discrepancy in Windows versions? I'll double-check that.

@mknyszek
Copy link
Contributor Author

Just kidding, looks like you landed a fix already. Duplicate of #62359. I'll leave this open and see if the issue still shows up on LUCI.

Fully one step behind this afternoon. 😅

@bcmills
Copy link
Contributor

bcmills commented Aug 30, 2023

What I landed earlier was an arm64-specific skip, not a proper fix. 🙃

Good to know that this isn't architecture-specific, at least.

@bcmills bcmills self-assigned this Aug 30, 2023
@bcmills bcmills added the NeedsFix The path to resolution is known, but the work has not been done. label Aug 30, 2023
@bcmills bcmills added this to the Go1.22 milestone Aug 30, 2023
@gopherbot
Copy link

Change https://go.dev/cl/524595 mentions this issue: net: retry TestDialTimeout subtests with progressively shorter timeouts

gopherbot pushed a commit that referenced this issue Aug 30, 2023
The LUCI builders seem to show that the failure mode in #62359 is not
specific to windows/arm64, but it occurs to me that we ought to be
able to eventually retry by making the timeout so short that the
remote end can't possibly respond in time (discounting the possibility
that the kernel itself might short-circuit the loopback address).

For #62377.
Updates #62359.
Updates #56876.

Change-Id: I1fb5fa4f2a5d2cfe35465f34248ed9a035f91f4f
Reviewed-on: https://go-review.googlesource.com/c/go/+/524595
Reviewed-by: Damien Neil <dneil@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@gopherbot
Copy link

Change https://go.dev/cl/524936 mentions this issue: net: deflake TestDialTimeout on windows

gopherbot pushed a commit that referenced this issue Sep 1, 2023
The time granularity on windows is large enough that setting even an
implausibly small timeout still gives ConnectEx enough time to succeed
before the timeout expires. That causes TestDialTimeout to sometimes
flake, because it expects to be able to provoke a timeout using some
nonzero duration.

This change takes a two-pronged approach to address the problem:

1. We can set a deadline on the FD more aggressively. (If the Context
has already expired, or the deadline is already known, we can go ahead
and set it on the fd without waiting for a background goroutine to get
around to it.)

2. We can reintroduce a test hook to ensure that Dial takes a
measurable amount of time before it completes, so that setting an
implausibly short deadline sets that deadline in the past instead of
the future.

Together, these reduce the flake rate on a windows-amd64-longtest
gomote from around 1-in-10 to less than 1-in-2000.

For #62377.

Change-Id: I03975c32f61fffa9f6f84efb3c474a01ac5a0d1e
Reviewed-on: https://go-review.googlesource.com/c/go/+/524936
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Damien Neil <dneil@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@bcmills
Copy link
Contributor

bcmills commented Sep 5, 2023

I believe this is fixed by https://go.dev/cl/524936. @mknyszek, can you confirm?

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 5, 2023
@mknyszek
Copy link
Contributor Author

mknyszek commented Sep 5, 2023

I don't see any new failures on LUCI: https://ci.chromium.org/ui/test/golang/net.TestDialTimeout%2F1ms%2F5s.

I think we're good here, thanks Bryan!

@mknyszek mknyszek closed this as completed Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

3 participants