Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestDialCancel is not compatible with new macOS ARM64 builders #52579

Closed
heschi opened this issue Apr 26, 2022 · 14 comments
Closed

net: TestDialCancel is not compatible with new macOS ARM64 builders #52579

heschi opened this issue Apr 26, 2022 · 14 comments
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@heschi
Copy link
Contributor

heschi commented Apr 26, 2022

TestDialCancel requires that traffic to 198.18.0.254 and 2001:2::254 be blackholed. The comment on slowDialTCP notes that "In some environments, the slow IPs may be explicitly unreachable". The network we're using for the new macOS ARM64 builders is one such environment. I skipped the test in https://go.dev/cl/402181, but it might be nice for someone more knowledgeable about this than me to look and decide what should be done.

@heschi heschi added OS-Darwin Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 26, 2022
@heschi heschi added this to the Backlog milestone Apr 26, 2022
@bcmills
Copy link
Contributor

bcmills commented Apr 26, 2022

The comment on slowDialTCP notes that "In some environments, the slow IPs may be explicitly unreachable".

Hmm, I wonder if TestDialCancel should be installing slowDialTCP! It looks like the other tests that refer to slowDst4 and slowDst6 both do. (Those other tests are TestDialerFallbackDelay and TestDialParallel.)

@ianlancetaylor, @neild, @bradfitz: do you think it would make sense to install slowDialTCP for TestDialCancel?

@bcmills bcmills modified the milestones: Backlog, Go1.19 Apr 26, 2022
@neild
Copy link
Contributor

neild commented Apr 26, 2022

Looks to me like TestDialCancel (and any other test using the blackholed addresses) should install slowDialTCP. I'll send you a CL.

@gopherbot
Copy link

Change https://go.dev/cl/402454 mentions this issue: net: install slowDialTCP hook in TestDialCancel

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 27, 2022
@bcmills
Copy link
Contributor

bcmills commented May 4, 2022

@gopherbot, please backport to Go 1.17 and 1.18. This test is currently failing on the release branches too.

@gopherbot
Copy link

Backport issue(s) opened: #52705 (for 1.17), #52706 (for 1.18).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@dmitshur dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure. and removed NeedsFix The path to resolution is known, but the work has not been done. labels May 4, 2022
@dmitshur
Copy link
Contributor

My understanding of the state of this issue is that it can in Backlog milestone without a release-blocker label. That is, the known-failing test is skipped (via CL 402181) so it's not causing noise on the dashboard (#11811), and to me this test seem safe not to run on darwin/arm64 given it's run elsewhere (i.e., the chance of regression that would affect only this one particular port is low). @bcmills Should we remove release-blocker label, or can you clarify why it should be kept? Thanks.

@bcmills
Copy link
Contributor

bcmills commented May 17, 2022

I agree that, given that the failure mode almost certainly indicates a mismatch in the test's assumptions rather than a bug in the net package per se, it would be ok to appropriately widen the test-skips and then move it to the Backlog milestone.

However, the current skip is based on the builder name, so this test may still fail if someone who uses a similar network configuration imports the net package and then runs go test all in their own module, which I argue ought to be a reasonable thing to do — especially given that darwin/arm64 is a first class port.

I suggest:

  • If we believe that the conditions that cause this failure are specific to the darwin/arm64 platform, then we could widen the skip to GOOS = "darwin" && GOARCH == "arm64" (independent of GO_BUILDER_NAME).
  • If we don't really know what the conditions are, but the failure mode is specific enough to still get a useful signal from running the test, then we could skip the test any time we see that failure mode regardless of platform.
  • If we don't know what conditions cause the failure and either we can't easily detect that failure mode (e.g. it manifests as a timeout) or the test isn't useful otherwise (e.g. that failure mode is the only thing the test is looking for to begin with), then we should probably skip the test on all platforms.

@heschi
Copy link
Contributor Author

heschi commented May 20, 2022

I don't see any immediate reason to think this is a platform-specific problem. It might be, but it seems much more likely that it's just a quirk of the network we're on. (I still hope to move them to a different network but it is taking a while.)

I don't think skipping the tests is costing us much coverage, per Dmitri's comment, and I don't think there's anything release-blocking left now that we've backported to the skip to the release branches. I would like the net maintainers to make any further decisions on what to do here. (Perhaps there's a better way to write the test?)

cc @ianlancetaylor @neild

@neild
Copy link
Contributor

neild commented May 20, 2022

Is there a builder I can run the test on that will reliably fail?

@heschi
Copy link
Contributor Author

heschi commented May 20, 2022 via email

@bcmills
Copy link
Contributor

bcmills commented May 23, 2022

@heschi

I don't think there's anything release-blocking left now that we've backported to the skip to the release branches.

I disagree: I think the remaining release-blocking action is for the net maintainers to determine whether (and how) to broaden the skip or fix and unskip the test (as I infer Damien is in the process of doing).

That doesn't necessarily need to be a time-intensive process (a CL to skip a test only takes a couple of minutes to mail the change and set Auto-Submit+1), but it does need to happen before the release in order to tie up the loose ends of supporting darwin/arm64 as a first-class port.

@millerresearch
Copy link
Contributor

I've been getting these failures since yesterday too, on the plan9-arm builders. It's not dependent on the platform, but on network connectivity. My internet provider's gateway machine is sending Network Unreachable ICMP messages for 198.18.0.254 - but TestDialCancel depends on connection timing out.

@millerresearch
Copy link
Contributor

Looking at RFC 6890, I don't think 198.18.0.0/15 should be globally routed - it seems to be a reserved address block for benchmarking within an organisation. Shouldn't a gateway router be dropping packets to this range, not sending them to the internet? My guess is that my internet provider has just been reconfigured to send Network Unreachable messages for this block instead of dropping them silently.

I've set up a static route to black-hole this address range (by forwarding to a non-existent host). It looks like that makes the net test work again.

@gopherbot gopherbot removed this from the Go1.19 milestone Aug 2, 2022
@gopherbot gopherbot added this to the Go1.20 milestone Aug 2, 2022
@gopherbot gopherbot modified the milestones: Go1.20, Go1.21 Feb 1, 2023
@gopherbot
Copy link

Change https://go.dev/cl/496037 mentions this issue: net: ignore more errors in TestDialCancel

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
Development

No branches or pull requests

6 participants