Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: Test{Read,Write}TimeoutFluctuation failures with "took over 1s" #36108

Closed
bcmills opened this issue Dec 12, 2019 · 11 comments
Closed

net: Test{Read,Write}TimeoutFluctuation failures with "took over 1s" #36108

bcmills opened this issue Dec 12, 2019 · 11 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Dec 12, 2019

--- FAIL: TestWriteTimeoutFluctuation (1.09s)
    timeout_test.go:758: Write took over 1s; expected 0.1s
FAIL
FAIL	net	18.272s

2019-12-11T19:51:34-1b1fbb3/solaris-amd64-oraclerel
2019-10-21T16:42:10-7a6da21/solaris-amd64-oraclerel
2019-05-07T18:42:17-dc0388c/solaris-amd64-smartosbuildlet

See also #20838, #10775.

CC @bradfitz @ianlancetaylor

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 12, 2019
@bcmills bcmills added this to the Backlog milestone Dec 12, 2019
@bcmills bcmills changed the title net: TestWriteTimeoutFluctuation is flaky on solaris net: TestWriteTimeoutFluctuation is flaky Jan 27, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Apr 16, 2020

@bcmills bcmills changed the title net: TestWriteTimeoutFluctuation is flaky net: TestWriteTimeoutFluctuation failures with "Write took over 1s" Jan 22, 2021
@bcmills bcmills changed the title net: TestWriteTimeoutFluctuation failures with "Write took over 1s" net: Test{Read,Write}TimeoutFluctuation failures with "took over 1s" Jan 22, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Nov 18, 2021

@bcmills
Copy link
Contributor Author

bcmills commented Nov 18, 2021

greplogs --dashboard -md -l -e 'FAIL: TestReadTimeoutFluctuation' --since=2021-01-23

2021-11-03T00:07:03-d6f7203/dragonfly-amd64
2021-04-16T01:54:27-f08c552/dragonfly-amd64-5_8

@bcmills
Copy link
Contributor Author

bcmills commented Nov 18, 2021

Part of the problem here, I think, is that the timing check relies on the runtime scheduler rather than the monotonic timer, and relies on the goroutine scheduler rather than comparing times: the timer in the test function that fires at the expected deadline and the deadline computed in the timeoutTransmitter goroutine are completely independent and separated by arbitrary scheduler latency.

It also doesn't help that the test is marked as Parallel, which may substantially increase scheduler noise (especially on small or slow builders). I presume that it is marked as Parallel because it is too slow otherwise, which comes back to the arbitrary timeouts.

I think this test can fairly easily be made both much faster and much more robust.

@gopherbot
Copy link

Change https://golang.org/cl/365334 mentions this issue: net: simplify Test*Fluctuation

@bcmills bcmills self-assigned this Nov 18, 2021
@bcmills bcmills added the NeedsFix The path to resolution is known, but the work has not been done. label Nov 18, 2021
@bcmills bcmills modified the milestones: Backlog, Go1.18 Nov 18, 2021
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 18, 2021
@bcmills bcmills reopened this Nov 22, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Nov 22, 2021

Needs some fine-tuning: apparently 10% slop is not enough for OpenBSD even with a generous 512ms timeout. 😩
https://storage.googleapis.com/go-build-log/1e637cd3/openbsd-amd64-68_3585d3e7.log

@gopherbot
Copy link

Change https://golang.org/cl/366176 mentions this issue: net: allow more generous slop in Fluctuation tests

@gopherbot
Copy link

Change https://golang.org/cl/372215 mentions this issue: net: increase timing slop in TimeoutFluctuation tests on NetBSD and OpenBSD

gopherbot pushed a commit that referenced this issue Dec 15, 2021
…penBSD

Decrease the slop everywhere else, since NetBSD and OpenBSD seem to be
the only ones that miss by that much.

For #50189
Updates #36108

Change-Id: I22ac39cc7c254e40358fcd933b5a6016629602c3
Reviewed-on: https://go-review.googlesource.com/c/go/+/372215
Trust: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@rsc rsc unassigned bcmills Jun 23, 2022
@gopherbot
Copy link

Change https://go.dev/cl/415234 mentions this issue: os: simplify deadline fluctuation tests

gopherbot pushed a commit that referenced this issue Jun 30, 2022
This applies the net package CL 365334, CL 366176, CL 372215 to the os
package.

CL 365334:

    These tests were checking for fairly narrow timing windows, but were
    running in parallel and heavily dependent on timer and goroutine
    scheduling. This change eliminates unnecessary goroutines, runs the
    tests sequentially (dramatically shortening the timeouts to reduce the
    penalty of doing so), and uses timestamp comparison instead of
    background timers to hopefully gain some robustness from monotonic
    timestamps.

    Many of the other tests from this package would benefit from similar
    simplifications, which we can apply if and when we notice flaky
    failures or want to improve the latency of running the test.

CL 366176:

    It appears that at least the OpenBSD kernel gets sloppier the longer
    the timeout we give it, up to an observed overhead of around 25%.
    Let's give it a little more than that (33%) in the comparison, and
    also increase the growth curve to match the actual observed times
    instead of exponential initial growth.

CL 372215:

    Decrease the slop everywhere else, since NetBSD and OpenBSD seem to be
    the only ones that miss by that much.

For #36108
For #50189
Fixes #50725 (we hope)

Change-Id: I0854d27af67ca9fcf0f9d9e4ff67acff4c2effc8
Reviewed-on: https://go-review.googlesource.com/c/go/+/415234
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
This applies the net package CL 365334, CL 366176, CL 372215 to the os
package.

CL 365334:

    These tests were checking for fairly narrow timing windows, but were
    running in parallel and heavily dependent on timer and goroutine
    scheduling. This change eliminates unnecessary goroutines, runs the
    tests sequentially (dramatically shortening the timeouts to reduce the
    penalty of doing so), and uses timestamp comparison instead of
    background timers to hopefully gain some robustness from monotonic
    timestamps.

    Many of the other tests from this package would benefit from similar
    simplifications, which we can apply if and when we notice flaky
    failures or want to improve the latency of running the test.

CL 366176:

    It appears that at least the OpenBSD kernel gets sloppier the longer
    the timeout we give it, up to an observed overhead of around 25%.
    Let's give it a little more than that (33%) in the comparison, and
    also increase the growth curve to match the actual observed times
    instead of exponential initial growth.

CL 372215:

    Decrease the slop everywhere else, since NetBSD and OpenBSD seem to be
    the only ones that miss by that much.

For golang#36108
For golang#50189
Fixes golang#50725 (we hope)

Change-Id: I0854d27af67ca9fcf0f9d9e4ff67acff4c2effc8
Reviewed-on: https://go-review.googlesource.com/c/go/+/415234
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
@golang golang locked and limited conversation to collaborators Jun 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

2 participants