cmd/go: add support for dealing with flaky tests #62244

bradfitz · 2023-08-23T16:12:44Z

Background

First off, flaky tests (a test that usually passes but sometimes fails) are the worst and you should not write them and fix them immediately.

That said, the larger teams & projects & tests get, the more likely they get introduced. And sometimes it's technically and/or politically hard to get them fixed. (e.g. the person who wrote it originally left the company or works on a new team) You're then left deciding whether to skip/delete the entire test (which might be otherwise super useful), or teach your team to become immune to and start ignoring test failures, which is the worst, when teams start submitting when CI's red, becoming blind to real test failures.

Google doesn't have this problem internally because Google's build system supports detecting & annotating flaky tests: https://bazel.build/reference/be/common-definitions

Tailscale has its own test wrapper that retries tests annotated with flakytest.Mark as flaky. We can't use go test -exec=... unfortunately, as that prevents caching (#27207). So instead we need a separate tool wraps cmd/go (kinda awkwardly, as it turns out).

Go internally also uses these flaky test markers 79 times:

So, clearly flaky tests are a problem. And cmd/go isn't helping.

Proposal

I propose that Go:

let users annotate known flaky tests (ala tb.MarkFlaky(urlOrMessage string))
let cmd/go support re-running marked flaky tests up to N times (default 3 like Bazel?). Maybe add -retrycount or -maxcount aside the existing -count flag?
output machine readable output from cmd/go (that test2json recognizes) to say that a test was flaky and failed but eventually passed, that users can plug into their dashboards/analytics, to find tests that are no longer flaky
optional: support re-running even tests not annotated as flaky after they fail, to discover that they're flaky. (perhaps it can still fail the test if not annotated, but it can fail with a different failure type, so the users learn it's flaky and can go fix or annotate)

FAQ

Won't this encourage writing flaky tests? No. Everybody hates flaky tests.

Won't this discourage fixing flaky tests? No. We're already retrying flaking tests. We're just fighting cmd/go to do so.

Why do you have flaky tests? Unit tests for pure functions aren't flaking. The tests usually flaking are big and involve timeouts and the network and subprocesses, even hitting localhost servers started by the tests. Many CI systems are super slow & oversubscribed. Even super generous network timeouts can fail.

Wasn't there already a proposal about this? Yes, in 2018: #27181. The summary was to not write flaky tests. I'm trying again.

The text was updated successfully, but these errors were encountered:

bcmills · 2023-08-23T17:25:32Z

I've investigated an awful lot of test flakes in the Go project over the past few years. Some observations:

Many test flakes are due to hard-coded timeouts (in the test or in the code under test).
- Removing the arbitrary timeouts is often sufficient to deflake the test.
- For tests that verify time-measurement functions, the fix is usually to compute upper and lower bounds based on the actual time elapsed in the test.
- For tests that specifically check timeout behavior, the fix is usually to retry the test with exponentially longer timeouts until it succeeds.
Many test flakes manifest as deadlocks or hangs.
- Retrying a test that fails in this way can increase the end-to-end time of the test by an arbitrarily large factor.
- Reducing that factor implies running the individual test with a much shorter timeout, but a much shorter timeout increases the risk of a flake due to the timeout being too short.
  - Which leads us back to the pattern of retrying the test with exponentially longer timeouts.
Test flakes that are not in one of the above categories generally have a systematic (but undiagnosed) failure mode.
- This is what allows us to use tools like watchflakes to classify errors.
- These tests should only be retried if the failure mode matches the known pattern; otherwise, the flakiness annotation can mask real regressions in the code under test.

The common elements of these points are that flaky tests often need to be retried at a fine granularity and with a progressively increasing timeout, rather than retried a fixed number of times in the same configuration.

To me, that suggests an API addition to the testing package rather than a solution in cmd/go.

bitfield · 2023-08-24T11:54:08Z

Everybody hates flaky tests

I'm sure we all agree on that, Brad, but it doesn't follow that we should add special functionality to Go to make it easier for developers to live with flaky tests.

The summary was to not write flaky tests. I'm trying again.

I'm not sure why you expect a different answer this time. Frankly, if having flaky tests annoys you, or is inconvenient, I'd say that's a good thing.

sometimes it's technically and/or politically hard to get them fixed

Agreed. The solution is to delete them instead, not to mask the flakiness with automatic retries.

painhardcore · 2023-08-29T10:10:06Z

I wholeheartedly support this proposal. Addressing flaky tests is a real-world necessity and providing a structured way to manage them in the standard library is a step in the right direction.
This not only facilitates better tracking and management of these tests but also synergizes well with other tools like linters, by enforcing the addition of comments with Issue numbers or JIRA tickets, thus ensuring that the issues are properly documented and tracked.
Moreover, this proposal strikes a good balance by acknowledging the existence of flaky tests without encouraging their creation or ignoring their need for eventual resolution. It's a pragmatic approach to a problem that, despite our best efforts, persists in many large-scale projects.

vearutop · 2023-08-29T10:29:57Z

These tests should only be retried if the failure mode matches the known pattern; otherwise, the flakiness annotation can mask real regressions in the code under test.

To me, that suggests an API addition to the testing package rather than a solution in cmd/go.

I agree, I'd rather have a conservative opt-in way to retry test failures, so that flakyness "policy" is defined per test and is in full control of a developer.

	t.RunRetry("name", func(t *testing.T, attempt int) {
		if attempt > 3 {
			// Give up and forgive.
			t.Skip()
			
			// Or give up and fail hard.
			// t.FailNow()
		}

		// Iterate with increasing tolerance depending on attempt.

		t.Fail()
	})

rsc · 2023-09-07T13:53:30Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

dnephin · 2023-09-10T18:47:06Z

the larger teams & projects & tests get, the more likely [flaky tests] get introduced. And sometimes it's technically and/or politically hard to get them fixed. (e.g. the person who wrote it originally left the company or works on a new team)

This is a really important point that I think is worth elaborating. The impact of flaky tests on the contributors to a software project is often directly correlated to the constraints imposed by the organization responsible for the project.

Many software projects won't suffer from flaky tests. Maybe because the code has minimal concurrency, the scope of the problem is small, or because the code isn't critical enough to require comprehensive end-to-end testing of all the components. Other projects with significant concurrency, large scope, and large end-to-end tests may encounter the occasional flaky test, but also don't suffer from flaky tests because the team has sufficient time and resources to deal with them. The Go project has some flaky tests, but appears to have significant infrastructure in place to identify, and triage the flaky tests, and also to mitigate the impact of those flakes on other contributors. Not every project is so fortunate. If your experience is working only on projects that are in these two categories it can be easy to dismiss the problem of flaky tests as not significant. Maybe the problem of flaky tests is not significant to the majority of projects, but it can be very real for some.

A few years ago I joined a team that was working on a well known open source project. The project has thousands of tests, many of them could flake. The team had attempted to deal with the flaky tests for years, but could never get them into a reasonable state. The project had been around for 6 years already, had hundreds of contributors who no longer worked on it (many from the open source community), and most of the flaky tests were written by someone who was no longer involved with the project. The cost of fixing a flaky test was high enough that by the time one or two were fixed new ones had appeared. Contributors were forced to either click "rerun job" in CI multiple times (and wait hours to get a green run), or assume the test was flaky and merge with a failing CI check.

Multiple attempts were made to manually identify flaky tests by creating github issues to track them. This helped a bit, but by itself didn't really address the problem. Creating github issues for every CI run was still taking time from every contributor, and it was often easier to click re-run and not create a github issue. Github issues on their own did not help prioritize the most flaky of tests. This manual approach was not enough.

This experience is what motivated me to write the --rerun-fails feature in gotestsum. It doesn't use test markers, but otherwise seems to implement the proposal described in the original post. Roughly:

if more than x (default 10) tests fail in the initial run, do not re-run anything. This is to avoid re-running tests when there is an infrastructure problem, or a real problem in a core piece of code.
for each test that failed in the initial run, re-run it up to n times (default 2 more attempts)
show all the test failures in the output, so that someone looking at the run knows that some tests had to be retried
--rerun-fails-report=filename will output the list of tests that were re-run. This can get posted to the Github PR as a comment, so that the author of the PR is still aware of any potentially new flaky tests in their CI run.

This feature allowed the team to make real progress against fixing flaky tests. Instead of manually ticketing and re-running failures we could use the test2json output to find the most common flakes, and focus our efforts on those tests.

I've seen some interest in this feature from the community. This sourcegraph search shows at least a few recognizable projects using it. I imagine there are others that are not indexed by source graph, not open source, or are using the flag in a script that doesn't match the search query.

dnephin · 2023-09-10T19:05:17Z

Given my experience with flaky tests above, I have a few questions about the proposal.

Why only re-run marked tests? My rational for re-running flaky tests has generally been that an existing flaky test should not introduce friction for other contributors. If a contributor has to create a github issue and edit a test they are not familiar with, that is non-zero friction and they may be tempted to just re-run the CI job instead. I would generally prefer to have an automated system notice the flakes by reading the test2json output and seeing that a test both failed and passed in the same run. That same system should be able to sort the flakes to find the most common ones.

Some indication of re-run tests is important, but given the nature of a flaky test, it seems unlikely that the person adding the mark is going to be the same person that introduce the flake.

Is the marking of a test more for the developer reading the test code, so that they understand it's a flaky test?

Do we need new output to show "that a test was flaky and failed but eventually passed" ? If a test appears multiple times in the test2json with both fail and pass doesn't that indicate a flake?

bcmills · 2023-09-11T15:29:55Z

I think @vearutop is on the right track with RunRetry, but I think it's also important to be able to distinguish an outright Fail (meaning “this test had an unexpected failure mode”) from a “flake” (meaning “this test failed to produce a useful signal”).

I suggest the following additions to testing.T. The Retry method is analogous to Fail, and the Restart methods are analogous to the Skip methods.

// Restart is equivalent to Retry followed by Skip.
func (*T) Restart(args ...any)

// Restartf is equivalent to Retry followed by Skipf.
func (*T) Restartf(format string, args ...any)

// Retry marks the function as unable to produce a useful signal,
// but continues execution.
// If the test runs to completion or is skipped, and is not marked as
// having failed, it will be re-run.
func (*T) Retry()

// Retries reports the number of times the current test or subtest has been
// re-run due to a call to Retry. The first run of each test reports zero retries.
func (*T) Retries() int64

The same methods could also be added to testing.B and testing.F if desired, but note that this would not allow for flaky Example functions.

The idea is that when a test flakes, it calls Restart or Restartf with a suitable message; the re-run can then use the Retries method to compute whatever new timeouts it needs. If for some reason a test wants to limit the number of retries it attempts, it could also use the Retries counter to trigger a call to Fatalf or similar.

bcmills · 2023-09-11T15:32:56Z

@dnephin

Is the marking of a test more for the developer reading the test code, so that they understand it's a flaky test?

The marking of a test is so that the flake-retry mechanism doesn't mask actual regressions — such as data races or logical races — that may manifest as low-but-nonzero-probability failures in tests that were previously not flaky at all.

willfaught · 2023-09-11T16:03:24Z

Why return int64 instead of int?

bcmills · 2023-09-11T16:31:08Z

Because a very short test can plausibly fail more than 2³¹ times before it times out on a 32-bit machine?

(But int would probably be fine too — presumably nobody wants to read 2³¹ retries' worth of test logs anyway. 😅)

willfaught · 2023-09-11T21:03:29Z

I was thinking the number would be on the order of tens at most. 2^31 retries would be quite the flaky (and short) test. It sounded like the usual source is timeouts, which suggested longer tests to me.

bcmills · 2023-09-11T21:09:04Z

It sounded like the usual source is timeouts, which suggested longer tests to me.

Agreed, although I think it's important that whatever API we add is also able to gracefully handle other kinds of flakes.

(And note that for the “flake manifests as a deadlock, race, or runtime throw” sort of test, generally the test would need to reinvoke itself as a subprocess, run that subprocess using exec.CommandContext or similar, and then examine the output to decide whether to retry.)

dnephin · 2023-09-16T18:12:06Z

The marking of a test is so that the flake-retry mechanism doesn't mask actual regressions — such as data races or logical races — that may manifest as low-but-nonzero-probability failures in tests that were previously not flaky at all.

Data races and logical races are also common causes for flakes. Is the distinction here "a flake caused by production code" indicating a bug that needs to be fixed vs "a flake caused by test code" indicating a lower priority to fix?

Is the marking of tests actually effective in preventing masking of those failures?

Consider these scenarios.

Scenario 1: I contribute a change that introduces a new logical race. I run the tests a few times but the race only causes the test to fail 1 in 20 runs so it never fails on my change request. The change is merged into main. A few days later it fails on a change request from a different contributor. That contributor doesn't recognize the error. They have a few options.

re-run the tests on the change request by pressing a button in their CI system. The test likely passes now, so they ignore the error and no one becomes aware of the flake for a longer period of time. The failure is now "masked" until someone notices it again.
attempt to diagnose the source of the failure, either by themselves or with the help of someone else. They identify it as a flaky test so they add the marker and open a new issue about the problem. Eventually the responsible party hopefully becomes aware of the issue and fixes the bug.

In this scenario the test wasn't automatically re-run, but due to the nature of flaky tests not happening frequently, the flake is masked either way. In the best case scenario the bug isn't fixed until the appropriate party becomes aware and has time to look into it.

Scenario 2: As above, I contribute a change that introduces a new logical race, it doesn't fail on my change request, and the change is merged into main. This time tests are being retried without having to mark them. When tests are retried they are reported on the change request, and when new ones appear that same automation opens an issue for them.

In this scenario the appropriate party still becomes aware of the problem at about the same time (when new issues are reported), and that may even happen earlier because in this scenario you aren't relying on every single contributor to do the right thing. The automation handles it.

Scenario 3: I contribute a change that introduces a regression, causing requests to timeout more often. The tests that cover this change are already marked as flaky, so the tests are retried and they pass after a couple attempts.

In this scenario flaky test markers are used, but it still masks the regression. Until someone looks at the frequency of flakes they don't notice the regression.

In these scenarios the marking of tests doesn't seem to prevent them from being masked. In the unlikely scenario where a test flakes on the change request that introduces the bug it could mask the problem, but only if the retried tests are not made visible. Scenario 3 already requires that any retried tests are made visible, which seems like it prevents the failure from being masked.

The problem of flaky tests is arguably often more a cultural one than a technical one. That's why this point from the original post is so important:

the larger teams & projects & tests get, the more likely [flaky tests] get introduced. And sometimes it's technically and/or politically hard to get them fixed.

The API proposed here seems like it would be great when the flaky behaviour in a test is already well understood.

From the Retry godoc suggested above:

// If the test runs to completion or is skipped, and is not marked as having failed, it will be re-run.

If the tests aren't retried because of any t.Error or t.Fatal it seems like this mechanism would not help existing test suites that are struggling with flaky tests. It would require making changes to many tests, and that was already the problem.

In my experience, and based on my read of the original proposal, this doesn't really address the most painful parts of dealing with flaky tests. If Retry worked on tests marked failed, that might help.

dnephin · 2023-09-18T01:34:54Z

For some kinds of tests, particularly end-to-end tests of large systems, any failure would benefit from being retried. It's easy to introducing flaky behaviour into these kinds of tests, and a human is going to have to rerun them to determine if the failure is reliable or flaky. Rerunning these automatically would be helpful.

Maybe a new -retrycount[=n] flag could enable that behaviour, but the flag would only enable reruns of failures, it wouldn't change the pass/fail results of tests. Any test that fails at least once would be considered failed.

Something like t.Retry (or t.Unreliable(reason string) ?) would allow a test author to change the pass/fail result of a test that was retried. Any test marked unreliable would only require a single run to pass to be considered successful. If these new unreliable failures are reported as a new Action type in the test2json output (maybe undetermined or unreliable?) that should allow automated systems to track the failure rate.

That approach should provide good signals on new failures (they aren't masked as passing tests) while still making it easier for developers to work on projects that are prone to flaky tests.

bcmills · 2023-09-18T14:18:09Z

Data races and logical races are also common causes for flakes. Is the distinction here "a flake caused by production code" indicating a bug that needs to be fixed vs "a flake caused by test code" indicating a lower priority to fix?

No. The distinction is “a flake caused by something known to be a preexisting issue” vs. “a flake whose cause has not been triaged”. If no one has even looked at the failure mode before, how can we assume anything at all about its priority?

Is the marking of tests actually effective in preventing masking of those failures?

Marking of tests? No, because if you mark the whole test as flaky then any failure at all will be retried. But marking of failure modes is effective: if the test is only marked as a flake if its failure matches a known issue, then a new failure mode will still cause the failure to be reported.

bcmills · 2023-09-18T14:31:50Z

@dnephin, regarding your more specific scenarios.

Scenario 1: … re-run the tests on the change request by pressing a button in their CI system. The test likely passes now, so they ignore the error and no one becomes aware of the flake for a longer period of time. The failure is now "masked" until someone notices it again.

If someone is doing code reviews, ideally they will notice the CI failure when they review the change. But yes: if contributors and maintainers are in the habit of hitting a “retry” button without also reporting the test failure, then the test failure will presumably go unreported. That's why CI systems are usually run as “continuous” (post-commit) tests rather than just pre-commit.

If the failure happens during post-commit testing, then it will presumably be reported in the usual way.

Scenario 2: … When tests are retried they are reported on the change request, and when new ones appear that same automation opens an issue for them. … In this scenario the appropriate party still becomes aware of the problem at about the same time … The automation handles it.

That assumes that someone is triaging the failures reported by the automation. If a project isn't keeping up with triage for its test failures, why would we expect it to keep up with triage for automated failure reports? (Aren't those more-or-less equivalent?)

Scenario 3: I contribute a change that introduces a regression, causing requests to timeout more often. The tests that cover this change are already marked as flaky, so the tests are retried and they pass after a couple attempts. … Until someone looks at the frequency of flakes they don't notice the regression.

Yes, that's the downside of marking a given failure mode as a flake: it masks all other regressions with the same failure mode. To minimize the masking effect, it is important to mark the most specific failure mode that is feasible.

But this is closely related to performance monitoring in general. If you want to test timing- or performance-related characteristics, you need a testing setup that is fundamentally different from the usual heavily-loaded CI environment.

bcmills · 2023-09-18T14:38:06Z

If the tests aren't retried because of any t.Error or t.Fatal it … would require making changes to many tests, and that was already the problem.

In my experience, and based on my read of the original proposal, this doesn't really address the most painful parts of dealing with flaky tests. If Retry worked on tests marked failed, that might help.

In my experience, the most painful part of dealing with flaky tests is the time and effort it takes to understand the failure, not to mark it. It is usually much easier to mark a failure mode as flaky than to diagnose and fix its underlying cause.

dnephin · 2023-09-19T02:30:44Z

Thank you for the detailed response. I now better understand the motivation for the proposed solution.

If a project isn't keeping up with triage for its test failures, why would we expect it to keep up with triage for automated failure reports? (Aren't those more-or-less equivalent?)

The number of flaky test failures can often be due to historical compromises. It's common for organizations to make technical compromises to deliver a product. Even if a team is able to spend enough time to deal with new flakes being caused today, that doesn't mean they're able to immediately address the large number of flaky tests built up over previous cycles due to those compromises.

The key difference is the "always retry and track using automation" approach gives the team flexibility about when they spend the time to triage. The investigation can be scheduled for a convenient time, and all the flakes don't need to be fixed each time.

If no one has even looked at the failure mode before, how can we assume anything at all about its priority?

No doubt it is difficult to assume too much without doing some investigation, but there are some things that can help prioritize the investigation. A flake in a test for a critical system is going to be higher priority than a test failure in a less critical system. A flake that happens 1/10 runs is also potentially higher priority than one that occurs 1/1000. When a flake first appeared can also help prioritize. A flake that has existed for years is likely lower priority than one that just started in the last week.

In my experience, the most painful part of dealing with flaky tests is the time and effort it takes to understand the failure, not to mark it.

Definitely the most time consuming part of dealing with flaky tests is understanding the failure, but I would argue that doesn't necessarily make it the most painful. I often enjoy the challenge, especially if I have the support of my team and organization to do that work.

I would argue the painful part of dealing with flaky tests is the distruption to other work tasks. Someone trying to contribute a new feature, or a bug fix shouldn't be interrupted with having to understand a test failure in a completely different part of the system.

If marking a test is simply adding a line at the begining of the test (ex: t.MarkFlaky) that is a pretty reasonable amount of work. My understanding of the t.Retry proposal is that the test can't fail. So a contributor has to find every place in the test that could flake and replace the t.Fatalf with a t.Restartf. In a large end-to-end test that's a significant amount of work. There are often helpers that are re-used across many tests, and trying to change all the t.Fatalf may not be practical.

t.Retry seems useful for flaky behaviour that is well understood. Some mechanism to defer investigation (ex t.Unreliable or t.MarkFlaky) would also be useful.

rsc · 2023-10-24T15:52:55Z

It seems like there are two main proposals on the table: Brad's and Bryan's. They differ in a few details but it sounds like there is general agreement we should try to do something here. If I can simplify them to their cores:

Brad's proposal is to add t.Flaky("reason"), called at the start of the test. If the test later fails, it is retried, up to the -maxretry flag (default 2).
Bryan's proposal is to add t.Retry("reason"), called when the test would otherwise fail. t.Retry is like t.Skip in that it ends the test, but it also queues a retry of the test, up to -maxretry. If the retry limit is reached, t.Retry is like t.Fatal instead.

For both assume an f form too: t.Flakyf and t.Retryf. Let me know if I've grossly misunderstood or oversimplified.

A benefit of t.Flaky is that monitoring can notice purely from the test logs that a particular test is marked as flaky but not flaking. To do this with t.Retry, you'd need both test logs and a scan of the source code to find the Retry calls and check for ones that never appear in recent logs.

A benefit of t.Retry is that it is more precise about the specific reason for the flake. If the test starts failing for some other reason, that's a fail, not a retry. For example in this tailscale code there are many t.Fatal lines, and probably only a couple are implicated in flakiness. For example probably key.ParseNodePublicUntyped is not flaky.

But this precision comes at a cost. One might say that a benefit of t.Flaky is that it's not precise, so that you don't have to chase down and rewrite a specific t.Fatal line just to mark the test flaky. This is not a big deal if the t.Fatal line is in the test itself, but the t.Fatal line may be in a helper package that is less easily modified. (On the other hand, maybe that's the fault of the helper package, but people do like their assert packages.)

Bryan's full proposal also gives the test more control over what happens next, with the Retries method. That said, I am not sure exactly how a test would use that extra control, so I left it out above. It seems better for the testing package and the command line to control retries.

Looking at Go's main repo, we use SkipFlaky and SkipFlakyNet almost exactly like t.Retry, which is not terribly surprising.

Looking at Tailscale uses of flakytest.Mark, it does give me pause not to know where exactly the flaky failures happen. So I would lean toward trying t.Retry first.

What do people think about adding just the one method t.Retry, as described above, along with the -maxretry flag?

bcmills · 2023-10-24T19:30:11Z

I think a -maxretry flag would be a mistake. I would instead leave the retry policy up to the test itself (it can call Retry or not), and rely on timeouts (particularly per-test timeouts as in #48157) as the backstop.

Also note that for many classes of test, arbitrarily many calls to Retry may occur in normal operation, and should not be considered a problem unless they are adding significant latency. For example, a test of a timeout-related API may want to start with a very aggressive timeout to keep running time short on fast, unloaded developer machines, but retry with exponentially longer timeouts to avoid spurious failures on slow or heavily-loaded CI machines.

The main purpose of the Retries method is to simplify that sort of exponential backoff for test functions that may be run multiple times, either as a result of the -count flag or due to reuse in multiple subtests. Without the Retries method, one needs to somehow track the retry count (or running time) out-of-band. That can be done (for example, by factoring out a subtest and storing the retry counter in the parent test's call frame), but it's tedious and awkward.

bcmills · 2023-10-24T19:34:36Z

Curiously, the Retries method is also useful for flakes that manifest in certain other ways, such as deadlocks. In general, the pattern for fixing a deadlock-prone test consists of two steps:

Add a short timeout on the operation, which will trigger if the deadlock occurs.
At each failure, retry with a longer timeout.

Adding the timeout gives the test enough time for a meaningful retry after a deadlock, but introduces the possibility of timeouts in normal operation of the test. An exponentially increasing timeout allows the test to bound the amount of time wasted on these normal-operation retries to be linearly proportional to the running time of the success case.

(And, again, the Retries method would greatly simplify that exponential backoff.)

rsc · 2023-10-26T23:49:47Z

I looked at another random flaky Tailscale failure and found a test that uses c.Assert everywhere. There's nowhere to put the t.Retry method, because the Fatal is buried in the Assert. The same problem happens with any test helpers, not just ones as general as "Assert". Brad also pointed out that adding t.Flaky to the top of a test is something an automate process can do, whereas finding the right place to put t.Retry is not (in the Assert case, even a human can't do it). His vision is automation to file a bug and add the t.Flaky(bugURL) and then also automation to remove the t.Flaky when the flake is fixed. This does sound like a good set of tools to enable building.

So I am leaning more toward t.Flaky again rather than t.Retry.

bcmills · 2023-10-31T17:30:59Z

It seems to me that there is a very minor change to Retry that would allow it to address that use-case as well: namely, allow Retry to retry a test that has already failed.

Then the “tool-automated tagging helper” use-case looks like:

func Mark(t testing.TB, issue string) {
	t.Cleanup(func() {
		if t.Failed() {
			t.Logf("flakytest: retrying flaky test (issue %v)", issue)
			t.Retry()
		}
	})
}

That comes at the expense of needing somewhat more careful usage to avoid retrying unexpected failures:

	if errors.Is(err, context.DeadlineExceeded) && !t.Failed() {
		t.Logf("retrying with longer timeout")
		t.Retry()
	}

Crucially, though, it still does support the use-case of retrying in normal operation. As far as I can tell, t.Flaky does not, and I don't see a way to adapt it for that kind of usage.

rsc · 2023-11-01T17:21:07Z

Nice observation. So if we go with the t.Retry, are there any remaining concerns?

thepudds · 2023-11-01T18:14:35Z

Would someone mind summarizing what would be included with t.Retry under the latest incarnation of the proposal?

(For example, would it just be t.Retry, or would it include a -maxretry flag from #62244 (comment), or a t.Retries method that Bryan was explaining a few comments ago in #62244 (comment) and elsewhere, or anything else?)

rsc · 2023-11-02T13:30:14Z

I think it would rely on per-test timeouts as @bcmills pointed out, but that means we should probably still include Retries() int too.

bcmills · 2023-11-02T14:39:57Z

Note that individual tests (or test helpers) could add their own -maxretry flag if desired:

if t.Failed() && t.Retries() < *maxRetry {
	t.Logf("flakytest: retrying flaky test (issue %v)", issue)
	t.Retry()
}

ChrisHines · 2023-11-02T14:43:44Z

Note that individual tests (or test helpers) could add their own -maxretry flag if desired:
if t.Failed() && t.Retries() < *maxRetry {
	t.Logf("flakytest: retrying flaky test (issue %v)", issue)
	t.Retry()
}

Packages that add flags to test binaries are hostile to using go test -maxretry ./... if all of the packages in the pattern don't recognize the flag.

bcmills · 2023-11-02T14:51:06Z

@ChrisHines, so use an environment variable instead? (My point is just that if a test helper wants to set an arbitrary cap on the number of retries, it can do that fairly easily using the Retries method.)

ChrisHines · 2023-11-02T15:27:59Z

@ChrisHines, so use an environment variable instead? (My point is just that if a test helper wants to set an arbitrary cap on the number of retries, it can do that fairly easily using the Retries method.)

Fair enough, thanks for clarifying your point. 👍

I guess the question becomes: How important is it to have a standard "global" way to set the max retries vs. potentially multiple ways which could allow different helpers to have different configurations but also make it hard to know all the different knobs to set when setting up CI pipelines?

rsc · 2023-11-02T17:16:02Z

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

Add to testing.T only (not B or F):

// Retry causes the test to clean up and exit, 
// as if [T.FailNow] had been called, and then start again.
// If the test is already cleaning up and exiting, 
// such as when called as part of a func deferred or registered with [T.Cleanup],
// Retry lets that process finish and then starts the test again.
func (*T) Retry()

// Retries returns the number of times this test has been retried during this run.
func (*T) Retries() int

In the testing output, calling Retry results in an output line like:

=== RETRY TestCase

In the test2json output, this line produces

{"Action":"retry","Test":"TestCase"}
{"Action":"retry","Test":"TestCase","Output":"=== RETRY TestCase\n"}

rsc · 2023-11-10T17:27:40Z

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

Add to testing.T only (not B or F):

// Retry causes the test to clean up and exit, 
// as if [T.FailNow] had been called, and then start again.
// If the test is already cleaning up and exiting, 
// such as when called as part of a func deferred or registered with [T.Cleanup],
// Retry lets that process finish and then starts the test again.
func (*T) Retry()

// Retries returns the number of times this test has been retried during this run.
func (*T) Retries() int

In the testing output, calling Retry results in an output line like:

=== RETRY TestCase

In the test2json output, this line produces

{"Action":"retry","Test":"TestCase"}
{"Action":"retry","Test":"TestCase","Output":"=== RETRY TestCase\n"}

bcmills · 2023-11-10T17:48:27Z

Hmm. I don't think Retry should abort the test like how FailNow does — that makes it unsafe to use in a child goroutine (at least in most cases). It should instead work like Fail (just mark the test), and if a test helper wants to skip the rest of the test they can pair it with a SkipNow.

(An exception for “already cleaning up and exiting” seems too fragile and too ambiguous to me — what would that mean if Retry was called from a defer in an ordinary test function, and would that be different from being called in a t.Cleanup callback?)

I think that would be something like:

// Retry marks the function as unable to produce a useful signal,
// but continues execution.
// When a test or subtest finishes its cleanup, if Retry was called on that
// test its status is discarded and the test function is run again immediately.

bcmills · 2023-11-10T17:50:54Z

We should also specify the behavior for subtests. Probably we retry the individual subtest function, rather than retrying the entire chain of parents.

But then what happens if a subtest fails and its parent test calls Retry afterward? If we're running in -json mode and streaming output, we may have already written the FAIL status for the subtest at that point. 🤔

rsc · 2023-12-05T17:13:26Z

I think both @bcmills's comments are correct and should be reflected in the final implementation, whenever that happens. In particular if someone is foolish enough to call t.Retry in a parent after a subtest has failed, seeing an overall PASS seems OK.

bradfitz added the Proposal label Aug 23, 2023

gopherbot added this to the Proposal milestone Aug 23, 2023

rsc added the Proposal-FinalCommentPeriod label Nov 2, 2023

rsc changed the title ~~proposal: cmd/go: add support for dealing with flaky tests~~ cmd/go: add support for dealing with flaky tests Nov 10, 2023

rsc modified the milestones: Proposal, Backlog Nov 10, 2023

rsc added Proposal-Accepted and removed Proposal-FinalCommentPeriod labels Nov 10, 2023

thanm mentioned this issue Nov 14, 2023

Go compiler and runtime meeting notes #43930

Open

kentquirk mentioned this issue Nov 21, 2023

CI Flaky TestOriginalSampleRateIsNotedInMetaField honeycombio/refinery#896

Closed

dnephin mentioned this issue Dec 31, 2023

--junitfile XML report contains all test runs if using --rerun-fails gotestyourself/gotestsum#384

Closed

pducolin mentioned this issue Jan 16, 2024

[testutil] add flake testing marker DataDog/datadog-agent#22028

Merged

10 tasks

dmitshur added the GoCommand cmd/go label Feb 8, 2024

cmd/go: add support for dealing with flaky tests #62244

cmd/go: add support for dealing with flaky tests #62244

Comments

bradfitz commented Aug 23, 2023 • edited

Background

Proposal

FAQ

bcmills commented Aug 23, 2023 • edited

bitfield commented Aug 24, 2023

painhardcore commented Aug 29, 2023

vearutop commented Aug 29, 2023

rsc commented Sep 7, 2023

dnephin commented Sep 10, 2023 • edited

dnephin commented Sep 10, 2023

bcmills commented Sep 11, 2023

bcmills commented Sep 11, 2023

willfaught commented Sep 11, 2023

bcmills commented Sep 11, 2023

willfaught commented Sep 11, 2023

bcmills commented Sep 11, 2023

dnephin commented Sep 16, 2023 • edited

dnephin commented Sep 18, 2023 • edited

bcmills commented Sep 18, 2023

bcmills commented Sep 18, 2023

bcmills commented Sep 18, 2023

dnephin commented Sep 19, 2023

rsc commented Oct 24, 2023 • edited

bcmills commented Oct 24, 2023 • edited

bcmills commented Oct 24, 2023 • edited

rsc commented Oct 26, 2023

bcmills commented Oct 31, 2023 • edited

rsc commented Nov 1, 2023

thepudds commented Nov 1, 2023

rsc commented Nov 2, 2023

bcmills commented Nov 2, 2023

ChrisHines commented Nov 2, 2023 • edited

bcmills commented Nov 2, 2023

ChrisHines commented Nov 2, 2023 • edited

rsc commented Nov 2, 2023

rsc commented Nov 10, 2023

bcmills commented Nov 10, 2023

bcmills commented Nov 10, 2023

rsc commented Dec 5, 2023

bradfitz commented Aug 23, 2023 •

edited

bcmills commented Aug 23, 2023 •

edited

dnephin commented Sep 10, 2023 •

edited

dnephin commented Sep 16, 2023 •

edited

dnephin commented Sep 18, 2023 •

edited

rsc commented Oct 24, 2023 •

edited

bcmills commented Oct 24, 2023 •

edited

bcmills commented Oct 24, 2023 •

edited

bcmills commented Oct 31, 2023 •

edited

ChrisHines commented Nov 2, 2023 •

edited

ChrisHines commented Nov 2, 2023 •

edited