Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: transparent TryBot retries mask stalls that should be fixed #42699

Closed
bcmills opened this issue Nov 18, 2020 · 13 comments
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 18, 2020

I noticed a stall on a SlowBot (#39665 (comment)), and when I went to check on it again I noticed that the counter had reset and the builder had restarted.

This behavior closely mimics the js-wasm stall-and-restart behavior reported in #35170, and I wonder if it explains other observed stalls too (probably #39349; perhaps #36482 and/or #32632 as well).

This timeout-and-retry behavior seems to paper over a systemic problem to make it less noticeable, but if it were instead more noticeable we might have a better picture of how and why it occurs — and might be better equipped to fix the underlying problem (and speed up the perceived TryBot speed by eliminating the spurious timeouts entirely).

CC @golang/release

@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Nov 18, 2020
@gopherbot gopherbot added this to the Unreleased milestone Nov 18, 2020
@bcmills bcmills changed the title x/build: transparent retries mask stalls that should be fixed x/build: transparent TryBot retries mask stalls that should be fixed Nov 18, 2020
@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 26, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Jan 5, 2021

See previously #13026.

@dmitshur dmitshur changed the title x/build: transparent TryBot retries mask stalls that should be fixed x/build/cmd/coordinator: transparent TryBot retries mask stalls that should be fixed Apr 30, 2021
@dmitshur
Copy link
Contributor

dmitshur commented Dec 7, 2021

This timeout-and-retry behavior

@bcmills Is the issue here that there's an unbounded number of transparent retries, or that even a bounded number of retries—like 3—is considered problematic in this context? (Or is it not known whether these are bounded and needs investigation?)

If they're unbounded, we should definitely cap it at a fixed number and promote to fatal error. That should be comparatively easy and high value. If already bounded, then improving this situation is more involved, especially if there isn't a good way to differentiate a network communication issue from a build stall.

@bcmills
Copy link
Contributor Author

bcmills commented Dec 7, 2021

I would consider even a bounded number of retries to be problematic for the GCE builders. If we've tuned the builders properly no retries should actually ever be needed, and even a single retry doubles the latency of its shard. (If there is a network communication issue to the GCE builders that occurs frequently enough to matter, we should diagnose and fix that.)

I'm more sympathetic to having a few bounded retries to the reverse builders, since those could reasonably have less-reliable consumer network connections — but even then, I think it would be a better experience to have an explicit SlowBot failure (for which I can remove the missing SlowBot) than to have a whole SlowBot run stalled on some particular builder that has gone AWOL.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 28, 2022

Now that the GopherBot leaves more detailed status comments, the problematic retry behavior is becoming clearer.

For example, on https://go.dev/cl/381734 I requested two plan9 SlowBots. The Gopher Robot has now commented three times on “SlowBots beginning”, and has “Superseded” itself twice without further explanation.

@bcmills
Copy link
Contributor Author

bcmills commented Mar 18, 2022

This caused a 20-minute SlowBot delay on https://go.dev/cl/391805 PS15.

The linux-arm64-aws TryBot got stuck in testsanitizers for 20 minutes, and instead of producing a diagnosable failure it timed out the machine and started over. 😠

https://farmer.golang.org/temporarylogs?name=linux-arm64-aws&rev=6553c0d1d16000f564fc64b13600792f8662dcd5&st=0xc039bf4680:

  builder: linux-arm64-aws
      rev: 6553c0d1d16000f564fc64b13600792f8662dcd5
 buildlet: https://3.17.152.2 EC2 VM: buildlet-linux-arm64-aws-rnc35ea30
  started: 2022-03-18 21:47:04.595869047 +0000 UTC m=+18819.820715051
   status: still running

Events:
…
  2022-03-18T22:10:37Z finish_run_test:testsanitizers after 20m0s; err=error copying response: context deadline exceeded; 52.15.88.103:443
  2022-03-18T22:10:37Z test_helper_is_broken 52.15.88.103:443
  2022-03-18T22:10:37Z closed_helper 52.15.88.103:443
  2022-03-18T22:10:40Z run_test:testsanitizers 3.17.152.2:443

@bcmills
Copy link
Contributor Author

bcmills commented Apr 7, 2022

I noticed a 1-minute delay on https://go.dev/cl/360354 PS9 due to an early failure on the windows-amd64-2016 builder.

I didn't catch the error message in the logs before the builder restarted.

@gopherbot
Copy link

Change https://go.dev/cl/406216 mentions this issue: cmd/coordinator: consolidate and increase global VM deletion timeout

gopherbot pushed a commit to golang/build that referenced this issue May 16, 2022
We had a lot of flexibility over timeouts, making their maintenance
harder. Consolidate it to a single timeout in the pool package, and
modify it from 45 minutes to 2 hours.

There's room for improvement in how we maintain this timeout,
but I'm leaving that for future work (with a tracking issue).

Fixes golang/go#52591.
Updates golang/go#52929.
Updates golang/go#49666.
Updates golang/go#42699.

Change-Id: I2ad92648d89a714397bd8b0e1ec490fc9f6d6790
Reviewed-on: https://go-review.googlesource.com/c/build/+/406216
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
@cagedmantis cagedmantis added this to Planned in Go Release Team May 17, 2022
@dmitshur
Copy link
Contributor

I ran into this most recently while sending a trivial fix in https://go.dev/cl/406897, where a linux-amd64-longtest SlowBot request took almost 2 hrs instead of the usual 10-15 min, and while waiting I spent that time making progress on investigating this issue. I found that the new behavior was introduced during a cmd/coordinator refactor in late 2019, and it was a bug rather than an intentional change. We also filed issue #36629 in 2020 that was trying to investigate what happened, which I've now closed in favor of this issue. I've also opened #52998 for tracking progress on the testsanitizers hang specifically (also pointed out above in #42699 (comment)).

Moving this issue to NeedsFix because I don't believe further investigation is needed here—I know where the problem is. I plan to discuss this with the team and will update this issue soon after that.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 19, 2022
@gopherbot
Copy link

Change https://go.dev/cl/407555 mentions this issue: buildlet: fix Exec to return ErrTimeout on timeout

gopherbot pushed a commit to golang/build that referenced this issue May 27, 2022
The coordinator relies on Exec reporting that the given timeout was
exceeded in order to mark a build as failed instead of retrying it.
A refactor resulted in Exec no longer doing that, despite what its
documentation promises, so fix that.

Also add a test since evidence shows that catching a regression can
be helpful.

For golang/go#42699.
Updates golang/go#35707.

Change-Id: Iacef90b83e7b81fad88a33baa6489d5157e3528f
Reviewed-on: https://go-review.googlesource.com/c/build/+/407555
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
@dmitshur
Copy link
Contributor

I believe the change in CL 407555 should be sufficient to resolve this issue. If there's any evidence to the contrary, it'd be helpful to know.

There were some additional timeout efficiency improvements I was considering and discussed with the team, and though there's agreement that we can make those improvements, they're not necessary for resolving this problem.

@bcmills
Copy link
Contributor Author

bcmills commented May 31, 2022

Hmm. As far as I can tell CL 407555 fixes the main source of the failures that led to retries, but doesn't change the fact that retries would still be attempted (and would still mask stalls) for other issues, such as communication errors with reverse SlowBots.

@bcmills
Copy link
Contributor Author

bcmills commented May 31, 2022

At the very least, though, I would say that with that CL in we can probably deprioritize further work on this issue. Reverse-builder SlowBot failures are much more rare than js-wasm TryBots or linux-amd64-longtest SlowBots.

@dmitshur
Copy link
Contributor

The LUCI infrastructure doesn't have this problem. For a recent instance and comparison, see here.

At this point of the LUCI migration I don't expect we'll resolve this issue in the old coordinator infrastructure.

@dmitshur dmitshur closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done.
Projects
Status: Done
Development

No branches or pull requests

3 participants