Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: sometimes TryBot builds fail to schedule a machine and get stuck even after quota becomes available #55947

Closed
dmitshur opened this issue Sep 29, 2022 · 4 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented Sep 29, 2022

Consider the TryBot run on CL 431956. Most of its builds have finished, with exception of linux-amd64-boringcrypto, linux-amd64-nounified, and linux-amd64-unified which are in waiting_for_machine state upwards of 20+ hrs:

image

image

Those TryBot runs may have been started at a busy time, but they aren't being scheduled even after the GCE pool capacity is available.

GCE pool capacity: 3/20000 instances; 0/10000 CPUs, 24/200 C2_CPUS, 0/800 N2_CPUS, 0/800 N2D_CPUS
image image

This is a problem for TryBot speed (meta tracking issue #17104).

CC @golang/release, @thanm, @cherrymui.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 29, 2022
@dmitshur dmitshur added this to the Unreleased milestone Sep 29, 2022
@dmitshur dmitshur added the Builders x/build issues (builders, bots, dashboards) label Sep 29, 2022
@dmitshur dmitshur changed the title cmd/coordinator: sometimes TryBot builds fail to schedule a machine and get stuck x/build/cmd/coordinator: sometimes TryBot builds fail to schedule a machine and get stuck Sep 29, 2022
@thanm
Copy link
Contributor

thanm commented Sep 29, 2022

Nice detective work. Infinite loop in checkDep (or whatever it is) looks more actionable than general slowness...

@gopherbot
Copy link

Change https://go.dev/cl/435496 mentions this issue: cmd/coordinator: always return on third failure in checkDep

gopherbot pushed a commit to golang/build that referenced this issue Oct 4, 2022
The previous logic in checkDep had a small possibility of looping
forever if the third call to maintner was okay, but all successive
ones fail. I'm not sure if that's what happened in go.dev/issue/55947,
but rewrite the code to avoid that possibility anyway.

For golang/go#55947.

Change-Id: I28cd14cf8aa82b80d446ec9dbc3b118d4ef8b0fc
Reviewed-on: https://go-review.googlesource.com/c/build/+/435496
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
@dmitshur
Copy link
Contributor Author

dmitshur commented Oct 4, 2022

We should wait after CL 435496 is deployed and see if this happens again. It's possible that CL is all that's needed to fix this problem.

@dmitshur dmitshur added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Oct 4, 2022
@dmitshur dmitshur self-assigned this Oct 4, 2022
@dmitshur
Copy link
Contributor Author

We have learned that the overall problem of quota exhaustion still happens after coordinator restarts during a busy time; @prattmic and @heschi have made more progress on understanding that, but it's not the problem that this issue was tracking.

I suspect this particular instance was really a rare case of checkDep stalling, and I haven't seen it happen again after CL 435496. Closing for now.

@dmitshur dmitshur changed the title x/build/cmd/coordinator: sometimes TryBot builds fail to schedule a machine and get stuck x/build/cmd/coordinator: sometimes TryBot builds fail to schedule a machine and get stuck even after quota becomes available Oct 28, 2022
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 28, 2022
@golang golang locked and limited conversation to collaborators Oct 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
Archived in project
Development

No branches or pull requests

3 participants