New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build/cmd/coordinator: failing to find TryBot work because it cannot handle multiple CLs with same Change-Id #43312
Comments
Thanks for reporting this. Coordinator finds work by using Gerrit's API, so if Gerrit is having issues, that would explain trybots not starting. We should figure out if the problem is on the side of Gerrit or not. I see at https://farmer.golang.org/#trybots there are active trybot runs right now, so maybe this is an issue only affecting a subset of CLs? |
They all seem to be stuck though. All builds have completed, yet the trybot run isn't completed. |
From coordinator logs, it seems clear that the problem is coordinator is failing to find work:
Coordinator will need to be updated to handle it without an error. |
Is it possible to Gerrit super-user force delete https://go-review.googlesource.com/c/build/+/279515 or something? Maybe that would help unstick the situation? |
Yes, deleting the CL may work. We should also make coordinator more resilient so this won't need to be dealt with manually if a duplicate Gerrit CL happens again in the future. I suspect the problem could be how we're specifying the ID in the call here: comments, err := gerritc.ListChangeComments(ctx, ci.ID)
if err != nil {
return nil, err
} But I can't reproduce it locally with |
Probably. I think it's a bug that Gerrit was able to get into this state in the first place though. I thought that Gerrit is supposed to ensure a 1:1 correspondence between Branch+Change-Id and CL. I'm not sure how @aclements was able to create those duplicate CLs in the first place. |
Ok, I could reproduce it by hitting the maintapi server directly, after some tries:
It succeeds more often than fails, so maybe this isn't what's causing TryBots not to start. (Although it's possible it is what caused coordinator to get into a bad/stuck state, at least in its ability to start/complete TryBot runs.) I think a good next step will be to restart coordinator and see how behaves, before we try to do more. I'll do that now. |
Restarting alone doesn't seem to make a big difference, TryBot runs are still reaching "Builds remaining: 0" and not completing (e.g., https://farmer.golang.org/try?commit=0ca67663 has reached that state some time ago and is still "Active"). This seems to be a problem that needs to be understood+fixed in addition to the "findTryWork" one. This needs some more investigation, and I'll resume working on it tomorrow morning since it affects TryBot runs only. |
Change https://golang.org/cl/279672 mentions this issue: |
I've deployed CL 279672 for the problem spotted in #43312 (comment), and I'm seeing much TryBot activity has resumed. |
This problem is resolved now. I've filed 3 issues for some follow up tasks:
|
Go Bot hasn't acked any Run-TryBot+1 requests since about 1:15pm.
https://go-review.googlesource.com/c/go/+/276653 (patch set 10) is the most recent CL I've seen to get acknowledged by Go Bot, after Run-TryBot+1 at 1:00pm.
https://go-review.googlesource.com/c/go/+/277934 (patch set 7) is the earliest CL I've seen to not get acknowledged by Go Bot, after Run-TryBot+1 at 1:17pm.
@aclements mentions having uploaded duplicate CLs to Gerrit around this time window, which they said seemed to have broken Gerrit. Perhaps the CLs broke Go Bot too?
/cc @golang/release
The text was updated successfully, but these errors were encountered: