Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: failing to find TryBot work because it cannot handle multiple CLs with same Change-Id #43312

Closed
mdempsky opened this issue Dec 21, 2020 · 11 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Soon This needs to be done soon. (regressions, serious bugs, outages)
Milestone

Comments

@mdempsky
Copy link
Member

mdempsky commented Dec 21, 2020

Go Bot hasn't acked any Run-TryBot+1 requests since about 1:15pm.

https://go-review.googlesource.com/c/go/+/276653 (patch set 10) is the most recent CL I've seen to get acknowledged by Go Bot, after Run-TryBot+1 at 1:00pm.

https://go-review.googlesource.com/c/go/+/277934 (patch set 7) is the earliest CL I've seen to not get acknowledged by Go Bot, after Run-TryBot+1 at 1:17pm.

@aclements mentions having uploaded duplicate CLs to Gerrit around this time window, which they said seemed to have broken Gerrit. Perhaps the CLs broke Go Bot too?

/cc @golang/release

@mdempsky mdempsky added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Soon This needs to be done soon. (regressions, serious bugs, outages) labels Dec 21, 2020
@mdempsky mdempsky added this to the Go1.16 milestone Dec 21, 2020
@dmitshur dmitshur changed the title build: trybots are down x/build: trybots are down Dec 21, 2020
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Dec 21, 2020
@dmitshur
Copy link
Contributor

Thanks for reporting this.

Coordinator finds work by using Gerrit's API, so if Gerrit is having issues, that would explain trybots not starting. We should figure out if the problem is on the side of Gerrit or not.

I see at https://farmer.golang.org/#trybots there are active trybot runs right now, so maybe this is an issue only affecting a subset of CLs?

@dmitshur
Copy link
Contributor

I see at https://farmer.golang.org/#trybots there are active trybot runs right now,

They all seem to be stuck though. All builds have completed, yet the trybot run isn't completed.

@dmitshur
Copy link
Contributor

From coordinator logs, it seems clear that the problem is coordinator is failing to find work:

$ kubectl logs  coordinator-deployment-6cff4d4d8d-5j6f5  | grep "failed to find trybot work:"
[...]
2020/12/21 23:37:21 failed to find trybot work: rpc error: code = 2 desc = grpc: HTTP status 404 Not Found; Multiple changes found for build%7Emaster%7EIb1c7ae1e914116dd8a4440db8ee46d6af3ed1ad5%0A
2020/12/21 23:37:36 failed to find trybot work: rpc error: code = 2 desc = grpc: HTTP status 404 Not Found; Multiple changes found for build%7Emaster%7EIb1c7ae1e914116dd8a4440db8ee46d6af3ed1ad5%0A
2020/12/21 23:37:52 failed to find trybot work: rpc error: code = 2 desc = grpc: HTTP status 404 Not Found; Multiple changes found for build%7Emaster%7EIb1c7ae1e914116dd8a4440db8ee46d6af3ed1ad5%0A
2020/12/21 23:38:06 failed to find trybot work: rpc error: code = 2 desc = grpc: HTTP status 404 Not Found; Multiple changes found for build%7Emaster%7EIb1c7ae1e914116dd8a4440db8ee46d6af3ed1ad5%0A
2020/12/21 23:38:22 failed to find trybot work: rpc error: code = 2 desc = grpc: HTTP status 404 Not Found; Multiple changes found for build%7Emaster%7EIb1c7ae1e914116dd8a4440db8ee46d6af3ed1ad5%0A

Ib1c7ae1e914116dd8a4440db8ee46d6af3ed1ad5 is the Change-Id of the duplicate CL that was mentioned.

Coordinator will need to be updated to handle it without an error.

@dmitshur dmitshur changed the title x/build: trybots are down x/build/cmd/coordinator: failing to find TryBot work because it cannot handle multiple CLs with same Change-Id Dec 21, 2020
@dmitshur dmitshur pinned this issue Dec 21, 2020
@mdempsky
Copy link
Member Author

Is it possible to Gerrit super-user force delete https://go-review.googlesource.com/c/build/+/279515 or something? Maybe that would help unstick the situation?

@dmitshur
Copy link
Contributor

Yes, deleting the CL may work.

We should also make coordinator more resilient so this won't need to be dealt with manually if a duplicate Gerrit CL happens again in the future.

I suspect the problem could be how we're specifying the ID in the call here:

comments, err := gerritc.ListChangeComments(ctx, ci.ID)
if err != nil {
	return nil, err
}

But I can't reproduce it locally with maintq try-work. Not sure why yet.

@mdempsky
Copy link
Member Author

We should also make coordinator more resilient so this won't need to be dealt with manually if a duplicate Gerrit CL happens again in the future.

Probably. I think it's a bug that Gerrit was able to get into this state in the first place though. I thought that Gerrit is supposed to ensure a 1:1 correspondence between Branch+Change-Id and CL. I'm not sure how @aclements was able to create those duplicate CLs in the first place.

@dmitshur
Copy link
Contributor

Ok, I could reproduce it by hitting the maintapi server directly, after some tries:

$ maintq try-work
rpc error: code = 2 desc = grpc: HTTP status 404 Not Found; Multiple changes found for build%7Emaster%7EIb1c7ae1e914116dd8a4440db8ee46d6af3ed1ad5%0A
exit status 1

It succeeds more often than fails, so maybe this isn't what's causing TryBots not to start. (Although it's possible it is what caused coordinator to get into a bad/stuck state, at least in its ability to start/complete TryBot runs.)

I think a good next step will be to restart coordinator and see how behaves, before we try to do more. I'll do that now.

@dmitshur
Copy link
Contributor

dmitshur commented Dec 22, 2020

Restarting alone doesn't seem to make a big difference, TryBot runs are still reaching "Builds remaining: 0" and not completing (e.g., https://farmer.golang.org/try?commit=0ca67663 has reached that state some time ago and is still "Active"). This seems to be a problem that needs to be understood+fixed in addition to the "findTryWork" one.

This needs some more investigation, and I'll resume working on it tomorrow morning since it affects TryBot runs only.

@dmitshur dmitshur self-assigned this Dec 22, 2020
@gopherbot
Copy link

Change https://golang.org/cl/279672 mentions this issue: maintner/maintnerd/maintapi: switch to project~numericId Gerrit ID type

@dmitshur
Copy link
Contributor

I've deployed CL 279672 for the problem spotted in #43312 (comment), and I'm seeing much TryBot activity has resumed.

@dmitshur
Copy link
Contributor

This problem is resolved now.

I've filed 3 issues for some follow up tasks:

@rolandshoemaker rolandshoemaker unpinned this issue Dec 22, 2020
@golang golang locked and limited conversation to collaborators Dec 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Soon This needs to be done soon. (regressions, serious bugs, outages)
Projects
None yet
Development

No branches or pull requests

3 participants