Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: gitmirror health check is flaky because it reports status of a single gitmirror instance, as determined by load balancer #37828

Closed
dmitshur opened this issue Mar 12, 2020 · 5 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@dmitshur
Copy link
Contributor

https://farmer.golang.org/status/gitmirror is intermittently reporting an issue with git mirroring of the main repository:

# "gitmirror" status: Git mirroring
# Notes: https://github.com/golang/build/tree/master/cmd/gitmirror
Error: repo go: hung? no activity since last success 4h13m30.458762593s ago

image

It's normal for it to not have activity for some number of minutes, but 4+ hours seems unexpected.

Edit: This may be a false positive of some sort. I refreshed the status of https://farmer.golang.org/status/gitmirror just a few minutes after starting to type this, and it suddenly became "ok". I refreshed again, and it came back to reporting a problem. Now it's back to "ok". It should not be going from "ok" to "no activity for 4+ hours".

/cc @cagedmantis @toothrot

@dmitshur dmitshur added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 12, 2020
@dmitshur dmitshur added this to the Unreleased milestone Mar 12, 2020
@toothrot
Copy link
Contributor

There are 2 gitmirror instances. This was resolved by restarting the failing gitmirror instance (deleting the pod), which we discovered by proxying to the pods and investigating their statuses.

One idea would be to add stackdriver metrics to gitmirror for monitoring this. Another would be to improve the health check so that it checks all the pods, rather than relying on the kubedns load balancing to land on a healthy one.

@dmitshur
Copy link
Contributor Author

This issue is happening again now. Refreshing farmer.golang.org sometimes shows that Git mirroring is okay, other times it's showing "repo go: hung? no activity since last success 8h41m49.610989692s ago":

image

There isn't a real mirroring problem, because a 4-hour commit in the main Go repo is successfully mirrored to GitHub.

From looking at the logs for the 2 gitmirror instances, I see a number of errors from the Gerrit servers today. That may be what caused at least one of the instances to get into a bad state.

@dmitshur dmitshur self-assigned this Mar 31, 2020
@dmitshur dmitshur changed the title x/build/cmd/gitmirror: reporting "Go repo mirroring hung for 4+ hours" then "ok" then again x/build/cmd/coordinator: gitmirror health check is flaky because it reports status of a single gitmirror instance, as determined by load balancer Mar 31, 2020
@gopherbot
Copy link

Change https://golang.org/cl/226678 mentions this issue: cmd/coordinator: report errors on all gitmirror instances

@dmitshur
Copy link
Contributor Author

to improve the health check so that it checks all the pods

I've sent CL 226678 that implements this.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 1, 2020
@dmitshur
Copy link
Contributor Author

dmitshur commented Apr 1, 2020

Tested the CL above, it worked as expected:

image

No more flakiness, refreshing the page always produces the same result. 🎉

I've redeployed gitmirror to resolve its current issue. Now that the reporting at https://farmer.golang.org/#health is reliable, we can deal with gitmirror problems if they come up in the future.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

3 participants