Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: update k8s clusters #33529

Closed
andybons opened this issue Aug 7, 2019 · 10 comments
Closed

x/build: update k8s clusters #33529

andybons opened this issue Aug 7, 2019 · 10 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@andybons
Copy link
Member

andybons commented Aug 7, 2019

Placeholder issue for upgrading our k8s clusters

@andybons andybons added Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done. labels Aug 7, 2019
@gopherbot gopherbot added this to the Unreleased milestone Aug 7, 2019
@andybons
Copy link
Member Author

andybons commented Aug 7, 2019

[Development] successfully disabled Kubernetes Dashboard

$ gcloud container clusters update buildlets --project=<dev project name>--zone=us-central1-f --update-addons=KubernetesDashboard=DISABLED
$ gcloud container clusters update go --project=<dev project name> --zone=us-central1-f --update-addons=KubernetesDashboard=DISABLED

@andybons
Copy link
Member Author

andybons commented Aug 7, 2019

[Production] successfully disabled Kubernetes Dashboard add-on

$ gcloud container clusters update buildlets --project=<project name>--zone=us-central1-f --update-addons=KubernetesDashboard=DISABLED
$ gcloud container clusters update go --project=<project name> --zone=us-central1-f --update-addons=KubernetesDashboard=DISABLED

@andybons
Copy link
Member Author

andybons commented Aug 7, 2019

Stepping away until 4pm ET. Will begin upgrade of k8s clusters then.

@andybons
Copy link
Member Author

andybons commented Aug 7, 2019

[Dev]
Updated Masters and nodes to 1.12.7-gke.25

No issues found so far.

Will continue with prod tomorrow morning.

@andybons andybons self-assigned this Aug 7, 2019
@andybons
Copy link
Member Author

Delayed, but upgrading prod cluster now.

@andybons
Copy link
Member Author

andybons commented Aug 12, 2019

Updated masters and nodes to 1.12.7-gke.25

Scaleway builders are having issues. From https://farmer.golang.org:

# "scaleway" status: Scaleway linux/arm machines
# Notes: https://github.com/golang/build/tree/master/env/linux-arm/scaleway
Warn: scaleway-prod-16 missing, never seen (at least 12m7s)
Warn: scaleway-prod-17 missing, never seen (at least 12m7s)
Warn: scaleway-prod-18 missing, never seen (at least 12m7s)
Warn: scaleway-prod-20 missing, not seen for 11m20s
Warn: scaleway-prod-24 missing, not seen for 11m18s
Warn: scaleway-prod-25 missing, not seen for 11m18s
Warn: scaleway-prod-26 missing, not seen for 11m9s
Warn: scaleway-prod-27 missing, not seen for 11m20s
Warn: scaleway-prod-30 missing, not seen for 11m8s
Warn: scaleway-prod-31 missing, not seen for 11m15s
Error: 10 machines missing, 20% of capacity

Investigating

@andybons
Copy link
Member Author

From the scaleway machine (scaleway-prod-16):

systemctl status rundockerbuildlet.service
● rundockerbuildlet.service - Run Buildlets in Docker
   Loaded: loaded (/etc/systemd/user/rundockerbuildlet.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2019-08-12 15:41:30 UTC; 7s ago
 Main PID: 6142 (rundockerbuildl)
   Memory: 1.6M
      CPU: 1.387s
   CGroup: /system.slice/rundockerbuildlet.service
           └─6142 /usr/local/bin/rundockerbuildlet -basename=scaleway -image=gobuilder-arm-scaleway:latest -n=1

Aug 12 15:41:34 scw-e8738f rundockerbuildlet[6142]: See 'docker run --help'.
Aug 12 15:41:35 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:35 Creating scaleway-prod-16 ...
Aug 12 15:41:35 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:35 Error creating scaleway-prod-16: exit status 125, docker: Error response from daemon: Conflict. The name "/scaleway-prod-16" is already in use by container 91786a55d6eace4a38e313ae0bb5e972c2a53667422ae064bc1
Aug 12 15:41:35 scw-e8738f rundockerbuildlet[6142]: See 'docker run --help'.
Aug 12 15:41:36 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:36 Creating scaleway-prod-16 ...
Aug 12 15:41:36 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:36 Error creating scaleway-prod-16: exit status 125, docker: Error response from daemon: Conflict. The name "/scaleway-prod-16" is already in use by container 91786a55d6eace4a38e313ae0bb5e972c2a53667422ae064bc1
Aug 12 15:41:36 scw-e8738f rundockerbuildlet[6142]: See 'docker run --help'.
Aug 12 15:41:37 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:37 Creating scaleway-prod-16 ...
Aug 12 15:41:37 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:37 Error creating scaleway-prod-16: exit status 125, docker: Error response from daemon: Conflict. The name "/scaleway-prod-16" is already in use by container 91786a55d6eace4a38e313ae0bb5e972c2a53667422ae064bc1
Aug 12 15:41:37 scw-e8738f rundockerbuildlet[6142]: See 'docker run --help'.

@andybons
Copy link
Member Author

Cleaned up stopped container noted in that output. Will continue on other machines.

@andybons
Copy link
Member Author

ssh -i ~/keys/id_ed25519_golang1 root@IP 'docker rm $(docker ps -a -q)'

All Scaleway Docker images are back up now and talking to coordinator.

No other issues seen. Will reopen bug if another issue comes up.

@toothrot
Copy link
Contributor

rundockerbuildlet tries to clean up these containers itself, but is failing to remove it.

It tries to remove exited containers here: https://github.com/golang/build/blob/master/cmd/rundockerbuildlet/rundockerbuildlet.go#L91-L93

It also tries to remove "Created" containers that never reach the running status here: https://github.com/golang/build/blob/master/cmd/rundockerbuildlet/rundockerbuildlet.go#L111-L114

Based on the log lines you posted, I would expect to see an error if it tried and failed to remove one of these containers. I'm guessing it's not detecting it for some reason.

It's possible the container ended up in a status we're not handling:
status | One of created, restarting, running, removing, paused, exited, or dead. Even if that were the case, I would expect the logic in L111-L114 to handle this correctly

It's also possible that our logic in L91-L93 isn't fetching the status properly from the formatted string (an extra space in a name? multiple names?).

It's hard to tell the root cause at this point as all the impacted hosts have been fixed manually, so we can't inspect the conflicting container status.

@golang golang locked and limited conversation to collaborators Aug 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

3 participants