Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: one linux/arm builder (scaleway-prod-50) is missing and not coming back #32801

Closed
dmitshur opened this issue Jun 27, 2019 · 2 comments
Closed
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@dmitshur
Copy link
Contributor

From https://farmer.golang.org/#health:

Warn: scaleway-prod-50 missing, not seen for 66h58m42s
Warn: 1 machines missing, 2% of capacity

Only 1 machine out of 50 is missing (98% of linux/arm builder capacity is still there), so this isn't a serious or urgent issue.

But need to investigate and see why it's not coming back. The x/build/cmd/scaleway daemon is supposed to monitor and restart wedged instances.

@dmitshur dmitshur added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 27, 2019
@dmitshur dmitshur added this to the Unreleased milestone Jun 27, 2019
@dmitshur dmitshur changed the title x/build: one scaleway-prod-50 builder missing and not coming back x/build: one linux/arm builder (scaleway-prod-50) is missing and not coming back Jun 27, 2019
@dmitshur dmitshur self-assigned this Jun 27, 2019
@dmitshur
Copy link
Contributor Author

dmitshur commented Jun 27, 2019

The scaleway deployment had 2 restarts:

$ kubectl get pods | grep scaleway
scaleway-deployment-6646f89654-ln8ts       1/1       Running   2          4d

x/build/cmd/scaleway is not very tolerant of network errors, it tends to treat them as fatal and bail out. That's what happened 4 days ago:

$ kubectl logs -p scaleway-deployment-6646f89654-ln8ts
2019/06/24 16:15:52 scaleway instance checker daemon running.
2019/06/24 16:16:22 Get https://api.scaleway.com/servers?per_page=100: dial tcp 212.47.225.71:443: i/o timeout

(Not sure what happened in the restart prior to that one, but likely a similar timeout issue.)

Since then, scaleway has been valiantly trying to restart the 50th server with no luck:

$ kubectl logs  scaleway-deployment-6646f89654-ln8ts
2019/06/24 16:16:35 scaleway instance checker daemon running.
2019/06/24 16:16:52 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/24 16:16:52 reboot("scaleway-prod-50"): <nil>
2019/06/24 16:26:54 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/24 16:27:03 reboot("scaleway-prod-50"): <nil>
2019/06/24 16:37:05 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/24 16:37:13 reboot("scaleway-prod-50"): <nil>
2019/06/24 16:47:16 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/24 16:47:24 reboot("scaleway-prod-50"): <nil>
2019/06/24 16:57:26 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/24 16:57:35 reboot("scaleway-prod-50"): <nil>
[...]
2019/06/24 20:09:40 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/24 20:09:48 reboot("scaleway-prod-50"): <nil>
2019/06/24 20:19:50 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/24 20:19:59 reboot("scaleway-prod-50"): <nil>
2019/06/24 20:30:01 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/24 20:30:10 reboot("scaleway-prod-50"): <nil>
2019/06/24 20:40:12 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/24 20:40:21 reboot("scaleway-prod-50"): <nil>
[...]
2019/06/25 00:33:35 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/25 00:33:35 reboot("scaleway-prod-50"): <nil>
2019/06/25 00:43:37 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/25 00:43:38 reboot("scaleway-prod-50"): <nil>
[...]
2019/06/25 16:53:58 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/25 16:54:07 reboot("scaleway-prod-50"): <nil>
2019/06/25 17:04:09 rebooting old running-but-disconnected "scaleway-prod-50" server...
[...]
2019/06/26 06:53:57 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/26 06:53:57 reboot("scaleway-prod-50"): <nil>
2019/06/26 07:03:59 rebooting old running-but-disconnected "scaleway-prod-50" server...
[...]
2019/06/26 18:21:33 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/26 18:21:42 reboot("scaleway-prod-50"): <nil>
2019/06/26 18:31:44 rebooting old running-but-disconnected "scaleway-prod-50" server...
[...]
2019/06/27 01:36:58 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/27 01:37:06 reboot("scaleway-prod-50"): <nil>
2019/06/27 01:47:08 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/27 01:47:09 reboot("scaleway-prod-50"): <nil>
2019/06/27 01:57:11 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/27 01:57:19 reboot("scaleway-prod-50"): <nil>
2019/06/27 02:07:23 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/27 02:07:23 reboot("scaleway-prod-50"): <nil>
2019/06/27 02:17:26 rebooting old running-but-disconnected "scaleway-prod-50" server...
2019/06/27 02:17:26 reboot("scaleway-prod-50"): <nil>
2019/06/27 02:27:28 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 02:37:30 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 02:47:32 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 02:57:34 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 03:07:37 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 03:17:39 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 03:27:41 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 03:37:43 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 03:47:46 rebooting old running-but-disconnected "scaleway-prod-36" server...
2019/06/27 03:47:46 reboot("scaleway-prod-36"): <nil>
2019/06/27 03:47:46 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 03:57:48 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 04:07:50 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 04:17:52 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 04:27:55 server "scaleway-prod-50" in state "stopping"; not creating
2019/06/27 04:37:57 server "scaleway-prod-50" in state "stopping"; not creating

It was hopelessly wedged. I manually logged in to Scaleway UI and moved it out of the way. scaleway was able to successfully recreate a new instance of "scaleway-prod-50":

2019/06/27 04:47:59 Doing req "[...]"
2019/06/27 04:48:00 Create of 50: 201 Created
2019/06/27 04:48:02 Powering on scaleway-prod-50 ([...]) = <nil>

https://farmer.golang.org/status/scaleway says "ok":

This issue is resolved now. (If failures such as this one become very common, we can do more work to automate their resolution.)

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 27, 2019
@dmitshur
Copy link
Contributor Author

Note, I'm getting an "Error: server should be running" error from Scaleway UI when trying to forcibly shut it down or delete it, so this kind of issue is unlikely to be possible to automate.

The wedged instance is moved out of the way so it's not a problem; I'll try again in a few days and file a ticket with them if it's still un-deletable by then.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

2 participants