Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: add health check for root filesystem of the Mac bastion host not being read-only #32449

Closed
bradfitz opened this issue Jun 5, 2019 · 12 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Jun 5, 2019

The Macs are down again:

https://farmer.golang.org/status/macs

# "macs" status: MacStadium Mac VMs
# Notes: https://github.com/golang/build/tree/master/env/darwin/macstadium
Warn: macstadium_host01a missing, not seen for 46h18m23s
Warn: macstadium_host01b missing, not seen for 54h25m13s
Warn: macstadium_host02a missing, not seen for 54h25m0s
Warn: macstadium_host02b missing, not seen for 48h3m37s
Warn: macstadium_host04b missing, not seen for 47h55m10s
Warn: macstadium_host07a missing, not seen for 46h17m36s
Warn: macstadium_host08a missing, not seen for 48h0m48s
Warn: macstadium_host08b missing, not seen for 46h9m34s
Warn: macstadium_host09a missing, not seen for 46h23m44s
Warn: macstadium_host10a missing, not seen for 112h46m24s
Warn: macstadium_host10b missing, not seen for 112h47m30s
Error: 11 machines missing, 55% of capacity

Looking at the macstadiumd host's logs:

gopher@godns:~$ sudo journalctl -f -u makemac
-- Logs begin at Wed 2019-06-05 07:30:30 PDT. --
Jun 05 08:24:56 godns makemac[2341]: 2019/06/05 08:24:56 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:24:56 godns makemac[2341]: 2019/06/05 08:24:56 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:24:57 godns makemac[2341]: 2019/06/05 08:24:57 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:24:57 godns makemac[2341]: 2019/06/05 08:24:57 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:24:58 godns makemac[2341]: 2019/06/05 08:24:58 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:24:59 godns makemac[2341]: 2019/06/05 08:24:59 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:24:59 godns makemac[2341]: 2019/06/05 08:24:59 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:00 godns makemac[2341]: 2019/06/05 08:25:00 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:00 godns makemac[2341]: 2019/06/05 08:25:00 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:01 godns makemac[2341]: 2019/06/05 08:25:01 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:25:02 godns makemac[2341]: 2019/06/05 08:25:02 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"

Jun 05 08:25:03 godns makemac[2341]: 2019/06/05 08:25:03 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:03 godns makemac[2341]: 2019/06/05 08:25:03 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:25:03 godns makemac[2341]: 2019/06/05 08:25:03 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:04 godns makemac[2341]: 2019/06/05 08:25:04 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:05 godns makemac[2341]: 2019/06/05 08:25:05 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:05 godns makemac[2341]: 2019/06/05 08:25:05 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:25:06 godns makemac[2341]: 2019/06/05 08:25:06 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:06 godns makemac[2341]: 2019/06/05 08:25:06 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:07 godns makemac[2341]: 2019/06/05 08:25:07 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:25:07 godns makemac[2341]: 2019/06/05 08:25:07 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:08 godns makemac[2341]: 2019/06/05 08:25:08 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:09 godns makemac[2341]: 2019/06/05 08:25:09 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:09 godns makemac[2341]: 2019/06/05 08:25:09 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:25:10 godns makemac[2341]: 2019/06/05 08:25:10 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:10 godns makemac[2341]: 2019/06/05 08:25:10 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"

Something's wrong with the cluster.

Related: since the coordinator now polls the makemac JSON status URL (and it's currently reporting healthy), we should include errors like getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF in the makemac daemon's status response JSON, so they can be shown in the coordinator health output.

/cc @andybons @bcmills

@bradfitz bradfitz added the NeedsFix The path to resolution is known, but the work has not been done. label Jun 5, 2019
@gopherbot gopherbot added this to the Unreleased milestone Jun 5, 2019
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Jun 5, 2019
@dmitshur
Copy link
Contributor

dmitshur commented Jun 5, 2019

Do you know if the remaining 9 machines are operating correctly? As I understood the output, the Mac builders are at half capacity rather than completely down. Or is there a problem that's causing it to not report that the remaining 9 machines aren't operating?

Based on builds at https://build.golang.org/ and details at https://farmer.golang.org/#pools, it seems the remaining 9 machines are connected, but they offer host-darwin-10_10 and host-darwin-10_11 only. So we're missing 100% of builders for macOS 10.12 and 10.14, which is a more severe problem. Adding Soon label.

@dmitshur dmitshur added the Soon This needs to be done soon. (regressions, serious bugs, outages) label Jun 5, 2019
@bradfitz
Copy link
Contributor Author

bradfitz commented Jun 5, 2019

Each of the hosts can run any VM type, but it doesn't actively rebalance. (And even if it did, in this case it can't even do API calls to the VMware API server, so it wouldn't be able to anyway)

The health checker could also report failures on the connected guest types.

@bradfitz
Copy link
Contributor Author

bradfitz commented Jun 5, 2019

Do you know if the remaining 9 machines are operating correctly?

Yes, they'll probably each run fine for one build each and then kill themselves after the build and get stuck in the same state as the other 11.

@dmitshur
Copy link
Contributor

dmitshur commented Jun 5, 2019

I'm starting to investigate this now.

Yes, they'll probably each run fine for one build each and then kill themselves after the build and get stuck in the same state as the other 11.

One of them is running for 8 minutes now:

macstadium_host03a (207.254.3.58:49234) version 23, host-darwin-10_10: connected 34h49m45.2s, idle for 9.5s
macstadium_host03b (207.254.3.58:49219) version 23, host-darwin-10_10: connected 31h30m27s, idle for 9.63s
macstadium_host04a (207.254.3.58:49231) version 23, host-darwin-10_10: connected 31h29m14.1s, idle for 2.53s
macstadium_host05a (207.254.3.58:60994) version 23, host-darwin-10_11: connected 17h18m44.5s, working for 5m0.5s
macstadium_host05b (207.254.3.58:51481) version 23, host-darwin-10_11: connected 17h8m21s, idle for 2.13s
macstadium_host06a (207.254.3.58:62984) version 23, host-darwin-10_11: connected 14h58m26.8s, idle for 700.6ms
macstadium_host06b (207.254.3.58:50924) version 23, host-darwin-10_11: connected 15h2m47.7s, idle for 2.45s
macstadium_host07b (207.254.3.58:61029) version 23, host-darwin-10_11: connected 14h47m0.6s, idle for 817.4ms
macstadium_host09b (207.254.3.58:55217) version 23, host-darwin-10_11: connected 18h52m45s, working for 8m8.3s

So we'll see what happens to it after it's done.

@dmitshur
Copy link
Contributor

dmitshur commented Jun 6, 2019

That host completed the build and re-connected successfully after that.

I tried restarting the physical host04 node via MacStadium UI. It previously had one of the VMs missing but another present. Now they're both missing and not coming back. So the problem is not there.

The most immediate problem seems to be the getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF error in makemac output. That error happens when makemac tries to run govc ls -json /MacStadium-ATL/host/MacMini_Cluster. I tried running it manually on the bastion host to get more information about why makemac is getting EOF, and got this:

$ govc ls -json /MacStadium-ATL/host/MacMini_Cluster
govc: open /home/gopher/.govmomi/sessions/d57643d[...]: read-only file system
$ echo $?
1

I haven't used govc in the past, so I don't know what's the normal state here. Is it normal that it's a read-only filesystem? @bradfitz Does that tell you anything?

@dmitshur
Copy link
Contributor

dmitshur commented Jun 6, 2019

There was a chance makemac was running as another user (instead of gopher user that I was sshed in as), but that doesn't seem to be the case (see here).

The entire root filesystem on the bastion host appears to be read-only. That isn't intentional, is it? I wonder if something changed recently that would result in that being the case.

@bradfitz
Copy link
Contributor Author

bradfitz commented Jun 6, 2019

It shouldn't be read-only. The filesystem probably crapped itself. Check dmesg. A reboot should fix it.

That's another thing that should be exported in makemac's status JSON. (My "Related: ..." comment above).

@dmitshur
Copy link
Contributor

dmitshur commented Jun 6, 2019

Yeah, dmesg confirmed there were some IO errors and timeouts, which ultimately resulted in:

[2371027.998161] EXT4-fs (sda1): Remounting filesystem read-only

@dmitshur
Copy link
Contributor

dmitshur commented Jun 6, 2019

I've restarted the bastion host. It came back up, and its filesystem is now writeable.

As a result, calls to govc have started to succeed, and makemac is making progress on re-creating the missing Mac VMs, so the cluster is coming back up for the most part (some more work may need to be done to get it to 100% up).

I suspect this will be enough to resolve the immediate issue, but there are more followup tasks to improve monitoring so we can spot some of these issues sooner.

Edit: By now, all Mac hosts are up, and https://build.golang.org is catching up on missed Mac builds.

https://farmer.golang.org/status/macs

ok

Some Mac VMs still do disappear occasionally. That has a different cause than this original outage and will need to be investigated separately.

@dmitshur dmitshur removed the Soon This needs to be done soon. (regressions, serious bugs, outages) label Jun 6, 2019
@gopherbot
Copy link

Change https://golang.org/cl/181217 mentions this issue: cmd/makemac: export warnings and error state in status JSON endpoint

gopherbot pushed a commit to golang/build that referenced this issue Jun 10, 2019
…oordinator

This adds information on warnings & errors to makemac's JSON status
handler that is then parsed by the coordinator's health checking code,
which already polls this JSON endpoint.

Updates golang/go#32449
Updates golang/go#15760

Change-Id: I69bea7b07c184d1f62a358bc317376aa97018230
Reviewed-on: https://go-review.googlesource.com/c/build/+/181217
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@andybons andybons changed the title x/build: Macs down x/build: follow up on Mac outage monitoring improvements Jun 26, 2019
@dmitshur dmitshur changed the title x/build: follow up on Mac outage monitoring improvements x/build/cmd/coordinator: add health check for root filesystem of the Mac bastion host not being read-only Oct 23, 2019
@dmitshur
Copy link
Contributor

This happened again as part of #35109. It will be helpful to add a health check for the bastion host root filesystem not being mounted as read-only to help diagnose this kind of issue in the future.

@gopherbot
Copy link

Change https://golang.org/cl/202822 mentions this issue: cmd/makemac: report error if filesystem is unwritable

codebien pushed a commit to codebien/build that referenced this issue Nov 13, 2019
Fixes golang/go#32449

Change-Id: I35d059778ab96ef4d57236aaccb41698314d6fac
Reviewed-on: https://go-review.googlesource.com/c/build/+/202822
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@golang golang locked and limited conversation to collaborators Oct 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

3 participants