x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically #37001

dmitshur · 2020-02-03T20:46:48Z

@ianlancetaylor has reported in #36996 (comment) that the openbsd-386-62 gomote instance was crashing, which made debugging an OpenBSD issue more difficult and time consuming:

Unfortunately, the gomote then crashed before I could look at all the data. The gomote continues to crash periodically, forcing me to rebuild everything before I can do more testing.

We should investigate and try to fix that, or find another solution to make it easier to debug OpenBSD issues. This is the tracking issue for that. /cc @cagedmantis @toothrot

The text was updated successfully, but these errors were encountered:

toothrot · 2020-02-03T20:57:04Z

@ianlancetaylor Can you let us know on this bug a date and time next time this occurs? I want to investigate in our instance and coordinator logs.

ianlancetaylor · 2020-02-03T21:22:41Z

It just happened a few minutes ago.

I'll try to capture the exact time next time.

ianlancetaylor · 2020-02-03T21:50:01Z

The openbsd-386-62 gomote just crashed again, between 13:46:57 PST and 13:49:09 PST.

toothrot · 2020-02-03T21:51:54Z

Perfect thanks, I'll take a look.

toothrot · 2020-02-03T22:09:27Z

As I suspected, the coordinator is killing your instances, but I do not know why:

"2020/02/03 21:01:25 created buildlet user-iant-openbsd-386-62-1 for user-iant (GCE VM: buildlet-openbsd-386-62-rnb65ce41)
...
"2020/02/03 21:45:47 deleting VM "buildlet-openbsd-386-62-rnb65ce41" in zone "us-central1-c"; delete-at expiration ..."

It looks like it was up for about 45 minutes. I'll try to figure out how this is implemented.

toothrot · 2020-02-03T22:20:11Z

The default timeout is 45 minutes: https://github.com/golang/build/blob/17a7d8724fa7128cd79bcb78e1fbe087043bf810/cmd/coordinator/coordinator.go#L140

I'm still tracing through this code, but it seems like this should happen for all GCE VMs. I'll keep digging.

bcmills · 2020-02-04T00:56:12Z

45 minutes sounds like roughly the same timescale as in #28365. Perhaps they have the same root cause?

ianlancetaylor · 2020-02-04T01:13:29Z

Thanks for looking at this. My understanding was that the coordinator would not kill the instance if I was connected to it via gomote ssh, as was the case here. But maybe my understanding was incorrect.

dmitshur · 2020-02-04T01:22:48Z

That was my understanding too, and I've seen gomote instances hang around for a long time (many hours) due to an active ssh connection. Perhaps it works for some builder types but not others.

I've remembered about issue #36802 still needing a resolution. There aren't any gomote sessions right now, so I'll use this as a chance to redeploy coordinator, so we can know that the latest version is in use. Edit: Done, see #36802 (comment).

toothrot · 2020-02-04T17:27:48Z

My current belief is this issue is specifically related to GCE VMs, which is a narrow-ish subset of our VMs. I'm still reading through the coordinator code to fully understand how it works before saying with confidence what is causing it, but I have my suspicions.

toothrot · 2020-02-04T18:28:01Z

OK! I believe I have tracked it down.

When using gomote ssh, a property named Expires on RemoteBuildlet is updated every minute while a SSH session is active: https://github.com/golang/build/blob/5bb938ef020fb4b7f22d366b1e0dc8f9b425cc2f/buildlet/remote.go#L171

For GCE VMs, we also track a different attribute, delete-at in instance metadata. This property is not updated while SSHing, meaning we will eventually hit the default 45 minute timeout on these VMs and expire them here: https://github.com/golang/build/blob/5bb938ef020fb4b7f22d366b1e0dc8f9b425cc2f/cmd/coordinator/gce.go#L577

We could do one or more of the following:

Update the SSH session to also bump the instance metadata attribute where applicible
Improve the expiration check in gce.go to account for active SSH sessions
Not set a delete-at when SSHing, as we'll rely on the remote buildlet cleanup instead.

I'm not sure which is best yet, or if some combination of them is best. I'll keep looking. The majority of the knowledge of this code I believe is tied up in @bradfitz and @crawshaw.

bradfitz · 2020-02-04T18:33:14Z

I'd do (2) .... "Improve the expiration check in gce.go to account for active SSH sessions"

Sorry, I thought it already did that.

gopherbot · 2020-02-04T20:00:16Z

Change https://golang.org/cl/217722 mentions this issue: cmd/coordinator,buildlet: keep active GCE SSH sessions alive

This CL skips deleting active remote buildlets. The coordinator has multiple ways of tracking stale buildlets. For our GCE buildlets, we periodically delete old VMs after their expiration time, typically 45 minutes after their creation. The expiration tracking in coordinator/gce.go does not account for remote buildlets, which are buildlets created by users or cmd/release. Remote buildlets have their own staleness checks and cleanup process, so we should skip the GCE specific cleanup logic for them. This adds an additional field to the buildlet Client in order to correlate a GCE VM with a buildlet. Updates golang/go#37001 Change-Id: Ib0acdf79c4dfbee6e0061c513f98b749d4b9cc64 Reviewed-on: https://go-review.googlesource.com/c/build/+/217722 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>

toothrot · 2020-02-06T22:03:08Z

This change has been deployed. I've managed to keep a GCE based gomote session alive for hours.

I'm going to close this issue.

@ianlancetaylor Hopefully you can finish your debugging of OpenBSD now!

toothrot · 2020-02-06T22:33:16Z

I just verified that my test instance was successfully deleted by the remote buildlet cleanup process (as opposed to the abandoned VM process), as intended.

ianlancetaylor · 2020-02-06T22:53:22Z

Thanks!

dmitshur added OS-OpenBSD Builders NeedsInvestigation labels Feb 3, 2020

gopherbot added this to the Unreleased milestone Feb 3, 2020

dmitshur changed the title ~~x/build/env/openbsd-386: gomote instance is not reliable~~ x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically Feb 3, 2020

toothrot self-assigned this Feb 3, 2020

toothrot closed this as completed Feb 6, 2020

toothrot removed the NeedsInvestigation label Feb 6, 2020

dmitshur mentioned this issue May 27, 2020

x/build/cmd/gomote: 502 Bad Gateway error #28365

Open

golang locked and limited conversation to collaborators Feb 5, 2021

gopherbot added the FrozenDueToAge label Feb 5, 2021

rsc unassigned toothrot Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically #37001

x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically #37001

dmitshur commented Feb 3, 2020

toothrot commented Feb 3, 2020

ianlancetaylor commented Feb 3, 2020

ianlancetaylor commented Feb 3, 2020

toothrot commented Feb 3, 2020

toothrot commented Feb 3, 2020

toothrot commented Feb 3, 2020

bcmills commented Feb 4, 2020

ianlancetaylor commented Feb 4, 2020

dmitshur commented Feb 4, 2020 •

edited

Loading

toothrot commented Feb 4, 2020

toothrot commented Feb 4, 2020

bradfitz commented Feb 4, 2020

gopherbot commented Feb 4, 2020

toothrot commented Feb 6, 2020

toothrot commented Feb 6, 2020

ianlancetaylor commented Feb 6, 2020

x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically #37001

x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically #37001

Comments

dmitshur commented Feb 3, 2020

toothrot commented Feb 3, 2020

ianlancetaylor commented Feb 3, 2020

ianlancetaylor commented Feb 3, 2020

toothrot commented Feb 3, 2020

toothrot commented Feb 3, 2020

toothrot commented Feb 3, 2020

bcmills commented Feb 4, 2020

ianlancetaylor commented Feb 4, 2020

dmitshur commented Feb 4, 2020 • edited Loading

toothrot commented Feb 4, 2020

toothrot commented Feb 4, 2020

bradfitz commented Feb 4, 2020

gopherbot commented Feb 4, 2020

toothrot commented Feb 6, 2020

toothrot commented Feb 6, 2020

ianlancetaylor commented Feb 6, 2020

dmitshur commented Feb 4, 2020 •

edited

Loading