Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically #37001

Closed
dmitshur opened this issue Feb 3, 2020 · 16 comments
Closed
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge OS-OpenBSD
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented Feb 3, 2020

@ianlancetaylor has reported in #36996 (comment) that the openbsd-386-62 gomote instance was crashing, which made debugging an OpenBSD issue more difficult and time consuming:

Unfortunately, the gomote then crashed before I could look at all the data. The gomote continues to crash periodically, forcing me to rebuild everything before I can do more testing.

We should investigate and try to fix that, or find another solution to make it easier to debug OpenBSD issues. This is the tracking issue for that. /cc @cagedmantis @toothrot

@dmitshur dmitshur added OS-OpenBSD Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 3, 2020
@gopherbot gopherbot added this to the Unreleased milestone Feb 3, 2020
@dmitshur dmitshur changed the title x/build/env/openbsd-386: gomote instance is not reliable x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically Feb 3, 2020
@toothrot
Copy link
Contributor

toothrot commented Feb 3, 2020

@ianlancetaylor Can you let us know on this bug a date and time next time this occurs? I want to investigate in our instance and coordinator logs.

@ianlancetaylor
Copy link
Contributor

It just happened a few minutes ago.

I'll try to capture the exact time next time.

@ianlancetaylor
Copy link
Contributor

The openbsd-386-62 gomote just crashed again, between 13:46:57 PST and 13:49:09 PST.

@toothrot
Copy link
Contributor

toothrot commented Feb 3, 2020

Perfect thanks, I'll take a look.

@toothrot
Copy link
Contributor

toothrot commented Feb 3, 2020

As I suspected, the coordinator is killing your instances, but I do not know why:

"2020/02/03 21:01:25 created buildlet user-iant-openbsd-386-62-1 for user-iant (GCE VM: buildlet-openbsd-386-62-rnb65ce41)
...
"2020/02/03 21:45:47 deleting VM "buildlet-openbsd-386-62-rnb65ce41" in zone "us-central1-c"; delete-at expiration ..."

It looks like it was up for about 45 minutes. I'll try to figure out how this is implemented.

@toothrot toothrot self-assigned this Feb 3, 2020
@toothrot
Copy link
Contributor

toothrot commented Feb 3, 2020

The default timeout is 45 minutes: https://github.com/golang/build/blob/17a7d8724fa7128cd79bcb78e1fbe087043bf810/cmd/coordinator/coordinator.go#L140

I'm still tracing through this code, but it seems like this should happen for all GCE VMs. I'll keep digging.

@bcmills
Copy link
Contributor

bcmills commented Feb 4, 2020

45 minutes sounds like roughly the same timescale as in #28365. Perhaps they have the same root cause?

@ianlancetaylor
Copy link
Contributor

Thanks for looking at this. My understanding was that the coordinator would not kill the instance if I was connected to it via gomote ssh, as was the case here. But maybe my understanding was incorrect.

@dmitshur
Copy link
Contributor Author

dmitshur commented Feb 4, 2020

That was my understanding too, and I've seen gomote instances hang around for a long time (many hours) due to an active ssh connection. Perhaps it works for some builder types but not others.

I've remembered about issue #36802 still needing a resolution. There aren't any gomote sessions right now, so I'll use this as a chance to redeploy coordinator, so we can know that the latest version is in use. Edit: Done, see #36802 (comment).

@toothrot
Copy link
Contributor

toothrot commented Feb 4, 2020

My current belief is this issue is specifically related to GCE VMs, which is a narrow-ish subset of our VMs. I'm still reading through the coordinator code to fully understand how it works before saying with confidence what is causing it, but I have my suspicions.

@toothrot
Copy link
Contributor

toothrot commented Feb 4, 2020

OK! I believe I have tracked it down.

When using gomote ssh, a property named Expires on RemoteBuildlet is updated every minute while a SSH session is active: https://github.com/golang/build/blob/5bb938ef020fb4b7f22d366b1e0dc8f9b425cc2f/buildlet/remote.go#L171

For GCE VMs, we also track a different attribute, delete-at in instance metadata. This property is not updated while SSHing, meaning we will eventually hit the default 45 minute timeout on these VMs and expire them here: https://github.com/golang/build/blob/5bb938ef020fb4b7f22d366b1e0dc8f9b425cc2f/cmd/coordinator/gce.go#L577

We could do one or more of the following:

  • Update the SSH session to also bump the instance metadata attribute where applicible
  • Improve the expiration check in gce.go to account for active SSH sessions
  • Not set a delete-at when SSHing, as we'll rely on the remote buildlet cleanup instead.

I'm not sure which is best yet, or if some combination of them is best. I'll keep looking. The majority of the knowledge of this code I believe is tied up in @bradfitz and @crawshaw.

@bradfitz
Copy link
Contributor

bradfitz commented Feb 4, 2020

I'd do (2) .... "Improve the expiration check in gce.go to account for active SSH sessions"

Sorry, I thought it already did that.

@gopherbot
Copy link

Change https://golang.org/cl/217722 mentions this issue: cmd/coordinator,buildlet: keep active GCE SSH sessions alive

gopherbot pushed a commit to golang/build that referenced this issue Feb 6, 2020
This CL skips deleting active remote buildlets.

The coordinator has multiple ways of tracking stale buildlets. For our
GCE buildlets, we periodically delete old VMs after their expiration
time, typically 45 minutes after their creation. The expiration tracking
in coordinator/gce.go does not account for remote buildlets, which are
buildlets created by users or cmd/release. Remote buildlets have their
own staleness checks and cleanup process, so we should skip the GCE
specific cleanup logic for them.

This adds an additional field to the buildlet Client in order to
correlate a GCE VM with a buildlet.

Updates golang/go#37001

Change-Id: Ib0acdf79c4dfbee6e0061c513f98b749d4b9cc64
Reviewed-on: https://go-review.googlesource.com/c/build/+/217722
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@toothrot
Copy link
Contributor

toothrot commented Feb 6, 2020

This change has been deployed. I've managed to keep a GCE based gomote session alive for hours.

I'm going to close this issue.

@ianlancetaylor Hopefully you can finish your debugging of OpenBSD now!

@toothrot toothrot closed this as completed Feb 6, 2020
@toothrot toothrot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 6, 2020
@toothrot
Copy link
Contributor

toothrot commented Feb 6, 2020

I just verified that my test instance was successfully deleted by the remote buildlet cleanup process (as opposed to the abandoned VM process), as intended.

@ianlancetaylor
Copy link
Contributor

Thanks!

@golang golang locked and limited conversation to collaborators Feb 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge OS-OpenBSD
Projects
None yet
Development

No branches or pull requests

6 participants