-
Notifications
You must be signed in to change notification settings - Fork 18k
x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically #37001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@ianlancetaylor Can you let us know on this bug a date and time next time this occurs? I want to investigate in our instance and coordinator logs. |
It just happened a few minutes ago. I'll try to capture the exact time next time. |
The openbsd-386-62 gomote just crashed again, between 13:46:57 PST and 13:49:09 PST. |
Perfect thanks, I'll take a look. |
As I suspected, the coordinator is killing your instances, but I do not know why:
It looks like it was up for about 45 minutes. I'll try to figure out how this is implemented. |
The default timeout is 45 minutes: https://github.com/golang/build/blob/17a7d8724fa7128cd79bcb78e1fbe087043bf810/cmd/coordinator/coordinator.go#L140 I'm still tracing through this code, but it seems like this should happen for all GCE VMs. I'll keep digging. |
45 minutes sounds like roughly the same timescale as in #28365. Perhaps they have the same root cause? |
Thanks for looking at this. My understanding was that the coordinator would not kill the instance if I was connected to it via |
That was my understanding too, and I've seen gomote instances hang around for a long time (many hours) due to an active ssh connection. Perhaps it works for some builder types but not others. I've remembered about issue #36802 still needing a resolution. There aren't any gomote sessions right now, so I'll use this as a chance to redeploy coordinator, so we can know that the latest version is in use. Edit: Done, see #36802 (comment). |
My current belief is this issue is specifically related to GCE VMs, which is a narrow-ish subset of our VMs. I'm still reading through the coordinator code to fully understand how it works before saying with confidence what is causing it, but I have my suspicions. |
OK! I believe I have tracked it down. When using For GCE VMs, we also track a different attribute, We could do one or more of the following:
I'm not sure which is best yet, or if some combination of them is best. I'll keep looking. The majority of the knowledge of this code I believe is tied up in @bradfitz and @crawshaw. |
I'd do (2) .... "Improve the expiration check in gce.go to account for active SSH sessions" Sorry, I thought it already did that. |
Change https://golang.org/cl/217722 mentions this issue: |
This CL skips deleting active remote buildlets. The coordinator has multiple ways of tracking stale buildlets. For our GCE buildlets, we periodically delete old VMs after their expiration time, typically 45 minutes after their creation. The expiration tracking in coordinator/gce.go does not account for remote buildlets, which are buildlets created by users or cmd/release. Remote buildlets have their own staleness checks and cleanup process, so we should skip the GCE specific cleanup logic for them. This adds an additional field to the buildlet Client in order to correlate a GCE VM with a buildlet. Updates golang/go#37001 Change-Id: Ib0acdf79c4dfbee6e0061c513f98b749d4b9cc64 Reviewed-on: https://go-review.googlesource.com/c/build/+/217722 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
This change has been deployed. I've managed to keep a GCE based gomote session alive for hours. I'm going to close this issue. @ianlancetaylor Hopefully you can finish your debugging of OpenBSD now! |
I just verified that my test instance was successfully deleted by the remote buildlet cleanup process (as opposed to the abandoned VM process), as intended. |
Thanks! |
@ianlancetaylor has reported in #36996 (comment) that the
openbsd-386-62
gomote instance was crashing, which made debugging an OpenBSD issue more difficult and time consuming:We should investigate and try to fix that, or find another solution to make it easier to debug OpenBSD issues. This is the tracking issue for that. /cc @cagedmantis @toothrot
The text was updated successfully, but these errors were encountered: