Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/gomote: gomote create, run, etc timeout after 2 hours #56423

Open
prattmic opened this issue Oct 25, 2022 · 4 comments
Open

x/build/cmd/gomote: gomote create, run, etc timeout after 2 hours #56423

prattmic opened this issue Oct 25, 2022 · 4 comments
Assignees
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@prattmic
Copy link
Member

After two hours, gomote create and run (and presumably other commands) timeout with an error like:

Error running create: failed to create buildlet: stream terminated by RST_STREAM with error code: INTERNAL_ERROR

This is rather annoying, particularly for gomote run. gomote create can be manually retried, but gomote run doesn't have a good workaround [1].

We believe this is because the load balancer has a 2 hour timeout, after which the gRPC connection is closed.

As a short term mitigation, I propose increasing this timeout, perhaps to 24 hours.

However the GCP docs note that connections with long timeout are at risk of disconnecting due to maintainence restarts of the load balancer. Longer term, it would probably be better to make the operations retry-able. e.g., perhaps GomoteServer.ExecuteCommand immediately returns an execution ID, and a new GomoteServer.StreamOutput could be used to stream output from that execution? If that is closed, the client can simply reconnect and continue streaming. Alternatively, maybe gomote ssh should be more featureful and could replace gomote run, since it also seems to avoid timeout issues.

[1] At the moment I have a terrible hack to scrap the ssh command out of gomote ssh, and then fake a PTY to feed commands into it.

cc @cagedmantis @golang/release

@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Oct 25, 2022
@gopherbot gopherbot added this to the Unreleased milestone Oct 25, 2022
@heschi heschi added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 25, 2022
@prattmic
Copy link
Member Author

Alternatively, maybe gomote ssh should be more featureful and could replace gomote run, since it also seems to avoid timeout issues.

Note that the coordinator SSH server does not support commands without a PTY (e.g., ssh <hostname> /bin/ls). Attempting to do so with the SSH command printed by gomote ssh results in the error message for #21140.

scp etc not yet supported; https://golang.org/issue/21140                                                                                    

A workaround is to pass -t -t to ssh, which forces PTY allocation and pipe interactive commands into ssh. e.g.,

$ echo "ls\nexit" | ssh -t -t <args from gomote ssh>

@gopherbot
Copy link

Change https://go.dev/cl/445435 mentions this issue: deploy: increase coordinator ingress timeout to 24 hours

@toothrot toothrot added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Nov 1, 2022
gopherbot pushed a commit to golang/build that referenced this issue Nov 2, 2022
Connections connected for long than the timeout are automatically closed
by the load balancer. gomote create (CreateInstance) and gomote run
(ExecuteCommand) are implemented as single, long-running gRPC calls.
Currently, if one of these exceeds 2 hours, the connection is closed and
the call fails.

Increase the limit to 24 hr as a mitigation to give long-running
commands more time to complete. As noted at
https://cloud.google.com/load-balancing/docs/https#timeouts_and_retries,
these connections are still at risk of reset due to restarts of the load
balance itself, so ideally gomote eventually migrates to RPCs that
support retry/continue.

For golang/go#56423.

Change-Id: Ia10faea1ca8558373d2d6b45abcf99c476317270
Reviewed-on: https://go-review.googlesource.com/c/build/+/445435
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
@cagedmantis
Copy link
Contributor

@prattmic Should I close this out? Would you like some more time to test the change?

@prattmic
Copy link
Member Author

prattmic commented Nov 4, 2022

My immediate pain is fixed, though there is still a timeout, plus GCLB restarts can still reset connections at any time. Thus, long term I'd still like to eventually see RPCs move to retry-able operations that can survive these resets.

Perhaps Unplanned is the right milestone for this?

@prattmic prattmic modified the milestones: Unreleased, Unplanned Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done.
Projects
Status: Planned
Development

No branches or pull requests

5 participants