Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/gomote: SSH to AWS ARM gomotes sometimes fails #49489

Open
prattmic opened this issue Nov 9, 2021 · 8 comments
Open

x/build/cmd/gomote: SSH to AWS ARM gomotes sometimes fails #49489

prattmic opened this issue Nov 9, 2021 · 8 comments
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@prattmic
Copy link
Member

prattmic commented Nov 9, 2021

It seems like some gomotes are permanently broken, but new ones are OK?

$ gomote ssh user-mpratt-linux-arm64-aws-0
$ ssh -p 2222 user-mpratt-linux-arm64-aws-0@farmer.golang.org # auth using https://github.com/prattmic.keys
# Welcome to the gomote ssh proxy, mpratt.
# Connecting to/starting remote ssh...
#
failed to connect to ssh on user-mpratt-linux-arm64-aws-0: unexpected /connect-ssh response: 502 Bad Gateway, dial tcp 127.0.0.1:22: connect: connection refused

Connection to farmer.golang.org closed.
$ gomote ssh user-mpratt-linux-arm64-aws-0
$ ssh -p 2222 user-mpratt-linux-arm64-aws-0@farmer.golang.org # auth using https://github.com/prattmic.keys
# Welcome to the gomote ssh proxy, mpratt.
# Connecting to/starting remote ssh...
#
failed to connect to ssh on user-mpratt-linux-arm64-aws-0: unexpected /connect-ssh response: 502 Bad Gateway, dial tcp 127.0.0.1:22: connect: connection refused

Connection to farmer.golang.org closed.
$ gomote create linux-arm64-aws           
# still creating linux-arm64-aws after 5s; 0 requests ahead of you
# still creating linux-arm64-aws after 10s; 0 requests ahead of you
# still creating linux-arm64-aws after 15s; 0 requests ahead of you
# still creating linux-arm64-aws after 20s; 0 requests ahead of you
# still creating linux-arm64-aws after 25s; 0 requests ahead of you
# still creating linux-arm64-aws after 30s; 0 requests ahead of you
# still creating linux-arm64-aws after 35s; 0 requests ahead of you
user-mpratt-linux-arm64-aws-1
$ gomote ssh user-mpratt-linux-arm64-aws-1
$ ssh -p 2222 user-mpratt-linux-arm64-aws-1@farmer.golang.org # auth using https://github.com/prattmic.keys
# Welcome to the gomote ssh proxy, mpratt.
# Connecting to/starting remote ssh...
#
# `gomote push` and the builders use:
# - workdir: /workdir
# - GOROOT: /workdir/go
# - GOPATH: /workdir/gopath
# - env: GO_BUILDER_NAME=linux-arm64-aws GOROOT_BOOTSTRAP=/usr/local/go-bootstrap
# Happy debugging.
Warning: Permanently added '[localhost]:38283' (ECDSA) to the list of known hosts.
Linux 68c8a20ccbb8 4.19.0-12-arm64 #1 SMP Debian 4.19.152-1 (2020-10-18) aarch64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
debug1: PAM: reinitializing credentials
debug1: permanently_set_uid: 0/0
Environment:
  USER=root
  LOGNAME=root
  HOME=/root
  PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
  MAIL=/var/mail/root
  SHELL=/bin/bash
  TERM=screen
  SSH_CLIENT=127.0.0.1 45848 22
  SSH_CONNECTION=127.0.0.1 45848 127.0.0.1 22
  SSH_TTY=/dev/pts/0
root@68c8a20ccbb8:~#

i.e., the old instance doesn't seem to work at all. The new instance works fine. I've had this SSH issue on two different instances today (linux-arm-aws and linux-arm64-aws). FWIW, the broken ones I did various gomote run commands on (successfully) before attempting to SSH, while the working ones I SSH'd immediately.

cc @golang/release

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 9, 2021
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Nov 9, 2021
@gopherbot gopherbot added this to the Unreleased milestone Nov 9, 2021
@cherrymui
Copy link
Member

FWIW, the broken ones I did various gomote run commands on (successfully) before attempting to SSH, while the working ones I SSH'd immediately.

I've seen this is the case for other gomotes as well, and it probably has been a long time like this. So when I need gomote ssh I always do it upfront.

@dmitshur
Copy link
Contributor

dmitshur commented Nov 9, 2021

FWIW, the broken ones I did various gomote run commands on (successfully) before attempting to SSH, while the working ones I SSH'd immediately.

This prompted me to check one idea, and I suspect it's a likely explanation.

gomote run has a -firewall flag:

$ gomote run -help 2>&1 | grep firewall
  -firewall
    	Enable outbound firewall on machine. This is on by default on many builders (where supported) but disabled by default on gomote for ease of debugging. Once any command has been run with the -firewall flag on, it's on for the lifetime of that gomote instance.

Some of the builders set GO_DISABLE_OUTBOUND_NETWORK=1 in the environment (via x/build/dashboard package or another means), and the buildlet's /exec endpoint (used by gomote run) interprets it and disables outbound network access (here).

@dmitshur
Copy link
Contributor

dmitshur commented Nov 9, 2021

Actually, that flag's default value is false, and linux-arm64-aws's environment doesn't include that env var. Maybe it's possible some commands you ran did trigger it, or it could be something else after all.

@dmitshur
Copy link
Contributor

dmitshur commented Nov 9, 2021

Pretty sure it's not GO_DISABLE_OUTBOUND_NETWORK in this case because this builder doesn't have the /sbin/iptables binary.

I've noticed the 502 happened on a linux-arm64-aws gomote when I was trying to connect a second time to the same instance. When I disconnected the first SSH connection, then I could establish a new one.

@cagedmantis
Copy link
Contributor

@cherrymui Have they always been linux-arm[64]-aws gomote sessions?

@cherrymui
Copy link
Member

If I recall correctly I have seen on other (non-arm) gomotes as well.

@cagedmantis
Copy link
Contributor

Seems like that error is originating from the buildletclient. The error is happening on the instance when the buildlet is trying to start and connect to the local SSH server. It looks like there is some room for improvement for the SSH server start/connection logic. Especially if it only starts that SSH daemon once.

@cagedmantis
Copy link
Contributor

Just out of curiosity. Is this still a common problem that people are encountering?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants