Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: LUCI linux-riscv64-mengzhuo bot has a flaky network connection #65464

Open
mknyszek opened this issue Feb 2, 2024 · 3 comments
Open
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Feb 2, 2024

The linux-riscv64-mengzhuo machine has a flaky network connection. Recent builds have timed out trying to obtain things like the tip-of-tree for a repository and trying to download prebuilt toolchains from the CAS service. The consecutive failures of the latter led the machine to be quarantined, draining the pool of linux-riscv64 machines for some time.

I will note that some of the work cmd/coordinator used to do is now done on the bot itself. This is just a consequence of the switch to LUCI's execution model.

@mengzhuo is there anything that can be done either by you or us to improve the stability of the machine? For instance, if we're OK with making builds slower, we can skip the prebuild on this platform and always run make.bash, even for subrepos. Fetching the tip-of-tree is a bit unavoidable for subrepo builds though, because those builds can only be triggered by one source. Perhaps we can target a different service, or add some other exception.

@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Feb 2, 2024
@gopherbot gopherbot added this to the Unreleased milestone Feb 2, 2024
@mknyszek mknyszek added OS-Linux NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. arch-riscv Issues solely affecting the riscv64 architecture. and removed Builders x/build issues (builders, bots, dashboards) labels Feb 2, 2024
@mengzhuo
Copy link
Contributor

mengzhuo commented Feb 4, 2024

All my riscv-builders are located in my homelab in Shenzhen, China, which is home ADSL network.
may be I can transfer these riscv development boards/boxes to a server-farm if possible for more stable/better QoS network. (Looking for funding from RISCV companies here.)

This is a ping latency graph of go.googlesoure.com/*.appspot.com in 24 hours, it looks pretty good to me that all packages go around the whole planet (<180ms)
image

@mengzhuo mengzhuo added Builders x/build issues (builders, bots, dashboards) and removed OS-Linux arch-riscv Issues solely affecting the riscv64 architecture. labels Feb 4, 2024
@mengzhuo
Copy link
Contributor

mengzhuo commented Feb 8, 2024

I take a look at some of failures "go.googlesoure.com" resolved to "173.194.174.82" which my firewall didn't allow this ip.
I had allow "173.194.0.0/16" now.

https://chromium-swarm.appspot.com/task?id=679f0bed18bf1910

@mknyszek
Copy link
Contributor Author

mknyszek commented Feb 9, 2024

Thanks for looking into that! The other category of issues I've seen is that the cas download step fails (because the attempt to download, and the retries) fail too. I can provide some instructions on how to try to reproduce that if it would be helpful.

Also, it looks like the bot went down recently? https://chromium-swarm.appspot.com/bot?id=linux-riscv64-mengzhuo has status dead and the last task ended with BOT_DIED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants