You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cherrymui
added
NeedsFix
The path to resolution is known, but the work has not been done.
Builders
x/build issues (builders, bots, dashboards)
and removed
Builders
x/build issues (builders, bots, dashboards)
labels
Apr 22, 2024
I got an access grant, and logged into VMs to inspect them. Both were up (not hung or dead) but the "swarming" user was completely inactive (which is not supposed to happen if the systems are healthy). I inspected the system event logs but I don't see any red flags-- last entry in the logs for anything useful done by swarming is on Apr 7th, then after that the user just vanishes.
From the bot logs I see this in the Apr 7th swarming bot log ("C:\Users\swarming.swarming\logs\bot_stdout.log.1"):
Found a previous bot, 11832 rebooting as a workaround for https://crbug.com/1061531
Sleeping for 300 secs
We have SWARMING_NEVER_REBOOT set to true for these VMs, but the code in question doesn't seem to respect that.
Of course that doesn't explain why we would have two copies of the swarming bot running at the same time in the first place. Also a mystery as to why we don't get a proper auto-logon of the swarming user after this happens (since when I do manual restarts we don't seem to have this issue). If anyone has any ideas on how to debug this let me know.
I restarted both VMs and and they seem to be processing jobs again.
From what I can tell, SWARMING_NEVER_REBOOT has effect for most frequent reasons that would otherwise cause the reboot to happen, but it doesn't catch all. The swarming bot seems to occasionally trigger a reboot in some edge cases.
We can try to catch and report those edge cases, and aim to get them fixed so the variable does as its name implies in all situations. There may still be future instances that get missed and a restart happens unintentionally anyway.
Other options include making this builder come back automatically after a restart, i.e., remove the need for setting the variable, and just handling the occasional restart manually when it happens.
Since the builders are now back online and working, let's close this particular issue. Thanks.
https://ci.chromium.org/ui/p/golang/builders/luci.golang.ci/gotip-windows-arm64
seems all builders are offline.
cc @golang/release @thanm
The text was updated successfully, but these errors were encountered: