Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: trybot (slowbot) was in a "running" state for 2+ hours #35700

Closed
dmitshur opened this issue Nov 19, 2019 · 3 comments
Closed
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dmitshur
Copy link
Contributor

While in the process of deploying a new version of coordinator with @cagedmantis, one of the active trybot runs was a slowbot run for patch set 2 of CL 207858. What stood out was that the aix-ppc64 builder was running for excessively long and timing out:

image


Change-ID: sys~master~Ia0b452f96b08cbb69a0b64cc2a467b5cf94e72da Commit: 25a8c4258c1770d56358edb5ade42803d2c7dca0 (status)
   Remain: 1, fails: []
  android-amd64-emu: running=false
  freebsd-386-11_2: running=false
  freebsd-amd64-11_2: running=false
  freebsd-amd64-12_0: running=false
  linux-386: running=false
  linux-amd64: running=false
  linux-amd64-race: running=false
  netbsd-amd64-8_0: running=false
  openbsd-386-64: running=false
  openbsd-amd64-64: running=false
  windows-386-2008: running=false
  windows-amd64-2016: running=false
  aix-ppc64: running=true
  darwin-amd64-10_14: running=false
  dragonfly-amd64: running=false
  illumos-amd64: running=false
  solaris-amd64-oraclerel: running=false
  linux-amd64: running=false
  linux-amd64: running=false

Its temporary log was:

  builder: aix-ppc64
      rev: 8cf5293caa7071601fa90358abdd20a0b787e178
 buildlet: (nil *buildlet.Client)
  started: 2019-11-19 15:02:52.179254691 +0000 UTC m=+317165.846668715
   status: still running

Events:
  2019-11-19T15:02:52Z checking_for_snapshot 
  2019-11-19T15:02:52Z finish_checking_for_snapshot after 25.8ms
  2019-11-19T15:02:52Z get_buildlet 
 +11724.5s (now)

Build log:


(buildlet still starting; no live streaming. reload manually to see status)

Relevant details for the host-aix-ppc64-osuosl builder at the time were as follows.

Scheduler state

  • host-aix-ppc64-osuosl: 39 waiting (oldest 90h48m54s, newest 1h24m35s, progress 3h44m22s)
    • try: 1 (oldest 2h46m21s, newest 2h46m21s)

Buildlet pools

Reverse pool by host type (in use / total)

  • host-aix-ppc64-osuosl: 1/1

Reverse pool machine detail

power8-aix-host1 (140.211.9.26:37646) version 23, host-aix-ppc64-osuosl: connected 3h44m22.7s, working for 3h44m22.7s

Active builds

aix-ppc64 rev 7719016e; running; http://power8-aix-host1 reverse peer power8-aix-host1/140.211.9.26:37646 for host type host-aix-ppc64-osuosl, 90h48m51s ago
...
  2019-11-19T17:48:03Z still_waiting_on_test go_test:net/http
  2019-11-19T17:48:33Z still_waiting_on_test go_test:net/http
  2019-11-19T17:49:03Z still_waiting_on_test go_test:net/http
   +9.7s (now)

Filing this issue so we can discuss and investigate if needed after coordinator is deployed. /cc @bradfitz @cagedmantis @toothrot

At first glance, I thought it was a problem that a trybot didn't timeout/retry/fail after 2 hours, but given it was in "waiting_for_machine" state, maybe waiting indefinitely is the right thing?

@dmitshur dmitshur added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 19, 2019
@dmitshur dmitshur added this to the Unreleased milestone Nov 19, 2019
@dmitshur dmitshur changed the title x/build/cmd/coordinator: x/build/cmd/coordinator: trybot (slowbot) was in a "running" state for 2+ hours Nov 19, 2019
@bradfitz
Copy link
Contributor

I don't see a problem.

AIX is slow & backlogged. I'll be surfacing the scheduler state in more places (including here). I just sent out https://go-review.googlesource.com/c/build/+/207841 which is a start.

@bradfitz
Copy link
Contributor

https://farmer.golang.org/#sched says:

host-aix-ppc64-osuosl: 42 waiting (oldest 91h33m35s, newest 15m40s, progress 4h29m2s)
try: 1 (oldest 3h31m2s, newest 3h31m2s)

And down below, the single AIX machine:

power8-aix-host1 (140.211.9.26:37646) version 23, host-aix-ppc64-osuosl: connected 4h29m3.6s, working for 4h29m3.6s

It's been working for 4 hours.

That slowness should be fixed by https://go-review.googlesource.com/c/go/+/207497

@dmitshur
Copy link
Contributor Author

Ok, so the situation was that the AIX builder was already occupied doing another build, so the slowbot was waiting for the builder to free up. That does sound like things are working as intended.

I misread it as the AIX builder being free and yet not being scheduled. I see now that it was 1/1 in use.

Thanks for confirming Brad. 👍

@golang golang locked and limited conversation to collaborators Nov 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants