Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: frequent timeouts running js-wasm TryBots #35170

Closed
bcmills opened this issue Oct 26, 2019 · 19 comments
Closed

x/build: frequent timeouts running js-wasm TryBots #35170

bcmills opened this issue Oct 26, 2019 · 19 comments
Labels
arch-wasm WebAssembly issues Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Oct 26, 2019

I've noticed that the js-wasm TryBots on my recent CLs seem to start with one or more spurious failures before finally running successfully.

I suspect that that increases load on the TryBots generally, and may increase overall TryBot latency in some circumstances.

A log from a recent example is here (https://farmer.golang.org/temporarylogs?name=js-wasm&rev=9516a47489b92b64e9cf5633e3ab7e37def3353f&st=0xc00ab39b80):

  builder: js-wasm
      rev: 9516a47489b92b64e9cf5633e3ab7e37def3353f
 buildlet: (nil *buildlet.Client)
  started: 2019-10-26 01:37:51.137407213 +0000 UTC m=+287652.939853079
    ended: 2019-10-26 01:38:21.20458221 +0000 UTC m=+287683.007028020
  success: false

Events:
  2019-10-26T01:37:51Z ask_maintner_has_ancestor 
  2019-10-26T01:38:21Z finish_ask_maintner_has_ancestor after 30s; err=context deadline exceeded

Build log:
Error checking whether commit 9516a47489b92b64e9cf5633e3ab7e37def3353f includes ancestor 3dced519cbabc213df369d9112206986e62687fa: context deadline exceeded


Error: context deadline exceeded

CC @dmitshur @toothrot @bradfitz

@bcmills bcmills added Testing An issue that has been verified to require only test changes, not just a test failure. Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. arch-wasm WebAssembly issues labels Oct 26, 2019
@gopherbot gopherbot added this to the Unreleased milestone Oct 26, 2019
@agnivade
Copy link
Contributor

One change was that the builder was recently bumped to Node 13.

@bradfitz
Copy link
Contributor

I think that's unrelated. When I've seen this, the error is also as Bryan quoted above: a timeout doing a gRPC call to maintner, asking whether we should even do this build. We used to need to skip builds for js/wasm if the git history of the rev to be built didn't include js-wasm support. But now that we've supported js/wasm for a number of releases (since Go 1.11) and we don't even support Go 1.10 (or even Go 1.11) any more, we can just remove that condition on that builder.

But also: maintner shouldn't be so slow at that query.

@gopherbot
Copy link

Change https://golang.org/cl/203757 mentions this issue: dashboard: stop checking whether the branch/rev supports js/wasm for js/wasm

gopherbot pushed a commit to golang/build that referenced this issue Oct 28, 2019
…js/wasm

It's supported it for ages. We no longer support or build branches
that didn't have js/wasm support.

See golang/go#35170 (comment)

Updates golang/go#35170 (still need a maintner fix, ideally)

Change-Id: If5489dd266dea9a3a48bbc29f83528c0f939d23b
Reviewed-on: https://go-review.googlesource.com/c/build/+/203757
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@bradfitz
Copy link
Contributor

Not very scientific, but the has-ancestor RPCs are taking only 100ms for me. (includes TCP+TLS setup, etc) Not anywhere near 30 seconds.

$ time maintq has-ancestor 0ae9389609f23dc905c58fc2ad7bcc16b770f337 3dced519cbabc213df369d9112206986e62687fa
has_ancestor:true

real    0m0.105s
user    0m0.076s
sys     0m0.028s

codebien pushed a commit to codebien/build that referenced this issue Nov 13, 2019
…js/wasm

It's supported it for ages. We no longer support or build branches
that didn't have js/wasm support.

See golang/go#35170 (comment)

Updates golang/go#35170 (still need a maintner fix, ideally)

Change-Id: If5489dd266dea9a3a48bbc29f83528c0f939d23b
Reviewed-on: https://go-review.googlesource.com/c/build/+/203757
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@agnivade
Copy link
Contributor

agnivade commented May 8, 2020

@dmitshur - Would you be able to check if this is still happening ?

@neelance
Copy link
Member

@dmitshur Do you know if this is still an issue?

@dmitshur
Copy link
Contributor

dmitshur commented Oct 28, 2020

I feel like I've seen some discussion about this still happening occasionally quite recently, more so than the comments in this issue would suggest. Perhaps it was in another thread. I'll keep an eye out for this.

@rsc
Copy link
Contributor

rsc commented Oct 29, 2020

Yes, it is still happening. Last week I caught the js-wasm trybot taking 25 minutes when most were done in 15,
and just now I caught it taking 34 minutes when everything else was done in 16.
That one was https://go-review.googlesource.com/c/go/+/266357 and I gave up waiting and just submitted the CL.

Any advice about what to do when js-wasm looks like it is stuck/very slow and how to debug further would be greatly appreciated.

@dmitshur
Copy link
Contributor

Any advice about what to do when js-wasm looks like it is stuck/very slow and how to debug further would be greatly appreciated.

I suggest trying to determine if there is a specific place where the build progresses more slowly than others. I.e., is it perhaps that time to first test is delayed, then it's fast, or is one of the package tests very slow, while the rest is fast, or is slowness equally distributed across all packages?

I'll also try to look into logs for the two occurrences you've shared and see if I can determine it from that.

@dmitshur
Copy link
Contributor

dmitshur commented Oct 29, 2020

From CL 266374 where I recently started trybots:

image

In this instance, the js-wasm builder is in waiting_for_machine state for over 10 minutes. That is very unexpected—it's a GCP builder that we should be able to spin up without running out of physical machines. Perhaps we're running out of some quota. Not sure why only js-wasm and not other GCP builders. But this is a lead.

Edit: Here's the scheduler state at the time:

image

Issues #31282, #36088 may be related.

@bcmills
Copy link
Contributor Author

bcmills commented Oct 30, 2020

In several occurrences that I saw, the js-wasm TryBot started running and partially completed, then descheduled and started over again from the beginning. I think there is some sort of failure being buried by retry logic, rather than just a slow test.

If we don't have the bandwidth to diagnose this at the moment, perhaps we should temporarily demote js-wasm to an opt-in SlowBot?

@thanm
Copy link
Contributor

thanm commented Oct 30, 2020

I'm also experiencing js-wasm slowdowns for my trybot runs. Here is an example:

Screenshot from 2020-10-30 13-18-40

Log (https://farmer.golang.org/temporarylogs?name=js-wasm&rev=ff065bd219cd7a9df01466e338b28f1891671da3&st=0xc0184e91e0):

  builder: js-wasm
      rev: ff065bd219cd7a9df01466e338b28f1891671da3
 buildlet: (nil *buildlet.Client)
  started: 2020-10-30 16:46:48.188839196 +0000 UTC m=+727935.299326955
   status: still running

Events:
  2020-10-30T16:46:48Z checking_for_snapshot 
  2020-10-30T16:46:48Z finish_checking_for_snapshot after 23.9ms
  2020-10-30T16:46:48Z get_buildlet 
 +2093.2s (now)

Build log:

(buildlet still starting; no live streaming. reload manually to see status)

Seems as though it is not getting past "get buildlet".

[Edit: did finally make progress after 40 min wait.]

@neelance
Copy link
Member

I don't know much about farmer, but on https://farmer.golang.org/#sched it seems to me as if multiple trybot runs are waiting on the js-wasm tests, but I can only see very few buildlets with the prefix GCE VM: buildlet-js-wasm-. Is something preventing these buildlets from spawning?

@dmitshur
Copy link
Contributor

Is something preventing these buildlets from spawning?

It seems so. I don't have an idea about what it might be yet. It may be related to #42285. I'm going to look more later today.

@dmitshur
Copy link
Contributor

I extracted some raw data from some Wasm trybot runs, which I was going to analyze more and post something here, but then this work got pre-empted by other work, and by now I've lost it due to a forced browser update.

From memory, the high-level summary was that the Wasm trybot runs are generally quick, with two components that contribute a fairly significant proportion of the overall time:

  • test directory
  • reboot test

There's a commonly used dist-test adjust policy fasterTrybots that skips precisely those two components during pre-submit tests (but not during post-submit tests), which is an option available to speed up Wasm trybots.

However, it seems the root problem here was with the scheduling of the Wasm builders and starting the trybot runs, not the length of trybot test execution itself under a typical "happy" run. There are many possible places to look next. As @bcmills also pointed out in #42699, there is a problem of stalls sometimes resulting in extended test executions that eventually instead of what should be failures. Stalls could be due to something Wasm-specific (e.g., #31282) or something more general. We also have some possible blindspots on the side of coordinator (e.g., #39349, #39665). As I've recently experienced during #42379 (comment), it's possible for stalls to happen when a test execution is sharded and not happen during a sequential all.bash test run.

All this is to say that more work needs to be done to get closer to finding and resolving the root issue(s) here. I think we have confirmation that this problem is intermittent and skipping a deterministically slow test is not an available option.

@gopherbot
Copy link

Change https://golang.org/cl/347295 mentions this issue: dashboard: disable reboot test on js-wasm builder

gopherbot pushed a commit to golang/build that referenced this issue Sep 2, 2021
The reboot test verifies that the modified version of Go can be used
for bootstrapping purposes. The js-wasm builder is configured with
GOHOSTOS=linux and GOHOSTARCH=amd64 so that it builds packages and
commands for host, linux/amd64, which is used during bootstrapping
in the reboot test. Since this doesn't provide additional coverage
beyond what we get from linux-amd64's execution of the reboot test,
disable this test on the js-wasm builder.

Also remove some accidental duplicate code.

For golang/go#35170.

Change-Id: Ic770d04f6f5adee7cb836b1a8440dc818fdc5e5b
Reviewed-on: https://go-review.googlesource.com/c/build/+/347295
Trust: Dmitri Shuralyov <dmitshur@golang.org>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Alexander Rakoczy <alex@golang.org>
@dmitshur
Copy link
Contributor

dmitshur commented May 31, 2022

Is anyone still observing this issue?

Is something preventing these buildlets from spawning?

A new idea by now is that it might've been hitting the GCE VM quota. Some of those quotas are visible in the first line of https://farmer.golang.org/#pools.

If there are no frequent timeouts affecting this builder by now, perhaps we should close this.

@bcmills
Copy link
Contributor Author

bcmills commented Jun 1, 2022

I haven't seen any of these recently, but I also haven't been looking for them.

I suspect that the “failure being buried by retry logic” that I mentioned in #35170 (comment) may have been the one you fixed in CL 407555.

Since we can identify a specific change that plausibly could have fixed this issue, I'd be ok with closing it until / unless we observe it again.

@dmitshur
Copy link
Contributor

dmitshur commented Jan 9, 2023

Sounds good, closing since I haven't seen reports of this in a long while, can reopen or file another issue if this comes up again.

From looking at the initial report, there's also a chance that the same problem as in issue #55947 (fixed as of Oct 2022) may have contributed here too.

@dmitshur dmitshur closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2023
@golang golang locked and limited conversation to collaborators Jan 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-wasm WebAssembly issues Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

8 participants