Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/dashboard: linux-mips64le-mengzhuo repeatedly timing out while writing snapshot #52235

Open
dmitshur opened this issue Apr 8, 2022 · 4 comments
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented Apr 8, 2022

Almost all recent commits don't have a build result for this builder yet, as seen on build.golang.org:

image

From coordinator logs:

$ kubectl logs coordinator-deployment-86f6c447fc-qkkhc | grep 'failed to write snapshot to GCS'
2022/04/08 14:28:29 [build linux-mips64le-mengzhuo 3e7ffb862f550c38ce0611b970a4dce10a01226e]: failed to write snapshot to GCS after copying 6622422 bytes: context deadline exceeded
2022/04/08 14:41:16 [build linux-mips64le-mengzhuo 3e387528e54971d6009fe8833dcab6fc08737e04]: failed to write snapshot to GCS after copying 5668283 bytes: context deadline exceeded
2022/04/08 14:54:19 [build linux-mips64le-mengzhuo 3e7ffb862f550c38ce0611b970a4dce10a01226e]: failed to write snapshot to GCS after copying 7044318 bytes: context deadline exceeded
2022/04/08 15:06:47 [build linux-mips64le-mengzhuo 3e387528e54971d6009fe8833dcab6fc08737e04]: failed to write snapshot to GCS after copying 5897663 bytes: context deadline exceeded
2022/04/08 15:19:49 [build linux-mips64le-mengzhuo 3e387528e54971d6009fe8833dcab6fc08737e04]: failed to write snapshot to GCS after copying 5864896 bytes: context deadline exceeded
2022/04/08 15:33:23 [build linux-mips64le-mengzhuo 3e387528e54971d6009fe8833dcab6fc08737e04]: failed to write snapshot to GCS after copying 5930430 bytes: context deadline exceeded

The write snapshot timeout at this moment is 5 minutes, and this builder is running into that timeout repeatedly while only writing approximately 5-7 MB. A typical snapshot is around 150 MB, so at that speed it'd need more than 2 hours.

Is this an unexpected issue with the builder and/or its internet uplink speed, or should the builder be configured with SkipSnapshot: true? I see that it already has FlakyNet: true set. @mengzhuo, are you able to take a look and advise how you'd like to proceed? Thanks.

CC @golang/release.

@dmitshur dmitshur added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 8, 2022
@dmitshur dmitshur added this to the Unreleased milestone Apr 8, 2022
@gopherbot
Copy link

Change https://go.dev/cl/398697 mentions this issue: cmd/coordinator: triple writeSnapshot timeout for reverse builders

gopherbot pushed a commit to golang/build that referenced this issue Apr 8, 2022
This timeout is meant to be an upper bound, and some reverse builders
have been observed to need a bit over 5 minutes to finish the upload.
Give them more time and update the comment to describe the 2022 state.

Also log how many bytes they've managed to copy before failing.

Updates golang/go#52235.
Updates golang/go#49149.

Change-Id: I20f850620f0aa8126968862f2ad9a096fa32ce03
Reviewed-on: https://go-review.googlesource.com/c/build/+/398697
Trust: Carlos Amedee <carlos@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@dmitshur
Copy link
Contributor Author

dmitshur commented Apr 8, 2022

Based on the comment in CL 399034, I see a possible explanation is if the builder is slow on compression, not the upload, so the 2 hour estimate might be way off.

If so, we can see the effect that CL 398697 makes (it's deployed now).

@dmitshur
Copy link
Contributor Author

dmitshur commented Apr 9, 2022

@mengzhuo Result after making the timeout 15 minutes instead of 5:

2022/04/09 00:10:30 [build linux-mips64le-mengzhuo 0f0c89243044a5a5de142e51da3a98f082fd3771]: failed to write snapshot to GCS after copying 18973578 bytes: context deadline exceeded

It was able to transfer an additional ~10 MB in the 10 extra minutes. It seems even 20 minutes would not be nearly enough.

Edit: A newer run completed in 9 minutes:

2022-04-09T00:58:24Z finish_write_snapshot_to_gcs after 8m51.1s

@mengzhuo
Copy link
Contributor

mengzhuo commented Apr 9, 2022

@dmitshur Thank you and this detail update. I've changed the proxy and it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants