-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: coordinator restarts #22042
Comments
More long-term, we should come up with some plan to prevent these failures (backpressure) and also provide some priority queue (e.g., give priority to release builders). For cmd/release, I did have a (mental) todo to add support for retries, but it happens infrequently enough that I just retry manually. |
That is #19178. |
Thanks to @kelseyhightower for helping me debug this. Coordinator had a resource limit of 2Gi memory in deployment config. |
Sarah, can you or Kelsey note here how this was debugged, for future reference? The limit is at https://github.com/golang/build/blob/master/cmd/coordinator/deployment-prod.yaml#L26 |
Yes, totally. Good idea.
then under the I will submit the change to the prod.yaml and ref this issue thanks Brad! |
Change https://golang.org/cl/75532 mentions this issue: |
can we also monitor the amount of memory being used by each builder? GCP has a good metrics aggregator, right? |
Coordinator sometimes restarts unexpectedly, causing build failures and gomote sessions to be terminated.
Trends:
The rest of this issue is dedicated to documenting specific instances of when I observed these restarts.
1. 08/11/2017
Maintner went down, many builders queued. (see #21383).
When we got maintner back up, all queued builders tried to run, and many (most?) failed.
Seemed to me that the failure had to be caused by all the builders sharing some resource, and hitting some limit.
@aclements helped me debug a bit; we concluded that we were probably hitting a disk write IOPS limit.
Cannot remember if we observed coordinator restarts.
2. 09/22/2017
The second instance of this that I've seen is @rsc
golang.org/x/build/cmd/release
(binary namedbuild.release
below) failures.The last 3 failures were due to coordinator restarting in the middle of handling the requests.
There is no indication that coordinator restarted due to any program error - there are no panics, or any other indication of any error on the coordinator side.
Buildlet logs were also interspersed with non-program builder failures (just
exit status 1
) at the same time, eg:@rsc tried again a few days later to release and everything worked fine; no changes to the builder pipeline (that I am aware of).
3. 10/25/2017
timestamps (eastern)
15:06, 14:59, 13:56
cc @bradfitz @andybons @broady
The text was updated successfully, but these errors were encountered: