x/build/cmd/coordinator: failures are sometimes missing error output #39349

bcmills · 2020-06-01T19:10:11Z

https://build.golang.org/log/dab786adef1a18622f61641285864ac9c63fb7e3 is marked with fail on the dashboard, but the word FAIL does not appear in the output file at all.

Either the output is truncated, or the last test in the log (misc/cgo/testshared) exited with a nonzero status and no output.

CC @dmitshur @toothrot @cagedmantis

The text was updated successfully, but these errors were encountered:

bcmills · 2020-10-29T13:41:47Z

bcmills · 2021-01-11T16:05:48Z

bcmills · 2021-08-27T15:03:56Z

The failures are still occurring, but they now have one extra line of output: a signal: terminated line at the end of the test.

$ greplogs --dashboard -md -l -e (?m)##### .*\n.*signal: terminated\n\z

2021-08-27T01:08:12-c927599/linux-386-longtest
2021-08-26T17:19:04-166b691/linux-386-longtest
2021-06-02T17:34:43-e11d142/linux-386-longtest
2021-05-27T15:00:58-950fa11/linux-386-longtest
2021-04-28T18:50:41-ea65a12/linux-386-longtest

bcmills · 2022-01-05T19:30:25Z

greplogs --dashboard -md -l -e '(?m)##### .*\n.*signal: terminated\n\z' --since=2021-08-28

bcmills · 2022-01-05T19:33:07Z

This seems like a very high failure rate, and linux-386 is a first-class port. Marking as release-blocker.

Given that the recent failures are only on the 386-longtest builder, I suspect that this is due to some test running out of address space. Probably either the builder itself needs to be tuned or the ../test runner harness needs to reduce its parallelism on 32-bit platforms.

bcmills · 2022-01-05T19:35:00Z

I wonder if this could also just be a timing issue. Maybe the coordinator isn't giving the linux-386-longtest builder long enough to run all of the tests? (CC @golang/release)

aclements · 2022-01-19T17:18:58Z

Some of these failures from the second comment could be because dist doesn't implement timeouts on ../test and some ../misc/cgo tests. That wouldn't explain truncation on any of the standard library tests, or the more recent pattern with "signal: terminated".

ianlancetaylor · 2022-01-27T22:41:12Z

The failures all indicate that cmd/dist failed due to a SIGTERM signal. I don't understand where that signal is coming from. The Linux kernel will never send a SIGTERM to a process. I don't see any case in which the coordinator sends a SIGTERM.

It appears that Docker will send a SIGTERM signal when the docker container is being shut down. Perhaps that is what is happening here. It does seem that there is a timeout for the instance, but the code is complicated enough that I don't understand how it is set. For these tests just the runtime test alone takes over 10 minutes, so it does take a while to run.

Is there any logging of timeouts done by the coordinator? I don't know this code myself.

I'm not sure this is a release blocker because it looks to me like an issue with the builder system rather than something that will be fixed in the release.

bcmills · 2022-01-28T00:44:56Z

I'm not sure this is a release blocker because it looks to me like an issue with the builder system rather than something that will be fixed in the release.

My understanding from https://go.dev/wiki/PortingPolicy#first-class-ports is that first-class ports are expected to have passing tests at each release, and linux/386 is listed as a first-class port. I don't know of anyone running the linux/386 tests locally with any regularity, so if they aren't even reliably running to completion on the Go builders I don't think we can say with certainty that they are consistently passing.

The inflection point of the failure rate of this builder appears to have been around August 2020. Considering the state of the world in August 2020, I can understand why it wasn't addressed immediately at that time, but I do not know why it remained unaddressed at the time of the Go 1.17 release in 2021. I assume that it was overlooked because it lacked the release-blocker label, so when I was re-triaging the failures on the build dashboard I added that label to ensure that it would not be similarly overlooked for the current release.

bcmills · 2022-01-28T04:14:05Z

it looks to me like an issue with the builder system rather than something that will be fixed in the release.

In addition to the interaction with the porting policy, to me this kind of issue is also a matter of equity.

I watch the builders to check whether there have been regressions in the parts of the project for which I am responsible, and failures in the linux-386-longtest builder matter to me: for one, many of the cmd/go tests only run on the longtest builders, and for two, many of the fuzzing tests I have reviewed this cycle have behaviors unique to the linux-386-longtest builder, because that builder has the unique combination of running non-short tests and having a non-amd64 GOARCH (which impacts fuzzing instrumentation).

So when I see failures on this builder, I check them. A significant rate of false-positive failures causes a significant amount of unproductive, avoidable triage work, and that in turn contributes to feelings of frustration and burnout. Since the Go project does not seem to have anyone else triaging new or existing builder failures with any regularity, I feel that the costs of this ongoing flakiness have been externalized on to me.

#33598 went through a trajectory very similar to this issue: we had a series of recurring failures on the builders for darwin/amd64, which is also nominally a first-class port. I identified a way to reliably reproduce the problem in March 2020, and the issue remained unaddressed until I diagnosed and fixed it myself in October 2021 (CL 353549).

#39665 was also similar: Dmitri reported longtest failures on windows/amd64 (also a first-class port) in June 2020, and no apparent progress was made on even a diagnosis until I reported a new failure mode in November 2021 (in #49457) and marked it as a release-blocker, at which point the underlying issue was apparently fixed.

If we consider subrepo tests, there are many more examples. As I understand it from #11811 the policy of the project is that subrepo tests should be passing before a release, but for at least the past couple of years we have cut releases with frequently- or persistently-broken subrepo builders. (Some examples: #45700, #31567, #36163; the latter on windows/amd64, which is a first-class port.)

My takeaway from those cases is that persistent builder issues generally will not be addressed unless I address them myself (as in #33598, #31567, and #36163), or they actively interfere with running x/build/cmd/releasebot, or they are explicitly marked with the release-blocker label.

Letting these kinds of persistent issues linger was understandable as a short-term situation in 2020, but it isn't tenable as a steady state for a large, staffed project. We all want to land releases, and our existing policy (at least as I understand it) is that to land a release we also need to keep the build healthy. Adhering to that policy provides some backpressure on the accumulation of unplanned technical debt, and helps to “internalize the externality” of flaky and broken builders.

ianlancetaylor · 2022-01-28T04:21:54Z

I feel your pain, and many thanks for doing this unrewarding work.

Still, this seems like a process problem. I see no reason that an issue like this should block a release. If the linux-386-longtest builder failed 50% of the time then it would have to block the release because it might be hiding other problems. But that's not the case here; we are getting enough successful runs to have reason to believe that the build is basically OK.

You are pointing out that if these issues are not marked as release blockers, then they will never be addressed. That is the problem to fix. We shouldn't get ourselves into a position where we have to hold up a release because of a problem with the builders, where we have no reason to believe that the problem is or is hiding a problem with the release.

So I sympathize with the approach that you are taking, but I think we need to develop a different approach.

bcmills · 2022-01-28T05:41:36Z

Still, this seems like a process problem.

I agree, but I think we already have a (documented but unenforced) solution to the process problem: namely, to align the incentives between shipping the release and maintaining a healthy build.

We shouldn't get ourselves into a position where we have to hold up a release because of a problem with the builders, where we have no reason to believe that the problem is or is hiding a problem with the release.

I agree, but we could also avoid getting ourselves into that position by identifying and addressing new failures when they are introduced.

If the builders are consistently passing, then they will almost always remain so unless failures are introduced or exposed by specific code or configuration changes. If we respond to new failures immediately, we can more easily identify the likely causes, and either roll back or adapt to the changes that introduced them. As I understand it, the release freeze exists precisely to allow time for new problems to be addressed before the scheduled release.

For example, the spike in the linux-386-longtest failure rate in August 2020 was at essentially the same time that the tree opened for development on Go 1.16, with six failures that month. If we had identified the failures and started investigating in depth then, we would have still had five months of development and freeze remaining in which to diagnose and mitigate the problem before even a single release window passed.

For another example, the failure rate for the NetBSD and OpenBSD builders spiked during the current cycle (#49209 and related issues) due to a latent platform bug exposed by a configuration change in mid-October. The compiler & runtime team began diagnosing it in earnest in early November, the builders were reconfigured to diagnose the problem, and a configuration that works around the problem was identified on Nov. 29 (#49209 (comment)) — all within the normal release freeze window (if barely). The process worked, but I suspect that it only even started when it did because I tagged that issue release-blocker on Oct. 28.

If we follow the process and fix the problems as they are introduced, it doesn't need to hold up the release. The apparent tension between fixing the builds and cutting the release on time is, as far as I can tell, a consequence of putting off the build-fixes until the release is already looming.

bcmills · 2022-01-28T05:59:14Z

If we want to call a mulligan on the process problem, I suggest that we treat this issue as a release-blocker, but for Go 1.19 rather than 1.18.

I firmly believe that it should be a release-blocker for some release in the near future, because otherwise the incentives of the process are misaligned and we have no check on accumulating an unbounded amount of unplanned technical debt.

However, I agree that it does not indicate a regression specific to the 1.18 release, which is by all accounts a busy release cycle and major milestone and already delayed enough as it is.

Making it a release-blocker for Go 1.19 would allow adequate time to ship the 1.18 release, and provide an entire development window in which to investigate and mitigate this issue.

dmitshur · 2022-01-28T22:55:48Z

In our team meeting this week, we've discussed this and agreed as the next step here to move this to 1.19 milestone, to give us a chance to prioritize investigation of this issue during the 1.19 dev cycle. To make progress here, we need to do development work on x/build/cmd/coordinator, and the 1.18 freeze isn't the right period for that work.

For the 1.18 freeze, if there are linux-386-longtest test failures that happen at a rate that is more than can be attributed to flaky failures, we'll need to look for optimal ways to reproduce and understand them even if the build logs do not include as much information as we'd like them to.

Updating this issue so that it's more up to date with the latest reality, and we can adjust it further as needed.

gopherbot · 2022-03-16T14:38:02Z

This issue is currently labeled as early-in-cycle for Go 1.19.
That time is now, so a friendly reminder to look at it again.

bcmills · 2022-03-30T19:43:55Z

It appears that the regexp I used in #39349 (comment) was even under-counting: the signal: terminated message does not occur on Windows. Widening the regexp to include that, we have:

greplogs --dashboard -md -l -e '(?m)##### .*\n(?:.*signal: terminated\n|\s*)\z' --since=2022-01-05

2022-03-21T18:58:42-79103fa/windows-386-2008
2022-01-28T22:21:43-25b4b86/linux-386-longtest
2022-01-27T00:03:31-f4aa021/linux-386-longtest
2022-01-24T12:26:25-0ef6dd7/linux-386-longtest
2022-01-07T17:55:52-98ed916/linux-386-longtest

bcmills · 2022-04-05T21:07:07Z

One more today:

greplogs --dashboard -md -l -e '(?m)##### .*\n(?:.*signal: terminated\n|\s*)\z' --since=2022-03-22

2022-04-05T09:26:01-a041a75/windows-386-2008

aclements · 2022-05-31T18:47:48Z

Coordinator folks, what builder logs would we expect if cmd/dist itself got stuck? Does the buildlet or coordinator impose a timeout that would be logged, or would we expect the coordinator to just kill off the VM and the log to stop?

aclements · 2022-05-31T18:52:12Z

@dmitshur says that until two weeks ago, if cmd/dist wedged, the builder would eventually get killed and the build retried. And as of two weeks ago, it will now get killed and clearly marked as timed-out. Which suggests this is not cmd/dist wedging.

dmitshur · 2022-05-31T19:26:01Z

@aclements The relevant issue for the recent change is #42699 (comment). (Maybe it was 1 week ago instead of 2.)

@cagedmantis also pointed out this the occasional coordinator restarts may be contributing to this problem. Issue #51057 is related to that.

heschi · 2022-11-08T19:54:44Z

Do we know if this is still happening? I have no idea if we've done anything to fix it, but there also haven't been complaints for months.

bcmills · 2022-11-09T21:59:55Z

greplogs suggests that this probably hasn't happened since April, assuming I downloaded enough logs (I fetched -n=8192 from the main repo just now).

`greplogs --dashboard -md -l -e '(?m)##### .*\n(?:.*signal: terminated\n|\s*)\z' --since=2022-04-05`
[2022-04-07T10:41:34-3a0cda4/windows-386-2008](https://build.golang.org/log/9c18da3fd5f1d9e0c736b0ac7e500d3238024dc6)
[2022-04-06T20:46:47-81ae993/windows-386-2008](https://build.golang.org/log/570256809bc996f56ae5829c2058bcfd003a8862)
[2022-04-05T09:26:01-a041a75/windows-386-2008](https://build.golang.org/log/20822144ed3fd0f9c91c733b4042ed4bd85dd1b7)

This may have been triggered by some underlying hang in a test in the test or misc section. Especially given the amount of ongoing cleanup happening in that part of the repo (especially @aclements' work on misc and cmd/dist), I'd be ok with closing this as “probably obsolete” for now.

heschi · 2022-11-09T22:40:22Z

Closing as "probably obsolete" :)

bcmills added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 1, 2020

gopherbot added this to the Unreleased milestone Jun 1, 2020

bcmills mentioned this issue Nov 18, 2020

misc/cgo/testcshared: sometimes stalls on windows-amd64-longtest builder in non-sharded longtest mode #39665

Closed

bcmills changed the title ~~x/build: linux-386-longtest failure missing error output~~ x/build: longtest failures missing error output Nov 18, 2020

This was referenced Nov 18, 2020

x/build/dashboard: linux-386-longtest log missing the failure reason #41426

Closed

x/build/cmd/coordinator: transparent TryBot retries mask stalls that should be fixed #42699

Closed

dmitshur mentioned this issue Nov 30, 2020

x/build: frequent timeouts running js-wasm TryBots #35170

Closed

bcmills modified the milestones: Unreleased, Go1.18 Jan 5, 2022

bcmills added release-blocker Testing An issue that has been verified to require only test changes, not just a test failure. labels Jan 5, 2022

toothrot added this to Planned in Go Release Team Jan 25, 2022

bcmills mentioned this issue Jan 28, 2022

x/build: "unexpected stale targets" on darwin-arm64-11_0-toothrot #49692

Closed

dmitshur modified the milestones: Go1.18, Go1.19 Jan 28, 2022

dmitshur added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Jan 28, 2022

dmitshur changed the title ~~x/build: longtest failures missing error output~~ x/build/cmd/coordinator: longtest failures are sometimes missing error output Jan 28, 2022

heschi removed this from Planned in Go Release Team Mar 16, 2022

dmitshur added this to Planned in Go Release Team Mar 29, 2022

bcmills changed the title ~~x/build/cmd/coordinator: longtest failures are sometimes missing error output~~ x/build/cmd/coordinator: failures are sometimes missing error output Apr 7, 2022

dmitshur removed the early-in-cycle A change that should be done early in the 3 month dev cycle. label May 11, 2022

dmitshur modified the milestones: Go1.19, Unreleased Jun 1, 2022

toothrot mentioned this issue Jul 20, 2022

build dashboard triage log #52653

Closed

This comment was marked as off-topic.

Sign in to view

heschi closed this as completed Nov 9, 2022

golang locked and limited conversation to collaborators Nov 9, 2023

gopherbot added the FrozenDueToAge label Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/build/cmd/coordinator: failures are sometimes missing error output #39349

x/build/cmd/coordinator: failures are sometimes missing error output #39349

bcmills commented Jun 1, 2020

bcmills commented Oct 29, 2020

bcmills commented Jan 11, 2021

bcmills commented Aug 27, 2021

bcmills commented Jan 5, 2022

bcmills commented Jan 5, 2022

bcmills commented Jan 5, 2022

aclements commented Jan 19, 2022

ianlancetaylor commented Jan 27, 2022

bcmills commented Jan 28, 2022 •

edited

bcmills commented Jan 28, 2022 •

edited

ianlancetaylor commented Jan 28, 2022

bcmills commented Jan 28, 2022 •

edited

bcmills commented Jan 28, 2022

dmitshur commented Jan 28, 2022

gopherbot commented Mar 16, 2022

bcmills commented Mar 30, 2022

bcmills commented Apr 5, 2022

aclements commented May 31, 2022

aclements commented May 31, 2022

dmitshur commented May 31, 2022

This comment was marked as off-topic.

heschi commented Nov 8, 2022

bcmills commented Nov 9, 2022

heschi commented Nov 9, 2022

x/build/cmd/coordinator: failures are sometimes missing error output #39349

x/build/cmd/coordinator: failures are sometimes missing error output #39349

Comments

bcmills commented Jun 1, 2020

bcmills commented Oct 29, 2020

bcmills commented Jan 11, 2021

bcmills commented Aug 27, 2021

bcmills commented Jan 5, 2022

bcmills commented Jan 5, 2022

bcmills commented Jan 5, 2022

aclements commented Jan 19, 2022

ianlancetaylor commented Jan 27, 2022

bcmills commented Jan 28, 2022 • edited

bcmills commented Jan 28, 2022 • edited

ianlancetaylor commented Jan 28, 2022

bcmills commented Jan 28, 2022 • edited

bcmills commented Jan 28, 2022

dmitshur commented Jan 28, 2022

gopherbot commented Mar 16, 2022

bcmills commented Mar 30, 2022

bcmills commented Apr 5, 2022

aclements commented May 31, 2022

aclements commented May 31, 2022

dmitshur commented May 31, 2022

This comment was marked as off-topic.

heschi commented Nov 8, 2022

bcmills commented Nov 9, 2022

heschi commented Nov 9, 2022

bcmills commented Jan 28, 2022 •

edited

bcmills commented Jan 28, 2022 •

edited

bcmills commented Jan 28, 2022 •

edited