Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: trybots don't work on golang.org/dl repo #35581

Closed
dmitshur opened this issue Nov 14, 2019 · 4 comments
Closed

x/build/cmd/coordinator: trybots don't work on golang.org/dl repo #35581

dmitshur opened this issue Nov 14, 2019 · 4 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented Nov 14, 2019

There are two parts to this issues:

  1. The trybot builds start, fail, and keep retrying indefinitely. It's been 2 hours since https://go-review.googlesource.com/c/dl/+/207097/1#message-2732030f20d3db705eb32433ddacbb459f630a61, and it's still doing this:

    It should give up at some point and report a failure.

  2. The builds fail due to something being misplaced, likely because golang.org/dl is the only repo that doesn't have the 'x' element in the import path.

      builder: freebsd-amd64-12_0
          rev: e77106cc54af58d3eedbb134310668c6993474c7
     buildlet: http://10.240.0.11 GCE VM: buildlet-freebsd-12-0-rndfed6a2
      started: 2019-11-14 03:36:14.524802483 +0000 UTC m=+162177.592170276
        ended: 2019-11-14 03:36:46.471944701 +0000 UTC m=+162209.539312514
      success: false
    
    Events:
      2019-11-14T03:36:14Z checking_for_snapshot 
      2019-11-14T03:36:14Z finish_checking_for_snapshot after 18.9ms
      2019-11-14T03:36:14Z get_buildlet 
      2019-11-14T03:36:14Z awaiting_gce_quota 
      2019-11-14T03:36:14Z finish_awaiting_gce_quota after 0s
      2019-11-14T03:36:14Z create_gce_buildlet buildlet-freebsd-12-0-rndfed6a2
      2019-11-14T03:36:14Z create_gce_instance buildlet-freebsd-12-0-rndfed6a2
      2019-11-14T03:36:23Z finish_create_gce_instance after 8.29s; buildlet-freebsd-12-0-rndfed6a2
      2019-11-14T03:36:23Z wait_buildlet_start buildlet-freebsd-12-0-rndfed6a2
      2019-11-14T03:36:23Z got_instance_info waiting_for_buildlet...
      2019-11-14T03:36:41Z finish_wait_buildlet_start after 18.3s; buildlet-freebsd-12-0-rndfed6a2
      2019-11-14T03:36:41Z finish_create_gce_buildlet after 26.8s; buildlet-freebsd-12-0-rndfed6a2
      2019-11-14T03:36:41Z finish_get_buildlet after 27s
      2019-11-14T03:36:41Z using_buildlet 10.240.0.11:80
      2019-11-14T03:36:41Z write_snapshot_tar 
      2019-11-14T03:36:46Z finish_write_snapshot_tar after 4.42s
      2019-11-14T03:36:46Z make_and_test 
      2019-11-14T03:36:46Z fetching_subrepo dl
      2019-11-14T03:36:46Z get_source 
      2019-11-14T03:36:46Z finish_get_source after 0s
      2019-11-14T03:36:46Z finish_make_and_test after 86.2ms; err=runTests: failed to determine go env GOMOD value: go env error: chdir /tmp/workdir/gopath/src/golang.org/dl: no such file or directory
    output:
    
    
    Build log:
    freebsd-amd64-12_0 at e77106cc54af58d3eedbb134310668c6993474c7 building dl at b01b4c2ff1bd61a7c9de81596442130f2ff07ca7
    
    
    
    Error: runTests: failed to determine go env GOMOD value: go env error: chdir /tmp/workdir/gopath/src/golang.org/dl: no such file or directory
    output:
    

The second issue is related to #30852. It seems a little more needs to be done for testing in module mode.

/cc @bcmills @bradfitz

@dmitshur dmitshur added Testing An issue that has been verified to require only test changes, not just a test failure. Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 14, 2019
@gopherbot gopherbot added this to the Unreleased milestone Nov 14, 2019
@dmitshur
Copy link
Contributor Author

For part 1 of this issue, the problem is as follows.

In runSubrepoTests, we're doing the work of determining if we're invoked in module mode or not:

func (st *buildStatus) runSubrepoTests() (remoteErr, err error) {
	[...]
	// Determine if we're invoked in module mode.
	// If using module mode, the absolute path to the go.mod of the main module.
	// If using GOPATH mode, the empty string.
	goMod, err := st.goMod(importPathOfRepo(st.SubName), goroot, gopath)
	if err != nil {
		return nil, fmt.Errorf("failed to determine go env GOMOD value: %v", err)
	}
	[...]

A failure to determine go env GOMOD is currently being reported as a "communications failure" rather than a "real test failure". This was chosen because we didn't know of a way in which go env GOMOD can fail due to a real problem in the source code.

The reason it's failing now is because the directory corresponding to the root of the module does not exist (for some reason, determining that reason is part 2 of this issue):

failed to determine go env GOMOD value: go env error: chdir /workdir/gopath/src/golang.org/dl: no such file or directory

It is intentional to keep retrying "communications failures" forever, because the expectation is that they should eventually succeed. There is a special case where some repeated communication errors are promoted to terminal errors:

// If a build fails multiple times due to communication
// problems with the buildlet, assume something's wrong with
// the buildlet or machine and fail the build, rather than
// looping forever. This promotes the err (communication
// error) to a remoteErr (an error that occurred remotely and
// is terminal).
if rerr := st.repeatedCommunicationError(err); rerr != nil {
	remoteErr = rerr
	err = nil
	doneMsg = "communication error to buildlet (promoted to terminal error): " + rerr.Error()
	fmt.Fprintf(st, "\n%s\n", doneMsg)
}

In this case, it's a bug in coordinator that's causing go env GOMOD to fail. We shouldn't promote it to a "real test failure" because it isn't a problem in the golang.org/dl module, it's a bug somewhere in coordinator. Once we fix this, this will not happen again.

(It is an option to add a "knownInternalCoordinatorBugError" helper that would promote a "failed to determine go env GOMOD value" error to be terminal, but I don't think it's worth doing at this time.)

So, what's left here is to understand why part 2 is happening and fix it.

@dmitshur dmitshur self-assigned this Dec 19, 2019
@dmitshur
Copy link
Contributor Author

dmitshur commented Dec 19, 2019

I see where the problem for part 2 is.

One of the first steps in runSubrepoTests is to:

// Check out the provided sub-repo to the buildlet's workspace.
// Need to do this first, so we can run go env GOMOD in it.
err = buildgo.FetchSubrepo(st.ctx, st, st.bc, st.SubName, st.SubRev)
if err != nil {
	return nil, err
}

That call is placing the golang.org/dl module in the wrong place. It's being placed in go/src/golang.org/x/dl instead of go/src/golang.org/dl.

The fix is trivial.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 19, 2019
@gopherbot
Copy link

Change https://golang.org/cl/212023 mentions this issue: cmd/coordinator: fix trybots for the "dl" repo more

@dmitshur
Copy link
Contributor Author

It is intentional to keep retrying "communications failures" forever, because the expectation is that they should eventually succeed.

I'm seeing now that this isn't quite true, so there is some room for improvement here. I opened issue #36226 about it for tracking purposes, but it's not a high priority.

@golang golang locked and limited conversation to collaborators Dec 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

2 participants