Skip to content

x/build/cmd/relui: sometimes impossible to retry/approve tasks in Go minor release workflow, appearing as an "unknown task" error #70249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dmitshur opened this issue Nov 7, 2024 · 5 comments
Assignees
Labels
Builders x/build issues (builders, bots, dashboards) FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented Nov 7, 2024

During the recently Go minor release workflow, an issue was uncovered where some tasks that needed to be retried and/or approved couldn't be. Retrying the task, which certainly existed, produced the following error:

2024/11/06 17:39:04 s.w.RetryTask(_, "27c1b153-bb9f-4ab0-9b9a-935e6362b4ef"): unknown task "Run advisory builder go1.23-linux-amd64-msan-clang15"

Notably, other very similar tasks such as "Run advisory builder go1.22-linux-amd64-msan-clang15" were possible to successfully restart.

2024/11/06 17:43:17 TaskStateChanged("27c1b153-bb9f-4ab0-9b9a-935e6362b4ef", "Run advisory builder go1.22-linux-amd64-msan-clang15", &workflow.TaskState{Name:"Run advisory builder go1.22-linux-amd64-msan-clang15", Started:false, Finished:false, Result:interface {}(nil), SerializedResult:[]uint8(nil), Error:"", RetryCount:0})
2024/11/06 17:43:17 TaskStateChanged("27c1b153-bb9f-4ab0-9b9a-935e6362b4ef", "Run advisory builder go1.22-linux-amd64-msan-clang15", &workflow.TaskState{Name:"Run advisory builder go1.22-linux-amd64-msan-clang15", Started:true, Finished:false, Result:interface {}(nil), SerializedResult:[]uint8(nil), Error:"", RetryCount:0})

Restarting relui once didn't make a difference to the existing workflow, but restarting the entire workflow was an effective workaround.

This is the tracking issue to understand what was causing this condition to happen, and to fix it so it won't happen in future release workflows.

CC @golang/release.

@dmitshur dmitshur added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 7, 2024
@dmitshur dmitshur added this to the Unreleased milestone Nov 7, 2024
@dmitshur dmitshur self-assigned this Nov 7, 2024
@dmitshur dmitshur changed the title x/build/cmd/relui: sometimes impossible to retry/approve tasks in Go minor release workflow x/build/cmd/relui: sometimes impossible to retry/approve tasks in Go minor release workflow, appearing as an "unknown task" error Nov 7, 2024
@dmitshur dmitshur moved this to In Progress in Go Release Nov 7, 2024
@dmitshur
Copy link
Contributor Author

dmitshur commented Nov 7, 2024

I was able to successfully reproduce this "unknown task" error condition locally with a simplified dry-run release workflow:

image

During the workflow for Go minor releases, both Go releases are built in parallel. Each one has a "Plan builders" expansion intended to plan out the appropriate builders dynamically and create "Run advisory builder" tasks for each one.

The workflow package documentation notes an important limitation of expansions:

running multiple expansions concurrently, is an error that will corrupt the workflow's state.

But there's nothing that guarantees the two "Plan builders" expansions don't run concurrently.

I looked more into why concurrent expansions aren't supported, and the issue is that each one begins with a copy of the workflow, modifies it as it runs, and then updates the workflow. If two run at once, one of them ends up not observing the result of the other, which explains why it was possible to retry the 1.22 tasks, but not 1.23 ones. (Also, restarting relui had a 50/50% chance of them running in a different order, such that it would've allowed resuming the workflow.)

I have the following test to capture the problem:

func TestManualRetryMultipleExpansions(t *testing.T) {
	var counters, retried [2]int
	wd := wf.New(wf.ACL{})
	sub1 := wd.Sub("sub1")
	sub2 := wd.Sub("sub2")
	for i, wd := range []*wf.Definition{sub1, sub2} {
		out := wf.Expand0(wd, fmt.Sprintf("expand %d", i+1), func(wd *wf.Definition) (wf.Value[string], error) {
			return wf.Task0(wd, fmt.Sprintf("work %d", i+1), func(ctx *wf.TaskContext) (string, error) {
				ctx.DisableRetries()
				counters[i]++
				if counters[i] == 1 {
					return "", fmt.Errorf("first try fail")
				}
				return "", nil
			}), nil
		})
		wf.Output(wd, "out", out)
	}

	w := startWorkflow(t, wd, nil)
	listener := &errorListener{
		taskName: "work 1",
		callback: func(string) {
			go func() {
				retried[0]++
				err := w.RetryTask(context.Background(), "work 1")
				if err != nil {
					t.Errorf(`RetryTask("work 1") failed: %v`, err)
					return
				}
			}()
		},
		Listener: &errorListener{
			taskName: "work 2",
			callback: func(string) {
				go func() {
					retried[1]++
					err := w.RetryTask(context.Background(), "work 2")
					if err != nil {
						t.Errorf(`RetryTask("work 2") failed: %v`, err)
						return
					}
				}()
			},
			Listener: &verboseListener{t},
		},
	}
	runWorkflow(t, w, listener)
	if counters[0] != 2 {
		t.Errorf("sub1 task ran %v times, wanted 2", counters[0])
	}
	if retried[0] != 1 {
		t.Errorf("sub1 task was retried %v times, wanted 1", retried[0])
	}
	if counters[1] != 2 {
		t.Errorf("sub2 task ran %v times, wanted 2", counters[1])
	}
	if retried[1] != 1 {
		t.Errorf("sub2 task was retried %v times, wanted 1", retried[1])
	}
}

At tip of x/build, it fails due to the "unknown task" error:

$ go test -v -race -run=^TestManualRetryMultipleExpansions$ ./internal/workflow
=== RUN   TestManualRetryMultipleExpansions
    workflow_test.go:667: task sub1: expand 1: started
    workflow_test.go:667: task sub2: expand 2: started
    workflow_test.go:671: task sub2: expand 2: done: <nil>
    workflow_test.go:667: task work 2    : started
    workflow_test.go:671: task sub1: expand 1: done: <nil>
    workflow_test.go:667: task work 1    : started
    workflow_test.go:669: task work 1    : error: first try fail
    workflow_test.go:669: task work 2    : error: first try fail
    workflow_test.go:658: workflow "cd451b52-c1e7-409a-8fd4-f9fb391cb864": stalled
    workflow_test.go:686: task work 1    : LOG: Manual retry requested
    workflow_test.go:658: workflow "cd451b52-c1e7-409a-8fd4-f9fb391cb864": stalled
    workflow_test.go:658: workflow "cd451b52-c1e7-409a-8fd4-f9fb391cb864": stalled
    workflow_test.go:667: task work 1    : started
    workflow_test.go:374: RetryTask("work 2") failed: unknown task "work 2"
    workflow_test.go:671: task work 1    : done: 
    workflow_test.go:658: workflow "cd451b52-c1e7-409a-8fd4-f9fb391cb864": stalled
    workflow_test.go:382: w.Run() = _, context deadline exceeded, wanted no error
--- FAIL: TestManualRetryMultipleExpansions (10.00s)
FAIL
FAIL	golang.org/x/build/internal/workflow	10.272s
FAIL

@dmitshur
Copy link
Contributor Author

dmitshur commented Nov 7, 2024

I considered how we can fix this. One path is to modify the workflow so that it can't run two expansions in parallel. This is tricky, and would make the workflow code less readable - it currently calls addSingleReleaseWorkflow twice for 2 minor releases, a nice property. It'd be great if it were viable to add support for running expansions in parallel, but that's not something I want to try to tackle now.

Instead, it seems to work well to limit any given workflow to run no more than one expansion at a time. The change is small, easier to reason about, and handles our current needs for expansions. I'll send a CL for that.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. FixPending Issues that have a fix which has not yet been reviewed or submitted. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 7, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/626336 mentions this issue: internal/workflow: fix sub-workflow prefix for tasks added via expansion

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/626335 mentions this issue: internal/workflow: run at most one expansion at a time

gopherbot pushed a commit to golang/build that referenced this issue Nov 19, 2024
Tasks added by expansions, in contrast to all other tasks, didn't handle
the sub-workflow prefix. A while back I looked into it briefly, and saw
that shallowClone didn't clone that one field, so figured CL 546235
would fix it.

It didn't, and now it's apparent to me why not. The expansion always
gets a copy of the top-level workflow definition, even if it was made
by a task with some prefix. Modify addExpansion to record not just that
a given task is an expansion, but also the workflow prefix at that time.

We also have a test that makes it easy to verify this works now.
Keep shallowClone as is, to match the behavior implied by its name,
even though by now the namePrefix it clones gets overwritten anyway.

For golang/go#70249.

Change-Id: Ib1cb34ef121baf946fe6f2500c4bf1611aaa6db7
Reviewed-on: https://go-review.googlesource.com/c/build/+/626336
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Release Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done.
Projects
Archived in project
Development

No branches or pull requests

3 participants