-
Notifications
You must be signed in to change notification settings - Fork 18k
x/build/cmd/relui: sometimes impossible to retry/approve tasks in Go minor release workflow, appearing as an "unknown task" error #70249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I was able to successfully reproduce this "unknown task" error condition locally with a simplified dry-run release workflow: During the workflow for Go minor releases, both Go releases are built in parallel. Each one has a "Plan builders" expansion intended to plan out the appropriate builders dynamically and create "Run advisory builder" tasks for each one. The
But there's nothing that guarantees the two "Plan builders" expansions don't run concurrently. I looked more into why concurrent expansions aren't supported, and the issue is that each one begins with a copy of the workflow, modifies it as it runs, and then updates the workflow. If two run at once, one of them ends up not observing the result of the other, which explains why it was possible to retry the 1.22 tasks, but not 1.23 ones. (Also, restarting relui had a 50/50% chance of them running in a different order, such that it would've allowed resuming the workflow.) I have the following test to capture the problem: func TestManualRetryMultipleExpansions(t *testing.T) {
var counters, retried [2]int
wd := wf.New(wf.ACL{})
sub1 := wd.Sub("sub1")
sub2 := wd.Sub("sub2")
for i, wd := range []*wf.Definition{sub1, sub2} {
out := wf.Expand0(wd, fmt.Sprintf("expand %d", i+1), func(wd *wf.Definition) (wf.Value[string], error) {
return wf.Task0(wd, fmt.Sprintf("work %d", i+1), func(ctx *wf.TaskContext) (string, error) {
ctx.DisableRetries()
counters[i]++
if counters[i] == 1 {
return "", fmt.Errorf("first try fail")
}
return "", nil
}), nil
})
wf.Output(wd, "out", out)
}
w := startWorkflow(t, wd, nil)
listener := &errorListener{
taskName: "work 1",
callback: func(string) {
go func() {
retried[0]++
err := w.RetryTask(context.Background(), "work 1")
if err != nil {
t.Errorf(`RetryTask("work 1") failed: %v`, err)
return
}
}()
},
Listener: &errorListener{
taskName: "work 2",
callback: func(string) {
go func() {
retried[1]++
err := w.RetryTask(context.Background(), "work 2")
if err != nil {
t.Errorf(`RetryTask("work 2") failed: %v`, err)
return
}
}()
},
Listener: &verboseListener{t},
},
}
runWorkflow(t, w, listener)
if counters[0] != 2 {
t.Errorf("sub1 task ran %v times, wanted 2", counters[0])
}
if retried[0] != 1 {
t.Errorf("sub1 task was retried %v times, wanted 1", retried[0])
}
if counters[1] != 2 {
t.Errorf("sub2 task ran %v times, wanted 2", counters[1])
}
if retried[1] != 1 {
t.Errorf("sub2 task was retried %v times, wanted 1", retried[1])
}
} At tip of x/build, it fails due to the "unknown task" error: $ go test -v -race -run=^TestManualRetryMultipleExpansions$ ./internal/workflow
=== RUN TestManualRetryMultipleExpansions
workflow_test.go:667: task sub1: expand 1: started
workflow_test.go:667: task sub2: expand 2: started
workflow_test.go:671: task sub2: expand 2: done: <nil>
workflow_test.go:667: task work 2 : started
workflow_test.go:671: task sub1: expand 1: done: <nil>
workflow_test.go:667: task work 1 : started
workflow_test.go:669: task work 1 : error: first try fail
workflow_test.go:669: task work 2 : error: first try fail
workflow_test.go:658: workflow "cd451b52-c1e7-409a-8fd4-f9fb391cb864": stalled
workflow_test.go:686: task work 1 : LOG: Manual retry requested
workflow_test.go:658: workflow "cd451b52-c1e7-409a-8fd4-f9fb391cb864": stalled
workflow_test.go:658: workflow "cd451b52-c1e7-409a-8fd4-f9fb391cb864": stalled
workflow_test.go:667: task work 1 : started
workflow_test.go:374: RetryTask("work 2") failed: unknown task "work 2"
workflow_test.go:671: task work 1 : done:
workflow_test.go:658: workflow "cd451b52-c1e7-409a-8fd4-f9fb391cb864": stalled
workflow_test.go:382: w.Run() = _, context deadline exceeded, wanted no error
--- FAIL: TestManualRetryMultipleExpansions (10.00s)
FAIL
FAIL golang.org/x/build/internal/workflow 10.272s
FAIL |
I considered how we can fix this. One path is to modify the workflow so that it can't run two expansions in parallel. This is tricky, and would make the workflow code less readable - it currently calls addSingleReleaseWorkflow twice for 2 minor releases, a nice property. It'd be great if it were viable to add support for running expansions in parallel, but that's not something I want to try to tackle now. Instead, it seems to work well to limit any given workflow to run no more than one expansion at a time. The change is small, easier to reason about, and handles our current needs for expansions. I'll send a CL for that. |
Change https://go.dev/cl/626336 mentions this issue: |
Change https://go.dev/cl/626335 mentions this issue: |
Tasks added by expansions, in contrast to all other tasks, didn't handle the sub-workflow prefix. A while back I looked into it briefly, and saw that shallowClone didn't clone that one field, so figured CL 546235 would fix it. It didn't, and now it's apparent to me why not. The expansion always gets a copy of the top-level workflow definition, even if it was made by a task with some prefix. Modify addExpansion to record not just that a given task is an expansion, but also the workflow prefix at that time. We also have a test that makes it easy to verify this works now. Keep shallowClone as is, to match the behavior implied by its name, even though by now the namePrefix it clones gets overwritten anyway. For golang/go#70249. Change-Id: Ib1cb34ef121baf946fe6f2500c4bf1611aaa6db7 Reviewed-on: https://go-review.googlesource.com/c/build/+/626336 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
During the recently Go minor release workflow, an issue was uncovered where some tasks that needed to be retried and/or approved couldn't be. Retrying the task, which certainly existed, produced the following error:
Notably, other very similar tasks such as "Run advisory builder go1.22-linux-amd64-msan-clang15" were possible to successfully restart.
Restarting relui once didn't make a difference to the existing workflow, but restarting the entire workflow was an effective workaround.
This is the tracking issue to understand what was causing this condition to happen, and to fix it so it won't happen in future release workflows.
CC @golang/release.
The text was updated successfully, but these errors were encountered: