Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/internal/workflow: apparent deadlock in TestResume #49318

Closed
bcmills opened this issue Nov 3, 2021 · 7 comments
Closed

x/build/internal/workflow: apparent deadlock in TestResume #49318

bcmills opened this issue Nov 3, 2021 · 7 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 3, 2021

Spotted in a linux-amd64-race TryBot (https://storage.googleapis.com/go-build-log/6458b2e8/linux-amd64-race_089ef1b2.log):

panic: test timed out after 10m0s

goroutine 14 [running]:
testing.(*M).startAlarm.func1()
	/workdir/go/src/testing/testing.go:1999 +0xbb
created by time.goFunc
	/workdir/go/src/time/sleep.go:176 +0x48

goroutine 1 [chan receive]:
testing.(*T).Run(0xc00010eb60, {0x733ba5, 0xa}, 0x7423f8)
	/workdir/go/src/testing/testing.go:1457 +0x750
testing.runTests.func1(0x0?)
	/workdir/go/src/testing/testing.go:1809 +0x9a
testing.tRunner(0xc00010eb60, 0xc000105ba0)
	/workdir/go/src/testing/testing.go:1409 +0x214
testing.runTests(0xc0001145a0?, {0x8ecc60, 0x6, 0x6}, {0x40?, 0x7fb009723438?, 0x8f1c00?})
	/workdir/go/src/testing/testing.go:1807 +0x7e5
testing.(*M).Run(0xc0001145a0)
	/workdir/go/src/testing/testing.go:1689 +0xa72
main.main()
	_testmain.go:59 +0x2e5

goroutine 10 [select]:
golang.org/x/build/internal/workflow.(*Workflow).Run(0xc000076ff0, {0x78eb38?, 0xc00001a078}, {0x78da10, 0xc00000e258})
	/workdir/gopath/src/golang.org/x/build/internal/workflow/workflow.go:423 +0x549
golang.org/x/build/internal/workflow_test.runWorkflow(0xc00017c000, 0xc000219e60?, {0x78da10, 0xc00000e258})
	/workdir/gopath/src/golang.org/x/build/internal/workflow/workflow_test.go:257 +0xed
golang.org/x/build/internal/workflow_test.TestResume(0xc00017c000)
	/workdir/gopath/src/golang.org/x/build/internal/workflow/workflow_test.go:208 +0xba9
testing.tRunner(0xc00017c000, 0x7423f8)
	/workdir/go/src/testing/testing.go:1409 +0x214
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1456 +0x725
FAIL	golang.org/x/build/internal/workflow	600.158s
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 3, 2021
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Nov 3, 2021
@gopherbot gopherbot added this to the Unreleased milestone Nov 3, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Nov 3, 2021

@heschi
Copy link
Contributor

heschi commented Nov 3, 2021

It appears that the workflow has nothing to do, but it hasn't succeeded. That must mean that a task failed. The current design of the workflow package is that it will simply hang it that scenario waiting for someone to (e.g.) click a button to retry the task.

It's a little embarrassing, but I don't see how that can happen in this test, and I can't reproduce it in 10K local runs. For now, I've written a CL that adds timeouts to most tests and has the workflow give up if it's stuck.

@gopherbot
Copy link

Change https://golang.org/cl/361195 mentions this issue: internal/workflow: hang less

gopherbot pushed a commit to golang/build that referenced this issue Nov 3, 2021
We don't want to abort a workflow when a single task fails -- we want to
let other tasks finish if they can. However, once everything that can
run has, there's no point in sticking around. Exit in that case.

Also set timeouts on most tests.

Updates golang/go#49318.

Change-Id: I3dcdb7eb5703e6502cf7f155213cdad6595c4bac
Reviewed-on: https://go-review.googlesource.com/c/build/+/361195
Trust: Heschi Kreinick <heschi@google.com>
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Nov 10, 2021

The change to make the test give up if stuck seems to have at least changed the failure mode!

greplogs --dashboard -md -E 'FAIL: TestResume' --since=2021-11-04

2021-11-10T05:08:25-1e70b36-17980df/linux-amd64-race:

--- FAIL: TestResume (0.00s)
    workflow_test.go:287: task run once  : started
    workflow_test.go:291: task run once  : done: ran
    workflow_test.go:287: task block     : started
    workflow_test.go:289: task block     : error: context canceled
    workflow_test.go:210: canceled workflow returned error workflow has progressed as far as it can, wanted Canceled
FAIL
FAIL	golang.org/x/build/internal/workflow

2021-11-04T20:01:22-6cdc019-6d1fffa/linux-amd64-race:

--- FAIL: TestResume (0.00s)
    workflow_test.go:287: task run once  : started
    workflow_test.go:291: task run once  : done: ran
    workflow_test.go:287: task block     : started
    workflow_test.go:289: task block     : error: context canceled
    workflow_test.go:210: canceled workflow returned error workflow has progressed as far as it can, wanted Canceled
FAIL
FAIL	golang.org/x/build/internal/workflow

@bcmills
Copy link
Contributor Author

bcmills commented Nov 10, 2021

I can reproduce the failure about 50% of the time by adding time.Sleep(1 * time.Millisecond) at these two points together:

@bcmills
Copy link
Contributor Author

bcmills commented Nov 10, 2021

Mailed a fix. It's a race in (*Workflow).Run analogous to the one that errgroup.Group was designed to avoid.

@gopherbot
Copy link

Change https://golang.org/cl/362974 mentions this issue: internal/workflow: do not report errors to the listener after cancellation

@bcmills bcmills self-assigned this Nov 10, 2021
@rsc rsc unassigned bcmills Jun 23, 2022
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants