Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: apparent deadlock in TestWaitGroupMisuse3 #35774

Closed
bcmills opened this issue Nov 22, 2019 · 7 comments
Closed

sync: apparent deadlock in TestWaitGroupMisuse3 #35774

bcmills opened this issue Nov 22, 2019 · 7 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 22, 2019

Looks like there has been a recent regression in sync.TestWaitGroupMisuse3.

It's not obvious to me whether this is a bug in the test itself, or a bug in the runtime (or the sync package) exposed by the test.

panic: test timed out after 3m0s

goroutine 3650 [running]:
testing.(*M).startAlarm.func1()
	/workdir/go/src/testing/testing.go:1422 +0x145
created by time.goFunc
	/workdir/go/src/time/sleep.go:168 +0x46

goroutine 1 [chan receive, 2 minutes]:
testing.(*T).Run(0xc00014c7e0, 0x60f12e, 0x14, 0x617d10, 0x5dd71d00)
	/workdir/go/src/testing/testing.go:1006 +0x68e
testing.runTests.func1(0xc00014c000)
	/workdir/go/src/testing/testing.go:1247 +0xdc
testing.tRunner(0xc00014c000, 0xc00011bc70)
	/workdir/go/src/testing/testing.go:954 +0x19c
testing.runTests(0xc00012c020, 0x746ba0, 0x1f, 0x1f, 0x7f7af6be0100)
	/workdir/go/src/testing/testing.go:1245 +0x3a5
testing.(*M).Run(0xc000148000, 0x0)
	/workdir/go/src/testing/testing.go:1162 +0x454
main.main()
	_testmain.go:198 +0xc1

goroutine 3444 [semacquire, 2 minutes]:
sync.runtime_Semacquire(0xc000466078)
	/workdir/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc000466070)
	/workdir/go/src/sync/waitgroup.go:130 +0xd6
sync_test.TestWaitGroupMisuse3(0xc00014c7e0)
	/workdir/go/src/sync/waitgroup_test.go:174 +0x2b0
testing.tRunner(0xc00014c7e0, 0x617d10)
	/workdir/go/src/testing/testing.go:954 +0x19c
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1005 +0x660
FAIL	sync	180.007s

2019-11-21T23:15:28-bbbc658/linux-amd64-noopt
2019-11-03T01:44:46-d2c039f/openbsd-amd64-62
2019-11-01T07:59:09-5b31021/linux-386-387

CC @aclements @ianlancetaylor @cherrymui @rhysh @CAFxX

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Nov 22, 2019
@bcmills bcmills added this to the Go1.14 milestone Nov 22, 2019
@bcmills
Copy link
Contributor Author

bcmills commented Nov 22, 2019

Hmm, the first failure predates CL 206180, so that can't be implicated. I don't see any other changes to sync, so this looks like a runtime bug to me.

@rhysh
Copy link
Contributor

rhysh commented Nov 23, 2019

I took a shot at pinning this down, but it appears present (on linux/amd64) in go1.13.3 and as long ago as go1.7.6 (the oldest I can easily cross-compile). It's possible that it's become more frequent.

With -test.run=TestWaitGroupMisuse3 and without -test.timeout, it shows up as fatal error: all goroutines are asleep - deadlock!. For example, with go1.13.3 on a big linux/amd64 machine with GOMAXPROCS trimmed down to 4, using stress:

492232 runs so far, 5 failures

/tmp/go-stress101518475
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [chan receive]:
testing.(*T).Run(0xc0000ac100, 0x5ac2f4, 0x14, 0x5b4a70, 0x483601)
	/usr/local/go/src/testing/testing.go:961 +0x377
testing.runTests.func1(0xc0000ac000)
	/usr/local/go/src/testing/testing.go:1202 +0x78
testing.tRunner(0xc0000ac000, 0xc00007edc0)
	/usr/local/go/src/testing/testing.go:909 +0xc9
testing.runTests(0xc00000c0e0, 0x6e8ac0, 0x1f, 0x1f, 0x0)
	/usr/local/go/src/testing/testing.go:1200 +0x2a7
testing.(*M).Run(0xc00009e000, 0x0)
	/usr/local/go/src/testing/testing.go:1117 +0x176
main.main()
	_testmain.go:198 +0x135

goroutine 6 [semacquire]:
sync.runtime_Semacquire(0xc000012138)
	/usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc000012130)
	/usr/local/go/src/sync/waitgroup.go:130 +0x64
sync_test.TestWaitGroupMisuse3(0xc0000ac100)
	/usr/local/go/src/sync/waitgroup_test.go:174 +0x206
testing.tRunner(0xc0000ac100, 0x5b4a70)
	/usr/local/go/src/testing/testing.go:909 +0xc9
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:960 +0x350


ERROR: exit status 2

@ianlancetaylor
Copy link
Contributor

This hasn't happened since November 21. The failure was uncommon before then so this doesn't prove that it's been fixed. But it could have been fixed by https://golang.org/cl/208379, committed November 22. CC @mknyszek .

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor
Copy link
Contributor

When there are concurrent calls to Add and Wait, there is a race: Add using a negative number (as when called from Done) checks whether *statep has changed, panics if it has, and otherwise sets it to 0; these operations are not atomic. Wait does an atomic increment on *statep. If the atomic increment occurs between the check of whether *statep has changed and when it is set to 0, the WaitGroup is out of sync. At that point it is possible that nothing will wake up the goroutine sleeping in Wait, which is what we see.

@ianlancetaylor
Copy link
Contributor

Ah, never mind, it's a lot simpler than that. If one of the helper goroutines panics, the main goroutine's Wait may hang forever.

@gopherbot
Copy link

Change https://golang.org/cl/210218 mentions this issue: sync: deflake TestWaitGroupMisuse3

@golang golang locked and limited conversation to collaborators Dec 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

4 participants