Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: apparent deadlock in TestCgoNumGoroutine #39024

Closed
bcmills opened this issue May 12, 2020 · 13 comments
Closed

runtime: apparent deadlock in TestCgoNumGoroutine #39024

bcmills opened this issue May 12, 2020 · 13 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented May 12, 2020

2020-05-11T22:38:32-8c1db77/openbsd-amd64-64

--- FAIL: TestCgoNumGoroutine (60.25s)
    crash_test.go:95: testprogcgo NumGoroutine exit status: exit status 2
    crash_cgo_test.go:417: expected "OK\n" got SIGQUIT: quit
        PC=0x469fff m=6 sigcode=0
        
        goroutine 0 [idle]:
        runtime.thrsleep(0xc00002f738, 0x200000003, 0x0, 0x0, 0xc00002f738, 0x58, 0xc00006a000, 0x8000, 0x0, 0xc000001980, ...)
        	/tmp/workdir/go/src/runtime/sys_openbsd_amd64.s:72 +0x1f
        runtime.semasleep(0xffffffffffffffff, 0x200e7039c)
        	/tmp/workdir/go/src/runtime/os_openbsd.go:167 +0xb4
        runtime.notesleep(0x88c178)
        	/tmp/workdir/go/src/runtime/lock_sema.go:181 +0xcf
        runtime.templateThread()
        	/tmp/workdir/go/src/runtime/proc.go:1863 +0xfa
        runtime.mstart1()
        	/tmp/workdir/go/src/runtime/proc.go:1156 +0xc8
        runtime.mstart()
        	/tmp/workdir/go/src/runtime/proc.go:1121 +0x6e
        
        goroutine 1 [syscall]:
        main._Cfunc_CheckNumGoroutine()
        	_cgo_gotypes.go:139 +0x45
        main.NumGoroutine()
        	/tmp/workdir/go/src/runtime/testdata/testprogcgo/numgoroutine.go:49 +0x59
        main.main()
        	/tmp/workdir/go/src/runtime/testdata/testprogcgo/main.go:34 +0x1da
        
        rax    0x58
        rbx    0xc00002f400
        rcx    0x469fff
        rdx    0x0
        rdi    0xc00002f738
        rsi    0x3
        rbp    0x200e70370
        rsp    0x200e70310
        r8     0xc00002f738
        r9     0x0
        r10    0x0
        r11    0x246
        r12    0x5205e0
        r13    0x7f7ffffd72f0
        r14    0xc000001980
        r15    0x4398c0
        rip    0x469fff
        rflags 0x246
        cs     0x2b
        fs     0x0
        gs     0x0
FAIL
FAIL	runtime	105.009s

Marking as release-blocker until we understand whether this is a regression.

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels May 12, 2020
@bcmills bcmills added this to the Go1.15 milestone May 12, 2020
@aclements
Copy link
Member

/cc @mknyszek

@aclements
Copy link
Member

/cc @prattmic

@mknyszek
Copy link
Contributor

I'll take a look at this one after the runtime/trace test failures, if no one else gets to it first.

@prattmic
Copy link
Member

Similar failure in another openbsd cgo test:

2020-05-12T19:15:34-cb11c98/openbsd-386-62

--- FAIL: TestEnsureDropM (120.07s)
    crash_test.go:95: testprogcgo EnsureDropM exit status: exit status 2
    crash_cgo_test.go:174: expected "OK\n", got SIGQUIT: quit
        PC=0x1c05a717 m=6 sigcode=0
        
        goroutine 0 [idle]:
        runtime.thrsleep(0x3c42cacc, 0x3, 0x0, 0x0, 0x3c42cacc, 0x4, 0x1c041aa2, 0x7c28a9ec, 0x0, 0x0, ...)
        	/tmp/workdir/go/src/runtime/sys_openbsd_386.s:384 +0x7
        runtime.semasleep(0xffffffff, 0xffffffff, 0x3c42c900)
        	/tmp/workdir/go/src/runtime/os_openbsd.go:167 +0xcc
        runtime.notesleep(0x3c1223fc)
        	/tmp/workdir/go/src/runtime/lock_sema.go:181 +0xda
        runtime.templateThread()
        	/tmp/workdir/go/src/runtime/proc.go:1863 +0xd4
        runtime.mstart1()
        	/tmp/workdir/go/src/runtime/proc.go:1156 +0x8f
        runtime.mstart()
        	/tmp/workdir/go/src/runtime/proc.go:1121 +0x4f
        
        goroutine 1 [syscall]:
        main._Cfunc_CheckM()
        	_cgo_gotypes.go:127 +0x2d
        main.EnsureDropM()
        	/tmp/workdir/go/src/runtime/testdata/testprogcgo/dropm.go:57 +0x14
        main.main()
        	/tmp/workdir/go/src/runtime/testdata/testprogcgo/main.go:34 +0x148
        
        eax    0x4
        ebx    0xffffffff
        ecx    0x0
        edx    0x3c42cacc
        edi    0x1c031b40
        esi    0x3c400d20
        ebp    0x1
        esp    0x7c28a9c4
        eip    0x1c05a717
        eflags 0x206
        cs     0x2b
        fs     0x7c24005b
        gs     0x9720063

@mknyszek mknyszek self-assigned this May 19, 2020
@mknyszek
Copy link
Contributor

I've been running this test continuously (directly by compiling the testprog, and via go test) on openbsd-amd64-64 for about an hour now and haven't been able to reproduce at tip.

By the failures above, the template thread is sleeping (not unexpected) and goroutine 1 is in C code. Both failures above look like a timeout while the goroutine sits in C code, but there isn't much to go on here. I might try and see if reproducing on 386 is easier?

@mknyszek
Copy link
Contributor

Both EnsureDropM and NumGoroutine involve calling a C function which creates a C thread that calls into Go code, and in both the deadlocks above we see that the goroutine is blocked in C code, so probably waiting for that other C thread to do what what it needs to do, and it may not be in Go code yet.

I pored over the needm logic and I can't think of a way that there might be a deadlock e.g. via a C->Go thread waiting indefinitely for an m which will never come, so I don't think there's something there until we have more evidence.

@aclements
Copy link
Member

Similar recent failures:

$ greplogs -dashboard -e "crash_cgo_test.*SIGQUIT" -md -l

2020-05-12T19:15:34-cb11c98/openbsd-386-62
2020-05-11T22:38:32-8c1db77/openbsd-amd64-64
2020-05-01T05:25:54-e1d1684/freebsd-arm64-dmgk
2020-04-29T20:33:31-197a2a3/netbsd-amd64-9_0

Then there's nothing until 2018. The second two seem to have a lot more going on, so they might not be the same.

@cagedmantis
Copy link
Contributor

I think that it is ok to work on this after beta1.

@cagedmantis cagedmantis added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 21, 2020
@toothrot toothrot removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Jun 10, 2020
@aclements
Copy link
Member

Friendly ping @mknyszek , did you happen to make any more progress on this?

@mknyszek
Copy link
Contributor

No update, sorry about that. I tried again but haven't been able to reproduce this at all.

@aclements
Copy link
Member

There haven't been any failures since 2020-05-12. I've started a stress test at 8c1db77 on openbsd-amd64-64.

gopool2 create -setup 'gomote push $VM && gomote run $VM go/src/make.bash' openbsd-amd64-64
stress2 -p 8 -max-passes 100000 -pass '^ok' gopool2 run 'gomote run -e GO_TEST_TIMEOUT_SCALE=2 $VM go/bin/go test -short runtime'

@aclements
Copy link
Member

100,000 runs with no failures.

@aclements
Copy link
Member

Still no failures since 2020-05-12. I'm inclined to close this. I don't think we have enough information to debug this, and it seems rare to the point of possibly not happening any more.

@golang golang locked and limited conversation to collaborators Jun 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

7 participants