Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: apparent deadlock in image/gif test on linux-ppc64-buildlet #32613

Closed
bcmills opened this issue Jun 14, 2019 · 11 comments
Closed

runtime: apparent deadlock in image/gif test on linux-ppc64-buildlet #32613

bcmills opened this issue Jun 14, 2019 · 11 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Jun 14, 2019

There was a timeout in the image/gif test, but from the symptoms it looks more like a runtime bug to me: one of the threads is idle on runtime.futex via runtime.mcall, and the the other one says goroutine running on other thread; stack unavailable.

That combination of symptoms is similar to #32327, although the path from runtime.mcall to runtime.futex differs.

https://build.golang.org/log/e08f0037958f84cf1b1fe6b9f80c8208d332104c

SIGQUIT: quit
PC=0x6be3c m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x2ad150, 0x8000000002, 0x0, 0x0, 0x12000, 0x46d40, 0x46704, 0xc00035e780, 0x46d7c, 0xc0000384c8, ...)
	/tmp/workdir-host-linux-ppc64-osu/go/src/runtime/sys_linux_ppc64x.s:472 +0x1c
runtime.futexsleep(0x2ad150, 0x200046708, 0xffffffffffffffff)
	/tmp/workdir-host-linux-ppc64-osu/go/src/runtime/os_linux.go:44 +0x3c
runtime.lock(0x2ad150)
	/tmp/workdir-host-linux-ppc64-osu/go/src/runtime/lock_futex.go:102 +0x1bc
runtime.exitsyscall0(0xc000076180)
	/tmp/workdir-host-linux-ppc64-osu/go/src/runtime/proc.go:3119 +0x7c
runtime.mcall(0xc000076180)
	/tmp/workdir-host-linux-ppc64-osu/go/src/runtime/asm_ppc64x.s:202 +0x58

goroutine 1 [chan receive, 3 minutes]:
testing.(*T).Run(0xc0000acd00, 0x18dbfc, 0x1b, 0x193748, 0x100000000000349)
	/tmp/workdir-host-linux-ppc64-osu/go/src/testing/testing.go:961 +0x304
testing.runTests.func1(0xc0000ac000)
	/tmp/workdir-host-linux-ppc64-osu/go/src/testing/testing.go:1207 +0x78
testing.tRunner(0xc0000ac000, 0xc000058d48)
	/tmp/workdir-host-linux-ppc64-osu/go/src/testing/testing.go:909 +0xc8
testing.runTests(0xc00008c080, 0x2a7dc0, 0x18, 0x18, 0x0)
	/tmp/workdir-host-linux-ppc64-osu/go/src/testing/testing.go:1205 +0x27c
testing.(*M).Run(0xc0000a8000, 0x0)
	/tmp/workdir-host-linux-ppc64-osu/go/src/testing/testing.go:1122 +0x158
main.main()
	_testmain.go:96 +0x130

goroutine 30 [running]:
	goroutine running on other thread; stack unavailable
created by testing.(*T).Run
	/tmp/workdir-host-linux-ppc64-osu/go/src/testing/testing.go:960 +0x2e8

r0   0xdd	r1   0x3fffd047b228
r2   0x8000000000	r3   0x2ad150
r4   0x80	r5   0x2
r6   0x0	r7   0x0
r8   0x0	r9   0x0
r10  0x0	r11  0x0
r12  0x0	r13  0x0
r14  0x1a0b0	r15  0xc000030e78
r16  0x0	r17  0xc000025900
r18  0x0	r19  0x1b4782
r20  0xc000020010	r21  0x2ad840
r22  0x0	r23  0x0
r24  0x8	r25  0x3fff8b55a520
r26  0x3fff8b55a578	r27  0x73
r28  0x72	r29  0x1
r30  0x2ad2a0	r31  0x19bbc
pc   0x6be3c	ctr  0x0
link 0x3a62c	xer  0x0
ccr  0x54400002	trap 0xc00
*** Test killed with quit: ran too long (4m0s).
FAIL	image/gif	240.002s

CC @laboger @aclements @mknyszek @randall77

@bcmills bcmills added Testing An issue that has been verified to require only test changes, not just a test failure. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 14, 2019
@bcmills bcmills added this to the Go1.13 milestone Jun 14, 2019
@bcmills bcmills changed the title runtime: strange apparent deadlock in image/gif test on linux-ppc64-buildlet runtime: apparent deadlock in image/gif test on linux-ppc64-buildlet Jun 14, 2019
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@laboger
Copy link
Contributor

laboger commented Aug 27, 2019

I looked through the dashboard failures and the timeouts in image/gif and runtime have happened on both ppc64 and ppc64le power8. We have not seen these timeouts on the power9 builder before today. (I see a new power9 failure out there from today but not the same as these.)

The power9 builder uses Debian 9 and I believe Brad told me once that the power8 ppc64le builder used Debian 7 and I am not aware it has been upgraded since. Can we verify it's not the distro or kernel before wasting too much time on these failures. Could be some kernel bug that has since been fixed.

Brad suggested using gomote to figure out the distro but looks like that needs a token which I don't have.

@bcmills
Copy link
Contributor Author

bcmills commented Sep 11, 2019

One more today (linux-ppc64-buildlet): https://build.golang.org/log/c6b2fbeb285314f08b608b6448f2e967415876cc

@dmitshur @toothrot: could you help with either a gomote token or distro information?

@laboger
Copy link
Contributor

laboger commented Sep 12, 2019

I don't think the distro is at fault here anymore. I found that these are using Debian 8 based on some comments I found. My concern was if it was Debian 7.

Can we bump up the timeout value for ppc64 to 10m or so just to rule out a deadlock vs. something making it take a long time?

@gopherbot
Copy link

Change https://golang.org/cl/197237 mentions this issue: x/build: increase timeout for ppc64/ppc64le builders

@laboger
Copy link
Contributor

laboger commented Sep 25, 2019

I was not aware the builder machines only had 2 processors each. I will try and see if that helps to reproduce the problem (ours have at least 16, some many more).

So that means the default value for the test parallelism should be 2, but I've seen failure logs where there are many more than 2 goroutine stacks running tests. I guess that should be OK but I did not expect that.

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@laboger
Copy link
Contributor

laboger commented Oct 11, 2019

I have not been able to reproduce this one with GOMAXPROCS=2. It would help to know what is on the stack that is unavailable, is there any way to get that information?

@bcmills
Copy link
Contributor Author

bcmills commented Oct 11, 2019

It would help to know what is on the stack that is unavailable, is there any way to get that information?

@aclements, @ianlancetaylor: any tips on coaxing the runtime into providing more stacks?

@ianlancetaylor
Copy link
Contributor

You can probably get more stacks by running with GOTRACEBACK=crash.

@gopherbot
Copy link

Change https://golang.org/cl/203886 mentions this issue: env/linux-ppc64/osuosl: add Docker setup notes

gopherbot pushed a commit to golang/build that referenced this issue Nov 4, 2019
Collaboration with @tiborvass at Docker who got Docker running on
big-endian PPC64. Go for ppc64 doesn't support cgo or external
linking, so runc doesn't work, but a new OCI-compliant runc
implementation written in C (https://github.com/containers/crun) means
we can run Docker after all. See NOTES & build-*.sh

Then add a Dockerfile & associated cleanup in buildlet & stage0 to use
rundockerbuildlet.

Once done, might help with golang/go#35188, golang/go#32613, etc.

Fixes golang/go#34830
Updates golang/go#21260

Change-Id: I43d7afa1d58bbdfa16e3c57670bc41f1d1932d80
Reviewed-on: https://go-review.googlesource.com/c/build/+/203886
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
codebien pushed a commit to codebien/build that referenced this issue Nov 13, 2019
Collaboration with @tiborvass at Docker who got Docker running on
big-endian PPC64. Go for ppc64 doesn't support cgo or external
linking, so runc doesn't work, but a new OCI-compliant runc
implementation written in C (https://github.com/containers/crun) means
we can run Docker after all. See NOTES & build-*.sh

Then add a Dockerfile & associated cleanup in buildlet & stage0 to use
rundockerbuildlet.

Once done, might help with golang/go#35188, golang/go#32613, etc.

Fixes golang/go#34830
Updates golang/go#21260

Change-Id: I43d7afa1d58bbdfa16e3c57670bc41f1d1932d80
Reviewed-on: https://go-review.googlesource.com/c/build/+/203886
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@laboger
Copy link
Contributor

laboger commented Dec 3, 2019

I don't think we've seen this since the move to Docker with a more recent kernel.

@bcmills
Copy link
Contributor Author

bcmills commented Dec 4, 2019

Thanks. Closing on the theory that this was a kernel bug.

@bcmills bcmills closed this as completed Dec 4, 2019
@golang golang locked and limited conversation to collaborators Dec 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

6 participants