Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: timeout in TestAllThreadsSyscallSignals #44193

Closed
bcmills opened this issue Feb 9, 2021 · 19 comments
Closed

os/signal: timeout in TestAllThreadsSyscallSignals #44193

bcmills opened this issue Feb 9, 2021 · 19 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Feb 9, 2021

It's not clear to me whether this is a deadlock, a livelock, or just a slow test, but the similarity to #43149 is concerning (CC @AndrewGMorgan @ianlancetaylor).

panic: test timed out after 3m0s

goroutine 5 [running]:
testing.(*M).startAlarm.func1()
	/workdir/go/src/testing/testing.go:1701 +0xcc
created by time.goFunc
	/workdir/go/src/time/sleep.go:180 +0x44

goroutine 1 [chan receive]:
testing.(*T).Run(0xc00009c300, 0x17407a, 0x1c, 0x179b58, 0xf)
	/workdir/go/src/testing/testing.go:1240 +0x280
testing.runTests.func1(0xc00009c180)
	/workdir/go/src/testing/testing.go:1512 +0x78
testing.tRunner(0xc00009c180, 0xc0000a7d58)
	/workdir/go/src/testing/testing.go:1194 +0xd8
testing.runTests(0xc0000b4018, 0x259a60, 0x12, 0x12, 0xc000d45ed867509a, 0x29e8e2495d, 0x25d820, 0x1705ec)
	/workdir/go/src/testing/testing.go:1510 +0x2b4
testing.(*M).Run(0xc0000c6000, 0x0)
	/workdir/go/src/testing/testing.go:1418 +0x1a0
main.main()
	_testmain.go:81 +0x130

goroutine 18 [runnable]:
syscall.runtime_doAllThreadsSyscall(0xc00008a4d0)
	/workdir/go/src/runtime/proc.go:1669 +0x384
syscall.AllThreadsSyscall(0xab, 0x8, 0x0, 0x0, 0xdcc10, 0x229ed0, 0x7a32c)
	/workdir/go/src/syscall/syscall_linux.go:1064 +0xa8
os/signal.TestAllThreadsSyscallSignals(0xc00009c300)
	/workdir/go/src/os/signal/signal_linux_test.go:23 +0x40
testing.tRunner(0xc00009c300, 0x179b58)
	/workdir/go/src/testing/testing.go:1194 +0xd8
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1239 +0x264
FAIL	os/signal	180.104s

2021-02-09T18:40:13-e9c9683/linux-ppc64-buildlet

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 9, 2021
@bcmills bcmills added this to the Go1.16 milestone Feb 9, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Feb 9, 2021

Tentatively milestoned to Go 1.16 because the function under test is new (https://tip.golang.org/doc/go1.16#syscall).

@ceseo
Copy link
Contributor

ceseo commented Feb 10, 2021

@laboger FYI

@AndrewGMorgan
Copy link
Contributor

Please assign this to me. I have no idea what might have happened here, yet, but I'll try to figure it out.

Looking at the build.golang.org status page, is this a one-off failure?

@ianlancetaylor
Copy link
Contributor

This does seem to have occurred only once.

@AndrewGMorgan
Copy link
Contributor

AndrewGMorgan commented Feb 10, 2021

Quickly looking at the code https://tip.golang.org/src/os/signal/signal_linux_test.go it looks like the test was evaluating whether or not it should be skipped at the time of the 3 minute timeout.

Capturing the following just in case it otherwise gets lost at some point.

linux-ppc64-buildlet at e9c96835971044aa4ace37c7787de231bbde05d9

:: Running /workdir/go/src/make.bash with args ["/workdir/go/src/make.bash"] and env ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" "HOSTNAME=ppc64_01" "GO_BUILDER_ENV=host-linux-ppc64-osu" "DEBIAN_FRONTEND=noninteractive" "GOROOT_BOOTSTRAP=/workdir/go1.4" "GO_BUILD_KEY_DELETE_AFTER_READ=true" "GO_BUILD_KEY_PATH=/buildkey/gobuildkey" "HOME=/root" "USER=root" "GO_STAGE0_NET_DELAY=500ms" "GO_STAGE0_DL_DELAY=800ms" "WORKDIR=/workdir" "GO_BUILDER_NAME=linux-ppc64-buildlet" "GO_BUILDER_FLAKY_NET=1" "GOROOT_BOOTSTRAP=/usr/local/go-bootstrap" "GOBIN=" "TMPDIR=/workdir/tmp" "GOCACHE=/workdir/gocache" "GOROOT_BOOTSTRAP=/usr/local/go-bootstrap"] in dir /workdir/go/src

Building Go cmd/dist using /usr/local/go-bootstrap. (go1.8 linux/ppc64)
Building Go toolchain1 using /usr/local/go-bootstrap.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/ppc64.
---
Installed Go for linux/ppc64 in /workdir/go
Installed commands in /workdir/go/bin
[... see failure at top of bug for end of this log...]

@dmitshur dmitshur modified the milestones: Go1.16, Go1.17 Feb 11, 2021
@AndrewGMorgan
Copy link
Contributor

#42178 (comment) concerning the ppc64 build supports the detail that this code is CGO_ENABLED=0. My recollection of working on resolving that issue was that this architecture was noticeably slower overall than the systems I typically work with. This lends some support for a timeout being more likely on this architecture. However, the test should be nowhere near 3 minutes of runtime.

That being said, the code in this crash trace does not appear to have timed out while running the loop inside the test, but before that loop even starts and in the code after all the syscall parts on all the threads have run and just as code is trying to unstop the world (and reenable GC).

I've not yet reproduced the failure, and the build servers appear not to have failed again. But I'm still investigating.

@bcmills
Copy link
Contributor Author

bcmills commented Feb 12, 2021

@AndrewGMorgan, another theory to consider: perhaps the actual slowdown was in some other os/signal test, and there simply wasn't enough time remaining, and TestAllThreadsSyscallSignals just happened to be the one of the remaining tests that was running when the timeout passed.

@AndrewGMorgan
Copy link
Contributor

Did this ever repeat on the build servers?

Running against HEAD, the whole os/signal (nocgo) test normally completes in about 8.5 seconds. So, a 3 minute timeout suggests something very wrong happened.

Compressed into one bug update, hammering on this test for many hundreds of iterations, with some print() instrumentation, I have been able to reproduce the failure on the build server (linux-ppc64-buildlet-0)!

It's elusive and I'm not convinced it reproduces in the same place every time (different print()s are completing before the panic strikes, but I have also found that too many print()s can make it clear up - compiler subtlety?)... All the lockups, however, do seem to be within the startTheWorldGC() call at the very end of the syscall_runtime_doAllThreadsSyscall() function.

I'm using this command to reproduce the failure (it happens after a few attempts in a row, and a pass takes about 23s, so a 40s timeout seems fair):

~/go/bin/gomote run user-${USER}-linux-ppc64-buildlet-0 go/bin/go test os/signal -test.run TestAllThreadsSyscall -test.count=1000 -test.timeout=40s 

I'll keep trying to narrow it down, but we might want to get someone more familiar with startTheWorldGC() etc to take a crack at it. I'm less and less certain the issue is with the actual thread fixup support - which is the novel part of syscall_runtime_doAllThreadsSyscall()...

@bcmills
Copy link
Contributor Author

bcmills commented Mar 17, 2021

Did this ever repeat on the build servers?

Turns out it did! Just not on ppc64:
2021-02-23T15:04:12-ab331c0/linux-amd64-nocgo

@bcmills bcmills changed the title os/signal: timeout in TestAllThreadsSyscallSignals on linux-ppc64-buildlet os/signal: timeout in TestAllThreadsSyscallSignals Mar 17, 2021
@AndrewGMorgan
Copy link
Contributor

Interesting. This should be easier to debug. That being said, so far, my recipe for reproducing it does not work on this architecture for me.

Just to keep all the crash dumps in the same place this linux-amd64-nocgo failure was:

panic: test timed out after 3m0s

goroutine 17 [running]:
testing.(*M).startAlarm.func1()
	/workdir/go/src/testing/testing.go:1701 +0xe5
created by time.goFunc
	/workdir/go/src/time/sleep.go:180 +0x45

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000001380, 0x56e655, 0x1c, 0x574180, 0x487dc6)
	/workdir/go/src/testing/testing.go:1240 +0x2da
testing.runTests.func1(0xc000001200)
	/workdir/go/src/testing/testing.go:1512 +0x78
testing.tRunner(0xc000001200, 0xc000041de0)
	/workdir/go/src/testing/testing.go:1194 +0xef
testing.runTests(0xc00000c030, 0x651a60, 0x12, 0x12, 0xc00564c8d5926aa1, 0x29e8dd82f0, 0x655840, 0x56ac27)
	/workdir/go/src/testing/testing.go:1510 +0x305
testing.(*M).Run(0xc000106000, 0x0)
	/workdir/go/src/testing/testing.go:1418 +0x1eb
main.main()
	_testmain.go:81 +0x138

goroutine 6 [chan send]:
syscall.runtime_doAllThreadsSyscall(0xc0001a0000)
	/workdir/go/src/runtime/proc.go:1585 +0x6d
syscall.AllThreadsSyscall(0x9d, 0x8, 0x0, 0x0, 0x0, 0x0, 0x0)
	/workdir/go/src/syscall/syscall_linux.go:1064 +0xa6
os/signal.TestAllThreadsSyscallSignals(0xc000001380)
	/workdir/go/src/os/signal/signal_linux_test.go:32 +0x10e
testing.tRunner(0xc000001380, 0x574180)
	/workdir/go/src/testing/testing.go:1194 +0xef
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1239 +0x2b3

goroutine 33 [runnable]:
os/signal.loop()
	/workdir/go/src/os/signal/signal_unix.go:22
created by os/signal.Notify.func1.1
	/workdir/go/src/os/signal/signal.go:151 +0x45
FAIL	os/signal	180.038s
FAIL
go tool dist: Failed: exit status 1

@AndrewGMorgan
Copy link
Contributor

AndrewGMorgan commented Mar 18, 2021

A modified recipe.

Compile the test into a stand alone binary CGO_ENABLED=0 ../bin/go test -c os/signal. Then run the test -test.count=10000 -test.v=1 -test.run=TestAllThreadsSyscall as a regular user. If it eventually hangs, from another terminal sudo gdb ./signal.test (yes sudo), and from the gdb prompt attach <PID>.

Variants of the above sequence don't seem to be able to break into the hung program. Very strange.

@AndrewGMorgan
Copy link
Contributor

I've found at least one issue that leads to a dead lock. It explains why I needed to use sudo to debug, but I can't decide if it is a bug in the stop the world code or the all threads fix-up mechanism. It relates to a parked thread being woken up by a signal. I plan to find a way to work around this, and then explore if the problem predates the all threads mechanism or not.

@AndrewGMorgan
Copy link
Contributor

AndrewGMorgan commented Mar 26, 2021

Tl;dr this is, indeed, a pair of bugs with the runtime.runtime_syscall_doAllThreadsSyscall() mechanism. I have a fix.

I've added a signal test (os/signal:signal_test.go:TestSignalTrace()) for another use of stop/start the world and it doesn't fail with or without CGO_ENABLED=0. It doesn't invoke runtime.runtime_syscall_doAllThreadsSyscall(), but I'm leaving that added test in my fix for this present bug because it seems there are subtleties with handling signals in Go that I'd not understood when implementing runtime.runtime_syscall_doAllThreadsSyscall() and I can see future modifications to the stop/start the world functions or other runtime changes falling afoul of them too.

I'm probably being a little loose with technical language here, but hopefully the following explains the problematic conditions effectively...

First a recap for context:

  • The runtime.runtime_syscall_doAllThreadsSyscall() works by stopping the world (which attempts to park all but one OS thread - which we will call the coordinator) and then cause each other OS thread to briefly unpark solely to execute the coordinated system call. Once the coordinator thread can confirm that all the OS threads have performed this syscall, the coordinator starts the world again.
  • Signal handling in the Go runtime effectively requires two OS threads: one to receive the raw signal (in the context of an OS signal handler), and the other to ingest the signal in the regular Go runtime context.

The two issues, with the runtime.runtime_syscall_doAllThreadsSyscall() mechanism, when handling signals in Go are as follows:

  • First, the coordinator cannot be allowed to receive the incoming signal. That is, since it is the only thread running normally during a runtime.stopTheWorldGC(), this signal handling will deadlock waiting for some other thread to ingest the signal and there are no other threads available to do this; they are all parked...
    • The solution to this is to block all signal handling on the coordinator thread for the duration of the runtime.runtime_syscall_doAllThreadsSyscall() call.
  • Second, with the above blocking in effect, a runtime.mPark()d OS thread is yanked out of its mPark()ing condition in the signal handler context, to handle the signal. That handler thread will block until some other OS thread is allowed to run the regular runtime code to ingest the signal. Since the world is stopped with the one thread that could do it (the coordinator) blocked, this never happens. Which itself is two types of bad:
    • the signal handling thread doesn't get a chance to run runtime.mDoFixup() which means the whole all-threads mechanism will deadlock...
    • [Edit: on further investigation, this bullet is incorrect. Thanks to Ian for making me look harder!: following the handler running, the interrupted notesleep(&g.m.park) is not guaranteed to get its companion noteclear(&g.m.park) to execute, which means the next time this OS thread gets runtime.mPark()d, it will effectively notesleep() twice in a row and panic the runtime.-
    • The solution for this is to recognize this special case and execute the runtime.mDoFixup() [Edit: this next part is, thus, not needed: and noteclear(&g.m.park)] from the signal handler context.

So far, I've run my fix against HEAD for 10000 iterations without error on amd64 (which took 30 minutes to run):

$ CGO_ENABLED=0 ../bin/go test -c os/signal
$ time ./signal.test -test.v=1 -test.run 'SignalTrace|TestAllThreads' -test.count=10000

I'll give it a whirl on the linux-ppc64-buildlet before attempting to push the patch for review.

@AndrewGMorgan
Copy link
Contributor

The mentioned fix is https://go-review.googlesource.com/c/go/+/305149 .

@gopherbot
Copy link

Change https://golang.org/cl/305149 mentions this issue: syscall: syscall.AllThreadsSyscall signal handling fixes

@AndrewGMorgan
Copy link
Contributor

Since this is a bug in 1.16 should this issue be tagged with that too?

@ianlancetaylor
Copy link
Contributor

@gopherbot Please open backport to 1.16

This bug can cause a rare deadlock if a signal is received during a call to syscall.Setuid or various similar functions on GNU/Linux.

@gopherbot
Copy link

Backport issue(s) opened: #45307 (for 1.16).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

Change https://golang.org/cl/316869 mentions this issue: [release-branch.go1.16] syscall: syscall.AllThreadsSyscall signal handling fixes

gopherbot pushed a commit that referenced this issue May 4, 2021
…dling fixes

The runtime support for syscall.AllThreadsSyscall() functions had
some corner case deadlock issues when signal handling was in use.
This was observed in at least 3 build test failures on ppc64 and
amd64 architecture CGO_ENABLED=0 builds over the last few months.

The fixes involve more controlled handling of signals while the
AllThreads mechanism is being executed. Further details are
discussed in bug #44193.

The all-threads syscall support is new in go1.16, so earlier
releases are not affected by this bug.

Fixes #45307

Change-Id: I01ba8508a6e1bb2d872751f50da86dd07911a41d
Reviewed-on: https://go-review.googlesource.com/c/go/+/305149
Reviewed-by: Michael Pratt <mpratt@google.com>
Trust: Michael Pratt <mpratt@google.com>
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
(cherry picked from commit 7e97e4e)
Reviewed-on: https://go-review.googlesource.com/c/go/+/316869
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants