Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: livelock in cpuProfile.add #48782

Closed
rhysh opened this issue Oct 4, 2021 · 3 comments
Closed

runtime: livelock in cpuProfile.add #48782

rhysh opened this issue Oct 4, 2021 · 3 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Oct 4, 2021

What version of Go are you using (go version)?

go1.14.7, on linux/amd64

Does this issue reproduce with the latest release?

I've only seen this once, across a large fleet running for several years. The relevant pieces of code look the same in the latest release.

What operating system and processor architecture are you using (go env)?

The problem appeared on a linux/amd64 machine with two processors, each with 10 cores / 20 hyperthreads for a default GOMAXPROCS of 40.

What did you do?

The process collected a CPU profile (with runtime/pprof.StartCPUProfile / StopCPUProfile).

What did you expect to see?

I expected the process would continue running unimpeded.

What did you see instead?

The process stopped doing productive work, and instead consumed all available CPU resources. GDB with thread apply all bt showed 43 threads, all of which were in calls from runtime.sigtramp to runtime.sigtrampgo to runtime.sighandler to runtime.sigprof to runtime.(*cpuProfile).add.

All of them were in https://github.com/golang/go/blob/go1.14.7/src/runtime/proc.go#L3924, the call from runtime.sigprof to runtime.(*cpuProfile).add. Of those, 41 were at the resulting osyield call https://github.com/golang/go/blob/go1.14.7/src/runtime/cpuprof.go#L94 and 2 were at the atomic.Cas call https://github.com/golang/go/blob/go1.14.7/src/runtime/cpuprof.go#L93.

func (p *cpuProfile) add(gp *g, stk []uintptr) {
	// Simple cas-lock to coordinate with setcpuprofilerate.
	for !atomic.Cas(&prof.signalLock, 0, 1) {
		osyield()
	}

It looks to me like the locking in runtime.setprofilerate near https://github.com/golang/go/blob/go1.14.7/src/runtime/proc.go#L4010 isn't sufficient to protect against this, especially when disabling the profiler.

	for !atomic.Cas(&prof.signalLock, 0, 1) {
		osyield()
	}
	if prof.hz != hz {
		setProcessCPUProfiler(hz)
		prof.hz = hz
	}
	atomic.Store(&prof.signalLock, 0)

Consider the case where the thread that is trying to disable the profiler earns a SIGPROF delivery, and the signal arrives at the start of the call to setProcessCPUProfiler. The value in prof.signalLock will be 1, and the value in prof.hz will not yet be 0. When the signal handler runs on that thread, the check for prof.hz != 0 at https://github.com/golang/go/blob/go1.14.7/src/runtime/proc.go#L3923 will be true, allowing a call to cpuprof.add, which will then spin endlessly trying to grab prof.signalLock.

	if prof.hz != 0 {
		cpuprof.add(gp, stk[:n])
	}

Other threads will either earn their own SIGPROF deliveries which will also get stuck, or the work of waiting on the spinlock will earn additional process-targeted SIGPROF deliveries (because setitimer is active, generating process-targeted signals which the kernel best-effort delivers to the active thread). But because those threads are in the middle of handling signals already, the kernel (maybe?) will deliver them to any other eligible thread. This leads (it seems) to every thread in the process being stuck in SIGPROF handlers trying to obtain that lock.

I plan to send a CL that allows the lock-holder to communicate "please try later" vs "do not retry this lock".

@mknyszek mknyszek added this to the Backlog milestone Oct 4, 2021
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 4, 2021
@mknyszek
Copy link
Contributor

mknyszek commented Oct 4, 2021

Based on this, it looks like this is still an issue today? This code hasn't been touched in a while, so I wouldn't be surprised.

CC @prattmic too.

@gopherbot
Copy link

Change https://golang.org/cl/353931 mentions this issue: runtime: avoid reentrant locking by cpuprof.add

@prattmic
Copy link
Member

prattmic commented Oct 5, 2021

As pointed out by @ianlancetaylor and @rhysh on the CL, this should be fixed in 1.16+ with http://golang.org/cl/353931.

@prattmic prattmic closed this as completed Oct 5, 2021
@golang golang locked and limited conversation to collaborators Oct 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants