runtime: livelock in cpuProfile.add #48782
Labels
FrozenDueToAge
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone
What version of Go are you using (
go version
)?go1.14.7, on linux/amd64
Does this issue reproduce with the latest release?
I've only seen this once, across a large fleet running for several years. The relevant pieces of code look the same in the latest release.
What operating system and processor architecture are you using (
go env
)?The problem appeared on a linux/amd64 machine with two processors, each with 10 cores / 20 hyperthreads for a default GOMAXPROCS of 40.
What did you do?
The process collected a CPU profile (with
runtime/pprof.StartCPUProfile
/StopCPUProfile
).What did you expect to see?
I expected the process would continue running unimpeded.
What did you see instead?
The process stopped doing productive work, and instead consumed all available CPU resources. GDB with
thread apply all bt
showed 43 threads, all of which were in calls fromruntime.sigtramp
toruntime.sigtrampgo
toruntime.sighandler
toruntime.sigprof
toruntime.(*cpuProfile).add
.All of them were in https://github.com/golang/go/blob/go1.14.7/src/runtime/proc.go#L3924, the call from
runtime.sigprof
toruntime.(*cpuProfile).add
. Of those, 41 were at the resultingosyield
call https://github.com/golang/go/blob/go1.14.7/src/runtime/cpuprof.go#L94 and 2 were at theatomic.Cas
call https://github.com/golang/go/blob/go1.14.7/src/runtime/cpuprof.go#L93.It looks to me like the locking in
runtime.setprofilerate
near https://github.com/golang/go/blob/go1.14.7/src/runtime/proc.go#L4010 isn't sufficient to protect against this, especially when disabling the profiler.Consider the case where the thread that is trying to disable the profiler earns a SIGPROF delivery, and the signal arrives at the start of the call to
setProcessCPUProfiler
. The value inprof.signalLock
will be 1, and the value inprof.hz
will not yet be 0. When the signal handler runs on that thread, the check forprof.hz != 0
at https://github.com/golang/go/blob/go1.14.7/src/runtime/proc.go#L3923 will be true, allowing a call tocpuprof.add
, which will then spin endlessly trying to grabprof.signalLock
.Other threads will either earn their own SIGPROF deliveries which will also get stuck, or the work of waiting on the spinlock will earn additional process-targeted SIGPROF deliveries (because setitimer is active, generating process-targeted signals which the kernel best-effort delivers to the active thread). But because those threads are in the middle of handling signals already, the kernel (maybe?) will deliver them to any other eligible thread. This leads (it seems) to every thread in the process being stuck in SIGPROF handlers trying to obtain that lock.
I plan to send a CL that allows the lock-holder to communicate "please try later" vs "do not retry this lock".
The text was updated successfully, but these errors were encountered: