-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/pprof: livelock in cpuProfile.add #25785
Comments
Change https://golang.org/cl/117057 mentions this issue: |
Does this need a backport to 1.10 (and 1.9)? |
CL 117057 might need a backport in order to fix #24260. The fix to this issue would be just a backport, as I have been unable to reproduce this issue at tip. This suggests the issue has something to do with the kernel helper calls that are no longer present at tip. |
On 32-bit architectures without native 64-bit atomic instructions, 64-bit atomics are emulated using spinlocks. However, the sigprof handling code expects to be able to perform 64-bit atomic operations in signal handlers. Spinning on an acquired spinlock in a signal handler leads to a livelock. This is issue #20146. The original fix for #20146 did not include arm in the list of architectures that need to work around portability issues in the sigprof handler code. The unit test designed to catch this issue does not fail on arm builds because arm uses striped spinlocks, and thus the livelock takes many minutes to reproduce. This is issue #24260. (This patch doesn't completely fix #24260 on go1.10.2 due to issue #25785, which is probably related to the arm cas kernel helpers. Those have been removed at tip.) With this patch applied, I was able to run the reproducer for issue #24260 for more than 90 minutes without reproducing the livelock. Without this patch, the livelock took as little as 8 minutes to reproduce. Fixes #20146 Updates #24260 Change-Id: I64bf53a14d53c4932367d919ac55e17c99d87484 Reviewed-on: https://go-review.googlesource.com/117057 Run-TryBot: Philip Hofer <phofer@umich.edu> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
Upgrading to 1.11 is a valid workaround, so this is not eligible for backport. |
Reproduced on linux/arm, go1.10.2, while investigating #24260
The relevant code snippets are as follows:
and
The trouble is that
setThreadCPUProfiler(0)
doesn't actually do what the comment above it says it is supposed to do. On Linux, SIGPROF is delivered by arming a timer withsetitimer(2)
, andsetThreadCPUProfiler
disarms the timer by callingsetitimer(ITIMER_PROF, &it, NULL)
with an emptystruct itimerval
.setitimer(2)
delivers signals to the process, not the thread. Disarming a timer on thread 0 doesn't mean that thread 1 hasn't already queued a SIGPROF that may or may not be delivered to thread 0. See #14434 (Also see the BUGS section of the man page where it points out that signal generation and signal delivery are distinct events. We might conclude that it's possible to receive at least one SIGPROF delivery after the timer has been disarmed; I'm currently spelunking in the linux source tree to see if that's actually true.)A confounding issue is that
atomic.Cas
on arm (up through go1.10.2, but not at tip) calls a kernel helper that leads the kernel to callsched_yield
, so the loop in question actually yields two time slices and not one. I don't know if this is necessarily related to the issue, but it may exacerbate it, since it means that the loop could actually sleep longer in each iteration than than the SIGPROF timer interval.I'm trying to reproduce this at tip. I'm also going to capture the output of
perf record
so that there's a little more evidence that this is actually what's happening. Frustratingly, the fastest repro I have still takes 20+ minutes to hit, so debugging this issue is slow going.The text was updated successfully, but these errors were encountered: