Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: StopCPUProfile occasionally stuck with 100% CPU and process hang #52912

Closed
breezewish opened this issue May 15, 2022 · 13 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@breezewish
Copy link

breezewish commented May 15, 2022

What version of Go are you using (go version)?

go1.18.0

Does this issue reproduce with the latest release?

Not sure.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/breezewish/Library/Caches/go-build"
GOENV="/Users/breezewish/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/breezewish/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/breezewish/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/homebrew/Cellar/go/1.18.1/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/homebrew/Cellar/go/1.18.1/libexec/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.18.1"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/6m/phy2_frd0vd1hb7g1py32qz40000gn/T/go-build2219610054=/tmp/go-build -gno-record-gcc-switches -fno-common"
GOROOT/bin/go version: go version go1.18.1 darwin/arm64
GOROOT/bin/go tool compile -V: compile version go1.18.1
uname -v: Darwin Kernel Version 21.4.0: Fri Mar 18 00:46:32 PDT 2022; root:xnu-8020.101.4~15/RELEASE_ARM64_T6000
ProductName:	macOS
ProductVersion:	12.3.1
BuildVersion:	21E258
lldb --version: lldb-1316.0.9.41
Apple Swift version 5.6 (swiftlang-5.6.0.323.62 clang-1316.0.20.8)

What did you do?

In tidb-server there is a feature with a pattern that repeatedly run pprof CPU profiler for 1 second (StartCPUProfile -> wait 1 sec -> StopCPUProfile -> StartCPUProfile -> wait 1 sec -> ...).

Recently in my MacOS M1 with this feature enabled, I observed that the tidb-server process was hang with 100% (1 core) CPU and it cannot process any requests.

According to the CPU profiling data provided by Instruments, looks like StopCPUProfile was looping infinitely at

for !atomic.Cas(&prof.signalLock, 0, 1) {
:

image

I have no idea how this issue can be reliably reproduced. Hope the stack provided by the Instruments helps.

What did you expect to see?

StopCPUProfile should not cause process hang.

What did you see instead?

Process was hanging.

@seankhliao seankhliao changed the title pprof.StopCPUProfile occasionally stuck with 100% CPU and process hang runtime/pprof: StopCPUProfile occasionally stuck with 100% CPU and process hang May 16, 2022
@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 16, 2022
@seankhliao
Copy link
Member

try killing it with SIGABRT and see what it's doing?

@breezewish
Copy link
Author

@seankhliao Not reproduced recent days, will try with SIGABRT when it happened. Meanwhile do you have other ideas or suggested measurements I can take to keep some necessary information when it reproduce the next time?

@rhysh
Copy link
Contributor

rhysh commented May 23, 2022

Could this be a memory-order / weakness problem with M1 / ARM64? The defense against this deadlock is setting m.profilehz = 0 (in https://github.com/golang/go/blob/go1.18/src/runtime/proc.go#L4569 / https://github.com/golang/go/blob/go1.18/src/runtime/signal_unix.go#L328), which gets checked at the top of the SIGPROF handler (at https://github.com/golang/go/blob/go1.18/src/runtime/proc.go#L4450).

The store and load of m.profilehz are non-atomic, which seems like it could be a problem on ARM64. But those take place on the same thread (store on the curg and load on the gsignal of a particular M), which seems like it could mitigate that problem. @ianlancetaylor , do you know if there's something to that, or would brokenness there show up as even bigger problems? Thanks.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@prattmic prattmic added this to the Go1.20 milestone Jul 27, 2022
@prattmic prattmic self-assigned this Jul 27, 2022
@prattmic
Copy link
Member

If you reproduce this again (or still have the profile), could you expand the _sigtramp part of the profile? That is probably holding signalLock and does seem to be stuck somewhere.

@prattmic
Copy link
Member

@rhysh I see where you are coming from, but if we hit that race, shouldn't we be stuck spinning in the signal handler when the deadlock occurs, not setcpuprofilerate? To also get stuck there I think we would need a second concurrent caller to StopCPUProfile.

@prattmic
Copy link
Member

sigprofNonGoPC takes prof.signalLock via cpuprof.addNonGo, and is only protected by a check of prof.hz, which (1) is non-atomic, and (2) we don't modify in setcpuprofilerate until after taking signalLock.

Thus I believe it would be possible to get a deadlock in the signal handler if we go down the addNonGo path.

That said, we are in fact running Go code, so we should avoid that path. setProcessCPUProfilerTimer makes a few libc calls. Perhaps if those get interrupted at just the right point we are unable to get the G?

There would also need to be a second call to StopCPUProfile on another thread to have the stuckness in both StopCPUProfile and sigtramp that is shown in the profile.

@gopherbot
Copy link

Change https://go.dev/cl/420196 mentions this issue: runtime: convert prof.hz to atomic type

@prattmic
Copy link
Member

https://go.dev/cl/420196 makes prof.hz atomic. It is part of a general atomics cleanup and I don't think it will make a difference here, but if this is easily reproducible then it may be worth testing.

@prattmic
Copy link
Member

FWIW, I have been unable to trivially reproduce this on darwin-arm64 calling StartCPUProfile/StopCPUProfile in a loop (with some extra work to ensure we get SIGPROFs).

@prattmic prattmic added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Aug 4, 2022
gopherbot pushed a commit that referenced this issue Aug 12, 2022
This converts several unsynchronized reads (reads without holding
prof.signalLock) into atomic reads.

For #53821.
For #52912.

Change-Id: I421b96a22fbe26d699bcc21010c8a9e0f4efc276
Reviewed-on: https://go-review.googlesource.com/c/go/+/420196
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@joedian
Copy link

joedian commented Aug 18, 2022

@prattmic does the merged fix solve the issue or is more info needed?

@prattmic
Copy link
Member

I don't expect it to make a difference, but it is possible.

@breezewish can you reproduce this at tip?

@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@breezewish
Copy link
Author

It seems to be hard to reproduce in my M1 in recent days. We have some production env using this Go version now and I will ping back when encountering more deadlocks like this. Thanks a lot for the dig!

@golang golang locked and limited conversation to collaborators Sep 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

6 participants