runtime/pprof: SIGPROF interrupt causes infinite loop #55243
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
FrozenDueToAge
WaitingForInfo
Issue is not actionable because of missing required information, which needs to be provided.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What did you do?
I work on a library that ends up calling the bpf(2) syscall via
unix.Syscall
. bpf(2) has a bunch of subcommands, one of which allows loading BPF bytecode into the kernel.The kernel does analysis on the bytecode, which can be "slow" going from milliseconds to seconds depending on kernel config. Since Linux 4.20 the kernel interrupts the verification process if the calling process has signals pending.
This leads to an unfortunate interaction with the runtime CPU profiler, since it uses SIGPROF to trigger sampling. If the interval between SIGPROF is smaller than the time it takes to fully analyze a BPF program we enter an infinite loop:
https://github.com/cilium/ebpf/blob/713c8dc84f60f2b599caabd6dfb2fb33b7878bc1/internal/sys/syscall.go#L19-L37
One of our users encountered this in the wild and blogged about it: https://dxuuu.xyz/bpf-go-pprof.html You can find a small reproducer for x86 / arm64 Linux here: https://github.com/lmb/sigprof-repro
There was a similar issue (I think?) in the runtime related to fork: #5517 Other profilers / runtimes also experience similar problems with other syscalls: async-profiler/async-profiler#97 Somewhat related, non-cooperative preemption of goroutines also uses signals, and therefore can trigger the same problem. We've not received bug reports for that though.
We've considered limiting the number of syscall retries on interruption, and returning an error if we retry too often. This has the very unfortunate side effect that enabling CPU profiling can make your application fail, where it otherwise wouldn't have. As a library that is really undesirable, since we have no idea who can trigger profiling. The Go runtime itself also retries an infinite number of times for some syscalls that return EINTR so there is some precedent for infinite retry.
This leaves us with trying to avoid receiving SIGPROF (and other?) signals on the thread calling bpf(2) in the first place, kind of similar to
syscall.runtime_BeforeFork
. We're not sure how best to go about this however, since there is no suitable API from the runtime side to block SIGPROF only for the syscall thread. We could call runtime.LockOSThread and invoke sigprocmask directly, but that has a good chance of breaking in a subtle way due to runtime / CGO / libc interactions.My questions are therefore:
The text was updated successfully, but these errors were encountered: