New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: goroutine in C code hangs with async preemption under macOS Big Sur #45558
Comments
CC @cherrymui |
This appears to have been caused by |
Here's a simple reproduction that continuously allocates and frees 1 GB of memory in two goroutines, and outputs the goroutine stacks every 30 seconds. Both goroutines will eventually block, the first one within about 20 seconds. package main
import (
// #include <stdlib.h>
"C"
"fmt"
"math"
"runtime"
"time"
"unsafe"
_ "github.com/benesch/cgosymbolizer"
)
const (
concurrency = 2
size = 32768
count = int(1e9) / size // 1 GB per goroutine
)
func main() {
for w := 0; w < concurrency; w++ {
go func(w int) {
for i := 0; ; i++ {
fmt.Printf("goroutine:%d cycles:%d\n", w, i)
allocs := make([][]byte, 0, count)
for i := 0; i < count; i++ {
allocs = append(allocs, alloc(size))
}
for _, a := range allocs {
free(a)
}
}
}(w)
}
buf := make([]byte, 65536)
for {
time.Sleep(30 * time.Second)
n := runtime.Stack(buf, true)
fmt.Printf("%s\n", buf[:n])
}
}
func alloc(n int) []byte {
ptr := C.calloc(C.size_t(n), 1)
if ptr == nil {
panic("out of memory")
}
return (*[math.MaxInt32]byte)(ptr)[:n:n]
}
func free(b []byte) {
C.free(unsafe.Pointer(&b[0]))
} |
Here's an even smaller test case. Interestingly, this will have hung goroutines even with package main
import (
// #include <stdlib.h>
"C"
"fmt"
"os"
"runtime"
"syscall"
"time"
_ "github.com/benesch/cgosymbolizer"
)
const concurrency = 2
func main() {
envName := C.CString("GODEBUG")
for w := 0; w < concurrency; w++ {
go func(w int) {
for i := 0; ; i++ {
if i%1e6 == 0 {
fmt.Printf("goroutine:%d calls:%d\n", w, i)
runtime.Gosched()
}
C.getenv(envName)
}
}(w)
}
go func() {
p, _ := os.FindProcess(os.Getpid())
for {
p.Signal(syscall.SIGUSR1)
time.Sleep(time.Millisecond)
}
}()
buf := make([]byte, 65536)
for {
time.Sleep(30 * time.Second)
n := runtime.Stack(buf, true)
fmt.Printf("%s\n", buf[:n])
}
} |
Now I'm curious whether github.com/ianlancetaylor/cgosymbolizer works any better. |
Gave that a try too, same behavior. |
In the original issue report, the deadlock is happening because the program is trying to fetch a stack trace while calling the C function The github.com/ianlancetaylor/cgosymbolizer package, on the other hand, does not call In any case it's not clear to me that this is a bug in the Go toolchain, or that there is anything that the Go project can change to fix this. You might be able to fix the problem by changing the file cgosymbolizer_darwin.c, the function |
Irrespective of this bug, @erikgrinaker, Cockroach should definitely switch to using Ian’s cgosymbolizer if it supports Mach-O now. |
Oh, I see, you’re removing it entirely via cockroachdb/cockroach#63737. That seems reasonable. |
Thank you for the explanation @ianlancetaylor, that makes a lot of sense. We did see libunwind come up when looking at the hung threads in a debugger, which fits well with what you're saying.
This does indeed solve the issue, as one would expect.
I agree, nothing here indicates a problem in Go, so I'll close the issue. Thanks for having a look!
We don't really have a big need for this now that we no longer use RocksDB, but otherwise that might have been worth exploring. Thanks for the suggestion! |
Something is still unexplained here, though. If the issue was that benesch/cgosymbolizer can sometimes call malloc, then why didn’t switching to ianlancetaylor/cgosymbolizer, which supposedly does not call malloc, fix the issue? |
That's true. I had a look, and it's blocked on
On macOS, #if defined(__APPLE__) && __has_include_next(<unwind.h>)
/* Darwin (from 11.x on) provide an unwind.h. If that's available,
* use it. libunwind wraps some of its definitions in #ifdef _GNU_SOURCE,
* so define that around the include.*/
* I think that ends up being the same implementation as |
It would be extremely unfortunate if the LLVM implementation of |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, also with
master
at 49e933f.What operating system and processor architecture are you using (
go env
)?macOS 11.2.3 (Big Sur)
amd64
go env
OutputWhat did you do?
We haven't been able to come up with a reduced test case that reproduces this yet, but we're working on it.
When running CockroachDB on macOS Big Sur with async preemption enabled and under some load, we occasionally see goroutines get stuck executing Cgo functions (specifically
calloc
). When this happens, the process pegs the CPU at 100% (one core), and the Cgo call never returns. It appears to be somewhat correlated with resource contention. We see this with Go 1.16.3, 1.15.11, and 1.14.15.This does not happen with
GODEBUG=asyncpreemptoff=1
, nor does it happen with macOS Catalina (or Linux), nor if we disable Cgocalloc
and use the Go memory allocator instead.It can be reproduced practically every time by running a five-node cluster and generating some load as follows (see also macOS build instructions):
(to tear down the cluster, run
./bin/roachprod destroy local
)Within a few minutes, one of the processes should have a goroutine stuck on
calloc
. This may or may not affect the running query, depending on which goroutine blocks. Blocked goroutines can be found with e.g.:The node URL that's output has a pprof endpoint at
/debug/pprof
. A CPU profile captured during a hung goroutine, while the process is pegged at 100% CPU, shows much of the time spent in the runtime:The relevant Cgo code in the Pebble storage engine is fairly simple, consisting of
calloc
andfree
calls:https://github.com/cockroachdb/pebble/blob/3d4c32f510a80f21e787caabf360edffe1431677/internal/manual/manual.go
What did you expect to see?
Cgo calls returning as normal.
What did you see instead?
Cgo calls never returning, blocking the goroutine forever.
The text was updated successfully, but these errors were encountered: