Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: go stuck in a dead lock between threads epoll_pwait/futex(..., FUTEX_WAKE_PRIVATE, 1) on arm64 #55120

Closed
schaefi opened this issue Sep 17, 2022 · 10 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@schaefi
Copy link

schaefi commented Sep 17, 2022

go version
go version go1.18.1 linux/arm64

Does this issue reproduce with the latest release?

yes I tried go 1.18 and go 1.19

What operating system and processor architecture are you using (go env)?

go env

GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/admin/.cache/go-build"
GOENV="/home/admin/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/admin/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/admin/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go-1.18"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go-1.18/pkg/tool/linux_arm64"
GOVCS=""
GOVERSION="go1.18.1"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/admin/__mego/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1438079104=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I'm working with go and go compiled binaries on an NXP S32g devel board and observed a dead lock of the
programs that makes them stuck forever. The problem seems to be related to how go applications synchronize
in its threads and the mutex locks. For a simple reproduced I wrote the following code:

  • main.go

    package main
    
    func main() {
            mprint()
    }
    
  • mprint.go

    package main
    
    import "fmt"
    
    func mprint() {
            list := []int{1, 2, 3}
            list = append(list, 10)
            for _, v := range list {
                    fmt.Println(v)
            }
    }
    

The actual program code is immaterial as the issue appears when compiling with

go build mprint.go main.go

It actually should be done in a second but the compilation is stuck forever. The process list shows me

ps ax -T

32240   32240 pts/1    Sl+    0:06 go build mprint.go main.go
32240   32241 pts/1    Sl+    0:00 go build mprint.go main.go
32240   32242 pts/1    Sl+    0:00 go build mprint.go main.go
32240   32243 pts/1    Sl+    0:00 go build mprint.go main.go
32240   32244 pts/1    Sl+    0:00 go build mprint.go main.go
32240   32245 pts/1    Sl+    0:00 go build mprint.go main.go
32240   32246 pts/1    Rl+    0:06 go build mprint.go main.go

Checking with strace what's going on exposes the following:

sudo strace -p 32240

strace: Process 32240 attached
futex(0x95efc8, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 32240 detached
 <detached ...>

The same for: 32242, 32243, 32244, 32245

sudo strace -p 32241

strace: Process 32241 attached
restart_syscall(<... resuming interrupted io_setup ...>) = 0
epoll_pwait(4, [], 128, 0, NULL, 512266007) = 0
getpid()                                = 32240
tgkill(32240, 32246, SIGURG)            = 0
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 523616458) = 0
getpid()                                = 32240
tgkill(32240, 32246, SIGURG)            = 0
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 534857704) = 0
getpid()                                = 32240
tgkill(32240, 32246, SIGURG)            = 0
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 545530928) = 0
getpid()

... NEVER ENDS and makes it to consume 100% CPU usage

sudo strace -p 32246

rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274881856320
futex(0x9605f0, FUTEX_WAKE_PRIVATE, 1)  = 0
futex(0x9604f8, FUTEX_WAKE_PRIVATE, 1)  = 1
futex(0x4000104d48, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 32246 detached
 <detached ...>

So it's in the epoll_pwait() forever. Neither the descriptor nor the connected SIGURG signals seems
to get it out of this polling. I couldn't get a clue on this and after some time I decided to write
this report here.

Unfortunately this seems to be a non deterministic behavior because I cannot reproduce it to this
extend on other arm boards (e.g raspberry) or on other archs (e.g x86_64). However I was able
to get into the dead lock on x86_64 if I run go compiled code with ltrace, e.g ltrace podman --version

So I think something is causing a race condition and it might also be related to the board
hardware such that I can see it on this board in any case but only occasionally elsewhere. I know this makes
it a bad issue for you but I really hope you can give me some ideas to try, patches for testing
or just abuse me as tester of ideas on the board.

I did several tests with other programming languages, e.g C (simple mutex testing) or rust (which also has
some thread safe programming model) and they did not expose this sort of issues. Thus I believe it's not
a generaly broken board hardware but some sort of unfortunate circumstances.

Another fun fact; If the processes are stuck they can be made unstuck if you ssh into the machine.
Sounds strange yes but my assumption with this is that this behavior could be related
to ssh which avoid orphaned processes by using TCP out-of-band data also generating a SIGURG.
In go it seems SIGURG is also used for thread preemption. It feels like that it has race conditions which
are triggered any time on my arm board but does happen only occasionally on other hardware.
So if I ssh, a stuck go program can be made to continue until the next dead lock condition ;)

What did you expect to see?

I expected the go compiler not to hang in a dead lock. I also wondered about compiled go based programs
to run into dead locks with the same behavior as soon as threads are used.

What did you see instead?

go itself and go compiled binaries are stuck in a dead lock as described.
I'm running out of ideas what else I could do and kindly ask for help from the experts.

If wished I can provide temporary ssh access to the arm board I'm using which allows to
reproduce the problem as described.

Thanks a ton

Cheers,
Marcus

@ianlancetaylor
Copy link
Contributor

I don't know what is happening here, but the pattern you are showing is normal. You have to pass the -f option to strace in order to get a useful trace of a Go program. Otherwise you are just tracing a single thread.

I'm not saying that there isn't a problem, but we'll need more information to diagnose it. Especially since other people on linux-arm64 are not reporting similar problems. Thanks.

@ianlancetaylor ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 18, 2022
@schaefi
Copy link
Author

schaefi commented Sep 19, 2022

@ianlancetaylor thanks for looking into this. I did as you suggested with the following example

podman --version

... hangs, ok

strace -f -p 35511 2> podman-version.strace

stopped after some time Ctrl-C, as you can see from the trace it repeats in chunks

and attaching podman-version.strace.txt

just for clarity I picked podman as an example but this behavior can also be seen with any other
go program which uses threads even go itself and it does not happen always but on this machine in ~8 out of 10 attempts

podman-version.strace.txt

Thanks

@ianlancetaylor ianlancetaylor changed the title go stuck in a dead lock between threads epoll_pwait/futex(..., FUTEX_WAKE_PRIVATE, 1) on arm64 runtime: go stuck in a dead lock between threads epoll_pwait/futex(..., FUTEX_WAKE_PRIVATE, 1) on arm64 Sep 20, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 20, 2022
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Sep 20, 2022
@ianlancetaylor ianlancetaylor added this to the Go1.20 milestone Sep 20, 2022
@ianlancetaylor
Copy link
Contributor

CC @golang/runtime

@m-kat
Copy link

m-kat commented Sep 20, 2022

Hi,
some additional information to that deadlock behaviour that we observed, when using go build with -race flag on an NXP board, sometimes the deadlock does not occur, but the behaviour is not deterministic.
Hopefully, this information can help you to narrow the issue.

Thanks,
Marcin

@prattmic
Copy link
Member

When the program is stuck, could you attach with gdb and see where all of the threads are (thread apply all backtrace)?

@m-kat
Copy link

m-kat commented Sep 22, 2022

Hi,
gdb traces on the golang build execution gives no meaningful information (as it is not compiled with debugging symbols only addresses are printed out). We have developed simpler two-threaded code that has the same issue:

package main

import (
	"fmt"
	"time"
)

func main() {
	finished := make(chan int)
	go spinner(100*time.Millisecond, finished)
	const n = 45
	fibN := fib(n) // slow
	finished <- 0
	fmt.Printf("\rFibonacci(%d) = %d\n", n, fibN)
}

func spinner(delay time.Duration, in <-chan int) {
	end := true
	for end {
		for _, r := range `-\|/` {
			fmt.Printf("\r%c", r)
			time.Sleep(delay)
		}
		select {
		case <-in:
			end = false
		default:
			continue
		}
	}
}

func fib(x int) int {
	if x < 2 {
		return x
	}
	return fib(x-1) + fib(x-2)
}

in working environment this code takes up to 5 seconds to run, on before mentioned NXP platform it was running 1 minute and after I saw it has stucked (one thread taking 100% CPU time and 4 threads waiting on futexes or nano sleeping ) I have connected gdb to it and I've got such result from thread backtrace:

(gdb) thread apply all backtrace

Thread 5 (LWP 2805 "thex"):
#0  runtime.epollwait () at /usr/lib/go-1.18/src/runtime/sys_linux_arm64.s:731
#1  0x000000000003b280 in runtime.netpoll (delay=<optimized out>, ~r0=...) at /usr/lib/go-1.18/src/runtime/netpoll_epoll.go:126
#2  0x0000000000046df4 in runtime.findrunnable (gp=<optimized out>, inheritTime=<optimized out>) at /usr/lib/go-1.18/src/runtime/proc.go:2767
#3  0x0000000000048278 in runtime.schedule () at /usr/lib/go-1.18/src/runtime/proc.go:3187
#4  0x00000000000487f4 in runtime.park_m (gp=0x400010a680) at /usr/lib/go-1.18/src/runtime/proc.go:3336
#5  0x0000000000067864 in runtime.mcall () at /usr/lib/go-1.18/src/runtime/asm_arm64.s:186
#6  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 4 (LWP 2804 "thex"):
#0  0x000000000008dba8 in main.fib (x=2, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#1  0x000000000008dba0 in main.fib (x=3, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#2  0x000000000008dbb4 in main.fib (x=5, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#3  0x000000000008dba0 in main.fib (x=6, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#4  0x000000000008dbb4 in main.fib (x=8, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#5  0x000000000008dba0 in main.fib (x=9, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#6  0x000000000008dba0 in main.fib (x=10, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#7  0x000000000008dba0 in main.fib (x=11, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#8  0x000000000008dbb4 in main.fib (x=13, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#9  0x000000000008dbb4 in main.fib (x=15, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#10 0x000000000008dbb4 in main.fib (x=17, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#11 0x000000000008dba0 in main.fib (x=18, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#12 0x000000000008dbb4 in main.fib (x=20, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#13 0x000000000008dbb4 in main.fib (x=22, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#14 0x000000000008dba0 in main.fib (x=23, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#15 0x000000000008dbb4 in main.fib (x=25, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#16 0x000000000008dba0 in main.fib (x=26, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#17 0x000000000008dba0 in main.fib (x=27, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#18 0x000000000008dba0 in main.fib (x=28, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#19 0x000000000008dbb4 in main.fib (x=30, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#20 0x000000000008dbb4 in main.fib (x=32, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#21 0x000000000008dba0 in main.fib (x=33, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#22 0x000000000008dba0 in main.fib (x=34, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#23 0x000000000008dbb4 in main.fib (x=36, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#24 0x000000000008dbb4 in main.fib (x=38, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#25 0x000000000008dba0 in main.fib (x=39, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#26 0x000000000008dbb4 in main.fib (x=41, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#27 0x000000000008dba0 in main.fib (x=42, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#28 0x000000000008dbb4 in main.fib (x=44, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#29 0x000000000008dba0 in main.fib (x=45, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#30 0x000000000008d91c in main.main () at /home/admin/_mthread/main.go:12

Thread 3 (LWP 2803 "thex"):
#0  runtime.futex () at /usr/lib/go-1.18/src/runtime/sys_linux_arm64.s:596
#1  0x000000000003b55c in runtime.futexsleep (addr=<optimized out>, val=128, ns=0) at /usr/lib/go-1.18/src/runtime/os_linux.go:66
#2  0x0000000000019144 in runtime.notesleep (n=0x400003a548) at /usr/lib/go-1.18/src/runtime/lock_futex.go:159
#3  0x0000000000045a18 in runtime.mPark () at /usr/lib/go-1.18/src/runtime/proc.go:1449
#4  runtime.stopm () at /usr/lib/go-1.18/src/runtime/proc.go:2228
#5  0x00000000000470e4 in runtime.findrunnable (gp=<optimized out>, inheritTime=<optimized out>) at /usr/lib/go-1.18/src/runtime/proc.go:2804
#6  0x0000000000048278 in runtime.schedule () at /usr/lib/go-1.18/src/runtime/proc.go:3187
#7  0x00000000000487f4 in runtime.park_m (gp=0x400010a680) at /usr/lib/go-1.18/src/runtime/proc.go:3336
#8  0x0000000000067864 in runtime.mcall () at /usr/lib/go-1.18/src/runtime/asm_arm64.s:186
#9  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (LWP 2802 "thex"):
#0  runtime.usleep () at /usr/lib/go-1.18/src/runtime/sys_linux_arm64.s:150
#1  0x000000000004cc6c in runtime.sysmon () at /usr/lib/go-1.18/src/runtime/proc.go:5066
#2  0x000000000004411c in runtime.mstart1 () at /usr/lib/go-1.18/src/runtime/proc.go:1418
#3  0x0000000000044068 in runtime.mstart0 () at /usr/lib/go-1.18/src/runtime/proc.go:1376
#4  0x00000000000677f0 in runtime.mstart () at /usr/lib/go-1.18/src/runtime/asm_arm64.s:114

Thread 1 (LWP 2801 "thex"):
#0  runtime.futex () at /usr/lib/go-1.18/src/runtime/sys_linux_arm64.s:596
#1  0x000000000003b55c in runtime.futexsleep (addr=<optimized out>, val=128, ns=0) at /usr/lib/go-1.18/src/runtime/os_linux.go:66
#2  0x0000000000019144 in runtime.notesleep (n=0x138aa8 <runtime.m0+328>) at /usr/lib/go-1.18/src/runtime/lock_futex.go:159
#3  0x0000000000045a18 in runtime.mPark () at /usr/lib/go-1.18/src/runtime/proc.go:1449
#4  runtime.stopm () at /usr/lib/go-1.18/src/runtime/proc.go:2228
#5  0x00000000000470e4 in runtime.findrunnable (gp=<optimized out>, inheritTime=<optimized out>) at /usr/lib/go-1.18/src/runtime/proc.go:2804
#6  0x0000000000048278 in runtime.schedule () at /usr/lib/go-1.18/src/runtime/proc.go:3187
#7  0x00000000000487f4 in runtime.park_m (gp=0x400010a680) at /usr/lib/go-1.18/src/runtime/proc.go:3336
#8  0x0000000000067864 in runtime.mcall () at /usr/lib/go-1.18/src/runtime/asm_arm64.s:186
#9  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

As you can see, thread 1 and 3 are waiting on mutex, thread 4 has already finished calculating the Fibonacci(45) value.
After taking that backtrace I run continue on the gdb, waited another minute and took another backtrace:

(gdb) continue
Continuing.
^C
Thread 1 "thex" received signal SIGINT, Interrupt.
runtime.futex () at /usr/lib/go-1.18/src/runtime/sys_linux_arm64.s:596
596		SVC
(gdb) thread apply all backtrace

Thread 5 (LWP 2805 "thex"):
#0  runtime.futex () at /usr/lib/go-1.18/src/runtime/sys_linux_arm64.s:596
#1  0x000000000003b55c in runtime.futexsleep (addr=<optimized out>, val=128, ns=0) at /usr/lib/go-1.18/src/runtime/os_linux.go:66
#2  0x0000000000019144 in runtime.notesleep (n=0x4000080148) at /usr/lib/go-1.18/src/runtime/lock_futex.go:159
#3  0x0000000000045a18 in runtime.mPark () at /usr/lib/go-1.18/src/runtime/proc.go:1449
#4  runtime.stopm () at /usr/lib/go-1.18/src/runtime/proc.go:2228
#5  0x00000000000470e4 in runtime.findrunnable (gp=<optimized out>, inheritTime=<optimized out>) at /usr/lib/go-1.18/src/runtime/proc.go:2804
#6  0x0000000000048278 in runtime.schedule () at /usr/lib/go-1.18/src/runtime/proc.go:3187
#7  0x00000000000487f4 in runtime.park_m (gp=0x400010a680) at /usr/lib/go-1.18/src/runtime/proc.go:3336
#8  0x0000000000067864 in runtime.mcall () at /usr/lib/go-1.18/src/runtime/asm_arm64.s:186
#9  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 4 (LWP 2804 "thex"):
#0  0x000000000008db68 in main.fib (x=6, ~r0=<optimized out>) at /home/admin/_mthread/main.go:33
#1  0x000000000008dba0 in main.fib (x=7, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#2  0x000000000008dba0 in main.fib (x=8, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#3  0x000000000008dba0 in main.fib (x=9, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#4  0x000000000008dba0 in main.fib (x=10, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#5  0x000000000008dbb4 in main.fib (x=12, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#6  0x000000000008dba0 in main.fib (x=13, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#7  0x000000000008dba0 in main.fib (x=14, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#8  0x000000000008dbb4 in main.fib (x=16, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#9  0x000000000008dba0 in main.fib (x=17, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#10 0x000000000008dbb4 in main.fib (x=19, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#11 0x000000000008dba0 in main.fib (x=20, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#12 0x000000000008dbb4 in main.fib (x=22, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#13 0x000000000008dbb4 in main.fib (x=24, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#14 0x000000000008dba0 in main.fib (x=25, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#15 0x000000000008dba0 in main.fib (x=26, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#16 0x000000000008dbb4 in main.fib (x=28, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#17 0x000000000008dbb4 in main.fib (x=30, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#18 0x000000000008dba0 in main.fib (x=31, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#19 0x000000000008dbb4 in main.fib (x=33, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#20 0x000000000008dbb4 in main.fib (x=35, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#21 0x000000000008dba0 in main.fib (x=36, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#22 0x000000000008dbb4 in main.fib (x=38, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#23 0x000000000008dbb4 in main.fib (x=40, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#24 0x000000000008dbb4 in main.fib (x=42, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#25 0x000000000008dba0 in main.fib (x=43, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#26 0x000000000008dbb4 in main.fib (x=45, ~r0=<optimized out>) at /home/admin/_mthread/main.go:37
#27 0x000000000008d91c in main.main () at /home/admin/_mthread/main.go:12

Thread 3 (LWP 2803 "thex"):
#0  runtime.epollwait () at /usr/lib/go-1.18/src/runtime/sys_linux_arm64.s:731
#1  0x000000000003b280 in runtime.netpoll (delay=<optimized out>, ~r0=...) at /usr/lib/go-1.18/src/runtime/netpoll_epoll.go:126
#2  0x0000000000046df4 in runtime.findrunnable (gp=<optimized out>, inheritTime=<optimized out>) at /usr/lib/go-1.18/src/runtime/proc.go:2767
#3  0x0000000000048278 in runtime.schedule () at /usr/lib/go-1.18/src/runtime/proc.go:3187
#4  0x00000000000487f4 in runtime.park_m (gp=0x400010a680) at /usr/lib/go-1.18/src/runtime/proc.go:3336
#5  0x0000000000067864 in runtime.mcall () at /usr/lib/go-1.18/src/runtime/asm_arm64.s:186
#6  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (LWP 2802 "thex"):
#0  runtime.usleep () at /usr/lib/go-1.18/src/runtime/sys_linux_arm64.s:150
#1  0x000000000004cc6c in runtime.sysmon () at /usr/lib/go-1.18/src/runtime/proc.go:5066
#2  0x000000000004411c in runtime.mstart1 () at /usr/lib/go-1.18/src/runtime/proc.go:1418
#3  0x0000000000044068 in runtime.mstart0 () at /usr/lib/go-1.18/src/runtime/proc.go:1376
#4  0x00000000000677f0 in runtime.mstart () at /usr/lib/go-1.18/src/runtime/asm_arm64.s:114

Thread 1 (LWP 2801 "thex"):
#0  runtime.futex () at /usr/lib/go-1.18/src/runtime/sys_linux_arm64.s:596
#1  0x000000000003b55c in runtime.futexsleep (addr=<optimized out>, val=128, ns=0) at /usr/lib/go-1.18/src/runtime/os_linux.go:66
#2  0x0000000000019144 in runtime.notesleep (n=0x138aa8 <runtime.m0+328>) at /usr/lib/go-1.18/src/runtime/lock_futex.go:159
#3  0x0000000000045a18 in runtime.mPark () at /usr/lib/go-1.18/src/runtime/proc.go:1449
#4  runtime.stopm () at /usr/lib/go-1.18/src/runtime/proc.go:2228
#5  0x00000000000470e4 in runtime.findrunnable (gp=<optimized out>, inheritTime=<optimized out>) at /usr/lib/go-1.18/src/runtime/proc.go:2804
#6  0x0000000000048278 in runtime.schedule () at /usr/lib/go-1.18/src/runtime/proc.go:3187
#7  0x00000000000487f4 in runtime.park_m (gp=0x400010a680) at /usr/lib/go-1.18/src/runtime/proc.go:3336
#8  0x0000000000067864 in runtime.mcall () at /usr/lib/go-1.18/src/runtime/asm_arm64.s:186
#9  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 

as you can see nothing has changed. As previously Marcus stated, it has non-deterministic behaviour, playing with SIGINT and "continue" a few times in the gdb has unblocked the program and it has finished successfully. The amount of SIGINT - continue phases vary, once it is one time another time it is 4 times to break and continue the process. As already mentioned, creating a new ssh connection to the board unblocks always stuck threads and the program finishes successfully.

Important: whenever the program got stuck, the thread responsible for displaying the "simplified progress notification" in the routine spinner was printing the values properly the whole time

@isbm
Copy link

isbm commented Sep 22, 2022

@ianlancetaylor @prattmic Another point: I just compiled this little Fibonacci program with 1.13.15 version, latest before 1.14 version, and observed while it works ~1 minute vs expected 5-7 seconds, it still performs reliably and deterministically slow, without being stuck.

This is a trace from it, so as you see no SIGURG noted:

[pid  3243] futex(0x177440, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=99830993} <unfinished ...>
[pid  3242] futex(0x40000364c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  3241] <... nanosleep resumed>NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid  3243] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid  3243] futex(0x40000364c8, FUTEX_WAKE_PRIVATE, 1) = 1
[pid  3242] <... futex resumed>)        = 0
[pid  3243] write(1, "\r/", 2 <unfinished ...>
[pid  3242] futex(0x40000364c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  3243] <... write resumed>)        = 2
[pid  3243] futex(0x40000364c8, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  3243] futex(0x177340, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=99834393} <unfinished ...>
[pid  3242] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid  3242] futex(0x40000364c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  3241] <... nanosleep resumed>NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  3241] nanosleep({tv_sec=0, tv_nsec=10000000}, ^Cstrace: Process 3240 detached
strace: Process 3241 detached
<detached ...>

Does it ring a bell?

@prattmic
Copy link
Member

1.14 added asynchronous preemption. Could you try setting the GODEBUG=asyncpreemptoff=1 environment variable on a later version to see if it has the same behavior as 1.13?

@schaefi
Copy link
Author

schaefi commented Sep 26, 2022

@prattmic thanks for the hints. We have tested it on the board and the issue persists with and without the asynchronous preemption. After further digging into this we believe this is not a go issue. The board which is used here has 4 cores. 2 real CPUs exposing two cores each. If we run applications limited to the cores of one CPU (taskset --cpu-list 0,1 ...) the dead lock issue does not appear and the behavior becomes deterministic. Thus I think the issue lives on the kernel/hardware side of things.

The reason why we thought go could be the problem was because go uses threads a lot and therefore was constantly running into the dead lock issue when all other applications worked as expected. But none of them uses more than two threads. So we were running into the wrong direction, really sorry for that but all your tips helped to debug the problem.
Thanks much.

There is still no solution for us but I wanted to share our findings here such that you know

If you don't mind I'd like to keep this open for a few more days even though I don't expect any fundamental changes with regards to the reasons of the issue.

Again thanks for your help

@ianlancetaylor
Copy link
Contributor

Thanks for following up. I'm going to close this issue, but please feel free to comment if it does seem to be a Go problem after all.

@ianlancetaylor ianlancetaylor closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2022
@golang golang locked and limited conversation to collaborators Sep 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants