runtime: cgo stuck because of go signal handler went into dead loop if pid=1 #59569

cocktail828 · 2023-04-12T10:01:30Z

What version of Go are you using (`go version`)?

$ go version
go version go1.19 linux/amd64

Does this issue reproduce with the latest release?

Sure

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=
GONOSUMDB=
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=
GOPROXY="https://goproxy.cn,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.19"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2188387714=/tmp/go-build -gno-record-gcc-switches"

What did you do?

The bug found in go1.10.3 at first, the I update go to go1.19, it happends too. I find the root case lives here:

go/src/runtime/signal_unix.go

Line 964 in de475e8

In almost all real cases the program is about to crash, but in my situation, it does not after usleep. But why, I cannot explain!!!.

Then I try fix it myself with the following patch:

	raise(sig)
	// Give the signal a chance to be delivered.
	// In almost all real cases the program is about to crash,
	// so sleeping here is not a waste of time.
	usleep(1000)

	switch sig {
	case _SIGSEGV, _SIGABRT, _SIGFPE, _SIGILL, _SIGBUS:
		print("fatal error: unexpected signal during runtime execution\n")
		signame := signame(sig)
		if signame != "" {
			print("[signal ", signame)
		} else {
			print("[signal ", hex(sig))
		}
		print(" code=", hex(c.sigcode()), " addr=", hex(c.sigaddr()), "]\n")
		dieFromSignal(sig)

	default:
		// If the signal didn't cause the program to exit, restore the
		// Go signal handler and carry on.
		//
		// We may receive another instance of the signal before we
		// restore the Go handler, but that is not so bad: we know
		// that the Go program has been ignoring the signal.
		setsig(sig, abi.FuncPCABIInternal(sighandler))
	}

What did you expect to see?

panic

What did you see instead?

See #56649

The text was updated successfully, but these errors were encountered:

seankhliao · 2023-04-12T12:09:42Z

We will still need a way to reproduce this issue to verify any fixes

prattmic · 2023-04-12T14:06:23Z

What does “containered CentOS 7” mean? Does this change the semantics of rt_sigaction? We set the action for SIGSEGV to DFL, so it shouldn’t be possible to deliver to a handler anymore.

cocktail828 · 2023-04-12T15:29:18Z

First thanks for your help.

What does “containered CentOS 7” mean? Does this change the semantics of rt_sigaction?

The container is built from centos 7.
The host is running kernel 3.10.0 and we don't make any change so rt_sigaction should work as expect.

We set the action for SIGSEGV to DFL, so it shouldn’t be possible to deliver to a handler anymore.

Yes, I can see that from strace. It should crash after usleep, but it doesn't. That's what I confused. Maybe the kernel bug???
The issue can only reproduce in small probility.

I can keep an environment for debug the next time it reproduce, if anybody interest enough to debug the issue.

cocktail828 · 2023-04-12T15:34:46Z

I don't want to merge the patch, but want somebody professional enough to review the changes.

ianlancetaylor · 2023-04-12T15:59:39Z

Can you show us the output from strace -f? Thanks.

prattmic · 2023-04-12T18:10:23Z

@ianlancetaylor the strace is on #56649 (we probably should have reopened that one rather than making a new bug)

ianlancetaylor · 2023-04-12T18:57:23Z

Thanks. I also don't understand that strace. It appears to be running the code in raisebadsignal. That means that the SIGSEGV was received on a thread that was not started by Go code.

// unblocksig(sig)
[pid 108096] rt_sigprocmask(SIG_UNBLOCK, [SEGV], NULL, 8) = 0
// setsig(sig, handler) where handler == SIG_DFL
[pid 108096] rt_sigaction(SIGSEGV, {SIG_DFL, ~[], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x7f695ef1d630}, NULL, 8) = 0
// raise(sig)
[pid 108096] getpid()                   = 1
[pid 108096] gettid()                   = 223
[pid 108096] tgkill(1, 223, SIGSEGV)    = 0
[pid 108096] --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
// usleep(1000)
[pid 108096] nanosleep({0, 1000000}, NULL) = 0

In other words the process receives a SIGSEGV signal while the signal handler is set to SIG_DFL and the signal is not blocked. But the program just keeps running. That should not happen.

You said that this only happens occasionally. What does the strace -f output look like when the program does not go into an endless loop? Thanks.

cocktail828 · 2023-04-13T01:37:45Z

In other words the process receives a SIGSEGV signal while the signal handler is set to SIG_DFL and the signal is not blocked. But the program just keeps running. That should not happen.

Yep. I agree with u. So this maybe the kernel bug, I guess. Recently, we also request our SRE to record the kernel version of the host which reproduce the issue.

You said that this only happens occasionally. What does the strace -f output look like when the program does not go into an endless loop? Thanks.

It take some times. I will provide the log and keep an environment for debug the next time it reproduce.
Thank you very much.

cocktail828 · 2023-05-05T14:16:09Z

@ianlancetaylor
The following attachment is the output by strace -f -p ${pid}. May you get something useful from it.
strace-f.txt

Thanks in advance.

ianlancetaylor · 2023-05-05T23:30:23Z

Thanks. I just noticed si_pid=1 in the strace. I think that must be the problem. It sounds like you are running this program in a container, and you are running it as process 1. As it happens, process 1 is magic in Unix kernels: the kernel only delivers signals that actually have signal handlers. So when the Go runtime removes the signal handlers and raises the signal again, nothing happens. Now I don't know why this ever works at all, at least when running as process 1.

I don't know how much we care about this edge case. But we could change raisebadsignal to check for a process ID of 1 and a signal expected to kill the process.

cocktail828 · 2023-05-06T02:05:38Z

It's weird, the process crash and exit in almost all case, and they all run as pid 1 in a container. In other words, a container with only one process(no matter it comes from C/C++, Golang, or others) should crash and exit for SIGSEGV, or something unexpected should happed. Image a case, that all container crash but not exit... So I still think there is a bug hide in go or the kernel.

Thanks for your attention!

ianlancetaylor · 2023-05-10T00:13:01Z

The failure will only occur if the signal is delivered on a thread that was created by C, not by Go. So there is some unpredictability as to precisely how it will be handled. It is only when the signal is delivered to a C thread that we go into the endless loop of signal delivery due to running on PID 1.

prattmic · 2023-05-10T19:22:21Z

For reference, the kernel ignores signals with SIG_DFL handler for init tasks (SIGNAL_UNKILLABLE) unless it is a "forced" signal: https://elixir.bootlin.com/linux/v6.3/source/kernel/signal.c#L89

Kernel generated signals (such as SIGSEGV from a page fault) are "forced": https://elixir.bootlin.com/linux/v6.3/source/kernel/signal.c#L1245

So we should be able to raise the signal by explicitly faulting after removing the signal handler.

cocktail828 · 2023-05-11T03:15:34Z

Yep, you are right. I can reproduce the issue myself with follwing step.
And the issue will not be reproduce when not ru at pid 1

step1: compile the code and get an executable a.out

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#ifndef SYS_gettid
#error"SYS_gettid unavailable on this system"
#endif
#define gettid() ((pid_t)syscall(SYS_gettid))

long sys_tgkill(pid_t pid, pid_t tid, int signo) {
    printf(">>>>>%ld\n", syscall(SYS_tgkill, pid, tid, signo));
}

pid_t tid_send_signal;
pid_t tid_recv_signal;

void* thread_func(void* arg) {
    pid_t tid = gettid();
    if (*(int*)arg)
        tid_send_signal = tid;
    else
        tid_recv_signal = tid;
    printf("======%d %ld\n", *(int*)arg, tid);

    pthread_detach(pthread_self());
    sleep(1);

    while (1) {
        if (*(int*)arg)
                sys_tgkill(getpid(), tid_recv_signal, SIGSEGV);
        sleep(1);
    }
}

int main(int argc, char* argv[]) {
    pthread_t t1, t2;
    int should_send_signal = 1;
    pthread_create(&t1, NULL, thread_func, &should_send_signal);

    int should_recv_signal = 0;
    pthread_create(&t2, NULL, thread_func, &should_recv_signal);
    while (1) { sleep(100); }
    return 0;
}

step2: make an docker image, and add the a.out into the image.

cat > Dockerfile <<EOF
from debian:bullseye-slim
add a.out /root
cmd ["/root/a.out"]
EOF

step3: run the image and the issue reproduce...

[root@sentry-dev ~]# strace -ff -p `pidof a.out`
strace: Process 23367 attached with 3 threads
[pid 23399] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 23367] restart_syscall(<... resuming interrupted clock_nanosleep ...> <unfinished ...>
[pid 23398] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 23399] <... restart_syscall resumed>) = 0
[pid 23399] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 23398] <... restart_syscall resumed>) = 0
[pid 23398] getpid()                    = 1
[pid 23398] tgkill(1, 8, SIGSEGV <unfinished ...>
[pid 23399] <... clock_nanosleep resumed>{tv_sec=0, tv_nsec=983584817}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 23398] <... tgkill resumed>)       = 0
[pid 23399] --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
[pid 23398] write(1, ">>>>>0\n", 7 <unfinished ...>
[pid 23399] restart_syscall(<... resuming interrupted clock_nanosleep ...> <unfinished ...>
[pid 23398] <... write resumed>)        = 7
[pid 23398] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 23399] <... restart_syscall resumed>) = 0
[pid 23399] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 23398] <... clock_nanosleep resumed>0x7f72f8363e80) = 0
[pid 23398] getpid()                    = 1
[pid 23398] tgkill(1, 8, SIGSEGV <unfinished ...>
[pid 23399] <... clock_nanosleep resumed>{tv_sec=0, tv_nsec=983160829}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 23398] <... tgkill resumed>)       = 0
[pid 23399] --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
[pid 23399] restart_syscall(<... resuming interrupted clock_nanosleep ...> <unfinished ...>
[pid 23398] write(1, ">>>>>0\n", 7)     = 7
[pid 23398] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, ^Cstrace: Process 23367 detached
strace: Process 23398 detached
 <detached ...>
strace: Process 23399 detached
[root@sentry-dev ~]#

Notice:
run with --init to make the process not start with pid 1, will fix this issue!

[root@sentry-dev xxx]# docker run -it --rm --init xxx
======1 8
======0 9
>>>>>0 # the process exit successfully

seankhliao added the WaitingForInfo label Apr 12, 2023

seankhliao changed the title ~~affected/package: runtime~~ runtime: cgo stuck because of go signal handler went into dead loop Apr 12, 2023

gopherbot added the compiler/runtime label Apr 12, 2023

mknyszek added this to Go Compiler / Runtime Apr 12, 2023

ianlancetaylor removed the WaitingForInfo label May 5, 2023

cherrymui changed the title ~~runtime: cgo stuck because of go signal handler went into dead loop~~ runtime: cgo stuck because of go signal handler went into dead loop if pid=1 May 9, 2023

cherrymui added the NeedsDecision label May 9, 2023

cherrymui added this to the Backlog milestone May 9, 2023

mknyszek assigned prattmic May 10, 2023

mknyszek moved this to Todo in Go Compiler / Runtime May 10, 2023

RSWilli mentioned this issue Oct 19, 2023

Handling SIGSEGV when in docker container go-gst/go-gst#55

Closed

gabyhelp mentioned this issue Mar 3, 2025

Go process receiving SIGABRT or SIGSEGV signals is exiting with 2 exit status #72084

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: cgo stuck because of go signal handler went into dead loop if pid=1 #59569

runtime: cgo stuck because of go signal handler went into dead loop if pid=1 #59569

cocktail828 commented Apr 12, 2023

seankhliao commented Apr 12, 2023

prattmic commented Apr 12, 2023

cocktail828 commented Apr 12, 2023

cocktail828 commented Apr 12, 2023

ianlancetaylor commented Apr 12, 2023

prattmic commented Apr 12, 2023

ianlancetaylor commented Apr 12, 2023

cocktail828 commented Apr 13, 2023

cocktail828 commented May 5, 2023

ianlancetaylor commented May 5, 2023

cocktail828 commented May 6, 2023

ianlancetaylor commented May 10, 2023

prattmic commented May 10, 2023

cocktail828 commented May 11, 2023 •

edited

Loading

runtime: cgo stuck because of go signal handler went into dead loop if pid=1 #59569

runtime: cgo stuck because of go signal handler went into dead loop if pid=1 #59569

Comments

cocktail828 commented Apr 12, 2023

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

seankhliao commented Apr 12, 2023

prattmic commented Apr 12, 2023

cocktail828 commented Apr 12, 2023

cocktail828 commented Apr 12, 2023

ianlancetaylor commented Apr 12, 2023

prattmic commented Apr 12, 2023

ianlancetaylor commented Apr 12, 2023

cocktail828 commented Apr 13, 2023

cocktail828 commented May 5, 2023

ianlancetaylor commented May 5, 2023

cocktail828 commented May 6, 2023

ianlancetaylor commented May 10, 2023

prattmic commented May 10, 2023

cocktail828 commented May 11, 2023 • edited Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

cocktail828 commented May 11, 2023 •

edited

Loading