runtime: async signals not reliably delivered to Go threads under TSAN #18717

bcmills · 2017-01-19T19:05:31Z

go version devel +f4be767501 Wed Jan 18 16:21:28 2017 -0500 linux/amd64

The fix for #17753 causes signal delivery in Go programs running under TSAN to pass through the TSAN interceptors.

Unfortunately, it appears that the TSAN interceptors defer delivery of asynchronous signals until the next blocking libc call on the same thread. This interacts poorly with the Go runtime's use of direct 'futex' syscalls to park idle threads: if the thread receiving the signal happens to be one that the runtime is about to park, it may never make a libc call and the signal will be lost.

A simple program illustrating this behavior:

src/tsansig/tsansig.go:

package main

import (
	"os"
	"os/signal"
	"syscall"
)

/*
#cgo CFLAGS: -g -fsanitize=thread
#cgo LDFLAGS: -g -fsanitize=thread
*/
import "C"

func main() {
	c := make(chan os.Signal, 1)
	signal.Notify(c, syscall.SIGUSR1)
	defer signal.Stop(c)
	syscall.Kill(syscall.Getpid(), syscall.SIGUSR1)
	<-c
}

$ go build tsansig
$ ./tsansig

It should exit with status 0 almost immediately. Instead, it hangs forever.

One might expect that injecting a blocking libc call would help:

package main

import (
	"os"
	"os/signal"
	"syscall"
	"time"
)

/*
#cgo CFLAGS: -g -fsanitize=thread
#cgo LDFLAGS: -g -fsanitize=thread

#include <stddef.h>
#include <time.h>
*/
import "C"

func nanosleep(d time.Duration) error {
	req := C.struct_timespec{
		tv_sec:  C.__time_t(d / time.Second),
		tv_nsec: C.__syscall_slong_t(d % time.Second),
	}
	var rem C.struct_timespec
	for {
		switch _, err := C.nanosleep(&req, &rem); err {
		case nil:
			return nil
		case syscall.EINTR:
			req = rem
			continue
		default:
			return err
		}
	}
}

func main() {
	c := make(chan os.Signal, 1)
	signal.Notify(c, syscall.SIGUSR1)
	defer signal.Stop(c)
	syscall.Kill(syscall.Getpid(), syscall.SIGUSR1)

	for {
		select {
		case <-c:
			return
		default:
			if err := nanosleep(10 * time.Nanosecond); err != nil {
				panic(err)
			}
		}
	}
}

But that attempt at a workaround does not succeed: the thread that receives the signal happens not to be the one the runtime uses for the subsequent cgo calls.

Impact on Go 1.8

This is (unfortunately) a significant regression over Go 1.7. In 1.7, Go programs without significant C signal behavior would exhibit correct Go signal behavior. In the current 1.8 RC, the C signal behavior in the presence of the Go runtime is improved, but the Go signal behavior is broken even for programs with no interesting C signal activity.

The text was updated successfully, but these errors were encountered:

bcmills · 2017-01-19T19:44:38Z

One possible fix would be to have something in the runtime make an explicit call (in cgo-enabled builds) to some no-op libc function (e.g. sleep(0)) to trigger signal delivery. (runtime.findrunnable seems like a likely candidate.)

However, it isn't obvious to me that any call to runtime.futexsleep with a negative ns parameter can be made safely in the presence of TSAN, since the signal could be delivered at any point between the last libc call and the futex syscall. Since futexsleep allows spurious wakeups, perhaps we also need to inject an arbitrary timeout to futexsleep when TSAN is enabled to bound the delay on signal delivery.

bcmills · 2017-01-19T21:29:44Z

(@ianlancetaylor @dvyukov )

bcmills · 2017-01-19T21:51:38Z

https://go-review.googlesource.com/#/c/35494/ has a potential fix.

dvyukov · 2017-01-20T09:17:44Z

Tsan has a notion of "blocking regions" of code (e.g. pthread_cond_wait). It delivers signals synchronously in such regions as it knows that no activity happens on the thread. We could mark whole Go world as "blocking region", then tsan will deliver signals synchronously while in Go code. However, we call some tsan annotations (__tsan_acquire/release) from Go code, so we will need to somehow mark them as well.
This will require new hooks in tsan runtime, so we will need to handle versioning somehow as old tsan runtime won't have the hooks.
What release is this for?

bcmills · 2017-01-20T21:35:45Z

We could mark whole Go world as "blocking region", then tsan will deliver signals synchronously while in Go code.
…
This will require new hooks in tsan runtime

That seems much more complicated than making periodic libc calls on idle Ms — we would also need to unblock the regions during cgo calls.

What release is this for?

I was trying to fix it in 1.8, but the changes are invasive enough that Ian suggested we delay until 1.9.

gopherbot · 2017-01-24T21:15:58Z

CL https://golang.org/cl/35494 mentions this issue.

bcmills · 2017-03-08T18:22:38Z

I've been thinking more about @dvyukov's comments, and I'm starting to see the logic to it.

I agree that marking the whole Go runtime as a "blocking region" for TSAN purposes is the right first-principles solution. It seems that as far as TSAN is concerned, the important property of a "blocking call" is that it does not acquire any libc locks, and Go only acquires libc locks when executing cgo function calls (including the x_cgo hooks in place for other runtime calls).

That said, the "periodic yielding" strategy does seem to improve responsiveness significantly in practice, so I think that's probably the right fallback for versions of TSAN which do not support the proposed new hooks.

gopherbot · 2017-03-09T05:05:40Z

CL https://golang.org/cl/37867 mentions this issue.

There are a few problems from change 35494, discovered during testing of change 37852. 1. I was confused about the usage of n.key in the sema variant, so we were looping on the wrong condition. The error was not caught by the TryBots (presumably due to missing TSAN coverage in the BSD and darwin builders?). 2. The sysmon goroutine sometimes skips notetsleep entirely, using direct usleep syscalls instead. In that case, we were not calling _cgo_yield, leading to missed signals under TSAN. 3. Some notetsleep calls have long finite timeouts. They should be broken up into smaller chunks with a yield at the end of each chunk. updates #18717 Change-Id: I91175af5dea3857deebc686f51a8a40f9d690bcc Reviewed-on: https://go-review.googlesource.com/37867 Run-TryBot: Bryan Mills <bcmills@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>

bcmills changed the title ~~runtime: async signals not reliably delivered to Go threads under TSAN/MSAN~~ runtime: async signals not reliably delivered to Go threads under TSAN Jan 20, 2017

bcmills mentioned this issue Jan 27, 2017

runtime: tight loops should be preemptible #10958

Closed

gopherbot closed this as completed in 29edf0f Mar 8, 2017

golang locked and limited conversation to collaborators Mar 9, 2018

gopherbot added the FrozenDueToAge label Mar 9, 2018

bcmills mentioned this issue Apr 5, 2019

os/signal: test flaky on darwin #31264

Closed

bcmills removed the FrozenDueToAge label Dec 4, 2019

golang unlocked this conversation Dec 4, 2019

bcmills mentioned this issue Dec 4, 2019

runtime: Windows binaries built with -race occasionally deadlock #35775

Closed

bcmills mentioned this issue Nov 13, 2020

runtime: hangs on Windows due to McAfee's mfehcinj.dll apparently hijacking or suspending threads #42528

Closed

golang locked and limited conversation to collaborators Dec 3, 2020

gopherbot added the FrozenDueToAge label Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: async signals not reliably delivered to Go threads under TSAN #18717

runtime: async signals not reliably delivered to Go threads under TSAN #18717

bcmills commented Jan 19, 2017 •

edited

Loading

bcmills commented Jan 19, 2017

bcmills commented Jan 19, 2017

bcmills commented Jan 19, 2017

dvyukov commented Jan 20, 2017

bcmills commented Jan 20, 2017

gopherbot commented Jan 24, 2017

bcmills commented Mar 8, 2017

gopherbot commented Mar 9, 2017

runtime: async signals not reliably delivered to Go threads under TSAN #18717

runtime: async signals not reliably delivered to Go threads under TSAN #18717

Comments

bcmills commented Jan 19, 2017 • edited Loading

Impact on Go 1.8

bcmills commented Jan 19, 2017

bcmills commented Jan 19, 2017

bcmills commented Jan 19, 2017

dvyukov commented Jan 20, 2017

bcmills commented Jan 20, 2017

gopherbot commented Jan 24, 2017

bcmills commented Mar 8, 2017

gopherbot commented Mar 9, 2017

bcmills commented Jan 19, 2017 •

edited

Loading