Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: timeouts in os/signal tests #27520

Closed
randall77 opened this issue Sep 5, 2018 · 11 comments
Closed

runtime: timeouts in os/signal tests #27520

randall77 opened this issue Sep 5, 2018 · 11 comments
Milestone

Comments

@randall77
Copy link
Contributor

From @mwf on issue #25696 :

Hi! I ran into a similar problem here: #26317 (comment)

$ time go test all
...

*** Test killed: ran too long (10m0s).
FAIL os/signal 605.012s
...
go test all 1521.37s user 69.95s system 228% cpu 11:35.33 total
It hangs even with -short flag passed.
The test passes without any problem in go1.10.3.

$ go version && go test os/signal
go version go1.10.3 darwin/amd64
ok  	os/signal	7.701s

The most interesting thing that go test -short os/signal test passes OK, if you've never run go test os/signal after installing go. Once you run it, short tests also hang, even go clean -testcache doesn't help.

$ go test -short os/signal
ok  	os/signal	4.742s
$ go clean -testcache
$ go test os/signal
signal: killed
FAIL	os/signal	59.544s
$ go test -short os/signal
signal: killed
FAIL	os/signal	132.764s

Sorry, even the -short sometimes hangs.

Please try running this:

for i in {1..20}
do
    go test -short os/signal;
    go clean -testcache;
done
signal: killed
FAIL	os/signal	26.467s
ok  	os/signal	4.735s
ok  	os/signal	4.716s
ok  	os/signal	4.719s
signal: killed
FAIL	os/signal	25.518s
ok  	os/signal	4.705s
signal: killed
FAIL	os/signal	13.824s
^C^C^C%

I killed the hanging tests manually with ps aux | grep signal.test | grep -v "grep" | awk '{print $2}' | xargs kill -9

So it has nothing to do with short/non-short difference.

@randall77 randall77 self-assigned this Sep 5, 2018
@randall77 randall77 added this to the Go1.11.1 milestone Sep 5, 2018
@randall77
Copy link
Contributor Author

I traced this into a bug in semasleep in runtime/os_darwin.go.
When the condition variable wait has a spurious wakeup (due to signals in this test, I believe) we retry the wait with the same timeout again, instead of subtracting the time used so far. For this test, this ends up with a time.Sleep taking forever, as it gets spurious wakeups at a faster rate than the timeout.

@gopherbot
Copy link

Change https://golang.org/cl/133655 mentions this issue: runtime: in semasleep, subtract time spent so far from timeout

@randall77 randall77 modified the milestones: Go1.11.1, Go1.12 Sep 5, 2018
@randall77
Copy link
Contributor Author

@gopherbot, please open an issue for backport to 1.11.1.

@gopherbot
Copy link

Backport issue(s) opened: #27521 (for 1.11).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@odeke-em
Copy link
Member

@randall77 you were right on target with your prognosis and fix of this issue, nice!

I got some time tonight to work on a repro for a regression test, as I had promised in a tertiary issue.
The repro is really simple, it sends an asynchronous non-terminal signal to a program that uses semasleep e.g. on OS X, with <-time.After(period)

This simple program can reproduce this issue on unpatched versions

//+build !windows

package main

import (
	"log"
	"os"
	"time"
)

func main() {
	log.Printf("PID: %d", os.Getpid())
	<-time.After(4 * time.Second)
}

and given this bash script

#!/bin/bash

function si {
    for ((i=0; i<40; i++))
    do
        kill -SIGIO $1 && echo "Sent SIGIO" || break
        sleep 0.6
    done
}
si $1

Running the pair

If you run time go run main.go and quickly grab that PID that's printed, then in another shell run that bash script e.g. chmod +x stopper.sh && ./stopper.sh 77055

With the fix

The Go program will return immediately after the fixed number of seconds. The bash shell only prints say 4 times and exits, the Go program will properly exit

$ time go run samp.go 
2018/09/12 02:06:09 PID: 77259

real	0m4.212s
user	0m0.201s
sys	0m0.122s

Without the fix

The bash shell will print 40 times

Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO
Sent SIGIO

and then finally the Go program will exit after >= 40 * 0.6s: >= 24s

$ time go run samp.go 
2018/09/12 02:03:47 PID: 77177

real	0m31.535s
user	0m0.191s
sys	0m0.118s

Go regression test

I tried to work on a repro but perhaps I need a fresh mind in the morning as the repro when translated should be trivial. Nonetheless here it is

//+build !windows

package main

import (
	"io/ioutil"
	"log"
	"os"
	"os/exec"
	"path/filepath"
	"syscall"
	"time"
)

func main() {
	tempDir, err := ioutil.TempDir("", "signal-refresh")
	if err != nil {
		log.Fatalf("Failed to create the temp directory: %v", err)
	}
	defer os.RemoveAll(tempDir)

	// Given the simple program below
	repro := `
    package main

    import "time"

    func main() {
        <-time.After(2 * time.Second)
    }
    `
	reproPath := filepath.Join(tempDir, "repro.go")
	if err := ioutil.WriteFile(reproPath, []byte(repro), 0755); err != nil {
		log.Printf("Failed to create temp file for repro.go: %v", err)
		os.Exit(-1)
	}

	// Once it is written now we need to run it
	cmd := exec.Command("go", "run", reproPath)
	if err := cmd.Start(); err != nil {
		log.Printf("Failed to start command: %v", err)
		os.Exit(-1)
	}
	doneCh := make(chan error)
	go func() {
		doneCh <- cmd.Wait()
	}()

	// Now that we've started the repro, we
	// can continuously send to it signal SIGIO.
	unfixedTimer := time.NewTimer(4 * time.Second)
	for {
		select {
		case <-time.After(600 * time.Millisecond):
			// Send the pesky signal that toggle spinning
			// till infinity if #27520 is not fixed!!
                        cmd.Process.Signal(syscall.SIGIO)
			log.Println("Sent SIGIO")

		case <-unfixedTimer.C:
			log.Println("Unfortunately the issue hasn't yet been fixed!")
			cmd.Process.Signal(syscall.SIGKILL)
			return

		case err := <-doneCh:
			if err != nil {
				log.Printf("The program returned but unfortunately with an error: %v", err)
			} else {
				log.Println("Hooray, the issue is fixed!!")
			}
			return
		}
	}
}

but something isn't right with my Go translation and unfortunately it doesn't reproduce the issue as a standalone, but the previous go and shell program reproduce it reliably. Perhaps it being the parent sending the SIGIO? I'll also kindly page @ianlancetaylor to help me as this involves signals.

@ianlancetaylor
Copy link
Contributor

When you use go run, you're sending the signal to the cmd/go process, not to your program. Use go build and run the program directly rather than via go run.

@gopherbot
Copy link

Change https://golang.org/cl/135015 mentions this issue: runtime: regression test for semasleep indefinite hang

@odeke-em
Copy link
Member

When you use go run, you're sending the signal to the cmd/go process, not to your program. Use go build and run the program directly rather than via go run.

@ianlancetaylor awesome, thank you for pointing that out! In deed, that was my problem.

With that update, this standalone Go repro below can now reliably serve as a regression test

//+build !windows

package main

import (
	"io/ioutil"
	"log"
	"os"
	"os/exec"
	"path/filepath"
	"syscall"
	"time"
)

func main() {
	tempDir, err := ioutil.TempDir("", "signal-refresh")
	if err != nil {
		log.Fatalf("Failed to create the temp directory: %v", err)
	}
	defer os.RemoveAll(tempDir)

	// Given the simple program below
	repro := `
    package main

    import "time"

    func main() {
        <-time.After(2 * time.Second)
    }
    `
	mainPath := filepath.Join(tempDir, "main.go")
	if err := ioutil.WriteFile(mainPath, []byte(repro), 0755); err != nil {
		log.Printf("Failed to create temp file for repro.go: %v", err)
		return
	}
	binaryPath := filepath.Join(tempDir, "binary")

	out, err := exec.Command("go", "build", "-o", binaryPath, mainPath).CombinedOutput()
	if err != nil {
		log.Printf("Failed to compile the binary: err: %v\nOutput: %s\n", err, out)
		return
	}
	if err := os.Chmod(binaryPath, 0755); err != nil {
		log.Printf("Failed to chmod binary: %v", err)
		return
	}

	// Now run the binary
	cmd := exec.Command(binaryPath)
	if err := cmd.Start(); err != nil {
		log.Printf("Failed to start command: %v", err)
		return
	}
	doneCh := make(chan error)
	go func() {
		doneCh <- cmd.Wait()
	}()

	// Now that we've started the repro, we
	// can continuously send to it signal SIGIO.
	unfixedTimer := time.NewTimer(4 * time.Second)
	for {
		select {
		case <-time.After(600 * time.Millisecond):
			// Send the pesky signal that toggle spinning
			// till infinity if #27520 is not fixed!!
			cmd.Process.Signal(syscall.SIGIO)
			log.Println("Sent SIGIO")

		case <-unfixedTimer.C:
			log.Println("Unfortunately the issue hasn't yet been fixed!")
			cmd.Process.Signal(syscall.SIGKILL)
			return

		case err := <-doneCh:
			if err != nil {
				log.Printf("The program returned but unfortunately with an error: %v", err)
			} else {
				log.Println("Hooray, the issue is fixed!!")
			}
			return
		}
	}
}

Results

  • On unpatched versions of Go
$ go run main.go 
2018/09/12 11:48:33 Sent SIGIO
2018/09/12 11:48:34 Sent SIGIO
2018/09/12 11:48:35 Sent SIGIO
2018/09/12 11:48:35 Sent SIGIO
2018/09/12 11:48:36 Sent SIGIO
2018/09/12 11:48:36 Sent SIGIO
2018/09/12 11:48:37 Unfortunately the issue hasn't yet been fixed!
  • On patched versions of Go
$ go run main.go 
2018/09/12 11:49:03 Sent SIGIO
2018/09/12 11:49:04 Sent SIGIO
2018/09/12 11:49:04 Sent SIGIO
2018/09/12 11:49:05 Hooray, the issue is fixed!!

With that, I've mailed https://go-review.googlesource.com/c/go/+/135015

gopherbot pushed a commit that referenced this issue Sep 13, 2018
A regression test in which: for a program that invokes semasleep,
we send non-terminal signals such as SIGIO.
Since the signal wakes up pthread_cond_timedwait_relative_np,
after CL 133655, we should only re-spin for the amount of
time left, instead of re-spinning with the original duration
which would cause an indefinite spin.

Updates #27520

Change-Id: I744a6d04cf8923bc4e13649446aff5e42b7de5d8
Reviewed-on: https://go-review.googlesource.com/135015
Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@mikesmitty
Copy link

mikesmitty commented Sep 4, 2019

Not sure if this is the right place for it, but on go1.13 I'm seeing this issue pop up intermittently. In a CentOS 6 chroot on CentOS 7 x86_64 (using mock with the epel-6-x86_64 target).

--- FAIL: TestSpuriousWakeupsNeverHangSemasleep (2.86s)
    semasleep_test.go:51: Program failed to return on time and has to be killed, issue #27520 still exists
FAIL
FAIL    runtime 99.958s

@ianlancetaylor
Copy link
Contributor

@mikesmitty This issue has been closed for almost a year and we think it is fixed. Please open a new issue will full details. Thanks.

@gopherbot
Copy link

Change https://golang.org/cl/233361 mentions this issue: runtime: deflake TestSpuriousWakeupsNeverHangSemasleep

@golang golang locked and limited conversation to collaborators May 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants