Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fatal error: stopTheWorld: not stopped #17442

Closed
dvyukov opened this issue Oct 14, 2016 · 13 comments
Closed

runtime: fatal error: stopTheWorld: not stopped #17442

dvyukov opened this issue Oct 14, 2016 · 13 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Oct 14, 2016

go version devel +d13fa4d Fri Oct 14 03:58:15 2016 +0000 linux/amd64

I am seeing the following crashes:

fatal error: stopTheWorld: not stopped
SIGABRT: abort
PC=0x448321 m=0

goroutine 0 [idle]:
runtime.futex(0x4ccb10, 0x0, 0x0, 0x0, 0x7f8f00000000, 0x7ffdfc0c2030, 0x0, 0x0, 0x7ffdfc0c2060, 0x409c7b, ...)
    /tmp/go-fuzz-build962140472/goroot/src/runtime/sys_linux_amd64.s:387 +0x21 fp=0x7ffdfc0c1ff0 sp=0x7ffdfc0c1fe8
runtime.futexsleep(0x4ccb10, 0x0, 0xffffffffffffffff)
    /tmp/go-fuzz-build962140472/goroot/src/runtime/os_linux.go:45 +0x62 fp=0x7ffdfc0c2040 sp=0x7ffdfc0c1ff0
runtime.notesleep(0x4ccb10)
    /tmp/go-fuzz-build962140472/goroot/src/runtime/lock_futex.go:145 +0x6b fp=0x7ffdfc0c2070 sp=0x7ffdfc0c2040
runtime.stopm()
    /tmp/go-fuzz-build962140472/goroot/src/runtime/proc.go:1596 +0xad fp=0x7ffdfc0c2098 sp=0x7ffdfc0c2070
runtime.gcstopm()
    /tmp/go-fuzz-build962140472/goroot/src/runtime/proc.go:1800 +0xb7 fp=0x7ffdfc0c20c0 sp=0x7ffdfc0c2098
runtime.schedule()
    /tmp/go-fuzz-build962140472/goroot/src/runtime/proc.go:2086 +0x110 fp=0x7ffdfc0c2100 sp=0x7ffdfc0c20c0
runtime.goschedImpl(0xc4200001a0)
    /tmp/go-fuzz-build962140472/goroot/src/runtime/proc.go:2200 +0xfb fp=0x7ffdfc0c2120 sp=0x7ffdfc0c2100
runtime.gopreempt_m(0xc4200001a0)
    /tmp/go-fuzz-build962140472/goroot/src/runtime/proc.go:2215 +0x36 fp=0x7ffdfc0c2138 sp=0x7ffdfc0c2120
runtime.newstack()
    /tmp/go-fuzz-build962140472/goroot/src/runtime/stack.go:1052 +0x2d6 fp=0x7ffdfc0c22b0 sp=0x7ffdfc0c2138
runtime.morestack()
    /tmp/go-fuzz-build962140472/goroot/src/runtime/asm_amd64.s:366 +0x7f fp=0x7ffdfc0c22b8 sp=0x7ffdfc0c22b0

Full crash message:
https://gist.github.com/dvyukov/7968c10af421d0ef5c5dcc6a9cdfff43

To reproduce:
check out github.com/dvyukov/go-fuzz on 430caef0d9e2de9993a25311ada71d9857cf9f80
apply:

diff --git a/examples/test/test.go b/examples/test/test.go
index 04740b8..3d4c5d28 100644
--- a/examples/test/test.go
+++ b/examples/test/test.go
@@ -34,7 +34,7 @@ func init() {
 }

 func Fuzz(data []byte) int {
-       // runtime.GOMAXPROCS(runtime.NumCPU())
+       runtime.GOMAXPROCS(runtime.NumCPU())
        if len(data) == 1 {
                if data[0] == '!' || data[0] == '#' {
                        panic("bingo 0")

from go-fuzz dir:
go install ./go-fuzz ./go-fuzz-build
go-fuzz-build github.com/dvyukov/go-fuzz/examples/test
go-fuzz -bin test-fuzz.zip -workdir examples/test
after some time the crash report appears in examples/test/crashers/*.output
if go-fuzz crashes with a panic, restart it, it crashes for unrelated reason

@quentinmit quentinmit added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 17, 2016
@quentinmit
Copy link
Contributor

The linked issue (dvyukov/go-fuzz#145) suggests that the program has a data race. Aren't all bets off if there is a data race?

@quentinmit quentinmit added this to the Go1.8Maybe milestone Oct 17, 2016
@dvyukov
Copy link
Member Author

dvyukov commented Oct 18, 2016

@quentinmit The crashing process has races on integers. @rsc says races on integers are safe.

@rsc
Copy link
Contributor

rsc commented Oct 21, 2016

Yes: races on integers are not supposed to crash a program.

@rsc rsc modified the milestones: Go1.8, Go1.8Maybe Nov 1, 2016
@aclements
Copy link
Member

I was able to reproduce this in about 10 minutes at tip (96414ca) using the following:

cd 17442
GOPATH=$PWD
go get -d github.com/dvyukov/go-fuzz
cd src/github.com/dvyukov/go-fuzz
git checkout 430caef0d9e2de9993a25311ada71d9857cf9f80
sed -i -e 's,// runtime,runtime,' examples/test/test.go
go install ./go-fuzz ./go-fuzz-build
$GOPATH/bin/go-fuzz-build github.com/dvyukov/go-fuzz/examples/test
# Get core files from crashes.
echo 'core.%p' | sudo tee /proc/sys/kernel/core_pattern
ulimit -c unlimited
export GOTRACEBACK=crash
$GOPATH/bin/go-fuzz -bin test-fuzz.zip -workdir examples/test

I applied the following patch so I could match crashes with core files:

diff --git a/go-fuzz/testee.go b/go-fuzz/testee.go
index c918632..be3f0d3 100644
--- a/go-fuzz/testee.go
+++ b/go-fuzz/testee.go
@@ -114,6 +114,7 @@ func (bin *TestBinary) test(data []byte) (res int, ns uint64, cover, sonar, outp
                                hdr := fmt.Sprintf("program hanged (timeout %v seconds)\n\n", *flagTimeout)
                                output = append([]byte(hdr), output...)
                        }
+                       output = append(output, fmt.Sprintf("\nPID: %d\n", bin.testee.cmd.Process.Pid)...)
                        bin.testee = nil
                        return
                }

To debug the core file:

unzip test-fuzz.zip
gdb cover.exe core.$BADPID

I haven't dug into the core yet.

@aclements
Copy link
Member

I believe this is a race between freezetheworld and stopTheWorldWithSema. stopTheWorldWithSema is panicking because it expects sched.stopwait to be 0, but it's 0x7fffffff, which is the value set by freezetheworld. This suggests the process was just about to die anyway from an uncaught panic.

@aclements
Copy link
Member

This suggests the process was just about to die anyway from an uncaught panic.

Further evidence: in the core I have, both M 0 and M 5 have a non-zero "dying". stopTheWorldWithSema is running on M 5 from goroutine 1. It must have come in second because it's blocked acquiring paniclk in startpanic_m. This also indicates that the bad stopwait value isn't because stopTheWorldWithSema itself panicked, since it hasn't reached the freezetheworld yet.

M 5's stack trace
SIGQUIT: quit
PC=0x44a2f1 m=5 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x4eacd8, 0x200000000, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0xc420089ef8, 0x40a417, ...)
	/tmp/go-fuzz-build874034407/goroot/src/runtime/sys_linux_amd64.s:405 +0x21 fp=0xc420089e70 sp=0xc420089e68
runtime.futexsleep(0x4eacd8, 0x2, 0xffffffffffffffff)
	/tmp/go-fuzz-build874034407/goroot/src/runtime/os_linux.go:45 +0x62 fp=0xc420089ec0 sp=0xc420089e70
runtime.lock(0x4eacd8)
	/tmp/go-fuzz-build874034407/goroot/src/runtime/lock_futex.go:101 +0x157 fp=0xc420089f08 sp=0xc420089ec0
runtime.startpanic_m()
	/tmp/go-fuzz-build874034407/goroot/src/runtime/panic.go:642 +0xc1 fp=0xc420089f30 sp=0xc420089f08
runtime.systemstack(0x48a768)
	/tmp/go-fuzz-build874034407/goroot/src/runtime/asm_amd64.s:330 +0xab fp=0xc420089f38 sp=0xc420089f30
runtime.startpanic()
	/tmp/go-fuzz-build874034407/goroot/src/runtime/panic.go:569 +0x1e fp=0xc420089f50 sp=0xc420089f38
runtime.throw(0x487a29, 0x19)
	/tmp/go-fuzz-build874034407/goroot/src/runtime/panic.go:595 +0x88 fp=0xc420089f70 sp=0xc420089f50
runtime.stopTheWorldWithSema()
	/tmp/go-fuzz-build874034407/goroot/src/runtime/proc.go:1022 +0x1f2 fp=0xc420089fb8 sp=0xc420089f70
runtime.systemstack(0xc42044a520)
	/tmp/go-fuzz-build874034407/goroot/src/runtime/asm_amd64.s:314 +0x79 fp=0xc420089fc0 sp=0xc420089fb8
runtime.mstart()
	/tmp/go-fuzz-build874034407/goroot/src/runtime/proc.go:1102 fp=0xc420089fc8 sp=0xc420089fc0
goroutine 1 [running]:
rax    0xca
rbx    0x1
rcx    0x44a2f3
rdx    0x2
rdi    0x4eacd8
rsi    0x0
rbp    0xc420089eb0
rsp    0xc420089e68
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x0
r13    0xffffffee
r14    0x55
r15    0x55
rip    0x44a2f1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

Strangely, M 0, despite having a non-zero dying, is just in notesleep in stopm.

@aclements
Copy link
Member

I figured out a little more. The racing panic is specifically caused by a SIGABRT from the test harness (when the test times out), which enters the panic path of SIGABRT. I haven't figured out the exact circumstances of the race, but this could explain why I couldn't reproduce this with a normal panic.

program hanged (timeout 10 seconds)

sighandler startpanic sig=6 goid=0 m=0
bad stopwait  0x7fffffff
fatal error: stopTheWorld: not stopped
throw startpanic goid=0 m=5 curg=1
SIGABRT: abort
PC=0x44c861 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x4d4fb0, 0x0, 0x0, 0x0, 0x7fff00000000, 0x42fb4b, 0x0, 0x0, 0x7ffff1194fb0, 0x40c65b, ...)
        /tmp/go-fuzz-build092816926/goroot/src/runtime/sys_linux_amd64.s:405 +0x21
runtime.futexsleep(0x4d4fb0, 0x0, 0xffffffffffffffff)
        /tmp/go-fuzz-build092816926/goroot/src/runtime/os_linux.go:45 +0x62
runtime.notesleep(0x4d4fb0)
        /tmp/go-fuzz-build092816926/goroot/src/runtime/lock_futex.go:145 +0x6b
runtime.stopm()
        /tmp/go-fuzz-build092816926/goroot/src/runtime/proc.go:1629 +0xad
runtime.findrunnable(0xc420030000, 0x0)
        /tmp/go-fuzz-build092816926/goroot/src/runtime/proc.go:2083 +0x2e4
runtime.schedule()
        /tmp/go-fuzz-build092816926/goroot/src/runtime/proc.go:2203 +0x14c
runtime.goschedImpl(0xc420001520)
        /tmp/go-fuzz-build092816926/goroot/src/runtime/proc.go:2281 +0x106
runtime.gosched_m(0xc420001520)
        /tmp/go-fuzz-build092816926/goroot/src/runtime/proc.go:2289 +0x36
runtime.mcall(0x7ffff1195140)
        /tmp/go-fuzz-build092816926/goroot/src/runtime/asm_amd64.s:256 +0x5b

goroutine 1 [running]:
        goroutine running on other thread; stack unavailable

goroutine 4 [runnable]:
runtime.Gosched()
        /tmp/go-fuzz-build092816926/goroot/src/runtime/proc.go:240 +0x1e
github.com/dvyukov/go-fuzz/examples/test.init.1.func1()
        /tmp/go-fuzz-build092816926/gopath/src/github.com/dvyukov/go-fuzz/examples/test/test.go:24 +0x7d
created by github.com/dvyukov/go-fuzz/examples/test.init.1
        /tmp/go-fuzz-build092816926/gopath/src/github.com/dvyukov/go-fuzz/examples/test/test.go:33 +0x65

rax    0xca
rbx    0x0
rcx    0x44c863
rdx    0x0
rdi    0x4d4fb0
rsi    0x0
rbp    0x7ffff1194f80
rsp    0x7ffff1194f38
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x0
r13    0xffffffee
r14    0x15c0
r15    0x55
rip    0x44c861
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

@aclements
Copy link
Member

The following program reproduces this for me in around a second. It starts a subprocess that gets multiple Ms running, one of which repeatedly stops the world. It then sends it a SIGABRT. It's important that there be multiple Ms and that the one that receives the signal be different from the one doing the STW.

It's not too sensitive to the timing of the SIGABRT, but the timing I have below worked reliably on my laptop.

The timeout handling in the test isn't necessary for reproduction, but sometimes the child gets wedged (I think for unrelated reasons), so it's nice to have.

package main

import (
	"bytes"
	"fmt"
	"log"
	"math/rand"
	"os"
	"os/exec"
	"runtime"
	"strings"
	"syscall"
	"time"
)

func main() {
	if len(os.Args) > 1 && os.Args[1] == "-sub" {
		go func() {
			for {
				runtime.GC()
			}
		}()
		for {
			time.Now()
		}
	}

	self, err := os.Executable()
	if err != nil {
		log.Fatal(err)
	}
	for {
		var buf bytes.Buffer
		cmd := exec.Command(self, "-sub")
		cmd.Stdout = &buf
		cmd.Stderr = &buf
		if err := cmd.Start(); err != nil {
			log.Fatal(err)
		}
		time.Sleep(2*time.Millisecond + time.Duration(rand.Intn(int(5*time.Millisecond))))
		cmd.Process.Signal(syscall.SIGABRT)
		done := make(chan bool)
		dump := false
		go func() {
			for {
				t := time.NewTimer(time.Second)
				select {
				case <-done:
					t.Stop()
					return
				case <-t.C:
					fmt.Println("timed out; sending KILL")
					dump = true
					cmd.Process.Signal(syscall.SIGKILL)
				}
			}
		}()
		cmd.Wait()
		done <- true
		fail := strings.Contains(buf.String(), "not stopped")
		if dump || fail {
			fmt.Println(buf.String())
			if fail {
				break
			}
		}
	}
}

@aclements
Copy link
Member

So, at this point, the specific sequence of events is pretty clear:

  1. Something initiates a STW. stopTheWorldWithSema sets sched.stopwait to gomaxprocs and starts preempting Ps.
  2. A SIGABRT (or some _SigThrow signal) comes in on a thread that STW has already preempted. sighandler sees that crashing == 0 and calls startpanic, which eventually calls freezetheworld, which sets sched.stopwait to 0x7fffffff. This thread spends about 7 ms in freezetheworld because it can't preempt the goroutine running stopTheWorldWithSema.
  3. Meanwhile, stopTheWorldWithSema decides (correctly) that it has successfully stopped all Ps. It does a final sanity check that sched.stopwait == 0, which fails.

I'm inclined to fix this by having stopTheWorldWithSema check panicking in this failure path and simply suppress the throw (probably by just sleep-spinning). This is atomically incremented before entering freezetheworld, so it should be a good indicator. And spinning should be fine because the panic caused by the signal otherwise does the right thing (the only thing that's wrong about the panic message is the "not stopped" message).

@ianlancetaylor, are there maybe cleaner ways to solve this in the signal handler?

@ianlancetaylor
Copy link
Contributor

From what you are saying, the same problem would happen if one goroutine calls panic, the panic is not recovered, and the goroutine is about to call startpanic when the thread it is running on is preempted by the OS. Then some other goroutine on some other thread calls stopTheWorldWithSema. Then the first thread is scheduled again, calls startpanic and then freezetheworld (on the g0 stack), and we have the scenario you describe above. So the problem seems independent of signal handling as such.

Since freezetheworld is intended to permanently stop all goroutines, I think the best fix is simply for stopTheWorldWithSema to check for it, and to suspend itself, as you suggest.

@aclements
Copy link
Member

From what you are saying, the same problem would happen if one goroutine calls panic, the panic is not recovered, and the goroutine is about to call startpanic when the thread it is running on is preempted by the OS. Then some other goroutine on some other thread calls stopTheWorldWithSema. Then the first thread is scheduled again, calls startpanic and then freezetheworld (on the g0 stack), and we have the scenario you describe above. So the problem seems independent of signal handling as such.

That can't happen because the panicking goroutine will block the STW, so stopTheWorldWithSema will never get to the point where it observes the bad stopwait value. The problem with signals is that they can initiate a panic on an already "stopped" M, so unlike a normal panic, they can race with STW.

@ianlancetaylor
Copy link
Contributor

Ah, I see.

I can't think of anything clever we could do in the signal handler. We need it to crash the program. I suppose we could skip the call to freezetheworld if sched.stopwait > 0. I don't know why that would be any better than your suggestion, though.

@gopherbot
Copy link

CL https://golang.org/cl/34611 mentions this issue.

@golang golang locked and limited conversation to collaborators Dec 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants