runtime: forEachP: P did not run fn #10618

bradfitz · 2015-04-29T15:10:13Z

Seen on http://build.golang.org/log/fab14a23af2926e4f67dc55c4f859f0efa04159c for dragonfly at 73e791a

##### GOMAXPROCS=2 runtime -cpu=1,2,4
fatal error: forEachP: P did not run fn

runtime stack:
runtime.throw(0x699d90, 0x1a)
    /tmp/gobuilder/dragonfly-amd64-73e791a38a71/go/src/runtime/panic.go:543 +0x96
runtime.forEachP(0x6e1970)
    /tmp/gobuilder/dragonfly-amd64-73e791a38a71/go/src/runtime/proc1.go:826 +0x3cf
runtime.gc.func1()
    /tmp/gobuilder/dragonfly-amd64-73e791a38a71/go/src/runtime/mgc.go:802 +0xb6
runtime.systemstack(0xc2080a3500)
    /tmp/gobuilder/dragonfly-amd64-73e791a38a71/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
    /tmp/gobuilder/dragonfly-amd64-73e791a38a71/go/src/runtime/proc1.go:686

The text was updated successfully, but these errors were encountered:

aclements · 2015-05-14T15:45:09Z

At this point, we've seen this failure on the following commits and builders:

2015-04-29T04:48:55-73e791a/dragonfly-amd64
2015-04-29T18:07:43-c26fc88/dragonfly-amd64
2015-05-07T23:54:28-a3dfcf5/linux-amd64-clang
2015-05-12T16:26:47-18d98bc/linux-386-sid
2015-05-12T19:34:46-9b379d7/linux-amd64
2015-05-12T23:39:00-c06b856/linux-amd64-nocgo

On all of these, the stack has been essentially the same. It's always in the runtime test, and it's always executing both TestFutexsleep.func1 and TestGoroutineProfile. TestFutexsleep itself is not running: it created two goroutines and woke them up through a futex, but they haven't been scheduled to let them exit yet.

Here's the exact stack from the most recent failure:

fatal error: forEachP: P did not run fn

runtime stack:
runtime.throw(0x6a50f0, 0x1a)
        /tmp/buildlet-scatch235740004/go/src/runtime/panic.go:527 +0x96
runtime.forEachP(0x6e3ab8)
        /tmp/buildlet-scatch235740004/go/src/runtime/proc1.go:822 +0x3cf
runtime.gc.func1()
        /tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:832 +0x98
runtime.systemstack(0xc20801a000)
        /tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
        /tmp/buildlet-scatch235740004/go/src/runtime/proc1.go:682

goroutine 160 [running]:
runtime.systemstack_switch()
        /tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:216 fp=0xc2083ce5a8 sp=0xc2083ce5a0
runtime.gc(0x0)
        /tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:833 +0x1dc fp=0xc2083ce7b0 sp=0xc2083ce5a8
runtime.backgroundgc()
        /tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:753 +0x40 fp=0xc2083ce7e0 sp=0xc2083ce7b0
runtime.goexit()
        /tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc2083ce7e8 sp=0xc2083ce7e0
created by runtime.startGC
        /tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:726 +0x139

goroutine 1 [chan receive, locked to thread]:
testing.RunTests(0x6e37f0, 0x7b5da0, 0x91, 0x91, 0x7f1bd0bd8301)
        /tmp/buildlet-scatch235740004/go/src/testing/testing.go:563 +0xa6a
testing.(*M).Run(0xc20821ff38, 0x4b3fed)
        /tmp/buildlet-scatch235740004/go/src/testing/testing.go:492 +0x73
main.main()
        runtime/_test/_testmain.go:822 +0x119

goroutine 2592 [syscall]:
runtime_test.TestFutexsleep.func1(0x0, 0x4e94914f0000, 0x693930, 0x14, 0xc208018120)
        /tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:47 +0x2d
created by runtime_test.TestFutexsleep
        /tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:51 +0x26d

goroutine 2593 [syscall]:
runtime_test.TestFutexsleep.func1(0x0, 0x1dcd65174876e800, 0x6932f0, 0x13, 0xc208018180)
        /tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:47 +0x2d
created by runtime_test.TestFutexsleep
        /tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:51 +0x26d

goroutine 82776 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
        /tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
        /tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82778 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
        /tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
        /tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82775 [semacquire]:
runtime.GoroutineProfile(0xc208141f68, 0x80, 0x80, 0x10, 0x0)
        /tmp/buildlet-scatch235740004/go/src/runtime/mprof.go:524 +0x80
runtime_test.TestGoroutineProfile(0xc208536d80)
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:47 +0x1eb
testing.tRunner(0xc208536d80, 0x7b6910)
        /tmp/buildlet-scatch235740004/go/src/testing/testing.go:454 +0x9b
created by testing.RunTests
        /tmp/buildlet-scatch235740004/go/src/testing/testing.go:562 +0xa2a

goroutine 82777 [running]:
        goroutine running on other thread; stack unavailable
created by runtime_test.TestGoroutineProfile
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82779 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
        /tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
        /tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130
FAIL    runtime 17.371s

dr2chase · 2015-05-14T16:06:41Z

Is this perhaps relevant?
https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64

On Thu, May 14, 2015 at 11:45 AM, Austin Clements notifications@github.com
wrote:

At this point, we've seen this failure on the following commits and
builders:

2015-04-29T04:48:55-73e791a/dragonfly-amd64
2015-04-29T18:07:43-c26fc88/dragonfly-amd64
2015-05-07T23:54:28-a3dfcf5/linux-amd64-clang
2015-05-12T16:26:47-18d98bc/linux-386-sid
2015-05-12T19:34:46-9b379d7/linux-amd64
2015-05-12T23:39:00-c06b856/linux-amd64-nocgo

On all of these, the stack has been essentially the same. It's always in
the runtime test, and it's always executing both TestFutexsleep.func1 and
TestGoroutineProfile. TestFutexsleep itself is not running: it created two
goroutines and woke them up through a futex, but they haven't been
scheduled to let them exit yet.

Here's the exact stack from the most recent failure:

fatal error: forEachP: P did not run fn

runtime stack:
runtime.throw(0x6a50f0, 0x1a)
/tmp/buildlet-scatch235740004/go/src/runtime/panic.go:527 +0x96
runtime.forEachP(0x6e3ab8)
/tmp/buildlet-scatch235740004/go/src/runtime/proc1.go:822 +0x3cf
runtime.gc.func1()
/tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:832 +0x98
runtime.systemstack(0xc20801a000)
/tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
/tmp/buildlet-scatch235740004/go/src/runtime/proc1.go:682

goroutine 160 [running]:
runtime.systemstack_switch()
/tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:216 fp=0xc2083ce5a8 sp=0xc2083ce5a0
runtime.gc(0x0)
/tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:833 +0x1dc fp=0xc2083ce7b0 sp=0xc2083ce5a8
runtime.backgroundgc()
/tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:753 +0x40 fp=0xc2083ce7e0 sp=0xc2083ce7b0
runtime.goexit()
/tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc2083ce7e8 sp=0xc2083ce7e0
created by runtime.startGC
/tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:726 +0x139

goroutine 1 [chan receive, locked to thread]:
testing.RunTests(0x6e37f0, 0x7b5da0, 0x91, 0x91, 0x7f1bd0bd8301)
/tmp/buildlet-scatch235740004/go/src/testing/testing.go:563 +0xa6a
testing.(*M).Run(0xc20821ff38, 0x4b3fed)
/tmp/buildlet-scatch235740004/go/src/testing/testing.go:492 +0x73
main.main()
runtime/_test/_testmain.go:822 +0x119

goroutine 2592 [syscall]:
runtime_test.TestFutexsleep.func1(0x0, 0x4e94914f0000, 0x693930, 0x14, 0xc208018120)
/tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:47 +0x2d
created by runtime_test.TestFutexsleep
/tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:51 +0x26d

goroutine 2593 [syscall]:
runtime_test.TestFutexsleep.func1(0x0, 0x1dcd65174876e800, 0x6932f0, 0x13, 0xc208018180)
/tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:47 +0x2d
created by runtime_test.TestFutexsleep
/tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:51 +0x26d

goroutine 82776 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
/tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
/tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82778 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
/tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
/tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82775 [semacquire]:
runtime.GoroutineProfile(0xc208141f68, 0x80, 0x80, 0x10, 0x0)
/tmp/buildlet-scatch235740004/go/src/runtime/mprof.go:524 +0x80
runtime_test.TestGoroutineProfile(0xc208536d80)
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:47 +0x1eb
testing.tRunner(0xc208536d80, 0x7b6910)
/tmp/buildlet-scatch235740004/go/src/testing/testing.go:454 +0x9b
created by testing.RunTests
/tmp/buildlet-scatch235740004/go/src/testing/testing.go:562 +0xa2a

goroutine 82777 [running]:
goroutine running on other thread; stack unavailable
created by runtime_test.TestGoroutineProfile
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82779 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
/tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
/tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130
FAIL runtime 17.371s

—
Reply to this email directly or view it on GitHub
#10618 (comment).

aclements · 2015-05-14T16:21:02Z

Is this perhaps relevant?
https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64

Interesting, but I don't think so. Because of how TestFutexsleep is structured, it's actually fine if the futex sleep in TestFutexsleep.func1 never returns (at worst, I would expect the runtime test to time out if this happened, but I'm not sure if even that would happen). I'm also fairly certain we're not depending on the syscall itself to act as a memory barrier in this case, since the synchronization happens through atomic operations on p.status and p.runSafePointFn.

aclements · 2015-05-14T18:09:24Z

I think the fact that TestGoroutineProfile is running is mostly irrelevant, since the actual GorountineProfile call is just blocking on acquiring the worldsema. However, TestGoroutineProfile first increases GOMAXPROCS to 100 (while TestFutexsleep.func1 is still in the syscall path), which might be relevant.

aclements · 2015-05-14T18:11:11Z

This command line reproduces this within a few seconds for me:
GOTRACEBACK=crash stress ./runtime.test -test.run TestFutexsleep|TestGoroutineProfile

@RLH

aclements · 2015-05-14T18:43:59Z

We've found one cause of this: holding worldsema is not enough to prevent gomaxprocs from changing because we always release worldsema before calling starttheworld. Swapping the release to be after the starttheworld in all places dramatically reduces the probability of this failing. It doesn't eliminate it, though.

I'm almost positive this has the same root cause as the "fatal error: forEachP: not stopped" failures we've also been seeing. What appears to be happening is that our stopwait accounting is sometimes off by one. When this happens, the second to last P to run the safe point function decrements stopwait to 0 and wakes up forEachP. Depending on when the actual last P runs the safe point function, we either see that stopwait is -1, or we see that the last P hasn't run its safe point function yet.

aclements · 2015-05-15T19:20:16Z

We figured out the remaining problem once worldsema is released after starttheworld. I'm preparing a CL sequence (a cleanup of Rick's prototype fix in CL 10110).

Our dance involves three parties.

G1 is running on M1 on P1. P1 has an empty run queue.

G2/M2 is in a blocked syscall and has lost its P. (The details of this don't matter, it just needs to be in a position where it needs to grab an idle P.)

GC just started on G3/M3/P3. (These aren't very involved, they just have to be separate from the other G's, M's, and P's.)

GC calls stoptheworld(), which sets sched.gcwaiting to 1.

G1/M1 begins to enter a syscall:
2. G1/M1 invokes reentersyscall, which sets the P1's status to _Psyscall.
3. G1/M1's reentersyscall observes gcwaiting != 0 and calls entersyscall_gcwait.
4. G1/M1's entersyscall_gcwait blocks acquiring sched.lock.

Back on GC:
5. stoptheworld cas's P1's status to _Pgcstop, does other stuff, and returns.
6. GC does stuff and then calls starttheworld().
7. starttheworld() calls procresize(), which sets P1's status to _Pidle and puts P1 on the idle list.

G2/M2 returns from its syscall and takes over P1:
8. G2/M2 returns from its blocked syscall and pulls P1 off the idle list.
9. G2/M2 acquires P1, which sets P1's status to _Prunning.
10. G2/M2 starts a new syscall and invokes reentersyscall, which sets P1's status to _Psyscall.

Back on G1/M1:
11. G1/M1 finally acquires sched.lock in entersyscall_gcwait.

At this point, G1/M1 still thinks it's running on P1. P1's status is _Psyscall, which is consistent with what G1/M1 is doing, but it's _Psyscall because G2/M2 put it in to _Psyscall, not G1/M1. This is basically an ABA race on P1's status.

The problem arises because forEachP currently shares stopwait with stoptheworld. G1/M1's entersyscall_gcwait observes the non-zero stopwait set by forEachP, but mistakes it for a stoptheworld. It cas's P1's status from _Psyscall (set by G2/M2) to _Pgcstop and proceeds to decrement stopwait one more time than forEachP was expecting.

The plan (prototyped in Rick's CL 10110) is to give forEachP its own stopwait and stopnote that aren't shared with stoptheworld.

Of course, all of this raises the question of how this worked before forEachP existed, since the ABA race on the P status can happen without forEachP. This took me a while to figure out. Here's a sketch:
12. G1/M1's entersyscall_gcwait sees that stopwait is 0 and returns. The syscall happens and returns. G1/M1 invokes exitsyscall, which invokes exitsyscallfast.
13. G1/M1's exitsyscallfast successfully cas's P1's status from _Psyscall to _Prunning (even though this _Psyscall status was set by G2/M2, not G1/M1).
14. G1's returns to executing user code, running on P1.

And, finally, back on G2/M2:
15. G2/M2 executes its system call and then calls exitsyscall, which invokes exitsyscallfast.
16. exitsyscallfast tries to cas P1's status from _Psyscall to _Prunning, but this fails because P1 is in _Prunning (because G1/M1 is using it).

In fact, if the timing is just right, a P could bounce back and forth between two M's this way without ever passing through the idle list again. But this would be okay because only one of them would be executing user code at any given moment.

gopherbot · 2015-05-15T23:00:28Z

CL https://golang.org/cl/10156 mentions this issue.

gopherbot · 2015-05-15T23:00:29Z

CL https://golang.org/cl/10157 mentions this issue.

Currently, startTheWorld releases worldsema before starting the world. Since startTheWorld can change gomaxprocs after allowing Ps to run, this means that gomaxprocs can change while another P holds worldsema. Unfortunately, the garbage collector and forEachP assume that holding worldsema protects against changes in gomaxprocs (which it *almost* does). In particular, this is causing somewhat frequent "P did not run fn" crashes in forEachP in the runtime tests because gomaxprocs is changing between the several loops that forEachP does over all the Ps. Fix this by only releasing worldsema after the world is started. This relates to issue #10618. forEachP still fails under stress testing, but much less frequently. Change-Id: I085d627b70cca9ebe9af28fe73b9872f1bb224ff Reviewed-on: https://go-review.googlesource.com/10156 Reviewed-by: Russ Cox <rsc@golang.org>

bradfitz assigned aclements Apr 29, 2015

bradfitz added this to the Go1.5 milestone Apr 29, 2015

aclements closed this as completed in f0dd002 May 18, 2015

golang locked and limited conversation to collaborators Jun 25, 2016

gopherbot added the FrozenDueToAge label Jun 25, 2016

rsc unassigned aclements Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: forEachP: P did not run fn #10618

runtime: forEachP: P did not run fn #10618

bradfitz commented Apr 29, 2015

aclements commented May 14, 2015

dr2chase commented May 14, 2015

aclements commented May 14, 2015

aclements commented May 14, 2015

aclements commented May 14, 2015

aclements commented May 14, 2015

aclements commented May 15, 2015

gopherbot commented May 15, 2015

gopherbot commented May 15, 2015

runtime: forEachP: P did not run fn #10618

runtime: forEachP: P did not run fn #10618

Comments

bradfitz commented Apr 29, 2015

aclements commented May 14, 2015

dr2chase commented May 14, 2015

aclements commented May 14, 2015

aclements commented May 14, 2015

aclements commented May 14, 2015

aclements commented May 14, 2015

aclements commented May 15, 2015

gopherbot commented May 15, 2015

gopherbot commented May 15, 2015