Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: forEachP: P did not run fn #10618

Closed
bradfitz opened this issue Apr 29, 2015 · 9 comments
Closed

runtime: forEachP: P did not run fn #10618

bradfitz opened this issue Apr 29, 2015 · 9 comments
Milestone

Comments

@bradfitz
Copy link
Contributor

Seen on http://build.golang.org/log/fab14a23af2926e4f67dc55c4f859f0efa04159c for dragonfly at 73e791a

##### GOMAXPROCS=2 runtime -cpu=1,2,4
fatal error: forEachP: P did not run fn

runtime stack:
runtime.throw(0x699d90, 0x1a)
    /tmp/gobuilder/dragonfly-amd64-73e791a38a71/go/src/runtime/panic.go:543 +0x96
runtime.forEachP(0x6e1970)
    /tmp/gobuilder/dragonfly-amd64-73e791a38a71/go/src/runtime/proc1.go:826 +0x3cf
runtime.gc.func1()
    /tmp/gobuilder/dragonfly-amd64-73e791a38a71/go/src/runtime/mgc.go:802 +0xb6
runtime.systemstack(0xc2080a3500)
    /tmp/gobuilder/dragonfly-amd64-73e791a38a71/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
    /tmp/gobuilder/dragonfly-amd64-73e791a38a71/go/src/runtime/proc1.go:686
@bradfitz bradfitz added this to the Go1.5 milestone Apr 29, 2015
@aclements
Copy link
Member

At this point, we've seen this failure on the following commits and builders:

2015-04-29T04:48:55-73e791a/dragonfly-amd64
2015-04-29T18:07:43-c26fc88/dragonfly-amd64
2015-05-07T23:54:28-a3dfcf5/linux-amd64-clang
2015-05-12T16:26:47-18d98bc/linux-386-sid
2015-05-12T19:34:46-9b379d7/linux-amd64
2015-05-12T23:39:00-c06b856/linux-amd64-nocgo

On all of these, the stack has been essentially the same. It's always in the runtime test, and it's always executing both TestFutexsleep.func1 and TestGoroutineProfile. TestFutexsleep itself is not running: it created two goroutines and woke them up through a futex, but they haven't been scheduled to let them exit yet.

Here's the exact stack from the most recent failure:

fatal error: forEachP: P did not run fn

runtime stack:
runtime.throw(0x6a50f0, 0x1a)
        /tmp/buildlet-scatch235740004/go/src/runtime/panic.go:527 +0x96
runtime.forEachP(0x6e3ab8)
        /tmp/buildlet-scatch235740004/go/src/runtime/proc1.go:822 +0x3cf
runtime.gc.func1()
        /tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:832 +0x98
runtime.systemstack(0xc20801a000)
        /tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
        /tmp/buildlet-scatch235740004/go/src/runtime/proc1.go:682

goroutine 160 [running]:
runtime.systemstack_switch()
        /tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:216 fp=0xc2083ce5a8 sp=0xc2083ce5a0
runtime.gc(0x0)
        /tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:833 +0x1dc fp=0xc2083ce7b0 sp=0xc2083ce5a8
runtime.backgroundgc()
        /tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:753 +0x40 fp=0xc2083ce7e0 sp=0xc2083ce7b0
runtime.goexit()
        /tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc2083ce7e8 sp=0xc2083ce7e0
created by runtime.startGC
        /tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:726 +0x139

goroutine 1 [chan receive, locked to thread]:
testing.RunTests(0x6e37f0, 0x7b5da0, 0x91, 0x91, 0x7f1bd0bd8301)
        /tmp/buildlet-scatch235740004/go/src/testing/testing.go:563 +0xa6a
testing.(*M).Run(0xc20821ff38, 0x4b3fed)
        /tmp/buildlet-scatch235740004/go/src/testing/testing.go:492 +0x73
main.main()
        runtime/_test/_testmain.go:822 +0x119

goroutine 2592 [syscall]:
runtime_test.TestFutexsleep.func1(0x0, 0x4e94914f0000, 0x693930, 0x14, 0xc208018120)
        /tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:47 +0x2d
created by runtime_test.TestFutexsleep
        /tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:51 +0x26d

goroutine 2593 [syscall]:
runtime_test.TestFutexsleep.func1(0x0, 0x1dcd65174876e800, 0x6932f0, 0x13, 0xc208018180)
        /tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:47 +0x2d
created by runtime_test.TestFutexsleep
        /tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:51 +0x26d

goroutine 82776 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
        /tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
        /tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82778 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
        /tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
        /tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82775 [semacquire]:
runtime.GoroutineProfile(0xc208141f68, 0x80, 0x80, 0x10, 0x0)
        /tmp/buildlet-scatch235740004/go/src/runtime/mprof.go:524 +0x80
runtime_test.TestGoroutineProfile(0xc208536d80)
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:47 +0x1eb
testing.tRunner(0xc208536d80, 0x7b6910)
        /tmp/buildlet-scatch235740004/go/src/testing/testing.go:454 +0x9b
created by testing.RunTests
        /tmp/buildlet-scatch235740004/go/src/testing/testing.go:562 +0xa2a

goroutine 82777 [running]:
        goroutine running on other thread; stack unavailable
created by runtime_test.TestGoroutineProfile
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82779 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
        /tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
        /tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
        /tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130
FAIL    runtime 17.371s

@dr2chase
Copy link
Contributor

Is this perhaps relevant?
https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64

On Thu, May 14, 2015 at 11:45 AM, Austin Clements notifications@github.com
wrote:

At this point, we've seen this failure on the following commits and
builders:

2015-04-29T04:48:55-73e791a/dragonfly-amd64
2015-04-29T18:07:43-c26fc88/dragonfly-amd64
2015-05-07T23:54:28-a3dfcf5/linux-amd64-clang
2015-05-12T16:26:47-18d98bc/linux-386-sid
2015-05-12T19:34:46-9b379d7/linux-amd64
2015-05-12T23:39:00-c06b856/linux-amd64-nocgo

On all of these, the stack has been essentially the same. It's always in
the runtime test, and it's always executing both TestFutexsleep.func1 and
TestGoroutineProfile. TestFutexsleep itself is not running: it created two
goroutines and woke them up through a futex, but they haven't been
scheduled to let them exit yet.

Here's the exact stack from the most recent failure:

fatal error: forEachP: P did not run fn

runtime stack:
runtime.throw(0x6a50f0, 0x1a)
/tmp/buildlet-scatch235740004/go/src/runtime/panic.go:527 +0x96
runtime.forEachP(0x6e3ab8)
/tmp/buildlet-scatch235740004/go/src/runtime/proc1.go:822 +0x3cf
runtime.gc.func1()
/tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:832 +0x98
runtime.systemstack(0xc20801a000)
/tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
/tmp/buildlet-scatch235740004/go/src/runtime/proc1.go:682

goroutine 160 [running]:
runtime.systemstack_switch()
/tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:216 fp=0xc2083ce5a8 sp=0xc2083ce5a0
runtime.gc(0x0)
/tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:833 +0x1dc fp=0xc2083ce7b0 sp=0xc2083ce5a8
runtime.backgroundgc()
/tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:753 +0x40 fp=0xc2083ce7e0 sp=0xc2083ce7b0
runtime.goexit()
/tmp/buildlet-scatch235740004/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc2083ce7e8 sp=0xc2083ce7e0
created by runtime.startGC
/tmp/buildlet-scatch235740004/go/src/runtime/mgc.go:726 +0x139

goroutine 1 [chan receive, locked to thread]:
testing.RunTests(0x6e37f0, 0x7b5da0, 0x91, 0x91, 0x7f1bd0bd8301)
/tmp/buildlet-scatch235740004/go/src/testing/testing.go:563 +0xa6a
testing.(*M).Run(0xc20821ff38, 0x4b3fed)
/tmp/buildlet-scatch235740004/go/src/testing/testing.go:492 +0x73
main.main()
runtime/_test/_testmain.go:822 +0x119

goroutine 2592 [syscall]:
runtime_test.TestFutexsleep.func1(0x0, 0x4e94914f0000, 0x693930, 0x14, 0xc208018120)
/tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:47 +0x2d
created by runtime_test.TestFutexsleep
/tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:51 +0x26d

goroutine 2593 [syscall]:
runtime_test.TestFutexsleep.func1(0x0, 0x1dcd65174876e800, 0x6932f0, 0x13, 0xc208018180)
/tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:47 +0x2d
created by runtime_test.TestFutexsleep
/tmp/buildlet-scatch235740004/go/src/runtime/futex_test.go:51 +0x26d

goroutine 82776 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
/tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
/tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82778 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
/tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
/tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82775 [semacquire]:
runtime.GoroutineProfile(0xc208141f68, 0x80, 0x80, 0x10, 0x0)
/tmp/buildlet-scatch235740004/go/src/runtime/mprof.go:524 +0x80
runtime_test.TestGoroutineProfile(0xc208536d80)
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:47 +0x1eb
testing.tRunner(0xc208536d80, 0x7b6910)
/tmp/buildlet-scatch235740004/go/src/testing/testing.go:454 +0x9b
created by testing.RunTests
/tmp/buildlet-scatch235740004/go/src/testing/testing.go:562 +0xa2a

goroutine 82777 [running]:
goroutine running on other thread; stack unavailable
created by runtime_test.TestGoroutineProfile
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130

goroutine 82779 [runnable]:
syscall.Syscall(0x3, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x9)
/tmp/buildlet-scatch235740004/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Close(0xffffffffffffffff, 0x0, 0x0)
/tmp/buildlet-scatch235740004/go/src/syscall/zsyscall_linux_amd64.go:294 +0x51
runtime_test.TestGoroutineProfile.func1(0xc2080110e0, 0xc20800e9a0)
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:35 +0x3e
created by runtime_test.TestGoroutineProfile
/tmp/buildlet-scatch235740004/go/src/runtime/runtime_unix_test.go:38 +0x130
FAIL runtime 17.371s


Reply to this email directly or view it on GitHub
#10618 (comment).

@aclements
Copy link
Member

Is this perhaps relevant?
https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64

Interesting, but I don't think so. Because of how TestFutexsleep is structured, it's actually fine if the futex sleep in TestFutexsleep.func1 never returns (at worst, I would expect the runtime test to time out if this happened, but I'm not sure if even that would happen). I'm also fairly certain we're not depending on the syscall itself to act as a memory barrier in this case, since the synchronization happens through atomic operations on p.status and p.runSafePointFn.

@aclements
Copy link
Member

I think the fact that TestGoroutineProfile is running is mostly irrelevant, since the actual GorountineProfile call is just blocking on acquiring the worldsema. However, TestGoroutineProfile first increases GOMAXPROCS to 100 (while TestFutexsleep.func1 is still in the syscall path), which might be relevant.

@aclements
Copy link
Member

This command line reproduces this within a few seconds for me:
GOTRACEBACK=crash stress ./runtime.test -test.run TestFutexsleep|TestGoroutineProfile

@RLH

@aclements
Copy link
Member

We've found one cause of this: holding worldsema is not enough to prevent gomaxprocs from changing because we always release worldsema before calling starttheworld. Swapping the release to be after the starttheworld in all places dramatically reduces the probability of this failing. It doesn't eliminate it, though.

I'm almost positive this has the same root cause as the "fatal error: forEachP: not stopped" failures we've also been seeing. What appears to be happening is that our stopwait accounting is sometimes off by one. When this happens, the second to last P to run the safe point function decrements stopwait to 0 and wakes up forEachP. Depending on when the actual last P runs the safe point function, we either see that stopwait is -1, or we see that the last P hasn't run its safe point function yet.

@aclements
Copy link
Member

We figured out the remaining problem once worldsema is released after starttheworld. I'm preparing a CL sequence (a cleanup of Rick's prototype fix in CL 10110).

Our dance involves three parties.

G1 is running on M1 on P1. P1 has an empty run queue.

G2/M2 is in a blocked syscall and has lost its P. (The details of this don't matter, it just needs to be in a position where it needs to grab an idle P.)

GC just started on G3/M3/P3. (These aren't very involved, they just have to be separate from the other G's, M's, and P's.)

  1. GC calls stoptheworld(), which sets sched.gcwaiting to 1.

G1/M1 begins to enter a syscall:
2. G1/M1 invokes reentersyscall, which sets the P1's status to _Psyscall.
3. G1/M1's reentersyscall observes gcwaiting != 0 and calls entersyscall_gcwait.
4. G1/M1's entersyscall_gcwait blocks acquiring sched.lock.

Back on GC:
5. stoptheworld cas's P1's status to _Pgcstop, does other stuff, and returns.
6. GC does stuff and then calls starttheworld().
7. starttheworld() calls procresize(), which sets P1's status to _Pidle and puts P1 on the idle list.

G2/M2 returns from its syscall and takes over P1:
8. G2/M2 returns from its blocked syscall and pulls P1 off the idle list.
9. G2/M2 acquires P1, which sets P1's status to _Prunning.
10. G2/M2 starts a new syscall and invokes reentersyscall, which sets P1's status to _Psyscall.

Back on G1/M1:
11. G1/M1 finally acquires sched.lock in entersyscall_gcwait.

At this point, G1/M1 still thinks it's running on P1. P1's status is _Psyscall, which is consistent with what G1/M1 is doing, but it's _Psyscall because G2/M2 put it in to _Psyscall, not G1/M1. This is basically an ABA race on P1's status.

The problem arises because forEachP currently shares stopwait with stoptheworld. G1/M1's entersyscall_gcwait observes the non-zero stopwait set by forEachP, but mistakes it for a stoptheworld. It cas's P1's status from _Psyscall (set by G2/M2) to _Pgcstop and proceeds to decrement stopwait one more time than forEachP was expecting.

The plan (prototyped in Rick's CL 10110) is to give forEachP its own stopwait and stopnote that aren't shared with stoptheworld.

Of course, all of this raises the question of how this worked before forEachP existed, since the ABA race on the P status can happen without forEachP. This took me a while to figure out. Here's a sketch:
12. G1/M1's entersyscall_gcwait sees that stopwait is 0 and returns. The syscall happens and returns. G1/M1 invokes exitsyscall, which invokes exitsyscallfast.
13. G1/M1's exitsyscallfast successfully cas's P1's status from _Psyscall to _Prunning (even though this _Psyscall status was set by G2/M2, not G1/M1).
14. G1's returns to executing user code, running on P1.

And, finally, back on G2/M2:
15. G2/M2 executes its system call and then calls exitsyscall, which invokes exitsyscallfast.
16. exitsyscallfast tries to cas P1's status from _Psyscall to _Prunning, but this fails because P1 is in _Prunning (because G1/M1 is using it).

In fact, if the timing is just right, a P could bounce back and forth between two M's this way without ever passing through the idle list again. But this would be okay because only one of them would be executing user code at any given moment.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/10156 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/10157 mentions this issue.

aclements added a commit that referenced this issue May 18, 2015
Currently, startTheWorld releases worldsema before starting the
world. Since startTheWorld can change gomaxprocs after allowing Ps to
run, this means that gomaxprocs can change while another P holds
worldsema.

Unfortunately, the garbage collector and forEachP assume that holding
worldsema protects against changes in gomaxprocs (which it *almost*
does). In particular, this is causing somewhat frequent "P did not run
fn" crashes in forEachP in the runtime tests because gomaxprocs is
changing between the several loops that forEachP does over all the Ps.

Fix this by only releasing worldsema after the world is started.

This relates to issue #10618. forEachP still fails under stress
testing, but much less frequently.

Change-Id: I085d627b70cca9ebe9af28fe73b9872f1bb224ff
Reviewed-on: https://go-review.googlesource.com/10156
Reviewed-by: Russ Cox <rsc@golang.org>
@golang golang locked and limited conversation to collaborators Jun 25, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants