-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: forEachP: P did not run fn #10618
Comments
At this point, we've seen this failure on the following commits and builders: 2015-04-29T04:48:55-73e791a/dragonfly-amd64 On all of these, the stack has been essentially the same. It's always in the runtime test, and it's always executing both TestFutexsleep.func1 and TestGoroutineProfile. TestFutexsleep itself is not running: it created two goroutines and woke them up through a futex, but they haven't been scheduled to let them exit yet. Here's the exact stack from the most recent failure:
|
Is this perhaps relevant? On Thu, May 14, 2015 at 11:45 AM, Austin Clements notifications@github.com
|
Interesting, but I don't think so. Because of how TestFutexsleep is structured, it's actually fine if the futex sleep in TestFutexsleep.func1 never returns (at worst, I would expect the runtime test to time out if this happened, but I'm not sure if even that would happen). I'm also fairly certain we're not depending on the syscall itself to act as a memory barrier in this case, since the synchronization happens through atomic operations on p.status and p.runSafePointFn. |
I think the fact that TestGoroutineProfile is running is mostly irrelevant, since the actual GorountineProfile call is just blocking on acquiring the worldsema. However, TestGoroutineProfile first increases GOMAXPROCS to 100 (while TestFutexsleep.func1 is still in the syscall path), which might be relevant. |
This command line reproduces this within a few seconds for me: |
We've found one cause of this: holding worldsema is not enough to prevent gomaxprocs from changing because we always release worldsema before calling starttheworld. Swapping the release to be after the starttheworld in all places dramatically reduces the probability of this failing. It doesn't eliminate it, though. I'm almost positive this has the same root cause as the "fatal error: forEachP: not stopped" failures we've also been seeing. What appears to be happening is that our stopwait accounting is sometimes off by one. When this happens, the second to last P to run the safe point function decrements stopwait to 0 and wakes up forEachP. Depending on when the actual last P runs the safe point function, we either see that stopwait is -1, or we see that the last P hasn't run its safe point function yet. |
We figured out the remaining problem once worldsema is released after starttheworld. I'm preparing a CL sequence (a cleanup of Rick's prototype fix in CL 10110). Our dance involves three parties. G1 is running on M1 on P1. P1 has an empty run queue. G2/M2 is in a blocked syscall and has lost its P. (The details of this don't matter, it just needs to be in a position where it needs to grab an idle P.) GC just started on G3/M3/P3. (These aren't very involved, they just have to be separate from the other G's, M's, and P's.)
G1/M1 begins to enter a syscall: Back on GC: G2/M2 returns from its syscall and takes over P1: Back on G1/M1: At this point, G1/M1 still thinks it's running on P1. P1's status is _Psyscall, which is consistent with what G1/M1 is doing, but it's _Psyscall because G2/M2 put it in to _Psyscall, not G1/M1. This is basically an ABA race on P1's status. The problem arises because forEachP currently shares stopwait with stoptheworld. G1/M1's entersyscall_gcwait observes the non-zero stopwait set by forEachP, but mistakes it for a stoptheworld. It cas's P1's status from _Psyscall (set by G2/M2) to _Pgcstop and proceeds to decrement stopwait one more time than forEachP was expecting. The plan (prototyped in Rick's CL 10110) is to give forEachP its own stopwait and stopnote that aren't shared with stoptheworld. Of course, all of this raises the question of how this worked before forEachP existed, since the ABA race on the P status can happen without forEachP. This took me a while to figure out. Here's a sketch: And, finally, back on G2/M2: In fact, if the timing is just right, a P could bounce back and forth between two M's this way without ever passing through the idle list again. But this would be okay because only one of them would be executing user code at any given moment. |
CL https://golang.org/cl/10156 mentions this issue. |
CL https://golang.org/cl/10157 mentions this issue. |
Currently, startTheWorld releases worldsema before starting the world. Since startTheWorld can change gomaxprocs after allowing Ps to run, this means that gomaxprocs can change while another P holds worldsema. Unfortunately, the garbage collector and forEachP assume that holding worldsema protects against changes in gomaxprocs (which it *almost* does). In particular, this is causing somewhat frequent "P did not run fn" crashes in forEachP in the runtime tests because gomaxprocs is changing between the several loops that forEachP does over all the Ps. Fix this by only releasing worldsema after the world is started. This relates to issue #10618. forEachP still fails under stress testing, but much less frequently. Change-Id: I085d627b70cca9ebe9af28fe73b9872f1bb224ff Reviewed-on: https://go-review.googlesource.com/10156 Reviewed-by: Russ Cox <rsc@golang.org>
Seen on http://build.golang.org/log/fab14a23af2926e4f67dc55c4f859f0efa04159c for dragonfly at 73e791a
The text was updated successfully, but these errors were encountered: