Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: broken work conservation due to CL 310850 #45867

Closed
bcmills opened this issue Apr 30, 2021 · 41 comments
Closed

runtime: broken work conservation due to CL 310850 #45867

bcmills opened this issue Apr 30, 2021 · 41 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Apr 30, 2021

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
SIGQUIT: quit
PC=0x4747c1 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex()
	/workdir/go/src/runtime/sys_linux_amd64.s:519 +0x21 fp=0x7ffffcc5a8e8 sp=0x7ffffcc5a8e0 pc=0x4747c1
runtime.futexsleep(0xc000024000, 0x7f6ae0, 0xc000036800)
	/workdir/go/src/runtime/os_linux.go:44 +0x36 fp=0x7ffffcc5a938 sp=0x7ffffcc5a8e8 pc=0x435776
runtime.notesleep(0x7f6c30)
	/workdir/go/src/runtime/lock_futex.go:160 +0x87 fp=0x7ffffcc5a970 sp=0x7ffffcc5a938 pc=0x40da47
runtime.mPark()
	/workdir/go/src/runtime/proc.go:1441 +0x2a fp=0x7ffffcc5a990 sp=0x7ffffcc5a970 pc=0x43ef2a
runtime.stoplockedm()
	/workdir/go/src/runtime/proc.go:2600 +0x65 fp=0x7ffffcc5a9e0 sp=0x7ffffcc5a990 pc=0x440b85
runtime.schedule()
	/workdir/go/src/runtime/proc.go:3287 +0x3d fp=0x7ffffcc5aa30 sp=0x7ffffcc5a9e0 pc=0x44267d
runtime.park_m(0xc0000001a0)
	/workdir/go/src/runtime/proc.go:3504 +0x14d fp=0x7ffffcc5aa60 sp=0x7ffffcc5aa30 pc=0x442dcd
runtime.mcall()
	/workdir/go/src/runtime/asm_amd64.s:307 +0x43 fp=0x7ffffcc5aa70 sp=0x7ffffcc5aa60 pc=0x470783

…

goroutine 23533 [running]:
	goroutine running on other thread; stack unavailable
created by runtime_test.testGoroutineParallelism2
	/workdir/go/src/runtime/proc_test.go:174 +0x325

goroutine 23532 [runnable]:
runtime_test.testGoroutineParallelism2.func1()
	/workdir/go/src/runtime/proc_test.go:174 fp=0xc00016cfe0 sp=0xc00016cfd8 pc=0x5aa140
runtime.goexit()
	/workdir/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00016cfe8 sp=0xc00016cfe0 pc=0x4729c1
created by runtime_test.testGoroutineParallelism2
	/workdir/go/src/runtime/proc_test.go:174 +0x325

goroutine 23531 [running]:
	goroutine running on other thread; stack unavailable
created by runtime_test.testGoroutineParallelism2
	/workdir/go/src/runtime/proc_test.go:174 +0x325

goroutine 23511 [runnable]:
runtime.gopark(0xc0000623b8, 0x0, 0x0, 0x0, 0x0)
	/workdir/go/src/runtime/proc.go:366 +0xd6 fp=0xc000474ee8 sp=0xc000474ec8 pc=0x43c5d6
runtime.chansend(0x272, 0x657a90, 0x1, 0x0)
	/workdir/go/src/runtime/chan.go:257 +0x435 fp=0xc000474f78 sp=0xc000474ee8 pc=0x407435
runtime.chansend1(0x0, 0x0)
	/workdir/go/src/runtime/chan.go:143 +0x1d fp=0xc000474fa8 sp=0xc000474f78 pc=0x406fdd
runtime_test.testGoroutineParallelism2.func2.1(0xce)
	/workdir/go/src/runtime/proc_test.go:212 +0x5a fp=0xc000474fc8 sp=0xc000474fa8 pc=0x5aa0ba
runtime_test.testGoroutineParallelism2.func2·dwrap·87()
	/workdir/go/src/runtime/proc_test.go:213 +0x2d fp=0xc000474fe0 sp=0xc000474fc8 pc=0x5aa02d
runtime.goexit()
	/workdir/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000474fe8 sp=0xc000474fe0 pc=0x4729c1
created by runtime_test.testGoroutineParallelism2.func2
	/workdir/go/src/runtime/proc_test.go:205 +0x46

goroutine 23505 [chan receive]:
runtime.gopark(0x40f25b, 0xc0003680c0, 0x20, 0x0, 0x7fd204238c40)
	/workdir/go/src/runtime/proc.go:366 +0xd6 fp=0xc0000afd20 sp=0xc0000afd00 pc=0x43c5d6
runtime.chanrecv(0x0, 0x0, 0x1)
	/workdir/go/src/runtime/chan.go:576 +0x54c fp=0xc0000afdb8 sp=0xc0000afd20 pc=0x4082cc
runtime.chanrecv1(0xffffffff, 0xc000051120)
	/workdir/go/src/runtime/chan.go:439 +0x18 fp=0xc0000afde0 sp=0xc0000afdb8 pc=0x407d18
runtime_test.testGoroutineParallelism2(0xbf0b36aee8, 0x1, 0x0)
	/workdir/go/src/runtime/proc_test.go:183 +0x20c fp=0xc0000aff48 sp=0xc0000afde0 pc=0x5a9b0c
runtime_test.TestGoroutineParallelism2(0x406fdd)
	/workdir/go/src/runtime/proc_test.go:146 +0x25 fp=0xc0000aff70 sp=0xc0000aff48 pc=0x5a98a5
testing.tRunner(0xc0002201a0, 0x6593d0)
	/workdir/go/src/testing/testing.go:1242 +0x102 fp=0xc0000affc0 sp=0xc0000aff70 pc=0x4dff02
testing.(*T).Run·dwrap·21()
	/workdir/go/src/testing/testing.go:1289 +0x2a fp=0xc0000affe0 sp=0xc0000affc0 pc=0x4e0c4a
runtime.goexit()
	/workdir/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000affe8 sp=0xc0000affe0 pc=0x4729c1
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1289 +0x365

…
*** Test killed with quit: ran too long (6m0s).
FAIL	runtime	360.133s
FAIL

2021-04-30T00:32:42-303b194/linux-amd64-noopt
2020-06-26T17:07:58-d1015f3/freebsd-386-11_2

Also seen on a recent TryBot:
https://storage.googleapis.com/go-build-log/eb7ebec0/freebsd-amd64-12_2_913d0c87.log

CC @jeremyfaller @aclements

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 30, 2021
@bcmills bcmills added this to the Backlog milestone Apr 30, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Nov 29, 2021

greplogs --dashboard -md -l -e '(?m)(?:SIGQUIT|test timed out).*\n(?:.*\n)*^runtime\.chansend1.*\n\t.*\nruntime_test\.testGoroutineParallelism2' --since=2021-11-03

2021-11-17T04:31:29-fceca2c/freebsd-arm64-dmgk
2021-11-15T18:02:28-1dc9af5/freebsd-arm64-dmgk
2021-11-12T18:48:59-5d24203/freebsd-arm64-dmgk
2021-11-09T22:11:33-4aa0746/freebsd-arm64-dmgk

@bcmills
Copy link
Contributor Author

bcmills commented Nov 29, 2021

Marking as release-blocker for Go 1.18 due to the high failure rate.

(Note that at least one failure within the current cycle is on a linux/386 builder, which tests a first-class port.)

@bcmills bcmills added okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker labels Nov 29, 2021
@bcmills bcmills modified the milestones: Backlog, Go1.18 Nov 29, 2021
@mknyszek
Copy link
Contributor

mknyszek commented Dec 6, 2021

Looking at the test, and specifically the comment

 162     // If runtime triggers a forced GC during this test then it will deadlock,
 163     // since the goroutines can't be stopped/preempted.
 164     // Disable GC for this test (see issue #10958).
 165     defer debug.SetGCPercent(debug.SetGCPercent(-1))

I suspect that what's wrong here is that a GC is already in progress, and this is fallout from the new minimum heap size making that more likely. This is probably already mitigated by https://go.dev/cl/368137, but still theoretically a failure here is possible. Indeed, the last failure was on Dec 2nd (going by the greplogs in #45867 (comment)), when I landed that CL, supporting that theory.

I'll send a CL to fix this for good.

@mknyszek mknyszek removed release-blocker okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 labels Dec 6, 2021
@gopherbot
Copy link

Change https://golang.org/cl/369747 mentions this issue: runtime: call runtime.GC in several tests that disable GC

@mknyszek mknyszek added okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker labels Dec 6, 2021
@mknyszek mknyszek closed this as completed Dec 6, 2021
gopherbot pushed a commit that referenced this issue Dec 6, 2021
These tests disable GC because of the potential for a deadlock, but
don't consider that a GC could be in progress due to other tests. The
likelihood of this case was increased when the minimum heap size was
lowered during the Go 1.18 cycle. The issue was then mitigated by
CL 368137 but in theory is always a problem.

This change is intended specifically for #45867, but I just walked over
a whole bunch of other tests that don't take this precaution where it
seems like it could be relevant (some tests it's not, like the
UserForcedGC test, or testprogs where no other code has run before it).

Fixes #45867.

Change-Id: I6a1b4ae73e05cab5a0b2d2cce14126bd13be0ba5
Reviewed-on: https://go-review.googlesource.com/c/go/+/369747
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: David Chase <drchase@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@mknyszek
Copy link
Contributor

mknyszek commented Dec 7, 2021

Update on the root cause here: the fixes for #49680, #49695, and #45867 all assumed that SetGCPercent(-1) doesn't block until the GC's mark phase is done, but it actually does. The cause of those failures is that at the beginning of the sweep phase, the GC does try to preempt every P once, and this may run concurrently with test code that has non-preemptible goroutines that create the potential for a deadlock. In short, the fix was correct, but the reasoning was incorrect.

@gopherbot
Copy link

Change https://golang.org/cl/369815 mentions this issue: runtime: fix comments on the behavior of SetGCPercent

gopherbot pushed a commit that referenced this issue Dec 7, 2021
Fixes for #49680, #49695, #45867, and #49370 all assumed that
SetGCPercent(-1) doesn't block until the GC's mark phase is done, but
it actually does. The cause of 3 of those 4 failures comes from the fact
that at the beginning of the sweep phase, the GC does try to preempt
every P once, and this may run concurrently with test code. In the
fourth case, the issue was likely that only *one* of the debug_test.go
tests was missing a call to SetGCPercent(-1). Just to be safe, leave a
TODO there for now to remove the extraneous runtime.GC calls, but leave
the calls in.

Updates #49680, #49695, #45867, and #49370.

Change-Id: Ibf4e64addfba18312526968bcf40f1f5d54eb3f1
Reviewed-on: https://go-review.googlesource.com/c/go/+/369815
Reviewed-by: Austin Clements <austin@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Dec 8, 2021

Unfortunately this symptom is still present: https://build.golang.org/log/9c21b1044b95e285295c9cf6040932dd3f5d63b2 occurred after CL 369747.

@bcmills bcmills reopened this Dec 8, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Dec 8, 2021

Hmm. The test still deadlocked, but runtime.chansend1 isn't present in that stack dump at all. 🤔

Perhaps at least the failure mechanism changed!

@bcmills
Copy link
Contributor Author

bcmills commented Dec 8, 2021

This regexp matches the one new failure observed since that change.

greplogs --dashboard -md -l -e '^goroutine \d+ \[chan receive.*\]:\n(?:.+\n\t.+\n)*runtime\.chanrecv1.*\n\t.*\nruntime_test\.testGoroutineParallelism2' --since=2021-12-05

2021-12-07T21:16:00-daf9018/linux-ppc64-buildlet

@bcmills
Copy link
Contributor Author

bcmills commented Dec 9, 2021

Two makes it a pattern, right? 😅

greplogs --dashboard -md -l -e '^goroutine \d+ \[chan receive.*\]:\n(?:.+\n\t.+\n)*runtime\.chanrecv1.*\n\t.*\nruntime_test\.testGoroutineParallelism2' --since=2021-12-05

2021-12-08T17:55:13-c1c303f/freebsd-arm64-dmgk
2021-12-07T21:16:00-daf9018/linux-ppc64-buildlet

@mknyszek
Copy link
Contributor

mknyszek commented Dec 9, 2021

I'm having a really hard time reproducing this on ppc64. I'm spinning it in a loop, and after about 60 mins in 1 gomote I've got nothing. I may scale this up.

Looking at this failure we do have more closely, it looks like a problem in the scheduler! Note that in https://build.golang.org/log/9c21b1044b95e285295c9cf6040932dd3f5d63b2 we're currently in the if block starting at line 176 in proc_test.go:

        if load {
            // Create P goroutines and wait until they all run.
            // When we run the actual test below, worker threads
            // running the goroutines will start parking.
            done := make(chan bool)
            x := uint32(0)
            for p := 0; p < P; p++ {
                go func() { // <-- Point A
                    if atomic.AddUint32(&x, 1) == uint32(P) {
                        done <- true
                        return
                    }
                    for atomic.LoadUint32(&x) != uint32(P) { // <-- Point B
                    }
                }()
            }
            <-done // <-- Point C
        }

The main goroutine (177239) is blocked on the done channel at point C, 3 of 4 goroutines (177299, 177300, 177302) created at line 182 (point A) are spinning at line 187 (I assume, stack unavailable) at point B, and the last one (177301) appears to be stopped at line 182, at point A. Notably, this last one is runnable. It doesn't appear to have been scheduled yet at all.

If that goroutine (177301) were to run, the test would get unstuck. But the question is... why is it not running? What scheduler list is it sitting in for 10 whole minutes such that nothing can see it? Sure, 3 of 4 Ps are taken up by these non-preemptible goroutines that are actively spinning, but 1 P should be free to run code. There are no other running goroutines that appear to be taking up the 4th P. This suggests that what we have here is a missing wakeup somewhere.

I don't see any major changes that landed in the scheduler this cycle. I do see two potential culprit CLs in Gerrit history (my search methods were crude, so perhaps there are more). There's https://golang.org/cl/310850 from May, but if that was the culprit we'd be seeing this go back much further in time to the 1.17 cycle. Then, the only other CL I see is https://golang.org/cl/317509 from August. If I could reproduce, I'd try reverting either of these, but so far no dice.

(Side note: taking a more fundamental approach to the code... I'm also curious as to why this test deadlocks due to preemption at all (even before I added the runtime.GC). In theory asynchronous preemption should take care of it?)

@mknyszek
Copy link
Contributor

mknyszek commented Dec 9, 2021

FWIW I think if we had a core dump, then the answer would be readily clear. We could poke at the scheduler state, see where the unscheduled goroutine is sitting, and make an inference given how few changes to the scheduler happened this cycle.

Looking back further into the past, looks like we have quite a few of these failures, mainly on linux-ppc64 and freebsd-arm64.

greplogs --dashboard -md -l -e '^goroutine \d+ \[chan receive.*\]:\n(?:.+\n\t.+\n)*runtime\.chanrecv1.*\n\t.*\nruntime_test\.testGoroutineParallelism2'

2021-12-07T21:16:00-daf9018/linux-ppc64-buildlet
2021-12-02T20:45:47-3c6295d/freebsd-arm64-dmgk
2021-11-29T16:08:23-a59ab29/freebsd-arm64-dmgk
2021-11-19T18:35:56-ac0da79/freebsd-arm64-dmgk
2021-11-17T04:31:29-fceca2c/freebsd-arm64-dmgk
2021-11-16T05:36:07-8656895/linux-ppc64-buildlet
2021-11-15T18:02:28-1dc9af5/freebsd-arm64-dmgk
2021-11-13T01:39:09-e658c42/freebsd-arm64-dmgk
2021-11-12T18:48:59-5d24203/freebsd-arm64-dmgk
2021-11-09T22:14:19-cb908f1/freebsd-arm64-dmgk
2021-11-09T22:11:33-4aa0746/freebsd-arm64-dmgk
2021-11-03T20:30:17-7f2463c/freebsd-arm64-dmgk
2021-11-02T03:55:19-6f1e9a9/freebsd-arm64-dmgk
2021-11-01T13:12:37-4056934/freebsd-arm64-dmgk
2021-10-25T20:39:31-0ae0d5c/freebsd-arm64-dmgk
2021-10-25T18:59:26-8c94aa4/freebsd-386-12_2
2021-10-25T12:07:15-fcd2d9c/freebsd-arm64-dmgk
2021-10-22T00:56:18-23e57e5/freebsd-arm64-dmgk
2021-10-07T20:02:29-0f52292/freebsd-arm64-dmgk
2021-10-01T23:51:16-c129af9/linux-amd64-buster
2021-10-01T04:32:04-8ac5cbe/freebsd-arm64-dmgk
2021-09-30T19:56:27-70b1a45/linux-amd64-fedora
2021-09-30T15:20:44-7162c4c/freebsd-arm64-dmgk
2021-09-28T17:19:19-3a55597/freebsd-arm64-dmgk
2021-09-22T18:44:36-7a03ca6/freebsd-386-12_2
2021-09-22T18:32:21-ccfc41e/freebsd-arm64-dmgk
2021-09-16T17:44:44-b1bedc0/freebsd-arm64-dmgk
2021-09-15T22:53:17-59a9a03/freebsd-arm64-dmgk
2021-09-15T00:06:54-c7f2f51/freebsd-arm64-dmgk
2021-09-09T15:07:50-a295b3c/freebsd-arm64-dmgk
2021-09-02T19:36:22-a8aa6cf/linux-amd64-wsl
2021-09-02T14:44:50-014a972/freebsd-arm64-dmgk
2021-09-02T09:24:59-17e9d14/linux-386-sid
2021-08-30T20:22:38-5f0d821/freebsd-arm64-dmgk
2021-08-28T17:05:43-6df3aac/freebsd-arm64-dmgk
2021-08-28T16:58:26-5afa555/freebsd-arm64-dmgk
2021-08-25T18:27:41-d2f002c/freebsd-arm64-dmgk
2021-08-25T01:57:42-de1c934/freebsd-arm64-dmgk
2021-08-24T22:10:11-5d863f8/freebsd-arm64-dmgk
2021-08-19T09:11:02-3bdc179/linux-amd64
2021-08-16T18:44:57-631af58/freebsd-arm64-dmgk

Some of these failures don't look like exactly the same failure mode, and
2021-10-01T23:51:16-c129af9/linux-amd64-buster
looks especially familiar (linux/amd64!).

The failures before this look different. They fail in a different part of the code, and they're also much rarer. Also, the first failure in the list above that looks really close to the ones reported above is

2021-08-25T01:57:42-de1c934/freebsd-arm64-dmgk

I'll poke around on what happened that day.

@prattmic
Copy link
Member

I've got a couple of core files now. (Including one with the same symptoms in TestGoroutineParallelism rather than TestGoroutineParallelism2.)

What I am seeing is that the runnable goroutines are on local runqueues, but there is a P that is unexpectedly idle. e.g.,

(gdb) info goroutines
...
* 85461 allgs[215] 0xc000ad1040 running  runtime_test.testGoroutineParallelism2.func2.2
...
* 85438 allgs[291] 0xc000545ba0 running  runtime_test.testGoroutineParallelism2.func2.2
...
* 85437 allgs[326] 0xc0006016c0 running  runtime_test.testGoroutineParallelism2.func2.2
...
  85460 allgs[391] 0xc0007cba00 runnable runtime_test.testGoroutineParallelism2.func2.2

Goroutine 85460 is on P 1's run queue:

(gdb) p 'runtime.allp'.array[1]->runqhead % 256
$4 = 157
(gdb) p 'runtime.allp'.array[1]->runqtail % 256
$5 = 158                                              
(gdb) p /x 'runtime.allp'.array[1]->runq[157]
$7 = 0xc0007cba00

P 3 is not running:

(gdb) p $p0->status
$8 = 1
(gdb) p $p1->status
$9 = 1
(gdb) p $p2->status
$10 = 1
(gdb) p $p3->status
$11 = 0

P 3 not running is a bug, and likely the root cause of the deadlock. We have available work and an idle P, so we should have woken a P to run this work. This is where I will focus investigation.

That said, in theory async preemption should still preempt the running goroutines and allow the runnable goroutine to execute. It is possible that the tight loop those goroutines are not async safe points, in which case we would indeed deadlock here. I'm not sure if this is the case or not. The loop [1] seems like it should be preemptible to me, but I haven't dug into the PCDATA.

[1]

   0x00000000005b76a4 <+100>:   mov    (%rcx),%r9d
=> 0x00000000005b76a7 <+103>:   cmp    %edx,%r9d
   0x00000000005b76aa <+106>:   jne    0x5b76a4 <runtime_test.testGoroutineParallelism2.func2.1+100>

@cherrymui
Copy link
Member

Yeah, that loop is async preemptible.

@prattmic
Copy link
Member

End-of-evening summary:

I've added decent tracing, here's a simple case:

[5.830560975 P 3] | M 8 0xc000081200 | P 3 0xc000030f00 | G 0xc00007cb60 -> runnable (newproc)
[5.830561840 P 3] | M 8 0xc000081200 | P 3 0xc000030f00 | wakep -> no idle Ps
[5.830562906 P 3] | M 8 0xc000081200 | P 3 0xc000030f00 | G 0xc000071520 -> runnable (newproc)
[5.830563706 P 3] | M 8 0xc000081200 | P 3 0xc000030f00 | wakep -> no idle Ps
[5.830565129 P 3] | M 8 0xc000081200 | P 3 0xc000030f00 | G 0xc0000704e0 -> runnable (newproc)
[5.830565751 P 3] | M 8 0xc000081200 | P 3 0xc000030f00 | wakep -> no idle Ps
[5.830566648 P 3] | M 8 0xc000081200 | P 3 0xc000030f00 | G 0xc000070680 -> runnable (newproc)
[5.830567450 P 3] | M 8 0xc000081200 | P 3 0xc000030f00 | wakep -> no idle Ps
[5.830568447 P 3] | M 8 0xc000081200 | P 3 0xc000030f00 | G 0xc000070680 -> running
[5.832034229 P 2] | M 7 0xc000080d80 | P 2 0xc00002ea00 | wakep -> no idle Ps
[5.832035726 P 2] | M 7 0xc000080d80 | P 2 0xc00002ea00 | G 0xc000071520 -> running
[5.834019009 P -1] | M 10 0xc000041200 | P nil | unpark next P 0 0xc00002a000
[5.834024988 P 0] | M 10 0xc000041200 | P 0 0xc00002a000 | spinning
[5.834027052 P 0] | M 10 0xc000041200 | P 0 0xc00002a000 | wakep -> no idle Ps
[5.834027830 P 0] | M 10 0xc000041200 | P 0 0xc00002a000 | G 0xc00007cb60 -> running
[5.844149048 P -1] | M 4 0xc000080000 | P nil | pidleput P 1 0xc00002c500
[5.844153811 P -1] | M 4 0xc000080000 | P nil | netpoll waitms -1
>> begin log 0 <<
[300.111891685 P -1] | M 0 0x7e88a0 | P nil | dying

G 0xc0000704e0 is the goroutine that is stuck in runnable. Based on my understanding, M 4 at the end should have been woken from its netpoll via netpollBreak to handle the remaining work after M 10 grabbed some work, for some reason this didn't happen. Alternatively, M 10 should start another spinning M.

@prattmic
Copy link
Member

I'm at end-of-day fried brain status, but this feels like a bug in the wakep conditions.

When a spinning thread finds work, it must wake another spinning thread to compensate for new work submission (see long comment at the top of proc.go). This is the job of resetspinning(). However, it calls wakep, which does nothing if there are no idle Ps.

But one of these Ps may be held by a non-spinning M in the final steps of releasing the P. In such a case, the non-spinning M will not discover the remaining work and block in netpoll. Thus we lose our work-conservation. Normally we could still make progress [1] because the other running Ps would eventually enter the scheduler and discover the work, but these tight loop are never preempted (even with async preemption, for unclear reasons).

[1] That said, I don't see a fundamental reason why all others Ps couldn't be in the final steps of releasing the P, in which case we don't completely deadlock. (In particular because the spinning check doesn't use CAS).

@prattmic
Copy link
Member

To summarize, what I see in the trace above is:

  • M 10 is a spinning M. It finds some work and calls resetspinning -> wakep which does not wake a new M because there are no idle Ps.
  • Meanwhile, M 4 is a non-spinning M which has failed to find work. It drops P 1 (after the wakep above) and goes directly into blocking netpoll.
  • We now have an idle P and runnable work and no spinning M to take the P and work.

@prattmic
Copy link
Member

@aclements pointed out that "[M 4] drops P 1 ... and goes directly into blocking netpoll." is new behavior in https://go.dev/cl/310850, which was submitted just before the first one of these test failures.

@prattmic
Copy link
Member

#50749 explains why async preemption did not kick in to allow forward progress. Disabling TestFutexsleep makes this failure disappear. But we still have a work conservation bug here.

A few ideas I've brainstormed:

  1. Roll back https://go.dev/cl/310850.
  1. resetspinning starts a new spinning M if there are no other spinning Ms, even if no Ps are available. This would get a new spinning M up to find the work and allow the non-spinning M to drop its P and sleep.
  • I don't love this because in the normal case there are no Ps and won't be any time soon, so we're just creating extra churn.
  1. Optimize this slightly to synchronize with non-spinning Ms dropping the P. i.e., resetspinning if there are no other spinning Ms and no Ps sets sched.needspinning. After the non-spinning M drops its P if needspinning is set it takes the P back and becomes a spinning M.
  • Subpar primarily because yay! more complexity!

@prattmic prattmic self-assigned this Jan 21, 2022
@prattmic
Copy link
Member

FWIW, explicitly setting GODEBUG=asyncpreemptoff=1 rather than depending on the TestFutexsleep race increases my repro rate from ~5/10k to ~25/10k runs.

@cherrymui
Copy link
Member

Interesting. That explains that I tried to run that test in isolation but never reproduced the failure...

@aclements
Copy link
Member

@prattmic , CL 380058 fixes the test problem, right? If it's just the work conservation problem remaining, I think we should punt this to 1.19 because that's only a performance issue that requires a pretty specific and subtle setup to trigger.

@prattmic
Copy link
Member

prattmic commented Feb 2, 2022

CL 380058 fixes the test problem, right?

Correct, the test should no longer fail.

I think we should punt this to 1.19 because that's only a performance issue that requires a pretty specific and subtle setup to trigger.

To be clear, this isn't quite "only a performance issue". The failing tests were deadlocked and would never recover. You are correct that "requires a pretty specific and subtle setup to trigger."

I think dropping this as a release blocker is OK, as this issue is new in 1.17, not 1.18. We may still want to backport the eventual fix to both 1.17 and 1.18.

@prattmic prattmic added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 18, 2022
@prattmic prattmic changed the title runtime: timeouts in TestGoroutineParallelism2 runtime: broken work conservation due to CL 310850 Feb 18, 2022
@gopherbot
Copy link

Change https://go.dev/cl/389014 mentions this issue: runtime: synchronize P wakeup and dropping Ps

@heschi heschi modified the milestones: Go1.18, Go1.19 Mar 15, 2022
@ianlancetaylor
Copy link
Contributor

@prattmic This issue is in the 1.19 milestone. I'm not clear on the status. Should this be 1.20? Backlog? Closed? Thanks.

@prattmic prattmic modified the milestones: Go1.19, Go1.20 Jun 27, 2022
@prattmic prattmic added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Aug 1, 2022
@mknyszek mknyszek added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 11, 2022
@golang golang locked and limited conversation to collaborators Aug 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
Development

No branches or pull requests

8 participants