runtime: scheduler sometimes starves a runnable goroutine on wasm platforms #65178

bcmills · 2024-01-19T21:39:46Z

#!watchflakes
post <- goarch == "wasm" && pkg == "golang.org/x/net/http2" && test == "TestServer_Push_RejectAfterGoAway" && `timeout`

Go version

go version devel go1.23-1653833811 Fri Jan 19 19:23:40 2024 +0000 linux/amd64

Output of `go env` in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/usr/local/google/home/bcmills/.cache/go-build'
GOENV='/usr/local/google/home/bcmills/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/usr/local/google/home/bcmills/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/usr/local/google/home/bcmills'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/google/home/bcmills/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/google/home/bcmills/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='devel go1.23-1653833811 Fri Jan 19 19:23:40 2024 +0000'
GCCGO='/usr/bin/gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/usr/local/google/home/bcmills/go/src/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1312466808=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Add the following test to the time package:

func TestAfterFuncStarvation(t *testing.T) {
	var (
		wg sync.WaitGroup
		stop atomic.Bool
		c = make(chan bool, 1)
	)

	wg.Add(2)
	go func() {
		for !stop.Load() {
			c <- true
		}
		close(c)
		wg.Done()
	}()
	go func() {
		for range c {}
		wg.Done()
	}()

	AfterFunc(1 * Microsecond, func() { stop.Store(true) })
	wg.Wait()
}

Then run it using:

GOARCH=wasm GOOS=wasip1 go test time -run=TestAfterFuncStarvation -count=10000

What did you see happen?

~/go/src$ GOARCH=wasm GOOS=wasip1 go test time -run=TestAfterFuncStarvation -count=10000
panic: test timed out after 10m0s
running tests:
        TestAfterFuncStarvation (9m56s)

goroutine 13835 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2366 +0x44
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5

goroutine 1 [chan receive]:
testing.(*T).Run(0x178b040, {0x6a007, 0x13}, 0xde460)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1750 +0x44
testing.runTests.func1(0x178b040)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2161 +0x3
testing.tRunner(0x178b040, 0x1499c80)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
testing.runTests(0x140a2a0, {0x37c140, 0x7f, 0x7f}, {0xc162d83908280698, 0x8bb3084da4, 0x37faa0})
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2159 +0x49
testing.(*M).Run(0x144e140)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2027 +0x6e
main.main()
        _testmain.go:445 +0x9

goroutine 13833 [runnable]:
time_test.TestAfterFuncStarvation.func2()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:102 +0x4
created by time_test.TestAfterFuncStarvation in goroutine 13831
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:101 +0xd

goroutine 13831 [semacquire, 9 minutes]:
sync.runtime_Semacquire(0x1790198)
        /usr/local/google/home/bcmills/go/src/runtime/sema.go:62 +0x2
sync.(*WaitGroup).Wait(0x1790190)
        /usr/local/google/home/bcmills/go/src/sync/waitgroup.go:116 +0xf
time_test.TestAfterFuncStarvation(0x178b1e0)
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:107 +0x12
testing.tRunner(0x178b1e0, 0xde460)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
created by testing.(*T).Run in goroutine 1
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1742 +0x43

goroutine 13832 [chan send]:
time_test.TestAfterFuncStarvation.func1()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:96 +0x3
created by time_test.TestAfterFuncStarvation in goroutine 13831
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:94 +0x9

goroutine 13834 [runnable]:
time_test.TestAfterFuncStarvation.func3()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:106
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5
FAIL    time    600.293s
FAIL

What did you expect to see?

The test completes 1000 iterations on wasip1/wasm in about 1.2 seconds on my machine, so 10000 iterations should only take ~12 seconds:

~/go/src$ GOARCH=wasm GOOS=wasip1 go test time -run=TestAfterFuncStarvation -count=1000
ok      time    1.228s

Running native code on linux/amd64 is much faster still, but that's to be expected:

go test time -run=TestAfterFuncStarvation -count=10000
ok      time    0.587s

(attn @golang/wasm)

The text was updated successfully, but these errors were encountered:

bcmills · 2024-01-19T21:49:35Z

This also affects js/wasm: 100 iterations take <600ms, so 1000 iterations should take <6s (but times out after a full minute):

~/go/src$ GOARCH=wasm GOOS=js go test time -run=TestAfterFuncStarvation -count=100
ok      time    0.519s

~/go/src$ GOARCH=wasm GOOS=js go test time -run=TestAfterFuncStarvation -count=1000 -timeout=1m
panic: test timed out after 1m0s
running tests:
        TestAfterFuncStarvation (1m0s)

goroutine 2980 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2366 +0x44
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5

goroutine 1 [chan receive]:
testing.(*T).Run(0x151fa00, {0x6e19f, 0x17}, 0xe09b0)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1750 +0x44
testing.runTests.func1(0x151fa00)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2161 +0x3
testing.tRunner(0x151fa00, 0x1493c80)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
testing.runTests(0x140a360, {0x37fc60, 0x7f, 0x7f}, {0xc162d954b7abb580, 0xdf9e00901, 0x38b6e0})
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2159 +0x49
testing.(*M).Run(0x144c3c0)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2027 +0x6e
main.main()
        _testmain.go:445 +0x9

goroutine 6 [waiting]:
runtime.gopark(0x0, 0x0, 0x0, 0x0, 0x1)
        /usr/local/google/home/bcmills/go/src/runtime/proc.go:402 +0x25
runtime.handleEvent()
        /usr/local/google/home/bcmills/go/src/runtime/lock_js.go:296 +0x25
runtime.goexit({})
        /usr/local/google/home/bcmills/go/src/runtime/asm_wasm.s:434 +0x1

goroutine 2976 [semacquire]:
sync.runtime_Semacquire(0x140c188)
        /usr/local/google/home/bcmills/go/src/runtime/sema.go:62 +0x2
sync.(*WaitGroup).Wait(0x140c180)
        /usr/local/google/home/bcmills/go/src/sync/waitgroup.go:116 +0xf
time_test.TestAfterFuncStarvation(0x151fba0)
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:107 +0x12
testing.tRunner(0x151fba0, 0xe09b0)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
created by testing.(*T).Run in goroutine 1
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1742 +0x43

goroutine 2977 [chan send]:
time_test.TestAfterFuncStarvation.func1()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:96 +0x3
created by time_test.TestAfterFuncStarvation in goroutine 2976
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:94 +0x9

goroutine 2978 [runnable]:
time_test.TestAfterFuncStarvation.func2()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:102 +0x4
created by time_test.TestAfterFuncStarvation in goroutine 2976
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:101 +0xd

goroutine 2979 [runnable]:
time_test.TestAfterFuncStarvation.func3()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:106
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5
FAIL    time    60.358s
FAIL

prattmic · 2024-01-19T21:55:39Z

Does this also affect linux-amd64 if GOMAXPROCS=1?

bcmills · 2024-01-19T22:13:08Z

No, it does not: if I update the test to explicitly call runtime.GOMAXPROCS(1), it continues to pass after even 100000 runs.

gopherbot · 2024-01-19T22:14:26Z

Change https://go.dev/cl/557037 mentions this issue: net: work around runtime scheduler starvation on js and wasip1

gopherbot · 2024-01-19T22:14:27Z

Change https://go.dev/cl/557038 mentions this issue: time: add a regression test for time.AfterFunc goroutine starvation

prattmic · 2024-01-19T22:18:49Z

At a theoretical level, this doesn't surprise me, and I would expect it to impact linux-amd64 GOMAXPROCS=1.

This test always has at least one runnable goroutine (one of the channel send/recv goroutines). In fact, because the channel is buffered in theory both goroutines may be runnable all of the time.

When the timer fires, a third goroutine will become runnable [1].

We have one thread on which to run three goroutines, so there must be some scheduling delay. Go does not have a real-time scheduler, or even a best-effort deadline-based scheduler, so it can theoretically have unbounded scheduling delay.

Readying a goroutine typically puts it on the head of the run queue to get it to run soon. But since that applies to almost every ready, the timer firing could put goroutine 3 on the head of the run queue, and then goroutine 1 or 2 could replace the head before the scheduler runs again.

If you dump the /sched/latencies:seconds histogram from runtime/metrics, I'd expect to see a long tail of long scheduling delays.

There are certainly many improvements that could go into the scheduler to try to make this situation better (#51071 comes to mind).

The more interesting case would be if this is actually completely stuck. i.e., the timer expiration or ready goroutine was somehow lost altogether. Figuring that out by running a test is a bit of a halting problem situation, but looking at runtime/metrics is probably a good start. Do we have many cases of long scheduling delay (but not enough to cause a timeout)? Or do things seem totally fine until one case gets stuck forever?

[1] We always run timers before looking for other work in the scheduler, so the timer should always fire. But an AfterFunc timer just starts a goroutine to run the func, which is the third goroutine.

bcmills · 2024-01-19T22:29:47Z

I agree that the runtime is theoretically allowed to produce this behavior according to the language spec.

However, the addition of non-cooperative preemption (#24543) set up a strong expectation that every runnable goroutine will eventually be scheduled, and resolved a large number of previous scheduler fairness issues (linked from that proposal).

We also explicitly made mutex locking starvation-free in #13086. To me, that suggests that this sort of goroutine starvation is not intended behavior.

This sort of starvation shows up in realistic programs that use time.AfterFunc to interrupt two or more running goroutines, as might be the case in a benchmark or stress-test. (See, for example, #65177.) It is sometimes possible to work around that starvation by making explicit calls to runtime.Gosched (as in https://go.dev/cl/557037), but arguably that's exactly the sort of explicit scheduler management that #24543 was intended to address.

Fixes #65177. Updates #65178. Updates #64321. Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32 Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/557037 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com>

prattmic · 2024-01-20T00:04:48Z

I think I poorly worded my previous message and came across too strongly. I don't think this is good behavior, I meant that there are two broad classes this problem may be in:

There is nothing fundamentally wrong, but the existing design is producing poor behavior.
Something is fundamentally broken, and this wakeup is lost/gone and no amount of waiting would resolve it.

bcmills · 2024-01-20T00:15:54Z

Ah, yeah. I think the evidence so far supports (1) — the goroutine dumps from the timed-out tests list the AfterFunc goroutine as runnable, it just isn't actually being run.

(The fact that it is listed as runnable, and the fact that adding an explicit runtime.Gosched in https://go.dev/cl/557037 seems to fix the starvation there, suggest that the goroutine is successfully being marked schedulable, it just isn't getting scheduled.)

Jorropo · 2024-01-22T05:32:17Z

I thought this could be working on amd64 because the goroutines are being non-cooperatively preempted by GC which could randomly restart execution. But trying out with GOGC=off GOMAXPROCS=1 doesn't let me reproduce.

Jorropo · 2024-01-22T05:37:50Z

Even more minified example reproduce for me on wasip1:

 func TestAfterFuncStarvation(t *testing.T) {
 	var (
 		wg sync.WaitGroup
 		stop atomic.Bool
-		c = make(chan bool, 1)
+		c = make(chan bool)
 	)
 
 	wg.Add(2)
 	go func() {
 		for !stop.Load() {
 			c <- true
 		}
 		close(c)
 		wg.Done()
 	}()
 	go func() {
 		for range c {}
 		wg.Done()
 	}()
 
 	AfterFunc(1 * Microsecond, func() { stop.Store(true) })
 	wg.Wait()
 }

It wouldn't reliably fail with -count=1000 anymore however, but with -count=10000 it would (tested 10/10 times), I don't know what this could mean however.

Jorropo · 2024-01-22T05:41:49Z

Even more surprisingly this works on GOARCH=amd64 GOMAXPROCS=1 (even with GOGC=off):

func TestAfterFuncStarvation2(t *testing.T) {
	var stop atomic.Bool
	AfterFunc(1 * Microsecond, func() { stop.Store(true) })
	for !stop.Load() {}
}

But expectedly never works on wasip1.

Jorropo · 2024-01-22T05:56:49Z

I am able to repro on GOARCH=amd64 GOMAXPROCS=1 by adding && false this if clause:

go/src/runtime/proc.go

Line 170 in f19f31f

if GOARCH != "wasm" { // no threads on wasm yet, so no sysmon

-count=10000 works in 0.173s but -count=100000 timeout with equivalent stack traces.

This explains why wasm is so special here, other arches have a watchdog.

The test is skipped on wasm platforms for now, because it successfully detects a starvation bug on those platforms. For #65178. Change-Id: I05d28f1c7be99fcab67ec4dfaa38f412e11fd3cb Reviewed-on: https://go-review.googlesource.com/c/go/+/557038 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com>

prattmic · 2024-01-22T19:30:51Z

Thanks for digging @Jorropo! sysmon preempts long-running goroutines, so without that preemption, the tight stop.Load() loop won't ever yield. That said, even with sysmon, the right combinations of goroutines should still theoretically be able to trigger long wait times.

Still, the lack of sysmon on wasm makes me wonder if we should be more aggressive about self-preemption there.

Jorropo · 2024-01-22T19:39:51Z

I had the same question, I don't see how to solve the tight loop without the compiler aggressively injecting runtime.Gosched() in loops (maybe with a local counter to only yield after X iterations ? or based on monotonic time if this is cheap ?).

I find it interesting that the original example repro on amd64 after I disable sysmon.
I havn't looked at the scheduler code in a while, but if I remember there is a special spot in the worklist for a hot goroutine, could it be that the two goroutines keep exchanging themselves from the hot slot ?
I guess after every ~~4096 schedule (or whatever)~~ should be based on monotonic time, we should pick a random goroutine in our work queue instead of the hot one (or whatever is the first one in the queue).

aclements · 2024-01-22T19:54:05Z

I guess after every ~~4096 schedule (or whatever)~~ should be based on monotonic time, we should pick a random goroutine in our work queue instead of the hot one (or whatever is the first one in the queue).

On non-wasm platforms, we already do this using the "inheritTime" result from runqget. What's supposed to be happening is that goroutines can switch off in runnext (the "hot slot") but they share a single time slice, and will get preempted for another goroutine if that time slice runs out. But sysmon is responsible for time slice preemption. You're right that without that, things can be arbitrarily unfair.

gopherbot · 2024-01-22T21:30:09Z

Change https://go.dev/cl/557615 mentions this issue: http2: avoid goroutine starvation in TestServer_Push_RejectAfterGoAway

CL 557037 added a runtime.Gosched to prevent goroutine starvation in the wasm fake-net stack. Unfortunately, that Gosched causes the scheduler to enter a very similar starvation loop in this test. Add another runtime.Gosched to break this new loop. For golang/go#65178. Change-Id: I24b3f50dd728800462f71f27290b0d8f99d5ae5b Cq-Include-Trybots: luci.golang.try:x_net-gotip-wasip1-wasm_wasmtime,x_net-gotip-wasip1-wasm_wazero,x_net-gotip-js-wasm Reviewed-on: https://go-review.googlesource.com/c/net/+/557615 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>

gopherbot · 2024-01-22T23:07:17Z

Change https://go.dev/cl/557437 mentions this issue: runtime: disable use of runnext on wasm

prattmic · 2024-01-22T23:08:11Z

With https://go.dev/cl/557437, the original repro no longer hangs. This does not do anything to impact goroutines that simply never yield at all, like #65178 (comment).

bcmills · 2024-01-22T23:21:12Z

That sounds great! I think it's reasonable to only fix goroutines that actually yield for now, since that sort of live-lock is relatively much easier to diagnose. (For the single-goroutine case, the live-locked goroutine will actually be running in a goroutine dump, as opposed to “running some of the time”.)

Fixes golang#65177. Updates golang#65178. Updates golang#64321. Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32 Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/557037 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com>

The test is skipped on wasm platforms for now, because it successfully detects a starvation bug on those platforms. For golang#65178. Change-Id: I05d28f1c7be99fcab67ec4dfaa38f412e11fd3cb Reviewed-on: https://go-review.googlesource.com/c/go/+/557038 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com>

When readying a goroutine, the scheduler typically places the readied goroutine in pp.runnext, which will typically be the next goroutine to run in the schedule. In order to prevent a set of ping-pong goroutines from simply switching back and forth via runnext and starving the rest of the run queue, a goroutine scheduled via runnext shares a time slice (pp.schedtick) with the previous goroutine. sysmon detects "long-running goroutines", which really means Ps using the same pp.schedtick for too long, and preempts them to allow the rest of the run queue to run. Thus this avoids starvation via runnext. However, wasm has no threads, and thus no sysmon. Without sysmon to preempt, the possibility for starvation returns. Avoid this by disabling runnext entirely on wasm. This means that readied goroutines always go on the end of the run queue and thus cannot starve via runnext. Note that this CL doesn't do anything about single long-running goroutines. Without sysmon to preempt them, a single goroutine that fails to yield will starve the run queue indefinitely. For golang#65178. Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero Change-Id: I7dffe1e72c6586474186b72f8068cff77b661eae Reviewed-on: https://go-review.googlesource.com/c/go/+/557437 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Bryan Mills <bcmills@google.com>

This reverts CL 557437. Reason for revert: Appears to have broken wasip1 builders. For golang#65178. Change-Id: I59c1a310eb56589c768536fe444c1efaf862f8b0 Reviewed-on: https://go-review.googlesource.com/c/go/+/559237 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>

bcmills · 2024-02-22T19:32:33Z

@gopherbot, please backport to Go 1.22. The additional tests enabled in https://go.dev/cl/526117 made it more likely for this bug to be exposed in test runs, which may cause flakiness in pre-release (and backport CL) testing.

gopherbot · 2024-02-22T19:33:03Z

Backport issue(s) opened: #65883 (for 1.22).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

gopherbot · 2024-02-22T19:38:32Z

Change https://go.dev/cl/566175 mentions this issue: [release-branch.go1.22] net: work around runtime scheduler starvation on js and wasip1

This breaks an unbounded client-side retry loop if the server's timeout happens to fire during its final read of the TLS handshake. The retry loop was observed on wasm platforms at CL 557437. I was also able to reproduce chains of dozens of retries on my linux/amd64 workstation by adjusting some timeouts and adding a couple of sleeps, as in this patch: https://gist.github.com/bcmills/d0a0a57e5f64eebc24e8211d8ea502b3 However, on linux/amd64 on my workstation the test always eventually breaks out of the retry loop due to timing jitter. I couldn't find a retry-specific hook in the http.Client, http.Transport, or tls.Config structs, so I have instead abused the Transport.Proxy hook for this purpose. Separately, we may want to consider adding a retry-specific hook, or changing the net/http implementation to avoid transparently retrying in this case. Fixes #65410. Updates #65178. Change-Id: I0e43c039615fe815f0a4ba99a8813c48b1fdc7e6 Reviewed-on: https://go-review.googlesource.com/c/go/+/559835 Reviewed-by: Damien Neil <dneil@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>

When readying a goroutine, the scheduler typically places the readied goroutine in pp.runnext, which will typically be the next goroutine to run in the schedule. In order to prevent a set of ping-pong goroutines from simply switching back and forth via runnext and starving the rest of the run queue, a goroutine scheduled via runnext shares a time slice (pp.schedtick) with the previous goroutine. sysmon detects "long-running goroutines", which really means Ps using the same pp.schedtick for too long, and preempts them to allow the rest of the run queue to run. Thus this avoids starvation via runnext. However, wasm has no threads, and thus no sysmon. Without sysmon to preempt, the possibility for starvation returns. Avoid this by disabling runnext entirely on wasm. This means that readied goroutines always go on the end of the run queue and thus cannot starve via runnext. Note that this CL doesn't do anything about single long-running goroutines. Without sysmon to preempt them, a single goroutine that fails to yield will starve the run queue indefinitely. For #65178. Change-Id: I10859d088776125a2af8c9cd862b6e071da628b5 Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/559798 Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

… on js and wasip1 For #65883. Updates #65177. Updates #65178. Updates #64321. Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32 Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/557037 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com> (cherry picked from commit f19f31f) Reviewed-on: https://go-review.googlesource.com/c/go/+/566175 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>

bcmills · 2024-03-04T20:59:31Z

@prattmic, with https://go.dev/cl/559798 merged, should we close this issue? (Or is there more to do?)

bcmills · 2024-03-04T21:00:23Z

I guess this doesn't address the case where a single long-running goroutine never yields. Not sure whether we want to consider that in-scope for this issue, or treat it as a separate feature request to support goroutine preemption on js/wasm. 🤔

… on js and wasip1 For golang#65883. Updates golang#65177. Updates golang#65178. Updates golang#64321. Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32 Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/557037 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com> (cherry picked from commit f19f31f) Reviewed-on: https://go-review.googlesource.com/c/go/+/566175 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>

prattmic · 2024-03-07T21:14:55Z

I think that should be a separate issue, or at least we should rename this to be more clearly about async preemption.

labulakalia · 2024-12-30T08:07:34Z

this issue exist on branch master
this is code example

Build Wasm

$ gotip version
go version devel go1.24-2b794ed Fri Dec 27 17:23:24 2024 -0800 linux/amd64

example code

//net.go
package main

import (
	"fmt"
	"log"
	"log/slog"
	"net"
	"os"
	"time"
)

func main() {
	slog.SetLogLoggerLevel(slog.LevelDebug)
	f := os.NewFile(4, "")
	l, err := net.FileListener(f)
	defer f.Close()
	if err != nil {
		log.Fatalln("111", err)
	}
	fmt.Println("start listen")
	defer l.Close()
	for {
		fmt.Println("start accept")

		conn, err := l.Accept()
		if err != nil {
			log.Fatalln(err)
		}
		fmt.Println("start accept", conn.RemoteAddr())
		go func() {
			for {
				fmt.Println("start read")
				data := make([]byte, 1024)
				_, err = conn.Read(data)
				if err != nil {
					log.Println(err)
				}
			}
		}()
		conn.Write([]byte("test"))
		// time.Sleep(time.Second) // if not this line, `fmt.Println("start read")` never run
	}
}

build wasm

GOOS=wasip1 GOARCH=wasm gotip build -o net.wasm net.go

Run wasm use wazero

run wasm on wazero runtime

//wazero.go
package main

import (
	"context"
	"crypto/rand"
	_ "embed"
	"log"
	"log/slog"
	"net/http"
	"os"

	"github.com/tetratelabs/wazero"
	experimentalsock "github.com/tetratelabs/wazero/experimental/sock"
	"github.com/tetratelabs/wazero/imports/wasi_snapshot_preview1"
)


//go:embed net.wasm
var netWasm []byte
func main() {
	slog.SetLogLoggerLevel(slog.LevelDebug)

	sockCfg := experimentalsock.NewConfig().WithTCPListener("127.0.0.1", 19971)
	ctx := experimentalsock.WithConfig(context.Background(), sockCfg)

	r := wazero.NewRuntime(ctx)
	wasi_snapshot_preview1.MustInstantiate(ctx, r)
	conf := wazero.NewModuleConfig().
		WithFS(os.DirFS(".")).
		WithStdout(os.Stdout).
		WithStderr(os.Stderr).
		WithStdin(os.Stdin).
		WithRandSource(rand.Reader).
		WithSysNanosleep().
		WithSysNanotime().
		WithSysWalltime().WithArgs("wasi", "sock")
	_, err := r.InstantiateWithConfig(ctx, netWasm, conf)
	if err != nil {
		log.Panicln(err)
	}
}

Send msg

nc 127.0.0.1 19971
test1
1
1
1
1

$ gotip run example.go 
start listen
start accept
start accept :0
start accept

note: this line code fmt.Println("start read") not run,if add time.Sleep will prinf start read
@prattmic

labulakalia · 2024-12-30T14:28:19Z

https://go.dev/blog/wasi
Wasm is a single threaded architecture with no parallelism. The scheduler can still schedule goroutines to run concurrently, and standard in/out/error is non-blocking, so a goroutine can execute while another reads or writes, but any host function calls (such as requesting random data using the example above) will cause all goroutines to block until the host function call has returned.
it is right

prattmic · 2025-01-06T07:37:01Z

@labulakalia If I understand correctly, the conn.Write blocks until there is a reader, but because we are blocked in write, the reader will never run. You seem to have found some documentation about this with #65178 (comment).

I don't think there is much we can do about this (FD 4 needs to be non-blocking, maybe we can make that easier somehow).

Regardless, this isn't the same issue that was being discussed here (though similar), so please open a new issue if you think we could do better here.

Per #65178 (comment), this issue should be closed in favor of #71134.

gopherbot added the compiler/runtime label Jan 19, 2024

bcmills mentioned this issue Jan 19, 2024

net/http: timeout in TestIssue4191_InfiniteGetTimeout on wasip1/wasm #65177

Closed

bcmills added the arch-wasm label Jan 19, 2024

bcmills changed the title ~~runtime: scheduler sometimes starves a runnable goroutine on wasip1/wasm~~ runtime: scheduler sometimes starves a runnable goroutine on wasm platforms Jan 19, 2024

cherrymui added the NeedsInvestigation label Jan 23, 2024

cherrymui added this to the Backlog milestone Jan 23, 2024

mknyszek added this to Go Compiler / Runtime Jan 24, 2024

prattmic mentioned this issue Jan 31, 2024

net/http: slight scheduling changes makes TestServerWriteTimeout/h2 flaky (primarily wasm) #65410

Closed

gopherbot mentioned this issue Feb 22, 2024

runtime: scheduler sometimes starves a runnable goroutine on wasm platforms [1.22 backport] #65883

Closed

gopherbot added the Testing label Feb 22, 2024

bcmills removed the Testing label Feb 22, 2024

cagedmantis mentioned this issue Jan 3, 2025

os: TestRootConcurrentClose fails when runtime.AddCleanup is used #71117

Closed

kkoreilly mentioned this issue Jan 5, 2025

Apps with unyielding loops in goroutines hang on web cogentcore/core#1416

Open

prattmic mentioned this issue Jan 6, 2025

runtime: async preemption (or alternative) for wasm #71134

Open

prattmic closed this as completed Jan 6, 2025

github-project-automation bot moved this to Done in Test Flakes Jan 6, 2025

github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Jan 6, 2025

gabyhelp mentioned this issue Mar 11, 2025

context: canceling a context should clear the timeout in nodejs #72797

Open

gabyhelp mentioned this issue Mar 26, 2025

runtime: Channel benchmark causes extremely high idle wake ups on macOS Sequoia #73055

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: scheduler sometimes starves a runnable goroutine on wasm platforms #65178

runtime: scheduler sometimes starves a runnable goroutine on wasm platforms #65178

bcmills commented Jan 19, 2024 •

edited by cherrymui

Loading

bcmills commented Jan 19, 2024

prattmic commented Jan 19, 2024

bcmills commented Jan 19, 2024 •

edited

Loading

gopherbot commented Jan 19, 2024

gopherbot commented Jan 19, 2024

prattmic commented Jan 19, 2024 •

edited

Loading

bcmills commented Jan 19, 2024 •

edited

Loading

prattmic commented Jan 20, 2024

bcmills commented Jan 20, 2024

Jorropo commented Jan 22, 2024 •

edited

Loading

Jorropo commented Jan 22, 2024

Jorropo commented Jan 22, 2024 •

edited

Loading

Jorropo commented Jan 22, 2024 •

edited

Loading

prattmic commented Jan 22, 2024

Jorropo commented Jan 22, 2024 •

edited

Loading

aclements commented Jan 22, 2024

gopherbot commented Jan 22, 2024

gopherbot commented Jan 22, 2024

prattmic commented Jan 22, 2024

bcmills commented Jan 22, 2024

bcmills commented Feb 22, 2024

gopherbot commented Feb 22, 2024

gopherbot commented Feb 22, 2024

bcmills commented Mar 4, 2024

bcmills commented Mar 4, 2024

prattmic commented Mar 7, 2024

labulakalia commented Dec 30, 2024 •

edited

Loading

labulakalia commented Dec 30, 2024

prattmic commented Jan 6, 2025

runtime: scheduler sometimes starves a runnable goroutine on wasm platforms #65178

runtime: scheduler sometimes starves a runnable goroutine on wasm platforms #65178

Comments

bcmills commented Jan 19, 2024 • edited by cherrymui Loading

Go version

Output of go env in your module/workspace:

What did you do?

What did you see happen?

What did you expect to see?

bcmills commented Jan 19, 2024

prattmic commented Jan 19, 2024

bcmills commented Jan 19, 2024 • edited Loading

gopherbot commented Jan 19, 2024

gopherbot commented Jan 19, 2024

prattmic commented Jan 19, 2024 • edited Loading

bcmills commented Jan 19, 2024 • edited Loading

prattmic commented Jan 20, 2024

bcmills commented Jan 20, 2024

Jorropo commented Jan 22, 2024 • edited Loading

Jorropo commented Jan 22, 2024

Jorropo commented Jan 22, 2024 • edited Loading

Jorropo commented Jan 22, 2024 • edited Loading

prattmic commented Jan 22, 2024

Jorropo commented Jan 22, 2024 • edited Loading

aclements commented Jan 22, 2024

gopherbot commented Jan 22, 2024

gopherbot commented Jan 22, 2024

prattmic commented Jan 22, 2024

bcmills commented Jan 22, 2024

bcmills commented Feb 22, 2024

gopherbot commented Feb 22, 2024

gopherbot commented Feb 22, 2024

bcmills commented Mar 4, 2024

bcmills commented Mar 4, 2024

prattmic commented Mar 7, 2024

labulakalia commented Dec 30, 2024 • edited Loading

Build Wasm

Run wasm use wazero

Send msg

labulakalia commented Dec 30, 2024

prattmic commented Jan 6, 2025

bcmills commented Jan 19, 2024 •

edited by cherrymui

Loading

Output of `go env` in your module/workspace:

bcmills commented Jan 19, 2024 •

edited

Loading

prattmic commented Jan 19, 2024 •

edited

Loading

bcmills commented Jan 19, 2024 •

edited

Loading

Jorropo commented Jan 22, 2024 •

edited

Loading

Jorropo commented Jan 22, 2024 •

edited

Loading

Jorropo commented Jan 22, 2024 •

edited

Loading

Jorropo commented Jan 22, 2024 •

edited

Loading

labulakalia commented Dec 30, 2024 •

edited

Loading