runtime: send to empty buffered channel can fail on select statement with default case #48433

changkun · 2021-09-17T12:30:55Z

What version of Go are you using (`go version`)?

$ go version
go version devel go1.18-9133245be7 Sat Sep 4 20:35:25 2021 +0000 darwin/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE="auto"
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/changkun/Library/Caches/go-build"
GOENV="/Users/changkun/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/changkun/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/changkun/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/changkun/dev/godev/go-gerrit"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/changkun/dev/godev/go-gerrit/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="devel go1.18-9133245be7 Sat Sep 4 20:35:25 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/g2/6fmr1qzx0ns3shq74zrp6bd40000gn/T/go-build1056299778=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

In the practice of using golang.org/x/mobile, we observed #48434 that the rendering loop in the golang.org/x/mobile/app package can pause. After debugging and logging (see CL 350212), we captured a log case which seems to suggest the following situation may execute the default case:

ch := make(chan struct{}, 1)
select {
case ch <- struct{}{}:
default:
    panic("should be impossible?")
}

What did you expect to see?

always do the case of send.

What did you see instead?

the default case is executed

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2021-09-17T19:40:00Z

That would certainly be a very serious problem, but I think we're going to need a test case.

changkun · 2021-09-17T22:28:47Z

I really wish I could write a minimal test case, but sadly the problem only happens occasionally in a very complex application.
Based on the current collected log, and a logical causality analysis on it, it eventually leads me to open this issue.

Nevertheless, I've written a minimum example to demonstrate the key components. Again, this minimum example does not reproduce the problem (at least I haven't run long enough to really see if it can reproduce):

package main

/*
void processGLcall() {
	for (int i = 0; i < 100000; i++) {
		// simulate some work
	}
}
*/
import "C"
import (
	"math/rand"
	"time"
)

func init() {
	rand.Seed(time.Now().Unix())
}

func main() {
	go G1()
	go G2()
	select {}
}

type fn struct{ blocking bool }

var (
	workAvailable = make(chan struct{}, 1)
	work          = make(chan fn, 1)
	retvalue      = make(chan int)
)

func G1() {
	tk := time.NewTicker(time.Second / 30)
	for range tk.C {
		select {
		case <-workAvailable:
			println("G1: onDraw")
			doWork()
		case <-time.After(100 * time.Millisecond):
			println("G1: timeout")
		}
	}
}

func G2() {
	tk := time.NewTicker(time.Second / 30)
	for range tk.C {
		n := rand.Intn(1000)
		for i := 0; i < n; i++ {
			callFn(fn{blocking: rand.Intn(4) == 0})
		}
	}
}

func doWork() {
	for {
		select {
		case w := <-work:
			println("G1: doWork consume work")
			C.processGLcall()
			if w.blocking {
				println("G1: before returning a value")
				retvalue <- 1
				println("G1: after value returned")
			}
		default:
			println("G1: doWork returned")
			return
		}
	}
}

func callFn(c fn) {
	work <- c

	select {
	case workAvailable <- struct{}{}:
		println("G2: workAvaliable success")
	default:
		println("G2: workAvaliable failed")
	}

	if c.blocking {
		println("G2: callFn, wait for ret")
		<-retvalue
		println("G2: callFn, after get ret")
	}
}

and the log (collected from real work application) prints:

G2 workAvaliable success  (a)
G2 workAvaliable success  (b)
G2 workAvaliable failed   (c)
G1 doWork returned        (d)
G1 doWork returned        (e)
G2 workAvaliable success  (f)
G2 workAvaliable success  (g)
G1 onDraw                 (h)
G1 doWork returned        (i)
G1 doWork returned        (j)
G2 workAvaliable failed   (k)
G1 timeout                (l)
G1 timeout
G1 timeout
G1 timeout
G1 timeout
G1 timeout
...
onDraw is never executed again.

The workAvaliable failed in (k) seems to suggest the issue.

ianlancetaylor · 2021-09-17T22:46:03Z

The question for a failure like this is: how do you know for sure that the value was fetched out of the channel before the select statement that tried sending to the channel?

As far as I can tell, in your sample program, if G1 gets suspended for the times it takes to make two separate calls to callFn, then you will get this error. It's not likely. But I don't see anything preventing it, so it's possible. A time.Ticker is best effort in Go; there is no hard real time guarantee.

changkun · 2021-09-18T07:36:45Z

how do you know for sure that the value was fetched out of the channel before the select statement that tried sending to the channel?

This is easy to answer: Because the cap(workAvaliable) == 1, then workAvaliable success means its previous work <- c in the callFn succeed. Hence len(work) should be greater or equal than 1, and G1 will result in a doWork return. Therefore, (a) result in (d), (b) result in (e), (f) result in (i), and (g) result in (j). Therefore, after (j), workAvaliable must be empty. As we see (k) failed, hence leads to this issue.

if G1 gets suspended for the times it takes to make two separate calls to callFn, then you will get this error.

I am not quite sure I follow here. G1 can't make two separate calls to callFn, because cap(work) == 1.

A time.Ticker is best effort in Go; there is no hard real time guarantee.

the time.Ticker is for demonstration purposes. We don't need to discuss its guarantee here. In the real application, the G1 wakes periodically by OS, and executes the select statement inside the loop. (Note that OS failed to wake the G1 is a false statement, because the timeout keep getting printed hence the call is indeed get called)

ianlancetaylor · 2021-09-20T22:59:59Z

When I look at your small example program, I see this possible execution sequence:

G1 calls doWork
G2 calls callFn (let's say that blocking is false)
callFn sends c on the channel work
callFn sends an empty struct on the channel workAvailable
callFn returns
doWork receives w from the channel work
doWork is preempted before doing anything else
G2 calls callFn again
callFn sends c on the channel work (this succeeds because the channel is empty)
callFn tries to send an empty struct on the channel workAvailable
the send on workAvailable fails because the channel is full
callFn prints "G2: workAvaliable failed"

This sequence of events seems entirely possible to me, and it does not indicate any problem with the channel implementation.

Therefore, (a) result in (d), (b) result in (e), (f) result in (i), and (g) result in (j). Therefore, after (j), workAvaliable must be empty. As we see (k) failed, hence leads to this issue.

But doWork can read from the channel work before it prints anything. That means that work can be empty while workAvailable is not empty.

I understand that your small example program is not the real program. But perhaps the same kind of thing can happen in the real program. The problem in the small example program is that there is no synchronization between the work and workAvailable channels.

changkun · 2021-09-21T04:23:05Z

doWork receives w from the channel work
...

the send on workAvailable fails because the channel is full
...
This sequence of events seems entirely possible to me

Isn't here leads to workAvailable was empty? How does the last send on workAvailable fail? Otherwise, doWork won't be called. The described case seems impossible to happen even there is a preemption.

But doWork can read from the channel work before it prints anything.

This seems not true neither. From the log, (j) happened, and a subsequent (h) did not appear, which should be executed before doWork's receive of the work channel. If doWork can read from the work channel, then a further "G1: onDraw" should, at least, be printed.

work can be empty while workAvailable is not empty

The work channel can be empty while workAvailable is not empty, but then in the next tick, workAvailable will be consumed, which won't lead to a forever freeze of G2. This is because a further callFn will enqueue another call (work is empty at the moment), and sends another signal to workAvailalbe, hence, in one more tick, the work channel will receive a new call and guarantee all works can proceed eventually. This contradicts the freeze we observed in practice.

I understand that your small example program is not the real program.

It is difficult to produce a minimum reproducer. As even in a bigger proglram, the problem reproduces rarely (but terrible disaster when occur). Quote "Reproduced this by leaving an app one hour after leaving that on Pixel 4a." from @hajimehoshi

The problem in the small example program is that there is no synchronization between the work and workAvailable channels.

See comment in above. I still think the synchronization is guaranteed implicitly (by causality).

hajimehoshi · 2021-09-21T05:06:14Z

It is difficult to produce a minimum reproducer. As even in a bigger proglram, the problem reproduces rarely (but terrible disaster when occur). Quote "Reproduced this by leaving an app one hour after leaving that on Pixel 4a." from @hajimehoshi

I remember this can be reproduced only with 5 or 10 minutes waiting by adjusting the workbufLen like this:
hajimehoshi/ebiten#1332 (comment)

My current vague guess is that this issue happens only on Android. I think we should make a minimized case by adjusting workbufLen and e.g., flooding GL commands...

changkun · 2021-09-21T05:10:54Z

I remember this can be reproduced only with 5 or 10 minutes waiting by adjusting the workbufLen like this:
hajimehoshi/ebiten#1332 (comment)

Thanks for the confirmation :) Sorry that I didn't read through the entire issue before.

My current vague guess is that this issue happens only on Android.

I can confirm the issue also happens on iOS (surprisingly not reproducible on macOS with M1 chip).

ianlancetaylor · 2021-09-21T17:04:36Z

Isn't here leads to workAvailable was empty? How does the last send on workAvailable fail? Otherwise, doWork won't be called. The described case seems impossible to happen even there is a preemption.

doWork is a loop that keeps reading from the work channel as long as there is something to read. There is no required correlation between the number of values that doWork reads from work and the number of values that G1 reads from workAvailable. it is possible for callFn to be called several times while doWork is in the loop reading from work, before doWork returns to G1.

If doWork can read from the work channel, then a further "G1: onDraw" should, at least, be printed.

Eventually, sure, but the delay between them can be arbitrary long.

This contradicts the freeze we observed in practice.

I haven't seen discussion of a freeze, though maybe I missed it. I thought the issue was that the select in callFn would sometimes choose the default case when the channel is empty. As far as I can tell that is entirely possible.

changkun · 2021-09-21T19:00:01Z

Eventually, sure, but the delay between them can be arbitrary long.

Sorry if the previous description was misleading, but certainly the "Eventually" means two ticks (clearly bounded less than a second).

As said previously, "If the delay between them can be arbitrary long" contradicts the current observation, except the runtime preemption can cause an arbitrarily long delay. If this is what is currently suggesting, it seems to be another serious problem in the runtime too.

I haven't seen discussion of a freeze, though maybe I missed it.

I am sorry again if the previous wording was a little bit misleading. The initial term that is used on top of the issue was "pause". I thought linking to another issue might clarify it more, but apparently not yet.

I thought the issue was that the select in callFn would sometimes choose the default case when the channel is empty.

Based on a causality analysis, this seems to be the only cause of a rendering freeze issue, see #48434.

As far as I can tell that is entirely possible.

See comments above. But allow me to clarify: it is entirely possible, if and only if the runtime preemption can cause an arbitrarily long delay to the scheduling of a goroutine. Is my understanding correct? If not, would you mind clarify the possibility again that leads to a freeze of the observation that doWork is never called again (after at least minutes)

ianlancetaylor · 2021-09-21T21:11:16Z

it is entirely possible, if and only if the runtime preemption can cause an arbitrarily long delay to the scheduling of a goroutine.

In a long-running program, there are many possible reasons why a goroutine may be arbitrarily delayed. For example, although the garbage collector is fast, it does freeze all other goroutines briefly. This requires preempting goroutines, which requires signals. The kernel can and does interpose arbitrary delays on signal delivery, depending on system load. Until the signal is delivered, the goroutine will keep running. If your program relies on any kind of synchronization, you can't assume anything about how long it takes a goroutine is delayed.

That said, I want to stress that I don't think this has anything to do with the behavior of your sample program. In your sample program calls to callFn, which write to work and workAvailable, are entirely independent of doWork, which reads from work in a loop. If there are two calls to callFn while doWork is running, then the second call to callFn will print G2: workAvaliable failed. Nothing prevents that. Discussions of tickers are irrelevant to that fact. That will happen if doWork is temporarily suspended while G2 calls callFn, and that can happen for any reason.

changkun · 2021-09-25T15:01:05Z

Unfortunately, I cannot find any reliable reproducer as a convincer in a short frame but only a log-based logical deduction. Close until I got any new evidence.

mapleray · 2022-07-28T10:39:57Z

met the same problem.

$  go version
go version go1.18.4 linux/amd64

simplification code:

ch := make(chan *job, 10000)
select {
case ch <- job:
default:
	logger.Errorf("full channel with size: %d", len(ch))
}

From log found that default is excuted but channel is not full, channel size is less than 100.

btittelbach · 2023-04-28T15:19:43Z

we also ran into this problem.

channel capacity is 50
channel len 0 when observed right after select-default executed
publish into channel using select, then default statement is executed
happens irregularly in a very complicated setup with many goroutines and many many channels
seems to happen if the func doing the select publish into channel is called twice within a s or so.
bug is somewhat reproduceable about 1 out of 20 times
bug is somewhat reproduceable on different device, so not a memory issue
go version go1.18.1 linux/amd64
compiled for GOARCH=arm GOOS=linux without cgo

to my understanding, the default-path of select should never be executed, unless the buffered channel is full.
We are absolutely sure the channel was never anywhere close to being more full than 2 at any time , yet the default path of select was executed.

ianlancetaylor · 2023-04-28T17:23:33Z

@btittelbach Please open a new issue with a test case that we can use to reproduce the problem. Thanks.

btittelbach · 2023-04-30T03:08:26Z

I took some time over the weekend to try and reproduce the error in a simpler setup.
Which led to the discovery of a quite old bug in my code 😅

Thus I happily retract my bug-report

changkun mentioned this issue Sep 17, 2021

x/mobile: rendering freeze #48434

Open

cagedmantis added the WaitingForInfo label Sep 17, 2021

changkun closed this as completed Sep 25, 2021

golang locked and limited conversation to collaborators Dec 19, 2024

gopherbot added the FrozenDueToAge label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: send to empty buffered channel can fail on select statement with default case #48433

runtime: send to empty buffered channel can fail on select statement with default case #48433

changkun commented Sep 17, 2021 •

edited

Loading

ianlancetaylor commented Sep 17, 2021

changkun commented Sep 17, 2021

ianlancetaylor commented Sep 17, 2021

changkun commented Sep 18, 2021

ianlancetaylor commented Sep 20, 2021 •

edited

Loading

changkun commented Sep 21, 2021

hajimehoshi commented Sep 21, 2021

changkun commented Sep 21, 2021 •

edited

Loading

ianlancetaylor commented Sep 21, 2021

changkun commented Sep 21, 2021

ianlancetaylor commented Sep 21, 2021

changkun commented Sep 25, 2021

mapleray commented Jul 28, 2022 •

edited

Loading

btittelbach commented Apr 28, 2023 •

edited

Loading

ianlancetaylor commented Apr 28, 2023

btittelbach commented Apr 30, 2023

runtime: send to empty buffered channel can fail on select statement with default case #48433

runtime: send to empty buffered channel can fail on select statement with default case #48433

Comments

changkun commented Sep 17, 2021 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

ianlancetaylor commented Sep 17, 2021

changkun commented Sep 17, 2021

ianlancetaylor commented Sep 17, 2021

changkun commented Sep 18, 2021

ianlancetaylor commented Sep 20, 2021 • edited Loading

changkun commented Sep 21, 2021

hajimehoshi commented Sep 21, 2021

changkun commented Sep 21, 2021 • edited Loading

ianlancetaylor commented Sep 21, 2021

changkun commented Sep 21, 2021

ianlancetaylor commented Sep 21, 2021

changkun commented Sep 25, 2021

mapleray commented Jul 28, 2022 • edited Loading

btittelbach commented Apr 28, 2023 • edited Loading

ianlancetaylor commented Apr 28, 2023

btittelbach commented Apr 30, 2023

changkun commented Sep 17, 2021 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

ianlancetaylor commented Sep 20, 2021 •

edited

Loading

changkun commented Sep 21, 2021 •

edited

Loading

mapleray commented Jul 28, 2022 •

edited

Loading

btittelbach commented Apr 28, 2023 •

edited

Loading