Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: send to empty buffered channel can fail on select statement with default case #48433

Closed
changkun opened this issue Sep 17, 2021 · 16 comments
Labels
WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@changkun
Copy link
Member

changkun commented Sep 17, 2021

What version of Go are you using (go version)?

$ go version
go version devel go1.18-9133245be7 Sat Sep 4 20:35:25 2021 +0000 darwin/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="auto"
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/changkun/Library/Caches/go-build"
GOENV="/Users/changkun/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/changkun/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/changkun/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/changkun/dev/godev/go-gerrit"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/changkun/dev/godev/go-gerrit/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="devel go1.18-9133245be7 Sat Sep 4 20:35:25 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/g2/6fmr1qzx0ns3shq74zrp6bd40000gn/T/go-build1056299778=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

In the practice of using golang.org/x/mobile, we observed #48434 that the rendering loop in the golang.org/x/mobile/app package can pause. After debugging and logging (see CL 350212), we captured a log case which seems to suggest the following situation may execute the default case:

ch := make(chan struct{}, 1)
select {
case ch <- struct{}{}:
default:
    panic("should be impossible?")
}

What did you expect to see?

always do the case of send.

What did you see instead?

the default case is executed

@ianlancetaylor
Copy link
Contributor

That would certainly be a very serious problem, but I think we're going to need a test case.

@changkun
Copy link
Member Author

I really wish I could write a minimal test case, but sadly the problem only happens occasionally in a very complex application.
Based on the current collected log, and a logical causality analysis on it, it eventually leads me to open this issue.

Nevertheless, I've written a minimum example to demonstrate the key components. Again, this minimum example does not reproduce the problem (at least I haven't run long enough to really see if it can reproduce):

package main

/*
void processGLcall() {
	for (int i = 0; i < 100000; i++) {
		// simulate some work
	}
}
*/
import "C"
import (
	"math/rand"
	"time"
)

func init() {
	rand.Seed(time.Now().Unix())
}

func main() {
	go G1()
	go G2()
	select {}
}

type fn struct{ blocking bool }

var (
	workAvailable = make(chan struct{}, 1)
	work          = make(chan fn, 1)
	retvalue      = make(chan int)
)

func G1() {
	tk := time.NewTicker(time.Second / 30)
	for range tk.C {
		select {
		case <-workAvailable:
			println("G1: onDraw")
			doWork()
		case <-time.After(100 * time.Millisecond):
			println("G1: timeout")
		}
	}
}

func G2() {
	tk := time.NewTicker(time.Second / 30)
	for range tk.C {
		n := rand.Intn(1000)
		for i := 0; i < n; i++ {
			callFn(fn{blocking: rand.Intn(4) == 0})
		}
	}
}

func doWork() {
	for {
		select {
		case w := <-work:
			println("G1: doWork consume work")
			C.processGLcall()
			if w.blocking {
				println("G1: before returning a value")
				retvalue <- 1
				println("G1: after value returned")
			}
		default:
			println("G1: doWork returned")
			return
		}
	}
}

func callFn(c fn) {
	work <- c

	select {
	case workAvailable <- struct{}{}:
		println("G2: workAvaliable success")
	default:
		println("G2: workAvaliable failed")
	}

	if c.blocking {
		println("G2: callFn, wait for ret")
		<-retvalue
		println("G2: callFn, after get ret")
	}
}

and the log (collected from real work application) prints:

G2 workAvaliable success  (a)
G2 workAvaliable success  (b)
G2 workAvaliable failed   (c)
G1 doWork returned        (d)
G1 doWork returned        (e)
G2 workAvaliable success  (f)
G2 workAvaliable success  (g)
G1 onDraw                 (h)
G1 doWork returned        (i)
G1 doWork returned        (j)
G2 workAvaliable failed   (k)
G1 timeout                (l)
G1 timeout
G1 timeout
G1 timeout
G1 timeout
G1 timeout
...
onDraw is never executed again.

The workAvaliable failed in (k) seems to suggest the issue.

@ianlancetaylor
Copy link
Contributor

The question for a failure like this is: how do you know for sure that the value was fetched out of the channel before the select statement that tried sending to the channel?

As far as I can tell, in your sample program, if G1 gets suspended for the times it takes to make two separate calls to callFn, then you will get this error. It's not likely. But I don't see anything preventing it, so it's possible. A time.Ticker is best effort in Go; there is no hard real time guarantee.

@cagedmantis cagedmantis added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 17, 2021
@changkun
Copy link
Member Author

how do you know for sure that the value was fetched out of the channel before the select statement that tried sending to the channel?

This is easy to answer: Because the cap(workAvaliable) == 1, then workAvaliable success means its previous work <- c in the callFn succeed. Hence len(work) should be greater or equal than 1, and G1 will result in a doWork return. Therefore, (a) result in (d), (b) result in (e), (f) result in (i), and (g) result in (j). Therefore, after (j), workAvaliable must be empty. As we see (k) failed, hence leads to this issue.

if G1 gets suspended for the times it takes to make two separate calls to callFn, then you will get this error.

I am not quite sure I follow here. G1 can't make two separate calls to callFn, because cap(work) == 1.

A time.Ticker is best effort in Go; there is no hard real time guarantee.

the time.Ticker is for demonstration purposes. We don't need to discuss its guarantee here. In the real application, the G1 wakes periodically by OS, and executes the select statement inside the loop. (Note that OS failed to wake the G1 is a false statement, because the timeout keep getting printed hence the call is indeed get called)

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Sep 20, 2021

When I look at your small example program, I see this possible execution sequence:

  • G1 calls doWork
  • G2 calls callFn (let's say that blocking is false)
  • callFn sends c on the channel work
  • callFn sends an empty struct on the channel workAvailable
  • callFn returns
  • doWork receives w from the channel work
  • doWork is preempted before doing anything else
  • G2 calls callFn again
  • callFn sends c on the channel work (this succeeds because the channel is empty)
  • callFn tries to send an empty struct on the channel workAvailable
  • the send on workAvailable fails because the channel is full
  • callFn prints "G2: workAvaliable failed"

This sequence of events seems entirely possible to me, and it does not indicate any problem with the channel implementation.

Therefore, (a) result in (d), (b) result in (e), (f) result in (i), and (g) result in (j). Therefore, after (j), workAvaliable must be empty. As we see (k) failed, hence leads to this issue.

But doWork can read from the channel work before it prints anything. That means that work can be empty while workAvailable is not empty.

I understand that your small example program is not the real program. But perhaps the same kind of thing can happen in the real program. The problem in the small example program is that there is no synchronization between the work and workAvailable channels.

@changkun
Copy link
Member Author

  • doWork receives w from the channel work
    ...
  • the send on workAvailable fails because the channel is full
    ...
    This sequence of events seems entirely possible to me

Isn't here leads to workAvailable was empty? How does the last send on workAvailable fail? Otherwise, doWork won't be called. The described case seems impossible to happen even there is a preemption.

But doWork can read from the channel work before it prints anything.

This seems not true neither. From the log, (j) happened, and a subsequent (h) did not appear, which should be executed before doWork's receive of the work channel. If doWork can read from the work channel, then a further "G1: onDraw" should, at least, be printed.

work can be empty while workAvailable is not empty

The work channel can be empty while workAvailable is not empty, but then in the next tick, workAvailable will be consumed, which won't lead to a forever freeze of G2. This is because a further callFn will enqueue another call (work is empty at the moment), and sends another signal to workAvailalbe, hence, in one more tick, the work channel will receive a new call and guarantee all works can proceed eventually. This contradicts the freeze we observed in practice.

I understand that your small example program is not the real program.

It is difficult to produce a minimum reproducer. As even in a bigger proglram, the problem reproduces rarely (but terrible disaster when occur). Quote "Reproduced this by leaving an app one hour after leaving that on Pixel 4a." from @hajimehoshi

The problem in the small example program is that there is no synchronization between the work and workAvailable channels.

See comment in above. I still think the synchronization is guaranteed implicitly (by causality).

@hajimehoshi
Copy link
Member

It is difficult to produce a minimum reproducer. As even in a bigger proglram, the problem reproduces rarely (but terrible disaster when occur). Quote "Reproduced this by leaving an app one hour after leaving that on Pixel 4a." from @hajimehoshi

I remember this can be reproduced only with 5 or 10 minutes waiting by adjusting the workbufLen like this:
hajimehoshi/ebiten#1332 (comment)

My current vague guess is that this issue happens only on Android. I think we should make a minimized case by adjusting workbufLen and e.g., flooding GL commands...

@changkun
Copy link
Member Author

changkun commented Sep 21, 2021

I remember this can be reproduced only with 5 or 10 minutes waiting by adjusting the workbufLen like this:
hajimehoshi/ebiten#1332 (comment)

Thanks for the confirmation :) Sorry that I didn't read through the entire issue before.

My current vague guess is that this issue happens only on Android.

I can confirm the issue also happens on iOS (surprisingly not reproducible on macOS with M1 chip).

@ianlancetaylor
Copy link
Contributor

Isn't here leads to workAvailable was empty? How does the last send on workAvailable fail? Otherwise, doWork won't be called. The described case seems impossible to happen even there is a preemption.

doWork is a loop that keeps reading from the work channel as long as there is something to read. There is no required correlation between the number of values that doWork reads from work and the number of values that G1 reads from workAvailable. it is possible for callFn to be called several times while doWork is in the loop reading from work, before doWork returns to G1.

If doWork can read from the work channel, then a further "G1: onDraw" should, at least, be printed.

Eventually, sure, but the delay between them can be arbitrary long.

This contradicts the freeze we observed in practice.

I haven't seen discussion of a freeze, though maybe I missed it. I thought the issue was that the select in callFn would sometimes choose the default case when the channel is empty. As far as I can tell that is entirely possible.

@changkun
Copy link
Member Author

Eventually, sure, but the delay between them can be arbitrary long.

Sorry if the previous description was misleading, but certainly the "Eventually" means two ticks (clearly bounded less than a second).

As said previously, "If the delay between them can be arbitrary long" contradicts the current observation, except the runtime preemption can cause an arbitrarily long delay. If this is what is currently suggesting, it seems to be another serious problem in the runtime too.

I haven't seen discussion of a freeze, though maybe I missed it.

I am sorry again if the previous wording was a little bit misleading. The initial term that is used on top of the issue was "pause". I thought linking to another issue might clarify it more, but apparently not yet.

I thought the issue was that the select in callFn would sometimes choose the default case when the channel is empty.

Based on a causality analysis, this seems to be the only cause of a rendering freeze issue, see #48434.

As far as I can tell that is entirely possible.

See comments above. But allow me to clarify: it is entirely possible, if and only if the runtime preemption can cause an arbitrarily long delay to the scheduling of a goroutine. Is my understanding correct? If not, would you mind clarify the possibility again that leads to a freeze of the observation that doWork is never called again (after at least minutes)

@ianlancetaylor
Copy link
Contributor

it is entirely possible, if and only if the runtime preemption can cause an arbitrarily long delay to the scheduling of a goroutine.

In a long-running program, there are many possible reasons why a goroutine may be arbitrarily delayed. For example, although the garbage collector is fast, it does freeze all other goroutines briefly. This requires preempting goroutines, which requires signals. The kernel can and does interpose arbitrary delays on signal delivery, depending on system load. Until the signal is delivered, the goroutine will keep running. If your program relies on any kind of synchronization, you can't assume anything about how long it takes a goroutine is delayed.

That said, I want to stress that I don't think this has anything to do with the behavior of your sample program. In your sample program calls to callFn, which write to work and workAvailable, are entirely independent of doWork, which reads from work in a loop. If there are two calls to callFn while doWork is running, then the second call to callFn will print G2: workAvaliable failed. Nothing prevents that. Discussions of tickers are irrelevant to that fact. That will happen if doWork is temporarily suspended while G2 calls callFn, and that can happen for any reason.

@changkun
Copy link
Member Author

Unfortunately, I cannot find any reliable reproducer as a convincer in a short frame but only a log-based logical deduction. Close until I got any new evidence.

@mapleray
Copy link

mapleray commented Jul 28, 2022

met the same problem.

$  go version
go version go1.18.4 linux/amd64

simplification code:

ch := make(chan *job, 10000)
select {
case ch <- job:
default:
	logger.Errorf("full channel with size: %d", len(ch))
}

From log found that default is excuted but channel is not full, channel size is less than 100.

@btittelbach
Copy link

btittelbach commented Apr 28, 2023

we also ran into this problem.

  • channel capacity is 50
  • channel len 0 when observed right after select-default executed
  • publish into channel using select, then default statement is executed
  • happens irregularly in a very complicated setup with many goroutines and many many channels
  • seems to happen if the func doing the select publish into channel is called twice within a s or so.
  • bug is somewhat reproduceable about 1 out of 20 times
  • bug is somewhat reproduceable on different device, so not a memory issue
  • go version go1.18.1 linux/amd64
  • compiled for GOARCH=arm GOOS=linux without cgo

to my understanding, the default-path of select should never be executed, unless the buffered channel is full.
We are absolutely sure the channel was never anywhere close to being more full than 2 at any time , yet the default path of select was executed.

@ianlancetaylor
Copy link
Contributor

@btittelbach Please open a new issue with a test case that we can use to reproduce the problem. Thanks.

@btittelbach
Copy link

I took some time over the weekend to try and reproduce the error in a simpler setup.
Which led to the discovery of a quite old bug in my code 😅

Thus I happily retract my bug-report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

6 participants