-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: send to empty buffered channel can fail on select statement with default case #48433
Comments
That would certainly be a very serious problem, but I think we're going to need a test case. |
I really wish I could write a minimal test case, but sadly the problem only happens occasionally in a very complex application. Nevertheless, I've written a minimum example to demonstrate the key components. Again, this minimum example does not reproduce the problem (at least I haven't run long enough to really see if it can reproduce): package main
/*
void processGLcall() {
for (int i = 0; i < 100000; i++) {
// simulate some work
}
}
*/
import "C"
import (
"math/rand"
"time"
)
func init() {
rand.Seed(time.Now().Unix())
}
func main() {
go G1()
go G2()
select {}
}
type fn struct{ blocking bool }
var (
workAvailable = make(chan struct{}, 1)
work = make(chan fn, 1)
retvalue = make(chan int)
)
func G1() {
tk := time.NewTicker(time.Second / 30)
for range tk.C {
select {
case <-workAvailable:
println("G1: onDraw")
doWork()
case <-time.After(100 * time.Millisecond):
println("G1: timeout")
}
}
}
func G2() {
tk := time.NewTicker(time.Second / 30)
for range tk.C {
n := rand.Intn(1000)
for i := 0; i < n; i++ {
callFn(fn{blocking: rand.Intn(4) == 0})
}
}
}
func doWork() {
for {
select {
case w := <-work:
println("G1: doWork consume work")
C.processGLcall()
if w.blocking {
println("G1: before returning a value")
retvalue <- 1
println("G1: after value returned")
}
default:
println("G1: doWork returned")
return
}
}
}
func callFn(c fn) {
work <- c
select {
case workAvailable <- struct{}{}:
println("G2: workAvaliable success")
default:
println("G2: workAvaliable failed")
}
if c.blocking {
println("G2: callFn, wait for ret")
<-retvalue
println("G2: callFn, after get ret")
}
} and the log (collected from real work application) prints:
The workAvaliable failed in (k) seems to suggest the issue. |
The question for a failure like this is: how do you know for sure that the value was fetched out of the channel before the select statement that tried sending to the channel? As far as I can tell, in your sample program, if |
This is easy to answer: Because the cap(workAvaliable) == 1, then
I am not quite sure I follow here. G1 can't make two separate calls to callFn, because cap(work) == 1.
the time.Ticker is for demonstration purposes. We don't need to discuss its guarantee here. In the real application, the G1 wakes periodically by OS, and executes the select statement inside the loop. (Note that OS failed to wake the G1 is a false statement, because the |
When I look at your small example program, I see this possible execution sequence:
This sequence of events seems entirely possible to me, and it does not indicate any problem with the channel implementation.
But I understand that your small example program is not the real program. But perhaps the same kind of thing can happen in the real program. The problem in the small example program is that there is no synchronization between the |
Isn't here leads to
This seems not true neither. From the log, (j) happened, and a subsequent (h) did not appear, which should be executed before
The
It is difficult to produce a minimum reproducer. As even in a bigger proglram, the problem reproduces rarely (but terrible disaster when occur). Quote "Reproduced this by leaving an app one hour after leaving that on Pixel 4a." from @hajimehoshi
See comment in above. I still think the synchronization is guaranteed implicitly (by causality). |
I remember this can be reproduced only with 5 or 10 minutes waiting by adjusting the My current vague guess is that this issue happens only on Android. I think we should make a minimized case by adjusting |
Thanks for the confirmation :) Sorry that I didn't read through the entire issue before.
I can confirm the issue also happens on iOS (surprisingly not reproducible on macOS with M1 chip). |
Eventually, sure, but the delay between them can be arbitrary long.
I haven't seen discussion of a freeze, though maybe I missed it. I thought the issue was that the |
Sorry if the previous description was misleading, but certainly the "Eventually" means two ticks (clearly bounded less than a second). As said previously, "If the delay between them can be arbitrary long" contradicts the current observation, except the runtime preemption can cause an arbitrarily long delay. If this is what is currently suggesting, it seems to be another serious problem in the runtime too.
I am sorry again if the previous wording was a little bit misleading. The initial term that is used on top of the issue was "pause". I thought linking to another issue might clarify it more, but apparently not yet.
Based on a causality analysis, this seems to be the only cause of a rendering freeze issue, see #48434.
See comments above. But allow me to clarify: it is entirely possible, if and only if the runtime preemption can cause an arbitrarily long delay to the scheduling of a goroutine. Is my understanding correct? If not, would you mind clarify the possibility again that leads to a freeze of the observation that |
In a long-running program, there are many possible reasons why a goroutine may be arbitrarily delayed. For example, although the garbage collector is fast, it does freeze all other goroutines briefly. This requires preempting goroutines, which requires signals. The kernel can and does interpose arbitrary delays on signal delivery, depending on system load. Until the signal is delivered, the goroutine will keep running. If your program relies on any kind of synchronization, you can't assume anything about how long it takes a goroutine is delayed. That said, I want to stress that I don't think this has anything to do with the behavior of your sample program. In your sample program calls to |
Unfortunately, I cannot find any reliable reproducer as a convincer in a short frame but only a log-based logical deduction. Close until I got any new evidence. |
met the same problem.
simplification code:
From log found that default is excuted but channel is not full, channel size is less than 100. |
we also ran into this problem.
to my understanding, the default-path of select should never be executed, unless the buffered channel is full. |
@btittelbach Please open a new issue with a test case that we can use to reproduce the problem. Thanks. |
I took some time over the weekend to try and reproduce the error in a simpler setup. Thus I happily retract my bug-report |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
In the practice of using golang.org/x/mobile, we observed #48434 that the rendering loop in the golang.org/x/mobile/app package can pause. After debugging and logging (see CL 350212), we captured a log case which seems to suggest the following situation may execute the default case:
What did you expect to see?
always do the case of send.
What did you see instead?
the default case is executed
The text was updated successfully, but these errors were encountered: