sync: potential race when using cond.Broadcast #14064

ianlancetaylor · 2016-01-22T00:09:38Z

Copied from a posting to golang-nuts by Wedson Almeida Filho. I (Ian) don't think there is a problem with cond.Signal, but there does seem to be a problem with cond.Broadcast.

I'm under the impression that the current implementation of condition variables in Go is incorrect and can lead to deadlocks.

cond.Wait does the following:

[Atomically] increments the waiter count
Releases the mutex
Waits on the semaphore
Re-acquires mutex

It looks like it's possible for one thread to "steal" a signal/broadcast intended to another one if it loses the CPU between steps 2 & 3 above. For example, the following sequence seems problematic:

Thread A: calls cond.Wait, loses CPU between step 2 & 3 above. (Preempted by the kernel scheduler.)
Thread B: calls cond.Signal or cond.Broadcast; semaphore gets incremented to 1.
Thread C: calls cond.Wait, the wait on the semaphore is immediately satisfied by the count stored by the previous step.
Thread A: regains CPU, sleeps on the semaphore. It never gets the notification sent by thread B.

A more concrete example might be a [simplified] consumer/producer implementation (I'm only keeping track of a count):

func produce() {
  m.Lock()
  while (count == max) {
    cond.Wait()
  }

  count++
  if count == 1 {
    cond.Broadcast()
  }
  m.Unlock()
}

func consume() {
  m.Lock()
  while (count == 0) {
    cond.Wait()
  }

  count--
  if count == max-1 {
    cond.Broadcast()
  }
  m.Unlock()
}

For the code above, suppose count == 0. The following sequence is possible:

goroutine A calls consume, which will call cond.Wait; suppose it loses the CPU between steps 2 & 3.
goroutine B calls produce, which will call cond.Broadcast, which will cause the semaphore to go to 1.
goroutine B calls produce repeatedly, until count reaches the max.
goroutine B will itself call cond.Wait, which will be satisfied immediately.
goroutine B will call cond.Wait again, and sleep.
goroutine A regains the CPU, and sleeps on the semaphore as well.

The threads are now deadlocked. Granted that it's a race condition that is hard to hit, it looks like a correctness issue.

The text was updated successfully, but these errors were encountered:

dvyukov · 2016-01-22T15:28:39Z

Cond uses special sync semaphore (think of rendezvous on sync chan).
As the result a goroutine cannot self-consume a signal.
If a goroutine signals under the mutex, then the exact set of designated goroutines will receive the signals (new goroutines can't enter into the critical section).
If a goroutine signals outside of the mutex, then there is inherent non-determinism wrt receiving goroutines (new goroutines can enter into the critical section and wait on the cv). But this is expected.

So the described scenario can't happen.

wedsonaf · 2016-01-22T18:40:20Z

Dmitry,

I think there's still a race in the Broadcast case outside the mutex.

Suppose there are N goroutines waiting on a cv (but not on the semaphore yet). Then another goroutine calls Broadcast, while concurrently another goroutine is calling Wait on it.

While what's going to happen to that last goroutine is definitely non-deterministic (it may or may not get the notification), what happens to the previously N waiting goroutines is deterministic: they should all get the notification.

Under this implementation, it looks like one of them may not get the notification when the "new" goroutine steals its notification.

dvyukov · 2016-01-22T18:52:04Z

@wedsonaf please provide a full example program that will misbehave.

wedsonaf · 2016-01-22T21:10:36Z

This is hard to reproduce because we need the thread to lose the CPU between releasing the lock and sleeping on the semaphore. So to make it easier, I modified the sync package to optionally artificially introduce a delay of 1s in the spot we care about. This is controlled by sync.ForceWait.

The code below will result in a deadlock.

Note that if we remove the extra waiter altogether the code runs to completion after the 1s wait.

package main

import (
        "runtime"
        "sync"
        "time"
)

func main() {
        var m sync.Mutex
        cond := sync.NewCond(&m)

        // Start a waiter.
        ch := make(chan struct{})
        sync.ForceWait = true
        go func() {
                runtime.LockOSThread()
                m.Lock()
                ch <- struct{}{}
                cond.Wait()
                m.Unlock()

                ch <- struct{}{}
        }()

        <-ch
        m.Lock()
        m.Unlock()

        // We know that the waiter is in the cond.Wait() call because we
        // synchronized with it, then acquired/released the mutex it was
        // holding when we synchronized.

        go func() {
                cond.Broadcast()
        }()

        // Sleep before we wait to give Broadcast a chance to wait for the waiter.
        time.Sleep(20 * time.Millisecond)

        // Start the extra waiter, without the forced wait.
        sync.ForceWait = false
        go func() {
                m.Lock()
                cond.Wait()
                m.Unlock()
        }()

        // This will only be satisfied when the first waiter wakes up. We'll
        // deadlock if it never wakes up.
        <-ch
}

Below is the diff to sync:

diff --git a/src/runtime/sema.go b/src/runtime/sema.go
index b54621b..0ea5d63 100644
--- a/src/runtime/sema.go
+++ b/src/runtime/sema.go
@@ -291,3 +291,8 @@ func syncsemcheck(sz uintptr) {
                throw("bad syncSema size")
        }
 }
+
+//go:linkname sleep sync.runtime_Usleep
+func sleep(us uint32) {
+       usleep(us)
+}
diff --git a/src/sync/cond.go b/src/sync/cond.go
index 0aefcda..a9f7b4b 100644
--- a/src/sync/cond.go
+++ b/src/sync/cond.go
@@ -10,6 +10,8 @@ import (
        "unsafe"
 )

+var ForceWait = false
+
 // Cond implements a condition variable, a rendezvous point
 // for goroutines waiting for or announcing the occurrence
 // of an event.
@@ -60,6 +62,11 @@ func (c *Cond) Wait() {
                race.Enable()
        }
        c.L.Unlock()
+
+       if ForceWait {
+               runtime_Usleep(1000000)
+       }
+
        runtime_Syncsemacquire(&c.sema)
        c.L.Lock()
 }
diff --git a/src/sync/runtime.go b/src/sync/runtime.go
index c66d2de..8874b18 100644
--- a/src/sync/runtime.go
+++ b/src/sync/runtime.go
@@ -45,3 +45,5 @@ func runtime_canSpin(i int) bool

 // runtime_doSpin does active spinning.
 func runtime_doSpin()
+
+func runtime_Usleep(usec uint32)

wedsonaf · 2016-01-25T03:15:59Z

Here's a proposed fix: https://go-review.googlesource.com/18892

gopherbot · 2016-01-25T04:00:16Z

CL https://golang.org/cl/18892 mentions this issue.

dvyukov · 2016-01-25T09:31:27Z

@wedsonaf yes, this is a bug in Cond. Have you actually hit this scenario in real life or it is a thought experiment?

wedsonaf · 2016-01-25T17:29:57Z

It was a "directed" thought experiment: I was investigating an issue where it seemed like a waiting goroutine wasn't waking up on a broadcast, but it turned out to be something else.

ianlancetaylor assigned dvyukov Jan 22, 2016

ianlancetaylor added this to the Go1.7 milestone Jan 22, 2016

dvyukov closed this as completed Jan 22, 2016

dvyukov reopened this Feb 28, 2016

gopherbot closed this as completed in 8e7072c Mar 15, 2016

golang locked and limited conversation to collaborators Mar 19, 2017

gopherbot added the FrozenDueToAge label Mar 19, 2017

rsc unassigned dvyukov Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: potential race when using cond.Broadcast #14064

sync: potential race when using cond.Broadcast #14064

ianlancetaylor commented Jan 22, 2016

dvyukov commented Jan 22, 2016

wedsonaf commented Jan 22, 2016

dvyukov commented Jan 22, 2016

wedsonaf commented Jan 22, 2016

wedsonaf commented Jan 25, 2016

gopherbot commented Jan 25, 2016

dvyukov commented Jan 25, 2016

wedsonaf commented Jan 25, 2016

sync: potential race when using cond.Broadcast #14064

sync: potential race when using cond.Broadcast #14064

Comments

ianlancetaylor commented Jan 22, 2016

dvyukov commented Jan 22, 2016

wedsonaf commented Jan 22, 2016

dvyukov commented Jan 22, 2016

wedsonaf commented Jan 22, 2016

wedsonaf commented Jan 25, 2016

gopherbot commented Jan 25, 2016

dvyukov commented Jan 25, 2016

wedsonaf commented Jan 25, 2016