Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: potential race when using cond.Broadcast #14064

Closed
ianlancetaylor opened this issue Jan 22, 2016 · 8 comments
Closed

sync: potential race when using cond.Broadcast #14064

ianlancetaylor opened this issue Jan 22, 2016 · 8 comments
Milestone

Comments

@ianlancetaylor
Copy link
Contributor

Copied from a posting to golang-nuts by Wedson Almeida Filho. I (Ian) don't think there is a problem with cond.Signal, but there does seem to be a problem with cond.Broadcast.

I'm under the impression that the current implementation of condition variables in Go is incorrect and can lead to deadlocks.

cond.Wait does the following:

  1. [Atomically] increments the waiter count
  2. Releases the mutex
  3. Waits on the semaphore
  4. Re-acquires mutex

It looks like it's possible for one thread to "steal" a signal/broadcast intended to another one if it loses the CPU between steps 2 & 3 above. For example, the following sequence seems problematic:

  1. Thread A: calls cond.Wait, loses CPU between step 2 & 3 above. (Preempted by the kernel scheduler.)
  2. Thread B: calls cond.Signal or cond.Broadcast; semaphore gets incremented to 1.
  3. Thread C: calls cond.Wait, the wait on the semaphore is immediately satisfied by the count stored by the previous step.
  4. Thread A: regains CPU, sleeps on the semaphore. It never gets the notification sent by thread B.

A more concrete example might be a [simplified] consumer/producer implementation (I'm only keeping track of a count):

func produce() {
  m.Lock()
  while (count == max) {
    cond.Wait()
  }

  count++
  if count == 1 {
    cond.Broadcast()
  }
  m.Unlock()
}

func consume() {
  m.Lock()
  while (count == 0) {
    cond.Wait()
  }

  count--
  if count == max-1 {
    cond.Broadcast()
  }
  m.Unlock()
}

For the code above, suppose count == 0. The following sequence is possible:

  1. goroutine A calls consume, which will call cond.Wait; suppose it loses the CPU between steps 2 & 3.
  2. goroutine B calls produce, which will call cond.Broadcast, which will cause the semaphore to go to 1.
  3. goroutine B calls produce repeatedly, until count reaches the max.
  4. goroutine B will itself call cond.Wait, which will be satisfied immediately.
  5. goroutine B will call cond.Wait again, and sleep.
  6. goroutine A regains the CPU, and sleeps on the semaphore as well.

The threads are now deadlocked. Granted that it's a race condition that is hard to hit, it looks like a correctness issue.

@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Jan 22, 2016
@dvyukov
Copy link
Member

dvyukov commented Jan 22, 2016

Cond uses special sync semaphore (think of rendezvous on sync chan).
As the result a goroutine cannot self-consume a signal.
If a goroutine signals under the mutex, then the exact set of designated goroutines will receive the signals (new goroutines can't enter into the critical section).
If a goroutine signals outside of the mutex, then there is inherent non-determinism wrt receiving goroutines (new goroutines can enter into the critical section and wait on the cv). But this is expected.

So the described scenario can't happen.

@dvyukov dvyukov closed this as completed Jan 22, 2016
@wedsonaf
Copy link
Contributor

Dmitry,

I think there's still a race in the Broadcast case outside the mutex.

Suppose there are N goroutines waiting on a cv (but not on the semaphore yet). Then another goroutine calls Broadcast, while concurrently another goroutine is calling Wait on it.

While what's going to happen to that last goroutine is definitely non-deterministic (it may or may not get the notification), what happens to the previously N waiting goroutines is deterministic: they should all get the notification.

Under this implementation, it looks like one of them may not get the notification when the "new" goroutine steals its notification.

@dvyukov
Copy link
Member

dvyukov commented Jan 22, 2016

@wedsonaf please provide a full example program that will misbehave.

@wedsonaf
Copy link
Contributor

This is hard to reproduce because we need the thread to lose the CPU between releasing the lock and sleeping on the semaphore. So to make it easier, I modified the sync package to optionally artificially introduce a delay of 1s in the spot we care about. This is controlled by sync.ForceWait.

The code below will result in a deadlock.

Note that if we remove the extra waiter altogether the code runs to completion after the 1s wait.

package main

import (
        "runtime"
        "sync"
        "time"
)

func main() {
        var m sync.Mutex
        cond := sync.NewCond(&m)

        // Start a waiter.
        ch := make(chan struct{})
        sync.ForceWait = true
        go func() {
                runtime.LockOSThread()
                m.Lock()
                ch <- struct{}{}
                cond.Wait()
                m.Unlock()

                ch <- struct{}{}
        }()

        <-ch
        m.Lock()
        m.Unlock()

        // We know that the waiter is in the cond.Wait() call because we
        // synchronized with it, then acquired/released the mutex it was
        // holding when we synchronized.

        go func() {
                cond.Broadcast()
        }()

        // Sleep before we wait to give Broadcast a chance to wait for the waiter.
        time.Sleep(20 * time.Millisecond)

        // Start the extra waiter, without the forced wait.
        sync.ForceWait = false
        go func() {
                m.Lock()
                cond.Wait()
                m.Unlock()
        }()

        // This will only be satisfied when the first waiter wakes up. We'll
        // deadlock if it never wakes up.
        <-ch
}

Below is the diff to sync:

diff --git a/src/runtime/sema.go b/src/runtime/sema.go
index b54621b..0ea5d63 100644
--- a/src/runtime/sema.go
+++ b/src/runtime/sema.go
@@ -291,3 +291,8 @@ func syncsemcheck(sz uintptr) {
                throw("bad syncSema size")
        }
 }
+
+//go:linkname sleep sync.runtime_Usleep
+func sleep(us uint32) {
+       usleep(us)
+}
diff --git a/src/sync/cond.go b/src/sync/cond.go
index 0aefcda..a9f7b4b 100644
--- a/src/sync/cond.go
+++ b/src/sync/cond.go
@@ -10,6 +10,8 @@ import (
        "unsafe"
 )

+var ForceWait = false
+
 // Cond implements a condition variable, a rendezvous point
 // for goroutines waiting for or announcing the occurrence
 // of an event.
@@ -60,6 +62,11 @@ func (c *Cond) Wait() {
                race.Enable()
        }
        c.L.Unlock()
+
+       if ForceWait {
+               runtime_Usleep(1000000)
+       }
+
        runtime_Syncsemacquire(&c.sema)
        c.L.Lock()
 }
diff --git a/src/sync/runtime.go b/src/sync/runtime.go
index c66d2de..8874b18 100644
--- a/src/sync/runtime.go
+++ b/src/sync/runtime.go
@@ -45,3 +45,5 @@ func runtime_canSpin(i int) bool

 // runtime_doSpin does active spinning.
 func runtime_doSpin()
+
+func runtime_Usleep(usec uint32)

@wedsonaf
Copy link
Contributor

Here's a proposed fix: https://go-review.googlesource.com/18892

@gopherbot
Copy link

CL https://golang.org/cl/18892 mentions this issue.

@dvyukov
Copy link
Member

dvyukov commented Jan 25, 2016

@wedsonaf yes, this is a bug in Cond. Have you actually hit this scenario in real life or it is a thought experiment?

@wedsonaf
Copy link
Contributor

It was a "directed" thought experiment: I was investigating an issue where it seemed like a waiting goroutine wasn't waking up on a broadcast, but it turned out to be something else.

@dvyukov dvyukov reopened this Feb 28, 2016
@golang golang locked and limited conversation to collaborators Mar 19, 2017
@rsc rsc unassigned dvyukov Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants