Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/race: potential false positive from race detector #39052

Open
EVODelavega opened this issue May 13, 2020 · 6 comments
Open

runtime/race: potential false positive from race detector #39052

EVODelavega opened this issue May 13, 2020 · 6 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. RaceDetector
Milestone

Comments

@EVODelavega
Copy link

I've gone over this so many times, thinking I must've missed something, but I do believe I have found a case where the race detector returns a false positive (ie data race where there really isn't a data race). It seems to be something that happens when writing to a channel in a select-case statement directly.

The unit tests trigger the race detector, even though I'm ensuring all calls accessing the channel have been made using a callback and a waitgroup.

I have the channels in a map, which I access through a mutex. The data race vanishes the moment I explicitly remove the type that holds the channel from this map. The only way I am able to do is because the mutex is released, so once again: I'm certain everything behaves correctly. Code below

What version of Go are you using (go version)?

$ go version

go version go1.14.2 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/XXXX/.cache/go-build"
GOENV="/home/XXXX/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/XXXX/go"
GOPRIVATE=""
GOPROXY="direct"
GOROOT="/usr/lib/golang"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/XXXX/projects/datarace/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build729449265=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I'm writing a simple message event bus where the broker pushes data onto a channel of its subscribers/consumers. If the channel buffer is full, I don't want the broker to block, so I'm using routines, and a select statement to skip writes to a channel with a full buffer. To make life easier WRT testing, I'm mocking a subscriber interface, and I'm exposing the channels through functions (similar to context.Context.Done() and the like).

My tests all pass, and everything behaves as expected. However, running the same tests with the race detector, I'm getting what I believe to be a false positive. I have a test where I send data to a subscriber that isn't consuming the messages. The channel buffer is full, and I want to ensure that the broker doesn't block. To make sure I've tried to send all data, I'm using a waitgroup to check if the subscriber has indeed been accessed N number of times (where N is the number of events I'm sending). Once the waitgroup is done, I validate what data is on the channel, make sure it's empty, and then close it. The statement where I close the channel is marked as a data race.

If I do the exact same thing, but remove the subscriber from the broker, the data race magically is no more. Here's the code to reproduce the issue:

broker.go

package broker

import (
	"context"
	"log"
	"sync"
)

//go:generate go run github.com/golang/mock/mockgen -destination mocks/sub_mock.go -package mocks my.pkg/race/broker Sub
type Sub interface {
	C() chan<- interface{}
	Done() <-chan struct{}
}

type Broker struct {
	mu   sync.Mutex
	ctx  context.Context
	subs map[int]Sub
	keys []int
}

func New(ctx context.Context) *Broker {
	return &Broker{
		ctx:  ctx,
		subs: map[int]Sub{},
		keys: []int{},
	}
}

func (b *Broker) Send(v interface{}) {
	b.mu.Lock()
	go func() {
		rm := make([]int, 0, len(b.subs))
		defer func() {
			if len(rm) > 0 {
				b.unsub(rm...)
			}
			b.mu.Unlock()
		}()
		for k, s := range b.subs {
			select {
			case <-b.ctx.Done():
				return
			case <-s.Done():
				rm = append(rm, k)
			case s.C() <- v:
				continue
			default:
				log.Printf("Skipped sub %d", k)
			}
		}
	}()
}

func (b *Broker) Subscribe(s Sub) int {
	b.mu.Lock()
	k := b.key()
	b.subs[k] = s
	b.mu.Unlock()
	return k
}

func (b *Broker) Unsubscribe(k int) {
	b.mu.Lock()
	b.unsub(k)
	b.mu.Unlock()
}

func (b *Broker) key() int {
	if len(b.keys) > 0 {
		k := b.keys[0]
		b.keys = b.keys[1:]
		return k
	}
	return len(b.subs) + 1
}

func (b *Broker) unsub(keys ...int) {
	for _, k := range keys {
		if _, ok := b.subs[k]; !ok {
			return
		}
		delete(b.subs, k)
		b.keys = append(b.keys, k)
	}
}

broker_test.go

package broker_test

import (
	"context"
	"sync"
	"testing"

	"my.pkg/race/broker"
	"my.pkg/race/broker/mocks"

	"github.com/golang/mock/gomock"
	"github.com/tj/assert"
)

type tstBroker struct {
	*broker.Broker
	cfunc context.CancelFunc
	ctx   context.Context
	ctrl  *gomock.Controller
}

func getBroker(t *testing.T) *tstBroker {
	ctx, cfunc := context.WithCancel(context.Background())
	ctrl := gomock.NewController(t)
	return &tstBroker{
		Broker: broker.New(ctx),
		cfunc:  cfunc,
		ctx:    ctx,
		ctrl:   ctrl,
	}
}

func TestRace(t *testing.T) {
	broker := getBroker(t)
	defer broker.Finish()
	sub := mocks.NewMockSub(broker.ctrl)
	cCh, dCh := make(chan interface{}, 1), make(chan struct{})
	vals := []interface{}{1, 2, 3}
	wg := sync.WaitGroup{}
	wg.Add(len(vals))
	sub.EXPECT().Done().Times(len(vals)).Return(dCh)
	sub.EXPECT().C().Times(len(vals)).Return(cCh).Do(func() {
		wg.Done()
	})
	k := broker.Subscribe(sub)
	assert.NotZero(t, k)
	for _, v := range vals {
		broker.Send(v)
	}
	wg.Wait()
	// I've tried to send all 3 values, channels should be safe to close now
	close(dCh)
	// channel had buffer of 1, so first value should be present
	assert.Equal(t, vals[0], <-cCh)
	// other values should be skipped due to default select
	assert.Equal(t, 0, len(cCh))
	close(cCh)
}

func TestNoRace(t *testing.T) {
	broker := getBroker(t)
	defer broker.Finish()
	sub := mocks.NewMockSub(broker.ctrl)
	cCh, dCh := make(chan interface{}, 1), make(chan struct{})
	vals := []interface{}{1, 2, 3}
	wg := sync.WaitGroup{}
	wg.Add(len(vals))
	sub.EXPECT().Done().Times(len(vals)).Return(dCh)
	sub.EXPECT().C().Times(len(vals)).Return(cCh).Do(func() {
		wg.Done()
	})
	k := broker.Subscribe(sub)
	assert.NotZero(t, k)
	for _, v := range vals {
		broker.Send(v)
	}
	wg.Wait()
	// I've tried to send all 3 values, channels should be safe to close now
	close(dCh)
	// channel had buffer of 1, so first value should be present
	assert.Equal(t, vals[0], <-cCh)
	// other values should be skipped due to default select
	assert.Equal(t, 0, len(cCh))
	// add this line, and data race magically vanishes
	broker.Unsubscribe(k)
	close(cCh)
}

func (b *tstBroker) Finish() {
	b.cfunc()
	b.ctrl.Finish()
}

See the data race by running: go test -v -race ./broker/... -run TestRace

What did you expect to see?

I expect to see log output showing that the subscriber was skipped twice (output I do indeed see), and no data race

What did you see instead?

I still saw the code behaved as expected, but I do see a data race reported:

go test -v -race ./broker/...  -run TestRace                                                                                                                                                                                                                         
=== RUN   TestRace
2020/05/13 16:24:06 Skipped sub 1
2020/05/13 16:24:06 Skipped sub 1
==================
WARNING: DATA RACE
Write at 0x00c00011a4f0 by goroutine 7:
  runtime.closechan()
      /usr/lib/golang/src/runtime/chan.go:335 +0x0
  my.pkg/race/broker_test.TestRace()
      /home/XXXX/projects/race/broker/stuff_test.go:56 +0x7c8
  testing.tRunner()
      /usr/lib/golang/src/testing/testing.go:991 +0x1eb

Previous read at 0x00c00011a4f0 by goroutine 10:
  runtime.chansend()
      /usr/lib/golang/src/runtime/chan.go:142 +0x0
  my.pkg/race/broker.(*Broker).Send.func1()
      /home/XXXX/projects/race/broker/stuff.go:46 +0x369

Goroutine 7 (running) created at:
  testing.(*T).Run()
      /usr/lib/golang/src/testing/testing.go:1042 +0x660
  testing.runTests.func1()
      /usr/lib/golang/src/testing/testing.go:1284 +0xa6
  testing.tRunner()
      /usr/lib/golang/src/testing/testing.go:991 +0x1eb
  testing.runTests()
      /usr/lib/golang/src/testing/testing.go:1282 +0x527
  testing.(*M).Run()
      /usr/lib/golang/src/testing/testing.go:1199 +0x2ff
  main.main()
      _testmain.go:48 +0x223

Goroutine 10 (finished) created at:
  my.pkg/race/broker.(*Broker).Send()
      /home/XXXX/projects/race/broker/stuff.go:32 +0x70
  my.pkg/race/broker_test.TestRace()
      /home/XXXX/projects/race/broker/stuff_test.go:47 +0x664
  testing.tRunner()
      /usr/lib/golang/src/testing/testing.go:991 +0x1eb
==================
    TestRace: testing.go:906: race detected during execution of test
--- FAIL: TestRace (0.00s)
    : testing.go:906: race detected during execution of test
FAIL
FAIL	my.pkg/race/broker	0.009s
?   	my.pkg/race/broker/mocks	[no test files]

Though I'm not certain, my guess is that the expression s.C() <- v, because it's a case expression, is what trips the race detector up here. The channel buffer is full, so any writes would be blocking if I'd put the channel write in the default case. As it stands, the write cannot possibly be executed, so instead my code logs the fact that a subscriber is being skipped, the routine ends (defer func unlocks the mutex), and the mock callback decrements the waitgroup. Once the waitgroup is empty, all calls to my mock subscriber have been made, and the channel can be safely closed.

It seems, however, that I need to add the additional call, removing the mock from the broker to "reset" the race detector state. I'll try and have a look at the source, maybe something jumps out.

@ALTree ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 13, 2020
@ALTree ALTree added this to the Backlog milestone May 13, 2020
@ALTree ALTree changed the title Potential false positive from race detector runtime/race: potential false positive from race detector May 13, 2020
@randall77
Copy link
Contributor

That looks like a race between reading the length of the channel and writing to the channel. That looks like a correct report to me.

@EVODelavega
Copy link
Author

@randall77 Thanks for looking at this. Commenting out the assertion on the channel length doesn't change things, so I don't think that's what is causing the report. I could be wrong, obviously, but considering what statements are being referred to in the report, len(cCh) doesn't show up

@ianlancetaylor
Copy link
Contributor

It would help if you could provide complete stand-alone code for the problem, and if the race report that you show corresponds to that code.

That said, I see that (*Broker).Send starts a goroutine. In TestRace I see that the code calls (*Broker).Send three times. That starts three goroutines. If I'm reading your mocking code correctly, it calls wg.Done as soon as the calling code calls s.C, which is to say as soon as the goroutine enters the select statement. So the goroutine calls wg.Done when it enters the select statement, but before it chooses any case from the select statement. Then TestRace reads from the channel. While I can't claim to understand this code, it seems possible that after reading from the channel, the select statement sends to the channel. And then TestRace closes the channel. There is no synchronization between that send and that close, so you get a race report.

@randall77
Copy link
Contributor

The len(cCh) is the access that caused the race in the report you showed us. At least, assuming broker_test.go you gave us and stuff_test.go you ran with are identical.
If there is still a race without that line, show us that updated race report.
(It will probably be on the close(cCh) because that also races with the channel send.)

@EVODelavega
Copy link
Author

EVODelavega commented May 14, 2020

@ianlancetaylor The mock .Do() function that calls wg.Done() is invoked after the mock has returned the channel. I've created a repo with the code so it's easy to recreate the issue: https://github.com/EVODelavega/go-race

The channel is closed after the wg is done, so after the channel was returned to the 3 routines. The routines run sequentially by definition (each routine requires a mutex lock), so only the last call to Send could be the time a data race happens. Then again, the wg.Wait call makes sure that the routine has indeed called sub.C(). Between my waiting for the waitgroup, and my closing the channel, I close the done channel, read from the channel, check its length, etc... and still, the data race is reported. The only way to get around it is to make an explicit call to the broker, removing the mock, which acquires a lock, and thus wait for the go routine to return.

@randall77 As you can see in the repo I created: the code to reproduce the issue is indeed identical to what I posted here. the len(cCh) statement is 100% not a data race, just looking at the language specifications:

A single channel may be used in send statements, receive operations, and calls to the built-in functions cap and len by any number of goroutines without further synchronization. Channels act as first-in-first-out queues. For example, if one goroutine sends values on a channel and a second goroutine receives them, the values are received in the order sent.

checking length and cap of a channel doesn't require synchronisation, so there is no data race possible there. The closing of the channel is where the race detector takes issue with, unless I remove the mock from the broker first. The issue I have is that: the behaviour of the unit tests, and indeed the broker are 100% deterministic. I use the waitgroup for synchronisation, and close(dCh) is not flagged up. Of course, the routines only read from the done channel, but looking at the coverage rapport: this case is never selected anyway, and instead I see the broker outputs "Skipped broker 1" twice (which tells me the select executed the default case).

Try as I like, to me there's only 2 ways around this issue:

  • After the wg.Wait() statement, I have to make a second call to the broker (Unsubscribe or Subscribe), which acquires a mutex lock, ensuring the routines have returned, or call Send a fourth time after closing the dCh. In both cases, it really does feel like I'm appeasing the race detector, because -as I keep saying- the behaviour of the code is deterministic to the best of my knowledge.
  • Never close the channel, but that's just bad form

@randall77
Copy link
Contributor

the len(cCh) statement is 100% not a data race, just looking at the language specifications:

You're right, this is not a race. Why then did it report the line number of the len call?
Your repro reports the close call on the next line.

I agree with Ian. The compiler converts:

select {
case ...:
case s.C() <- v:
}

To

tmp := s.C()
select {
case ...:
case tmp <- v:
}

In between the assignment and the select statement, the wait group can be decremented to 0 (inside the 3rd C call), allowing the main routine to run to the close statement and cause a race with the select case.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. RaceDetector
Projects
Status: Triage Backlog
Development

No branches or pull requests

5 participants