runtime/race: race detector sometimes misses race on []byte copy #36794

rogpeppe · 2020-01-27T12:49:22Z

What version of Go are you using (`go version`)?

$ go version
go version devel +6dbcc8b865 Thu Jan 9 15:38:00 2020 +0000 linux/amd64

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/rog/.cache/go-build"
GOENV="/home/rog/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY="github.com/heetch"
GONOSUMDB="github.com/heetch"
GOOS="linux"
GOPATH="/home/rog/src/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/rog/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/rog/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/tmp/m/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build752416385=/tmp/go-build -gno-record-gcc-switches"
/pre>

What did you do?

When this test code runs under the race detector, I don't see a race detected, even though there is a race. When I add a sleep (set bogus to true), the race detector does find the race.

https://play.golang.org/p/ECoOELB1fC1

package test

import (
	"sync"
	"testing"
	"time"
)

const bogus = false

func corrupt(data []byte) {
	copy(data, data[2:])
}

var saved string

func TestFoo(t *testing.T) {
	data := []byte("hello there")
	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		saved = string(data)
		wg.Done()
	}()
	if bogus {
		time.Sleep(time.Millisecond)
	}
	corrupt(data)
	wg.Wait()
	t.Logf("saved %q; data %q", saved, data)
}

The text was updated successfully, but these errors were encountered:

myitcv · 2020-01-27T12:59:35Z

cc @dvyukov

josharian · 2020-01-27T18:27:58Z

cc @mdempsky

randall77 · 2020-01-27T18:36:18Z

Definitely strange. It seems very fragile. Changing the string length affects it (only lengths 10 and 11 demonstrate the bug).
Replacing time.Sleep with a simple integer global variable write still reproduces the bug. Anything that calls into the race detector, really.

The calls out to the race detector from the runtime look fine, and the same in both cases.

slicecopy write 0xc00011c090 9
slicecopy read 0xc00011c092 9
slicebytetostring read 0xc00011c090 11

I suspect something in the race detector. Possibly this is just a limitation of the race detector, something is colliding in the shadow space in one case and not in another. It's not guaranteed to find all races.

dvyukov · 2020-01-28T10:18:46Z

I suspect this is related to one of deep limitations of the tsan runtime - it memorizes at most 4 unique previous memory accesses for each 8-byte aligned granule of application memory (hard to memorize all previous memory accesses that ever happened in an application with full stack traces).

When all 4 slots are already busy, tsan either replaces one of "weaker" (for purposes of race detection) memory accesses (e.g. we have a read and now a write to the same memory), or if there are no weaker ones, it replaces one pseudo-randomly.

There are several aspects combined here to give this stable effect on this test:

One of the accesses is unaligned (to 8-byte granule).
The slice range accesses end up in tsan MemoryAccessRange function:
https://github.com/llvm-mirror/compiler-rt/blob/master/lib/tsan/rtl/tsan_rtl_thread.cpp#L376-L405
For aligned ranges (and middle part of large accesses), tsan models 8-byte accesses, so these take just 1 of 4 available slots. However for unaligned start/end, it simply models them as N 1-byte accesses. So these may take lots of slots.
The unaligned offset is 2, so tsan models 6 individual 1-byte access to the beginning of the slice. These probably evict everything (the writes that should be detect later as racing).
There is a potential for improvement here: tsan could model this 6-byte access as 1 2-byte access + 1 4-byte accesses instead, which would take only 2 slots (it supports only 1/2/4/8 byte accesses).
Exact behavior of the pseudo random eviction (potentially the writes could have not been evicted even with 6 subsequent reads). That's probably why inserting even another access, surfaces the race.
The fact that slicecopy first does racewriterangepc followed by racereadrangepc. As the result the split reads evict previous writes.
I think this is a "bug" (for some definition of bug of course b/c it was totally not obvious). I think we need to swap the order. If ranges don't overlap, it should not matter (minus potential effect on pseudo-random eviction). But if they overlap, then writes are always "stronger" for race detection and it's better to preserve them. This may be enough to "fix" this particular case. Currently tsan does remember reads to the region, but since the second access is also a read (slicebytetostring), reads don't race with reads, and the race is not detected. If we keep writes, it should be detected.
As a bonus point, if ranges overlap, we could instrument the overlapping part only as write (also faster).

Other range instrumentation sites should be audited for similar cases too (points 4 and 5).

@rogpeppe thanks for the nice test case! False positives are worse, but they always loud and are not left unnoticed. Whereas false negatives very frequently stay unnoticed in large quantities for long time because they are silent. So each of them is way more important in some sense and always requires some manual attention and a bit healthy mistrust in things that "known to work" :)

randall77 · 2020-01-28T15:40:07Z

I think this is a "bug" (for some definition of bug of course b/c it was totally not obvious). I think we need to swap the order. If ranges don't overlap, it should not matter (minus potential effect on pseudo-random eviction). But if they overlap, then writes are always "stronger" for race detection and it's better to preserve them. This may be enough to "fix" this particular case. Currently tsan does remember reads to the region, but since the second access is also a read (slicebytetostring), reads don't race with reads, and the race is not detected. If we keep writes, it should be detected.

Yes, I noticed this as well, and swapping the order fixes this particular issue. I didn't understand why; now I do.
I'll submit a CL for that for 1.15.

gopherbot · 2020-02-25T03:09:38Z

Change https://golang.org/cl/220685 mentions this issue: runtime: reorder race detector calls in slicecopy

cagedmantis added this to the Go1.15 milestone Feb 3, 2020

cagedmantis added the NeedsFix The path to resolution is known, but the work has not been done. label Feb 3, 2020

gopherbot closed this as completed in 089e482 Feb 25, 2020

dvyukov added the RaceDetector label Mar 1, 2020

golang locked and limited conversation to collaborators Mar 1, 2021

gopherbot added the FrozenDueToAge label Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/race: race detector sometimes misses race on []byte copy #36794

runtime/race: race detector sometimes misses race on []byte copy #36794

rogpeppe commented Jan 27, 2020

myitcv commented Jan 27, 2020

josharian commented Jan 27, 2020

randall77 commented Jan 27, 2020

dvyukov commented Jan 28, 2020

randall77 commented Jan 28, 2020

gopherbot commented Feb 25, 2020

runtime/race: race detector sometimes misses race on []byte copy #36794

runtime/race: race detector sometimes misses race on []byte copy #36794

Comments

rogpeppe commented Jan 27, 2020

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

myitcv commented Jan 27, 2020

josharian commented Jan 27, 2020

randall77 commented Jan 27, 2020

dvyukov commented Jan 28, 2020

randall77 commented Jan 28, 2020

gopherbot commented Feb 25, 2020

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?