Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/race: race detector sometimes misses race on []byte copy #36794

Closed
rogpeppe opened this issue Jan 27, 2020 · 6 comments
Closed

runtime/race: race detector sometimes misses race on []byte copy #36794

rogpeppe opened this issue Jan 27, 2020 · 6 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. RaceDetector
Milestone

Comments

@rogpeppe
Copy link
Contributor

What version of Go are you using (go version)?

$ go version
go version devel +6dbcc8b865 Thu Jan 9 15:38:00 2020 +0000 linux/amd64

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/rog/.cache/go-build"
GOENV="/home/rog/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY="github.com/heetch"
GONOSUMDB="github.com/heetch"
GOOS="linux"
GOPATH="/home/rog/src/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/rog/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/rog/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/tmp/m/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build752416385=/tmp/go-build -gno-record-gcc-switches"
/pre>

What did you do?

When this test code runs under the race detector, I don't see a race detected, even though there is a race. When I add a sleep (set bogus to true), the race detector does find the race.

https://play.golang.org/p/ECoOELB1fC1

package test

import (
	"sync"
	"testing"
	"time"
)

const bogus = false

func corrupt(data []byte) {
	copy(data, data[2:])
}

var saved string

func TestFoo(t *testing.T) {
	data := []byte("hello there")
	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		saved = string(data)
		wg.Done()
	}()
	if bogus {
		time.Sleep(time.Millisecond)
	}
	corrupt(data)
	wg.Wait()
	t.Logf("saved %q; data %q", saved, data)
}
@myitcv
Copy link
Member

myitcv commented Jan 27, 2020

cc @dvyukov

@josharian
Copy link
Contributor

cc @mdempsky

@randall77
Copy link
Contributor

Definitely strange. It seems very fragile. Changing the string length affects it (only lengths 10 and 11 demonstrate the bug).
Replacing time.Sleep with a simple integer global variable write still reproduces the bug. Anything that calls into the race detector, really.

The calls out to the race detector from the runtime look fine, and the same in both cases.

slicecopy write 0xc00011c090 9
slicecopy read 0xc00011c092 9
slicebytetostring read 0xc00011c090 11

I suspect something in the race detector. Possibly this is just a limitation of the race detector, something is colliding in the shadow space in one case and not in another. It's not guaranteed to find all races.

@dvyukov
Copy link
Member

dvyukov commented Jan 28, 2020

I suspect this is related to one of deep limitations of the tsan runtime - it memorizes at most 4 unique previous memory accesses for each 8-byte aligned granule of application memory (hard to memorize all previous memory accesses that ever happened in an application with full stack traces).

When all 4 slots are already busy, tsan either replaces one of "weaker" (for purposes of race detection) memory accesses (e.g. we have a read and now a write to the same memory), or if there are no weaker ones, it replaces one pseudo-randomly.

There are several aspects combined here to give this stable effect on this test:

  1. One of the accesses is unaligned (to 8-byte granule).
    The slice range accesses end up in tsan MemoryAccessRange function:
    https://github.com/llvm-mirror/compiler-rt/blob/master/lib/tsan/rtl/tsan_rtl_thread.cpp#L376-L405
    For aligned ranges (and middle part of large accesses), tsan models 8-byte accesses, so these take just 1 of 4 available slots. However for unaligned start/end, it simply models them as N 1-byte accesses. So these may take lots of slots.

  2. The unaligned offset is 2, so tsan models 6 individual 1-byte access to the beginning of the slice. These probably evict everything (the writes that should be detect later as racing).
    There is a potential for improvement here: tsan could model this 6-byte access as 1 2-byte access + 1 4-byte accesses instead, which would take only 2 slots (it supports only 1/2/4/8 byte accesses).

  3. Exact behavior of the pseudo random eviction (potentially the writes could have not been evicted even with 6 subsequent reads). That's probably why inserting even another access, surfaces the race.

  4. The fact that slicecopy first does racewriterangepc followed by racereadrangepc. As the result the split reads evict previous writes.
    I think this is a "bug" (for some definition of bug of course b/c it was totally not obvious). I think we need to swap the order. If ranges don't overlap, it should not matter (minus potential effect on pseudo-random eviction). But if they overlap, then writes are always "stronger" for race detection and it's better to preserve them. This may be enough to "fix" this particular case. Currently tsan does remember reads to the region, but since the second access is also a read (slicebytetostring), reads don't race with reads, and the race is not detected. If we keep writes, it should be detected.

  5. As a bonus point, if ranges overlap, we could instrument the overlapping part only as write (also faster).

Other range instrumentation sites should be audited for similar cases too (points 4 and 5).

@rogpeppe thanks for the nice test case! False positives are worse, but they always loud and are not left unnoticed. Whereas false negatives very frequently stay unnoticed in large quantities for long time because they are silent. So each of them is way more important in some sense and always requires some manual attention and a bit healthy mistrust in things that "known to work" :)

@randall77
Copy link
Contributor

I think this is a "bug" (for some definition of bug of course b/c it was totally not obvious). I think we need to swap the order. If ranges don't overlap, it should not matter (minus potential effect on pseudo-random eviction). But if they overlap, then writes are always "stronger" for race detection and it's better to preserve them. This may be enough to "fix" this particular case. Currently tsan does remember reads to the region, but since the second access is also a read (slicebytetostring), reads don't race with reads, and the race is not detected. If we keep writes, it should be detected.

Yes, I noticed this as well, and swapping the order fixes this particular issue. I didn't understand why; now I do.
I'll submit a CL for that for 1.15.

@cagedmantis cagedmantis added this to the Go1.15 milestone Feb 3, 2020
@cagedmantis cagedmantis added the NeedsFix The path to resolution is known, but the work has not been done. label Feb 3, 2020
@gopherbot
Copy link

Change https://golang.org/cl/220685 mentions this issue: runtime: reorder race detector calls in slicecopy

@golang golang locked and limited conversation to collaborators Mar 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. RaceDetector
Projects
None yet
Development

No branches or pull requests

7 participants