Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: concurrent map access detection in hashmap.go only catches one stack #26703

Closed
zamsden opened this issue Jul 30, 2018 · 10 comments
Closed
Milestone

Comments

@zamsden
Copy link

zamsden commented Jul 30, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.10.2 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/zamsden/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/zamsden/go"
GORACE=""
GOROOT="/tt/go1.10.2"
GOTMPDIR=""
GOTOOLDIR="/tt/go1.10.2/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build492558083=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Debugging a concurrent map race, I noticed I only ever get one stack in the list of running goroutines that could be modifying the map.

package main

import (
	"sync"
)

var x map[int]string = make(map[int]string)

func f(s string, wg *sync.WaitGroup) {
	x[0] = s
	wg.Done()
}

func g(s string, wg *sync.WaitGroup) {
	x[1] = s
	wg.Done()
}

func main() {
	for {
		var wg sync.WaitGroup
		wg.Add(2)
		go f("Hello", &wg)
		go g("Playground", &wg)
		wg.Wait()
	}
}

What did you expect to see?

Both f() and g() in the stack trace.

What did you see instead?

Only one of them is ever listed. This makes it harder to debug concurrent access, as you have to catch both racing threads in the race. The trivial example here makes it pretty obvious which threads are racing, but a rare race in production might not be debug-able without this information. Here's an example dump I get:

fatal error: concurrent map writes

goroutine 10643 [running]:
runtime.throw(0x46cda3, 0x15)
	/tt/go1.10.2/src/runtime/panic.go:616 +0x81 fp=0xc42004df50 sp=0xc42004df30 pc=0x421c81
runtime.mapassign_fast64(0x45bec0, 0xc42007a000, 0x1, 0x46fc88)
	/tt/go1.10.2/src/runtime/hashmap_fast.go:531 +0x2f6 fp=0xc42004df98 sp=0xc42004df50 pc=0x408276
main.g(0x46be1d, 0xa, 0xc4200b4320)
	/tt/f.go:15 +0x42 fp=0xc42004dfc8 sp=0xc42004df98 pc=0x44d442
runtime.goexit()
	/tt/go1.10.2/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc42004dfd0 sp=0xc42004dfc8 pc=0x4474e1
created by main.main
	/tt/f.go:24 +0xbb

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc4200b432c)
	/tt/go1.10.2/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc4200b4320)
	/tt/go1.10.2/src/sync/waitgroup.go:129 +0x72
main.main()
	/tt/f.go:25 +0xc9
@randall77
Copy link
Contributor

You can get both stacks if you run with the race detector (go run -race).

We can't print both stacks normally because it is hard to stop the first writer in time. It may well have moved on to other operations by the time the second writer is able to stop it. We don't want to add extra overhead to write operations (e.g., some synchronization, or recording the stack, ...) to handle this case. This detection code is best effort; if you want reliable and complete error traces use the race detector.

@zamsden
Copy link
Author

zamsden commented Jul 31, 2018

@randall77 can you please re-open? I have a patch that fixes this and doesn't add overhead to handle this case. Getting both stacks when possible is invaluable for debugging rare production races.

Waiting on my company's legal department review of the Google contributer's agreement before I can submit the code.

@zamsden
Copy link
Author

zamsden commented Jul 31, 2018

Output with patch applied:

[zamsden@localhost tt]$ ./f 
fatal error: concurrent map writes
fatal error: concurrent map writes

goroutine 18 [running]:
runtime.throw(0x46cda3, 0x15)
	/tt/go1.10.2/src/runtime/panic.go:616 +0x81 fp=0xc420046f50 sp=0xc420046f30 pc=0x421c81
runtime.mapassign_fast64(0x45bec0, 0xc420092000, 0x1, 0x0)
	/tt/go1.10.2/src/runtime/hashmap_fast.go:602 +0x2db fp=0xc420046f98 sp=0xc420046f50 pc=0x40825b
main.g(0x46be1d, 0xa, 0xc420094000)
	/tt/f.go:15 +0x42 fp=0xc420046fc8 sp=0xc420046f98 pc=0x44d442
runtime.goexit()
	/tt/go1.10.2/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420046fd0 sp=0xc420046fc8 pc=0x4474e1
created by main.main
	/tt/f.go:24 +0xbb

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc42009400c)
	/tt/go1.10.2/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc420094000)
	/tt/go1.10.2/src/sync/waitgroup.go:129 +0x72
main.main()
	/tt/f.go:25 +0xc9

goroutine 17 [running]:
	goroutine running on other thread; stack unavailable
created by main.main
	/tt/f.go:23 +0x84

goroutine 17 [running]:
runtime.throw(0x46cda3, 0x15)
	/tt/go1.10.2/src/runtime/panic.go:616 +0x81 fp=0xc420046750 sp=0xc420046730 pc=0x421c81
runtime.mapassign_fast64(0x45bec0, 0xc420092000, 0x0, 0x0)
	/tt/go1.10.2/src/runtime/hashmap_fast.go:602 +0x2db fp=0xc420046798 sp=0xc420046750 pc=0x40825b
main.f(0x46b8a2, 0x5, 0xc420094000)
	/tt/f.go:10 +0x42 fp=0xc4200467c8 sp=0xc420046798 pc=0x44d3b2
runtime.goexit()
	/tt/go1.10.2/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4200467d0 sp=0xc4200467c8 pc=0x4474e1
created by main.main
	/tt/f.go:23 +0x84

@randall77
Copy link
Contributor

Sure, if you have a patch I'll reopen.
Can you describe your technique?

@randall77 randall77 reopened this Jul 31, 2018
@zamsden
Copy link
Author

zamsden commented Jul 31, 2018

@randall77

Instead of h.flags |= for adding the hashWriting flag, use h.flags = h.flags ^ hashWriting. Leave the h.flags &^= hashWriting to clear on the exit path exactly the same.

Now, two concurrent writers both cancel each other before the check that h.flags & hashWriting == 0. Ideally they both hit the check on the way out. There are still a few races that can't be detected due to single instruction race windows and the fact that the read of h.flags is non-atomic and can be pre-fetched.

In practice, this makes both goroutines simultaneously panic and generate their own stack trace together around > 50% of the time (as seen in the output above). Without this change, you will only ever get one goroutine stack trace,

This never fails to generate a concurrent access that actually happened which could have been reasonably detected (either we flip the bit to zero, and all racing goroutines fail, or with an odd number of concurrent writers, we flip the bit back to one, exactly equivalent to the current code), and never leaves behind an accidentally set bit (which is why we don't use XOR on the exit path to clear the bit - this would leave another non-racing goroutine vulnerable to hitting the bit being set on entry, generating a false positive).

It also shouldn't even generate more instructions in the emitted output; OR should simply be replaced with XOR in x86 machine code.

@zamsden
Copy link
Author

zamsden commented Jul 31, 2018

Got permission to accept the CLA and post the patch. Will upload this through the proper channel, which I understand to be go-review.googlesource.com's PolyGerrit service tomorrow. For now, here is the patch.
hashmap-stack.txt

@agnivade
Copy link
Contributor

You can also send a PR the normal way.

@ALTree ALTree changed the title Concurrent map access detection in hashmap.go only catches one stack runtime: concurrent map access detection in hashmap.go only catches one stack Jul 31, 2018
@ALTree ALTree added this to the Go1.12 milestone Jul 31, 2018
@gopherbot
Copy link

Change https://golang.org/cl/126936 mentions this issue: runtime: catch concurrent stacks more often

@ghost
Copy link

ghost commented Aug 23, 2018

I should note that the comments like

        // Set hashWriting after calling alg.hash for consistency with mapdelete
	h.flags ^= hashWriting

do not correctly describe the code after the commit.

@josharian
Copy link
Contributor

Want to send a CL changing 'Set' to 'Update'? (Or perhaps 'Toggle'?)

@golang golang locked and limited conversation to collaborators Aug 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants