-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: fatal error: sweep increased allocation count, go1.9.x #22781
Comments
If the code triggers a race condition warning, then there's a race condition in the code and the program is not valid. Can you explain why do you believe that the reported race is not actually a race? Are you suggesting there's a bug in the race detector? |
Nevertheless, Go tries not to crash just because there's a data race. (Go doesn't makes as many promises as Java in the face of a data race, but it does try not to crash.) |
These path datatypes represent the moves a chess piece can make on the board. The function Those path generating functions are called multiple times in these Board methods:
The application code (such as what happens when a player requests what moves are currently available for all pieces via a HTTP GET request) calls into these methods after reading the latest representation of the board from the database without sharing these data structures - each request is one goroutine, and the only shared memory is the postgres database and game state diffs sent to a websocket goroutine for each player currently on the game page. This shared memory for sending diffs is not pointed to in these race conditions.
The race is that an AbsPath was written into memory, then Truncated is read. In this case the writing goroutine is one of those listening goroutines - it has put together a representation of all moves possible for the board state because it is determining if a checkmate state exists and would have returned if so (the goroutine received a notification that a move was made, or has just started and is checking initial conditions). The reading goroutine is a player requesting what all moves are for the current game state, likely because they just moved and need the next move set. There is no shared memory here besides the base representation of the moves for each piece, which is used to create new AbsPaths and is not referenced outside of
I've asked in the golang-nuts thread for a review of the
The server does not crash but does indicate a data race if I simulate four games or less (on a four core processor) but does crash within 30 seconds if the number of games increases. My test client is making moves and requesting available moves as fast as possible (~5 milliseconds per request) unsynchronized except for the interlock of playing the game. |
Crash on go1.9.2 darwin/amd64 testing localhost (macOS 10.12.6) with client count 40:
Full trace: https://gist.github.com/pciet/37e676ab8cbeac40bf3c4b0daed7dca4 No crash for two games (four clients). Crashes on three games (six clients): https://gist.github.com/pciet/8c2b29c07fd548254b1bc3b02984a0ba Crashes on four games (eight clients):
Full trace: https://gist.github.com/pciet/8189ffbb66bc46025186ceebccf2f6d6 This is a two core machine. |
The race detector has very very few false positives. It is telling you that there is no synchronization between the code that creates the |
The GC keeps count of how many objects have been allocated since the last
GC and how many were live at the last GC. This puts an upper bound on the
number of live objects the GC expects to find live at the next GC. When the
GC sweeps the span it checks to see if this upper bound was exceeded. In
the past we have seen this check triggered by races resulting in pointers
being hidden from the GC. We have also seen it when there is unsafe casting
of pointers to uintptrs that the GC does not recognize as valid roots. And,
of course, due to bugs in the GC. The first step is to get a minimized race
free reproducer that doesn't use unsafe constructs.
…On Sun, Nov 19, 2017 at 12:24 AM, Ian Lance Taylor ***@***.*** > wrote:
The race detector has very very few false positives. It is telling you
that there is no synchronization between the code that creates the AbsPath
value and the code that reads the Truncated field from it. You should
look carefully at your locking.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#22781 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA7WnyIGablKVqaG1gkdtGQieWYjil7bks5s37uCgaJpZM4QiKdE>
.
|
Here's a simplified program that removes the application code and triggers the race and crash readily. The package wichessing does not contain the keyword 'go' and is an exact duplicate of the code in the original repository - the computation should have no concurrency. The only shared memory are global variables that are read and not written. My understanding is that there are no unsafe constructs. https://github.com/pciet/debuggo22781 package main
import (
"github.com/pciet/debuggo22781/wichessing"
)
const clients = 6
func main() {
for i := 0; i < clients; i++ {
go simulateClient()
}
select {} // block forever
}
func simulateClient() {
for {
board := wichessing.Board{}
for i := 0; i < 64; i++ {
board[i] = wichessing.Point{
AbsPoint: wichessing.AbsPoint{wichessing.FileFromIndex(uint8(i)), wichessing.RankFromIndex(uint8(i))},
}
}
setChessStart(&board)
_, _, _ = board.Moves(wichessing.White, wichessing.AbsPoint{0, 8}, wichessing.AbsPoint{0, 8})
}
}
func setChessStart(b *wichessing.Board) {
b[0].Piece = &wichessing.Piece{
Kind: wichessing.Rook,
Orientation: wichessing.White,
Base: wichessing.Rook,
}
b[1].Piece = &wichessing.Piece{
Kind: wichessing.Knight,
Orientation: wichessing.White,
Base: wichessing.Knight,
}
b[2].Piece = &wichessing.Piece{
Kind: wichessing.Bishop,
Orientation: wichessing.White,
Base: wichessing.Bishop,
}
b[3].Piece = &wichessing.Piece{
Kind: wichessing.Queen,
Orientation: wichessing.White,
Base: wichessing.Queen,
}
b[4].Piece = &wichessing.Piece{
Kind: wichessing.King,
Orientation: wichessing.White,
Base: wichessing.King,
}
b[5].Piece = &wichessing.Piece{
Kind: wichessing.Bishop,
Orientation: wichessing.White,
Base: wichessing.Bishop,
}
b[6].Piece = &wichessing.Piece{
Kind: wichessing.Knight,
Orientation: wichessing.White,
Base: wichessing.Knight,
}
b[7].Piece = &wichessing.Piece{
Kind: wichessing.Rook,
Orientation: wichessing.White,
Base: wichessing.Rook,
}
for i := 0; i < 8; i++ {
b[8+i].Piece = &wichessing.Piece{
Kind: wichessing.Pawn,
Orientation: wichessing.White,
Base: wichessing.Pawn,
}
}
b[63].Piece = &wichessing.Piece{
Kind: wichessing.Rook,
Orientation: wichessing.Black,
Base: wichessing.Rook,
}
b[62].Piece = &wichessing.Piece{
Kind: wichessing.Knight,
Orientation: wichessing.Black,
Base: wichessing.Knight,
}
b[61].Piece = &wichessing.Piece{
Kind: wichessing.Bishop,
Orientation: wichessing.Black,
Base: wichessing.Bishop,
}
b[60].Piece = &wichessing.Piece{
Kind: wichessing.King,
Orientation: wichessing.Black,
Base: wichessing.King,
}
b[59].Piece = &wichessing.Piece{
Kind: wichessing.Queen,
Orientation: wichessing.Black,
Base: wichessing.Queen,
}
b[58].Piece = &wichessing.Piece{
Kind: wichessing.Bishop,
Orientation: wichessing.Black,
Base: wichessing.Bishop,
}
b[57].Piece = &wichessing.Piece{
Kind: wichessing.Knight,
Orientation: wichessing.Black,
Base: wichessing.Knight,
}
b[56].Piece = &wichessing.Piece{
Kind: wichessing.Rook,
Orientation: wichessing.Black,
Base: wichessing.Rook,
}
for i := 0; i < 8; i++ {
b[63-16+i].Piece = &wichessing.Piece{
Kind: wichessing.Pawn,
Orientation: wichessing.Black,
Base: wichessing.Pawn,
}
}
for i := 0; i < 64; i++ {
if b[i].Piece == nil {
continue
}
p := b[i].Piece.SetKindFlags()
b[i].Piece = &p
}
} The race and crash are easy to trigger on my machine, although the crash signature is different.
Races and crash trace here: https://gist.github.com/pciet/9c2251324184adbe3fdd25e6ded9160b |
The race should be fixed first to see if that resolves the issue. One bug
at a time. The GC is concurrent and it is not hard to imagine that an
application race could hide pointers from the GC.
…On Sun, Nov 19, 2017 at 1:18 PM, Matt Juran ***@***.***> wrote:
Here's a simplified program that removes the application code and triggers
the race and crash readily. The package wichessing does not contain the
keyword 'go' and is an exact duplicate of the code in the original
repository - the computation should have no concurrency. The only shared
memory are global variables that are read and not written. My understanding
is that there are no unsafe constructs.
https://github.com/pciet/debuggo22781
package main
import (
"github.com/pciet/debuggo22781/wichessing"
)
const clients = 6
func main() {
for i := 0; i < clients; i++ {
go simulateClient()
}
select {} // block forever
}
func simulateClient() {
for {
board := wichessing.Board{}
for i := 0; i < 64; i++ {
board[i] = wichessing.Point{
AbsPoint: wichessing.AbsPoint{wichessing.FileFromIndex(uint8(i)), wichessing.RankFromIndex(uint8(i))},
}
}
setChessStart(&board)
_, _, _ = board.Moves(wichessing.White, wichessing.AbsPoint{0, 8}, wichessing.AbsPoint{0, 8})
}
}
func setChessStart(b *wichessing.Board) {
b[0].Piece = &wichessing.Piece{
Kind: wichessing.Rook,
Orientation: wichessing.White,
Base: wichessing.Rook,
}
b[1].Piece = &wichessing.Piece{
Kind: wichessing.Knight,
Orientation: wichessing.White,
Base: wichessing.Knight,
}
b[2].Piece = &wichessing.Piece{
Kind: wichessing.Bishop,
Orientation: wichessing.White,
Base: wichessing.Bishop,
}
b[3].Piece = &wichessing.Piece{
Kind: wichessing.Queen,
Orientation: wichessing.White,
Base: wichessing.Queen,
}
b[4].Piece = &wichessing.Piece{
Kind: wichessing.King,
Orientation: wichessing.White,
Base: wichessing.King,
}
b[5].Piece = &wichessing.Piece{
Kind: wichessing.Bishop,
Orientation: wichessing.White,
Base: wichessing.Bishop,
}
b[6].Piece = &wichessing.Piece{
Kind: wichessing.Knight,
Orientation: wichessing.White,
Base: wichessing.Knight,
}
b[7].Piece = &wichessing.Piece{
Kind: wichessing.Rook,
Orientation: wichessing.White,
Base: wichessing.Rook,
}
for i := 0; i < 8; i++ {
b[8+i].Piece = &wichessing.Piece{
Kind: wichessing.Pawn,
Orientation: wichessing.White,
Base: wichessing.Pawn,
}
}
b[63].Piece = &wichessing.Piece{
Kind: wichessing.Rook,
Orientation: wichessing.Black,
Base: wichessing.Rook,
}
b[62].Piece = &wichessing.Piece{
Kind: wichessing.Knight,
Orientation: wichessing.Black,
Base: wichessing.Knight,
}
b[61].Piece = &wichessing.Piece{
Kind: wichessing.Bishop,
Orientation: wichessing.Black,
Base: wichessing.Bishop,
}
b[60].Piece = &wichessing.Piece{
Kind: wichessing.King,
Orientation: wichessing.Black,
Base: wichessing.King,
}
b[59].Piece = &wichessing.Piece{
Kind: wichessing.Queen,
Orientation: wichessing.Black,
Base: wichessing.Queen,
}
b[58].Piece = &wichessing.Piece{
Kind: wichessing.Bishop,
Orientation: wichessing.Black,
Base: wichessing.Bishop,
}
b[57].Piece = &wichessing.Piece{
Kind: wichessing.Knight,
Orientation: wichessing.Black,
Base: wichessing.Knight,
}
b[56].Piece = &wichessing.Piece{
Kind: wichessing.Rook,
Orientation: wichessing.Black,
Base: wichessing.Rook,
}
for i := 0; i < 8; i++ {
b[63-16+i].Piece = &wichessing.Piece{
Kind: wichessing.Pawn,
Orientation: wichessing.Black,
Base: wichessing.Pawn,
}
}
for i := 0; i < 64; i++ {
if b[i].Piece == nil {
continue
}
p := b[i].Piece.SetKindFlags()
b[i].Piece = &p
}
}
The race and crash are easy to trigger on my machine, although the crash
signature is different.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1 pc=0x10f9884]
Races and crash trace here: https://gist.github.com/pciet/
9c2251324184adbe3fdd25e6ded9160b
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#22781 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA7Wn1XGYy7fUT0jTs-5hMBIgzOB_nEpks5s4HEIgaJpZM4QiKdE>
.
|
I am making the claim that there is no fixable race in this program as each goroutine is independent by only reading from shared memory (no shared writes). The shared reads are in the first two functions of wichessing/path.go where written memory is created independently for each calling goroutine (this is also where the race is indicated). The shared memory is read for the for loop computation and otherwise unused (len, addition, then an addition or subtraction). The written memory is not shared between goroutines. |
Normally when somebody blames the race detector, we're all very skeptical (see @ianlancetaylor's comment earlier), and I admit (from looking at this a few minutes) that the race detector isn't super helpful here. Notes: (all on Linux) -- there is only one "go" statement in the code. The missing stacks are usually lost:
When they're not lost, they point to:
Where line 19 is: // This function will truncate moves that leave the board and convert the defined relative points to actual board points but does \
not remove points with other pieces or other similar constraints.
func TruncatedAbsPathsForKind(the Kind, from AbsPoint, with Orientation) AbsPathSetMap {
absmap := make(AbsPathSetMap)
for movetype, paths := range RelPathMapForKind(the) {
availablepaths := make(AbsPathSet)
for path, _ := range paths {
availablepath := AbsPath{ // <---- Line 19
Points: make([]AbsPoint, 0, len(*path)),
} I highly suspect the problem is this line: availablepath.Points = append(availablepath.Points, AbsPoint{File: uint8(absfile), Rank: uint8(absrank)}) Appends on shared memory are always a red flag in concurrent code. I'm busy now for a few hours, but I can test that theory later unless others beat me to it. |
(although I'm not convinced that memory is shared yet, either.) The append was just my next area for investigation. |
I also noticed this, so I started bisecting. The first commit that triggers a race report on the reproducer above is ec091b6 (runtime: add mapassign_fast*). It's possible that the race is real, but it's triggered by bad code generation and not by an error in the Go code itself. Next step: investigate changes in code generation introduced in the reproducer by that commit. |
Here's what's happening in that trace (reversed):
The main goroutine put together the initial chess board state and then asks for all available moves for both players on that board. Each of the goroutines created is working on an independent board.
Before determining moves it looks for a checkmate state.
If the king is not in check then its not a checkmate. // board_check.go
46 // some cases cause reactions that remove the King, such as an enemy detonator move and friendly guard adjacent
47 _, has := b.AfterMove(orig, *pt, orientation, previousFrom, previousTo).KingLocationFor(turn)
48 if has == false {
49 return true
50 } The loop this is in is looking at every move the opponent can make and returning true if the king can be taken directly or indirectly (check). This AfterMove call is used to look for indirect takes.
What does the board look like after a move is executed? The caller of AfterMove iterates over this looking for the king location.
Execute the move for the board returned by AfterMove. The call in Move is verifying the requested move is legal. Everything computed in here is discarded: // board_move.go
126 if b.MovesFromPoint(fromPoint, previousFrom, previousTo).Has(to) == false {
127 return PointSet{}, pieceset
128 }
What moves are available from this square on the board?
What actual paths are there for this kind of piece at this square on the board?
The written memory in TruncatedAbsPathsForKind is discarded after two returns in this case. I do not believe those global variables make it past the reads in this function - the returned memory (written serially by a number of appends) is isolated in this call stack. |
This smaller program crashes but does not race: package main
import (
"github.com/pciet/debuggo22781/wichessing"
)
const clients = 6
func main() {
for i := 0; i < clients; i++ {
go func() {
for {
// in wichessing/path.go
_ = wichessing.TruncatedAbsPathsForKind(wichessing.Pawn, wichessing.AbsPoint{0, 1}, wichessing.White)
}
}()
}
select {} // block forever
} https://gist.github.com/pciet/8ad292c81bdae9046bf5484797b333f2
type AbsPathSet map[*AbsPath]struct{}
type AbsPath struct {
Points []AbsPoint
Truncated bool
}
// Absolute Point represents a specific point on the board.
type AbsPoint struct {
File uint8
Rank uint8
}
// The PathType is used for pieces with varying moves; the pawn is the base chess example with different first move, take moves, and regular moves.
type PathType int
const (
First PathType = iota
Move
Take
RallyMove
)
type AbsPathSetMap map[PathType]AbsPathSet
type RelPathSetMap map[PathType]RelPathSet
// All relative paths for a piece, used to calculate actual paths for a board state.
type RelPathSet map[*RelPath]struct{}
type RelPath []RelPoint
type RelPoint struct {
XOffset int8
YOffset int8
}
13 // This function will truncate moves that leave the board and convert the defined relative points to actual board points but does not remove points with other pieces or other similar constraints.
14 func TruncatedAbsPathsForKind(the Kind, from AbsPoint, with Orientation) AbsPathSetMap {
15 absmap := make(AbsPathSetMap)
16 for movetype, paths := range RelPathMapForKind(the) {
17 availablepaths := make(AbsPathSet)
18 for path, _ := range paths {
19 availablepath := AbsPath{
20 Points: make([]AbsPoint, 0, len(*path)),
21 }
22 truncated := false
23 for _, point := range *path {
24 absfile := int8(from.File) + point.XOffset
25 if (absfile > 7) || (absfile < 0) {
26 truncated = true
27 break
28 }
29 var absrank int8
30 if with == White {
31 absrank = int8(from.Rank) + point.YOffset
32 } else {
33 absrank = int8(from.Rank) - point.YOffset
34 }
35 if (absrank > 7) || (absrank < 0) {
36 truncated = true
37 break
38 }
39 availablepath.Points = append(availablepath.Points, AbsPoint{File: uint8(absfile), Rank: uint8(absrank)})
40 }
41 availablepath.Truncated = truncated
42 if len(availablepath.Points) != 0 {
43 availablepaths[&availablepath] = struct{}{}
44 }
45 }
46 if len(availablepaths) != 0 {
47 absmap[movetype] = availablepaths
48 }
49 }
50 return absmap
51 } |
|
I got a slightly different crash message (always on 1.9.2):
There is no unsafe or cgo in the code, and no race reports (apparently), so there's definitely something fishy going on here. |
Here's a small autocontained reproducer:
It crashes fairly quickly (usually within one minute) when running |
Is the map key freed prematurely when the only reference is within mapassign_fast64?
|
@chlunde Thanks. That makes this seem related to #21297. CC @randall77 |
With 1.9 runtime.KeepAlive still crashes, but not on tip |
It does certainly seem similar to #21297. I have a core dump on a failure, and there is a map with a pointer key, where the referent has been GCd. Maybe we need to separate the pointer and non-pointer versions of mapassign_fast64 and friends. |
Looking at the generated code, the key pointer seems to be on the stack across the call to |
Ah, but according to the liveness information |
This does seem to roughly correspond to the original bug report, which also appears to be storing a pointer to a local variable into a map, where the local variable is defined in a loop. |
@pciet, can you file a separate bug for the race detector false positive, if that remains an issue? |
With https://go-review.googlesource.com/c/go/+/79018 patched onto the latest in the repo my application does not crash or race under the test load that triggered the crash initially. I'll run it overnight to be sure. Thanks for the help, I'm looking forward to the next release and will stick with 1.8.5 until then. |
Confirmed that without the patch ( |
Reopening for 1.9.3. |
What version of Go are you using (
|
This looks like memory corruption. Have you tried running your program under the race detector? See https://blog.golang.org/race-detector . |
@davecheney I run |
Use that race enabled version of the application and try to reproduce the conditions that caused the crash.
… On 11 Jan 2018, at 21:17, Cruth kvinc ***@***.***> wrote:
@davecheney I run go build -race myapp and nothing happened. What need I do?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@oiooj if this is the same issue then it isn't present in 1.8.x nor in the 1.10 beta (download here https://groups.google.com/forum/#!topic/golang-dev/SCPpL9PqVbs). If you see the crash with 1.10 beta or 1.8.5 then please open a new issue. |
CL 88635 OK for Go 1.9.3. |
Change https://golang.org/cl/88635 mentions this issue: |
…pointer keys The signature of the mapassign_fast* routines need to distinguish the pointerness of their key argument. If the affected routines suspend part way through, the object pointed to by the key might get garbage collected because the key is typed as a uint{32,64}. This is not a problem for mapaccess or mapdelete because the key in those situations do not live beyond the call involved. If the object referenced by the key is garbage collected prematurely, the code still works fine. Even if that object is subsequently reallocated, it can't be written to the map in time to affect the lookup/delete. Fixes #22781 Change-Id: I0bbbc5e9883d5ce702faf4e655348be1191ee439 Reviewed-on: https://go-review.googlesource.com/79018 Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Martin Möhrmann <moehrmann@google.com> Reviewed-on: https://go-review.googlesource.com/88635 Run-TryBot: Andrew Bonventre <andybons@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
go1.9.3 has been packaged and includes:
The release is posted at golang.org/dl. — golang.org/x/build/cmd/releasebot, Jan 22 21:02:56 UTC |
Change https://golang.org/cl/93715 mentions this issue: |
This test runs independent goroutines modifying a comprehensive variety of local vars to look for garbage collector regressions. This test has been verified to trigger issue 22781 on the go1.9.2 tag. This test expands on test/fixedbugs/issue22781.go. Tests #22781 Change-Id: Id32f8dde7ef650aea1b1b4cf518e6d045537bfdc Reviewed-on: https://go-review.googlesource.com/93715 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
What version of Go are you using (
go version
)?go version go1.9.2 linux/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?Ubuntu Server 16.04.3 LTS
withPostgreSQL 9.5.10
andOpenSSH_7.2p2
selected during install.Intel(R) Core(TM) i5-6400 CPU
What did you do?
Load testing by acting as 40 clients automatically without delays between actions playing my chess game implemented as a Go HTTP server with a postgres database, which should partially simulate a load of hundreds of players. Crash within 30 seconds as the number of active clients ramps up. The clients are making HTTP calls from a separate 2015 MacBook Pro on the local network.
What did you expect to see?
The load test should run indefinitely.
What did you see instead?
Crash with the signature "fatal error: sweep increased allocation count" reproducible within 30 seconds. The crash does not happen with a client count of eight or less. The crash does not happen with go1.6.4, go1.7.6, or go1.8.4, but does with go1.9, go1.9.1, and go1.9.2.
A race condition is warned with
go build -race
set, but the memory indicated should not have been shared between goroutines.My investigation is detailed here, with stack traces (with and without GODEBUG=gccheckmark=1) and the race warnings: https://github.com/pciet/wichess/issues/19
My email to golang-nuts: https://groups.google.com/forum/#!topic/golang-nuts/7lcongdGOMM
I'll try to put together a case with less code and without the database.
The text was updated successfully, but these errors were encountered: