Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(818)

Issue 179680043: code review 179680043: runtime: fix hang in GC due to shrinkstack vs netpoll race (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
9 years, 4 months ago by rsc
Modified:
9 years, 4 months ago
Reviewers:
gobot, dfc, iant, rlh, bradfitz
CC:
austin, rlh, dvyukov, golang-codereviews, khr, r
Visibility:
Public.

Description

runtime: fix hang in GC due to shrinkstack vs netpoll race During garbage collection, after scanning a stack, we think about shrinking it to reclaim some memory. The shrinking code (called while the world is stopped) checked that the status was Gwaiting or Grunnable and then changed the state to Gcopystack, to essentially lock the stack so that no other GC thread is scanning it. The same locking happens for stack growth (and is more necessary there). oldstatus = runtime·readgstatus(gp); oldstatus &= ~Gscan; if(oldstatus == Gwaiting || oldstatus == Grunnable) runtime·casgstatus(gp, oldstatus, Gcopystack); // oldstatus is Gwaiting or Grunnable else runtime·throw("copystack: bad status, not Gwaiting or Grunnable"); Unfortunately, "stop the world" doesn't stop everything. It stops all normal goroutine execution, but the network polling thread is still blocked in epoll and may wake up. If it does, and it chooses a goroutine to mark runnable, and that goroutine is the one whose stack is shrinking, then it can happen that between readgstatus and casgstatus, the status changes from Gwaiting to Grunnable. casgstatus assumes that if the status is not what is expected, it is a transient change (like from Gwaiting to Gscanwaiting and back, or like from Gwaiting to Gcopystack and back), and it loops until the status has been restored to the expected value. In this case, the status has changed semi-permanently from Gwaiting to Grunnable - it won't change again until the GC is done and the world can continue, but the GC is waiting for the status to change back. This wedges the program. To fix, call a special variant of casgstatus that accepts either Gwaiting or Grunnable as valid statuses. Without the fix bug with the extra check+throw in casgstatus, the program below dies in a few seconds (2-10) with GOMAXPROCS=8 on a 2012 Retina MacBook Pro. With the fix, it runs for minutes and minutes. package main import ( "io" "log" "net" "runtime" ) func main() { const N = 100 for i := 0; i < N; i++ { l, err := net.Listen("tcp", "127.0.0.1:0") if err != nil { log.Fatal(err) } ch := make(chan net.Conn, 1) go func() { var err error c1, err := net.Dial("tcp", l.Addr().String()) if err != nil { log.Fatal(err) } ch <- c1 }() c2, err := l.Accept() if err != nil { log.Fatal(err) } c1 := <-ch l.Close() go netguy(c1, c2) go netguy(c2, c1) c1.Write(make([]byte, 100)) } for { runtime.GC() } } func netguy(r, w net.Conn) { buf := make([]byte, 100) for { bigstack(1000) _, err := io.ReadFull(r, buf) if err != nil { log.Fatal(err) } w.Write(buf) } } var g int func bigstack(n int) { var buf [100]byte if n > 0 { bigstack(n - 1) } g = int(buf[0]) + int(buf[99]) } Fixes issue 9186.

Patch Set 1 #

Patch Set 2 : diff -r 321d04dea9d6a457cb4c7375ce3cf625f8836a19 https://code.google.com/p/go/ #

Patch Set 3 : diff -r 321d04dea9d6a457cb4c7375ce3cf625f8836a19 https://code.google.com/p/go/ #

Patch Set 4 : diff -r 321d04dea9d6a457cb4c7375ce3cf625f8836a19 https://code.google.com/p/go/ #

Total comments: 2
Unified diffs Side-by-side diffs Delta from patch set Stats (+35 lines, -6 lines) Patch
M src/runtime/proc.c View 1 2 3 chunks +32 lines, -0 lines 2 comments Download
M src/runtime/runtime.h View 1 2 1 chunk +2 lines, -0 lines 0 comments Download
M src/runtime/stack.c View 1 1 chunk +1 line, -6 lines 0 comments Download

Messages

Total messages: 7
rsc
Hello austin, rlh (cc: dvyukov, golang-codereviews@googlegroups.com, iant, khr, r), I'd like you to review this ...
9 years, 4 months ago (2014-12-01 21:25:58 UTC) #1
rlh
LGTM On Mon, Dec 1, 2014 at 4:25 PM, <rsc@golang.org> wrote: > Reviewers: austin, rlh, ...
9 years, 4 months ago (2014-12-01 21:28:06 UTC) #2
rsc
*** Submitted as https://code.google.com/p/go/source/detail?r=752cd9199639 *** runtime: fix hang in GC due to shrinkstack vs netpoll ...
9 years, 4 months ago (2014-12-01 21:32:13 UTC) #3
gobot
This CL appears to have broken the linux-386-sid builder. See http://build.golang.org/log/5021c87102e8376f4500b8f8f660467ecde3fbfd
9 years, 4 months ago (2014-12-01 21:43:08 UTC) #4
dfc
False alarm, I think the builder might be sick # Checking API compatibility. Error running ...
9 years, 4 months ago (2014-12-01 21:46:02 UTC) #5
bradfitz
Not a disk space problem. (as it was previously) $ sudo df Filesystem 1K-blocks Used ...
9 years, 4 months ago (2014-12-01 21:49:30 UTC) #6
iant
9 years, 4 months ago (2014-12-01 23:13:17 UTC) #7
Message was sent while issue was closed.
https://codereview.appspot.com/179680043/diff/60001/src/runtime/proc.c
File src/runtime/proc.c (right):

https://codereview.appspot.com/179680043/diff/60001/src/runtime/proc.c#newcod...
src/runtime/proc.c:427: if(oldval == Gwaiting && gp->atomicstatus == Grunnable)
{
This test seems conservative, which is probably good for 1.4.  For 1.5 if oldval
!= gp->atomicstatus &^ Gscan we should probably have a switch that only accepts
the known permitted cases, to more quickly catch this kind of problem in the
future.

https://codereview.appspot.com/179680043/diff/60001/src/runtime/proc.c#newcod...
src/runtime/proc.c:454: runtime·casgcopystack(G *gp)
This has become a somewhat goofy name, since from the caller's perspective there
is no compare-and-swap operation.  All the other casg functions take oldval and
newval.  This function simply sets the status to Gcopystack and returns the old
status.  No need to worry about it for 1.4, of course.
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld f62528b