Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(4646)

Issue 184030043: code review 184030043: [release-branch.go1.4] runtime: fix hang in GC due to s... (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
9 years, 4 months ago by rsc
Modified:
9 years, 4 months ago
Reviewers:
CC:
rlh, golang-codereviews
Visibility:
Public.

Description

[release-branch.go1.4] runtime: fix hang in GC due to shrinkstack vs netpoll race ««« CL 179680043 / 752cd9199639 runtime: fix hang in GC due to shrinkstack vs netpoll race During garbage collection, after scanning a stack, we think about shrinking it to reclaim some memory. The shrinking code (called while the world is stopped) checked that the status was Gwaiting or Grunnable and then changed the state to Gcopystack, to essentially lock the stack so that no other GC thread is scanning it. The same locking happens for stack growth (and is more necessary there). oldstatus = runtime·readgstatus(gp); oldstatus &= ~Gscan; if(oldstatus == Gwaiting || oldstatus == Grunnable) runtime·casgstatus(gp, oldstatus, Gcopystack); // oldstatus is Gwaiting or Grunnable else runtime·throw("copystack: bad status, not Gwaiting or Grunnable"); Unfortunately, "stop the world" doesn't stop everything. It stops all normal goroutine execution, but the network polling thread is still blocked in epoll and may wake up. If it does, and it chooses a goroutine to mark runnable, and that goroutine is the one whose stack is shrinking, then it can happen that between readgstatus and casgstatus, the status changes from Gwaiting to Grunnable. casgstatus assumes that if the status is not what is expected, it is a transient change (like from Gwaiting to Gscanwaiting and back, or like from Gwaiting to Gcopystack and back), and it loops until the status has been restored to the expected value. In this case, the status has changed semi-permanently from Gwaiting to Grunnable - it won't change again until the GC is done and the world can continue, but the GC is waiting for the status to change back. This wedges the program. To fix, call a special variant of casgstatus that accepts either Gwaiting or Grunnable as valid statuses. Without the fix bug with the extra check+throw in casgstatus, the program below dies in a few seconds (2-10) with GOMAXPROCS=8 on a 2012 Retina MacBook Pro. With the fix, it runs for minutes and minutes. package main import ( "io" "log" "net" "runtime" ) func main() { const N = 100 for i := 0; i < N; i++ { l, err := net.Listen("tcp", "127.0.0.1:0") if err != nil { log.Fatal(err) } ch := make(chan net.Conn, 1) go func() { var err error c1, err := net.Dial("tcp", l.Addr().String()) if err != nil { log.Fatal(err) } ch <- c1 }() c2, err := l.Accept() if err != nil { log.Fatal(err) } c1 := <-ch l.Close() go netguy(c1, c2) go netguy(c2, c1) c1.Write(make([]byte, 100)) } for { runtime.GC() } } func netguy(r, w net.Conn) { buf := make([]byte, 100) for { bigstack(1000) _, err := io.ReadFull(r, buf) if err != nil { log.Fatal(err) } w.Write(buf) } } var g int func bigstack(n int) { var buf [100]byte if n > 0 { bigstack(n - 1) } g = int(buf[0]) + int(buf[99]) } Fixes issue 9186. LGTM=rlh R=austin, rlh CC=dvyukov, golang-codereviews, iant, khr, r https://codereview.appspot.com/179680043 »»»

Patch Set 1 #

Patch Set 2 : diff -r cd4772ba95732b5dd568d1cd60cf6b5e7356a9b3 https://code.google.com/p/go/ #

Unified diffs Side-by-side diffs Delta from patch set Stats (+35 lines, -6 lines) Patch
M src/runtime/proc.c View 1 3 chunks +32 lines, -0 lines 0 comments Download
M src/runtime/runtime.h View 1 1 chunk +2 lines, -0 lines 0 comments Download
M src/runtime/stack.c View 1 1 chunk +1 line, -6 lines 0 comments Download

Messages

Total messages: 2
rsc
Hello rlh (cc: golang-codereviews@googlegroups.com), I'd like you to review this change to https://code.google.com/p/go/
9 years, 4 months ago (2014-12-01 21:42:41 UTC) #1
rsc
9 years, 4 months ago (2014-12-01 21:42:47 UTC) #2
*** Submitted as https://code.google.com/p/go/source/detail?r=9e6afb82e1eb ***

[release-branch.go1.4] runtime: fix hang in GC due to shrinkstack vs netpoll
race

««« CL 179680043 / 752cd9199639
runtime: fix hang in GC due to shrinkstack vs netpoll race

During garbage collection, after scanning a stack, we think about
shrinking it to reclaim some memory. The shrinking code (called
while the world is stopped) checked that the status was Gwaiting
or Grunnable and then changed the state to Gcopystack, to essentially
lock the stack so that no other GC thread is scanning it.
The same locking happens for stack growth (and is more necessary there).

        oldstatus = runtime·readgstatus(gp);
        oldstatus &= ~Gscan;
        if(oldstatus == Gwaiting || oldstatus == Grunnable)
                runtime·casgstatus(gp, oldstatus, Gcopystack); // oldstatus is
Gwaiting or Grunnable
        else
                runtime·throw("copystack: bad status, not Gwaiting or
Grunnable");

Unfortunately, "stop the world" doesn't stop everything. It stops all
normal goroutine execution, but the network polling thread is still
blocked in epoll and may wake up. If it does, and it chooses a goroutine
to mark runnable, and that goroutine is the one whose stack is shrinking,
then it can happen that between readgstatus and casgstatus, the status
changes from Gwaiting to Grunnable.

casgstatus assumes that if the status is not what is expected, it is a
transient change (like from Gwaiting to Gscanwaiting and back, or like
from Gwaiting to Gcopystack and back), and it loops until the status
has been restored to the expected value. In this case, the status has
changed semi-permanently from Gwaiting to Grunnable - it won't
change again until the GC is done and the world can continue, but the
GC is waiting for the status to change back. This wedges the program.

To fix, call a special variant of casgstatus that accepts either Gwaiting
or Grunnable as valid statuses.

Without the fix bug with the extra check+throw in casgstatus, the
program below dies in a few seconds (2-10) with GOMAXPROCS=8
on a 2012 Retina MacBook Pro. With the fix, it runs for minutes
and minutes.

package main

import (
        "io"
        "log"
        "net"
        "runtime"
)

func main() {
        const N = 100
        for i := 0; i < N; i++ {
                l, err := net.Listen("tcp", "127.0.0.1:0")
                if err != nil {
                        log.Fatal(err)
                }
                ch := make(chan net.Conn, 1)
                go func() {
                        var err error
                        c1, err := net.Dial("tcp", l.Addr().String())
                        if err != nil {
                                log.Fatal(err)
                        }
                        ch <- c1
                }()
                c2, err := l.Accept()
                if err != nil {
                        log.Fatal(err)
                }
                c1 := <-ch
                l.Close()
                go netguy(c1, c2)
                go netguy(c2, c1)
                c1.Write(make([]byte, 100))
        }
        for {
                runtime.GC()
        }
}

func netguy(r, w net.Conn) {
        buf := make([]byte, 100)
        for {
                bigstack(1000)
                _, err := io.ReadFull(r, buf)
                if err != nil {
                        log.Fatal(err)
                }
                w.Write(buf)
        }
}

var g int

func bigstack(n int) {
        var buf [100]byte
        if n > 0 {
                bigstack(n - 1)
        }
        g = int(buf[0]) + int(buf[99])
}

Fixes issue 9186.

LGTM=rlh
R=austin, rlh
CC=dvyukov, golang-codereviews, iant, khr, r
https://codereview.appspot.com/179680043
»»»

TBR=rlh
CC=golang-codereviews
https://codereview.appspot.com/184030043
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld f62528b