Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: goroutine execution stalled during GC #16293

Closed
cespare opened this issue Jul 8, 2016 · 22 comments
Closed

runtime: goroutine execution stalled during GC #16293

cespare opened this issue Jul 8, 2016 · 22 comments
Milestone

Comments

@cespare
Copy link
Contributor

cespare commented Jul 8, 2016

This issue is occurring with Go 1.6.2 on linux/amd64.

We have a Go server that exhibits large latency spikes (hundreds of ms) during GCs even though the STW periods are small (1-2ms). The symptoms are somewhat similar to #14812; I'm reporting a new issue on @RLH's guidance.

Background on the Go program: it is a key/value store that fulfills put/get requests from clients via a simple custom protocol over TCP. The server creates a goroutine per connection, like net/http. The heap contains an extremely large map (>100M keys). The server is written to be reasonably careful about memory allocation and the only GCs are those forced by sysmon every 2 minutes.

When the server starts, the maximum request latency is a few ms. Requests touch disk so occasionally there are latency spikes of at most a few dozen ms. After about 5 days, the server is exhibiting ~500ms latency spikes at every GC. After a week or two, the spikes can grow to 1s or more.

The latency spikes do not affect every request that arrives during the GC; only a few requests seem affected (maybe just one?)

Last week I restarted this server using GODEBUG=gctrace=1,gcpacertrace=1 and with net/http/pprof enabled. Then a couple of days ago I captured both GC logs from a spike as well as a runtime execution trace.

Here is what the spikes look like:

screen_20160706123408

(The time series is bucketed at 10s intervals; the mis-alignment between the spike and the GC line is partly due to that and partly an artifact of how we collect runtime metrics.)

We'll take a closer look at the 18:35 spike. Here are the logs:

2016/07/06 18:35:09 GC forced
2016/07/06 18:35:09 pacer: assist ratio=+8.551347e-003 (scan 81 MB in 9720->19297 MB) workers=2+0
2016/07/06 18:35:09 pacer: H_m_prev=10117323976 h_t=+9.500000e-001 H_T=19728781753 h_a=+7.435357e-003 H_a=10192549896 h_g=+1.000000e+000 H_g=20234647952 u_a=+3.749354e-001 u_g=+2.500000e-001 W_a=85852208 goalΔ=+5.000000e-002 actualΔ=-9.425646e-001 u_a/u_g=+1.499742e+000
2016/07/06 18:35:09 gc 3407 @409224.383s 0%: 0.099+504+0.77 ms clock, 0.79+504/508/9.2+6.2 ms cpu, 9720->9720->9648 MB, 19297 MB goal, 8 P
2016/07/06 18:35:10 pacer: sweep done at heap size 9649MB; allocated 0MB of spans; swept 1246890 pages

Here's a screenshot of the trace:

screen_20160707175656

Note the long "GC" and "gcBgMarkWorker" lines that run for about 500ms and the single (*Server).handleConn line that runs for the same period. It seems that the goroutine managing that connection was stalled during the GC, yet many other requests were served on other connections in a few ms or less.

I have added some further instrumentation to my code to pinpoint the delays, so I will attach more information later. (But I'll have to wait a few days.)

The program isn't open-source but I can provide the binary and execution trace privately. I might also be able to send over the source in some form, but it will take some work on my end to come up with a driver that can reproduce the problem.

@cespare
Copy link
Contributor Author

cespare commented Jul 8, 2016

/cc @RLH @aclements

@ianlancetaylor ianlancetaylor added this to the Go1.8 milestone Jul 8, 2016
@aclements
Copy link
Member

@cespare, could you try patching in https://go-review.googlesource.com/c/23540/? You mentioned that there's a large map, so my guess is that one of the mutator assists (handleConn in the screenshot you posted) is grabbing that and spending a long time scanning it. That would also be consistent with just one of the requests being delayed per GC, and sometimes no request being delayed (if the background worker was the one to grab the map).

2016/07/06 18:35:09 pacer: assist ratio=+8.551347e-003 (scan 81 MB in 9720->19297 MB) workers=2+0

Just checking: is the vast majority of your heap non-pointer data? What's the type of your map? It's also a little weird it took a whole second to scan just 81 MB of pointer data...

@aclements
Copy link
Member

Would it also be possible to post a longer gctrace? I'm particularly interested in whether things have obviously changed between when the server has recently started and has low latency and when it's older and exhibiting high latency. Some snippets of the gctrace/gcpacertrace at different ages would also be fine.

@cespare
Copy link
Contributor Author

cespare commented Jul 11, 2016

@aclements thanks for taking a look. I'll patch in CL 23540 and report back in a week or so. As for the rest of your questions:

is the vast majority of your heap non-pointer data? What's the type of your map?

Yes. The vast majority of the heap is that single map. The type is map[keyHash]RecordRef:

type keyHash [12]byte

type RecordRef struct {
    seq    uint32
    offset uint32
}

(This server is a later version of the same program that caused me to file #9477 a couple of years ago; at that time it was important to eliminate all pointers in order to keep pause times under control.)

If your theory is correct, it sounds like I could simply shard the map to partially work around the problem.

Would it also be possible to post a longer gctrace? I'm particularly interested in whether things have obviously changed between when the server has recently started and has low latency and when it's older and exhibiting high latency. Some snippets of the gctrace/gcpacertrace at different ages would also be fine.

I restarted the process at around 2016-07-02T00:55:17Z. Here are some snippets from the log from each day between then and the logs I posted above:

2016/07/02 12:01:40 GC forced
2016/07/02 12:01:40 pacer: assist ratio=+2.138854e-003 (scan 17 MB in 8485->16878 MB) workers=2+0
2016/07/02 12:01:40 pacer: H_m_prev=8849114144 h_t=+9.500000e-001 H_T=17255772580 h_a=+5.442970e-003 H_a=8897279608 h_g=+1.000000e+000 H_g=17698228288 u_a=+2.503884e-001 u_g=+2.500000e-001 W_a=18793824 goalΔ=+5.000000e-002 actualΔ=-9.445570e-001 u_a/u_g=+1.001554e+000
2016/07/02 12:01:40 gc 339 @40015.564s 0%: 0.12+101+0.51 ms clock, 0.98+0.31/6.0/110+4.1 ms cpu, 8485->8485->8436 MB, 16878 MB goal, 8 P
2016/07/02 12:01:40 pacer: sweep done at heap size 8436MB; allocated 0MB of spans; swept 1086754 pages
2016/07/02 12:02:15 scvg266: inuse: 8462, idle: 5311, sys: 13774, released: 5260, consumed: 8513 (MB)
...
2016/07/03 12:01:35 GC forced
2016/07/03 12:01:35 pacer: assist ratio=+3.950049e-003 (scan 34 MB in 8738->17379 MB) workers=2+0
2016/07/03 12:01:36 pacer: H_m_prev=9111766208 h_t=+9.500000e-001 H_T=17767944105 h_a=+5.644927e-003 H_a=9163201464 h_g=+1.000000e+000 H_g=18223532416 u_a=+2.500166e-001 u_g=+2.500000e-001 W_a=35753936 goalΔ=+5.000000e-002 actualΔ=-9.443551e-001 u_a/u_g=+1.000066e+000
2016/07/03 12:01:36 gc 1058 @126411.045s 0%: 0.34+197+0.91 ms clock, 2.7+0.026/6.4/208+7.3 ms cpu, 8738->8738->8693 MB, 17379 MB goal, 8 P
2016/07/03 12:01:36 pacer: sweep done at heap size 8694MB; allocated 0MB of spans; swept 1119763 pages
2016/07/03 12:02:15 scvg842: 1 MB released
2016/07/03 12:02:15 scvg842: inuse: 8722, idle: 5572, sys: 14295, released: 5511, consumed: 8783 (MB)
...
2016/07/04 12:00:55 GC forced
2016/07/04 12:00:55 pacer: assist ratio=+5.573086e-003 (scan 52 MB in 9556->19028 MB) workers=2+0
2016/07/04 12:00:55 pacer: H_m_prev=9976379976 h_t=+9.500000e-001 H_T=19453940953 h_a=+4.495774e-003 H_a=10021231528 h_g=+1.000000e+000 H_g=19952759952 u_a=+2.500208e-001 u_g=+2.500000e-001 W_a=55319360 goalΔ=+5.000000e-002 actualΔ=-9.455042e-001 u_a/u_g=+1.000083e+000
2016/07/04 12:00:55 gc 1776 @212770.689s 0%: 0.13+333+1.0 ms clock, 1.1+0.055/336/13+8.3 ms cpu, 9556->9556->9514 MB, 19028 MB goal, 8 P
2016/07/04 12:00:55 pacer: sweep done at heap size 9515MB; allocated 0MB of spans; swept 1225150 pages
2016/07/04 12:02:16 scvg1418: 3 MB released
2016/07/04 12:02:16 scvg1418: inuse: 9564, idle: 4730, sys: 14295, released: 4716, consumed: 9578 (MB)
...
2016/07/05 12:01:45 GC forced
2016/07/05 12:01:45 pacer: assist ratio=+8.636990e-003 (scan 81 MB in 9576->19057 MB) workers=2+0
2016/07/05 12:01:46 pacer: H_m_prev=9991664936 h_t=+9.500000e-001 H_T=19483746625 h_a=+4.967289e-003 H_a=10041296424 h_g=+1.000000e+000 H_g=19983329872 u_a=+2.501286e-001 u_g=+2.500000e-001 W_a=85842496 goalΔ=+5.000000e-002 actualΔ=-9.450327e-001 u_a/u_g=+1.000515e+000
2016/07/05 12:01:46 gc 2494 @299221.000s 0%: 0.27+436+0.80 ms clock, 2.1+0.44/7.8/442+6.4 ms cpu, 9576->9576->9528 MB, 19057 MB goal, 8 P
2016/07/05 12:01:46 pacer: sweep done at heap size 9529MB; allocated 0MB of spans; swept 1228103 pages
2016/07/05 12:02:17 scvg1994: 3 MB released
2016/07/05 12:02:17 scvg1994: inuse: 9563, idle: 4731, sys: 14295, released: 4692, consumed: 9602 (MB)
...
2016/07/06 12:01:29 GC forced
2016/07/06 12:01:29 pacer: assist ratio=+8.515688e-003 (scan 81 MB in 9701->19318 MB) workers=2+0
2016/07/06 12:01:30 pacer: H_m_prev=10128628744 h_t=+9.500000e-001 H_T=19750826050 h_a=+4.366870e-003 H_a=10172859152 h_g=+1.000000e+000 H_g=20257257488 u_a=+2.500632e-001 u_g=+2.500000e-001 W_a=85846568 goalΔ=+5.000000e-002 actualΔ=-9.456331e-001 u_a/u_g=+1.000253e+000
2016/07/06 12:01:30 gc 3211 @385604.769s 0%: 0.36+498+0.97 ms clock, 2.9+0.25/8.7/502+7.8 ms cpu, 9701->9701->9655 MB, 19318 MB goal, 8 P
2016/07/06 12:01:30 pacer: sweep done at heap size 9655MB; allocated 0MB of spans; swept 1244427 pages
2016/07/06 12:02:18 scvg2570: 0 MB released
2016/07/06 12:02:18 scvg2570: inuse: 9699, idle: 4595, sys: 14295, released: 4560, consumed: 9734 (MB)

@cespare
Copy link
Contributor Author

cespare commented Jul 21, 2016

I'm having trouble reproducing the issue now. Once I can reliably reproduce it again I'll patch in CL 23540 and report back (otherwise it will be hard to say if it's helping).

One hint that @rhysh gave me was that I could be running into #16070 as well. It's possible I wouldn't notice the leaked memory because the size of the overflow buckets is dwarfed by the size of map data, and it could explain how the latency spikes grow worse over time.

@rhysh
Copy link
Contributor

rhysh commented Aug 23, 2016

I've seen this recently in some of my applications with go1.6.x and go1.7—but with large slices and large channels (both with pointers) instead of large pointerless maps.

I have a reproducer for this. Using go1.6.3 and go1.7 with a live heap of around 1GB, it can show mutator unavailability in excess of a second (in "slice-livepointer" mode). It can show mutator unavailability of tens of milliseconds with a large pointer-free map—after churning the map for several minutes to simulate what I suspect are @cespare 's program's operation conditions.

Applying https://golang.org/cl/23540 PS3 to go1.7 brings the pauses for all test modes are down to only a few milliseconds.


package main

import (
    "flag"
    "log"
    "os"
    "os/signal"
    "runtime/trace"
    "time"
)

const (
    ballastBytes      = 1000 << 20
    tickInterval      = 1 * time.Millisecond
    desiredGCInterval = 1 * time.Second
    workers           = 10
    testDuration      = 10 * time.Second

    ptrBytes = 8 // 8 bytes per pointer on amd64
)

var (
    ballast interface{}
    sink    = make([]interface{}, workers)
)

// makeBallast controls which GC pause experiment is active.
var makeBallast func() interface{} = makeMapNopointer

// makeSliceNopointer creates a large pointer-free slice.
//
// In go1.6.3 and go1.7, this has no measured effect on mutator availability.
func makeSliceNopointer() interface{} { return make([]uintptr, ballastBytes/ptrBytes) }

// makeSlicePointer creates a large slice of nil pointers.
//
// In go1.6.3 and go1.7, this leads to long periods of mutator unavailability
// (hundreds of milliseconds).
func makeSlicePointer() interface{} { return make([]*uintptr, ballastBytes/ptrBytes) }

// makeSliceLivePointer creates a large slice containing pointers to many
// pieces of live memory.
//
// In go1.6.3 and go1.7, this leads to long periods of mutator unavailability
// (around a second).
func makeSliceLivePointer() interface{} {
    b := make([]*uintptr, ballastBytes/ptrBytes/2)
    for i := range b {
        b[i] = new(uintptr)
    }
    return b
}

// makeChanPointer creates a large channel of nil pointers.
//
// In go1.6.3 and go1.7, this leads to long periods of mutator unavailability
// (hundreds of milliseconds).
func makeChanPointer() interface{} { return make(chan *uintptr, ballastBytes/ptrBytes) }

// makeChanPointer creates a large channel of nil pointers.
//
// In go1.6.3 and go1.7, this leads to long periods of mutator unavailability
// (around a second).
func makeChanLivePointer() interface{} {
    c := make(chan *uintptr, ballastBytes/ptrBytes/2)
    for i := 0; i < cap(c); i++ {
        c <- new(uintptr)
    }
    return c
}

// makeMapNopointer creates a large map with non-pointer keys and values.
//
// In go1.6.3 and go1.7, this leads to long periods of mutator unavailability
// (tens of milliseconds).
//
// Populating and churning the map may take a few minutes.
func makeMapNopointer() interface{} {
    const (
        // mapMemoryInefficiencyEstimate estimates how many bytes of memory
        // does a map require to store one additional byte of key+value
        mapMemoryInefficiencyEstimate = 2

        kvsize     = 4
        mapsize    = (ballastBytes / ptrBytes) / kvsize / mapMemoryInefficiencyEstimate
        churnTimes = 30
    )

    b := make(map[uintptr][kvsize - 1]uintptr)
    // the key size and value size must both be below 128 bytes:
    // https://github.com/golang/go/blob/go1.7/src/runtime/hashmap.go#L70-L75

    // populate and churn the map to grow (https://golang.org/issue/16070) the
    // overflow bucket list:
    // https://github.com/golang/go/blob/go1.7/src/runtime/hashmap.go#L117-L126
    for i := 0; i < mapsize*(1+churnTimes); i++ {
        b[uintptr(i)] = [kvsize - 1]uintptr{}
        delete(b, uintptr(i-mapsize))
    }
    return b
}

func main() {
    tname := flag.String("trace", "", "Execution trace file name")
    exp := flag.String("experiment", "",
        "Name of experiment (slice-nopointer, slice-pointer, slice-livepointer, map-nopointer, chan-pointer, chan-livepointer)")
    flag.Parse()

    if *tname == "" {
        log.Fatalf("The -trace flag must be used to set the execution trace file destination")
    }

    switch *exp {
    case "slice-nopointer":
        makeBallast = makeSliceNopointer
    case "slice-pointer":
        makeBallast = makeSlicePointer
    case "slice-livepointer":
        makeBallast = makeSliceLivePointer
    case "map-nopointer":
        makeBallast = makeMapNopointer
    case "chan-pointer":
        makeBallast = makeChanPointer
    case "chan-livepointer":
        makeBallast = makeChanLivePointer
    case "":
    default:
        log.Fatalf("Unknown -experiment %q", *exp)
    }

    sigs := make(chan os.Signal, 1)
    signal.Notify(sigs, os.Interrupt)

    tfile, err := os.Create(*tname)
    if err != nil {
        log.Fatalf("trace open: %v", err)
    }
    defer func() {
        err := tfile.Close()
        if err != nil {
            log.Fatalf("trace close: %v", err)
        }
    }()

    err = trace.Start(tfile)
    if err != nil {
        log.Fatalf("trace start: %v", err)
    }
    defer trace.Stop()

    ballast = makeBallast()

    ticker := time.NewTicker(tickInterval)
    defer ticker.Stop()

    for i := 0; i < workers; i++ {
        go work(ticker.C, i)
    }

    select {
    case <-sigs:
    case <-time.After(testDuration):
    }
}

func work(ch <-chan time.Time, i int) {
    for t := range ch {
        d := time.Since(t)
        if d > 1*time.Millisecond {
            log.Printf("long pause: %s", d)
        }

        sink[i] = make([]byte, ballastBytes/(desiredGCInterval/tickInterval))
    }
}

@rhysh
Copy link
Contributor

rhysh commented Aug 26, 2016

@cespare — does this look like it reproduces your issue, or does it illustrate a different bug?

@aclements — this makes garbage collections pause all mutators that allocate memory for nearly the whole duration of the GC. My company has a couple of programs I'm aware of that see their applications pause for ~100ms whenever a GC happens, so for some applications (those with goroutines that allocate, and whose live heaps are concentrated in a few giant allocations), GCs are effectively stop-the-world.

Do you have recommendations for how to work around this, particularly for applications where breaking the few large allocations into many smaller ones would require significant effort?

Is there a safe runtime change to address this that could make it into a 1.7.x point release? You seemed skeptical of https://golang.org/cl/23540 for 1.7.

Thanks!

@cespare
Copy link
Contributor Author

cespare commented Aug 26, 2016

@rhysh — thanks for the repro case. I think it does replicate the problem I've been seeing.

I'm able to reproduce the problem again on my end and I've now patched in CL 23540 PS3. I'll report back later about whether that fixed the problem.

@aclements
Copy link
Member

@rhysh, that's a great benchmark. Can I have your permission to adapt it into the suite at https://github.com/aclements/go-gcbench/ (which will one day become part of x/benchmarks)?

Do you have recommendations for how to work around this, particularly for applications where breaking the few large allocations into many smaller ones would require significant effort?

If you can't break up the large allocations (or patch in CL 23540), the only other trick I can think of is to raise GOMAXPROCS. Given N large objects, at most N OS threads can get tied up in scanning those large objects, while the others continue to run (mostly) normally. You also need enough goroutines for this workaround to work, since it would still be possible for N goroutines to get tied up on those N OS threads.

Is there a safe runtime change to address this that could make it into a 1.7.x point release? You seemed skeptical of https://golang.org/cl/23540 for 1.7.

Given that this isn't a regression and I can't think of a super-low-risk way to do this, it doesn't fit the requirements for a point release.

I have a rebased and updated version of CL 23540 ready and am running benchmarks as I write this, so I should have a real version of that ready soon.

@aclements
Copy link
Member

@cespare, did you get a chance to try CL 23540? I just uploaded a rebased and updated version, but the code is no different from PS 3.

@cespare
Copy link
Contributor Author

cespare commented Aug 29, 2016

@aclements I patched it into my server two days ago. I'll report back later this week with the results.

@aclements
Copy link
Member

@cespare, great! Looking forward to the results.

@rhysh
Copy link
Contributor

rhysh commented Aug 30, 2016

Can I have your permission to adapt it

Yes, please do!

Given N large objects, at most N OS threads can get tied up in scanning those large objects, while the others continue to run (mostly) normally.

That's part of the story, @aclements. There's an additional implication of this bug that I think makes the workaround less practical than that.

My view is that this bug affects availability of all goroutines which allocate. With most of the process's memory tied up in N allocations (with N==1 for my reproducer), N goroutines which allocate during the concurrent mark phase will be assigned to scanning those large allocations. They'll tie up a P, and the G will be unable to run user code. Any additional goroutines that allocate during the concurrent mark phase will wait until the large scan tasks are done. These goroutines won't consume a P but their G will not be runnable, at least until there's scan credit available to steal (which might not be until nearly the end of the cycle). So long as there's a P available, each goroutine will be able to run until it attempts to allocate memory. At that point, they end up stuck at the end of runtime.gcAssistAlloc on goparkunlock(&work.assistQueue.lock, "GC assist wait", traceEvGoBlock, 2)

Given the "GC forced" logs in the first message on this issue, I suspect that @cespare 's server is careful to not allocate any memory at all in most goroutines which is why there continues to be a fair amount of goroutine scheduling activity and "Network" activity in the first message's execution trace.

Less finely tuned servers—including ones built with net/http.Server—allocate at least a little in many goroutines. Those goroutines end up un-runnable for long periods of the GC cycle.

In servers built with net/http.Server, this bug affects a small number of threads and a large number of goroutines. All of our http.Handlers end up stalled during the GC, not just the unluckiest one.

Given that this isn't a regression and I can't think of a super-low-risk way to do this, it doesn't fit the requirements for a point release.

Thank you for considering it.

@cespare
Copy link
Contributor Author

cespare commented Aug 30, 2016

I agree with @rhysh -- it seems like for the "have extra goroutines" workaround to help, I'd need to make some goroutines that only do dummy allocations purely to maximize the probability that they'll take the hit.

In my code, the place where a goroutine actually gets stuck for hundreds of ms is a 1-byte allocation!

    var scratch [1]byte                                                  <---------
    if _, err := io.ReadFull(r, scratch[:]); err != nil {
        ...
    }

@aclements
Copy link
Member

Yes, please do!

Thanks!

These goroutines won't consume a P but their G will not be runnable, at least until there's scan credit available to steal (which might not be until nearly the end of the cycle).

Shoot. I was thinking that scanobject would flush scan credit as it went and wake up these assists, but you're right. That flush is one level higher, in gcDrain, so it doesn't happen incrementally. I guess that's another good reason for oblets.

@dr2chase
Copy link
Contributor

Just a workaround suggestion, but do you care about what is read into that one-byte allocation? If it's throwaway, maybe you could preallocate a static (or a static array of bytes, randomly accessed to avoid fighting over cache lines)? I was looking at it with an eye to escape analysis, had that thought instead.

@bradfitz
Copy link
Contributor

bradfitz commented Aug 30, 2016

@dr2chase, and special case in the race detector that it's not a race? I think we'd all agree that's super gross.

But I think @cespare just cited that as the first example. There are allocations after that.

@aclements
Copy link
Member

@cespare: Friendly ping to see if you have results with the CL.

@cespare
Copy link
Contributor Author

cespare commented Sep 2, 2016

@aclements thanks for the reminder. I just checked the numbers and indeed, the large latency spikes are not happening anymore—CL 23540 fixes this problem for me. Thanks!

@aclements
Copy link
Member

@cespare, awesome! I'll finish polishing that CL up and get it in.

@ngrilly
Copy link

ngrilly commented Sep 15, 2016

It looks like this commit solves the high latency measured in https://github.com/spion/hashtable-latencies.

Today, I reran the benchmark with Go current master branch (78c4658) and now Go's latency is similar and even inferior to OCaml:

histogram

Thanks @aclements!

@mrjrieke
Copy link

Super cool.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants