runtime: GC's mark termination STW time extended by Linux NUMA migration #14406

rhysh · 2016-02-19T07:16:08Z

I have a process that uses a large number of goroutines and sees large mark termination STW times (this is the same program described in #12967). After disabling stack shrinking, it still has large (and variable) mark termination times. The program runs on linux/amd64, with linux version 3.13.

I've profiled with perf after enabling GOEXPERIMENT=framepointer, so I can see full call stacks of the GC and kernel. This profiling indicates that 1) stack shrinking significantly increases pause time, #12967, and 2) with stack shrinking disabled via GODEBUG=gcshrinkstackoff=1 the mark termination phase includes many calls to the kernel page_fault function, which six frames deeper leads to do_numa_page and then to migrate_misplaced_page.

It appears that the garbage collector makes sufficient accesses to memory to trick Linux's NUMA logic into moving physical memory pages to be closer to the GC workers that are accessing particular pages. This expensive migration increases GC pause times when it happens during mark termination. I suspect that the affinity that Gs have to Ps and that Ps have to CPU cores means that the mutator and GC fight back and forth on where the pages should be placed.

Setting the process's memory policy to MPOL_BIND via either mbind(2) or get_mempolicy(2) shows a significant reduction in—and improvement in consistency of—mark termination time, and effectively eliminates time spent on page faults during mark termination.

I have a small reproducer for this, included below (it's based on the reproducer for #12967 but doesn't have buggy use of WaitGroups, includes a large number of mostly-idle goroutines, and disables go1.6's improved inlining). It shows very consistent GC timings when the process's memory has been marked as MPOL_BIND via numactl, even when it's bound to all available nodes.

$ go version
go version go1.6 darwin/amd64
$ # I'm cross-compiling - building on darwin/amd64, running on linux/amd64
$ go tool compile -V
compile version go1.6 X:framepointer

$ uname -r
3.13.0-62-generic

The included test output was recorded on a c4.8xlarge EC2 instance.

I'd expect the Go GC to be able to indicate to the kernel that the memory accesses it's making are temporary and somewhat random, to advise the kernel to not migrate memory to match the access patterns.

Here 100 goroutines grow and shrink their stack requirements while 1,000,000 others read/write small pieces of their stacks without making function calls. Stack shrinking is disabled. Note the mark termination times of between 121ms and 1010ms, 22 million page faults, and 1316 seconds of system time over the 120-second test run.

$ /usr/bin/time env GODEBUG=gctrace=1,gcshrinkstackoff=1 ./issue12967 -d 2m -growers 100 -livers 1000000
gc 1 @0.005s 5%: 0.17+1.7+0.87 ms clock, 2.4+0.13/2.6/1.1+12 ms cpu, 100->100->100 MB, 101 MB goal, 36 P
gc 2 @0.023s 5%: 0.38+3.1+0.94 ms clock, 5.0+0.012/19/2.4+12 ms cpu, 101->101->101 MB, 200 MB goal, 36 P
gc 3 @0.353s 2%: 1.6+19+5.4 ms clock, 32+4.4/156/198+109 ms cpu, 151->152->129 MB, 202 MB goal, 36 P
gc 4 @0.909s 2%: 4.7+28+12 ms clock, 86+0.55/244/573+233 ms cpu, 231->232->176 MB, 257 MB goal, 36 P
gc 5 @1.772s 3%: 8.1+57+27 ms clock, 228+0/482/925+761 ms cpu, 324->337->239 MB, 352 MB goal, 36 P
gc 6 @2.925s 4%: 13+102+61 ms clock, 339+33/713/1831+1533 ms cpu, 437->455->335 MB, 453 MB goal, 36 P
gc 7 @4.649s 6%: 18+122+125 ms clock, 599+38/1015/2690+4014 ms cpu, 620->637->438 MB, 637 MB goal, 36 P
gc 8 @7.523s 20%: 117+15452+1010 ms clock, 3174+321/138568/413980+27277 ms cpu, 832->848->581 MB, 844 MB goal, 36 P
gc 9 @28.610s 18%: 48+2893+484 ms clock, 1153+0.099/21720/37561+11638 ms cpu, 1111->1131->666 MB, 1130 MB goal, 36 P
gc 10 @38.246s 15%: 32+300+121 ms clock, 1024+13/2499/7467+3900 ms cpu, 1266->1286->666 MB, 1293 MB goal, 36 P
gc 11 @45.041s 14%: 45+1464+222 ms clock, 1456+20/12898/32722+7106 ms cpu, 1267->1287->666 MB, 1294 MB goal, 36 P
gc 12 @52.895s 12%: 34+465+221 ms clock, 384+21/4017/8625+2435 ms cpu, 1266->1296->676 MB, 1293 MB goal, 36 P
gc 13 @59.619s 12%: 46+595+426 ms clock, 1500+0.10/5118/14312+13652 ms cpu, 1267->1297->676 MB, 1294 MB goal, 36 P
gc 14 @66.699s 11%: 64+712+179 ms clock, 2073+0.21/6241/18584+5733 ms cpu, 1266->1296->676 MB, 1293 MB goal, 36 P
gc 15 @73.656s 10%: 37+215+121 ms clock, 1203+0.006/1774/5229+3902 ms cpu, 1266->1277->656 MB, 1293 MB goal, 36 P
gc 16 @80.291s 10%: 63+572+157 ms clock, 1653+0.30/4999/10443+4099 ms cpu, 1266->1296->676 MB, 1293 MB goal, 36 P
gc 17 @87.130s 10%: 89+925+247 ms clock, 2873+0.017/7985/21239+7935 ms cpu, 1266->1296->676 MB, 1293 MB goal, 36 P
gc 18 @94.465s 9%: 36+487+220 ms clock, 1152+0.24/4188/12270+7071 ms cpu, 1266->1296->676 MB, 1293 MB goal, 36 P
gc 19 @101.246s 9%: 32+267+121 ms clock, 754+0/2281/4436+2799 ms cpu, 1266->1286->666 MB, 1293 MB goal, 36 P
gc 20 @107.862s 8%: 27+293+305 ms clock, 628+0.005/2452/5118+7015 ms cpu, 1266->1286->666 MB, 1293 MB goal, 36 P
gc 21 @114.597s 8%: 54+426+181 ms clock, 1750+6.0/3620/10683+5796 ms cpu, 1266->1296->676 MB, 1293 MB goal, 36 P
251.54user 1316.06system 2:00.37elapsed 1302%CPU (0avgtext+0avgdata 21715744maxresident)k
0inputs+0outputs (0major+22501185minor)pagefaults 0swaps

Here's the same test, but run with numactl --membind 0,1. Note the mark termination times tightly grouped between 124ms and 138ms, 1 million page faults, and only 37 seconds of system time. (I'm interested in the time spent on page faults, not their number, but this is what time provides.)

$ /usr/bin/time numactl --membind 0,1 -- env GODEBUG=gctrace=1,gcshrinkstackoff=1 ./issue12967 -d 2m -growers 100 -livers 1000000
gc 1 @0.005s 6%: 0.24+0.83+0.89 ms clock, 3.4+0.11/2.2/0.049+12 ms cpu, 100->100->100 MB, 101 MB goal, 36 P
gc 2 @0.025s 5%: 0.18+2.7+1.4 ms clock, 2.3+0.16/17/2.8+18 ms cpu, 101->101->101 MB, 200 MB goal, 36 P
gc 3 @0.365s 2%: 1.2+15+5.5 ms clock, 33+0.005/123/174+143 ms cpu, 151->152->129 MB, 202 MB goal, 36 P
gc 4 @0.832s 2%: 3.7+32+14 ms clock, 59+1.2/279/517+226 ms cpu, 224->226->170 MB, 257 MB goal, 36 P
gc 5 @1.631s 4%: 7.4+55+31 ms clock, 209+23/435/1120+895 ms cpu, 315->317->228 MB, 338 MB goal, 36 P
gc 6 @2.837s 4%: 12+100+49 ms clock, 259+24/853/1579+1035 ms cpu, 445->448->313 MB, 456 MB goal, 36 P
gc 7 @4.718s 4%: 20+152+77 ms clock, 550+19/1188/2633+2090 ms cpu, 606->625->436 MB, 621 MB goal, 36 P
gc 8 @7.194s 4%: 27+190+110 ms clock, 641+20/1588/3803+2533 ms cpu, 822->838->585 MB, 837 MB goal, 36 P
gc 9 @11.868s 4%: 38+214+132 ms clock, 1230+0/1769/5194+4235 ms cpu, 1121->1141->666 MB, 1141 MB goal, 36 P
gc 10 @18.384s 4%: 30+203+127 ms clock, 983+0/1642/4815+4084 ms cpu, 1266->1276->656 MB, 1293 MB goal, 36 P
gc 11 @24.959s 3%: 44+204+124 ms clock, 1422+0/1681/4933+3980 ms cpu, 1266->1276->656 MB, 1293 MB goal, 36 P
gc 12 @31.505s 3%: 30+214+128 ms clock, 580+0/1796/4596+2439 ms cpu, 1266->1276->656 MB, 1292 MB goal, 36 P
gc 13 @38.057s 3%: 46+212+129 ms clock, 1481+0/1747/5116+4147 ms cpu, 1266->1276->656 MB, 1293 MB goal, 36 P
gc 14 @44.618s 3%: 41+200+134 ms clock, 1341+0/1650/4775+4312 ms cpu, 1266->1276->656 MB, 1293 MB goal, 36 P
gc 15 @51.171s 3%: 34+252+138 ms clock, 656+0/2097/4271+2624 ms cpu, 1266->1286->666 MB, 1293 MB goal, 36 P
gc 16 @57.653s 3%: 40+211+131 ms clock, 1298+0/1743/4642+4211 ms cpu, 1266->1276->656 MB, 1293 MB goal, 36 P
gc 17 @64.204s 3%: 33+200+135 ms clock, 1087+0/1654/4892+4328 ms cpu, 1266->1276->656 MB, 1293 MB goal, 36 P
gc 18 @70.745s 3%: 46+210+125 ms clock, 1478+0/1737/4750+4029 ms cpu, 1266->1276->656 MB, 1293 MB goal, 36 P
gc 19 @77.308s 3%: 38+201+131 ms clock, 1238+0/1661/4877+4200 ms cpu, 1266->1276->656 MB, 1293 MB goal, 36 P
gc 20 @83.855s 3%: 29+254+131 ms clock, 503+0/2093/4239+2227 ms cpu, 1266->1286->666 MB, 1293 MB goal, 36 P
gc 21 @90.335s 3%: 45+204+131 ms clock, 1460+0/1682/4774+4201 ms cpu, 1266->1276->656 MB, 1293 MB goal, 36 P
gc 22 @96.880s 3%: 38+203+129 ms clock, 1220+0/1681/4936+4139 ms cpu, 1266->1276->656 MB, 1293 MB goal, 36 P
gc 23 @103.423s 3%: 44+218+136 ms clock, 1418+0/1764/4832+4369 ms cpu, 1266->1286->666 MB, 1293 MB goal, 36 P
gc 24 @109.898s 3%: 43+204+127 ms clock, 1406+0/1684/4902+4083 ms cpu, 1266->1276->656 MB, 1292 MB goal, 36 P
gc 25 @116.454s 3%: 30+228+130 ms clock, 602+0/1922/4525+2602 ms cpu, 1266->1286->676 MB, 1293 MB goal, 36 P
458.58user 37.48system 2:00.36elapsed 412%CPU (0avgtext+0avgdata 21675600maxresident)k
0inputs+0outputs (0major+1055953minor)pagefaults 0swaps

The test program follows:

package main

import (
    "flag"
    "os"
    "time"
)

const (
    ballastSize   = 100 << 20
    garbageSize   = 10 << 20
    garbagePeriod = 100 * time.Millisecond

    stackSize   = 10 << 10
    stackPeriod = 5000 * time.Millisecond
    liveSize    = 1 << 10
)

var (
    ballast []byte
    garbage []byte
)

func churn() {
    ballast = make([]byte, ballastSize)

    for {
        time.Sleep(garbagePeriod)
        garbage = make([]byte, garbageSize)
    }
}

func run(ch chan struct{}) {
    for {
        grow(ch)
        <-ch
    }
}

//go:noinline
func grow(ch chan struct{}) *byte {
    var s [stackSize / 8]*byte
    <-ch
    return s[0]
}

func live(ch chan struct{}) {
    var s [liveSize]byte
    for {
        <-ch
        // read/write the stack, so it doesn't go unused
        for i := 0; i < len(s); i += 256 {
            s[i]++
        }
    }
}

func main() {
    d := flag.Duration("d", 10*time.Second, "Run duration")
    growers := flag.Int("growers", 1e4, "Number of goroutines with growing/shrinking stacks")
    livers := flag.Int("livers", 1e5, "Number of goroutines with permanent live stacks")
    flag.Parse()

    time.AfterFunc(*d, func() { os.Exit(0) })

    go churn()

    var chs []chan struct{}

    // some goroutines will periodically grow their stacks
    for i := 0; i < *growers; i++ {
        chs = append(chs, make(chan struct{}))
        go run(chs[len(chs)-1])
    }

    // some goroutines will have large live stacks, which they periodically access
    for i := 0; i < *livers; i++ {
        chs = append(chs, make(chan struct{}))
        go live(chs[len(chs)-1])
    }

    for range time.Tick(stackPeriod / 2) {
        for _, ch := range chs {
            ch <- struct{}{}
        }
    }
}

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2016-02-19T16:06:05Z

CC @RLH @aclements

aclements · 2016-02-19T16:34:37Z

This is rather unfortunate. Linux has gone and made non-local access not just kind of expensive but sometimes insanely expensive. I'm a little confused by why this doesn't make concurrent mark also insanely expensive, since it's all random access. I see your sweep termination times are also very high. That also surprises me. We hardly do anything do sweep termination. We certainly aren't hitting all of the stacks or anything like that.

Of course, the "right" solution to this is to be more NUMA friendly. That would improve things on NUMA systems whether or not they're doing still migration tricks. That may be necessary in the long term, but for now it would be great if we could just disable the migration.

@rhysh, you may be more familiar with this feature. Do you happen to know if there's a way to just turn it off? I didn't see anything obvious in mbind or set_mempolicy. If you don't know, I'll dig around.

rhysh · 2016-02-19T17:02:46Z

The migration can be disabled by binding the memory to particular nodes—even the set of all nodes. Within mpol_misplaced, if the page's policy is set to MPOL_BIND and the page is already on an allowed node, the kernel will leave it where it is. This is why numactl --membind 0,1 has the effect it does (the c4.8xlarge machine has just two nodes), even though it would logically seem to do nothing.

http://lxr.free-electrons.com/source/mm/mempolicy.c?v=3.13#L2353

@aclements, is the GC able to bind all memory to the full set of NUMA nodes for the duration of the STW phases via set_mempolicy with MPOL_BIND?

aclements · 2016-02-19T17:43:18Z

Ah, perfect. I hadn't put together that that was what you were accomplishing with the numactl. Yes, we should be able to set_mempolicy around the STW phases. However, even with the numactl, your pause times are still really high. Is that just because of the stack shrinking problem (which I have a CL series almost ready for)?

rhysh · 2016-02-19T18:10:46Z

I was so happy to see you mention your concurrent stack shrinking change in the reddit AMA, and look forward to seeing its details! Did you end up pursuing in-place shrinking with a buddy allocator?

For the real application, I've been able to get pause times to around 10-15ms by disabling stack shrinking and confining the program to a single NUMA node (before I learned of the MPOL_BIND trick), when running with around 250,000 goroutines. It's great to know they'll only be temporary workarounds. Thanks!

Stack shrinking is turned off for both of the test outputs I shared in this bug. I haven't dug into why the pauses are still 130ms when stack shrinking and NUMA migration are both disabled. How fast do you expect the pauses to be for the program I've included when running with 1e6 goroutines with 1kB stacks? How big does a program have to be in any dimension before the 10ms goal no longer applies?

aclements · 2016-03-14T18:55:34Z

Stack shrinking is turned off for both of the test outputs I shared in this bug. I haven't dug into why the pauses are still 130ms when stack shrinking and NUMA migration are both disabled. How fast do you expect the pauses to be for the program I've included when running with 1e6 goroutines with 1kB stacks? How big does a program have to be in any dimension before the 10ms goal no longer applies?

I haven't fixed the NUMA thing yet, but could you try https://go-review.googlesource.com/20700 with the MPOL_BIND trick?

How fast I would expect this to be depends on both how many Gs there are and how many Gs have run during concurrent mark. The series at CL 20700 completely eliminates the cost of idle Gs, which is currently something like 30--40ns per idle goroutine plus a small C*log(stack size). The other major known cost is rescanning of Gs that have run during concurrent mark, which is harder to quantify, but is a minimum of something like 250ns per goroutine and goes up from there depending on how much of the stack has been dirtied and its complexity (number of frames and pointers). This is actually the last significant O(something the user controls) STW bottleneck we know of and we have some thoughts on how to address it, but it's not easy.

rhysh · 2016-03-17T00:01:20Z

Hi @aclements — I tried out CL 20700 PS 1, and it's had a wonderful impact on mark termination times.

I ran my application against:

go1.5.1
the merge-base of CL 20700 PS 1 and master (a4e31d4 at the time, before you merged CL 20044)
the parent of CL 20700 PS 1
CL 20700 PS 1
CL 20700 PS 1 with MPOL_BIND set to all numa nodes

Running on a two-socket host, when the application has GOMAXPROCS=16, GODEBUG=gctrace=1, ~60k goroutines, and ~200MB live heap, the sweep termination and mark termination pauses are as follows:

3.5ms and 42ms
0.70ms and 19ms # this data was collected several hours later than the others, so may not be directly comparable
0.17ms and 11ms
0.21ms and 1.3ms
0.085ms and 1.0ms

Disabling memory migration via MPOL_BIND (configuration 5 vs 4) still has a measurable effect on the STW times, but both configurations are well within the 10ms budget now. Thank you!

aclements · 2016-05-16T15:52:06Z

@rhysh, would you mind trying out tip and seeing if this still has a measurable and/or non-trivial effect?

Based on the effect of CL 20700 you reported, it seems like the NUMA migration was actually hitting us on access to metadata structures (like the g's). We've made a few changes since 20700 that I believe should reduce metadata scanning even more.

If we're still well below 10ms and the NUMA migration doesn't have a large effect any more, I think I'll go ahead and consider this fixed.

aclements · 2016-05-25T17:34:11Z

Closing because my understanding is that we no longer trigger the really bad NUMA migration behavior. But please reopen (or let me know and I'll reopen) if that's incorrect.

ianlancetaylor added this to the Go1.7 milestone Feb 19, 2016

aclements self-assigned this Feb 19, 2016

This was referenced Feb 20, 2016

runtime: gcMarkRootCheck contributes significantly to mark termination time #14419

Closed

runtime: gcResetMarkState contributes significantly to sweep termination time #14420

Closed

rhysh mentioned this issue Mar 3, 2016

runtime: shrinkstack during mark termination significantly increases GC STW time #12967

Closed

OneOfOne mentioned this issue Mar 30, 2016

Fix poor scalability to many (true-SMP) cores OneOfOne/lfchan#3

Open

aclements closed this as completed May 25, 2016

golang locked and limited conversation to collaborators May 25, 2017

gopherbot added the FrozenDueToAge label May 25, 2017

rsc unassigned aclements Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: GC's mark termination STW time extended by Linux NUMA migration #14406

runtime: GC's mark termination STW time extended by Linux NUMA migration #14406

rhysh commented Feb 19, 2016

ianlancetaylor commented Feb 19, 2016

aclements commented Feb 19, 2016

rhysh commented Feb 19, 2016

aclements commented Feb 19, 2016

rhysh commented Feb 19, 2016

aclements commented Mar 14, 2016

rhysh commented Mar 17, 2016

aclements commented May 16, 2016

aclements commented May 25, 2016

runtime: GC's mark termination STW time extended by Linux NUMA migration #14406

runtime: GC's mark termination STW time extended by Linux NUMA migration #14406

Comments

rhysh commented Feb 19, 2016

ianlancetaylor commented Feb 19, 2016

aclements commented Feb 19, 2016

rhysh commented Feb 19, 2016

aclements commented Feb 19, 2016

rhysh commented Feb 19, 2016

aclements commented Mar 14, 2016

rhysh commented Mar 17, 2016

aclements commented May 16, 2016

aclements commented May 25, 2016