New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: GC's mark termination STW time extended by Linux NUMA migration #14406
Comments
CC @RLH @aclements |
This is rather unfortunate. Linux has gone and made non-local access not just kind of expensive but sometimes insanely expensive. I'm a little confused by why this doesn't make concurrent mark also insanely expensive, since it's all random access. I see your sweep termination times are also very high. That also surprises me. We hardly do anything do sweep termination. We certainly aren't hitting all of the stacks or anything like that. Of course, the "right" solution to this is to be more NUMA friendly. That would improve things on NUMA systems whether or not they're doing still migration tricks. That may be necessary in the long term, but for now it would be great if we could just disable the migration. @rhysh, you may be more familiar with this feature. Do you happen to know if there's a way to just turn it off? I didn't see anything obvious in mbind or set_mempolicy. If you don't know, I'll dig around. |
The migration can be disabled by binding the memory to particular nodes—even the set of all nodes. Within http://lxr.free-electrons.com/source/mm/mempolicy.c?v=3.13#L2353 @aclements, is the GC able to bind all memory to the full set of NUMA nodes for the duration of the STW phases via |
Ah, perfect. I hadn't put together that that was what you were accomplishing with the numactl. Yes, we should be able to set_mempolicy around the STW phases. However, even with the numactl, your pause times are still really high. Is that just because of the stack shrinking problem (which I have a CL series almost ready for)? |
I was so happy to see you mention your concurrent stack shrinking change in the reddit AMA, and look forward to seeing its details! Did you end up pursuing in-place shrinking with a buddy allocator? For the real application, I've been able to get pause times to around 10-15ms by disabling stack shrinking and confining the program to a single NUMA node (before I learned of the MPOL_BIND trick), when running with around 250,000 goroutines. It's great to know they'll only be temporary workarounds. Thanks! Stack shrinking is turned off for both of the test outputs I shared in this bug. I haven't dug into why the pauses are still 130ms when stack shrinking and NUMA migration are both disabled. How fast do you expect the pauses to be for the program I've included when running with 1e6 goroutines with 1kB stacks? How big does a program have to be in any dimension before the 10ms goal no longer applies? |
I haven't fixed the NUMA thing yet, but could you try https://go-review.googlesource.com/20700 with the MPOL_BIND trick? How fast I would expect this to be depends on both how many Gs there are and how many Gs have run during concurrent mark. The series at CL 20700 completely eliminates the cost of idle Gs, which is currently something like 30--40ns per idle goroutine plus a small C*log(stack size). The other major known cost is rescanning of Gs that have run during concurrent mark, which is harder to quantify, but is a minimum of something like 250ns per goroutine and goes up from there depending on how much of the stack has been dirtied and its complexity (number of frames and pointers). This is actually the last significant O(something the user controls) STW bottleneck we know of and we have some thoughts on how to address it, but it's not easy. |
Hi @aclements — I tried out CL 20700 PS 1, and it's had a wonderful impact on mark termination times. I ran my application against:
Running on a two-socket host, when the application has GOMAXPROCS=16, GODEBUG=gctrace=1, ~60k goroutines, and ~200MB live heap, the sweep termination and mark termination pauses are as follows:
Disabling memory migration via MPOL_BIND (configuration 5 vs 4) still has a measurable effect on the STW times, but both configurations are well within the 10ms budget now. Thank you! |
@rhysh, would you mind trying out tip and seeing if this still has a measurable and/or non-trivial effect? Based on the effect of CL 20700 you reported, it seems like the NUMA migration was actually hitting us on access to metadata structures (like the g's). We've made a few changes since 20700 that I believe should reduce metadata scanning even more. If we're still well below 10ms and the NUMA migration doesn't have a large effect any more, I think I'll go ahead and consider this fixed. |
Closing because my understanding is that we no longer trigger the really bad NUMA migration behavior. But please reopen (or let me know and I'll reopen) if that's incorrect. |
I have a process that uses a large number of goroutines and sees large mark termination STW times (this is the same program described in #12967). After disabling stack shrinking, it still has large (and variable) mark termination times. The program runs on linux/amd64, with linux version 3.13.
I've profiled with
perf
after enabling GOEXPERIMENT=framepointer, so I can see full call stacks of the GC and kernel. This profiling indicates that 1) stack shrinking significantly increases pause time, #12967, and 2) with stack shrinking disabled via GODEBUG=gcshrinkstackoff=1 the mark termination phase includes many calls to the kernelpage_fault
function, which six frames deeper leads todo_numa_page
and then tomigrate_misplaced_page
.It appears that the garbage collector makes sufficient accesses to memory to trick Linux's NUMA logic into moving physical memory pages to be closer to the GC workers that are accessing particular pages. This expensive migration increases GC pause times when it happens during mark termination. I suspect that the affinity that Gs have to Ps and that Ps have to CPU cores means that the mutator and GC fight back and forth on where the pages should be placed.
Setting the process's memory policy to MPOL_BIND via either mbind(2) or get_mempolicy(2) shows a significant reduction in—and improvement in consistency of—mark termination time, and effectively eliminates time spent on page faults during mark termination.
I have a small reproducer for this, included below (it's based on the reproducer for #12967 but doesn't have buggy use of WaitGroups, includes a large number of mostly-idle goroutines, and disables go1.6's improved inlining). It shows very consistent GC timings when the process's memory has been marked as MPOL_BIND via
numactl
, even when it's bound to all available nodes.The included test output was recorded on a c4.8xlarge EC2 instance.
I'd expect the Go GC to be able to indicate to the kernel that the memory accesses it's making are temporary and somewhat random, to advise the kernel to not migrate memory to match the access patterns.
Here 100 goroutines grow and shrink their stack requirements while 1,000,000 others read/write small pieces of their stacks without making function calls. Stack shrinking is disabled. Note the mark termination times of between 121ms and 1010ms, 22 million page faults, and 1316 seconds of system time over the 120-second test run.
Here's the same test, but run with
numactl --membind 0,1
. Note the mark termination times tightly grouped between 124ms and 138ms, 1 million page faults, and only 37 seconds of system time. (I'm interested in the time spent on page faults, not their number, but this is whattime
provides.)The test program follows:
The text was updated successfully, but these errors were encountered: