runtime: heap goal overrun due to scheduler delays in mark termination #52433

mknyszek · 2022-04-19T15:17:12Z

Summary

In tests that have goroutines which heap-allocate and nothing else, I've observed these goroutines allocate beyond the heap goal by up to tens of MiB. This happens because of an inopportune scheduling delay, either by the OS scheduler or the Go scheduler, of a thread trying to terminate the GC's mark phase, where simply assisting the GC with scan work is not enough.

This is unlikely to affect real Go programs, but when it does, it is more likely to affect those with smaller heaps.

Full explanation

While writing a test for the implementation of #48409, I noticed that the memory limit might be passed.

The test in question has a single goroutine allocating heap memory in a loop, and a goroutine checking stats to ensure the memory limit is maintained. Occasionally, with a target limit of 64 MiB, the runtime could exceed the memory limit. Raising the limit to 512 MiB or more effectively mitigated the problem.

Digging deeper, I discovered the cause: the runtime was allowing the allocating goroutine to exceed the heap goal. GC assists weren't kicking in, because there was no mark work left to do anyway. With more logging I further discovered that the GC mark worker, which was in mark termination, would disappear entirely for 1-3ms when dropping into a sleep that has a target of 100µs. Replacing that sleep with an osyield could reproduce the same issue. I was also able to reproduce it by removing any kind of sleep and having it spin, though the point at which the thread could disappear was now more variable, when it did happen. It was also easier to reproduce with multiple instances running, but in all cases I ensured the CPU hardware still had plenty of breathing room (50% headroom in # of cores).

When investigating the spinning case, I also noticed that if a GC mark worker running mark termination algorithm notices that there's still work to do, it will by default park its goroutine. By the time that worker would wake back up and run mark termination again, the allocating goroutine would have blown past the heap goal. This was another common source of overrun.

Outside of this parking issue, however, I suspected an OS scheduler latency issue (this was on Linux). To confirm this, I traced OS scheduler events (perf sched record -k CLOCK_MONOTONIC) and lined them up with my logging to discover that indeed, the thread was getting context switched away for a couple milliseconds in at least the osyield and spinning cases (I haven't confirmed the original, but the symptoms are identical). It appears that due to kernel workers and background processes, the system could become transiently oversubscribed, and tasks could get kicked across CPUs, incurring this degree of latency.

But this isn't necessarily an issue on the OS's side. OS schedulers are always going to have some latency. More fundamentally, the runtime isn't always robust to this kind of external latency.

Aside from this test I wrote, the Go team has observed two other tests occasionally failing with mysterious heap overrun issues: #37331 and #49564. I suspect, but have not confirmed yet, that these flakes arise from precisely the same issue. Both involve tests that perform heap allocations in a loop.

Note that in practice this kind of heap overrun is rarely, if ever, an issue. In order to allocate at this rate, the application would have to do nothing with that memory, not even write to it or read from it, which is pointless. The overrun is also proportional to these external latency factors, which are effectively constant, so this is typically a lot less noticeable for larger heaps (i.e. going over by a few MiB is a much bigger fraction of the heap goal for 64 MiB than for 1 GiB).

Mitigation

The short-term mitigation for tests is to just allocate at a slower rate, reducing the amount of possible overrun. Long-term, the runtime should be more robust to these situations. A few ideas on how to do this:

For the situation where a mark worker parks after a failure to terminate the mark phase, forcing allocating goroutines to assist more often would actually help here, because assists can trigger mark termination as well. They can take over.
@aclements notes that typically, wait-free algorithms are used to combat these sorts of situations. A wait-free distributed mark termination algorithm is a tall order, but that would tackle this particular situation directly, as several goroutines could be helping each other through mark termination, as opposed to one goroutine coordinating it at a time.

CC @prattmic @aclements

The text was updated successfully, but these errors were encountered:

bcmills · 2022-04-20T16:20:51Z

(This might also explain #51019, although it doesn't really explain why that would occur on the -11 builder and not the -10.)

gopherbot · 2022-05-27T18:19:49Z

Change https://go.dev/cl/408825 mentions this issue: test/heapsampling.go: slow down allocation rate and reduce iterations

As far as I can tell, this test suffers from #52433. For some reason, this seems to become more of a problem on the windows/386 than anywhere else. This CL is an attempt at a mitigation by slowing down the allocation rate by inserting runtime.Gosched call in the inner loop. It also cuts the iteration count which should help too (as less memory is allocated in total), but the main motivation is to make sure the test doesn't take too long to run. Fixes #49564. Change-Id: I8cc622b06a69cdfa66f680a30e1ccf334eea2164 Reviewed-on: https://go-review.googlesource.com/c/go/+/408825 Reviewed-by: Bryan Mills <bcmills@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

mknyszek added the NeedsInvestigation label Apr 19, 2022

mknyszek added this to the Backlog milestone Apr 19, 2022

mknyszek self-assigned this Apr 19, 2022

mknyszek added this to Go Compiler / Runtime Apr 19, 2022

This was referenced Apr 25, 2022

runtime: dump a summary of heap stats on OOM #52546

Open

runtime: OOM in time_test.TestUnixMilli on solaris-amd64-oraclerel #52547

Open

This was referenced May 3, 2022

x/build: OOM while compiling on solaris-amd64-oraclerel #52676

Open

x/build: "fatal error: out of memory" on windows-arm64-11 #51019

Closed

This was referenced May 10, 2022

runtime: TestGcSys is still flaky #37331

Open

image/gif: TestDecodeMemoryConsumption flakes #35166

Closed

mknyszek mentioned this issue May 17, 2022

runtime: TestPingPongHog failures with ratios out of range #52207

Closed

prattmic moved this to Todo in Go Compiler / Runtime May 20, 2022

gopherbot added the compiler/runtime label Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: heap goal overrun due to scheduler delays in mark termination #52433

runtime: heap goal overrun due to scheduler delays in mark termination #52433

mknyszek commented Apr 19, 2022

bcmills commented Apr 20, 2022

gopherbot commented May 27, 2022

runtime: heap goal overrun due to scheduler delays in mark termination #52433

runtime: heap goal overrun due to scheduler delays in mark termination #52433

Comments

mknyszek commented Apr 19, 2022

Summary

Full explanation

Mitigation

bcmills commented Apr 20, 2022

gopherbot commented May 27, 2022