runtime/pprof: regression in TestMemoryProfiler/debug=1 starting in April 2021 #46500

bcmills · 2021-06-01T20:45:26Z

2021-05-30T02:37:38-1607c28/linux-amd64-sid
2021-04-27T21:55:07-214c8dd/linux-amd64-nocgo
2021-04-27T21:44:16-645cb62/linux-amd64-nocgo

This test has otherwise been passing consistently since it was last fixed in November 2019, so this looks like a 1.17 regression (CC @golang/release).

2019 failures

bcmills · 2021-06-02T16:14:14Z

CC @cherrymui for runtime/pprof.

toothrot · 2021-06-17T18:50:59Z

@cherrymui Following up, as we're approaching RC1 and this is a release-blocker.

prattmic · 2021-06-18T18:33:26Z

All of these failures are because we expect an entry like:

0: 0 [1: 2097152] @ 0x52cb88 0x49ed54 0x49e485 0x52cca7 0x52cec5 0x4c5e22 0x466121
            #	0x52cb87	runtime/pprof.allocateReflectTransient+0x27	/workdir/go/src/runtime/pprof/mprof_test.go:56

but get one like

1: 2097152 [1: 2097152] @ 0x52cb88 0x49ed54 0x49e485 0x52cca7 0x52cec5 0x4c5e22 0x466121
            #	0x52cb87	runtime/pprof.allocateReflectTransient+0x27	/workdir/go/src/runtime/pprof/mprof_test.go:56

The only mismatches are the first two numbers: want 0, got 1 and 2097152.

The first number is InUseObjects() => AllocObjects - FreeObjects. The second is InUseBytes() => AllocBytes - FreeBytes.

Since these are "transient" allocations and the test runs GC, we are expecting to see the allocations freed. The frees are recorded during sweep, so at first glance this would look like another case of #45315. However, this started failing after that was fixed, so I suspect that something in http://golang.org/cl/307915 or http://golang.org/cl/307916 is subtly broken and triggering this.

cc @mknyszek

prattmic · 2021-06-18T18:34:35Z

FWIW, I've been unable to reproduce this locally so far.

mknyszek · 2021-06-21T22:10:42Z

I think @prattmic is right and that this is probably a regression. Looking into it.

mknyszek · 2021-06-22T14:40:30Z

I ran the full suite of runtime/pprof tests overnight and I reproduced it. Unfortunately because of limited terminal scrollback I don't actually know how many executions it took...

Specifically, I ran:

CGO_ENABLED=0 go test -short -count=1 runtime/pprof

On the hunch that CGO_ENABLED=0 somehow makes the regression reproduce more easily.

mknyszek · 2021-06-22T14:50:18Z

OK I can probably shorten time to reproduce: TestMemoryProfiler is the third test to be executed. The only tests that could be causally influencing the failure are the two tests before it. Together they take much less time to run.

toothrot · 2021-06-25T18:11:36Z

@mknyszek We're close to the RC date for Go 1.17. Just a friendly ping.

mknyszek · 2021-06-25T18:51:58Z

I was able to reproduce again. I added a check for some potential issues with how reflect changed in Go 1.17, and I think I've successfully ruled that out.

I captured the output, so I can say now that it took about 2 hours of continuously running the full runtime/pprof test suite to hit it. I think I can actually stress test this into something useful, but now I need to figure out what I want to learn.

mknyszek · 2021-06-25T18:55:08Z

FWIW, I'm not 100% sure if this should be an RC blocker. What this test failure means is that there's a very rare chance that a heap profile ends up stale, specifically in the case of calling runtime.GC. That's not great, and I will continue to try to stress test it and fix it (builders should be green!), but it does not critically impact users of Go.

mknyszek · 2021-06-28T12:55:05Z

Got another reproducer, while running only the first 3 tests in the package in a loop. 206267 executions at ~0.077s per execution... about 4 hours to reproduce.

But hey, this time, I got a GC trace! And there's something very peculiar about this. The GC trace for the failing test is the only one that actually has a forced GC! You'd think that every single execution would have a forced GC, but that's not true at all, as it turns out.

I wonder if I'll be able to reproduce this more easily by adding a sleep, to make sure another GC cycle doesn't stomp on the forced GC.

mknyszek · 2021-06-28T12:57:22Z

Added a 1 second sleep before runtime.GC and now I have an instant reproducer. :)

mknyszek · 2021-06-28T12:59:16Z

Oh, wait. It occurs to me that because time.Sleep allocates, it might just be breaking the test.

mknyszek · 2021-06-28T13:03:24Z

Yeah, the time import changes the line numbers. Sigh.

mknyszek · 2021-06-28T13:23:11Z

False alarm. STDERR output for the test itself was hidden, so while I have a GC trace, the forced GC is not unique.

dmitshur · 2021-07-01T18:14:08Z

Thanks for investigating this Michael. In a release meeting we discussed we're primarily looking to understand the failure mode before RC 1, if possible. Then we can make a better decision about it.

mknyszek · 2021-07-01T18:49:45Z

I'm slowly gathering information, but given that it takes hours to reproduce, this is going to take a while.

mknyszek · 2021-07-01T20:28:11Z

Alrighty, new update: I've got a ~3 minute feedback loop now. I'm using debuglog to slowly whittle down the possibilities. Hopefully should have something soon. If I start floundering, I'll start bisecting.

mknyszek · 2021-07-01T21:18:02Z

I think I've confirmed this is a subtle bug in the isSweepDone condition.

I've got the following output, annotated for clarity:

/tmp/go-stress-20210701T205000-498548926
runtime/pprof.allocateReflectTransient <--- the profile stack entry that was incorrect.
>> begin log 1 <<
[0.032741835 P 2] PostSweep 6 3 3 2 <--- heap profile published in runtime.GC. GC cycle is #3, forced GC was triggered during GC cycle #2.
>> begin log 0 <<
[0.033512719 P 0] freed 6 3 <--- the allocation created by runtime/pprof.allocateReflectTransient is actually freed
[0.033651934 P 2] MemProfile 6 3 <--- the test grabs proflock and copies the profile out

I think I might know what the problem is.

Looking at sweepone as an example of sweepLocker usage, it appears that sweepDrained can be set by sweepone before another, concurrent, instance of sweepone (or similar) calls tryAcquire. That tryAcquire is what actually blocks completion. Consider the following events:

Some sweeper pulls the last span for sweeping out of the list of spans that need sweeping. It has not yet acquired ownership of the span for sweeping.
Some other sweeper notices there's no more work. Hooray! They mark sweepDrained. However note that at this very point, mheap_.sweepers == 0, because the last span hasn't actually been acquired for sweeping.
runtime.GC is looping, checking isSweepDone. It happens to fire right at this moment, so it continues on to publish the heap profile.
Then the last span actually gets swept.

Lo and behold, we've missed a free in the published heap profile.

Assuming this is actually the problem (I will continue to try to confirm this), I think that this does not indicate a larger potential issue. This issue of isSweepDone conflating "no more spans to sweep" with "no more outstanding sweeps" already existed in earlier releases. Austin's CLs were trying to fix this. I believe they succeeded them in making them less likely, eliminating the flakiness of another test Austin wrote in the 1.17 cycle. However, it appears that this test manages to take advantage of this race window.

When Austin was working on fixing this, we discussed how this condition wasn't actually problematic for GC correctness, because the GC will actually ensure all outstanding sweeps are complete because it needs to stop the world to begin the next mark phase. Sweeping in every case does always prevent preemption -- this is necessary for a much broader sense of correctness -- so a new GC will actually only start once all outstanding sweeps have completed.

@dmitshur As a result, I don't think this should block the RC, but I think this should be fixed prior to release. Ultimately, the worst it can do is make some tests (particularly ones that rely on runtime.GC) flaky.

mknyszek · 2021-07-01T22:09:42Z

I'm currently testing my theory by adding an extra blockCompletion call in sweepone, just after newSweepLocker is created. There's a risk that this could cause the isSweepDone condition to flap, but because sweepone bails out early if isSweepDone is already true, it can't actually happen, for this case anyway. AFAICT, this is the only path that matters for this test... a real fix may need more, more on that later...

The good news is that it's been 30 minutes and nothing has failed yet.

This ensures that mheap_.sweepers is always >0 before we grab that last span out of the list (i.e. before mheap_.nextSpanForSweep()). If someone else then marks sweepDrained, then that other sweeper will always be accounted for until it's actually finished. As a result, anyone checking isSweepDone will never observe it switch to true prematurely.

Unfortunately I think this is a hacky fix. I think even if with isSweepDone check, there's still a small window where the condition could flap. Consider this set of events:

A sweeper notices sweepDrained == 0 so it says that isSweepDone is false. There are no other outstanding sweepers.
Another sweeper notices that there's nothing left to sweep, so it sets sweepDrained. isSweepDone is now observable as true.
The first sweeper increments mheap_.sweepers because it passed the c heck. isSweepDone is now temporarily observable as false.
The first sweeper decrements mheap_.sweepers and now isSweepDone is true again.

Basically, what we need to guarantee is that:

mheap_.sweepers is incremented before any pop operation from the unswept lists.
mheap_.sweepers is incremented only if there's any more spans to be swept.

I think that these might be two contradictory conditions. I need to think about this more, though I'm certain there's a clean resolution to all this.

mknyszek · 2021-07-02T00:07:16Z

2.5 hours later and this narrower window has prevented a test failure. I think this is our culprit.

UPDATE: 20.5 hours later, and still no failure.

mknyszek · 2021-07-08T20:21:27Z

I've been thinking about this more. I think the right fix is to make the process of updating mheap_.sweepers and sweepDrained should be updated together. Consider a design where they're packed in the same uint32, with the top bit reserved as the boolean sweepDrained, the rest of the bits for mheap_.sweepers (2 billion concurrent sweepers should be enough for anyone :P).

Then it gets manipulated in the following way:

A potential sweeper is considering doing some sweeping. They read the combined value, and check sweepDrained. If it's set, then don't sweep. There's nothing to do.
Otherwise, they try a CAS-loop where the they increment sweepers, backing out if they ever observe sweepDrained as set.
If they successfully CAS, they can go on and pop a span and sweep. If they notice that there's nothing to sweep, they CAS-loop on setting sweepDrained and decrementing sweepers, then return.
Otherwise, acquire the span they popped for sweeping, sweep the span and whatever else, then CAS-loop to decrement sweepers.

isSweepDone is then just an atomic load, then a comparison with 0x80000000. The condition will never flap, because a sweeper is guaranteed to never increment sweepers once sweepDrained is set. Existing sweepers or CAS-loopers will simply decrement and drain sweepers as intended.

There's one more caveat here with reclaimers that don't pop from the list but do acquire spans for sweeping. They need only be accounted for in sweepers, so I think they can just do the same as above.

@aclements does this sound right to you?

My one concern here is contention due to CAS-looping. I think it should be relatively OK because this happens on the allocation slow path (the first slow path, refilling spans), though there are a number of other potential sweepers (reclaimers, proportional sweepers, or the background sweeper). I guess we'll just have to benchmark it.

mknyszek · 2021-07-08T21:54:49Z

I've sketched out a fix at https://golang.org/cl/333389 (be warned: it may not even compile, I didn't try) and that seems way too big for this release.

I think we should just add an extra runtime.GC call and a TODO to prevent this test from flaking for the release.

gopherbot · 2021-07-08T21:55:54Z

Change https://golang.org/cl/333389 mentions this issue: runtime: fix sweep termination condition

gopherbot · 2021-07-09T17:37:36Z

Change https://golang.org/cl/333549 mentions this issue: runtime/pprof: call runtime.GC twice in memory profile test

Currently, there is a chance that the sweep termination condition could flap, causing e.g. runtime.GC to return before all sweep work has not only been drained, but also completed. CL 307915 and CL 307916 attempted to fix this problem, but it is still possible that mheap_.sweepDrained is marked before any outstanding sweepers are accounted for in mheap_.sweepers, leaving a window in which a thread could observe isSweepDone as true before it actually was (and after some time it would revert to false, then true again, depending on the number of outstanding sweepers at that point). This change fixes the sweep termination condition by merging mheap_.sweepers and mheap_.sweepDrained into a single atomic value. This value is updated such that a new potential sweeper will increment the oustanding sweeper count iff there are still outstanding spans to be swept without an outstanding sweeper to pick them up. This design simplifies the sweep termination condition into a single atomic load and comparison and ensures the condition never flaps. Updates #46500. Fixes #45315. Change-Id: I6d69aff156b8d48428c4cc8cfdbf28be346dbf04 Reviewed-on: https://go-review.googlesource.com/c/go/+/333389 Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>

bcmills added NeedsInvestigation release-blocker okay-after-beta1 labels Jun 1, 2021

bcmills added this to the Go1.17 milestone Jun 1, 2021

heschi removed the okay-after-beta1 label Jun 10, 2021

prattmic assigned mknyszek Jun 21, 2021

mknyszek mentioned this issue Jul 9, 2021

runtime: runtime.GC can return without finishing sweep #45315

Closed

gopherbot closed this as completed in ab4085c Jul 9, 2021

ferrmin mentioned this issue Jul 10, 2021

runtime/pprof: call runtime.GC twice in memory profile test ferrmin/go#101

Merged

sthagen mentioned this issue Jul 12, 2021

runtime/pprof: call runtime.GC twice in memory profile test sthagen/golang-go#316

Merged

rsc unassigned mknyszek Jun 23, 2022

golang locked and limited conversation to collaborators Jun 23, 2023

gopherbot added the FrozenDueToAge label Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/pprof: regression in TestMemoryProfiler/debug=1 starting in April 2021 #46500

runtime/pprof: regression in TestMemoryProfiler/debug=1 starting in April 2021 #46500

bcmills commented Jun 1, 2021 •

edited

Loading

bcmills commented Jun 2, 2021

toothrot commented Jun 17, 2021

prattmic commented Jun 18, 2021

prattmic commented Jun 18, 2021

mknyszek commented Jun 21, 2021

mknyszek commented Jun 22, 2021

mknyszek commented Jun 22, 2021

toothrot commented Jun 25, 2021

mknyszek commented Jun 25, 2021

mknyszek commented Jun 25, 2021

mknyszek commented Jun 28, 2021

mknyszek commented Jun 28, 2021

mknyszek commented Jun 28, 2021

mknyszek commented Jun 28, 2021

mknyszek commented Jun 28, 2021

dmitshur commented Jul 1, 2021

mknyszek commented Jul 1, 2021

mknyszek commented Jul 1, 2021

mknyszek commented Jul 1, 2021

mknyszek commented Jul 1, 2021

mknyszek commented Jul 2, 2021 •

edited

Loading

mknyszek commented Jul 8, 2021

mknyszek commented Jul 8, 2021

gopherbot commented Jul 8, 2021

gopherbot commented Jul 9, 2021

runtime/pprof: regression in TestMemoryProfiler/debug=1 starting in April 2021 #46500

runtime/pprof: regression in TestMemoryProfiler/debug=1 starting in April 2021 #46500

Comments

bcmills commented Jun 1, 2021 • edited Loading

bcmills commented Jun 2, 2021

toothrot commented Jun 17, 2021

prattmic commented Jun 18, 2021

prattmic commented Jun 18, 2021

mknyszek commented Jun 21, 2021

mknyszek commented Jun 22, 2021

mknyszek commented Jun 22, 2021

toothrot commented Jun 25, 2021

mknyszek commented Jun 25, 2021

mknyszek commented Jun 25, 2021

mknyszek commented Jun 28, 2021

mknyszek commented Jun 28, 2021

mknyszek commented Jun 28, 2021

mknyszek commented Jun 28, 2021

mknyszek commented Jun 28, 2021

dmitshur commented Jul 1, 2021

mknyszek commented Jul 1, 2021

mknyszek commented Jul 1, 2021

mknyszek commented Jul 1, 2021

mknyszek commented Jul 1, 2021

mknyszek commented Jul 2, 2021 • edited Loading

mknyszek commented Jul 8, 2021

mknyszek commented Jul 8, 2021

gopherbot commented Jul 8, 2021

gopherbot commented Jul 9, 2021

bcmills commented Jun 1, 2021 •

edited

Loading

mknyszek commented Jul 2, 2021 •

edited

Loading