-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/pprof: regression in TestMemoryProfiler/debug=1 starting in April 2021 #46500
runtime/pprof: regression in TestMemoryProfiler/debug=1 starting in April 2021 #46500
Comments
CC @cherrymui for |
@cherrymui Following up, as we're approaching RC1 and this is a release-blocker. |
All of these failures are because we expect an entry like:
but get one like
The only mismatches are the first two numbers: want 0, got 1 and 2097152. The first number is Since these are "transient" allocations and the test runs GC, we are expecting to see the allocations freed. The frees are recorded during sweep, so at first glance this would look like another case of #45315. However, this started failing after that was fixed, so I suspect that something in http://golang.org/cl/307915 or http://golang.org/cl/307916 is subtly broken and triggering this. cc @mknyszek |
FWIW, I've been unable to reproduce this locally so far. |
I think @prattmic is right and that this is probably a regression. Looking into it. |
I ran the full suite of Specifically, I ran:
On the hunch that |
OK I can probably shorten time to reproduce: TestMemoryProfiler is the third test to be executed. The only tests that could be causally influencing the failure are the two tests before it. Together they take much less time to run. |
@mknyszek We're close to the RC date for Go 1.17. Just a friendly ping. |
I was able to reproduce again. I added a check for some potential issues with how reflect changed in Go 1.17, and I think I've successfully ruled that out. I captured the output, so I can say now that it took about 2 hours of continuously running the full |
FWIW, I'm not 100% sure if this should be an RC blocker. What this test failure means is that there's a very rare chance that a heap profile ends up stale, specifically in the case of calling |
Got another reproducer, while running only the first 3 tests in the package in a loop. 206267 executions at ~0.077s per execution... about 4 hours to reproduce. But hey, this time, I got a GC trace! And there's something very peculiar about this. The GC trace for the failing test is the only one that actually has a forced GC! You'd think that every single execution would have a forced GC, but that's not true at all, as it turns out. I wonder if I'll be able to reproduce this more easily by adding a sleep, to make sure another GC cycle doesn't stomp on the forced GC. |
Added a 1 second sleep before |
Oh, wait. It occurs to me that because |
Yeah, the |
False alarm. STDERR output for the test itself was hidden, so while I have a GC trace, the forced GC is not unique. |
Thanks for investigating this Michael. In a release meeting we discussed we're primarily looking to understand the failure mode before RC 1, if possible. Then we can make a better decision about it. |
I'm slowly gathering information, but given that it takes hours to reproduce, this is going to take a while. |
Alrighty, new update: I've got a ~3 minute feedback loop now. I'm using debuglog to slowly whittle down the possibilities. Hopefully should have something soon. If I start floundering, I'll start bisecting. |
I think I've confirmed this is a subtle bug in the I've got the following output, annotated for clarity:
I think I might know what the problem is. Looking at
Lo and behold, we've missed a free in the published heap profile. Assuming this is actually the problem (I will continue to try to confirm this), I think that this does not indicate a larger potential issue. This issue of When Austin was working on fixing this, we discussed how this condition wasn't actually problematic for GC correctness, because the GC will actually ensure all outstanding sweeps are complete because it needs to stop the world to begin the next mark phase. Sweeping in every case does always prevent preemption -- this is necessary for a much broader sense of correctness -- so a new GC will actually only start once all outstanding sweeps have completed. @dmitshur As a result, I don't think this should block the RC, but I think this should be fixed prior to release. Ultimately, the worst it can do is make some tests (particularly ones that rely on |
I'm currently testing my theory by adding an extra The good news is that it's been 30 minutes and nothing has failed yet. This ensures that Unfortunately I think this is a hacky fix. I think even if with
Basically, what we need to guarantee is that:
I think that these might be two contradictory conditions. I need to think about this more, though I'm certain there's a clean resolution to all this. |
2.5 hours later and this narrower window has prevented a test failure. I think this is our culprit. UPDATE: 20.5 hours later, and still no failure. |
I've been thinking about this more. I think the right fix is to make the process of updating Then it gets manipulated in the following way:
There's one more caveat here with reclaimers that don't pop from the list but do acquire spans for sweeping. They need only be accounted for in @aclements does this sound right to you? My one concern here is contention due to CAS-looping. I think it should be relatively OK because this happens on the allocation slow path (the first slow path, refilling spans), though there are a number of other potential sweepers (reclaimers, proportional sweepers, or the background sweeper). I guess we'll just have to benchmark it. |
I've sketched out a fix at https://golang.org/cl/333389 (be warned: it may not even compile, I didn't try) and that seems way too big for this release. I think we should just add an extra |
Change https://golang.org/cl/333389 mentions this issue: |
Change https://golang.org/cl/333549 mentions this issue: |
Currently, there is a chance that the sweep termination condition could flap, causing e.g. runtime.GC to return before all sweep work has not only been drained, but also completed. CL 307915 and CL 307916 attempted to fix this problem, but it is still possible that mheap_.sweepDrained is marked before any outstanding sweepers are accounted for in mheap_.sweepers, leaving a window in which a thread could observe isSweepDone as true before it actually was (and after some time it would revert to false, then true again, depending on the number of outstanding sweepers at that point). This change fixes the sweep termination condition by merging mheap_.sweepers and mheap_.sweepDrained into a single atomic value. This value is updated such that a new potential sweeper will increment the oustanding sweeper count iff there are still outstanding spans to be swept without an outstanding sweeper to pick them up. This design simplifies the sweep termination condition into a single atomic load and comparison and ensures the condition never flaps. Updates #46500. Fixes #45315. Change-Id: I6d69aff156b8d48428c4cc8cfdbf28be346dbf04 Reviewed-on: https://go-review.googlesource.com/c/go/+/333389 Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2021-05-30T02:37:38-1607c28/linux-amd64-sid
2021-04-27T21:55:07-214c8dd/linux-amd64-nocgo
2021-04-27T21:44:16-645cb62/linux-amd64-nocgo
This test has otherwise been passing consistently since it was last fixed in November 2019, so this looks like a 1.17 regression (CC @golang/release).
2019 failures
2019-11-11T12:40:04-f07059d/linux-amd64-noopt
2019-11-10T20:36:44-47bc240/linux-amd64-noopt
2019-11-10T17:13:25-298be61/linux-amd64-noopt
2019-11-10T17:12:35-46c9fd0/linux-amd64-noopt
2019-11-10T17:12:15-40ebcfa/linux-amd64-noopt
2019-11-10T17:11:34-cd53fdd/linux-amd64-noopt
2019-11-10T13:41:45-4d4ddd8/linux-amd64-noopt
2019-11-10T12:12:46-9eb9c7b/linux-amd64-noopt
2019-11-10T04:23:22-e6fb39a/linux-amd64-noopt
2019-11-09T22:14:01-78d4560/linux-amd64-noopt
2019-11-09T20:08:06-29cfb4d/linux-amd64-noopt
2019-11-09T19:31:32-7148478/linux-amd64-noopt
2019-11-09T19:25:46-6e11195/linux-amd64-noopt
2019-11-09T05:51:04-a0262b2/linux-amd64-noopt
2019-11-09T00:36:15-bde1968/linux-amd64-noopt
2019-11-08T23:22:06-11da2b2/linux-amd64-noopt
2019-11-08T22:44:29-42db1da/linux-amd64-noopt
2019-11-08T21:32:23-0bbcce9/linux-amd64-noopt
2019-11-08T21:27:51-b7d097a/linux-amd64-noopt
2019-11-08T21:05:17-9ee6ba0/linux-amd64-noopt
2019-11-08T20:50:17-9e914f5/linux-amd64-noopt
2019-11-08T20:24:43-bababde/linux-amd64-noopt
2019-11-08T20:05:25-7a5e0fe/linux-amd64-noopt
2019-11-08T19:28:57-904e113/linux-amd64-noopt
2019-11-08T19:28:49-f6ff806/linux-amd64-noopt
2019-11-08T19:24:30-e6c12c3/linux-amd64-noopt
2019-11-08T18:55:44-b2b0992/linux-amd64-noopt
2019-11-08T18:00:31-c444ec3/linux-amd64-noopt
2019-11-08T17:56:35-a84ac18/linux-amd64-noopt
2019-11-08T17:01:32-4517c02/linux-amd64-noopt
2019-11-08T17:01:05-a5a6f61/linux-amd64-noopt
2019-11-08T17:00:57-dac936a/linux-amd64-noopt
2019-11-08T16:46:24-47232f0/linux-amd64-noopt
2019-11-08T16:44:48-374c284/linux-amd64-noopt
2019-11-08T16:35:48-ffb5646/linux-amd64-noopt
2019-11-08T15:10:39-52aebe8/linux-amd64-noopt
2019-09-19T20:26:22-1c50fcf/netbsd-arm-bsiegert
The text was updated successfully, but these errors were encountered: