Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image/gif: TestDecodeMemoryConsumption flakes #35166

Closed
bcmills opened this issue Oct 25, 2019 · 11 comments
Closed

image/gif: TestDecodeMemoryConsumption flakes #35166

bcmills opened this issue Oct 25, 2019 · 11 comments
Assignees
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Oct 25, 2019

Seen on the dragonfly-amd64 builder (https://build.golang.org/log/40aa63372648fd03bc608538deaf94c00c314369):

--- FAIL: TestDecodeMemoryConsumption (0.14s)
    reader_test.go:422: Decode of 3000 frames increased heap by 77MB
FAIL
FAIL	image/gif	0.214s

Possibly a GC bug? (CC @aclements @mknyszek @danscales)

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 25, 2019
@bcmills bcmills added this to the Backlog milestone Oct 25, 2019
@danscales
Copy link
Contributor

I reproduced by using dragonfly gomote, and repeatedly calling:

gomote run user-danscales-dragonfly-amd64-0 go/bin/go test -test.count=100 image/gif

Reproduced about 4 times in 50 runs.

I changed the test so that when failure happens (because heap is more than 30MB bigger at end of decode than at the beginning), the test does a runtime.GC() and then measures the heap difference. This new code shows that GC fully recovers the 77MB and actually 4MB more. So, I'm guessing this is not a bug, just a case where sometimes memory is not quite scanned/freed in the same way by the initial GC call.

If this happens a lot, we should probably just change the test threshold from 30MB to 100MB, or something like that.

@mknyszek
Copy link
Contributor

mknyszek commented Nov 6, 2019

There was a GC pacing bug floating around that I fixed in the 1.14 cycle (#28574) wherein assists didn't kick in often enough in GOMAXPROCS=1 situations and you ended up with accidental, unnecessary heap growth. There's a chance this is something in that vein?

@danscales
Copy link
Contributor

Hmmm, I should have said that I reproduced it at the change where the bug was originally reported, so that is at 316fb95 (Around Oct 25th). So the pacing fix for #28574 (submitted Sept 4th) would have been included, but the problem still occurred.

But I justed tried with most recent master (today), and I can't seem to reproduce the problem. So, was there any other GC pacing change made recently (between Oct. 25th and Sept 4th)?

@mknyszek

@mknyszek
Copy link
Contributor

mknyszek commented Nov 6, 2019

@danscales I should clarify: I figured this is unrelated to the fix from a while ago given timelines, but I did recently land a GC pacing change as part of the allocator work, which may account for the difference you're seeing here?

In particular, we prevent the mark phase from starting too early which leads to more assists, but prevents a pacing-related heap size explosion if you're allocating rapidly. I figured this wouldn't have much of an effect without my other patches, but maybe it's doing something here.

@bcmills
Copy link
Contributor Author

bcmills commented May 13, 2022

greplogs -l -e 'FAIL: TestDecodeMemoryConsumption .*\n .* increased heap by' --since=2019-10-26
2022-05-12T19:04:52-f269d90/netbsd-amd64-9_0

--- FAIL: TestDecodeMemoryConsumption (1.23s)
    reader_test.go:422: Decode of 3000 frames increased heap by 44MB
FAIL
FAIL	image/gif	8.691s

@golang/runtime: this looks like possibly a GC pacing or heap-measurement bug?

@bcmills bcmills changed the title image/gif: TestDecodeMemoryConsumption flake on dragonfly-amd64 image/gif: TestDecodeMemoryConsumption flakes May 13, 2022
@mknyszek
Copy link
Contributor

If I recall correctly, this is one of the tests I looked into for #52433 as well. It doesn't quite allocate in a loop but I think the same thing can happen because the generated gif used for testing is trivial, so a lot of time is spent in allocation anyway. I'll confirm this.

@paulzhol
Copy link
Member

Also got one on the (very slow) freebsd-paulzhol-arm builder:
https://build.golang.org/log/dc7a44dc62de269d10e452baa513db32b48fba0e

--- FAIL: TestDecodeMemoryConsumption (19.89s)
    reader_test.go:422: Decode of 3000 frames increased heap by 48MB

@gopherbot
Copy link

Change https://go.dev/cl/407139 mentions this issue: runtime: use osyield in runqgrab on netbsd

@mknyszek
Copy link
Contributor

Well, the GC trace for netbsd-amd64 doesn't look great. Heap goal overrun (as per #52433) is fairly common, the CPU timings are totally strange (as if there's also a clock issue, like on OpenBSD), and there's a lot of oddities. For example:

gc 1919 @56.370s 44%: 18+0.47+0.002 ms clock, 74+0.20/0/0+0.011 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 1920 @56.389s 44%: 19+10+0.079 ms clock, 76+0.10/0.12/0+0.31 ms cpu, 3->15->12 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P

(There are only so many GC cycles because I was running with -count=100.)

For instance, note the 10 ms wall-clock time on GC 1920 with that not reflected at all in the CPU time. That strongly suggests to me a ton of time spent trying to finish a GC cycle in mark termination, just like #52433. I'm not sure this is enough evidence to 100% confirm that that's the issue, I'd need to actually get a kernel-level scheduler trace to be sure, but this looks very platform-specific, at least for NetBSD. That's not to mention the really, really high amount of time spent in sweep termination, which is reminiscent of what was going wrong on OpenBSD when I was trying to land #48409. And it looks like there's potentially a lot going wrong on NetBSD, performance-wise.

FWIW, it looks like the same fix for OpenBSD also helps NetBSD a lot (#52475). Sent a CL for that, it might help a little, but I still see some pretty large heap goal overruns.

The easiest thing to do would be to skip this test on NetBSD (and maybe FreeBSD+ARM).

@mknyszek mknyszek self-assigned this May 18, 2022
gopherbot pushed a commit that referenced this issue May 19, 2022
NetBSD appears to have the same issue OpenBSD had in runqgrab. See
issue #52475 for more details.

For #35166.

Change-Id: Ie53192d26919b4717bc0d61cadd88d688ff38bb4
Reviewed-on: https://go-review.googlesource.com/c/go/+/407139
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
@mknyszek
Copy link
Contributor

OK actually, looking back over the logs for my NetBSD runs, the wall-clock times of the GC cycles are much less egregious, and there's a much lower chance that something goes wrong here. I'm inclined to call this fixed? Don't know about freebsd/arm, but we haven't seen similar issues on FreeBSD in general, I don't think. Would it be reasonable to blame the builder? I can add a skip for that platform if it happens again.

@bcmills
Copy link
Contributor Author

bcmills commented Jun 16, 2023

I'm inclined to call this fixed? Don't know about freebsd/arm, but we haven't seen similar issues on FreeBSD in general, I don't think.

Seems ok to close it as maybe-fixed and let watchflakes open a new issue if it crops up on the builders again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants