image/gif: TestDecodeMemoryConsumption flakes #35166

bcmills · 2019-10-25T21:00:06Z

Seen on the dragonfly-amd64 builder (https://build.golang.org/log/40aa63372648fd03bc608538deaf94c00c314369):

--- FAIL: TestDecodeMemoryConsumption (0.14s)
    reader_test.go:422: Decode of 3000 frames increased heap by 77MB
FAIL
FAIL	image/gif	0.214s

Possibly a GC bug? (CC @aclements @mknyszek @danscales)

The text was updated successfully, but these errors were encountered:

danscales · 2019-11-06T00:40:31Z

I reproduced by using dragonfly gomote, and repeatedly calling:

gomote run user-danscales-dragonfly-amd64-0 go/bin/go test -test.count=100 image/gif

Reproduced about 4 times in 50 runs.

I changed the test so that when failure happens (because heap is more than 30MB bigger at end of decode than at the beginning), the test does a runtime.GC() and then measures the heap difference. This new code shows that GC fully recovers the 77MB and actually 4MB more. So, I'm guessing this is not a bug, just a case where sometimes memory is not quite scanned/freed in the same way by the initial GC call.

If this happens a lot, we should probably just change the test threshold from 30MB to 100MB, or something like that.

mknyszek · 2019-11-06T16:19:58Z

There was a GC pacing bug floating around that I fixed in the 1.14 cycle (#28574) wherein assists didn't kick in often enough in GOMAXPROCS=1 situations and you ended up with accidental, unnecessary heap growth. There's a chance this is something in that vein?

danscales · 2019-11-06T18:45:55Z

Hmmm, I should have said that I reproduced it at the change where the bug was originally reported, so that is at 316fb95 (Around Oct 25th). So the pacing fix for #28574 (submitted Sept 4th) would have been included, but the problem still occurred.

But I justed tried with most recent master (today), and I can't seem to reproduce the problem. So, was there any other GC pacing change made recently (between Oct. 25th and Sept 4th)?

@mknyszek

mknyszek · 2019-11-06T20:58:34Z

@danscales I should clarify: I figured this is unrelated to the fix from a while ago given timelines, but I did recently land a GC pacing change as part of the allocator work, which may account for the difference you're seeing here?

In particular, we prevent the mark phase from starting too early which leads to more assists, but prevents a pacing-related heap size explosion if you're allocating rapidly. I figured this wouldn't have much of an effect without my other patches, but maybe it's doing something here.

bcmills · 2022-05-13T16:43:57Z

greplogs -l -e 'FAIL: TestDecodeMemoryConsumption .*\n .* increased heap by' --since=2019-10-26
2022-05-12T19:04:52-f269d90/netbsd-amd64-9_0

--- FAIL: TestDecodeMemoryConsumption (1.23s)
    reader_test.go:422: Decode of 3000 frames increased heap by 44MB
FAIL
FAIL	image/gif	8.691s

@golang/runtime: this looks like possibly a GC pacing or heap-measurement bug?

mknyszek · 2022-05-13T16:58:40Z

If I recall correctly, this is one of the tests I looked into for #52433 as well. It doesn't quite allocate in a loop but I think the same thing can happen because the generated gif used for testing is trivial, so a lot of time is spent in allocation anyway. I'll confirm this.

paulzhol · 2022-05-14T17:39:51Z

Also got one on the (very slow) freebsd-paulzhol-arm builder:
https://build.golang.org/log/dc7a44dc62de269d10e452baa513db32b48fba0e

--- FAIL: TestDecodeMemoryConsumption (19.89s)
    reader_test.go:422: Decode of 3000 frames increased heap by 48MB

gopherbot · 2022-05-18T23:49:59Z

Change https://go.dev/cl/407139 mentions this issue: runtime: use osyield in runqgrab on netbsd

mknyszek · 2022-05-18T23:50:21Z

Well, the GC trace for netbsd-amd64 doesn't look great. Heap goal overrun (as per #52433) is fairly common, the CPU timings are totally strange (as if there's also a clock issue, like on OpenBSD), and there's a lot of oddities. For example:

gc 1919 @56.370s 44%: 18+0.47+0.002 ms clock, 74+0.20/0/0+0.011 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 1920 @56.389s 44%: 19+10+0.079 ms clock, 76+0.10/0.12/0+0.31 ms cpu, 3->15->12 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P

(There are only so many GC cycles because I was running with -count=100.)

For instance, note the 10 ms wall-clock time on GC 1920 with that not reflected at all in the CPU time. That strongly suggests to me a ton of time spent trying to finish a GC cycle in mark termination, just like #52433. I'm not sure this is enough evidence to 100% confirm that that's the issue, I'd need to actually get a kernel-level scheduler trace to be sure, but this looks very platform-specific, at least for NetBSD. That's not to mention the really, really high amount of time spent in sweep termination, which is reminiscent of what was going wrong on OpenBSD when I was trying to land #48409. And it looks like there's potentially a lot going wrong on NetBSD, performance-wise.

FWIW, it looks like the same fix for OpenBSD also helps NetBSD a lot (#52475). Sent a CL for that, it might help a little, but I still see some pretty large heap goal overruns.

The easiest thing to do would be to skip this test on NetBSD (and maybe FreeBSD+ARM).

NetBSD appears to have the same issue OpenBSD had in runqgrab. See issue #52475 for more details. For #35166. Change-Id: Ie53192d26919b4717bc0d61cadd88d688ff38bb4 Reviewed-on: https://go-review.googlesource.com/c/go/+/407139 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com>

mknyszek · 2022-05-19T18:11:58Z

OK actually, looking back over the logs for my NetBSD runs, the wall-clock times of the GC cycles are much less egregious, and there's a much lower chance that something goes wrong here. I'm inclined to call this fixed? Don't know about freebsd/arm, but we haven't seen similar issues on FreeBSD in general, I don't think. Would it be reasonable to blame the builder? I can add a skip for that platform if it happens again.

bcmills · 2023-06-16T18:06:31Z

I'm inclined to call this fixed? Don't know about freebsd/arm, but we haven't seen similar issues on FreeBSD in general, I don't think.

Seems ok to close it as maybe-fixed and let watchflakes open a new issue if it crops up on the builders again.

bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 25, 2019

bcmills added this to the Backlog milestone Oct 25, 2019

bcmills changed the title ~~image/gif: TestDecodeMemoryConsumption flake on dragonfly-amd64~~ image/gif: TestDecodeMemoryConsumption flakes May 13, 2022

mknyszek self-assigned this May 18, 2022

mknyszek closed this as completed Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image/gif: TestDecodeMemoryConsumption flakes #35166

image/gif: TestDecodeMemoryConsumption flakes #35166

bcmills commented Oct 25, 2019

danscales commented Nov 6, 2019

mknyszek commented Nov 6, 2019

danscales commented Nov 6, 2019

mknyszek commented Nov 6, 2019

bcmills commented May 13, 2022 •

edited

mknyszek commented May 13, 2022

paulzhol commented May 14, 2022

gopherbot commented May 18, 2022

mknyszek commented May 18, 2022

mknyszek commented May 19, 2022

bcmills commented Jun 16, 2023

image/gif: TestDecodeMemoryConsumption flakes #35166

image/gif: TestDecodeMemoryConsumption flakes #35166

Comments

bcmills commented Oct 25, 2019

danscales commented Nov 6, 2019

mknyszek commented Nov 6, 2019

danscales commented Nov 6, 2019

mknyszek commented Nov 6, 2019

bcmills commented May 13, 2022 • edited

mknyszek commented May 13, 2022

paulzhol commented May 14, 2022

gopherbot commented May 18, 2022

mknyszek commented May 18, 2022

mknyszek commented May 19, 2022

bcmills commented Jun 16, 2023

bcmills commented May 13, 2022 •

edited