Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: scavenger pacing fails to account for fragmentation [1.13 backport] #34149

Closed
gopherbot opened this issue Sep 6, 2019 · 7 comments
Closed
Labels
CherryPickApproved Used during the release process for point releases FrozenDueToAge Performance
Milestone

Comments

@gopherbot
Copy link

@mknyszek requested issue #34048 to be considered for backport to the next 1.13 minor release.

Since this has the potential to cause a severe performance loss, I think we should backport this to Go 1.13.

@gopherbot Please open a backport issue for 1.13.

@gopherbot gopherbot added the CherryPickCandidate Used during the release process for point releases label Sep 6, 2019
@gopherbot gopherbot added this to the Go1.13.1 milestone Sep 6, 2019
@mknyszek mknyszek self-assigned this Sep 6, 2019
@bcmills bcmills modified the milestones: Go1.13.1, Go1.13.2 Sep 25, 2019
@bcmills
Copy link
Contributor

bcmills commented Sep 25, 2019

this has the potential to cause a severe performance loss

@mknyszek, @aclements: could you describe the conditions under which this performance degradation occurs? (Is it predictable? Does it show up during testing? Is there a workaround available? Do we have a rough estimate for the fraction of users affected?)

@mknyszek
Copy link
Contributor

mknyszek commented Sep 25, 2019

Is it predictable?

Yes. If the application's span fragmentation (heap_inuse / heap_alloc) >= 10%, the scavenger can spiral into a state in which every free span is scavenged, leading to a large number of page faults and a large number of syscalls.

Does it show up during testing?

I don't have a specific test, but I can recreate the case with both large and small programs. I'd have to write a short program which reproduces the problem, but that shouldn't be difficult to do.

Is there a workaround available?

No. There is no way to work around this because it's embedded in runtime pacing. You'd have to somehow keep your application's span fragmentation under 10% which isn't reasonable to ask of anyone.

Do we have a rough estimate for the fraction of users affected?

No, and I'm not sure exactly how we'd be able to get that. We know for a fact that it severely affects Kubernetes' API latency and is one reason why they're blocked on moving to Go 1.13 (see #32828). There is another (medium-sized, I suppose?) Google application which took consistently took 80% longer to run in Go 1.13 as a direct consequence of this issue. The patch in the non-backport bug fixes this behavior, and brings it back to Go 1.12 levels of performance.

@bcmills
Copy link
Contributor

bcmills commented Sep 25, 2019

By “Is it predictable?” and “Does it show up during testing?”, I meant more, “will users be able to predict and/or identify that they are affected?”

(That is: this seems more important to backport if it can crop up suddenly after a user has already tested and deployed a new build.)

@mknyszek
Copy link
Contributor

Ah, I think I understand. Then it's not predictable. It can certainly show up suddenly in production, without anyone noticing while running tests, running microbenchmarks, or even a smaller version of the full application. It's as difficult to predict as how much heap fragmentation your application produces, which may be a function of input in some cases.

@aclements
Copy link
Member

I agree that this should be backported for the reasons @mknyszek laid out.

Specifically, the CL to backport is https://golang.org/cl/193040.

@andybons andybons added CherryPickApproved Used during the release process for point releases and removed CherryPickCandidate Used during the release process for point releases labels Oct 2, 2019
@gopherbot
Copy link
Author

Change https://golang.org/cl/198487 mentions this issue: [release-branch.go1.13] runtime: redefine scavenge goal in terms of heap_inuse

gopherbot pushed a commit that referenced this issue Oct 4, 2019
…eap_inuse

This change makes it so that the scavenge goal is defined primarily in
terms of heap_inuse at the end of the last GC rather than next_gc. The
reason behind this change is that next_gc doesn't take into account
fragmentation, and we can fall into situation where the scavenger thinks
it should have work to do but there's no free and unscavenged memory
available.

In order to ensure the scavenge goal still tracks next_gc, we multiply
heap_inuse by the ratio between the current heap goal and the last heap
goal, which describes whether the heap is growing or shrinking, and by
how much.

Finally, this change updates the documentation for scavenging and
elaborates on why the scavenge goal is defined the way it is.

Fixes #34149

Change-Id: I8deaf87620b5dc12a40ab8a90bf27932868610da
Reviewed-on: https://go-review.googlesource.com/c/go/+/193040
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
(cherry picked from commit 9b30811)
Reviewed-on: https://go-review.googlesource.com/c/go/+/198487
Run-TryBot: Andrew Bonventre <andybons@golang.org>
@gopherbot
Copy link
Author

Closed by merging cd951ae to release-branch.go1.13.

@katiehockman katiehockman modified the milestones: Go1.13.2, Go1.13.3 Oct 17, 2019
@golang golang locked and limited conversation to collaborators Oct 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CherryPickApproved Used during the release process for point releases FrozenDueToAge Performance
Projects
None yet
Development

No branches or pull requests

6 participants