Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime/pprof: add “heaptime” bytes*GCs memory profile #55900

Open
aclements opened this issue Sep 28, 2022 · 11 comments
Open

proposal: runtime/pprof: add “heaptime” bytes*GCs memory profile #55900

aclements opened this issue Sep 28, 2022 · 11 comments
Labels
Milestone

Comments

@aclements
Copy link
Member

I propose we add a new “heaptime” profile type to runtime/pprof. This is equivalent to the existing allocs profile, except that allocations are scaled by the number of GC cycles they survive. Hence, the profile would measure “space” in bytes*GCs and “objects” in objects*GCs. This can also be viewed as the “inuse” profile integrated over time.

This type of profile would help with understanding applications where a CPU profile shows significant time spent in GC scanning/marking by attributing the cause of the time. GC marking time scales with the number of GC cycles an object survives and its size, so an object that survives no cycles costs essentially nothing to GC, while an object that survives many cycles costs a lot in total. A heaptime profile would show users which allocations to reduce in order to reduce marking time.

For example, this came up in the context of improving compiler performance. The compiler spends a significant amount of CPU time in marking, but looking at the existing allocs profile is a poor proxy for attributing mark time.

One downside is that this profile would not be totally representative of marking costs, and hence may be misleading. For one, marking is more closely proportional to the “scan size” of an object, which is the size up to an object’s last pointer. Pointer-free objects are very cheap to mark, even if they are very large. It also wouldn’t capture time spent scanning non-heap objects like stacks and globals. Capturing these would require complicating the definition (and implementation) of the profile, but may be doable.

Taking this downside to the extreme, if we were to substantially change the GC algorithm, this type of profile may no longer reflect marking time at all. For example, in a generational garbage collector, it’s not entirely clear what this profile would mean (perhaps you would count the number of minor cycles an object survived while young plus the number of major cycles it survived while old, but that’s certainly harder to explain).

I believe this would be straightforward to implement. We would add the current GC cycle count to runtime.specialprofile and new counters to runtime.memRecordCycle. runtime.mProf_Free would subtract the allocating GC cycle counter stored in the special from the current cycle counter, use this to scale the allocation count and size, and accumulate these into the new fields of memRecordCycle. One complication is that I don’t think we want to add these to runtime.MemProfileRecord, so we would need some other way to get them into the runtime/pprof package. We already use private APIs between runtime and runtime/pprof for the CPU profile, and I think we would do something similar for this.

/cc @golang/runtime @mknyszek @dr2chase

@gopherbot gopherbot added this to the Proposal milestone Sep 28, 2022
@mknyszek
Copy link
Contributor

mknyszek commented Sep 28, 2022

While bytes*GC is much more direct, bytes*seconds is more general (and generalizes to e.g. reference counting). This is straightforward to implement today if you round out the lifetimes to the respective GC cycles (the "start" time is rounded down to the GC cycle in which the allocation was made, the "end" time is rounded down to the GC cycle in which the allocation was swept, so you just track the GC cycle number). It's less straightforward to implement in other garbage collection schemes, but at least it would (maybe) generalize! I'd probably call this a "heap lifetime" profile or something.

Another alternative is an opaque "GC impact" profile that has no meaningful unit; it's just a relative impact. The unit stays stable within a Go version, but not necessarily across Go versions. That way we can actually capture all the different costs and track them as they shift over time.

One additional complication is objects that live forever. To get this right, I think we might need to count them separately. Otherwise, I'm not sure how the pprof package would know how long they were alive (or how to count them, exactly). I imagine that will look something like stopping the world and collecting every object that's sampled but still alive?

On the topic of stacks and globals, I think we can model both as just large allocations (at least stacks come from the page heap anyway). They may have an outsized impact in the profile (especially globals, since they're always there), so we may want a way to filter them out.

@felixge
Copy link
Contributor

felixge commented Sep 28, 2022

+1 on the usefulness of understanding the lifetime of heap allocations.

Another approach to estimate this is to apply Little's Law (avg(inuse_age) = inuse_objects / (alloc_objects/seconds)) to delta heap profiles (?seconds= params in net/http/pprof). This idea was proposed by my colleague @alfa07 and I've implemented it as a heapage command in my pprofutils project.

Below is an screenshot from one of our biggest production services at Datadog (frames anonymized) that shows the estimated average age as a virtual leaf node in the frame graph:

image

As far as I can tell, the results seem to match my intuition for what the code is doing. But the limitation of this approach is that the estimate will be poor if the number of allocations observed in the delta period is small. Additionally it's impossible to compute an estimate (other than lifetime > period) for objects that had no allocation in the period.

Anyway, I figured it's worth sharing this approach in case you hadn't seen or considered it. The feature could be built directly into the pprof UI using the data available today. I like the idea of a GC Impact view that scales the frame graph, but objects that had no allocation in the period will problematic as explained above.

Perhaps the only issue we face as a profiling vendor is that the delta heap profiles are only available via net/http/pprof. Having a public API for it would be nice. In fact, we have no use case for the absolute lifetime allocation profile that Go currently supports and computing the delta profiles is a source of CPU and memory overhead for us. But that's mostly orthogonal to this issue.

@prattmic
Copy link
Member

The intuition in this profile is that e.g., a 64 byte object that lives for 10 GC cycles should have the same weight as 10 64 byte objects that live for 1 GC cycle.

I was initially confused how this new profile is fundamentally different from the existing inuse profile. An inuse profile is instantaneous rather than integral, but it does give equal weight to a long-lived 64 byte object and a short-lived 64 byte object (that is re-allocated every GC cycle). If the short-lived objects are not allocated every cycle, then a sample for those is probablistically less likely to appear in a profile, thus giving it less weight.

Since the inuse profile is instaneous, the thought above depends on capturing the profile at a "representative" moment (or doing monte carlo integration from several profiles). The key benefit of heaptime is that it doesn't require finding a "representative" moment (many applications may not even have a single representative steady-state, and just before exit is almost certainly not representative). Instead, much like a CPU profile, heaptime can be collected for the duration of execution, making it much easier to use properly.

@mknyszek
Copy link
Contributor

Another idea: building on my previous comment about filtering, we could tag allocations as either having pointers or not with a label. We don't currently support user-provided labels for memory profiles, but the main issue seems to be that we can't ever throw away labels for monotonic profiles (#23458). In this case, however, the runtime could add just a few labels of its own that are always relevant, like "has pointers" or "goroutine stack" or something. i.e. a small number of categories that are always meaningful.

@matttproud
Copy link
Contributor

On a facial read of the proposal, I would be a big fan of it due to past concerns with heap fragmentation, which motivated me to report this feature request:

#11890

Heap age as proposed may provide the tenure information I was requesting in bullet no. six in the aforementioned link.

@aclements
Copy link
Member Author

@mknyszek:

While bytesGC is much more direct, bytesseconds is more general (and generalizes to e.g. reference counting).

While that may be more general, I think it tells you less about actual costs.

Another alternative is an opaque "GC impact" profile that has no meaningful unit; it's just a relative impact.

That's interesting. This could still be "bytes * GCs", but it would be bytes scanned, not just total bytes. I would want to call that something more specific than "heaptime" (maybe "GC impact" as you suggested), but I think the unit could be concrete. Then we could count just the scan size of objects, including 0 for pointer-free objects.

One additional complication is objects that live forever. To get this right, I think we might need to count them separately.

That's a good point that I totally overlooked. They don't necessarily have to "live forever", but just objects that are still live when the profile snapshot is taken. I think those should just count for all of the GC cycles that have happened since they were allocated. That does mess up my proposed implementation of this profile, though... We could do the accounting during sweep. Sweep walks the specials anyway, so when it sees an mprof special for a marked object, it could bump the appropriate profile counter.

On the topic of stacks and globals, I think we can model both as just large allocations (at least stacks come from the page heap anyway). They may have an outsized impact in the profile (especially globals, since they're always there), so we may want a way to filter them out.

If we have a more general "GC impact" profile, I think you're right that we could just account these like any other allocation. Globals do have an outsize impact on marking time since they are by definition marked on every cycle, so I don't feel bad about indicating that. :)

@aclements
Copy link
Member Author

@felixge , that's clever. :) As you say, it has some limitations, and I would love to just build this in.

@aclements
Copy link
Member Author

@prattmic:

I was initially confused how this new profile is fundamentally different from the existing inuse profile.

You gave a good summary of the differences, and in particular any advice we give should probably distinguish long-running steady-state systems from bounded executions, the way you did.

If we went with a profile definition that more closely modeled the costs to marking, then the inuse profile and the heaptime (or whatever we would call it) profiles would be more different.

@felixge
Copy link
Contributor

felixge commented Oct 9, 2022

@prattmic I agree with your analysis that the existing inuse profile is already scaled by heapage. Short lived objects have a lower chance of appearing than long lived ones if they are allocated at the same rate. As far as individual inuse snapshots not being representative, I'd think aggregating multiple such profiles solves the issue. Various proprietary (google cloud profiler, datadog, etc.) as well as OSS solutions (e.g. parca-dev/parca) already exist for automating that. That being said, explicitly showing the heapage to developers is probably still useful, as they can use this information to fix issues where objects are being kept alive longer than expected/necessary.

@felixge , that's clever. :) As you say, it has some limitations, and I would love to just build this in.

@aclements should I create a proposal with google/pprof or what did you have in mind? I'd probably suggest to add it as a hover tooltip in the flamegraph or something like that.

@rhysh
Copy link
Contributor

rhysh commented Oct 9, 2022

For when to do the calculation: mProf_FlushLocked visits every sampled stack at the end of each mark phase, when it moves the alloc/free counts from "GC cycle N-2" to the all-time accumulator. That could be a good opportunity to update a "lifetime" or "cost" counter associated with that stack without building a new mechanism, so long as we're targeting a "sum of total lifetime" and not a "histogram of individual object lifetimes".

If a user collects a "heap" profile from a running system, they don't need to also collect an "allocs" profile (and vice versa); they can instead pass an explicit sample type when they run go tool pprof. Will the "heaptime" data be part of that same profile family (as new sample types and corresponding DefaultSampleType), or will obtaining it require a second visit to every mProf entry, and a second serialization, transfer, and storage of the call stacks?

@aclements
Copy link
Member Author

... I'd think aggregating multiple such profiles solves the issue

That works fine for long-running services that are generally in a steady-state, which is what Google Cloud Profiler and Datadog target, but it's not a good fit for single-shot, phased applications. For example, profiling the compiler was the original motivation for this.

@aclements should I create a proposal with google/pprof or what did you have in mind?

Sorry, by "I would love to just build this in" I meant add built-in support to runtime/pprof for generating this data, rather than a statistical approximation. We could certainly explore both approaches simultaneously, though, so you're welcome to file a proposal with pprof.

Will the "heaptime" data be part of that same profile family (as new sample types and corresponding DefaultSampleType), or will obtaining it require a second visit to every mProf entry, and a second serialization, transfer, and storage of the call stacks?

That's an interesting question. I'd originally thought of this as a new profile type, but it could be added as two new sample types to the existing heap profile. You have an interesting point about the added costs if you are already collecting heap profiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

7 participants