Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime/debug: add GCStats field to track count of marked pointers during last GC #52692

Closed
sumerc opened this issue May 3, 2022 · 9 comments

Comments

@sumerc
Copy link

sumerc commented May 3, 2022

Hi,

I would like to propose an idea (with few questions). I am still learning internals of Go (especially GC subsystem) so I hope my mental model is correct and clear.

Think about following code:

type ss struct {
     x int
}

a = make([]ss, ntimes)
for i := 0; i < ntimes; i++ {
    a[i] = ss{x: s{}}
}

Now, if I change the x int to x *int in the above code, the GC total latency jumps from ~106usec to ~1.6ms, roughly ~10x more overhead. Now, it is due to the fact that now GC needs to mark the pointer field in the structure, IIUC.

I think I am not the only one faced this: https://blog.gopheracademy.com/advent-2018/avoid-gc-overhead-large-heaps/. When there are lots of pointers held in memory, the GC latency increases.

Now I understand that this kind behavior is completely natural. I am more looking for ways to anticipate this kind of behavior during implementation via some kind of counter for observability reasons. This behavior becomes somewhat more subtle(?) if for example x is a string (string itself holds an implicit pointer) or x ia struct that contains a pointer somewhere deep in its definition.

Let me propose an idea:

// run same code as above

runtime.GC()

gcs := new(debug.GCStats)
debug.ReadGCStats(gcs)
fmt.Printf("Last GC total marked pointers:%d\n", gcs.NumPointers)

This would give the programmer a deterministic way to benchmark and optimize for allocations if possible. The current approach I am using to debug this is: to manually invoke GC and calculate elapsed time. This is somewhat ok, but IMHO it would be better and more deterministic to have a simple counter. WDYT?

And one more unrelated question just for learning: why we don't have a field for total GC latency (like a wall time), IIUC we currently have PauseTotalwhich shows the duration for STW, but would not it be nice to see whole running
time(total latency)? Probably this is not possible given that it runs concurrently in different goroutines or maybe is not just necessary?

Thanks,

@sumerc sumerc added the Proposal label May 3, 2022
@gopherbot gopherbot added this to the Proposal milestone May 3, 2022
@sumerc sumerc changed the title proposal: affected/package: debug GCStats to have a NumPointers field for tracking number of marked Pointers debug.GCStats to have a NumPointers field for tracking number of marked Pointers May 3, 2022
@sumerc sumerc changed the title debug.GCStats to have a NumPointers field for tracking number of marked Pointers debug.GCStats to have a field to track count of marked Pointers during last GC May 3, 2022
@ianlancetaylor ianlancetaylor changed the title debug.GCStats to have a field to track count of marked Pointers during last GC proposal: runtime/debug: add GCStats field to track count of marked pointers during last GC May 4, 2022
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) May 4, 2022
@ianlancetaylor
Copy link
Contributor

CC @mknyszek @golang/runtime

@mknyszek
Copy link
Contributor

mknyszek commented May 4, 2022

As a general note, please consider that with every internal statistic we expose, we basically have to track that forever, and either places some restrictions on the GC implementation or means the statistic's meaning changes over time. In general, we have to be careful with what we expose (even if exposed via runtime/metrics, which is meant for these kinds of internal metrics, but we still want to be judicious because while a deprecation path exists, it is long).

For this particular metric, however, there's a problem with tracking it: it's going to be pretty expensive. The scan loop in the GC is a very critical path, and is the only path with knowledge of how many pointers there actually are. It's written to be very tight.

I think that an already-approved proposal might be what you want, however. FTR, measuring the end-to-end latency of runtime.GC is not a great idea. Multiple concurrent runtime.GC calls may coalesce, and it also includes the sweep phase. The total length of the GC cycle also doesn't directly map to latency overheads, because the GC's mark phase runs concurrently with the application. Pause latency does map directly, because the whole application stops for that period of time. While latencies do increase while in the mark phase to synchronize occasionally with the GC running concurrently, it's far from easily predictable. Only a detailed trace can really get at that information (e.g. via the runtime/trace package).

It sounds like (and correct me if I'm wrong) what you want is just to identify when the GC is doing more work than before. I think one metric that would capture this without digging into implementation details is some estimate of total GC CPU time. On that note, there already is an accepted proposal to do that, #47216.

Unfortunately it's a toss-up as to whether I'll actually get around to implementing that before the 1.19 development cycle ends (this Sunday).

@sumerc
Copy link
Author

sumerc commented May 5, 2022

Multiple concurrent runtime.GC calls may coalesce, and it also includes the sweep phase.

Oh, did not think about that case.

For this particular metric, however, there's a problem with tracking it: it's going to be pretty expensive. The scan loop in the GC is a very critical path, and is the only path with knowledge of how many pointers there actually are. It's written to be very tight.

TIL something, thanks! Just out of curiosity: is a single increment of a variable (or maybe we can shard this variable) can really cause that much overhead? Of course, my assumption is to increment a single counter for every marked ptr, maybe that is wrong as well?

Pause latency does map directly

Nice to know. This means there is no way that GC latency grows but pause does not, right?

It sounds like (and correct me if I'm wrong) what you want is just to identify when the GC is doing more work than before. I think one metric that would capture this without digging into implementation details is some estimate of total GC CPU time.

Yes. Like we have alloc/s, in a benchmark, it would be nice to see how much GC pressure a code will have. Well, I agree GC CPU time is a good indication for my case. Do you think it would make sense showing this information in a benchmark?

@mknyszek
Copy link
Contributor

mknyszek commented May 5, 2022

For this particular metric, however, there's a problem with tracking it: it's going to be pretty expensive. The scan loop in the GC is a very critical path, and is the only path with knowledge of how many pointers there actually are. It's written to be very tight.

TIL something, thanks! Just out of curiosity: is a single increment of a variable (or maybe we can shard this variable) can really cause that much overhead? Of course, my assumption is to increment a single counter for every marked ptr, maybe that is wrong as well?

If it's a single global counter, that needs to be atomic. Marking is parallel, so that's going to be a contended global counter, which (IIRC, as a rule of thumb) has an overhead of around 50 ns on modern hardware (due to contention in the cache coherency protocol). Compared to how rapidly local unsynchronized CPU instructions finish that's very high.

There are ways around it, like maintaining a local counter that is flushed occasionally. However in a tight loop, even a few extra instructions can make a significant difference. While our scan loop probably isn't the best in the world, and maybe it would even be a tolerable performance hit to do this today, by maintaining this API, we have to live with this design choice forever. If one day (and it probably will happen) we make the scan loop extremely tight, this will be a significant hindrance.

Pause latency does map directly

Nice to know. This means there is no way that GC latency grows but pause does not, right?

It depends on what you mean by "GC latency." While the concurrent mark phase is active, there are potential increases in latency due to write barriers and GC assists by goroutines. In aggregate this looks like whole-program latency, but the effects are mostly local (unlike pauses, which are global). For example, goroutines assist the GC proportionally to how much they allocate.

It sounds like (and correct me if I'm wrong) what you want is just to identify when the GC is doing more work than before. I think one metric that would capture this without digging into implementation details is some estimate of total GC CPU time.

Yes. Like we have alloc/s, in a benchmark, it would be nice to see how much GC pressure a code will have. Well, I agree GC CPU time is a good indication for my case. Do you think it would make sense showing this information in a benchmark?

I think there are lots of good reasons to surface more GC information by default. See for example #52466. Exposing a GC CPU fraction over the course of the benchmark run might achieve what you want, and could also surface the same noise that#52466 is trying to root out. I think it's worth considering.

@sumerc
Copy link
Author

sumerc commented May 5, 2022

That is whole a lot of valuable information! Thanks for the pointers @mknyszek, especially #52466 seems really relevant.

I think we can close this if you like.

@rsc
Copy link
Contributor

rsc commented May 11, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) May 11, 2022
@rsc rsc moved this from Active to Declined in Proposals (old) May 18, 2022
@rsc
Copy link
Contributor

rsc commented May 18, 2022

This proposal has been declined as retracted.
— rsc for the proposal review group

@rsc rsc closed this as completed May 18, 2022
@sumerc
Copy link
Author

sumerc commented May 24, 2022

@mknyszek, I was just lurking through the GC code and found following heapScanWork atomic.Int64 in gcControllerState struct. Is not this what I have asked for? An atomic counter for marked pointers? I am just asking for learning. There seems to be lots of useful stuff in that struct as well.

@mknyszek
Copy link
Contributor

Not quite; heapMarked is the amount of heap memory reachable via pointers. This is a little easier to manage that "number of pointers" because you only have to increase the counter when you mark an object the first time, and an object may have more than one reference. That being said, we also take a lot of care to update that efficiently. The counter is actually distributed, there's roughly one counter per CPU. I'm not sure if that ever has to get flushed during a GC cycle, making it fairly efficient to track. That would be true for number of pointers encountered as well I suppose, but it's an additional complication.

@golang golang locked and limited conversation to collaborators May 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

5 participants