Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: bgsweep's use of Gosched bloats runtime traces and wastes CPU time #54767

Closed
dominikh opened this issue Aug 30, 2022 · 19 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@dominikh
Copy link
Member

runtime.bgsweep has a loop that calls Gosched after doing a small amount of work. The intention is to only use a
fraction of the available CPU time. When there are more runnable Gs than Ps, this works well: bgsweep relinquishes most
of its time and gets scheduled again after some time has passed. However, when we have excess Ps, bgsweep gets scheduled
again almost immediately after calling Gosched. This continues until bgsweep runs out of work entirely, at which point
it will stop looping. On large systems, with code that produces a disproportionate amount of garbage, bgsweep may only
rarely run out of work.

By itself, this might not be a major problem; although it's probably a waste of CPU time to constantly call Gosched just
to be scheduled again almost immediately. However, in combination with runtime/trace, this causes a bad user experience.
Each time bgsweep calls Gosched, and each time it gets scheduled again, an event gets emitted. This turns out to be a
lot of events. I have a trace captured from a real-world deployment of Prometheus. It is 60 seconds long and has the
following statistics:

  • 66 million events in total, spread across 163,678 goroutines
  • 33 million of those events are caused by the bgsweep goroutine
  • bgsweep spends 32s being properly inactive, having finished all its work
  • the trace contains two full GC cycles and 3 periods of activity in bgsweep (one per observed GC cycle and one
    continuing from the previous cycle when the trace begins).
  • according to go tool trace's goroutine analysis, bgsweep was responsible for 1.31% of total program execution time.

This means that during 28 seconds of actual activity, which constitute 3 periods of doing work, and 1.31% of all work
done, bgsweep was responsible for 50% of all events in the trace. During these periods, bgsweep spends about 500 ns
running and 600 ns inactive, repeated millions of times.

In the end, this results in a 300 MB large trace file that takes 2.5 minutes to be parsed and split by go tool trace,
during which time it consumes over 23 GB of system memory. The resulting JSON for the browser frontend (for all split
traces combined) is 11 GB in size. Admittedly, halving all of these metrics would still not be ideal, but it'd be half
as bad. And 1.31% of all execution wouldn't be responsible for 50% of the trace data. Furthermore, the remaining 33
million events are much more evenly distributed.

I don't have any proposed solutions nor deep knowledge of the runtime. However, I wonder if it would make sense and be
feasible for Gosched to be a no-op when the system is under-utilized and we're not trying to stop the world? In this
particular example this should make bgsweep finish almost twice as fast, with approximately 100% fewer events, wasting
less time in the scheduler, without impacting other goroutines.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 30, 2022
@prattmic
Copy link
Member

Interestingly I also just ran into this while collecting a histogram of how long goroutines remain running for, which had a big chunk running for absurdly short periods (eg ~30ns). Upon digging I found it was this same loop.

cc @mknyszek

@prattmic
Copy link
Member

One somewhat hacky workaround would be to check for idle Ps prior to Gosched. If there are idle Ps, then Gosched is very unlikely to find work, so we could just skip it.

@heschi heschi added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Aug 30, 2022
@heschi heschi added this to the Go1.20 milestone Aug 30, 2022
@mknyszek mknyszek modified the milestones: Go1.20, Backlog Aug 31, 2022
@mknyszek
Copy link
Contributor

I agree it would be great to do something about this. Putting it in backlog since it's not a new issue. At the very least we can do what Michael suggested and just patch it for bgsweep. I like your idea of making Gosched a little smarter at emitting events in general (or calling into the scheduler).

@mknyszek mknyszek self-assigned this Aug 31, 2022
@beoran
Copy link

beoran commented Sep 5, 2022

This could bear to be higher priority. Wasting 1% of CPU is expensive these days, certainly on cloud platforms! But thanks for looking into it.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 5, 2022

Wasting 1% of CPU is expensive these days

Thanks for the bump. I agree this issue is important to resolve, but I don't think resolving it is going to result in a 1% improvement in CPU utilization (or an improvement to throughput for that matter). Something has to do that sweep work, and bgsweep typically only takes a non-trivial amount of CPU time if the application has idle time to spare anyway, precisely because it does call Gosched so often. There's certainly overhead to calling Gosched each time, but I suspect a large part of that overhead is just eating up available idle time anyway. That was the original intent behind how bgsweep is written. With that said, bgsweep could be more efficient by not calling into Gosched as much (to your point), and the most immediate benefit is to use less of that available idle CPU (nice from a power consumption perspective). But again, I don't see this improving 1% of anything. If I'm wrong though, that'd be cool. :)

I think @dominikh's point is more that 1.31% of CPU time is responsible for 50% of the trace data, which I agree seems wildly disproportionate.

In any case, I've already put this on my list for 1.20 (EDIT: just realized it was still in the Backlog milestone; moved), so I plan to work on it. There are a number of ways to resolve this issue.

  1. Treat bgsweep like an idle mark worker.
    • The benefit of this is we never end up in bgsweep at all if the application is firing on all cylinders. I suspect going through the scheduler just to sweep a single span is probably less effective than just letting things that allocate memory do it.
    • The cost is it would require additional complexity to how things are scheduled.
  2. Have bgsweep check for idle Ps before calling Gosched. If there are idle Ps, keep going anyway.
    • The benefit of this is it fixes the immediate issue without too much additional complexity. You still have random bgsweep schedules going on to sweep a single span, but it's probably already not a big deal today.
  3. (2) but push that check down into Gosched.
    • The benefit of this is that now anything that calls Gosched won't unnecessarily go into the scheduler (though due to races it could be wrong about idleness, so it makes Gosched a tiny bit weaker).

I think (2) -> (3) -> (1) is a reasonable path. I'll do (2) first which should solve the immediate issue, (3) might be nice in general (though I need to double-check that it's not breaking any documented semantics), and (1) just to see how gross it is and if it makes any meaningful performance difference.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 5, 2022

And to be clear, (2) is the hacky workaround @prattmic mentioned. I personally don't think it's that hacky, and the effects of such a change are local to just bgsweep which is nice.

@mknyszek mknyszek modified the milestones: Backlog, Go1.20 Sep 5, 2022
@prattmic
Copy link
Member

prattmic commented Sep 6, 2022

I'm worried about (3). Gosched is a full call to schedule, so it can do lots of stuff: check the global runq, run timers, run GC workers, etc. If we vastly simplify Gosched with an idle P short-circuit, those behaviors are lost, and applications that want them (likely for perf/latency reasons) have no way of requesting them without fully blocking.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 6, 2022

those behaviors are lost

While this is true and makes (3) more risky, I'm not positive we actually guarantee any of that (and in theory, it could change out from underneath users). Though, I don't feel strongly about this. We can skip (3), and I'll just try (2).

@mknyszek
Copy link
Contributor

mknyszek commented Sep 6, 2022

Also, (3) will probably break a whole bunch of our own tests...

@mknyszek
Copy link
Contributor

mknyszek commented Sep 8, 2022

Working on (3), and while I think npidle is an easy signal to check for when GOMAXPROCS>1, relying on only that means that when GOMAXPROCS=1 we still have the same issue, basically.

A mitigation for that case could just be to do more work than sweeping one span before yielding. For instance, we could sweep 10 spans at once. Since we have a pretty good idea that this goroutine runs for only 10s of nanoseconds at a time, I think we can easily do 10x the amount of work before trying to yield, which will produce 10% of the events it produces today.

For something slightly more robust, we can more explicitly bound the latency the same way the scavenger does. However, while that conceptually is more robust, it has some issues that we need to work around (e.g. coarse grained timers) which add complexity. I'm inclined to just sweep 10 spans at time, write a comment about it, and call it good (plus the npidle check, which is easy).

Any thoughts?

@dominikh
Copy link
Member Author

dominikh commented Sep 8, 2022

GOMAXPROCS=1 doesn't seem common enough to warrant a lot of extra complexity. It also seems less likely to run into this issue, as it requires both idle Ps and a lot of garbage being produced (otherwise bgsweep doesn't have much work to do in the first place), and I don't see that happening with a single, idle P.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 8, 2022

I tend to agree, but I also don't see much of a downside to sweeping 10 spans at a time, and it would be unfortunate to miss this one case (for the sake of robustness). The bgsweep goroutine is still preemptible in between each span swept, so it's not like it adds any latency to say, a STW or something.

@rhysh
Copy link
Contributor

rhysh commented Sep 8, 2022

I'm inclined to just sweep 10 spans at time, write a comment about it, and call it good

Fewer calls into the scheduler sounds good to me. I think that doing even 10µs of work at a time would keep this responsive enough to hide in the noise of the other latency sources we have today.

Avoiding coarse-grained timer problems and bumping by a constant 10x (and confirming that in practice that the loop typically takes 10µs or less on modern hardware) seems like a good plan. Maybe the next release goes up to 32 iterations, and the one after increases it to 100, if we're not yet happy with the balance of responsiveness vs efficiency vs simplicity.

@gopherbot
Copy link

Change https://go.dev/cl/429615 mentions this issue: runtime: tweak bgsweep "low-priority" heuristic

@dominikh
Copy link
Member Author

dominikh commented Sep 9, 2022

I tend to agree, but I also don't see much of a downside to sweeping 10 spans at a time, and it would be unfortunate to miss this one case (for the sake of robustness). The bgsweep goroutine is still preemptible in between each span swept, so it's not like it adds any latency to say, a STW or something.

Sorry, yes, sweeping multiple spans SGTM. I was thinking of the complexity of timer-based latency bounds.

@ZekeLu
Copy link
Contributor

ZekeLu commented Sep 17, 2022

go/src/runtime/mgcsweep.go

Lines 298 to 306 in cc1b20e

const sweepBatchSize = 10
nSwept := 0
for sweepone() != ^uintptr(0) {
sweep.nbgsweep++
nSwept++
if nSwept%sweepBatchSize == 0 {
goschedIfBusy()
}
}

@mknyszek It seems that nSwept will keep increase until it overflows. It's fine in this case, but looks weird. How about to set the sweepBatchSize to 2n and implement it like this:

	const sweepBatchSize = 1 << 3 // or 1<<4
	const mask = sweepBatchSize - 1
	nSwept := 0
	for sweepone() != ^uintptr(0) {
		sweep.nbgsweep++
		nSwept = (nSwept + 1) & mask
		if nSwept == 0 {
			goschedIfBusy()
		}
	}

https://go.dev/play/p/-sxVGAPj0XZ?v=gotip

@mknyszek
Copy link
Contributor

Thanks, but I don't think this is necessary.

It's not actually possible for nSwept to overflow because it refers to swept spans in a single GC cycle (because it gets reset on each loop), and at most there's one per 8 KiB of heap memory. Overflowing would require the heap to be 8 KiB * (size of address space) which is by definition impossible. In other words, addresses would start to overflow and we'd be out of memory before then.

However, I don't think it needs a comment or anything; we have plenty of counters in the runtime that are proportional to the size of the heap that won't overflow for exactly the same reason. IMO the logic of the existing code demonstrates intent more clearly, and is also (I believe) correct.

@ZekeLu
Copy link
Contributor

ZekeLu commented Sep 19, 2022

Sorry for the noise! I thought nSwept is a global variable when I wrote the comment. I must be out of my mind then. Thank you for your patience!

@mknyszek
Copy link
Contributor

It's no worries. :) You may have been thinking of nbgsweep which is a global that has a chance of overflowing if a program just runs long enough, but I think it's pretty unlikely. The amount of time that would need to pass, even with an aggressively allocating application, is pretty high.

@golang golang locked and limited conversation to collaborators Sep 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

8 participants