Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: average stack size is not a fit for every application #61052

Open
cdvr1993 opened this issue Jun 28, 2023 · 9 comments
Open

runtime: average stack size is not a fit for every application #61052

cdvr1993 opened this issue Jun 28, 2023 · 9 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@cdvr1993
Copy link

cdvr1993 commented Jun 28, 2023

In go1.19 was introduced that stack are created basen on the historical average size. Even though this has improved the landscape for our applications, we still experience issues with some of them.

To give one real example. Let's introduce a 15k cores application. Originally the profiles were showing close to 10% cpu being consumed by stack growth. If we translate this to cores that's ~1.5k cores.

Screenshot 2023-06-28 at 11 50 07 AM

The stack usage is very low (we never exceed 16M):

Screenshot 2023-06-28 at 11 52 19 AM

After adding to our metrics to publish the runtime/metrics that exposes the initial stack size we could observe that 99% of the time the stack size is set to 2KB.

By using go:linkname we were able to expose the initial stack size global variable. So what we did was to enable "gcshrinkstackoff" and disable "adaptivestackstart", to not shrink stacks and to avoid the runtime overriding our value.

Screenshot 2023-06-28 at 12 38 26 PM
Screenshot 2023-06-28 at 12 38 14 PM
Screenshot 2023-06-28 at 12 39 18 PM

Finally we ran several experiments injecting different stack sizes:

4KB
Screenshot 2023-06-28 at 11 58 13 AM

8KB
Screenshot 2023-06-28 at 11 58 39 AM

16KB
Screenshot 2023-06-28 at 11 59 05 AM

We decided to stop at 16KB because the gains were approaching 0, but as you can see we were able to reduce copystack by 7%.

Memory went up, but it is still reasonable. From 16MB to 50MB.
Screenshot 2023-06-28 at 12 02 57 PM

With all that said, would it be possible for the Go runtime to expose:

  • A safe way to inject our own stack size (even if it is static).
  • A histogram with all the stack sizes it has seen, this would helps decide what size would gives us the best results without using too much memory.

Or do you have any idea on how to implement this more safely (not requiring private linking).

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jun 28, 2023
@randall77
Copy link
Contributor

A safe way to inject our own stack size (even if it is static).

The Go project has explicitly avoided providing knobs like this. It adds complexity that we have to live with forever. I don't think we'd want to change that stance to get just a few percent performance improvement.

A histogram with all the stack sizes it has seen, this would helps decide what size would gives us the best results without using too much memory.

This might be possible with the new runtime/metrics API, as it supports returning histograms. I don't think the runtime currently tracks goroutine sizes, so this might incur some additional cost to track. We'd also have to decide what "size" means here, exactly. Peak use? Average use? Randomly sampled use? Sampled at GC time use?

And given above, I don't see a compelling need to provide this data given that we don't intend to provide a way to use it.

@mpx
Copy link
Contributor

mpx commented Jul 4, 2023

The assumption a good global average size exists appears to be false for many applications. Allowing the developer to control the global value doesn't improve the situation.

It would be better to automatically track/set the initial stack size per go invocation - I suspect this assumption is likely to hold in many more cases. The assumption could still break for goroutine pools running wildly different tasks, any automated solution is going to struggle there.

@chabbimilind
Copy link

Average is not a very representative metric. It deserves some more exploration of other statistical measures: e.g., median/mode/...

@cespare
Copy link
Contributor

cespare commented Jul 5, 2023

You could also imagine incorporating this into PGO somehow.

@rabbbit
Copy link

rabbbit commented Jul 6, 2023

Thanks for the response @randall77!

(disclaimer, I work in the same corporation as @cdvr1993, but on a different domain)

The Go project has explicitly avoided providing knobs like this. It adds complexity that we have to live with forever. I don't think we'd want to change that stance to get just a few percent performance improvement.
[...]
I don't see a compelling need to provide this data given that we don't intend to provide a way to use it.

This is a bad phrasing on our part. We would also strongly prefer not to have knobs and for things to work (mostly) okay. The goal thing was to report the situation, list what we tried/thought about, and ask for advice.

To answer your points 1 by 1:

  • knobs: we agree that less is more. But then there are GOMAXPROCS, GOGC, GOMEMLIMIT - all modifiable at runtime. It was not immediately clear what is "too much". We mentioned the extra knob because we imagined it could (theoretically) be easier to expose a knob rather than changing the algorithm in general, like in runtime: new goroutines can spend excessive time in morestack #18138.
  • "just a few percent" - 7% total CPU reduction is a significant amount for us. Especially compared to things like PGO, which seem to have lower wins (?)
  • stats - well, we could at least observe this better in production and experiment in "user-space". We could then report back what we see.

As it stands, we'll likely end up using the unsafe/linkname tricks in at least the worst affected applications.

@mknyszek mknyszek added this to the Backlog milestone Jul 12, 2023
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 12, 2023
@rabbbit
Copy link

rabbbit commented Aug 1, 2023

@randall77 could I maybe ask you to reconsider the stat emission? Or is that an absolute strong no?

Even without more knobs, we can very well start a "goroutine ballast" already today. It would be much easier to do if we had stats to base it on.

@randall77
Copy link
Contributor

I don't have any objection to exposing such a stat by itself. I just think it would take a fair amount of work to implement - it isn't just exporting something the runtime is already keeping track of. Given the extra work required and limited payoff I don't think it is worth doing.
If someone can propose / design a stat that's easy & cheap to collect, then I think it would be ok to export it.

@prashantv
Copy link
Contributor

Instead of exposing this via runtime/metrics, would it be cheaper to expose as part of the pprof or some other explicit call? E.g., include the stack size for each goroutine in the pprof goroutines output.

@cdvr1993
Copy link
Author

cdvr1993 commented Aug 1, 2023

I was actually able to get this info from cpu profiles. Not sure if it is the correct approach but at least I have some info.

So basically I focus on copystack on one profile and then disassemble the binary so that I can get the stack usage of each stack trace.

Then I just order them based on total size.

With that I was able to determine that the best starting stack size for the application I mentioned previously was 32KB which allowed me to reduce copystack from 9.5% to 0.8%.

The remaining would require a somewhat big jump on stack size, so I decided to stay on 32KB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

10 participants