Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: average stack size is not a fit for every application #61052

Open
cdvr1993 opened this issue Jun 28, 2023 · 12 comments
Open

runtime: average stack size is not a fit for every application #61052

cdvr1993 opened this issue Jun 28, 2023 · 12 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@cdvr1993
Copy link

cdvr1993 commented Jun 28, 2023

In go1.19 was introduced that stack are created basen on the historical average size. Even though this has improved the landscape for our applications, we still experience issues with some of them.

To give one real example. Let's introduce a 15k cores application. Originally the profiles were showing close to 10% cpu being consumed by stack growth. If we translate this to cores that's ~1.5k cores.

Screenshot 2023-06-28 at 11 50 07 AM

The stack usage is very low (we never exceed 16M):

Screenshot 2023-06-28 at 11 52 19 AM

After adding to our metrics to publish the runtime/metrics that exposes the initial stack size we could observe that 99% of the time the stack size is set to 2KB.

By using go:linkname we were able to expose the initial stack size global variable. So what we did was to enable "gcshrinkstackoff" and disable "adaptivestackstart", to not shrink stacks and to avoid the runtime overriding our value.

Screenshot 2023-06-28 at 12 38 26 PM
Screenshot 2023-06-28 at 12 38 14 PM
Screenshot 2023-06-28 at 12 39 18 PM

Finally we ran several experiments injecting different stack sizes:

4KB
Screenshot 2023-06-28 at 11 58 13 AM

8KB
Screenshot 2023-06-28 at 11 58 39 AM

16KB
Screenshot 2023-06-28 at 11 59 05 AM

We decided to stop at 16KB because the gains were approaching 0, but as you can see we were able to reduce copystack by 7%.

Memory went up, but it is still reasonable. From 16MB to 50MB.
Screenshot 2023-06-28 at 12 02 57 PM

With all that said, would it be possible for the Go runtime to expose:

  • A safe way to inject our own stack size (even if it is static).
  • A histogram with all the stack sizes it has seen, this would helps decide what size would gives us the best results without using too much memory.

Or do you have any idea on how to implement this more safely (not requiring private linking).

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jun 28, 2023
@randall77
Copy link
Contributor

A safe way to inject our own stack size (even if it is static).

The Go project has explicitly avoided providing knobs like this. It adds complexity that we have to live with forever. I don't think we'd want to change that stance to get just a few percent performance improvement.

A histogram with all the stack sizes it has seen, this would helps decide what size would gives us the best results without using too much memory.

This might be possible with the new runtime/metrics API, as it supports returning histograms. I don't think the runtime currently tracks goroutine sizes, so this might incur some additional cost to track. We'd also have to decide what "size" means here, exactly. Peak use? Average use? Randomly sampled use? Sampled at GC time use?

And given above, I don't see a compelling need to provide this data given that we don't intend to provide a way to use it.

@mpx
Copy link
Contributor

mpx commented Jul 4, 2023

The assumption a good global average size exists appears to be false for many applications. Allowing the developer to control the global value doesn't improve the situation.

It would be better to automatically track/set the initial stack size per go invocation - I suspect this assumption is likely to hold in many more cases. The assumption could still break for goroutine pools running wildly different tasks, any automated solution is going to struggle there.

@chabbimilind
Copy link

Average is not a very representative metric. It deserves some more exploration of other statistical measures: e.g., median/mode/...

@cespare
Copy link
Contributor

cespare commented Jul 5, 2023

You could also imagine incorporating this into PGO somehow.

@rabbbit
Copy link

rabbbit commented Jul 6, 2023

Thanks for the response @randall77!

(disclaimer, I work in the same corporation as @cdvr1993, but on a different domain)

The Go project has explicitly avoided providing knobs like this. It adds complexity that we have to live with forever. I don't think we'd want to change that stance to get just a few percent performance improvement.
[...]
I don't see a compelling need to provide this data given that we don't intend to provide a way to use it.

This is a bad phrasing on our part. We would also strongly prefer not to have knobs and for things to work (mostly) okay. The goal thing was to report the situation, list what we tried/thought about, and ask for advice.

To answer your points 1 by 1:

  • knobs: we agree that less is more. But then there are GOMAXPROCS, GOGC, GOMEMLIMIT - all modifiable at runtime. It was not immediately clear what is "too much". We mentioned the extra knob because we imagined it could (theoretically) be easier to expose a knob rather than changing the algorithm in general, like in runtime: new goroutines can spend excessive time in morestack #18138.
  • "just a few percent" - 7% total CPU reduction is a significant amount for us. Especially compared to things like PGO, which seem to have lower wins (?)
  • stats - well, we could at least observe this better in production and experiment in "user-space". We could then report back what we see.

As it stands, we'll likely end up using the unsafe/linkname tricks in at least the worst affected applications.

@mknyszek mknyszek added this to the Backlog milestone Jul 12, 2023
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 12, 2023
@rabbbit
Copy link

rabbbit commented Aug 1, 2023

@randall77 could I maybe ask you to reconsider the stat emission? Or is that an absolute strong no?

Even without more knobs, we can very well start a "goroutine ballast" already today. It would be much easier to do if we had stats to base it on.

@randall77
Copy link
Contributor

I don't have any objection to exposing such a stat by itself. I just think it would take a fair amount of work to implement - it isn't just exporting something the runtime is already keeping track of. Given the extra work required and limited payoff I don't think it is worth doing.
If someone can propose / design a stat that's easy & cheap to collect, then I think it would be ok to export it.

@prashantv
Copy link
Contributor

Instead of exposing this via runtime/metrics, would it be cheaper to expose as part of the pprof or some other explicit call? E.g., include the stack size for each goroutine in the pprof goroutines output.

@cdvr1993
Copy link
Author

cdvr1993 commented Aug 1, 2023

I was actually able to get this info from cpu profiles. Not sure if it is the correct approach but at least I have some info.

So basically I focus on copystack on one profile and then disassemble the binary so that I can get the stack usage of each stack trace.

Then I just order them based on total size.

With that I was able to determine that the best starting stack size for the application I mentioned previously was 32KB which allowed me to reduce copystack from 9.5% to 0.8%.

The remaining would require a somewhat big jump on stack size, so I decided to stay on 32KB.

@DanielShaulov
Copy link

I don't have any objection to exposing such a stat by itself. I just think it would take a fair amount of work to implement - it isn't just exporting something the runtime is already keeping track of. Given the extra work required and limited payoff I don't think it is worth doing. If someone can propose / design a stat that's easy & cheap to collect, then I think it would be ok to export it.

A very simple stat that only keeps track of "maxStackSize" can be implemented with some very small additions to scanstack and gcComputeStartingStackSize to save the max in addition to the current sum and count that are being saved.
This will at least give people the "upper bound" of how high the startingStackSize needs to be to make sure their app never needs to waste CPU on copystack.

@randall77 - I think I can open such a PR, do you think if I do so it has a viable chance of getting merged?

@randall77
Copy link
Contributor

Just tracking max would be pretty easy, yes.
I'm not sure it would be terribly useful though. Max has the problem that a single outlier can skew everything. avg*N for some small N is probably more robust.

I think for a proposal to be viable we would need a concrete description what to collect (and how), and how that information would be used. "Sizing my stack ballast" is not a very compelling use...

@DanielShaulov
Copy link

I'm not sure it would be terribly useful though. Max has the problem that a single outlier can skew everything.

While that is true in general, web servers deployed at scale usually have a pretty consistent flow of requests and every "type" of request (method/route/some parameter combination)- will have some size that it reaches for most requests, and each type will have some frequency.
The problem with anything that uses AVG is that it might be majorly skewed towards something that creates the majority of goroutines, while a small minority (like 10% of request being a POST and getting to a different flow for example) of request that reach a larger stack still cause significant wasted CPU.
For those same apps - the cost of memory is often insignificant compared to the cost of CPU, making the decision to use max a very easy one (for a subset of apps that we know how to pinpoint).

"Sizing my stack ballast" is not a very compelling use...

This is a "means to an end" - the goal is to reduce costs. We have seen services waste up to 15% of their CPU on morestack (from CPU profiling). For large services this sums up to very significant compute cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests