Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime/pprof: add SetMaxDumpGoroutine to limit the number of goroutines dumped in a profile #50794

Closed
doujiang24 opened this issue Jan 25, 2022 · 29 comments

Comments

@doujiang24
Copy link
Contributor

We are using pprof for online profiling and we found goroutine dumping may take too much time when there are many goroutines.

In this case, it will take ~300ms when there are 100k goroutines on my side.
https://gist.github.com/doujiang24/9d066bce3a2bdd0f1b9fe1ef49699e4e

It's a too long time since there are almost 10k-100k goroutines in our system.

I think the easier way is to introduce a new API SetMaxDumpGoroutineNum.
I have implemented it in this PR: #50771

But, the SetMaxDumpGoroutineNum API introduced a new global variable.
This may not be a good idea.

Any feedback would be greatly appreciated. Thank you!

@aclements
Copy link
Member

Could you explain your use case more? Usually "online profiling" is done using a CPU profile, not a goroutine profile.

@ianlancetaylor ianlancetaylor changed the title pprof: limit the max goroutine number for dumping. proposal: runtime/pprof: add SetMaxDumpGoroutine to limit the number of goroutines dumped in a profile Jan 25, 2022
@gopherbot gopherbot added this to the Proposal milestone Jan 25, 2022
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Jan 25, 2022
@rsc
Copy link
Contributor

rsc commented Feb 2, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) Feb 2, 2022
@rhysh
Copy link
Contributor

rhysh commented Feb 2, 2022

Goroutine profiles stop the world, iterate over the entire list of live goroutines, and then restart the world. That whole-app pause can be hundreds of milliseconds when there are hundreds of thousands of goroutines.

@doujiang24 , is the problem for your apps the length of time that the app cannot do any other useful work? If so, it sounds similar to what's described in #33250.

Or is it the length of time that specific goroutine calling runtime/pprof.Lookup("goroutine").WriteTo needs to wait before it gets results?


@aclements , I've found regular collection of goroutine profiles to be one of the best tools (combination of effectiveness and ease-of-use) for tracking down problems of "why did my app get slow", including ones like the net/http deadlock in issue 32388. When I encountered that one, it affected a single service instance and self-resolved after about 15 minutes, so we were only able to understand it thanks to regular and automated collection of goroutine profiles. See also fgprof. Though @doujiang24 likely has other uses.

@doujiang24
Copy link
Contributor Author

Could you explain your use case more? Usually "online profiling" is done using a CPU profile, not a goroutine profile.

@aclements Sorry for the delay.
We want to know what are the goroutines doing or waiting for when there are too many goroutines.
thus, we may get a chance to reduce the number of goroutines.
Sampling is good enough for us since we don't need to know every goroutine.

@doujiang24
Copy link
Contributor Author

@rhysh Yeah, it's similar to #33250.
We use go for building a network proxy platform(MOSN), it's mostly used as a service mesh data plane.
So, latency is very important for us.

@doujiang24
Copy link
Contributor Author

I came out with a new idea about the API.
We can introduce the WithMaxGoroutine instead of SetMaxDumpGoroutine to avoid introducing a global variable.

usage:

pprof.Lookup("heap").WithMaxGoroutine(1024).WriteTo(...)

@rsc
Copy link
Contributor

rsc commented Feb 9, 2022

The big question is whether the stop-the-world is the problem versus the size of the profile. If the former, then writing less data won't take appreciably less time.

@doujiang24
Copy link
Contributor Author

The big question is whether the stop-the-world is the problem versus the size of the profile. If the former, then writing less data won't take appreciably less time.

@rsc Thanks for your point. I did more testing and I'm sharing the test results.

less data works for the previous case that takes ~300ms total (https://gist.github.com/doujiang24/9d066bce3a2bdd0f1b9fe1ef49699e4e)

Also, I have got more testing results by adding more log in pprof.go.
It shows 10x faster by writing less data.

100k goroutines, without the max goroutine limitation, means the standard behavior.

  1. fetch profile(runtime_goroutineProfileWithLabels, including the STW) takes ~130ms.
  2. the total pprof goroutine dump takes ~300ms (including printCountProfile).

100k goroutines, with the max 1k goroutine limitation, by using the PR #50771

  1. fetch profile(runtime_goroutineProfileWithLabelsAndLimit) takes ~15ms.
  2. the total pprof goroutine dump takes ~18ms.

Maybe STW may take more time in some other cases, but I think this shows less data will really take appreciably less time.

@rsc
Copy link
Contributor

rsc commented Feb 16, 2022

PR 50771 seems to be doing a random sample of the goroutines. Is that helpful in your use case, not to see all of them?
Then there's the question of the API.

/cc @aclements @mknyszek @prattmic for thoughts

@aclements
Copy link
Member

I suspect we could eliminate the STW here at the cost of losing "snapshot consistency" across the goroutine stacks.

Just looking over the current implementation, we could certainly improve the performance of this, though I doubt we can squeeze 10x out of it short of substantially redesigning how traceback works (e.g., using frame pointers).

@mknyszek
Copy link
Contributor

mknyszek commented Feb 16, 2022

I think it's still not clear to me where exactly the problem lies for your use-case, @doujiang24. Is the 300 ms delay between when you request the profile and when you get it the problem, or is the 130 ms (or however long the world is actually stopped) the problem, because it's a global latency hit.

If it's the latter, one thing we could do instead of limiting the number of goroutines is relax the consistency of the goroutine dump. Depending on how strict we want to be on backwards compatibility with this, this might not require a new API. If the goal is statistical goroutine sampling anyway, then I think that's fine. If a major use-case is getting a consistent snapshot that always 100% makes sense, then either this isn't really an option or it needs a knob.

I imagine that the global latency impact of this approach would be the same latency impact as stack scanning by the GC, which is definitely much, much less than 130 ms even for 100,000 goroutines. However, this would not help the end-to-end latency of the operation; it would still take 300 ms for that many goroutines, limiting how often you can sample.

@prattmic
Copy link
Member

Here are some of my initial thoughts, I haven't thought too deeply about this.

Currently, the goroutine profile is a complete instantaneous (one-off) view of what every goroutine is doing. Since it involves a stop-the-world (STW), we even get a consistent view of where every goroutine is at a single point in time [1] [2].

On the other hand, this proposal effectively turns this profile into an instantaneous (one-off) sampled view of what goroutines are doing, since we don't capture every goroutine. It would still maintain the consistent view due to STW.

My question for brainstorming is whether this is the best interface for use-cases of this profile? Is sampling OK, or do we typically need all goroutines? Is a consistent snapshot time needed? If so, maybe this API is best.

But I could also imagine perhaps we could provide a continuous sampling goroutine profile. Rather than a single instantaneous profile point, we sample goroutine state over time, eventually capturing all of them but never all at the same time. Would such a profile be useful, I'm not sure.

[1] As opposed to collecting traces while goroutines are running, which would result in each goroutine getting traced at a slightly different time of execution. In this case it would be possible to observe what appear to be "impossible" states between 2 goroutines.

[2] Modulo that some small bits of code not not preemptible and thus cannot be observed in the profile, but I don't think this matters much.

@ianlancetaylor
Copy link
Contributor

It's worth asking what the purpose of the goroutine profile is. It's not particularly helpful for analyzing performance or memory usage. Presumably it is helpful for finding leaked goroutines: goroutines that are waiting for something to happen that is never going to happen. Are there other uses for this profile?

If that is the main use can we have a more targeted goroutine profile? Such as, only list goroutines that have been blocked for some period of time?

@prattmic
Copy link
Member

I've found regular collection of goroutine profiles to be one of the best tools (combination of effectiveness and ease-of-use) for tracking down problems of "why did my app get slow"

@rhysh Could you expand more on this comment? How was this helpful? I'm imagining noticing some goroutines that, while not deadlocked, tend to be blocked or temporarily stopped somewhere we don't expect to be hot, but I'd like to hear if that is what you did.

@rhysh
Copy link
Contributor

rhysh commented Feb 16, 2022

A common workflow is to collect hundreds or thousands of goroutine profiles (from every instance of an app, every few minutes, for some time range), to get a count in each profile of the number of goroutines with a function matching ServeHTTP on their stack, and to look at what's different in the profiles that score the highest.

Goroutine profiles were important in tracking down #33747, where a large number of waiters on a single lock would lead to performance collapse.

A team I work with encountered the effects of #32388 last October. We used goroutine profiles to figure out what was going on there, and then discovered that someone had already reported the issue (and they had used goroutine profiles too).

Sometimes the reason an app slows down is that its log collection daemon is running behind, and (with a blocking log library) its calls to log end up waiting on the logger's sync.Mutex for a little while.

None of those are exactly leaks, because the thing those goroutines are waiting for does eventually happen. In the first and third cases, if load on the application decreases then the goroutines will each get a turn with the sync.Mutex. In the second case the OS closes the TCP connection after 15 minutes or so (sometimes faster).

We also used them to debug the problem that led to aws/aws-sdk-go#3127, where a sync.Mutex protected a cache that was only populated on success, so consistent failures led to 100% cache misses which were then serialized.

Around Go 1.13, an app I support with a typical response time of 1ms had occasional delays of 100ms. I used goroutine profiles to learn that those requests were delayed in particular functions by calls to runtime.morestack{,_noctxt}. A very close look with the CPU profiler (with request id added as a profile label, filtering to request IDs that showed up more than once) showed contention in runtime.(*mcentral).cacheSpan, which was fixed by then in CL 221182 (thanks @mknyszek !).

It's a very general tool, very easy to deploy (vs the opt-in of block/mutex profiles), very easy to use (vs execution trace).

@doujiang24
Copy link
Contributor Author

Thanks for @rhysh 's use cases.
In our system, we monitor the goroutine number online regularly and will dump a goroutine profile when:

  1. the number increased too much suddenly(may decrease after some time),
  2. or increased to a very large number.
    (we haven't enabled regular dump since we want to reduce the overhead of dumping).

in the 1st case, it may be caused by some unexpected issue, eg. network block or unexpected lock in some corner case.
for the second case, it usually means leaking and we can figure out it before online, but there may still have some corner cases that only happen online.

Actually, we care about two-part overhead:

  1. fetch profile(means the STW time mostly), ~130ms in the test case.
  2. other pure formatting or dumping working(means the printCountProfile), the ~170ms left.

the 1st STW time is more important for us.
but we also want to reduce the second time since ~170ms pure CPU time is also a bit heavy for an online network proxy system, and it's not so meaningful for us. We hope less overhead for profiling.


@rsc Yeah, we want to what does the goroutines are doing or waiting(mostly wanted). I think random sampling is a good choice, just like sampling in CPU or memory profile.


I suspect we could eliminate the STW here at the cost of losing "snapshot consistency" across the goroutine stacks.

@aclements @mknyszek I think this is a good choice for our use case. Glad to see this can be implemented, but there is another question we just break backward compatibility or introduce another new API to control it.
But I think it doesn't conflict with sampling, we can reduce the total overhead even it doesn't have STW.


If the goal is statistical goroutine sampling anyway, then I think that's fine.

@mknyszek yes, that's it.
In our previous cases, there are a large number of unexpected goroutines waiting on the same backtrace, and we don't need to know the exact number, the ratio is good enough for us.


Is sampling OK, or do we typically need all goroutines? Is a consistent snapshot time needed?

@prattmic sampling is OK for me usually. But I'm not sure if it is a good idea for others.


If that is the main use can we have a more targeted goroutine profile? Such as, only list goroutines that have been blocked for some period of time?

@ianlancetaylor yeah, in general, we are interested in the blocked goroutines while doing goroutine profile.
But I think the goroutine profile could give us a high level and brief view to know what the goroutines are doing/waiting of a system, make us know the system better, or find know to optimize it in a global view(it's a bit like the CPU profile, but not the same thing).

Thank you, everyone!

@gopherbot
Copy link

Change https://go.dev/cl/387415 mentions this issue: runtime: decrease STW pause for goroutine profile

@prattmic prattmic modified the milestones: Proposal, Backlog Feb 22, 2022
@rsc
Copy link
Contributor

rsc commented Mar 2, 2022

Assuming CL 387415 lands, do we need SetMaxDumpGoroutine at all? Or should this proposal be closed? Thanks.

@doujiang24
Copy link
Contributor Author

@rsc I think it's still better to have the SetMaxDumpGoroutine on my side, since it will reduce the total CPU cost.
One case on my side:
There are 40k goroutines in total, and most of them(about 30k) are unexpected and they are in the same backtrace.
I think SetMaxDumpGoroutine(sample with limit) is the better choice for this case.

@rhysh Glad to see CL 387415 started. Do you think sample with limit conflicts with CL 387415?

@felixge
Copy link
Contributor

felixge commented Mar 3, 2022

❤️ the discussion and proposed improvements, especially CL 387415 from @rhysh.

Before deciding on a SetMaxDumpGoroutine() API, I'd like to mention a related idea for extending the goroutine profiling API: If we could manage low STW overhead AND targeted goroutine profiling, it would be possible to build a goroutine tracer similar to the JavaScript wall-clock profiler in Chrome, showing stack traces on a per-goroutine timeline (flame chart) view rather than a FlameGraph.

image

I'm planning to submit a separate proposal and a working proof of concept (see this video for a teaser) for this soon, but the general idea would be to use pprof labels to allow users to mark a small subset of their goroutines (e.g. every Nth goroutine handling incoming user request) as "tracer goroutines" and then sample their stack traces at 100 Hz and record timestamps + goroutine ids + stacks. The output format could be the trace event format or protobuf format used by perfetto.

Perhaps this is entirely orthogonal to SetMaxDumpGoroutine, my only direct comment for this would be that a NewGoroutineProfile + GoroutineProfile.SetMaxGoroutines + GoroutineProfile.Write API similar to what's proposed for the CPU profiler in #42502 might provide a better path for future API additions.

@rhysh
Copy link
Contributor

rhysh commented Mar 3, 2022

Collecting the profile concurrently would complicate the sampling algorithm, but otherwise I think the two changes would be able to work together, @doujiang24 . (I don't have an algorithm off the top of my head for how to do that without O(samples) light work to prepare, but at least it's less than O(goroutines) of heavy work.)

I want to be clear that as far as I know, CL 387415 (at PS 3) does not compromise on the consistency of the goroutine profile (based on the test infrastructure in CL 387416). The key information that goroutine profiles provide is sometimes the presence/absence/state of a single goroutine. In #32388 it's a single goroutine holding up the http connection pool because of a slow network write (and 10k goroutines waiting). In #33747 it's the mysterious absence of any goroutine actually holding the lock (and 10k goroutines waiting). So I'm not convinced that sampling the goroutines is going to give good results for its users. An additional field on an extensible API like @felixge mentioned seems like a decent middleground.

As for the CPU overhead, I'm not sure how to tell how much is too much. Here's a possible framework: The machines I use often provide CPU and Memory allocations in a particular ratio, such as 2 GB of memory per 1 CPU thread. Filling the memory with 2kB goroutine stacks allows 1e6 goroutines. Collecting a goroutine profile for all of those takes about 2µs each, for a total of 2 seconds of CPU time. How often does an app need to collect a goroutine profile before the collection takes an unacceptable fraction of the compute resources, when a goroutine profile every five minutes leads to a CPU overhead of less than 1%? Faster is nice, but maybe it's already fast enough. @doujiang24 , what framework and target do you have in mind for "We hope less overhead for profiling."?

@rsc
Copy link
Contributor

rsc commented Mar 9, 2022

It sounds like there are enough available performance improvements (some of which are already being implemented) that we should hold off on adding any new API right now, which we would be stuck with forever. Do I have that right?

@rhysh
Copy link
Contributor

rhysh commented Mar 9, 2022

we should hold off on adding any new API right now

That's my view, yes.

@doujiang24
Copy link
Contributor Author

@rhysh Sorry for the delay.
Yeah, using a global variable is not a good idea. WithMaxGoroutine should be better:
#50794 (comment)

Also, sampling is not a good idea for all cases but really useful in some cases.
We just want an option not force sampling always.

  1. we implemented a continuous profiling library based on pprof, to capture the CPU/memory/goroutine spikes(at seconds levels), we don't want itself introduce spike.
  2. 1% CPU time in five minutes, should be okay in our case, also I think is okay for most cases.
  3. but, we care about the CPU spike since we(MOSN) are a network proxy software, CPU spike always leads to latency spike that's more important to us.

Yeah, CL 387415 reduced the STW really helps, STW could introduce larger latency.
But, 100+ms CPU block for one P is not friendly to us.

@rsc
Copy link
Contributor

rsc commented Mar 16, 2022

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Decline in Proposals (old) Mar 16, 2022
@doujiang24
Copy link
Contributor Author

okay, just confirm it clearly, is WithMaxGoroutine also not acceptable?

@ianlancetaylor
Copy link
Contributor

@doujiang24 As @rsc says above, our current opinion is "we should hold off on adding any new API right now, which we would be stuck with forever." Let's try the new changes out for a while and see to what extent there is still a real problem here to solve. Thanks.

@doujiang24
Copy link
Contributor Author

Okay, got it. Thank you all.

@rsc rsc moved this from Likely Decline to Declined in Proposals (old) Mar 23, 2022
@rsc
Copy link
Contributor

rsc commented Mar 23, 2022

No change in consensus, so declined.
— rsc for the proposal review group

@rsc rsc closed this as completed Mar 23, 2022
gopherbot pushed a commit that referenced this issue May 3, 2022
The goroutine profile needs to stop the world to get a consistent
snapshot of all goroutines in the app. Leaving the world stopped while
iterating over allgs leads to a pause proportional to the number of
goroutines in the app (or its high-water mark).

Instead, do only a fixed amount of bookkeeping while the world is
stopped. Install a barrier so the scheduler confirms that a goroutine
appears in the profile, with its stack recorded exactly as it was during
the stop-the-world pause, before it allows that goroutine to execute.
Iterate over allgs while the app resumes normal operations, adding each
to the profile unless they've been scheduled in the meantime (and so
have profiled themselves). Stop the world a second time to remove the
barrier and do a fixed amount of cleanup work.

This increases both the fixed overhead and per-goroutine CPU-time cost
of GoroutineProfile. It also increases the wall-clock latency of the
call to GoroutineProfile, since the scheduler may interrupt it to
execute other goroutines.

    name                                  old time/op    new time/op    delta
    GoroutineProfile/small/loaded-8         1.05ms ± 5%    4.99ms ±31%   +376.85%  (p=0.000 n=10+9)
    GoroutineProfile/sparse/loaded-8        1.04ms ± 4%    3.61ms ±27%   +246.61%  (p=0.000 n=10+10)
    GoroutineProfile/large/loaded-8         7.69ms ±17%   20.35ms ± 4%   +164.50%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle              958µs ± 0%    1820µs ±23%    +89.91%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle-8          1.00ms ± 3%    1.52ms ±17%    +51.18%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle-8           1.01ms ± 4%    1.47ms ± 7%    +45.28%  (p=0.000 n=9+9)
    GoroutineProfile/sparse/idle             980µs ± 1%    1403µs ± 2%    +43.22%  (p=0.000 n=9+10)
    GoroutineProfile/large/idle-8           7.19ms ± 8%    8.43ms ±21%    +17.22%  (p=0.011 n=10+10)
    PingPongHog                              511ns ± 8%     585ns ± 9%    +14.39%  (p=0.000 n=10+10)
    GoroutineProfile/large/idle             6.71ms ± 0%    7.58ms ± 3%    +13.08%  (p=0.000 n=8+10)
    PingPongHog-8                            469ns ± 8%     509ns ±12%     +8.62%  (p=0.010 n=9+10)
    WakeupParallelSyscall/5µs                216µs ± 4%     229µs ± 3%     +6.06%  (p=0.000 n=10+9)
    WakeupParallelSyscall/5µs-8              147µs ± 1%     149µs ± 2%     +1.12%  (p=0.009 n=10+10)
    WakeupParallelSyscall/2µs-8              140µs ± 0%     142µs ± 1%     +1.11%  (p=0.001 n=10+9)
    WakeupParallelSyscall/50µs-8             236µs ± 0%     238µs ± 1%     +1.08%  (p=0.000 n=9+10)
    WakeupParallelSyscall/1µs-8              138µs ± 0%     140µs ± 1%     +1.05%  (p=0.013 n=10+9)
    Matmult                                 8.52ns ± 1%    8.61ns ± 0%     +0.98%  (p=0.002 n=10+8)
    WakeupParallelSyscall/10µs-8             157µs ± 1%     158µs ± 1%     +0.58%  (p=0.003 n=10+8)
    CreateGoroutinesSingle-8                 328ns ± 0%     330ns ± 1%     +0.57%  (p=0.000 n=9+9)
    WakeupParallelSpinning/100µs-8           343µs ± 0%     344µs ± 1%     +0.30%  (p=0.015 n=8+8)
    WakeupParallelSyscall/20µs-8             178µs ± 0%     178µs ± 0%     +0.18%  (p=0.043 n=10+9)
    StackGrowthDeep-8                       22.8µs ± 0%    22.9µs ± 0%     +0.12%  (p=0.006 n=10+10)
    StackGrowth                             1.06µs ± 0%    1.06µs ± 0%     +0.09%  (p=0.000 n=8+9)
    WakeupParallelSpinning/0s               10.7µs ± 0%    10.7µs ± 0%     +0.08%  (p=0.000 n=9+9)
    WakeupParallelSpinning/5µs              30.7µs ± 0%    30.7µs ± 0%     +0.04%  (p=0.000 n=10+10)
    WakeupParallelSpinning/100µs             411µs ± 0%     411µs ± 0%     +0.03%  (p=0.000 n=10+9)
    WakeupParallelSpinning/2µs              18.7µs ± 0%    18.7µs ± 0%     +0.02%  (p=0.026 n=10+10)
    WakeupParallelSpinning/20µs-8           93.0µs ± 0%    93.0µs ± 0%     +0.01%  (p=0.021 n=9+10)
    StackGrowth-8                            216ns ± 0%     216ns ± 0%       ~     (p=0.209 n=10+10)
    CreateGoroutinesParallel-8              49.5ns ± 2%    49.3ns ± 1%       ~     (p=0.591 n=10+10)
    CreateGoroutinesSingle                   699ns ±20%     748ns ±19%       ~     (p=0.353 n=10+10)
    WakeupParallelSpinning/0s-8             15.9µs ± 2%    16.0µs ± 3%       ~     (p=0.315 n=10+10)
    WakeupParallelSpinning/1µs              14.6µs ± 0%    14.6µs ± 0%       ~     (p=0.513 n=10+10)
    WakeupParallelSpinning/2µs-8            24.2µs ± 3%    24.1µs ± 2%       ~     (p=0.971 n=10+10)
    WakeupParallelSpinning/10µs             50.7µs ± 0%    50.7µs ± 0%       ~     (p=0.101 n=10+10)
    WakeupParallelSpinning/20µs             90.7µs ± 0%    90.7µs ± 0%       ~     (p=0.898 n=10+10)
    WakeupParallelSpinning/50µs              211µs ± 0%     211µs ± 0%       ~     (p=0.382 n=10+10)
    WakeupParallelSyscall/0s-8               137µs ± 1%     138µs ± 0%       ~     (p=0.075 n=10+10)
    WakeupParallelSyscall/1µs                216µs ± 1%     219µs ± 3%       ~     (p=0.065 n=10+9)
    WakeupParallelSyscall/2µs                214µs ± 7%     219µs ± 1%       ~     (p=0.101 n=10+8)
    WakeupParallelSyscall/50µs               317µs ± 5%     326µs ± 4%       ~     (p=0.123 n=10+10)
    WakeupParallelSyscall/100µs              450µs ± 9%     459µs ± 8%       ~     (p=0.247 n=10+10)
    WakeupParallelSyscall/100µs-8            337µs ± 0%     338µs ± 1%       ~     (p=0.089 n=10+10)
    WakeupParallelSpinning/5µs-8            32.2µs ± 0%    32.2µs ± 0%     -0.05%  (p=0.026 n=9+10)
    WakeupParallelSpinning/50µs-8            216µs ± 0%     216µs ± 0%     -0.12%  (p=0.004 n=10+10)
    WakeupParallelSpinning/1µs-8            20.6µs ± 0%    20.5µs ± 0%     -0.22%  (p=0.014 n=10+10)
    WakeupParallelSpinning/10µs-8           54.5µs ± 0%    54.2µs ± 1%     -0.57%  (p=0.000 n=10+10)
    CreateGoroutines-8                       213ns ± 1%     211ns ± 1%     -0.86%  (p=0.002 n=10+10)
    CreateGoroutinesCapture                 1.03µs ± 0%    1.02µs ± 0%     -0.91%  (p=0.000 n=10+10)
    CreateGoroutinesCapture-8               1.32µs ± 1%    1.31µs ± 1%     -1.06%  (p=0.001 n=10+9)
    CreateGoroutines                         188ns ± 0%     186ns ± 0%     -1.06%  (p=0.000 n=9+10)
    CreateGoroutinesParallel                 188ns ± 0%     186ns ± 0%     -1.27%  (p=0.000 n=8+10)
    WakeupParallelSyscall/0s                 210µs ± 3%     207µs ± 3%     -1.60%  (p=0.043 n=10+10)
    StackGrowthDeep                          121µs ± 1%     119µs ± 1%     -1.70%  (p=0.000 n=9+10)
    Matmult-8                               1.82ns ± 3%    1.78ns ± 3%     -2.16%  (p=0.020 n=10+10)
    WakeupParallelSyscall/20µs               281µs ± 3%     269µs ± 4%     -4.44%  (p=0.000 n=10+10)
    WakeupParallelSyscall/10µs               239µs ± 3%     228µs ± 9%     -4.70%  (p=0.001 n=10+10)
    GoroutineProfile/sparse-nil/idle-8       485µs ± 2%      12µs ± 4%    -97.56%  (p=0.000 n=10+10)
    GoroutineProfile/small-nil/idle-8        484µs ± 2%      12µs ± 1%    -97.60%  (p=0.000 n=10+7)
    GoroutineProfile/small-nil/loaded-8      487µs ± 2%      11µs ± 3%    -97.68%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/loaded-8     507µs ± 4%      11µs ± 6%    -97.78%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/idle-8        709µs ± 2%      11µs ± 4%    -98.38%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/loaded-8      717µs ± 2%      11µs ± 3%    -98.43%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/idle         465µs ± 3%       1µs ± 1%    -99.84%  (p=0.000 n=10+10)
    GoroutineProfile/small-nil/idle          493µs ± 3%       1µs ± 0%    -99.85%  (p=0.000 n=10+9)
    GoroutineProfile/large-nil/idle          716µs ± 1%       1µs ± 2%    -99.89%  (p=0.000 n=7+10)

    name                                  old alloc/op   new alloc/op   delta
    CreateGoroutinesCapture                   144B ± 0%      144B ± 0%       ~     (all equal)
    CreateGoroutinesCapture-8                 144B ± 0%      144B ± 0%       ~     (all equal)

    name                                  old allocs/op  new allocs/op  delta
    CreateGoroutinesCapture                   5.00 ± 0%      5.00 ± 0%       ~     (all equal)
    CreateGoroutinesCapture-8                 5.00 ± 0%      5.00 ± 0%       ~     (all equal)

    name                                  old p50-ns     new p50-ns     delta
    GoroutineProfile/small/loaded-8          1.01M ± 3%     3.87M ±45%   +282.15%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/loaded-8         1.02M ± 3%     2.43M ±41%   +138.42%  (p=0.000 n=10+10)
    GoroutineProfile/large/loaded-8          7.43M ±16%    17.28M ± 2%   +132.43%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle               956k ± 0%     1559k ±16%    +63.03%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle-8            1.01M ± 3%     1.45M ± 7%    +44.31%  (p=0.000 n=10+9)
    GoroutineProfile/sparse/idle              977k ± 1%     1399k ± 2%    +43.20%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle-8           1.00M ± 3%     1.41M ± 3%    +40.47%  (p=0.000 n=10+10)
    GoroutineProfile/large/idle-8            6.97M ± 1%     8.41M ±25%    +20.54%  (p=0.003 n=8+10)
    GoroutineProfile/large/idle              6.71M ± 1%     7.46M ± 4%    +11.15%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/idle-8        483k ± 3%       13k ± 3%    -97.41%  (p=0.000 n=10+9)
    GoroutineProfile/small-nil/idle-8         483k ± 2%       12k ± 1%    -97.43%  (p=0.000 n=10+8)
    GoroutineProfile/small-nil/loaded-8       484k ± 3%       10k ± 2%    -97.93%  (p=0.000 n=10+8)
    GoroutineProfile/sparse-nil/loaded-8      492k ± 2%       10k ± 4%    -97.97%  (p=0.000 n=10+8)
    GoroutineProfile/large-nil/idle-8         708k ± 2%       12k ±15%    -98.36%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/loaded-8       714k ± 2%       10k ± 2%    -98.60%  (p=0.000 n=10+8)
    GoroutineProfile/sparse-nil/idle          459k ± 1%        1k ± 1%    -99.85%  (p=0.000 n=10+10)
    GoroutineProfile/small-nil/idle           477k ± 1%        1k ± 0%    -99.85%  (p=0.000 n=10+9)
    GoroutineProfile/large-nil/idle           712k ± 1%        1k ± 1%    -99.90%  (p=0.000 n=7+10)

    name                                  old p90-ns     new p90-ns     delta
    GoroutineProfile/small/loaded-8          1.13M ±10%     7.49M ±35%   +562.07%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/loaded-8         1.10M ±12%     4.58M ±31%   +318.02%  (p=0.000 n=10+9)
    GoroutineProfile/large/loaded-8          8.78M ±24%    27.83M ± 2%   +217.00%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle               967k ± 0%     2909k ±50%   +200.91%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle-8           1.02M ± 3%     1.96M ±76%    +92.99%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle-8            1.07M ±17%     1.55M ±12%    +45.23%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle              992k ± 1%     1417k ± 3%    +42.79%  (p=0.000 n=9+10)
    GoroutineProfile/large/idle              6.73M ± 0%     7.99M ± 8%    +18.80%  (p=0.000 n=8+10)
    GoroutineProfile/large/idle-8            8.20M ±25%     9.18M ±25%       ~     (p=0.315 n=10+10)
    GoroutineProfile/sparse-nil/idle-8        495k ± 3%       13k ± 1%    -97.36%  (p=0.000 n=10+9)
    GoroutineProfile/small-nil/idle-8         494k ± 2%       13k ± 3%    -97.36%  (p=0.000 n=10+10)
    GoroutineProfile/small-nil/loaded-8       496k ± 2%       13k ± 1%    -97.41%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/loaded-8      544k ±11%       13k ± 1%    -97.62%  (p=0.000 n=10+9)
    GoroutineProfile/large-nil/idle-8         724k ± 1%       13k ± 3%    -98.20%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/loaded-8       729k ± 3%       13k ± 2%    -98.23%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/idle          476k ± 4%        1k ± 1%    -99.85%  (p=0.000 n=9+10)
    GoroutineProfile/small-nil/idle           537k ±10%        1k ± 0%    -99.87%  (p=0.000 n=10+9)
    GoroutineProfile/large-nil/idle           729k ± 0%        1k ± 1%    -99.90%  (p=0.000 n=7+10)

    name                                  old p99-ns     new p99-ns     delta
    GoroutineProfile/sparse/loaded-8         1.27M ±33%    20.49M ±17%  +1514.61%  (p=0.000 n=10+10)
    GoroutineProfile/small/loaded-8          1.37M ±29%    20.48M ±23%  +1399.35%  (p=0.000 n=10+10)
    GoroutineProfile/large/loaded-8          9.76M ±23%    39.98M ±22%   +309.52%  (p=0.000 n=10+8)
    GoroutineProfile/small/idle               976k ± 1%     3367k ±55%   +244.94%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle-8           1.03M ± 3%     2.50M ±65%   +142.30%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle-8            1.17M ±34%     1.70M ±14%    +45.15%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle             1.02M ± 3%     1.45M ± 4%    +42.64%  (p=0.000 n=9+10)
    GoroutineProfile/large/idle              6.92M ± 2%     9.00M ± 7%    +29.98%  (p=0.000 n=8+9)
    GoroutineProfile/large/idle-8            8.74M ±23%     9.90M ±24%       ~     (p=0.190 n=10+10)
    GoroutineProfile/sparse-nil/idle-8        508k ± 4%       16k ± 2%    -96.90%  (p=0.000 n=10+9)
    GoroutineProfile/small-nil/idle-8         508k ± 4%       16k ± 3%    -96.91%  (p=0.000 n=10+9)
    GoroutineProfile/small-nil/loaded-8       542k ± 5%       15k ±15%    -97.15%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/loaded-8      649k ±16%       15k ±18%    -97.67%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/idle-8         738k ± 2%       16k ± 2%    -97.86%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/loaded-8       765k ± 4%       15k ±17%    -98.03%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/idle          539k ±26%        1k ±17%    -99.84%  (p=0.000 n=10+10)
    GoroutineProfile/small-nil/idle           659k ±25%        1k ± 0%    -99.84%  (p=0.000 n=10+8)
    GoroutineProfile/large-nil/idle           760k ± 2%        1k ±22%    -99.88%  (p=0.000 n=9+10)

Fixes #33250
For #50794

Change-Id: I862a2bc4e991cec485f21a6fce4fca84f2c6435b
Reviewed-on: https://go-review.googlesource.com/c/go/+/387415
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
TryBot-Result: Gopher Robot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Mar 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

9 participants