New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: cmd/trace: add 'greedy goroutine diagnosis' option #20792
Comments
If you are interested in this little tool, there is the repository. |
Sorry, I might had make a mistake:
It should probably be:
Although I hadn't dive in the go scheduler's code yet, but But no matter which one is the right answer, the 1st one or the |
/cc @aclements @RLH |
/cc @dvyukov |
Thanks a lot, rsc :) |
Nice tool.
Moving forward the Go compiler needs to be taught how to preempt tight
loops so, while valuable, the usefulness of this tool is short term. The
unfortunate downside is that once the tool is in the source base it would
need to be supported even though its usefulness will be minimal once the
compiler adds loop preemption.
…On Mon, Jun 26, 2017 at 3:53 PM, Russ Cox ***@***.***> wrote:
/cc @aclements <https://github.com/aclements> @RLH
<https://github.com/rlh>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20792 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA7Wn34dy3_FVSqGspt0jSod-__54uniks5sIAxKgaJpZM4OEhc5>
.
|
Thanks Richard for your reply :) Honestly, I don't think this tool would become useless even after our compiler adds a 'mature default-on or even always-on loop preemption option'. And there are several reasons. The approach of Even when the Personally view, although Go already has tons of open source projects and libraries which they only have the assumption of 'preempt only at function calls' at that time when they were developed. And there maybe something would turn wrong when the compiler performs the 'loop preemption transformation' for all of them. It is a little dangerous and risky as many go applications plays a very important role in their systems. With the assumption of 'preempt only at function calls', users can do some high priority batching jobs without been blocked by scheduler in a very high speed and low latency manner. But with the option 'loop preemption' is 'on', such jobs will become less efficient and the latency will be worse. So, the trade-off of the overhead of 'loop preemption' should be chosen and decided by the users themself: have it or not. (As far as I know, the current implementation of 'insert_resched_checks' is 'The simpler the loop, the worse the overhead, that is why this phase is not normally enabled. -- from David Chase', and it maybe not a quite easy job at all to bring a efficient and complete 'loop preempt' function to our users in a short time, one year? for instance.) So above all, I think it would be better to provide a diagnose tool for the users to see the status of their goroutines' cpu time slices and optimize their codes by themself if they really want to, no matter the 'loop preemption' is default 'off' now or may default 'on' in the future. Still, it maybe often better to provide more choices to our users then to limit them to a single one. |
I'm not sure I understand your claim, or perhaps I don't understand your definition of "cpu greedy problems". Can you give an example of such a problem that loop preemption doesn't solve? I also wouldn't call the cost "very dear". Our current prototype of loop preemption shows a ~2.5% geomean slow-down across the go1 benchmarks, which are CPU-intensive and have several tight loops. Each release generally speeds up code by more than that just through code generation improvements. |
Hi Austin, I think you are actually underestimating the distance and the difficulty of the gap between 'local optimization' and 'perfect and optimal solution' of the "cpu greedy problems" in golang. The definition of "cpu greedy problems" is: A cpu greedy goroutine is a goroutine which would very (this thread had discussed such influence) The 'cpu greediest goroutine' example maybe something like this: go func(){
for{}
}() But the 'loop' is only one kind big ordinary case of such problems, how about this one: _adder := func(c *uint32, depth uint32) uint32 {
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
// ....
/* omitted, because there are */
/* 100,000 of them :( */
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
return atomic.LoadUint32(c)
}
godeadloop := func() {
var ct uint32
for {
ct++
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
// ....
/* omitted, because there are */
/* 100,000 of them :( */
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
}
}
go godeadloop() intact version source code is in here In this example there is no loop anyway right? But it is pretty 'cpu greedy' and this kind of piece code could let other goroutines to starve or even to death (whole go process would stuck for several minutes probably I think if the STW happened together). Of course that you could solve this problem by a new 'insert_more_resched_checks' to the go AST after our discussion, but this is not the point, the point is the 'loop preempt' solution we currently use for the "cpu greedy problems" already had neglect this one corner case by mistakenly, how could we know there is not some more corner cases just hidden there that we didn't find them yet? If you really think the 'loop preempt' solution could be the 'perfect and optimal solution', then could you write a formal and strict mathematical proof for it then? If you couldn't, we can only use it as a suboptimal solution. If we use a 'suboptimal solution' as a 'perfect optimal solution', terrible things may happen. So, this is why I still insist on to add some diagnosis tool like 'diagreedy' to golang, you could use it as a verification tool or whatever, just give some more information to our users and let them know more about what is happening deep inside the go processes, it is also a win-win solution not only for the users, but also for the developers. |
If you have a large computer cluster which worth 100,000,000 dollars, then the improvement of 2.5% on the performance could save you 2,500,000 dollars. And 2,500,000 dollars could do many valuable things :D |
For some people who prefer performance, please give them other choices, do not close the doors to them, all they need to do is just to analyze the trace.out the go process have dumped on the fly. "Loop preempt" is a good tool, I think a good usage of it is to enable the "loop preempt" on some opensource project that we don't trust yet and disable them on our own reliable code to have a better performance and controllability. ( Sorry, I might have a impression that you may want to do anything for the users, but that would be some kind of over optimization if you do too much of it :p ) |
The compiler has already been taught to identify problematic loops. I'd suggest that a good next step would be to print out that information when requested, perhaps with |
A very good idea, and I agree with you 👍 The more tools and options to diagnose the possibly problematic code the better. (But one problem here is that maybe nearly all 'for loop' uses in the sources will be reported as 'problematic'. Although I don't know the details implementation of the current 'loop preempt' in the compiler. But the optimization is often one of the most difficult parts in compiler science ( because compiler can not read the mind of programers). And the optimization nearly all is 'suboptimal solution' and there is hardly any 'perfect optimization'. This why I still also insist on to provide some more external diagnosis tools to help programmers to improve their code if they want. ) |
I also thought that some code had been wrote with go assembly language like |
/cc @rsc @aclements @RLH @josharian It has been nearly 10 days since our first discussion. Maybe we should make a conclusion about it? |
@hnes proposals are reviewed on a schedule. And 10 days is not very long; many decisions take much longer to be made. |
Thank Josh for letting me know :)
It is pretty reasonable. |
Certainly loop preemption in Go 1.10 should make this less necessary, but as noted not completely unnecessary. In addition to straight line code there is also the question of large memmoves, which are in non-preemptible assembly, and maybe other operations like those. The question is how to expose this in the tool. "go tool trace" is really about the web view. This new flag turns it into a command-line program with a new output format. Can we instead somehow highlight "greedy" goroutines in the web view? Then no new option is needed. Or maybe it should be a third-party separate tool anyway? @dvyukov, as owner of go tool trace, any thoughts? |
Perhaps if we trace when we requested STW/foreachP and when we actually get STW/foreachP, it will give enough info in the web view to understand the bad cases (what's running at the time of the request for for how long). |
But the sampling could be imprecise and still "need some good luck" for users to find the real top N greedy goroutines. Traversing the whole |
How do you propose to detect non-cooperative goroutines? The fact that it runs on CPU for a long time does not mean anything, it still can be perfectly cooperative if/when we try to preempt it. You understand that this is a wrong way to write Go code, right? do_cpu_staff0() -- time cost 200us
runtime.Gosched()
do_cpu_staff1() -- time cost 200us
runtime.Gosched()
do_cpu_staff2() -- time cost 200us
runtime.Gosched() Goroutines are preemptive since Go 1.1 and soon they become even more preemptive. |
You are right and I quite agree with you. The
The more goroutines are cpu greedy, the worst the overall latency the application would be.
Yes, but before you chose to "preempt" it, you probably have to know where the problem exactly is, and this is why we may still need the diagnosis result of
May have a look about this although it is some kinda hacky but it still could mean something. This piece code in go 1.8.3 the STW will stuck nearly forever. The preemptive is just relatively. So, how about let we combine both of them?
|
No. What matters is only if a goroutine is not cooperative.
No. Runtime preempts all currently running goorutines. |
Agree with you. Maybe this is the literal ambiguity, the phrase "cpu greedy goroutines" I used here does imply that it's less cooperative or completely no cooperative (a example is the go function godeadloop0 defined in this proposal code example).
The more precise is "Go Runtime always try to preempts all currently running gorutines." |
So, how do you propose to detect these 'completely non-cooperative' goroutines? |
In fact I had think about this before, and here is maybe a appropriate solution:
STW:
timeout := 1ms
max_total_time := 2s
total_time := 0
set try_to_stw_flag
for _,p = range [P0,P1,P2] {
locked,timecost := mutex_lock_timeout(p,timeout)
total_time+=timecost
for locked is nil {
scan [P0,P1,P2]'s currently running user goroutines' info
dump it out // to trace struct || some log fd
if total_time >= max_total_time{
panic("stw take too long")
}
locked,timecost = mutex_lock_timeout(p,timeout)
total_time+=timecost
}
} |
A little correction. STW:
timeout := 1ms
max_total_time := 2s
total_time := 0
// ++ do not allocate the idle Processors to ready goroutines anymore
set try_to_stw_flag
for _,p = range [P0,P1,P2] {
locked,timecost := mutex_lock_timeout(p,timeout)
total_time+=timecost
for locked is nil {
// ++ may be it is hard to get the details( include func call stack)
// ++ of a running goroutine dynamically?
scan [P0,P1,P2]'s currently running user goroutines' info
dump it out // to trace struct || some log fd
if total_time >= max_total_time{
panic("stw take too long")
}
locked,timecost = mutex_lock_timeout(p,timeout)
total_time+=timecost
}
}
// ++
unset try_to_stw_flag
// stw successfully |
It seems that we just need to trace when we requested STW/foreachP and when we actually get STW/foreachP. The rest of info is already in the trace and can be inferred in the trace command (namely, what goroutines were running at the time of the request). |
Bravo! And it would be much convinient to implement our new tool :D |
OK, so it sounds like we should add that trace info to the trace dump file. But I still don't see what we're going to do to expose it, unless the STW is a different clearly marked bar already. What will it look like? A special command-line mode people need to know about is not discoverable enough. |
We currently record mark termination STW, but somehow forgot about sweep termination STW. I've got a CL to add this for computing MMUs, but it would help with this analysis, too. |
Change https://golang.org/cl/55411 mentions this issue: |
OK, so after CL 55411, @aclements, is there anything else for this issue? Should we at least mark it proposal-accepted? |
In terms of infrastructure support, I believe CL 55411 is sufficient.
I certainly hope we can make this analysis simply unnecessary, but it could be useful in the interim and in flushing out bugs. Presentation questions aside, it should be relatively straightforward to implement a robust version on top of the events added by CL 55411. |
Right now we only kind of sort of trace GC STW events. We emit events around mark termination, but those start well after stopping the world and end before starting it again, and we don't emit any events for sweep termination. Fix this by generalizing EvGCScanStart/EvGCScanDone. These were already re-purposed to indicate mark termination (despite the names). This commit renames them to EvGCSTWStart/EvGCSTWDone, adds an argument to indicate the STW reason, and shuffles the runtime to generate them right before stopping the world and right after starting the world, respectively. These events will make it possible to generate precise minimum mutator utilization (MMU) graphs and could be useful in detecting non-preemptible goroutines (e.g., #20792). Change-Id: If95783f370781d8ef66addd94886028103a7c26f Reviewed-on: https://go-review.googlesource.com/55411 Reviewed-by: Rick Hudson <rlh@golang.org>
Marking Proposal-Hold until there's a clearer explanation of what the exposed UI is. |
(Maybe helpful to bypass the issue #10958 in a quite different but probably better and more
clean way. Coming from this thread, many thanks to the participants of this discussion)
Hi all :)
I have implemented a sub-option in the
go tool trace
recently named "diagreedy"which means "diagnoses and finds out the greediest several goroutines". This tool
already helped us tracked down several deep hidden problems in our go applications
and achieved more stable and short GC pause latency by fixing them afterwards.
Terminology
Big Picture
In this diagram, the value of N is 3 (i.e. the GOMAXPROCS), and the top 3 long on-processor
time slices is t1, t2, t3 and they are corresponding to goroutine 1, 2, 10.
The max time cost of the pre-GC-action Stop-the-World is named Tmax_stw and is the
sum of t1, t2 and t3.
The Tgcwork is the time cost by the acually effective Garbage Collecting action after
the phase Stop-the-World. Tgcwork is determined by the amount of the allocated objects
and the complexity of the relationship among them.
Tmax_stw and Tgcwork together finally determined the one typical GC pause cost. As
for Tgcwork, we could optimize it by reusing the objects (via sync.Pool for example) and
by many other methods. But we do not discuss Tgcwork futher here. We concentrate on
the Tmax_stw right now.
We could get this conclusion plainly:
For some cpu intensive applications, the top max N on-processor time slices is often
very big (if without very carefully task splits). Especially when our project is including
tons of other and heavily used open source projects which will cause it becomes more
complex and harder to trace.
Although we already had some tools to sample out the top n cpu usage of the functions.
But the accumulated cpu usage of the functions and the longest top n time slices is not
the same thing at all. For example, if we have 1000 different cpu intensive goroutines
something like:
Our Tmax_stw would become GOMAXPROCS*1 ms. It's terrible because sometimes
we would set the GOMAXPROCS to a pretty big value (16 or even 36 for instance).
With the sample of top n cpu usage, we may couldn't find out the sources of the problem
easily. But with the 'top max N on-processor time slices' approach, we could track down
the sources of the long GC pause immediately and fix them like this:
Then our Tmax_stw would turns from GOMAXPROCS*1 ms to GOMAXPROCS/5 ms afterwards :)
(If your GOMAXPROCS is very big, just splits do_cpu_staff even further. And you even
could implement some logic which would let the do_cpu_staff automated splits its tasks
based on the value of GOMAXPROCS and a other parameter could named time_cost_wanted to
solve this problem for good.)
revision1
Sorry, I might had make a mistake:
It should probably be:
Although I hadn't dive in the go scheduler's code yet, but
I guess the 2nd is most likely the right answer, because the
1st algorithm is just too inefficient and unlikely using by the
go scheduler.
But no matter which one is the right answer, the 1st one or the
2nd, it couldn't change our conclusion made before utterly.
Example
And there is a example about the usage of
go tool trace -diagreedy
Sample code to diagnose:
Diagnosis for the 1st time:
And then, we could found the source of the long GC pause godeadloop0
and chage the variable granul to value 10000. Test again:
As you see, the typical GC pause finally changed from 8ms to 0.4ms. Of course
the situation would be much more complex in a real big project, but the key idea
of the tracking is the same and clear.
Also implemented a 'dump' option like this below:
I would be very glad to push it to golang project and benefits more if this proposal is accepted.
Many thanks for your time and good patience.
The text was updated successfully, but these errors were encountered: