Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime/trace: improve Go diagnostics / tracing #29104

Open
robaho opened this issue Dec 5, 2018 · 6 comments
Open

proposal: runtime/trace: improve Go diagnostics / tracing #29104

robaho opened this issue Dec 5, 2018 · 6 comments

Comments

@robaho
Copy link

robaho commented Dec 5, 2018

Overview

Go offers extensive diagnostic capabilities out of the box, but they are of limited usefulness in a larger, highly concurrent, real-world application utilizing 100+ Go routines.

For the purposes of this proposal, I am using a real-world application. This application uses gRPC and has a single RPC stream active. It also supports 'fix financial protocol' via standard TCP sockets (no active connections). It emits multicast packets as well. It has a web interface, and also web socket interface (both with no active connections). So there is a lot of network end-points in-use. Although it is a real-world application, it is very small compared to most enterprise applications.

The following screen shot shows a profile, and the OS cpu utilization - notice the very heavy system cpu usage:

image

A tree sample provides a bit more context:

      10ms   runtime.libcCall
             runtime.pthread_mutex_lock
             runtime.semawakeup
             runtime.notewakeup
             runtime.startm
             runtime.wakep
             runtime.ready
             runtime.goready.func1
             runtime.systemstack
             runtime.goready
             runtime.send
             runtime.chansend
             runtime.selectnbsend
             google.golang.org/grpc/internal/transport.(*controlBuffer).executeAndPut
             google.golang.org/grpc/internal/transport.(*controlBuffer).put
             google.golang.org/grpc/internal/transport.(*http2Server).handleData
             google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams
             google.golang.org/grpc.(*Server).serveStreams
             google.golang.org/grpc.(*Server).handleRawConn.func1

Here is the OS total CPU % breakdown:

image

Here is the 'web' view for the profile:

profile001

Here is a GODEBUG sampling with scheddetail=1 for the same process.

SCHED 87116ms: gomaxprocs=4 idleprocs=1 threads=14 spinningthreads=1 idlethreads=4 runqueue=0 gcwaiting=0 nmidlelocked=0 stopwait=0 sysmonwait=0
  P0: status=0 schedtick=580546 syscalltick=798347 m=-1 runqsize=0 gfreecnt=0
  P1: status=2 schedtick=570346 syscalltick=805620 m=-1 runqsize=0 gfreecnt=0
  P2: status=1 schedtick=571461 syscalltick=801749 m=4 runqsize=0 gfreecnt=0
  P3: status=1 schedtick=567930 syscalltick=814616 m=3 runqsize=0 gfreecnt=0
  M13: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M12: p=-1 curg=50 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M11: p=-1 curg=19 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M10: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M9: p=1 curg=16 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
  M8: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M7: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M6: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M5: p=-1 curg=34 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M4: p=2 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
  M3: p=3 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=true blocked=false lockedg=-1
  M2: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
  M1: p=-1 curg=17 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=false lockedg=17
  M0: p=0 curg=1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
  G1: status=3(chan receive) m=0 lockedm=-1
  G17: status=6() m=1 lockedm=1
  G2: status=4(force gc (idle)) m=-1 lockedm=-1
  G18: status=4(GC sweep wait) m=-1 lockedm=-1
  G3: status=4(finalizer wait) m=-1 lockedm=-1
  G5: status=4(chan receive) m=-1 lockedm=-1
  G6: status=4(IO wait) m=-1 lockedm=-1
  G7: status=4(select) m=-1 lockedm=-1
  G8: status=4(select) m=-1 lockedm=-1
  G9: status=4(select) m=-1 lockedm=-1
  G10: status=4(select) m=-1 lockedm=-1
  G11: status=4(select) m=-1 lockedm=-1
  G12: status=4(select) m=-1 lockedm=-1

... 20 entries exactly like above except Go routine number...

  G28: status=4(GC worker (idle)) m=-1 lockedm=-1
  G71: status=4(IO wait) m=-1 lockedm=-1
  G72: status=4(IO wait) m=-1 lockedm=-1
  G73: status=4(sleep) m=-1 lockedm=-1
  G74: status=4(timer goroutine (idle)) m=-1 lockedm=-1
  G41: status=4(select) m=-1 lockedm=-1
  G42: status=4(IO wait) m=-1 lockedm=-1
  G75: status=4(GC worker (idle)) m=-1 lockedm=-1
  G29: status=4(GC worker (idle)) m=-1 lockedm=-1
  G44: status=4(GC worker (idle)) m=-1 lockedm=-1

When using the 'trace' the flow it a bit easier to spot, as 'goroutine analysis' shows:

image

And going into the detail for startMarketData (exhibit A) function, we see:

image

and examining the sync block time, we see:

image

but examining the scheduler wait time, we see:

image

Dilema

Although the profile clearly shows lots of OS activity, and the OS monitor shows heavy CPU utilization, there is no easy way to analyze what is causing the heavy system cpu%. The biggest problem is that the profile is NOT partitioned by 'go routine'. Using the trace tool, things are better but there is no aggregate view. Ideally the detail view (exhibit A above), would be shown on the 'analysis page' for all go routines - the only way to get this information is to inspect each routine individually. Something like this (filed as issue #29103)

image

But more importantly, there needs to be a summary analysis as to the "why" - not reviewing a graph, for example, if I open the detail for the start market data routine above, it should show (sample only):

startmarketdata.func1
   network wait time 102 ms:
      5% send on socket
      70% poll
      25% read on socket
   scheduler wait 1 us
      57% wait for GC cycle
      43% wait for CPU
   sync time 50 ms
      67% wait on channel
               100% sent by go routine xxxx
      23% wait on mutex filename:line#
   etc...

The user can also use the 'trace viewer' which is 100% manual - it is very powerful, but difficult to see the forest through the trees.

Specific Proposals

  1. Ability to add a "name" to a Go routine, something like:
      go "myname" func(){...}
      go stringexp func(){...}

Working with the current naming is more difficult than need be - you need to be an expert in all of the codebases of all of the libraries to even have an basic understanding of what is going on. The name would allows the library creator to provide significant information as to it purpose. This need should be appended to the existing "name" in the diagnostic tools.

  1. In lieu of 1, at a minimum, the go routine name as provided now should be changed to include the source file and line number. Often a single function spawns multiple go routines, and the idiomatic way is anonymous functions, so you end up with func1, func2, etc. The source/line would make tracking these down much easier. This would also be the case, if the program did not provide a name in 1.

  2. The trace output should have an option for per go routine stack sampling, which would place a 'stack trace event' in the trace file on the sampling interval. It may be possible to sync a profile capture with a sync capture based on timestamps, but it simplifies a lot to have everything in a single capture file.

  3. There should be trace reporting by Go routine instance. Currently with the trace, all of the routines of a single type are lumped together. This does not allow easy detection of processing bottlenecks. For example, if there are 4 routines consuming 15%, is it 1 routine consuming 15%, or 3 consuming 1% and 1 consuming 14% ?

  4. Most importantly, there is no public API to read the trace file to produce custom analysis reports, especially when there exists an API to add 'user events' to the trace. The internal/trace package should be made public. Currently, the information might be obtained in json from the 'tool trace webserver', but this is undocumented, and probably too inefficient for certain analysis needs. The current solution involves copy and pasting the package.

@gopherbot gopherbot added this to the Proposal milestone Dec 5, 2018
@robaho
Copy link
Author

robaho commented Dec 5, 2018

This may need to be separated into multiple issues, but I didn't see an easy way to group them, other than just referencing the "parent" incident in the text ?

@odeke-em
Copy link
Member

odeke-em commented Dec 5, 2018

Thank you for this report/issue @robaho!

Kindly paging @hyangah @rakyll @dvyukov @aclements

@odeke-em odeke-em changed the title proposal: improve Go diagnostics / tracing proposal: runtime/trace: improve Go diagnostics / tracing Dec 5, 2018
@robaho
Copy link
Author

robaho commented Dec 5, 2018

Issue 4 is incorrect. When you trace into a Go routine there are entries for each routine. (Graphs,etc)

The problem is things are too low level. It could be helped by libraries and applications putting user events everywhere, but even then the analysis is manual.

I am interested in developing something more macro that understands the event dependencies and provides an easier to understand performance profile.

@rsc
Copy link
Contributor

rsc commented Dec 12, 2018

The tracer is a work in progress. Progress has been slow recently. It's not clear when we will get to runtime/trace improvements.

We do already record the place where a goroutine was created (it shows up in panic crash output, for example). If that can be added to the trace format easily, that seems worthwhile. The larger issues around good summaries and digest presentations are difficult.

@robaho
Copy link
Author

robaho commented Dec 12, 2018

@rsc if the trace parsing facilities were moved out of an internal package to public, then users could develop more advanced analysis tools. Right now you need to copy and paste the code, which is brittle as things advance. At at minimum this could be addressed pretty easily. This is especially important for the user events (annotations), as custom analysis is almost certainly required

@robaho
Copy link
Author

robaho commented Dec 12, 2018

@rsc Btw, I added some of the aggregate work in already in #29103 and going through the code I can think of lots of very useful presentations for this data, even a live streaming events processor for analysis. The tracing system is very well designed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants