Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/trace: visualize time taken by syscall #57307

Closed
aktau opened this issue Dec 14, 2022 · 11 comments
Closed

cmd/trace: visualize time taken by syscall #57307

aktau opened this issue Dec 14, 2022 · 11 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@aktau
Copy link
Contributor

aktau commented Dec 14, 2022

Currently (Go 1.20-dev), syscalls as processed by go tool trace from a runtime/trace profile visualize as points in time, being about 1 pixel wide no matter how long they took (see reproducer below). It would be nicer if they were slices, to visualize how much time is spent in (conceptually blocking) syscalls for a given goroutine.

image

Somewhere in the last release cycle (1.19), I tried making the obvious patch to cmd/trace:

                case trace.EvGoSysCall:
-                       ctx.emitInstant(ev, "syscall", "")
+                       if ev.Link != nil {
+                               ctx.emitSlice(ev, "syscall")
+                       } else {
+                               ctx.emitInstant(ev, "syscall", "")
+                       }

And I remembered it didn't work: most (all?) EvGoSysCall events don't have a link to the finish event. But before filing this proposal I tried patching up cmd/trace from 1.20 and to my surprise it did work. I assume something changed in Go 1.20 and now more (most?) EvGoSysCall events have a link, which is great. Debugging output of my patched go tool trace:

...
2022/12/14 05:42:10 syscall event with link (6/12):
    101597987   GoSysCall       p=1     g=1     off=2876        stkID=22        seq=0
 -> 101613258   GoSysExit       p=1000003       g=1     off=264 stkID=0 seq=0 { g=1 seq=17 ts=0 }
 -> 101613376   GoStart p=0     g=1     off=270 stkID=0 seq=0 { g=1 seq=0 }
 -> 101619219   GoSysBlock      p=0     g=1     off=283 stkID=0 seq=0
2022/12/14 05:42:10 syscall event with link (6/13):
    101618055   GoSysCall       p=0     g=1     off=279 stkID=22        seq=0
 -> 101633491   GoSysExit       p=1000003       g=1     off=297 stkID=0 seq=0 { g=1 seq=19 ts=0 }
 -> 101633634   GoStart p=0     g=1     off=303 stkID=0 seq=0 { g=1 seq=0 }
 -> 101637671   GoSysBlock      p=0     g=1     off=315 stkID=0 seq=0
...

Resulting trace view comparison:

image

Reproducer:

The program that generated this trace was:

package main

import (
        "fmt"
        "os"
        "runtime/trace"
        "time"
)

func main() {
        if err := rmain(); err != nil {
                fmt.Fprintf(os.Stderr, "error: %v", err)
                os.Exit(1)
        }
}

func rmain() error {
        if err := trace.Start(os.Stdout); err != nil {
                return err
        }
        defer trace.Stop()
        f, err := os.Open("/dev/urandom")
        if err != nil {
                return err
        }
        var buf [1024 * 1024 * 1024]byte
        for i, last := 0, time.Now(); i < 5; i++ {
                fmt.Fprintf(os.Stderr, "start...")
                n, err := f.Read(buf[:])
                if err != nil {
                        return err
                }
                if n == 0 {
                        return fmt.Errorf("read: returned empty slice (%d, %v)", n, err)
                }
                now := time.Now()
                fmt.Fprintf(os.Stderr, "took %v\n", now.Sub(last))
                last = now
        }

        return nil
}

Running it:

$ go run longsyscall.go > longsyscall.trace
start...took 2.203378744s
start...took 2.207477581s
start...took 2.211924562s
start...took 2.202174787s
start...took 2.219617454s
@aktau aktau added the Proposal label Dec 14, 2022
@gopherbot gopherbot added this to the Proposal milestone Dec 14, 2022
@dominikh
Copy link
Member

dominikh commented Dec 14, 2022

Here are the notes I took on Go runtime tracing and syscall behavior. At the time I looked at Go 1.19's runtime, but I don't think anything of importance has changed in 1.20. I hope this helps.

When user code invokes a syscall, the runtime emits an EvGoSysCall event. If the syscall returns before sysmon finds
the P that's blocked in the syscall, we'll not see any other events. This is a fast path, avoiding a bunch of
scheduling, allowing the goroutine to resume immediately after the syscall returns, having retained its P and M.

If sysmon does find the P, it retakes the P from the M and the runtime emits GoSysBlock, followed by ProcStop. When a
blocking syscall returns, we get a EvGoSysExit event. The EvGoSysExit event gets emitted right before the G resumes
running. However, to maintain accurate timing, the event carries the original timestamp of when the syscall returned,
and the trace parser uses this to reorder the event. [...]

One implication of sysmon detecting blocked syscalls is that the duration between EvGoSysCall and EvGoSysBlock should
be attributes to the syscall, too. This is the time between starting the syscall and Go figuring out that it's
blocking. [...]

Sysmon runs at most every 20 μs, but it will run considerably less often if it thinks there's nothing to do, up to 10
ms. In testing, even in busy programs (with GOMAXPROCS restricted to low values), sysmon sometimes decides to sleep
for 5 ms. A loop calling syscall.Nanosleep, sleeping 10 ms each time, will have EvGoSysBlock spans that last from ~10
ms to 5 ms, with the other 5 ms being the time between EvGoSysCall and EvGoSysBlock.

Another issue is that when a syscall returns fast enough we'll not be provided with information on how long the
syscall took. In the worst case, a fast syscall could've run for 10 ms, and to us it'll look like 10 ms of user code.

One special case is syscall.RawSyscall, which doesn't emit any tracing events whatsoever, and AFAIU isn't handled by
sysmon. That means these syscalls can block Ps and Ms, but we have no insight into it happening.

@prattmic
Copy link
Member

cc @golang/runtime @mknyszek

@ianlancetaylor
Copy link
Contributor

I'm not sure this has to be a proposal. Are there any API changes here, or cases where this might potentially break an existing program? Thanks.

@aktau
Copy link
Contributor Author

aktau commented Dec 14, 2022

There aren't. But I didn't know which template would be appropriate. I doubted between "proposal" and "bug". While this is actually more of a feature request, between proposal and bug I'd choose the former. Which one should I have chosen?

@ianlancetaylor
Copy link
Contributor

The "Bugs" template I guess. I considered changing it to "Bugs or feature requests", but that would lead too many people away from the "Proposal" template appropriate for API changes. So not sure how to make this better.

@ianlancetaylor ianlancetaylor changed the title proposal: cmd/trace: visualize time taken by syscall cmd/trace: visualize time taken by syscall Dec 14, 2022
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. compiler/runtime Issues related to the Go compiler and/or runtime. and removed Proposal labels Dec 14, 2022
@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Backlog Dec 14, 2022
@aktau
Copy link
Contributor Author

aktau commented Dec 15, 2022

Thanks @dominikh, that's very useful. Your notes explain (I think) exactly what's going on, and why my sample code/trace was not representative of the problems I experienced in production nearly a year ago.

If sysmon does find the P, it retakes the P from the M and the runtime emits GoSysBlock, followed by ProcStop. When a
blocking syscall returns, we get a EvGoSysExit event. The EvGoSysExit event gets emitted right before the G resumes
running. However, to maintain accurate timing, the event carries the original timestamp of when the syscall returned,
and the trace parser uses this to reorder the event. [...]

In the sample trace, I'm artificially issuing long-running read syscalls from /dev/urandom, because I wanted to make the "problem" clear on a human timescale. In the production trace where I couldn't get slices even with the cmd/trace patch above, the situation was different:

  • The syscall used is also read, but from a socket.
  • There is very high concurrency (GOMAXPROCS > 200 on a 100+ core machine AFAIK).
  • The syscalls are much shorter (I don't have exact numbers, as I didn't have text logging and there's no duration in the traces), I estimate it was milliseconds.

All of that likely prevented sysmon from finding those syscalls.

So this issue is not the only one preventing good observability. Ideally Go would:

  1. (this issue) Render syscalls as a slice instead of a point. This improvement can (IMHO) be made already, long-running syscalls will benefit already.
  2. (no issue yet) Syscall duration (EvGoSysExit) even for short syscalls. This would also fix the inability to see how much time is spent in syscalls in the cmd/trace (blocking) syscall profile. Currently it ignores syscalls without a link.
    if ev.Type != trace.EvGoSysCall || ev.Link == nil || ev.StkID == 0 || len(ev.Stk) == 0 {

@aktau
Copy link
Contributor Author

aktau commented Dec 16, 2022

I modified the sample program to have an additional goroutine that makes many short syscalls, alongside the main goroutine which makes long syscalls. As @dominikh said, a lot of them (but not all) don't have a link (they're instants, not slices). Comparing non-patched vs patched cmd/trace shows some issues when this happens:

EDIT: The image below is wrong, this wrongly patched version omits instant (non-linked) syscalls. That's why the trace is much smaller.

image

  • g1 is the main goroutine, it executes longsyscalls() which issues 5 syscalls that take a little more than 2s each.
  • g7 is the shortsyscalls() goroutine. It issues fast open/fnctl/read/close syscalls in quick succession.

The long syscalls (rendered as a slice in the patched version on the right) are displayed on the proc that started executing it (proc 0 on the screenshot). Trace events for the first long syscall:

2022/12/15 14:19:00 syscall event with link (48510/48640):                                                                  
                 106527657 GoSysCall      p=0       g=1              stkID=52         seq=0  
 ->  2.214s     2320112634 GoSysExit      p=1000003 g=1              stkID=0          seq=0   { g=1 seq=10 ts=0 }
 ->  0.000s     2320113061 GoStart        p=2       g=1              stkID=0          seq=0   { g=1 seq=0 }
 ->  0.006s     2326550570 GoSysBlock     p=2       g=1              stkID=0          seq=0       

Between timestamps 106527657 and 2326550570, there were many other syscalls by function shortsyscalls. For example (most syscalls by g 7 only have a GoSysCall event, as per @dominikh's explanation, but there are some that occur at the right time and take sufficiently long to manifest as a slice):

2022/12/15 14:19:00 syscall event with link (93270/93401):                                                                  
                 186017593 GoSysCall      p=0       g=7              stkID=27         seq=0                                 
 ->  0.000s      186021132 GoSysExit      p=1000003 g=7              stkID=0          seq=0   { g=7 seq=270 ts=0 }          
 ->  0.000s      186021298 GoStart        p=1       g=7              stkID=0          seq=0   { g=7 seq=0 }                 
 ->  0.015s      200578297 GoPreempt      p=1       g=7              stkID=54         seq=0                     
 ->  0.000s      200579793 GoStart        p=1       g=7              stkID=0          seq=0   { g=7 seq=0 }       
 ->  0.020s      220699932 GoSysBlock     p=1       g=7              stkID=0          seq=0  

It also started on proc 0. This indicates that the long-running syscall was moved to another proc close to when it started, or perhaps immediately. I suppose that is proc 1000003 mentioned in GoSysExit which may be some virtual ID for the proc attached (?) to the kernel thread servicing the event poller.

Given that non-blocking syscalls (glad I picked those) don't run on a dedicated thread, it's questionable how to display this in the main view. Perhaps in that view we could either generate fake procs for every multiplexed syscall, and display also procs for syscalls that can't be moved to the netpoller thread (so they need their own M). This needs more discussion IMHO.

However, what shouldn't be contentious is presenting slices instead of instants in the goroutine centric view. The patched cmd/trace does well here:

image

On a related note, the. The goroutine analysis page shows a breakdown of goroutine time spent in either: Execution, Network wait, Sync block, Blocking syscall, Scheduler wait, GC sweeping, GC pause. The longsyscalls loop spends nearly all of its time in blocking syscall:

image

We know that shortsyscalls also spends most of its time in syscalls. But the goroutine analysis summary shows that it's overwhelmingly in the executing state:

image

I'd have expected at least 20% of the time to be in syscalls (I've thought about measuring this on the Go side but calling time.Now() so often might disturb things). This further calls out that it would be nice if every syscall had a time measurement.

New program:

func main() {
        if err := rmain(); err != nil {
                fmt.Fprintf(os.Stderr, "error: %v", err)
                os.Exit(1)
        }
}

func rmain() error {
        if err := trace.Start(os.Stdout); err != nil {
                return err
        }
        defer trace.Stop()
        go shortsyscalls()
        return longsyscalls()
}

func longsyscalls() error {
        f, err := os.Open("/dev/urandom")
        if err != nil {
                return err
        }
        var buf [1024 * 1024 * 1024]byte
        for i, last := 0, time.Now(); i < 5; i++ {
                fmt.Fprintf(os.Stderr, "%d: read(%q) start\n", i, f.Name())
                n, err := f.Read(buf[:])
                if err != nil || n == 0 {
                        panic("PANIC!")
                }
                now := time.Now()
                fmt.Fprintf(os.Stderr, "%d: read(%q) end (%v)\n", i, f.Name(), now.Sub(last))
                last = now
        }
        return nil
}

// Never stops.
func shortsyscalls() {
        var buf [1024 * 1024 * 10]byte
        for i, last := 0, time.Now(); i < math.MaxInt; i++ {
                f, err := os.Open("./longsyscall.go") // After the first read, this should be very fast.
                if err != nil {
                        panic(err)
                }
                if i%1000 == 0 {
                        fmt.Fprintf(os.Stderr, "%d: read(%q) start\n", i, f.Name())
                }
                n, err := f.Read(buf[:])
                if err != nil || n == 0 {
                        panic("PANIC!")
                }
                now := time.Now()
                if i%1000 == 0 {
                        fmt.Fprintf(os.Stderr, "%d: read(%q) end (%v)\n", i, f.Name(), now.Sub(last))
                }
                last = now
                f.Close()
        }
}

@dominikh
Copy link
Member

dominikh commented Dec 16, 2022

@aktau can you explain the format of your debug output? That is, what exactly does

2022/12/15 14:19:00 syscall event with link (48510/48640):                                                                  
                 106527657 GoSysCall      p=0       g=1              stkID=52         seq=0  
 ->  2.214s     2320112634 GoSysExit      p=1000003 g=1              stkID=0          seq=0   { g=1 seq=10 ts=0 }
 ->  0.000s     2320113061 GoStart        p=2       g=1              stkID=0          seq=0   { g=1 seq=0 }
 ->  0.006s     2326550570 GoSysBlock     p=2       g=1              stkID=0          seq=0       

represent? With regard to the multiple lines and their order.

@dominikh
Copy link
Member

It also started on proc 0. This indicates that the long-running syscall was moved to another proc close to when it started, or perhaps immediately.

There's no moving between Ps here. The goroutine was simply running on p=0 all along, until it eventually invoked a syscall. (The corresponding event indicating this would be the last GoStart before the GoSysCall)

I suppose that is proc 1000003 mentioned in GoSysExit which may be some virtual ID for the proc attached (?) to the kernel thread servicing the event poller.

1000003 is a synthetic ID that gets used for "returns from syscalls". There's no reason like a kernel thread or event poller for this, just the fact that when the syscall returns, there is no P associated with that event, but we wanted some number. Other such IDs are

const (
        // Special P identifiers:
        FakeP    = 1000000 + iota
        TimerP   // depicts timer unblocks
        NetpollP // depicts network unblocks
        SyscallP // depicts returns from syscalls
        GCP      // depicts GC state
        ProfileP // depicts recording of CPU profile samples
)

@aktau
Copy link
Contributor Author

aktau commented Dec 16, 2022

@aktau can you explain the format of your debug output? That is, what exactly does

For every GoSysCall event in the trace, print metadata about it (first line, without the ->) and all of its linked events (iterating over the links). Perhaps easier to show the code I added to cmd/trace to produce this (see bottom of reply).

There's no moving between Ps here. The goroutine was simply running on p=0 all along, until it eventually invoked a syscall. (The corresponding event indicating this would be the last GoStart before the GoSysCall)

I think I've just been misusing terminology here. The main trace view is a proc-centric view, and I wanted to express that the syscall was (apart from the very beginning perhaps) not executing on the proc (0) anymore. When I read "proc" in that view my mind just thinks "thread". Stated differently: some other goroutine (like g7) could occupy proc 0 when the main goroutine (g1) was in the read syscall.

1000003 is a synthetic ID that gets used for "returns from syscalls". There's no reason like a kernel thread or event poller for this, just the fact that when the syscall returns, there is no P associated with that event, but we wanted some number. Other such IDs are

I was wondering why SyscallP was used and not NetpollP given that a cursory look at newFile from os/file_unix.go shows that under normal circumstances files resulting from Open are added to the netpoller. But thinking about it, a regular file is always ready to read and a read is always synchronous (even if the fd is O_NONBLOCK), so it has to be SyscallP.

Debug log code:

+       var sysCalls, sysCallsNoLink int
        for _, ev := range ctx.parsed.Events {
                // Handle state transitions before we filter out events.
                switch ev.Type {
@@ -817,7 +819,29 @@ func generateTrace(params *traceParams, consumer traceConsumer) error {
                case trace.EvGoUnblock:
                        ctx.emitArrow(ev, "unblock")
                case trace.EvGoSysCall:
-                       ctx.emitInstant(ev, "syscall", "")
+                       sysCalls++
+                       if ev.Link != nil {
+                               ctx.emitSlice(ev, "syscall")
+                               var sb strings.Builder
+                               fmt.Fprintf(&sb, "syscall event with link (%d/%d):\n            %+v", sysCallsNoLink, sysCalls, ev)
+                               for prev, itev := ev, ev.Link; itev != nil; prev, itev = itev, itev.Link {
+                                       sb.WriteString("\n -> ")
+                                       fmt.Fprintf(&sb, " %2.3fs ", time.Duration(itev.Ts-prev.Ts).Seconds())
+                                       sb.WriteString(itev.String())
+                               }
+                               log.Println(sb.String())
+                       } else {
+                               sysCallsNoLink++
+                               if (sysCallsNoLink % 100000) == 0 {
+                                       log.Printf("sampled syscall event without link (%d/%d):\n    %+v", sysCallsNoLink, sysCalls, ev)
+                               }
+                               ctx.emitInstant(ev, "syscall", "")
+                       }
                case trace.EvGoSysExit:
                        ctx.emitArrow(ev, "sysexit")
                case trace.EvUserLog:

diff --git a/internal/trace/parser.go b/internal/trace/parser.go
index b091a85..a17db4b 100644
--- a/internal/trace/parser.go
+++ b/internal/trace/parser.go
@@ -974,13 +974,29 @@ func PrintEvent(ev *Event) {
 func (ev *Event) String() string {
        desc := EventDescriptions[ev.Type]
        w := new(strings.Builder)
-       fmt.Fprintf(w, "%v %v p=%v g=%v off=%v", ev.Ts, desc.Name, ev.P, ev.G, ev.Off)
+       fmt.Fprintf(w, "%14d %-14s p=%-7d g=%-14d stkID=%-10d seq=%-3d", ev.Ts, desc.Name, ev.P, ev.G, ev.StkID, ev.seq)
+       if len(desc.Args) > 0 {
+               w.WriteString(" {")
+       }
        for i, a := range desc.Args {
                fmt.Fprintf(w, " %v=%v", a, ev.Args[i])
        }
+       if len(desc.Args) > 0 {
+               w.WriteString(" }")
+       }
+       if len(desc.SArgs) > 0 {
+               w.WriteString(" {")
+       }
        for i, a := range desc.SArgs {
                fmt.Fprintf(w, " %v=%v", a, ev.SArgs[i])
        }
+       if len(desc.SArgs) > 0 {
+               w.WriteString(" }")
+       }
        return w.String()
 }

@mknyszek
Copy link
Contributor

This issue is fixed in the new tracer. All system calls for Go 1.22 traces and later have a duration in the trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

7 participants