Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: 'Total samples' less than expected #45983

Closed
shawndx opened this issue May 6, 2021 · 8 comments
Closed

runtime/pprof: 'Total samples' less than expected #45983

shawndx opened this issue May 6, 2021 · 8 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@shawndx
Copy link
Contributor

shawndx commented May 6, 2021

What version of Go are you using (go version)?

$ go version
go version devel go1.17-1108cbe60b Thu May 6 02:21:55 2021 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
....

What did you do?

on a machine with 16-core CPU (i7-10700K)

% go test -c runtime

4-core, multiple-round benchmarking delivers stable pprof numbers

% GOMAXPROCS=4 ./runtime.test -test.bench=ChanContended -test.run=none -test.benchtime=20s -test.cpuprofile=/tmp/p
% go tool pprof /tmp/p

File: runtime.test
Type: cpu
Time: May 6, 2021 at 10:48am (CST)
Duration: 28.53s, Total samples = 71.09s (249.20%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

8-core, multiple-round benchmarking delivers stable pprof numbers

% GOMAXPROCS=8 ./runtime.test -test.bench=ChanContended -test.run=none -test.benchtime=20s -test.cpuprofile=/tmp/p
% go tool pprof /tmp/p

File: runtime.test
Type: cpu
Time: May 6, 2021 at 10:41am (CST)
Duration: 31.09s, Total samples = 106.05s (341.07%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

16-core, multiple-round benchmarking delivers stable pprof numbers

% ./runtime.test -test.bench=ChanContended -test.run=none -test.benchtime=20s -test.cpuprofile=/tmp/p
% go tool pprof /tmp/p

File: runtime.test
Type: cpu
Time: May 6, 2021 at 10:47am (CST)
Duration: 37.10s, Total samples = 92.34s (248.92%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

What did you expect to see?

The 'Total samples' is near to GOMAXPROCS * 100% as the benchmark occupies all available cores.

If my assumption (expected 'total samples' near to GOMAXPROCS * 100%) is wrong, I'd like to learn what's the logic behind the gap and the variance.

What did you see instead?

'Total samples' is far less than expected.

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 6, 2021
@prattmic
Copy link
Member

prattmic commented May 6, 2021

There may be an obvious explanation for this that I'm forgetting, but this seems off to me. I can reproduce with a simple program:

package main

import (
        "runtime/pprof"
        "os"
        "time"
)

func main() {
        f, err := os.Create("/tmp/cpu.pprof")
        if err != nil {
                panic(err)
        }
        defer f.Close()
        if err := pprof.StartCPUProfile(f); err != nil {
                panic(err)
        }
        defer pprof.StopCPUProfile()

        for i := 0; i < 8; i++ {
                go func() {
                        for {}
                }()
        }

        time.Sleep(10*time.Second)
}
$ sudo perf stat ./spin

 Performance counter stats for './spin':

         80,975.46 msec task-clock                #    7.988 CPUs utilized          
             1,757      context-switches          #    0.022 K/sec                  
                 9      cpu-migrations            #    0.000 K/sec                  
               179      page-faults               #    0.002 K/sec                  
   354,168,070,986      cycles                    #    4.374 GHz                    
   525,195,128,361      instructions              #    1.48  insn per cycle         
   262,544,698,978      branches                  # 3242.275 M/sec                  
         1,091,515      branch-misses             #    0.00% of all branches        

      10.136525690 seconds time elapsed

      80.950154000 seconds user
       0.043994000 seconds sys

$ pprof /tmp/cpu.pprof 
File: spin
Type: cpu
Time: May 6, 2021 at 4:41pm (EDT)
Duration: 10.12s, Total samples = 28.69s (283.54%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

Linux perf's task-clock should be approximately equivalent to our ITIMER_PROF, so I'm not immediately sure where the discrepancy is.

One thing that came to mind is that we may not be keeping up and be dropping samples, but it looks like we add fake _LostExternalCode frames to account for this, which aren't in my profile.

@shawndx
Copy link
Contributor Author

shawndx commented May 7, 2021

@prattmic The discrepancy seems to pertain to environments, I happen to have two environments, one with discrepancy, it's where the pprof data in the problem description came from, while the other one's pprof data matches what I expect, that is 'total samples' roughly equal to "Duration * 100 * CPU usage".

I tried your sample case in the two envs as well.

  1. env. with discrepancy
% perf ./spin
 Performance counter stats for './spin':

         80,952.08 msec task-clock:u              #    8.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               521      page-faults:u             #    0.006 K/sec
   379,377,742,934      cycles:u                  #    4.686 GHz
   758,643,371,690      instructions:u            #    2.00  insn per cycle
   379,316,691,617      branches:u                # 4685.694 M/sec
            61,415      branch-misses:u           #    0.00% of all branches

      10.118769852 seconds time elapsed

      80.935721000 seconds user
       0.020002000 seconds sys
% go tool pprof /tmp/cpu.pprof
File: spin
Type: cpu
Time: May 7, 2021 at 7:40pm (CST)
Duration: 10.12s, Total samples = 26.38s (260.74%)
Entering interactive mode (type "help" for commands, "o" for options)
  1. envs. without discrepancy
%  perf stat ./spin

Performance counter stats for './spin':

          80006.12 msec task-clock:u              #    7.845 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               556      page-faults:u             #    0.007 K/sec
                 0      cycles:u                  #    0.000 GHz
      474456245058      instructions:u
      237220207452      branches:u                # 2965.026 M/sec
            180517      branch-misses:u           #    0.00% of all branches

      10.198484751 seconds time elapsed

      79.897995000 seconds user
       0.444254000 seconds sys
% go tool pprof /tmp/cpu.pprof
File: spin
Type: cpu
Time: May 7, 2021 at 7:39pm (CST)
Duration: 10.18s, Total samples = 78.92s (775.12%)
Entering interactive mode (type "help" for commands, "o" for options)

@shawndx
Copy link
Contributor Author

shawndx commented May 7, 2021

@prattmic

One thing that came to mind is that we may not be keeping up and be dropping samples,
but it looks like we add fake _LostExternalCode frames to account for this, which aren't in my profile.

I used to add a counter in 'sighandler' to count '_SIGPROF' and turned out the number always equals to 'total samples' (sample_index=samples) in both environments I mentioned before. A wild guess is somehow some _SIGPROF fails to land.

@prattmic
Copy link
Member

prattmic commented May 7, 2021

@shawn-xdji what do you mean by 'environment'? Different machines? Or different configurations on the same machine?

I used to add a counter in 'sighandler' to count '_SIGPROF' and turned out the number always equals to 'total samples' (sample_index=samples) in both environments I mentioned before. A wild guess is somehow some _SIGPROF fails to land.

I did this too, with the same result. :)

One possibility is that the kernel is dropping some of the signals.

Roughly, the way ITIMER_PROF works is:

  1. Hardware timer interrupt fires.
  2. Interrupt handler increments the itimer counter for process currently executing on the CPU.
  3. If the itimer overflows, it is time to deliver a SIGPROF.
  4. The signal is sent with preference from the currently executing thread, but eligible for the entire thread group.
  5. The signal will not be delivered to the current thread if is the masking the signal.
  6. If it can't go to the current thread, the kernel will look for any thread in the thread group that can take the signal.
  7. If no threads are found, then the signal will sit in the queue until a thread becomes eligible.
  8. Only one instance of a signal can be in the queue at once, so if there is already a SIGPROF in the queue, additional SIGPROFs will be dropped on the floor.

I believe SIGPROF is masked when calling sigprof, so one possibility is we are spending so much time handling SIGPROFs that the kernel is having to drop lots of additional SIGPROFs. This should be fairly easy for me to test, though it feels a bit far-fetched, as we have a 10ms period, and sigprof should be much faster than that.

@prattmic
Copy link
Member

prattmic commented May 7, 2021

Looking at the kernel code a bit more, there is also a similar and more likely chance for the signal to be dropped.

Note that SIGPROFs are queued on the thread-group-wide signal queue (PIDTYPE_TGID): https://elixir.bootlin.com/linux/v5.12.1/source/kernel/signal.c#L1111

This leaves a small window between queuing the signal and the current thread dequeuing it where any additional timer expirations would drop their SIGPROFs. This should still be a pretty small race, but seems a bit more reasonable than the previous case.

@prattmic
Copy link
Member

prattmic commented May 7, 2021

This is not a Go specific problem.

From the Go program above, I get ~2800 SIGPROF signals.
From the C program below, I also get ~2800 SIGPROF signals.

#include <pthread.h>                              
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>            
#include <unistd.h>

_Atomic uint64_t sigprof_count = 0;

void handler(int sig, siginfo_t* info, void* priv) {
  sigprof_count++;
}

void* spin(void* priv) {
  while (1) {
  }
  return NULL;
}

int main(int argc, char** argv) {
  struct sigaction sa = {
    .sa_sigaction = handler,
    .sa_flags = SA_SIGINFO,
  };
  int ret = sigaction(SIGPROF, &sa, NULL);
  if (ret < 0) {
    perror("sigaction");
    return 1;
  }

  sigset_t set;
  sigemptyset(&set);
  sigaddset(&set, SIGPROF);
  ret = sigprocmask(SIG_UNBLOCK, &set, NULL);
  if (ret < 0) {
    perror("sigprocmask");
    return 1;
  }

  struct itimerval it = {
    .it_interval = {
      .tv_usec = 10000,
    },
    .it_value = {
      .tv_usec = 10000,
    },
  };
  ret = setitimer(ITIMER_PROF, &it, NULL);
  if (ret < 0) {
    perror("setitimer");
    return 1;
  }

  for (int i = 0; i < 8; i++) {
    pthread_t id;
    ret = pthread_create(&id, NULL, spin, NULL);
    if (ret < 0) {
      perror("pthread_create");
      return 1;
    }
  }

  sleep(10);

  printf("sigprof_count = %llu\n", sigprof_count);
  return 0;
}

@prattmic
Copy link
Member

prattmic commented May 7, 2021

I believe this is a duplicate of #35057.

@prattmic prattmic closed this as completed May 7, 2021
@shawndx
Copy link
Contributor Author

shawndx commented May 8, 2021

@shawn-xdji what do you mean by 'environment'? Different machines? Or different configurations on the same machine?

They are of different machines and different OSes (ubuntu vs. debian).
Thanks for the details and directing to #35057.

@golang golang locked and limited conversation to collaborators May 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants