runtime/pprof: 'Total samples' less than expected #45983

shawndx · 2021-05-06T03:01:48Z

What version of Go are you using (`go version`)?

$ go version
go version devel go1.17-1108cbe60b Thu May 6 02:21:55 2021 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""

GOARCH="amd64"

GOBIN=""

GOEXE=""

GOFLAGS=""

GOHOSTARCH="amd64"

GOHOSTOS="linux"

....

What did you do?

on a machine with 16-core CPU (i7-10700K)

% go test -c runtime

4-core, multiple-round benchmarking delivers stable pprof numbers

% GOMAXPROCS=4 ./runtime.test -test.bench=ChanContended -test.run=none -test.benchtime=20s -test.cpuprofile=/tmp/p
% go tool pprof /tmp/p

File: runtime.test
Type: cpu
Time: May 6, 2021 at 10:48am (CST)
Duration: 28.53s, Total samples = 71.09s (249.20%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

8-core, multiple-round benchmarking delivers stable pprof numbers

% GOMAXPROCS=8 ./runtime.test -test.bench=ChanContended -test.run=none -test.benchtime=20s -test.cpuprofile=/tmp/p
% go tool pprof /tmp/p

File: runtime.test
Type: cpu
Time: May 6, 2021 at 10:41am (CST)
Duration: 31.09s, Total samples = 106.05s (341.07%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

16-core, multiple-round benchmarking delivers stable pprof numbers

% ./runtime.test -test.bench=ChanContended -test.run=none -test.benchtime=20s -test.cpuprofile=/tmp/p
% go tool pprof /tmp/p

File: runtime.test
Type: cpu
Time: May 6, 2021 at 10:47am (CST)
Duration: 37.10s, Total samples = 92.34s (248.92%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

What did you expect to see?

The 'Total samples' is near to GOMAXPROCS * 100% as the benchmark occupies all available cores.

If my assumption (expected 'total samples' near to GOMAXPROCS * 100%) is wrong, I'd like to learn what's the logic behind the gap and the variance.

What did you see instead?

'Total samples' is far less than expected.

The text was updated successfully, but these errors were encountered:

prattmic · 2021-05-06T20:50:20Z

There may be an obvious explanation for this that I'm forgetting, but this seems off to me. I can reproduce with a simple program:

package main

import (
        "runtime/pprof"
        "os"
        "time"
)

func main() {
        f, err := os.Create("/tmp/cpu.pprof")
        if err != nil {
                panic(err)
        }
        defer f.Close()
        if err := pprof.StartCPUProfile(f); err != nil {
                panic(err)
        }
        defer pprof.StopCPUProfile()

        for i := 0; i < 8; i++ {
                go func() {
                        for {}
                }()
        }

        time.Sleep(10*time.Second)
}

$ sudo perf stat ./spin

 Performance counter stats for './spin':

         80,975.46 msec task-clock                #    7.988 CPUs utilized          
             1,757      context-switches          #    0.022 K/sec                  
                 9      cpu-migrations            #    0.000 K/sec                  
               179      page-faults               #    0.002 K/sec                  
   354,168,070,986      cycles                    #    4.374 GHz                    
   525,195,128,361      instructions              #    1.48  insn per cycle         
   262,544,698,978      branches                  # 3242.275 M/sec                  
         1,091,515      branch-misses             #    0.00% of all branches        

      10.136525690 seconds time elapsed

      80.950154000 seconds user
       0.043994000 seconds sys

$ pprof /tmp/cpu.pprof 
File: spin
Type: cpu
Time: May 6, 2021 at 4:41pm (EDT)
Duration: 10.12s, Total samples = 28.69s (283.54%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

Linux perf's task-clock should be approximately equivalent to our ITIMER_PROF, so I'm not immediately sure where the discrepancy is.

One thing that came to mind is that we may not be keeping up and be dropping samples, but it looks like we add fake _LostExternalCode frames to account for this, which aren't in my profile.

shawndx · 2021-05-07T11:48:53Z

@prattmic The discrepancy seems to pertain to environments, I happen to have two environments, one with discrepancy, it's where the pprof data in the problem description came from, while the other one's pprof data matches what I expect, that is 'total samples' roughly equal to "Duration * 100 * CPU usage".

I tried your sample case in the two envs as well.

env. with discrepancy

% perf ./spin
 Performance counter stats for './spin':

         80,952.08 msec task-clock:u              #    8.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               521      page-faults:u             #    0.006 K/sec
   379,377,742,934      cycles:u                  #    4.686 GHz
   758,643,371,690      instructions:u            #    2.00  insn per cycle
   379,316,691,617      branches:u                # 4685.694 M/sec
            61,415      branch-misses:u           #    0.00% of all branches

      10.118769852 seconds time elapsed

      80.935721000 seconds user
       0.020002000 seconds sys

% go tool pprof /tmp/cpu.pprof
File: spin
Type: cpu
Time: May 7, 2021 at 7:40pm (CST)
Duration: 10.12s, Total samples = 26.38s (260.74%)
Entering interactive mode (type "help" for commands, "o" for options)

envs. without discrepancy

%  perf stat ./spin

Performance counter stats for './spin':

          80006.12 msec task-clock:u              #    7.845 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               556      page-faults:u             #    0.007 K/sec
                 0      cycles:u                  #    0.000 GHz
      474456245058      instructions:u
      237220207452      branches:u                # 2965.026 M/sec
            180517      branch-misses:u           #    0.00% of all branches

      10.198484751 seconds time elapsed

      79.897995000 seconds user
       0.444254000 seconds sys

% go tool pprof /tmp/cpu.pprof
File: spin
Type: cpu
Time: May 7, 2021 at 7:39pm (CST)
Duration: 10.18s, Total samples = 78.92s (775.12%)
Entering interactive mode (type "help" for commands, "o" for options)

shawndx · 2021-05-07T11:54:40Z

@prattmic

One thing that came to mind is that we may not be keeping up and be dropping samples,
but it looks like we add fake _LostExternalCode frames to account for this, which aren't in my profile.

I used to add a counter in 'sighandler' to count '_SIGPROF' and turned out the number always equals to 'total samples' (sample_index=samples) in both environments I mentioned before. A wild guess is somehow some _SIGPROF fails to land.

prattmic · 2021-05-07T15:11:17Z

@shawn-xdji what do you mean by 'environment'? Different machines? Or different configurations on the same machine?

I used to add a counter in 'sighandler' to count '_SIGPROF' and turned out the number always equals to 'total samples' (sample_index=samples) in both environments I mentioned before. A wild guess is somehow some _SIGPROF fails to land.

I did this too, with the same result. :)

One possibility is that the kernel is dropping some of the signals.

Roughly, the way ITIMER_PROF works is:

Hardware timer interrupt fires.
Interrupt handler increments the itimer counter for process currently executing on the CPU.
If the itimer overflows, it is time to deliver a SIGPROF.
The signal is sent with preference from the currently executing thread, but eligible for the entire thread group.
The signal will not be delivered to the current thread if is the masking the signal.
If it can't go to the current thread, the kernel will look for any thread in the thread group that can take the signal.
If no threads are found, then the signal will sit in the queue until a thread becomes eligible.
Only one instance of a signal can be in the queue at once, so if there is already a SIGPROF in the queue, additional SIGPROFs will be dropped on the floor.

I believe SIGPROF is masked when calling sigprof, so one possibility is we are spending so much time handling SIGPROFs that the kernel is having to drop lots of additional SIGPROFs. This should be fairly easy for me to test, though it feels a bit far-fetched, as we have a 10ms period, and sigprof should be much faster than that.

prattmic · 2021-05-07T15:26:46Z

Looking at the kernel code a bit more, there is also a similar and more likely chance for the signal to be dropped.

Note that SIGPROFs are queued on the thread-group-wide signal queue (PIDTYPE_TGID): https://elixir.bootlin.com/linux/v5.12.1/source/kernel/signal.c#L1111

Signal sending short-circuits if the signal is already pending: https://elixir.bootlin.com/linux/v5.12.1/source/kernel/signal.c#L1117
Even when delivering to the current thread, the signal is queued on the shared queue: https://elixir.bootlin.com/linux/v5.12.1/source/kernel/signal.c#L1193
complete_signal on L1210 will select the current task and set TIF_SIGPENDING.
On return to userspace, exit_to_user_mode_loop will detect TIF_SIGPENDING and dequeue the signal via handle_signal_work -> arch_go_signal_or_restart -> get_signal -> dequeue_signal: https://elixir.bootlin.com/linux/v5.12.1/source/kernel/entry/common.c#L176.

This leaves a small window between queuing the signal and the current thread dequeuing it where any additional timer expirations would drop their SIGPROFs. This should still be a pretty small race, but seems a bit more reasonable than the previous case.

prattmic · 2021-05-07T15:58:18Z

This is not a Go specific problem.

From the Go program above, I get ~2800 SIGPROF signals.
From the C program below, I also get ~2800 SIGPROF signals.

#include <pthread.h>                              
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>            
#include <unistd.h>

_Atomic uint64_t sigprof_count = 0;

void handler(int sig, siginfo_t* info, void* priv) {
  sigprof_count++;
}

void* spin(void* priv) {
  while (1) {
  }
  return NULL;
}

int main(int argc, char** argv) {
  struct sigaction sa = {
    .sa_sigaction = handler,
    .sa_flags = SA_SIGINFO,
  };
  int ret = sigaction(SIGPROF, &sa, NULL);
  if (ret < 0) {
    perror("sigaction");
    return 1;
  }

  sigset_t set;
  sigemptyset(&set);
  sigaddset(&set, SIGPROF);
  ret = sigprocmask(SIG_UNBLOCK, &set, NULL);
  if (ret < 0) {
    perror("sigprocmask");
    return 1;
  }

  struct itimerval it = {
    .it_interval = {
      .tv_usec = 10000,
    },
    .it_value = {
      .tv_usec = 10000,
    },
  };
  ret = setitimer(ITIMER_PROF, &it, NULL);
  if (ret < 0) {
    perror("setitimer");
    return 1;
  }

  for (int i = 0; i < 8; i++) {
    pthread_t id;
    ret = pthread_create(&id, NULL, spin, NULL);
    if (ret < 0) {
      perror("pthread_create");
      return 1;
    }
  }

  sleep(10);

  printf("sigprof_count = %llu\n", sigprof_count);
  return 0;
}

prattmic · 2021-05-07T16:03:01Z

I believe this is a duplicate of #35057.

shawndx · 2021-05-08T01:48:01Z

@shawn-xdji what do you mean by 'environment'? Different machines? Or different configurations on the same machine?

They are of different machines and different OSes (ubuntu vs. debian).
Thanks for the details and directing to #35057.

prattmic added the NeedsInvestigation label May 6, 2021

prattmic closed this as completed May 7, 2021

golang locked and limited conversation to collaborators May 8, 2022

gopherbot added the FrozenDueToAge label May 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/pprof: 'Total samples' less than expected #45983

runtime/pprof: 'Total samples' less than expected #45983

shawndx commented May 6, 2021

prattmic commented May 6, 2021

shawndx commented May 7, 2021 •

edited

Loading

shawndx commented May 7, 2021

prattmic commented May 7, 2021

prattmic commented May 7, 2021

prattmic commented May 7, 2021

prattmic commented May 7, 2021

shawndx commented May 8, 2021 •

edited

Loading

runtime/pprof: 'Total samples' less than expected #45983

runtime/pprof: 'Total samples' less than expected #45983

Comments

shawndx commented May 6, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

on a machine with 16-core CPU (i7-10700K)

4-core, multiple-round benchmarking delivers stable pprof numbers

8-core, multiple-round benchmarking delivers stable pprof numbers

16-core, multiple-round benchmarking delivers stable pprof numbers

What did you expect to see?

What did you see instead?

prattmic commented May 6, 2021

shawndx commented May 7, 2021 • edited Loading

shawndx commented May 7, 2021

prattmic commented May 7, 2021

prattmic commented May 7, 2021

prattmic commented May 7, 2021

prattmic commented May 7, 2021

shawndx commented May 8, 2021 • edited Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

shawndx commented May 7, 2021 •

edited

Loading

shawndx commented May 8, 2021 •

edited

Loading