Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use #35057

Closed
rhysh opened this issue Oct 21, 2019 · 43 comments
Closed

runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use #35057

rhysh opened this issue Oct 21, 2019 · 43 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@rhysh
Copy link
Contributor

rhysh commented Oct 21, 2019

What version of Go are you using (go version)?

$ go1.13 version
go version go1.13.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes, this is present in Go 1.13 and in current tip with Linux 4.14. It seems to exist even back in Go 1.4, and with Linux 3.2 and 2.6.32.

Am I holding it wrong?

Around Go 1.6, a lot of the tests in runtime/pprof were called out as being flaky. It looks like around that same time, the builders got an overhaul. Maybe they moved to machines with more CPU cores than before, and the increase in flakiness was due to some SIGPROF deliveries being skipped?

The tests in runtime/pprof both now and around Go 1.6 seem to compare parts of the profile to itself, but not to the CPU usage reported by the operating system. If this is a real bug, those tests would not have discovered it.

What operating system and processor architecture are you using (go env)?

I'm compiling on darwin/amd64 and running on linux/amd64.

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/rhys/Library/Caches/go-build"
GOENV="/Users/rhys/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY="*"
GONOSUMDB="*"
GOOS="darwin"
GOPATH="/Users/rhys/go"
GOPRIVATE="*"
GOPROXY="direct"
GOROOT="/usr/local/go"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/49/zmds5zsn75z1283vtzxyfr5hj7yjq4/T/go-build343144681=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I looked at a CPU profile for a Go program running on a Linux machine where top reported 20 cores of CPU usage (2000%), but the profile showed about 240% / 2.4 cores of usage.

I ran a test with the -test.cpuprofile flag on a Linux machine and compared the results of the time shell built-in with go tool pprof's view of the usage. I varied the rate that the program asked the kernel to deliver SIGPROF and found that the two measurements agreed on the number of CPU cycles spent as long as there were fewer than 250 SIGPROF deliveries per second.

I ran the test under perf stat -e 'signal:*' and found that its count of signal:signal_generate events lined up with the number of SIGPROF deliveries I'd expect, that its count of signal:signal_deliver events lined up with the number of samples in the CPU profile, and that the two matched well only when the "generate" rate was less than 250 samples per second.


Here, the test uses 96 vCPUs of a machine with 96 hyperthreads for 10 seconds, using the Go runtime's default profile rate of 100 Hz. The Linux kernel generates slightly less than 96,000 signals (which are probably all SIGPROF). The time built-in reports slightly less than 16 minutes (960 seconds) of "user" CPU. That's good.

The resulting profile shows 10.20 seconds of wall-clock time and 1.61 minutes (about 96.6 seconds) of CPU time, or about 9660 samples at 100 Hz. That's close to the number of signals that the kernel reports it delivered to the program, but that doesn't match the number generated by the kernel or the actual CPU time spent.

$ time sudo perf stat -e 'signal:*' ./test-n -test.cpuprofile=./prof-n -test.cpu=96
PASS

 Performance counter stats for './test-n -test.cpuprofile=./prof-n -test.cpu=96':

            95,690      signal:signal_generate                                      
             9,688      signal:signal_deliver                                       

      10.211210687 seconds time elapsed


real	0m10.316s
user	15m57.840s
sys	0m0.045s
$ go tool pprof -text /tmp/test-n /tmp/prof-n 
File: test-n
Type: cpu
Time: Oct 21, 2019 at 2:16pm (PDT)
Duration: 10.20s, Total samples = 1.61mins (949.83%)
Showing nodes accounting for 1.61mins, 100% of 1.61mins total
Dropped 13 nodes (cum <= 0.01mins)
      flat  flat%   sum%        cum   cum%
  1.61mins   100%   100%   1.61mins   100%  command-line-arguments.cpuHog1
         0     0%   100%   1.61mins   100%  command-line-arguments.TestCPUProfile.func1
         0     0%   100%   1.61mins   100%  command-line-arguments.cpuHogger

Calling runtime.SetCPUProfileRate with "2 Hz" right before the testing package's CPU profile starts lets me dial the profile rate down to less than 250 Hz process-wide. (The warning message seems harmless in this case.) This leads to the kernel's measurements of "signal:signal_generate" and "signal:signal_deliver" matching each other, and for go tool pprof's measurement of "15.94mins" to come very close to what the time built-in sees at "user 15m57.048s".

$ time sudo env PROFILE_HZ=2 perf stat -e 'signal:*' ./test-n -test.cpuprofile=./prof-n -test.cpu=96
runtime: cannot set cpu profile rate until previous profile has finished.
PASS

 Performance counter stats for './test-n -test.cpuprofile=./prof-n -test.cpu=96':

             1,913      signal:signal_generate                                      
             1,913      signal:signal_deliver                                       

      10.120272912 seconds time elapsed


real	0m10.221s
user	15m57.048s
sys	0m0.378s
$ go tool pprof -text /tmp/test-n /tmp/prof-n 
File: test-n
Type: cpu
Time: Oct 21, 2019 at 2:17pm (PDT)
Duration: 10.11s, Total samples = 15.94mins (9464.52%)
Showing nodes accounting for 15.93mins, 99.90% of 15.94mins total
Dropped 1 node (cum <= 0.08mins)
      flat  flat%   sum%        cum   cum%
 15.93mins 99.90% 99.90%  15.93mins 99.90%  command-line-arguments.cpuHog1
         0     0% 99.90%  15.93mins 99.90%  command-line-arguments.TestCPUProfile.func1
         0     0% 99.90%  15.93mins 99.90%  command-line-arguments.cpuHogger

I confirmed that the kernel was configured with high-resolution timers as recommended in #13841.

$ uname -a
Linux ip-172-31-18-196.us-west-2.compute.internal 4.14.123-111.109.amzn2.x86_64 #1 SMP Mon Jun 10 19:37:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /boot/config-$(uname -r) | grep CONFIG_HIGH_RES_TIMERS
CONFIG_HIGH_RES_TIMERS=y
$ cat /boot/config-$(uname -r) | grep ^CONFIG_HZ
CONFIG_HZ_250=y
CONFIG_HZ=250

I've seen this effect both on virtual machines and on physical hardware. (Most of my follow-up testing has taken place on virtual machines.)

What did you expect to see?

I expected the number of seconds of CPU time reported by go tool pprof to align with the number of seconds of CPU time observed by the kernel.

When I run go tool pprof, I expect the time reported in the "Duration" line (like "Duration: 5.11s, Total samples = 8.50s (166.40%)") to match what I'd see from looking at a tool like top at the same time.

What did you see instead?

The Linux kernel seems to drop SIGPROF events when they come more than 250 times per second. I don't know if it drops them fairly—the profiles might be skewed.


Open questions

Is there a simple setting that my coworkers and I are missing? I've reproduced this with vanilla machine images for Ubuntu and Amazon Linux 2.

Is the right move for runtime.SetCPUProfileRate to limit its input to 250 / GOMAXPROCS?

Does the number "250" come from Linux's CONFIG_HZ_250=y / CONFIG_HZ=250, and is it right for that configuration to end up compiled in to Go?

Thanks!


Here's the test program:

package repro

import (
	"os"
	"runtime"
	"strconv"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

var sink int64

func init() {
	hz, _ := strconv.Atoi(os.Getenv("PROFILE_HZ"))
	if hz > 0 {
		runtime.SetCPUProfileRate(hz)
	}
}

func TestCPUProfile(t *testing.T) {
	workers := runtime.GOMAXPROCS(0)

	var wg sync.WaitGroup
	for i := 0; i < workers; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()

			var v int
			cpuHogger(cpuHog1, &v, 10*time.Second)
			atomic.StoreInt64(&sink, int64(v))
		}()
	}

	wg.Wait()
}

// cpuHogger and cpuHog1 from src/runtime/pprof/pprof_test.go

func cpuHogger(f func(x int) int, y *int, dur time.Duration) {
	// We only need to get one 100 Hz clock tick, so we've got
	// a large safety buffer.
	// But do at least 500 iterations (which should take about 100ms),
	// otherwise TestCPUProfileMultithreaded can fail if only one
	// thread is scheduled during the testing period.
	t0 := time.Now()
	accum := *y
	for i := 0; i < 500 || time.Since(t0) < dur; i++ {
		accum = f(accum)
	}
	*y = accum
}

// The actual CPU hogging function.
// Must not call other functions nor access heap/globals in the loop,
// otherwise under race detector the samples will be in the race runtime.
func cpuHog1(x int) int {
	foo := x
	for i := 0; i < 1e5; i++ {
		if foo > 0 {
			foo *= foo
		} else {
			foo *= foo + 1
		}
	}
	return foo
}
@hyangah
Copy link
Contributor

hyangah commented Oct 22, 2019

@rsc @aclements @dvyukov
Many reported various limitations of pprof-based cpu profiling (#13841, #14434, ...).
In addition to that, collecting cpu profiles with runtime/pprof requires extra cpu resources to the process (and there is no special priority in scheduling for profile collection afaik), so personally I avoid trusting the cpu profiles from overloaded processes.

If you are on linux, you can try linux perf. The pprof tool understands perf input as well.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 22, 2019
@heschi
Copy link
Contributor

heschi commented Oct 23, 2019

Can you get some more information from the signal_generate tracepoint? Specifically, the thread ID to see if #14434 is in play, and the result to see how often it's TRACE_SIGNAL_ALREADY_PENDING. My (ill-informed) guess is that the delivery is hugely biased toward one thread, and a lot of the signals are getting dropped as a result of the overload.

@rhysh
Copy link
Contributor Author

rhysh commented Oct 23, 2019

I was able to get a lot of info from perf trace --no-syscalls -e 'signal:*'. It looks like each thread is getting roughly its fair share of both signal_generate and signal_deliver events, but that often the signal_generate events have res=2 (which I take to mean TRACE_SIGNAL_ALREADY_PENDING).

I don't see significant bias towards particular threads.

I see a few dozen signal_generate events being generated in a very short time window (tens of microseconds). This repeats every 4ms (at 250 Hz). The first in the burst has res=0, and subsequent ones have res=2 until a signal_deliver event shows up. The signal_generate event following that will once again have res=0. It looks like the generation of SIGPROF events is racing against their delivery.


The test ran for 10 seconds on 96 vCPUs, with 10108 samples in the resulting CPU profile.

$ sudo perf trace --no-syscalls -e 'signal:*' ./test-n -test.cpuprofile=./prof-n -test.cpu=96 &> ./trace-100-96
$ go tool pprof -raw ./test-n ./prof-n | sed -n -e '/^Samples/,/^Locations/ p' | awk '{if ($1 ~ /^[0-9]+$/) {s += $1}} END {print s}'
10108

Looking at a small sample of threads, each got about 100 signal_generate events with res=0 and about 850 with res=2.

$ cat ./trace-100-96 | grep generate | awk '{$1="" ; print $0}' | sort | uniq -c | tail
  95  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37086 grp=1 res=0)
 887  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37086 grp=1 res=2)
 125  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37087 grp=1 res=0)
 812  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37087 grp=1 res=2)
 100  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37088 grp=1 res=0)
 850  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37088 grp=1 res=2)
 117  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37089 grp=1 res=0)
 851  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37089 grp=1 res=2)
 106  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37090 grp=1 res=0)
 888  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37090 grp=1 res=2)

The 95782 signal_generate events were split across 97 threads, with between 846 and 1122 events for each thread (only modest skew).

$ cat ./trace-100-96 | grep generate | awk '{print $6}' | sort | uniq -c | awk '{print $1}' | dist
N 97  sum 95782  mean 987.443  gmean 986.417  std dev 45.2544  variance 2047.96

     min 846
   1%ile 850.293
   5%ile 912.8
  25%ile 962.667
  median 984
  75%ile 1008
  95%ile 1075.6
  99%ile 1117.09
     max 1122

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⠖⠒⠒⠦⠤⠤⠤⠖⠒⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.012
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡤⠚⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⣀⣠⠤⠤⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠖⠒⠒⠒⠒⠒⠒⠉⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⠲⠤⠤⠤⠤⠖⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
                900                 1000                 1100

The signal_generate events with res=0 are also split across 97 threads with between 68 and 136 events per thread. There are 10108 of them, an exact match for the count of samples in the generated CPU profile.

$ cat ./trace-100-96 | grep generate | grep 'res=0' | awk '{print $6}' | sort | uniq -c | awk '{print $1}' | dist
N 97  sum 10108  mean 104.206  gmean 103.117  std dev 14.9276  variance 222.832

     min 68
   1%ile 69.84
   5%ile 80.4
  25%ile 93
  median 103
  75%ile 117
  95%ile 128.8
  99%ile 134.467
     max 136

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⣀⠤⠤⠤⠖⠒⠒⠒⠒⠲⠤⢄⣀⣀⣀⣀⣀⣀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.026
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣠⠤⠔⠒⠋⠉⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠓⠒⠒⠦⠤⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠴⠒⠒⠒⠒⠒⠚⠉⠉⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠒⠒⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
                                100

Here's the distribution of the inter-arrival times of signal_generate events for a particular thread:

$ cat ./trace-100-96 | grep generate | grep '37090' | awk '{print $1}' | awk '{print $1-t ; t=$1}' | dist
N 994  sum 10092  mean 10.1529  gmean 7.99014  std dev 8.56788  variance 73.4086

     min 3.838
   1%ile 3.89428
   5%ile 3.93505
  25%ile 4.02
  median 7.993
  75%ile 12.057
  95%ile 24.095
  99%ile 38.8371
     max 136.01

⠀⠀⠀⠀⠀⠀⢀⠖⠲⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.185
⠀⠀⠀⠀⠀⠀⡸⠀⠀⠣⠀⢠⠞⠙⣆⠀⠀⢀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠇⠀⠀⠘⠤⠎⠀⠀⠈⠦⠴⠉⠀⠙⠤⠴⠊⠉⠓⠤⠤⠤⠖⠦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠁
  0                                                               50

And the inter-arrival times of signal_generate events with res=0:

$ cat ./trace-100-96 | grep generate | grep res=0 | grep '37090' | awk '{print $1}' | awk '{print $1-t ; t=$1}' | dist
N 106  sum 10080  mean 95.0942  gmean 53.0591  std dev 99.8375  variance 9967.52

     min 3.943
   1%ile 3.98505
   5%ile 7.9314
  25%ile 24.0063
  median 61.9045
  75%ile 151.981
  95%ile 307.972
  99%ile 477.82
     max 500.062

⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⠔⠒⠒⠲⢤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.008
⠀⠀⠀⠀⠀⠀⠀⣠⠞⠁⠀⠀⠀⠀⠀⠀⠉⠒⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠚⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠙⠊⠉⠉⠉⠉⠉⠉⠙⠒⠒⠒⠤⠤⠤⠤⠤⠤⠤⠴⠒⠒⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠁
        0                                                     500

The signal_generate events come in bursts. The median time between them is 2µs. The bursts are 4ms apart.

$ cat ./trace-100-96 | sort -k 1 -n | grep generate | awk '{print $1}' | awk '{print $1-t ; t=$1}' | dist
N 95782  sum 10128  mean 0.10574  std dev 0.627422  variance 0.393658

     min 0
   1%ile 0.001
   5%ile 0.001
  25%ile 0.002
  median 0.002
  75%ile 0.003
  95%ile 0.005
  99%ile 3.93
     max 4.006

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠂⠀⠀⠀⠀⠀⠀⡖ 2.578
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠴⠥⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
     -0

The bursts of signal_generate events from the kernel seem to be racing against the application's signal handler. After a signal_deliver event, the next signal_generate will have res=0 but all others will have res=2.

$ cat ./trace-100-96 | sort -k 1 -n | head -n 10000 | tail -n 30
  1016.072 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37086 grp=1 res=2)
  1016.074 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37061 grp=1 res=2)
  1016.076 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37017 grp=1 res=2)
  1016.078 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37055 grp=1 res=2)
  1016.080 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37019 grp=1 res=2)
  1016.082 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37068 grp=1 res=2)
  1016.084 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37008 grp=1 res=2)
  1016.086 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37036 grp=1 res=2)
  1016.088 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37046 grp=1 res=2)
  1016.090 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37034 grp=1 res=2)
  1016.093 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37029 grp=1 res=2)
  1016.094 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37059 grp=1 res=2)
  1016.095 signal:signal_deliver:sig=27 errno=0 code=128 sa_handler=45b970 sa_flags=1c000004)
  1016.097 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37056 grp=1 res=0)
  1016.098 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37014 grp=1 res=2)
  1016.099 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37083 grp=1 res=2)
  1016.101 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37013 grp=1 res=2)
  1016.102 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37002 grp=1 res=2)
  1016.103 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37032 grp=1 res=2)
  1016.104 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=36996 grp=1 res=2)
  1016.106 signal:signal_deliver:sig=27 errno=0 code=128 sa_handler=45b970 sa_flags=1c000004)
  1016.107 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37079 grp=1 res=0)
  1016.109 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=36999 grp=1 res=2)
  1016.111 signal:signal_deliver:sig=27 errno=0 code=128 sa_handler=45b970 sa_flags=1c000004)
  1019.995 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37013 grp=1 res=0)
  1020.002 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37014 grp=1 res=2)
  1020.006 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37005 grp=1 res=2)
  1020.009 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37051 grp=1 res=2)
  1020.011 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37059 grp=1 res=2)
  1020.012 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37008 grp=1 res=2)

After a signal_generate event with res=0, the median delay before the next signal_deliver event is 10µs:

$ cat ./trace-100-96 | sort -k 1 -n | grep signal | grep -v 'res=2' | awk '{if (/deliver/) {print $1-t} else {t=$1}}' | dist
N 10108  sum 218.332  mean 0.0215999  gmean 0.0114421  std dev 0.0351222  variance 0.00123357

     min 0.001
   1%ile 0.002
   5%ile 0.002
  25%ile 0.005
  median 0.01
  75%ile 0.026
  95%ile 0.079
  99%ile 0.122
     max 1

⠀⠀⠀⠀⠀⠀⠀⡔⢢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 73.804
⠀⠀⠀⠀⠀⠀⢰⠁⠀⠓⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠏⠀⠀⠀⠀⠈⠉⠓⠒⠒⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
      0                                      0.1

@heschi
Copy link
Contributor

heschi commented Oct 23, 2019

It may be time for me to let more knowledgeable people handle this, but I'm confused by the raw trace text. It looks like the process is only allowed to have one outstanding SIGPROF at a time across all threads? Can that be right?

@rhysh
Copy link
Contributor Author

rhysh commented Oct 23, 2019

It looks like the process is only allowed to have one outstanding SIGPROF at a time across all threads?

That's what it looks like to me too. Maybe there's a way to have that limit apply to each thread rather than the process as a whole?

CC @ianlancetaylor for expertise on signals.

@hyangah
Copy link
Contributor

hyangah commented Oct 24, 2019

From http://man7.org/linux/man-pages/man7/signal.7.html I read

Standard signals do not queue. If multiple instances of a standard
signal are generated while that signal is blocked, then only one
instance of the signal is marked as pending (and the signal will be
delivered just once when it is unblocked). In the case where a
standard signal is already pending, the siginfo_t structure (see
sigaction(2)) associated with that signal is not overwritten on
arrival of subsequent instances of the same signal. Thus, the
process will receive the information associated with the first
instance of the signal.

I don't know how SIGPROF based profiling is supposed to work reliably with multi-core, multi-threaded systems. If this is true and I don't misread this, it sounds to me like we need to adjust our sampling frequency based on the available cores and the number of threads.

@hyangah
Copy link
Contributor

hyangah commented Oct 24, 2019

cc @aalexand and @rauls5382 for advice on how sigprof handling works in c++ and other languages.

@rhysh
Copy link
Contributor Author

rhysh commented Oct 24, 2019

In #14434, @aclements pointed out that applications can use the create_timer syscall to track time spent in individual threads with the CLOCK_THREAD_CPUTIME_ID clockid. With the Linux-specific SIGEV_THREAD_ID event, it looks like they can request that the signal be delivered to a particular thread. If each thread/M uses that to request individual cost accounting and individual signal delivery, does that result in a separate 250 Hz limit for each thread?

Put another way, does each thread have its own bitmask / size-one-queue of pending signals?

Here are some options I see that build on each other, in increasing risk/complexity:

  1. Limit the per-process delivery rate in setitimer to no more than a constant divided by GOMAXPROCS. We might always guess "250 Hz", so would allow the usual 100 Hz rate for GOMAXPROCS of 1 or 2, but would limit it to 50 Hz when GOMAXPROCS=5.
  2. Instead of using setitimer to request process-wide SIGPROF delivery, use create_timer with CLOCK_PROCESS_CPUTIME_ID to request that same process-wide delivery. Then, use timer_getoverrun to count the number of missed signals and attribute them to a placeholder function in the final profile.
  3. Enhance the use of create_timer to request per-thread accounting and delivery with CLOCK_THREAD_CPUTIME_ID. Return to the default interval of 100 Hz. (If each Linux thread has an independent bitmask for pending signals.)

Is that the right path forward?

@aalexand
Copy link
Contributor

This is a known and unfortunate shortcoming of SIGPROF. See https://elinux.org/Kernel_Timer_Systems. Also http://hpctoolkit.org/man/hpcrun.html#section_15:

"On Linux systems, the kernel will not deliver itimer interrupts faster than the unit of a jiffy, which defaults to 4 milliseconds; see the itimer man page."

It's a bit strange that this extends to per-core timers, but apparently it's the case.

There is Google-internal b/69814231 that tracks some investigations around that, I think the options are:

  1. Use getrusage() to compute the total CPU time consumed by the process during the profiling session. Use that to pro-rate the sample weights. This will get the overall CPU usage right, but won't change the number of samples so the accuracy may suffer. Easy to implement though.

  2. Use the CLOCK_THREAD_CPUTIME_ID mode. That's what we do (not by default though) in Stackdriver Profiler's Java agent (link).

  3. It should also be possible I think to use Linux perf subsystem as the source of the timer events, but not 100% sure how accurate that would be and whether there are permission implications inside of Docker containers.

@rhysh
Copy link
Contributor Author

rhysh commented Oct 25, 2019

Thanks for the context, @aalexand .

I tried out the three options I proposed yesterday:

  • Scaling down the requested profile interval works well, at a cost of lower sampling frequency.
  • Using timer_create for process-wide CPU accounting probably won't work. The signals it delivers seem to get picked up by arbitrary threads—including the sleeping sysmon and template threads—rather than only those spending CPU cycles. In addition, timer_getoverrun underreports the missed samples. When I run the test program with GOMAXPROCS=96 (on a machine with 96 vCPUs) and request a 10ms CLOCK_PROCESS_CPUTIME_ID interval, the profile includes around 4,400 samples in a ten-second window (rather than 96,000) and timer_getoverrun reports around 20 missed events.
  • Using timer_create and CLOCK_THREAD_CPUTIME_ID on each thread (with SIGEV_THREAD_ID to request the thread itself be notified) gives good results. Each thread is able to get 100 signals per second, even when there are 96 active threads. To make this work, each M would need to call timer_create itself, but I'd expect that any thread within the process would be able to arm/disarm all of the timers via timer_settime in runtime.setProcessCPUProfiler.

@aalexand
Copy link
Contributor

CLOCK_THREAD_CPUTIME_ID is probably most promising. The only caveat is handling threads that the runtime is not aware of. On Linux this could be solved by iterating over all process' threads using /proc file system (this would miss short-lived threads created after the profiling has started though, so might not always work). Not sure about non-Linux systems.

@gopherbot
Copy link

Change https://golang.org/cl/204279 mentions this issue: runtime: limit SIGPROF rate to 250 Hz on Linux

@rhysh
Copy link
Contributor Author

rhysh commented Oct 31, 2019

Work that comes in bursts—causing the process to spend more than 10ms of CPU time in one kernel tick—is systematically under-sampled. I've updated the reproducer to do work in a single goroutine, and then to do the same amount of work across GOMAXPROCS goroutines. In the resulting profile with go1.13.3 on a 96-vCPU machine, samples from when the process used only one goroutine make up 90.5% of the profile. The work done in parallel appears in only 9.3% of the samples. I'd expect both to be close to 50% / 1:1 rather than that 9.7:1 skew ratio.

It seems like as long as profiling on Linux uses setitimer, the sample rate has to balance low resolution against that bias.

The workaround for low-resolution profiles is to collect profiles over a longer time. I don't know of a practical workaround for user programs. (List threads, get the clock for each, change calls to runtime/pprof.StartCPUProfile to create and enable each thread's timer and then dial down the setitimer rate to 1 Hz, then fix up any disagreement in nanoseconds-per-sample in the resulting profile, then deactivate the per-thread timers.)

Is the project willing to trade away resolution to get lower bias? If so, how much?


updated reproducer showing 9.7:1 skew
package repro

import (
	"flag"
	"os"
	"runtime"
	"strconv"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

var (
	sink      int64
	taskCount = flag.Int("task-count", 100, "")
)

func init() {
	hz, _ := strconv.Atoi(os.Getenv("PROFILE_HZ"))
	if hz > 0 {
		runtime.SetCPUProfileRate(hz)
	}
}

func TestCPUProfile(t *testing.T) {
	parallelism := runtime.GOMAXPROCS(0)
	tasks := make(chan func())

	// force creation of GOMAXPROCS threads
	var gate sync.WaitGroup
	gate.Add(1)

	var wg sync.WaitGroup
	for i := 0; i < parallelism; i++ {
		gate.Add(1)
		wg.Add(1)
		go func() {
			defer wg.Done()

			runtime.LockOSThread()
			gate.Done()
			gate.Wait()
			runtime.UnlockOSThread()

			for task := range tasks {
				task()
			}
		}()
	}

	gate.Done()

	var taskWg sync.WaitGroup
	enqueueTask := func(fn func(x int) int) {
		taskWg.Add(1)
		tasks <- func() {
			defer taskWg.Done()
			var v int
			cpuHogger(fn, &v, 0)
			atomic.StoreInt64(&sink, int64(v))
		}
	}

	// do work sequentially
	for i := 0; i < *taskCount; i++ {
		enqueueTask(cpuHog1)
		taskWg.Wait()
	}

	// do work in parallel
	for i := 0; i < *taskCount; i++ {
		enqueueTask(cpuHog2)
	}
	taskWg.Wait()

	close(tasks)
	wg.Wait()
}

// cpuHogger, cpuHog1, and cpuHog2 from src/runtime/pprof/pprof_test.go

func cpuHogger(f func(x int) int, y *int, dur time.Duration) {
	// We only need to get one 100 Hz clock tick, so we've got
	// a large safety buffer.
	// But do at least 500 iterations (which should take about 100ms),
	// otherwise TestCPUProfileMultithreaded can fail if only one
	// thread is scheduled during the testing period.
	t0 := time.Now()
	accum := *y
	for i := 0; i < 500 || time.Since(t0) < dur; i++ {
		accum = f(accum)
	}
	*y = accum
}

// The actual CPU hogging function.
// Must not call other functions nor access heap/globals in the loop,
// otherwise under race detector the samples will be in the race runtime.
func cpuHog1(x int) int {
	foo := x
	for i := 0; i < 1e5; i++ {
		if foo > 0 {
			foo *= foo
		} else {
			foo *= foo + 1
		}
	}
	return foo
}

func cpuHog2(x int) int {
	foo := x
	for i := 0; i < 1e5; i++ {
		if foo > 0 {
			foo *= foo
		} else {
			foo *= foo + 2
		}
	}
	return foo
}
$ time ./test-n -test.count=1 -test.v -test.cpuprofile=./prof-n -test.cpu=96 -task-count=1000
=== RUN   TestCPUProfile
--- PASS: TestCPUProfile (52.84s)
PASS

real	0m52.973s
user	1m50.164s
sys	0m0.096s
$ go tool pprof -text /tmp/test-n /tmp/prof-n
File: test-n
Type: cpu
Time: Oct 31, 2019 at 11:06am (PDT)
Duration: 52.97s, Total samples = 57.66s (108.86%)
Showing nodes accounting for 57.54s, 99.79% of 57.66s total
Dropped 18 nodes (cum <= 0.29s)
      flat  flat%   sum%        cum   cum%
    52.19s 90.51% 90.51%     52.19s 90.51%  command-line-arguments.cpuHog1
     5.35s  9.28% 99.79%      5.35s  9.28%  command-line-arguments.cpuHog2
         0     0% 99.79%     57.54s 99.79%  command-line-arguments.TestCPUProfile.func1
         0     0% 99.79%     57.54s 99.79%  command-line-arguments.TestCPUProfile.func2.1
         0     0% 99.79%     57.54s 99.79%  command-line-arguments.cpuHogger
$ time env PROFILE_HZ=2 ./test-n -test.count=1 -test.v -test.cpuprofile=./prof-n -test.cpu=96 -task-count=1000
=== RUN   TestCPUProfile
runtime: cannot set cpu profile rate until previous profile has finished.
--- PASS: TestCPUProfile (52.78s)
PASS

real	0m52.927s
user	1m49.887s
sys	0m0.104s
$ go tool pprof -text /tmp/test-n /tmp/prof-n
File: test-n
Type: cpu
Time: Oct 31, 2019 at 11:04am (PDT)
Duration: 52.92s, Total samples = 109.50s (206.92%)
Showing nodes accounting for 109.50s, 100% of 109.50s total
      flat  flat%   sum%        cum   cum%
    57.50s 52.51% 52.51%     57.50s 52.51%  command-line-arguments.cpuHog2
       52s 47.49%   100%        52s 47.49%  command-line-arguments.cpuHog1
         0     0%   100%    109.50s   100%  command-line-arguments.TestCPUProfile.func1
         0     0%   100%    109.50s   100%  command-line-arguments.TestCPUProfile.func2.1
         0     0%   100%    109.50s   100%  command-line-arguments.cpuHogger

@rhysh
Copy link
Contributor Author

rhysh commented Aug 11, 2020

Work that comes in bursts—causing the process to spend more than 10ms of CPU time in one kernel tick—is systematically under-sampled.

I have a real-world example of this type of work: the garbage collection done by the runtime. One of the apps I work with runs on a machine with a large number of hyperthreads, but typically does only a small amount of application-specific work. A CPU profile (from runtime/pprof.StartCPUProfile) that covers several GC runs shows about 4% of the process's CPU time is spent within runtime.gcBgMarkWorker. An execution trace (from runtime/trace.Start) that covers several GC runs shows (in the "Goroutine analysis" view) that the runtime.gcBgMarkWorker goroutines accounted for about 20% of the total program execution time.

The 4% vs 20% figures don't account for edge effects (the trace covers 3 GC runs and 4 long periods of application work), or for cases where the Go runtime believed it had scheduled a goroutine to a thread but the kernel had in fact suspended the thread. But a 5x difference in the two reports is significant, it aligns with the behavior I've described in this issue, and it's in a workload that affects any Go program that 1) uses the GC, 2) has many hyperthreads available for use, 3) is provisioned to use less than 100% CPU.

@rhysh
Copy link
Contributor Author

rhysh commented Aug 26, 2020

It looks like timer_create with CLOCK_THREAD_CPUTIME_ID can work for pure-Go programs:
• The timers seem to correctly track the CPU usage of single threads.
• An extension specific to Linux (SIGEV_THREAD_ID for the sigev_notify field) can steer the signal to the thread that caused the timer to expire.
• The kernel will successfully deliver these signals, even when many of them trigger in a 4ms window.

There's more:
• The clockid for a thread's CPU usage can be calculated from its pid/tid. That means a single goroutine/thread in the program can create a timer to track the CPU usage each other thread in the program (and ask that those timers deliver SIGPROF to the respective threads).
• Go registers its signal handlers with SA_SIGINFO, and the value of si_code shows whether this SIGPROF delivery is from setitimer (SI_KERNEL) or from timer_create (SI_TIMER). That means it's possible to use setitimer and timer_create at the same time, and to de-conflict the results. (I've tested this on Linux 4.14, and have not confirmed on older kernels.)

CLOCK_THREAD_CPUTIME_ID is probably most promising. The only caveat is handling threads that the runtime is not aware of.

If we use setitimer for the whole process, and use timer_create for the threads the runtime can discover (either through runtime.allm or /proc/self/task), the signal handler can choose to no-op on SIGPROF signals that are both 1) caused by setitimer and 2) arrive on threads that have a per-thread timer active.

The result of this would be profiles of work done on threads created by the Go runtime that are more accurate (not cutting off at 250% CPU usage) and precise (not under-sampling GC work on large, mostly-idle machines by a factor of 5), while keeping the current behavior for non-Go-runtime-created threads.

If there's interest in improving profiles for work done on threads that the Go runtime did not create (or are otherwise not in runtime.allm), and the project has appetite for the corresponding complexity, there are a few possible paths after that:
• When setitimer causes a SIGPROF delivery on a previously-unknown thread, the Go code that samples the stack could also request a per-thread timer for that thread.
• The program could poll /proc/self/task, looking for threads that were created without going through runtime.newm or similar.
• The program could use use a mechanism to subscribe to new thread creation (it looks like perf_event_open can do this, if the runtime can call it when there's still only one thread in the process .. so maybe it doesn't help for the applications that would need it).


I'm working on code to make this happen, but it's resulting in a lot of new branching between Linux and the rest of the Unix support; there's a lot that was shared in src/runtime/signal_unix.go that's now different. If you're reading this and can think of other caveats or blockers that might keep that contribution out of the tree, or reasons that branching wouldn't be allowed, I'd be interested to hear and discuss them. Thanks!

@gopherbot
Copy link

Change https://golang.org/cl/324129 mentions this issue: runtime: profile with per-thread timers on Linux

@rhysh
Copy link
Contributor Author

rhysh commented Jun 5, 2021

I did the work to split Linux profiling out from the generic Unix profiling code, and added the syscalls to all the architectures that Go+Linux supports, so CL324129 PS3 now (1) compiles and (2) works well enough to run my reproducer test from the top of this issue. Setting GODEBUG=pproftimercreate=1 enables the timer_create/CLOCK_THREAD_CPUTIME_ID-based profiler. It seems to work for GOARCH of 386, amd64, arm, and arm64.

Here are some results I observed on a 64-core arm64 / Aarch64 machine (with 64 real cores and no hyperthreading / SMT). Recall that "cpuHog1" would be better named "work done serially on one thread", that "cpuHog2" is "work done across all available threads in parallel", and that those functions each do the same amount of total work. The CL needs some refinement (what to do when a new M starts during profiling, etc), but I think this shows that (1) the skew in setitimer can be severe on readily-available hardware (here, a c6g.metal on EC2), and (2) timer_create seems to fix it.

CC @felixge

First with PS3 but GODEBUG=pproftimercreate=0 to get old setitimer profiling:

$ time GODEBUG=pproftimercreate=0 ./35057_arm64.test -test.cpuprofile=./35057_arm64_0.profile
PASS

real	0m10.421s
user	0m20.148s
sys	0m0.000s

$ go tool pprof -top ./35057_arm64_0.profile
File: 35057_arm64.test
Type: cpu
Time: Jun 4, 2021 at 3:41pm (PDT)
Duration: 10.42s, Total samples = 10330ms (99.15%)
Showing nodes accounting for 10330ms, 100% of 10330ms total
      flat  flat%   sum%        cum   cum%
   10050ms 97.29% 97.29%    10050ms 97.29%  issue/golang/go/35057.cpuHog1
     280ms  2.71%   100%      280ms  2.71%  issue/golang/go/35057.cpuHog2
         0     0%   100%    10330ms   100%  issue/golang/go/35057.TestCPUProfile.func1
         0     0%   100%    10330ms   100%  issue/golang/go/35057.TestCPUProfile.func2.1
         0     0%   100%    10330ms   100%  issue/golang/go/35057.cpuHogger

Then with the same executable with the new timer_create profiler via GODEBUG=pproftimercreate=1:

$ time GODEBUG=pproftimercreate=1 ./35057_arm64.test -test.cpuprofile=./35057_arm64_1.profile
PASS

real	0m10.436s
user	0m20.255s
sys	0m0.010s

$ go tool pprof -top ./35057_arm64_1.profile
File: 35057_arm64.test
Type: cpu
Time: Jun 4, 2021 at 3:41pm (PDT)
Duration: 10.43s, Total samples = 20.06s (192.28%)
Showing nodes accounting for 20.05s, 100% of 20.06s total
Dropped 1 node (cum <= 0.10s)
      flat  flat%   sum%        cum   cum%
    10.06s 50.15% 50.15%     10.06s 50.15%  issue/golang/go/35057.cpuHog1
     9.99s 49.80%   100%        10s 49.85%  issue/golang/go/35057.cpuHog2
         0     0%   100%     20.06s   100%  issue/golang/go/35057.TestCPUProfile.func1
         0     0%   100%     20.06s   100%  issue/golang/go/35057.TestCPUProfile.func2.1
         0     0%   100%     20.06s   100%  issue/golang/go/35057.cpuHogger

@felixge
Copy link
Contributor

felixge commented Jun 5, 2021

@rhysh this is awesome 👏. As mentioned on twitter, I'd be happy to help with this.

So on my end I put together a small standalone C project called proftest that makes it easy to compare signal delivery bias as well as max signal delivery frequencies between setitimer(2) and timer_create(2). The results confirm (might have to scroll down in the thread) the issues seen in the Go profiler and that timer_create(2) should be able to fix them.

@rhysh I know your patch still says "do not review", but if you'd like I'd be happy to do some testing and reviewing off it on my end. Additionally I could try to integrate your test case from above into the patch as well. Unless there is a reason you haven't done so yourself yet?

@rhysh
Copy link
Contributor Author

rhysh commented Jun 5, 2021

Thanks @felixge — the results from proftest are very encouraging!

I think it may be easiest to collaborate on review, on testing, and on problem-solving. My impression of Gerrit is that it doesn't do much to enable work from multiple people on the same patch—do you have ideas on how to collaborate on the code itself without getting bogged down in mechanics?

I have a bit of test code in CL 204279 to compare the magnitude of CPU time reported by the app (pprof) vs the OS (getrusage). I should update the timer_create CL to include that test, and a test of the relative magnitude of serial/parallel execution like the one at the top of this issue. No good reason I haven't done it yet, will do.

CL 324129 isn't ready for detailed code review, but I'd appreciate your review of it for "ideas" and "structure", and to get problem-solving advice on some parts of it:

  • Is the way I split GOOS=linux out from the other Unix-like OSes correct, or is there a less-verbose way?
  • Is adding struct definitions and assembly code for every architecture the right way to support new syscalls from the runtime package (at least for GOOS=linux)?
  • How does profiling work today in programs that aren't solely Go code? How does it work for Go programs that use cgo, how does it work for Go code that ends up in a DSO that a C-like process dynamically loads?
  • When a new M starts while profiling is active, what's the right way to create a new timer for it? I have a bad hack in place right now in runtime/proc.go. It looks like runtime·minit is a place for GOOS-dependent code, so might be a good way forward on that.
  • When profiling stops, what's the right way to delete all of the timers in a way that resolves races between threads that launched while profiling was active?

And maybe also:

  • Should GOOS=linux profiling also use setitimer, record samples for unknown threads and ignore samples for runtime-managed threads (which would have their own timer_create timer)?
  • When setting the period and offset for a timer, it's clear that the period should be what the user requested (usually the default of 100Hz / 10ms). But should that also be the initial offset? Or should that be randomly distributed with an average of the (10ms) target? (Maybe uniformly over the interval 1ns–20ms.)

@felixge
Copy link
Contributor

felixge commented Jun 9, 2021

do you have ideas on how to collaborate on the code itself without getting bogged down in mechanics?

We could just collaborate in a GitHub PR in your fork until we're ready to submit upstream. This way you could even give me access to push commits into the branch (of course we'd discuss things before).

I should update the timer_create CL to include that test, and a test of the relative magnitude of serial/parallel execution like the one at the top of this issue. No good reason I haven't done it yet, will do.

SGTM. This is also something I could try to do on my end if it makes your life easier.

CL 324129 isn't ready for detailed code review, but I'd appreciate your review of it for "ideas" and "structure", and to get problem-solving advice on some parts of it:

I'm not an runtime internals expert, but I'll try to add some thoughts.

Is the way I split GOOS=linux out from the other Unix-like OSes correct, or is there a less-verbose way?
Is adding struct definitions and assembly code for every architecture the right way to support new syscalls from the runtime package (at least for GOOS=linux)?

LGTM, but not my area of expertise.

How does profiling work today in programs that aren't solely Go code? How does it work for Go programs that use cgo, how does it work for Go code that ends up in a DSO that a C-like process dynamically loads?

My understanding is that non-Go functions are executing on their own stack, and when a signal arrives the various SetCgoTraceback callbacks get invoked to do the unwinding and symbolizations of those stacks. cgocall.go also has some great comments outlining the details.

When a new M starts while profiling is active, what's the right way to create a new timer for it? I have a bad hack in place right now in runtime/proc.go. It looks like runtime·minit is a place for GOOS-dependent code, so might be a good way forward on that.

Not sure about the right away, but you hinted at using setitimer() to discover non-Go threads. Perhaps such a mechanism could be used for all threads, including those created by the runtime. But of course this leads to new problems, such as receiving a potentially undesirable excess of signals unless we could somehow distinguish the setitimer signals and turn them off on "discovered" threads using sigprocmask(). Definitely trick, but perhaps worth exploring further? Especially if it would allow a unified approach for all threads instead of having to treat C/Go threads differently?

When profiling stops, what's the right way to delete all of the timers in a way that resolves races between threads that launched while profiling was active?

Tricky. My naive intuition would be to not attempt to disable the timers pro-actively, and instead check the process-wide profiling state every-time a thread receives a signal, and disable timers inside of the signal handler as needed?

Should GOOS=linux profiling also use setitimer, record samples for unknown threads and ignore samples for runtime-managed threads (which would have their own timer_create timer)?

I think so. But ideally setitimer() would only be used to discover threads and then configure new time_create() instances for them to make sure they are sampled with the same accuracy as Go threads.

When setting the period and offset for a timer, it's clear that the period should be what the user requested (usually the default of 100Hz / 10ms). But should that also be the initial offset? Or should that be randomly distributed with an average of the (10ms) target? (Maybe uniformly over the interval 1ns–20ms.)

Yeah, adding some jitter to the timers is probably a good idea to avoid signals being generated while a signal handler is active which would cause them to be dropped like we're seeing with setitimer. That being said, I can try to do some testing on this in proftest to investigate this further. I'm actually quite surprised that I haven't seen timer_create running into similar race conditions in my testing so far, but I'll have to simulate a "slow" signal handler to get a better picture. From my analysis the current signal handler code usually takes ~1usec to run, so there is a chance that the "natural" jitter of the timers not being created at precisely the same moment is already "good enough" to steer clear of this issue.

Next steps for me:

  • Leave some comments on your CL 324129
  • Do some proftest experiments with slow signal handlers
  • Play around with combining setitimer + timer_create for thread discovery

Let me know what else I can do to help. Potential ideas:

  • Integrate your test case from CL 204279
  • Write CPU profiler test cases for cgo in case the runtime doesn't have good coverage already.
  • Write test cases for cgo code that spawns its own threads

Thanks for all the work you've already put into this. I'm looking forward to collaborating!

@felixge
Copy link
Contributor

felixge commented Jun 9, 2021

As promised, I updated proftest so it can simulate signal handlers that spend non-trivial amounts of time (via -s flag). The result is surprising, no matter how slow I make the signal handler, it seems like time_create() doesn't end up dropping any signals on the floor.

Below is an example of a signal handler that spends an average of 221 usec inside of a busy loop without dropping signals. Since generated signals don't queue, I suspect timer_create() is implemented in a way that prevents it from generating more signals while a signal handler for the process is execution.

Maybe my analysis is wrong, but if it isn't, I think we don't need to worry about introducing artificial jitter when setting up timer_create() timers.

make && bpftrace -c './proftest -c -w2 -s10' ./trace.bt
gcc -g -Werror proftest.c -lpthread -lrt -o proftest
Attaching 5 probes...
threads: 12
work_scale: 2
sig_work_scale: 10
setitimer: 0
timer_create: 1
hz: 100
os: Linux 5.10.25-linuxkit #1 SMP Tue Mar 23 09:27:39 UTC 2021 x86_64

thread_id,signals,cpu_seconds,hz
0,373,3.073,121
1,372,3.073,121
2,372,3.073,121
3,373,3.074,121
4,370,3.070,121
5,373,3.073,121
6,370,3.071,120
7,371,3.071,121
8,372,3.073,121
9,372,3.073,121
10,374,3.075,122
11,372,3.073,121

process hz (total): 1453
thread hz (min): 120
thread hz (avg): 121
thread hz (max): 122
signal handler time usec (avg): 205


@signal[deliver]: 4477
@signal[generate]: 4478

@signal_handler_usecs: 
[8, 16)                1 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)               1 |                                                    |
[64, 128)              1 |                                                    |
[128, 256)          4246 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)           165 |@@                                                  |
[512, 1K)             34 |                                                    |
[1K, 2K)              12 |                                                    |
[2K, 4K)               3 |                                                    |
[4K, 8K)               1 |                                                    |

@signal_handler_usecs_avg: 221

Edit: My initial results showed the signal handler time 10x longer than it really is. So as a sanity check, I've also instrumented it via bpftrace now and corrected the data above. The conclusion is still the same.

@rhysh
Copy link
Contributor Author

rhysh commented Jun 18, 2021

Thanks @felixge , this is all very helpful. Your diagram of the signal handling flow control in PS4 clarifies how cgo programs are (or may be) different.

You're right to call out "Write test cases for cgo code that spawns its own threads" and "Write CPU profiler test cases for cgo in case the runtime doesn't have good coverage already." Getting cgo "right" (good results, easy to understand and explain the behavior) is a big remaining hurdle. I've done some work (below) to show the current behavior; I'd especially appreciate any help you can offer on contributing test cases upstream, which doesn't seem to have any yet for cgo profiling. Thank you!


I put together a rough demo (the code and results are in the expandable "Details" sections at the bottom) to compare how the profiler measures busy loops in Go and in cgo. The test matrix I have so far compares 1/ CPU spent in Go code vs C code, 2/ in threads created by the Go runtime vs in C by pthreads, 3/ when profiling starts before vs during the work, 4/ when there's 1 thread of work vs "many" (4 threads, on my 4-core/8-hyperthread test machine), 5/ in a stable release (go1.16.3 was most convenient for me) vs with PS4, and 6/ with the new code paths in PS4 active vs inactive (via GODEBUG=pproftimercreate=1).

Usually I see 1 thread of work measured as "100%". The exception to that is when the work happens in C code on a thread that the Go runtime created, and the profiler started after the call into C. I'm not especially worried about this.

When there are 4 threads of work and the profiler is using setitimer (either go1.16.3 or GODEBUG=pproftimercreate=0), the profiler consistently measures around "250%" when it would have measured "100%" for a single thread worth of work. This is the main thing I'm trying to fix.

When there are 4 threads of work and the profiler is using timer_create (GODEBUG=pproftimercreate=1), the profiler measures around "400%" (when it would have measured "100%" for a single thread worth of work) 1/ when the work happens in normal Go code, 2/ when the work happens in C calls made from threads that the Go runtime created, and 3/ when the work happens in Go code that is called by threads that C created using pthreads. In the first two cases, the Go runtime created the thread so the thread has an associated M structure. In the third case, although the Go runtime didn't create the thread it still associates an M structure with it via needm (if I understand correctly). The final case, where work happens in C code by threads created in C code, shows up as "250%". That's the case where there is no M for the thread, there is no timer_create for the thread, and the source of the SIGPROF is the process-wide setitimer. I'm concerned that poor or confusing behavior for that last case may keep this change from being merged, or from being enabled by default. @aalexand has called that out as a risk, and it looks like that's why it's off by default in GCP's Java profiler.


Looking at the (assembly) code for cgoSigtramp, it looks like the behavior is different per architecture. There isn't much on linux/arm64, but there is on linux/amd64. And sigprofNoGoWrapper looks like it only exists on linux/ppc64le. I haven't done an exhaustive search for tests of profiling with cgo, but the variation in code between big platforms like amd64 vs arm64 makes me suspicious.

It also looks like sigprofNonGo is a Go function that's called with the C calling convention: it has arguments (including the signal attributes to differentiate between setitimer and timer_create sources), but they're invisible to Go because they're in registers instead of on the stack. Maybe access to that would be helpful, but only for dealing with double-counting.

I haven't created a case yet where a thread's work would be double-counted. Getting signals from two sources means the thread has called timer_create, which means it has an M. But hitting any code path aside from the normal sigtramp (which in PS4 checks for this double-count condition) requires not having a G (or, "not knowing about" a G). How would a thread have an M, but not be able to pull a G out of thread local storage? Would that come from the TLS getting clobbered somehow? Is that common, and when it happens are the other effects benign?


Code

./cgo.go

// Copyright 2021 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

/*
void burn_cgo() ;
static void *burn_pthread(void *arg);
static void *burn_pthread_go(void *arg);
void prep_cgo();
void prep_pthread();
void prep_pthread_go();
*/
import "C"

import (
	"flag"
	"log"
	"os"
	"runtime"
	"runtime/pprof"
	"time"
)

func main() {
	goThread := flag.Int("go-threads", 0, "Number of Go-created threads to run Go code")
	cgoThread := flag.Int("cgo-threads", 0, "Number of Go-created threads to run cgo code")
	cgoPthread := flag.Int("cgo-pthreads", 0, "Number of cgo-created threads to run cgo code")
	goPthread := flag.Int("go-pthreads", 0, "Number of cgo-created threads to run Go code")
	sleep := flag.Duration("sleep", 2*time.Second, "Duration of test")
	presleep := flag.Duration("presleep", 0, "How long to wait for threads to start before profiling begins")
	cpuprofile := flag.String("cpuprofile", "", "Write a CPU profile to the specified file")
	flag.Parse()

	startProfile := func() {}
	if name := *cpuprofile; name != "" {
		f, err := os.Create(name)
		if err != nil {
			log.Fatal(err)
		}
		defer f.Close()

		startProfile = func() {
			err := pprof.StartCPUProfile(f)
			if err != nil {
				log.Fatal(err)
			}
		}
		defer pprof.StopCPUProfile()
	}

	if *presleep <= 0 {
		startProfile()
	}

	for i := 0; i < *goThread; i++ {
		go in_go()
	}
	for i := 0; i < *cgoThread; i++ {
		go in_cgo()
	}
	for i := 0; i < *cgoPthread; i++ {
		go in_pthread()
	}
	for i := 0; i < *goPthread; i++ {
		go in_pthread_go()
	}

	if *presleep > 0 {
		time.Sleep(*presleep)
		startProfile()
	}

	time.Sleep(*sleep)
}

func in_go() {
	runtime.LockOSThread()
	defer runtime.UnlockOSThread()
	for {
	}
}

func in_cgo() {
	runtime.LockOSThread()
	defer runtime.UnlockOSThread()
	C.prep_cgo()
}

func in_pthread() {
	runtime.LockOSThread()
	defer runtime.UnlockOSThread()
	C.prep_pthread()
}

func in_pthread_go() {
	runtime.LockOSThread()
	defer runtime.UnlockOSThread()
	C.prep_pthread_go()
}

//export burn_callback
func burn_callback() {
	runtime.LockOSThread()
	defer runtime.UnlockOSThread()
	for {
	}
}

./cgo.c

// Copyright 2021 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void burn_cgo() {
	while (1) asm("");
}

static void *burn_pthread(void *arg) {
	while (1) asm("");
}

static void *burn_pthread_go(void *arg) {
	extern void burn_callback(void);
	burn_callback();
}

void prep_cgo() {
	burn_cgo();
}

void prep_pthread() {
	int res;
	pthread_t tid;

	res = pthread_create(&tid, NULL, burn_pthread, NULL);
	if (res != 0) {
		fprintf(stderr, "pthread_create: %s\n", strerror(res));
		exit(EXIT_FAILURE);
	}
}

void prep_pthread_go() {
	int res;
	pthread_t tid;

	res = pthread_create(&tid, NULL, burn_pthread_go, NULL);
	if (res != 0) {
		fprintf(stderr, "pthread_create: %s\n", strerror(res));
		exit(EXIT_FAILURE);
	}
}

Results

$ go1.16 version
go version go1.16.3 linux/amd64
$ go1.16 build -o /tmp/c116 .

# CL324129 PS 4
$ go version
go version devel go1.17-57f155839f Tue Jun 8 16:15:30 2021 -0700 linux/amd64
$ go build -o /tmp/ccl4 .
/tmp/c116 -cgo-pthreads=1 -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.c116.cp1.9ms.pb.gz
/tmp/c116 -cgo-pthreads=1 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.c116.cp1.0ms.pb.gz
/tmp/c116 -cgo-pthreads=4 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.c116.cp4.0ms.pb.gz
/tmp/c116  -go-threads=1  -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.c116.gt1.9ms.pb.gz
/tmp/c116  -go-threads=1  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.c116.gt1.0ms.pb.gz
/tmp/c116  -go-threads=4  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.c116.gt4.0ms.pb.gz
/tmp/c116 -cgo-threads=1  -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.c116.ct1.9ms.pb.gz
/tmp/c116 -cgo-threads=1  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.c116.ct1.0ms.pb.gz
/tmp/c116 -cgo-threads=4  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.c116.ct4.0ms.pb.gz
/tmp/c116  -go-pthreads=1 -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.c116.gp1.9ms.pb.gz
/tmp/c116  -go-pthreads=1 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.c116.gp1.0ms.pb.gz
/tmp/c116  -go-pthreads=4 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.c116.gp4.0ms.pb.gz

GODEBUG=pproftimercreate=0 /tmp/ccl4 -cgo-pthreads=1 -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.ccl4.tc0.cp1.9ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4 -cgo-pthreads=1 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc0.cp1.0ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4 -cgo-pthreads=4 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc0.cp4.0ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4  -go-threads=1  -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.ccl4.tc0.gt1.9ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4  -go-threads=1  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc0.gt1.0ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4  -go-threads=4  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc0.gt4.0ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4 -cgo-threads=1  -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.ccl4.tc0.ct1.9ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4 -cgo-threads=1  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc0.ct1.0ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4 -cgo-threads=4  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc0.ct4.0ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4  -go-pthreads=1 -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.ccl4.tc0.gp1.9ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4  -go-pthreads=1 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc0.gp1.0ms.pb.gz
GODEBUG=pproftimercreate=0 /tmp/ccl4  -go-pthreads=4 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc0.gp4.0ms.pb.gz

GODEBUG=pproftimercreate=1 /tmp/ccl4 -cgo-pthreads=1 -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.ccl4.tc1.cp1.9ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4 -cgo-pthreads=1 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc1.cp1.0ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4 -cgo-pthreads=4 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc1.cp4.0ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4  -go-threads=1  -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.ccl4.tc1.gt1.9ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4  -go-threads=1  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc1.gt1.0ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4  -go-threads=4  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc1.gt4.0ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4 -cgo-threads=1  -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.ccl4.tc1.ct1.9ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4 -cgo-threads=1  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc1.ct1.0ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4 -cgo-threads=4  -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc1.ct4.0ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4  -go-pthreads=1 -sleep=1s -presleep=9ms -cpuprofile=/tmp/prof.ccl4.tc1.gp1.9ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4  -go-pthreads=1 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc1.gp1.0ms.pb.gz
GODEBUG=pproftimercreate=1 /tmp/ccl4  -go-pthreads=4 -sleep=1s -presleep=0ms -cpuprofile=/tmp/prof.ccl4.tc1.gp4.0ms.pb.gz
$ ls -c -r /tmp/prof.*.pb.gz | xargs -n 1 bash -c 'go tool pprof -raw $1 2>/dev/null | sed -n -e /^Sam/,/^Loc/p | awk -F: '\''{if (int($1)>0) {s+=$1}} END {print int(s)}'\'' | tr "\n" " " ; echo $1' -- | awk '{printf("%3d %s\n", $1, $2)}'
 99 /tmp/prof.c116.cp1.9ms.pb.gz
100 /tmp/prof.c116.cp1.0ms.pb.gz
251 /tmp/prof.c116.cp4.0ms.pb.gz
 99 /tmp/prof.c116.gt1.9ms.pb.gz
 99 /tmp/prof.c116.gt1.0ms.pb.gz
250 /tmp/prof.c116.gt4.0ms.pb.gz
  0 /tmp/prof.c116.ct1.9ms.pb.gz
100 /tmp/prof.c116.ct1.0ms.pb.gz
250 /tmp/prof.c116.ct4.0ms.pb.gz
 99 /tmp/prof.c116.gp1.9ms.pb.gz
 99 /tmp/prof.c116.gp1.0ms.pb.gz
253 /tmp/prof.c116.gp4.0ms.pb.gz
 99 /tmp/prof.ccl4.tc0.cp1.9ms.pb.gz
 99 /tmp/prof.ccl4.tc0.cp1.0ms.pb.gz
267 /tmp/prof.ccl4.tc0.cp4.0ms.pb.gz
 99 /tmp/prof.ccl4.tc0.gt1.9ms.pb.gz
 99 /tmp/prof.ccl4.tc0.gt1.0ms.pb.gz
251 /tmp/prof.ccl4.tc0.gt4.0ms.pb.gz
  0 /tmp/prof.ccl4.tc0.ct1.9ms.pb.gz
 99 /tmp/prof.ccl4.tc0.ct1.0ms.pb.gz
251 /tmp/prof.ccl4.tc0.ct4.0ms.pb.gz
 99 /tmp/prof.ccl4.tc0.gp1.9ms.pb.gz
 99 /tmp/prof.ccl4.tc0.gp1.0ms.pb.gz
251 /tmp/prof.ccl4.tc0.gp4.0ms.pb.gz
100 /tmp/prof.ccl4.tc1.cp1.9ms.pb.gz
 99 /tmp/prof.ccl4.tc1.cp1.0ms.pb.gz
251 /tmp/prof.ccl4.tc1.cp4.0ms.pb.gz
 99 /tmp/prof.ccl4.tc1.gt1.9ms.pb.gz
100 /tmp/prof.ccl4.tc1.gt1.0ms.pb.gz
398 /tmp/prof.ccl4.tc1.gt4.0ms.pb.gz
  0 /tmp/prof.ccl4.tc1.ct1.9ms.pb.gz
 99 /tmp/prof.ccl4.tc1.ct1.0ms.pb.gz
400 /tmp/prof.ccl4.tc1.ct4.0ms.pb.gz
100 /tmp/prof.ccl4.tc1.gp1.9ms.pb.gz
 98 /tmp/prof.ccl4.tc1.gp1.0ms.pb.gz
397 /tmp/prof.ccl4.tc1.gp4.0ms.pb.gz

@ianlancetaylor
Copy link
Contributor

For background: a complete implementation of cgoSigtramp will call the cgoTraceback function if one has been set by runtime.SetCgoTraceback. The cgoTraceback function is a C function that is called with the C calling convention. This full support has only been implemented on linux-amd64, linux-ppc64le, darwin-amd64, freebsd-amd64. Other platforms do not support calling cgoTraceback when a signal occurs, which means that tests like TestCgoCrashTraceback will fail. The intention is that every system will call cgoTraceback from the signal handler, but this is pretty esoteric, only matters to programs that call runtime.SetCgoTraceback, and hasn't been done.

I'm not sure quite what it means for a thread to have an M but not a G, since in general threads get their M by looking at getg().m. But note that when a thread created by C calls a Go function, it will be temporarily assigned an M, and there are brief periods of time where the M has been assigned but there is no G.

@rhysh
Copy link
Contributor Author

rhysh commented Jun 18, 2021

Thanks @ianlancetaylor . What I mean for a thread to "have" an M but not be able to find its G is 1/ there is a runtime.m value that has its procid field set to the value we'd get from a gettid syscall on that thread, and 2/ a call to getg() (or use of the g register, depending on the platform) would return nil.

On linux-ppc64le, the pointer to the g struct is stored in R30. If a Go program on that platform uses runtime.SetCgoTraceback, enables profiling (with runtime.execute calling setThreadCPUProfiler which would, after CL324129 PS4, do a timer_create syscall to enable profiling for that thread), then calls a C function which calls some assembly code which zeroes out R30, what is the result?

It looks like when SIGPROF arrives on that thread, we'd get a call to cgoSigtramp. That uses runtime·load_g to fix the g register / R30 before checking if g == nil. So clobbering R30 itself in C / assembly is fine, so long as the load_g is able to fix it via TLSBSS. A working g register means the thread executes sigtramp—the same code that normal Go programs would use—rather than sigtrampnog. But a C / assembly function that broke the TLSBSS mechanism would lead the thread to execute sigtrampnog. (There's probably an equivalent way to break getg on linux-amd64, but I don't know enough about TLS to engineer it.)

Is that common? If a program does that, does it break in many ways or would only the profiler be affected?

@ianlancetaylor
Copy link
Contributor

I guess I'm not sure what you mean by suggesting that a C/assembly function might break the TLSBSS mechanism. Why do we care about that case? Such a program would presumably break any C signal handler that uses thread-local variables.

That said, the g == nil case can easily arise if the signal is delivered to a thread created by C that is running C code. Such a thread would naturally have g == nil.

As a separate matter, with the current runtime code, when a thread created by C calls into Go, an M is assigned to that thread. That M will have its procid field set to the thread ID, by the minit function. And it will have a g assigned, and the value g in TLS will point to the M. Then when the thread returns from Go back to C, g will be cleared to nil, and the M will be put on a list for future calls from C-created threads to Go. However, as it happens, in today's code the procid field will never be cleared. So that is an example where there can be an M with a procid field that refers to a C created thread, but the TLS entry for g on that thread will be nil.

@felixge
Copy link
Contributor

felixge commented Jun 28, 2021

I'd especially appreciate any help you can offer on contributing test cases upstream, which doesn't seem to have any yet for cgo profiling. Thank you!

I'll try to create some this week and submit them separately. Will drop the review links here.

Looking at the (assembly) code for cgoSigtramp, it looks like the behavior is different per architecture.

Yeah. I should have clarified that my diagram was for linux/amd64.

I put together a rough demo (the code and results are in the expandable "Details" sections at the bottom) to compare how the profiler measures busy loops in Go and in cgo.

Fantastic, I'll take a closer look at this as well.

Sorry for the late follow-up.

@felixge
Copy link
Contributor

felixge commented Jul 6, 2021

Edit: On second thought, we might not need SetCgoTraceback() see follow-up comment.

I'm starting to work on a patch to add cgo cpu profiling test cases now. It seems that in order to make statistical assertions on the amount of time spend in different cgo threads, we'll need a SetCgoTraceback() implementation. Otherwise we'll not be able to tell which functions the cgo code is spending time in (see screenshot below):

2021-07-06 Google Chrome unknown cpu - 15-34-02

I'm thinking that vendoring @ianlancetaylor's cgosymbolizer for the test suite might be the easiest option. @ianlancetaylor does this seem reasonable?

If taking on this dependency for the test suite is not an option we'll need something else. AFAIK debug/dwarf doesn't implement unwinding, but symbolization might work. If that's the case I could probably do a simple frame pointer unwinding implementation in C.

So for now I'll try to @ianlancetaylor's symbolizer and see if I can turn @rhysh test cases from above into something that fits into the current test suite. @rhysh what do you think?

@felixge
Copy link
Contributor

felixge commented Jul 6, 2021

On second thought, we might be able to get away without a SetCgoTraceback implementation. Our profiles should still contain the rip pc, so we might just need symbolize that using DWARF. That might be good enough if the cgo code we're interested in is running a hot loop that doesn't call out to other functions ...

felixge added a commit to felixge/dump that referenced this issue Jul 7, 2021
@felixge
Copy link
Contributor

felixge commented Jul 15, 2021

@rhysh I created a CL for the test cases we discussed. PTAL https://go-review.googlesource.com/c/go/+/334769

@gopherbot
Copy link

Change https://golang.org/cl/342052 mentions this issue: runtime: add timer_create syscalls for Linux

@gopherbot
Copy link

Change https://golang.org/cl/342053 mentions this issue: runtime: allow per-OS changes to unix profiler

@gopherbot
Copy link

Change https://golang.org/cl/342054 mentions this issue: runtime: use per-thread profiler for linux/amd64

@gopherbot
Copy link

Change https://golang.org/cl/342055 mentions this issue: runtime: use per-thread profiler for linux/ppc64le

@prattmic
Copy link
Member

prattmic commented Sep 7, 2021

@rhysh thanks for all the detailed analysis on this. I think there is quite a compelling argument, and that we probably want this. I plan to review your CLs this week (sorry for the delay! Just got back from vacation.).

A few notes/thoughts I have:

  • This problem should also be solved by proposal: runtime/pprof: add PMU-based profiles #36821, if accepted. However, PMUs won't ever be available on all systems, so we'll need a fallback.
  • I'm curious about the performance cost of enabling timers on thread creation/discovery as well as the kernel overhead of managing all these extra timers. I don't expect it to be severe, but it would be good to have some measurements to verify.

@prattmic
Copy link
Member

prattmic commented Sep 8, 2021

  • I'm curious about the performance cost of enabling timers on thread creation/discovery as well as the kernel overhead of managing all these extra timers. I don't expect it to be severe, but it would be good to have some measurements to verify.

For what it's worth, I took a look at the implementation for these, and I am not too concerned.

For expiration and delivery, handling is nearly identical to ITIMER_PROF, the timer just lives on (struct task_struct).cpu_timers[0] instead of (struct task_struct).signal.cpu_timers[0], both of which are checked in the same path: https://elixir.bootlin.com/linux/v5.14.1/source/kernel/time/posix-cpu-timers.c#L1242.

Creation of a timer is a bit more involved, but it is primarily just an allocation and installation on the proper timer list.

Of course, we should still measure.

@rhysh
Copy link
Contributor Author

rhysh commented Sep 21, 2021

Thanks for the reviews, @prattmic , and for checking on the implementation in the kernel.

I wrote a benchmark for creating, setting, and deleting a timer and found that those steps take about 1–2µs total on my linux/amd64 test machine with 8 hyperthreads / 4 physical cores and go1.17.1 and Linux in the v5.4 series. That should cover the cost of enabling the timers, and (for the few moments that each timer exists in the benchmark code) some of the cost of managing them.

To measure the performance cost of the timers while they're active, I ran a benchmark from a core package (BenchmarkCodeEncoder from encoding/json, because of its use of b.RunParallel) while collecting (and discarding) a CPU profile. I compared the results with my most recent changes (CL 342055 PS 5, go version devel go1.18-bf0bbf764c Mon Sep 20 12:01:33 2021 -0700 linux/amd64) with the merge-base on the main branch (go version devel go1.18-3fa7dbeff5 Fri Sep 17 21:29:32 2021 +0000 linux/amd64). I disabled cgo to know that the new timers would cover as much as possible. I collected results with GOMAXPROCS at 1, 2, 4, and 8; with 1 and 2, the number of SIGPROF deliveries should be the same, but with 4 and 8 the new code will result in more SIGPROF deliveries and profiles that are larger and more complete.

The results claim that the new timer_create profiler leads to slightly faster benchmark performance when GOMAXPROCS=1 (weird!), and slightly slower performance when GOMAXPROCS=8 (maybe due to more signals being successfully delivered?), and equivalent performance with GOMAXPROCS of 2 or 4.

It also claims more memory allocated when GOMAXPROCS is 4 or 8, even though there are 0 allocs per operation. It looks like that's describing the memory used for the CPU profile, since alloc/op for that benchmark is about 1 byte when CPU profiling is disabled.

$ for i in {1..10}; do CGO_ENABLED=0 ./bin/go test -run='^$' -bench=BenchmarkCodeEncoder -cpuprofile=/dev/null -cpu=1,2,4,8 encoding/json; done

$ $ benchstat /tmp/tip /tmp/cl
name           old time/op    new time/op    delta
CodeEncoder      7.12ms ± 0%    7.10ms ± 0%   -0.30%  (p=0.000 n=8+10)
CodeEncoder-2    3.56ms ± 1%    3.56ms ± 1%     ~     (p=0.315 n=9+10)
CodeEncoder-4    1.80ms ± 1%    1.79ms ± 1%     ~     (p=0.549 n=10+9)
CodeEncoder-8    1.87ms ± 0%    1.89ms ± 0%   +0.72%  (p=0.000 n=10+9)

name           old speed      new speed      delta
CodeEncoder     272MB/s ± 0%   273MB/s ± 0%   +0.30%  (p=0.000 n=8+10)
CodeEncoder-2   544MB/s ± 1%   545MB/s ± 1%     ~     (p=0.305 n=9+10)
CodeEncoder-4  1.08GB/s ± 1%  1.08GB/s ± 1%     ~     (p=0.549 n=10+9)
CodeEncoder-8  1.04GB/s ± 0%  1.03GB/s ± 0%   -0.72%  (p=0.000 n=10+9)

name           old alloc/op   new alloc/op   delta
CodeEncoder        407B ± 8%      400B ± 6%     ~     (p=0.165 n=10+10)
CodeEncoder-2      351B ±22%      395B ± 7%     ~     (p=0.197 n=10+10)
CodeEncoder-4      172B ± 9%      269B ±11%  +56.17%  (p=0.000 n=8+9)
CodeEncoder-8      230B ±10%      435B ± 5%  +88.86%  (p=0.000 n=10+9)

name           old allocs/op  new allocs/op  delta
CodeEncoder        0.00           0.00          ~     (all equal)
CodeEncoder-2      0.00           0.00          ~     (all equal)
CodeEncoder-4      0.00           0.00          ~     (all equal)
CodeEncoder-8      0.00           0.00          ~     (all equal)

Here are the results and code for the create/settime/delete benchmark.

name           time/op
TimerCreate    1.86µs ± 0%
TimerCreate-2  1.29µs ± 1%
TimerCreate-4  1.05µs ± 3%
TimerCreate-8  1.35µs ± 0%
Benchmark code
// Copyright 2021 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

//go:build linux
// +build linux

package repro

import (
	"runtime"
	"syscall"
	"testing"
	"unsafe"
)

func BenchmarkTimerCreate(b *testing.B) {
	b.RunParallel(func(pb *testing.PB) {

		runtime.LockOSThread()
		defer runtime.UnlockOSThread()
		procid := syscall.Gettid()

		for pb.Next() {
			var timerid int32
			sevp := &sigevent{
				notify:                 _SIGEV_THREAD_ID,
				signo:                  int32(syscall.SIGPROF),
				sigev_notify_thread_id: int32(procid),
			}
			err := timer_create(_CLOCK_THREAD_CPUTIME_ID, sevp, &timerid)
			if err != nil {
				b.Errorf("timer_create; err = %v", err)
				return
			}

			spec := &itimerspec{
				it_interval: timespec{tv_nsec: 10 * 1e6},
				it_value:    timespec{tv_nsec: 10 * 1e6},
			}
			err = timer_settime(timerid, 0, spec, nil)
			if err != nil {
				b.Errorf("timer_settime; err = %v", err)
				return
			}

			err = timer_delete(timerid)
			if err != nil {
				b.Errorf("timer_delete; err = %v", err)
				return
			}
		}
	})
}

const (
	_SIGEV_THREAD_ID = 0x4

	_CLOCK_THREAD_CPUTIME_ID = 0x3
)

func timer_create(clockid int32, sevp *sigevent, timerid *int32) error {
	_, _, errno := syscall.Syscall(syscall.SYS_TIMER_CREATE,
		uintptr(clockid),
		uintptr(unsafe.Pointer(sevp)),
		uintptr(unsafe.Pointer(timerid)))
	if errno != 0 {
		return errno
	}
	return nil
}

func timer_settime(timerid int32, flags int32, new, old *itimerspec) error {
	_, _, errno := syscall.Syscall6(syscall.SYS_TIMER_SETTIME,
		uintptr(timerid),
		uintptr(flags),
		uintptr(unsafe.Pointer(new)),
		uintptr(unsafe.Pointer(old)),
		0,
		0)
	if errno != 0 {
		return errno
	}
	return nil
}

func timer_delete(timerid int32) error {
	_, _, errno := syscall.Syscall(syscall.SYS_TIMER_DELETE,
		uintptr(timerid),
		0,
		0)
	if errno != 0 {
		return errno
	}
	return nil
}

type timespec struct {
	tv_sec  int64
	tv_nsec int64
}

type itimerspec struct {
	it_interval timespec
	it_value    timespec
}

type sigevent struct {
	value  uintptr
	signo  int32
	notify int32
	// below here is a union; sigev_notify_thread_id is the only field we use
	sigev_notify_thread_id int32
}
Benchmark results
$ go version                                                 
go version go1.17.1 linux/amd64

$ go test ./timercreate_test.go -bench=. -cpu=1,2,4,8 -count=10
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
BenchmarkTimerCreate              624936              1870 ns/op
BenchmarkTimerCreate              641752              1870 ns/op
BenchmarkTimerCreate              648594              1858 ns/op
BenchmarkTimerCreate              657189              1860 ns/op
BenchmarkTimerCreate              646204              1870 ns/op
BenchmarkTimerCreate              645511              1859 ns/op
BenchmarkTimerCreate              654400              1857 ns/op
BenchmarkTimerCreate              644583              1862 ns/op
BenchmarkTimerCreate              630817              1862 ns/op
BenchmarkTimerCreate              638829              1859 ns/op
BenchmarkTimerCreate-2            936886              1295 ns/op
BenchmarkTimerCreate-2            927549              1309 ns/op
BenchmarkTimerCreate-2            911044              1285 ns/op
BenchmarkTimerCreate-2            929598              1294 ns/op
BenchmarkTimerCreate-2            912998              1293 ns/op
BenchmarkTimerCreate-2            935090              1294 ns/op
BenchmarkTimerCreate-2            952288              1289 ns/op
BenchmarkTimerCreate-2            943212              1301 ns/op
BenchmarkTimerCreate-2            927578              1293 ns/op
BenchmarkTimerCreate-2            926503              1299 ns/op
BenchmarkTimerCreate-4           1000000              1053 ns/op
BenchmarkTimerCreate-4           1000000              1061 ns/op
BenchmarkTimerCreate-4           1000000              1047 ns/op
BenchmarkTimerCreate-4           1000000              1034 ns/op
BenchmarkTimerCreate-4           1000000              1046 ns/op
BenchmarkTimerCreate-4           1000000              1033 ns/op
BenchmarkTimerCreate-4           1000000              1070 ns/op
BenchmarkTimerCreate-4           1000000              1083 ns/op
BenchmarkTimerCreate-4            988370              1115 ns/op
BenchmarkTimerCreate-4           1226841              1057 ns/op
BenchmarkTimerCreate-8            888840              1351 ns/op
BenchmarkTimerCreate-8            910783              1351 ns/op
BenchmarkTimerCreate-8            909226              1348 ns/op
BenchmarkTimerCreate-8            884605              1348 ns/op
BenchmarkTimerCreate-8            894314              1352 ns/op
BenchmarkTimerCreate-8            889054              1349 ns/op
BenchmarkTimerCreate-8            891530              1350 ns/op
BenchmarkTimerCreate-8            913797              1353 ns/op
BenchmarkTimerCreate-8            893120              1350 ns/op
BenchmarkTimerCreate-8            917433              1349 ns/op
PASS
ok      command-line-arguments  48.576s

@prattmic
Copy link
Member

Thanks @rhysh for the detailed benchmarks. They all look good to me. Just one question, for CodeEncoder-4 and CodeEncoder-8, I'm assuming that before your CL those benchmarks were hitting the 250% limit in profiles, and after they show accurate usage? If so, that would explain the addition memory usage, as they are (correctly) receiving more profiling signals.

@rhysh
Copy link
Contributor Author

rhysh commented Sep 23, 2021

That's right, before my CL the -cpu=4,8 tests hit the 250% limit.

Before:

$ ./bin/go version
go version devel go1.18-3fa7dbeff5 Fri Sep 17 21:29:32 2021 +0000 linux/amd64
$ for cpu in 1 2 4 8 ; do CGO_ENABLED=0 ./bin/go test -run='^$' -bench=BenchmarkCodeEncoder -cpuprofile=/tmp/p -cpu=$cpu -benchtime=10s encoding/json | grep '^Benchmark' && ./bin/go tool pprof -top /tmp/p | grep '^Duration' ; done
BenchmarkCodeEncoder        1684           7123800 ns/op         272.39 MB/s         315 B/op          0 allocs/op
Duration: 12.97s, Total samples = 12.79s (98.59%)
BenchmarkCodeEncoder-2              3333           3573002 ns/op         543.09 MB/s         251 B/op          0 allocs/op
Duration: 12.52s, Total samples = 24.57s (196.31%)
BenchmarkCodeEncoder-4              5978           1782120 ns/op        1088.86 MB/s         155 B/op          0 allocs/op
Duration: 11.06s, Total samples = 27.38s (247.64%)
BenchmarkCodeEncoder-8              6231           1869153 ns/op        1038.16 MB/s         169 B/op          0 allocs/op
Duration: 12.07s, Total samples = 29.68s (245.88%)

After:

$ ./bin/go version
go version devel go1.18-bf0bbf764c Mon Sep 20 12:01:33 2021 -0700 linux/amd64
$ for cpu in 1 2 4 8 ; do CGO_ENABLED=0 ./bin/go test -run='^$' -bench=BenchmarkCodeEncoder -cpuprofile=/tmp/p -cpu=$cpu -benchtime=10s encoding/json | grep '^Benchmark' && ./bin/go tool pprof -top /tmp/p | grep '^Duration' ; done
BenchmarkCodeEncoder        1686           7100152 ns/op         273.30 MB/s         329 B/op          0 allocs/op
Duration: 12.95s, Total samples = 12.76s (98.51%)
BenchmarkCodeEncoder-2              3344           3557388 ns/op         545.48 MB/s         252 B/op          0 allocs/op
Duration: 12.42s, Total samples = 24.56s (197.78%)
BenchmarkCodeEncoder-4              6700           1781134 ns/op        1089.46 MB/s         125 B/op          0 allocs/op
Duration: 21.84s, Total samples = 86.60s (396.51%)
BenchmarkCodeEncoder-8              5984           1880590 ns/op        1031.84 MB/s         351 B/op          0 allocs/op
Duration: 11.71s, Total samples = 91.48s (781.06%)

@gopherbot
Copy link

Change https://golang.org/cl/351790 mentions this issue: runtime: move sigprofNonGo

gopherbot pushed a commit that referenced this issue Sep 27, 2021
Updates #35057

Change-Id: Id702b502fa4e4005ba1e450a945bc4420a8a8b8c
Reviewed-on: https://go-review.googlesource.com/c/go/+/342052
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Trust: Than McIntosh <thanm@google.com>
gopherbot pushed a commit that referenced this issue Sep 27, 2021
Updates #35057

Change-Id: I56ea8f4750022847f0866c85e237a2cea40e0ff7
Reviewed-on: https://go-review.googlesource.com/c/go/+/342053
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
gopherbot pushed a commit that referenced this issue Sep 27, 2021
Using setitimer on Linux to request SIGPROF signal deliveries in
proportion to the process's on-CPU time results in under-reporting when
the program uses several goroutines in parallel. Linux calculates the
process's total CPU spend on a regular basis (often every 4ms); if the
process has spent enough CPU time since the last calculation to warrant
more than one SIGPROF (usually 10ms for the default sample rate of 100
Hz), the kernel is often able to deliver only one of them. With these
common settings, that results in Go CPU profiles being attenuated for
programs that use more than 2.5 goroutines in parallel.

To avoid in effect overflowing the kernel's process-wide CPU counter,
and relying on Linux's typical behavior of having the active thread
handle the resulting process-targeted signal, use timer_create to
request a timer for each OS thread that the Go runtime manages. Have
each timer track the CPU time of a single thread, with the resulting
SIGPROF going directly to that thread.

To continue tracking CPU time spent on threads that don't interact with
the Go runtime (such as those created and used in cgo), keep using
setitimer in addition to the new mechanism. When a SIGPROF signal
arrives, check whether it's due to setitimer or timer_create and filter
as appropriate: If the thread is known to Go (has an M) and has a
timer_create timer, ignore SIGPROF signals from setitimer. If the thread
is not known to Go (does not have an M), ignore SIGPROF signals that are
not from setitimer.

Counteract the new bias that per-thread profiling adds against
short-lived threads (or those that are only active on occasion for a
short time, such as garbage collection workers on mostly-idle systems)
by configuring the timers' initial trigger to be from a uniform random
distribution between "immediate trigger" and the full requested sample
period.

Updates #35057

Change-Id: Iab753c4e5101bdc09ef9132eec84a75478e05579
Reviewed-on: https://go-review.googlesource.com/c/go/+/324129
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: David Chase <drchase@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Sep 27, 2021
The sigprofNonGo and sigprofNonGoPC functions are only used on unix-like
platforms. In preparation for unix-specific changes to sigprofNonGo,
move it (plus its close relative) to a unix-specific file.

Updates #35057

Change-Id: I9c814127c58612ea9a9fbd28a992b04ace5c604d
Reviewed-on: https://go-review.googlesource.com/c/go/+/351790
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Trust: David Chase <drchase@google.com>
gopherbot pushed a commit that referenced this issue Sep 27, 2021
Updates #35057

Change-Id: I61d772a2cbfb27540fb70c14676c68593076ca94
Reviewed-on: https://go-review.googlesource.com/c/go/+/342054
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
@prattmic
Copy link
Member

@rhysh the new TestCPUProfileMultithreadMagnitude seems to be immediately flaky on linux-amd64-longtest builders (where the 10% max difference is enforced):

2021-09-27T18:58:36-5b90958/linux-amd64-longtest
2021-09-27T18:58:29-8cfd8c3/linux-amd64-longtest

I haven't yet been able to reproduce these failures.

@prattmic
Copy link
Member

prattmic commented Sep 27, 2021

Actually, the new timers aren't in use on amd64 until http://golang.org/cl/342054, so it makes sense that the test was failing (i.e., it was added a bit too early). Since the CL is in, we don't expect any more failures.

While we're at it, I believe this issue is now complete?

@rhysh
Copy link
Contributor Author

rhysh commented Sep 27, 2021

Yes, with CL 324129 (to fix on Linux for most architectures) and CL 342054 (to enable the fix on the remaining architectures), this is now complete. Thank you (and @felixge , and everyone both up-thread and elsewhere) for your help to get it done!

You're right @prattmic , I put the test too early in the CL stack. Is it worth a follow-up to move the go118UseTimerCreateProfiler flag to gate the test too? I'm happy to send one, either now or once the Go team is back from break. But the harm for bisects is already done, and if we have to turn the feature off it'll be obvious enough that the test needs to come out too.

@prattmic
Copy link
Member

Is it worth a follow-up to move the go118UseTimerCreateProfiler flag to gate the test too?

I don't think we need to bother. It will be pretty obvious in tests if the flag is disabled.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

9 participants