Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: under-samples work on short-lived threads #40435

Closed
rhysh opened this issue Jul 27, 2020 · 5 comments
Closed

runtime/pprof: under-samples work on short-lived threads #40435

rhysh opened this issue Jul 27, 2020 · 5 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Jul 27, 2020

What version of Go are you using (go version)?

$ go version
go version go1.15rc1 darwin/amd64

$ uname -a
Linux ip-172-31-18-196.us-west-2.compute.internal 4.14.123-111.109.amzn2.x86_64 #1 SMP Mon Jun 10 19:37:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Does this issue reproduce with the latest release?

Yes this is present in go1.13.14 and go1.14.6, though with slightly different shapes. When I ran with Go 1.13, the profile contained zero samples. When I ran with Go 1.14, the process took about 3 seconds longer to run (7.8s vs 5s) and included 2.2s worth of samples for the function of interest. (Though if this worked as I expect, the profiles would cover the same amount of time as the "time" shell built-in reported.)

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/rhys/Library/Caches/go-build"
GOENV="/Users/rhys/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/rhys/go/pkg/mod"
GONOPROXY="*"
GONOSUMDB="*"
GOOS="darwin"
GOPATH="/Users/rhys/go"
GOPRIVATE="*"
GOPROXY="direct"
GOROOT="/usr/local/go"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/49/zmds5zsn75z1283vtzxyfr5hj7yjq4/T/go-build464588215=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

On a linux/amd64 machine, I ran a program that does 5ms of work at a time on each of 1000 worker goroutines, and uses runtime.LockOSThread and runtime.Goexit to force each worker goroutine to be on a different OS thread.

package main

import (
	"flag"
	"fmt"
	"log"
	"os"
	"runtime"
	"runtime/pprof"
	"sync"
	"time"
)

func main() {
	profileName := flag.String("profile", "./hops.pb.gz", "File name for CPU profile")
	tickTime := flag.Duration("tick", 5*time.Millisecond, "How much work to run on each thread")
	iterations := flag.Int("iterations", 1000, "Number of threads to use for work")
	flag.Parse()

	pf, err := os.Create(*profileName)
	if err != nil {
		log.Fatalf("Create; err = %v", err)
	}
	defer func() {
		err := pf.Close()
		if err != nil {
			log.Fatalf("Close; err = %v", err)
		}
	}()
	err = pprof.StartCPUProfile(pf)
	if err != nil {
		log.Fatalf("StartCPUProfile; err = %v", err)
	}
	defer pprof.StopCPUProfile()

	tick := time.Tick(*tickTime)

	var p int64 = 1
	var wg sync.WaitGroup
	for i := 0; i < *iterations; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()

			// Have this code run on its own thread, force the runtime to
			// terminate the thread when this goroutine completes (so the next
			// run will be on a different thread).
			runtime.LockOSThread()
			defer runtime.Goexit()

			for {
				select {
				case <-tick:
					return
				default:
				}

				work(&p)
			}
		}()
		wg.Wait()
	}

	fmt.Printf("%d\n", p)
}

func work(p *int64) {
	for i := 0; i < 1e5; i++ {
		*p *= 3
	}
}

What did you expect to see?

I expected the CPU time reported by the shell's time built-in and the CPU time reported by go tool pprof to roughly match.

What did you see instead?

When doing 5ms of work on each of 1000 unique threads, the shell's time built-in reports 5103ms of user-space CPU time, and go tool pprof reports only 50ms of CPU time.

$ time ./hops.linux_amd64 -profile=./hops.pb.gz -tick=5ms -iterations=1000
-6036671839377369855

real	0m5.105s
user	0m5.103s
sys	0m0.085s

$ go tool pprof -top ./hops.pb.gz
File: hops.linux_amd64
Type: cpu
Time: Jul 27, 2020 at 2:04pm (PDT)
Duration: 5.10s, Total samples = 50ms ( 0.98%)
Showing nodes accounting for 50ms, 100% of 50ms total
      flat  flat%   sum%        cum   cum%
      30ms 60.00% 60.00%       30ms 60.00%  main.work
      10ms 20.00% 80.00%       10ms 20.00%  runtime.(*randomEnum).next (inline)
      10ms 20.00%   100%       10ms 20.00%  runtime.futex
         0     0%   100%       10ms 20.00%  main.main
         0     0%   100%       30ms 60.00%  main.main.func2
         0     0%   100%       10ms 20.00%  runtime.convT64
         0     0%   100%       10ms 20.00%  runtime.findrunnable
         0     0%   100%       10ms 20.00%  runtime.futexsleep
         0     0%   100%       10ms 20.00%  runtime.gcBgMarkStartWorkers
         0     0%   100%       10ms 20.00%  runtime.gcStart
         0     0%   100%       10ms 20.00%  runtime.main
         0     0%   100%       10ms 20.00%  runtime.mallocgc
         0     0%   100%       10ms 20.00%  runtime.mstart
         0     0%   100%       10ms 20.00%  runtime.mstart1
         0     0%   100%       10ms 20.00%  runtime.notetsleep_internal
         0     0%   100%       10ms 20.00%  runtime.notetsleepg
         0     0%   100%       10ms 20.00%  runtime.schedule

@cagedmantis cagedmantis added this to the Backlog milestone Jul 28, 2020
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 28, 2020
@cagedmantis
Copy link
Contributor

/cc @hyangah

@cagedmantis
Copy link
Contributor

I did not add this to the Go1.15 milestone since you've reported that the issue is reproducible in earlier versions of Go. This can change though.

@aalexand
Copy link
Contributor

aalexand commented Aug 9, 2020

I'm curious if we know why this issue happens? I could see this happening with per-thread CLOCK_THREAD_CPUTIME_ID-style timers (like done here), but Go runtime uses plain good gold setitimer with SIGPROF as far as I know? Does SIGPROF sampling only count in full scheduler ticks and doesn't carry the remainder forward to the next thread scheduled on the same CPU (which would be not entirely correct from the attribution point of view but would be more correct statistically).

@rhysh
Copy link
Contributor Author

rhysh commented Aug 21, 2020

I'm curious if we know why this issue happens?

I think it's the call from runtime.execute to runtime.setThreadCPUProfiler to setitimer(2); each M makes that call, and this test program ensures a steady supply of new Ms. It looks like the result of the setitimer syscall is to reset the count for the whole process. I filed #40963 for that flavor of the problem.

@rhysh
Copy link
Contributor Author

rhysh commented Aug 21, 2020

It looks like this is fixed in tip, likely thanks to CL 240003 / commit bd519d0. The pprof output claims "Total samples = 5.11s", which is a close match to the "user 0m5.106s" that the time shell built-in reports.

$ go-tip version ./hops
./hops: devel +9679b30733 Fri Aug 21 16:52:08 2020 +0000

$ time ./hops -tick=5ms -iterations=1000
-8662753420010608383

real	0m5.132s
user	0m5.106s
sys	0m0.097s

$ go tool pprof -top /tmp/hops.pb.gz 
File: hops
Type: cpu
Time: Aug 21, 2020 at 1:11pm (PDT)
Duration: 5.13s, Total samples = 5.11s (99.65%)
Showing nodes accounting for 4.99s, 97.65% of 5.11s total
Dropped 47 nodes (cum <= 0.03s)
      flat  flat%   sum%        cum   cum%
     4.87s 95.30% 95.30%      4.87s 95.30%  main.work
     0.06s  1.17% 96.48%      0.06s  1.17%  runtime.rtsigprocmask
     0.04s  0.78% 97.26%      0.04s  0.78%  runtime.runqgrab
     0.01s   0.2% 97.46%      0.05s  0.98%  runtime.runqsteal
     0.01s   0.2% 97.65%      0.12s  2.35%  runtime.schedule
         0     0% 97.65%      4.88s 95.50%  main.main.func2
         0     0% 97.65%      0.10s  1.96%  runtime.findrunnable
         0     0% 97.65%      0.05s  0.98%  runtime.minit
         0     0% 97.65%      0.04s  0.78%  runtime.minitSignalMask
         0     0% 97.65%      0.04s  0.78%  runtime.minitSignals
         0     0% 97.65%      0.20s  3.91%  runtime.mstart
         0     0% 97.65%      0.18s  3.52%  runtime.mstart1
         0     0% 97.65%      0.06s  1.17%  runtime.sigprocmask (inline)

@rhysh rhysh closed this as completed Aug 21, 2020
@golang golang locked and limited conversation to collaborators Aug 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants