cmd/pprof: pprof provides wrong address/line for CPU profiles #41338

mpx · 2020-09-11T12:27:35Z

Using CPU profiling and go tool pprof attributes CPU to the instruction following the one that consumed the time. This can cause CPU time to be attributed to an incorrect line number. For comparison, the Linux perf command works correctly. The example below shows the time is spent in the for loop, when it is actually spent during expensive memory accesses.

AFAICT this issue has existed for a long time (at least since 1.4). I can't see this issue has been logged yet which is surprising. It occurs with recent toolchains as well:

go version devel +d277a36123 Fri Sep 11 02:58:36 2020 +0000 linux/amd64
go version go1.15.2 linux/amd64

Tested on: Linux localhost.localdomain 5.8.4-200.fc32.x86_64 #1 SMP Wed Aug 26 22:28:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Here is a script which demonstrates the fault:

#!/bin/sh

cat > prof_test.go <<EOF
// Demonstrate that CPU profiles refer to the wrong instruction.
package prof

import "testing"

var buf [1024 * 1024 * 1024]byte

func Benchmark(b *testing.B) {
    for i := 0; i < b.N; i++ {
        // Expensive memory accesses.
        buf[(2097169*i)%len(buf)] = 42
    }
}
EOF

go mod init prof
go test -c -o prof.test
perf record -- ./prof.test -test.bench . -test.benchtime 3s -test.cpuprofile cpu.prof
echo
echo '=== pprof list shows the "for" loop is expensive, not the memory access. ==='
go tool pprof -list Benchmark prof.test cpu.prof
echo
echo '=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ==='
go tool pprof -disasm Benchmark prof.test cpu.prof
echo
echo '=== Linux perf correctly shows the "MOV" instruction is expensive. ==='
perf annotate --stdio -s prof.Benchmark | cat

And the output:

$ ./run.sh 
goos: linux
goarch: amd64
pkg: prof
Benchmark-8   	314821470	        11.2 ns/op
PASS
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.995 MB perf.data (25644 samples) ]

=== pprof list shows the "for" loop is expensive, not the memory access. ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark in /home/mark/Prog/golang/src/junk/memperf/prof_test.go
     6.34s      6.34s (flat, cum)   100% of Total
         .          .      4:import "testing"
         .          .      5:
         .          .      6:var buf [1024 * 1024 * 1024]byte
         .          .      7:
         .          .      8:func Benchmark(b *testing.B) {
     5.52s      5.52s      9:    for i := 0; i < b.N; i++ {
         .          .     10:        // Expensive memory accesses.
     820ms      820ms     11:        buf[(2097169*i)%len(buf)] = 42
         .          .     12:    }
         .          .     13:}

=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark
     6.34s      6.34s (flat, cum)   100% of Total
         .          .     509500: SUBQ $0x18, SP                          ;prof_test.go:8
         .          .     509504: MOVQ BP, 0x10(SP)
         .          .     509509: LEAQ 0x10(SP), BP
         .          .     50950e: MOVQ 0x20(SP), DX                       ;prof_test.go:9
         .          .     509513: XORL AX, AX
         .          .     509515: JMP 0x509525
      30ms       30ms     509517: LEAQ prof.buf(SB), BX                   ;prof.Benchmark prof_test.go:11
     540ms      540ms     50951e: MOVB $0x2a, 0(BX)(SI*1)
     5.49s      5.49s     509522: INCQ AX                                 ;prof.Benchmark prof_test.go:9
      30ms       30ms     509525: CMPQ AX, 0x170(DX)
         .          .     50952c: JLE 0x509559                            ;prof_test.go:9
      10ms       10ms     50952e: IMULQ $0x200011, AX, BX                 ;prof.Benchmark prof_test.go:11
      20ms       20ms     509535: MOVQ BX, SI
      80ms       80ms     509538: SARQ $0x3f, BX
         .          .     50953c: SHRQ $0x22, BX                          ;prof_test.go:11
         .          .     509540: ADDQ SI, BX
      30ms       30ms     509543: SARQ $0x1e, BX                          ;prof.Benchmark prof_test.go:11
      90ms       90ms     509547: SHLQ $0x1e, BX
         .          .     50954b: SUBQ BX, SI                             ;prof_test.go:11
      20ms       20ms     50954e: CMPQ $prof.buf+1067102432(SB), SI       ;prof.Benchmark prof_test.go:11
         .          .     509555: JB 0x509517                             ;prof_test.go:11
         .          .     509557: JMP 0x509563
         .          .     509559: MOVQ 0x10(SP), BP                       ;prof_test.go:9
         .          .     50955e: ADDQ $0x18, SP
         .          .     509562: RET
         .          .     509563: MOVQ SI, AX                             ;prof_test.go:11
         .          .     509566: MOVL $prof.buf+1067102432(SB), CX
         .          .     50956b: CALL runtime.panicIndex(SB)

=== Linux perf correctly shows the "MOV" instruction is expensive. ===
 Percent |	Source code & Disassembly of prof.test for cycles:u (24257 samples, percent: local period)
----------------------------------------------------------------------------------------------------------
         :
         :
         :
         :            Disassembly of section .text:
         :
         :            0000000000509500 <prof.Benchmark>:
         :            prof.Benchmark():
         :
         :            import "testing"
         :
         :            var buf [1024 * 1024 * 1024]byte
         :
         :            func Benchmark(b *testing.B) {
    0.00 :   509500: sub    $0x18,%rsp
    0.00 :   509504: mov    %rbp,0x10(%rsp)
    0.00 :   509509: lea    0x10(%rsp),%rbp
         :            for i := 0; i < b.N; i++ {
    0.00 :   50950e: mov    0x20(%rsp),%rdx
    0.00 :   509513: xor    %eax,%eax
    0.00 :   509515: jmp    509525 <prof.Benchmark+0x25>
         :            // Expensive memory accesses.
         :            buf[(2097169*i)%len(buf)] = 42
    1.82 :   509517: lea    0x14ba02(%rip),%rbx        # 654f20 <prof.buf>
   92.32 :   50951e: movb   $0x2a,(%rbx,%rsi,1)
         :            for i := 0; i < b.N; i++ {
    0.08 :   509522: inc    %rax
    0.25 :   509525: cmp    %rax,0x170(%rdx)
    0.00 :   50952c: jle    509559 <prof.Benchmark+0x59>
         :            buf[(2097169*i)%len(buf)] = 42
    0.53 :   50952e: imul   $0x200011,%rax,%rbx
    1.74 :   509535: mov    %rbx,%rsi
    0.02 :   509538: sar    $0x3f,%rbx
    0.11 :   50953c: shr    $0x22,%rbx
    0.46 :   509540: add    %rsi,%rbx
    1.93 :   509543: sar    $0x1e,%rbx
    0.03 :   509547: shl    $0x1e,%rbx
    0.12 :   50954b: sub    %rbx,%rsi
    0.00 :   50954e: cmp    $0x40000000,%rsi
    0.61 :   509555: jb     509517 <prof.Benchmark+0x17>
    0.00 :   509557: jmp    509563 <prof.Benchmark+0x63>
         :            for i := 0; i < b.N; i++ {
    0.00 :   509559: mov    0x10(%rsp),%rbp
    0.00 :   50955e: add    $0x18,%rsp
    0.00 :   509562: retq
         :            buf[(2097169*i)%len(buf)] = 42
    0.00 :   509563: mov    %rsi,%rax
    0.00 :   509566: mov    $0x40000000,%ecx
    0.00 :   50956b: callq  46aae0 <runtime.panicIndex>
         :            ./<autogenerated>:1
    0.00 :   509570: nop

Cc @hyangah @randall77

The text was updated successfully, but these errors were encountered:

prattmic · 2020-09-11T15:09:36Z

This sounds like it is likely due to the proposal in #36821. Go's pprof profiling uses ITIMER_PROF, which can have skid of several instructions from the actual expensive instruction.

Linux's perf, on the other hand, can use Intel PEBS (/AMD equivalent) for precise sampling.

I don't think perf enables PEBS by default (-e cycles:pp is required), but even the standard hardware counter likely has less skid than ITIMER_PROF.

odeke-em · 2020-09-11T22:49:53Z

/cc @rhysh too.

prattmic · 2021-07-27T18:56:50Z

As I noted in #30708, we cannot correct the skid in Go except to use precise hardware counters, which #36821 would enable.

Regardless of the outcome of that proposal (not all systems will have precise counters), we could use better documentation about this limitation.

odeke-em changed the title ~~cmd/pprof: Pprof provides wrong address/line for CPU profiles~~ cmd/pprof: pprof provides wrong address/line for CPU profiles Sep 11, 2020

toothrot added the NeedsInvestigation label Sep 14, 2020

toothrot added this to the Backlog milestone Sep 14, 2020

prattmic mentioned this issue Jul 27, 2021

runtime: CPU instruction "NOP" utilizes ~50% CPU and pprof doesn't explain that #30708

Closed

prattmic added Documentation NeedsFix and removed NeedsInvestigation labels Jul 27, 2021

lars-t-hansen mentioned this issue Sep 9, 2024

Implement concurrent filtering and postprocessing NAICNO/Jobanalyzer#526

Closed

6 tasks

gabyhelp mentioned this issue Nov 28, 2024

runtime/pprof: shows invalid instruction addresses #70609

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/pprof: pprof provides wrong address/line for CPU profiles #41338

cmd/pprof: pprof provides wrong address/line for CPU profiles #41338

mpx commented Sep 11, 2020

prattmic commented Sep 11, 2020

odeke-em commented Sep 11, 2020

prattmic commented Jul 27, 2021

cmd/pprof: pprof provides wrong address/line for CPU profiles #41338

cmd/pprof: pprof provides wrong address/line for CPU profiles #41338

Comments

mpx commented Sep 11, 2020

prattmic commented Sep 11, 2020

odeke-em commented Sep 11, 2020

prattmic commented Jul 27, 2021