Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: pprof tests failing with "missing cpuHog in pprof output" #26369

Closed
jcajka opened this issue Jul 13, 2018 · 20 comments
Closed

runtime: pprof tests failing with "missing cpuHog in pprof output" #26369

jcajka opened this issue Jul 13, 2018 · 20 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@jcajka
Copy link
Contributor

jcajka commented Jul 13, 2018

What version of Go are you using (go version)?

master branch
devel +8a330454dc Fri Jul 13 03:53:00 2018 +0000

Does this issue reproduce with the latest release?

Not observed with go1.10, but observed with 1.11beta1

What operating system and processor architecture are you using (go env)?

amd64/linux

What did you do?

./all.bash build of Go

What did you expect to see?

All tests passing

### What did you see instead?
--- FAIL: TestCgoPprofThread (4.90s)
    crash_cgo_test.go:311: [/builddir/build/BUILD/go-8a330454dc1502091dba2c890f5f6d0c095034de/bin/go tool pprof -top -nodecount=1 /tmp/go-build359347223/testprogcgo_.exe /tmp/prof112190479]:
        File: testprogcgo_.exe
        Build ID: d80f67ea93448e16a62f5c8998bf883aca7a8779
        Type: cpu
        Time: Jul 13, 2018 at 3:08pm (CEST)
        Duration: 200.47ms, Total samples = 40ms (19.95%)
        Showing nodes accounting for 30ms, 75.00% of 40ms total
        Showing top 1 nodes out of 4
              flat  flat%   sum%        cum   cum%
              30ms 75.00% 75.00%       30ms 75.00%  main.CrashTraceback
    crash_cgo_test.go:315: missing cpuHog in pprof output
    crash_cgo_test.go:311: [/builddir/build/BUILD/go-8a330454dc1502091dba2c890f5f6d0c095034de/bin/go tool pprof -top -nodecount=1 /tmp/prof112190479]:
        File: testprogcgo_.exe
        Build ID: d80f67ea93448e16a62f5c8998bf883aca7a8779
        Type: cpu
        Time: Jul 13, 2018 at 3:08pm (CEST)
        Duration: 200.47ms, Total samples = 40ms (19.95%)
        Showing nodes accounting for 30ms, 75.00% of 40ms total
        Showing top 1 nodes out of 4
              flat  flat%   sum%        cum   cum%
              30ms 75.00% 75.00%       30ms 75.00%  main.CrashTraceback
    crash_cgo_test.go:315: missing cpuHog in pprof output
--- FAIL: TestCgoPprofThreadNoTraceback (5.58s)
    crash_cgo_test.go:311: [/builddir/build/BUILD/go-8a330454dc1502091dba2c890f5f6d0c095034de/bin/go tool pprof -top -nodecount=1 /tmp/go-build359347223/testprogcgo_.exe /tmp/prof122578037]:
        File: testprogcgo_.exe
        Build ID: d80f67ea93448e16a62f5c8998bf883aca7a8779
        Type: cpu
        Time: Jul 13, 2018 at 3:08pm (CEST)
        Duration: 1.11s, Total samples = 440ms (39.65%)
        Showing nodes accounting for 440ms, 100% of 440ms total
        Showing top 1 nodes out of 2
              flat  flat%   sum%        cum   cum%
             440ms   100%   100%      440ms   100%  main.CrashTraceback
    crash_cgo_test.go:315: missing cpuHog in pprof output
    crash_cgo_test.go:311: [/builddir/build/BUILD/go-8a330454dc1502091dba2c890f5f6d0c095034de/bin/go tool pprof -top -nodecount=1 /tmp/prof122578037]:
        File: testprogcgo_.exe
        Build ID: d80f67ea93448e16a62f5c8998bf883aca7a8779
        Type: cpu
        Time: Jul 13, 2018 at 3:08pm (CEST)
        Duration: 1.11s, Total samples = 440ms (39.65%)
        Showing nodes accounting for 440ms, 100% of 440ms total
        Showing top 1 nodes out of 2
              flat  flat%   sum%        cum   cum%
             440ms   100%   100%      440ms   100%  main.CrashTraceback
    crash_cgo_test.go:315: missing cpuHog in pprof output
--- FAIL: TestCgoPprof (20.98s)
    crash_cgo_test.go:311: [/builddir/build/BUILD/go-8a330454dc1502091dba2c890f5f6d0c095034de/bin/go tool pprof -top -nodecount=1 /tmp/go-build359347223/testprogcgo_.exe /tmp/prof381107200]:
        File: testprogcgo_.exe
        Build ID: d80f67ea93448e16a62f5c8998bf883aca7a8779
        Type: cpu
        Time: Jul 13, 2018 at 3:08pm (CEST)
        Duration: 200.53ms, Total samples = 20ms ( 9.97%)
        Showing nodes accounting for 20ms, 100% of 20ms total
        Showing top 1 nodes out of 6
              flat  flat%   sum%        cum   cum%
              20ms   100%   100%       20ms   100%  main.init.16
    crash_cgo_test.go:315: missing cpuHog in pprof output
    crash_cgo_test.go:311: [/builddir/build/BUILD/go-8a330454dc1502091dba2c890f5f6d0c095034de/bin/go tool pprof -top -nodecount=1 /tmp/prof381107200]:
        File: testprogcgo_.exe
        Build ID: d80f67ea93448e16a62f5c8998bf883aca7a8779
        Type: cpu
        Time: Jul 13, 2018 at 3:08pm (CEST)
        Duration: 200.53ms, Total samples = 20ms ( 9.97%)
        Showing nodes accounting for 20ms, 100% of 20ms total
        Showing top 1 nodes out of 6
              flat  flat%   sum%        cum   cum%
              20ms   100%   100%       20ms   100%  main.init.16
    crash_cgo_test.go:315: missing cpuHog in pprof output
--- FAIL: TestCgoPprofPIE (21.07s)
    crash_cgo_test.go:311: [/builddir/build/BUILD/go-8a330454dc1502091dba2c890f5f6d0c095034de/bin/go tool pprof -top -nodecount=1 /tmp/go-build359347223/testprogcgo_-buildmode=pie.exe /tmp/prof942946892]:
        File: testprogcgo_-buildmode=pie.exe
        Build ID: 486e881cc024130f0bb2c163ef38dceb8d66fd9b
        Type: cpu
        Time: Jul 13, 2018 at 3:08pm (CEST)
        Duration: 200.70ms, Total samples = 20ms ( 9.97%)
        Showing nodes accounting for 20ms, 100% of 20ms total
        Showing top 1 nodes out of 6
              flat  flat%   sum%        cum   cum%
              20ms   100%   100%       20ms   100%  runtime.reentersyscall
    crash_cgo_test.go:315: missing cpuHog in pprof output
    crash_cgo_test.go:311: [/builddir/build/BUILD/go-8a330454dc1502091dba2c890f5f6d0c095034de/bin/go tool pprof -top -nodecount=1 /tmp/prof942946892]:
        File: testprogcgo_-buildmode=pie.exe
        Build ID: 486e881cc024130f0bb2c163ef38dceb8d66fd9b
        Type: cpu
        Time: Jul 13, 2018 at 3:08pm (CEST)
        Duration: 200.70ms, Total samples = 20ms ( 9.97%)
        Showing nodes accounting for 20ms, 100% of 20ms total
        Showing top 1 nodes out of 6
              flat  flat%   sum%        cum   cum%
              20ms   100%   100%       20ms   100%  runtime.reentersyscall
    crash_cgo_test.go:315: missing cpuHog in pprof output
FAIL
FAIL	runtime	46.604s

in KVM VM(Fedora, host laptop/Fedora, also observed in the Fedora build system, x86_64 and armv7 VMs), although it seems that all test pass fine while being run on bare metal machine(laptop/Fedora). I'm still investigating what might be different and will do the bisect, etc....

Hm... this looks like possible dup of #18856

@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 13, 2018
@ianlancetaylor ianlancetaylor modified the milestones: Go1.12, Go1.11 Jul 13, 2018
@ianlancetaylor
Copy link
Contributor

Looks like PC values that are in C code are somehow being misattributed to Go code. Perhaps the C symbols are simply missing? But I don't understand why it would matter whether this is running in a VM or not. Is something about the VM environment causing the C symbol table to be stripped?

@jcajka
Copy link
Contributor Author

jcajka commented Jul 16, 2018

I have bit missed the version of Fedora I have been running. I have thought it has been F27 in the VM same as on my host, but it has been rawhide/F29. Sorry for the confusion.

It seems that this started happening with latest binutils in F29, that is 2.30.90(AFAIK pre-release 2.31). Still digging deeper.

@jcajka
Copy link
Contributor Author

jcajka commented Jul 17, 2018

Poking the runtime/testdata/testprogcgo binary with objectdmp I haven noted any obvious issue with it. Only notable difference that stands out to me is that with the latest binutils there are more segments otherwise i don't see any notable difference(C symbols are there). Could it be issue in (revers) function name lookup?
With 2.29:

Program Header:
    PHDR off    0x0000000000000040 vaddr 0x0000000000400040 paddr 0x0000000000400040 align 2**3
         filesz 0x00000000000002a0 memsz 0x00000000000002a0 flags r-x
  INTERP off    0x00000000000002e0 vaddr 0x00000000004002e0 paddr 0x00000000004002e0 align 2**0
         filesz 0x000000000000001c memsz 0x000000000000001c flags r--
    LOAD off    0x0000000000000000 vaddr 0x0000000000400000 paddr 0x0000000000400000 align 2**21
         filesz 0x0000000000188270 memsz 0x0000000000188270 flags r-x
    LOAD off    0x0000000000188de8 vaddr 0x0000000000788de8 paddr 0x0000000000788de8 align 2**21
         filesz 0x00000000000165b8 memsz 0x0000000000035ad0 flags rw-
 DYNAMIC off    0x0000000000188e08 vaddr 0x0000000000788e08 paddr 0x0000000000788e08 align 2**3
         filesz 0x00000000000001e0 memsz 0x00000000000001e0 flags rw-
    NOTE off    0x00000000000002fc vaddr 0x00000000004002fc paddr 0x00000000004002fc align 2**2
         filesz 0x0000000000000020 memsz 0x0000000000000020 flags r--
    NOTE off    0x0000000000000320 vaddr 0x0000000000400320 paddr 0x0000000000400320 align 2**3
         filesz 0x0000000000000030 memsz 0x0000000000000030 flags r--
    NOTE off    0x0000000000000350 vaddr 0x0000000000400350 paddr 0x0000000000400350 align 2**2
         filesz 0x0000000000000088 memsz 0x0000000000000088 flags r--
     TLS off    0x0000000000188de8 vaddr 0x0000000000788de8 paddr 0x0000000000788de8 align 2**3
         filesz 0x0000000000000000 memsz 0x0000000000000008 flags r--
EH_FRAME off    0x0000000000186ab8 vaddr 0x0000000000586ab8 paddr 0x0000000000586ab8 align 2**2
         filesz 0x00000000000004ac memsz 0x00000000000004ac flags r--
   STACK off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4
         filesz 0x0000000000000000 memsz 0x0000000000000000 flags rw-
   RELRO off    0x0000000000188de8 vaddr 0x0000000000788de8 paddr 0x0000000000788de8 align 2**0
         filesz 0x0000000000000218 memsz 0x0000000000000218 flags r--

With 2.31:

Program Header:
    PHDR off    0x0000000000000040 vaddr 0x0000000000400040 paddr 0x0000000000400040 align 2**3
         filesz 0x0000000000000310 memsz 0x0000000000000310 flags r--
  INTERP off    0x0000000000000350 vaddr 0x0000000000400350 paddr 0x0000000000400350 align 2**0
         filesz 0x000000000000001c memsz 0x000000000000001c flags r--
    LOAD off    0x0000000000000000 vaddr 0x0000000000400000 paddr 0x0000000000400000 align 2**12
         filesz 0x00000000000025e8 memsz 0x00000000000025e8 flags r--
    LOAD off    0x0000000000003000 vaddr 0x0000000000403000 paddr 0x0000000000403000 align 2**12
         filesz 0x00000000000bf4c1 memsz 0x00000000000bf4c1 flags r-x
    LOAD off    0x00000000000c3000 vaddr 0x00000000004c3000 paddr 0x00000000004c3000 align 2**12
         filesz 0x00000000000c6830 memsz 0x00000000000c6830 flags r--
    LOAD off    0x0000000000189de8 vaddr 0x000000000058ade8 paddr 0x000000000058ade8 align 2**12
         filesz 0x00000000000165b8 memsz 0x0000000000035ad0 flags rw-
 DYNAMIC off    0x0000000000189e08 vaddr 0x000000000058ae08 paddr 0x000000000058ae08 align 2**3
         filesz 0x00000000000001e0 memsz 0x00000000000001e0 flags rw-
    NOTE off    0x000000000000036c vaddr 0x000000000040036c paddr 0x000000000040036c align 2**2
         filesz 0x0000000000000020 memsz 0x0000000000000020 flags r--
    NOTE off    0x0000000000000390 vaddr 0x0000000000400390 paddr 0x0000000000400390 align 2**3
         filesz 0x0000000000000030 memsz 0x0000000000000030 flags r--
    NOTE off    0x00000000000003c0 vaddr 0x00000000004003c0 paddr 0x00000000004003c0 align 2**2
         filesz 0x0000000000000088 memsz 0x0000000000000088 flags r--
     TLS off    0x0000000000189de8 vaddr 0x000000000058ade8 paddr 0x000000000058ade8 align 2**3
         filesz 0x0000000000000000 memsz 0x0000000000000008 flags r--
EH_FRAME off    0x0000000000188078 vaddr 0x0000000000588078 paddr 0x0000000000588078 align 2**2
         filesz 0x00000000000004ac memsz 0x00000000000004ac flags r--
   STACK off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4
         filesz 0x0000000000000000 memsz 0x0000000000000000 flags rw-
   RELRO off    0x0000000000189de8 vaddr 0x000000000058ade8 paddr 0x000000000058ade8 align 2**0
         filesz 0x0000000000000218 memsz 0x0000000000000218 flags r--

@ianlancetaylor This already exceeds my current knowledge (of pprof), do you have any pointers in which direction to look?

@ianlancetaylor
Copy link
Contributor

Having multiple PT_LOAD segments is very strange and may be a bug, but I can't see why it would have this effect.

Is it still true that this problem only arises when using a VM?

@ianlancetaylor
Copy link
Contributor

For the record, I tried using GNU binutils tip on my Debian system, and everything passed.

@jcajka
Copy link
Contributor Author

jcajka commented Jul 18, 2018

No it happens every time I use the latest(2.31/2.30.90 in Fedora) binutils. I will try to bisect when it started binutils version wise.

@jcajka
Copy link
Contributor Author

jcajka commented Jul 18, 2018

Once more confirmed that this started with pre-release 2.31 binutils.

@ianlancetaylor
Copy link
Contributor

I'm sorry, I have no idea at all. I guess the next step would be to try to find out whether the problem is in profile created by the Go program or whether it's in the symbolization done by pprof. I would tentatively guess the latter because it seems more likely to be affected by a binutils change, but I don't actually understand how that could be.

@ALTree
Copy link
Member

ALTree commented Jul 23, 2018

To add another data point, I'm seeing this when running all.bash on my Debian machine (not a VM).

Linux 4.16.0-2-amd64 #1 SMP Debian 4.16.12-1 (2018-05-27) x86_64 GNU/Linux

Binutils version is GNU Binutils for Debian 2.30.90.20180710.

@ianlancetaylor
Copy link
Contributor

Just a note that I still can't recreate this. It would help a great deal if someone who can recreate the problem can find some pointer to a possible problem.

If go test runtime fails, it should be possible to recreate the problem by doing

cd runtime/testdata/testprogcgo
go build
./testprogcgo CgoPprof
# that will print the name of a file name in $TMPDIR
go tool pprof -top -nodecount=1 TMPDIR_FILE_NAME

That go tool pprof line should show that the "top 1 node" is cpuHog. We need to find out why it doesn't.

Thanks.

@ALTree
Copy link
Member

ALTree commented Jul 24, 2018

Seems that 100% of the exec time is assigned to main.init.16

$ gotip tool pprof -top prof713363986 
File: testprogcgo
Build ID: 214e80bf3237fa07675130b15b0c38529236cd2c
Type: cpu
Time: Jul 24, 2018 at 9:53am (CEST)
Duration: 201.20ms, Total samples = 20ms ( 9.94%)
Showing nodes accounting for 20ms, 100% of 20ms total
      flat  flat%   sum%        cum   cum%
      20ms   100%   100%       20ms   100%  main.init.16
         0     0%   100%       20ms   100%  _cgoexp_0c3b74faa2ea_GoCheckM
         0     0%   100%       20ms   100%  os/exec.(*Cmd).Run
         0     0%   100%       20ms   100%  os/exec.(*Cmd).Start
         0     0%   100%       20ms   100%  runtime.cgocall
         0     0%   100%       20ms   100%  runtime.panicindex

Also only 20ms of samples? cpuHog is not there at all.

@ianlancetaylor
Copy link
Contributor

Thanks.

I also see 20ms of samples. The problem is that pprof is reporting main.init.16 when it should be reporting cpuHog.

Look at the output of readelf -s --wide testprogcgo. You should see the symbols cpuHog and main.init.16 among many others. What are the lines output by readelf for those two symbols?

@ALTree
Copy link
Member

ALTree commented Jul 24, 2018

$ readelf -s --wide testprogcgo | grep main.init.16
  4040: 00000000004c2060   616 FUNC    LOCAL  DEFAULT   14 main.init.16

$ readelf -s --wide testprogcgo | grep cpuHog
   136: 00000000004c5097    86 FUNC    GLOBAL DEFAULT   14 cpuHog
   140: 00000000004c5824    86 FUNC    GLOBAL DEFAULT   14 cpuHogThread
   147: 00000000004c5147    33 FUNC    GLOBAL DEFAULT   14 _cgo_0c3b74faa2ea_Cfunc_cpuHog
    59: 00000000005be278     8 OBJECT  LOCAL  DEFAULT   28 main._cgo_0c3b74faa2ea_Cfunc_cpuHog
  3973: 00000000004bdc30    82 FUNC    LOCAL  DEFAULT   14 main._Cfunc_cpuHog
  4135: 00000000005f11c0     4 OBJECT  LOCAL  DEFAULT   30 cpuHogCount
  4162: 00000000005f11c8     4 OBJECT  LOCAL  DEFAULT   30 cpuHogThreadCount
  4163: 00000000004c58d7    24 FUNC    LOCAL  DEFAULT   14 cpuHogDriver
  4247: 00000000004c5147    33 FUNC    GLOBAL DEFAULT   14 _cgo_0c3b74faa2ea_Cfunc_cpuHog
  4294: 00000000004c5824    86 FUNC    GLOBAL DEFAULT   14 cpuHogThread
  4297: 00000000004c5097    86 FUNC    GLOBAL DEFAULT   14 cpuHog

@ianlancetaylor
Copy link
Contributor

Thanks. My only guess is that somehow pprof is unable to read the C symbol table, although that doesn't make any sense. Could you attach the testprogcgo and /tmp/profNNNN files here?

@ALTree
Copy link
Member

ALTree commented Aug 2, 2018

Binary and prof archive here: issue-26369.zip

@hyangah
Copy link
Contributor

hyangah commented Aug 3, 2018

Not only C symbol table. See the location list in the 'raw' command output. All symbol lookups are done incorrectly.

(pprof) raw
PeriodType: cpu nanoseconds
Period: 10000000
Time: 2018-08-02 04:31:31.975312051 -0400 EDT
Duration: 201.
Samples:
samples/count cpu/nanoseconds
          2   20000000: 1 2 3 4 5 6 
Locations
     1: 0x4c50d1 M=1 main.init.16 /home/alberto/go/src/runtime/testdata/testprogcgo/threadpprof.go:81 s=0
     2: 0x40612d M=1 runtime.cgocall /home/alberto/go/src/runtime/cgocall.go:128 s=0
     3: 0x4bdcd0 M=1 os/exec.(*Cmd).Start /home/alberto/go/src/os/exec/exec.go:383 s=0
     4: 0x4c15fd M=1 _cgoexp_0c3b74faa2ea_GoCheckM :-1 s=0
     5: 0x4bd928 M=1 os/exec.(*Cmd).Run /home/alberto/go/src/os/exec/exec.go:309 s=0
     6: 0x42dc96 M=1 runtime.panicindex /home/alberto/go/src/runtime/panic.go:44 s=0
Mappings
1: 0x403000/0x4c7000/0x3000 /tmp/testprogcgo 3218b6a46c85f75188b2c89579e4cff936288bfa [FN][FL][LN][IN]

pprof's symbol lookup is done by

func (f *file) SourceLine(addr uint64) ([]driver.Frame, error) {

Mappings section of the above output indicates non-zero offset. (Start:0x403000, Limit:0x4c7000, Offset:0x3000). The symbol lookup code tries to adjust the requested address with the offset

addr -= f.offset

But it looks like the addresses in the profile are already adjusted, so we ended up with garbage.

$ addr2line -e /tmp/testprogcgo -f 0x4c50d1
cpuHog
/home/alberto/go/src/runtime/testdata/testprogcgo/pprof.go:19

# 0x4c50d1 - 0x3000 = 0x4c20d1
$ addr2line -e /tmp/testprogcgo -f 0x4c20d1
main.init.16
/home/alberto/go/src/runtime/testdata/testprogcgo/threadpprof.go:81

@aalexand @rauls5382
Upstream pprof seems to have the same issue even though upstream pprof doesn't use this code.

@ianlancetaylor
Copy link
Contributor

@hyangah Thanks for the analysis. It is indeed due to the additional PT_LOAD segments seen in this executable, though I still don't know where they are coming from. I have a patch.

@gopherbot
Copy link

Change https://golang.org/cl/127895 mentions this issue: cmd/internal/objfile: only consider executable segments for load address

@ianlancetaylor
Copy link
Contributor

@ALTree Thanks for sending the files.

@aalexand
Copy link
Contributor

aalexand commented Aug 4, 2018

though I still don't know where they are coming from

@ianlancetaylor It's probably related to -z separate-code / --enable-separate-code:

...
    3 Changes in 2.31:
    4 
...
    7 * Add a configure option --enable-separate-code to decide whether
    8   -z separate-code should be enabled in ELF linker by default.  Default
    9   to yes for Linux/x86 targets.  Note that -z separate-code can increase
   10   disk and memory size.
   11 
   12 Changes in 2.30:
   13 
   14 * Add -z separate-code to generate separate code PT_LOAD segment.
   15 
...

, see https://fossies.org/linux/binutils/ld/NEWS. From this, sounds like the motivation for that change is "to avoid mixing code pages with data to improve cache performance as well as security".

metux pushed a commit to oss-qm/go that referenced this issue Nov 8, 2018
Reportedly on some new Fedora systems the linker is producing extra
load segments, basically making the dynamic section non-executable.
We were assuming that the first load segment could be used to
determine the program's load offset, but that is no longer true.
Use the first executable load segment instead.

Fixes golang#26369

Change-Id: I5ee31ddeef2e8caeed3112edc5149065a6448456
Reviewed-on: https://go-review.googlesource.com/127895
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@golang golang locked and limited conversation to collaborators Aug 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants