Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "invalid traceback origin" in TestCgoPprofThread on linux/amd64 #43174

Closed
bcmills opened this issue Dec 14, 2020 · 7 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Dec 14, 2020

2020-12-11T17:26:14-1341a3d/linux-amd64-wsl

--- FAIL: TestCgoPprofThread (0.76s)
    crash_cgo_test.go:335: [/tmp/workdir-host-linux-amd64-wsl/go/bin/go tool pprof -traces /tmp/workdir-host-linux-amd64-wsl/tmp/go-build1643386062/testprogcgo_.exe /tmp/workdir-host-linux-amd64-wsl/tmp/prof2131107081]:
        File: testprogcgo_.exe
        Build ID: 5bebc0f29d333075b2d2cb6ad739468825910989
        Type: cpu
        Time: Dec 12, 2020 at 1:35am (CST)
        Duration: 200.36ms, Total samples = 110ms (54.90%)
        -----------+-------------------------------------------------------
              10ms   cpuHogThread
                     cpuHogThread2
                     runtime.cgocall
                     main._Cfunc_runCPUHogThread
                     main.pprofThread
                     main.CgoPprofThread
                     main.main
                     runtime.main
        -----------+-------------------------------------------------------
             100ms   cpuHogThread
                     cpuHogThread2
        -----------+-------------------------------------------------------
    crash_cgo_test.go:347: invalid traceback origin: got=[cpuHogThread cpuHogThread2 runtime.cgocall main._Cfunc_runCPUHogThread main.pprofThread main.CgoPprofThread main.main runtime.main]; want=[cpuHogThread ... cpuHogThread2]
    crash_cgo_test.go:335: [/tmp/workdir-host-linux-amd64-wsl/go/bin/go tool pprof -traces /tmp/workdir-host-linux-amd64-wsl/tmp/prof2131107081]:
        File: testprogcgo_.exe
        Build ID: 5bebc0f29d333075b2d2cb6ad739468825910989
        Type: cpu
        Time: Dec 12, 2020 at 1:35am (CST)
        Duration: 200.36ms, Total samples = 110ms (54.90%)
        -----------+-------------------------------------------------------
              10ms   cpuHogThread
                     cpuHogThread2
                     runtime.cgocall
                     main._Cfunc_runCPUHogThread
                     main.pprofThread
                     main.CgoPprofThread
                     main.main
                     runtime.main
        -----------+-------------------------------------------------------
             100ms   cpuHogThread
                     cpuHogThread2
        -----------+-------------------------------------------------------
    crash_cgo_test.go:347: invalid traceback origin: got=[cpuHogThread cpuHogThread2 runtime.cgocall main._Cfunc_runCPUHogThread main.pprofThread main.CgoPprofThread main.main runtime.main]; want=[cpuHogThread ... cpuHogThread2]
FAIL
FAIL	runtime	21.528s

CC @hyangah @aclements @ianlancetaylor @mengzhuo

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 14, 2020
@bcmills bcmills added this to the Backlog milestone Dec 14, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Sep 2, 2021

@bcmills
Copy link
Contributor Author

bcmills commented Dec 2, 2021

The failure rate seems to be way up since the 1.18 cycle opened, and the failures are also now affecting other linux-amd64 builders. Marking as release-blocker for Go 1.18 as a likely 1.18 regression.

greplogs --dashboard -md -l -e 'FAIL: TestCgoPprofThread .*\n(?:\s+.+\n)*.*invalid traceback origin' --since=2021-09-03

2021-11-29T00:57:09-78af02e/linux-amd64-longtest
2021-11-16T17:13:33-29ec902/linux-amd64-jessie
2021-11-11T11:00:33-a01a6d6/linux-amd64
2021-10-28T22:38:00-af05d8b/linux-amd64-fedora
2021-10-11T15:28:50-702e337/linux-amd64-wsl
2021-10-09T01:04:29-e1c294a/linux-amd64-fedora
2021-10-06T19:45:35-e38ec96/linux-amd64-longtest
2021-10-05T19:15:03-ae83301/linux-amd64-staticlockranking
2021-09-15T03:29:39-9fc2889/linux-amd64-wsl
2021-09-13T18:02:42-e74e363/linux-amd64-wsl
2021-09-04T19:33:54-28dae3d/linux-amd64-wsl

@bcmills bcmills changed the title runtime: "invalid traceback origin" in TestCgoPprofThread on linux-amd64-wsl builder runtime: "invalid traceback origin" in TestCgoPprofThread on linux/amd64 Dec 2, 2021
@bcmills bcmills added okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker labels Dec 2, 2021
@bcmills bcmills modified the milestones: Backlog, Go1.18 Dec 2, 2021
@mknyszek
Copy link
Contributor

mknyszek commented Dec 6, 2021

The uptick may be related to recent changes to how CPU profiles are obtained. CC @rhysh @prattmic ?

@rhysh
Copy link
Contributor

rhysh commented Dec 7, 2021

The timer_create changes landed on 2021-09-27. It looks like the uptick began earlier, near the start of September.

From reading the test, its exact goal isn't clear to me. It looks for stacks in the profile with cpuHogThread at the top. From that set of matching stacks, it considers whichever appears in the text-format profile first. On that single stack, it then checks whether cpuHogThread2 is at the bottom.

The failures I've spot-checked have had two unique stacks with cpuHogThread at the top: one has cpuHogThread2 at the bottom, the other has additional frames below. Is the goal of the test to confirm that [cpuHogThread ... cpuHogThread2] appears somewhere in the profile (probably yes, with the current design)? Or to confirm that whenever cpuHogThread is at the top, that nothing other than cpuHogThread2 is at the bottom?

My first guess was that there should be only one unique stack. (If the test looked for that, it might fail more reliably.) But the cgo traceback is faked:

// pprofCgoThreadTraceback is passed to runtime.SetCgoTraceback.
// For testing purposes it pretends that all CPU hits in C code are in cpuHog.
void pprofCgoThreadTraceback(void* parg) {
struct cgoTracebackArg* arg = (struct cgoTracebackArg*)(parg);
arg->buf[0] = (uintptr_t)(cpuHogThread) + 0x10;
arg->buf[1] = (uintptr_t)(cpuHogThread2) + 0x4;
arg->buf[2] = 0;
}

The goroutine that starts the profile makes a cgo call to create a C-owned thread to burn CPU time. If the Go-owned thread gets a SIGPROF while it's in C code (in the process of creating the C-owned thread), that probably results in the longer stack seen in the failing tests. From what I see, the failures don't indicate a problem for production code.

if err := pprof.StartCPUProfile(f); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(2)
}
C.runCPUHogThread()

So it looks to me like either:

  • (probably) the test should change to pass if any of the stacks match [cpuHogThread ... cpuHogThread2]
  • or (maybe) the test should change to add a label to the goroutine that makes the first call into cgo, filter the profile to remove any samples that include that label, and then look for all of the stacks with cpuHogThread at the top to have cpuHogThread2 at the bottom.

@gopherbot
Copy link

Change https://golang.org/cl/370135 mentions this issue: runtime: fix flake in TestCgoPprofThread

@mknyszek
Copy link
Contributor

mknyszek commented Dec 7, 2021

@rhysh Oops. You're totally right. Sorry for pinging you on this, I should've checked the timeline of the pprof changes you made. In any case, thank you so much for the fix!

@ianlancetaylor
Copy link
Contributor

The cpuHogThread2 stuff was added in https://golang.org/cl/151917 for #29034. I think that ensuring that both cpuHogThread and cpuHogThread2 appear together will suffice for that issue.

@golang golang locked and limited conversation to collaborators Dec 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker
Projects
None yet
Development

No branches or pull requests

5 participants