Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: writeProfiles is not called after panic #65129

Open
felixge opened this issue Jan 17, 2024 · 8 comments
Open

testing: writeProfiles is not called after panic #65129

felixge opened this issue Jan 17, 2024 · 8 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@felixge
Copy link
Contributor

felixge commented Jan 17, 2024

Go version

go1.21.4

Output of go env in your module/workspace:

GO111MODULE='auto'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/user/Library/Caches/go-build'
GOENV='/Users/user/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS='-count=1'
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/felixge/go/pkg/mod'
GONOPROXY='github.com/Org'
GONOSUMDB='github.com/Org,go.orgbuild.io'
GOOS='darwin'
GOPATH='/Users/user/go'
GOPRIVATE='github.com/Org'
GOPROXY='binaries.orgbuild.io,proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.21.4/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.21.4/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.21.4'
GCCGO='gccgo'
AR='ar'
CC='clang'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/Users/user/go/src/github.com/felixge/dump/trace-panic/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/0t/nwrdnfnd2mjdgr0z_p_f6xww0000gq/T/go-build2519625148=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

Debug a test that hit a panic using go test -trace. Below is a greatly simplified example that reproduces the problem. Run it with go test -trace go.trace.

package main

import (
	"context"
	"fmt"
	"runtime/trace"
	"testing"
)

func TestTracePanic(t *testing.T) {
    // This simulates some trace buffers getting flushed before the test panics.
	for i := 0; i < 100000; i++ {
		trace.Log(context.Background(), "log", fmt.Sprintf("%d", i))
	}
	panic("panic")
}

What did you see happen?

$ go tool trace go.trace 
2024/01/17 10:56:02 Parsing trace...
failed to parse trace: no EvFrequency event

What did you expect to see?

A valid trace file that I can open.

Additional Thoughts

This reproduces with tip.

The problem seems to be that the after() func that is supposed to call writeProfiles doesn't get called when a test panics because tRunner() runs in a different goroutine.

@gopherbot
Copy link

Change https://go.dev/cl/556255 mentions this issue: testing: call writeProfile after panic

@felixge
Copy link
Contributor Author

felixge commented Jan 17, 2024

cc @prattmic

@cagedmantis
Copy link
Contributor

@bcmills @golang/runtime

@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 17, 2024
@cagedmantis cagedmantis added this to the Backlog milestone Jan 17, 2024
@bcmills bcmills changed the title cmd/test: writeProfiles is not called after panic testing: writeProfiles is not called after panic Jan 17, 2024
@bcmills
Copy link
Contributor

bcmills commented Jan 18, 2024

I suspect that this could be fixed by calling m.after within tRunner when it is exiting by panic, more-or-less here:
https://cs.opensource.google/go/go/+/master:src/testing/testing.go;l=1601-1606;drc=97daa6e94296980b4aa2dac93a938a5edd95ce93

But then the question is: how is tRunner supposed to get the M? It is normally at a higher level than a T or even a common. I suppose we could stick it in a func() field in one of those structs, but the need for that suggests a deeper dependency-inversion problem that we ought to think about how to solve.

@felixge
Copy link
Contributor Author

felixge commented Jan 18, 2024

@bcmills i implemented your suggestion (not at exactly at the same line, but that was a mistake done in haste) and it seems to work well. PTAL and let me know if you think the func() field is good enough or if the dependency inversion problem has you concerned.

@felixge
Copy link
Contributor Author

felixge commented Feb 25, 2024

I just noticed that the fix for #65319 has fixed the problem reported in this issue for execution traces . In fact, it even fixed the problem for the case where an arbitrary goroutine panics. My approach in CL 556255 would have only worked for panics in the goroutine calling the TestXXX function.

This raises the question if CL 556255 should proceed as is. I see three options:

  1. Modify the CL to take a similar approach where the runtime flushes the profiles on panics.
  2. Abandon it.
  3. Continue it as-is.

I don't think option 1 is workable. At the moment of a panic, the runtime is only aware of the fact that the CPU profile could be flushed. It has no idea that the testing package would also like to take snapshots of the alloc/heap, mutex, block and goroutine profiles.

Option 2 is worth considering. Execution traces are by far the most valuable data source you'd like to have when a Go program crashes. And arguably they also contain almost everything one would hope to find in a CPU, block, mutex or goroutine profile. However, they don't have any information about memory beyond heap size and heap target. So IMO there is still value in getting the alloc/heap profile written out when a test crashes.

Based on the above, I will go with option 3. However, I would not be heart broken if we decide to go with option 2 in order to keep the testing package simpler and b/c my CL doesn't work for arbitrary goroutines panics.

@bcmills
Copy link
Contributor

bcmills commented Feb 26, 2024

In the event of an uncontrolled crash, I would expect that the most interesting case for a heap profile would be an OOM, but in case of an OOM failure I think the runtime throws instead of panicking.

Similarly, a mutex/blocking profile would be most useful if the test times out, but that doesn't result in a panic in the test either.

So I think in the interest of avoiding unnecessary complexity in the testing package (which it already has more than enough of!), I would prefer to abandon the CL and close this as a duplicate of #65319.

@felixge
Copy link
Contributor Author

felixge commented Feb 27, 2024

In the event of an uncontrolled crash, I would expect that the most interesting case for a heap profile would be an OOM, but in case of an OOM failure I think the runtime throws instead of panicking.

Linux uses the KILL signal to implement OOM killing. It gives the process no chance to dump any kind of diagnostics data. It's not even possible to get a core dump. It's a sad state of affairs 😞.

Similarly, a mutex/blocking profile would be most useful if the test times out, but that doesn't result in a panic in the test either.

Funny you mention this. The original problem that caused me to file this issue was a test that was timing out. This test didn't properly clean up after itself ... causing another test in the package to panic further down the road. This was all happening in CI and not locally reproducible. Hence the need for execution traces and other diagnostics data from the runtime.

So I think in the interest of avoiding unnecessary complexity in the testing package (which it already has more than enough of!), I would prefer to abandon the CL and close this as a duplicate of #65319.

My original issue was solved using execution tracing. We modified CI to only run the test that was timing out, which allowed us to capture a trace. This workaround is no longer needed, thanks to #65319. The remaining problems here are hypothetical for me, so I don't feel strongly enough about them to lobby you into accepting more complexity in the testing package.

That being said, if we close this, let's make it clear that this is only a partial duplicate of #65319. Writing out cpu/heap/mutex/goroutine profiles on test panics is still unique to this issue and doesn't overlap with #65319.

@bcmills bcmills modified the milestones: Backlog, Unplanned Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants