runtime/pprof: TestProfilerStackDepth/mutex failures #68562

gopherbot · 2024-07-24T01:08:26Z

#!watchflakes
default <- pkg == "runtime/pprof" && test == "TestProfilerStackDepth/mutex"

Issue created automatically to collect these failures.

Example (log):

=== RUN   TestProfilerStackDepth/mutex
panic: test timed out after 3m0s
	running tests:
		TestProfilerStackDepth (2m44s)
		TestProfilerStackDepth/mutex (2m43s)

goroutine 26723 gp=0x1d158c8 m=14 mp=0x1a80508 [running]:
panic({0x5ebd60, 0x192e030})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:804 +0x137 fp=0x1b54f84 sp=0x1b54f30 pc=0x4b00c7
testing.(*M).startAlarm.func1()
...
runtime/pprof.TestProfilerStackDepth.func2(0x18024b8)
	C:/b/s/w/ir/x/w/goroot/src/runtime/pprof/pprof_test.go:2472 +0x80 fp=0x1857fb8 sp=0x1857eec pc=0x5c2ba0
testing.tRunner(0x18024b8, 0x18cabe8)
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1690 +0xc7 fp=0x1857fe4 sp=0x1857fb8 pc=0x534177
testing.(*T).Run.gowrap1()
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x2c fp=0x1857ff0 sp=0x1857fe4 pc=0x53512c
runtime.goexit({})
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x1857ff4 sp=0x1857ff0 pc=0x4b6f51
created by testing.(*T).Run in goroutine 26684
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x3bb

— watchflakes

The text was updated successfully, but these errors were encountered:

gopherbot · 2024-07-24T01:08:30Z

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "runtime/pprof" && test == "TestProfilerStackDepth/mutex"

2024-07-23 21:29 gotip-windows-386 go@c18ff292 runtime/pprof.TestProfilerStackDepth/mutex [ABORT] (log)

=== RUN   TestProfilerStackDepth/mutex
panic: test timed out after 3m0s
	running tests:
		TestProfilerStackDepth (2m44s)
		TestProfilerStackDepth/mutex (2m43s)

goroutine 26723 gp=0x1d158c8 m=14 mp=0x1a80508 [running]:
panic({0x5ebd60, 0x192e030})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:804 +0x137 fp=0x1b54f84 sp=0x1b54f30 pc=0x4b00c7
testing.(*M).startAlarm.func1()
...
runtime/pprof.TestProfilerStackDepth.func2(0x18024b8)
	C:/b/s/w/ir/x/w/goroot/src/runtime/pprof/pprof_test.go:2472 +0x80 fp=0x1857fb8 sp=0x1857eec pc=0x5c2ba0
testing.tRunner(0x18024b8, 0x18cabe8)
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1690 +0xc7 fp=0x1857fe4 sp=0x1857fb8 pc=0x534177
testing.(*T).Run.gowrap1()
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x2c fp=0x1857ff0 sp=0x1857fe4 pc=0x53512c
runtime.goexit({})
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x1857ff4 sp=0x1857ff0 pc=0x4b6f51
created by testing.(*T).Run in goroutine 26684
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x3bb

— watchflakes

mknyszek · 2024-07-31T21:45:27Z

I spent some time looking into this failure, and it's clearly some kind of GC deadlock. Notably, the goroutine trying to write out the mutex profile calls into malloc, then into GC assists, and gets stuck somewhere in assists. Unfortunately it's unclear where. Some candidates include:

Trying to scan a goroutine that is not preemptible (and isn't exiting from a non-preemptible section, so possibly some kind of mutual deadlock). I checked to make sure self-scan is safe, and AFAICT it is: assisting goroutines are marked as waiting after switching to the system stack. The self-scan logic in markroot is also correct, AFAICT. There's also no other running goroutine in the stack dump (and certainly not anything that's non-preemptible).
The goroutine is blocked on a lock, like the assist queue lock, which somebody else is holding. This seems unlikely since there aren't any other goroutines blocked on a lock in the stack dump.

The lack of any smoking guns in the stack dump suggests to me that sysmon may be involved. Or, it's some kind of memory corruption. I also wonder if this is a platform-specific failure; windows-386 is a bit funky.

A final note is all the GC mark workers in the stack dump are parked. This kinda makes sense -- the GC is not done, but they can't find any work to do because in 2 minutes they almost certainly found everything they could aside from wherever that one goroutine is stuck.

mknyszek · 2024-07-31T22:07:23Z

Actually, the assist queue lock theory is not possible. We know for a fact we're in gcAssistAlloc1 since we're on the system stack and the line numbers match up for us to have entered the system stack there.

So, the likely explanation is difficulty preempting some goroutine (I can't think of any other reason we'd block so long...).

mknyszek · 2024-07-31T22:33:12Z

I have a new possible lead/theory. This test is somewhat special in that the mutex profile rate is 1. What if there's a place where we're acquiring and/or releasing a runtime lock, somewhere deep in gcDrainN, where we end up recording that lock in the mutex profile and blocking somewhere in there. I don't immediately see why that would cause a deadlock, but since this is a new kind of failure, recent changes (such as in the Go 1.23 cycle) may be relevant.

Though, this doesn't really line up with the evidence, since there should be some other goroutine waiting on a lock. Maybe it's a self deadlock? The write-mutex-profile code is interacting with the same locks that the runtime unlock path is, though I don't immediately see how we could self-deadlock in exactly the place we're stuck. Even if we were holding a lock (which it doesn't look like we are), that should prevent recording further locks.

mknyszek · 2024-08-08T14:56:04Z

In the last triage meeting, I said we'd just wait for more failures. So, setting the WaitingForInfo label.

gopherbot · 2024-09-08T14:59:35Z

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

gopherbot added the NeedsInvestigation label Jul 24, 2024

gopherbot added this to Test Flakes Jul 24, 2024

gopherbot added the compiler/runtime label Jul 24, 2024

mknyszek added this to Go Compiler / Runtime Jul 24, 2024

mknyszek self-assigned this Jul 31, 2024

mknyszek mentioned this issue Jul 31, 2024

runtime/pprof: TestProfilerStackDepth/mutex times out #68561

Closed

mknyszek moved this to Todo in Go Compiler / Runtime Jul 31, 2024

mknyszek added this to the Go1.24 milestone Jul 31, 2024

mknyszek mentioned this issue Aug 7, 2024

runtime/pprof: TestEmptyCallStack failures #68684

Open

mknyszek added the WaitingForInfo label Aug 8, 2024

gopherbot closed this as not planned Sep 8, 2024

github-project-automation bot moved this from Todo to Done in Go Compiler / Runtime Sep 8, 2024

github-project-automation bot moved this to Done in Test Flakes Sep 8, 2024

gabyhelp mentioned this issue Nov 5, 2024

runtime/pprof: flaky TestGenericsHashKeyInPprofBuilder #70208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/pprof: TestProfilerStackDepth/mutex failures #68562

runtime/pprof: TestProfilerStackDepth/mutex failures #68562

gopherbot commented Jul 24, 2024

gopherbot commented Jul 24, 2024

mknyszek commented Jul 31, 2024 •

edited

Loading

mknyszek commented Jul 31, 2024

mknyszek commented Jul 31, 2024

mknyszek commented Aug 8, 2024

gopherbot commented Sep 8, 2024

runtime/pprof: TestProfilerStackDepth/mutex failures #68562

runtime/pprof: TestProfilerStackDepth/mutex failures #68562

Comments

gopherbot commented Jul 24, 2024

gopherbot commented Jul 24, 2024

mknyszek commented Jul 31, 2024 • edited Loading

mknyszek commented Jul 31, 2024

mknyszek commented Jul 31, 2024

mknyszek commented Aug 8, 2024

gopherbot commented Sep 8, 2024

mknyszek commented Jul 31, 2024 •

edited

Loading