Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: TestProfilerStackDepth/mutex failures #68562

Closed
gopherbot opened this issue Jul 24, 2024 · 6 comments
Closed

runtime/pprof: TestProfilerStackDepth/mutex failures #68562

gopherbot opened this issue Jul 24, 2024 · 6 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@gopherbot
Copy link
Contributor

#!watchflakes
default <- pkg == "runtime/pprof" && test == "TestProfilerStackDepth/mutex"

Issue created automatically to collect these failures.

Example (log):

=== RUN   TestProfilerStackDepth/mutex
panic: test timed out after 3m0s
	running tests:
		TestProfilerStackDepth (2m44s)
		TestProfilerStackDepth/mutex (2m43s)

goroutine 26723 gp=0x1d158c8 m=14 mp=0x1a80508 [running]:
panic({0x5ebd60, 0x192e030})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:804 +0x137 fp=0x1b54f84 sp=0x1b54f30 pc=0x4b00c7
testing.(*M).startAlarm.func1()
...
runtime/pprof.TestProfilerStackDepth.func2(0x18024b8)
	C:/b/s/w/ir/x/w/goroot/src/runtime/pprof/pprof_test.go:2472 +0x80 fp=0x1857fb8 sp=0x1857eec pc=0x5c2ba0
testing.tRunner(0x18024b8, 0x18cabe8)
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1690 +0xc7 fp=0x1857fe4 sp=0x1857fb8 pc=0x534177
testing.(*T).Run.gowrap1()
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x2c fp=0x1857ff0 sp=0x1857fe4 pc=0x53512c
runtime.goexit({})
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x1857ff4 sp=0x1857ff0 pc=0x4b6f51
created by testing.(*T).Run in goroutine 26684
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x3bb

watchflakes

@gopherbot gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 24, 2024
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "runtime/pprof" && test == "TestProfilerStackDepth/mutex"
2024-07-23 21:29 gotip-windows-386 go@c18ff292 runtime/pprof.TestProfilerStackDepth/mutex [ABORT] (log)
=== RUN   TestProfilerStackDepth/mutex
panic: test timed out after 3m0s
	running tests:
		TestProfilerStackDepth (2m44s)
		TestProfilerStackDepth/mutex (2m43s)

goroutine 26723 gp=0x1d158c8 m=14 mp=0x1a80508 [running]:
panic({0x5ebd60, 0x192e030})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:804 +0x137 fp=0x1b54f84 sp=0x1b54f30 pc=0x4b00c7
testing.(*M).startAlarm.func1()
...
runtime/pprof.TestProfilerStackDepth.func2(0x18024b8)
	C:/b/s/w/ir/x/w/goroot/src/runtime/pprof/pprof_test.go:2472 +0x80 fp=0x1857fb8 sp=0x1857eec pc=0x5c2ba0
testing.tRunner(0x18024b8, 0x18cabe8)
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1690 +0xc7 fp=0x1857fe4 sp=0x1857fb8 pc=0x534177
testing.(*T).Run.gowrap1()
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x2c fp=0x1857ff0 sp=0x1857fe4 pc=0x53512c
runtime.goexit({})
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x1857ff4 sp=0x1857ff0 pc=0x4b6f51
created by testing.(*T).Run in goroutine 26684
	C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x3bb

watchflakes

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 24, 2024
@mknyszek mknyszek self-assigned this Jul 31, 2024
@mknyszek mknyszek added this to the Go1.24 milestone Jul 31, 2024
@mknyszek
Copy link
Contributor

mknyszek commented Jul 31, 2024

I spent some time looking into this failure, and it's clearly some kind of GC deadlock. Notably, the goroutine trying to write out the mutex profile calls into malloc, then into GC assists, and gets stuck somewhere in assists. Unfortunately it's unclear where. Some candidates include:

  • Trying to scan a goroutine that is not preemptible (and isn't exiting from a non-preemptible section, so possibly some kind of mutual deadlock). I checked to make sure self-scan is safe, and AFAICT it is: assisting goroutines are marked as waiting after switching to the system stack. The self-scan logic in markroot is also correct, AFAICT. There's also no other running goroutine in the stack dump (and certainly not anything that's non-preemptible).
  • The goroutine is blocked on a lock, like the assist queue lock, which somebody else is holding. This seems unlikely since there aren't any other goroutines blocked on a lock in the stack dump.

The lack of any smoking guns in the stack dump suggests to me that sysmon may be involved. Or, it's some kind of memory corruption. I also wonder if this is a platform-specific failure; windows-386 is a bit funky.

A final note is all the GC mark workers in the stack dump are parked. This kinda makes sense -- the GC is not done, but they can't find any work to do because in 2 minutes they almost certainly found everything they could aside from wherever that one goroutine is stuck.

@mknyszek
Copy link
Contributor

Actually, the assist queue lock theory is not possible. We know for a fact we're in gcAssistAlloc1 since we're on the system stack and the line numbers match up for us to have entered the system stack there.

So, the likely explanation is difficulty preempting some goroutine (I can't think of any other reason we'd block so long...).

@mknyszek
Copy link
Contributor

I have a new possible lead/theory. This test is somewhat special in that the mutex profile rate is 1. What if there's a place where we're acquiring and/or releasing a runtime lock, somewhere deep in gcDrainN, where we end up recording that lock in the mutex profile and blocking somewhere in there. I don't immediately see why that would cause a deadlock, but since this is a new kind of failure, recent changes (such as in the Go 1.23 cycle) may be relevant.

Though, this doesn't really line up with the evidence, since there should be some other goroutine waiting on a lock. Maybe it's a self deadlock? The write-mutex-profile code is interacting with the same locks that the runtime unlock path is, though I don't immediately see how we could self-deadlock in exactly the place we're stuck. Even if we were holding a lock (which it doesn't look like we are), that should prevent recording further locks.

@mknyszek
Copy link
Contributor

mknyszek commented Aug 8, 2024

In the last triage meeting, I said we'd just wait for more failures. So, setting the WaitingForInfo label.

@mknyszek mknyszek added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Aug 8, 2024
@gopherbot
Copy link
Contributor Author

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 8, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Go Compiler / Runtime Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
Archived in project
Development

No branches or pull requests

2 participants