Skip to content

internal/trace: TestTraceCPUProfile/Stress failures #70883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gopherbot opened this issue Dec 17, 2024 · 7 comments
Closed

internal/trace: TestTraceCPUProfile/Stress failures #70883

gopherbot opened this issue Dec 17, 2024 · 7 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@gopherbot
Copy link
Contributor

#!watchflakes
default <- pkg == "internal/trace" && test == "TestTraceCPUProfile/Stress"

Issue created automatically to collect these failures.

Example (log):

=== RUN   TestTraceCPUProfile/Stress
    reader_test.go:112: unexpected error while reading the trace: inconsistent status for goroutine 4: old Runnable vs. new Waiting
    trace_test.go:627: found bad trace; dumping to test log...
    trace_test.go:638: Trace Go1.23
        EventBatch gen=1 m=18446744073709551615 time=4827905708 size=5
        Frequency freq=15625000
        EventBatch gen=1 m=76763 time=4827904561 size=144
        ProcStatus dt=69 p=0 pstatus=1
        GoStatus dt=3 g=1 m=76763 gstatus=2
        ProcsChange dt=52 procs_value=8 stack=1
...
        String id=74
        	data="runtime.chanrecv1"
        String id=75
        	data="runtime.(*wakeableSleep).sleep"
        String id=76
        	data="runtime.traceStartReadCPU.func1"
        String id=77
        	data="/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/tracecpu.go"
        
--- FAIL: TestTraceCPUProfile/Stress (3.61s)

watchflakes

@gopherbot gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 17, 2024
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace" && test == "TestTraceCPUProfile/Stress"
2024-12-17 16:05 gotip-linux-arm64-longtest go@31e50af5 internal/trace.TestTraceCPUProfile/Stress (log)
=== RUN   TestTraceCPUProfile/Stress
    reader_test.go:112: unexpected error while reading the trace: inconsistent status for goroutine 4: old Runnable vs. new Waiting
    trace_test.go:627: found bad trace; dumping to test log...
    trace_test.go:638: Trace Go1.23
        EventBatch gen=1 m=18446744073709551615 time=4827905708 size=5
        Frequency freq=15625000
        EventBatch gen=1 m=76763 time=4827904561 size=144
        ProcStatus dt=69 p=0 pstatus=1
        GoStatus dt=3 g=1 m=76763 gstatus=2
        ProcsChange dt=52 procs_value=8 stack=1
...
        String id=74
        	data="runtime.chanrecv1"
        String id=75
        	data="runtime.(*wakeableSleep).sleep"
        String id=76
        	data="runtime.traceStartReadCPU.func1"
        String id=77
        	data="/home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/tracecpu.go"
        
--- FAIL: TestTraceCPUProfile/Stress (3.61s)

watchflakes

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 17, 2024
@mknyszek mknyszek self-assigned this Dec 18, 2024
@mknyszek mknyszek added this to the Go1.24 milestone Dec 18, 2024
@mknyszek
Copy link
Contributor

I'm fairly certain I've found the issue here, and it's an old bug. TL;DR: traceAcquire/traceRelease is not performed around casgstatus in injectglist.

What's happening in this trace is the following sequence of events.

  • G4 (background scavenger) is about to be unblocked by sysmon in injectglist during trace gen N. Emits an event but does not yet casgstatus, and crucially, traceAdvance can continue, more than once, before we reach casgstatus.
  • traceAdvance executes and completes. We are now in gen N+1.
  • traceAdvance begins executing again and observes G4 in Gwaiting, scribbling it down.
  • traceAdvance advances the generation counter. We are now in gen N+2.
  • traceAdvance looks to see if G4 emitted any events in gen N+1, since it scribbled down its status. Nope! It's still in that window between emitting the event and casgstatus.
  • traceAdvance emits a Gwaiting status for G4.
  • Later, the parser sees a conflict, because in gen N the goroutine appears to be in Grunnable, but in gen N+1, it's reported as Gwaiting.

This is a "stress" test, so traceAdvance is basically executing in a loop. This would be extremely rare in practice where it executes every ~1 second.

The fix is easy: wrap the casgstatus calls in injectglist with traceAcquire/traceRelease and emit the events on the spot.

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/638558 mentions this issue: runtime: hold traceAcquire across casgstatus in injectglist

@mknyszek
Copy link
Contributor

mknyszek commented Jan 7, 2025

@gopherbot Please open backport issues for Go 1.22 and Go 1.23.

This bug means a rare chance that the tracer will emit broken traces with no workaround. The fix is also small and concerns only tracer-related code. (It touches the scheduler, but the scheduler logic is left as-is.)

@gopherbot
Copy link
Contributor Author

Backport issue(s) opened: #71146 (for 1.22), #71147 (for 1.23).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. FixPending Issues that have a fix which has not yet been reviewed or submitted. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 8, 2025
@dmitshur dmitshur moved this to Active in Test Flakes Jan 8, 2025
@dmitshur dmitshur moved this from Todo to In Progress in Go Compiler / Runtime Jan 8, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Jan 8, 2025
@github-project-automation github-project-automation bot moved this from Active to Done in Test Flakes Jan 8, 2025
@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/641378 mentions this issue: [release-branch.go1.23] runtime: hold traceAcquire across casgstatus in injectglist

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/641356 mentions this issue: [release-branch.go1.22] runtime: hold traceAcquire across casgstatus in injectglist

gopherbot pushed a commit that referenced this issue Jan 8, 2025

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
…in injectglist

Currently injectglist emits all the trace events before actually calling
casgstatus on each goroutine. This is a problem, since tracing can
observe an inconsistent state (gstatus does not match tracer's 'emitted
an event' state).

This change fixes the problem by having injectglist do what every other
scheduler function does, and that's wrap each call to casgstatus in
traceAcquire/traceRelease.

For #70883.
Fixes #71146.

Change-Id: I857e96cec01688013597e8efc0c4c3d0b72d3a70
Reviewed-on: https://go-review.googlesource.com/c/go/+/638558
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
(cherry picked from commit f025d19)
Reviewed-on: https://go-review.googlesource.com/c/go/+/641356
Auto-Submit: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Jan 8, 2025

Verified

This commit was signed with the committer’s verified signature. The key has expired.
Ma27 Maximilian Bosch
…in injectglist

Currently injectglist emits all the trace events before actually calling
casgstatus on each goroutine. This is a problem, since tracing can
observe an inconsistent state (gstatus does not match tracer's 'emitted
an event' state).

This change fixes the problem by having injectglist do what every other
scheduler function does, and that's wrap each call to casgstatus in
traceAcquire/traceRelease.

For #70883.
Fixes #71147.

Change-Id: I857e96cec01688013597e8efc0c4c3d0b72d3a70
Reviewed-on: https://go-review.googlesource.com/c/go/+/638558
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
(cherry picked from commit f025d19)
Reviewed-on: https://go-review.googlesource.com/c/go/+/641378
Auto-Submit: Michael Pratt <mpratt@google.com>
wyf9661 pushed a commit to wyf9661/go that referenced this issue Jan 21, 2025
Currently injectglist emits all the trace events before actually calling
casgstatus on each goroutine. This is a problem, since tracing can
observe an inconsistent state (gstatus does not match tracer's 'emitted
an event' state).

This change fixes the problem by having injectglist do what every other
scheduler function does, and that's wrap each call to casgstatus in
traceAcquire/traceRelease.

Fixes golang#70883.

Change-Id: I857e96cec01688013597e8efc0c4c3d0b72d3a70
Reviewed-on: https://go-review.googlesource.com/c/go/+/638558
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
mpminardi pushed a commit to tailscale/go that referenced this issue Jan 28, 2025
…in injectglist

Currently injectglist emits all the trace events before actually calling
casgstatus on each goroutine. This is a problem, since tracing can
observe an inconsistent state (gstatus does not match tracer's 'emitted
an event' state).

This change fixes the problem by having injectglist do what every other
scheduler function does, and that's wrap each call to casgstatus in
traceAcquire/traceRelease.

For golang#70883.
Fixes golang#71147.

Change-Id: I857e96cec01688013597e8efc0c4c3d0b72d3a70
Reviewed-on: https://go-review.googlesource.com/c/go/+/638558
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
(cherry picked from commit f025d19)
Reviewed-on: https://go-review.googlesource.com/c/go/+/641378
Auto-Submit: Michael Pratt <mpratt@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done.
Projects
Archived in project
Development

No branches or pull requests

3 participants