Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestCrashDumpsAllThreads flakes #35356

Closed
FiloSottile opened this issue Nov 4, 2019 · 21 comments
Closed

runtime: TestCrashDumpsAllThreads flakes #35356

FiloSottile opened this issue Nov 4, 2019 · 21 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@FiloSottile FiloSottile added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 4, 2019
@FiloSottile FiloSottile added this to the Go1.14 milestone Nov 4, 2019
@bcmills
Copy link
Contributor

bcmills commented Nov 6, 2019

@ianlancetaylor
Copy link
Contributor

My current guess is that this has changed due to signal preemption. The test only works if each running M is hit with a SIGQUIT while it is running a tight loop. The test is checking that the stack trace for each thread shows the loop. If the thread is preempted when it receives the SIGQUIT, it will instead show a track trace in the runtime with the thread stopped at gcstopm.

@ianlancetaylor
Copy link
Contributor

CC @aclements @cherrymui

@gopherbot
Copy link

Change https://golang.org/cl/205720 mentions this issue: runtime: set GODEBUG=asyncpreemptoff=1 in TestCrashDumpsAllThreads

@aclements
Copy link
Member

aclements commented Nov 7, 2019 via email

@cherrymui
Copy link
Member

If the SIGQUIT lands right after asyncPreempt2 switches to the system stack, but before CASing the G status out of _Grunning, I guess this can result in goroutine running on other thread; stack unavailable on the G and also not being trace backed on the M (as it is running on its G0)? (But the failure stack traces don't look like this.)

@ianlancetaylor
Copy link
Contributor

The test is expecting to see that theoretically unpreemptible loop appears the expected number of times in the stack trace. But if the loop is preempted, then the stack trace of that M will show something else, not the loop. So the test will fail.

Non-cooperative preemption will make real code more robust, but it doesn't make this specific test more robust.

@cherrymui It's normal to see goroutine running on other thread in the first set of stack traces. The expectation of the test is that you will then see one more stack trace per thread. The test expects those additional stack traces to fill in the references to the loop. But if the M is preempted when the program dies, then we don't see a reference to the loop in its stack.

@cherrymui
Copy link
Member

@ianlancetaylor when the goroutine is preempted, it will show something like

        goroutine 6 [runnable]:
        runtime.asyncPreempt2()
        	/workdir/go/src/runtime/preempt.go:289 +0x62 fp=0xc000030630 sp=0xc000030610 pc=0x42ba62
        runtime.asyncPreempt()
        	/workdir/go/src/runtime/preempt_amd64.s:45 +0xdb fp=0xc0000307b8 sp=0xc000030630 pc=0x4589fb
        main.loop(0x0, 0xc00005e060)
        	/workdir/tmp/go-build869718673/main.go:36 +0x2d fp=0xc0000307d0 sp=0xc0000307b8 pc=0x48e3cd
        runtime.goexit()
        	/workdir/go/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc0000307d8 sp=0xc0000307d0 pc=0x4570d1
        created by main.main
        	/workdir/tmp/go-build869718673/main.go:17 +0xac

The test counts the number of "main.loop" appearances, so this is counted correctly. On that thread, the additional stack trace (what I meant by saying "being trace backed on the M") will show that M's G0 stack, which does not include main.loop.

@cherrymui
Copy link
Member

So, I think,

  • if the goroutine is not preempted, the first set of stack traces will have goroutine running on other thread, and the additional stack trace will show one with main.loop.
  • if the goroutine is preempted, the first set of stack traces will show main.loop (by tracing back through runtime.asyncPreempt), and the additional stack trace will show its G0 stack, which does not include main.loop.

So either way, it should be counted as 1 appearance (except in the case that the SIGQUIT lands in the small window I mentioned in my previous comment).

@ianlancetaylor
Copy link
Contributor

I think that what is happening when the test fails is that 1) the signal is sent; 2) the goroutine is reported as goroutine running on other thread; 3) the goroutine is preempted and the M enters schedule; 4) the M that was running the goroutine is signaled; 5) the M reports a stack trace up through schedule; 6) nothing reports main.loop because that goroutine was already reported in step 2 and the M is now doing something else.

@cherrymui
Copy link
Member

@ianlancetaylor I thing you're right. This can result in a missing main.loop.

@bcmills
Copy link
Contributor

bcmills commented May 12, 2020

Still flaky (with the same failure mode) after CL 205720, albeit with what seems to be a lower failure rate:
2020-05-12T15:01:56-a0698a6/darwin-amd64-10_15
2019-12-06T23:44:28-76d2f6c/freebsd-amd64-11_2

@4a6f656c
Copy link
Contributor

4a6f656c commented Nov 6, 2020

Not sure if directly related, however this test currently fails 100% of the time on the openbsd/mips64 builder:

https://build.golang.org/log/b88cd54aa3ccb3e2bf03f5958dd593291803b00e

@cherrymui
Copy link
Member

The runtime waits for 5 seconds for the SIGQUIT signal to bounce between all threads. All failure cases above seem to take longer than 5 seconds to finish, and it doesn't include all Ms. For those cases, I guess it is just that 5 seconds is not long enough to have the signal bounce between all threads (maybe the system is quite busy, because all.bash launches many processes?).

@odeke-em
Copy link
Member

odeke-em commented Feb 7, 2021

Still flaky, but shouldn’t hold up a release. Punting to Go1.17. Thank you everyone.

@odeke-em odeke-em modified the milestones: Go1.16, Go1.17 Feb 7, 2021
@cherrymui
Copy link
Member

I think my analysis above gives a possible reason. But I don't know what we want to do to fix. Do we want to increase the 5 seconds waiting time?

@ianlancetaylor
Copy link
Contributor

Let's try dropping the t.Parallel so that there is hopefully less going on during this test. I'll send you a CL that does that and also tries to fold the test into testprog.

@gopherbot
Copy link

Change https://golang.org/cl/312510 mentions this issue: runtime: don't run TestCrashDumpsAllThreads in parallel

@golang golang locked and limited conversation to collaborators Apr 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

9 participants