New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: TestCrashDumpsAllThreads flakes #35356
Comments
Still flaky after CL 204800.
|
My current guess is that this has changed due to signal preemption. The test only works if each running M is hit with a |
Change https://golang.org/cl/205720 mentions this issue: |
In principle, non-cooperative preemption should make this test more robust,
since we should be able to get all the stacks even without
GOTRACEBACK=crash. That's obviously not quite working.
|
If the SIGQUIT lands right after asyncPreempt2 switches to the system stack, but before CASing the G status out of _Grunning, I guess this can result in |
The test is expecting to see that theoretically unpreemptible loop appears the expected number of times in the stack trace. But if the loop is preempted, then the stack trace of that M will show something else, not the loop. So the test will fail. Non-cooperative preemption will make real code more robust, but it doesn't make this specific test more robust. @cherrymui It's normal to see |
@ianlancetaylor when the goroutine is preempted, it will show something like
The test counts the number of "main.loop" appearances, so this is counted correctly. On that thread, the additional stack trace (what I meant by saying "being trace backed on the M") will show that M's G0 stack, which does not include main.loop. |
So, I think,
So either way, it should be counted as 1 appearance (except in the case that the SIGQUIT lands in the small window I mentioned in my previous comment). |
I think that what is happening when the test fails is that 1) the signal is sent; 2) the goroutine is reported as |
@ianlancetaylor I thing you're right. This can result in a missing |
Still flaky (with the same failure mode) after CL 205720, albeit with what seems to be a lower failure rate: |
Not sure if directly related, however this test currently fails 100% of the time on the openbsd/mips64 builder: https://build.golang.org/log/b88cd54aa3ccb3e2bf03f5958dd593291803b00e |
The runtime waits for 5 seconds for the SIGQUIT signal to bounce between all threads. All failure cases above seem to take longer than 5 seconds to finish, and it doesn't include all Ms. For those cases, I guess it is just that 5 seconds is not long enough to have the signal bounce between all threads (maybe the system is quite busy, because all.bash launches many processes?). |
Still flaky, but shouldn’t hold up a release. Punting to Go1.17. Thank you everyone. |
I think my analysis above gives a possible reason. But I don't know what we want to do to fix. Do we want to increase the 5 seconds waiting time? |
Let's try dropping the |
Change https://golang.org/cl/312510 mentions this issue: |
https://storage.googleapis.com/go-build-log/e4992e62/linux-amd64_2c1df239.log
2019-11-04T18:53:43-210e367/darwin-amd64-race
2019-11-04T05:27:25-fb29e22/linux-386
2019-11-04T05:27:25-fb29e22/netbsd-amd64-8_0
2019-11-03T05:01:00-4497d7e/darwin-amd64-10_12
2019-11-03T01:44:46-d2c039f/linux-386-clang
2019-11-02T21:51:23-7955ece/linux-amd64-sid
2019-11-01T16:05:22-8d45e61/freebsd-386-11_2
Maybe yet another flake fixed by https://go-review.googlesource.com/c/go/+/204800?
The text was updated successfully, but these errors were encountered: