New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: mayMoreStackPreempt failures #55160
Comments
Found new matching flaky dashboard failures. 2022-09-02 19:08 linux-386-longtest go@dbf442b1 runtime.TestVDSO (log)
2022-09-02 19:08 linux-386-longtest go@b91e3737 runtime.TestVDSO (log)
2022-09-02 19:09 linux-386-longtest go@55ca6a20 runtime.TestVDSO (log)
2022-09-02 19:22 linux-386-longtest go@0fda8b19 runtime.TestVDSO (log)
2022-09-02 19:22 linux-386-longtest go@0fda8b19 runtime.TestCgoPprofCallback (log)
2022-09-03 15:45 linux-386-longtest go@f798dc68 runtime.TestVDSO (log)
2022-09-03 18:21 linux-386-longtest go@a0f05823 runtime.TestVDSO (log)
2022-09-03 18:21 linux-386-longtest go@a0f05823 runtime.TestCgoPprofCallback (log)
2022-09-04 04:17 linux-386-longtest go@535fe2b2 runtime.TestVDSO (log)
2022-09-05 07:14 linux-386-longtest go@3fbcf05d runtime.TestVDSO (log)
2022-09-05 07:17 linux-386-longtest go@4e7e7ae1 runtime.TestVDSO (log)
2022-09-05 08:08 linux-386-longtest go@4ad55cd9 runtime.TestVDSO (log)
2022-09-05 08:08 linux-386-longtest go@af7f4176 runtime.TestVDSO (log)
2022-09-05 08:28 linux-386-longtest go@bd5595d7 runtime.TestVDSO (log)
2022-09-05 08:28 linux-386-longtest go@bd5595d7 runtime.TestCgoPprofCallback (log)
2022-09-05 21:39 linux-386-longtest go@4c1ca42a runtime.TestVDSO (log)
2022-09-06 14:44 linux-386-longtest go@a60a3dc5 runtime.TestCgoPprofCallback (log)
2022-09-06 15:44 linux-386-longtest go@32f68b5a runtime.TestVDSO (log)
2022-09-06 15:44 linux-386-longtest go@32f68b5a runtime.TestCgoPprofCallback (log)
2022-09-06 15:48 linux-386-longtest go@07b19bf5 runtime.TestVDSO (log)
2022-09-02 19:08 linux-amd64-longtest go@dbf442b1 runtime.TestCgoPprofCallback (log)
2022-09-02 19:08 linux-amd64-longtest go@dbf442b1 runtime.TestCgoCCodeSIGPROF (log)
2022-09-02 19:08 linux-amd64-longtest go@b91e3737 runtime.TestCgoPprofCallback (log)
2022-09-03 15:45 linux-amd64-longtest go@f798dc68 runtime.TestCgoPprofCallback (log)
2022-09-03 18:21 linux-amd64-longtest go@a0f05823 runtime.TestCgoPprofCallback (log)
2022-09-04 04:17 linux-amd64-longtest go@535fe2b2 runtime.TestCgoPprofCallback (log)
2022-09-05 08:12 linux-amd64-longtest go@67e65424 runtime.TestCgoPprofCallback (log)
2022-09-05 08:12 linux-amd64-longtest go@67e65424 runtime.TestCgoCCodeSIGPROF (log)
2022-09-05 21:39 linux-amd64-longtest go@4c1ca42a runtime.TestCgoPprofCallback (log)
2022-09-06 11:14 linux-amd64-longtest go@1c504843 runtime.TestCgoPprofCallback (log)
|
These were coming in multiple times per day and have now stopped for almost two weeks. Closing. |
Found new matching flaky dashboard failures for:
2022-08-23 03:09 windows-amd64-longtest go@0a52d806 (log)
|
Found new dashboard test flakes for:
2022-09-06 15:48 linux-amd64-longtest go@07b19bf5 runtime.TestCgoPprofCallback (log)
|
Found new dashboard test flakes for:
2022-10-31 21:00 linux-386-longtest go@ec0b5402 runtime.TestCgoTracebackContextPreemption (log)
2022-11-02 18:19 linux-386-longtest go@07a70bca runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 15:17 linux-386-longtest go@1bfb51f8 runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 15:30 linux-386-longtest go@e81263c7 runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 17:01 linux-386-longtest go@667c53e1 runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 17:40 linux-386-longtest go@7abc8a2e runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 18:33 linux-386-longtest go@44cabb80 runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 19:34 linux-386-longtest go@3511c822 runtime.TestCgoTracebackContextPreemption (log)
2022-11-03 19:59 linux-386-longtest go@d031e9e0 runtime.TestCgoTracebackContextPreemption (log)
|
No more failure for a while. Probably fixed by CL https://golang.org/cl/447495 , which is submitted at Nov. 4, so the time seems to match. |
Found new dashboard test flakes for:
2023-01-25 16:38 linux-arm64-longtest go@1d3088ef runtime.TestCrashDumpsAllThreads (log)
2023-02-01 21:30 linux-amd64-longtest go@cda461bb runtime.TestCrashDumpsAllThreads (log)
|
cc @aclements These timeouts blocked release branch CLs as flakes; we should get them fixed or skipped. |
This reproduces pretty quickly on my linux/amd64 laptop with: cd runtime
go test -c
GOFLAGS="-gcflags=runtime=-d=maymorestack=runtime.mayMoreStackPreempt -gcflags=runtime/testdata/...=" stress2 ./runtime.test -test.run=CrashDumpsAllThreads -test.timeout=30m Of course, my usual first step would be to send it a SIGQUIT and see what's happening, but that's exactly what fails in this test. 😛 I haven't debugged yet. gdb may hold answers. For reference, I initially tried running the testprog directly with the below script before I realized there are various pipes and signals involved in running this test. We might still need to reproduce those, but I'm hoping we can debug this through the test driver. cd runtime/testdata/testprog
GOFLAGS="-gcflags=runtime=-d=maymorestack=runtime.mayMoreStackPreempt -gcflags=runtime/testdata/...=" go build
# Expects to write to FD 3 and then get a SIGQUIT
GOTRACEBACK=crash GOGC=off GODEBUG=asyncpreemptoff=1 ./testprog CrashDumpsAllThreads |
All of the Ms are just parked in |
Ah hah, it's getting stuck before the SIGQUIT is involved. It's actually getting stuck in this loop, before the subprocess even indicates to the parent that it's ready for the SIGQUIT. |
Amusingly, this means you can send the subprocess a SIGQUIT. I've pasted the interesting part of the traceback below. All other goroutines and threads are uninteresting (this is running with
|
Faster repro. Apply the following diff --- a/src/runtime/testdata/testprog/crashdump.go
+++ b/src/runtime/testdata/testprog/crashdump.go
@@ -29,6 +29,8 @@ func CrashDumpsAllThreads() {
<-c
}
+ return
+
// Tell our parent that all the goroutines are executing.
if _, err := os.NewFile(3, "pipe").WriteString("x"); err != nil {
fmt.Fprintf(os.Stderr, "write to pipe failed: %v\n", err) Then cd runtime/testdata/testprog
GOFLAGS="-gcflags=runtime=-d=maymorestack=runtime.mayMoreStackPreempt -gcflags=runtime/testdata/...=" go build
GOTRACEBACK=crash GOGC=off GODEBUG=asyncpreemptoff=1 stress2 ./testprog CrashDumpsAllThreads |
Here's a version that works without a patch: cd runtime/testdata/testprog
GOFLAGS="-gcflags=runtime=-d=maymorestack=runtime.mayMoreStackPreempt -gcflags=runtime/testdata/...=" go build
GOTRACEBACK=crash GOGC=off GODEBUG=asyncpreemptoff=1 stress2 -timeouts-fail -max-fails=1 -timeout=10s -pass="write to pipe" -max-runs=50000 ./testprog CrashDumpsAllThreads |
This isn't new, either. I reproduced in Go 1.18, which was the first release to have mayMoreStack. |
Found new dashboard test flakes for:
2023-02-23 06:07 linux-arm64-longtest go@71c02bed runtime.TestCrashDumpsAllThreads (log)
|
Found new dashboard test flakes for:
2024-03-08 22:14 linux-arm64-longtest go@c220fbab runtime.TestNetpollDeadlock (log)
|
Found new dashboard test flakes for:
2024-03-09 14:19 windows-amd64-longtest go@74726def runtime.TestNetpollDeadlock (log)
|
Sorry, but there were parse errors in the watch flakes script.
And the problems were:
See https://go.dev/wiki/Watchflakes for details. |
The |
Found new dashboard test flakes for:
2024-02-16 20:25 gotip-windows-amd64-longtest go@cdd0ddaf runtime:mayMoreStackPreempt.TestReadMetricsCumulative (log)
|
Found new dashboard test flakes for:
2024-04-05 20:51 linux-amd64-longtest go@5ec7395a runtime.TestCrashWhileTracing (log)
2024-04-05 20:51 windows-amd64-longtest go@5ec7395a runtime.TestCrashWhileTracing (log)
2024-04-05 22:09 linux-amd64-longtest go@d186dde8 runtime.TestCrashWhileTracing (log)
2024-04-05 22:09 windows-amd64-longtest go@d186dde8 runtime.TestCrashWhileTracing (log)
2024-04-05 22:18 windows-amd64-longtest go@5d18f889 runtime.TestCrashWhileTracing (log)
|
Found new dashboard test flakes for:
2024-04-07 03:32 windows-amd64-longtest go@20f052c8 runtime.TestCrashWhileTracing (log)
2024-04-08 15:39 linux-amd64-longtest go@ae0a08de runtime.TestCrashWhileTracing (log)
|
Found new dashboard test flakes for:
2024-04-08 15:39 windows-amd64-longtest go@ae0a08de runtime.TestCrashWhileTracing (log)
2024-04-08 15:53 windows-amd64-longtest go@da732dd1 runtime.TestCrashWhileTracing (log)
2024-04-08 16:36 windows-amd64-longtest go@e8f5c04c runtime.TestCrashWhileTracing (log)
|
I have a revert CL out to fix this. |
I think I understand what went wrong here. Looks like the fact that this CL made the tracer more preemptible also made it much more likely that the trace would have a broken tail. The test seems to succeed overall though, because other invariants it checks are completely fine. I think I know how to fix this to re-land it. |
Revert landed. |
Found new dashboard test flakes for:
2024-04-08 20:51 windows-amd64-longtest go@8008998b runtime.TestCrashWhileTracing (log)
|
Found new dashboard test flakes for:
2024-04-08 21:03 linux-amd64-longtest go@017478a9 runtime.TestCrashWhileTracing (log)
2024-04-08 21:43 linux-amd64-longtest go@e995aa95 runtime.TestCrashWhileTracing (log)
|
Found new dashboard test flakes for:
2024-04-08 21:43 windows-amd64-longtest go@d4395ebc runtime.TestCrashWhileTracing (log)
|
Found new dashboard test flakes for:
2024-04-08 21:43 darwin-amd64-longtest go@e995aa95 runtime.TestCrashWhileTracing (log)
|
Found new dashboard test flakes for:
2024-04-08 15:39 gotip-windows-amd64-longtest go@ae0a08de runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-08 15:53 gotip-windows-amd64-longtest go@da732dd1 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-08 20:33 gotip-windows-amd64-longtest go@50a50590 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-08 21:03 gotip-windows-amd64-longtest go@017478a9 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-10 17:23 gotip-windows-amd64-longtest go@73981695 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-10 19:41 gotip-windows-amd64-longtest go@5b5d6f87 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-11 16:15 gotip-windows-amd64-longtest go@48e00ab7 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-11 20:22 gotip-windows-amd64-longtest go@1843464f runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-11 21:23 gotip-windows-amd64-longtest go@bfb1f563 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-12 14:07 gotip-windows-amd64-longtest go@5c20d5fa runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-12 19:15 gotip-windows-amd64-longtest go@2f3ff1fc runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-14 18:17 gotip-windows-amd64-longtest go@37f48222 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-15 17:03 gotip-windows-amd64-longtest go@7b10c49e runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-15 17:38 gotip-windows-amd64-longtest go@59209c4b runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-17 19:54 gotip-windows-amd64-longtest go@2073b35e runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
|
The non-Windows failures are likely just stragglers from the fix. The Windows failures are likely because something is failing to flush (a pipe, in this case...?). |
Found new dashboard test flakes for:
2024-04-19 17:15 gotip-windows-amd64-longtest go@2ff89341 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
|
Found new dashboard test flakes for:
2024-04-24 18:15 gotip-windows-amd64-longtest go@e6891188 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-25 00:43 gotip-windows-amd64-longtest go@799968df runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
|
Found new dashboard test flakes for:
2024-04-25 01:05 gotip-windows-amd64-longtest go@db5f2b41 runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-25 11:28 gotip-windows-amd64-longtest go@8960925a runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
2024-04-25 17:07 gotip-windows-amd64-longtest go@06478e4b runtime:mayMoreStackPreempt.TestCrashWhileTracing (log)
|
Intermittent failures in the new mayMoreStackPreempt. This bug will replace #54778 and #55073.
The text was updated successfully, but these errors were encountered: