New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: crashes on linux/riscv64 during runtime/pprof and os/signal tests #49709
Comments
This looks like a problem with linux/riscv64 signal handling, not specific to the cpu profiler: the tests for the os/signal package see similar crashes (though because of the test structure, it's in a subprocess rather than the test runner itself).
One of the failures is "fatal error: unexpected signal during runtime execution"
"fatal error: runtime: split stack overflow"
|
kindly cc @cherrymui @4a6f656c @prattmic |
Yeah, it does look like signal related. Is it reproducible at tip? Thanks. |
Yes, the runtime/pprof crashes reproduce at tip. It's very fast, I got 10 failures in under 4 minutes. The crashes in os/signal are slower, less than one per hour on go1.17.3. I'll leave it running.
|
I haven't seen any os/signal failures yet at tip, and with the go1.17.3 failure rate I'd expect "about 6" by now:
|
The os/signal failures reproduce at tip, but very slowly (10 hours):
|
I'm not able to reproduce this here on a SiFive HiFive Unleashed:
That ran for several hours without failing - @mengzhuo can you test on your SiFive HiFive Unmatched? |
@4a6f656c Joel, I ran this test for about 1 hour and no failure occurred. some updates, It runs 18 hours and no failures.
|
It's a crash, not a test failure, so I don't think that disabling the tests is the right thing to do. The I ran the stress test for several hours with async preemption disabled and saw zero failures:
Some of the failures I see, which complain of The other two failure modes, What, if anything, would make it safe for the signal handlers to be re-entrant, and/or to share the M's gsignal stack? I haven't seen these failures on linux/amd64 (though I don't typically use single-core machines). I added some debuglog to the signal handler; this is what I see.Atop go1.17.3:
|
@rhysh thanks for the logs! If the signal handler is reentrant, or the signal stack is somehow shared, I think it is very wrong. Could you print the stack pointer at entry of signal handler to see if it is running on the same stack? |
I think that condition should never be true. This could either due to that we're getting nested signals or the G is somehow wrong. As we got signal PC in What kernel version are you running on? @4a6f656c and @mengzhuo , what kernel version does the builder running? |
Yes, the variation in the result of
In a crash from "fatal error: unexpected signal during runtime execution", it looks like M0 processes SIGURG and SIGPROF always with SP 0x3fa80098d8. On M3, it usually processes those with SP 0x3fa805b8d8. But then it gets a SIGPROF/27 on PC 0x16c34 ( We don't see a log line from a competing SIGURG/23 delivery, but we do see that the SIGPROF/27 signal came when the PC was inside a function that only the signal handler would call. And we see a split in the debuglog output (">> begin log 0 <<"), which seems to mean that something else (the signal handler) was using the debuglog that's usually in the global pool (between calls to
Here's the current diff from go1.17.3
And the crash log from stress (including debuglog output)
|
On linux/amd64, adding On linux/riscv64, the strace output shows It looks from this like the definition of go/src/runtime/defs_linux_riscv64.go Lines 124 to 129 in f598e29
|
Good finding. From the kernel C header
So it does look wrong if SA_RESTORER is not defined. And it does look like not defined on riscv64. |
Change https://golang.org/cl/367635 mentions this issue: |
What version of Go are you using (
go version
)?(Cross-compiling to linux/riscv64)
Does this issue reproduce with the latest release?
Yes, this problem appears with go1.17.3, the latest stable release.
What operating system and processor architecture are you using (
go env
)?I only have a working Go installation on my darwin/amd64 machine (below). I'm cross-compiling from there to linux/riscv64, where I see the problem.
go env
OutputWhat did you do?
I've got a small RISCV64 computer: Nezha with the Allwinner D1 SOC. I'm running the runtime/pprof tests at version go1.17.3 and seeing several types of "fatal error" failures.
What did you expect to see?
Near-zero failure rate of runtime/pprof's TestCPUProfileMultithreaded test in short mode when using go1.17.3, and zero crashes.
What did you see instead?
Currently about 1.5% failure rate, all of which are crashes ("fatal error"). (On the plus side, I haven't seen any test failures for that test.)
In the last 42 minutes, the stress run has collected:
Here's an example of each:
"fatal error: unexpected signal during runtime execution" / "[signal SIGSEGV: segmentation violation code=0x1 addr=0x2 pc=0x2]"
"runtime: unexpected return pc for runtime.sigpanic called from 0x0" / "fatal error: unknown caller pc"
"fatal error: runtime: split stack overflow"
The text was updated successfully, but these errors were encountered: