-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: execution halts with goroutines stuck in runtime.gopark
(protocol error E08 during memory read for packet
)
#61768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here are some interesting backtraces I captured:
|
runtime.gopark
runtime.gopark
(protocol error E08 during memory read for packet
)
@mknyszek @cherrymui @bcmills |
I can reliably reproduce this within a few minutes on my machine if you want me to grab anything else 👍 . |
Is this https://github.com/ava-labs/avalanchego ? Can the test run on just a laptop, or does it need additional setup? |
There is some additional setup. I've been reproducing on my Mac M2 Max with (spawns a local network of p2p processes):
Within a few minutes, different processes spawned will start to halt (continuous profiling is enabled by default and runs once per minute). |
Can confirm that ./scripts/run.sh does run and spew messages. Is there anything else I should be doing after that? I see messages about "disconnecting peer". One avalanchego process is chewing up 300% of CPU, is that expected? (Laptop is also an M2 Max, all the cores, a truckload of memory too.) I.e., what am I looking for? (And is there a better way than killall to get rid of the processes?) Now there's two avalanchego processes, each consuming well over 300% of CPU. |
This is what happens when the process halts with this bug. Run
This means a peer is no longer responsive (because it has halted/is stuck).
Yeah, they all eventually halt if you wait long enough.
You can run |
A-ha. Any way to prevent the other ones from getting further stuck and eating all the CPU? Maybe just keep attaching them to shut them down? Or do I debug one, kill the rest? |
I'd suggest killing the rest after attaching to one that is stuck. Otherwise the runaway CPU will happen as more get stuck (no way to prevent that bc of the bug this PR is open for ^). Glad you were able to repro so quickly 👍 . |
Any update on this or anything else I can collect? |
Actually got it open just now, would not want to claim it is anything like "progress". |
I'm going to test if this occurs on go1.19 to see if I can bisect the regression. |
Thank you, and good luck. |
I ran a few tests. I have not been able to reproduce it in v1.19.12 but was able to reproduce it immediately in v1.21.0. I'm going to keep trying on v1.19.12. |
I just reproduced the issue on v1.19.12 😢 . It took longer but it eventually hit (same exact symptoms as shared above). |
I know "what" but not "why". The profiling signal lock is stuck on for at least 100,000 consecutive osyield, i.e., 1 second. |
I can look for that, but I don't think I've seen it. I'm also seeing behavior that makes me think this might perhaps be a symptom, not a cause -- after adding checks, two processes jammed at 99.9% (but not more, which was what I previously saw) and not having "Idle Wake Ups" counting up at a wicked rate. And not crashing out on the checks I added. edit -- might not be a symptom, might be a consequence of throw from within a signal handler. will the runtime test for "does this arm64 support CAS?" work properly within a signal handler? |
Note, so far, of the 3 failures that I have seen, they've been at
in runtime/cpuprof.go, in particular the failure was detected there, and the stuck lock was also created there. |
definitely have a fix, maybe it's not the right fix. |
Change https://go.dev/cl/518836 mentions this issue: |
@gopherbot please open the backport tracking issues. This is an annoying bug, possible cause of flakes, with a targeted fix. |
Backport issue(s) opened: #62018 (for 1.20), #62019 (for 1.21). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/519275 mentions this issue: |
Thanks @dr2chase!! |
Change https://go.dev/cl/519375 mentions this issue: |
Change https://go.dev/cl/518677 mentions this issue: |
These operations misbehave and cause hangs and flakes. Fail hard if they are attempted. Tested by backing out the Darwin-profiling-hang fix CL 518836 and running run.bash, the guard panicked in runtime/pprof tests, as expected/hoped. Updates #61768 Change-Id: I89b6f85745fbaa2245141ea98f584afc5d6b133e Reviewed-on: https://go-review.googlesource.com/c/go/+/519275 Reviewed-by: Austin Clements <austin@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
…ing reads On Darwin (and assume also on iOS but not sure), notetsleepg cannot be called in a signal-handling context. Avoid this by disabling block reads on Darwin. An alternate approach was to add "sigNote" with a pipe-based implementation on Darwin, but that ultimately would have required at least one more linkname between runtime and syscall to avoid racing with fork and opening the pipe, so, not. Fixes #62018. Updates #61768. Change-Id: I0e8dd4abf9a606a3ff73fc37c3bd75f55924e07e Reviewed-on: https://go-review.googlesource.com/c/go/+/518836 Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> (cherry picked from commit c6ee8e3) Reviewed-on: https://go-review.googlesource.com/c/go/+/518677 Auto-Submit: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Austin Clements <austin@google.com>
…ing reads On Darwin (and assume also on iOS but not sure), notetsleepg cannot be called in a signal-handling context. Avoid this by disabling block reads on Darwin. An alternate approach was to add "sigNote" with a pipe-based implementation on Darwin, but that ultimately would have required at least one more linkname between runtime and syscall to avoid racing with fork and opening the pipe, so, not. Fixes #62019. Updates #61768. Change-Id: I0e8dd4abf9a606a3ff73fc37c3bd75f55924e07e Reviewed-on: https://go-review.googlesource.com/c/go/+/518836 Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> (cherry picked from commit c6ee8e3) Reviewed-on: https://go-review.googlesource.com/c/go/+/519375 Reviewed-by: Austin Clements <austin@google.com> Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?What did you do?
When running continuous profiling on my binary, the entire program halted with all goroutines stuck in
runtime.gopark
. When disabling profiling, this problem went away.I posted a similar issue a few months ago ("halt when profiling") but don't believe it to be related (#58798).
What did you expect to see?
I expected profiling to allow the binary to run without issue.
What did you see instead?
Profiling put the binary in a "deadlocked"/"unrecoverable" state.
The text was updated successfully, but these errors were encountered: