Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: failures in TestCtrlHandler with "could not read stdout: EOF" on windows-arm64 #49458

Closed
bcmills opened this issue Nov 8, 2021 · 25 comments
Labels
arch-arm64 FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor Author

bcmills commented Nov 8, 2021

Marking as release-blocker since this appears to be a very recent regression (CC @aclements @mknyszek).

@bcmills bcmills added arch-arm64 NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows labels Nov 8, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 8, 2021
@mknyszek
Copy link
Contributor

mknyszek commented Nov 8, 2021

CC @bufflig

@bufflig bufflig self-assigned this Nov 8, 2021
@bufflig
Copy link
Contributor

bufflig commented Nov 8, 2021

Yea, this looks new and bad. I'll take it.

@bufflig
Copy link
Contributor

bufflig commented Nov 11, 2021

@mknyszek - I wonder if you have thoughts on this?

It's reproducible beginning with the enabling of the new Pacer. It is fairly uncommon, but if you e.g. add a runtime.Gosched() after signal.Notify in the test program (which, of course should change nothing), it happens reliably but only on arm64.

Anyway, even when it happens, the signal walks through our handlers and we end up thinking we delivered the signal (sigsend in sigqueque.go), the sig.state is sigReceiving both when it works and when it doesn't reach the channel in the test program, and notewakeup is ending up in semawakeup. So everything happens as expected, but we never get anything on the channel (or the m does not get woken up). This happens with a low probability if the program is untouched, and a much higher probability if i yield after calling signal.Notify. If I disable the new pacer, I cannot get it to fail at all.

Could it be that the m does not get woken up in time? The process sits idle for a while and then it gets forcefully killed, which is also a little mysterious, but it does not seem to crash. It's possible taskkill eventually teminates it. Then the actual test fails as it gets EOF on the pipe.

@mknyszek
Copy link
Contributor

Hm... The only thing the new pacer changes is when GC's happen, basically (in theory). Does GODEBUG=gctrace=1 indicate a GC happening where it doesn't with the pacer off? The GC in general perturbs the scheduling of goroutines quite a bit.

@bufflig
Copy link
Contributor

bufflig commented Nov 12, 2021

You're spot on. That seems to be the problem. What happens is, I'm pretty sure, this:

  • When a "signal" arrives to our ConsoleCtrlHandler, and it is a signal that would terminate the process if we returned (which is exactly what we're sending here), we post the signal (through another p) and then the (arbitrary) thread that handled the signal goes to sleep (real sleep, like stdcall1(_Sleep, uintptr(_INFINITE)))
  • If a GC starts here and decides to preempt alll p's it will try to preempt the m that is now blocked by calling preemptM from the preeptone function.
  • But on arm64, preemptMSupported is not true, and there's a TODO: Implement call injection in the beginning of preemptM(). So, nothing happens, we continue to wait for non-preempted p:s to proclaim themselves idle forever
  • Meanwhile, the world is stopped, so the program is essentially one big idle-loop.

This happens when a GC is needed between the signal has arrived and the signal is processed by the little test program, and would not happen on amd64 as preemptM is implemented there. For unknown reasons, GC sometimes happen early and there'n no neeed for GC between the signal arrives and the program is terminated in a normal way, sometimes a GC is needed in the "bad" situation.

I see three solutions:

  1. just disable the test on arm64/windows until the TODO can be addressed
  2. Both disable the test and also make sure to not try to handle terminate-signals on arm64, as it will inevitably be flaky
  3. implement preemptM on arm64, which I'm unsure of how hard it would be

I suggest 2 for 1.18, 3 for 1.19. Any input is welcome.

@cherrymui
Copy link
Member

Implementing preemptM shouldn't be hard. But I agree we probably don't do it in the freeze.

@aclements
Copy link
Member

Somewhat related, I have this CL for preemption on windows/arm, which I apparently failed to get in again and which probably needs a non-trivial rebase at this point.

@bufflig
Copy link
Contributor

bufflig commented Nov 15, 2021

That is for 32bit arm though? I think the adding of arm64 conflicts with it, but as I just looked at the code, I can confirm that it would be a simple rebase. I could try to extend it to arm64. Still, I feel it would be best done in 1.19, but I could change my mind if you think it's better to do it now.

@aclements
Copy link
Member

Correct, that is for 32-bit ARM, but the code is probably quite similar for 64-bit ARM. I'm not advocating for implementing preemption for 1.19, since that's something that really benefits from some soak time. That said, I'm somewhat skeptical of things that really depend on asynchronous preemption for correctness (versus latency).

@bufflig
Copy link
Contributor

bufflig commented Nov 15, 2021

Yea, the whole SetConsoleCtrlHandler thing is somewhat sketchy as it needs to completely hang its thread to not immediately terminate the program. To handle i.e. ctrl-C, nothing like that is needed, but if we chose to handle these "termination" signals, we will get into trouble. I don't know if it's even a valuable functionality, but what do I know :)

@cherrymui
Copy link
Member

Would it be possible to just block the thread, instead of doing an infinite sleep? Or, drop the P before going to sleep?

@cherrymui
Copy link
Member

diff --git a/src/runtime/os_windows.go b/src/runtime/os_windows.go
index 648239fb36..5e50394d8b 100644
--- a/src/runtime/os_windows.go
+++ b/src/runtime/os_windows.go
@@ -1195,7 +1195,8 @@ func ctrlHandler(_type uint32) uintptr {
                if s == _SIGTERM {
                        // Windows terminates the process after this handler returns.
                        // Block indefinitely to give signal handlers a chance to clean up.
-                       stdcall1(_Sleep, uintptr(_INFINITE))
+                       block()
+                       //stdcall1(_Sleep, uintptr(_INFINITE))
                }
                return 1
        }

This seems to make the test pass reliably, without depending on async preemption.

But I'm not sure how safe it is. What if the handler runs when we are in some interesting state (e.g. no P, non-Go thread, or some critical section of the runtime)? Can it happen?

@mknyszek
Copy link
Contributor

mknyszek commented Nov 16, 2021

@cherrymui The ctrlHandler is always called on a new (non-Go) thread created by the OS https://docs.microsoft.com/en-us/windows/console/handlerroutine

@cherrymui
Copy link
Member

Thanks @mknyszek . Does it have a P? If so, why? If not, why we need to preempt it?

@bufflig
Copy link
Contributor

bufflig commented Nov 16, 2021

I made some printouts in the handler and in some other parts of the code, and it certainly appears to run on the same OS thread as some other stuff. The OS thread ID of the m it's signalling is sometimes the same as the thread executing the handler. Unless I messed it up in some way, of course. The thread is also often (but not always) associated with m0.

I'll try the block call, if my theories are correct, that should do it.

@mknyszek
Copy link
Contributor

Ah, I see. It's because we treat it like a Windows callback (via compileCallback). So it's going to call through callbackasm1 which calls into cgocallback, which I think will create a G and an M and acquire a P.

@mknyszek
Copy link
Contributor

So, that's my bad, in effect ctrlHandler should be executing as if in a completely normal Go context, I suppose.

@cherrymui
Copy link
Member

Okay, thanks! Yeah, if it is called via cgocallback, it would look like a regular Go function, so blocking the function probably will work.

@mknyszek
Copy link
Contributor

If I were to make a guess, when we switched to compileCallback here for 1.17 and the register ABI work, I think that we left this infinite sleep on a thread because it was once executing on a bare thread, so the call into the OS to sleep was really the only way. But now it's fully integrated with the scheduler.

I can dig through the history to confirm this is the case.

@bufflig
Copy link
Contributor

bufflig commented Nov 16, 2021

It appears that Cherry's suggestion fixes it, at least I can no longer reproduce. And it fits all the symptoms I've seen. The preemption on amd64 makes it work anyway, right?

@cherrymui
Copy link
Member

Yeah, async preemption probably makes it work on AMD64. It fails occasionally (with the original code) if I set GODEBUG=asyncpreemptoff=1.

@bufflig
Copy link
Contributor

bufflig commented Nov 16, 2021

Cool, thanks! I can make the cl, if that's fine with you (I need the practice :)).

@cherrymui
Copy link
Member

Sure, no problem :)

@gopherbot
Copy link

Change https://golang.org/cl/364556 mentions this issue: runtime: Make sure to properly park before going to sleep in Windows ConsoleControllHandler

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants