New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: TestLibraryCtrlHandler fails occasionally with register ABI enabled #45638
Comments
I was able to reproduce with |
I did rewrite the ctrlHandler runtime code recently in preparation for regabi, though not specifically in a regabi way. If you can repro, it would be worth doing a check around that commit to see if that was the cause. |
Looking at the build dashboard logs, it doesn't seem like this is regabi related. There's one other failure:
The fact that there are no other failures suggests that the CL @aclements mentioned is probably the culprit (https://golang.org/cl/309632). By the way, it's harder to reproduce than I thought. Background on the test:
Here's my working theory: As a result, the test now flakes a little bit. I will try to confirm this by adding further delay in initialization to see if the test failure reproduces more easily. |
I don't think my working theory is far off from the problem, but I don't fully understand anymore why the race pertains to this test. This test just makes sure that the C process receives the signal on its handler. It doesn't care about the Go handler at all. |
Print debugging isn't working because the initialization code doesn't seem to have access to stdout (or the equivalent) on Windows. I just tried switching the timeout in the test's C code from 5 seconds to Unfortunately when I did that I came back to my terminal filled with binary garbage followed by "fail" reported by the parent process. I suspect this problem goes a little deeper than I realize... |
For a taste, my terminal looks something like this:
except the garbage extends far beyond what my scrollback buffer can carry. I can reproduce this with the same frequency as the original failure. |
Something is very broken here. This kind of reads like something is trying to print (maybe the |
I tried to reproduce this issue on my computer, but I could not. Building programs (Go and especially C) is slow on Windows, so I modified this test to build everything once and then run executable in a loop. diff --git a/src/runtime/signal_windows_test.go b/src/runtime/signal_windows_test.go
index d0f61fb6c2..ecacafd9d2 100644
--- a/src/runtime/signal_windows_test.go
+++ b/src/runtime/signal_windows_test.go
@@ -170,8 +170,14 @@ func TestLibraryCtrlHandler(t *testing.T) {
t.Fatalf("failed to build c exe: %s\n%s", err, out)
}
+ for i := 0; i < 10000; i++ {
+ testLibraryCtrlHandler(t, exe)
+ }
+}
+
+func testLibraryCtrlHandler(t *testing.T, exe string) {
// run test program
- cmd = exec.Command(exe)
+ cmd := exec.Command(exe)
var stderr bytes.Buffer
cmd.Stderr = &stderr
outPipe, err := cmd.StdoutPipe() And my test succeeds
Alex |
Change https://golang.org/cl/313350 mentions this issue: |
Change https://golang.org/cl/313351 mentions this issue: |
@alexbrainman |
Ah, nevermind, I did not look closely enough at your code. My bad! Huh, what version of Windows are you running? Maybe it's a bug specific to the |
3 consecutive runs with @alexbrainman's patch from #45638 (comment) came up clean. That's... interesting. |
Wow, I spoke too soon. It literally JUST failed. Once out 30000 runs, roughly. |
Interestingly, this is how it failed:
Again with the terminal garbage, but it's very interesting to see |
I have a suspicion that The |
Er, actually I don't see why this would be a problem. The Windows docs claim (https://docs.microsoft.com/en-us/windows/console/handlerroutine) that a new thread is created to handle these signals, so actually |
Here's a theory about how the above messages are getting produced: something causes us to get into an exception that lands on a thread that has GS set (where we put TLS on Windows). That exception tries to print the |
If that theory is true, then it raises three questions:
|
Good news: in the current failure mode I can actually print out things when the DLL is getting initialized. |
CL 211139 added TestLibraryCtrlHandler. But the CL left out import "C" line in the test file that is supposed to be build with Cgo. While debugging issue #45638, I discovered that the DLL built during TestLibraryCtrlHandler does not have Dummy function. Adding import "C" makes Dummy function appear in DLL function list. TestLibraryCtrlHandler does not actually calls Dummy function. So I don't see how this change affects issue #45638, but still let's make this code correct. Updates #45638 Change-Id: Ibab8fed29ef2ae446d0815842cf0bd040a5fb943 Reviewed-on: https://go-review.googlesource.com/c/go/+/313350 Trust: Alex Brainman <alex.brainman@gmail.com> Run-TryBot: Alex Brainman <alex.brainman@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
While debugging issue #45638, I discovered that some tests were using --buildmode command line parameter instead of -buildmode. The --buildmode parameter is handled properly by the flag package - it is read as -buildmode. But we should correct code anyway. Updates #45638 Change-Id: I75cf95c7d11dcdf4aeccf568b2dea77bd8942352 Reviewed-on: https://go-review.googlesource.com/c/go/+/313351 Trust: Alex Brainman <alex.brainman@gmail.com> Run-TryBot: Alex Brainman <alex.brainman@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
I just run my test from #45638 (comment) on current tip. It failed once:
and succeeded 3 times.
The test still succeeds. But that gives me an idea. If you modify the test to use Unfortunately I am not familiar with debuggers, so I cannot help you, but surly we can find help with debugging. You can use Delve debugger on Windows - I am not sure how good it is with Cgo program. Or you can use WindDbg - it is used by real Windows developers (not like me) - you should be able to do anything with it, except it does not understand Go program symbols. Or you can use gdb.
It certainly can be. Do not forget that we are testing C program with Go DLL (built with Cgo) here. And you need gcc to build these. Windows does not come with gcc installed. so whatever gcc you have installed on the builder is what you use. So you would have to account for the fact that this bug could be in gcc tools or runtime. And this builder, I suspect, have very old version of gcc. For example, see #35006 and #43195. I use
here. I suspect this gcc is different from what is installed on builder. And the fact that test failed once here (see above), probably means it is not gcc bug.
This was, probably, generated by gcc linker, not Go linker. But otherwise, I have not comment.
Unfortunately am not familiar enough with Go runtime and gcc runtime to speculate. But that sounds possible. There a 2 control handlers installed by SetConsoleCtrlHandler - one by gcc runtime and another by Go runtime. And there are 2 exception handlers installed - one by gcc runtime and another by Go runtime. So it can get messy. Alex |
The test failed again just as I submitted previous message. Unfortunately this time the output is huge, so I am providing the end of it.
Alex |
Another avenue to explore here (I don't have time anymore) is to try to run If Go control handler code is broken, then this test might break too. Alex |
/cc @bufflig |
Ah, this may be a PE vs. ELF thing? I don't see Windows runtime code relying on a variable being initialized anywhere. |
Yeah... If I print the length of |
Coming back to this... I think my theory is subtly wrong. According to the MSDN, control handlers are always called on a new thread (https://docs.microsoft.com/en-us/windows/console/console-control-handlers). That thread should be distinct from the runtime initialization thread, and so shouldn't impede progress in the runtime, presumably. That thread will just spin in The only thing preventing progress though, is that check of |
Well, hey I learned one new thing. There should be a I'm going to try to see where that signal is coming from. I have access to the instruction pointer in that context, it's just a question of how to actually surface it. |
The PC that triggers the failure isn't one that's known to Go (and it's consistently the same one). If I were to guess, it's probably a Windows DLL (win32?). |
Change https://golang.org/cl/316809 mentions this issue: |
The saga continues. I fixed the garbage-getting-printed issue, but there's still that pesky signal firing (FWIW it's an access violation signal, so the equivalent of a segfault, and to catch you up, the PC that triggered it is NOT in the Go binary). The good news is, I've narrowed the source a little further. Namely, if I eliminate the call to Now what I did was I kept the print and the I'm thinking now that there's, very simply (back to the very very beginning of this thread), a race between At this point I have no idea how to proceed. My CL above is a reasonable workaround. I can also make it so that the test binary calls @ianlancetaylor @alexbrainman WDYT? |
Currently if a signal lands on a non-Go thread that's handled by the Go handler, Go will emit a message. However, unlike everywhere else in the runtime, Go will not abort the process after, and the signal handler will try to continue executing. This leads to cascading failures and possibly even memory corruption. For #45638. Change-Id: I546f4e82f339d555bed295528d819ac883b92bc6 Reviewed-on: https://go-review.googlesource.com/c/go/+/316809 Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>
I tried running test on top of 1108cbe and test failed once and I still see garbage printed:
Is this garbage expected? I also noticed that this message would only be printed, if runtime.sigtramp called and it could not find g. The fact that runtime.sigtramp is called means we received an exception. It would be nice to see stack trace printed. Unfortunately runtime.sigtramp end-up calling runtime.badsignal2 instead. Perhaps we can disable runtime.badsignal2 check and hope stack trace is printed. Another option would be run test program inside a debugger. Debugger will stop when exception. Maybe we can see the program state then. It would be nice to see exactly where exception occurs. Perhaps your https://go-review.googlesource.com/c/go/+/315830/ fixes this error, and is good enough.
That means that your CL 315830 could not be a solution here. Correct?
What particular race are you thinking about? You are allowed to call SetConsoleCtrlHandler API many times - that is how the API is designed. In fact I am pretty sure our test calls SetConsoleCtrlHandler twice - Go calls it, and C runtime also calls it. Perhaps the problem is that Go DLL does not remove SetConsoleCtrlHandler handler before it exits? And Windows calls the handler address when Go runtime is already off. Similarly SetConsoleCtrlHandler is supposed to handle CTRL_CLOSE_EVENT. Perhaps we get that message (I don't see how that happens) after Go DLL runtime is off, but Windows SetConsoleCtrlHandler handler still points to Go code. I plan to play with this more on the weekend. Let's delay the decision. Also, perhaps @zx2c4 can help debug this, if you have time. Thank you. Alex |
It's not, that's very surprising. It's also surprising that you landed in badsignal2 more than once -- all the failures I got after that CL were clean. I'll try to reproduce this again.
Unfortunately I don't think that'll work. There's a lot of stuff that depends on having some kind of G (including a
I've been putting this aside because I don't have an environment where I can easily attach a debugger and follow subprocesses. The Windows world feels alien to me. I don't have high confidence in Delve handling this very specific situation very well, but I can try.
What the commit message of that CL describes as the problem is incorrect, yes. However, it's clear that calling
I don't have a specific one in mind, because I don't have any documentation on
Interesting, I hadn't thought about a situation where runtime initialization is complete. What do you mean by when the Go runtime is already "off"? Do you mean the rest of the process is already exiting?
Thank you so much for taking the time to look at this! |
badsignal2 calls abort at the end now. And abort starts with I managed to reproduce this error just twice. But I did not have debugger ready at that time. I will try again next week. Alex |
Huh. I figured abort meant something closer to Should this just |
I don't know enough about
I think we should do that. We should also print more exception details, if possible. See Alex |
I found a computer (with Windows 7) where this issue reproduces much more easily than on my Windows 10 computer. Unfortunately I can reproduce this issue even on top of https://go-review.googlesource.com/c/go/+/315830/ I will try to understand where the issue occurs, as promised. No success guraneteed. Alex |
@mknyszek Checking in from a release meeting, are there updates here? Thanks. |
@dmitshur I have been working on this issue. I think the problem here is with the For example, the test was introduced together with the fix in CL 211139. But the test still passes, if I revert to commit before CL 211139.
The go/src/runtime/testdata/testwinlibsignal/main.c Lines 35 to 42 in 831573c
But there is nothing in the test to verify that So we need to replace the test first. And then we should be able to tell, if we have a problem or not. It is pretty hard to build proper test here. But I think I am making progress. I will report more when I am ready with the test. I hope it helps. Alex |
This should be very easy to fix. It's true that Go initialization is done asynchronously on another thread, so this test races. (Side note: even if it races, it shouldn't spew garbage like above.) But every cgo entry point calls |
Change https://golang.org/cl/321769 mentions this issue: |
@alexbrainman https://go-review.googlesource.com/c/go/+/321769 passes tests and should now fix the races with the tests you mentioned. I haven't read all of this thread in detail, so I'm not sure if this is all that's required to close the issue. But at the very least, you can +2 that and we'll at least have a fixed test. However, the garbage text you're getting is a bit disturbing, and maybe we should look more deeply into that after you review my test fix. Specifically: if a signal arrives before the Go runtime is finished initializing, why are we spewing garbage rather than doing something polite? Do we have better options? |
I added the call to Maybe the path forward here is to fix the test, and then file a separate issue for the garbage being spewed due to the race between the signal landing and initialization completing. Then again, the garbage spewing might just be fixed by calling |
Adding a call to
Or you could just re-title this issue if you wanted? AFAICT, https://go-review.googlesource.com/c/go/+/321769 is the correct fix for the actual test failing, but indeed, I agree with you, that the garbage spewing is the lingering issue that definitely deserves attention. |
The test harness waits for "ready" as a sign that the Go runtime has installed its signal handler and is ready to be tested. But actually, while LoadLibrary starts the loading of the Go runtime, it does so asynchronously, so the "ready" sign is potentially premature and certainly racy. However, all exported cgo entry points make a call to _cgo_wait_runtime_init_done which waits for that asynchronous initialization to complete. Therefore, this commit fixes the test to call into the exported "Dummy" cgo function before emitting the "ready" sign, so that we're sure the Go runtime is actually loaded. Updates #45638. Change-Id: I9b12b172d45bdcc09d54dd301de3a3e499544834 Reviewed-on: https://go-review.googlesource.com/c/go/+/321769 Trust: Jason A. Donenfeld <Jason@zx2c4.com> Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
Change https://golang.org/cl/321789 mentions this issue: |
I think we can close this issue because https://go-review.googlesource.com/c/go/+/321769 fixes the The test fails if I revert https://go-review.googlesource.com/c/go/+/211139/ But on current tip I could not make test fail in any way on few different systems. Alex |
@alexbrainman Sounds good. I'll file a new issue for the garbage we saw getting generated. |
Found a failure on the
windows-amd64-regabi
bot:Seems like a flake, but we should look into it.
The text was updated successfully, but these errors were encountered: