-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: Windows binaries built with -race occasionally deadlock #35775
Comments
I suspect that the same root cause is responsible for apparent deadlocks in the |
These failures are all either on the race builder or for tests with the race detector enabled. Seems like something makes deadlocks possible with the race detector. The findflakes programs reports that this might be due to https://golang.org/cl/204340 "runtime: support preemption on windows/{386,amd64}". |
That could be related to #18717 — does the runtime assume that its preemption signals are eventually delivered? |
It does not. It will always try both a cooperative and a non-cooperative preemption and take whichever one it can get. If the signal goes missing, it will wait for a cooperative preemption just like before. |
I keep seeing these as trybot flakes too: It'd be nice to see stacks, though. Maybe we could work on that first and it might yield a clue. |
I've had a tough time trying to reproduce this in a gomote to get a stack trace, but I was able to reproduce at least once (unfortunately, I didn't get a stack trace that time...). If it's not too hard, it might be easier to reproduce in CI and the trybots @bradfitz, though I'm not sure how best to do that. In particular, I've been targeting the "Testing race detector" tests that tend to be the source of failure as @ianlancetaylor mentioned and haven't been able to reproduce this failure on a gomote. Meanwhile, I'll keep trying. |
Alrighty thanks to @aclements and a Windows laptop we have a reproducer, a theory, and a partial fix. The problem is a race between Thread 1: Suspend (asynchronously) Thread 2. This race is already handled in the runtime for the usual exits by putting a lock around suspending a thread (and effectively disallowing it in certain cases, like exit), but in race mode Unfortunately this raises a bigger issue: what if C code, called from Go, calls |
Put a lock on each M? Lock the lock when the M is entering cgo call, and before SuspendThread on that M. So we won't suspend a thread running C code. This is fine as we won't preempt C code anyway. In fact, if we want to SuspendThread on an M for which the lock is locked, meaning that it is running C code, we don't need to suspend it anyway, so we don't need to wait on the lock. If the M is about to enter C, and the lock is locked, meaning that there is a pending suspend, it can just preempt itself. |
@cherrymui that solution makes sense, and would solve other issues such as sending signals to threads running C code resulting in surprising behavior like in #36281 (IIUC) if we used it on all platforms, not just Windows. |
I took a stab at fixing this using @cherrymui's suggestion (varied slightly after chatting in person so there's no actual blocking behavior, just a one-shot CAS). Unfortunately, in Windows, any "system call" can potentially call ExitProcess, so we need to defend against this in |
I also have make.bat reliably hanging on a Windows 7 computer here. I run make.bat (I use GOARCH=amd64) on top of
and this is how far it gets As you can see, there is stuck go_bootstrap.exe there. I used Delve to see what the process does, and this is what I see:
I also tried reproducing this on another Windows 7 (386 this time) running in VM, and I could not reproduce it. I am not sure what is special about this computer that it fails. But, I guess, if it is a race, then anything is possible. Anything can push it over the age. Also I can make.bat pass, if I set runtime.preemptMSupported to false. Let me know, if you want me to debug something. I can reproduce it pretty much every time. Sorry all this details are vague. That is only free time I had. Alex |
@alexbrainman, thanks for sharing the details you had time to collect. From the Delve output it looks like the process has a lot of threads, which would mean this is a different issue. Could you confirm that through some other process explorer and paste your report into a new issue if there's more than one thread in the hung process? |
For the record, this is what @mknyszek was using as a repro (based on dist's test):
We whittled this down to the following, which fails after a few minutes
|
Change https://golang.org/cl/213837 mentions this issue: |
I think we can unpin this now. |
Done. See #36492. Alex |
Starting in November, there appears to be a dramatic uptick in the number of test timeouts on the
windows-*
builders.Many of these are for tests that normally run nearly instantaneously, such as
archive/tar
andbufio
.2019-11-22T03:06:22-0e02cfb/windows-amd64-race
2019-11-21T22:20:17-94e9a5e/windows-amd64-race
2019-11-21T19:27:16-f4a8bf1/windows-amd64-longtest
2019-11-21T19:09:24-2434869/windows-amd64-longtest
2019-11-21T19:09:24-2434869/windows-amd64-race
2019-11-21T16:56:47-37715cc/windows-amd64-longtest
2019-11-21T16:56:47-37715cc/windows-amd64-race
2019-11-21T16:01:14-c7e73ef/windows-amd64-race
2019-11-20T20:51:13-9852b4b/windows-amd64-race
2019-11-13T19:15:27-7ad2748/windows-amd64-longtest
2019-11-12T22:09:05-a56d755/windows-amd64-2016
2019-11-12T01:07:15-ec73263/windows-amd64-2012
2019-11-08T17:01:05-a5a6f61/windows-amd64-2012
2019-11-07T19:18:12-1b0b980/windows-amd64-2012
2019-11-07T05:52:34-3eabdd2/windows-amd64-longtest
2019-11-06T09:09:59-0c5d545/windows-amd64-2008
2019-11-06T02:52:51-f71bd51/windows-amd64-2016
2019-11-05T16:31:48-414c1d4/windows-amd64-2016
2019-11-05T14:44:56-e457cc3/windows-amd64-race
2019-11-05T05:19:08-d51f7f3/windows-amd64-longtest
2019-11-05T03:50:54-979d65d/windows-amd64-2016
2019-11-05T00:19:10-6cbd737/windows-amd64-race
2019-11-01T14:48:28-a570fcf/windows-amd64-2012
2019-03-19T08:30:50-451a2eb/windows-amd64-2008
2018-12-05T21:54:54-6454a09/windows-amd64-race
CC @ianlancetaylor @aclements @alexbrainman @zx2c4
The text was updated successfully, but these errors were encountered: