Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: timeout tests hang on windows #5971

Closed
dvyukov opened this issue Jul 27, 2013 · 15 comments
Closed

net: timeout tests hang on windows #5971

dvyukov opened this issue Jul 27, 2013 · 15 comments
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Jul 27, 2013

It happens regularly after switching to integrated network poller on windows:
http://build.golang.org/log/6bf0e82c91268d1e8c899a6ba83a7d414b24783c
@dvyukov
Copy link
Member Author

dvyukov commented Jul 27, 2013

Comment 1:

I can reliably reproduce it on my 64-bit Windows7 machine if I set process affinity to 1
core in runtime.osinit. But can not reproduce on another 64-bit Windows7.
The bad news is that looks like kernel bug.
Why I think so (aside from I do not see a bug in our code):
- Whenever it happens, when the process exits it hangs in corrupted
state. I can not kill it by any means. It remains semi-alive with 1
thread blocked in kernel.
- When I tried to look at the semi-alive process with ProcExplorer it
caused BSOD several times.
Then a process exists Windows cancels all its outstanding IO requests.
So the hypothesis is that the IO request becomes corrupted and causes
some kind of kernel deadlock. When the kernel tries to cancel it
during process exit, that hangs as well.
I see that the timeout does fire -- the goroutine blocks in
runtime_pollWaitCanceled.
And it never unblocks, even if I increase the timeout.
The CPU is idle, so it's not a busy loop (and the busy loop would not
cause process exit failure).
If I add any logging into interesting paths the problem goes away.
Alex, you've said that the request eventually unblocks on your machine
if you increase the timeout. The hypothesis that your version of OS
may have some kind of internal timeout for request cancellation or
something like this. I.e. the OS eventually resolves the situation.
But this does not happen on my version of OS.
Another bad news is that I do not see any reasonable workaround. If my
hypothesis is correct, then the lockup is caused by concurrent IO
rediness notification and cancellation request. There is no way to
avoid that.

@dvyukov
Copy link
Member Author

dvyukov commented Jul 28, 2013

Comment 2:

Now I strongly believe that this is a kernel bug.
Any attempts to log detailed info eliminate the bug. I've implemented fast in-memory
binary logging system, and the results confirm that the IO completion notification is
not sent for a cancelled IO operation.
I've also collected winsock kernel traces and they show the same.
Finally I've reproduced it in a small C windows program.
I've filed the bug with Microsoft Connect:
https://connect.microsoft.com/VisualStudio/feedback/details/795203/cancelioex-hangs-program#tabs
and posted to MSDN:
http://social.msdn.microsoft.com/Forums/en-US/bbc1e9eb-1f70-4064-ae6a-6fa470c80c27/cancelioex-hangs-program

@alexbrainman
Copy link
Member

Comment 3:

Dmitriy,
I tried this on our windows-amd64 builder:
c:\go\src\pkg\net>hg id
87fd884c846b+ tip
c:\go\src\pkg\net>hg diff
diff -r 87fd884c846b src/pkg/net/timeout_test.go
--- a/src/pkg/net/timeout_test.go       Mon Jul 29 01:25:51 2013 +0200
+++ b/src/pkg/net/timeout_test.go       Sun Jul 28 02:03:47 2013 +0000
@@ -423,8 +423,6 @@
        switch runtime.GOOS {
        case "plan9":
                t.Skipf("skipping test on %q", runtime.GOOS)
-       case "windows":
-               t.Skipf("skipping test on %q, see issue #5971", runtime.GOOS)
        }
        defer runtime.GOMAXPROCS(runtime.GOMAXPROCS(maxProcs))
@@ -498,7 +496,7 @@
                                clientc <- copyRes{n, err, d}
                        }()
-                       const tooLong = 2000 * time.Millisecond
+                       const tooLong = 10000 * time.Millisecond
                        select {
                        case res := <-clientc:
                                if isTimeout(res.err) {
and test works fine (I tried it many times)
c:\go\src\pkg\net>go test -run=1
PASS
ok      net     13.250s
I can see CPU goes 100% during the test, so my explanation as "busy wait" and "timer
does not get fired" is still possible.
I can observe same behaviour on my pc that used to fail this test with wait=0 in
runtime.netpoll - and it still does if I set wait to 0. It also PASS if I extend tooLong
deadline in our test.
I could not reproduce the failure anywhere else, cpu affinity or not. Could you send me
your diff of "set cpu affinity" to see if I can use it.
I didn't see BSOD anywhere.
I also noticed our windows-amd64 builder OS has not been updated for a while (probably
never). Should we upgrade it to the latest version available?
I doubt you will get much help from Microsoft, but fingers crossed. Perhaps you know
someone who knows someone who might push your issue along.
Thank you for all you work.
Alex

@alexbrainman
Copy link
Member

Comment 4:

Something I found http://msdn.microsoft.com/en-us/windows/hardware/gg487388.aspx. So it
looks to me that driver could be at fault here.
Also http://msdn.microsoft.com/en-us/library/aa480216.aspx talks about "Driver Hang
Verifier" tool that you can run to test your driver.
Alex

@dvyukov
Copy link
Member Author

dvyukov commented Jul 29, 2013

Comment 5:

Hummm...
I've seen these articles. But loopback must be short circuited on TCP/IP level, driver
must not be involved.
I need to try to turn off my antivirus...
What you describe looks different from what I see...
Can you try to figure out where exactly it busy loops? A profiler can help, or a stack
dump, or logging. I don't know what will be simpler to do on windows.
Here is how I set affinity:
--- a/src/pkg/runtime/os_windows.c  Mon Jul 29 19:43:08 2013 +0400
+++ b/src/pkg/runtime/os_windows.c  Mon Jul 29 19:50:53 2013 +0400
@@ -36,6 +36,9 @@
 #pragma dynimport runtime·WriteFile WriteFile "kernel32.dll"
 #pragma dynimport runtime·NtWaitForSingleObject NtWaitForSingleObject "ntdll.dll"
 
+#pragma dynimport runtime·SetProcessAffinityMask SetProcessAffinityMask "kernel32.dll"
+extern void *runtime·SetProcessAffinityMask;
+
 extern void *runtime·NtWaitForSingleObject;
 
 extern void *runtime·CloseHandle;
@@ -78,6 +81,7 @@
 void
 runtime·osinit(void)
 {
+   runtime·stdcall(runtime·SetProcessAffinityMask, 2, (uintptr)-1, (uintptr)1);
    // -1 = current process, -2 = current thread
    runtime·stdcall(runtime·DuplicateHandle, 7,
        (uintptr)-1, (uintptr)-2, (uintptr)-1, &m->thread,

@alexbrainman
Copy link
Member

Comment 6:

Dmitriy,
I have done some debugging here (not on windows-amd64 builder), and I see "timers are
late on firing" as I have seen before. If you apply attached diff to 620fea2e0220, you
will see some debug output as well as required source changes. Perhaps you can give me
pointers at what I could do next.
Alternatively you could try and debug the problem on our windows-amd64 builder, if you
like. Just let me know.
Thank you.
Alex

Attachments:

  1. a.diff (18183 bytes)

@dvyukov
Copy link
Member Author

dvyukov commented Jul 30, 2013

Comment 7:

Issues on my machine were caused by AVG antivirus.
Once I've removed it, the C program works w/o any problems.

@dvyukov
Copy link
Member Author

dvyukov commented Jul 30, 2013

Comment 8:

Why do you say that timers do not fire, when the last log says that the timeout had
fired and the thread called CancelIoEx:
+             0: SetDeadline(1375158649929599400)
+    2105549600: WSARecv(fd=300): Wait: 5 err=i/o timeout
+    2105549600: WSARecv(fd=300): CancelIO: 1
+    2105549600: WSARecv(fd=300): CancelIO: 2
+    2105549600: WSARecv(fd=300): CancelIO: 3 err=Element not found.
+    2105549600: WSARecv(fd=300): CancelIO: 4
+    2105549600: WSARecv(fd=300): WaitCanceled: 1
Also in the trace the thread waits in runtime_pollWaitCanceled.
It means that the timeout had fired, but IO completion notification is not sent.

@alexbrainman
Copy link
Member

Comment 9:

The timer fired all right, but it is too late now. The test will only wait for 2s. Look
at the time it took to fire, it is way more then it was asked to wait.
Alex

@dvyukov
Copy link
Member Author

dvyukov commented Jul 30, 2013

Comment 10:

Ah, I see, the timeout has fired after 1.4 seconds in one test and after 2.1 seconds in
another.

@dvyukov
Copy link
Member Author

dvyukov commented Jul 30, 2013

Comment 11:

So on what machines are you able to reproduce it? I am interested in OS version,
bitness, number of cores.

@dvyukov
Copy link
Member Author

dvyukov commented Jul 30, 2013

Comment 12:

I can not reproduce the timeout on my machine with wait=0 and with/without affinity.
It would be great if you can give me temporal ssh access to the builder or any other
machine on which it reproduces.

@alexbrainman
Copy link
Member

Comment 13:

The pc I use here is "Windows 7 Professional" 64-bit running in vmware single CPU.
Alex

@dvyukov
Copy link
Member Author

dvyukov commented Aug 3, 2013

Comment 14:

I've got to the bottom of this.
Windows is so carried away watching how 2 goroutines enthusiastically exchange with
network packets, so it rejects to schedule timer thread for 2+ seconds. It's not that
the thread can not exitsyscall, it's not scheduled on CPU at all.
There are 4 preconditions:
1. Goroutines do heavy IO.
2. Dynamic priority boosting is turned on.
3. Server version of windows.
4. Windows is configured to do server-type scheduling.
When I turn off dynamic priority boosting, delay drops from 2+ seconds (thread
starvation prevention threshold) to 400ms (4 100ms quantums). If reconfigure scheduler
to do client-type scheduling, the delay further drops to 60ms (4 15ms quantums).
We can not change scheduler type in Go process, because it's system-wide setting.
However, we can change dynamic priority boosting, 400ms delay should be enough for this
test to pass. If not, we can make the timeout on windows somewhat longer. This should
not be a problem in real applications, because they usually do not consume 100% CPU
(it's mostly possible only on localhost); the problem is also tolerated by multicore
systems; and the timeout does not fire iff the application is doing useful work, in
which case delayed timeout is most likely not a problem (you want timeout iff there is
no data on the connection).
The fix is coming.

@dvyukov
Copy link
Member Author

dvyukov commented Aug 4, 2013

Comment 15:

This issue was closed by revision a574822.

Status changed to Fixed.

@rsc rsc added this to the Go1.2 milestone Apr 14, 2015
@rsc rsc removed the go1.2 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants