Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime,x/sys/windows: panic accessing (possibly) nil field of net_op #58870

Closed
hawkinsw opened this issue Mar 5, 2023 · 44 comments
Closed

runtime,x/sys/windows: panic accessing (possibly) nil field of net_op #58870

hawkinsw opened this issue Mar 5, 2023 · 44 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. OS-Windows
Milestone

Comments

@hawkinsw
Copy link
Contributor

hawkinsw commented Mar 5, 2023

First of all, thank you all for everything that you do!

What version of Go are you using (go version)?

$ go version
go version go1.20.1 windows/amd64

Does this issue reproduce with the latest release?

No, but reproducing the issue is timing dependent so my inability to reproduce seems to be the result of a different order of scheduling timeouts, etc.

What operating system and processor architecture are you using (go env)?

On Windows on amd64 hardware.

What did you do?

Using windows.WSAIoctl on an overlapped socket with a "zero" values in the windows.Overlapped struct will cause a panic in netpoll_windows.go at line 134 (as of the version of the code tagged at version 1.20.1) where the runtime attempts to dereference op.pd without checking whether it is nil.

The problem is that the runtime does not expect that something other than itself (i.e., the runtime) will add items to the list of operations that could be returned by a call to GetQueuedCompletionStatusEx. The comment

	op       *net_op // In reality it's *overlapped, but we cast it to *net_op anyway.

is telling.

Unfortunately, we cannot guarantee that is the case.

If an application added something to that list (e.g., through a use of windows.WSAIoctl) that does not complete before the scheduler does another "round" of netpoll, it is possible that op is not actually a net_op. If the application user does "the right thing" by passing a pointer to a "zerod" windows.Overlapped structure, the runtime will break.

There was a related issue from several years earlier that @bradfitz found and fixed: #21172

This issue seems to be related.

I have two one-line patches to the sys and runtime packages that (I believe) provide a simple fix for the issue. I wanted to write this issue before posting them to gerrit so that there is something to reference. I will update this report with those patchsets.

See

sys/windows: Pad windows.Overlapped to protect the runtime | https://go-review.googlesource.com/c/sys/+/473163

and

473163: sys/windows: Pad windows.Overlapped to protect the runtime | https://go-review.googlesource.com/c/sys/+/473163

@panjf2000 panjf2000 added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Mar 5, 2023
@panjf2000 panjf2000 added this to the Backlog milestone Mar 5, 2023
@panjf2000 panjf2000 added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 5, 2023
@seankhliao seankhliao changed the title runtime and sys/windows: panic accessing (possibly) nil field of net_op runtime,x/sys/windows: panic accessing (possibly) nil field of net_op Mar 6, 2023
@hawkinsw
Copy link
Contributor Author

hawkinsw commented Mar 6, 2023

Per @ianlancetaylor 's request, here is information about how to reproduce this error:

Using the code at

https://github.com/hawkinsw/windowpane
...

here is the panic that results

[signal 0xc0000005 code=0x0 addr=0x8 pc=0x39fde8]

runtime stack:
runtime.throw({0x487045?, 0x0?})
        C:/Users/whh8b/Code/Code/goroot/src/runtime/panic.go:1047 +0x65 fp=0xc90e9ff168 sp=0xc90e9ff138 pc=0x3a5aa5
runtime.sigpanic()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/signal_windows.go:251 +0x265 fp=0xc90e9ff1b0 sp=0xc90e9ff168 pc=0x3b9325
runtime.netpoll(0xdae957c5783c?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/netpoll_windows.go:134 +0x2a8 fp=0xc90e9ffa38 sp=0xc90e9ff1b0 pc=0x39fde8        
runtime.findRunnable()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:2971 +0x76a fp=0xc90e9ffb40 sp=0xc90e9ffa38 pc=0x3ad6ea
runtime.schedule()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:3360 +0xb1 fp=0xc90e9ffb78 sp=0xc90e9ffb40 pc=0x3ae871
runtime.goexit0(0xc0000571e0)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:3687 +0x293 fp=0xc90e9ffbb8 sp=0xc90e9ffb78 pc=0x3af813
runtime.mcall()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:452 +0x48 fp=0xc90e9ffbc8 sp=0xc90e9ffbb8 pc=0x3d0fc8

goroutine 1 [select]:
runtime.gopark(0xc00066fe58?, 0x2?, 0x0?, 0x0?, 0xc00066fe1c?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc00066fca0 sp=0xc00066fc80 pc=0x3a8636
runtime.selectgo(0xc00066fe58, 0xc00066fe18, 0x480db7?, 0x0, 0xc00066fe48?, 0x1)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/select.go:327 +0x7be fp=0xc00066fde0 sp=0xc00066fca0 pc=0x3b6b9e
main.(*TransparentClientSender).Write(0xc0000080d8, {0xc000180000, 0x4c4b40, 0x4c4b40})
        C:/Users/whh8b/Code/Code/windowpane/cmd/client/client.go:42 +0x23e fp=0xc00066fe88 sp=0xc00066fde0 pc=0x44de9e
main.main()
        C:/Users/whh8b/Code/Code/windowpane/cmd/client/client.go:81 +0x305 fp=0xc00066ff80 sp=0xc00066fe88 pc=0x44e425
runtime.main()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:250 +0x1f7 fp=0xc00066ffe0 sp=0xc00066ff80 pc=0x3a8217
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00066ffe8 sp=0xc00066ffe0 pc=0x3d32c1

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc000059fb0 sp=0xc000059f90 pc=0x3a8636
runtime.goparkunlock(...)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:387
runtime.forcegchelper()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:305 +0xb2 fp=0xc000059fe0 sp=0xc000059fb0 pc=0x3a8452
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000059fe8 sp=0xc000059fe0 pc=0x3d32c1
created by runtime.init.6
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:293 +0x25

goroutine 3 [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc00005bf80 sp=0xc00005bf60 pc=0x3a8636
runtime.goparkunlock(...)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:387
runtime.bgsweep(0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgcsweep.go:319 +0xde fp=0xc00005bfc8 sp=0xc00005bf80 pc=0x392e1e
runtime.gcenable.func1()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:178 +0x26 fp=0xc00005bfe0 sp=0xc00005bfc8 pc=0x388206
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x3d32c1
created by runtime.gcenable
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:178 +0x6b

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc00002c1c0?, 0x4a6380?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc00006bf70 sp=0xc00006bf50 pc=0x3a8636
runtime.goparkunlock(...)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:387
runtime.(*scavengerState).park(0x55e9e0)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgcscavenge.go:400 +0x53 fp=0xc00006bfa0 sp=0xc00006bf70 pc=0x390cf3
runtime.bgscavenge(0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgcscavenge.go:633 +0x65 fp=0xc00006bfc8 sp=0xc00006bfa0 pc=0x391305
runtime.gcenable.func2()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:179 +0x26 fp=0xc00006bfe0 sp=0xc00006bfc8 pc=0x3881a6
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00006bfe8 sp=0xc00006bfe0 pc=0x3d32c1
created by runtime.gcenable
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:179 +0xaa

goroutine 5 [finalizer wait]:
runtime.gopark(0x3a89b2?, 0x3a8385?, 0x0?, 0x0?, 0xc00005df70?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc00005de28 sp=0xc00005de08 pc=0x3a8636
runtime.runfinq()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mfinal.go:193 +0x107 fp=0xc00005dfe0 sp=0xc00005de28 pc=0x387267
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00005dfe8 sp=0xc00005dfe0 pc=0x3d32c1
created by runtime.createfing
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mfinal.go:163 +0x45

goroutine 34 [GC worker (idle)]:
runtime.gopark(0xdae957d59d0c?, 0xc00006dfd0?, 0xb1?, 0x3e?, 0xc00006dfb0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc00006df50 sp=0xc00006df30 pc=0x3a8636
runtime.gcBgMarkWorker()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1275 +0xf1 fp=0xc00006dfe0 sp=0xc00006df50 pc=0x389db1
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00006dfe8 sp=0xc00006dfe0 pc=0x3d32c1
created by runtime.gcBgMarkStartWorkers
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1199 +0x25

goroutine 7 [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc000067f50 sp=0xc000067f30 pc=0x3a8636
runtime.gcBgMarkWorker()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1275 +0xf1 fp=0xc000067fe0 sp=0xc000067f50 pc=0x389db1
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x3d32c1
created by runtime.gcBgMarkStartWorkers
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1199 +0x25

goroutine 18 [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc000653f50 sp=0xc000653f30 pc=0x3a8636
runtime.gcBgMarkWorker()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1275 +0xf1 fp=0xc000653fe0 sp=0xc000653f50 pc=0x389db1
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000653fe8 sp=0xc000653fe0 pc=0x3d32c1
created by runtime.gcBgMarkStartWorkers
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1199 +0x25

goroutine 19 [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc000655f50 sp=0xc000655f30 pc=0x3a8636
runtime.gcBgMarkWorker()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1275 +0xf1 fp=0xc000655fe0 sp=0xc000655f50 pc=0x389db1
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000655fe8 sp=0xc000655fe0 pc=0x3d32c1
created by runtime.gcBgMarkStartWorkers
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1199 +0x25

goroutine 20 [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc00064ff50 sp=0xc00064ff30 pc=0x3a8636
runtime.gcBgMarkWorker()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1275 +0xf1 fp=0xc00064ffe0 sp=0xc00064ff50 pc=0x389db1
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00064ffe8 sp=0xc00064ffe0 pc=0x3d32c1
created by runtime.gcBgMarkStartWorkers
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1199 +0x25

goroutine 21 [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc000651f50 sp=0xc000651f30 pc=0x3a8636
runtime.gcBgMarkWorker()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1275 +0xf1 fp=0xc000651fe0 sp=0xc000651f50 pc=0x389db1
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000651fe8 sp=0xc000651fe0 pc=0x3d32c1
created by runtime.gcBgMarkStartWorkers
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1199 +0x25

goroutine 8 [GC worker (idle)]:
runtime.gopark(0xdae957d59d0c?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc000069f50 sp=0xc000069f30 pc=0x3a8636
runtime.gcBgMarkWorker()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1275 +0xf1 fp=0xc000069fe0 sp=0xc000069f50 pc=0x389db1
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000069fe8 sp=0xc000069fe0 pc=0x3d32c1
created by runtime.gcBgMarkStartWorkers
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1199 +0x25

goroutine 35 [GC worker (idle)]:
runtime.gopark(0xdae957d59d0c?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:381 +0xd6 fp=0xc00008ff50 sp=0xc00008ff30 pc=0x3a8636
runtime.gcBgMarkWorker()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1275 +0xf1 fp=0xc00008ffe0 sp=0xc00008ff50 pc=0x389db1
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00008ffe8 sp=0xc00008ffe0 pc=0x3d32c1
created by runtime.gcBgMarkStartWorkers
        C:/Users/whh8b/Code/Code/goroot/src/runtime/mgc.go:1199 +0x25

goroutine 22 [runnable]:
runtime.Gosched(...)
        C:/Users/whh8b/Code/Code/goroot/src/runtime/proc.go:321
github.com/hawkinsw/windowpane/window.getTCPInfoRaw.func1.1()
        C:/Users/whh8b/Code/Code/windowpane/window/window_windows.go:103 +0x25 fp=0xc00008bfe0 sp=0xc00008bfc8 pc=0x44db45
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00008bfe8 sp=0xc00008bfe0 pc=0x3d32c1
created by github.com/hawkinsw/windowpane/window.getTCPInfoRaw.func1
        C:/Users/whh8b/Code/Code/windowpane/window/window_windows.go:101 +0xdf

goroutine 9 [runnable]:
main.(*TransparentClientSender).Write.func1()
        C:/Users/whh8b/Code/Code/windowpane/cmd/client/client.go:28 fp=0xc00008dfe0 sp=0xc00008dfd8 pc=0x44dfc0
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00008dfe8 sp=0xc00008dfe0 pc=0x3d32c1
created by main.(*TransparentClientSender).Write
        C:/Users/whh8b/Code/Code/windowpane/cmd/client/client.go:28 +0x1f2

I am more than happy to create something that is more compact as a reproducer but I wanted to send you something quickly to demonstrate that it is a real, actual problem and not hypothetical.

I hope that this information is helpful!
Will

@ianlancetaylor
Copy link
Contributor

Thanks. Evidently you can reproduce this problem, but I don't understand it. The docs for GetQueuedCompletionStatusEx say that it returns entries for operations associated with the completion port handle that is passed in. Your call to WASIoctl is not using that completion port. So what is being returned by GetQueuedCompletionStatusEx?

CC @golang/windows

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Mar 6, 2023

I hope that what I am about to say is helpful and not something that is obvious to wizards like you. If it is obvious and does not address your point, I apologize!

It's a tortuous path, but I will try to lay it out:

  1. For a call to net.Dial attempting to establish a TCP connection, we eventually reach
  2. sysDialer.dialTCP, which eventually leads to
  3. internetSocket which calls
  4. socket which calls
  5. netFD.dial where

we do

	err = fd.pd.init(fd)

permalink

which (eventually) leads to netpollopen in netpoll_windows.go where the fd is added to iocphandle.

I know that's circuitous, but that's the way that I see it. I hope that helps!

More importantly, I hope that I was answering your question and not telling you something that you already knew (you know this code base far better than this dodo).

Thanks again for everything!
Will

@ianlancetaylor
Copy link
Contributor

Thanks, that part I knew. But maybe I see what I am missing. I think you are suggesting that if a file descriptor has been associated with the netpoll completion port, then any other operation on that file can cause that completion port to return the descriptor, even if the operation in question does not refer to the completion port. That seems strange to me, but I don't know much about Windows.

Anyhow, I think it's best that this CL be reviewed by someone familiar with Windows, so I'll simply say that I have no further objections to it. Thanks.

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Mar 6, 2023

Thanks, that part I knew. But maybe I see what I am missing. I think you are suggesting that if a file descriptor has been associated with the netpoll completion port, then any other operation on that file can cause that completion port to return the descriptor, even if the operation in question does not refer to the completion port. That seems strange to me, but I don't know much about Windows.

That is my understanding as well! It does seem odd to me, but it seems like any operation that completes on an fd (even if it was not started in a way that is associated with that completion port) can be noticed by anyone with a completion port for that fd.

Anyhow, I think it's best that this CL be reviewed by someone familiar with Windows, so I'll simply say that I have no further objections to it. Thanks.

No, thank you! Is there someone else that I should add to the reviewers on the CLs to take the load off you?

@ianlancetaylor
Copy link
Contributor

We've already CC'ed @golang/windows here so we'll see if someone picks it up.

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Mar 7, 2023

From the description of CreateIoCompletionPort

Associating an instance of an opened file handle with an I/O completion port allows a process to receive notification of the completion of asynchronous I/O operations involving that file handle.

and

It is best not to share a file handle associated with an I/O completion port by using either handle inheritance or a call to the DuplicateHandle function. Operations performed with such duplicate handles generate completion notifications. Careful consideration is advised.

To me that indicates the behavior we are seeing is entirely rational and to be expected. In other words, it seems like we will have to provide a fix somehow.

In addition to the proposed patch(es), I found some additional information that might give us another opportunity for solid fix without impacting how users interact with the runtime:

From the documentation for GetQueuedCompletionStatus(Ex):

Even if you have passed the function a file handle associated with a completion port and a valid OVERLAPPED structure, an application can prevent completion port notification. This is done by specifying a valid event handle for the hEvent member of the OVERLAPPED structure, and setting its low-order bit. A valid event handle whose low-order bit is set keeps I/O completion from being queued to the completion port.

Given that information, we could provide the user with a function to create a quote-unquote windows.Overlapped that wraps a "real" windows.Overlapped. The wrapped structure would meet the specifications above (so that there is no overlap -- pun intended -- with the events that the runtime thinks are completing and the ones that the user thinks are completing). The wrapped structure would then dispatch to the fake windows.Overlapped that the user actually provided in a way that meets the specifications in the documentation.

It seems like either of these two options would work.

I hope that the documentation above is helpful! Please let me know how else I can help and how I can fixup the CL to be more useful!
Will

(Edited to include a second quote from the CreateIoCompletionPort documentation.)

hawkinsw added a commit to hawkinsw/windowpane that referenced this issue Mar 10, 2023
@hawkinsw
Copy link
Contributor Author

FYI: I have created a POC for this bug and to demonstrate how file can have "spooky action at a distance": https://github.com/hawkinsw/windowpane/tree/main/bug_poc.

I hope that this helps. Please, please let me know if/how I can help!

@mknyszek
Copy link
Contributor

In triage now, it would be really useful to get some input from someone in @golang/windows on this.

@qmuntal
Copy link
Contributor

qmuntal commented Mar 15, 2023

I'll take a look at this. Nothing to say for the moment other than the first step is to get a reproducer than can fit into the Go repo. I can see how this is difficult to reproduce from outside the Go runtime, I'll try to prepare for runtime or internal/poll.

@hawkinsw
Copy link
Contributor Author

Please let me know if/how I can help! The repository that I posted has "reproducing" code written directly using the win32 APIs. It at least demonstrates that the behavior is real. I hope that the patch is helpful, too. It seems to do everything that needs to be done.

@qmuntal
Copy link
Contributor

qmuntal commented Mar 17, 2023

I've been able to reproduce some interference between a user-space WASIoctl and the GetQueuedCompletionStatusEx called from the Go runtime, leading to the same crash described above.

I still don't understand which is the root cause. CL 473515 seems to help in some cases but the runtime still crashes occasionally, I think it just hides the issue and move the crash even farther.

Test code:

package poll_test

import (
	"errors"
	"internal/poll"
	"internal/syscall/windows"
	"syscall"
	"testing"
	"unsafe"
)

func TestWSASocketConflict(t *testing.T) {
	s, err := windows.WSASocket(syscall.AF_INET, syscall.SOCK_STREAM, syscall.IPPROTO_TCP, nil, 0, windows.WSA_FLAG_OVERLAPPED)
	if err != nil {
		t.Fatal(err)
	}
	fd := poll.FD{Sysfd: s, IsStream: true, ZeroReadIsEOF: true}
	_, err = fd.Init("tcp", true)
	if err != nil {
		syscall.CloseHandle(s)
		t.Fatal(err)
	}
	defer fd.Close()

	const SIO_TCP_INFO = syscall.IOC_INOUT | syscall.IOC_VENDOR | 39
	inbuf := uint32(0)
	var outbuf _TCP_INFO_v0
	cbbr := uint32(0)
	err = fd.WSAIoctl(
		SIO_TCP_INFO,
		(*byte)(unsafe.Pointer(&inbuf)),
		uint32(unsafe.Sizeof(inbuf)),
		(*byte)(unsafe.Pointer(&outbuf)),
		uint32(unsafe.Sizeof(outbuf)),
		&cbbr,
		&syscall.Overlapped{},
		0,
	)
	if err != nil && !errors.Is(err, syscall.ERROR_IO_PENDING) {
		t.Fatal(err)
	}
}

type _TCP_INFO_v0 struct {
	State             uint32
	Mss               uint32
	ConnectionTimeMs  uint64
	TimestampsEnabled bool
	RttUs             uint32
	MinRttUs          uint32
	BytesInFlight     uint32
	Cwnd              uint32
	SndWnd            uint32
	RcvWnd            uint32
	RcvBuf            uint32
	BytesOut          uint64
	BytesIn           uint64
	BytesReordered    uint32
	BytesRetrans      uint32
	FastRetrans       uint32
	DupAcksIn         uint32
	TimeoutEpisodes   uint32
	SynRetrans        uint8
}

Panic:

=== RUN   TestWSASocketConflict
fatal error: unexpected signal during runtime execution
--- PASS: TestWSASocketConflict (0.00s)
[signal 0xc0000005 code=0x0 addr=0x8 pc=0x12171e]

runtime stack:
runtime.throw({0x24033c?, 0x0?})
        runtime/panic.go:1076 +0x65 fp=0x543bdff148 sp=0x543bdff118 pc=0x127645
runtime.sigpanic()
        runtime/signal_windows.go:345 +0x265 fp=0x543bdff190 sp=0x543bdff148 pc=0x13c065
runtime.netpoll(0x8ff77537ce84?)
        runtime/netpoll_windows.go:138 +0x29e fp=0x543bdffa18 sp=0x543bdff190 pc=0x12171e
runtime.findRunnable()
        runtime/proc.go:2960 +0x76a fp=0x543bdffb20 sp=0x543bdffa18 pc=0x12ee8a
runtime.schedule()
        runtime/proc.go:3349 +0xb1 fp=0x543bdffb58 sp=0x543bdffb20 pc=0x12ff31
runtime.park_m(0xc0000589c0?)
        runtime/proc.go:3500 +0x119 fp=0x543bdffb88 sp=0x543bdffb58 pc=0x1303f9
runtime.mcall()
        runtime/asm_amd64.s:455 +0x4f fp=0x543bdffba0 sp=0x543bdffb88 pc=0x1568ef

goroutine 1 [runnable]:
runtime.gopark(0xc000119a28?, 0xfca68?, 0x20?, 0x81?, 0x18?)
        runtime/proc.go:381 +0xce fp=0xc0001199c0 sp=0xc0001199a0 pc=0x12a02e
runtime.chanrecv(0xc0000242a0, 0xc000119aa7, 0x1)
        runtime/chan.go:583 +0x3d0 fp=0xc000119a38 sp=0xc0001199c0 pc=0xf5d70
runtime.chanrecv1(0x33abc0?, 0x20ee80?)
        runtime/chan.go:442 +0x12 fp=0xc000119a60 sp=0xc000119a38 pc=0xf5992
testing.(*T).Run(0xc000059380, {0x239ebc?, 0x15d76d?}, 0x243ab8)
        testing/testing.go:1649 +0x3c8 fp=0xc000119b20 sp=0xc000119a60 pc=0x1bf828
testing.runTests.func1(0x33c280?)
        testing/testing.go:2055 +0x3e fp=0xc000119b70 sp=0xc000119b20 pc=0x1c189e
testing.tRunner(0xc000059380, 0xc000119c88)
        testing/testing.go:1595 +0xff fp=0xc000119bc0 sp=0xc000119b70 pc=0x1be9bf
testing.runTests(0xc00007c320?, {0x3355a0, 0xd, 0xd}, {0xc00000db40?, 0x100c000119d10?, 0x33ad20?})
        testing/testing.go:2053 +0x445 fp=0xc000119cb8 sp=0xc000119bc0 pc=0x1c1785
testing.(*M).Run(0xc00007c320)
        testing/testing.go:1925 +0x636 fp=0xc000119f00 sp=0xc000119cb8 pc=0x1c01b6
main.main()
        _testmain.go:73 +0x19c fp=0xc000119f80 sp=0xc000119f00 pc=0x1ffcdc
runtime.main()
        runtime/proc.go:250 +0x1f2 fp=0xc000119fe0 sp=0xc000119f80 pc=0x129c52
runtime.goexit()
        runtime/asm_amd64.s:1619 +0x1 fp=0xc000119fe8 sp=0xc000119fe0 pc=0x1588e1

exit status 2
FAIL    internal/poll   0.212s

@hawkinsw
Copy link
Contributor Author

Thank you! If you could help me by telling me how to integrate your test into the build framework and how to run it (sorry that I am not well-versed in that area of the code base despite having contributed code to the runtime before), I would love to help confirm a fix and continue to debug.

I love working on this stuff and hope that I am being helpful!
Will

@qmuntal
Copy link
Contributor

qmuntal commented Mar 17, 2023

Thank you! If you could help me by telling me how to integrate your test into the build framework and how to run it

That's a little tricky, as it is using internal/poll only available from within the Go source code. Let's see if I can list the steps you should follow:

  1. git clone https://github.com/golang/go
  2. cd go/src
  3. ./make.bat
  4. set GOROOT=$(wherever you just cloned the Go repo)
  5. copy my test script into a new test file in ./internal/poll, for example ./internal/poll/foo_test.go
  6. ..\bin\go.exe test -run TestWSASocketConflict ./internal/poll

@hawkinsw
Copy link
Contributor Author

Thank you!!

I sincerely appreciate that. I hope that you are okay with me helping. Again, I love this!!

@hawkinsw
Copy link
Contributor Author

@qmuntal FYI: I can reproduce the error locally!

@hawkinsw
Copy link
Contributor Author

I built a version of the compiler/runtime with CL473515 applied and was able to successfully trigger the bug with a slightly modified version of the patch that, I think, shows that the CL is useful.

func TestWSASocketConflict(t *testing.T) {
	s, err := windows.WSASocket(syscall.AF_INET, syscall.SOCK_STREAM, syscall.IPPROTO_TCP, nil, 0, windows.WSA_FLAG_OVERLAPPED)
	if err != nil {
		t.Fatal(err)
	}
	fd := poll.FD{Sysfd: s, IsStream: true, ZeroReadIsEOF: true}
	_, err = fd.Init("tcp", true)
	if err != nil {
		syscall.CloseHandle(s)
		t.Fatal(err)
	}
	defer fd.Close()

	const SIO_TCP_INFO = syscall.IOC_INOUT | syscall.IOC_VENDOR | 39
	inbuf := uint32(0)
	var outbuf _TCP_INFO_v0
	cbbr := uint32(0)

	// Set up an array of 2 syscall.Overlapped structs so that we
	// can play with triggering bad behavior even in the patched version
	// of the code. The net_op is bigger than the syscall.Overlapped
	// and the field in the net_op that is dangerously dereferenced (alliteration)
	// (before the CL is applied) would fall immediately after the end
	// of the syscall.Overlapped. So ...
	ovs := make([]syscall.Overlapped, 2)
	// ... if the first field of the adjacent syscall.Overlapped is 0 then
	// even the patched code will fail -- it sees *something* in the pd field
	// of the net_op. So, let's test that case and make sure that it fails.
	ovs[1].Internal -= 1
	// However, if the adjacent in-memory element is 0 then
	// the patched code will properly detect that what it found during
	// completion was *not* something that it manages. Whew.
	// ovs[1].Internal = 0

	err = fd.WSAIoctl(
		SIO_TCP_INFO,
		(*byte)(unsafe.Pointer(&inbuf)),
		uint32(unsafe.Sizeof(inbuf)),
		(*byte)(unsafe.Pointer(&outbuf)),
		uint32(unsafe.Sizeof(outbuf)),
		&cbbr,
		&ovs[0],
		0,
	)
	if err != nil && !errors.Is(err, syscall.ERROR_IO_PENDING) {
		t.Fatal(err)
	}
}

I hope that the commentary in the test helps explain what I am trying to demonstrate. As I ran the test in the configuration where CL473515 properly shielded a dangerous access (according to the above criteria), I was not able to reproduce.

I hope that this information helps! Thanks again for all that you are doing!

Will

@hawkinsw
Copy link
Contributor Author

After running the test repeatedly for approximately 45 minutes, I was able to trigger an error:

I think that the source of the issue is that the test is closing the socket (which leads to a decref) and the runtime does, too. That's just a supposition and I will dig in more tomorrow.

I have included the dump below. Thanks again!

Will

=== RUN   TestWSASocketConflict
Exception 0xc0000005 0x8 0x0 0x0
PC=0x0

runtime: g 6: unknown pc 0xc000059520
stack: frame={sp:0xc00006dea8, fp:0x0} stack=[0xc00006c000,0xc00006e000)
0x000000c00006dda8:  0x000000c00006ddc8  0x00000000009a2033 <internal/poll.(*FD).decref+0x0000000000000053>
0x000000c00006ddb8:  0x000000c000136000  0x000002caf7406860
0x000000c00006ddc8:  0x000000c00006ddf8  0x00000000009a3067 <internal/poll.(*FD).Close+0x0000000000000087>
0x000000c00006ddd8:  0x000002caf7406838  0x0000000000000000
0x000000c00006dde8:  0x0000000000000000  0x0000000000000000
0x000000c00006ddf8:  0x000000c00006de10  0x0000000000a1f2a5 <internal/poll_test.TestWSASocketConflict.func1+0x0000000000000025>
0x000000c00006de08:  0x000000c000136000  0x000000c00006df60
0x000000c00006de18:  0x0000000000a1f24e <internal/poll_test.TestWSASocketConflict+0x000000000000030e>  0x0000000000a8aea0
0x000000c00006de28:  0x0000000000b40388  0x0000000000a8aea0
0x000000c00006de38:  0x0000000000a8a0b8  0x000000000000059d
0x000000c00006de48:  0x000000000000059c  0x00000000009df500 <testing.tRunner+0x0000000000000000>
0x000000c00006de58:  0x0000000000b36238  0x0000000000b46600
0x000000c00006de68:  0x000000c0001320e0  0x0000000000000000
0x000000c00006de78:  0x0000000000000194  0x0000000000b36238
0x000000c00006de88:  0x0000000000a8aea0  0x000000c0c00000bb
0x000000c00006de98:  0x0000000000000000  0x000000c000059520
0x000000c00006dea8: <0x0000000000000000  0x0000000000000000
0x000000c00006deb8:  0x000000c000059520  0x000000c00006df40
0x000000c00006dec8:  0x00000000009df14a <testing.(*common).runCleanup+0x000000000000014a>  0x000000c000059520
0x000000c00006ded8:  0x0101000000000000  0x0000000000000000
0x000000c00006dee8:  0x0000000000000000  0x0000000000000000
0x000000c00006def8:  0x00000000009df1e0 <testing.(*common).runCleanup.func2+0x0000000000000000>  0x000000c000059520
0x000000c00006df08:  0x0000000000000000  0x0000000000000000
0x000000c00006df18:  0x00000000009df2a0 <testing.(*common).runCleanup.func3+0x0000000000000000>  0x000000c0000595d0
0x000000c00006df28:  0x000000c00006def8  0x0000000000000000
0x000000c00006df38:  0x000000c00006df18  0x000000c00006df60
0x000000c00006df48:  0x00000000009df6a5 <testing.tRunner.func2+0x0000000000000025>  0x000000c000059520
0x000000c00006df58:  0x000000c00006df38  0x000000c00006dfb0
0x000000c00006df68:  0x00000000009df629 <testing.tRunner+0x0000000000000129>  0x000000c000059520
0x000000c00006df78:  0x0100000000000000  0x00000000009df680 <testing.tRunner.func2+0x0000000000000000>
0x000000c00006df88:  0x000000c000059520  0x00000000009df6c0 <testing.tRunner.func1+0x0000000000000000>
0x000000c00006df98:  0x000000c000059520  0x000000c00006df80
runtime: g 6: unknown pc 0xc000059520
stack: frame={sp:0xc00006dea8, fp:0x0} stack=[0xc00006c000,0xc00006e000)
0x000000c00006dda8:  0x000000c00006ddc8  0x00000000009a2033 <internal/poll.(*FD).decref+0x0000000000000053>
0x000000c00006ddb8:  0x000000c000136000  0x000002caf7406860
0x000000c00006ddc8:  0x000000c00006ddf8  0x00000000009a3067 <internal/poll.(*FD).Close+0x0000000000000087>
0x000000c00006ddd8:  0x000002caf7406838  0x0000000000000000
0x000000c00006dde8:  0x0000000000000000  0x0000000000000000
0x000000c00006ddf8:  0x000000c00006de10  0x0000000000a1f2a5 <internal/poll_test.TestWSASocketConflict.func1+0x0000000000000025>
0x000000c00006de08:  0x000000c000136000  0x000000c00006df60
0x000000c00006de18:  0x0000000000a1f24e <internal/poll_test.TestWSASocketConflict+0x000000000000030e>  0x0000000000a8aea0
0x000000c00006de28:  0x0000000000b40388  0x0000000000a8aea0
0x000000c00006de38:  0x0000000000a8a0b8  0x000000000000059d
0x000000c00006de48:  0x000000000000059c  0x00000000009df500 <testing.tRunner+0x0000000000000000>
0x000000c00006de58:  0x0000000000b36238  0x0000000000b46600
0x000000c00006de68:  0x000000c0001320e0  0x0000000000000000
0x000000c00006de78:  0x0000000000000194  0x0000000000b36238
0x000000c00006de88:  0x0000000000a8aea0  0x000000c0c00000bb
0x000000c00006de98:  0x0000000000000000  0x000000c000059520
0x000000c00006dea8: <0x0000000000000000  0x0000000000000000
0x000000c00006deb8:  0x000000c000059520  0x000000c00006df40
0x000000c00006dec8:  0x00000000009df14a <testing.(*common).runCleanup+0x000000000000014a>  0x000000c000059520
0x000000c00006ded8:  0x0101000000000000  0x0000000000000000
0x000000c00006dee8:  0x0000000000000000  0x0000000000000000
0x000000c00006def8:  0x00000000009df1e0 <testing.(*common).runCleanup.func2+0x0000000000000000>  0x000000c000059520
0x000000c00006df08:  0x0000000000000000  0x0000000000000000
0x000000c00006df18:  0x00000000009df2a0 <testing.(*common).runCleanup.func3+0x0000000000000000>  0x000000c0000595d0
0x000000c00006df28:  0x000000c00006def8  0x0000000000000000
0x000000c00006df38:  0x000000c00006df18  0x000000c00006df60
0x000000c00006df48:  0x00000000009df6a5 <testing.tRunner.func2+0x0000000000000025>  0x000000c000059520
0x000000c00006df58:  0x000000c00006df38  0x000000c00006dfb0
0x000000c00006df68:  0x00000000009df629 <testing.tRunner+0x0000000000000129>  0x000000c000059520
0x000000c00006df78:  0x0100000000000000  0x00000000009df680 <testing.tRunner.func2+0x0000000000000000>
0x000000c00006df88:  0x000000c000059520  0x00000000009df6c0 <testing.tRunner.func1+0x0000000000000000>
0x000000c00006df98:  0x000000c000059520  0x000000c00006df80
created by testing.(*T).Run in goroutine 1
        C:/Users/whh8b/Code/Code/goroot-fixed/src/testing/testing.go:1632 +0x3ad  

goroutine 1 [chan receive]:
runtime.gopark(0xc000117a28?, 0x91cb68?, 0x20?, 0x81?, 0x18?)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:381 +0xce fp=0xc0001179c0 sp=0xc0001179a0 pc=0x94a2ee
runtime.chanrecv(0xc000028460, 0xc000117aa7, 0x1)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/chan.go:583 +0x3d0 fp=0xc000117a38 sp=0xc0001179c0 pc=0x915d90
runtime.chanrecv1(0xb5eba0?, 0xa2fe80?)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/chan.go:442 +0x12 fp=0xc000117a60 sp=0xc000117a38 pc=0x9159b2
testing.(*T).Run(0xc000059380, {0xa5a980?, 0x97e52d?}, 0xa645b0)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/testing/testing.go:1633 +0x3c8 fp=0xc000117b20 sp=0xc000117a60 pc=0x9e0468
testing.runTests.func1(0xb60260?)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/testing/testing.go:2039 +0x3e fp=0xc000117b70 sp=0xc000117b20 pc=0x9e24de
testing.tRunner(0xc000059380, 0xc000117c88)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/testing/testing.go:1579 +0xff fp=0xc000117bc0 sp=0xc000117b70 pc=0x9df5ff
testing.runTests(0xc00007a320?, {0xb595a0, 0xd, 0xd}, {0xc00000db40?, 0x100c000117d10?, 0xb5ed00?})
        C:/Users/whh8b/Code/Code/goroot-fixed/src/testing/testing.go:2037 +0x445 fp=0xc000117cb8 sp=0xc000117bc0 pc=0x9e23c5
testing.(*M).Run(0xc00007a320)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/testing/testing.go:1909 +0x636 fp=0xc000117f00 sp=0xc000117cb8 pc=0x9e0df6
main.main()
        _testmain.go:73 +0x19c fp=0xc000117f80 sp=0xc000117f00 pc=0xa2095c        
runtime.main()
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:250 +0x1f2 fp=0xc000117fe0 sp=0xc000117f80 pc=0x949f12
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000117fe8 sp=0xc000117fe0 pc=0x979601

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:381 +0xce fp=0xc00005bfa8 sp=0xc00005bf88 pc=0x94a2ee
runtime.goparkunlock(...)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:387
runtime.forcegchelper()
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:305 +0xb8 fp=0xc00005bfe0 sp=0xc00005bfa8 pc=0x94a158
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x979601
created by runtime.init.6 in goroutine 1
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:293 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:381 +0xce fp=0xc00005df78 sp=0xc00005df58 pc=0x94a2ee
runtime.goparkunlock(...)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:387
runtime.bgsweep(0x0?)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/mgcsweep.go:278 +0x94 fp=0xc00005dfc8 sp=0xc00005df78 pc=0x933a14
runtime.gcenable.func1()
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/mgc.go:178 +0x25 fp=0xc00005dfe0 sp=0xc00005dfc8 pc=0x929225
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00005dfe8 sp=0xc00005dfe0 pc=0x979601
created by runtime.gcenable in goroutine 1
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/mgc.go:178 +0x66        

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc0000281c0?, 0xa88f58?, 0x1?, 0x0?, 0xc000058b60?)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:381 +0xce fp=0xc00006bf70 sp=0xc00006bf50 pc=0x94a2ee
runtime.goparkunlock(...)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:387
runtime.(*scavengerState).park(0xb5ed80)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/mgcscavenge.go:399 +0x49 fp=0xc00006bfa0 sp=0xc00006bf70 pc=0x9319e9
runtime.bgscavenge(0x0?)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/mgcscavenge.go:627 +0x3c fp=0xc00006bfc8 sp=0xc00006bfa0 pc=0x931f7c
runtime.gcenable.func2()
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/mgc.go:179 +0x25 fp=0xc00006bfe0 sp=0xc00006bfc8 pc=0x9291c5
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00006bfe8 sp=0xc00006bfe0 pc=0x979601
created by runtime.gcenable in goroutine 1
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/mgc.go:179 +0xa5        

goroutine 5 [finalizer wait]:
runtime.gopark(0x58?, 0xa4f620?, 0x70?, 0xff?, 0x9569fe?)
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/proc.go:381 +0xce fp=0xc00005fe28 sp=0xc00005fe08 pc=0x94a2ee
runtime.runfinq()
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/mfinal.go:193 +0x107 fp=0xc00005ffe0 sp=0xc00005fe28 pc=0x928307
runtime.goexit()
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00005ffe8 sp=0xc00005ffe0 pc=0x979601
created by runtime.createfing in goroutine 1
        C:/Users/whh8b/Code/Code/goroot-fixed/src/runtime/mfinal.go:163 +0x3d     
rax     0xc000059520
rbx     0x0
rcx     0x0
rdi     0x0
rsi     0x0
rbp     0xc0c00000bb
rsp     0xc00006dea0
r8      0x0
r9      0xc000136218
r10     0x3
r11     0xef28fff600
r12     0x0
r13     0x0
r14     0xc0000596c0
r15     0x333ff
rip     0x0
rflags  0x10216
cs      0x33
fs      0x53
gs      0x2b
FAIL    internal/poll   0.565s

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Mar 19, 2023

After further investigation, it appears that the problem above had a relatively simple cause: The test is using syscall.Overlapped which is not padded the way that windows.Overlapped is when CL473515 is applied. In other words, the protection was not quite complete. After padding system.Overlapped I have been running the test for approximately 120 minutes (2 hours) and do not see any failures. I will update the CL appropriately.

I still think that there may be another issue with which we have to contend:

In the case where the caller of WSAIoctl does not wait for completion (in the case where it does not complete immediately) and program execution leaves the scope where the Overlapped object is initially allocated, the overlapped entry generated by GetQueuedCompletionStatusEx will contain a reference to memory that has been (potentially) released.

Does that seem like a reasonable statement? If it is, we can come up with a workaround.

As usual, thank you for letting me help!
Will

@hawkinsw
Copy link
Contributor Author

Sorry for the delay -- the CL is now updated. Let me know if/how I can help further!

Will

@qmuntal
Copy link
Contributor

qmuntal commented Mar 21, 2023

Thanks for the new insights @hawkinsw. I still think that when calling GetQueuedCompletionStatusEx we shouldn't be receiving entries that are not enqueued from the Go runtime. You are focusing your effort on fixing a situation that shouldn't occur (but it does). I would like to have a better understanding about the root cause before partially patching the problem.

For example, CreateIoCompletionPort accepts a per-handle unique key (CompletionKey parameter). I'm wondering if it would be possible to set this unique key for handles owned by the Go runtime, and then update (c *rawConn) Control on Windows to pass a duplicated handle to the callback, which won't be associated to the unique key previously set.

@hawkinsw
Copy link
Contributor Author

Thanks for the feedback! I think that

  1. I see no reason why the runtime shouldn't get those entries (sorry for the double negative). The same handle is given back to the caller when they create a connection that is kept and managed by the runtime. In other words, because ...
  2. ... the completion events are done based on the HANDLE (referring to the way that MSFT writes types and not yelling at you!) and not on the overlapped entry, the situation seems to make sense to me.
  3. I think that the POC code I wrote https://github.com/hawkinsw/windowpane/blob/main/bug_poc/bug_poc.cpp) and the documentation make that clear ...
  4. ... because the overlapped only plays a part in canceling (or otherwise manipulating) outstanding pending (aka overlapped) entries (https://learn.microsoft.com/en-us/windows/win32/fileio/cancelioex-func) any thread waiting on a completion port that is associated with that handle could be notified.
  5. The paragraph I quoted above to @ianlancetaylor seems to indicate that what we are seeing here is problematic and (perhaps too subtly) warns against duplicating file handles:

It is best not to share a file handle associated with an I/O completion port by using either handle inheritance or a call to the DuplicateHandle function. Operations performed with such duplicate handles generate completion notifications. Careful consideration is advised.

I am convinced that you know more than I, so please help me understand what you are seeing that I am not! Thanks for letting me help!
Will

@hawkinsw
Copy link
Contributor Author

I think that I found a little more evidence that what we are seeing is "expected". I augmented the runtime and the test to log the value of Sysfd in two places:

  1. In the test
func TestWSASocketConflict(t *testing.T) {
	s, err := windows.WSASocket(syscall.AF_INET, syscall.SOCK_STREAM, syscall.IPPROTO_TCP, nil, 0, windows.WSA_FLAG_OVERLAPPED)
	if err != nil {
		t.Fatal(err)
	}
	fd := poll.FD{Sysfd: s, IsStream: true, ZeroReadIsEOF: true}
	_, err = fd.Init("tcp", true)
	if err != nil {
		syscall.CloseHandle(s)
		t.Fatal(err)
	}
	defer fd.Close()

	println("fd.Sysfd: ", fd.Sysfd)
...
  1. When it is associated with iocphandle (i.e., netpollopen):
func netpollopen(fd uintptr, pd *pollDesc) int32 {
	println("Adding ", fd, " to the iocphandle.")
	if stdcall4(_CreateIoCompletionPort, fd, iocphandle, 0, 0) == 0 {
		return int32(getlasterror())
	}
	return 0
}

When running the test we what we expect -- that the IDs of the FD are identical:

=== RUN   TestWSASocketConflict
Adding  404  to the iocphandle.
fd.Sysfd:  404
--- PASS: TestWSASocketConflict (0.00s)
PASS
ok      internal/poll   0.223s

To me this seems like conclusive proof that we have what we expected: The runtime is adding the same FD to the completion port which, when combined with what we learned from the documentation, seems to make it clear that the behavior we are seeing is exactly what we should see.

I did have a thought about a better solution than the one I proposed in the CL and I will update the CL to demonstrate it. Looking forward to your feedback.

Thanks again for everything!
Will

@qmuntal
Copy link
Contributor

qmuntal commented Mar 22, 2023

The evidences are clear, you explain it very well @hawkinsw, the current behavior is in accordance with what's documented. My point is that we should try to improve how the Go runtime manages the completion ports and also make sure that the handler passed to the user via rawConn.Control cannot interfere with the Go runtime GetQueuedCompletionStatusEx. If that's not possible, then we can look at how defense against these conflicts, which probably will be very similar to CL 473515.

@hawkinsw
Copy link
Contributor Author

@qmuntal I am relieved -- I was definitely starting to question my sanity! I really have enjoyed our co-investigation! Thank you, again, for letting me participate.

As I said, I had a thought about how to be proactive rather than reactive managing what completion entries (overlappedEntrys) are retrieved from GetQueuedCompletionStatusEx. I will let you know as soon as I have had a chance to do that. I apologize for any delay -- my teaching load is a little high this semester!

@qmuntal
Copy link
Contributor

qmuntal commented Mar 23, 2023

I've tried to materialize my previous thoughts about using a completion key to filter out entries not originated from the runtime (without having to update rawConn.Control). @hawkinsw can you try CL 478935 and check if the original issue is gone with that? Regardless of the outcome, I'm still not 100% this is the right answer as I haven't exercised it much.

@hawkinsw
Copy link
Contributor Author

Thanks @qmuntal -- I will do that as soon as possible!

@hawkinsw
Copy link
Contributor Author

@qmuntal -- I have applied the patch and am running my test! Everything looks great ... until it doesn't!

I see an issue that might be specific to the that would not be visible to users. It would appear that under certain circumstances the decref method of the FD type (I think that I got the go-specific terminology correct!) is called with (what appears to be) a nil receiver. It is on the tip of my brain why that happens but I working on writing out the specifics for why it would occur. I will keep you posted!

Thanks (again!) for letting me work with you!
Will

@hawkinsw

This comment was marked as off-topic.

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Mar 27, 2023

I believe that I have isolated the source of the failure. Judging by the fact that the PC is 0x0 and that internal/poll.(*FD).decref+0x0000000000000053 is frequently on the stack in the trace, I (finally) just went to the object code (what I should have done all along). In the compiled objects on my local host, internal/poll.(*FD).decref+0x0000000000000053 is the address after a call to internal/poll.(*FD).destroy (consistent with the hypothesis that internal/poll.(*FD).decref+0x0000000000000053 is a return address pushed on the stack.

So, we go to internal/poll.(*FD).destroy and look for any place that has a indirect call (the likely, if not only, way to get a 0x0 PC). Fortunately, there is only one:

  492487:       48 89 c8                mov    rax,rcx
  49248a:       80 b8 20 02 00 00 00    cmp    BYTE PTR [rax+0x220],0x0
  492491:       75 15                   jne    4924a8 <internal/poll.(*FD).destroy+0x68>
  492493:       48 8b 15 2e 2e 1b 00    mov    rdx,QWORD PTR [rip+0x1b2e2e]        # 6452c8 <internal/poll.CloseFunc>
  49249a:       48 8b 0a                mov    rcx,QWORD PTR [rdx]
  49249d:       48 8b 58 10             mov    rbx,QWORD PTR [rax+0x10]
  4924a1:       48 89 d8                mov    rax,rbx
  4924a4:       ff d1                   call   rcx

Note that this is not the entirety of the function and, again, this object code is specific to my host, obviously.

For better or worse, this does seem consistent with a timing or memory corruption error: We read the contents of rcx from memory before calling through it.

@qmuntal, I think that seems like a reasonable analysis and will continue down this path. I am sorry it has taken me this long to finally come up with a reasonable hypothesis but I am happy that I finally have something to go on. If you have thoughts, please let me know -- I would love to get your insight!

Thanks again and hope you had a great weekend!

UPDATE:
The precise hypothesis that I formulated is not correct. However, I believe that I am still barking up the right tree. Just need to figure out the precise limb! I will keep you posted!

UPDATE 2: (22:39:28GMT 27 March 2023 Monday)
While this note does not make me happy, it is somewhat relieving! I have narrowed the problem down to something in the test harness.

@qmuntal
Copy link
Contributor

qmuntal commented Mar 28, 2023

@qmuntal, I think that seems like a reasonable analysis and will continue down this path. I am sorry it has taken me this long to finally come up with a reasonable hypothesis but I am happy that I finally have something to go on. If you have thoughts, please let me know -- I would love to get your insight!

UPDATE 2: (22:39:28GMT 27 March 2023 Monday)
While this note does not make me happy, it is somewhat relieving! I have narrowed the problem down to something in the test harness.

Take my test with caution, I've used interacted few times with internal/poll.FD and it might have misused parts of its API.

@hawkinsw
Copy link
Contributor Author

Sorry for the delay between updates. While I do not know why what I am about to describe is happening, I am confident that I have some (what I think is) definitive proof of what is going on.

While running a test, c (the receiver) of testing.(*common).runCleanup in testing/testing.go is nil.

Because the error is nondeterministic it is not always at this place that I see the panic. However, with logging I have demonstrated that it is at least one place where the problem occurs. That is great news because now I have at least something to go on.

Given the data accumulated so far, it would appear that there is something happening asynchronously that is scribbling over the value of c.

In one common failure scenario, the instantiation of testing.(*common).runCleanup with the nil receiver is from testing/testing.go line 1331:

			c.runCleanup(normalPanic)

In other words, we were literally just executing in a place where we know that c was valid!

I will keep you posted as I look for more.

cc @qmuntal

@hawkinsw
Copy link
Contributor Author

I still think that there may be another issue with which we have to contend:

In the case where the caller of WSAIoctl does not wait for completion (in the case where it does not complete immediately) and program execution leaves the scope where the Overlapped object is initially allocated, the overlapped entry generated by GetQueuedCompletionStatusEx will contain a reference to memory that has been (potentially) released.

UPDATE: It would appear that this is exactly the issue. I am going to modify your patch @qmuntal to work around the issue. I will keep you posted!

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Apr 3, 2023

@qmuntal I believe that I know fully understand the problem and have a workable solution. I am going to spend time today writing a complete description of the problem and the proposed solution so that we have it for posterity and to use as a way to discuss whether to go forward with that solution. Let me know if that sounds okay!

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Apr 4, 2023

I am so sorry that it took me as long as it did to produce this document. I hope that you find it valuable and perhaps there is something worthwhile inside. I look forward to hearing what you think. Thanks again for everything!

https://github.com/hawkinsw/windowpane/blob/main/description.md

@qmuntal
Copy link
Contributor

qmuntal commented Apr 4, 2023

That was a good reading. Some thoughts:

  • Looks like CL 478935 does protect the runtime against some conflicts with the completion port. Would you mind sending a new CL that incorporates CL 478935 and your version of TestWSASocketConflict that don't make the runtime crash?
  • If I got it right, your proposal of using CreateEvent to efficiently wait for the completion of a potentially long-running I/O operation is achievable without modifying the Go runtime nor the standard library. Nevertheless, it would be good to add a test somewhere that verifies that that pattern works well.
  • I'm still not sold about adding a helper function that returns or modifies a handler to set its lower bit, as this is easily doable on the caller side. Seeing that pattern in action (for example in the test I requested in the previous bullet) may change my mind.

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Apr 4, 2023

That was a good reading. Some thoughts:

* Looks like [CL 478935](https://go-review.googlesource.com/c/go/+/478935) does protect the runtime against some conflicts with the completion port. Would you mind sending a new CL that incorporates CL 478935 and your version of `TestWSASocketConflict` that don't make the runtime crash?

Absolutely! I will do it as soon as possible. It will be my first time attempting to modify someone else's CL on Gerrit so please forgive me (in advance) if I somehow screw up!

* If I got it right, your proposal of using `CreateEvent` to efficiently wait for the completion of a potentially long-running I/O operation is achievable without modifying the Go runtime nor the standard library. Nevertheless, it would be good to add a test somewhere that verifies that that pattern works well.

That is absolutely correct. The only problem is ... to use CreateEvent from inside the compiler/runtime source code we will need to make a modification. You will see that in the updated CL. From the user perspective, however, those knobs are already available in the standard library -- pretty cool!

* I'm still not sold about adding a helper function that returns or modifies a handler to set its lower bit, as this is easily doable on the caller side. Seeing that pattern in action (for example in the test I requested in the previous bullet) may change my mind.

Nor am I! I was just suggesting that it might be something we would consider. We can reconsider after you see the updated CL. Thank you for the feedback!
Will

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Apr 5, 2023

@qmuntal Sorry if I am being dense. I just attempted to git codereview mail the updated CL and I failed -- I lack "forge author" permission. Did you want me to create a separate CL or is that a permission that you can provide me so that we can collaborate on the CL?

Thanks!

@qmuntal
Copy link
Contributor

qmuntal commented Apr 5, 2023

@qmuntal Sorry if I am being dense. I just attempted to git codereview mail the updated CL and I failed -- I lack "forge author" permission. Did you want me to create a separate CL or is that a permission that you can provide me so that we can collaborate on the CL?

No problem. Submit a new CL copying whatever relevant code from my CL and set my as co-author. I'll abandone myh CL latter.

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Apr 5, 2023

Will do!!

@hawkinsw
Copy link
Contributor Author

hawkinsw commented Apr 5, 2023

Tried to set you! Let me know if I did it correctly:

482495: runtime: filter i/o async entries using completion key on windows
https://go-review.googlesource.com/c/go/+/482495

@gopherbot
Copy link

Change https://go.dev/cl/482495 mentions this issue: runtime: filter i/o async entries using completion key on windows

@gopherbot
Copy link

Change https://go.dev/cl/558895 mentions this issue: runtime: make netpollBreak entries identifiable on Windows

gopherbot pushed a commit that referenced this issue Feb 1, 2024
It is currently not possible to distinguish between a netpollBreak
entry and an entry initiated by external WSA operations (as in #58870).

This CL sets a unique completion key when posting the
netpollBreak entry so that it can be identified as such.

Change-Id: I8e74a7ddc607dc215d6ed8c59d5c3cf47ec8dc62
Reviewed-on: https://go-review.googlesource.com/c/go/+/558895
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
@gopherbot
Copy link

Change https://go.dev/cl/561895 mentions this issue: runtime: make netpoll events sources identifiable on Windows

gopherbot pushed a commit that referenced this issue Feb 14, 2024
This is another attempt at CL 558895, but without adding stale pollDescs
protection, which deviates from the original goal of the CL and adds
complexity without proper testing.

It is currently not possible to distinguish between a netpollBreak,
an internal/poll WSA operation, and an external WSA operation (as
in #58870). This can cause spurious wakeups when external WSA operations
are retrieved from the queue, as they are treated as netpollBreak
events.

This CL makes use of completion keys to identify the source of the
event.

While here, fix TestWSASocketConflict, which was not properly
exercising the "external WSA operation" case.

Change-Id: I91f746d300d32eb7fed3c8f27266fef379360d98
Reviewed-on: https://go-review.googlesource.com/c/go/+/561895
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
It is currently not possible to distinguish between a netpollBreak
entry and an entry initiated by external WSA operations (as in golang#58870).

This CL sets a unique completion key when posting the
netpollBreak entry so that it can be identified as such.

Change-Id: I8e74a7ddc607dc215d6ed8c59d5c3cf47ec8dc62
Reviewed-on: https://go-review.googlesource.com/c/go/+/558895
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
This is another attempt at CL 558895, but without adding stale pollDescs
protection, which deviates from the original goal of the CL and adds
complexity without proper testing.

It is currently not possible to distinguish between a netpollBreak,
an internal/poll WSA operation, and an external WSA operation (as
in golang#58870). This can cause spurious wakeups when external WSA operations
are retrieved from the queue, as they are treated as netpollBreak
events.

This CL makes use of completion keys to identify the source of the
event.

While here, fix TestWSASocketConflict, which was not properly
exercising the "external WSA operation" case.

Change-Id: I91f746d300d32eb7fed3c8f27266fef379360d98
Reviewed-on: https://go-review.googlesource.com/c/go/+/561895
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. OS-Windows
Projects
None yet
Development

No branches or pull requests

7 participants