-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: system stack and heap corruption when interacting with cgo on Windows #59724
Comments
You need to use LockOSThread on Windows since you an only render on the main thread. |
The linked code does no rendering. Additionally, as I mention in the description above, that isn't true of vulkan. |
cc @golang/runtime |
This repro has 100 goroutines that concurrently call |
They do not call it concurrently |
I use a channel to ensure that only one runs at a time |
Doh, apologies. |
SUCCEED - Manjaro 6.2.9.1 - GeForce RTX 3060 mobile v530.410 |
In triage, we think that trying to reproduce with |
CC @qmuntal for some Windows expertise, maybe? The difficulty for us in reproducing it is we don't have a readily-available Windows box that has Vulkan library installed on it. (Is that necessary? I assume so, but maybe I'm missing something.) |
It cannot be reproduced with LockOSThread, as I mention in the bug description, and I turn the GC off in the linked code using |
I don't think that Vulkan is a vital part of the issue, but I don't know what about Vulkan is causing it to trigger, the couple of simple things I tried to repro without Vulkan didn't work. I have walked other users through setting up Vulkan to repro this issue (I had to, in order to get the AMD repro) so I can do so with others who have a windows machine if they have the time. It involves installing mingw. |
Can't reproduce this issue using an NVIDIA Quadro T1000 and the latest Vulkan SDK. @CannibalVox could you share the output of vulkaninfo --summary
==========
VULKANINFO
==========
Vulkan Instance Version: 1.3.224
Instance Extensions: count = 17
-------------------------------
VK_EXT_debug_report : extension revision 10
VK_EXT_debug_utils : extension revision 2
VK_EXT_direct_mode_display : extension revision 1
VK_EXT_swapchain_colorspace : extension revision 4
VK_KHR_device_group_creation : extension revision 1
VK_KHR_display : extension revision 23
VK_KHR_external_fence_capabilities : extension revision 1
VK_KHR_external_memory_capabilities : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2 : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2 : extension revision 1
VK_KHR_portability_enumeration : extension revision 1
VK_KHR_surface : extension revision 25
VK_KHR_surface_protected_capabilities : extension revision 1
VK_KHR_win32_surface : extension revision 6
VK_NV_external_memory_capabilities : extension revision 1
Instance Layers: count = 8
--------------------------
VK_LAYER_KHRONOS_profiles Khronos Profiles layer 1.3.243 version 1
VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer 1.3.243 version 1
VK_LAYER_KHRONOS_validation Khronos Validation Layer 1.3.243 version 1
VK_LAYER_LUNARG_api_dump LunarG API dump layer 1.3.243 version 2
VK_LAYER_LUNARG_gfxreconstruct GFXReconstruct Capture Layer Version 0.9.19 1.3.243 version 36883
VK_LAYER_LUNARG_monitor Execution Monitoring Layer 1.3.243 version 1
VK_LAYER_LUNARG_screenshot LunarG image capture layer 1.3.243 version 1
VK_LAYER_NV_optimus NVIDIA Optimus layer 1.3.224 version 1
Devices:
========
GPU0:
apiVersion = 4206816 (1.3.224)
driverVersion = 2216148992 (0x8417c000)
vendorID = 0x10de
deviceID = 0x1fb9
deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
deviceName = Quadro T1000 with Max-Q Design
driverID = DRIVER_ID_NVIDIA_PROPRIETARY
driverName = NVIDIA
driverInfo = 528.95
conformanceVersion = 1.3.3.1
deviceUUID = 4b0115d4-e3bd-e983-c3b0-92f9a06fd48c
driverUUID = fcf93a5f-db54-5d85-a8b0-b89bd3de837b |
Here's mine:
Attached is the full vulkaninfo for the AMD machine that repro'd the issue |
You may have to run it 4 or 5 times to reproduce, if you haven't done that |
Hm, both of us have steam & epic overlay layers, it's possible they're at fault. |
One of the linux users who tested for us had the steam overlay, but none had the epic overlay so this may not be an OS thing. |
OK, uninstalled the epic launcher and I can no longer repro, so the layers it installs is a piece of the puzzle here. I still think go is part of the formula, because of the number of things you can do in go that aren't visible to EOS that don't repro. However, I'm definitely more willing to believe that epic is doing something fundamentally wrong than vulkan. There's a strong possibility that windows isn't part of the formula, too, I'm going to see what my options are for getting EOS installed on a non-windows OS for testing. |
OK the layers are windows-only so this seems like it's probably not actually a problem with go's windows support, if it is a problem with go. I'm going to do additional investigation to see if I can tease out the various possibilities:
|
Thanks. It sounds like Epic might do something that requires (possibly undocumented) things to run on the same thread. And you may want to actually use |
Well, no, that's not it either. As I mentioned in the task description, there's all kinds of ways you can interact on different threads that don't repro the issue. That's why this is so confusing. Epic can't know about some of the things that are required to repro the issue. If it's not an issue with go, then epic is modifying some go memory, or maybe the system stack, in a way that causes issues but only when you do this exact set of things. |
🤔 Does spinning up a new goroutine interact with the system stack at all? I'm kind of being drawn toward the "epic modifies the system stack" theory but I don't understand why you have to spin up goroutines after calling into epic for it to repro. Other question: is there a good way to compare the system stack before vs. after a call into epic to verify whether epic is modifying it? |
Sorry, you did indeed already answer both of my questions in the original bug. 😅 Do you have any example crashes to share? Just throwing something out there: I wonder if maybe some Go-specific TLS data is being clobbered causing really weird crashes. (I think if such state were broken, you'd still eventually fail even with LockOSThread. And maybe that does happen, it just takes a lot longer to fail?) |
It might only be getting clobbered when you call from different threads or system stacks. But we end up with the same situation of "why do you have to spin up new goroutines, and only after the create call, for it to fail". It might be like you said, it just takes a whole lot longer? |
Yep, I had it running several hours without failing. Well, now its clear why, I don't have the offending extension. @CannibalVox do you know from where can I get it? |
If you install the epic launcher from https://store.epicgames.com/en-US/download it will install it as part of the update process. When you see a login modal, that should mean it's installed. |
I'm afraid I can't reproduce the issue even though Epic overlay seems to be correctly installed. @CannibalVox you have a bunch of other layers that I don't have. Could it be that Epic is interacting wrongly with one of those? |
Well, I uninstalled everything but epic and it's still reproducing for me:
|
Also I was previously asked for an example crash, so here it is- this happens when the program is exiting and is the most common form this issue takes: it tries to switch to the system stack to call the exitcode syscall and it explodes with 0xc0000005 (access violation)
|
@qmuntal I'm running via mingw, maybe that's part of the equation, I assume you're using powershell? |
Yes, I do use powershell. My mingw version is pretty new (see below), which one do you use? $ gcc --version
gcc.exe (MinGW-W64 x86_64-ucrt-posix-seh, built by Brecht Sanders) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. |
|
Does your program registers any callbacks that is called when the program exit? |
My code does not, I can't be certain that epic overlay isn't doing that. |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
Why was this in "waiting for info"? I provided everything I was asked for. @qmuntal @cherrymui |
Apologies, that looks like a mistake to me. |
What version of Go are you using (
go version
)?I also tried this with 1.8, 1.12, and 1.18
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Ran this program a few times in a row: https://github.com/CannibalVox/heapcorruptrepro
What did you expect to see?
Successful completion
What did you see instead?
It often succeeds, but it fails maybe a third to half of the time on Windows, and no other operating system. The nature of the failure is different at different times. Often, I'll see crashes on exit when attempting to run the exit syscall because the system stack has been corrupted. Sometimes the program will exit prematurely with exit code 0xc0000374 (corrupted heap). At other times, I will see access violation panics when calling C methods.
Because of the involvement of vulkan (and I can't tell exactly what aspect of vulkan is triggering the issue to reproduce it with a different library), it's easy to point the finger at vulkan. However, I do not believe vulkan per se is responsible:
This issue will only repro if the following four things all happen on the same goroutine. Moving any of them elsewhere or doing them in a different order works properly.
This is not simply a case of vulkan using thread context (for one thing, it doesn't)- we can perform create and destroy operations on any arbitrary goroutines all day long if we want to, as long as we don't follow the above instructions to the letter. Likewise, we can do the above on linux without difficulty.
Here are the 5 scenarios that I was able to try:
SUCCEED - Ubuntu 22.04.2 LTS - GeForce RTX 4090 v525.105
SUCCEED - Ubuntu 22.10 - Intel(R) UHD Graphics v22.2.5
SUCCEED - Ubuntu 22.10 - GeForce RTX 3070 v525.105.17
FAIL - Windows 10 - GeForce RTX 3070 v531.61
FAIL - Windows 10 - Radeon 6800M v21.20.01.24
It's difficult to reproduce (given the fact that I can only figure out how to get vulkan to trigger it), but I believe that this is an issue with the go runtime. Vulkan would not be able to tell the difference between one goroutine on thread A creating objects and one go routine on thread B destroying them, and one goroutine creating objects, switching to thread B, and then destroying them. And it certainly can't tell whether go has spun up goroutines performing unrelated tasks between the two points. I'm concerned that this may indicate deeper issues with cgo on windows in the go runtime.
The text was updated successfully, but these errors were encountered: