Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fatal errors in TestGo2C2Go on windows-amd64 since 2021-11-05 #49457

Closed
bcmills opened this issue Nov 8, 2021 · 16 comments
Closed

runtime: fatal errors in TestGo2C2Go on windows-amd64 since 2021-11-05 #49457

bcmills opened this issue Nov 8, 2021 · 16 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 8, 2021

greplogs --dashboard -md -l -e 'FAIL: TestGo2C2Go'

2021-11-08T18:54:21-830b393/windows-amd64-2008
2021-11-08T16:44:14-7ee3f14/windows-amd64-2008
2021-11-05T22:30:17-b07c41d/windows-amd64-longtest
2021-11-05T19:48:29-c353f1b/windows-amd64-2016
2021-11-05T17:39:43-0bc98b3/windows-amd64-longtest

Not obvious to me when this started, since the first failure CL seems unrelated.
(@aclements, @mknyszek: possibly related to your changes around then?)

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Nov 8, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 8, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Nov 8, 2021

(CC @bufflig)

@bufflig
Copy link
Contributor

bufflig commented Nov 12, 2021

Looks bad, I'll take it.

@bufflig bufflig self-assigned this Nov 12, 2021
@toothrot
Copy link
Contributor

This issue is currently blocking the first beta release of Go 1.18. Are there any updates?

@bufflig
Copy link
Contributor

bufflig commented Nov 19, 2021

I've been trying to reproduce, but have been unsuccessful in getting this to happen in an environment where I can debug it. If anyone has ideas on how to provoke this, I'm all ears. It would not be surprising to me if it's somewhat related to the gcc toolchain being used, but it could also be completely unrelated.

@bufflig
Copy link
Contributor

bufflig commented Nov 19, 2021

Fumbling in the dark, as I cannot repro, I wonder if the changes regarding how cgo calls work with regabi (that caused the hanging in the signal handler) could be related to this?
@cherrymui. @mknyszek - I know you're both busy, but does the stack dump tell you anything about where I should start looking?

@cherrymui
Copy link
Member

It is possible that this is regabi related.

I spent some time on this the other day. I did't reproduce the crash, but sometimes it hangs. I haven't figured out anything from that. I'll take another look next week.

@bufflig
Copy link
Contributor

bufflig commented Nov 19, 2021

I had one timeout in the beginning of my testing, but now it's stubbornly working and I haven't been able to get that to happen again either :(. I'll continue trying different things to provoke it.

@cherrymui
Copy link
Member

The stack traces are interestingly confusing. It essentially has two sets of stack traces. Taking 2021-11-08T18:54:21-830b393/windows-amd64-2008 as an example, starting with a fault on the system stack, followed by 4 goroutine stacks (goroutine 1-4), then

        Exception 0x39f33c 0x3 0x74b41551 0x7fefd7ea06d
        PC=0x7fefd7ea06d
        signal arrived during external code execution

then a few more goroutine stacks. For goroutine 1, at the top it complains an unknown PC at 0xf4392a, whereas in the second set of tracebacks it starts with exactly that PC

        runtime.cgocall(0xfcd2e6, 0x1c00009bf18)
        	C:/workdir/go/src/runtime/cgocall.go:157 +0x4a fp=0x1c00009bef0 sp=0x1c00009beb8 pc=0xf4392a

There are two Go runtimes in the process, one from the main executable (m1.exe), the other from the shared object (libtestgo2c2go.dll). I wonder what happens is that both runtimes does a traceback. The one in the shared object does it first, print the first 4 stacks. Then the main executable does another traceback with a very different set of goroutines (except goroutine 1).

For goroutine 1, since in the Go->C->Go case we run the callback on the same G stack, so it makes sense that we see the PC of runtime.cgocall from cgocallback. But the runtime from the shared object doesn't seem to understand the PC of runtime.cgocall from the host.

The direct failure is sched.safePointFn being nil. It shouldn't be if it gets there (p.runSafePointFn != 0). sched is a global variable. This, plus the fact that there are two tracebacks with essentially completely different set of Gs, make me wonder if the two copies of the runtime doesn't share global variables correctly. That would explain the inconsistency: one runtime sets sched.safePointFn to non-nil and also sets p.runSafePointFn for its allp. The other runtime observe the P (through the G) but its own sched.safePointFn is not set. It also explains the two runtimes see different allgs.

(If that's the case, I wonder how it ever works, then?)

@cherrymui
Copy link
Member

@ianlancetaylor how Go2C2Go test is supposed to work? Do the two runtimes are supposed to be deduplicated, at least for the writable part (global variables), or they are separated? I don't think "separated" could work unless they don't share anything, but the Go2C2Go case definitely shares a G.

(I know for plugins runtimes are supposed to be deduplicated, with a single shared writable state. (Read-only symbols may not be deduplicated and that is fine.))

@ianlancetaylor
Copy link
Contributor

This most likely started failing due to @mknyszek 's changes leading to a GC occurring where it did not occur before.

This test is written assuming ELF shared library semantics. It's not going to work in general. It's not important to test it on Windows, as the case we are testing for (#27019) is only going to work on ELF. We should just skip the test on windows. I'll send a CL.

Sorry for not noticing this earlier.

@gopherbot
Copy link

Change https://golang.org/cl/365994 mentions this issue: misc/cgo/testcshared: skip TestGo2C2Go on Windows

@cherrymui
Copy link
Member

Yeah, I think it is due to a GC occurring at an interesting time.

On ELF I don't think it properly shares global variables either. It looks to me it also has two sets of globals. Somehow it doesn't crash. I think the c-shared build mode assumes it has only one copy of the runtime (unlike plugins, which has special supports for multiple copies). Maybe the test is problematic on ELF as well.

@ianlancetaylor
Copy link
Contributor

On ELF the symbols are all exported, and the shared library is (of course) compiled PIC which means that all variable references go through the GOT and all functions go through the PLT, so the dynamic linker should resolve all the duplicated symbols to the same address. The complete executable image will contain two copies of the globals, but the code should consistently use only one of them. At least, that is how it is supposed to work.

@cherrymui
Copy link
Member

On ELF it looks to me most of the Go symbols are not exported and are not accessed through the GOT, only the C symbols are. Also, in plugin mode we have extra logic to handle multiple pcln tables, type tables, etc., which are (read-only) global variables but should not be deduplicated. We don't have those in c-shared mode.

I looked into how it works on Linux now. With our current TLS model, it seems the main executable and the shared object get different TLS slots for the G. So when the Go callback is called, runtime.cgocallback from the shared object uses its own TLS slot and finds no G there, so it starts the callback from a new G (with its own M and P, none of them is known to the main executable). This effectively keeps the two runtimes separated, at least for this simple test case.

Maybe it works if it carefully keeps the two runtime separated. But it seems delicate. (If we allow passing pointers from Go to C to Go, bad things might happen.)

@ianlancetaylor
Copy link
Contributor

Oy vey, thanks for looking into it. If we have trouble again we should probably just delete the test.

@cherrymui
Copy link
Member

SGTM. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants