Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "morestack on g0" in x/perf/storage/app on windows/arm64 #47557

Closed
bcmills opened this issue Aug 5, 2021 · 27 comments
Closed

runtime: "morestack on g0" in x/perf/storage/app on windows/arm64 #47557

bcmills opened this issue Aug 5, 2021 · 27 comments
Labels
arch-arm64 FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Windows release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Aug 5, 2021

$ greplogs --dashboard -l -md -e (?ms)morestack on g0.*FAIL\s+golang\.org/x/perf/storage/app
2021-02-20T03:31:36-40a54f1/6e73886/windows-arm64-10
2021-02-20T03:31:36-40a54f1/8a7ee4c/windows-arm64-10
2021-02-20T03:31:36-40a54f1/b8ca6e5/windows-arm64-10

fatal: morestack on g0
runtime: signal received on thread not created by Go.
…
FAIL	golang.org/x/perf/storage/app	0.274s

CC @prattmic @cherrymui @ianlancetaylor

@bcmills bcmills added OS-Windows NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. arch-arm64 labels Aug 5, 2021
@bcmills bcmills added this to the Go1.17 milestone Aug 5, 2021
@mknyszek mknyszek modified the milestones: Go1.17, Backlog Aug 18, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Sep 22, 2021

This is a release-blocker via #11811.

windows/arm64 is not a first-class port, so in theory this can be addressed by either fixing the underlying bug or adding skips to the relevant test(s).

(However, it looks like a pretty severe runtime bug to me.)

@bcmills bcmills modified the milestones: Backlog, Go1.18 Sep 22, 2021
@heschi
Copy link
Contributor

heschi commented Sep 29, 2021

cc @mknyszek

@toothrot
Copy link
Contributor

This is delightfully reproducible and fails very reliably.

@toothrot
Copy link
Contributor

cc @aclements

@mknyszek
Copy link
Contributor

I'll take a look.

@mknyszek
Copy link
Contributor

After a few hours of holding gomote wrong, I've reproduced. Now I'm trying to get it into a debuggable state, but go test -c is having issues...

The following runs fine:

gomote run -dir=perf user-mknyszek-windows-arm64-10-0 go/bin/go test ./storage/app

And the following fails:

$ gomote run -dir=perf user-mknyszek-windows-arm64-10-0 go/bin/go test -c ./storage/app
# golang.org/x/perf/storage/app.test
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x136463
sym 3263: unsupported obj reloc R_DWARFSECREF/4 to go.info.net/http.(*ServeMux).HandleFunc$abstract
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x136490
sym 3263: unsupported obj reloc R_DWARFSECREF/4 to go.info.net/http.(*ServeMux).HandleFunc$abstract
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x13649e
sym 3263: unsupported obj reloc R_DWARFSECREF/4 to go.info.net/http.(*ServeMux).HandleFunc$abstract
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x136463
sym 3263: unsupported obj reloc R_DWARFSECREF/4 to go.info.net/http.(*ServeMux).HandleFunc$abstract
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x136490
sym 3263: unsupported obj reloc R_DWARFSECREF/4 to go.info.net/http.(*ServeMux).HandleFunc$abstract
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x13649e
sym 3263: unsupported obj reloc R_DWARFSECREF/4 to go.info.net/http.(*ServeMux).HandleFunc$abstract
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x136463
sym 3263: unsupported obj reloc R_DWARFSECREF/4 to go.info.net/http.(*ServeMux).HandleFunc$abstract
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x136490
sym 3263: unsupported obj reloc R_DWARFSECREF/4 to go.info.net/http.(*ServeMux).HandleFunc$abstract
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x13649e
sym 3263: unsupported obj reloc R_DWARFSECREF/4 to go.info.net/http.(*ServeMux).HandleFunc$abstract
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x136463
sym 3263: unsupported obj reloc R_DWARFSECREF/4 to go.info.net/http.(*ServeMux).HandleFunc$abstract
sym 3263: invalid relocation: R_DWARFSECREF .debug_info+0x136490
C:\workdir\go\pkg\tool\windows_arm64\link.exe: too many errors
Error running run: exit status 2

CC @thanm maybe?

@bcmills
Copy link
Contributor Author

bcmills commented Oct 14, 2021

Oh, interesting! By coincidence, that difference between go test and go test -c) came up recently in a code review.
(https://go-review.googlesource.com/c/go/+/348991/8..15/src/cmd/go/testdata/script/build_issue48319.txt#b15)

@bcmills
Copy link
Contributor Author

bcmills commented Oct 14, 2021

I would say that you could add the -s and -w LDFLAGS manually, but I'm guessing you actually need that debug info “to get it into a debuggable state”. 😅

@thanm
Copy link
Contributor

thanm commented Oct 14, 2021

Interesting problem. It looks like one of these tests is failing:

https://go.googlesource.com/go/+/24e798e2876f05d628f1e9a32ce8c7f4a3ed3268/src/cmd/link/internal/arm64/asm.go#610
https://go.googlesource.com/go/+/24e798e2876f05d628f1e9a32ce8c7f4a3ed3268/src/cmd/link/internal/arm64/asm.go#618

meaning that the relocation won't reach, but we can't find the linker-introduced label symbol. Why it is happening with only DWARF relocations is a mystery though.

I think this would be better off as another bug. Do you want to file it or should I?

@thanm
Copy link
Contributor

thanm commented Oct 14, 2021

I am kind of curious about how you are going to debug the test once it's build properly with DWARF. Delve doesn't support windows+arm64, so I assume gdb... does the builder actually have a gdb that works?

@mknyszek
Copy link
Contributor

@thanm 🤦 yeah, you're right. I don't think it has gdb. It might have a Windows debugger. I guess it's just down to print debugging, anyway.

I'll file the bug.

@mknyszek
Copy link
Contributor

OK actually, I know why we're getting a morestack on g0 failure and why all these signal received on thread not created by Go messages.

Something causes a signal to land on a thread not created by Go the first time (or so the runtime thinks). This calls into badsignal2, which in turn calls runtime.abort. Unfortunately runtime.abort just raises a signal. If we're already not on a Go thread, we just fall back into badsignal2 until we exhaust the stack, hence the morestack on g0 failure that appears to finally fail.

I'm not yet sure what causes the original signal, though.

@mknyszek
Copy link
Contributor

Coincidentally, I have a CL that fixes the recursive runtime.abort issue: https://go-review.googlesource.com/c/go/+/321789/. I should really land it, I totally forgot about it. After that, we should be able to get more info.

@gopherbot
Copy link

Change https://golang.org/cl/321789 mentions this issue: runtime: exit harder in badsignal2

@mknyszek
Copy link
Contributor

That's very strange. I've confirmed that binaries built with https://golang.org/cl/321789 on windows/arm64 do actually have the right code in badsignal2, but I'm still getting the same failure mode. As far as I can tell, there's no other way such a message gets printed...

@mknyszek
Copy link
Contributor

Furthermore, the failure appears before any tests actually get executed. Having go test -c work here turns out would actually be very helpful, since I'm not sure at what point it's failing now (in the compiler? early in the runtime for the test?)

@thanm
Copy link
Contributor

thanm commented Oct 14, 2021

You might try working around the DWARF problem with

go test -ldflags=-w -c

@mknyszek
Copy link
Contributor

Thanks @thanm! That worked. OK, it's definitely not the compiler crashing, it's the binary. But before any tests execute, I'm afraid.

@mknyszek
Copy link
Contributor

Looks like it's failing very early in runtime initialization. This explains the failure; there's no g set up yet or anything! Whatever the failure really is, it could be masked by this bad signal stuff, I think.

I've narrowed down the failure to this loop on the first module data encountered in moduledataverify.

@mknyszek
Copy link
Contributor

I've further confirmed that on the second iteration of that loop, this check passes, so there's already something wrong. However, then the runtime crashes on the following line, specifically, the datap.pclntable indexing.

This suggests to me that something is broken about the binary. It's worth noting that this is a cgo binary; all the tests in this package that produce the failing binary are build-tagged with cgo. I have a copy of the bad binary and also steps to reproduce; this isn't my area of expertise so any help would be appreciated.

@TopperDEL
Copy link

I have a similar issue, though I'm not sure if that is the exact same problem. I convert a go-library with CGO into a DLL for windows ARM64. That crashes during load/initialization with an "AccessViolation during read" followed by plenty of "AccessViolation during write" until the program quits with a "StackOverflow". The root-error seems to be around "morestack", too, and the runtime seems to try to raise a badsignal. So it kind of seems to fit.

I could provide two dlls - one is working and the other is not. If one adds those to a UWP-app and tries to PInvoke into "uplink_internal_UniverseIsEmpty" on e.g. a Hololens 2, it crashes with the above described error-chain.

storj_uplink_dlls.zip

This is the root-cause-assembly-code:
image

@dmitshur
Copy link
Contributor

Friendly ping on this issue as it's currently marked as a release-blocker.

Also CC @bufflig in case you're able to take a look.

@mknyszek
Copy link
Contributor

Help from someone familiar with the compiler and/or linker would probably be best. Some parts of the binary being generated from these tests appear to be very broken.

@cherrymui
Copy link
Member

I'll take a look and see if I can understand anything...

@gopherbot
Copy link

Change https://golang.org/cl/360895 mentions this issue: cmd/link: don't use label symbol for absolute address relocations on ARM64 PE

@aclements aclements added NeedsFix The path to resolution is known, but the work has not been done. and removed help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 3, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Nov 9, 2021

@gopherbot, please backport to Go 1.17. This failure mode is still occurring consistently on the go1.17 builder.

@gopherbot
Copy link

Backport issue(s) opened: #49479 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@golang golang locked and limited conversation to collaborators Nov 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Windows release-blocker
Projects
None yet
Development

No branches or pull requests

10 participants