Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug/pe: TestDefaultLinkerDWARF and TestExternalLinkerDWARF fail on windows/arm64 builder #46406

Closed
dmitshur opened this issue May 26, 2021 · 25 comments
Labels
arch-arm64 FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Milestone

Comments

@dmitshur
Copy link
Contributor

Spotted on the new windows-arm64-aws builder on a recent Go tip commit a62c087:

--- FAIL: TestDefaultLinkerDWARF (4.77s)
    file_test.go:395: Testprog output:
        base=0x7ff786f40000
        main=0x7ff786fd6890
        offset=0x96890
    file_test.go:426: decoding dwarf section info at offset 0x13b1: underflow
--- FAIL: TestExternalLinkerDWARF (7.73s)
    file_test.go:395: Testprog output:
        base=0x7ff6aa9b0000
        main=0x7ff6aaa46890
        offset=0x96890
    file_test.go:426: decoding dwarf section info at offset 0x13b1: underflow
FAIL
FAIL	debug/pe	15.328s

(Source.)

That builder has an open known issue (#42604) so there's a possibility of problem with the builder, but the test failure looks like it may be a problem in debug/pe on windows/arm64 specifically. It may be relevant that testing of internal linking on windows/arm64 was intentionally skipped in CL 312045 (CC @rsc).

CC @alexbrainman via owners. Also CC @rsc and @golang/release.

@dmitshur dmitshur added OS-Windows NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. arch-arm64 labels May 26, 2021
@dmitshur dmitshur added this to the Go1.17 milestone May 26, 2021
@alexbrainman
Copy link
Member

file_test.go:426: decoding dwarf section info at offset 0x13b1: underflow

It appears the problem here is decoding Dwarf section. I am not familiar with the code that decodes Dwarf.

I also don't have windows-arm64 hardware so I cannot even try to reproduce it here.

I also don't use windows-arm64 myself. Perhaps @zx2c4 will be interested investigating this.

It may be relevant that testing of internal linking on windows/arm64 was intentionally skipped in CL 312045 (CC @rsc).

This test uses external linker, not internal linker. And Windows does not come with C compiler and linker installed. So whatever C tools you installed on that builder could be relevant to this bug.

I hope it helps.

Alex

@thanm
Copy link
Contributor

thanm commented May 27, 2021

I poked at this a bit. I changed the error in question to a panic; here's the stack trace at the point of the error:

panic: buf.bytes underflow [recovered]
	panic: buf.bytes underflow

goroutine 35 [running]:
testing.tRunner.func1.2({0x7ff75268e800, 0x7ff7527007d8})
	c:/workdir/go/src/testing/testing.go:1192 +0x258
testing.tRunner.func1(0x4000145040)
	c:/workdir/go/src/testing/testing.go:1195 +0x278
panic({0x7ff75268e800, 0x7ff7527007d8})
	c:/workdir/go/src/runtime/panic.go:1038 +0x224
debug/dwarf.(*buf).bytes(...)
	c:/workdir/go/src/debug/dwarf/buf.go:72
debug/dwarf.(*buf).skip(...)
	c:/workdir/go/src/debug/dwarf/buf.go:82
debug/dwarf.(*Data).parseUnits(0x4000288000)
	c:/workdir/go/src/debug/dwarf/unit.go:50 +0x720
debug/dwarf.New({0x4000282000, 0x1ade, 0x1c00}, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x4000300000, ...}, ...)
	c:/workdir/go/src/debug/dwarf/open.go:96 +0x454
debug/pe.(*File).DWARF(0x40000be140)
	c:/workdir/go/src/debug/pe/file.go:270 +0x634
debug/pe.testDWARF(0x4000145040, 0x1)
	c:/workdir/go/src/debug/pe/file_test.go:424 +0xd3c
debug/pe.testCgoDWARF(0x4000145040, 0x1)
	c:/workdir/go/src/debug/pe/file_cgo_test.go:20 +0x90
debug/pe.TestDefaultLinkerDWARF(0x4000145040)
	c:/workdir/go/src/debug/pe/file_cgo_test.go:24 +0x30
testing.tRunner(0x4000145040, 0x7ff7526d0380)
	c:/workdir/go/src/testing/testing.go:1242 +0x100
created by testing.(*T).Run
	c:/workdir/go/src/testing/testing.go:1289 +0x350
FAIL	debug/pe	12.234s
FAIL

What this says is that things are going bad at a very early stage in the process of reading the DWARF. The reader is counting the number of compilation units in the .debug_info section; it does this by reading the unit length, then skipping over to the next unit using that value.

This suggests some sort of very basic problem in the DWARF generation, like getting the unit lengths wrong, or incorrectly adding padding between the units.

@thanm
Copy link
Contributor

thanm commented May 27, 2021

I did a bit more hand testing with the gomote. I think this is mainly a problem with external linking. When I force internal linkage for my small testcase, the DWARF looks ok (at least it is not hitting the problem in this issue).

@cherrymui
Copy link
Member

Is R_DWARFSECREF handled in the right way in pereloc1 at https://tip.golang.org/src/cmd/link/internal/arm64/asm.go#L633 ?
On elfreloc1 R_DWARFSECREF and R_ADDR are handled in the same way (which I think is correct). Does it work if we let pereloc1 do the same? @thanm could you try that? Thanks.

@thanm
Copy link
Contributor

thanm commented May 27, 2021

Thanks -- I'll give that a try. I note that when I compile this program:

package main

import "C"
import _ "net/http/pprof"

func main() {
	println("himom")
}

I get:

C:\workdir\xxx>go build -o testcase.exe -ldflags="-v" testcase.go 
# command-line-arguments
HEADER = -H10 -T0xffffffffffffffff -R0xffffffff
sym 53881: invalid relocation: R_DWARFSECREF .debug_info+0x107a99
sym 53881: unsupported obj reloc R_DWARFSECREF/4 to go.info.strings.HasSuffix$abstract
sym 53881: invalid relocation: R_DWARFSECREF .debug_info+0x107aae
sym 53881: unsupported obj reloc R_DWARFSECREF/4 to go.info.strings.HasSuffix$abstract
sym 53881: invalid relocation: R_DWARFSECREF .debug_info+0x107ab6
sym 53881: unsupported obj reloc R_DWARFSECREF/4 to go.info.strings.HasSuffix$abstract
sym 17544: invalid relocation: R_DWARFSECREF .debug_loc+0x10003d
sym 17544: unsupported obj reloc R_DWARFSECREF/4 to 
sym 17544: invalid relocation: R_DWARFSECREF .debug_loc+0x100070
sym 17544: unsupported obj reloc R_DWARFSECREF/4 to 
sym 17544: invalid relocation: R_DWARFSECREF .debug_loc+0x1000db
sym 17544: unsupported obj reloc R_DWARFSECREF/4 to 
sym 17544: invalid relocation: R_DWARFSECREF .debug_loc+0x100146
sym 17544: unsupported obj reloc R_DWARFSECREF/4 to 
sym 17544: invalid relocation: R_DWARFSECREF .debug_loc+0x10018e
sym 17544: unsupported obj reloc R_DWARFSECREF/4 to 
sym 17544: invalid relocation: R_DWARFSECREF .debug_loc+0x1001f9
sym 17544: unsupported obj reloc R_DWARFSECREF/4 to 
sym 17544: invalid relocation: R_DWARFSECREF .debug_loc+0x10027b
sym 17544: unsupported obj reloc R_DWARFSECREF/4 to 
sym 17544: invalid relocation: R_DWARFSECREF .debug_loc+0x1002d8
C:\workdir\go\pkg\tool\windows_arm64\link.exe: too many errors

which also seems to point the finger at R_DWARFSECREF.

@thanm
Copy link
Contributor

thanm commented May 27, 2021

I tried hacking arm64/asm.go to treat R_DWARFSECREF the same as R_ADDR, e.g.

diff --git a/src/cmd/link/internal/arm64/asm.go b/src/cmd/link/internal/arm64/asm.go
index c10bdc4120..1584e82abb 100644
--- a/src/cmd/link/internal/arm64/asm.go
+++ b/src/cmd/link/internal/arm64/asm.go
@@ -630,12 +630,7 @@ func pereloc1(arch *sys.Arch, out *ld.OutBuf, ldr *loader.Loader, s loader.Sym,
        default:
                return false
 
-       case objabi.R_DWARFSECREF:
-               out.Write32(uint32(sectoff))
-               out.Write32(uint32(symdynid))
-               out.Write16(ld.IMAGE_REL_ARM64_SECREL)
-
-       case objabi.R_ADDR:
+       case objabi.R_ADDR, objabi.R_DWARFSECREF:
                out.Write32(uint32(sectoff))
                out.Write32(uint32(symdynid))
                if r.Size == 8 {

but this did not seem to do the trick (same error from the debug/pe test).

For the errors I posted in the previous comment, I think they are coming from

https://go.googlesource.com/go/+/cdcd02842da7c004efd023881e3719105209c908/src/cmd/link/internal/arm64/asm.go#610

which suggests some sort of problem with gensymslate / label symbols (e.g. to get around the 21 bit limit). More investigation needed. Not sure if the two problems are related though.

@dmitshur
Copy link
Contributor Author

dmitshur commented Jun 1, 2021

Marking with release-blocker because passing tests are generally needed for binary releases (unless test are skipped), issue #46355, which is a release blocker for 1.17.

CC @rsc since you worked on adding this port for 1.17 (issue #36439).

@rsc
Copy link
Contributor

rsc commented Jun 1, 2021

What cgo toolchain is the builder using?
I must have used https://github.com/mstorsjo/llvm-mingw/releases/tag/20201020.
If the builder was configured with the newer (since my changes)
https://github.com/mstorsjo/llvm-mingw/releases/tag/20210423
then maybe there is a bug related to that toolchain or maybe that toolchain itself is buggy.

Honestly the simplest thing to do is probably to make sure the builder uses 20201020.

I am trying to confirm that that's the version I used but I am having trouble getting the tablet to turn on. (!)

@dmitshur
Copy link
Contributor Author

dmitshur commented Jun 1, 2021

The commit message in CL 322653 suggests the builder uses llvm-arm64 mingw, and this line confirms it's a newer version than the suggested one.

@dmitshur
Copy link
Contributor Author

dmitshur commented Jun 1, 2021

From above, we know the test is passing with 20201020, but failing with the newer 20210423. It should work with a newer stable llvm-mingw release, so this will need investigation anyway, it can just happen later.

We can add a skip for the test and keep the builder as is, both to avoid the relatively slow process of rebuilding the image and to have the existing builder with latest stable llvm-mingw available for testing. (Thanks @heschi for the suggestion.)

@gopherbot
Copy link

Change https://golang.org/cl/323990 mentions this issue: debug/pe: skip tests on windows/arm64 that fail with newer llvm-mingw

@zx2c4
Copy link
Contributor

zx2c4 commented Jun 1, 2021

@mstorsjo - seems like there might be a regression in your latest toolchain. Interested in taking a look?

@mstorsjo
Copy link

mstorsjo commented Jun 1, 2021

@mstorsjo - seems like there might be a regression in your latest toolchain. Interested in taking a look?

Always interested in resolving regressions, but I'm not really familiar with what I'm looking at and I'm (in a 2 minute readthrough) seeing various issues being discussed here.

Given llvm-mingw as is, how do I minimally reproduce the issue at hand? (Is it a new build time warning that wasn't there before?) I'd love to bisect down the upstream llvm change that caused it.

@dmitshur
Copy link
Contributor Author

dmitshur commented Jun 1, 2021

@mstorsjo If you're able to build Go tip from source for windows/arm64, it should be possible to reproduce this with go test -short debug/pe. Otherwise you may need to wait for someone to create a more standalone, smaller repro.

@rsc
Copy link
Contributor

rsc commented Jun 1, 2021

I don't think we have conclusive evidence at all that it's a regression in LLVM-Mingw, only an incompatibility between Go and LLVM-Mingw with those particular versions. It could just as easily be a problem on Go's side, although our DWARF parser does handle regular LLVM and GCC just fine.

@mstorsjo
Copy link

mstorsjo commented Jun 2, 2021

@mstorsjo If you're able to build Go tip from source for windows/arm64, it should be possible to reproduce this with go test -short debug/pe. Otherwise you may need to wait for someone to create a more standalone, smaller repro.

A smaller standalone repro would be ideal yeah - but if someone can provide instructions for how to build go from source with llvm-mingw I could try the full case, at least for pinpointing where the regression came from. Can I cross build Go from unix, transfer the build to a windows/arm64 machine and then use that to run tests, or do I need to build it natively (and how do I do that)?

@alexbrainman
Copy link
Member

Can I cross build Go from unix, transfer the build to a windows/arm64 machine and then use that to run tests, or do I need to build it natively (and how do I do that)?

Go programs can be build on any OS for any OS. So you can build windows-arm64 binaries on UNIX (including test binaries).

Unfortunately this test uses mix of Go and C - called Cgo. So you also need to have your C compiler on UNIX to be able to cross-compile to Windows. And you would need to run test on Windows. And that test executes GCC on Windows.

I would say it is easier to install Go from source and C compiler on Windows. And then you could run go test -short debug/pe and you will be able to reproduce the issue.

This is

https://golang.org/doc/install/source

how to install Go from source.

All Go tools are written in Go. So before getting Go source, you should download another version of Go with already pre-built binaries. Most users would use this section

https://golang.org/doc/install/source#bootstrapFromSource

Unfortunately this will not work for you because windows-arm64 is very new and pre-built binaries are not available for download.

So you should use a UNIX system with any version of Go installed, and build windows-arm64 toolchain

https://golang.org/doc/install/source#bootstrapFromCrosscompiledSource

and then copy the files onto Windows and use that toolchain to built Go from source there.

I hope it helps.

Alex

@rsc
Copy link
Contributor

rsc commented Jun 2, 2021

@mstorsjo Apologies for you getting dragged in here. I think you should wait until we have a clearer repro case to hand you. Chances are decent still that the problem is on our side.

@mstorsjo
Copy link

mstorsjo commented Jun 2, 2021

@mstorsjo Apologies for you getting dragged in here. I think you should wait until we have a clearer repro case to hand you. Chances are decent still that the problem is on our side.

Ok, no problem! Holding off of it for now.

@rsc
Copy link
Contributor

rsc commented Jun 2, 2021

I have awakened my Surface Pro X from its slumber, but despite valiant efforts I have not succeeded in convincing it that today is not April 26, 2021. It insists that all the other computers on the internet serving certificates from the future must be the ones who are insecurely configured. After a reboot, even less works on the machine, not even trying to use the start menu and certainly not the "Adjust date/time" menu entry. I assume it is doing background network connections all of which are failing during TLS. Perhaps if I let it sit here for a few hours it will reconsider. A remarkably secure operating system.

@thanm
Copy link
Contributor

thanm commented Jun 3, 2021

I spent a bit more time debugging. I think I have a fix for part of the problem-- the DWARF sections aren't being given the correct alignment. I'll send a CL shortly (will also re-enable the test).

The other problem (mentioned above in #46406 (comment)) is still there however so I don't think we are completely out of the woods on DWARF.

@gopherbot
Copy link

Change https://golang.org/cl/324763 mentions this issue: cmd/link: use correct alignment in PE DWARF sections

gopherbot pushed a commit that referenced this issue Jun 3, 2021
Set the correct section flags to insure that .debug_* sections are
using 1-byte alignment instead of the default. This seems to be
important for later versions of LLVM-mingw on windows (shows up on the
windows/arm64 builder).

Updates #46406.

Change-Id: I023d5208374f867552ba68b45011f7990159868f
Reviewed-on: https://go-review.googlesource.com/c/go/+/324763
Trust: Than McIntosh <thanm@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Than McIntosh <thanm@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
@thanm
Copy link
Contributor

thanm commented Jun 3, 2021

I've submitted https://golang.org/cl/324763 and manually verified that the test runs correctly on the windows-arm64-aws builder. @dmitshur , ok to close this issue?

@dmitshur
Copy link
Contributor Author

dmitshur commented Jun 3, 2021

Yes, thank you for fixing this problem and verifying @thanm. Let's close the tracking issue.

I'll also close CL 323990 (test skip) since it's no longer necessary to skip the test.

@gopherbot
Copy link

Change https://golang.org/cl/345129 mentions this issue: cmd/release: start running release tests for windows-arm64

gopherbot pushed a commit to golang/build that referenced this issue Aug 25, 2021
By now, all outstanding issues preventing release tests for
windows-arm64 from being turned on by default are resolved:

- Issues golang/go#46406 and golang/go#46502 are fixed.
- The other remaining blocker golang/go#47017 is determined to be a
  builder-specific issue (not reproducible on physical Windows ARM64
  hardware) that by now happens infrequently enough that automated
  retries should be able to cover for future occurrences.
  Issue golang/go#47965 is opened to track progress on this.
- The builder performance is good and allows release tests to complete
  quickly, not adding any bottlenecks to the release process.

If something changes, we can revisit this, but information available
so far suggests it's a good time to start running release tests for
windows/arm64 by default.

Fixes golang/go#47017.
Updates golang/go#47965.

Change-Id: Ie96164821a2f8e795a22ac2d36b7292587a3e117
Reviewed-on: https://go-review.googlesource.com/c/build/+/345129
Reviewed-by: Alexander Rakoczy <alex@golang.org>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
@golang golang locked and limited conversation to collaborators Aug 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Projects
None yet
Development

No branches or pull requests

8 participants