Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/link: windows/arm64 binary output invalid with linkmode=external #51923

Closed
jeremyd2019 opened this issue Mar 24, 2022 · 15 comments
Closed

cmd/link: windows/arm64 binary output invalid with linkmode=external #51923

jeremyd2019 opened this issue Mar 24, 2022 · 15 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@jeremyd2019
Copy link

jeremyd2019 commented Mar 24, 2022

What version of Go are you using (go version)?

$ go version
go version go1.18 windows/arm64

With fix from #51903 applied

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=arm64
set GOBIN=
set GOCACHE=C:\Users\Administrator\AppData\Local\go-build
set GOENV=C:\Users\Administrator\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=arm64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\Administrator\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\Administrator\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\msys64\clangarm64\lib\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\msys64\clangarm64\lib\go\pkg\tool\windows_arm64
set GOVCS=
set GOVERSION=go1.18
set GCCGO=gccgo
set AR=ar
set CC=clang
set CXX=clang++
set CGO_ENABLED=1
set GOMOD=NUL
set GOWORK=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-mthreads -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=C:\msys64\tmp\go-build1929041516=/tmp/go-build -gno-record-gcc-switches

Windows 11 ARM64, running MSYS2's package of clang/llvm/lld 13.0.1.

What did you do?

Download and extract https://downloads.rclone.org/v1.58.0/rclone-v1.58.0.tar.gz

  go build -ldflags -s
  (
    cd librclone
    go build --buildmode=c-shared -ldflags -s -o librclone.dll
    cat > test.c <<END
#define WIN32_LEAN_AND_MEAN
#include <stdio.h>
#include <windows.h>

int main()
{
        HMODULE hmod = LoadLibrary("librclone.dll");
        if (!hmod || hmod == INVALID_HANDLE_VALUE)
                printf("%lu\n", GetLastError());
        return 0;
}
END
    clang -o test.exe test.c
    ./test.exe
  )

What did you expect to see?

Successful run.

What did you see instead?

193

(this is ERROR_BAD_EXE_FORMAT)

I haven't been able to figure out what's wrong with the DLL that Windows won't load it. I'm attaching it in case somebody has some idea of what to look for.

librclone.zip

Split from #51903 since this seems to be a different issue.

/cc @cherrymui and @mstorsjo in case he has some idea what's invalid about this DLL

@jeremyd2019 jeremyd2019 changed the title windows/arm64 c-shared DLL output is invalid cmd/link: windows/arm64 c-shared DLL output is invalid Mar 24, 2022
@cherrymui
Copy link
Member

Does it work with other DLL built from Go (e.g. just an empty main function)?

@jeremyd2019
Copy link
Author

jeremyd2019 commented Mar 24, 2022

I have absolutely no Go experience, so I apologize in advance. In a new directory, copying test.exe from earlier steps.

cat > test.go <<END
package main
function main() {}
END

go mod init test
go build --buildmode=c-shared -ldflags -s -o librclone.dll
./test.exe

no error reported.

@cherrymui
Copy link
Member

Thanks. So the error seems specific to rclone? What if you build rclone with -ldflags="-s -w", or just -ldflags="-w"?

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 24, 2022
@mknyszek mknyszek added this to the Backlog milestone Mar 24, 2022
@jeremyd2019
Copy link
Author

jeremyd2019 commented Mar 24, 2022

Same result with -ldflags -w, no additional message. (interestingly, it seems to have tried to link in my test.c that was sitting there at first, and I got a duplicate symbol main error from ld.lld until I deleted test.c)

@cherrymui
Copy link
Member

cherrymui commented Mar 24, 2022

Same result with -ldflags -w, no additional message

Sorry, could you clarify if it still fails with error 193? Or no error reported?

it seems to have tried to link in my test.c that was sitting there at first

go build includes all Go and C (and a few others) source files in the current directory. You can do go build <flags> test.go to build only the Go file.

@jeremyd2019
Copy link
Author

On a hunch, I tried rebuilding rclone.exe with -ldflags=-linkmode=external, and rclone.exe build in that way also fails to run:

---------------------------
WinDbg:10.0.22000.194 ARM64 
---------------------------
Could not create process 'C:\M\mingw-w64-rclone\src\rclone-v1.58.0\rclone.exe', Win32 error 0n193

%1 is not a valid Win32 application.
---------------------------
OK   
---------------------------

@jeremyd2019 jeremyd2019 changed the title cmd/link: windows/arm64 c-shared DLL output is invalid cmd/link: windows/arm64 binary output invalid with linkmode=external Mar 24, 2022
@jeremyd2019
Copy link
Author

jeremyd2019 commented Mar 24, 2022

Same result with -ldflags -w, no additional message

Sorry, could you clarify if it still fails with error 193? Or no error reported?

Linking the 'real' librclone.dll had no error or warning. Attempting to load the DLL gives error 193

@jeremyd2019
Copy link
Author

It occurred to me that an lld "repro.tar" of linking the DLL might be helpful (especially for @mstorsjo)
go build --buildmode=c-shared "-ldflags=-extldflags -Wl,-Xlink=-linkrepro:$(pwd -W)" -o librclone.dll
repro.tar.gz

@mstorsjo
Copy link

It occurred to me that an lld "repro.tar" of linking the DLL might be helpful (especially for @mstorsjo)
go build --buildmode=c-shared "-ldflags=-extldflags -Wl,-Xlink=-linkrepro:$(pwd -W)" -o librclone.dll
repro.tar.gz

Awesome! Thanks, yes that did help me pinpoint the issue further. I think I know the root cause of what's happening here.

TL;DR; the go.o object file contains relocations in nonmonotonic order at one place. If inspecting llvm-objdump -r C/Users/ADMINI~1/AppData/Local/Temp/go-link-173614113/go.o with the files that @jeremyd2019 so kindly provided, I find this:

RELOCATION RECORDS FOR [.rdata]:
[...]
000000000081cf70 IMAGE_REL_ARM64_ADDR64 go.string.*
000000000081cf80 IMAGE_REL_ARM64_ADDR64 go.string.*
000000000081cfa0 IMAGE_REL_ARM64_ADDR64 go.string.*
000000000081cf90 IMAGE_REL_ARM64_ADDR64 go.string.* <---- Note, this address is smaller than the previous one
000000000081cfc8 IMAGE_REL_ARM64_ADDR64 go.string.*
000000000081cfd8 IMAGE_REL_ARM64_ADDR64 go.string.*

What leads me to believe this is the root cause (and how I found it):

Trying to load this DLL with a small test program like in the original post here, I get a dialog box pop up saying that it was unable load it (with error status 0xc000007b). If I try to do the same on Wine on linux on aarch64, it loads successfully. So the DLL isn't entirely broken, but some more strict check in Windows bails out here.

One of the most common issues in failing to load an executable like this has in the past been if there are gaps in the section layout, which often can happen after stripping (but wouldn't happen if the linker omits debug info directly, unless the linker is broken). But that isn't an issue here, all the sections are correctly laid out. All the DLL flags and similar are set correctly. (On arm64, the dynamicbase flag is a must, contrary to x86.)

Next, I tried to look at what parts of the executable that the Windows loader might be interacting with. I tried modifying the executable to zero out some fields, to have less things for the Windows loader to interact with. (To modify the executable, I used llvm-objcopy where I manually edited the code to do the exact tweaks on the PE header that I wanted; there's probably other tools for tweaking the PE header that would have worked too.)

Looking at the file headers, it has a couple data directories defined; export table, import table, import address table, exception table, base relocation table, TLS table.

I tried zooming in on the base relocation table. (See https://docs.microsoft.com/en-us/windows/win32/debug/pe-format#the-reloc-section-image-only for some docs on it.) As the executable has got dynamicbase set, having no base relocation table at all doesn't seem to fly, but we can truncate it to see if it contains bad data somewhere. For inspecting the relocation table, I used MSVC's dumpbin.exe; dumpbin -relocations librclone.dll. That produces a printout like this:

BASE RELOCATIONS #7
  EDB000 RVA,       1C SizeOfBlock
       0  DIR64      00000001821705D8
       8  DIR64      000000018228C178
      10  DIR64      0000000182209768
      18  DIR64      000000018228C160
      20  DIR64      000000018228C158
      28  DIR64      00000001821707B0
      30  DIR64      00000001821707B8
      38  DIR64      00000001821707D0
      40  DIR64      00000001821707E0
       0  ABS                        
  EDD000 RVA,      224 SizeOfBlock
     790  DIR64      000000018006FDB0
     798  DIR64      0000000180C8E960
...

This output contained 174k lines, so we can't sit down and inspect it by hand. We can't truncate it to zero. But we can truncate it to only contain entries for some pages. Here, the size 1C and 224 are sizes of individual pieces of the relocation table. If we truncate the relocation table size (in the data directory) to 0x1C bytes, we suddenly get a different error, ERROR_NOACCESS (998) instead of the previous dialog box and error 193. (I have no idea why this produces the error ERROR_NOACCESS here, but it's at least a difference from before, so I zoomed in on that.)

So somewhere, in the relocation table, there's something that makes the Windows loader bail out. So I bisected the table, leaving out more or less of it, until I managed to pinpoint the offending piece:

 16F7000 RVA,      190 SizeOfBlock
       8  DIR64      0000000181358C74
      28  DIR64      0000000181343004
      18  DIR64      0000000181342167
      50  DIR64      0000000181341FB7
      60  DIR64      0000000181343AF8
      70  DIR64      000000018134B456
      90  DIR64      0000000181343004
      80  DIR64      0000000181342167
[...]
     F78  DIR64      0000000181358977
     F98  DIR64      0000000181343004
     F88  DIR64      0000000181342167
     FC0  DIR64      0000000181341FB7
     FD0  DIR64      0000000181347341
     FE0  DIR64      0000000181370BF6
 16F8000 RVA,        C SizeOfBlock
       0  DIR64      0000000181343004
       0  ABS                        
 16F7000 RVA,        C SizeOfBlock
     FF0  DIR64      0000000181342167
       0  ABS                        

So here, there's a discontinuity in the table (or whatever you'd call it - it's not correctly sorted). With some more debugging in lld, I was able to pinpoint the source of this discontinutiy down to the .rdata section in the go.o object file, and to the list of relocations I referenced above.

@cherrymui
Copy link
Member

@mstorsjo thanks for the detailed investigation! So it sounds like some relocations are not sorted and the PE loader doesn't like them?

Could you, or @jeremyd2019 , try if this patch makes any difference? Thanks!

diff --git a/src/cmd/internal/obj/objfile.go b/src/cmd/internal/obj/objfile.go
index e7bc45ccdf..39838da4ad 100644
--- a/src/cmd/internal/obj/objfile.go
+++ b/src/cmd/internal/obj/objfile.go
@@ -173,6 +173,7 @@ func WriteObjFile(ctxt *Link, b *bio.Writer) {
        h.Offsets[goobj.BlkReloc] = w.Offset()
        for _, list := range lists {
                for _, s := range list {
+                       sort.Sort(relocByOff(s.R)) // XXX sort relocations
                        for i := range s.R {
                                w.Reloc(&s.R[i])
                        }

@mstorsjo
Copy link

@mstorsjo thanks for the detailed investigation! So it sounds like some relocations are not sorted and the PE loader doesn't like them?

Yes; object file level relocations aren't sorted. For normal linking this should be mostly harmless, but the non-sortedness carries over to the DLL/EXE base relocations, which seem to need to be in strictly increasing order, and lld assumes the object file relocations are correctly ordered to begin with - which they seem to be so far - I haven't checked the spec if this is a strict requirement or just something that all toolchains practically do.

Technically it would be possible to add a check and warning for it in lld, but as lld generally is quite performance focused and sometimes quite blindly assume that the input files are correct, I'm not sure if a check for it (or even worse, a sorting pass) would be welcomed there.

I could check and see what MS link.exe does in such a situation - if it does resolve it nicely, one could argue that lld should get some fix too.

Could you, or @jeremyd2019 , try if this patch makes any difference?

I don't have any setup available for testing it unfortunately, but let's hope @jeremyd2019 can.

@jeremyd2019
Copy link
Author

jeremyd2019 commented Mar 25, 2022

What leads me to believe this is the root cause (and how I found it):

Trying to load this DLL with a small test program like in the original post here, I get a dialog box pop up saying that it was unable load it (with error status 0xc000007b). If I try to do the same on Wine on linux on aarch64, it loads successfully. So the DLL isn't entirely broken, but some more strict check in Windows bails out here.

One of the most common issues in failing to load an executable like this has in the past been if there are gaps in the section layout, which often can happen after stripping (but wouldn't happen if the linker omits debug info directly, unless the linker is broken). But that isn't an issue here, all the sections are correctly laid out. All the DLL flags and similar are set correctly. (On arm64, the dynamicbase flag is a must, contrary to x86.)

This is as far as I was able to get with my knowledge. I wish there were some sort of 'PE linter' that could detect issues that would make Windows reject an image. I guess I could start one at least with what I've learned so far...

Off-topic update: I started one: https://github.com/jeremyd2019/pelint/blob/main/pelint.py

Could you, or @jeremyd2019 , try if this patch makes any difference?

I don't have any setup available for testing it unfortunately, but let's hope @jeremyd2019 can.

msys2/MINGW-packages@master...jeremyd2019:test-go-patch 😁

Rebuilt go with those patches, and then rebuilt rclone & librclone (from its mingw-w64-rclone package). The DLL loaded in my test exe, and librclone's included python/ctypes wrapper test also worked. I also managed to get their C test for librclone working with minimal modifications for Windows.

@gopherbot
Copy link

Change https://go.dev/cl/396195 mentions this issue: cmd/internal/obj: sort relocations

@cherrymui
Copy link
Member

Thanks for confirming.

@jeremyd2019
Copy link
Author

It makes me wonder why Windows i686/x86_64 doesn't have this issue (ie, is there some earlier step sorting relocations for them?)

@dmitshur dmitshur modified the milestones: Backlog, Go1.19 Mar 28, 2022
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 28, 2022
@golang golang locked and limited conversation to collaborators Mar 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants