Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: ntdll.dll errors in rtlGetNtVersionNumbers via os.StartProcess #49731

Closed
bcmills opened this issue Nov 22, 2021 · 26 comments
Closed

syscall: ntdll.dll errors in rtlGetNtVersionNumbers via os.StartProcess #49731

bcmills opened this issue Nov 22, 2021 · 26 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 OS-Windows release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 22, 2021

# go run run.go -G=0 fixedbugs/issue26411.go
exit status 2
panic: Failed to find RtlGetNtVersionNumbers procedure in ntdll.dll: The specified procedure could not be found.

goroutine 165 [running]:
panic({0x163b3a0, 0xc0003027e0})
	C:/workdir/go/src/runtime/panic.go:941 +0x397 fp=0xc0002b68c8 sp=0xc0002b6808 pc=0x1116437
syscall.(*LazyProc).mustFind(...)
	C:/workdir/go/src/syscall/dll_windows.go:293
syscall.(*LazyProc).Addr(...)
	C:/workdir/go/src/syscall/dll_windows.go:300
syscall.rtlGetNtVersionNumbers(0xc000546480?, 0x2?, 0x1bff6b55d00?)
	C:/workdir/go/src/syscall/zsyscall_windows.go:1219 +0x7a fp=0xc0002b6920 sp=0xc0002b68c8 pc=0x115d33a
syscall.StartProcess({0xc0004061e0?, 0xc0005004c8?}, {0xc00029b000, 0x2, 0x2}, 0xc0000ba000?)
	C:/workdir/go/src/syscall/exec_windows.go:317 +0x26b fp=0xc0002b6b88 sp=0xc0002b6920 pc=0x115424b
os.startProcess({0xc0004061e0, 0x30}, {0xc00029b000, 0x2, 0x2}, 0xc0002b6e60)
	C:/workdir/go/src/os/exec_posix.go:54 +0x305 fp=0xc0002b6c68 sp=0xc0002b6b88 pc=0x11be445
os.StartProcess({0xc0004061e0, 0x30}, {0xc00029b000, 0x2, 0x2}, 0xf6df31f8?)
	C:/workdir/go/src/os/exec.go:109 +0x5a fp=0xc0002b6cb0 sp=0xc0002b6c68 pc=0x11bde9a
os/exec.(*Cmd).Start(0xc00037ec60)
	C:/workdir/go/src/os/exec/exec.go:422 +0x611 fp=0xc0002b6eb8 sp=0xc0002b6cb0 pc=0x11ef631
os/exec.(*Cmd).Run(0xc0004061e0?)
	C:/workdir/go/src/os/exec/exec.go:338 +0x1e fp=0xc0002b6ed8 sp=0xc0002b6eb8 pc=0x11eed3e
cmd/go/internal/work.(*Builder).toolID(0xc00007d220, {0x16b450c, 0x7})
	C:/workdir/go/src/cmd/go/internal/work/buildid.go:167 +0x425 fp=0xc0002b70b8 sp=0xc0002b6ed8 pc=0x15599c5
cmd/go/internal/work.(*Builder).buildActionID(0xc00007d220, 0xc0001c9180)
	C:/workdir/go/src/cmd/go/internal/work/exec.go:312 +0xef5 fp=0xc0002b74c0 sp=0xc0002b70b8 pc=0x155faf5
cmd/go/internal/work.(*Builder).build(0xc00007d220, {0xc000018130?, 0xc000406180?}, 0xc0001c9180)
	C:/workdir/go/src/cmd/go/internal/work/exec.go:472 +0x199 fp=0xc0002b7db8 sp=0xc0002b74c0 pc=0x1560f99
cmd/go/internal/work.(*Builder).Do.func2({0x17b24b8, 0xc000018130}, 0xc0001c9180)
	C:/workdir/go/src/cmd/go/internal/work/exec.go:139 +0x5ec fp=0xc0002b7f20 sp=0xc0002b7db8 pc=0x155e8ec
cmd/go/internal/work.(*Builder).Do.func3()
	C:/workdir/go/src/cmd/go/internal/work/exec.go:201 +0xb9 fp=0xc0002b7fe0 sp=0xc0002b7f20 pc=0x155e119
runtime.goexit()
	C:/workdir/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0002b7fe8 sp=0xc0002b7fe0 pc=0x1143761
created by cmd/go/internal/work.(*Builder).Do
	C:/workdir/go/src/cmd/go/internal/work/exec.go:187 +0x3da

greplogs --dashboard -md -l -e 'Failed to find RtlGetNtVersionNumbers'

2021-11-22T18:57:33-5a3d871/windows-amd64-longtest

CC @bufflig

@bcmills
Copy link
Contributor Author

bcmills commented Nov 22, 2021

Given the failure mode here I suspect some kind of race. No idea beyond that. 🤷‍♂️

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows labels Nov 22, 2021
@bcmills bcmills added this to the Backlog milestone Nov 22, 2021
@bufflig
Copy link
Contributor

bufflig commented Nov 23, 2021

I have no idea, the dll is loaded but the basic function is not found. The only reasonable cause for this would be that some other dll was found under this name, but it's a system dll, so that also seems improbable. I don't know of any reasonable event in the OS that would do this to a program, so I feel it has to be internal. I could imagine something getting garbage collected that messes up the call to fetch the function pointer, but I can't see anything obvious in the code - or something messing with the syscall structure of the m, but I can't see anything that doesn't first lockOsThread that touches that.

@bcmills
Copy link
Contributor Author

bcmills commented Nov 23, 2021

Another one, same symbol but different error:
https://build.golang.org/log/aab177b601b0996b06fcb82738a63407e213c2fc

--- FAIL: TestPErsrcLLVM (0.21s)
    link_test.go:858: building failed: exit status 2, output:
        panic: Failed to load ntdll.dll: The specified module could not be found.
        
        goroutine 13 [running]:
        panic({0x7fb3a0, 0xc0003860f0})
        	C:/workdir/go/src/runtime/panic.go:941 +0x397 fp=0xc00044a8c8 sp=0xc00044a808 pc=0x2d6437
        syscall.(*LazyProc).mustFind(...)
        	C:/workdir/go/src/syscall/dll_windows.go:293
        syscall.(*LazyProc).Addr(...)
        	C:/workdir/go/src/syscall/dll_windows.go:300
        syscall.rtlGetNtVersionNumbers(0xc000380640?, 0x2?, 0x1c2344f8a80?)
        	C:/workdir/go/src/syscall/zsyscall_windows.go:1219 +0x7a fp=0xc00044a920 sp=0xc00044a8c8 pc=0x31d33a
        syscall.StartProcess({0xc000218060?, 0xc000362df8?}, {0xc0000963a0, 0x2, 0x2}, 0xc000315000?)
        	C:/workdir/go/src/syscall/exec_windows.go:317 +0x26b fp=0xc00044ab88 sp=0xc00044a920 pc=0x31424b
        os.startProcess({0xc000218060, 0x30}, {0xc0000963a0, 0x2, 0x2}, 0xc00044ae60)
        	C:/workdir/go/src/os/exec_posix.go:54 +0x305 fp=0xc00044ac68 sp=0xc00044ab88 pc=0x37e445
        os.StartProcess({0xc000218060, 0x30}, {0xc0000963a0, 0x2, 0x2}, 0x34611190?)
        	C:/workdir/go/src/os/exec.go:109 +0x5a fp=0xc00044acb0 sp=0xc00044ac68 pc=0x37de9a
        os/exec.(*Cmd).Start(0xc000254b00)
        	C:/workdir/go/src/os/exec/exec.go:422 +0x611 fp=0xc00044aeb8 sp=0xc00044acb0 pc=0x3af631
        os/exec.(*Cmd).Run(0xc000218060?)
        	C:/workdir/go/src/os/exec/exec.go:338 +0x1e fp=0xc00044aed8 sp=0xc00044aeb8 pc=0x3aed3e
        cmd/go/internal/work.(*Builder).toolID(0xc0000952c0, {0x87450c, 0x7})
        	C:/workdir/go/src/cmd/go/internal/work/buildid.go:167 +0x425 fp=0xc00044b0b8 sp=0xc00044aed8 pc=0x7199c5
        cmd/go/internal/work.(*Builder).buildActionID(0xc0000952c0, 0xc00035c780)
        	C:/workdir/go/src/cmd/go/internal/work/exec.go:312 +0xef5 fp=0xc00044b4c0 sp=0xc00044b0b8 pc=0x71faf5
        cmd/go/internal/work.(*Builder).build(0xc0000952c0, {0xc000018110?, 0xc000218000?}, 0xc00035c780)
        	C:/workdir/go/src/cmd/go/internal/work/exec.go:472 +0x199 fp=0xc00044bdb8 sp=0xc00044b4c0 pc=0x720f99
        cmd/go/internal/work.(*Builder).Do.func2({0x9724b8, 0xc000018110}, 0xc00035c780)
        	C:/workdir/go/src/cmd/go/internal/work/exec.go:139 +0x5ec fp=0xc00044bf20 sp=0xc00044bdb8 pc=0x71e8ec
        cmd/go/internal/work.(*Builder).Do.func3()
        	C:/workdir/go/src/cmd/go/internal/work/exec.go:201 +0xb9 fp=0xc00044bfe0 sp=0xc00044bf20 pc=0x71e119
        runtime.goexit()
        	C:/workdir/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc00044bfe8 sp=0xc00044bfe0 pc=0x303761
        created by cmd/go/internal/work.(*Builder).Do
        	C:/workdir/go/src/cmd/go/internal/work/exec.go:187 +0x3da

@bcmills
Copy link
Contributor Author

bcmills commented Nov 23, 2021

Broadening the search, looks like it's still just those two, and both very recently. (This looks like a 1.18 regression to me.)

greplogs --dashboard -md -l -e 'panic.*ntdll\.dll.*\n\s*\n\s*goroutine \d+.*:\n(?:\s*.+\n\s*\t.+\n)*\s*syscall\.' --since=2021-01-01

2021-11-22T21:52:20-100d7ea/windows-amd64-longtest
2021-11-22T18:57:33-5a3d871/windows-amd64-longtest

@bcmills bcmills added release-blocker okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 labels Nov 23, 2021
@bcmills bcmills modified the milestones: Backlog, Go1.18 Nov 23, 2021
@bcmills bcmills changed the title syscall: "Failed to find RtlGetNtVersionNumbers procedure in ntdll.dll" on windows-amd64-longtest builder syscall: ntdll.dll errors in rtlGetNtVersionNumbers via os.StartProcess Nov 23, 2021
@bufflig
Copy link
Contributor

bufflig commented Nov 23, 2021

I did run a test over night to try to provoke this on 32bit+64bit amd and arm64, it would not happen (surprise). I'll try to see if I can provoke a gomote, so far it has only happened on amd64-longtest, so maybe I need that specific setup to get the timing right...

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 29, 2021

RtlGetNtVersionNumbers is just reading the fields right out of the PEB. It's absolutely trivial. Need be, we could reimplement this in Go instead.

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 29, 2021

Is it possible that this is some GC jujitsu?

func (d *DLL) FindProc(name string) (proc *Proc, err error) {
        namep, err := BytePtrFromString(name)
        if err != nil {
                return nil, err
        }
        a, e := getprocaddress(uintptr(d.Handle), namep)

And then:

//go:linkname syscall_getprocaddress syscall.getprocaddress
//go:nosplit
//go:cgo_unsafe_args
func syscall_getprocaddress(handle uintptr, procname *byte) (outhandle, err uintptr) {
        lockOSThread()
        defer unlockOSThread()
        c := &getg().m.syscall
        c.fn = getGetProcAddress()
        c.n = 2
        c.args = uintptr(noescape(unsafe.Pointer(&handle)))
        cgocall(asmstdcallAddr, unsafe.Pointer(c))
        outhandle = c.r1
        if outhandle == 0 {
                err = c.err
        }
        return
}

Could something weird be happening where procname doesn't appear visible/used by the GC, so those bytes get cleaned up before GetProcAddr is actually called, so it looks up garbage?

@gopherbot
Copy link

Change https://golang.org/cl/367654 mentions this issue: runtime: keep procname alive so it's not GC'd

@ianlancetaylor
Copy link
Contributor

If you're right (I don't know whether you are or not) that may indicate a bug in the implementation of //go:cgo_unsafe_args. That directive needs to imply that all the arguments remain live as long as the first argument is live. CC @randall77 @mdempsky

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 29, 2021

That directive needs to imply that all the arguments remain live as long as the first argument is live.

Sort of I think. The majority of uses cases for it pass around integers of one form or another, which means for the uintptr case, the caller is responsible for keeping alive its casted arguments. The Windows example above violates that. But most others don't.

Except for one OpenBSD case I found, maybe:

diff --git a/src/runtime/sys_openbsd1.go b/src/runtime/sys_openbsd1.go
index 4b80f60226..d852e3c58a 100644
--- a/src/runtime/sys_openbsd1.go
+++ b/src/runtime/sys_openbsd1.go
@@ -14,7 +14,10 @@ import (
 //go:nosplit
 //go:cgo_unsafe_args
 func thrsleep(ident uintptr, clock_id int32, tsp *timespec, lock uintptr, abort *uint32) int32 {
-	return libcCall(unsafe.Pointer(abi.FuncPCABI0(thrsleep_trampoline)), unsafe.Pointer(&ident))
+	ret := libcCall(unsafe.Pointer(abi.FuncPCABI0(thrsleep_trampoline)), unsafe.Pointer(&ident))
+	KeepAlive(tsp)
+	KeepAlive(abort)
+	return ret
 }
 func thrsleep_trampoline()
 

My CL handles the three Windows ones, and depending on how certain the OpenBSD one smells, I'll roll that into the same patch.

@mdempsky
Copy link
Member

That directive needs to imply that all the arguments remain live as long as the first argument is live.

At least in the cases addressed by https://go-review.googlesource.com/c/go/+/367654/3/src/runtime/syscall_windows.go, even the first argument is already dead by time cgocall is called. So if we have to add KeepAlive calls anyway, it seems simpler to just add them for all of the parameters.

But I'm also not opposed to making //go:cgo_unsafe_args imply that all of the parameter variables have shared liveness. Not sure if it would complicate stack objects though? We'd have to represent them as a single, shared object.

@mdempsky
Copy link
Member

As far as I can tell, the only effect of //go:cgo_unsafe_args is to force the ABI0 calling convention so that parameters are laid out in memory predictably. I don't see any cmd/compile code for changing liveness analysis or stack objects.

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 29, 2021

This specific bug is sort of interesting from a security persecptive, as I wonder if it could be leveraged for some clever code execution arising from the UaF. Code was going to call write(attacker_controlled), but UaF is exploited to turn that write into a system.

CC @FiloSottile

@ianlancetaylor
Copy link
Contributor

In syscall_getprocaddress we are passing &handle to cgocall. The meaning of cgo_unsafe_args is that a pointer to the first argument is a pointer to all the arguments. So in principle passing &handle to cgocall ought to keep all the arguments and everything they point to alive. In practice this doesn't work since &handle is by definition on the stack, not the heap, and cgocall doesn't know that keeping that pointer alive should keep the other arguments alive.

Still, I think that we should make the compiler do something sensible here, rather than depending on adding KeepAlive calls as needed. The options I see are to change the stack map liveness information so that a pointer to any argument is a pointer to all arguments, or to simply keep all arguments alive.

@mdempsky
Copy link
Member

In syscall_getprocaddress we are passing &handle to cgocall.

The code in question is:

	c := &getg().m.syscall
	c.fn = getGetProcAddress()
	c.n = 2
	c.args = uintptr(noescape(unsafe.Pointer(&handle)))
	cgocall(asmstdcallAddr, unsafe.Pointer(c))

This code is storing &handle as a uintptr into memory, and then it's not referenced again. As far as the compiler and runtime are concerned, handle is now dead, unless we add a KeepAlive(handle).

Maybe we can change args from uintptr to unsafe.Pointer, and the KeepAlive(handle) is unnecessary. But we'd still have to add extra compiler logic to tweak the liveness rules within the compiler so that keeping the first parameter alive implies keeping all of them alive.

I agree the latter seems in principle more appealing. But as //go:cgo_unsafe_args is a stdlib-only pragma, I'm inclined to say we should just stick to the former solution (i.e., adding KeepAlive calls) because it seems overall simpler at the moment. Liveness and stack objects are already complex and subtle enough, whereas adding KeepAlives is relatively straight forward. And from @zx2c4 's analysis, it sounds like there aren't many cases that need it.

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 29, 2021

My analysis and CL was looking at pointers, such as procname *byte that are passed, looking for the case where the memory it points to is kept alive. I didn't exhaustively examine the prevalence of taking the address of the argument variable and casting it to uintptr because it seems like tons of things do that. If that's problematic -- it sounds like it could be -- I'll have to rework my patch and grep around again.

@ianlancetaylor
Copy link
Contributor

We also use //go:cgo_unsafe_args in code generated by cgo. But actually I see now that that code is OK because cgo also emits code to keep all the parameters alive (that code precedes runtime.KeepAlive and should perhaps change to call it).

So, OK, let's try to remember to fix the uses in the standard library. Thanks.

@mdempsky
Copy link
Member

@zx2c4 Thanks. I'd recommend we look at all //go:cgo_unsafe_args functions that have a pointer-y parameter (in practice, probably just simple pointers and unsafe.Pointer, but maps/channels/etc or structs/arrays containing them would be interesting too).

E.g., looking briefly, this function in runtime/sys_darwin.go would be suspect:

//go:nosplit
//go:cgo_unsafe_args
func pthread_create(attr *pthreadattr, start uintptr, arg unsafe.Pointer) int32 {
        return libcCall(unsafe.Pointer(abi.FuncPCABI0(pthread_create_trampoline)), unsafe.Pointer(&attr))
}

because arg is immediately dead. (But in practice for this particular function, I expect arg always refers to explicitly managed memory, which would make this a non-issue.)

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 30, 2021

Right, yea. I've done a big pass through and caught a lot of cases on a bunch of platforms. The CL is updated now.

@ianlancetaylor / @randall77 - I'll wait for your 👍 on that CL before I submit it.

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 30, 2021

By the way, the patch's diffstat is now +124/-27, which is quite a bit more than the original +1/-0 when this all began. That makes me wonder whether we'd be better off changing the meaning of //go:cgo_unsafe_args in the compiler instead?

@mdempsky
Copy link
Member

That makes me wonder whether we'd be better off changing the meaning of //go:cgo_unsafe_args in the compiler instead?

I was wondering that too. I think it would be pretty easy for liveness analysis to just mark all parameters of //go:cgo_unsafe_args as live at exit, which would be comparable to your CL.

I'm inclined though to say lets land your CL since it's already prepared, and we can always roll it back if we implement the cmd/compile change.

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 30, 2021

I'm inclined though to say lets land your CL since it's already prepared, and we can always roll it back if we implement the cmd/compile change.

That works. Hopefully we can get all the changes, then, into one CL now, so that the eventual revert is more straight forward.

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 30, 2021

@gopherbot please backport this because it seems kind of security sensitive.

@gopherbot
Copy link

Backport issue(s) opened: #49867 (for 1.16), #49868 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

Change https://golang.org/cl/368355 mentions this issue: [release-branch.go1.17] runtime: keep //go:cgo_unsafe_args arguments alive to prevent GC

@gopherbot
Copy link

Change https://golang.org/cl/368356 mentions this issue: [release-branch.go1.16] runtime: keep //go:cgo_unsafe_args arguments alive to prevent GC

gopherbot pushed a commit that referenced this issue Dec 1, 2021
…alive to prevent GC

When syscall's DLL.FindProc calls into syscall_getprocaddress with a
byte slice pointer, we need to keep those bytes alive. Otherwise the GC
will collect the allocation, and we wind up calling `GetProcAddress` on
garbage, which showed up as various flakes in the builders. It turns out
that this problem extends to many uses of //go:cgo_unsafe_args
throughout, on all platforms. So this patch fixes the issue by keeping
non-integer pointer arguments alive through their invocation in
//go:cgo_unsafe_args functions.

Fixes #49868.
Updates #49731.

Change-Id: I93e4fbc2e8e210cb3fc53149708758bb33f2f9c7
Reviewed-on: https://go-review.googlesource.com/c/go/+/368355
Trust: Jason A. Donenfeld <Jason@zx2c4.com>
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
gopherbot pushed a commit that referenced this issue Dec 1, 2021
…alive to prevent GC

When syscall's DLL.FindProc calls into syscall_getprocaddress with a
byte slice pointer, we need to keep those bytes alive. Otherwise the GC
will collect the allocation, and we wind up calling `GetProcAddress` on
garbage, which showed up as various flakes in the builders. It turns out
that this problem extends to many uses of //go:cgo_unsafe_args
throughout, on all platforms. So this patch fixes the issue by keeping
non-integer pointer arguments alive through their invocation in
//go:cgo_unsafe_args functions.

Fixes #49867.
Updates #49731.

Change-Id: I93e4fbc2e8e210cb3fc53149708758bb33f2f9c7
Reviewed-on: https://go-review.googlesource.com/c/go/+/368356
Trust: Jason A. Donenfeld <Jason@zx2c4.com>
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@golang golang locked and limited conversation to collaborators Dec 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 OS-Windows release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants