Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: infinite recursion on windows triggered by morestack #21382

Closed
kjk opened this issue Aug 10, 2017 · 35 comments
Closed

runtime: infinite recursion on windows triggered by morestack #21382

kjk opened this issue Aug 10, 2017 · 35 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Milestone

Comments

@kjk
Copy link

kjk commented Aug 10, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.9rc2 windows/amd64

What operating system and processor architecture are you using (go env)?

set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=C:\Users\kjk\src\go
set GORACE=
set GOROOT=C:\Go
set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0
set CXX=g++
set CGO_ENABLED=1
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config

What did you do?

This is a continuation of #20975 so the same repro program (https://github.com/kjk/go20975) built in 64bit mode.

What did you expect to see?

No infinite recursion.

What did you see instead?

This time I used https://github.com/kjk/cv2pdb to convert dwarf to pdb so that I can get symbols in windbg.

I ran repro program under windbg.

The crash is:

 # RetAddr           : Args to Child                                                           : Call Site
00 00000000`0043cc0b : 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b : go20975!runtime.morestack+0x10 [C:\Go\src\runtime\asm_amd64.s @ 377] 
01 00000000`00451a56 : 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152] 
02 00000000`0043cc0b : 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382] 
03 00000000`00451a56 : 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152] 
04 00000000`0043cc0b : 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382] 
05 00000000`00451a56 : 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152] 
06 00000000`0043cc0b : 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382] 
07 00000000`00451a56 : 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152] 
08 00000000`0043cc0b : 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 00000000`0045104a : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382] 
09 00000000`00451a56 : 00000000`0043cc0b 00000000`00451a56 00000000`0045104a 00000000`004519ee : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152] 
0a 00000000`0043cc0b : 00000000`00451a56 00000000`0045104a 00000000`004519ee 00000000`004304f0 : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382] 
0b 00000000`00451a56 : 00000000`0045104a 00000000`004519ee 00000000`004304f0 00000000`00b9fef0 : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152] 
0c 00000000`0045104a : 00000000`004519ee 00000000`004304f0 00000000`00b9fef0 00000000`00000000 : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382] 
0d 00000000`004519ee : 00000000`004304f0 00000000`00b9fef0 00000000`00000000 00007ffa`59b6e618 : go20975!runtime.exitsyscallfast.func1+0xaa [C:\Go\src\runtime\proc.go @ 2717] 
0e 00000000`004304f0 : 00000000`00b9fef0 00000000`00000000 00007ffa`59b6e618 00000000`00455804 : go20975!runtime.systemstack+0x7e [C:\Go\src\runtime\asm_amd64.s @ 347] 
0f 00000000`00b9fef0 : 00000000`00000000 00007ffa`59b6e618 00000000`00455804 00000000`006307d8 : go20975!runtime.mstart [C:\Go\src\runtime\proc.go @ 1125] 
10 00000000`00000000 : 00007ffa`59b6e618 00000000`00455804 00000000`006307d8 00000000`00b90e00 : 0xb9fef0
TEXT runtime·morestack(SB),NOSPLIT,$0-0
	// Cannot grow scheduler stack (m->g0).
	get_tls(CX)
	MOVQ	g(CX), BX
	MOVQ	g_m(BX), BX
	MOVQ	m_g0(BX), SI
	CMPQ	g(CX), SI
	JNE	3(PC)
	CALL	runtime·badmorestackg0(SB)
	INT	$3

INT $3 is executed which triggers runtime.sigpanic. I assume sigpanic does stack check, calls morestack and that does INT $3 again. Infite loop happens and eventually crash will happen.

@kjk
Copy link
Author

kjk commented Aug 10, 2017

So I'm stepping through the assembly and there's more fishy stuff.

After int 3 we end up in:

00000000`00451a56 03488b          add     ecx,dword ptr [rax-75h] ds:ffffffff`ffffffa2=????????
00000000`00451a59 7350            jae     go20975!runtime.morestack+0x7b (00000000`00451aab)
00000000`00451a5b 4839b100000000  cmp     qword ptr [rcx],rsi

However, at that point rax is 0x17, so trying to de-reference [rax-75h] throws an exception:

0:000> t
(1830.205c): Access violation - code c0000005 (first chance)

That doesn't make sense to me unless this is a trick to just trigger an exception.

Here's a what gets executed, according to windbg, when single-stepping from int 3 to calling morestack again:

go20975!runtime.morestack+0x25:
00000000`00451a55 cd03            int     3
0:000> p
WARNING: This break is not a step/trace completion.
The last command has been cleared to prevent
accidental continuation of this unrelated event.
Check the event, location and thread before resuming.
(1830.205c): Break instruction exception - code 80000003 (first chance)
go20975!runtime.morestack+0x26:
00000000`00451a56 03488b          add     ecx,dword ptr [rax-75h] ds:ffffffff`ffffffa2=????????
0:000> t
(1830.205c): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
go20975!runtime.morestack+0x26:
00000000`00451a56 03488b          add     ecx,dword ptr [rax-75h] ds:ffffffff`ffffffa2=????????
0:000> t
go20975!runtime.sigpanic+0x9:
00000000`0043ca89 488b8900000000  mov     rcx,qword ptr [rcx] ds:00000000`0077b438={go20975!runtimeg0 (00000000`0077b0a0)}
0:000> t
go20975!runtime.morestack_noctxt:
00000000`00451ad0 31d2            xor     edx,edx
0:000> t
go20975!runtime.morestack_noctxt+0x2:
00000000`00451ad2 e959ffffff      jmp     go20975!runtime.morestack (00000000`00451a30)
0:000> t
go20975!runtime.morestack:
00000000`00451a30 65488b0c2528000000 mov   rcx,qword ptr gs:[28h] gs:00000000`00000028=????????????????

I don't get how executing:

go20975!runtime.sigpanic+0x9:
00000000`0043ca89 488b8900000000  mov     rcx,qword ptr [rcx] ds:00000000`0077b438={go20975!runtimeg0 (00000000`0077b0a0)}

ends up going to go20975!runtime.morestack_noctxt == 0000000000451ad0`.

@mvdan
Copy link
Member

mvdan commented Aug 10, 2017

Just to clarify, is this when building the program, or when running it?

Does this happen with 1.8?

CC @aclements

@mvdan
Copy link
Member

mvdan commented Aug 10, 2017

Also, if this was an infinite recursion, wouldn't you end up with a panic or crash of some sort? I don't know what windbg is, so perhaps there's something I'm missing.

@kjk
Copy link
Author

kjk commented Aug 10, 2017

Eventually the process will go away due to stack overflow exception. In this scenario runtime is incapable of handling it and generating a proper panic.

@alexbrainman
Copy link
Member

In this scenario runtime is incapable of handling it and generating a proper panic.

I would not expect Go to generate proper panic after executing INT $3. I will let Austin decide if something needs to be done here.

Alex

@aclements
Copy link
Member

I'd like to understand how we wound up in morestack without any remaining system stack space in the first place. Once we hit the INT $3, it would be nice to fail more gracefully, but things are toast anyway.

If there any way MSHTML could be calling back into Go code while deep in the stack?

If not, and I'm grasping at straws here, but my guess is that the C "syscall" code (which runs on the system stack) is running out of stack space, which invokes a Windows exception handler registered by the runtime, which also attempts to run on the system stack and fails when it sees there's no stack left. @alexbrainman, I know very little about how Windows exception handlers work; does this seem like a plausible explanation?

(Notably, on UNIX platforms, the signal handler runs on yet another stack that's only for signal handling, so even if we run out of space on the system stack, we have a little more backup room in which to fail gracefully.)

@aclements
Copy link
Member

Actually, this is sort of interesting, though I'm not sure what to make of it:

0c 00000000`0045104a : 00000000`004519ee 00000000`004304f0 00000000`00b9fef0 00000000`00000000 : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382] 
0d 00000000`004519ee : 00000000`004304f0 00000000`00b9fef0 00000000`00000000 00007ffa`59b6e618 : go20975!runtime.exitsyscallfast.func1+0xaa [C:\Go\src\runtime\proc.go @ 2717] 

exitsyscallfast.func1 is specifically the closure that does throw("exitsyscall: syscall frame is no longer valid"). This indicates that we tried to return from the system call (though why that would be, I'm not sure), but the stack got unwound or the SP just changed completely. Then, we tried to switch to the system stack to report this, but it was full, leading to a cascade of other problems.

@kjk, can you put a breakpoint in exitsyscall at the first call to systemstack (inside the if getcallersp(unsafe.Pointer(&dummy)) > g.syscallsp`) and see what the call stack there is?

@kjk
Copy link
Author

kjk commented Sep 5, 2017

@aclements Please also read comments #20975 (comment) and below as this is the same issue and there is more detail there.

To summarize my guesses at this point:

It's not caused by running out of stack space.

morestack is called unconditionally (i.e. regardless of how much stack is left) by the closure passed to systemstack in exitsyscallfast.func1.

When morestack is called there's plenty of stack but it detects that it's being called on scheduler stack (g.m.g0 == g) which shouldn't happen because systemstack is supposed to ensure that it's, well, system stack. There seems to be a missed case in that logic.

When morestack detects this invariant being violated, it does int 3 to trigger debugger and make debugging easy.

It seems to be 64-bit only so I assume it's some of the arch-specific runtime assembly routines.

mshtml per se doesn't call Go but there are plenty of C->Go->C transitions because of how Windows message processing works.

Each window has a callback (called wndproc) responsible for handling message for that window. In Windows every control (a button, listview, browser view etc.) is a window.

To add custom handling of messages we need to provide our own wndproc callback, which must be called via C->Go trampoline. When that callback is not interested in the message, we need to call the original wndproc for that message, which is Go->C transition.

So every GUI windows program has a high rate of C->Go and Go->C transitions, especially those using https://github.com/lxn/walk/ library, as it hoooks wndproc for all windows it creates.

This also makes debugging with breakpoints impossible. I've spent several hours setting breakpoints at various points and stepping through the code but the same code works correctly the first 1000 times and then fails.

To summarize my beliefs:

  • not caused by running out of stack
  • caused by systemstack failing to switch to system stack before calling its closure and remaining on scheduler stack
  • 64-bit only
  • not deterministic but correlated to high rate of C->Go and Go->C transitions
  • most likely a bug in runtime.systemstack in asm_arm64.s

The most promising approach would be to instrument systemstack to add the same check that morestack does but when exiting systemstack, to catch bad condition (remaining on scheduler stack) earlier.

@aclements
Copy link
Member

@kjk, systemstack is extremely well-trodden code. Obviously it's not impossible that it contains a bug, but that's way down on my list of candidates.

morestack is called unconditionally (i.e. regardless of how much stack is left) by the closure passed to systemstack in exitsyscallfast.func1.

Why do you say that? It never makes sense to call morestack unconditionally, and, looking at the disassembly of exitsyscallfast.func1, it clearly does check the stack bound before calling morestack, as it's supposed to.

When morestack is called there's plenty of stack but it detects that it's being called on scheduler stack (g.m.g0 == g) which shouldn't happen because systemstack is supposed to ensure that it's, well, system stack.

This isn't quite right. There is no separate "scheduler stack", there's just the user stack and the system stack (and the signal stack on UNIX). If g.m.g0 == g, then you're on the system stack. So, systemstack is supposed to put you on the system stack, at which point g.m.g0 == g, and any call to morestack should panic.

What makes you say there's plenty of stack when it calls morestack? I didn't see evidence for that here or on the other issue (I may have just missed it; there are a lot of posts).

To add custom handling of messages we need to provide our own wndproc callback, which must be called via C->Go trampoline.

Can you point me to where your code is doing this? Normally this would go through the cgo callback paths, but since your application isn't using cgo, I'm curious how this is being done.

Given the C->Go callbacks, this is all precisely the behavior I would expect if C code were using up the system stack and then calling back into Go code.

(From my earlier post:)

exitsyscallfast.func1 is specifically the closure that does throw("exitsyscall: syscall frame is no longer valid").

Oops, I'd missed the fast in there, so I was looking at the wrong closure. Unfortunately, I would expect exitsyscallfast.func1 to be called quite frequently in normal operation, so setting a breakpoint there isn't useful. (But it does mean the SP probably isn't getting totally trashed like I thought.)

@kjk
Copy link
Author

kjk commented Sep 5, 2017

Like I said, those are guesses, you're more likely to be right than me.

There is no separate "scheduler stack"

I'm just parroting back terminology used by the code e.g.

// Cannot grow scheduler stack (m->g0).

What makes you say there's plenty of stack when it calls morestack?

I've tried the repro with ridiculously large (16 MB) stack and got the same thing.

In the debugger, I printed the callstack and it was relatively short from main().

Either way, this particular issue is due to morestack detecting an internal inconsistency (and not being able to handle it via somewhat controlled panic which eventually triggering windows exception that silently kills the process).

Given the C->Go callbacks, this is all precisely the behavior I would expect if C code were using up the system stack and then calling back into Go code.

It's also consistent with being confused about which stack the code is on.

If the code is confused about which stack it is on, then we might be on a thread with plenty of stack but "needs to grow stack" check is done on the wrong stack, wrongly detects need to expand stack, calls morestack which detects it's the wrong stack and does int 3.

Can you point me to where your code is doing this?

On windows syscall.Syscal does Go->C call and syscall.NewCallback creates C->Go callback.

This is all done in the lnx/walk library:

Windows GUI code is roughly this (https://github.com/lxn/walk/blob/2d327b4a1aba7cda2a365bc566fd60ea6bd4c8bf/form.go#L365):

  • there's an infinite loop calling (Windows OS functions) GetMessage()/DispatchMessage() (until getting a message indicating the app has been closed)
  • DispatchMessage() is a win32 OS function so that's Go -> C transition. It determines which window is target of the message and calls its wndproc. In our case it causes C -> Go transition as wndproc is a trampoline to a Go code
  • often Go callback doesn't process the message and calls the original C wndproc, which is Go -> C transition
  • then they all unwind, go back to Go code that repeats GetMessage()/DispatchMessage() until the user closes the window, triggering the exit

It's unavoidable to get Go -> C -> Go -> C in Windows GUI programs. Using cgo is not necessary for that.

@alexbrainman
Copy link
Member

is running out of stack space, which invokes a Windows exception handler registered by the runtime, which also attempts to run on the system stack and fails when it sees there's no stack left.

Windows exception handler calls runtime.sigtramp. The runtime.sigtramp will run on scheduler stack. Also see _StackSystem is used to make sure we always have enough room to run exception handler.

Normally this would go through the cgo callback paths, but since your application isn't using cgo, I'm curious how this is being done.

If you are interested to see simple Windows GUI app, you can download d8b239ff60a62c3f50f7eb5994221b50ba055cf2 commit (initial commit) of https://github.com/alexbrainman/gowingui

Alex

@mvdan mvdan changed the title Runtime infinite recursion on windows triggered by morestack runtime: infinite recursion on windows triggered by morestack Jan 9, 2018
@mvdan mvdan added OS-Windows NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 9, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.11 milestone Mar 30, 2018
@aclements
Copy link
Member

To distinguish this from #20975, let's keep this issue about the infinite recursion caused by aborting in morestack (I was muddying this together with #20975 as much as anyone).

The problem here is that after the runtime exceeds the g0 stack bounds (which happens because of #20975), it calls morestack, which invokes INT $3, which goes into the exception handler, which runs on the g0 stack, so it immediately calls morestack, and we get an infinite recursion.

@alexbrainman suggested some solutions in #20975 (comment).

@aclements
Copy link
Member

Following up to #20975 (comment):

It does sounds reasonable (you assume that runtime.morestack is called on runtime.exceptionhandler entry).

I assume that because it's true. :)

I suppose we could call runtime.exit instead of executing INT 3.

It's nice to have badmorestack fail in a "something went wrong, please debug me" way.

We do handle debug exceptions (search for _EXCEPTION_BREAKPOINT), but it is too late for this scenario. So, sure, we could modify runtime.sigtramp to handle this situation. Maybe we could mark runtime.exceptionhandler or at least some part of it as "no to grow stack" instead?

It would probably work fine to avoid trying to grow the stack until we know we're past the _EXCEPTION_BREAKPOINT check. I'll give that a go.

@gopherbot
Copy link

Change https://golang.org/cl/120857 mentions this issue: runtime: mark Windows exception handling entry nosplit

@aclements
Copy link
Member

I attempted to fix this with https://golang.org/cl/120857 but that didn't work. Unfortunately, I really have no way to debug this on Windows (I spent most of this afternoon trying to bring up a Windows VM to the point where I could debug this and didn't even get close). If someone else can apply that CL, set a breakpoint on runtime.badmorestackg0, and then single step to figure out where things are going wrong after that, I might be able to make some more progress.

@kjk
Copy link
Author

kjk commented Jun 29, 2018

I got it to the point of being able to run all.bat but I'm not familiar with working on Go compiler.

I know that all.bat eventually runs go tool dist test to run test but that doesn't allow to set breakpoints.

If you were on linux, how would you run just that one test under gdb?

For what it's worth, here's how I did setup on Windows: https://www.notion.so/Building-Go-from-source-on-Windows-bf2363a561864fa58c08a3c6d2305f97

@kjk
Copy link
Author

kjk commented Jun 29, 2018

BTW: I tried:

kjk@LAPTOP-HVPJGI3T C:\Users\kjk\src\go-dev\src
> ..\bin\go.exe test runtime

Which would work in normal circumstances but I get a lengthy error message:

go tool vet: exit status 2
flag provided but not defined: -V
Usage of vet:
...

Why does it even talk about vet when I'm invoking test ?

@ianlancetaylor
Copy link
Contributor

@kjk go test runtime is indeed the way to run the runtime tests. To get something you can run under gdb, go test -c runtime and gdb ./runtime.test. Some of the runtime tests involve running a test program, typically testprog or testprogcgo. You can run those directly by changing to the directory runtime/testdata/testprog (or testprogcgo) and running go build and ./testprog.exe TEST, where TEST is the name of the test to run. You will have look at the code in runtime/crash_test.go or whatever to see what the expected output is.

Running go test does normally invoke go vet, as documented at https://golang.org/cmd/go/#hdr-Test_packages . The error about -V doesn't make a lot of sense, though. You may want to open a separate issue about this. Provide the output of go test -x runtime.

@aclements
Copy link
Member

@kjk, for this specific problem, run go test -c runtime first to get a binary you can debug, then gdb --args runtime.txt -test.run TestG0StackOverflow. If all that works, you should be able to br 'runtime.badmorestackg0' and run, and then stepi repeatedly to single step. Presumably you can also do this all in windbg.

@kjk
Copy link
Author

kjk commented Jun 29, 2018

Those are notes in progress.

I got to a point I can get run the test under gdb with (in powershell):

  • $env:GOROOT = "C:\Users\kjk\src\go" : this I was missing previously
  • . "${env:GOROOT}\bin\go" test -c runtime -ldflags=-linkmode=external : linkmode external is necessary to work-around gdb bug parsing Go-generated DWARF info

So far this is with TestG0StackOverflow added but no fixes (to test before / after).

To run tests I do:

  • $env:TEST_G0_STACK_OVERFLOW = "1" : this tells TestG0StackOverflow to do the actual crash
  • gdb --args .\runtime.test.exe "-test.run" TestG0StackOverflow
  • br runtime.badmorestackg0
  • r

First anomaly: the initial signal is sigsev i.e. writing to unaccessible memory. That is unexpected as I would expect stack check to happen first. It seems like stack check doesn't work:

Thread 1 received signal SIGSEGV, Segmentation fault.
0x000000000045b797 in runtime.stackOverflow (x=<optimized out>) at C:/Users/kjk/src/go/src/runtime/export_test.go:462
462             var buf [16]byte

(gdb) disass 0x000000000045b797
Dump of assembler code for function runtime.stackOverflow:
   0x000000000045b770 <+0>:     mov    %gs:0x28,%rcx
   0x000000000045b779 <+9>:     mov    0x0(%rcx),%rcx
   0x000000000045b780 <+16>:    cmp    0x10(%rcx),%rsp
   0x000000000045b784 <+20>:    jbe    0x45b7b4 <runtime.stackOverflow+68>
   0x000000000045b786 <+22>:    sub    $0x20,%rsp
   0x000000000045b78a <+26>:    mov    %rbp,0x18(%rsp)
   0x000000000045b78f <+31>:    lea    0x18(%rsp),%rbp
   0x000000000045b794 <+36>:    xorps  %xmm0,%xmm0
=> 0x000000000045b797 <+39>:    movups %xmm0,0x8(%rsp)
   0x000000000045b79c <+44>:    lea    0x8(%rsp),%rax
   0x000000000045b7a1 <+49>:    mov    %rax,(%rsp)
   0x000000000045b7a5 <+53>:    callq  0x45b770 <runtime.stackOverflow>
   0x000000000045b7aa <+58>:    mov    0x18(%rsp),%rbp
   0x000000000045b7af <+63>:    add    $0x20,%rsp
   0x000000000045b7b3 <+67>:    retq
   0x000000000045b7b4 <+68>:    callq  0x45f350 <runtime.morestack_noctxt>
   0x000000000045b7b9 <+73>:    jmp    0x45b770 <runtime.stackOverflow>
End of assembler dump.

(gdb) info reg
rax            0xd64018 14041112
rbx            0xf5fdc8 16121288
rcx            0x8faec0 9416384
rdx            0x711cf0 7412976
rsi            0x45f1f0 4583920
rdi            0x45ed50 4582736
rbp            0xd64000 0xd64000
rsp            0xd63fe8 0xd63fe8
r8             0xc000010448     824633787464
r9             0x1      1
r10            0x1      1
r11            0x1      1
r12            0x31     49
r13            0x6      6
r14            0x5      5
r15            0x27     39
rip            0x45b797 0x45b797 <runtime.stackOverflow+39>
eflags         0x10206  [ PF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x2b     43
es             0x2b     43
fs             0x53     83
gs             0x2b     43

My assembly is rusty but: the line that crashes is movups %xmm0,0x8(%rsp) which seems to me is zero-ing the stack.

It should never get there because the stack check should detect lack of stack and jump to 0x45b7b4 to call runtime.morestack_noctxt.

If my reading is correct then putting anything on stack is likely to cause another crash (i.e. sigsev).

@kjk
Copy link
Author

kjk commented Jun 29, 2018

Interestingly, adding nosplit makes things worse.

The code that is in tree only prints "fatal: morestack on g0" twice.
Adding nosplit as in the PR makes it print a lot of them.

Here's the difference in the debugger when runtime.badmorestackg0 is hit:

Code in tree:

Thread 1 hit Breakpoint 1, runtime.badmorestackg0 () at C:/Users/kjk/src/go/src/runtime/proc.go:434
434     func badmorestackg0() {
(gdb) bt
#0  runtime.badmorestackg0 () at C:/Users/kjk/src/go/src/runtime/proc.go:434
#1  0x000000000045f2d5 in runtime.morestack () at C:/Users/kjk/src/go/src/runtime/asm_amd64.s:396
#2  0x0000000000443ff6 in runtime.exceptionhandler (gp=0x8faec0 <runtime.g0>, info=0xd63db0, r=0xd638c0, ~r3=<optimized out>)
    at C:/Users/kjk/src/go/src/runtime/signal_windows.go:74
#3  0x0000000000462d58 in runtime.sigtramp () at C:/Users/kjk/src/go/src/runtime/sys_windows_amd64.s:175
#4  0x0000000000d638c0 in ?? ()

(gdb) c
Continuing.
fatal: morestack on g0

Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
runtime.abort () at C:/Users/kjk/src/go/src/runtime/asm_amd64.s:840
840             JMP     loop
(gdb) bt
#0  runtime.abort () at C:/Users/kjk/src/go/src/runtime/asm_amd64.s:840
#1  0x000000000045f2da in runtime.morestack () at C:/Users/kjk/src/go/src/runtime/asm_amd64.s:397
#2  0x0000000000443ff6 in runtime.exceptionhandler (gp=0x8faec0 <runtime.g0>, info=0xd63db0, r=0xd638c0, ~r3=<optimized out>)
    at C:/Users/kjk/src/go/src/runtime/signal_windows.go:74
#3  0x0000000000462d58 in runtime.sigtramp () at C:/Users/kjk/src/go/src/runtime/sys_windows_amd64.s:175
#4  0x0000000000d638c0 in ?? ()

Adding more nosplit:

Thread 1 hit Breakpoint 1, runtime.badmorestackg0 () at C:/Users/kjk/src/go/src/runtime/proc.go:434
434     func badmorestackg0() {
(gdb) bt
#0  runtime.badmorestackg0 () at C:/Users/kjk/src/go/src/runtime/proc.go:434
#1  0x000000000045f265 in runtime.morestack () at C:/Users/kjk/src/go/src/runtime/asm_amd64.s:396
#2  0x0000000000430d74 in runtime.printlock () at C:/Users/kjk/src/go/src/runtime/print.go:66
#3  0x00000000004440be in runtime.lastcontinuehandler (gp=0x8faec0 <runtime.g0>, info=0xd63db0, r=0xd638c0, ~r3=<optimized out>)
    at C:/Users/kjk/src/go/src/runtime/defs_windows_amd64.go:119
#4  0x0000000000462ce8 in runtime.sigtramp () at C:/Users/kjk/src/go/src/runtime/sys_windows_amd64.s:175
#5  0x0000000000d638c0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

(gdb) c
Continuing.

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ffd22cea073 in ?? ()

This seems consistent with my theory that the root of the problem is that we allow sigsev due to accessing outside of stack.

With more nosplit more is on the stack so we get into loop: sigsev -> call exception handler -> call badmorestack -> sigsev because stack is broken -> call exception handler etec.

@kjk
Copy link
Author

kjk commented Jun 29, 2018

So this might be the same in spirit as #26061 and require https://go-review.googlesource.com/c/go/+/120195 to land first.

@alexbrainman
Copy link
Member

@kjk are you sure you are debugging correct test? The test needs to have TEST_G0_STACK_OVERFLOW environment variable set:

set TEST_G0_STACK_OVERFLOW=1

Unfortunetly, my gdb does not work properly when I do stepi command at critical moment.

c:\Users\Alex\dev\go\src>set TEST_G0_STACK_OVERFLOW=1

c:\Users\Alex\dev\go\src>go test -gcflags=all="-N -l" -c -v -o %TMP%\a.exe runtime && gdb --args %TMP%\a.exe -test.run=TestG0StackOverflow
GNU gdb (GDB) 7.8
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-w64-mingw32".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from C:\Users\Alex\AppData\Local\Temp\a.exe...done.
(gdb) handle SIGTRAP stop print pass
SIGTRAP is used by the debugger.
Are you sure you want to change it? (y or n) y
Signal        Stop      Print   Pass to program Description
SIGTRAP       Yes       Yes     Yes             Trace/breakpoint trap
(gdb) br 'runtime.exceptionhandler'
Breakpoint 1 at 0x448e1e: file c:/users/alex/dev/go/src/runtime/signal_windows.go, line 88.
(gdb) display /10i $pc
1: x/10i $pc
<error: No registers.>
(gdb) r
Starting program: C:\Users\Alex\AppData\Local\Temp\a.exe "-test.run=TestG0StackOverflow"
[New Thread 4092.0x1c88]
[New Thread 4092.0x1fec]
[New Thread 4092.0x1af8]
[New Thread 4092.0x25f4]
warning: Can not parse XML library list; XML support was disabled at compile time
[New Thread 4092.0x207c]
[New Thread 4092.0xe80]
[New Thread 4092.0xae0]
[New Thread 4092.0x1e34]
fatal: morestack on g0

Program received signal SIGTRAP, Trace/breakpoint trap.
runtime.abort () at c:/users/alex/dev/go/src/runtime/asm_amd64.s:840
840             JMP     loop
1: x/10i $pc
=> 0x466dc2 <runtime.abort+2>:  jmp    0x466dc2 <runtime.abort+2>
   0x466dc4 <runtime.abort+4>:  int3
   0x466dc5 <runtime.abort+5>:  int3
   0x466dc6 <runtime.abort+6>:  int3
   0x466dc7 <runtime.abort+7>:  int3
   0x466dc8 <runtime.abort+8>:  int3
   0x466dc9 <runtime.abort+9>:  int3
   0x466dca <runtime.abort+10>: int3
   0x466dcb <runtime.abort+11>: int3
   0x466dcc <runtime.abort+12>: int3
(gdb) c
Continuing.

Breakpoint 1, runtime.exceptionhandler (gp=<optimized out>, info=<optimized out>,
    r=<optimized out>, ~r3=<optimized out>) at c:/users/alex/dev/go/src/runtime/signal_windows.go:88
88              if !isgoexception(info, r) {
1: x/10i $pc
=> 0x448e1e <runtime.exceptionhandler+14>:      mov    0x30(%rsp),%rax
   0x448e23 <runtime.exceptionhandler+19>:      mov    %rax,(%rsp)
   0x448e27 <runtime.exceptionhandler+23>:      mov    0x38(%rsp),%rcx
   0x448e2c <runtime.exceptionhandler+28>:      mov    %rcx,0x8(%rsp)
   0x448e31 <runtime.exceptionhandler+33>:      callq  0x448d00 <runtime.isgoexception>
   0x448e36 <runtime.exceptionhandler+38>:      lea    0x10(%rsp),%rax
   0x448e3b <runtime.exceptionhandler+43>:      cmpb   $0x0,(%rax)
   0x448e3e <runtime.exceptionhandler+46>:      je     0x448f6d <runtime.exceptionhandler+349>
   0x448e44 <runtime.exceptionhandler+52>:      mov    0x40(%rsp),%rax
   0x448e49 <runtime.exceptionhandler+57>:      test   %al,(%rax)
(gdb) si
fatal: morestack on g0
runtime.abort () at c:/users/alex/dev/go/src/runtime/asm_amd64.s:840
840             JMP     loop
1: x/10i $pc
=> 0x466dc2 <runtime.abort+2>:  jmp    0x466dc2 <runtime.abort+2>
   0x466dc4 <runtime.abort+4>:  int3
   0x466dc5 <runtime.abort+5>:  int3
   0x466dc6 <runtime.abort+6>:  int3
   0x466dc7 <runtime.abort+7>:  int3
   0x466dc8 <runtime.abort+8>:  int3
   0x466dc9 <runtime.abort+9>:  int3
   0x466dca <runtime.abort+10>: int3
   0x466dcb <runtime.abort+11>: int3
   0x466dcc <runtime.abort+12>: int3
(gdb)

@kjk perhaps you will have better luck with your gdb version. Or maybe you can use different debugger here.

@aclements let me know what else I can try here.

Thank you.

Alex

@kjk
Copy link
Author

kjk commented Jul 1, 2018

@alexbrainman I'm pretty sure

You can see the exact commands I ran: https://www.notion.so/Building-Go-from-source-on-Windows-bf2363a561864fa58c08a3c6d2305f97

I did set TEST_G0_STACK_OVERFLOW=1

And more notes from inside gdb: https://gist.github.com/kjk/962a3ac824ef5492921dad4ad2619080

@kjk
Copy link
Author

kjk commented Jul 4, 2018

I retested with https://go-review.googlesource.com/c/go/+/120195 i.e. windows stack check fix.

The root cause is still that stack zeroing causes SIGSEV aka. ACCESS VIOLATION. I now understand the code better, so here's what I found.

First, annotated disassembly of stackOverflow function:

Dump of assembler code for function runtime.stackOverflow:
   0x000000000045bb30 <+0>:     mov    %gs:0x28,%rcx
   0x000000000045bb39 <+9>:     mov    0x0(%rcx),%rcx             : %rcx = g
   0x000000000045bb40 <+16>:    cmp    0x10(%rcx),%rsp            : cmp g.stackguard0, %rsp
   0x000000000045bb44 <+20>:    jbe    0x45bb74 <runtime.stackOverflow+68>
   0x000000000045bb46 <+22>:    sub    $0x20,%rsp                 : %rsp -= 0x20
   0x000000000045bb4a <+26>:    mov    %rbp,0x18(%rsp)            : *(%rsp + 0x18) = %rbp
   0x000000000045bb4f <+31>:    lea    0x18(%rsp),%rbp            : %rbp = %rsp + 0x18
   0x000000000045bb54 <+36>:    xorps  %xmm0,%xmm0                : %xmm0 = 0
=> 0x000000000045bb57 <+39>:    movups %xmm0,0x8(%rsp)            : *(%rsp + 0x8) = %xmm0
   0x000000000045bb5c <+44>:    lea    0x8(%rsp),%rax

At the time of the crash, %rcx is g:

(gdb) x/8 $rcx
0x8fdec0 <runtime.g0>:              0x00d72000      0x00000000      0x00f6fdf0      0x00000000
0x8fded0 <runtime.g0+16>:       0x00d73370      0x00000000      0x00d73370      0x00000000

g is:

type stack struct {
	lo uintptr
	hi uintptr
}
type g struct {
	stack       stack   // offset known to runtime/cgo
	stackguard0 uintptr // offset known to liblink
	stackguard1 uintptr // offset known to libli

So 0x00d72000 is stack.lo, 0x00f6fdf0 is stack.hi.

(gdb) p 0x00f6fdf0 - 0x00d72000
$55 = 2088432

i.e. stack is ~2MB, which checks out.

0x00d73370 is stack.lo + StackGuard which is 4976 == 4096 + 880 which seems to check out.

At the time of the crash:

(gdb) p/x $rsp
$58 = 0xd73fe8

We crash trying to write to address $rsp + 0x8 i.e. 0xd73ff0.

Here's the weird thing: 0xd73ff0 seems to be a legit address within stack bounds:

(gdb) p/x 0xd73ff0 - 0x00d72000
$60 = 0x1ff0
(gdb) p 0x1ff0
$61 = 8176

Furthermore, I can access this memory under gdb:

(gdb) x/8 $rsp + 0x8
0xd73ff0:       0x00000000      0x00000000      0x00000000      0x00000000

If this address was really in-accessible, I would get something like:

(gdb) x/4 0x0
0x0:    Cannot access memory at address 0x0

So here's my wild theory: somehow memory mapping when extending stack is racy. So when the code is just running, it hits Access Violation because the memory is not yet accessible. But when we land in the debugger, the memory mapping is finalized so we can access the memory.

Unfortunately gdb on Windows doesn't support $_siginfo so I can't tell which address triggered AV (under gdb).

@kjk
Copy link
Author

kjk commented Jul 4, 2018

I was able to use WinDBG so I narrowed down the issue.

The exception that happens is actually stack overflow. All other info I got from gdb is correct.

I also narrowed it down to delayed stack expansion.

I changed the repro to use 1048 bytes of stack (instead of 16) to speed up the repro. This changes the stack clearing code a bit.

I managed to figure out a conditional breakpoint that triggers right before crash: bp 0045bb55 "j (rsp == d73dc0) '!vadump; !teb'; 'gc'"

00000000`0045bb30 65488b0c2528000000 mov     rcx, qword ptr gs:[28h]
00000000`0045bb39 488b8900000000     mov     rcx, qword ptr [rcx]
00000000`0045bb40 488d842458fcffff   lea     rax, [rsp-3A8h]
00000000`0045bb48 483b4110           cmp     rax, qword ptr [rcx+10h]
00000000`0045bb4c 764d               jbe     runtime_test+0x5bb9b (00000000`0045bb9b)
00000000`0045bb4e 4881ec28040000     sub     rsp, 428h
00000000`0045bb55 4889ac2420040000   mov     qword ptr [rsp+420h], rbp
<= breakpoint here
00000000`0045bb5d 488dac2420040000   lea     rbp, [rsp+420h]
00000000`0045bb65 48c744240800000000 mov     qword ptr [rsp+8], 0 ss:00000000`00d73dc8=0000000000000000
00000000`0045bb6e 488d7c2410         lea     rdi, [rsp+10h]

When breakpoint happens:

TEB at 00000000002bd000
    ExceptionList:        0000000000000000
    StackBase:            0000000000f70000
    StackLimit:           0000000000d74000

BaseAddress:       0000000000d71000
RegionSize:        0000000000003000
State:             00001000  MEM_COMMIT
Protect:           00000104  PAGE_READWRITE + PAGE_GUARD
Type:              00020000  MEM_PRIVATE

BaseAddress:       0000000000d74000
RegionSize:        00000000001fc000
State:             00001000  MEM_COMMIT
Protect:           00000004  PAGE_READWRITE
Type:              00020000  MEM_PRIVAT

0:000> ? rsp
Evaluate expression: 14106048 = 00000000`00d73dc0
0:000> dd rcx
00000000`008fdec0  00d72000 00000000 00f6fdf0 00000000

Notice that at this point rsp is invalid. Rsp is d73dc0. The stack base in g is d7200 which is legit but stack base in TEB is d74000. Memory mapping shows that area in d71000-d74000 has PAGE_GUARD bit set.

Stepping through next 2 instructions is still fine:

0:000> p
runtime_test+0x5bb5d:
00000000`0045bb5d 488dac2420040000 lea     rbp,[rsp+420h]
0:000>  !teb
TEB at 00000000002bd000
    ExceptionList:        0000000000000000
    StackBase:            0000000000f70000
    StackLimit:           0000000000d74000

0:000> p
runtime_test+0x5bb65:
00000000`0045bb65 48c744240800000000 mov   qword ptr [rsp+8],0 ss:00000000`00d73dc8=????????????????
0:000>  !teb
TEB at 00000000002bd000
    ExceptionList:        0000000000000000
    StackBase:            0000000000f70000
    StackLimit:           0000000000d74000

Next instructions will cause stack overflow by trying to access 00d73dc8:

0:000> p
(2830.3eb0): Stack overflow - code c00000fd (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
runtime_test+0x5bb65:
00000000`0045bb65 48c744240800000000 mov   qword ptr [rsp+8],0 ss:00000000`00d73dc8=0000000000000000

At this point the stack is properly mapped:

0:000>  !teb
TEB at 00000000002bd000
    ExceptionList:        0000000000000000
    StackBase:            0000000000f70000
    StackLimit:           0000000000d71000

!vadump
BaseAddress:       0000000000d71000
RegionSize:        00000000001ff000
State:             00001000  MEM_COMMIT
Protect:           00000004  PAGE_READWRITE
Type:              00020000  MEM_PRIVATE

Continuing will eventually crash with AV somewhere in runtime but without symbols can't tell where.

@kjk
Copy link
Author

kjk commented Jul 4, 2018

Another theory: there is one-off error in the code that maps system stack (which I cannot find) because this happens reliably when accessing what looks like the lowest page of the stack.

When we have the following mapping:

BaseAddress:       0019a000
RegionSize:        00002000
State:             00001000  MEM_COMMIT
Protect:           00000104  PAGE_READWRITE + PAGE_GUARD
Type:              00020000  MEM_PRIVATE

BaseAddress:       0019c000
RegionSize:        00004000
State:             00001000  MEM_COMMIT
Protect:           00000004  PAGE_READWRITE
Type:              00020000  MEM_PRIVATE

I assume the part with PAGE_GUARD is there by default. At some point we detect we need to commit more stack and we call mmap() to extend stack.

It looks like we fail to do that for the lowest page.

Another theory: this might be caused by mismatch of g.stack.lo, which is 0x00d72000 and TEB.StackLimit which is 0d71000 which seems to be 4096 bytes which looks like 1 page.

@kjk
Copy link
Author

kjk commented Jul 4, 2018

BTW: the same problem happens on 386.

@kjk
Copy link
Author

kjk commented Jul 4, 2018

Another observation: memoryBasicInformation.baseAddress and memoryBasicInformation.allocationBase is shifted by 0x1000 from TEB StackBase and StackLimit (i.e. StackLimit is 0xa1000 and allocationBase is 0xa0000).

g0.stack.lo is allocationBase + 0x2000 which explains 0x1000 difference (0x2000 - 0x1000) difference between g0.stack.lo and TEB.StackBase.

TEB is https://en.wikipedia.org/wiki/Win32_Thread_Information_Block

But not sure if that's relevant. I patched minit() with:

mbi.allocationBase += 0x1000
mbi.baseAddress += 0x1000

to make them match and that didn't change anything.

@kjk
Copy link
Author

kjk commented Jul 4, 2018

Changing slack value from 8*1024 to 16*1024 (base := mbi.allocationBase + 8*1024) fixes the stack overflow.

runtime.morestack gets called, prints fatal: morestack on g0 message, does int 3 which invokes exception handler.

However, things then get recursive i.e. code in exception handler will call runtime.morestack etc.

I've added //go:nosplit from @aclements PR, then I've added some more that are called within exception handler (e.g. traceback(), gettraceback(), findfunc()) and then I've hit the limit on those:

 runtime.test
runtime.systemstack: nosplit stack overflow
        748     assumed on entry to runtime.traceback (nosplit)

@kjk
Copy link
Author

kjk commented Jul 4, 2018

And here's a fix:

--- a/src/runtime/os_windows.go                                                                                         
+++ b/src/runtime/os_windows.go                                                                                         
@@ -698,10 +698,12 @@ func minit() {                                                                                    
                print("runtime: VirtualQuery failed; errno=", getlasterror(), "\n")                                     
                throw("VirtualQuery for stack base failed")                                                             
        }                                                                                                               
+                                                                                                                       
        // Add 8K of slop for calling C functions that don't have                                                       
        // stack checks. We shouldn't be anywhere near this bound                                                       
        // anyway.                                                                                                      
-       base := mbi.allocationBase + 8*1024                                                                             
+       base := mbi.allocationBase + 16*1024                                                                            
+                                                                                                                       
        // Sanity check the stack bounds.                                                                               
        g0 := getg()                                                                                                    
        if base > g0.stack.hi || g0.stack.hi-base > 64<<20 {                                                            
diff --git a/src/runtime/proc.go b/src/runtime/proc.go                                                                  
index f82014eb92..288324be30 100644                                                                                     
--- a/src/runtime/proc.go                                                                                               
+++ b/src/runtime/proc.go                                                                                               
@@ -436,6 +436,15 @@ func badmorestackg0() {                                                                            
        write(2, sp.str, int32(sp.len))                                                                                 
 }                                                                                                                      
                                                                                                                        
+var exceptionhandlerMsg = "exceptionhandler\n"                                                                         
+                                                                                                                       
+//go:nosplit                                                                                                           
+//go:nowritebarrierrec                                                                                                 
+func exceptionhandlerPrint() {                                                                                         
+       sp := stringStructOf(&exceptionhandlerMsg)                                                                      
+       write(2, sp.str, int32(sp.len))                                                                                 
+}                                                                                                                      
+                                                                                                                       
 var badmorestackgsignalMsg = "fatal: morestack on gsignal\n"                                                           
                                                                                                                        
 //go:nosplit                                                                                                           
diff --git a/src/runtime/signal_windows.go b/src/runtime/signal_windows.go                                              
index 500b02880d..8101931ffe 100644                                                                                     
--- a/src/runtime/signal_windows.go                                                                                     
+++ b/src/runtime/signal_windows.go                                                                                     
@@ -71,7 +71,14 @@ func isgoexception(info *exceptionrecord, r *context) bool {                                         
 // Called by sigtramp from Windows VEH handler.                                                                        
 // Return value signals whether the exception has been handled (EXCEPTION_CONTINUE_EXECUTION)                          
 // or should be made available to other handlers in the chain (EXCEPTION_CONTINUE_SEARCH).                             
+//go:nosplit                                                                                                           
 func exceptionhandler(info *exceptionrecord, r *context, gp *g) int32 {                                                
+       exceptionhandlerPrint()                                                                                         
+                                                                                                                       
+       g := getg()                                                                                                     
+       g.stack.lo -= 16 * 1024                                                                                         
+       g.stackguard0 -= 16 * 1024                                                                                      
+                                                                                                                       
        if !isgoexception(info, r) {                                                                                    
                return _EXCEPTION_CONTINUE_SEARCH                                                                       
        }                                                                                                               

Instead of trying to remove implicit calls to morestack by adding //go:nosplit I just made it believe that everything is ok by using the slack we've added in minit().

With this change I get the proper clean exit with callstacks printed:

PS C:\Users\kjk\src\go-dev\src> .\runtime.test.exe "-test.run" TestG0StackOverflow
fatal: morestack on g0
exceptionhandler
fatal error: unexpected signal during runtime execution
[signal 0x80000003 code=0x0 addr=0x0 pc=0x4557b1]

runtime stack:
runtime.throw(0x6684ed, 0x2a)
        C:/Users/kjk/src/go-dev/src/runtime/panic.go:608 +0x54 fp=0xa4cf8 sp=0xa4ce4 pc=0x42a714
runtime.sigpanic()
        C:/Users/kjk/src/go-dev/src/runtime/signal_windows.go:173 +0x14d fp=0xa4d0c sp=0xa4cf8 pc=0x43cd4d
runtime.abort()
        C:/Users/kjk/src/go-dev/src/runtime/asm_386.s:866 +0x1 fp=0xa4d10 sp=0xa4d0c pc=0x4557b1
runtime.morestack()
        C:/Users/kjk/src/go-dev/src/runtime/asm_386.s:442 +0x24 fp=0xa4d14 sp=0xa4d10 pc=0x454334

@aclements
Copy link
Member

Thanks for the great debugging @kjk! You've definitely found the root of the problem: the initial INT3 traps fine and we detect that there's a problem, but since the stack bounds aren't quite right we wind up walking off the edge of the actual stack and the subsequent failures are a different exception, which we don't handle so carefully.

It seems like we should perhaps use the TIB instead of VirtualQuery to get the stack bounds (I'm not sure why I didn't come across that when originally figuring out how to get the stack bounds), and do something to make room for handling the exception (like the slack you added in exceptionhandler).

To answer some of your other questions, which you may have already found the answers to:

Another theory: there is one-off error in the code that maps system stack (which I cannot find) because this happens reliably when accessing what looks like the lowest page of the stack.

Unlike goroutine stacks, this stack is allocated by Windows itself when we create the thread.

I assume the part with PAGE_GUARD is there by default. At some point we detect we need to commit more stack and we call mmap() to extend stack.

The PAGE_GUARD is definitely a problem. Apparently the VirtualQuery call we use to find the bounds of the stack considers that to be part of the mapping, even though we can't actually use that memory. That's what causes the runtime to set up the wrong stack bounds.

There's nothing in the Go runtime that commits more stack or in any way extends a system stack. The OS commits more stack memory as we touch it, but that's transparent to Go.

Instead of trying to remove implicit calls to morestack by adding //go:nosplit I just made it believe that everything is ok by using the slack we've added in minit().

That's not a bad idea, though it needs to be done a little more carefully. :)

I've been trying to figure out what stack the vectored exception handler runs on when it's handling a stack overflow exception without much luck. The closest I've come is https://stackoverflow.com/questions/1897301/vectored-exception-handling-during-stackoverflowexception, but that could mean the OS reserves some small dedicated stack for this purpose, or that it lets the stack grow into the PAGE_GUARD region for this purpose. Either way, it's probably better if we just completely avoid overrunning the stack.

@aclements
Copy link
Member

It seems like we should perhaps use the TIB instead of VirtualQuery to get the stack bounds

Sigh. Apparently the StackLimit field in the TIB gives the limit of the committed stack, not the reserved stack, so that's not useful. There's a later field with the "Address of memory allocated for stack" but that returns the same base address as VirtualQuery.

Apparently the VirtualQuery call we use to find the bounds of the stack considers that to be part of the mapping, even though we can't actually use that memory.

Based on https://docs.microsoft.com/en-us/windows/desktop/Memory/creating-guard-pages, this is bit a different from the guard pages I'm used to. Apparently Windows will let you use that memory, but only after the process has handled a STATUS_GUARD_PAGE_VIOLATION exception.

@gopherbot
Copy link

Change https://golang.org/cl/122515 mentions this issue: runtime: fix abort handling on Windows

@gopherbot
Copy link

Change https://golang.org/cl/122516 mentions this issue: runtime: account for guard zone in Windows stack size

gopherbot pushed a commit that referenced this issue Jul 7, 2018
On Windows, the IP recorded in the breakpoint exception caused by
runtime.abort is actually one byte after the INT3, unlike on UNIX
OSes. Account for this in isgoexception.

It turns out TestAbort was "passing" on Windows anyway because abort
still caused a fatal panic, just not the one we were expecting. This
CL tightens this test to check that the runtime specifically reports a
breakpoint exception.

Fixing this is related to #21382, since we use runtime.abort in
reporting g0 stack overflows, and it's important that we detect this
and not try to handle it.

Change-Id: I66120944d138eb80f839346b157a3759c1019e34
Reviewed-on: https://go-review.googlesource.com/122515
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
gopherbot pushed a commit that referenced this issue Jul 7, 2018
Windows includes an 8K guard in system-allocated thread stacks, which
we currently don't account for when setting the g0 stack bounds. As a
result, if we do overflow the g0 stack bounds, we'll get a
STATUS_GUARD_PAGE_VIOLATION exception, which we're not expecting.

Fix the g0 stack bounds to include a total of 16K of slop to account
for this 8K guard.

Updates #21382.

Change-Id: Ia89b741b1413328e4681a237f5a7ee645531fe16
Reviewed-on: https://go-review.googlesource.com/122516
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
@golang golang locked and limited conversation to collaborators Jul 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Projects
None yet
Development

No branches or pull requests

6 participants