Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: failure in TestRaiseException on windows-amd64-2012 #49681

Open
bcmills opened this issue Nov 19, 2021 · 43 comments
Open

runtime: failure in TestRaiseException on windows-amd64-2012 #49681

bcmills opened this issue Nov 19, 2021 · 43 comments
Labels
ExpertNeeded NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 19, 2021

--- FAIL: TestRaiseException (0.02s)
    crash_test.go:106: C:\Users\gopher\AppData\Local\Temp\1\go-build1560626842\testprog.exe RaiseException exit status: exit status 2989
    syscall_windows_test.go:636: No stack trace: 
FAIL
FAIL	runtime	68.703s

greplogs --dashboard -md -l -e 'FAIL: TestRaiseException'

2021-11-18T02:19:50-f1cc529/windows-amd64-2012

The test dates from 2014. I can find only one such faillure in the logs, and it is very recent — marking as release-blocker until we can determine whether this is a regression (CC @bufflig).

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker labels Nov 19, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 19, 2021
@toothrot toothrot added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Nov 19, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Nov 23, 2021

One more this week. Looking like a regression of some sort.

greplogs --dashboard -md -l -e 'FAIL: TestRaiseException' --since=2021-11-19

2021-11-22T22:36:03-7456b94/windows-amd64-2012

@bufflig
Copy link
Contributor

bufflig commented Nov 24, 2021

Hmmm, this might have been flaky for a while. I didn't realize it then, as I thought it was related to the crashdump change and it "went away" after a sync, but it was probably a flakiness I saw. My guess is that the recent regression is caused by changed timing due to the scavenger changes, but that's just a guess. The flakiness should be fixed anyway, I'll put it on the list.

@bufflig bufflig self-assigned this Nov 24, 2021
@alexbrainman
Copy link
Member

crash_test.go:106: C:\Users\gopher\AppData\Local\Temp\1\go-build1560626842\testprog.exe RaiseException exit status: exit status 2989
syscall_windows_test.go:636: No stack trace:

I spent some time staring at this. 2989 is 0xbad from

proc.Call(0xbad, EXCEPTION_NONCONTINUABLE, 0, 0)

I also run runtime.TestRaiseException and it would not fail on me. I also added t.Logf in the test to print stacktrace, and I can see that when test succeeds, the child process always returns 2.

Go runtime calls runtime.exit(2) from our Windows exception handler.

I suspect in this failure runtime.exit(2) call was not reached. Either runtime crashed in exception handler, or Windows did not call our exception handler for some reason.

Alex.

@bufflig
Copy link
Contributor

bufflig commented Nov 29, 2021

I was suspecting that something was spooky here
The exception technically comes from non-go code, so I was wondering if that's simply the reason we don't get a stack trace and - for some reason I don't understand - the test for non-go-code did not come to the same conclusion before? But I could completely misunderstand it too. When I tested, an exception I caused "directly" by e.g. reading a protected memory page gave stack dumps, but when I raised the exception by calling RaiseException, it became "flaky" somewhere between 1.17 and 1.18.

@alexbrainman
Copy link
Member

I was suspecting that something was spooky here
The exception technically comes from non-go code, so I was wondering if that's simply the reason we don't get a stack trace and - for some reason I don't understand - the test for non-go-code did not come to the same conclusion before? But I could completely misunderstand it too.

I don't understand what you are saying. runtime.isgoexception should always return false in our scenario. Because the function only returns true if exception code is _EXCEPTION_ACCESS_VIOLATION, _EXCEPTION_INT_DIVIDE_BY_ZERO and other consts in that case statement. But we pass 0xbad value to RaiseException, and 0xbad does not match any of the consts in the case.

When I tested, an exception I caused "directly" by e.g. reading a protected memory page gave stack dumps, but when I raised the exception by calling RaiseException, it became "flaky" somewhere between 1.17 and 1.18.

I guess, your testing confirms that we do have new problem here. How do I reproduce your "flaky" testing?

Thank you.

Alex

@bufflig
Copy link
Contributor

bufflig commented Dec 2, 2021

I was suspecting that something was spooky here
The exception technically comes from non-go code, so I was wondering if that's simply the reason we don't get a stack trace and - for some reason I don't understand - the test for non-go-code did not come to the same conclusion before? But I could completely misunderstand it too.

I don't understand what you are saying. runtime.isgoexception should always return false in our scenario. Because the function only returns true if exception code is _EXCEPTION_ACCESS_VIOLATION, _EXCEPTION_INT_DIVIDE_BY_ZERO and other consts in that case statement. But we pass 0xbad value to RaiseException, and 0xbad does not match any of the consts in the case.

You're totally right and I'm clearly wrong. I think you didn't understand what I was saying because what I was saying made no sense :) I was specifically throwing EXCEPTION_ACCESS_VIOLATION in my test program, why my theory was potentially a little less out there back then...

When I tested, an exception I caused "directly" by e.g. reading a protected memory page gave stack dumps, but when I raised the exception by calling RaiseException, it became "flaky" somewhere between 1.17 and 1.18.

I guess, your testing confirms that we do have new problem here. How do I reproduce your "flaky" testing?

I just used a small program calling RaiseException through cgo:

package main

// #include <windows.h>
// #include <stdio.h>
// #include <stdlib.h>
// #include <errhandlingapi.h>
//
// static void raise() {
//   RaiseException(EXCEPTION_ACCESS_VIOLATION, 0,0,0);
// }
import "C"

import (
	"time"
)

func main() {
	time.Sleep(time.Second)
	C.raise()
	return
}

It became much less probable to succeed in dumping a stack trace during 1.18 development, but as you can see, it already needed a sleep to work to begin with, so it wasn't water-proof from the start.

However, after trying to reproduce now, I think it's a red herring, as after @randall77 's https://go-review.googlesource.com/c/go/+/364174, the program above consistently works, even without the sleep. In fact, after that change, none of my sometimes flaky tests programs trying to create exceptions in different ways ever fail to stack dump. That makes no sense to me, looking at the change itself. And also, it's completely opposite to what shows up in the builders, it rather started happening in the builders after that change, or a few days after.

So now I have no idea, not even a bad one.

Thank you.

Alex

Thanks!

Patrik

@alexbrainman
Copy link
Member

I just used a small program calling RaiseException through cgo:

runtime.TestRaiseException does not uses cgo. I expect cgo and no-cgo runtimes are very different from each other. If your small cgo program works fine (dumps stack trace every time), that does not prove anything.

Alex

@bufflig
Copy link
Contributor

bufflig commented Dec 7, 2021

Yea, probably not. I'll see if I can make it happen on a gomote with the testcase itself, to see if there's any way to reproduce.

@bufflig
Copy link
Contributor

bufflig commented Dec 8, 2021

I managed to get some repros with some logging. Not conclusive in any way, but, as you mostly had concluded already, it does reach exceptionhandler (in signal_windows.go), which does return _EXCEPTION_CONTINUE_SEARCH, but then neither firstcontinuehandler nor lastcontinuehandler are called, as far as I can see they are not crashing, they are not called at all.

@bufflig
Copy link
Contributor

bufflig commented Dec 8, 2021

I don't know if it's really relevant, but it appears the change that I mentioned above, 364174, also is makes this error much more probable. I've not been able to reproduce it once before that change, but by spamming the gomotes, I can reproduce within a couple of minutes after that change. I still have no clue why that change would affect exception handling, of course. It might just change the timing or something.

@cherrymui
Copy link
Member

It seems that in a successful run, exceptionhandler is called first, returning as isgoexception is false, then firstcontinuehandler is called, then lastcontinuehandler is called which prints the stack trace. In a failed run, though, exceptionhandler is called and returned as usual, but firstcontinuehandler and lastcontinuehandler are not called. Instead, it exits with error code 2989 (0xbad, instead of the usual 2). I don't know why the continue handlers are not called some time.

@bufflig
Copy link
Contributor

bufflig commented Dec 8, 2021

Yes. I don't know if the trampoline functions in asm are called though. There is some code there beyond my comprehension in sys_windows_amd64:sigtramp (no surprise there :)), also using tls, which was very vaguely related to the change 364174. It does do some stuff to make sure it's running on g0 stack, I don't know if that could be where the flakiness comes from?

@cherrymui
Copy link
Member

I think sigtramp is also not called (for firstcontinuehandler and lastcontinuehandler). I added some code to print in sigtramp and I didn't see it print.

Yeah, I also cannot reproduce before CL 364174 or at tip just reverting CL 364174. Weird. That code runs very early at startup. I wonder how it could cause nondeterministic failure.

It also seems to fail more likely if I set GOGC=1. I tried disabling async preemption but it doesn't seem to have any effect.

@bufflig
Copy link
Contributor

bufflig commented Dec 8, 2021

Ah, interesting. I tried (somewhat arbitrarily) to re-register the exception handlers before returning _EXCEPTION_CONTINUE_SERACH from exceptionhandler, that did not affect what happened at all (not sure what I expected though). It almost seems like the return value seen by windows exception handling mechanism is not seeing _EXCEPTION_CONTINUE_SEARCH, but something else?

@cherrymui
Copy link
Member

It seems if I increase the 5*8 to 5*8+7 at https://cs.opensource.google/go/go/+/master:src/runtime/asm_amd64.s;l=152 (it was 4*8+7 before that CL), I can no longer reproduce the failure. Reading the code I don't see where it clobbers anything on the stack, though. Maybe there was something right below SP at the entry of rt0_go and should not be clobbered?

@bufflig
Copy link
Contributor

bufflig commented Dec 8, 2021

Heh, there's something going on there, definitely. If I make that specific change, the cgo program that started emitting stack dumps after 364174 stopped dumping stack dumps again. The change on that line clearly affects exception handlers, but I have no idea in which way or why. Can it just be some arbitrary memory corruption going on? The stack being slightly different just makes one thing more probable than another, but in fact, something else is broken?

I'm completely confused by the switching between ABIs in sigtramp, and I don't really get what ABI is used to call the Go code. Is it called using ABI0? What makes it not being called using ABIInternal? Wrappers? I might need an ABI for dummies lesson :)

@cherrymui
Copy link
Member

firstcontinuetramp is called using the system ABI. firstcontinuetramp passes the PC of the ABI0 symbol (as there is no <ABIInternal> marker) of firstcontinuehandler to sigtramp, so the ABI0 function is called. That function is actually a wrapper, which calls the actual ABIInternal Go function firstcontinuehandler, and returns its value back in ABI0.

@cherrymui
Copy link
Member

I added some print and it doesn't seem we clobbered anything. For both success run and failed run the word right below SP at entry of rt0_go is 0, and in both case we stored the same value there (this store https://cs.opensource.google/go/go/+/master:src/runtime/asm_amd64.s;l=155), and the first a few words above that are never changed and the same for both cases. So it is not that we clobbered something on that stack. Still weird.

@bufflig
Copy link
Contributor

bufflig commented Dec 9, 2021

Thanks for checking and explaining the ABI and the wrapper! At least now I understand sigtramp, which seems to do what's expected. This is indeed a weird one...

@cherrymui
Copy link
Member

I also tried to change sigtramp to always return 0 (which is not always correct but is all what we need for this test) and it can still fail. So it is not the return value.

@bufflig
Copy link
Contributor

bufflig commented Dec 10, 2021

Yea, I printed out AX at the end of sigtramp and it showed the same thing. So I can't see anyway we're returning the wrong value. Unfortunately...

I don't know what's up here. One thing is that the ones that are not run are "ContinueHandlers" and the one that's run is an "ExceptionHandlers". They are just run in sequence according to the docs I could find, but there might be some criteria for not running the continuehandlers that we hit. I couldn't find anything and the docs are sparse. MS docs I found basically say exceptionhandlers run, then if the last one returns EXCEPTION_CONTINUE_SEARCH, continuehandlers are run. But maybe there's some other criteria, like the stack has to look in a certain way (which would be kind of strange as they allegedly get their own stack frame). Maybe someone with access to the source could shed some light? @jstarks - do you by any chance have time to look at if there's some criteria that will skip continuehandlers?

Also I've had problems reproducing with the latest dev version lately, there's absolutely zero stuff that should affect this checked in lately as I can see. I will keep digging.

@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Jan 12, 2022

even though I think it's something that should be fixed, I'm not sure it's a release blocker.

I think that's reasonable, but if we go that route should we add a testenv.SkipFlaky on the windows-amd64-2012 builder to reduce noise on the dashboard in the meantime?

Better still, if they test only happened to work before is there something we can do to increase the failure rate on the other builders? (IMO a skipped test with a high failure rate is more valuable than a flaky one with a low failure rate.)

@bufflig
Copy link
Contributor

bufflig commented Jan 13, 2022

even though I think it's something that should be fixed, I'm not sure it's a release blocker.

I think that's reasonable, but if we go that route should we add a testenv.SkipFlaky on the windows-amd64-2012 builder to reduce noise on the dashboard in the meantime?

Sounds reasonable.

Better still, if they test only happened to work before is there something we can do to increase the failure rate on the other builders? (IMO a skipped test with a high failure rate is more valuable than a flaky one with a low failure rate.)

I can't come up with a way to increase the probability either, unfortunately. Maybe someone else has some idea?

@gopherbot
Copy link

Change https://golang.org/cl/378254 mentions this issue: testing: skip flaky TestRaiseException on windows-amd64-2012

@hanchaoqun
Copy link
Contributor

hanchaoqun commented Jan 13, 2022

@bufflig, @bcmills , yes, i agree this is not a new problem, I think it may be caused by the fact that the specific version of windows is not friendly enough to the go stack. So I think it is not stable to throw the exception of c in lastcontinuehandler, maybe we should throw the exception and print the stack immediately when we find the exception of c in the exceptionhandler, no longer rely on VCH, do you agree?
This problem can be stably reproduced by manually adjusting the stack (position of rip and rsp)

func exceptionhandler(info *exceptionrecord, r *context, gp *g) int32 {
	if !isgoexception(info, r) {
		sp := unsafe.Pointer(uintptr(r.rsp))
		for true {
			vsp := uint64(*((*uintptr)(sp)))
			if  vsp < getStackBase() && vsp >= getStackLimit() {
				r.rip = *((*uint64)(unsafe.Pointer(uintptr(sp) - 8)))
				r.rsp = *((*uint64)(unsafe.Pointer(sp)))
			}
			delta := uintptr(sys.StackAlign)
			sp = add(sp, delta)
			if uint64(uintptr(sp)) > uint64(uintptr(add(unsafe.Pointer(r.sp()), 8*50))) ||
				uint64(uintptr(sp)) >= getStackBase() {
				sp = add(sp, -delta)
				break
			}
		}
		return _EXCEPTION_CONTINUE_SEARCH
	}
        ...
D:\gobuild\golang\go1.18\src\runtime>go test -run=^TestRaiseException$ -v
=== RUN   TestRaiseException
    crash_test.go:124: running go build -o C:\Users\han\AppData\Local\Temp\go-build1862937472\testprog.exe
    testenv.go:361: [C:\Users\han\AppData\Local\Temp\go-build1862937472\testprog.exe RaiseException] exit status: exit status 2989
    syscall_windows_test.go:636: No stack trace:
--- FAIL: TestRaiseException (0.94s)
FAIL
exit status 1
FAIL    runtime 0.991s

gopherbot pushed a commit that referenced this issue Jan 13, 2022
This is in relation to #49681

Change-Id: I32ad8b506cf8fb0a94b15c3cc8b1eaf5af728c59
Reviewed-on: https://go-review.googlesource.com/c/go/+/378254
Run-TryBot: Patrik Nyblom <pnyb@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Trust: Patrik Nyblom <pnyb@google.com>
@bufflig
Copy link
Contributor

bufflig commented Jan 14, 2022

@hanchaoqun Yea, I was wondering if dumping the stack trace directly when !isgoexception was possible. I don't really know the reason we're using both exceptionhandlers and continuehandlers, but I assume there's a reason. Anyone has any knowledge of the history behind that?

@bcmills - I did add skipping the test case as flaky, should we remove the release blocker tag?

@bcmills bcmills modified the milestones: Go1.18, Backlog Jan 14, 2022
@alexbrainman
Copy link
Member

@bufflig

... I was wondering if dumping the stack trace directly when !isgoexception was possible. ​I don't really know the reason we're using both exceptionhandlers and continuehandlers, but I assume there's a reason. Anyone has any knowledge of the history behind that?

Do you mean printing stack trace and exiting in exceptionhandler function? I don't see how that is possible.

exceptionhandler receive ALL process exceptions. Some of these exceptions are raised and handled in external to Go code.

For example, DLL raises some exception, and then DLL exception handling code handles this exception and stops passing that exception further up the exception chain. Currently exceptions like that are ignored by Go runtime. Do you propose we crash the Go process instead?

I hope that explains well enough.

Alex

@bufflig
Copy link
Contributor

bufflig commented Jan 24, 2022

Ah, right - I see how my idea/thought would not work. Thanks!

jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
This is in relation to golang#49681

Change-Id: I32ad8b506cf8fb0a94b15c3cc8b1eaf5af728c59
Reviewed-on: https://go-review.googlesource.com/c/go/+/378254
Run-TryBot: Patrik Nyblom <pnyb@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Trust: Patrik Nyblom <pnyb@google.com>
@bufflig bufflig removed their assignment Jun 28, 2022
@gopherbot
Copy link

Change https://go.dev/cl/444616 mentions this issue: runtime: skip TestVectoredHandlerExceptionInNonGoThread

gopherbot pushed a commit that referenced this issue Oct 21, 2022
windows-amd-2012 builder seems to have some problems handling
exception thrown in external C code which is affecting
TestVectoredHandlerExceptionInNonGoThread.
The issue is known and discussed in #49681.

This Cl skips the offending test on windows-amd-2012.

Change-Id: I7ca4353c9e531f0d75ac6a8dbd809acfa1f15bf9
Reviewed-on: https://go-review.googlesource.com/c/go/+/444616
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Run-TryBot: Quim Muntal <quimmuntal@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
windows-amd-2012 builder seems to have some problems handling
exception thrown in external C code which is affecting
TestVectoredHandlerExceptionInNonGoThread.
The issue is known and discussed in golang#49681.

This Cl skips the offending test on windows-amd-2012.

Change-Id: I7ca4353c9e531f0d75ac6a8dbd809acfa1f15bf9
Reviewed-on: https://go-review.googlesource.com/c/go/+/444616
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Run-TryBot: Quim Muntal <quimmuntal@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

Change https://go.dev/cl/451294 mentions this issue: testing: skip flaky TestRaiseException on windows-amd64-2012-*

gopherbot pushed a commit that referenced this issue Nov 21, 2022
Modify skip rule for TestRaiseException to trigger on both the base
builder (windows-amd64-2012) and the newcc canary builder
(windows-amd64-2012-newcc).

Updates #49681.

Change-Id: I132f9ddd102666b68ad04cc661fdcc2cd841051a
Reviewed-on: https://go-review.googlesource.com/c/go/+/451294
Auto-Submit: Than McIntosh <thanm@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Than McIntosh <thanm@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
@gopherbot
Copy link

Change https://go.dev/cl/452436 mentions this issue: testing: skip TestVectoredHandlerExceptionInNonGoThread on windows-amd64-2012-*

gopherbot pushed a commit that referenced this issue Nov 21, 2022
…d64-2012-*

Modify skip rule for TestVectoredHandlerExceptionInNonGoThread to
trigger on both the base builder (windows-amd64-2012) and the newcc
canary builder (windows-amd64-2012-newcc).

Updates #49681.

Change-Id: I58109fc2e861b943cb66be0feec348671be84ab3
Reviewed-on: https://go-review.googlesource.com/c/go/+/452436
Run-TryBot: Than McIntosh <thanm@google.com>
Auto-Submit: Than McIntosh <thanm@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
felixge pushed a commit to felixge/go that referenced this issue Nov 21, 2022
Modify skip rule for TestRaiseException to trigger on both the base
builder (windows-amd64-2012) and the newcc canary builder
(windows-amd64-2012-newcc).

Updates golang#49681.

Change-Id: I132f9ddd102666b68ad04cc661fdcc2cd841051a
Reviewed-on: https://go-review.googlesource.com/c/go/+/451294
Auto-Submit: Than McIntosh <thanm@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Than McIntosh <thanm@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
felixge pushed a commit to felixge/go that referenced this issue Nov 21, 2022
…d64-2012-*

Modify skip rule for TestVectoredHandlerExceptionInNonGoThread to
trigger on both the base builder (windows-amd64-2012) and the newcc
canary builder (windows-amd64-2012-newcc).

Updates golang#49681.

Change-Id: I58109fc2e861b943cb66be0feec348671be84ab3
Reviewed-on: https://go-review.googlesource.com/c/go/+/452436
Run-TryBot: Than McIntosh <thanm@google.com>
Auto-Submit: Than McIntosh <thanm@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@thanm
Copy link
Contributor

thanm commented Nov 29, 2022

FYI, this looks like it is happening on the 1.19 release branch as well. Should we back-port 452436?

@heschi
Copy link
Contributor

heschi commented Nov 29, 2022

Yes, absolutely, we won't be able to cut a release otherwise.

@thanm
Copy link
Contributor

thanm commented Nov 29, 2022

@gopherbot please consider for 1.19 backport

@gopherbot
Copy link

Backport issue(s) opened: #56983 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link

Change https://go.dev/cl/454035 mentions this issue: testing: skip flaky TestRaiseException on windows-amd64-2012-*

@thanm
Copy link
Contributor

thanm commented Nov 29, 2022

OK, backport CL is ready to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExpertNeeded NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Projects
None yet
Development

No branches or pull requests

9 participants