-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: crash on 1.14 with unexpected return pc, fatal error: unknown caller pc #37664
Comments
You can try |
It still crashes the same way. |
Also /cc @aclements @ianlancetaylor from owners. |
@apmckinlay I'm happy to work on debugging this if you can create a code example that you are able to share. Even though it is not specifically related to the defer changes, I also have recently worked with the panic/recover implementation. One change that also went into Go 1.14 relating to panic/recover is https://go-review.googlesource.com/c/go/+/200081 . It should only affect behavior if you did a panic/recover after initiating a Goexit(). I assuming that was not your scenario, but the change could have had some other unintended side effect. |
@danscales Thank you very much! It's open source so no problem sharing. There shouldn't be an exit involved from my code. It appears to be a two step process - a first panic/defer/recover works properly, but then a second one crashes. Running the second one by itself is fine. It's as if the first one leaves something behind that affects the second one. The second can be quite simple, but the first has to be more complex to cause the later crash. Another point that may or may not be relevant is that the panic can be from the same frame as the defer/recover/re-panic (perhaps not typical?) Call stack depth also appears to be relevant. Possibly stack movement is a factor? |
@danscales Here is a set of files that should allow you to recreate the crash. |
@apmckinlay Thanks for setting up the repro case! It actually reproduces on Linux, though I had to fix the sys_nix.go file (syscall.Sysctl no longer exists on Linux). The bug is actually related to the new open-coded defers and their interaction with panic/recover. Confusingly, 'go build' doesn't recompile all the sub-packages with the -N option unless you do:
The bug goes away if you do that, since the problem is a defer in interp.go. As a more targeted work-around, you can put a 1-iteration for loop around the defer statement in interp() (and no -N option needed) and the problem goes away:
The bug does require several sequences of panics and recovers with a further re-panic, before doing the final recover. I think that I have the actual fix in the Go runtime, which is fairly simple, but I'm still working to verify it is the full fix, do more testing, etc. |
@danscales That's great, thanks! I'll add the work around so I can move to 1.14 Do you think the fix will get cherrypicked to 1.14.1 ? The build issue crossed my mind, but I'm pretty sure I did go build -a -v and saw the package listed so I thought that covered it. Maybe cleaning the build cache would have been safer? PS. Just listened to the GoTime podcast you were on. Including, coincidentally, the challenge of testing these kinds of changes. |
Currently, it seems like it could be a good option for cherrypicking for 1.14.1, but will have to confirm the fix and check with release folks, etc. I'll update as I learn more. |
@danscales Once you know more, if you believe this meets the criteria for backporting documented at https://golang.org/wiki/MinorReleases, feel free to follow the process described there to open backport issues. Thanks. |
I added the workaround to the defer's that re-panic. Note: compiling with -gcflags="all=-N" does eliminate the problems |
I would have expected that you would only need the workaround for every defer that has a recover, then possible re-panic, or is likely to be on the stack when such panic-recover-re-panics are happening. I don't expect you would have any trouble with the defer in the go Runtime (really just in runtime.main, I think, at the very start of the program). However, it may be a little hard to catch all such defers. It would be helpful if you happen to try a bit more to apply the workaround to the various defers that seem to be related to the panic-recover-repanic loop, and let me know through the bug if you are successful (and how many defers you tried fixing, whether successful or not). I have the fix and a sample test that I'm about to put out for review. |
Change https://golang.org/cl/222420 mentions this issue: |
@danscales I made another pass over the code and added the workaround to all the defers with recover that I thought might end up on the call stack. (now 14 places) It's tricky because it's a language implementation so the behavior is very dynamic and it's hard to statically determine what might be nested. With the additional workarounds, everything seems to be working fine, although I haven't done extensive testing. I will roll it out to some beta users and see what happens. |
@gopherbot please consider this for backport to 1.14, it's a regression (and the fix is quite simple). |
Backport issue(s) opened: #37782 (for 1.14). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
Change https://golang.org/cl/222818 mentions this issue: |
…ver/re-panics and open-coded defers In the open-code defer implementation, we add defer struct entries to the defer chain on-the-fly at panic time to represent stack frames that contain open-coded defers. This allows us to process non-open-coded and open-coded defers in the correct order. Also, we need somewhere to be able to store the 'started' state of open-coded defers. However, if a recover succeeds, defers will now be processed inline again (unless another panic happens). Any defer entry representing a frame with open-coded defers will become stale once we run the corresponding defers inline and exit the associated stack frame. So, we need to remove all entries for open-coded defers at recover time. The current code was only removing the top-most open-coded defer from the defer chain during recovery. However, with recursive functions that do repeated panic-recover-repanic, multiple stale entries can accumulate on the chain. So, we just adjust the loop to process the entire chain. Since this is at panic/recover case, it is fine to scan through the entire chain (which should usually have few elements in it, since most defers are open-coded). The added test fails with a SEGV without the fix, because it tries to run a stale open-code defer entry (and the stack has changed). Updates #37664. Fixes #37782. Change-Id: I8e3da5d610b5e607411451b66881dea887f7484d Reviewed-on: https://go-review.googlesource.com/c/go/+/222420 Run-TryBot: Dan Scales <danscales@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> (cherry picked from commit fae87a2) Reviewed-on: https://go-review.googlesource.com/c/go/+/222818 Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
@danscales I don't know if it's related, and currently have no way to reproduce it, but a user had a crash: "fatal error: found bad pointer in Go heap". The program uses unsafe and cgo so it's entirely possible it's unrelated. But one of the go routine stack traces looked similar, with a panic inside a panic, and open defers. The object with the bad pointer is at 0xc0000e6548 and runOpenDeferFrame has an argument of 0xc0000e6500 This is with Go 1.15.6 amd64 windows link to the entire crash output
|
@apmckinlay As far as I can tell, the panic inside the panic is happening in user code. That is, we have a panic in interp, and then the deferred function interp.func6 (closure in interp) run by the open defer code is causing another panic (maybe because it is accessing something that is nil related to the cause of the original panic?). So, unless we have further information, it looks like your user's crash is unrelated to this bug. Also, the other "bad pointer in Go heap" panic is unrelated to defers, and is possibly because of a bad use of unsafe or cgo. It just means that the garbage collector ran into an invalid pointer (an address in the range of the Go heap, but doesn't refer to a valid object) while doing GC marking (scanning). |
What version of Go are you using (
go version
)?Also happens on darwin
Works with 1.13.8
Does this issue reproduce with the latest release?
Yes, and also with tip as of Mar 4, 2020
What operating system and processor architecture are you using (
go env
)?Happens on Windows and Mac OS X (darwin)
I haven't tried Linux
This is showing 1.13.8 since it's my "main" installation. I am testing with go1.14 and gotipgo env
OutputWhat did you do?
I built my program with 1.14 and now it crashes. Works fine with 1.13
There is no cgo or assembler involved.
(There is some cgo in the project, but I'm building without it.)
There is minor (read-only) use of unsafe, but it does not appear to be involved.
It is running a single go routine, no concurrency (other than internal Go stuff).
Still crashes with GOGC=off
It is consistent and repeatable - running the same thing crashes the same way every time.
Crashes the same way on darwin, so presumably it's a cross platform issue.
However it is "touchy". Making slight changes to what I'm running can eliminate or change the crash.
It appears to be related to panic/defer/recover
Possibly something to do with re-panic the result of recover
The function doing the panic/defer/recover is recursive if that makes any difference.
Possibly something to do with the defer changes in 1.14
This code makes heavier than normal use of panic/defer/recover which may be why I'm running into this and other people are not.
Unfortunately, it's a large complex system and so far I have not been able to come up with a small Go example that reproduces it. (I could provide the necessary files and configuration if desired.)
I searched the issues but couldn't find anything that looked related.
I assume that nothing I do in normal single-threaded Go code should cause this?
I would welcome any suggestions on how to debug this issue.
e.g. Is there any way to control the compilation of defer handling?
What did you expect to see?
no crash
What did you see instead?
crash
example crash output
The text was updated successfully, but these errors were encountered: