Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: libunwind is unable to unwind CGo to Go's stack #40044

Open
steeve opened this issue Jul 4, 2020 · 22 comments
Open

runtime: libunwind is unable to unwind CGo to Go's stack #40044

steeve opened this issue Jul 4, 2020 · 22 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@steeve
Copy link
Contributor

steeve commented Jul 4, 2020

What version of Go are you using (go version)?

master as of the build

$ go version
devel +dd150176c3 Fri Jul 3 03:31:29 2020 +0000

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/steeve/Library/Caches/go-build"
GOENV="/Users/steeve/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/steeve/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/steeve/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/steeve/code/github.com/znly/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/steeve/code/github.com/znly/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/steeve/code/github.com/znly/go/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/bs/51dlb_nn5k35xq9qfsxv9wc00000gr/T/go-build842228435=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Following @cherrymui's comment on #39524, I figured I tried to check why lots of our backtraces on iOS stop at runtime.asmcgocall.
Since I wanted to reproduce it on my computer and lldb manges to properly backtrace, I figured I'd give libunwind a try, since this is was iOS uses when a program crashes.

Unfortunately libunwind didn't manage to walk the stack past CGo generated _Cfunc_ functions.

Given this program:

package main

/*
#cgo CFLAGS: -O0

#include <libunwind.h>
#include <stdio.h>

void backtrace() {
	unw_cursor_t cursor;
	unw_context_t context;

	// Initialize cursor to current frame for local unwinding.
	unw_getcontext(&context);
	unw_init_local(&cursor, &context);

	// Unwind frames one by one, going up the frame stack.
	while (unw_step(&cursor) > 0) {
		unw_word_t offset, pc;
		unw_get_reg(&cursor, UNW_REG_IP, &pc);
		if (pc == 0) {
			break;
		}
	    printf("0x%llx:", pc);
		char sym[256];
		if (unw_get_proc_name(&cursor, sym, sizeof(sym), &offset) == 0) {
			printf(" (%s+0x%llx)\n", sym, offset);
		} else {
			printf(" -- error: unable to obtain symbol name for this frame\n");
		}
  	}
}

void two() {
	printf("two\n");
	backtrace();
}

void one() {
	printf("one\n");
	two();
}
*/
import "C"

//go:noinline
func goone() {
	C.one()
}

func main() {
	goone()
}

It prints:

one1
two2
0x40617fe: (two+0x1e)
0x406182e: (one+0x1e)
0x406168b: (_cgo_7c45d1c2feef_Cfunc_one+0x1b)

I tried doing Go(1) -> C(1) -> Go(2) -> C(2) and backtrace, and it only unwinds C(2).

Also, I tried to make set asmcgocall to have a 16 bytes stack, hoping that the generated frame pointer would help, but it didn't.

What did you expect to see?

The complete backtrace.

What did you see instead?

A backtrace for C functions only.

@cherrymui
Copy link
Member

This is different from #39524 . We switch stacks at Go/C boundaries. Go code runs on goroutine stacks (typically small), whereas C code runs on system stacks (typically large). Since they are not on the same stack, I would not expect any stack unwinding tool to work. Not sure if there is anything we could do.

Maybe we could use the frame pointer to "fake" it? Not sure this is a good idea...

@steeve
Copy link
Contributor Author

steeve commented Jul 4, 2020

Indeed. Thinking about it however, it doesn't feel that hard to do. Either by locally modifying asmcgocall, or, in a more ambitious way, via go:systemstack. Do you think that could work, at least in theory?

@steeve
Copy link
Contributor Author

steeve commented Jul 4, 2020

Also, weirdly enough, lldb is able to do it, without dwarf.

@steeve
Copy link
Contributor Author

steeve commented Jul 5, 2020

I am also realizing this is very different from #39524 indeed. But in some way, since runtime.libCCall uses asmcgocall, it could also allow for unwinding of gourtines blocked in things like pthread functions.

@ianlancetaylor
Copy link
Contributor

If you want to unwind from C++ back into Go you may want to try github.com/ianlancetaylor/cgosymbolizer. Although that will only help from the Go side, not the C++ side.

In principle we could hand write unwind information for asmcgocall. The unwind information is basically DWARF, and it should be powerful enough to express what asmcgocall does.

@steeve
Copy link
Contributor Author

steeve commented Jul 5, 2020

@ianlancetaylor thank you. The issue, on the iOS side, is that unwinding is done locally, on the device (presumably with libunwind), without DWARF. DWARF is only added later to symbolicate the crashes.

That said, it could be useful for Android (which uses breakpad with minidumps)

@cherrymui I tried that forsaken piece of code to, in order to call the backtrace method without cgo, and alas, the unwinding still stops before it somehow. This is based on the rustgo article:

TEXT ·backtracetrampoline(SB),0,$2048
    MOVQ SP, BX        // Save SP in a callee-saved registry
    ADDQ $2048, SP     // Rollback SP to reuse this function's frame
    ANDQ $~15, SP      // Align the stack to 16-bytes
    CALL backtrace(SB)
    MOVQ BX, SP        // Restore SP
    RET

@cherrymui
Copy link
Member

@steeve Sorry, I'm not sure exactly what you're planning to do, and why go:systemstack is relevant here. go:systemstack only enforces the marked function must run on the system stack (i.e. cannot run on a goroutine stack). It doesn't change how stack switches work.

Also, on what architecture? You mentioned iOS (presumably ARM64), but also AMD64 in your go env.

That said, does CL https://go-review.googlesource.com/c/go/+/241080 makes any difference (on ARM64)? Thanks.

@steeve
Copy link
Contributor Author

steeve commented Jul 6, 2020

@cherrymui Thank you for the CL, I wasn't hoping as much. Will definitely try and let you know.

My ultimate target is indeed iOS (and Android, to an extent).
Before trying to fix it on iOS though, I figure it'd be easier to reproduce on my computer (darwin/amd64), and since iOS uses libunwind, try to investigate it myself.
My other experiment in which I tried to call the method directly in the Go stack, is to try and narrow down if the frame pointer was trashed because of the stack switch itself.

@gopherbot
Copy link

Change https://golang.org/cl/241158 mentions this issue: runtime: adjust frame pointer on stack copy on ARM64

@cagedmantis cagedmantis added this to the Backlog milestone Jul 6, 2020
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 6, 2020
@ianlancetaylor
Copy link
Contributor

@steeve libunwind unwinds the stack using the unwind information, which is not DWARF but is approximately the same format as a subset of DWARF. That's what I was referring to when I suggested that we could write unwind information for asmcgocall. (You can see the horrible details at https://www.airs.com/blog/archives/460).

@steeve
Copy link
Contributor Author

steeve commented Jul 9, 2020

I just tried your CL @cherrymui on a real device, and unfortunately, when I pause inside XCode's, I only see the stack up to asmcgocall.

In my case I did put a time.Sleep() and the backtrace in XCode's lldb was:

    frame #0: 0x00000001889f2bc0 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x00000001889151e4 libsystem_pthread.dylib`_pthread_cond_wait + 680
    frame #2: 0x000000010785d5e8 Zenly`runtime.pthread_cond_wait_trampoline + 24
    frame #3: 0x000000010785c4fc Zenly`runtime.asmcgocall + 204
    frame #4: 0x000000010785c4fc Zenly`runtime.asmcgocall + 204
  * frame #5: 0x000000010785c4fc Zenly`runtime.asmcgocall + 204

Note that on amd64, lldb manages to properly unwind.

@qmuntal
Copy link
Contributor

qmuntal commented Dec 23, 2022

The lack of frame pointer in asmcgocall is also breaking stack unwinding on WinDbg (using the prototype from #57302).

I've tried with gdb, and it's also broken:

Thread 1 hit Breakpoint 1, runtime.asmcgocall () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/asm_amd64.s:835
835             MOVQ    (g_sched+gobuf_sp)(SI), SP
(gdb) backtrace
#0  runtime.asmcgocall () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/asm_amd64.s:835
#1  0x0000000000402f49 in runtime.cgocall (fn=0x45a320 <runtime.asmstdcall>, arg=0x4daae0, ~r0=<optimized out>) at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/cgocall.go:167
#2  0x0000000000455314 in syscall.loadsystemlibrary (filename=0xc00010c000, absoluteFilepath=<optimized out>, handle=<optimized out>, err=<optimized out>) at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/syscall_windows.go:443
#3  0x000000000045b39c in syscall.loadsystemlibrary (filename=0x45a320 <runtime.asmstdcall>, absoluteFilepath=0x4daae0, handle=<optimized out>, err=<optimized out>) at <autogenerated>:1
... omitted
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) si
runtime.asmcgocall () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/asm_amd64.s:840
840             SUBQ    $64, SP
(gdb) backtrace
#0  runtime.asmcgocall () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/asm_amd64.s:840
#1  0x00000000007afee8 in ?? ()

Also, I tried to make set asmcgocall to have a 16 bytes stack, hoping that the generated frame pointer would help, but it didn't.

@steeve curiously, WinDbg and gdb can unwind the stack if asmcgocall stack is incremented to 16 bytes, which makes the assembler to introduce a frame pointer:

#0  runtime.asmcgocall () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/asm_amd64.s:811
#1  0x000000000042d3a5 in runtime.stdcall (fn=<optimized out>, ~r0=<optimized out>) at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/os_windows.go:1074
#2  0x000000000042d4bc in runtime.stdcall1 (fn=0x7ffca80d95d0 <LoadLibraryA>, a0=<optimized out>, ~r0=<optimized out>) at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/os_windows.go:1095
#3  0x000000000042ab85 in runtime.loadOptionalSyscalls () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/os_windows.go:249
#4  0x000000000042b9b5 in runtime.osinit () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/os_windows.go:551
#5  0x0000000000456637 in runtime.rt0_go () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/asm_amd64.s:348
#6  0x00000000a80d26bd in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) si
812             MOVQ    arg+8(FP), BX
(gdb) backtrace
#0  runtime.asmcgocall () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/asm_amd64.s:812
#1  0x000000000042d3a5 in runtime.stdcall (fn=<optimized out>, ~r0=<optimized out>) at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/os_windows.go:1074
#2  0x000000000042d4bc in runtime.stdcall1 (fn=0x7ffca80d95d0 <LoadLibraryA>, a0=<optimized out>, ~r0=<optimized out>) at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/os_windows.go:1095
#3  0x000000000042ab85 in runtime.loadOptionalSyscalls () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/os_windows.go:249
#4  0x000000000042b9b5 in runtime.osinit () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/os_windows.go:551
#5  0x0000000000456637 in runtime.rt0_go () at C:/Users/qmuntaldiaz/code/golang-go/src/runtime/asm_amd64.s:348
#6  0x00000000a80d26bd in ?? ()

@qmuntal
Copy link
Contributor

qmuntal commented Dec 23, 2022

@ianlancetaylor @cherrymui asmcgocall does not have a frame pointer even it being a non-leaf function due to condition 2 in this code:

if ctxt.Arch.Family == sys.AMD64 &&
!p.From.Sym.NoFrame() && // (1) below
!(autoffset == 0 && p.From.Sym.NoSplit()) && // (2) below
!(autoffset == 0 && !hasCall) { // (3) below
// Make room to save a base pointer.
// There are 2 cases we must avoid:
// 1) If noframe is set (which we do for functions which tail call).
// 2) Scary runtime internals which would be all messed up by frame pointers.
// We detect these using a heuristic: frameless nosplit functions.
// TODO: Maybe someday we label them all with NOFRAME and get rid of this heuristic.
// For performance, we also want to avoid:
// 3) Frameless leaf functions
bpsize = ctxt.Arch.PtrSize
autoffset += int32(bpsize)
p.To.Offset += int64(bpsize)
} else {
bpsize = 0
}

asmcgocall seems to be doing well with a frame pointer, so it does not fit in the scary-runtime-internals category (at least on Windows, have to try other OSes). IMO we should get rid of that heuristic and just use NOFRAME if a runtime function would be messed up be frame pointer, just as it is happening on arm and arm64.

@gopherbot
Copy link

Change https://go.dev/cl/459395 mentions this issue: runtime: use explicit NOFRAME on windows/amd64

@cherrymui
Copy link
Member

IMO we should get rid of that heuristic and just use NOFRAME if a runtime function would be messed up be frame pointer

I think that is a good direction. Thanks for looking into it. Are we sure what matters are all assembly functions? We don't have explicit NOFRAME control for compiled functions.

I'm still not sure about stack transition in asmcgocall being "broken". Technically, it is running on two stacks -- the C functions run on a different stack. So if we are unwinding the physical stack, you shouldn't see both Go and C functions. You could argue we're expecting to unwind the logical stack. That is a reasonable argument. But I don't think a decision has been made for whether the unwinding should be the physical stack or the logical stack. Further, at least some debugger is not happy if the stack pointer suddenly changes direction. I don't think it is a good idea if the unwinding only works when the C stack is at a higher address than the Go stack, given that we don't generally control where the stacks are in the address space.

For a (not quite accurate) analogy, what does the debugger do for C longjmp or setcontext or other stack transitions? Does it unwind through it? What machinery does it use?

Thanks.

@ianlancetaylor
Copy link
Contributor

The C longjmp and setcontext functions change the stack entirely and irrevocably (modulo another call to longjmp or setcontext, of course). They are unlike asmcgocall, which switches to a temporary stack for the duration of a function call and then returns to the original stack. So I don't think the same issues arise. For the C functions you get either the original call stack or the new call stack.

Note that we do support unwinding the cgo stack to the Go stack via the runtime.SetCgoTraceback function. However, that only supports tracebacks that use Go functions (runtime.Callers, runtime.Stack). It does not help with libunwind.

What would work for libunwind is for us to write unwind information for asmcgocall that tells libunwind how to load the stack and PC of the calling frame. This is feasible, as the unwind information can use arbitrary expressions, but complicated.

@cherrymui
Copy link
Member

I agree that longjmp is not really a good analogy because it could be an actual context switch (although it could be used to implement a temporary stack transition like asmcgocall, but the tools never know). I don't think there is anything in C that does a temporary stack switch?

I think the Go traceback API is mostly showing only the "user frames", e.g. we hide compiler-generated wrappers. So hiding the asmcgocall stack transition when SetCgoTraceback is set seems also reasonable. But I'm not sure we should blindly hide stack transitions for other low-level unwinders.

@qmuntal
Copy link
Contributor

qmuntal commented Dec 27, 2022

But I'm not sure we should blindly hide stack transitions for other low-level unwinders.

I don't expect Go to hide wrappers and stack transitions to external unwinders, if it is possible at all.

I do expect Go to facilitate unwinding stack transitions, even from C to Go, and vice versa. This is certainly doable using Windows' SEH if a frame pointer is set in the function prologue and the linker emits the appropriate metadata.

gopherbot pushed a commit that referenced this issue Jan 24, 2023
This CL marks non-leaf nosplit assembly functions as NOFRAME to avoid
relying on the implicit amd64 NOFRAME heuristic, where NOSPLIT functions
without stack were also marked as NOFRAME.

Updates #57302
Updates #40044

Change-Id: Ia4d26f8420dcf2b54528969ffbf40a73f1315d61
Reviewed-on: https://go-review.googlesource.com/c/go/+/459395
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Quim Muntal <quimmuntal@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
@qmuntal

This comment was marked as off-topic.

gopherbot pushed a commit that referenced this issue Apr 18, 2023
Frame pointer is enabled on ARM64. When copying stacks, the
saved frame pointers need to be adjusted.

Updates #39524, #40044.
Fixes #58432.

Change-Id: I73651fdfd1a6cccae26a5ce02e7e86f6c2fb9bf7
Reviewed-on: https://go-review.googlesource.com/c/go/+/241158
Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Triage Backlog
Development

No branches or pull requests

7 participants
@steeve @cagedmantis @ianlancetaylor @qmuntal @gopherbot @cherrymui and others