Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: in core dump generated for a signal while executing C code, gdb reports a corrupt stack #57698

Open
MariappanBalraj opened this issue Jan 9, 2023 · 64 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@MariappanBalraj
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.19.4 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/ubuntu/.cache/go-build"
GOENV="/home/ubuntu/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/ubuntu/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/ubuntu/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.19.4"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1839558467=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Following is the working source code of GO which uses CGO. From GO code, the c function test3() is called, which calls test2(), which calls test1(). In test1(), there is a NULL pointer assignment, which will cause segmentation fault. To get the core dump, "ulimit -c unlimited" and "echo "core" > /proc/sys/kernel/core_pattern" are run. The code is compiled without optimization by setting env variable CGO_CFLAGS to -g. The program is run by "GOTRACEBACK=crash ./test", which produced core dump. From the core dump, I am not getting complete C stack and it reports corrupt stack. Please note that when I run the program by using gdb, I am getting the complete stack.

package main

/*
#include <stdio.h>

void test1(void) {
int p = (int)NULL;
*p = 30;
}

void test2(void) {
int val = 2;
test1();
}

void test3(void) {
int val = 3;
test2();
}
*/
import "C"

func main() {
C.test3()
}

What did you expect to see?

I am expecting the same C stack which is displayed when the program is run directly by gdb.

gdb ./test
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
https://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test...
warning: Unsupported auto-load script at offset 0 in section .debug_gdb_scripts
of file /home/ubuntu/mbalraj/GO/TEST/test.
Use `info auto-load python-scripts [REGEXP]' to list them.
(gdb) run
Starting program: /home/ubuntu/mbalraj/GO/TEST/test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffd08f7640 (LWP 193717)]
[New Thread 0x7fffcbfff640 (LWP 193718)]
[New Thread 0x7fffcb7fe640 (LWP 193719)]
[New Thread 0x7fffcaffd640 (LWP 193720)]
[New Thread 0x7fffca7fc640 (LWP 193721)]

Thread 1 "test" received signal SIGSEGV, Segmentation fault.
0x000000000045abe6 in test1 () at /home/ubuntu/GO/TEST/test.go:8
8 *p = 30;
(gdb) bt
#0 0x000000000045abe6 in test1 () at /home/ubuntu/GO/TEST/test.go:8
#1 0x000000000045ac07 in test2 () at /home/ubuntu/GO/TEST/test.go:13
#2 0x000000000045ac22 in test3 () at /home/ubuntu/GO/TEST/test.go:18
#3 0x000000000045ac42 in _cgo_6ac9acc88c92_Cfunc_test3 (v=0xc000056f70) at /tmp/go-build/cgo-gcc-prolog:49
#4 0x00000000004567e4 in runtime.asmcgocall () at /usr/local/go/src/runtime/asm_amd64.s:844
#5 0x00000000004e15e0 in ?? ()
#6 0x0000000000000001 in ?? ()
#7 0x000000c000100a00 in ?? ()
#8 0x00007fffffffe2d8 in ?? ()
#9 0x0000000000439fa5 in runtime.malg.func1 () at /usr/local/go/src/runtime/proc.go:4080
#10 0x0000000000456629 in runtime.systemstack () at /usr/local/go/src/runtime/asm_amd64.s:492
#11 0x0000000000458fe5 in runtime.newproc (fn=0x1) at :1
#12 0x00000000004c9780 in runtime[scavenger] ()
#13 0x0000000000000001 in ?? ()
#14 0x0000000000456525 in runtime.mstart () at /usr/local/go/src/runtime/asm_amd64.s:390
#15 0x00000000004564af in runtime.rt0_go () at /usr/local/go/src/runtime/asm_amd64.s:354
#16 0x0000000000000001 in ?? ()
#17 0x00007fffffffe458 in ?? ()
#18 0x0000000000000000 in ?? ()

What did you see instead?

gdb ./test core.193685
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
https://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test...
[New LWP 193685]
[New LWP 193687]
[New LWP 193686]
[New LWP 193689]
[New LWP 193688]
[New LWP 193690]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by ./test'. Program terminated with signal SIGABRT, Aborted. #0 runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:159 159 RET [Current thread is 1 (Thread 0x7fdf0cfb7740 (LWP 193685))] warning: Unsupported auto-load script at offset 0 in section .debug_gdb_scripts of file /home/ubuntu/mbalraj/GO/TEST/test. Use info auto-load python-scripts [REGEXP]' to list them.
(gdb) bt
#0 runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:159
#1 0x0000000000443345 in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:870
#2 0x00000000004438de in runtime.sigfwdgo (sig=6, info=, ctx=, ~r0=)
at /usr/local/go/src/runtime/signal_unix.go:1086
#3 0x0000000000442027 in runtime.sigtrampgo (sig=0, info=0x0, ctx=0x4583c1 <runtime.raise+33>) at /usr/local/go/src/runtime/signal_unix.go:432
#4 0x00000000004586a6 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:359
#5
#6 runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:159
#7 0x0000000000443345 in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:870
#8 0x0000000000443558 in runtime.crash () at /usr/local/go/src/runtime/signal_unix.go:962
#9 0x000000000042f891 in runtime.fatalthrow.func1 () at /usr/local/go/src/runtime/panic.go:1129
#10 0x000000000042f80c in runtime.fatalthrow (t=) at /usr/local/go/src/runtime/panic.go:1122
#11 0x000000000042f4bd in runtime.throw (s=...) at /usr/local/go/src/runtime/panic.go:1047
#12 0x0000000000443289 in runtime.sigpanic () at /usr/local/go/src/runtime/signal_unix.go:819
#13 0x000000000045abe6 in test1 () at /home/ubuntu/GO/TEST/test.go:8
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

@bcmills bcmills added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 9, 2023
@cagedmantis cagedmantis changed the title affected/package: cmd/gco: core dump reports a corrupt stack Jan 9, 2023
@cagedmantis cagedmantis added this to the Backlog milestone Jan 9, 2023
@ianlancetaylor ianlancetaylor changed the title cmd/gco: core dump reports a corrupt stack runtime: in core dump generated for a signal while executing C code, gdb reports a corrupt stack Jan 9, 2023
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 9, 2023
@ianlancetaylor
Copy link
Contributor

CC @golang/runtime

Looks like gdb can't unwind past the call to sigpanic. It gets to test1 but can't get any farther. This might be difficult to solve, as the call to sigpanic is dynamically generated.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 10, 2023 via email

@ianlancetaylor
Copy link
Contributor

I'm suggesting that the problem is that gdb can't unwind past a call to sigpanic. When you call the C function assert, no call to sigpanic is involved.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 11, 2023 via email

@cherrymui
Copy link
Member

Maybe we could consider not injecting a sigpanic if a faulting signal lands on a non-user G. It would be that either we're in the runtime, or we're running some C code. Either way I think we cannot recover.

@cherrymui cherrymui self-assigned this Jan 11, 2023
@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 12, 2023 via email

@ianlancetaylor
Copy link
Contributor

There is no API to avoid calling sigpanic, and we are not going to add one.

@cherrymui has pointed out that it is currently impossible to recover a panic that occurs while executing C code, so it may be reasonable to avoid injecting a sigpanic call in C code. We could, instead, simply reraise the signal, as we do for other signals that occur while executing C code. See the badsignal function.

However, making that change is not going to be a high priority for us. I don't know of anybody else who has reported problems getting a stack backtrace in a core dump generated while executing C code called by Go code. Go is an open source project, and anybody is welcome to contribute. See https://go.dev/doc/contribute.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 12, 2023 via email

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 12, 2023 via email

@ianlancetaylor
Copy link
Contributor

Just removing the _Sigpanic flag will mean that dereferencing a nil pointer in Go will not cause a panic as it should. I expect that if you do that some tests will fail. You can run "all.bash" to build the entire tree and run all the tests, as documented at https://go.dev/doc/contribute#testing .

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 13, 2023 via email

@cherrymui
Copy link
Member

My thinking is to set it to _SigThrow if the signal is not arrived on a user Go stack. I plan to send a CL, but I need to check a few things to make sure it works as expected. Thanks.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 16, 2023 via email

@gopherbot
Copy link

Change https://go.dev/cl/462437 mentions this issue: runtime: don't inject a sigpanic if not on user G stack

@cherrymui
Copy link
Member

https://go.dev/cl/462437 is the CL. I'd still need to do more testing. You're welcome to try if the CL fixes your case. Thanks.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 18, 2023 via email

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 18, 2023 via email

@cherrymui
Copy link
Member

The Go functions and C functions run on different stacks. When test3 calls Test4, internally there is a stack switch, and GDB cannot unwind through that. You may want to try Delve (https://github.com/go-delve/delve), which knows Go's stack switch convention and probably handles this better.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 19, 2023 via email

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Jan 23, 2023 via email

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Feb 13, 2023 via email

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Feb 16, 2023 via email

@ianlancetaylor
Copy link
Contributor

The use of libgo.so indicates that Tom was using gccgo. You are using gc. I commented on the gdb bug.

@aarzilli
Copy link
Contributor

aarzilli commented Mar 2, 2023

The stack switching code in delve has always been a pain to maintain, and it doesn't work correctly in every possible situation. I think there's an ongoing change in the runtime to keep frame pointers correct all the time, I'd rather rewrite delve's stack switching to make use of that (on future versions of go that have it) at this point.

It does appear that stack traces on core dumps are broken, they used to work, we probably didn't notice because of a problem with the CI setup.

PS. stack traces on core dumps actually do work, but not if the signal happened on a C stack

@cherrymui
Copy link
Member

We don't use frame pointer on all platforms. Long term we probably want to add DWARF directives to express the stack transition (on frame pointer platforms we can make it point to the FP).

@aarzilli
Copy link
Contributor

aarzilli commented Mar 2, 2023

We don't use frame pointer on all platforms

Delve supports less than all platforms anyway, if amd64 and arm64 have them it would only leave out i386, which is ok.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 3, 2023 via email

@qmuntal
Copy link
Contributor

qmuntal commented Mar 3, 2023

@MariappanBalraj solution 2 is already done as part of #58378. Feel free to review the amd64 assembly code in case I missed any spurious NOFRAME.

Just last week I was prototyping the solution 3 at CL 471715. I don't plan to continue that work on the near future, just wanted to prove feasibility, and the results were good. You can take that as reference and work on a cleaner implementation. AFAIK it hasn't been decided if Go should switch to emitting FP-based unwind information instead of stack deltas, so it would be good to create an issue to discuss it and track the effort.

@derekparker
Copy link
Contributor

derekparker commented Mar 6, 2023

To add onto what @aarzilli mentioned above, it would be great to eventually phase out the custom and complex stack switching code within Delve. Improving frame pointer correctness would be a huge step forward, and thanks to those actively working on that.

I'd be happy to work on solution 3 for the long term, building on top of the work @qmuntal started (if he doesn't plan on continuing that work himself).

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 8, 2023 via email

@qmuntal
Copy link
Contributor

qmuntal commented Mar 8, 2023

Hi Quim, I want to try your changes for solution 2. How to fetch your changeset and build?

Run this commands:

go install golang.org/dl/gotip@latest
gotip download 472195

Then use gotip instead of go for building, testing, etc.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 8, 2023 via email

@qmuntal
Copy link
Contributor

qmuntal commented Mar 8, 2023

I have tried the steps. But still I am seeing the same behavior. Do I need
to pull 466316 also?

That CL (and other before it) make sure that runtime functions store the frame pointer on the stack. That per-se is not enough to enable cgo stack unwinding on gdb, as gdb uses DWARF CFI information contained on the binary to unwind the stack.

Go is not currently emitting DWARF CFI information that makes use of the frame pointer, but uses stack deltas instead, which don't work when changing the stack pointer. Implementing this is part of solution 3.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 8, 2023 via email

@derekparker
Copy link
Contributor

Thanks. Got it. I tried with dlv(delve) also. It is not showing the stack trace. Do we need to enhance dlv code to make stack unwinding work correctly? Somehow I am looking for a solution to decode core dump. I am not specific to GDB or DLV. But I prefer dlv. Kindly please help with this.

Hello,

Delve also uses the DWARF information to unwind the stack, so it would also need the fix outlined in solution 3 as well.

@qmuntal do you plan to continue your work on that or would you mind if I picked that up?

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 8, 2023 via email

@derekparker
Copy link
Contributor

Thanks got it. Then how storing function pointer helps?

I somewhat misspoke (mistyped?) in that Delve does use frame pointers in certain circumstances but it isn't the main mode of stack unwinding, we also do rely on the DWARF unwind information when present. Solution 2 and 3 somewhat go hand in hand because if we can keep frame pointers correct then we can utilize them in the DWARF unwind information for easier, less error prone stack unwinding when switching stacks.

@aarzilli
Copy link
Contributor

aarzilli commented Mar 8, 2023

I've looked a bit more at this and what happens with delve is that we see that the current goroutine is g0, executing a runtime function, and we switch to the goroutine stack. There are two problems with this:

  1. because the signal happened in cgo code we should have kept unwinding the current stack, but how would we know?
  2. we don't know how to unwind past runtime.sigtramp anyway

@qmuntal
Copy link
Contributor

qmuntal commented Mar 8, 2023

@qmuntal do you plan to continue your work on that or would you mind if I picked that up?

@derekparker feel free to pick that CL and evolve it. I'll probably update it once https://go-review.googlesource.com/c/go/+/472195 is merged.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 8, 2023 via email

@derekparker
Copy link
Contributor

@qmuntal do you plan to continue your work on that or would you mind if I picked that up?

@derekparker feel free to pick that CL and evolve it. I'll probably update it once https://go-review.googlesource.com/c/go/+/472195 is merged.

No worries, I'll hold off for now while you're still working on it, I was just going to work on it if you weren't planning on doing any more work on it yourself.

@qmuntal
Copy link
Contributor

qmuntal commented Mar 13, 2023

No worries, I'll hold off for now while you're still working on it, I was just going to work on it if you weren't planning on doing any more work on it yourself.

I've just updated a couple of things and now I'm 100% done with that prototype. You can take it from here 😄.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 16, 2023 via email

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 20, 2023 via email

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 21, 2023 via email

@qmuntal
Copy link
Contributor

qmuntal commented Mar 21, 2023

have gone through the code changes done as part of CL 471715. It looks
like DWARF CFA information is added. But I am not getting full context.

About the test case I added in CL 471715, it also fails form time to time on my machine, we might be hitting the same issue.

I am seeking strong support from you people.

As I mentioned earlier, I don't have the knowledge now the time to fix cgo stack unwinding on gdb. I submitted CL 471715 in case it is useful for someone else. Can't give more support than that.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 21, 2023 via email

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 24, 2023 via email

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 27, 2023 via email

gopherbot pushed a commit that referenced this issue Mar 27, 2023
If a panicking signal (e.g. SIGSEGV) happens on a g0 stack, we're
either in the runtime or running C code. Either way we cannot
recover and sigpanic will immediately throw. Further, injecting a
sigpanic could make the C stack unwinder and the debugger fail to
unwind the stack. So don't inject a sigpanic.

If we have cgo traceback and symbolizer attached, if it panics in
a C function ("CF" for the example below), previously it shows
something like

	fatal error: unexpected signal during runtime execution
	[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x45f1ef]

	runtime stack:
	runtime.throw({0x485460?, 0x0?})
		.../runtime/panic.go:1076 +0x5c fp=0x7ffd77f60f58 sp=0x7ffd77f60f28 pc=0x42e39c
	runtime.sigpanic()
		.../runtime/signal_unix.go:821 +0x3e9 fp=0x7ffd77f60fb8 sp=0x7ffd77f60f58 pc=0x442229

	goroutine 1 [syscall]:
	CF
		/tmp/pp/c.c:6 pc=0x45f1ef
	runtime.asmcgocall
		.../runtime/asm_amd64.s:869 pc=0x458007
	runtime.cgocall(0x45f1d0, 0xc000053f70)
		.../runtime/cgocall.go:158 +0x51 fp=0xc000053f48 sp=0xc000053f10 pc=0x404551
	main._Cfunc_CF()
		_cgo_gotypes.go:39 +0x3f fp=0xc000053f70 sp=0xc000053f48 pc=0x45f0bf

Now it shows

	SIGSEGV: segmentation violation
	PC=0x45f1ef m=0 sigcode=1
	signal arrived during cgo execution

	goroutine 1 [syscall]:
	CF
		/tmp/pp/c.c:6 pc=0x45f1ef
	runtime.asmcgocall
		.../runtime/asm_amd64.s:869 pc=0x458007
	runtime.cgocall(0x45f1d0, 0xc00004ef70)
		.../runtime/cgocall.go:158 +0x51 fp=0xc00004ef48 sp=0xc00004ef10 pc=0x404551
	main._Cfunc_CF()
		_cgo_gotypes.go:39 +0x3f fp=0xc00004ef70 sp=0xc00004ef48 pc=0x45f0bf

I think the new one is reasonable.

For #57698.

Change-Id: I4f7af91761374e9b569dce4c7587499d4799137e
Reviewed-on: https://go-review.googlesource.com/c/go/+/462437
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
@qmuntal
Copy link
Contributor

qmuntal commented Mar 28, 2023

I am just
trying to understand exactly what you have done for CL 471715. How will it
be helpful for others to proceed?

The key improvement of CL 471715 is that it emits the DW_CFA_def_cfa_register instruction when a function prologue sets the frame pointer. This instruction can be used by debuggers to know in which register the current frame pointer is stored, making stack unwinding easier and more robust. Without that CL, the Go toolchain only emits DW_CFA_def_cfa_offset and DW_CFA_advance_loc instructions, which can't correctly model the stack switching that happens in cgo calls.

Unfortunately, doing so requires more knowledge about the assembly instructions and functions stack layout than available to the Go linker, so I had to move part of the DWARF processing to the Go compiler. That's why CL 471715 has so many changes for just adding a single DWARF instruction.

@MariappanBalraj
Copy link
Author

MariappanBalraj commented Mar 28, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

9 participants