Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: bad backtraces in gdb on s390x #24385

Closed
josharian opened this issue Mar 14, 2018 · 13 comments
Closed

runtime: bad backtraces in gdb on s390x #24385

josharian opened this issue Mar 14, 2018 · 13 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@josharian
Copy link
Contributor

Reproduce:

$ gomote create linux-s390x-ibm
$ export GM=YOUR_GOMOTE_INSTANCE
$ git rev-parse HEAD  # tip as of Mar 13, 2018
ca9abbb731a6d1f824dd755afde83e5727dab3c6
$ gomote push $GM && gomote run $GM go/src/make.bash && gomote run $GM go/bin/go test -run=PythonCgo -v runtime
<gomote noise>
Building Go cmd/dist using /var/buildlet/go-linux-s390x-bootstrap.
Building Go toolchain1 using /var/buildlet/go-linux-s390x-bootstrap.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/s390x.
---
Installed Go for linux/s390x in /data/golang/workdir/go
Installed commands in /data/golang/workdir/go/bin
=== RUN   TestGdbPythonCgo
=== PAUSE TestGdbPythonCgo
=== CONT  TestGdbPythonCgo
--- PASS: TestGdbPythonCgo (0.74s)
	runtime-gdb_test.go:59: gdb version 8.0
	runtime-gdb_test.go:193: gdb output: Loading Go Runtime support.
		Loaded  Script                                                                 
		Yes     /data/golang/workdir/go/src/runtime/runtime-gdb.py                     
		Breakpoint 1 at 0x10899a8: file /data/golang/workdir/go/src/fmt/print.go, line 264.
		[Thread debugging using libthread_db enabled]
		Using host libthread_db library "/lib64/libthread_db.so.1".
		[New Thread 0x3fffba6f910 (LWP 53020)]
		[New Thread 0x3fffb26f910 (LWP 53021)]
		[New Thread 0x3fffaa6f910 (LWP 53022)]
		[New Thread 0x3fffa26f910 (LWP 53023)]
		
		Thread 1 "a.exe" hit Breakpoint 1, fmt.Println (a= []interface {} = {...}, err=..., n=3) at /data/golang/workdir/go/src/fmt/print.go:264
		264		return Fprintln(os.Stdout, a...)
		BEGIN info goroutines
		* 1 running  runtime.systemstack_switch
		  2 waiting  runtime.gopark
		  3 waiting  runtime.gopark
		  18 waiting  runtime.gopark
		END
		#1  0x00000000010907aa in main.main () at /tmp/go-build508044115/main.go:15
		15		fmt.Println("hi")
		BEGIN print mapvar
		$1 = map[string]string = {["ghi"] = "jkl", ["abc"] = "def"}
		END
		BEGIN print strvar
		$2 = "abc"
		END
		BEGIN info locals
		mapvar = map[string]string = {["ghi"] = "jkl", ["abc"] = "def"}
		slicevar.cap = 16
		slicevar.len = 1
		slicevar.ptr = "def"
		strvar = "abc"
		END
		#0  fmt.Println (a= []interface {} = {...}, err=..., n=3) at /data/golang/workdir/go/src/fmt/print.go:264
		264		return Fprintln(os.Stdout, a...)
		BEGIN goroutine 1 bt
		#0  fmt.Println (a= []interface {} = {...}, err=..., n=3) at /data/golang/workdir/go/src/fmt/print.go:264
		#1  0x00000000010907aa in main.main () at /tmp/go-build508044115/main.go:15
		END
		BEGIN goroutine 2 bt
		#0  runtime.gopark (lock=0x1137b80 <runtime.forcegc>, reason="force gc (idle)", traceEv=20 '\024', traceskip=1, unlockf={void (struct runtime.g *, void *, bool *)} 0xc000032778) at /data/golang/workdir/go/src/runtime/proc.go:292
		#1  0x000000000102f234 in runtime.goparkunlock (lock=0x1137b80 <runtime.forcegc>, reason="force gc (idle)", traceEv=20 '\024', traceskip=1) at /data/golang/workdir/go/src/runtime/proc.go:297
		#2  0x000000000102ef96 in runtime.forcegchelper () at /data/golang/workdir/go/src/runtime/proc.go:248
		#3  0x00000000010598a2 in runtime.goexit () at /data/golang/workdir/go/src/runtime/asm_s390x.s:787
		Backtrace stopped: previous frame identical to this frame (corrupt stack?)
		END
		Breakpoint 2 at 0x10907dc: file /tmp/go-build508044115/main.go, line 19.
		hi
		
		Thread 1 "a.exe" hit Breakpoint 2, main.main () at /tmp/go-build508044115/main.go:20
		20	}  // END_OF_PROGRAM
		BEGIN goroutine 1 bt at the end
		#0  main.main () at /tmp/go-build508044115/main.go:20
		END
		
PASS
ok  	runtime	0.744s

Though the test passes, observe this line:

Backtrace stopped: previous frame identical to this frame (corrupt stack?)

So something is unhealthy about s390x's backtrace of non-running goroutines.

This ended up being surfaced more visibly in build failures after CL 99078, e.g. https://build.golang.org/log/17e01b4804101ee50ca882c4520a3337818e42da. The CL was then reverted to fix the s390x build. Once this issue is fixed, let's re-apply CL 99078.

I don't plan to look into this further. I'm hoping someone with more s390x expertise can take over from here.

cc @mundaym @aclements

@josharian josharian added NeedsFix The path to resolution is known, but the work has not been done. release-blocker labels Mar 14, 2018
@josharian josharian added this to the Go1.11 milestone Mar 14, 2018
@cherrymui
Copy link
Member

I think this is ok. runtime.goexit is the top of a goroutine stack. Backtrace should stop there. But gdb doesn't know it.

@josharian
Copy link
Contributor Author

Hmm. I can say that gdb doesn’t complain about the equivalent bt call on linux/amd64. And something broke more seriously with the should-be-unrelated CL 99078. (I spent quite a while poking at it, to no avail. Then I noticed this complaint on master and guess I hoped it was related.)

@aclements
Copy link
Member

My hunch is that we should be arranging for the saved LR of runtime.goexit to be 0 on s390x and are simply failing to do that.

@aclements
Copy link
Member

Okay, it's pretty clear that we're zeroing the saved LR. That's pretty simple and it's the same on most arches. My hunch now is that the .debug_frame information is wrong. In particular, since the compiler thinks of goexit as a leaf function, .debug_frame claims that LR is unchanged from the callee frame, rather than telling the debugger to fetch the LR from the saved LR slot. The consequence, I believe, is that the goroutine's entry point function will save its LR, which will point to goexit (assuming the entry point isn't itself a leaf), and when the debugger unwinds from the entry point function it will use the same LR for goexit, and see that it's in a loop at goexit.

This appears to be the case for .debug_frame on all LR machines. I'm not sure why it only appears to affect s390x. Maybe it affects others, too, and we just haven't noticed.

@cherrymui
Copy link
Member

cherrymui commented Mar 16, 2018

Are you saying that we should let .debug_frame think goexit have a frame, so the debugger will fetch LR from its frame's LR slot, which we zeroed? What is the condition that gdb decides that it has reached the top of the stack? LR==0?

This appears to be the case for .debug_frame on all LR machines. I'm not sure why it only appears to affect s390x. Maybe it affects others, too, and we just haven't noticed.

I may have seen that on other LR architectures, and thought that was ok.

@aclements
Copy link
Member

Are you saying that we should let .debug_frame think goexit have a frame, so the debugger will fetch LR from its frame's LR slot, which we zeroed?

That's one possibility.

A totally different approach would be to rearrange the way we enter goroutines. We could, for example, actually enter at goexit and have it actually call the goroutine entry point, rather than hand-constructing a frame that makes it look like goexit called the entry point. I wouldn't do this just to fix up the debug info, but it may simplify things overall and have the side-effect of making the debug info right.

What is the condition that gdb decides that it has reached the top of the stack? LR==0?

I believe so, though I haven't specifically checked.

@aclements
Copy link
Member

Having thought about this more, perhaps the tricky part is that we don't know how big goexit's frame actually is. Logically, goexit's frame includes the arguments to the goroutine entry function, but those aren't fixed size and there's no way to represent variable sized arguments in Go's calling convention. I tried setting things up so we entered at goexit and it called the entry function, but ran head-first into this problem. And it would be a problem for any sort of generic DWARF unwinding.

This works in Go's own tracebacks only because we stop when we see goexit, so it doesn't matter that we can't walk over its frame.

So, maybe we need to write out a special FDE for goexit that explicitly sets the LR register to 0 to stop the debugger, similar to how we explicitly stop internal tracebacks.

(Now I'm curious why this seems to work on x86 machines...)

@cherrymui
Copy link
Member

(Now I'm curious why this seems to work on x86 machines...)

On AMD64:

(gdb) goroutine 2 bt
#0  runtime.gopark (lock=0x4a8810 <runtime.forcegc>, reason="force gc (idle)", traceEv=20 '\024', traceskip=1, unlockf=<optimized out>) at /usr/local/google/home/cherryyz/src/go/src/runtime/proc.go:292
#1  0x0000000000423bfe in runtime.goparkunlock (lock=<optimized out>, reason=..., traceEv=<optimized out>, traceskip=<optimized out>) at /usr/local/google/home/cherryyz/src/go/src/runtime/proc.go:297
#2  0x000000000042398a in runtime.forcegchelper () at /usr/local/google/home/cherryyz/src/go/src/runtime/proc.go:248
#3  0x0000000000446df1 in runtime.goexit () at /usr/local/google/home/cherryyz/src/go/src/runtime/asm_amd64.s:1385
#4  0x0000000000000000 in ?? ()

On AMD64, even goexit has no frame, its return PC is still fetched from stack, since if it were CALL'd, the CALL instruction would save return PC on stack. And we zeroed the stack space.

Apparently returnPC=0 doesn't stop gdb from tracing back, but once it gets to PC=0 it cannot proceed.

@cherrymui
Copy link
Member

On LR machines, I think making goexit appear to have a frame in DWARF would work. Or maybe just make it have a frame (but still hand-construct the frame when making new goroutine, to avoid variable sized arg problem).

@mundaym
Copy link
Member

mundaym commented Mar 19, 2018

On LR machines, I think making goexit appear to have a frame in DWARF would work.

If we do this with DWARF then I think the normal way to mark entry points is to insert a DW_CFA_undefined op with the link register (or rip on amd64) as its parameter. In gas on s390x this would be inserted using the directive .cfi_undefined r14. I think that should work fine with a zero-sized frame.

@cherrymui
Copy link
Member

Thanks @mundaym. This sounds the right thing to do.

@ianlancetaylor
Copy link
Contributor

I'm kicking the milestone forward to 1.12.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 9, 2018
@andybons andybons modified the milestones: Go1.12, Go1.13 Feb 12, 2019
@gopherbot
Copy link

Change https://golang.org/cl/169726 mentions this issue: cmd/link, runtime: mark goexit as the top of the call stack

@golang golang locked and limited conversation to collaborators Apr 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants