Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: ppc64le broken by encourage inlining of functions with single-call bodies #28679

Closed
bradfitz opened this issue Nov 8, 2018 · 19 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Nov 8, 2018

https://go-review.googlesource.com/c/go/+/147361 broke ppc64le.

https://build.golang.org/log/cee992f2c28faffaa1c9526349f4923f9cb16c83

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
--- FAIL: TestGdbPython (0.58s)
    runtime-gdb_test.go:66: gdb version 7.7
    runtime-gdb_test.go:213: gdb output: Loading Go Runtime support.
        Loaded  Script                                                                 
        Yes     /tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/runtime-gdb.py      
        Breakpoint 1 at 0x92cac: file /tmp/go-build132537636/main.go, line 15.
        hi
        
        Breakpoint 1, main.main () at /tmp/go-build132537636/main.go:17
        17		gslice = slicevar
        BEGIN info goroutines
        * 1 running  syscall.Syscall
          2 runnable runtime.forcegchelper
          17 waiting  runtime.gopark
          18 runnable runtime.runfinq
        END
        BEGIN print mapvar
        $1 = map[string]string = {["abc"] = "def", ["ghi"] = "jkl"}
        END
        BEGIN print strvar
        $2 = "abc"
        END
        BEGIN info locals
        mapvar = map[string]string = {["abc"] = "def", ["ghi"] = "jkl"}
        strvar = "abc"
        slicevar =  []string
        END
        BEGIN goroutine 1 bt
        #0  main.main () at /tmp/go-build132537636/main.go:17
        END
        BEGIN goroutine 2 bt
        #0  runtime.forcegchelper () at /tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/proc.go:242
        #1  0x0000000000092cac in fmt.Println (a=..., n=<optimized out>, err=...) at /tmp/workdir-host-linux-ppc64le-osu/go/src/fmt/print.go:275
        #2  main.main () at /tmp/go-build132537636/main.go:14
        Traceback (most recent call last):
          File "/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/runtime-gdb.py", line 536, in invoke
            gdb.parse_and_eval('$sp = $save_sp')
        gdb.error: Attempt to assign to an unmodifiable value.
        Error occurred in Python command: Attempt to assign to an unmodifiable value.
        END
        Breakpoint 2 at 0x92ce0: file /tmp/go-build132537636/main.go, line 18.
        runtime: newstack sp=0xc000032fc0 stack=[0xc000064000, 0xc000065000]
        	morebuf={pc:0x92cac sp:0xc000032fc0 lr:0x0}
        	sched={pc:0x38bb4 sp:0xc000032fc0 lr:0x92cac ctxt:0x0}
        runtime: gp=0xc000000180, goid=1, gp->status=0x2
         runtime: split stack overflow: 0xc000032fc0 < 0xc000064000
        fatal error: runtime: split stack overflow
        
        runtime stack:
        runtime.throw(0xd4b34, 0x1d)
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/panic.go:608 +0x5c
        runtime.newstack()
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/stack.go:995 +0x7a4
        runtime.morestack()
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/asm_ppc64x.s:333 +0x60
        
        goroutine 1 [running]:
        runtime.forcegchelper()
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/proc.go:242 +0x14 fp=0xc000032fc0 sp=0xc000032fc0 pc=0x38bb4
        runtime: unexpected return pc for main.main called from 0x0
        stack: frame={sp:0xc000032fc0, fp:0xc000033080} stack=[0xc000064000,0xc000065000)
        
        fmt.Println(...)
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/fmt/print.go:275
        main.main()
        	/tmp/go-build132537636/main.go:14 +0x1ec fp=0xc000033080 sp=0xc000032fc0 pc=0x92cac
        [Inferior 1 (process 11715) exited with code 02]
        BEGIN goroutine 1 bt at the end
        No such goroutine:  1
        END
        
    runtime-gdb_test.go:286: goroutine 1 bt at the end failed: No such goroutine:  1
FAIL
FAIL	runtime	36.282s
@bradfitz bradfitz added NeedsFix The path to resolution is known, but the work has not been done. release-blocker labels Nov 8, 2018
@bradfitz bradfitz added this to the Go1.12 milestone Nov 8, 2018
@randall77
Copy link
Contributor

Odd.
The backtrace for the background GC thread is all wrong:

BEGIN goroutine 2 bt
        #0  runtime.forcegchelper () at /tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/proc.go:242
        #1  0x0000000000092cac in fmt.Println (a=..., n=<optimized out>, err=...) at /tmp/workdir-host-linux-ppc64le-osu/go/src/fmt/print.go:275
        #2  main.main () at /tmp/go-build132537636/main.go:14

The backtrace starts at a GC helper thread, then somehow jumps to the user's thread, at a nonsensical location.

Here's goroutine 2's stack trace from a correct run (on linux):

        BEGIN goroutine 2 bt
        #0  runtime.gopark (unlockf=<optimized out>, lock=0x752d20 <runtime.forcegc>, reason=16 '\020', traceEv=20 '\024', traceskip=1) at /usr/local/google/home/khr/sandbox/inl/src/runtime/proc.go:302
        #1  0x0000000000429766 in runtime.goparkunlock (lock=<optimized out>, reason=<optimized out>, traceEv=<optimized out>, traceskip=<optimized out>) at /usr/local/google/home/khr/sandbox/inl/src/runtime/proc.go:307
        #2  runtime.forcegchelper () at /usr/local/google/home/khr/sandbox/inl/src/runtime/proc.go:250
        #3  0x0000000000450671 in runtime.goexit () at /usr/local/google/home/khr/sandbox/inl/src/runtime/asm_amd64.s:1340
        #4  0x0000000000000000 in ?? ()
        END

The error isn't deterministic; sometimes it happens and sometimes it doesn't.
I think this issue is probably a bug in the underlying mid-stack inlining, not anything to do with my inlining change directly.
I will investigate further.
@davidlazar @aclements

@laboger
Copy link
Contributor

laboger commented Nov 15, 2018

I haven't been able to reproduce this error on any of our systems, using various distros.
Is there something different about the way this run or the builder system?

I tried the full all.bash or just by doing this:
go tool dist test runtime:cpu124

@bradfitz
Copy link
Contributor Author

Our ppc64le builders run Debian jessie with kernel:

Linux go-le-1 3.16.0-4-powerpc64le #1 SMP Debian 3.16.7-ckt25-1 (2016-03-06) ppc64le GNU/Linux

@randall77
Copy link
Contributor

Check gdb versions as well, that is likely to be part of the issue.

@bradfitz
Copy link
Contributor Author

$ gdb --version
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1

@laboger
Copy link
Contributor

laboger commented Nov 15, 2018

I ran it on a comparable system but with a more recent kernel:
Linux lorentz 3.16.0-6-powerpc64le #1 SMP Debian 3.16.56-1+deb8u1 (2018-05-08) ppc64le GNU/Linux

gdb --version
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1

I don't understand the point of setting GOMAXPROCS=2 and then -test.cpu=1,2,4? Setting GOMAXPROCS=2 but -test.cpu=4 seems wrong. Could something be getting messed up because of that?

What is the ppc64 system? Seems strange that is not failing in this way.

@laboger
Copy link
Contributor

laboger commented Nov 26, 2018

Could we get a newer distro and kernel on the ppc64le builder. This is the second time we've seen failures in the builder that don't happen on other distros. Debian 8 seems pretty old.

I don't work with OSU much so I don't know who your contact is for this system or how to make this request.

@bradfitz
Copy link
Contributor Author

Could we get a newer distro and kernel on the ppc64le builder.

Sorry, but that won't happen anytime soon. We have a backlog of builder work and this wouldn't be near the top of the priority list. I recommend just skipping GDB tests if the GDB or kernel is too old.

/cc @dmitshur

@laboger
Copy link
Contributor

laboger commented Nov 27, 2018

It would be good if we could upgrade at some point. It's not a certainty that's the problem in this case, just a guess since it seems to work right so many places.

I built the programs for main.go, the program being run by gdb when it fails, before and after the bad commit. One thing I saw that looks suspicious is that starting in the commit where it started failing, Fprintln is now inlined from Println, so main doesn't directly call Println anymore. And fmt.Println is what shows up in the backtrace in the output when it fails. So I'm wondering if that is significant? I don't know what has to happen to the dwarf or whatever information gdb has to look at to figure out the backtrace in the case where inlining happened to get it right, but maybe there is something that causes gdb to look at uninitalized data causing random behavior? I also see that Println and Fprintln have variable argument lists, not common.

@laboger
Copy link
Contributor

laboger commented Nov 30, 2018

I'm not sure what's happening on the ppc64le builder, but I can reproduce the exact same error output on any machine by doing this:

GOMAXPROCS=1 ./runtime.test -test.run=GdbPython$

I can also use gdb with the source for the program being run in the testcase and step through and see what's wrong. I don't think the problem is related to the compiler doing something wrong but because the testcase is different and behaves differently. I can try this test with an older compiler and get the same failure.

The problem occurs if forcegchelper is at the top of the stack for one of the goroutines and the goroutine's pc is at the start of the function. At this point the value for sp has not been set yet for forcegchelper, and since this is the initial function for a goroutine I'm not sure what sp would be at that point.

// start forcegc helper goroutine
func init() {
        go forcegchelper()
}

So when we have this situation, and 'goroutine 2 bt' is invoked for a goroutine with this at the top of its stack, things go wrong after that, probably because the sp does not make sense at this point. If I break at the end and do goroutine 2 bt no errors occur.

@gopherbot
Copy link

Change https://golang.org/cl/152540 mentions this issue: runtime: fix runtime-gdb.py when switching sp value

@laboger
Copy link
Contributor

laboger commented Dec 5, 2018

To see the reason for the above change, here is a reference in the gdb documentation. https://sourceware.org/gdb/onlinedocs/gdb/Registers.html.
See the footnote where it is providing information on another way to set $sp.

This is a way of removing one word from the stack, on machines where stacks grow downward in memory (most machines, nowadays). This assumes that the innermost stack frame is selected; setting $sp is not allowed when other stack frames are selected.

@milanknezevic
Copy link
Contributor

@laboger
The same issue can be reproduced on mipsle, but not on mips, I'm not sure why.
In our case, goroutine $num bt happens at point in main where $pc = $ra(LR register). When goroutine command sets the $pc it points at the start of runtime.runfinq (before the $ra is saved on stack), so the gdb uses $ra to unwind. At this point we have bad backtrace:

(gdb) goroutine 17 bt
#0  runtime.runfinq () 
#1  0x000a4ea0 in fmt.Println 
#2  main.main ()

and gdb frame command returns that main.main frame is current frame.

(gdb) frame
#2  main.main () 

Your fix helps, thank you.
I fixed it locally by saving/restoring $ra along with $pc and $sp. Not sure if that should be done for all non-x86 targets, regardless of this issue.

gopherbot pushed a commit that referenced this issue Dec 7, 2018
After a recent change to runtime-gdb_test.go the ppc64le builder
has had intermittent failures. The failures occur when trying to
invoke the goroutineCmd function to display the backtrace for
a selected goroutine. There is nothing wrong with the testcase
but it seems to intermittently leave goroutines in a state
where an error can occur.

The error message indicates that the problem occurs when trying
to change the sp back to the original after displaying the
stacktrace for the goroutine.

gdb.error: Attempt to assign to an unmodifiable value.

After some searching I found that this error message can happen
if the sp register is changed when on a frame that is not the
top-most frame. To fix the problem, frame 0 is selected before
changing the value of sp. This fixes the problem in my
reproducer environment, and hopefully will fix the problem on
the builder.

Updates #28679

Change-Id: I329bc95b30f8c95acfb161b0d9cfdcbd917a1954
Reviewed-on: https://go-review.googlesource.com/c/152540
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@katiehockman
Copy link
Contributor

This is still failing freebsd-arm-paulzhol builds.
Latest failure: https://build.golang.org/log/d9278491818714d74a9190557d6107f737d48612

Is it feasible to disable this test, as @bradfitz suggested earlier? Or is there something else that can be tried?

@gopherbot
Copy link

Change https://golang.org/cl/155932 mentions this issue: runtime: disable GDB tests on freebsd/arm for now

@bradfitz
Copy link
Contributor Author

bradfitz commented Jan 2, 2019

@katiehockman, done, in https://golang.org/cl/155932 and tracked in new bug #29508

gopherbot pushed a commit that referenced this issue Jan 2, 2019
Updates #29508
Updates #28679

Change-Id: I19bc9f88aeb2b1f3e69856173a00c5a4d5ed3613
Reviewed-on: https://go-review.googlesource.com/c/155932
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Katie Hockman <katie@golang.org>
@andybons andybons modified the milestones: Go1.12, Go1.13 Feb 12, 2019
@andybons andybons removed this from the Go1.13 milestone Jul 8, 2019
@andybons andybons added this to the Go1.14 milestone Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@bradfitz
Copy link
Contributor Author

@laboger, is this still a problem or can this be closed?

@laboger
Copy link
Contributor

laboger commented Oct 29, 2019

I haven't seen this problem in a while on ppc64x. I think it was left open due to similar issues on other platforms. I think it can be closed.

@bradfitz
Copy link
Contributor Author

Thanks. Closing.

@golang golang locked and limited conversation to collaborators Oct 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

9 participants