Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libgo: SEGV in runtime test TestChan on ppc64le #36697

Open
laboger opened this issue Jan 22, 2020 · 20 comments
Open

libgo: SEGV in runtime test TestChan on ppc64le #36697

laboger opened this issue Jan 22, 2020 · 20 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@laboger
Copy link
Contributor

laboger commented Jan 22, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14beta1 gccgo (GCC) 10.0.1 20200122 (experimental) linux/ppc64le

Does this issue reproduce with the latest release?

This started happening in Go 1.13 and was reported in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92564. Continues to happen in Go 1.14beta1

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
linux/ppc64le

What did you do?

Run the libgo tests

What did you expect to see?

No failures

What did you see instead?

=== RUN TestChan
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x2 addr=0x7214a61c0000 pc=0x1008eb24]

More details, stacks, and gdb information can be found in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92564

@ianlancetaylor ianlancetaylor added this to the Gccgo milestone Jan 22, 2020
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 22, 2020
@laboger
Copy link
Contributor Author

laboger commented Sep 15, 2020

@ianlancetaylor This continues to happen. I would appreciate some direction on how to fix this. We just found out this error has been hiding at least one other error in runtime that shows up intermittently.

The SEGV happens consistently now when testing the Chan tests in runtime. I build the a.out file and run it with -test.run=Chan. Here is the relevant stack:

runtime stack:
runtime.dopanic_m
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/panic.go:1211
runtime.fatalthrow
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/panic.go:1071
runtime.throw
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/panic.go:1042
runtime.sigpanic
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/signal_unix.go:642
runtime.scanstackblock
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/mgcmark.go:1207
doscanstack1
	../../../gcc/libgo/runtime/stack.c:93
runtime.doscanstack
	../../../gcc/libgo/runtime/stack.c:40
runtime.scanstack
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/mgcmark.go:767
runtime.markroot..func1
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/mgcmark.go:230
runtime.systemstack
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/stubs.go:60
runtime.markroot
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/mgcmark.go:203
runtime.gcDrain
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/mgcmark.go:911
runtime.gcBgMarkWorker..func2
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/mgc.go:1960
runtime.systemstack..func1
	/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest171796/test/stubs.go:63
runtime_mstart
	../../../gcc/libgo/runtime/proc.c:593

I did some debugging with gdb and found that the SEGV happens because a bad value is being passed for the length spsize in r4 from doscanstack1 when calling scanstackblock.

The call is done from this code:

      if(sp != nil) {
                scanstackblock((uintptr)(sp), (uintptr)(spsize), gcw);
                while((sp = __splitstack_find(next_segment, next_sp,
                                              &spsize, &next_segment,
                                              &next_sp, &initial_sp)) != nil)
                        scanstackblock((uintptr)(sp), (uintptr)(spsize), gcw);
        }

The bad r4 looks like an address, not a length, and causes the loop to iterate too many times resulting finally in a pointer that is invalid.

func scanstackblock(b, n uintptr, gcw *gcWork) {
        if usestackmaps {
                throw("scanstackblock: conservative scan but stack map is used")
        }

        for i := uintptr(0); i < n; i += sys.PtrSize {
                // Same work as in scanobject; see comments there.
                obj := *(*uintptr)(unsafe.Pointer(b + i))
                if obj, span, objIndex := findObject(obj, b, i, true); obj != 0 {
                        greyobject(obj, b, i, span, gcw, objIndex, true)
                }
        }
}

I've tried to get more detail using gdb but unfortunately setting breakpoints around some of the calls in the while loop in doscanstack1 changes the behavior.. The SEGV always happens at the same location with a bad r4 but the number of times doscanstack1 and scanstackblock have been called before that can vary.
This is happening on master for gcc.

@ianlancetaylor
Copy link
Contributor

My reading of what you are saying is that in the loop in doscanstack the value of spsize is wrong: it looks like an address rather than size. That implies, of course, that __splitstack_find is somehow doing the wrong thing. __splitstack_find is part of GCC, and the source code is in libgcc/generic-morestack.c. In that function the address of spsize is known as len. There are a couple of different places that len is set. I think the next step would be to find which one is triggering. In gdb it should be possible to set a breakpoint on the various places that set *len, and put a condition on the breakpoint for whether the expression is value is unreasonable large.

@ianlancetaylor
Copy link
Contributor

That said, it's hard to see why this would happen so specifically for the runtime Chan tests and not in other cases. I can't think of why a stack overrun would lead to this kind of error. What are the bad values for spsize that you see?

One simple thing to try would be increase the value of BACKOFF in libgcc/config/rs6000/morestack.S. But it seems large enough already.

@laboger
Copy link
Contributor Author

laboger commented Sep 16, 2020

That said, it's hard to see why this would happen so specifically for the runtime Chan tests and not in other cases. I can't think of why a stack overrun would lead to this kind of error. What are the bad values for spsize that you see?

It is not only happening in the Chan tests but when running all the runtime tests, it stops running after it panics. So the Chan tests are the first and now it consistently fails if you do -test.run=Chan whereas early on the failure was intermittent.

gdb output:

(gdb) info reg $r4
r4             0x7fffcf30c338	140736669467448
(gdb) x/2x $r1+56
0x7fffcd9afb18:	0xcf30c338	0x00007fff
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00007ffff6f57488 in doscanstack1 at ../../../gcc/libgo/runtime/stack.c:44
	breakpoint already hit 4 times
2       breakpoint     keep y   <MULTIPLE>         
	breakpoint already hit 4 times
2.1                         y     0x0000000010092018 in runtime.scanstackblock at mgcmark.go:1199
2.2                         y     0x00007ffff750e888 in runtime.scanstackblock at ../../../gcc/libgo/go/runtime/mgcmark.go:1199
3       breakpoint     keep y   0x00007ffff6f575c0 in doscanstack1 at ../../../gcc/libgo/runtime/stack.c:93
	breakpoint already hit 2 times

I can set breaks on doscanstack1 and runtime.scanstackblock and those breaks are hit 4 times each. On the 4th call to runtime.scanstackblock the value in r4 is shown above and matches what it is r1+56 which is where it should have been loaded from before the call and I believe is spsize. I'm continuing to try to figure out where the bad value is coming from, and according to the source code it could have been set on multiple paths before the call to scanstackblock. It is curious to me that there are two copies of runtime.scanstackblock according to gdb? One in the binary and one in libgo.so. My a.out file was built using 'make runtime/check'. Not sure if that could be part of the problem.

I've been trying to set breaks in doscanstack1 to where the bad spsize happens but setting breaks in this code affects the behavior and doesn't seem to stop where I expect it to. I'll keep trying.

@laboger
Copy link
Contributor Author

laboger commented Sep 16, 2020

I added some prints to the code that display information when the spsize that is returned is a very large number, and also the value of n at the beginning of scanstackblock. In both cases it is __splitstack_find_context that returns the large value.
./a.out -test.run=Chan

bad spsize from __splitstack_find_context 0xb914c338
scanstackblock called with size too large: 0x7441b914c338
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x2 addr=0x7441bb520000 pc=0x100920b4]

runtime stack:
runtime.dopanic_m
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/panic.go:1211
runtime.fatalthrow
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/panic.go:1071
runtime.throw
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/panic.go:1042
runtime.sigpanic
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/signal_unix.go:642
runtime.scanstackblock
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgcmark.go:1210
doscanstack1
        ../../../gcc/libgo/runtime/stack.c:98
runtime.doscanstack
        ../../../gcc/libgo/runtime/stack.c:40
runtime.scanstack
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgcmark.go:767
runtime.markroot..func1
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgcmark.go:230
runtime.systemstack
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/stubs.go:60
runtime.markroot
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgcmark.go:203
runtime.gcDrain
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgcmark.go:911
runtime.gcBgMarkWorker..func2
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgc.go:1960
runtime.systemstack..func1
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/stubs.go:63
runtime_mstart
        ../../../gcc/libgo/runtime/proc.c:593

./a.out -test.run=Concurrent

bad spsize from __splitstack_find_context 0x3a46bfd8
scanstackblock called with size too large: 0x78da3a46bfd8
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x2 addr=0x78da3c840000 pc=0x100920b4]

runtime stack:
runtime.dopanic_m
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/panic.go:1211
runtime.fatalthrow
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/panic.go:1071
runtime.throw
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/panic.go:1042
runtime.sigpanic
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/signal_unix.go:642
runtime.scanstackblock
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgcmark.go:1210
doscanstack1
        ../../../gcc/libgo/runtime/stack.c:98
runtime.doscanstack
        ../../../gcc/libgo/runtime/stack.c:40
runtime.scanstack
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgcmark.go:767
runtime.markroot..func1
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgcmark.go:230
runtime.systemstack
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/stubs.go:60
runtime.markroot
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgcmark.go:203
runtime.gcDrain
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgcmark.go:911
runtime.gcBgMarkWorker..func2
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/mgc.go:1960
runtime.systemstack..func1
        /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/gotest80373/test/stubs.go:63
runtime_mstart
        ../../../gcc/libgo/runtime/proc.c:593

@ianlancetaylor
Copy link
Contributor

It's true that in the runtime tests there will be two copies of runtime.scanstackblock. I can't think of any reason why that would be a problem on GNU/Linux, though.

Can you get the contents of gp->stackcontext at the start of doscanstack1?

@laboger
Copy link
Contributor Author

laboger commented Sep 17, 2020

Here is the output from some prints at the start of doscanstack1:

boger@pike:~/gccgo-git/bld/powerpc64le-linux/libgo/gotest129267/test$ ./a.out -test.run=Chan -test.cpu=1
In doscanstack1 gp: 0xc000000d80 gp->stackcontext[0]: addr: 0xc000001a50 val: 0x7bfbd03e0000 
bad spsize: 0x7bfbd03dc338 from __splitstack_find_context with gp->stackcontext[0]: addr 0xc000001a50 val: 0x7bfbd03e0000
scanstackblock called with size too large: 136321460716344
In doscanstack1 gp: 0xc00024ed80 gp->stackcontext[0]: addr: 0xc00024fa50 val: 0x7bfbcc350000 
In doscanstack1 gp: 0xc000005100 gp->stackcontext[0]: addr: 0xc000005dd0 val: 0x7bfbcf310000 
In doscanstack1 gp: 0xc00024e000 gp->stackcontext[0]: addr: 0xc00024ecd0 val: 0x7bfbcea80000 
In doscanstack1 gp: 0xc00024fb00 gp->stackcontext[0]: addr: 0xc0002507d0 val: 0x7bfbcc320000 
In doscanstack1 gp: 0xc000250880 gp->stackcontext[0]: addr: 0xc000251550 val: 0x7bfbcc070000 
In doscanstack1 gp: 0xc000287600 gp->stackcontext[0]: addr: 0xc0002882d0 val: 0x7bfbbf7c0000 
In doscanstack1 gp: 0xc000251600 gp->stackcontext[0]: addr: 0xc0002522d0 val: 0x7bfbcc040000 
In doscanstack1 gp: 0xc000274d80 gp->stackcontext[0]: addr: 0xc000275a50 val: 0x7bfbbf4b0000 
In doscanstack1 gp: 0xc000276880 gp->stackcontext[0]: addr: 0xc000277550 val: 0x7bfbbf450000 
In doscanstack1 gp: 0xc000277600 gp->stackcontext[0]: addr: 0xc0002782d0 val: 0x7bfbbf420000 
In doscanstack1 gp: 0xc0002cc000 gp->stackcontext[0]: addr: 0xc0002cccd0 val: 0x7bfbbf2c0000 
In doscanstack1 gp: 0xc0002cdb00 gp->stackcontext[0]: addr: 0xc0002ce7d0 val: 0x7bfbbf260000 
In doscanstack1 gp: 0xc0002ccd80 gp->stackcontext[0]: addr: 0xc0002cda50 val: 0x7bfbbf290000 
In doscanstack1 gp: 0xc000307b00 gp->stackcontext[0]: addr: 0xc0003087d0 val: 0x7bfbbf200000 
In doscanstack1 gp: 0xc000308880 gp->stackcontext[0]: addr: 0xc000309550 val: 0x7bfbbf1d0000 
In doscanstack1 gp: 0xc000306d80 gp->stackcontext[0]: addr: 0xc000307a50 val: 0x7bfbbf230000 
In doscanstack1 gp: 0xc000309600 gp->stackcontext[0]: addr: 0xc00030a2d0 val: 0x7bfbbf1a0000 
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x2 addr=0x7bfbd27b0000 pc=0x100920b4]
.....

@laboger
Copy link
Contributor Author

laboger commented Sep 25, 2020

The problem happens because the first stack_segment structure assigned from __morestack_current_segment gets corrupted.

I used gdb to stop at __splitstack_find and displayed __morestack_current_segment and its contents. Then set a watch on the field containing the size.

Thread 1 "a.out" hit Breakpoint 2, __splitstack_find (segment_arg=0x0, sp=0x0, len=0x7fffcf32c560, next_segment=0xc000000e98, next_sp=0xc000000ea0, 
    initial_sp=0xc000000ea8) at ../../../gcc/libgcc/generic-morestack.c:915
915	  if (segment_arg == (void *) (uintptr_type) 1)
(gdb) display __morestack_current_segment
1: __morestack_current_segment = (struct stack_segment *) 0x7fffcf300000
(gdb) display *__morestack_current_segment
2: *__morestack_current_segment = {prev = 0x0, next = 0x0, size = 196552, old_stack = 0x0, dynamic_allocation = 0x0, free_dynamic_allocation = 0x0, extra = 0x0}
(gdb) x/2x 0x7fffcf300000
0x7fffcf300000:	0x00000000	0x00000000
(gdb) x/2x 0x7fffcf300000+16
0x7fffcf300010:	0x0002ffc8	0x00000000
...
(gdb) watch *0x7fffcf300010
Hardware watchpoint 3: *0x7fffcf300010
(gdb) c
Continuing.

Thread 1 "a.out" hit Breakpoint 2, __splitstack_find (segment_arg=0x0, sp=0x0, len=0x7fffcf32c560, next_segment=0xc000000e98, next_sp=0xc000000ea0, 
    initial_sp=0xc000000ea8) at ../../../gcc/libgcc/generic-morestack.c:907
907	__splitstack_find (void *segment_arg, void *sp, size_t *len,
1: __morestack_current_segment = (struct stack_segment *) 0x7fffcf300000
2: *__morestack_current_segment = {prev = 0x0, next = 0x0, size = 196552, old_stack = 0x0, dynamic_allocation = 0x0, free_dynamic_allocation = 0x0, extra = 0x0}
(gdb) c
Continuing.
setting segment: 0x7fffcf300000 from morestack_current_segment
DOWN len: 15104 segment: 0x7fffcf300000 segment->size: 196552 sp: 0x7fffcf32c500 morestack seg: 0x7fffcf300000

Thread 1 "a.out" hit Hardware watchpoint 3: *0x7fffcf300010

Old value = 196552
New value = -818937760
0x00007ffff763dc4c in backtrace_alloc (state=0x7fffcd940000, size=384, error_callback=0x7ffff6f40df0 <error_callback>, data=0x7fffcf32be10)
    at ../../../gcc/libbacktrace/mmap.c:132
132	  if (!state->threaded)
1: __morestack_current_segment = (struct stack_segment *) 0x7fffcf300000
2: *__morestack_current_segment = {prev = 0x0, next = 0x0, size = 140736669417568, old_stack = 0x7fffcd943308, dynamic_allocation = 0x7fffcf32ac50, 
  free_dynamic_allocation = 0x7, extra = 0x7fffcf32be10}
(gdb) bt
#0  0x00007ffff763dc4c in backtrace_alloc (state=0x7fffcd940000, size=384, error_callback=0x7ffff6f40df0 <error_callback>, data=0x7fffcf32be10)
    at ../../../gcc/libbacktrace/mmap.c:132
#1  0x00007ffff763df30 in backtrace_vector_grow (state=0x7fffcd940000, size=24, error_callback=0x7ffff6f40df0 <error_callback>, data=0x7fffcf32be10, 
    vec=0x7fffcf300780) at ../../../gcc/libbacktrace/mmap.c:270
#2  0x00007ffff762ef1c in add_function_range (state=<optimized out>, rdata=0x7fffcc538d10, lowpc=270108644, highpc=270108648, error_callback=<optimized out>, 
    data=<optimized out>, pvec=0x7fffcf300780) at ../../../gcc/libbacktrace/dwarf.c:3187
#3  0x00007ffff76310e0 in add_ranges_from_ranges (dwarf_sections=0x0, pcrange=<optimized out>, dwarf_sections=0x0, pcrange=<optimized out>, vec=0x7fffcf300780, 
    data=0x7fffcf32be10, error_callback=0x7ffff6f40df0 <error_callback>, rdata=0x7fffcc538d10, add_range=0x7ffff762eea0 <add_function_range>, base=269472992, 
    u=0x7fffcd7d99f8, is_bigendian=<optimized out>, base_address=0, state=0x7fffcd940000) at ../../../gcc/libbacktrace/dwarf.c:1705
#4  add_ranges (state=state@entry=0x7fffcd940000, dwarf_sections=dwarf_sections@entry=0x7fffcd943340, base_address=0, is_bigendian=<optimized out>, 
    u=u@entry=0x7fffcd7d99f8, base=<optimized out>, pcrange=pcrange@entry=0x7fffcf300388, add_range=add_range@entry=0x7ffff762eea0 <add_function_range>, 
    rdata=rdata@entry=0x7fffcc538d10, error_callback=error_callback@entry=0x7ffff6f40df0 <error_callback>, data=data@entry=0x7fffcf32be10, 
    vec=vec@entry=0x7fffcf300780) at ../../../gcc/libbacktrace/dwarf.c:1924
#5  0x00007ffff76331ec in read_function_entry (state=state@entry=0x7fffcd940000, ddata=ddata@entry=0x7fffcd943308, u=u@entry=0x7fffcd7d99f8, 
    base=<optimized out>, unit_buf=unit_buf@entry=0x7fffcf32ac50, lhdr=lhdr@entry=0x7fffcf32ad18, 
    error_callback=error_callback@entry=0x7ffff6f40df0 <error_callback>, data=data@entry=0x7fffcf32be10, vec_function=vec_function@entry=0x7fffcf32ad60, 
    vec_inlined=vec_inlined@entry=0x7fffcf300780) at ../../../gcc/libbacktrace/dwarf.c:3383
#6  0x00007ffff7632f20 in read_function_entry (state=state@entry=0x7fffcd940000, ddata=ddata@entry=0x7fffcd943308, u=u@entry=0x7fffcd7d99f8,
.... read_function_entry appears over and over until the end:
#292 0x00007ffff763450c in read_function_info (ret_addrs_count=<synthetic pointer>, ret_addrs=<synthetic pointer>, fvec=<optimized out>, u=0x7fffcd7d99f8, 
    data=<optimized out>, error_callback=<optimized out>, lhdr=0x7fffcf32ad18, ddata=<optimized out>, state=<optimized out>)
    at ../../../gcc/libbacktrace/dwarf.c:3497
#293 dwarf_lookup_pc (state=0x7fffcd940000, ddata=0x7fffcd943308, pc=<optimized out>, callback=0x7ffff6f41700 <callback>, 
    error_callback=0x7ffff6f40df0 <error_callback>, data=<optimized out>, found=0x7fffcf32af10) at ../../../gcc/libbacktrace/dwarf.c:3743
#294 0x00007ffff7635004 in dwarf_fileline (state=0x7fffcd940000, pc=270235995, callback=0x7ffff6f41700 <callback>, 
    error_callback=0x7ffff6f40df0 <error_callback>, data=0x7fffcf32be10) at ../../../gcc/libbacktrace/dwarf.c:3913
#295 0x00007ffff7636154 in backtrace_pcinfo (state=0x7fffcd940000, pc=270235995, callback=0x7ffff6f41700 <callback>, 
    error_callback=0x7ffff6f40df0 <error_callback>, data=0x7fffcf32be10) at ../../../gcc/libbacktrace/fileline.c:301
---Type <return> to continue, or q <return> to quit---
#296 0x00007ffff7636b80 in unwind (context=<optimized out>, vdata=0x7fffcf32bdb0) at ../../../gcc/libbacktrace/backtrace.c:91
#297 0x00007ffff629d5b8 in _Unwind_Backtrace () from /lib/powerpc64le-linux-gnu/libgcc_s.so.1
#298 0x00007ffff7636c4c in backtrace_full (state=0x7fffcd940000, skip=<optimized out>, callback=<optimized out>, error_callback=<optimized out>, 
    data=<optimized out>) at ../../../gcc/libbacktrace/backtrace.c:127
#299 0x00007ffff6f421ec in runtime_callers (skip=<optimized out>, locbuf=<optimized out>, m=<optimized out>, keep_thunks=<optimized out>)
    at ../../../gcc/libgo/runtime/go-callers.c:255
#300 0x00007ffff6f40994 in runtime.Caller (skip=<optimized out>) at ../../../gcc/libgo/runtime/go-caller.c:238
#301 0x00000000101b795c in runtime_test.lineNumber () at symtab_test.go:60
#302 runtime_test..import () at symtab_test.go:65
#303 0x00000000100f9bc0 in main.init () at _testmain.go:1
#304 0x00000000100b6098 in runtime.main (p.0=<optimized out>) at proc.go:219
#305 0x00000000100b99c8 in runtime.kickoff () at proc.go:1128
#306 0x00007ffff60c56ec in makecontext () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/makecontext.S:136
#307 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) 
(gdb) 
(gdb) x/i $pc
=> 0x7ffff763dc4c <backtrace_alloc+60>:	cmpdi   cr4,r9,0
(gdb) x/i $pc-4
   0x7ffff763dc48 <backtrace_alloc+56>:	stdu    r1,-80(r1)

So the stack is overflowing into the space used to hold the stack_segment structure for the first __morestack_current_segment.

@ianlancetaylor
Copy link
Contributor

Thanks. This is Go code that is calling into C code. The C code is (presumably) not compiled with -fsplit-stack. The intent for this kind of code is that the linker will detect the call from split-stack code to non-split-stack code and convert the split-stack function prologue sequence to call __morestack_non_split rather than __morestack. In this case that should be happening in runtime_callers. After linking, the runtime_callers function should be unconditionally calling __morestack_non_split. So disassemble runtime_callers to see what it does. (This assumes that are you using the gold linker.)

@laboger
Copy link
Contributor Author

laboger commented Sep 28, 2020

I see the specs file has this rule for the link command:
%{fsplit-stack: -fuse-ld=gold --wrap=pthread_create}
Which I recall is intended to force the use of the gold linker when using -fsplit-stack. There is also something in the configure to verify that a valid gold version exists on the system before allowing split stack and that is true for the build system.
If I do an objdump of runtime_callers in libgo.so.17.0.0 in the build directory:

0000000000af2130 <runtime_callers>:
  af2130:       fd 00 4c 3c     addis   r2,r12,253
  af2134:       d0 77 42 38     addi    r2,r2,30672
  af2138:       c0 8f 0d e8     ld      r0,-28736(r13)
  af213c:       80 bf 81 39     addi    r12,r1,-16512
  af2140:       00 00 00 60     nop
  af2144:       40 00 ac 7f     cmpld   cr7,r12,r0
  af2148:       d8 02 9c 41     blt     cr7,af2420 <runtime_callers+0x2f0>
  af214c:       a6 02 08 7c     mflr    r0
  af2150:       e8 ff a1 fb     std     r29,-24(r1)
  af2154:       f0 ff c1 fb     std     r30,-16(r1)
  af2158:       00 00 20 39     li      r9,0
  af215c:       f8 ff e1 fb     std     r31,-8(r1)
  af2160:       c0 ff 01 fb     std     r24,-64(r1)
  af2164:       01 00 63 38     addi    r3,r3,1
  af2168:       78 23 9f 7c     mr      r31,r4
  af216c:       c8 ff 21 fb     std     r25,-56(r1)
  af2170:       d0 ff 41 fb     std     r26,-48(r1)
  af2174:       78 33 dd 7c     mr      r29,r6
  af2178:       d8 ff 61 fb     std     r27,-40(r1)
  af217c:       e0 ff 81 fb     std     r28,-32(r1)
  af2180:       00 00 00 60     nop
  af2184:       08 81 c2 eb     ld      r30,-32504(r2)
  af2188:       10 00 01 f8     std     r0,16(r1)
  af218c:       81 ff 21 f8     stdu    r1,-128(r1)

....

I don't see any morestack function? The go-callers.o file has the same objdump. I can see in the build log that -fsplit-stack was used. Not sure if this is the way it works with a shared libgo.

Another thing that seems odd is that -fsplit-stack is on the link command when building libgo.so which should force -fuse-ld=gold but if I run the link command from the log by hand with the -v option, it looks like -fsplit-stack was removed before invoking the gold linker.
I found some code in go/gospec.c that might possibly be omitting -fsplit-stack when invoking the linker. Not sure if this is the reason for the unexpected code in runtime_callers. I don't see anywhere that morestack functions would be called.

@ianlancetaylor
Copy link
Contributor

This is the branch to where __morestack is called:

 af2148:       d8 02 9c 41     blt     cr7,af2420 <runtime_callers+0x2f0>

Since calls to __morestack are relatively unlikely, the compiler moves them out of line, typically to near the end of the function. You need to disassemble that part of the function.

@ianlancetaylor
Copy link
Contributor

OK, I just looked at the code in gold and I'm not sure I fully understand it. Unlike the x86_64 version which I wrote, the PowerPC version never changes the function to call __morestack_non_split. Instead, it asks for an additional amount of stack space, as set by the --split-stack-adjust-size option, whose default is 0x4000. In this case that does not appear to be large enough. Try building the test with -Wl,--split-stack-adjust-size=0x8000 and see what happens.

@amodra
Copy link

amodra commented Oct 7, 2020

I don't understand the x86_64 code fully. If do_calls_non_split finds a "cmpq %fs:112,%rsp", it changes that comparison to always fail. Then there are comparisons against "lea offset(%rsp),%r10" or %r11, and in that case gold makes the offset more negative by split_stack_adjust_size. In all cases the call is changed from __morestack to __morestack_non_split. I think that would mean the lea case would allow a call to a non-split-stack function with stack space of split_stack_adjust_size (plus the tcb guard amount of 256 bytes), if the available stack is more than split_stack_adjust_size, ie. there would not necessarily be a call to __morestack_non_split. However, the cmp case, which gcc emits for functions that require fewer than 256 bytes for their own stack frame, would always call __morestack_non_split. __morestack_non_split ensures 0x100000 bytes of stack! So we have a rather weird effect that small stack frame functions calling non-split-stack functions get much more space, at least with the default --split-stack-adjust-size.

In contrast, the ppc64 code always adds split_stack_adjust_size to the current function's requested stack frame size, compares that against the tcb limit, and calls __morestack if not enough stack is available. ppc64 doesn't need a special __morestack_non_split that repeats the stack limit comparison because unlike x86 we don't nop out the comparison. (With a "cmp %fs:112,%rsp" style prologue x86 loses that compare so possible recursion might lead to excessive stacks without the __morestack_non_split comparison.)

@laboger
Copy link
Contributor Author

laboger commented Oct 7, 2020

Try building the test with -Wl,--split-stack-adjust-size=0x8000 and see what happens.

I tried this and didn't help, assuming it just needs to be added to the link of the binary:

/home/boger/gccgo-git/bld/./gcc/gccgo -B/home/boger/gccgo-git/bld/./gcc/ -B/usr/local/gccgo.trunk/powerpc64le-linux/bin/ -B/usr/local/gccgo.trunk/powerpc64le-linux/lib/ -isystem /usr/local/gccgo.trunk/powerpc64le-linux/include -isystem /usr/local/gccgo.trunk/powerpc64le-linux/sys-include -g -O2 -fgo-compiling-runtime -L /home/boger/gccgo-git/bld/powerpc64le-linux/libgo -L /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/.libs -g -fgo-pkgpath=runtime -c -I . -fno-toplevel-reorder -o _gotest_.o export_debuglog_test.go export_linux_test.go export_mmap_test.go export_test.go export_unix_test.go proc_runtime_test.go alg.go atomic_pointer.go cgo_gccgo.go cgocall.go cgocheck.go chan.go compiler.go cpuprof.go cputicks.go debug.go debuglog.go debuglog_off.go env_posix.go eqtype.go error.go extern.go fastlog2.go fastlog2table.go ffi.go float.go hash64.go heapdump.go iface.go lfstack.go lfstack_64bit.go lock_futex.go lockrank.go lockrank_off.go malloc.go map.go map_fast32.go map_fast64.go map_faststr.go mbarrier.go mbitmap.go mcache.go mcentral.go mem_gccgo.go mfinal.go mfixalloc.go mgc.go mgc_gccgo.go mgcmark.go mgcscavenge.go mgcsweep.go mgcsweepbuf.go mgcwork.go mheap.go mpagealloc.go mpagealloc_64bit.go mpagecache.go mpallocbits.go mprof.go mranges.go msan0.go msize.go mspanset.go mstats.go mwbbuf.go nbpipe_pipe2.go netpoll.go netpoll_epoll.go os_gccgo.go os_linux.go os_linux_ppc64x.go panic.go panic32.go preempt.go preempt_nonwindows.go print.go proc.go profbuf.go proflabel.go race0.go rdebug.go relax_stub.go runtime.go runtime1.go runtime2.go rwmutex.go select.go sema.go signal_gccgo.go signal_unix.go sigqueue.go sigqueue_note.go sizeclasses.go slice.go string.go stubs.go stubs2.go stubs3.go stubs_linux.go symtab.go time.go time_nofake.go timestub.go timestub2.go trace.go traceback_gccgo.go type.go typekind.go utf8.go write_err.go runtime_sysinfo.go sigtab.go
/home/boger/gccgo-git/bld/./gcc/gccgo -B/home/boger/gccgo-git/bld/./gcc/ -B/usr/local/gccgo.trunk/powerpc64le-linux/bin/ -B/usr/local/gccgo.trunk/powerpc64le-linux/lib/ -isystem /usr/local/gccgo.trunk/powerpc64le-linux/include -isystem /usr/local/gccgo.trunk/powerpc64le-linux/sys-include -g -O2 -fgo-compiling-runtime -L /home/boger/gccgo-git/bld/powerpc64le-linux/libgo -L /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/.libs -g -fgo-pkgpath=runtime_test -c -I . -fno-toplevel-reorder -o _xtest_.o _first_test.go callers_test.go chan_test.go chanbarrier_test.go checkptr_test.go closure_test.go complex_test.go crash_cgo_test.go crash_gccgo_test.go crash_test.go crash_unix_test.go debuglog_test.go defer_test.go env_test.go example_test.go fastlog2_test.go gc_test.go hash_test.go iface_test.go lfstack_test.go malloc_test.go map_benchmark_test.go map_test.go memmove_test.go mfinal_test.go mgcscavenge_test.go mpagealloc_test.go mpagecache_test.go mpallocbits_test.go nbpipe_test.go netpoll_os_test.go norace_test.go panic_test.go proc_test.go profbuf_test.go rand_test.go runtime-lldb_test.go runtime_mmap_test.go runtime_test.go runtime_unix_test.go rwmutex_test.go sema_test.go semasleep_test.go sizeof_test.go slice_test.go stack_test.go string_test.go symtab_test.go time_test.go
/home/boger/gccgo-git/bld/./gcc/gccgo -B/home/boger/gccgo-git/bld/./gcc/ -B/usr/local/gccgo.trunk/powerpc64le-linux/bin/ -B/usr/local/gccgo.trunk/powerpc64le-linux/lib/ -isystem /usr/local/gccgo.trunk/powerpc64le-linux/include -isystem /usr/local/gccgo.trunk/powerpc64le-linux/sys-include -g -O2 -fgo-compiling-runtime -L /home/boger/gccgo-git/bld/powerpc64le-linux/libgo -L /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/.libs -g -c _testmain.go
/home/boger/gccgo-git/bld/./gcc/gccgo -B/home/boger/gccgo-git/bld/./gcc/ -B/usr/local/gccgo.trunk/powerpc64le-linux/bin/ -B/usr/local/gccgo.trunk/powerpc64le-linux/lib/ -isystem /usr/local/gccgo.trunk/powerpc64le-linux/include -isystem /usr/local/gccgo.trunk/powerpc64le-linux/sys-include -g -O2 -fgo-compiling-runtime -L /home/boger/gccgo-git/bld/powerpc64le-linux/libgo -L /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/.libs _gotest_.o _testmain.o _xtest_.o -Wl,--split-stack-adjust-size=0x8000 -lm
./a.out -test.short -test.timeout=600s
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x2 addr=0x7ad224900000 pc=0x10092124]

@laboger
Copy link
Contributor Author

laboger commented Oct 7, 2020

I added an option to get verbose output on the build of the test:

/home/boger/gccgo-git/bld/./gcc/gccgo -B/home/boger/gccgo-git/bld/./gcc/ -B/usr/local/gccgo.trunk/powerpc64le-linux/bin/ -B/usr/local/gccgo.trunk/powerpc64le-linux/lib/ -isystem /usr/local/gccgo.trunk/powerpc64le-linux/include -isystem /usr/local/gccgo.trunk/powerpc64le-linux/sys-include -g -O2 -fgo-compiling-runtime -L /home/boger/gccgo-git/bld/powerpc64le-linux/libgo -L /home/boger/gccgo-git/bld/powerpc64le-linux/libgo/.libs _gotest_.o _testmain.o _xtest_.o -v -Wl,--split-stack-adjust-size=0x10000 -lm
Reading specs from /home/boger/gccgo-git/bld/./gcc/specs
COLLECT_GCC=/home/boger/gccgo-git/bld/./gcc/gccgo
COLLECT_LTO_WRAPPER=/home/boger/gccgo-git/bld/./gcc/lto-wrapper
Target: powerpc64le-linux
Configured with: ../gcc/configure --target=powerpc64le-linux --host=powerpc64le-linux --build=powerpc64le-linux --disable-bootstrap --prefix=/usr/local/gccgo.trunk --enable-languages=c,c++,go
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 11.0.0 20201005 (experimental) (GCC) 
COMPILER_PATH=/home/boger/gccgo-git/bld/./gcc/
LIBRARY_PATH=/home/boger/gccgo-git/bld/./gcc/:/lib/powerpc64le-linux-gnu/:/lib/../lib64/:/usr/lib/powerpc64le-linux-gnu/:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-fsplit-stack' '-B' '/home/boger/gccgo-git/bld/./gcc/' '-B' '/usr/local/gccgo.trunk/powerpc64le-linux/bin/' '-B' '/usr/local/gccgo.trunk/powerpc64le-linux/lib/' '-isystem' '/usr/local/gccgo.trunk/powerpc64le-linux/include' '-isystem' '/usr/local/gccgo.trunk/powerpc64le-linux/sys-include' '-g' '-O2' '-fgo-compiling-runtime' '-L/home/boger/gccgo-git/bld/powerpc64le-linux/libgo' '-L/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/.libs' '-v' '-shared-libgcc' '-dumpdir' 'a.'
 /home/boger/gccgo-git/bld/./gcc/collect2 -plugin /home/boger/gccgo-git/bld/./gcc/liblto_plugin.so -plugin-opt=/home/boger/gccgo-git/bld/./gcc/lto-wrapper -plugin-opt=-fresolution=/tmp/ccN1pRM1.res -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc --eh-frame-hdr -V -m elf64lppc -dynamic-linker /lib64/ld64.so.2 /usr/lib/powerpc64le-linux-gnu/crt1.o /usr/lib/powerpc64le-linux-gnu/crti.o /home/boger/gccgo-git/bld/./gcc/crtbegin.o -L/home/boger/gccgo-git/bld/powerpc64le-linux/libgo -L/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/.libs -L/home/boger/gccgo-git/bld/./gcc -L/lib/powerpc64le-linux-gnu -L/lib/../lib64 -L/usr/lib/powerpc64le-linux-gnu _gotest_.o _testmain.o _xtest_.o --split-stack-adjust-size=0x10000 -lgobegin -lgo -lm -fuse-ld=gold --wrap=pthread_create -lgcc_s -lgcc -lc -lgcc_s -lgcc /home/boger/gccgo-git/bld/./gcc/crtend.o /usr/lib/powerpc64le-linux-gnu/crtn.o
GNU gold (GNU Binutils for Ubuntu 2.30) 1.15
  Supported targets:
   elf64-powerpcle
   elf64-powerpc
   elf32-powerpcle
   elf32-powerpc
  Supported emulations:
   elf64lppc
   elf64ppc
   elf32lppc
   elf32ppc
COLLECT_GCC_OPTIONS='-fsplit-stack' '-B' '/home/boger/gccgo-git/bld/./gcc/' '-B' '/usr/local/gccgo.trunk/powerpc64le-linux/bin/' '-B' '/usr/local/gccgo.trunk/powerpc64le-linux/lib/' '-isystem' '/usr/local/gccgo.trunk/powerpc64le-linux/include' '-isystem' '/usr/local/gccgo.trunk/powerpc64le-linux/sys-include' '-g' '-O2' '-fgo-compiling-runtime' '-L/home/boger/gccgo-git/bld/powerpc64le-linux/libgo' '-L/home/boger/gccgo-git/bld/powerpc64le-linux/libgo/.libs' '-v' '-shared-libgcc' '-dumpdir' 'a.'
./a.out -test.short -test.timeout=600s
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x2 addr=0x70e6ba930000 pc=0x10092124]

Not sure if this is a problem but I don't see -fsplit-stack being passed to gold? The options -fsplit-stack appears in COLLECT_GCC_OPTIONS and must be there to cause -fuse-ld=gold to be added.

@ianlancetaylor
Copy link
Contributor

@amodra You're right, the approaches used on x86 _64and ppc64 are consistent. The always-fail case is used on x86 because there isn't room to add in adjust_split_stack_size.

Originally I thought that the default value for --split-stack-adjust-size, 0x4000, would normally be sufficient. That turned out to be incorrect, and I bumped the value in libgcc up to 0x100000 in 2012. I think that in practice this has been working because most Go functions that call C functions are small wrappers with small stack frames.

@laboger It's normal that -fsplit-stack is not passed to gold itself. It's not required. What's interesting is whether --split-stack-adjust-size is being passed, and, in your example, it is.

Can you try increasing the value even more? You are using 0x10000, but in effect x86_64 is using 0x100000.

@laboger
Copy link
Contributor Author

laboger commented Oct 7, 2020

I used --split-stack-adjust-size=0x80000 and that worked. Now I just get the MemmoveAtomicity errors reported #41428. I'm running the full go testsuite now with this split-stack-adjust-size value.

@Emegua
Copy link

Emegua commented Oct 27, 2020

Where can I set --split-stack-adjust-size=0x80000? At compile time??

@amodra
Copy link

amodra commented Oct 27, 2020

-split-stack-adjust-size is a gold linker option. At link time.

@laboger
Copy link
Contributor Author

laboger commented Nov 17, 2020

@amodra submitted a change to gold to increase the default split stack size. I'm trying to verify it now and it seems to fix TestChan but I'm seeing other errors in the runtime test. Mostly something like this:

=== RUN   TestNoShrinkStackWhileParking
runtime: marked free object in span 0x7fff61213c98, elemsize=16 freeindex=0 (bad use of unsafe.Pointer? try -d=checkptr)
0xc000480000 free  unmarked
0xc000480010 free  unmarked
0xc000480020 free  unmarked
0xc000480030 free  unmarked
0xc000480040 free  unmarked
0xc000480050 free  unmarked
0xc000480060 free  unmarked
0xc000480070 alloc marked  
0xc000480080 alloc marked  
0xc000480090 alloc marked  
0xc0004800a0 alloc marked  
0xc0004800b0 alloc marked  
0xc0004800c0 alloc marked  
0xc0004800d0 alloc marked  
0xc0004800e0 alloc marked  
... lots more like this

I'm not sure if -d=checkptr works in gccgo and if it did how I would set it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants
@ianlancetaylor @amodra @laboger @Emegua and others