Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: crash in sigtramp #10534

Closed
dvyukov opened this issue Apr 22, 2015 · 5 comments
Closed

runtime: crash in sigtramp #10534

dvyukov opened this issue Apr 22, 2015 · 5 comments
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Apr 22, 2015

go version devel +87054c4 Wed Apr 22 02:50:48 2015 +0000 linux/amd64

A program silently crashed here (from a core file):

(gdb) bt
#0  runtime.sigtramp () at /ssd/src/go10/src/runtime/sys_linux_amd64.s:232
#1  0x0000000000469690 in runtime.sigtramp () at /ssd/src/go10/src/runtime/sys_linux_amd64.s:246
#2  0x0000000000000001 in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb) disass
Dump of assembler code for function runtime.sigtramp:
   0x0000000000469620 <+0>: sub    $0x40,%rsp
   0x0000000000469624 <+4>: mov    %fs:0xfffffffffffffff8,%r10
   0x000000000046962d <+13>:    cmp    $0x0,%r10
   0x0000000000469631 <+17>:    jne    0x469645 <runtime.sigtramp+37>
   0x0000000000469633 <+19>:    mov    %rdi,(%rsp)
   0x0000000000469637 <+23>:    lea    -0x1d60e(%rip),%rax        # 0x44c030 <runtime.badsignal>
   0x000000000046963e <+30>:    callq  *%rax
   0x0000000000469640 <+32>:    add    $0x40,%rsp
   0x0000000000469644 <+36>:    retq   
   0x0000000000469645 <+37>:    mov    %r10,0x28(%rsp)
   0x000000000046964a <+42>:    mov    0xb8(%r10),%rax
=> 0x0000000000469651 <+49>:    mov    0x48(%rax),%rax
   0x0000000000469655 <+53>:    mov    %rax,%fs:0xfffffffffffffff8
   0x000000000046965e <+62>:    mov    %rdi,(%rsp)
   0x0000000000469662 <+66>:    mov    %rsi,0x8(%rsp)
   0x0000000000469667 <+71>:    mov    %rdx,0x10(%rsp)
   0x000000000046966c <+76>:    mov    %r10,0x18(%rsp)
   0x0000000000469671 <+81>:    callq  0x44abd0 <runtime.sighandler>
   0x0000000000469676 <+86>:    mov    0x28(%rsp),%r10
   0x000000000046967b <+91>:    mov    %r10,%fs:0xfffffffffffffff8
   0x0000000000469684 <+100>:   add    $0x40,%rsp
   0x0000000000469688 <+104>:   retq   
   0x0000000000469689 <+105>:   add    %al,(%rax)
   0x000000000046968b <+107>:   add    %al,(%rax)
   0x000000000046968d <+109>:   add    %al,(%rax)
   0x000000000046968f <+111>:   add    %bh,0xf(%rax)
End of assembler dump.
(gdb) info registers
rax            0x0  0
rbx            0x0  0
rcx            0xc20807ab40 833358375744
rdx            0xc2097ebac0 833382955712
rsi            0xc2097ebbf0 833382956016
rdi            0xb  11
rbp            0x8  0x8
rsp            0xc2097eba78 0xc2097eba78
r8             0x10 16
r9             0x0  0
r10            0xc20807ab40 833358375744
r11            0x216    534
r12            0x1  1
r13            0xeccc98a8a  63565302410
r14            0xeccc98a8b  63565302411
r15            0xa3b0c0 10727616
rip            0x469651 0x469651 <runtime.sigtramp+49>
eflags         0x10202  [ IF RF ]
cs             0x33 51
ss             0x2b 43
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0

It means that g != nil but g.m == nil. The goroutine descriptor looks valid:

(gdb) p /x $r10
$1 = 0xc20807ab40
(gdb) x/100g 0xc20807ab40
0xc20807ab40:   0x000000c208953000  0x000000c208954000
0xc20807ab50:   0x000000c208953280  0xffffffffffffffff
0xc20807ab60:   0x0000000000000000  0x0000000000000000
0xc20807ab70:   0x0000000000000000  0x0000000000465f90
0xc20807ab80:   0x000000c20807ab40  0x0000000000000000
0xc20807ab90:   0x0000000000000000  0x0000000000000000
0xc20807aba0:   0x000000000045ace0  0x0000000000000000
0xc20807abb0:   0x000000000051c0f5  0x0000000000000000
0xc20807abc0:   0x0000000000000002  0x0000000000000017
0xc20807abd0:   0x0000000000000000  0x000000000084ddb0
0xc20807abe0:   0x000000000000000c  0x000000c20807a8c0
0xc20807abf0:   0x0000000001000000  0x0000000000000000
0xc20807ac00:   0x0000000000000000  0x0000000000000000
0xc20807ac10:   0x0000000000000000  0x0000000000000000
0xc20807ac20:   0x0000000000000000  0x0000000000000000
0xc20807ac30:   0x0000000000000000  0x0000000000000000
0xc20807ac40:   0x000000000040a1b6  0x000000000040a1f0
0xc20807ac50:   0x0000000000000000  0x0000000000000000
0xc20807ac60:   0x0000000000000000  0x0000000000002d40
0xc20807ac70:   0x000000000000015f  0x0000000000000000
0xc20807ac80:   0x000000c208e1f000  0x000000c208e20000
0xc20807ac90:   0x000000c208e1f280  0xffffffffffffffff
0xc20807aca0:   0x0000000000000000  0x0000000000000000
0xc20807acb0:   0x000000c208e1f950  0x0000000000465f90
0xc20807acc0:   0x000000c20807ac80  0x0000000000000000
0xc20807acd0:   0x0000000000000000  0x0000000000000000
0xc20807ace0:   0x000000000045ad20  0x0000000000000000
0xc20807acf0:   0x000000000051c0f5  0x0000000000000000
0xc20807ad00:   0x0000000000000002  0x0000000000000018
0xc20807ad10:   0x0000000000000000  0x000000000084ddb0
0xc20807ad20:   0x000000000000000c  0x000000c20807bb80
0xc20807ad30:   0x0000000001000000  0x000000c2099c4000
0xc20807ad40:   0x0000000000000000  0x0000000000000000
0xc20807ad50:   0x0000000000000000  0x0000000000000000
0xc20807ad60:   0x0000000000000000  0x0000000000000000
0xc20807ad70:   0x0000000000000000  0x0000000000000000
0xc20807ad80:   0x000000000040a1b6  0x000000000040a1f0
0xc20807ad90:   0x0000000000000000  0x0000000000000000
0xc20807ada0:   0x0000000000000000  0x0000000000006620
0xc20807adb0:   0x0000000000000322  0x0000000000000000
0xc20807adc0:   0x000000c208aae000  0x000000c208aaf000
0xc20807add0:   0x000000c208aae280  0xffffffffffffffff
0xc20807ade0:   0x0000000000000000  0x0000000000000000
0xc20807adf0:   0x000000c208aae640  0x000000000051c0f5
0xc20807ae00:   0x000000c20807adc0  0x0000000000000000
0xc20807ae10:   0x0000000000000000  0x0000000000000000

This is goroutine 17:

(gdb) info goroutines
* 17 syscall  runtime.goexit

I don't any obvious place where we setg(g) with g.m==nil.

Core file and the binary are here:
https://drive.google.com/file/d/0B20Uwp8Hs1oCdVpieWdfZy1PN3M/view?usp=sharing

@dvyukov dvyukov added this to the Go1.5 milestone Apr 22, 2015
@dvyukov
Copy link
Member Author

dvyukov commented Apr 22, 2015

/cc @rsc @randall77 @RLH @aclements

@aclements
Copy link
Member

Could you say a bit about what the program was doing? Is it reproducable?

I wonder if this is the same silent crash that we're seeing sometimes in the runtime/pprof test. E.g., http://build.golang.org/log/10670960d0c768b1c68b429984beaaa004281db7 (linux-386-sid on commit 58c1c01).

@dvyukov
Copy link
Member Author

dvyukov commented Apr 22, 2015

The program is fuzzer:
https://github.com/dvyukov/go-fuzz/tree/master/go-fuzz
It starts subprocesses and communicates with then over pipes.

It is well possible that it is the same crash we are seeing on bots, since pprof tests sends signal. Mine crash is not reproducible.

@dvyukov
Copy link
Member Author

dvyukov commented Apr 22, 2015

Another similar crash:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  runtime.sigtramp () at /ssd/src/go10/src/runtime/sys_linux_amd64.s:232
232     MOVQ    m_gsignal(AX), AX
Loading Go Runtime support.
(gdb) bt
#0  runtime.sigtramp () at /ssd/src/go10/src/runtime/sys_linux_amd64.s:232
#1  0x0000000000469690 in runtime.sigtramp () at /ssd/src/go10/src/runtime/sys_linux_amd64.s:246
#2  0x0000000000000001 in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb) disass
Dump of assembler code for function runtime.sigtramp:
   0x0000000000469620 <+0>: sub    $0x40,%rsp
   0x0000000000469624 <+4>: mov    %fs:0xfffffffffffffff8,%r10
   0x000000000046962d <+13>:    cmp    $0x0,%r10
   0x0000000000469631 <+17>:    jne    0x469645 <runtime.sigtramp+37>
   0x0000000000469633 <+19>:    mov    %rdi,(%rsp)
   0x0000000000469637 <+23>:    lea    -0x1d60e(%rip),%rax        # 0x44c030 <runtime.badsignal>
   0x000000000046963e <+30>:    callq  *%rax
   0x0000000000469640 <+32>:    add    $0x40,%rsp
   0x0000000000469644 <+36>:    retq   
   0x0000000000469645 <+37>:    mov    %r10,0x28(%rsp)
   0x000000000046964a <+42>:    mov    0xb8(%r10),%rax
=> 0x0000000000469651 <+49>:    mov    0x48(%rax),%rax
   0x0000000000469655 <+53>:    mov    %rax,%fs:0xfffffffffffffff8
   0x000000000046965e <+62>:    mov    %rdi,(%rsp)
   0x0000000000469662 <+66>:    mov    %rsi,0x8(%rsp)
   0x0000000000469667 <+71>:    mov    %rdx,0x10(%rsp)
   0x000000000046966c <+76>:    mov    %r10,0x18(%rsp)
   0x0000000000469671 <+81>:    callq  0x44abd0 <runtime.sighandler>
   0x0000000000469676 <+86>:    mov    0x28(%rsp),%r10
   0x000000000046967b <+91>:    mov    %r10,%fs:0xfffffffffffffff8
   0x0000000000469684 <+100>:   add    $0x40,%rsp
   0x0000000000469688 <+104>:   retq   
   0x0000000000469689 <+105>:   add    %al,(%rax)
   0x000000000046968b <+107>:   add    %al,(%rax)
   0x000000000046968d <+109>:   add    %al,(%rax)
   0x000000000046968f <+111>:   add    %bh,0xf(%rax)
End of assembler dump.
(gdb) info registers
rax            0x0  0
rbx            0x0  0
rcx            0xc2083ac640 833361724992
rdx            0xc209347ac0 833378089664
rsi            0xc209347bf0 833378089968
rdi            0xb  11
rbp            0x8  0x8
rsp            0xc209347a78 0xc209347a78
r8             0x5537bef7   1429716727
r9             0x112e0be826d694b3   1237940039285380275
r10            0xc2083ac640 833361724992
r11            0x7b 123
r12            0x18251674   405083764
r13            0xeccc9b5f6  63565313526
r14            0xeccc9b5f7  63565313527
r15            0xa3b0c0 10727616
rip            0x469651 0x469651 <runtime.sigtramp+49>
eflags         0x10202  [ IF RF ]
cs             0x33 51
ss             0x2b 43
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0
(gdb) x/100g 0xc2083ac640
0xc2083ac640:   833373249536    833373257728
0xc2083ac650:   -1314   -1
0xc2083ac660:   0   0
0xc2083ac670:   0   4439798
0xc2083ac680:   833361724992    0
0xc2083ac690:   0   0
0xc2083ac6a0:   0   0
0xc2083ac6b0:   5357813 0
0xc2083ac6c0:   2   28
0xc2083ac6d0:   0   8707504
0xc2083ac6e0:   12  833361723712
0xc2083ac6f0:   16777217    0
0xc2083ac700:   0   0
0xc2083ac710:   0   0
0xc2083ac720:   0   0
0xc2083ac730:   0   0
0xc2083ac740:   4235702 4235760
0xc2083ac750:   0   0
0xc2083ac760:   0   886464
0xc2083ac770:   6445    0
0xc2083ac780:   833379041280    833379074048
0xc2083ac790:   833379041920    833379041920
0xc2083ac7a0:   0   0
0xc2083ac7b0:   0   0
0xc2083ac7c0:   0   0
0xc2083ac7d0:   0   0
0xc2083ac7e0:   0   0
0xc2083ac7f0:   0   0
0xc2083ac800:   0   0
0xc2083ac810:   0   0
0xc2083ac820:   0   0
0xc2083ac830:   0   833360652288
0xc2083ac840:   0   0
0xc2083ac850:   0   0
0xc2083ac860:   0   0
0xc2083ac870:   0   0
0xc2083ac880:   0   0
0xc2083ac890:   0   0
0xc2083ac8a0:   0   0
0xc2083ac8b0:   0   0
0xc2083ac8c0:   833361612800    833361614848
0xc2083ac8d0:   833361613440    -1
0xc2083ac8e0:   0   0
0xc2083ac8f0:   833361614008    4434515
0xc2083ac900:   833361725632    0
0xc2083ac910:   0   0
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) info goroutines
  28 running  runtime.readyExecute

@aclements
Copy link
Member

We've fixed several memory corruption problems in the runtime since this was reported and also rewritten this part of the signal handling code. Since this is most likely fixed, I'm closing the issue, but please reopen if it is still happening.

I tried, unsuccessfully, to map this to a specific issue that was fixed. Below is a log of what I figured out from the core file and the version you were running.

In your first trace, it was actually goroutine 23 (0x17):

(gdb) info goroutines
  23 running  runtime.systemstack_switch

Not very enlightening, unfortunately. And since it's running, probably a lie.

The signal itself reveals some of what was happening at the time. rdi is 11, so this is a SIGSEGV. rdx is the signal context (type *ucontext), so we can find out where caused the original segfault:

(gdb) x/xg $rdx + 40 + 120
0xc2097ebb60:   0x000000c208953410  # rsp
0xc2097ebb68:   0x000000000041ebb9  # rip
(gdb) list *0x000000000041ebb9
0x41ebb9 is in runtime.mallocgc (/ssd/src/go10/src/runtime/malloc.go:511).
511   mp := acquirem()

So the original cause of the SIGSEGV was also that g.m==nil.

Annoyingly, gdb won't let me trick it in to unwinding the pre-signal stack on a core file, but groveling through the stack for plausible return RIPs yields:

(gdb) list *0x000000000041f549
0x41f549 is in runtime.newobject (/ssd/src/go10/src/runtime/malloc.go:725).
725 return mallocgc(uintptr(typ.size), typ, flags)
(gdb) list *0x00000000005b655d
0x5b655d is in reflect.(*structType).Field (/ssd/src/go10/src/reflect/type.go:850).
850 f.Index = []int{i}
(gdb) list *0x00000000005b51cd
0x5b51cd is in reflect.(*rtype).Field (/ssd/src/go10/src/reflect/type.go:606).
606 return tt.Field(i)
(gdb) list *0x00000000005624eb
0x5624eb is in encoding/binary.sizeof (/ssd/src/go10/src/encoding/binary/binary.go:353).
353 s := sizeof(t.Field(i).Type)

So we're not in anything strange. We're just running regular user code and it's allocating a regular object.

Taking a different track (suggested by your original post), I found only two places where we can legitimately set g.m to nil. There are other places where we change g.m, but they would all clearly crash immediately if g.m were set to nil. We set g.m to nil in dropg, but it also sets m.curg to nil. M 31 has curg set to 0xc20807ab40, the g that crashed, so unless two Ms were running that G, g.m wasn't set to nil in dropg. We set g.m to nil in goexit0, but we also cas the status to _Gdead and the g that crashed is in status _Grunning.

At this point, there seems to be no plausible way g.m could be nil other than memory corruption.

@golang golang locked and limited conversation to collaborators Jun 25, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants