runtime: SIGSEGV while handling SIGSEGV #13978

keyneston · 2016-01-15T22:54:17Z

It appears that my program is segfaulting (receiving a sigsegv) while handling
a sigsegv.

I was having some GC issues and a coworker recommended trying go1.6beta1 to see
if it would fix the GC issues. It didn't (I've resolved them in other ways) but
it did start causing my program to segfault.

Sadly I can't share the actual binary or core dump but I can share other
information which should hopefully assist in tracking it down.

The segfault occurs after 12-24 hours of running. The program is network IO
intensive. Taking UDP packets on one side and streaming TCP on the other.

CC: @rsc and @ianlancetaylor I was told this would be of interest to you two.

Version

$ go version
go version devel +505fa7b Tue Jan 12 17:31:01 2016 +0000 darwin/amd64

I also had this occur with go1.6beta1, I'm attempting beta2 now, but it won't
be till tomorrow at the earliest that I know.

Do you use cgo or unsafe?

I don't believe so and no unsafe.

tarrant@box-name:~$ ldd canal
not a dynamic executable

GDB StackTrace

#0  0x000000000044930e in runtime.sigtrampgo (sig=11, info=0xc820d3dbb0, ctx=0xc820d3da80) at /usr/local/Cellar/go/HEAD/libexec/src/runtime/signal_sigtramp.go:27
#1  0x0000000000469995 in runtime.sigtramp () at /usr/local/Cellar/go/HEAD/libexec/src/runtime/sys_linux_amd64.s:234
#2  0x00000000004699a0 in runtime.sigtramp () at /usr/local/Cellar/go/HEAD/libexec/src/runtime/sys_linux_amd64.s:235
#3  0x0000000000000001 in ?? ()
#4  0x0000000000000000 in ?? ()

Go stacktrace

INFO[0000] normal-program logging here
Segmentation fault (core dumped)

Memory info from gdb:

(gdb) print #2
Invalid character '#' in expression.
(gdb) print g
$1 = (runtime.g *) 0x0
(gdb) print g.m
Cannot access memory at address 0x30

Registers

$ info registers
rax            0xc820001380    859530335104
rbx            0xc820d3da58    859544214104
rcx            0xc820d3da58    859544214104
rdx            0xc820d3da80    859544214144
rsi            0x128    296
rdi            0xc820d3da80    859544214144
rbp            0xc820d3da48    0xc820d3da48
rsp            0xc820d3d9c8    0xc820d3d9c8
r8             0x0    0
r9             0xc820d3dbb0    859544214448
r10            0x20    32
r11            0xc820802024    859538726948
r12            0x0    0
r13            0x2    2
r14            0xa74ec4    10964676
r15            0x1    1
rip            0x44930e    0x44930e <runtime.sigtrampgo+190>
eflags         0x10202    [ IF RF ]
cs             0x33    51
ss             0x2b    43
ds             0x0    0
es             0x0    0
fs             0x0    0
gs             0x0    0

Object Dump

go tool objdump -s '^runtime\.sigtrampgo$'
...
    signal_sigtramp.go:22   0x4492f9    c3          RET
    signal_sigtramp.go:26   0x4492fa    488d9c2490000000    LEAQ 0x90(SP), BX
    signal_sigtramp.go:26   0x449302    4889d9          MOVQ BX, CX
    signal_sigtramp.go:27   0x449305    4889442430      MOVQ AX, 0x30(SP)
    signal_sigtramp.go:27   0x44930a    4c8b4030        MOVQ 0x30(AX), R8
    signal_sigtramp.go:27   0x44930e    4d8b4050        MOVQ 0x50(R8), R8
    signal_sigtramp.go:27   0x449312    498b18          MOVQ 0(R8), BX
    signal_sigtramp.go:27   0x449315    48894c2428      MOVQ CX, 0x28(SP)
    signal_sigtramp.go:27   0x44931a    4839cb          CMPQ CX, BX
    signal_sigtramp.go:27   0x44931d    7773            JA 0x449392
...

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2016-01-15T23:20:59Z

Thanks for the report. Unfortunately I don't see enough here to know what to fix. Are you saying that the Go program simply crashes with "Segmentation fault (core dumped)" and doesn't print anything else? That should (of course) never happen.

There is something odd here. Your go version output says darwin/amd64, but your backtrace says sys_linux_amd64.s. Which is it?

The backtrace line suggests that the crash is because g != nil but g.m == nil or g.m.gsignal == nil. Not sure how that could happen.

If you can figure out the arguments in the core dump, you can determine where the first SIGSEGV occurred by looking at the ctx argument to sigtrampgo. It will be a pointer to a sigcontext struct, defined in runtime/defs_linux_amd64.go. The rip field should point to the instruction that got the error.

keyneston · 2016-01-15T23:23:51Z

I'm cross compiling from darwin to linux, so compiler is darwin/amd64 and runtime is sys_linux_amd64. Should have mentioned that.

env GOOS=linux go build -o bin/canal example.com/path/canal

Yes the go program simply crashes with Segmentation fault (core dumped). No additional info.

I'll poke and see if I can figure out where the first sigsegv is occurring.

davecheney · 2016-01-16T00:08:18Z

Does your program crash if you compile it natively on the target Linux
machine?

On Sat, 16 Jan 2016, 10:23 tarrant notifications@github.com wrote:

I'm cross compiling from darwin to linux, so compiler is darwin/amd64 and
runtime is sys_linux_amd64. Should have mentioned that.

env GOOS=linux go build -o bin/canal example.com/path/canal

Yes the go program simply crashes with Segmentation fault (core dumped).
No additional info.

I'll poke and see if I can figure out where the first sigsegv is occurring.

—
Reply to this email directly or view it on GitHub
#13978 (comment).

keyneston · 2016-01-17T16:18:23Z

I've reproduced the issue with a linux built binary. This also happened to test it with go1.6beta2.

davecheney · 2016-01-17T20:50:42Z

Can you please paste the full stack trace from the crash.

On Mon, Jan 18, 2016 at 3:18 AM, tarrant notifications@github.com wrote:

I've reproduced the issue with a linux built binary. This also happened to
test it with go1.6beta2.

—
Reply to this email directly or view it on GitHub
#13978 (comment).

keyneston · 2016-01-17T20:57:08Z

The go stacktrace (doesn't really exist) but here is what it prints out:

INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
Segmentation fault (core dumped)

gdb:

tarrant@rep-testing-826198:~$ gdb canal-1453060415 core-1453060415
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/users/tarrant/canal-1453060415...done.

warning: core file may not match specified executable file.
[New LWP 34971]
[New LWP 34975]
[New LWP 35010]
[New LWP 35005]
[New LWP 34974]
[New LWP 34977]
[New LWP 34979]
[New LWP 34998]
[New LWP 34968]
[New LWP 34978]
[New LWP 34976]
[New LWP 34981]
[New LWP 34985]
[New LWP 34980]
[New LWP 34990]
[New LWP 34987]
[New LWP 34982]
[New LWP 34984]
[New LWP 34995]
[New LWP 34986]
[New LWP 34989]
[New LWP 34992]
[New LWP 34999]
[New LWP 35000]
[New LWP 34997]
[New LWP 35007]
[New LWP 34996]
[New LWP 35006]
[New LWP 35008]
[New LWP 36191]
[New LWP 35018]
[New LWP 34993]
[New LWP 34969]
[New LWP 35009]
[New LWP 34970]
[New LWP 35516]
[New LWP 34972]
[New LWP 34991]
[New LWP 34983]
[New LWP 34973]
[New LWP 34994]
[New LWP 35001]
[New LWP 34988]
[New LWP 35003]
[New LWP 35221]
[New LWP 35024]
[New LWP 35002]
[New LWP 35004]
[New LWP 35011]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./canal -log debug
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000446ddf in runtime.sigtrampgo (sig=11, info=0xc820095bb0, ctx=0xc820095a80)
    at /usr/local/go/src/runtime/signal_sigtramp.go:27
27      /usr/local/go/src/runtime/signal_sigtramp.go: No such file or directory.
warning: Missing auto-load scripts referenced in section .debug_gdb_scripts
of file /home/users/tarrant/canal-1453060415
Use `info auto-load-scripts [REGEXP]' to list them.
(gdb) bt
#0  0x0000000000446ddf in runtime.sigtrampgo (sig=11, info=0xc820095bb0, ctx=0xc820095a80)
    at /usr/local/go/src/runtime/signal_sigtramp.go:27
#1  0x000000000046533b in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:234
#2  0x0000000000465340 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:235
#3  0x0000000000000001 in ?? ()
#4  0x0000000000000000 in ?? ()
(gdb)

davecheney · 2016-01-17T21:09:37Z

I don't understand. There should be a panic from Go, especially for SEGV.

Are you sure there is no cgo in your project ? If you're doing packet
decoding are you using a library which is linking against libpcap? Is the
source of your project available ?

On Mon, Jan 18, 2016 at 7:57 AM, tarrant notifications@github.com wrote:

The go stacktrace (doesn't really exist) but here is what it prints out:

INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
Segmentation fault (core dumped)

gdb:

tarrant@rep-testing-826198:~$ gdb canal-1453060415 core-1453060415
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
http://bugs.launchpad.net/gdb-linaro/...
Reading symbols from /home/users/tarrant/canal-1453060415...done.

warning: core file may not match specified executable file.
[New LWP 34971]
[New LWP 34975]
[New LWP 35010]
[New LWP 35005]
[New LWP 34974]
[New LWP 34977]
[New LWP 34979]
[New LWP 34998]
[New LWP 34968]
[New LWP 34978]
[New LWP 34976]
[New LWP 34981]
[New LWP 34985]
[New LWP 34980]
[New LWP 34990]
[New LWP 34987]
[New LWP 34982]
[New LWP 34984]
[New LWP 34995]
[New LWP 34986]
[New LWP 34989]
[New LWP 34992]
[New LWP 34999]
[New LWP 35000]
[New LWP 34997]
[New LWP 35007]
[New LWP 34996]
[New LWP 35006]
[New LWP 35008]
[New LWP 36191]
[New LWP 35018]
[New LWP 34993]
[New LWP 34969]
[New LWP 35009]
[New LWP 34970]
[New LWP 35516]
[New LWP 34972]
[New LWP 34991]
[New LWP 34983]
[New LWP 34973]
[New LWP 34994]
[New LWP 35001]
[New LWP 34988]
[New LWP 35003]
[New LWP 35221]
[New LWP 35024]
[New LWP 35002]
[New LWP 35004]
[New LWP 35011]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by ./canal -log debug Program terminated with signal 11, Segmentation fault. #0 0x0000000000446ddf in runtime.sigtrampgo (sig=11, info=0xc820095bb0, ctx=0xc820095a80) at /usr/local/go/src/runtime/signal_sigtramp.go:27 27 /usr/local/go/src/runtime/signal_sigtramp.go: No such file or directory. warning: Missing auto-load scripts referenced in section .debug_gdb_scripts of file /home/users/tarrant/canal-1453060415 Useinfo auto-load-scripts [REGEXP]' to list them.
(gdb) bt
#0 0x0000000000446ddf in runtime.sigtrampgo (sig=11, info=0xc820095bb0, ctx=0xc820095a80)
at /usr/local/go/src/runtime/signal_sigtramp.go:27
#1 0x000000000046533b in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:234
#2 0x0000000000465340 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:235
#3 0x0000000000000001 in ?? ()
#4 0x0000000000000000 in ?? ()
(gdb)

—
Reply to this email directly or view it on GitHub
#13978 (comment).

rhysh · 2016-01-18T00:07:11Z

We're unfortunately not able to share the source code for the program. There's no cgo in this program outside of the core net and runtime/race packages. The packet processing is done on UDP packets in pure Go, not on SOCK_RAW with libpcap. The race detector has not found any data races in the program.

The program crashes with SIGSEGV in runtime.sigtrampgo, apparently because either g.m or g.m.gsignal is nil. The sig argument to sigtrampgo is 11, indicating that it was called in response to a SIGSEGV from elsewhere in the program. The ctx argument should point to a runtime.sigcontext struct, so we look at the memory it points to. The rip field in the core dumps is indeed a pointer, but not to a valid instruction. We have an example of it pointing into the .rodata section (to the string "GC worker (idle)"), and an example of it pointing to a runtime.g value (with 0xfffffffffffffade showing up after 16 bytes, where we'd expect the runtime.g's stackguard0 field to be).

I suspect that the instruction pointer for the first SIGSEGV is being accurately reported; that is, the original SIGSEGV is caused by an attempt to execute non-executable memory. I think this is likely due to memory corruption in the scheduler's datastructures.

How can we hunt down corruption in the scheduler's memory? Does corruption there sound like a plausible explanation for the crashes we see?

davecheney · 2016-01-18T00:16:30Z

Can you reproduce the problem on previous versions of Go 1.5.3 or 1.4.2?

How can we hunt down corruption in the scheduler's memory? Does corruption there sound like a plausible explanation for the crashes we see?

Possible, but unlikely, corruption of this sort usually comes from data races in the program.

If you suspect that your tests do not exercise possible racy paths, could you build a race enabled version of your program and deploy it ? go build -race $PKG will do it.

mwhudson · 2016-01-18T00:18:57Z

Can you run the process under gdb? That should stop at the point of the first segfault rather than when go's handler breaks and may be easier to find out what's going on. Alternatively, I guess, you could disable go's handling of SIGSEGV entirely.

As to figuring out why you're jumping into the weeds, I guess if you're lucky you're jumping via CALL and the address you jumped from is on the top of the stack. If not, you might still be able to figure out where you were in the code by looking through the stack, though of course if the jump is from runtime·gogo or something like that this won't help.

rhysh · 2016-01-18T00:48:38Z

The program has been running with the race detector for the past hour with no races reported. It hasn't encountered the SIGSEGV yet, but all other periodic code paths should have been exercised a few hundred times.

We can run the process under gdb, though it may take a day before we see the crash again. Disabling Go's SIGSEGV handling to force an immediate core dump sounds attractive. Can this be done via signal.Ignore(syscall.SIGSEGV), or does it require calling sigaction(2) directly?

mwhudson · 2016-01-18T01:07:55Z

We can run the process under gdb, though it may take a day before we see the crash again. Disabling Go's SIGSEGV handling to force an immediate core dump sounds attractive. Can this be done via signal.Ignore(syscall.SIGSEGV), or does it require calling sigaction(2) directly?

I don't actually know the answer to this. I was thinking of hacking out the runtime code that installs the SIGSEGV handler...

rhysh · 2016-01-18T03:08:33Z

The goroutine that's segfaulting is one of the GC's background mark workers. Both usable core dumps we have show the same thing (below). runtime.sigtrampgo is called with the sig argument set to 11 for SIGSEGV, and the contents of the info argument indicate signal 11 with si_code SEGV_MAPERR indicating that the fault is due to an unmapped address and the fault address is 0xd8.

go tool objdump indicates that the result of the call to getg is stored in the AX register.

$ gdb {program,core}-1452891329
[snip]
Program terminated with signal 11, Segmentation fault.
#0  0x000000000044930e in runtime.sigtrampgo (sig=11, info=0xc820d3dbb0, ctx=0xc820d3da80) at /usr/local/Cellar/go/HEAD/libexec/src/runtime/signal_sigtramp.go:27
[snip]
(gdb) p info.si_signo
$1 = 11
(gdb) p info.si_code
$2 = 1
(gdb) p/x info.si_addr
$3 = 0xd8
(gdb) info symbol (('runtime.g'*)$rax).gopc
runtime.gcBgMarkStartWorkers + 159 in section .text of /path/to/program-1452891329
(gdb) info symbol (('runtime.g'*)$rax).startpc
runtime.gcBgMarkWorker in section .text of /path/to/program-1452891329

$ gdb {program,core}-1453060415
[snip]
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000446ddf in runtime.sigtrampgo (sig=11, info=0xc820095bb0, ctx=0xc820095a80) at /usr/local/go/src/runtime/signal_sigtramp.go:27
[snip]
(gdb) p info.si_signo
$1 = 11
(gdb) p info.si_code
$2 = 1
(gdb) p/x info.si_addr
$3 = 0xd8
(gdb) info symbol (('runtime.g'*)$rax).gopc
runtime.gcBgMarkStartWorkers + 147 in section .text of /path/to/program-1453060415
(gdb) info symbol (('runtime.g'*)$rax).startpc
runtime.gcBgMarkWorker in section .text of /path/to/program-1453060415

rhysh · 2016-01-18T04:56:41Z

I think I've found the cause of our issues. The G that runs the GC background mark worker is sometimes running without an M. When that goroutine calls runtime.acquirem, it's unable to dereference _g_.m, which leads to a SIGSEGV at address 0xd8 when running the following instructions:

FS MOVQ FS:0xfffffff8, AX
MOVQ 0x30(AX), CX
MOVL 0xd8(CX), R8

The ctx argument to sigtrampgo doesn't point right to a *runtime.sigcontext - it points to 40 bytes before a *runtime.sigcontext, so all of the field names were wrong. Adding 0x28 to the pointer results in much more believable register contents, such as 1) an instruction pointer that points to an instruction in the text segment, and 2) a stack pointer that points to an address between stack.lo and stack.hi:

Crash 1452891329, built on darwin/amd64 to run on linux/amd64 at go version `devel +505fa7b Tue Jan 12 17:31:01 2016 +0000`

$ gdb {program,core}-1452891329
[snip]
(gdb) p/x (('runtime.g'*)$rax).stack.lo
$1 = 0xc82024c000
(gdb) p/x (('runtime.g'*)$rax).stack.hi
$2 = 0xc82024c7e0
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rsp
$3 = 0xc82024c740
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rip
$4 = 0x41f9f8
(gdb) info symbol (('runtime.sigcontext'*)(ctx+0x28)).rip
runtime.gcBgMarkWorker + 376 in section .text of /path/to/program-1452891329
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rcx
$5 = 0x0

$ go tool objdump /path/to/program-1452891329 | grep -C 3 '0x41f9f8'
    mgc.go:1469 0x41f9ea    c3          RET
    mgc.go:1390 0x41f9eb    64488b0425f8ffffff  FS MOVQ FS:0xfffffff8, AX
    mgc.go:1390 0x41f9f4    488b4830        MOVQ 0x30(AX), CX
    mgc.go:1390 0x41f9f8    448b81d8000000      MOVL 0xd8(CX), R8
    mgc.go:1390 0x41f9ff    41ffc0          INCL R8
    mgc.go:1390 0x41fa02    448981d8000000      MOVL R8, 0xd8(CX)
    mgc.go:1390 0x41fa09    4c8b4030        MOVQ 0x30(AX), R8

Crash 1453060415, built on linux/amd64 to run on linux/amd64 at go version `go1.6beta2`

$ gdb {canal,core}-1453060415
[snip]
(gdb) p/x (('runtime.g'*)$rax).stack.lo
$1 = 0xc8202a7800
(gdb) p/x (('runtime.g'*)$rax).stack.hi
$2 = 0xc8202a7fe0
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rsp
$3 = 0xc8202a7f20
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rip
$4 = 0x434514
(gdb) info symbol (('runtime.sigcontext'*)(ctx+0x28)).rip
runtime.gopark + 36 in section .text of /path/to/program-1453060415
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rax
$5 = 0x0

$ go tool objdump /path/to/program-1453060415 | grep -C 3 '0x434514'
    proc.go:248 0x434503    4883ec20        SUBQ $0x20, SP
    proc.go:249 0x434507    64488b0c25f8ffffff  FS MOVQ FS:0xfffffff8, CX
    proc.go:249 0x434510    488b4130        MOVQ 0x30(CX), AX
    proc.go:249 0x434514    8ba8d8000000        MOVL 0xd8(AX), BP
    proc.go:249 0x43451a    ffc5            INCL BP
    proc.go:249 0x43451c    89a8d8000000        MOVL BP, 0xd8(AX)
    proc.go:249 0x434522    488b6930        MOVQ 0x30(CX), BP

So to summarize, our program is crashing because a GC background mark worker is running with g.m set to nil. This causes a segfault, which causes the signal handler to run, which in turn encounters a segfault when it tries to dereference g.m, resulting in an immediate and unceremonious crash of the program.

rhysh · 2016-01-18T06:22:30Z

The 40 byte offset is because I'd been misinterpreting the meaning of the ctx argument to runtime.sigtrampgo. It's an unsafe.Pointer to a runtime.ucontext value which on amd64 includes a runtime.sigcontext value at a 40-byte offset (rather than a pointer directly to a runtime.sigcontext value). runtime/signal_linux_amd64.go shows how to correctly interpret the contents.

I don't have any further answers for why the GC worker is running without an M.

rhysh · 2016-01-18T07:07:10Z

It looks like this is the same bug as #13363, although the CPU profiler was not involved with these crashes. /cc @aclements

ianlancetaylor · 2016-01-18T18:19:18Z

@rhysh thanks for the good investigative work.

@RLH @aclements it appears that in some (uncommon) case gcBgMarkWorker can be started up without an m.

rhysh · 2016-01-18T22:49:54Z

Crash 1452891329 on go version 505fa7b comes in the middle of runtime.gcBgMarkWorker during a call to runtime.acquirem. This isn't the first call to runtime.acquirem in that function, which means that the worker must have had an m when it started and then lost it (otherwise we'd see a segfault at the first call).

Crash 1453060415 on go version go1.6beta2 comes at the call to runtime.acquirem at the start of runtime.gopark. I looked through the stack of the crashed goroutine for instruction addresses and found a 64-bit aligned value of 0x000000000041f0fc at address 0xc8202a7f40, between the stack.hi value of 0xc8202a7fe0 and the rsp value of 0xc8202a7f20. The instruction address corresponds to the return from runtime.gopark at the top of runtime.gcBgMarkWorker's for loop. If I've unwound the stack correctly, this means that gcBgMarkWorker calling gopark calling acquirem led to the second crash.

$ go tool objdump /path/to/program-1453060415 | grep -C 3 '0x41f0fc'
    mgc.go:1378 0x41f0e9    c644242014      MOVL $0x14, 0x20(SP)
    mgc.go:1378 0x41f0ee    48c744242800000000  MOVQ $0x0, 0x28(SP)
    mgc.go:1378 0x41f0f7    e8f4530100      CALL runtime.gopark(SB)
    mgc.go:1383 0x41f0fc    488b6c2470      MOVQ 0x70(SP), BP
    mgc.go:1383 0x41f101    4883fd00        CMPQ $0x0, BP
    mgc.go:1383 0x41f105    0f845d050000        JE 0x41f668
    mgc.go:1383 0x41f10b    488b9d10120000      MOVQ 0x1210(BP), BX

The gcBgMarkWorker goroutines had an m when they were created, but had their execution resumed at some point without an m.

aclements · 2016-01-19T22:01:35Z

Nice debugging! I agree that this is almost certainly the same failure as #13363. You're positive that CPU profiling wasn't enabled? Is it possible the application is receiving other asynchronous signals? (I would love to rule out async signals as a possible cause.)

keyneston · 2016-01-19T22:55:13Z

pprof is installed, but I was not actually calling any of the pprof endpoints for most of the crashes. Many of them I'd start the process close the screen session and then go home for the day to have it crash the next morning.

I don't think any async signals are being called, though they may be under the hood. Most of the syscalls are read/write to network fds.

I'll poke and see if I can find any others but I seriously doubt it.

aclements · 2016-01-19T23:14:14Z

I can think of a few hypotheses:

The background mark worker goroutine was resumed from preemption with g.m == nil.
The background mark worker was running and something it did set g.m to nil (without stopping the G).
The background mark worker was running and another thread concurrently set g.m to nil.
The background mark worker was running and a signal (running synchronously) returned with g.m set to nil.
Memory corruption.

We may be able to disprove 1 by looking at where the SIGSEGVs happen relative to possible preemption points. The background mark worker has fairly limited preemption points and since the goroutine can only resume at a preemption point, if there are any successful accesses to g.m between the crash and the previous preemption point, we can rule out 1.

My case from #13363 does not disprove 1. In fact, since we resume from an involuntary preemption at the call to morestack, the crash happened just a few instructions after a possible resume point. This lends support to theory 1, but does not prove it.

@rhysh, in 1452891329, do you know which acquirem crashed?

Crash 1453060415 would seem to disprove 1. As of the acquirem in gopark, the goroutine hasn't been preemptible for quite a while. On all of the code paths that lead from disabling preemption to gopark (of which I think there are three: function entry, regular looping around, and looping around from the gcMarkDone condition), we've done our own acquirem, which would indicate g.m != nil since the last preemption point.

Further evidence against 1 is that execute itself would crash if g.m were nil. However, there are other calls to gogo that wouldn't crash if g.m were nil.

For 2 and 3, we need to consider the places that modify g.m. Ignoring places where we assign g0.m or gsignal.m, there are four: newextram, execute, dropg, and goexit0. newextram always sets it to non-nil (and immediately uses it), so we can discount it. Likewise, execute would crash if g.m were nil. dropg and goexit0 do set g.m to nil.

@rhysh or @tarrant, could you have GDB dump the whole current G?

For 2, it's almost certainly not goexit0. For 3, it could conceivably be goexit0. However, in #13363, it's definitely not goexit0 because the status is _Grunning and the waitstatus is set. Likewise, all of the calls to dropg are immediately preceded by setting the status to something other than _Grunning. Most paths back to _Grunning would crash if g.m were nil, but I'm not sure all of them would. This would be a useful place to dig deeper.

For 4, I'm pretty sure there's nothing specific in the signal code that modifies g.m and I don't think it can call any of the functions I mentioned above. So 4 seems unlikely.

I hope it's not 5.

keyneston · 2016-01-19T23:35:23Z

gdb {canal,core}-1453060415
...
(gdb) print/x *g
$1 = {stack = {lo = 0xc8202a7800, hi = 0xc8202a7fe0}, stackguard0 = 0xc8202a7ad0, stackguard1 = 0xffffffffffffffff, _panic = 0x0, _defer = 0x0, m = 0x0, stackAlloc = 0x800,
  sched = {sp = 0x0, pc = 0x434653, g = 0xc8203dc300, ctxt = 0x0, ret = 0x0, lr = 0x0, bp = 0x0}, syscallsp = 0x0, syscallpc = 0x0, stkbar = {array = 0xc8202a7fe0, len = 0x0,
    cap = 0x2}, stkbarPos = 0x0, stktopsp = 0xc8202a7fb8, param = 0x0, atomicstatus = 0x2, stackLock = 0x0, goid = 0x232, waitsince = 0x0, waitreason = {str = 0x9bf1e0,
    len = 0x10}, schedlink = 0x0, preempt = 0x0, paniconfault = 0x0, preemptscan = 0x0, gcscandone = 0x1, gcscanvalid = 0x0, throwsplit = 0x0, raceignore = 0x0,
  sysblocktraced = 0x0, sysexitticks = 0x0, sysexitseq = 0x0, lockedm = 0x0, sig = 0x0, writebuf = {array = 0x0, len = 0x0, cap = 0x0}, sigcode0 = 0x0, sigcode1 = 0x0,
  sigpc = 0x0, gopc = 0x41ef63, startpc = 0x41efd0, racectx = 0x0, waiting = 0x0, gcAssistBytes = 0x0}

rhysh · 2016-01-20T00:27:42Z

Thanks for looking into this, Austin.

The acquirem call in 1452891329 is the one towards the top of gcBgMarkWorker's for loop: https://github.com/golang/go/blob/505fa7b/src/runtime/mgc.go#L1390

I've collected quite a few core dumps from reproducing #13363, which show five different call stacks at the time of the crash. I'm building binaries of the runtime/pprof test for linux/amd64 with go version devel +01b8640 Sat Jan 16 03:35:23 2016 +0000 darwin/amd64, with a toolchain built with GOEXPERIMENT=framepointer and with the pprof frequency set to 500Hz. I have a gdb script that uses the frame pointer to unwind the stacks.

$ for core in ./core-* ; do gdb --batch --command unwind.gdb ./pprof.test $core 2>/dev/null | grep -A 2 '^\$' | sed -n -e 's/ in section .*//p' | sed -e 's/ //g' | tac | tr '\n' ';' | sed -e 's/;$/\n/' ; done | sort | uniq -c | sort -n -r
     38 runtime.goexit+1;runtime.gcBgMarkWorker+1273;runtime.gcMarkDone+62;runtime.semacquire+46
     30 runtime.goexit+1;runtime.gcBgMarkWorker+1273;runtime.morestack+13
     24 runtime.goexit+1;runtime.gcBgMarkWorker+321;runtime.gopark+46
      1 runtime.goexit+1;runtime.gcBgMarkWorker+718;runtime.gcDrain+277;runtime.(*gcWork).balance+148;runtime.handoff+162
      1 runtime.goexit+1;runtime.gcBgMarkWorker+321;runtime.gopark+432;runtime.writebarrierptr+129;runtime.writebarrierptr_nostore1+27

stress command:

./stress sh -c "GODEBUG=gcstackbarrierall=1,gcstackbarrierall=0 GOGC=1 ./pprof.test -test.run TestStackBarrierProfiling"

gdb script:

$ cat unwind.gdb 
printf "\n"
p/x (('runtime.sigcontext'*)(ctx+0x28)).rip
i sym (('runtime.sigcontext'*)(ctx+0x28)).rip
set $ptr = (long *)(('runtime.sigcontext'*)(ctx+0x28)).rbp

while $ptr != 0
  printf "\n"
  set $instr = *($ptr+1)
  p/x $instr
  i sym $instr
  set $ptr = *(long **)($ptr)
end

changes to core packages to make reproducing easier, per #13363:

diff --git a/src/runtime/pprof/pprof.go b/src/runtime/pprof/pprof.go
index d32b315..f1d8cbb 100644
--- a/src/runtime/pprof/pprof.go
+++ b/src/runtime/pprof/pprof.go
@@ -577,7 +577,7 @@ func StartCPUProfile(w io.Writer) error {
        // system, and a nice round number to make it easy to
        // convert sample counts to seconds.  Instead of requiring
        // each client to specify the frequency, we hard code it.
-       const hz = 100
+       const hz = 500

        cpu.Lock()
        defer cpu.Unlock()
diff --git a/src/runtime/pprof/pprof_test.go b/src/runtime/pprof/pprof_test.go
index c241b54..6d83236 100644
--- a/src/runtime/pprof/pprof_test.go
+++ b/src/runtime/pprof/pprof_test.go
@@ -103,7 +103,7 @@ func parseProfile(t *testing.T, bytes []byte, f func(uintptr, []uintptr)) {
        }

        hd, val, tl := val[:5], val[5:l-3], val[l-3:]
-       if hd[0] != 0 || hd[1] != 3 || hd[2] != 0 || hd[3] != 1e6/100 || hd[4] != 0 {
+       if hd[0] != 0 || hd[1] != 3 || hd[2] != 0 /*|| hd[3] != 1e6/100*/ || hd[4] != 0 {
                t.Fatalf("unexpected header %#x", hd)
        }

rhysh · 2016-01-20T00:45:13Z

That is, 38 core dumps where the currently running goroutine had a stack of runtime.goexit+1 -> runtime.gcBgMarkWorker+1273 -> runtime.gcMarkDone+62 -> runtime.semacquire+46, which corresponds to a stack of:

runtime.gcBgMarkWorker https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1459
runtime.gcMarkDone https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1042
runtime.semacquire https://github.com/golang/go/blob/01b8640/src/runtime/sema.go#L68

30 core dumps with a stack of runtime.goexit+1 -> runtime.gcBgMarkWorker+1273 -> runtime.morestack+13, meaning:

runtime.gcBgMarkWorker https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1459
runtime.morestack https://github.com/golang/go/blob/01b8640/src/runtime/asm_amd64.s#L324

24 of runtime.goexit+1 -> runtime.gcBgMarkWorker+321 -> runtime.gopark+46:

runtime.gcBgMarkWorker https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1382
runtime.gopark https://github.com/golang/go/blob/01b8640/src/runtime/proc.go#L249

One of runtime.goexit+1 -> runtime.gcBgMarkWorker+718 -> runtime.gcDrain+277 -> runtime.(*gcWork).balance+148 -> runtime.handoff+162:

runtime.gcBgMarkWorker https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1412
runtime.gcDrain https://github.com/golang/go/blob/01b8640/src/runtime/mgcmark.go#L826
runtime.(*gcWork).balance https://github.com/golang/go/blob/01b8640/src/runtime/mgcwork.go#L245
runtime.handoff https://github.com/golang/go/blob/01b8640/src/runtime/mgcwork.go#L461

One of runtime.goexit+1 -> runtime.gcBgMarkWorker+321 -> runtime.gopark+432 -> runtime.writebarrierptr+129 -> runtime.writebarrierptr_nostore1+27:

runtime.gcBgMarkWorker https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1382
runtime.gopark https://github.com/golang/go/blob/01b8640/src/runtime/proc.go#L256
runtime.writebarrierptr https://github.com/golang/go/blob/01b8640/src/runtime/mbarrier.go#L143
runtime.writebarrierptr_nostore1 https://github.com/golang/go/blob/01b8640/src/runtime/mbarrier.go#L110

rhysh · 2016-01-20T00:50:33Z

I think that last crash may be the most interesting, as it happens within runtime.gopark after a successful call to acquirem.

The write to mp.waitunlockf invokes a write barrier which calls acquirem and hits a SIGSEGV.

func gopark(unlockf func(*g, unsafe.Pointer) bool, lock unsafe.Pointer, reason string, traceEv byte, traceskip int) {
    mp := acquirem() // getg().m is non-nil when this line executes
    gp := mp.curg
    status := readgstatus(gp)
    if status != _Grunning && status != _Gscanrunning {
        throw("gopark: bad g status")
    }
    mp.waitlock = lock
    mp.waitunlockf = *(*unsafe.Pointer)(unsafe.Pointer(&unlockf)) // SIGSEGV here during write barrier 
    gp.waitreason = reason
    mp.waittraceev = traceEv
    mp.waittraceskip = traceskip
    releasem(mp)
    // can't do anything that might move the G between Ms here.
    mcall(park_m)
}

aclements · 2016-01-20T02:05:34Z

Thanks @tarrant, it looks like that G is very similar to the one from #13363. It's in state _Grunning and the wait reason is set, so the same conclusions apply.

@rhysh, you're right that that last crash is really interesting. What makes it even more interesting is that there's also a write barrier on the previous line that would have crashed if g.m was nil on that line.

That strongly suggests that either the write barrier itself can somehow clear g.m, or something is concurrently clearing it. It's hard to say for sure, but I don't think any of the other stack traces are obviously incompatible with the theory that the write barrier itself is doing it. Though my hunch is that something is concurrently clearing it since there are so many different stacks and the write barrier itself is fairly deterministic.

aclements · 2016-01-20T02:45:02Z

I may know how this is happening. In the gcMarkDone case of the mark worker, the worker disassociates with the P and makes itself preemptible, and then later attempts to reassociate with the P. If the scheduler moved the worker to a different P while it was preemptible (and a new worker hasn't been started for the original P), then as soon as the worker reassociates with the original P, the scheduler on that P can run the worker. It may do so before the worker has fully parked itself, at which point the worker G is effectively running on two Ms simultaneously.

The sequence of steps is something like:

The worker on P1/M1 makes itself preemptible and enters gcMarkDone.
The scheduler moves the worker to P2/M2.
On P2, the worker reassociates itself as the worker for P1.
On P2, the worker calls gopark, which calls park_m, which cas's the G's status to _Gwaiting.
On P1, the scheduler selects the worker to run, cas's its status from _Gwaiting to _Grunnable, calls execute, which cas's its status from _Grunnable to _Grunning, and begins executing at whatever point in the worker was left in the sched buffer (which should be wherever it got preempted during gcMarkDone).
On P2, park_m calls dropg, which clears the worker G's m.
On P1, the worker crashes at whatever next touches the m.

aclements · 2016-01-20T03:50:13Z

@rhysh or @tarrant, can you give https://go-review.googlesource.com/18741 a try? It's not a good fix, but it should test if I'm right about the cause.

aclements · 2016-01-20T04:07:43Z

Hmm. The trybots are indicating that maybe you don't want to try CL 18741. It seems my "XXX This is definitely not ideal." is actually "XXX This causes random deadlocks all over the place." However, this does confirm that the workers really do migrate between Ps sometimes in this code path.

aclements · 2016-01-20T16:45:10Z

This time for real. @rhysh or @tarrant, can you give https://go-review.googlesource.com/18761 a try? I think that's actually the right fix (and it passes the trybots).

gopherbot · 2016-01-20T17:00:19Z

CL https://golang.org/cl/18761 mentions this issue.

keyneston · 2016-01-20T22:25:16Z

Sweet, I've gotten a patched version of go setup and the binary is built and running. I'll let it run for a few days and if nothing happens we can almost safely say this is fixed.

aclements · 2016-01-20T22:39:51Z

I think there's a bug in my CL that causes it to deadlock occasionally, but any results you can come up with are useful.

keyneston · 2016-01-21T00:49:43Z

Yea looking like a deadlock.

$ sudo perf top -p `pgrep canal`
Events: 2M cycles
 29.33%  canal             [.] runtime.findrunnable
 16.41%  [kernel]          [k] __ticket_spin_lock
  7.40%  canal             [.] runtime.runqgrab
  6.34%  canal             [.] runtime/internal/atomic.Load
  3.34%  canal             [.] runtime/internal/atomic.Xadd64
  3.29%  canal             [.] runtime/internal/atomic.Xadd
  2.80%  canal             [.] runtime.runqsteal
  2.73%  canal             [.] runtime.lock
  2.50%  canal             [.] runtime/internal/atomic.Xchg
  2.34%  canal             [.] runtime.fastrand1
  2.26%  canal             [.] runtime.gcFlushBgCredit
  1.54%  canal             [.] runtime.procyield
  1.42%  canal             [.] runtime/internal/atomic.Cas
  0.96%  canal             [.] runtime.(*gcControllerState).findRunnableGCWorker.func1
  0.87%  canal             [.] runtime.gcmarkwb_m
  0.66%  canal             [.] runtime.schedule
  0.65%  canal             [.] runtime.gcBgMarkWorker
  0.59%  canal             [.] runtime.writebarrierptr_nostore1
  0.57%  canal             [.] runtime.systemstack
  0.56%  canal             [.] runtime.mallocgc
  0.55%  canal             [.] runtime.execute
  0.50%  [kernel]          [k] __wake_up_common
  0.49%  [vdso]            [.] 0x7ffdbd1d870c
  0.48%  canal             [.] runtime.heapBitsForObject
  0.39%  [kernel]          [k] gup_pte_range

Doesn't look like any time is being spent in my code, instead its just spinning trying to find a thread to run.

aclements · 2016-01-21T01:38:48Z

Actually, it was spinning waiting for a dead thread to enter the "waiting" state, which, needless to say, was never going to happen. :)

But I think I've fixed the problem and it's passing my stress tests. Could you try the new version of CL 18761?

keyneston · 2016-01-21T18:41:18Z

This one seems to run better, but my program is still spending most of its time spinning.

Events: 273K cycles
 18.41%  [kernel]          [k] __ticket_spin_lock
 18.12%  canal             [.] runtime.findrunnable
  8.89%  canal             [.] runtime.runqgrab
  7.41%  canal             [.] runtime/internal/atomic.Load
  3.74%  canal             [.] runtime/internal/atomic.Xadd
  3.56%  canal             [.] runtime/internal/atomic.Xadd64
  3.36%  canal             [.] runtime.runqsteal
  3.01%  canal             [.] runtime.lock
  2.85%  canal             [.] runtime.fastrand1
  2.71%  canal             [.] runtime/internal/atomic.Xchg
  2.52%  canal             [.] runtime.gcFlushBgCredit
  1.69%  canal             [.] runtime.procyield
  1.53%  canal             [.] runtime/internal/atomic.Cas
  1.30%  canal             [.] runtime.gcBgMarkWorker
  1.13%  canal             [.] runtime.(*gcControllerState).findRunnableGCWorker.func1
  0.80%  canal             [.] runtime.gcmarkwb_m
  0.70%  canal             [.] runtime.mallocgc
  0.66%  canal             [.] runtime.writebarrierptr_nostore1
  0.63%  canal             [.] runtime.systemstack
  0.63%  canal             [.] runtime.gcDrain
  0.57%  [vdso]            [.] 0x7ffd809b370c

aclements · 2016-01-21T18:48:12Z

@tarrant, interesting. What does the profile look like without the CL?

keyneston · 2016-01-21T19:32:30Z

It would look a lot better if I had frame pointers enabled. But:

Events: 295K cycles
 73.83%  canal (deleted)   [.] 0x3e9d4
 14.63%  [kernel]          [k] __ticket_spin_lock
  0.74%  [kernel]          [k] gup_pte_range
  0.69%  [vdso]            [.] 0x7ffccd34f70c
  0.49%  [kernel]          [k] get_futex_key
  0.46%  [kernel]          [k] __wake_up_common
  0.41%  [kernel]          [k] __wait_on_bit_lock
  0.40%  [kernel]          [k] put_page
  0.38%  [kernel]          [k] prepare_to_wait_exclusive
  0.30%  [kernel]          [k] __schedule
  0.29%  [kernel]          [k] get_user_pages_fast
  0.28%  [kernel]          [k] __ticket_spin_unlock
  0.25%  [kernel]          [k] get_futex_key_refs.isra.11
  0.25%  [kernel]          [k] __wake_up_bit
  0.25%  [kernel]          [k] native_write_msr_safe
  0.24%  [kernel]          [k] wake_bit_function
  0.24%  [kernel]          [k] try_to_wake_up
  0.21%  [kernel]          [k] update_curr
  0.21%  [kernel]          [k] copy_user_generic_string
  0.20%  [kernel]          [k] unlock_page
  0.18%  [kernel]          [k] wake_affine
  0.18%  [kernel]          [k] effective_load.isra.57
  0.18%  [kernel]          [k] page_waitqueue
  0.16%  [kernel]          [k] drop_futex_key_refs.isra.14
  0.15%  [kernel]          [k] io_schedule

aclements · 2016-01-21T21:04:27Z

Thanks @tarrant. Unfortunately, I can't compare these profiles because there aren't any Go symbols in the second one. It looks like it's spending roughly the same amount of time in __ticket_spin_lock, but, for instance, I don't know if runtime.findrunnable has become more expensive with my CL.

keyneston · 2016-01-21T23:46:26Z

Hopefully this will help, this is a perf top from a binary a from a straight go1.6beta2. If there is a better way to let you see what is going on let me know and I'll do my best.

Events: 267K cycles
  4.18%  [kernel]          [k] __ticket_spin_lock
  3.48%  canal             [.] runtime.mallocgc
  3.16%  canal             [.] runtime.scanobject
  2.03%  canal             [.] runtime/internal/atomic.Xchg
  1.71%  canal             [.] runtime.pcvalue
  1.71%  canal             [.] runtime.greyobject
  1.59%  canal             [.] runtime.heapBitsForObject
  1.54%  canal             [.] runtime.memmove
  1.38%  canal             [.] runtime/internal/atomic.Cas64
  1.38%  canal             [.] runtime.readvarint
  1.32%  canal             [.] runtime.gcDrain
  1.28%  canal             [.] runtime.findrunnable
  1.23%  canal             [.] runtime.heapBitsSetType
  1.20%  canal             [.] runtime/internal/atomic.Load
  1.17%  canal             [.] runtime.lock
  1.15%  canal             [.] runtime.procyield
  1.10%  canal             [.] runtime.selectgoImpl
  1.06%  canal             [.] sync/atomic.CompareAndSwapUint32
  1.05%  canal             [.] runtime/internal/atomic.Load64
  1.05%  canal             [.] runtime.gentraceback
  1.03%  canal             [.] runtime/internal/atomic.Xadd
  1.02%  canal             [.] runtime.step
  0.90%  canal             [.] sync/atomic.AddUint32
  0.88%  canal             [.] runtime.systemstack

aclements · 2016-01-26T20:16:06Z

Hi @tarrant. I haven't been able to reproduce any performance problems with this CL with either the go1 benchmarks or the x/benchmarks garbage benchmark. Are you sure those profiles are comparable? Do you have end-to-end benchmark numbers that show a problem?

$ benchstat log.b203f88 log.fa839a1
name                     old time/op    new time/op    delta
BinaryTree17-4              2.98s ± 2%     3.00s ± 3%    ~     (p=0.690 n=5+5)
Fannkuch11-4                2.81s ± 3%     2.67s ± 3%  -5.03%  (p=0.016 n=5+5)
FmtFprintfEmpty-4          56.5ns ±14%    53.8ns ± 4%    ~     (p=0.310 n=5+5)
FmtFprintfString-4          176ns ±21%     165ns ± 3%    ~     (p=0.365 n=5+5)
FmtFprintfInt-4             162ns ± 1%     160ns ± 2%    ~     (p=0.556 n=4+5)
FmtFprintfIntInt-4          264ns ± 6%     253ns ± 2%  -4.02%  (p=0.032 n=5+5)
FmtFprintfPrefixedInt-4     241ns ± 2%     239ns ± 1%    ~     (p=0.325 n=5+5)
FmtFprintfFloat-4           339ns ± 5%     336ns ± 1%    ~     (p=0.913 n=5+5)
FmtManyArgs-4              1.07µs ± 5%    1.03µs ± 3%    ~     (p=0.056 n=5+5)
GobDecode-4                8.23ms ± 5%    8.08ms ± 2%    ~     (p=0.222 n=5+5)
GobEncode-4                6.97ms ± 2%    6.90ms ± 1%    ~     (p=0.151 n=5+5)
Gzip-4                      346ms ± 1%     356ms ±11%    ~     (p=0.690 n=5+5)
Gunzip-4                   41.3ms ± 3%    40.9ms ± 4%    ~     (p=0.421 n=5+5)
HTTPClientServer-4         66.3µs ± 6%    64.2µs ± 0%    ~     (p=0.151 n=5+5)
JSONEncode-4               18.0ms ± 2%    17.8ms ± 2%    ~     (p=0.095 n=5+5)
JSONDecode-4               52.2ms ± 1%    52.8ms ± 1%    ~     (p=0.063 n=5+4)
Mandelbrot200-4            4.19ms ± 0%    4.21ms ± 2%    ~     (p=0.286 n=4+5)
GoParse-4                  3.65ms ± 1%    3.64ms ± 3%    ~     (p=0.310 n=5+5)
RegexpMatchEasy0_32-4       113ns ± 3%     113ns ± 4%    ~     (p=0.460 n=5+5)
RegexpMatchEasy0_1K-4       204ns ± 3%     205ns ± 4%    ~     (p=1.000 n=5+5)
RegexpMatchEasy1_32-4      91.3ns ± 2%    91.1ns ± 2%    ~     (p=0.524 n=5+5)
RegexpMatchEasy1_1K-4       386ns ± 5%     381ns ± 1%    ~     (p=0.484 n=5+5)
RegexpMatchMedium_32-4      139ns ± 1%     139ns ± 0%    ~     (p=1.000 n=5+5)
RegexpMatchMedium_1K-4     42.5µs ± 3%    41.9µs ± 0%    ~     (p=0.111 n=5+4)
RegexpMatchHard_32-4       2.26µs ± 8%    2.19µs ± 2%    ~     (p=0.548 n=5+5)
RegexpMatchHard_1K-4       66.3µs ± 1%    66.3µs ± 5%    ~     (p=0.365 n=4+5)
Revcomp-4                   628ms ±28%     580ms ±20%    ~     (p=0.421 n=5+5)
Template-4                 59.2ms ± 1%    59.9ms ± 1%    ~     (p=0.413 n=4+5)
TimeParse-4                 343ns ± 1%     337ns ± 1%    ~     (p=0.063 n=4+5)
TimeFormat-4                348ns ± 0%     349ns ± 0%    ~     (p=0.333 n=4+5)

aclements · 2016-01-27T01:21:08Z

I'm going to go ahead and push this CL so we can get it out there, but please open a new issue if it causes performance problems for your workload.

keyneston · 2016-01-28T19:14:29Z

Just tried go1.6rc1 and am having zero issues. I'm going to chalk up my issues to some oddity in how I was patching or compiling.

Thanks a ton for your help in getting this all sorted out and fixed.
-T

aclements · 2016-01-28T20:28:49Z

Thanks to you and @rhysh for all the help debugging!

ianlancetaylor added this to the Go1.6 milestone Jan 15, 2016

gopherbot closed this as completed in eb3b183 Jan 27, 2016

golang locked and limited conversation to collaborators Feb 3, 2017

gopherbot added the FrozenDueToAge label Feb 3, 2017

runtime: SIGSEGV while handling SIGSEGV #13978

runtime: SIGSEGV while handling SIGSEGV #13978

Comments

keyneston commented Jan 15, 2016

Version

Do you use cgo or unsafe?

GDB StackTrace

Go stacktrace

Memory info from gdb:

Registers

Object Dump

ianlancetaylor commented Jan 15, 2016

keyneston commented Jan 15, 2016

davecheney commented Jan 16, 2016

keyneston commented Jan 17, 2016

davecheney commented Jan 17, 2016

keyneston commented Jan 17, 2016

davecheney commented Jan 17, 2016

rhysh commented Jan 18, 2016

davecheney commented Jan 18, 2016

mwhudson commented Jan 18, 2016

rhysh commented Jan 18, 2016

mwhudson commented Jan 18, 2016

rhysh commented Jan 18, 2016

rhysh commented Jan 18, 2016

Crash 1452891329, built on darwin/amd64 to run on linux/amd64 at go version devel +505fa7b Tue Jan 12 17:31:01 2016 +0000

Crash 1453060415, built on linux/amd64 to run on linux/amd64 at go version go1.6beta2

rhysh commented Jan 18, 2016

rhysh commented Jan 18, 2016

ianlancetaylor commented Jan 18, 2016

rhysh commented Jan 18, 2016

aclements commented Jan 19, 2016

keyneston commented Jan 19, 2016

aclements commented Jan 19, 2016

keyneston commented Jan 19, 2016

rhysh commented Jan 20, 2016

rhysh commented Jan 20, 2016

rhysh commented Jan 20, 2016

aclements commented Jan 20, 2016

aclements commented Jan 20, 2016

aclements commented Jan 20, 2016

aclements commented Jan 20, 2016

aclements commented Jan 20, 2016

gopherbot commented Jan 20, 2016

keyneston commented Jan 20, 2016

aclements commented Jan 20, 2016 via email

keyneston commented Jan 21, 2016

aclements commented Jan 21, 2016

keyneston commented Jan 21, 2016

aclements commented Jan 21, 2016

keyneston commented Jan 21, 2016

aclements commented Jan 21, 2016

keyneston commented Jan 21, 2016

aclements commented Jan 26, 2016

aclements commented Jan 27, 2016

keyneston commented Jan 28, 2016

aclements commented Jan 28, 2016

Crash 1452891329, built on darwin/amd64 to run on linux/amd64 at go version `devel +505fa7b Tue Jan 12 17:31:01 2016 +0000`

Crash 1453060415, built on linux/amd64 to run on linux/amd64 at go version `go1.6beta2`