Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: GC crash on linux-amd64-noopt #17785

Closed
bradfitz opened this issue Nov 3, 2016 · 13 comments
Closed

runtime: GC crash on linux-amd64-noopt #17785

bradfitz opened this issue Nov 3, 2016 · 13 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Nov 3, 2016

https://build.golang.org/log/05d53c7b89a4539c468ebec21fdfd59ee7ac52b3
linux-amd64-noopt at d1e9104

##### Testing race detector
ok  	runtime/race	8.162s
ok  	flag	1.037s
ok  	os/exec	3.050s
runtime: pointer 0xc42035a7c8 to unallocated span idx=0x1ad span.base()=0xc420354000 span.limit=0xc420361f80 span.state=3
runtime: found in object at *(0xc42039df98+0x8)
object=0xc42039df98 k=0x62101ce s.base()=0xc42039a000 s.limit=0xc42039c000 s.sizeclass=2 s.elemsize=16 s.state=_MSpanStack
 *(object+0) = 0x5de470
 *(object+8) = 0xc42035a7c8 <==
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw(0x661319, 0x3e)
	/tmp/workdir/go/src/runtime/panic.go:596 +0x95 fp=0x7ff12c9f6570 sp=0x7ff12c9f6550
runtime.heapBitsForObject(0xc42035a7c8, 0xc42039df98, 0x8, 0xc41fffc27f, 0xc400000000, 0x7ff1339a4f70, 0xc420021228, 0x4)
	/tmp/workdir/go/src/runtime/mbitmap.go:433 +0x2bf fp=0x7ff12c9f65c8 sp=0x7ff12c9f6570
runtime.scanblock(0xc42039df98, 0x18, 0x665b15, 0xc420021228)
	/tmp/workdir/go/src/runtime/mgcmark.go:1186 +0x12b fp=0x7ff12c9f6640 sp=0x7ff12c9f65c8
runtime.scanframeworker(0x7ff12c9f6958, 0x7ff12c9f6a48, 0xc420021228)
	/tmp/workdir/go/src/runtime/mgcmark.go:911 +0xfc fp=0x7ff12c9f66d0 sp=0x7ff12c9f6640
runtime.scanstack.func1(0x7ff12c9f6958, 0x0, 0x7ff12c9f6701)
	/tmp/workdir/go/src/runtime/mgcmark.go:805 +0x6e fp=0x7ff12c9f6720 sp=0x7ff12c9f66d0
runtime.gentraceback(0xffffffffffffffff, 0xc42039df60, 0x0, 0xc42023a680, 0x0, 0x0, 0x7fffffff, 0x7ff12c9f6b48, 0x0, 0x0, ...)
	/tmp/workdir/go/src/runtime/traceback.go:378 +0x13a2 fp=0x7ff12c9f69b8 sp=0x7ff12c9f6720
runtime.scanstack(0xc42023a680, 0xc420021228)
	/tmp/workdir/go/src/runtime/mgcmark.go:828 +0x265 fp=0x7ff12c9f6b98 sp=0x7ff12c9f69b8
runtime.scang(0xc42023a680, 0xc420021228)
	/tmp/workdir/go/src/runtime/proc.go:851 +0x209 fp=0x7ff12c9f6bd0 sp=0x7ff12c9f6b98
runtime.markroot.func1()
	/tmp/workdir/go/src/runtime/mgcmark.go:250 +0x6d fp=0x7ff12c9f6c10 sp=0x7ff12c9f6bd0
runtime.systemstack(0x7ff12c9f6c78)
	/tmp/workdir/go/src/runtime/asm_amd64.s:330 +0xab fp=0x7ff12c9f6c18 sp=0x7ff12c9f6c10
runtime.markroot(0xc420021228, 0x5d)
	/tmp/workdir/go/src/runtime/mgcmark.go:255 +0x1c7 fp=0x7ff12c9f6ca0 sp=0x7ff12c9f6c18
runtime.gcDrain(0xc420021228, 0x0)
	/tmp/workdir/go/src/runtime/mgcmark.go:1027 +0xad fp=0x7ff12c9f6cd0 sp=0x7ff12c9f6ca0
runtime.gchelper()
	/tmp/workdir/go/src/runtime/mgc.go:1877 +0xfd fp=0x7ff12c9f6d00 sp=0x7ff12c9f6cd0
runtime.stopm()
	/tmp/workdir/go/src/runtime/proc.go:1623 +0xe0 fp=0x7ff12c9f6d28 sp=0x7ff12c9f6d00
runtime.findrunnable(0xc420021300, 0x0)
	/tmp/workdir/go/src/runtime/proc.go:2052 +0x24f fp=0x7ff12c9f6dc0 sp=0x7ff12c9f6d28
runtime.schedule()
	/tmp/workdir/go/src/runtime/proc.go:2151 +0x153 fp=0x7ff12c9f6e00 sp=0x7ff12c9f6dc0
runtime.park_m(0xc4204e2ea0)
	/tmp/workdir/go/src/runtime/proc.go:2214 +0x7a fp=0x7ff12c9f6e38 sp=0x7ff12c9f6e00
runtime.mcall(0x0)
	/tmp/workdir/go/src/runtime/asm_amd64.s:256 +0x5b fp=0x7ff12c9f6e48 sp=0x7ff12c9f6e38

@bradfitz bradfitz added this to the Go1.8 milestone Nov 3, 2016
@randall77
Copy link
Contributor

Weird. Something is definitely confused here, the state of the span is _MSpanStack but the GC is scanning it as if it were a span of 16-byte objects.

@quentinmit quentinmit added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 7, 2016
@mwhudson
Copy link
Contributor

Four more on the first page of build.golang.org:

https://build.golang.org/log/b3ae09ee6ba8718e621f8548694c317e425057fe
https://build.golang.org/log/6a92b4f7ed27db03b3c2063486ecf7357fac642a
https://build.golang.org/log/b83479cfd0eaea91f4ac7939e002eab5cdd8cab3
https://build.golang.org/log/b83479cfd0eaea91f4ac7939e002eab5cdd8cab3

All on the noopt builder so far. That's usually something that's usually
inlined triggering a stack split or something along those lines I think?

On 8 November 2016 at 06:17, Keith Randall notifications@github.com wrote:

Weird. Something is definitely confused here, the state of the span is
_MSpanStack but the GC is scanning it as if it were a span of 16-byte
objects.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#17785 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AApBFl9dC0aIQZqQbkh1qEMxShJBk7_zks5q710QgaJpZM4KpCaT
.

@mwhudson mwhudson changed the title runtime: GC crash runtime: GC crash on linux-amd64-noopt Nov 15, 2016
@bradfitz
Copy link
Contributor Author

This crash continues again, now that an unrelated noopt test failure is no longer hiding it.

@aclements, any ideas?

@aclements
Copy link
Member

Some statistical observations.

I had assumed this was caused by the hybrid barrier, but it actually started a little before the hybrid barrier went in. The failure probability did significantly increase after the hybrid barrier---from ~1.5% to 18%---but the timing isn't quite right to say the hybrid barrier itself increased the failure probability.

Here are the first few instances of this failure:

2016-10-19T07:09:08-9aed16e/linux-amd64-noopt
2016-10-27T14:56:06-4c18204/linux-amd64-noopt
Hybrid barrier started going in at aa581f5. Mostly no-op transformations as of the next failure, however. fixalloc now zeros by default, but not for mspan, which would be the obvious culprit.
2016-10-28T19:13:23-8a7f0ad/linux-amd64-noopt
Hybrid barrier is fully in as of bd640c8. However, next failure doesn't happen for another 81 commits, so maybe it wasn't the hybrid barrier that increased its failure probability.
2016-11-01T20:33:58-e22c796/linux-amd64-noopt
2016-11-02T21:33:03-688995d/linux-amd64-noopt
2016-11-03T18:10:35-18f0e88/linux-amd64-noopt
2016-11-03T20:18:52-2b59b15/linux-amd64-noopt
2016-11-03T22:43:50-b241a06/linux-amd64-noopt

All failures are on linux-amd64-noopt.

We've observed 68 failures so far. All of these are when running misc/cgo/test. 43 of them are when running misc/cgo/test in the "Testing race detector" section. The rest are when running misc/cgo/test regularly (without the race detector).

There are some slight variations on the failure. They're all roughly:

runtime: pointer 0xc4202cc7a8 to unallocated span idx=0x166 span.base()=0xc4202c6000 span.limit=0xc4202d0000 span.state=2
runtime: found in object at *(0xc42038ff68+0x8)
object=0xc42038ff68 k=0x62101c7 s.base()=0xc420388000 s.limit=0xc42038a000 s.sizeclass=2 s.elemsize=16 s.state=_MSpanStack
 *(object+0) = 0x5d4320
 *(object+8) = 0xc4202cc7a8 <==
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

18 failures have s.sizeclass=0 s.elemsize=0. Of these, 5 also have s.limit=0x0. 35 failures have s.sizeclass=2 s.elemsize=16. 15 failures have s.sizeclass=10 s.elemsize=144. In the sizeclass=10 failures, the first two words have the same pattern as all of the other failures and it's still the second word that's bad. In all cases, s.state=_MSpanStack.

@aclements
Copy link
Member

The following reproduces it with high probability:

cd $GOROOT/src
GO_GCFLAGS="-N -l" ./make.bash
cd ../misc/cgo/test
go test -gcflags '-N -l' -c
GOTRACEBACK=2 GOGC=1 ./test.test -test.short

You can also reproduce it with -test.run=Test7978, though the probability is lower (though not super-low).

All failures are specifically in misc/cgo/test/issue7978.go and they all have the same stack that's being scanned. Here's a sample from 2016-11-29T15:27:49-d39b7b5/linux-amd64-noopt:

runtime: pointer 0xc4205dc7c8 to unallocated span idx=0x2ee span.base()=0xc4205d6000 span.limit=0xc4205c0000 span.state=3
runtime: found in object at *(0xc420388f98+0x8)
object=0xc420388f98 k=0x62101c4 s.base()=0xc420382000 s.limit=0xc42038df80 s.sizeclass=10 s.elemsize=144 s.state=_MSpanStack
 *(object+0) = 0x60f010
 *(object+8) = 0xc4205dc7c8 <==
...
runtime stack:
runtime.throw(0x6a1c75, 0x3e)
        /tmp/workdir/go/src/runtime/panic.go:596 +0x95 fp=0x7f765bff6618 sp=0x7f765bff65f8
runtime.heapBitsForObject(0xc4205dc7c8, 0xc420388f98, 0x8, 0xc41fffbfbf, 0xc400000000, 0x7f7669aa8f70, 0xc42001b228, 0x2)
        /tmp/workdir/go/src/runtime/mbitmap.go:433 +0x2bf fp=0x7f765bff6670 sp=0x7f765bff6618
runtime.scanblock(0xc420388f98, 0x18, 0x6a68d1, 0xc42001b228)
        /tmp/workdir/go/src/runtime/mgcmark.go:1220 +0x12b fp=0x7f765bff66e8 sp=0x7f765bff6670
runtime.scanframeworker(0x7f765bff6a00, 0x7f765bff6af0, 0xc42001b228)
        /tmp/workdir/go/src/runtime/mgcmark.go:925 +0xfc fp=0x7f765bff6778 sp=0x7f765bff66e8
...

goroutine 177 [syscall (scan), locked to thread]:
runtime.cgocall(0x60f010, 0xc4205dc7c8, 0xc4205dc798)
        /tmp/workdir/go/src/runtime/cgocall.go:130 +0x8b fp=0xc420388f98 sp=0xc420388f60
_/tmp/workdir/go/misc/cgo/test._Cfunc_issue7978c(0x13be988)
        _/tmp/workdir/go/misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1294 +0x63 fp=0xc420388fc8 sp=0xc420388f98
_/tmp/workdir/go/misc/cgo/test.issue7978go()
        /tmp/workdir/go/misc/cgo/test/issue7978.go:95 +0x3b fp=0xc420388fe0 sp=0xc420388fc8
runtime.goexit()
        /tmp/workdir/go/src/runtime/asm_amd64.s:2184 +0x1 fp=0xc420388fe8 sp=0xc420388fe0
created by _/tmp/workdir/go/misc/cgo/test.test7978
        /tmp/workdir/go/misc/cgo/test/issue7978.go:113 +0xd1

The block that's being scanned is the arguments to runtime.cgocall in the frame of test._Cfunc_issue7978c at sp 0xc420388f98.

The caller probably doesn't matter since it's the arguments map, but here's its declaration and compilation with -N -l:

_Cfunc_issue7978c
//go:cgo_import_static _cgo_4d9b5135caa8_Cfunc_issue7978c
//go:linkname __cgofn__cgo_4d9b5135caa8_Cfunc_issue7978c _cgo_4d9b5135caa8_Cfunc
_issue7978c
var __cgofn__cgo_4d9b5135caa8_Cfunc_issue7978c byte
var _cgo_4d9b5135caa8_Cfunc_issue7978c = unsafe.Pointer(&__cgofn__cgo_4d9b5135ca
a8_Cfunc_issue7978c)

//go:cgo_unsafe_args
func _Cfunc_issue7978c(p0 *_Ctype_uint32_t) (r1 _Ctype_void) {
        _cgo_runtime_cgocall(_cgo_4d9b5135caa8_Cfunc_issue7978c, uintptr(unsafe.Pointer(&p0)))
        if _Cgo_always_false {
                _Cgo_use(p0)
        }
        return
}
""._Cfunc_issue7978c t=1 size=133 args=0x8 locals=0x30
        0x0000 00000 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   TEXT    ""._Cfunc_issue7978c(SB), $48-8
        0x0000 00000 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   MOVQ    (TLS), CX
        0x0009 00009 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   CMPQ    SP, 16(CX)
        0x000d 00013 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   JLS     123
        0x000f 00015 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   SUBQ    $48, SP
        0x0013 00019 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   MOVQ    BP, 40(SP)
        0x0018 00024 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   LEAQ    40(SP), BP
        0x001d 00029 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   FUNCDATA        $0, gclocals·09b80ec389a9e6ac09cfa1cd3c45263d(SB)
        0x001d 00029 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   FUNCDATA        $1, gclocals·568470801006e5c0dc3947ea998fe279(SB)
        0x001d 00029 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   NOP
        0x001d 00029 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   NOP
        0x001d 00029 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   LEAQ    "".p0+56(FP), AX
        0x0022 00034 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1294)   MOVQ    AX, ""..autotmp_249+32(SP)
        0x0027 00039 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1294)   MOVQ    ""._cgo_4d9b5135caa8_Cfunc_issue7978c(SB), AX
        0x002e 00046 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1294)   MOVQ    AX, (SP)
        0x0032 00050 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1294)   MOVQ    ""..autotmp_249+32(SP), AX
        0x0037 00055 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1294)   MOVQ    AX, 8(SP)
        0x003c 00060 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1294)   PCDATA  $0, $1
        0x003c 00060 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1294)   CALL    runtime.cgocall(SB)
        0x0041 00065 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1295)   MOVBLZX runtime.cgoAlwaysFalse(SB), AX
        0x0048 00072 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1295)   TESTB   AL, AL
        0x004a 00074 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1295)   JNE     78
        0x004c 00076 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1295)   JMP     121
        0x004e 00078 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1296)   MOVQ    "".p0+56(FP), AX
        0x0053 00083 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1296)   MOVQ    AX, ""..autotmp_250+24(SP)
        0x0058 00088 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1296)   LEAQ    type.*"".C.uint32_t(SB), CX
        0x005f 00095 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1296)   MOVQ    CX, (SP)
        0x0063 00099 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1296)   MOVQ    AX, 8(SP)
        0x0068 00104 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1296)   PCDATA  $0, $0
        0x0068 00104 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1296)   CALL    runtime.cgoUse(SB)
        0x006d 00109 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1295)   JMP     111
        0x006f 00111 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1298)   MOVQ    40(SP), BP
        0x0074 00116 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1298)   ADDQ    $48, SP
        0x0078 00120 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1298)   RET
        0x0079 00121 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1295)   JMP     111
        0x007b 00123 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1295)   NOP
        0x007b 00123 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   PCDATA  $0, $-1
        0x007b 00123 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   CALL    runtime.morestack_noctxt(SB)
        0x0080 00128 (.../misc/cgo/test/_test/_obj_test/_cgo_gotypes.go:1293)   JMP     0
        0x0000 64 48 8b 0c 25 00 00 00 00 48 3b 61 10 76 6c 48  dH..%....H;a.vlH
        0x0010 83 ec 30 48 89 6c 24 28 48 8d 6c 24 28 48 8d 44  ..0H.l$(H.l$(H.D
        0x0020 24 38 48 89 44 24 20 48 8b 05 00 00 00 00 48 89  $8H.D$ H......H.
        0x0030 04 24 48 8b 44 24 20 48 89 44 24 08 e8 00 00 00  .$H.D$ H.D$.....
        0x0040 00 0f b6 05 00 00 00 00 84 c0 75 02 eb 2b 48 8b  ..........u..+H.
        0x0050 44 24 38 48 89 44 24 18 48 8d 0d 00 00 00 00 48  D$8H.D$.H......H
        0x0060 89 0c 24 48 89 44 24 08 e8 00 00 00 00 eb 00 48  ..$H.D$........H
        0x0070 8b 6c 24 28 48 83 c4 30 c3 eb f4 e8 00 00 00 00  .l$(H..0........
        0x0080 e9 7b ff ff ff                                   .{...
        rel 5+4 t=16 TLS+0
        rel 42+4 t=15 ""._cgo_4d9b5135caa8_Cfunc_issue7978c+0
        rel 61+4 t=8 runtime.cgocall+0
        rel 68+4 t=15 runtime.cgoAlwaysFalse+0
        rel 91+4 t=15 type.*"".C.uint32_t+0
        rel 105+4 t=8 runtime.cgoUse+0
        rel 124+4 t=8 runtime.morestack_noctxt+0
gclocals·09b80ec389a9e6ac09cfa1cd3c45263d t=8 dupok size=10
        0x0000 02 00 00 00 01 00 00 00 01 01                    ..........
gclocals·568470801006e5c0dc3947ea998fe279 t=8 dupok size=10
        0x0000 02 00 00 00 02 00 00 00 00 02                    ..........

From this we can see that the bad pointer is the arg argument, which should be a stack pointer (&p0). However, based on the hex dump in the panic, the arg word clearly doesn't point to the stack, which suggests that the stack may have been moved without adjusting this pointer.

Here's the compilation of runtime.cgocall:

"".cgocall t=1 nosplit size=310 args=0x18 locals=0x38
        0x0000 00000 (/home/austin/go.dev/src/runtime/cgocall.go:94)    TEXT    "".cgocall(SB), $56-24
        0x0000 00000 (/home/austin/go.dev/src/runtime/cgocall.go:94)    SUBQ    $56, SP
        0x0004 00004 (/home/austin/go.dev/src/runtime/cgocall.go:94)    MOVQ    BP, 48(SP)
        0x0009 00009 (/home/austin/go.dev/src/runtime/cgocall.go:94)    LEAQ    48(SP), BP
        0x000e 00014 (/home/austin/go.dev/src/runtime/cgocall.go:94)    FUNCDATA        $0, gclocals·6fc0d9879f56cfd09f1808073517724d(SB)
        0x000e 00014 (/home/austin/go.dev/src/runtime/cgocall.go:94)    FUNCDATA        $1, gclocals·bd4e7836c4f1ce903f7da6501847074d(SB)
        0x000e 00014 (/home/austin/go.dev/src/runtime/cgocall.go:94)    MOVL    $0, "".~r2+80(FP)
        0x0016 00022 (/home/austin/go.dev/src/runtime/cgocall.go:95)    MOVBLZX "".iscgo(SB), AX
        0x001d 00029 (/home/austin/go.dev/src/runtime/cgocall.go:95)    TESTB   AL, AL
        0x001f 00031 (/home/austin/go.dev/src/runtime/cgocall.go:95)    JNE     38
        0x0021 00033 (/home/austin/go.dev/src/runtime/cgocall.go:95)    JMP     279
        0x0026 00038 (/home/austin/go.dev/src/runtime/cgocall.go:99)    MOVQ    "".fn+64(FP), AX
        0x002b 00043 (/home/austin/go.dev/src/runtime/cgocall.go:99)    TESTQ   AX, AX
        0x002e 00046 (/home/austin/go.dev/src/runtime/cgocall.go:99)    JNE     53
        0x0030 00048 (/home/austin/go.dev/src/runtime/cgocall.go:99)    JMP     252
        0x0035 00053 (/home/austin/go.dev/src/runtime/cgocall.go:103)   JMP     55
        0x0037 00055 (/home/austin/go.dev/src/runtime/cgocall.go:111)   PCDATA  $0, $0
        0x0037 00055 (/home/austin/go.dev/src/runtime/cgocall.go:111)   CALL    "".lockOSThread(SB)
        0x003c 00060 (/home/austin/go.dev/src/runtime/cgocall.go:112)   MOVQ    (TLS), AX
        0x0045 00069 (/home/austin/go.dev/src/runtime/cgocall.go:112)   MOVQ    48(AX), AX
        0x0049 00073 (/home/austin/go.dev/src/runtime/cgocall.go:112)   MOVQ    AX, "".mp+40(SP)
        0x004e 00078 (/home/austin/go.dev/src/runtime/cgocall.go:113)   TESTB   AL, (AX)
        0x0050 00080 (/home/austin/go.dev/src/runtime/cgocall.go:113)   MOVQ    248(AX), AX
        0x0057 00087 (/home/austin/go.dev/src/runtime/cgocall.go:113)   MOVQ    AX, ""..autotmp_77+32(SP)
        0x005c 00092 (/home/austin/go.dev/src/runtime/cgocall.go:113)   MOVQ    "".mp+40(SP), CX
        0x0061 00097 (/home/austin/go.dev/src/runtime/cgocall.go:113)   TESTB   AL, (CX)
        0x0063 00099 (/home/austin/go.dev/src/runtime/cgocall.go:113)   INCQ    AX
        0x0066 00102 (/home/austin/go.dev/src/runtime/cgocall.go:113)   MOVQ    AX, 248(CX)
        0x006d 00109 (/home/austin/go.dev/src/runtime/cgocall.go:114)   MOVQ    "".mp+40(SP), AX
        0x0072 00114 (/home/austin/go.dev/src/runtime/cgocall.go:114)   TESTB   AL, (AX)
        0x0074 00116 (/home/austin/go.dev/src/runtime/cgocall.go:114)   MOVL    256(AX), AX
        0x007a 00122 (/home/austin/go.dev/src/runtime/cgocall.go:114)   MOVL    AX, ""..autotmp_78+28(SP)
        0x007e 00126 (/home/austin/go.dev/src/runtime/cgocall.go:114)   MOVQ    "".mp+40(SP), CX
        0x0083 00131 (/home/austin/go.dev/src/runtime/cgocall.go:114)   TESTB   AL, (CX)
        0x0085 00133 (/home/austin/go.dev/src/runtime/cgocall.go:114)   INCL    AX
        0x0087 00135 (/home/austin/go.dev/src/runtime/cgocall.go:114)   MOVL    AX, 256(CX)
        0x008d 00141 (/home/austin/go.dev/src/runtime/cgocall.go:117)   MOVQ    "".mp+40(SP), AX
        0x0092 00146 (/home/austin/go.dev/src/runtime/cgocall.go:117)   TESTB   AL, (AX)
        0x0094 00148 (/home/austin/go.dev/src/runtime/cgocall.go:117)   MOVQ    264(AX), AX
        0x009b 00155 (/home/austin/go.dev/src/runtime/cgocall.go:117)   TESTB   AL, (AX)
        0x009d 00157 (/home/austin/go.dev/src/runtime/cgocall.go:117)   MOVQ    $0, (AX)
        0x00a4 00164 (/home/austin/go.dev/src/runtime/cgocall.go:130)   MOVL    $0, (SP)
        0x00ab 00171 (/home/austin/go.dev/src/runtime/cgocall.go:130)   PCDATA  $0, $2
        0x00ab 00171 (/home/austin/go.dev/src/runtime/cgocall.go:130)   CALL    "".entersyscall(SB)
        0x00b0 00176 (/home/austin/go.dev/src/runtime/cgocall.go:131)   MOVQ    "".fn+64(FP), AX
        0x00b5 00181 (/home/austin/go.dev/src/runtime/cgocall.go:131)   MOVQ    AX, (SP)
        0x00b9 00185 (/home/austin/go.dev/src/runtime/cgocall.go:131)   MOVQ    "".arg+72(FP), AX
        0x00be 00190 (/home/austin/go.dev/src/runtime/cgocall.go:131)   MOVQ    AX, 8(SP)
        0x00c3 00195 (/home/austin/go.dev/src/runtime/cgocall.go:131)   PCDATA  $0, $3
        0x00c3 00195 (/home/austin/go.dev/src/runtime/cgocall.go:131)   CALL    "".asmcgocall(SB)
        0x00c8 00200 (/home/austin/go.dev/src/runtime/cgocall.go:131)   MOVL    16(SP), AX
        0x00cc 00204 (/home/austin/go.dev/src/runtime/cgocall.go:131)   MOVL    AX, "".errno+24(SP)
        0x00d0 00208 (/home/austin/go.dev/src/runtime/cgocall.go:132)   MOVL    $0, (SP)
        0x00d7 00215 (/home/austin/go.dev/src/runtime/cgocall.go:132)   PCDATA  $0, $3
        0x00d7 00215 (/home/austin/go.dev/src/runtime/cgocall.go:132)   CALL    "".exitsyscall(SB)
        0x00dc 00220 (/home/austin/go.dev/src/runtime/cgocall.go:134)   MOVQ    "".mp+40(SP), AX
        0x00e1 00225 (/home/austin/go.dev/src/runtime/cgocall.go:134)   MOVQ    AX, (SP)
        0x00e5 00229 (/home/austin/go.dev/src/runtime/cgocall.go:134)   PCDATA  $0, $1
        0x00e5 00229 (/home/austin/go.dev/src/runtime/cgocall.go:134)   CALL    "".endcgo(SB)
        0x00ea 00234 (/home/austin/go.dev/src/runtime/cgocall.go:135)   MOVL    "".errno+24(SP), AX
        0x00ee 00238 (/home/austin/go.dev/src/runtime/cgocall.go:135)   MOVL    AX, "".~r2+80(FP)
        0x00f2 00242 (/home/austin/go.dev/src/runtime/cgocall.go:135)   MOVQ    48(SP), BP
        0x00f7 00247 (/home/austin/go.dev/src/runtime/cgocall.go:135)   ADDQ    $56, SP
        0x00fb 00251 (/home/austin/go.dev/src/runtime/cgocall.go:135)   RET
        0x00fc 00252 (/home/austin/go.dev/src/runtime/cgocall.go:100)   LEAQ    go.string."cgocall nil"(SB), AX
        0x0103 00259 (/home/austin/go.dev/src/runtime/cgocall.go:100)   MOVQ    AX, (SP)
        0x0107 00263 (/home/austin/go.dev/src/runtime/cgocall.go:100)   MOVQ    $11, 8(SP)
        0x0110 00272 (/home/austin/go.dev/src/runtime/cgocall.go:100)   PCDATA  $0, $1
        0x0110 00272 (/home/austin/go.dev/src/runtime/cgocall.go:100)   CALL    "".throw(SB)
        0x0115 00277 (/home/austin/go.dev/src/runtime/cgocall.go:100)   UNDEF
        0x0117 00279 (/home/austin/go.dev/src/runtime/cgocall.go:95)    JMP     281
        0x0119 00281 (/home/austin/go.dev/src/runtime/cgocall.go:95)    JMP     283
        0x011b 00283 (/home/austin/go.dev/src/runtime/cgocall.go:96)    LEAQ    go.string."cgocall unavailable"(SB), AX
        0x0122 00290 (/home/austin/go.dev/src/runtime/cgocall.go:96)    MOVQ    AX, (SP)
        0x0126 00294 (/home/austin/go.dev/src/runtime/cgocall.go:96)    MOVQ    $19, 8(SP)
        0x012f 00303 (/home/austin/go.dev/src/runtime/cgocall.go:96)    PCDATA  $0, $1
        0x012f 00303 (/home/austin/go.dev/src/runtime/cgocall.go:96)    CALL    "".throw(SB)
        0x0134 00308 (/home/austin/go.dev/src/runtime/cgocall.go:96)    UNDEF
gclocals·6fc0d9879f56cfd09f1808073517724d t=8 dupok size=12
        0x0000 04 00 00 00 03 00 00 00 03 00 03 00              ............
gclocals·bd4e7836c4f1ce903f7da6501847074d t=8 dupok size=12
        0x0000 04 00 00 00 01 00 00 00 00 00 01 01              ............

The args map for runtime.cgocall is {0b011, 0b000, 0b011, 0b000}. In this compilation, the failure happened at cgocall+0xb0, and the code path to there uses either map index 0 or 2, both of which have both arguments marked live.

@aclements
Copy link
Member

I think I understand what's happening:

  1. cgocall calls entersyscall, which saves the PC and SP of its call site in cgocall. Call this PC/SP "X". This PC has arg map 0b011, so both pointer arguments are live.
  2. cgocall calls asmcgocall. Call the PC/SP of this call "Y". This PC has arg map 0b000 (index 3), so the arguments to cgocall are not live.
  3. asmcgocall goes into the C code, which eventually calls back into the Go code.
  4. cgocallbackg remembers the saved PC/SP "X" in some local variables, calls exitsyscall, and then calls cgocallbackg1.
  5. cgocallbackg1 causes a stack growth. This stack unwind sees PC/SP "Y". Since the arguments are dead at "Y", they are not adjusted.
  6. The callback into Go happens and returns.
  7. cgocallbackg1 calls reentersyscall with the recorded saved PC/SP "X", so "X" gets stashed back into gp.syscallpc/sp.
  8. GC scans the stack. It sees there's a saved syscall PC/SP, so it starts the traceback at PC/SP "X". At "X" the arguments are considered live, so it scans them, but since they weren't adjusted, the pointers are bad, so it panics.

@khr or @ianlancetaylor, any thoughts on the right way to fix this? It's sufficient to add KeepAlive(fn); KeepAlive(arg) to the end of cgocall so the arguments stay live through this little backwards execution game, but that feels like a horrible, horrible hack.

@randall77
Copy link
Contributor

One might argue that all of cgo callbacks feel like a horrible, horrible hack.

I see nothing in your description that mentions noopt. How is that relevant?

Would it help to give asmcgocall the correct args map (that its two args are pointers)?

@ianlancetaylor
Copy link
Contributor

If I understand this correctly, it seems like the key failure is that in asmcgocall the arguments are not considered to be live. But why aren't they? The function has a prototype in stubs.go, so according to funcdata.h it should have an implied GO_ARGS. Why doesn't the stack traceback see that. consider the arguments to be live, and adjust any pointers they contain?

@aclements
Copy link
Member

I see nothing in your description that mentions noopt. How is that relevant?

I can easily trigger this is on a regular opt build by forcing stack growth in the cgo callback of this test, so I think noopt just enlarged the existing stacks enough to cause a stack growth.

Would it help to give asmcgocall the correct args map (that its two args are pointers)?

No, I don't think so. The problem is the arguments to cgocall, not asmcgocall. When the GC stack trace runs, it doesn't even see asmcgocall on the stack (since it's starting from the entersyscall-saved PC/SP).

@ianlancetaylor
Copy link
Contributor

So if I understand this correctly, the nature of cgo calls is that we only see cgocall on the stack, we don't see the function that it called. And we see cgocall at a point where the arguments are dead, because the compiler expects that if they are live the callee is going to mark them as live. But in the stack traceback we never see the callee, so that doesn't help.

If that all sounds true I don't see why your earlier suggestion of using KeepAlive is a hack. It sounds like exactly the right thing to do to get around the fact that the stack backtrace can not do the right thing.

@aclements
Copy link
Member

Essentially, yes. We see cgocall twice: in the first stack trace (for stack growth) the arguments are dead, and in the second trace (for GC) they've risen from the dead and are live again. The undead arguments cause the crash.

If that all sounds true I don't see why your earlier suggestion of using KeepAlive is a hack.

Okay.

I checked all assignments to syscallsp/pc and this appears to be the only situation where we ever "roll them back".

@gopherbot
Copy link

CL https://golang.org/cl/33710 mentions this issue.

@golang golang locked and limited conversation to collaborators Nov 30, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

7 participants