Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "runtime·lock: lock count" fatal error when cgo is enabled #56243

Closed
corhere opened this issue Oct 15, 2022 · 21 comments
Closed

runtime: "runtime·lock: lock count" fatal error when cgo is enabled #56243

corhere opened this issue Oct 15, 2022 · 21 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@corhere
Copy link

corhere commented Oct 15, 2022

What version of Go are you using (go version)?

$ go version
go version go1.19.2 linux/amd64

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/ubuntu/.cache/go-build"
GOENV="/home/ubuntu/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/ubuntu/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/ubuntu/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.19.2"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1867075057=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Ran tests for containerd/containerd#7513

~/containerd$ go test -c ./snapshots/overlay
~/containerd$ sudo ./overlay.test -test.run -test.root TestOverlay/no_opt/128LayersMount

(Unfortunately, root is required as the test issues many mount syscalls. I have not had success creating a more minimal reproducer, but this test case takes only a few seconds to run to completion and reproduces the runtime errors fairly reliably.)

What did you expect to see?

The test either passes or fails.

What did you see instead?

fatal error: runtime·lock: lock count followed by hundreds (thousands?) of lines of fatal error: runtime·unlock: lock count. Sometimes these are followed by other runtime errors, such as:

fatal: morestack on g0

fatal: systemstack called from unexpected goroutineTrace/breakpoint trap

An "impossible" segfault in perfectly ordinary Go code.
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x784dcf]
(gdb) disass 0x784dcf
Dump of assembler code for function github.com/containerd/continuity.(*resource).Path:
   0x0000000000784dc0 <+0>:	mov    (%rax),%rcx
   0x0000000000784dc3 <+3>:	cmpq   $0x0,0x8(%rax)
   0x0000000000784dc8 <+8>:	jne    0x784dcf <github.com/containerd/continuity.(*resource).Path+15>
   0x0000000000784dca <+10>:	xor    %eax,%eax
   0x0000000000784dcc <+12>:	xor    %ebx,%ebx
   0x0000000000784dce <+14>:	ret
   0x0000000000784dcf <+15>:	mov    (%rcx),%rax
   0x0000000000784dd2 <+18>:	mov    0x8(%rcx),%rbx
   0x0000000000784dd6 <+22>:	ret
End of assembler dump.

https://github.com/containerd/continuity/blob/5ad51c7aca47b8e742f5e6e7dc841d50f5f6affd/resource.go#L270

A slice with length > 0 somehow had a nil data pointer... or rcx got clobbered in the middle of the function. No unsafe type-punning is used to construct the slice and go test -race does not complain.

fatal error: malloc deadlock / panic during panic followed by what appeared to be two interleaved stack dumps
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: malloc deadlock
panic during panic

runtime stack:
runtime.throw({0x87177d?, 0x7f7d35498848?})
/usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7f7d35498820 sp=0x7f7d354987f0 pc=0x4399fd
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498840 sp=0x7f7d35498820 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498858 sp=0x7f7d35498840 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498888 sp=0x7f7d35498858 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d354988e0?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d354988b8 sp=0x7f7d35498888 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d354988d8 sp=0x7f7d354988b8 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d354988f0 sp=0x7f7d354988d8 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498920 sp=0x7f7d354988f0 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498978?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498950 sp=0x7f7d35498920 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498970 sp=0x7f7d35498950 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498988 sp=0x7f7d35498970 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d354989b8 sp=0x7f7d35498988 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498a10?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d354989e8 sp=0x7f7d354989b8 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498a08 sp=0x7f7d354989e8 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498a20 sp=0x7f7d35498a08 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498a50 sp=0x7f7d35498a20 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498aa8?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498a80 sp=0x7f7d35498a50 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498aa0 sp=0x7f7d35498a80 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498ab8 sp=0x7f7d35498aa0 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498ae8 sp=0x7f7d35498ab8 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498b40?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498b18 sp=0x7f7d35498ae8 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498b38 sp=0x7f7d35498b18 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498b50 sp=0x7f7d35498b38 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498b80 sp=0x7f7d35498b50 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498bd8?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498bb0 sp=0x7f7d35498b80 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498bd0 sp=0x7f7d35498bb0 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498be8 sp=0x7f7d35498bd0 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498c18 sp=0x7f7d35498be8 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498c70?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498c48 sp=0x7f7d35498c18 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498c68 sp=0x7f7d35498c48 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498c80 sp=0x7f7d35498c68 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498cb0 sp=0x7f7d35498c80 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498d08?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498ce0 sp=0x7f7d35498cb0 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498d00 sp=0x7f7d35498ce0 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498d18 sp=0x7f7d35498d00 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498d48 sp=0x7f7d35498d18 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498da0?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498d78 sp=0x7f7d35498d48 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498d98 sp=0x7f7d35498d78 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498db0 sp=0x7f7d35498d98 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498de0 sp=0x7f7d35498db0 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498e38?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498e10 sp=0x7f7d35498de0 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498e30 sp=0x7f7d35498e10 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498e48 sp=0x7f7d35498e30 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498e78 sp=0x7f7d35498e48 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498ed0?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498ea8 sp=0x7f7d35498e78 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498ec8 sp=0x7f7d35498ea8 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498ee0 sp=0x7f7d35498ec8 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498f10 sp=0x7f7d35498ee0 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35498f68?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498f40 sp=0x7f7d35498f10 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498f60 sp=0x7f7d35498f40 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35498f78 sp=0x7f7d35498f60 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35498fa8 sp=0x7f7d35498f78 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35499000?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35498fd8 sp=0x7f7d35498fa8 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35498ff8 sp=0x7f7d35498fd8 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35499010 sp=0x7f7d35498ff8 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35499040 sp=0x7f7d35499010 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35499098?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35499070 sp=0x7f7d35499040 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35499090 sp=0x7f7d35499070 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d354990a8 sp=0x7f7d35499090 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d354990d8 sp=0x7f7d354990a8 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35499130?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35499108 sp=0x7f7d354990d8 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35499128 sp=0x7f7d35499108 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35499140 sp=0x7f7d35499128 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35499170 sp=0x7f7d35499140 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d354991c8?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d354991a0 sp=0x7f7d35499170 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d354991c0 sp=0x7f7d354991a0 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d354991d8 sp=0x7f7d354991c0 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35499208 sp=0x7f7d354991d8 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35499260?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35499238 sp=0x7f7d35499208 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35499258 sp=0x7f7d35499238 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35499270 sp=0x7f7d35499258 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d354992a0 sp=0x7f7d35499270 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d354992f8?})
/usr/local/go/src/runtime/panic.go:
goroutine 8 [running]:
runtime.throw({0x86b044?, 0xc0000b0f30?})
/usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0xc0000b0ee8 sp=0xc0000b0eb8 pc=0x4399fd
runtime.mallocgc(0x78, 0x83a4a0, 0x1)
/usr/local/go/src/runtime/malloc.go:913 +0x8ac fp=0xc0000b0f60 sp=0xc0000b0ee8 pc=0x40f70c
runtime.newobject(0x136e7fad0?)
/usr/local/go/src/runtime/malloc.go:1192 +0x27 fp=0xc0000b0f88 sp=0xc0000b0f60 pc=0x40f847
crypto/sha256.New()
/usr/local/go/src/crypto/sha256/sha256.go:166 +0x25 fp=0xc0000b0fb0 sp=0xc0000b0f88 pc=0x53df45
crypto.Hash.New(0x7f6820?)
/usr/local/go/src/crypto/crypto.go:131 +0x4a fp=0xc0000b0ff8 sp=0xc0000b0fb0 pc=0x53bb2a
github.com/opencontainers/go-digest.Algorithm.Hash({0x867270, 0x6})
/home/ubuntu/containerd/vendor/github.com/opencontainers/go-digest/algorithm.go:135 +0x97 fp=0xc0000b1040 sp=0xc0000b0ff8 pc=0x77a197
github.com/opencontainers/go-digest.Algorithm.Digester(...)
/home/ubuntu/containerd/vendor/github.com/opencontainers/go-digest/algorithm.go:112
github.com/containerd/continuity.simpleDigester.Digest({{0x867270?, 0x800da0?}}, {0x90bbe0?, 0xc0000143b0?})
/home/ubuntu/containerd/vendor/github.com/containerd/continuity/digests.go:42 +0x3f fp=0xc0000b10c0 sp=0xc0000b1040 pc=0x781c5f
github.com/containerd/continuity.(*simpleDigester).Digest(0x40d45d?, {0x90bbe0?, 0xc0000143b01043 +0x46? fp=}0x7f7d354992d0)
sp=0x7f7d354992a0 pc=0x4399e6:
1runtime.unlock2 +0x45 fp=(0xc0000b10f00x1b sp=?0xc0000b10c0)
pc= 0x7862e5/usr/local/go/src/runtime/lock_futex.go
:github.com/containerd/continuity.(*context).digest127 +(0x7a0xc0001a2a50 fp=, 0x7f7d354992f0{ sp=0xc0002faa000x7f7d354992d0, pc=0xf0x40db9a}
)
runtime.unlockWithRank(...)
/home/ubuntu/containerd/vendor/github.com/containerd/continuity/context.go: 634/usr/local/go/src/runtime/lockrank_off.go +:0x18f32 fp=
0xc0000b1170runtime.unlock sp=(...)
0xc0000b10f0 pc=/usr/local/go/src/runtime/lock_futex.go0x78190f:
112github.com/containerd/continuity.(*context).Resource
runtime.printunlock(0xc0001a2a50(, )
{0xc0002faa00/usr/local/go/src/runtime/print.go, :0xf80} +, 0x3b{ fp=0x90ef680x7f7d35499308, sp=0xc0000e66800x7f7d354992f0} pc=)
0x43b41b
/home/ubuntu/containerd/vendor/github.com/containerd/continuity/context.goruntime.throw.func1:(161)

  • 0x1fc/usr/local/go/src/runtime/panic.go fp=:0xc0000b13c81044 sp= +0xc0000b11700x55 pc= fp=0x77d6bc0x7f7d35499338
    sp=github.com/containerd/continuity.BuildManifest.func10x7f7d35499308( pc={0x439a750xc0002faa00
    , runtime.throw0xf(}{, 0x87177d{?0x90ef68, , 0x7f7d354993900xc0000e6680?}}, )
    { 0x0/usr/local/go/src/runtime/panic.go?:, 10430x0 +?0x46} fp=)
    0x7f7d35499368 sp=/home/ubuntu/containerd/vendor/github.com/containerd/continuity/manifest.go0x7f7d35499338: pc=950x4399e6 +
    0xc7runtime.unlock2 fp=(0xc0000b14580x1b sp=0xc0000b13c8?)
    pc=0x783267
    /usr/local/go/src/runtime/lock_futex.gogithub.com/containerd/continuity.(*context).Walk.func1:127( +{0x7a0xc00011d740 fp=?0x7f7d35499388, sp=0xc0000e66800x7f7d35499368? pc=}0x40db9a,
    {runtime.unlockWithRank0x90ef68, (...)
    0xc0000e6680}/usr/local/go/src/runtime/lockrank_off.go, :{320xc0000b14e8
    ?runtime.unlock, (...)
    0x46d747 ?}/usr/local/go/src/runtime/lock_futex.go)
    :112
    /home/ubuntu/containerd/vendor/github.com/containerd/continuity/context.goruntime.printunlock:(596)
    +0x70 fp=/usr/local/go/src/runtime/print.go0xc0000b14a0: sp=800xc0000b1458 + pc=0x3b0x781470 fp=
    0x7f7d354993a0path/filepath.walk sp=0x7f7d35499388( pc={0x43b41b0xc00011d740
    , runtime.throw.func10x3f(})
    , { 0x90ef68/usr/local/go/src/runtime/panic.go, :0xc0000e66801044} +, 0x550xc000183b90 fp=)
    /usr/local/go/src/path/filepath/path.go:433 +0x123 fp=0xc0000b1568 sp=0xc0000b14a0 pc=0x500e03
    path/filepath.walk({0xc0001dbb80, 0x38}, {0x90ef68, 0xc0000cd380}, 0xc000183b90)
    /usr/local/go/src/path/filepath/path.go:457 +0x285 fp=0xc0000b1630 sp=0xc0000b1568 pc=0x500f65
    path/filepath.walk({0x7f7d354993d00xc00002eb10 sp=, 0x7f7d354993a00x30 pc=}0x439a75,
    {0x90ef68runtime.throw, (0xc0000cd2b0{}0x87177d, ?0xc000183b90, )
    0x7f7d35499428 ?/usr/local/go/src/path/filepath/path.go}:)
    457 +/usr/local/go/src/runtime/panic.go0x285: fp=10430xc0000b16f8 + sp=0xc0000b1630 pc=0x500f65
    path/filepath.Walk({0xc00002eb10, 0x300x46 fp=0x7f7d35499400 sp=}0x7f7d354993d0, pc=0xc000183b900x4399e6)

    runtime.unlock2/usr/local/go/src/path/filepath/path.go:520 +(0x1b?)
    /usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35499420 sp=0x7f7d35499400 pc=0x40db9a
    runtime.unlockWithRank(...)
    /usr/local/go/src/runtime/lockrank_off.go:32
    runtime.unlock(...)
    /usr/local/go/src/runtime/lock_futex.go:112
    runtime.printunlock()
    /usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35499438 sp=0x7f7d35499420 pc=0x6c0x43b41b fp=
    runtime.throw.func10xc0000b1748( sp=)
    0xc0000b16f8 pc=/usr/local/go/src/runtime/panic.go0x5010cc:
    1044github.com/containerd/continuity/pathdriver.(*pathDriver).Walk +0x55( fp=0x84be400x7f7d35499468, sp={0x7f7d354994380xc00002eb10 pc=0x439a75
    runtime.throw({0x87177d?, 0x7f7d354994c0?})
    /usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35499498 sp=0x7f7d35499468 pc=0x4399e6
    runtime.unlock2(0x1b?)
    /usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d354994b8 sp=0x7f7d35499498 pc=0x40db9a
    runtime.unlockWithRank?, 0x40f847?}, 0x28?)
    /home/ubuntu/containerd/vendor/github.com/containerd/continuity/pathdriver/path_driver.go:88 +0x27 fp=0xc0000b1770 sp=0xc0000b1748 pc=0x779c47
    github.com/containerd/continuity.(*context).Walk(0xc0001a2a50, 0xc000183b60)
    (...)
    /home/ubuntu/containerd/vendor/github.com/containerd/continuity/context.go :/usr/local/go/src/runtime/lockrank_off.go594: +320x12b
    fp=runtime.unlock0xc0000b17b0 sp=(...)
    0xc0000b1770 pc=/usr/local/go/src/runtime/lock_futex.go0x7813ab:
    112github.com/containerd/continuity.BuildManifest
    runtime.printunlock({(0x90e248)
    /usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d354994d0 sp=0x7f7d354994b8 pc=0x43b41b
    runtime.throw.func1()
    /usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35499500 sp=0x7f7d354994d0 pc=0x439a75
    runtime.throw({0x87177d?, 0x7f7d35499558?})
    /usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35499530 sp=0x7f7d35499500 pc=0x4399e6
    runtime.unlock2(0x1b?)
    /usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35499550 sp=0x7f7d35499530 pc=0x40db9a
    runtime.unlockWithRank(...)
    /usr/local/go/src/runtime/lockrank_off.go:32
    runtime.unlock(...)
    /usr/local/go/src/runtime/lock_futex.go:112
    runtime.printunlock()
    /usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35499568 sp=0x7f7d35499550 pc=0x43b41b?
    , runtime.throw.func10xc0001a2a50(})
    )
    /usr/local/go/src/runtime/panic.go/home/ubuntu/containerd/vendor/github.com/containerd/continuity/manifest.go::104485 + +0x550x111 fp= fp=0x7f7d354995980xc0000b18d8 sp= sp=0x7f7d354995680xc0000b17b0 pc= pc=0x439a750x782ed1

runtime.throwgithub.com/containerd/continuity/fs/fstest.CheckDirectoryEqual({0x87177d?, 0x7f7d354995f0?})
({ 0xc00011d280/usr/local/go/src/runtime/panic.go, :0x3c1043} +, 0x46{ fp=0xc00002eb100x7f7d354995c8, sp=0x300x7f7d35499598} pc=)
0x4399e6
runtime.unlock2/home/ubuntu/containerd/vendor/github.com/containerd/continuity/fs/fstest/compare.go:(440x1b +?0x1ce)
fp= 0xc0000b1a38/usr/local/go/src/runtime/lock_futex.go sp=:0xc0000b18d8127 pc= +0x786dee0x7a
fp=github.com/containerd/containerd/snapshots/testsuite.check128LayersMount.func10x7f7d354995e8 sp=0x7f7d354995c8( pc={0x40db9a0x90e1d8
, runtime.unlockWithRank0xc00017b350(...)
} , /usr/local/go/src/runtime/lockrank_off.go0xc00012eea0, :{320x90ff20
, runtime.unlock0xc0000690e0(...)
} , /usr/local/go/src/runtime/lock_futex.go{:0xc00002ea50112,
0x2bruntime.printunlock})
()
/home/ubuntu/containerd/snapshots/testsuite/testsuite.go :/usr/local/go/src/runtime/print.go942: +800x14d4 + fp=0x3b0xc0000b1df0 fp= sp=0x7f7d354996000xc0000b1a38 sp= pc=0x7f7d354995e80x79fbb4 pc=
0x43b41bgithub.com/containerd/containerd/snapshots/testsuite.makeTest.func1
runtime.throw.func1(()
0xc00012eea0)
/usr/local/go/src/runtime/panic.go/home/ubuntu/containerd/snapshots/testsuite/testsuite.go:1044: +1170x55 + fp=0x4740x7f7d35499630 fp=0xc0000b1f70 sp=0x7f7d35499600 pc=0x439a75
sp=runtime.throw0xc0000b1df0 pc=(0x794fd4{0x87177d?, 0x7f7d35499688?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35499660 sp=0x7f7d35499630 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35499680 sp=0x7f7d35499660 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32

runtime.unlock(...)
testing.tRunner (/usr/local/go/src/runtime/lock_futex.go0xc00012eea0:, 1120xc00017a9c0
)
runtime.printunlock (/usr/local/go/src/testing/testing.go)
/usr/local/go/src/runtime/print.go::80 +0x3b fp=0x7f7d35499698 sp=14460x7f7d35499680 pc= +0x43b41b0x10b
fp=runtime.throw.func10xc0000b1fc0 sp=(0xc0000b1f70)
pc=0x5137cb
/usr/local/go/src/runtime/panic.gotesting.(*T).Run.func1:(1044 +)
0x55 fp=/usr/local/go/src/testing/testing.go0x7f7d354996c8: sp=14930x7f7d35499698 + pc=0x2a0x439a75 fp=
0xc0000b1fe0
sp=0xc0000b1fc0goroutine pc=80x51466a [
runningruntime.goexit]:
(runtime.systemstack_switch)
(/usr/local/go/src/runtime/asm_amd64.s)
: 1594/usr/local/go/src/runtime/asm_amd64.s +:0x1459 fp= fp=0xc0000b1fe80xc0000b0e78 sp= sp=0xc0000b1fe00xc0000b0e70 pc= pc=0x46d6010x46b3e0

created by runtime.fatalthrowtesting.(*T).Run(
0xb0ec0 ?)
/usr/local/go/src/testing/testing.go :/usr/local/go/src/runtime/panic.go1493: +0x35f1122

The crashes also consistently occur on GitHub Actions CI runners, which rules out hardware as a candidate.

Compiling with Cgo is a necessary condition to reproduce the issue. There is no user Cgo code in the built test binary, only runtime and std.

~/containerd$ go test -tags osusergo,netgo ./snapshots/overlay
~/containerd$ ldd overlay.test
	not a dynamic executable

I could not reproduce the issue on a pure-Go build.

I loaded up some core dumps into gdb and noticed a consistent pattern to the state of the process at the time of the crash.

  • Most threads were blocked on a futex, epollwait or usleep
  • One thread was blocked on a syscall
  • One thread was a freshly clone3()'d child, without having executed a single instruction (pc pointed to the instruction following the syscall, rax = 0 and rsp was set to exactly .stack + .stack_size of the clone_args struct pointed to by rdi.)
  • One thread was getting into trouble while in the process of exiting

I saw no evidence suggesting heap corruption when examining the core dumps. I learned that curg().m.locks was always set to -1 when the fatal runtime.lock call was made. On a hunch I patched one of the few unguarded and unbalanced decrements of an m.locks, runtime.releasem():

--- a/runtime/runtime1.go
+++ b/runtime/runtime1.go
@@ -482,6 +482,9 @@ func acquirem() *m {
 //go:nosplit
 func releasem(mp *m) {
        _g_ := getg()
+       if mp.locks == 0 {
+               crash()
+       }
        mp.locks--
        if mp.locks == 0 && _g_.preempt {
                // restore the preemption request in case we've cleared it in newstack

and was able to get clean stack traces without the recursive panicking.

(gdb) bt
#0  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:159
#1  0x0000000000450945 in runtime.dieFromSignal (sig=6)
    at /usr/local/go/src/runtime/signal_unix.go:870
#2  0x000000000045127e in runtime.sigfwdgo (sig=6, info=<optimized out>,
    ctx=<optimized out>, ~r0=<optimized out>)
    at /usr/local/go/src/runtime/signal_unix.go:1086
#3  0x000000000044f5e7 in runtime.sigtrampgo (sig=0, info=0x0,
    ctx=0x46f521 <runtime.raise+33>)
    at /usr/local/go/src/runtime/signal_unix.go:432
#4  0x000000000046f826 in runtime.sigtramp ()
    at /usr/local/go/src/runtime/sys_linux_amd64.s:359
#5  <signal handler called>
#6  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:159
#7  0x0000000000450945 in runtime.dieFromSignal (sig=6)
    at /usr/local/go/src/runtime/signal_unix.go:870
#8  0x000000000044b9ac in runtime.crash ()
    at /usr/local/go/src/runtime/signal_unix.go:962
#9  runtime.releasem (mp=0xc000154400)
    at /usr/local/go/src/runtime/runtime1.go:486
#10 0x0000000000440985 in runtime.startm (_p_=0xc000034000, spinning=false)
    at /usr/local/go/src/runtime/proc.go:2339
#11 0x0000000000440cee in runtime.handoffp (_p_=0x0)
    at /usr/local/go/src/runtime/proc.go:2352
#12 0x000000000043f597 in runtime.mexit (osStack=true)
    at /usr/local/go/src/runtime/proc.go:1537
#13 0x000000000043f1e9 in runtime.mstart0 ()
    at /usr/local/go/src/runtime/proc.go:1391
#14 0x000000000046b905 in runtime.mstart ()
    at /usr/local/go/src/runtime/asm_amd64.s:390
#15 0x0000000000401888 in runtime/cgo(.text) ()
#16 0x00007f94950c1920 in ?? ()
#17 0x00007f94bc5eb850 in ?? () at ./nptl/pthread_create.c:321
   from /lib/x86_64-linux-gnu/libc.so.6
#18 0x0000000000000000 in ?? ()
(gdb) info threads
  Id   Target Id                          Frame
* 1    Thread 0x7f9492565640 (LWP 186374) runtime.raise ()
    at /usr/local/go/src/runtime/sys_linux_amd64.s:159
  2    Thread 0x7f94bc554740 (LWP 186359) runtime.futex ()
    at /usr/local/go/src/runtime/sys_linux_amd64.s:560
  3    Thread 0x7f949371f640 (LWP 186363) runtime.futex ()
    at /usr/local/go/src/runtime/sys_linux_amd64.s:560
  4    Thread 0x7f9494721640 (LWP 186389) runtime.epollwait ()
    at /usr/local/go/src/runtime/sys_linux_amd64.s:706
  5    Thread 0x7f94950c2640 (LWP 186360) runtime.usleep ()
    at /usr/local/go/src/runtime/sys_linux_amd64.s:140
  6    Thread 0x7f9492d66640 (LWP 186390) runtime/internal/syscall.Syscall6
    () at /usr/local/go/src/runtime/internal/syscall/asm_linux_amd64.s:36
  7    Thread 0x7f9493f20640 (LWP 186391) clone3 ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:62

Every core dump I examined has the same traceback in the crashing thread. It's always a pthreads thread in the process of cleaning up and exiting, calling releasem() while its curg().m.locks == -1.

The garbage collector is also seemingly necessary to cause crashes. Setting GOGC=0 produces more reliable crashes, while I have yet to get a crash with GOGC=off. There seems to be some aspect of timing, as well. Turning the test verbosity on or off affects the probability of a crash, and I have yet to get a crash when running a race-enabled build or under strace.

(cc @cpuguy83)

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 15, 2022
@fuweid
Copy link
Contributor

fuweid commented Oct 15, 2022

I am also trying to reproduce this in github action.
I runs 10 jobs with same test at the same time.

runtime: newstack at runtime.checkdead+0x2f5 sp=0x7fb781e8ae38 stack=[0xc00004c800, 0xc00004d000]
	morebuf={pc:0x4745df sp:0x7fb781e8ae40 lr:0x0}
	sched={pc:0x47c975 sp:0x7fb781e8ae38 lr:0x0 ctxt:0x0}
runtime.mexit(0x1)
	/opt/hostedtoolcache/go/1.19.2/x64/src/runtime/proc.go:1545 +0x17f fp=0x7fb781e8ae70 sp=0x7fb781e8ae40 pc=0x4[74](https://github.com/fuweid/containerd-pr-7513/actions/runs/3255360436/jobs/5345228547#step:5:75)5df
runtime.mstart0()
	/opt/hostedtoolcache/go/1.19.2/x64/src/runtime/proc.go:1391 +0x89 fp=0x7fb781e8aea0 sp=0x7fb781e8ae70 pc=0x474289
runtime.mstart()
	/opt/hostedtoolcache/go/1.19.2/x64/src/runtime/asm_amd64.s:390 +0x5 fp=0x7fb781e8aea8 sp=0x7fb781e8aea0 pc=0x4a2725
created by github.com/fuweid/containerd-pr-[75](https://github.com/fuweid/containerd-pr-7513/actions/runs/3255360436/jobs/5345228547#step:5:76)13.mountAt
	/home/runner/work/containerd-pr-7513/containerd-pr-7513/mount.go:126 +0x2ac
fatal error: runtime: stack split at bad time

I add GOTRACEBACK=all to dump the stack. Hope it can help.

@dr2chase
Copy link
Contributor

@golang/runtime this looks like "fun", and a nice start on the debugging already.

@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 17, 2022
@mknyszek
Copy link
Contributor

Thanks for the detailed analysis.

I think there are two things going on here since you don't see evidence of memory corruption at this point:

  1. There's an acquirem or lock call (or something that increments m.locks) missing on a cgo-related path somewhere.
  2. The runtime handles crashing on lock issues poorly (trying to acquire locks while crashing from the same m when the lock count is broken is a recipe for disaster), getting into the recursive crashing state. AFAICT the crashes you mention aren't totally unexpected, except maybe that segfault in C code (though it's plausible depending on what kind of weird state the runtime gets into in this recursive crashing case).

@mknyszek mknyszek added this to the Go1.20 milestone Oct 17, 2022
@corhere
Copy link
Author

corhere commented Oct 17, 2022

I have been able to get the test to crash fairly consistently on a single-core VM, or when running it under taskset -a 1 to limit it to just a single core. I have had no luck reproducing it when the process runs with any parallelism. I have been testing on Ubuntu 22.04 LTS with glibc 2.35-0ubuntu3 and 2.35-0ubuntu3.1.

@prattmic
Copy link
Member

One thread was a freshly clone3()'d child, without having executed a single instruction (pc pointed to the instruction following the syscall, rax = 0 and rsp was set to exactly .stack + .stack_size of the clone_args struct pointed to by rdi.)

Just to clarify, you mean clone3 from libc, right? We only use clone3 for fork if SysProcAttr.UseCgroupFD is set, which is new https://go.dev/cl/417695, but that isn't even in 1.19

@corhere
Copy link
Author

corhere commented Oct 17, 2022

Yes, clone3 from libc.

@prattmic
Copy link
Member

I can reproduce this very consistently.

$ git clone https://github.com/containerd/containerd
$ cd containerd
$ git remote add cpuguy83 https://github.com/cpuguy83/containerd
$ git fetch cpuguy83
$ git checkout nix_mount_fork
$ go test -c ./snapshots/overlay
$ sudo taskset -a 1 ./overlay.test -test.root -test.run TestOverlay/no_opt/128LayersMount

@prattmic prattmic self-assigned this Oct 17, 2022
@prattmic
Copy link
Member

prattmic commented Oct 17, 2022

The immediate problem here is that two threads (35 and 38) are sharing the same M structure:

(gdb) thread apply all p (('runtime.g'*)$r14)->m

Thread 38 (Thread 0x7fffcde81640 (LWP 245313)):
$4 = (runtime.m *) 0xc000160400

Thread 35 (Thread 0x7fffce6c2640 (LWP 245299)):
$5 = (runtime.m *) 0xc000160400

Thread 32 (Thread 0x7fffd0225640 (LWP 245296)):
$6 = (runtime.m *) 0xc000161000

Thread 5 (Thread 0x7fffcf223640 (LWP 245266)):
$7 = (runtime.m *) 0xc00004d000

Thread 4 (Thread 0x7fffcfa24640 (LWP 245265)):
$8 = (runtime.m *) 0xc00004cc00

Thread 2 (Thread 0x7fffd0a26640 (LWP 245263)):
$9 = (runtime.m *) 0xc00004c400

Thread 1 (Thread 0x7ffff7dac740 (LWP 245259)):
$10 = (runtime.m *) 0xb86720 <runtime.m0>

This easily explains why the lock count could get out of sync, as well as other oddities, like malloc deadlock, and "panic during panic" across multiple threads.

I'm not sure sure how two threads end up with the same M.

Edit: this doesn't actually seem to be true at the moment we detect a bad lock value, but may happen later? However, I do often see concurrent clone calls at the throw.

@prattmic
Copy link
Member

The M structure is allocated on the heap just like any typical object. I believe the problem is that the M is "dying" too early, allowing the GC to free it and potentially reallocate this.

M's all reachable to the GC via runtime.allm. mexit removes the M from allm quite a while before it is done with it. I'm not seeing another way that the M would be reachable (or GC blocked) for the remainder of mexit, so I'm pretty sure this is unsafe.

This isn't the patch we'd want, but the following diff keeps the Ms alive forever, and I cannot reproduce failures with it applied:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 629f1f8d8f..feb974db8e 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -1578,6 +1578,8 @@ func mexit(osStack bool) {
 
        // Remove m from allm.
        lock(&sched.lock)
+       mp.deadlink = deadm
+       deadm = mp // keepalive
        for pprev := &allm; *pprev != nil; pprev = &(*pprev).alllink {
                if *pprev == mp {
                        *pprev = mp.alllink
diff --git a/src/runtime/runtime2.go b/src/runtime/runtime2.go
index 5b55b55ce1..115960c4e4 100644
--- a/src/runtime/runtime2.go
+++ b/src/runtime/runtime2.go
@@ -557,6 +557,7 @@ type m struct {
        cgoCallers    *cgoCallers   // cgo traceback if crashing in cgo call
        park          note
        alllink       *m // on allm
+       deadlink      *m // on deadm
        schedlink     muintptr
        lockedg       guintptr
        createstack   [32]uintptr // stack that created this thread.
@@ -1124,6 +1125,7 @@ func (w waitReason) isMutexWait() bool {
 
 var (
        allm       *m
+       deadm      *m
        gomaxprocs int32
        ncpu       int32
        forcegc    forcegcstate

@ianlancetaylor
Copy link
Contributor

Nice find. But why doesn't the call to exitThread keep the M alive at least until that point? Would it fix the problem if we added runtime.KeepAlive(mp) at the bottom of mexit?

@prattmic
Copy link
Member

prattmic commented Oct 17, 2022

I need to double check my assumption, but I don’t think the stack used in mexit is reachable for scanning except via m.g0.stack (and m isn’t reachable because it isn’t in allm). It is an OS stack, so I’m not sure where else it would be found.

Since the stack isn’t reachable, keeping values alive on the stack would have no effect.

Edit: perhaps it should still be reachable via allgs?

@fuweid
Copy link
Contributor

fuweid commented Oct 17, 2022

Is it possible to remove the M from allm after checkdead?

@gopherbot
Copy link

Change https://go.dev/cl/443716 mentions this issue: runtime: always keep global reference to mp until mexit completes

@prattmic
Copy link
Member

https://go.dev/cl/443716 should fix this. This is not a new bug, but the containerd PR seems to trigger it so well because it creates and exits threads very quickly (each mount operation is a new thread).

@prattmic
Copy link
Member

@fuweid

Is it possible to remove the M from allm after checkdead?

That is a different kind of "dead". :) I'm referring to the M structure being live from the perspective of the GC (see runtime.KeepAlive). checkdead is looking for deadlocks in the scheduler.

@prattmic
Copy link
Member

prattmic commented Oct 18, 2022

@gopherbot Please backport to 1.18 and 1.19

This issue is not new but can cause random memory corruption in any program that has a goroutine exit with LockOSThread set.

@gopherbot
Copy link

Backport issue(s) opened: #56308 (for 1.18), #56309 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link

Change https://go.dev/cl/443815 mentions this issue: [release-branch.go1.19] runtime: always keep global reference to mp until mexit completes

@gopherbot
Copy link

Change https://go.dev/cl/443816 mentions this issue: [release-branch.go1.18] runtime: always keep global reference to mp until mexit completes

@fuweid
Copy link
Contributor

fuweid commented Oct 18, 2022

@prattmic

That is a different kind of "dead". :) I'm referring to the M structure being live from the perspective of the GC (see runtime.KeepAlive). checkdead is looking for deadlocks in the scheduler.

Thanks for the quick fix.

@gopherbot
Copy link

Change https://go.dev/cl/444095 mentions this issue: runtime: use freeMStack named constant in assembly

gopherbot pushed a commit that referenced this issue Oct 24, 2022
…ntil mexit completes

Ms are allocated via standard heap allocation (`new(m)`), which means we
must keep them alive (i.e., reachable by the GC) until we are completely
done using them.

Ms are primarily reachable through runtime.allm. However, runtime.mexit
drops the M from allm fairly early, long before it is done using the M
structure. If that was the last reference to the M, it is now at risk of
being freed by the GC and used for some other allocation, leading to
memory corruption.

Ms with a Go-allocated stack coincidentally already keep a reference to
the M in sched.freem, so that the stack can be freed lazily. This
reference has the side effect of keeping this Ms reachable. However, Ms
with an OS stack skip this and are at risk of corruption.

Fix this lifetime by extending sched.freem use to all Ms, with the value
of mp.freeWait determining whether the stack needs to be freed or not.

For #56243.
Fixes #56309.

Change-Id: Ic0c01684775f5646970df507111c9abaac0ba52e
Reviewed-on: https://go-review.googlesource.com/c/go/+/443716
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit e252dcf)
Reviewed-on: https://go-review.googlesource.com/c/go/+/443815
Reviewed-by: Austin Clements <austin@google.com>
gopherbot pushed a commit that referenced this issue Oct 24, 2022
…ntil mexit completes

Ms are allocated via standard heap allocation (`new(m)`), which means we
must keep them alive (i.e., reachable by the GC) until we are completely
done using them.

Ms are primarily reachable through runtime.allm. However, runtime.mexit
drops the M from allm fairly early, long before it is done using the M
structure. If that was the last reference to the M, it is now at risk of
being freed by the GC and used for some other allocation, leading to
memory corruption.

Ms with a Go-allocated stack coincidentally already keep a reference to
the M in sched.freem, so that the stack can be freed lazily. This
reference has the side effect of keeping this Ms reachable. However, Ms
with an OS stack skip this and are at risk of corruption.

Fix this lifetime by extending sched.freem use to all Ms, with the value
of mp.freeWait determining whether the stack needs to be freed or not.

For #56243.
Fixes #56308.

Change-Id: Ic0c01684775f5646970df507111c9abaac0ba52e
Reviewed-on: https://go-review.googlesource.com/c/go/+/443716
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit e252dcf)
Reviewed-on: https://go-review.googlesource.com/c/go/+/443816
Reviewed-by: Austin Clements <austin@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
Ms are allocated via standard heap allocation (`new(m)`), which means we
must keep them alive (i.e., reachable by the GC) until we are completely
done using them.

Ms are primarily reachable through runtime.allm. However, runtime.mexit
drops the M from allm fairly early, long before it is done using the M
structure. If that was the last reference to the M, it is now at risk of
being freed by the GC and used for some other allocation, leading to
memory corruption.

Ms with a Go-allocated stack coincidentally already keep a reference to
the M in sched.freem, so that the stack can be freed lazily. This
reference has the side effect of keeping this Ms reachable. However, Ms
with an OS stack skip this and are at risk of corruption.

Fix this lifetime by extending sched.freem use to all Ms, with the value
of mp.freeWait determining whether the stack needs to be freed or not.

Fixes golang#56243.

Change-Id: Ic0c01684775f5646970df507111c9abaac0ba52e
Reviewed-on: https://go-review.googlesource.com/c/go/+/443716
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
andrew-d pushed a commit to tailscale/go that referenced this issue Dec 7, 2022
…ntil mexit completes

Ms are allocated via standard heap allocation (`new(m)`), which means we
must keep them alive (i.e., reachable by the GC) until we are completely
done using them.

Ms are primarily reachable through runtime.allm. However, runtime.mexit
drops the M from allm fairly early, long before it is done using the M
structure. If that was the last reference to the M, it is now at risk of
being freed by the GC and used for some other allocation, leading to
memory corruption.

Ms with a Go-allocated stack coincidentally already keep a reference to
the M in sched.freem, so that the stack can be freed lazily. This
reference has the side effect of keeping this Ms reachable. However, Ms
with an OS stack skip this and are at risk of corruption.

Fix this lifetime by extending sched.freem use to all Ms, with the value
of mp.freeWait determining whether the stack needs to be freed or not.

For golang#56243.
Fixes golang#56309.

Change-Id: Ic0c01684775f5646970df507111c9abaac0ba52e
Reviewed-on: https://go-review.googlesource.com/c/go/+/443716
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit e252dcf)
Reviewed-on: https://go-review.googlesource.com/c/go/+/443815
Reviewed-by: Austin Clements <austin@google.com>
@golang golang locked and limited conversation to collaborators Oct 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

8 participants