-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: "runtime·lock: lock count" fatal error when cgo is enabled #56243
Comments
I am also trying to reproduce this in github action.
I add |
@golang/runtime this looks like "fun", and a nice start on the debugging already. |
Thanks for the detailed analysis. I think there are two things going on here since you don't see evidence of memory corruption at this point:
|
I have been able to get the test to crash fairly consistently on a single-core VM, or when running it under |
Just to clarify, you mean |
Yes, |
I can reproduce this very consistently.
|
The immediate problem here is that two threads (35 and 38) are sharing the same M structure:
This easily explains why the lock count could get out of sync, as well as other oddities, like malloc deadlock, and "panic during panic" across multiple threads. I'm not sure sure how two threads end up with the same M. Edit: this doesn't actually seem to be true at the moment we detect a bad lock value, but may happen later? However, I do often see concurrent clone calls at the throw. |
The M structure is allocated on the heap just like any typical object. I believe the problem is that the M is "dying" too early, allowing the GC to free it and potentially reallocate this. M's all reachable to the GC via This isn't the patch we'd want, but the following diff keeps the Ms alive forever, and I cannot reproduce failures with it applied:
|
Nice find. But why doesn't the call to |
I need to double check my assumption, but I don’t think the stack used in mexit is reachable for scanning except via m.g0.stack (and m isn’t reachable because it isn’t in allm). It is an OS stack, so I’m not sure where else it would be found. Since the stack isn’t reachable, keeping values alive on the stack would have no effect. Edit: perhaps it should still be reachable via allgs? |
Is it possible to remove the M from allm after checkdead? |
Change https://go.dev/cl/443716 mentions this issue: |
https://go.dev/cl/443716 should fix this. This is not a new bug, but the containerd PR seems to trigger it so well because it creates and exits threads very quickly (each mount operation is a new thread). |
That is a different kind of "dead". :) I'm referring to the M structure being live from the perspective of the GC (see runtime.KeepAlive). |
@gopherbot Please backport to 1.18 and 1.19 This issue is not new but can cause random memory corruption in any program that has a goroutine exit with LockOSThread set. |
Backport issue(s) opened: #56308 (for 1.18), #56309 (for 1.19). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/443815 mentions this issue: |
Change https://go.dev/cl/443816 mentions this issue: |
Thanks for the quick fix. |
Change https://go.dev/cl/444095 mentions this issue: |
…ntil mexit completes Ms are allocated via standard heap allocation (`new(m)`), which means we must keep them alive (i.e., reachable by the GC) until we are completely done using them. Ms are primarily reachable through runtime.allm. However, runtime.mexit drops the M from allm fairly early, long before it is done using the M structure. If that was the last reference to the M, it is now at risk of being freed by the GC and used for some other allocation, leading to memory corruption. Ms with a Go-allocated stack coincidentally already keep a reference to the M in sched.freem, so that the stack can be freed lazily. This reference has the side effect of keeping this Ms reachable. However, Ms with an OS stack skip this and are at risk of corruption. Fix this lifetime by extending sched.freem use to all Ms, with the value of mp.freeWait determining whether the stack needs to be freed or not. For #56243. Fixes #56309. Change-Id: Ic0c01684775f5646970df507111c9abaac0ba52e Reviewed-on: https://go-review.googlesource.com/c/go/+/443716 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> (cherry picked from commit e252dcf) Reviewed-on: https://go-review.googlesource.com/c/go/+/443815 Reviewed-by: Austin Clements <austin@google.com>
…ntil mexit completes Ms are allocated via standard heap allocation (`new(m)`), which means we must keep them alive (i.e., reachable by the GC) until we are completely done using them. Ms are primarily reachable through runtime.allm. However, runtime.mexit drops the M from allm fairly early, long before it is done using the M structure. If that was the last reference to the M, it is now at risk of being freed by the GC and used for some other allocation, leading to memory corruption. Ms with a Go-allocated stack coincidentally already keep a reference to the M in sched.freem, so that the stack can be freed lazily. This reference has the side effect of keeping this Ms reachable. However, Ms with an OS stack skip this and are at risk of corruption. Fix this lifetime by extending sched.freem use to all Ms, with the value of mp.freeWait determining whether the stack needs to be freed or not. For #56243. Fixes #56308. Change-Id: Ic0c01684775f5646970df507111c9abaac0ba52e Reviewed-on: https://go-review.googlesource.com/c/go/+/443716 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> (cherry picked from commit e252dcf) Reviewed-on: https://go-review.googlesource.com/c/go/+/443816 Reviewed-by: Austin Clements <austin@google.com>
Ms are allocated via standard heap allocation (`new(m)`), which means we must keep them alive (i.e., reachable by the GC) until we are completely done using them. Ms are primarily reachable through runtime.allm. However, runtime.mexit drops the M from allm fairly early, long before it is done using the M structure. If that was the last reference to the M, it is now at risk of being freed by the GC and used for some other allocation, leading to memory corruption. Ms with a Go-allocated stack coincidentally already keep a reference to the M in sched.freem, so that the stack can be freed lazily. This reference has the side effect of keeping this Ms reachable. However, Ms with an OS stack skip this and are at risk of corruption. Fix this lifetime by extending sched.freem use to all Ms, with the value of mp.freeWait determining whether the stack needs to be freed or not. Fixes golang#56243. Change-Id: Ic0c01684775f5646970df507111c9abaac0ba52e Reviewed-on: https://go-review.googlesource.com/c/go/+/443716 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>
…ntil mexit completes Ms are allocated via standard heap allocation (`new(m)`), which means we must keep them alive (i.e., reachable by the GC) until we are completely done using them. Ms are primarily reachable through runtime.allm. However, runtime.mexit drops the M from allm fairly early, long before it is done using the M structure. If that was the last reference to the M, it is now at risk of being freed by the GC and used for some other allocation, leading to memory corruption. Ms with a Go-allocated stack coincidentally already keep a reference to the M in sched.freem, so that the stack can be freed lazily. This reference has the side effect of keeping this Ms reachable. However, Ms with an OS stack skip this and are at risk of corruption. Fix this lifetime by extending sched.freem use to all Ms, with the value of mp.freeWait determining whether the stack needs to be freed or not. For golang#56243. Fixes golang#56309. Change-Id: Ic0c01684775f5646970df507111c9abaac0ba52e Reviewed-on: https://go-review.googlesource.com/c/go/+/443716 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> (cherry picked from commit e252dcf) Reviewed-on: https://go-review.googlesource.com/c/go/+/443815 Reviewed-by: Austin Clements <austin@google.com>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Ran tests for containerd/containerd#7513
(Unfortunately, root is required as the test issues many
mount
syscalls. I have not had success creating a more minimal reproducer, but this test case takes only a few seconds to run to completion and reproduces the runtime errors fairly reliably.)What did you expect to see?
The test either passes or fails.
What did you see instead?
fatal error: runtime·lock: lock count
followed by hundreds (thousands?) of lines offatal error: runtime·unlock: lock count
. Sometimes these are followed by other runtime errors, such as:fatal: morestack on g0
fatal: systemstack called from unexpected goroutineTrace/breakpoint trap
An "impossible" segfault in perfectly ordinary Go code.
https://github.com/containerd/continuity/blob/5ad51c7aca47b8e742f5e6e7dc841d50f5f6affd/resource.go#L270
A slice with length > 0 somehow had a nil data pointer... or
rcx
got clobbered in the middle of the function. No unsafe type-punning is used to construct the slice andgo test -race
does not complain.fatal error: malloc deadlock / panic during panic followed by what appeared to be two interleaved stack dumps
0x1fc/usr/local/go/src/runtime/panic.go fp=:0xc0000b13c81044 sp= +0xc0000b11700x55 pc= fp=0x77d6bc0x7f7d35499338
sp=github.com/containerd/continuity.BuildManifest.func10x7f7d35499308( pc={0x439a750xc0002faa00
, runtime.throw0xf(}{, 0x87177d{?0x90ef68, , 0x7f7d354993900xc0000e6680?}}, )
{ 0x0/usr/local/go/src/runtime/panic.go?:, 10430x0 +?0x46} fp=)
0x7f7d35499368 sp=/home/ubuntu/containerd/vendor/github.com/containerd/continuity/manifest.go0x7f7d35499338: pc=950x4399e6 +
0xc7runtime.unlock2 fp=(0xc0000b14580x1b sp=0xc0000b13c8?)
pc=0x783267
/usr/local/go/src/runtime/lock_futex.gogithub.com/containerd/continuity.(*context).Walk.func1:127( +{0x7a0xc00011d740 fp=?0x7f7d35499388, sp=0xc0000e66800x7f7d35499368? pc=}0x40db9a,
{runtime.unlockWithRank0x90ef68, (...)
0xc0000e6680}/usr/local/go/src/runtime/lockrank_off.go, :{320xc0000b14e8
?runtime.unlock, (...)
0x46d747 ?}/usr/local/go/src/runtime/lock_futex.go)
:112
/home/ubuntu/containerd/vendor/github.com/containerd/continuity/context.goruntime.printunlock:(596)
+0x70 fp=/usr/local/go/src/runtime/print.go0xc0000b14a0: sp=800xc0000b1458 + pc=0x3b0x781470 fp=
0x7f7d354993a0path/filepath.walk sp=0x7f7d35499388( pc={0x43b41b0xc00011d740
, runtime.throw.func10x3f(})
, { 0x90ef68/usr/local/go/src/runtime/panic.go, :0xc0000e66801044} +, 0x550xc000183b90 fp=)
/usr/local/go/src/path/filepath/path.go:433 +0x123 fp=0xc0000b1568 sp=0xc0000b14a0 pc=0x500e03
path/filepath.walk({0xc0001dbb80, 0x38}, {0x90ef68, 0xc0000cd380}, 0xc000183b90)
/usr/local/go/src/path/filepath/path.go:457 +0x285 fp=0xc0000b1630 sp=0xc0000b1568 pc=0x500f65
path/filepath.walk({0x7f7d354993d00xc00002eb10 sp=, 0x7f7d354993a00x30 pc=}0x439a75,
{0x90ef68runtime.throw, (0xc0000cd2b0{}0x87177d, ?0xc000183b90, )
0x7f7d35499428 ?/usr/local/go/src/path/filepath/path.go}:)
457 +/usr/local/go/src/runtime/panic.go0x285: fp=10430xc0000b16f8 + sp=0xc0000b1630 pc=0x500f65
path/filepath.Walk({0xc00002eb10, 0x300x46 fp=0x7f7d35499400 sp=}0x7f7d354993d0, pc=0xc000183b900x4399e6)
runtime.unlock2/usr/local/go/src/path/filepath/path.go:520 +(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35499420 sp=0x7f7d35499400 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35499438 sp=0x7f7d35499420 pc=0x6c0x43b41b fp=
runtime.throw.func10xc0000b1748( sp=)
0xc0000b16f8 pc=/usr/local/go/src/runtime/panic.go0x5010cc:
1044github.com/containerd/continuity/pathdriver.(*pathDriver).Walk +0x55( fp=0x84be400x7f7d35499468, sp={0x7f7d354994380xc00002eb10 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d354994c0?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35499498 sp=0x7f7d35499468 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d354994b8 sp=0x7f7d35499498 pc=0x40db9a
runtime.unlockWithRank?, 0x40f847?}, 0x28?)
/home/ubuntu/containerd/vendor/github.com/containerd/continuity/pathdriver/path_driver.go:88 +0x27 fp=0xc0000b1770 sp=0xc0000b1748 pc=0x779c47
github.com/containerd/continuity.(*context).Walk(0xc0001a2a50, 0xc000183b60)
(...)
/home/ubuntu/containerd/vendor/github.com/containerd/continuity/context.go :/usr/local/go/src/runtime/lockrank_off.go594: +320x12b
fp=runtime.unlock0xc0000b17b0 sp=(...)
0xc0000b1770 pc=/usr/local/go/src/runtime/lock_futex.go0x7813ab:
112github.com/containerd/continuity.BuildManifest
runtime.printunlock({(0x90e248)
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d354994d0 sp=0x7f7d354994b8 pc=0x43b41b
runtime.throw.func1()
/usr/local/go/src/runtime/panic.go:1044 +0x55 fp=0x7f7d35499500 sp=0x7f7d354994d0 pc=0x439a75
runtime.throw({0x87177d?, 0x7f7d35499558?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35499530 sp=0x7f7d35499500 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35499550 sp=0x7f7d35499530 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
/usr/local/go/src/runtime/lock_futex.go:112
runtime.printunlock()
/usr/local/go/src/runtime/print.go:80 +0x3b fp=0x7f7d35499568 sp=0x7f7d35499550 pc=0x43b41b?
, runtime.throw.func10xc0001a2a50(})
)
/usr/local/go/src/runtime/panic.go/home/ubuntu/containerd/vendor/github.com/containerd/continuity/manifest.go::104485 + +0x550x111 fp= fp=0x7f7d354995980xc0000b18d8 sp= sp=0x7f7d354995680xc0000b17b0 pc= pc=0x439a750x782ed1
runtime.throwgithub.com/containerd/continuity/fs/fstest.CheckDirectoryEqual({0x87177d?, 0x7f7d354995f0?})
({ 0xc00011d280/usr/local/go/src/runtime/panic.go, :0x3c1043} +, 0x46{ fp=0xc00002eb100x7f7d354995c8, sp=0x300x7f7d35499598} pc=)
0x4399e6
runtime.unlock2/home/ubuntu/containerd/vendor/github.com/containerd/continuity/fs/fstest/compare.go:(440x1b +?0x1ce)
fp= 0xc0000b1a38/usr/local/go/src/runtime/lock_futex.go sp=:0xc0000b18d8127 pc= +0x786dee0x7a
fp=github.com/containerd/containerd/snapshots/testsuite.check128LayersMount.func10x7f7d354995e8 sp=0x7f7d354995c8( pc={0x40db9a0x90e1d8
, runtime.unlockWithRank0xc00017b350(...)
} , /usr/local/go/src/runtime/lockrank_off.go0xc00012eea0, :{320x90ff20
, runtime.unlock0xc0000690e0(...)
} , /usr/local/go/src/runtime/lock_futex.go{:0xc00002ea50112,
0x2bruntime.printunlock})
()
/home/ubuntu/containerd/snapshots/testsuite/testsuite.go :/usr/local/go/src/runtime/print.go942: +800x14d4 + fp=0x3b0xc0000b1df0 fp= sp=0x7f7d354996000xc0000b1a38 sp= pc=0x7f7d354995e80x79fbb4 pc=
0x43b41bgithub.com/containerd/containerd/snapshots/testsuite.makeTest.func1
runtime.throw.func1(()
0xc00012eea0)
/usr/local/go/src/runtime/panic.go/home/ubuntu/containerd/snapshots/testsuite/testsuite.go:1044: +1170x55 + fp=0x4740x7f7d35499630 fp=0xc0000b1f70 sp=0x7f7d35499600 pc=0x439a75
sp=runtime.throw0xc0000b1df0 pc=(0x794fd4{0x87177d?, 0x7f7d35499688?})
/usr/local/go/src/runtime/panic.go:1043 +0x46 fp=0x7f7d35499660 sp=0x7f7d35499630 pc=0x4399e6
runtime.unlock2(0x1b?)
/usr/local/go/src/runtime/lock_futex.go:127 +0x7a fp=0x7f7d35499680 sp=0x7f7d35499660 pc=0x40db9a
runtime.unlockWithRank(...)
/usr/local/go/src/runtime/lockrank_off.go:32
runtime.unlock(...)
testing.tRunner (/usr/local/go/src/runtime/lock_futex.go0xc00012eea0:, 1120xc00017a9c0
)
runtime.printunlock (/usr/local/go/src/testing/testing.go)
/usr/local/go/src/runtime/print.go::80 +0x3b fp=0x7f7d35499698 sp=14460x7f7d35499680 pc= +0x43b41b0x10b
fp=runtime.throw.func10xc0000b1fc0 sp=(0xc0000b1f70)
pc=0x5137cb
/usr/local/go/src/runtime/panic.gotesting.(*T).Run.func1:(1044 +)
0x55 fp=/usr/local/go/src/testing/testing.go0x7f7d354996c8: sp=14930x7f7d35499698 + pc=0x2a0x439a75 fp=
0xc0000b1fe0
sp=0xc0000b1fc0goroutine pc=80x51466a [
runningruntime.goexit]:
(runtime.systemstack_switch)
(/usr/local/go/src/runtime/asm_amd64.s)
: 1594/usr/local/go/src/runtime/asm_amd64.s +:0x1459 fp= fp=0xc0000b1fe80xc0000b0e78 sp= sp=0xc0000b1fe00xc0000b0e70 pc= pc=0x46d6010x46b3e0
created by runtime.fatalthrowtesting.(*T).Run(
0xb0ec0 ?)
/usr/local/go/src/testing/testing.go :/usr/local/go/src/runtime/panic.go1493: +0x35f1122
The crashes also consistently occur on GitHub Actions CI runners, which rules out hardware as a candidate.
Compiling with Cgo is a necessary condition to reproduce the issue. There is no user Cgo code in the built test binary, only runtime and std.
I could not reproduce the issue on a pure-Go build.
I loaded up some core dumps into gdb and noticed a consistent pattern to the state of the process at the time of the crash.
clone3()
'd child, without having executed a single instruction (pc pointed to the instruction following the syscall,rax = 0
andrsp
was set to exactly.stack + .stack_size
of the clone_args struct pointed to byrdi
.)I saw no evidence suggesting heap corruption when examining the core dumps. I learned that
curg().m.locks
was always set to -1 when the fatalruntime.lock
call was made. On a hunch I patched one of the few unguarded and unbalanced decrements of anm.locks
,runtime.releasem()
:and was able to get clean stack traces without the recursive panicking.
Every core dump I examined has the same traceback in the crashing thread. It's always a pthreads thread in the process of cleaning up and exiting, calling
releasem()
while itscurg().m.locks == -1
.The garbage collector is also seemingly necessary to cause crashes. Setting
GOGC=0
produces more reliable crashes, while I have yet to get a crash withGOGC=off
. There seems to be some aspect of timing, as well. Turning the test verbosity on or off affects the probability of a crash, and I have yet to get a crash when running a race-enabled build or understrace
.(cc @cpuguy83)
The text was updated successfully, but these errors were encountered: