runtime: lock cycle between malloc and execution tracer #53979

aclements · 2022-07-21T01:45:57Z

What version of Go are you using (`go version`)?

Current HEAD (2aa473c)

Does this issue reproduce with the latest release?

Yes, though via a different code path.

What did you do?

If the execution tracer is enabled, there's a potential though rare deadlock via a rank cycle on mheap_.lock and trace.lock:

A: setGCPercent or setMemoryLimit acquire mheap_.lock -> gcControllerCommit -> traceHeapGoal -> traceEvent -> traceEventLocked -> traceFlush -> acquires trace.lock
B: traceFlush acquires trace.lock -> triggers stack growth -> stack allocator calls mheap.allocManual -> mheap.allocSpan -> acquires mheap_.lock

Path "A" violates the current lock ranking. I discovered this when I added a "may acquire" annotation on traceEvent. But I think path "B" may be the real problem. Because stack growth can happen while holding trace.lock, it's pretty high in the ranking (has a low rank value). But this means that anything that holds any locks further down in the ranking, like the memory allocator, can't safely create trace events.

I wonder if, like mheap.lock, we should say that trace.lock can only be acquired on the system stack so stack growth can never happen. I think that would push tracing down to the leaves of the rank graph, rather than it being smack in the middle.

/cc @mknyszek @golang/runtime

The text was updated successfully, but these errors were encountered:

gopherbot · 2022-07-21T02:10:07Z

Change https://go.dev/cl/418716 mentions this issue: runtime: add missing trace lock edges

gopherbot · 2022-07-21T02:10:13Z

Change https://go.dev/cl/418720 mentions this issue: runtime: add mayAcquire annotation for trace.lock

gopherbot · 2022-07-21T20:32:52Z

Change https://go.dev/cl/418957 mentions this issue: runtime: move trace locks to the leaf of the lock graph

gopherbot · 2022-07-21T20:32:52Z

Change https://go.dev/cl/418955 mentions this issue: runtime: don't use trace.lock for trace reader parking

gopherbot · 2022-07-21T20:32:53Z

Change https://go.dev/cl/418956 mentions this issue: runtime: only acquire trace.lock on the system stack

We're missing lock edges to trace.lock that happen only rarely. Any trace event can potentially fill up a trace buffer and acquire trace.lock in order to flush the buffer, but this happens relatively rarely, so we simply haven't seen some of these lock edges that could happen. With this change, we promote "fin, notifyList < traceStackTab" to "fin, notifyList < trace" and now everything that emits trace events with a P enters the tracer lock ranks via "trace", rather than some things entering at "trace" and others at "traceStackTab". This was found by inspecting the rank graph for things that didn't make sense. Ideally we would add a mayAcquire annotation that any trace event can potentially acquire trace.lock, but there are actually cases that violate this ranking right now. This is #53979. The chance of a lock cycle is extremely low given the number of conditions that have to happen simultaneously. For #53789. Change-Id: Ic65947d27dee88d2daf639b21b2c9d37552f0ac0 Reviewed-on: https://go-review.googlesource.com/c/go/+/418716 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

We're missing lock edges to trace.lock that happen only rarely. Any trace event can potentially fill up a trace buffer and acquire trace.lock in order to flush the buffer, but this happens relatively rarely, so we simply haven't seen some of these lock edges that could happen. With this change, we promote "fin, notifyList < traceStackTab" to "fin, notifyList < trace" and now everything that emits trace events with a P enters the tracer lock ranks via "trace", rather than some things entering at "trace" and others at "traceStackTab". This was found by inspecting the rank graph for things that didn't make sense. Ideally we would add a mayAcquire annotation that any trace event can potentially acquire trace.lock, but there are actually cases that violate this ranking right now. This is golang#53979. The chance of a lock cycle is extremely low given the number of conditions that have to happen simultaneously. For golang#53789. Change-Id: Ic65947d27dee88d2daf639b21b2c9d37552f0ac0 Reviewed-on: https://go-review.googlesource.com/c/go/+/418716 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

gopherbot · 2022-08-11T16:02:28Z

Change https://go.dev/cl/422955 mentions this issue: runtime: avoid large object stack copy in traceStackTable.dump

gopherbot · 2022-08-11T16:02:29Z

Change https://go.dev/cl/422954 mentions this issue: runtime: write trace stack tab directly to trace buffer

Currently, the stack frame of (*traceStackTable).dump is 68KiB. We're about to move (*traceStackTable).dump to the system stack, where we often don't have this much room. 5140 bytes of this is an on-stack temporary buffer for constructing potentially large trace events before copying these out to the actual trace buffer. Reduce the stack frame size by writing these events directly to the trace buffer rather than temporary space. This introduces a couple complications: - The trace event starts with a varint encoding the event payload's length in bytes. These events are large and somewhat complicated, so it's hard to know the size ahead of time. That's not a problem with the temporary buffer because we can just construct the event and see how long it is. In order to support writing directly to the trace buffer, we reserve enough bytes for a maximum size varint and add support for populating a reserved space after the fact. - Emitting a stack event calls traceFrameForPC, which can itself emit string events. If these were emitted in the middle of the stack event, it would corrupt the stream. We already allocate a []Frame to convert the PC slice to frames, and then convert each Frame into a traceFrame with trace string IDs, so we address this by combining these two steps into one so that all trace string events are emitted before we start constructing the stack event. For #53979. Change-Id: Ie60704be95199559c426b551f8e119b14e06ddac Reviewed-on: https://go-review.googlesource.com/c/go/+/422954 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

Following up on the previous CL, this CL removes a unnecessary stack copy of a large object in a range loop. This drops another 64 KiB from (*traceStackTable).dump's stack frame so it is now roughly 80 bytes depending on architecture, which will easily fit on the system stack. For #53979. Change-Id: I16f642f6f1982d0ed0a62371bf2e19379e5870eb Reviewed-on: https://go-review.googlesource.com/c/go/+/422955 Reviewed-by: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Austin Clements <austin@google.com>

We're about to require that all uses of trace.lock be on the system stack. That's mostly easy, except that it's involving parking the trace reader. Fix this by changing that parking protocol so it instead synchronizes through an atomic. For #53979. Change-Id: Icd6db8678dd01094029d7ad1c612029f571b4cbb Reviewed-on: https://go-review.googlesource.com/c/go/+/418955 Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

Currently, trace.lock can be acquired while on a user G and stack splits can happen while holding trace.lock. That means every lock used by the stack allocator must be okay to acquire while holding trace.lock, including various locks related to span allocation. In turn, we cannot safely emit trace events while holding any allocation-related locks because this would cause a cycle in the lock rank graph. To fix this, require that trace.lock only be acquired on the system stack, like mheap.lock. This pushes it into the "bottom half" and eliminates the lock rank relationship between tracing and stack allocation, making it safe to emit trace events in many more places. One subtlety is that the trace code has race annotations and uses maps, which have race annotations. By default, we can't have race annotations on the system stack, so we borrow the user race context for these situations. We'll update the lock graph itself in the next CL. For #53979. This CL technically fixes the problem, but the lock rank checker doesn't know that yet. Change-Id: I9f5187a9c52a67bee4f7064db124b1ad53e5178f Reviewed-on: https://go-review.googlesource.com/c/go/+/418956 Reviewed-by: Michael Knyszek <mknyszek@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

Now that we've moved the trace locks to the leaf of the lock graph, we can safely annotate that any trace event may acquire trace.lock even if dynamically it turns out a particular event doesn't need to flush and acquire this lock. This reveals a new edge where we can trace while holding the mheap lock, so we add this to the lock graph. For #53789. Updates #53979. Change-Id: I13e2f6cd1b621cca4bed0cc13ef12e64d05c89a7 Reviewed-on: https://go-review.googlesource.com/c/go/+/418720 Reviewed-by: Michael Knyszek <mknyszek@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

aclements added the NeedsFix label Jul 21, 2022

aclements added this to the Go1.20 milestone Jul 21, 2022

gopherbot added the compiler/runtime label Jul 21, 2022

aclements mentioned this issue Jul 21, 2022

runtime: express lock rank graph as a DAG like go/build/deps_test.go #53789

Closed

prattmic added this to Go Compiler / Runtime Jul 25, 2022

prattmic assigned aclements Jul 27, 2022

prattmic moved this to In Progress in Go Compiler / Runtime Jul 27, 2022

gopherbot closed this as completed in cc8bac8 Aug 11, 2022

Repository owner moved this from In Progress to Done in Go Compiler / Runtime Aug 11, 2022

aclements mentioned this issue Aug 23, 2022

runtime/trace: timeouts on linux-arm-aws since 2022-06-08 #54594

Closed

mknyszek removed this from Go Compiler / Runtime Feb 15, 2023

golang locked and limited conversation to collaborators Aug 11, 2023

gopherbot added the FrozenDueToAge label Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: lock cycle between malloc and execution tracer #53979

runtime: lock cycle between malloc and execution tracer #53979

aclements commented Jul 21, 2022

gopherbot commented Jul 21, 2022

gopherbot commented Jul 21, 2022

gopherbot commented Jul 21, 2022

gopherbot commented Jul 21, 2022

gopherbot commented Jul 21, 2022

gopherbot commented Aug 11, 2022

gopherbot commented Aug 11, 2022

runtime: lock cycle between malloc and execution tracer #53979

runtime: lock cycle between malloc and execution tracer #53979

Comments

aclements commented Jul 21, 2022

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What did you do?

gopherbot commented Jul 21, 2022

gopherbot commented Jul 21, 2022

gopherbot commented Jul 21, 2022

gopherbot commented Jul 21, 2022

gopherbot commented Jul 21, 2022

gopherbot commented Aug 11, 2022

gopherbot commented Aug 11, 2022

What version of Go are you using (`go version`)?