Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: lock cycle between malloc and execution tracer #53979

Closed
aclements opened this issue Jul 21, 2022 · 7 comments
Closed

runtime: lock cycle between malloc and execution tracer #53979

aclements opened this issue Jul 21, 2022 · 7 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@aclements
Copy link
Member

What version of Go are you using (go version)?

Current HEAD (2aa473c)

Does this issue reproduce with the latest release?

Yes, though via a different code path.

What did you do?

If the execution tracer is enabled, there's a potential though rare deadlock via a rank cycle on mheap_.lock and trace.lock:

A: setGCPercent or setMemoryLimit acquire mheap_.lock -> gcControllerCommit -> traceHeapGoal -> traceEvent -> traceEventLocked -> traceFlush -> acquires trace.lock
B: traceFlush acquires trace.lock -> triggers stack growth -> stack allocator calls mheap.allocManual -> mheap.allocSpan -> acquires mheap_.lock

Path "A" violates the current lock ranking. I discovered this when I added a "may acquire" annotation on traceEvent. But I think path "B" may be the real problem. Because stack growth can happen while holding trace.lock, it's pretty high in the ranking (has a low rank value). But this means that anything that holds any locks further down in the ranking, like the memory allocator, can't safely create trace events.

I wonder if, like mheap.lock, we should say that trace.lock can only be acquired on the system stack so stack growth can never happen. I think that would push tracing down to the leaves of the rank graph, rather than it being smack in the middle.

/cc @mknyszek @golang/runtime

@aclements aclements added the NeedsFix The path to resolution is known, but the work has not been done. label Jul 21, 2022
@aclements aclements added this to the Go1.20 milestone Jul 21, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 21, 2022
@gopherbot
Copy link

Change https://go.dev/cl/418716 mentions this issue: runtime: add missing trace lock edges

@gopherbot
Copy link

Change https://go.dev/cl/418720 mentions this issue: runtime: add mayAcquire annotation for trace.lock

@gopherbot
Copy link

Change https://go.dev/cl/418957 mentions this issue: runtime: move trace locks to the leaf of the lock graph

@gopherbot
Copy link

Change https://go.dev/cl/418955 mentions this issue: runtime: don't use trace.lock for trace reader parking

@gopherbot
Copy link

Change https://go.dev/cl/418956 mentions this issue: runtime: only acquire trace.lock on the system stack

gopherbot pushed a commit that referenced this issue Aug 4, 2022
We're missing lock edges to trace.lock that happen only rarely. Any
trace event can potentially fill up a trace buffer and acquire
trace.lock in order to flush the buffer, but this happens relatively
rarely, so we simply haven't seen some of these lock edges that could
happen.

With this change, we promote "fin, notifyList < traceStackTab" to
"fin, notifyList < trace" and now everything that emits trace events
with a P enters the tracer lock ranks via "trace", rather than some
things entering at "trace" and others at "traceStackTab".

This was found by inspecting the rank graph for things that didn't
make sense.

Ideally we would add a mayAcquire annotation that any trace event can
potentially acquire trace.lock, but there are actually cases that
violate this ranking right now. This is #53979. The chance of a lock
cycle is extremely low given the number of conditions that have to
happen simultaneously.

For #53789.

Change-Id: Ic65947d27dee88d2daf639b21b2c9d37552f0ac0
Reviewed-on: https://go-review.googlesource.com/c/go/+/418716
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
We're missing lock edges to trace.lock that happen only rarely. Any
trace event can potentially fill up a trace buffer and acquire
trace.lock in order to flush the buffer, but this happens relatively
rarely, so we simply haven't seen some of these lock edges that could
happen.

With this change, we promote "fin, notifyList < traceStackTab" to
"fin, notifyList < trace" and now everything that emits trace events
with a P enters the tracer lock ranks via "trace", rather than some
things entering at "trace" and others at "traceStackTab".

This was found by inspecting the rank graph for things that didn't
make sense.

Ideally we would add a mayAcquire annotation that any trace event can
potentially acquire trace.lock, but there are actually cases that
violate this ranking right now. This is golang#53979. The chance of a lock
cycle is extremely low given the number of conditions that have to
happen simultaneously.

For golang#53789.

Change-Id: Ic65947d27dee88d2daf639b21b2c9d37552f0ac0
Reviewed-on: https://go-review.googlesource.com/c/go/+/418716
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

Change https://go.dev/cl/422955 mentions this issue: runtime: avoid large object stack copy in traceStackTable.dump

@gopherbot
Copy link

Change https://go.dev/cl/422954 mentions this issue: runtime: write trace stack tab directly to trace buffer

gopherbot pushed a commit that referenced this issue Aug 11, 2022
Currently, the stack frame of (*traceStackTable).dump is 68KiB. We're
about to move (*traceStackTable).dump to the system stack, where we
often don't have this much room.

5140 bytes of this is an on-stack temporary buffer for constructing
potentially large trace events before copying these out to the actual
trace buffer.

Reduce the stack frame size by writing these events directly to the
trace buffer rather than temporary space. This introduces a couple
complications:

- The trace event starts with a varint encoding the event payload's
  length in bytes. These events are large and somewhat complicated, so
  it's hard to know the size ahead of time. That's not a problem with
  the temporary buffer because we can just construct the event and see
  how long it is. In order to support writing directly to the trace
  buffer, we reserve enough bytes for a maximum size varint and add
  support for populating a reserved space after the fact.

- Emitting a stack event calls traceFrameForPC, which can itself emit
  string events. If these were emitted in the middle of the stack
  event, it would corrupt the stream. We already allocate a []Frame to
  convert the PC slice to frames, and then convert each Frame into a
  traceFrame with trace string IDs, so we address this by combining
  these two steps into one so that all trace string events are emitted
  before we start constructing the stack event.

For #53979.

Change-Id: Ie60704be95199559c426b551f8e119b14e06ddac
Reviewed-on: https://go-review.googlesource.com/c/go/+/422954
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Aug 11, 2022
Following up on the previous CL, this CL removes a unnecessary stack
copy of a large object in a range loop. This drops another 64 KiB from
(*traceStackTable).dump's stack frame so it is now roughly 80 bytes
depending on architecture, which will easily fit on the system stack.

For #53979.

Change-Id: I16f642f6f1982d0ed0a62371bf2e19379e5870eb
Reviewed-on: https://go-review.googlesource.com/c/go/+/422955
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
gopherbot pushed a commit that referenced this issue Aug 11, 2022
We're about to require that all uses of trace.lock be on the system
stack. That's mostly easy, except that it's involving parking the
trace reader. Fix this by changing that parking protocol so it instead
synchronizes through an atomic.

For #53979.

Change-Id: Icd6db8678dd01094029d7ad1c612029f571b4cbb
Reviewed-on: https://go-review.googlesource.com/c/go/+/418955
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Aug 11, 2022
Currently, trace.lock can be acquired while on a user G and stack
splits can happen while holding trace.lock. That means every lock used
by the stack allocator must be okay to acquire while holding
trace.lock, including various locks related to span allocation. In
turn, we cannot safely emit trace events while holding any
allocation-related locks because this would cause a cycle in the lock
rank graph.

To fix this, require that trace.lock only be acquired on the system
stack, like mheap.lock. This pushes it into the "bottom half" and
eliminates the lock rank relationship between tracing and stack
allocation, making it safe to emit trace events in many more places.

One subtlety is that the trace code has race annotations and uses
maps, which have race annotations. By default, we can't have race
annotations on the system stack, so we borrow the user race context
for these situations.

We'll update the lock graph itself in the next CL.

For #53979. This CL technically fixes the problem, but the lock rank
checker doesn't know that yet.

Change-Id: I9f5187a9c52a67bee4f7064db124b1ad53e5178f
Reviewed-on: https://go-review.googlesource.com/c/go/+/418956
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Aug 11, 2022
Now that we've moved the trace locks to the leaf of the lock graph, we
can safely annotate that any trace event may acquire trace.lock even
if dynamically it turns out a particular event doesn't need to flush
and acquire this lock.

This reveals a new edge where we can trace while holding the mheap
lock, so we add this to the lock graph.

For #53789.
Updates #53979.

Change-Id: I13e2f6cd1b621cca4bed0cc13ef12e64d05c89a7
Reviewed-on: https://go-review.googlesource.com/c/go/+/418720
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Aug 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

2 participants