runtime: support resuming a single goroutine under debuggers #25578

aclements · 2018-05-25T20:06:03Z

CL 109699 added support for debugger function call injection, but has an annoying limitation: it requires that the debugger resume the entire Go process after injecting the function call (and, to inject into a runnable but not running goroutine, it requires resuming the entire process even before injecting the call).

@heschik argued that this is a pretty bad experience. E.g., all the user wants to do is call String() to format something, and the entire process moves under them in the meantime. It's also different from what other debuggers do, which could surprise users.

This is tricky to solve. Simply resuming only the thread where the call was injected doesn't work because 1) it could easily lead to runtime-level deadlocks if any other thread is in the runtime, 2) the runtime could just switch to a different goroutine on that thread, and 3) if the GC kicks in it will try to cooperatively stop the other threads and deadlock.

I think solving this requires at least a little help from the runtime to pause all other user goroutines during the injected call. I'm not sure what exact form this should take, but I'm imagining the debugger could use call injection to first inject a runtime call to stop user goroutines, and then inject the real call.

However, even this gets tricky with non-cooperative safe points (e.g., the runtime would need to use the register maps to immediately preempt the other goroutines rather than waiting for them to reach a cooperative safe point) and with goroutines that are at unsafe points (particularly goroutines that are in the runtime). One possibility would be to have the debugger inject this "stop" call on every running thread. Using the call injection mechanism takes care of stopping at non-cooperative points, and would give the debugger the opportunity to step other goroutines past unsafe points and out of the runtime before injecting the stop. This puts some complexity into the debugger, but it should already have most of the core mechanisms necessary to do this (such as single stepping ability). Specifically, I'm picturing a protocol like:

For each thread, attempt to inject a runtime.debugStop call. Let all threads resume.
These calls will notify the debugger when the goroutine is successfully stopped, or the debug call injection will fail.
For injection that failed because the thread is in the runtime, unwind the stack and insert a breakpoint at the first return to user code. At that breakpoint attempt another debugStop. For injection that failed because the thread is at an unsafe point, single step the thread, attempting to inject debugStop at each step.
Let the remaining threads continue running. Repeat steps 2 through 4 until all threads are stopped.

This is basically a debugger-assisted non-cooperative stop-the-world. For Go 1.12, I plan to implement non-cooperative preemption directly in the runtime, which may move much of this logic into the runtime itself.

/cc @aarzilli

The text was updated successfully, but these errors were encountered:

aarzilli · 2018-05-28T06:25:25Z

For Go 1.12, I plan to implement non-cooperative preemption directly in the runtime, which may move much of this logic into the runtime itself.

That being the case, personally, I'd rather wait for 1.12's cycle for this. I think it's unlikely that we are going to finish implementing the debugger side of call injection before we go is deep in 1.12's cycle, even as it is now.

aarzilli · 2018-05-28T06:25:50Z

cc @derekparker

derekparker · 2018-05-29T16:24:36Z

That being the case, personally, I'd rather wait for 1.12's cycle for this.

Fair enough, but if we do end up making more movement on the implementation of call injection on our end reliably, I'd rather get this feature into users hands sooner rather than later, this is a particularly exciting one.

aclements · 2018-05-29T20:46:06Z

I'd rather get this feature into users hands sooner rather than later

Note that debugger function call injection is supported by the runtime now. It's just that the debugger has to let the whole process resume execution during the injected call. You probably have a better sense of the impact that has on the user experience than I do.

As a data point, GDB by default continues the entire process during things like call injection and stepping (though set scheduler-locking can turn that off on Linux at least).

gopherbot · 2018-09-11T21:44:29Z

Change https://golang.org/cl/134779 mentions this issue: runtime: support disabling goroutine scheduling by class

This adds support for disabling the scheduling of user goroutines while allowing system goroutines like the garbage collector to continue running. User goroutines pass through the usual state transitions, but if we attempt to actually schedule one, it will get put on a deferred scheduling list. Updates #26903. This is preparation for unifying STW GC and concurrent GC. Updates #25578. This same mechanism can form the basis for disabling all but a single user goroutine for the purposes of debugger function call injection. Change-Id: Ib72a808e00c25613fe6982f5528160d3de3dbbc6 Reviewed-on: https://go-review.googlesource.com/c/134779 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>

derekparker · 2019-02-27T20:33:00Z

I wanted to ping for an update on this issue as @aarzilli @heschik @hyangah and I spoke about it today during our meeting.

It seems like some of the foundation is there to prevent the runtime from scheduling goroutines that aren't already in the running state. However if I understand correctly were still a bit short on this feature. We would still need cooperation from the runtime to "unschedule" any running goroutine aside from the one Delve wants to be allowed to execute correct? As I understand it, even if we inject schedEnableUser(false) into every thread, that thread will still be _Grunning meaning it will still execute, correct? I assume the above CL prevents any non-running goroutine from being scheduled, but does not deschedule any running goroutine.

If that is the case, then what is the proposed path forward in order to finish this feature? From the original description @aclements suggested injecting a call on each thread which would stop the goroutine. From the standpoint of Delve that makes sense as it would allow us to selectively inject this call onto any thread running a goroutine besides the one we wish to continue execution.

We would really like to get this feature completed as I think that having other side effects during a user-injected function call is less than ideal. Is there any bandwidth to get this done during the 1.13 cycle?

prattmic · 2022-09-06T19:49:13Z

We have asynchronous preemption now, which makes things a bit simpler, but there are still several edge cases to consider.

We really want direct Go scheduler cooperation here. Bypassing cooperation and just continuing a single OS thread would work for a little while, but would fall apart when that thread enters the scheduler, or tries to allocate if another thread is holding a runtime lock.
The previous point likely means we want to do this by preempting all goroutines and putting them in some stopped state. However, asynchronous preemption is asynchronous (as advertised!), so doing this naively would allow a little execution skew on resume, as we try to stop all goroutines.
We need to be careful about coordinating with other scheduling events. e.g., if the resumed goroutines spawns another goroutine, that goroutine shouldn't start running. A STW should continue to work, but also not resume frozen goroutines when continued.
If the goroutine that the debugger wishes to resume isn't currently running, we need to transition it to running.

A scheme similar to what @aclements originally proposed is a good start, but there are still a lot of holes here:

Inject a debugger preempt request on all threads. On success, this puts the running goroutine in _Grunnable (or _Gdebugstop?) and then returns to the debugger.
1. This fails at unsafe points, and the debugger needs to continue execution until at a safe point.
2. In user code, the debugger could look at _PCDATA_UnsafePoint and continue / single step until the next safe point.
3. In runtime code, single stepping may be too slow, but setting a breakpoint on function return may be good enough?
  1. I am vaguely concerned we cause deadlocks here especially if one of the threads in the runtime is trying to do something like STW.
Once all goroutines are preempted, tell the runtime to freeze all execution except requested goroutines.
1. How to do this is a bit unclear. We can't inject a call anymore because all threads will be in the runtime.
Tell the runtime to run specified goroutines, which get _Grunning|Gdebugrun.
Continue execution on all threads. The runtime will only run _Grunning|Gdebugrun threads.

prattmic · 2022-09-06T19:56:05Z

I think a lot of this would be simplified if we could allow more skew. i.e., set a global debugPreempt and then periodically send sigPreempt to all threads until all goroutines stop. But this is much more asynchronous and would allow more code to run on goroutines that you want to freeze. I don't know if that would be an acceptable user experience in a debugger?

aarzilli · 2022-09-12T13:35:47Z

   2. In user code, the debugger could look at `_PCDATA_UnsafePoint` and continue / single step until the next safe point.

At the moment delve doesn't read any of the pcdata/funcdata structures of the runtime. This is nice because we don't have to worry about keeping up with changes made to them. If we decide that delve needs to start reading those we'll also have to figure out a way to at least detect that the format has changed and we need to update delve.

I think a lot of this would be simplified if we could allow more skew. i.e., set a global debugPreempt and then periodically send sigPreempt to all threads until all goroutines stop. But this is much more asynchronous and would allow more code to run on goroutines that you want to freeze. I don't know if that would be an acceptable user experience in a debugger?

Probably not. Much of the motivation behind this feature is to be used in conjunction with either call injection or next/step/stepout, if we allow other goroutines to execute a lot of code before actually stopping it wouldn't be much different from the status quo.

However, if we allow delve to read _PCDATA_UnsafePoint we could do something like the debugPreempt with minimal skew:

set debugPreempt = true, also "somehow" communicate to the runtime which goroutine(s) we want to keep running
for every thread running a goroutine that needs to be stopped:
- if it is running non-runtime code
  - if it's already at a safe point do nothing, otherwise:
  - set a breakpoint on every safepoint of the current function
  - set a breakpoint on the return address (I'm assuming the return address is always a safe point, if it isn't then a breakpoint on every safe point of the caller function)
- if it is running runtime code:
  - set a breakpoint on the first non-runtime function in the stack
resume all threads, sending a SIGURG to threads already at a safepoint
every time one of the breakpoints set above is hit resume the thread converting SIGTRAP into SIGURG
eventually every goroutine but the ones we want to run will be stopped

This should work fine with other STW in progress (according to my limited understanding of GC).
But what about windows? The asyncpreempt mechanism used on windows is incompatible with debuggers, I'm guessing there's no solution here.

prattmic · 2022-09-12T14:57:35Z

   2. In user code, the debugger could look at `_PCDATA_UnsafePoint` and continue / single step until the next safe point.
At the moment delve doesn't read any of the pcdata/funcdata structures of the runtime. This is nice because we don't have to worry about keeping up with changes made to them. If we decide that delve needs to start reading those we'll also have to figure out a way to at least detect that the format has changed and we need to update delve.

FWIW, I recently added a test to detect these failures in x/debug, which does some pcdata/funcdata parsing. If we revamped x/debug to provide more public API, then perhaps delve could use that.

prattmic · 2022-09-12T15:03:57Z

But what about windows? The asyncpreempt mechanism used on windows is incompatible with debuggers, I'm guessing there's no solution here.

I am not familiar with debugging on Windows. Am I correct that this is because we use SuspendThread, which is what I assume debuggers use as well?

If so, then in theory I think delve itself could implement preempM. Though that of course makes it very sensitive to runtime internals.

aarzilli · 2022-09-13T10:24:49Z

I am not familiar with debugging on Windows. Am I correct that this is because we use SuspendThread, which is what I assume debuggers use as well?

Yes, see: #36494

Also, I forgot to mention one thing. Because one possible use of this is in combination with call injection I think we should plan for it to play nice with it, but call injection invoves spawning a new goroutine to execute the call, which means that that goroutine should run, should this apply to all new goroutines, or no? weird.

aclements self-assigned this May 25, 2018

hyangah added the Debugging label May 30, 2018

ianlancetaylor added this to the Unplanned milestone May 31, 2018

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 31, 2018

dlsniper mentioned this issue Mar 29, 2019

runtime: stop single goroutine #31132

Open

mangalaman93 mentioned this issue Jul 14, 2019

Support stopping a single goroutine go-delve/delve#1529

Open

rsc unassigned aclements Jun 23, 2022

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

istvanballok mentioned this issue Apr 6, 2023

Add make gardener-debug as a skaffold-based debugging experience gardener/gardener#7755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: support resuming a single goroutine under debuggers #25578

runtime: support resuming a single goroutine under debuggers #25578

aclements commented May 25, 2018

aarzilli commented May 28, 2018

aarzilli commented May 28, 2018

derekparker commented May 29, 2018

aclements commented May 29, 2018

gopherbot commented Sep 11, 2018

derekparker commented Feb 27, 2019

prattmic commented Sep 6, 2022 •

edited

prattmic commented Sep 6, 2022

aarzilli commented Sep 12, 2022

prattmic commented Sep 12, 2022

prattmic commented Sep 12, 2022 •

edited

aarzilli commented Sep 13, 2022

runtime: support resuming a single goroutine under debuggers #25578

runtime: support resuming a single goroutine under debuggers #25578

Comments

aclements commented May 25, 2018

aarzilli commented May 28, 2018

aarzilli commented May 28, 2018

derekparker commented May 29, 2018

aclements commented May 29, 2018

gopherbot commented Sep 11, 2018

derekparker commented Feb 27, 2019

prattmic commented Sep 6, 2022 • edited

prattmic commented Sep 6, 2022

aarzilli commented Sep 12, 2022

prattmic commented Sep 12, 2022

prattmic commented Sep 12, 2022 • edited

aarzilli commented Sep 13, 2022

prattmic commented Sep 6, 2022 •

edited

prattmic commented Sep 12, 2022 •

edited