Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: eliminate the notion of a "syscall state" #58492

Open
CannibalVox opened this issue Feb 13, 2023 · 18 comments
Open

runtime: eliminate the notion of a "syscall state" #58492

CannibalVox opened this issue Feb 13, 2023 · 18 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@CannibalVox
Copy link

CannibalVox commented Feb 13, 2023

Abstract

Prevent longer-than-microsecond syscalls from causing excessive context change churn by eliminating the syscall state altogether. Goroutines will no longer enter into a special syscall state when making syscalls or cgo calls. Instead, the syscall will be executed by a separate syscall thread while the original goroutine is in an ordinary parked state. All Go Ms will now consist of a primary thread and a syscall thread.

Background

There are several ongoing issues with scheduler performance related to decisions to scale up or down the number of OS threads (Ms) used for executing goroutines. In #54622, the case is laid out that unnecessarily raising threads for a brief boost in workload can have undesirable performance implications. Effectively, this task identifies that the largest extant performance issues in the scheduler today are related to unnecessary thread creation and destruction. However, spinning up threads as a result of syscalls can have much more serious performance implications even than what are identified in the task above:

  • Moving the P the syscall originally came from to a different M can cause a context switch
  • If there is not enough work to sustain an additional P when the syscall returns (which is almost certainly the case), then the scheduler is extremely likely to return Go to the state it was in before the offending syscall. This means spinning down the M that performed the syscall, in addition to any intermediate states that might be traversed through before Go eventually decides to return to a single P.
  • If longer syscalls are made repeatedly in a go program, this can cause a very high percentage of system CPU to be dedicated to context switches and thread orchestration.

This usage pattern was recently revealed to be an issue in #58336, in which it appears that windows network calls via WSARecv/WSASend are blocking rather than nonblocking. A simple go network proxy run in Windows will perform thousands of context switches per second due to long calls repeatedly changing what M the program’s 2 G’s are run on. It does not do this in other operating systems, as those network calls are nonblocking, which allows the G to return to the P it came from without a new M being provisioned on non-Windows systems.

Generally speaking, the behavior of spinning up a new thread for the syscall state is always a problem, the Go team has previously chosen to address it by making short stints in the syscall state not engage in this behavior. By doing so, they have separated syscall behavior into three classes:

  • In the most common case, nonblocking syscalls return within nanoseconds and do not cause any problems because no new threads are created
  • In the second-most-common case, blocking syscalls are made rarely and the unnecessary context switch goes mostly unnoticed.
  • In the last case, frequent blocking syscalls, Go performance becomes untenable.

Proposal

I propose that every M be created with two threads instead of one: a thread for executing Go code and a thread for executing syscalls. When a goroutine attempts to execute a syscall, it will be carried out on the syscall thread while the original goroutine will stay in a completely ordinary parked state. Other goroutines that attempt to carry out syscalls during this time will park while waiting on the syscall thread to become available. Additionally, if there are other Ps with syscall threads that have less traffic, they could choose to steal G’s that have syscall work.

This will ensure that while longer syscalls will occupy shared syscall resources, which may become saturated, they will not cause M flapping or context switching. In an advanced case, syscall thread contention could be used as a metric for P scaling, and that would be much easier to measure and respond to than the situation right now, where long syscalls spin up additional Ms that don’t easily fit into the existing scheduler architecture and must be dealt with after the fact.

Rationale

The biggest problem with syscall and cgo performance today is that the threads created by long syscalls do not have any place within the go scheduler’s understanding of itself. It has a very tightly tuned understanding of how many Ms should be running and there is no way for it to respond appropriately to a new M suddenly being dumped in the middle of the scheduler, which is what long syscalls do.

Additionally, while moving the P to a new M after a syscall passes the threshold allows the 90% case to perform very well, it also guarantees a context switch in the 10% case, which is often unacceptable. In order to have a guaranteed route for a context-switch-free syscall, we need a route for syscalls to be handled without pulling the existing M away from the P. That means that there must be some sort of dedicated thread for syscalls, somewhere.

Alternatives

Also considered was the idea of a thread pool that lives outside of the M/P/G scheduler architecture and is used to process syscalls. The thread pool would consist of a stack of threads, which would scale between 1 and GOMAXPROCS threads, and a queue of syscall requests. New threads would be added when wait times on the queue passed a certain threshold, and threads would be removed on the garbage collector cadence in the same way items in an ObjectPool are, using a victim list to remove unused threads and eventually spin them down.

While idle threads would make up a much lower % of total program resources, and it is more flexible with syscall contention, this solution would require much more complicated orchestration. It also has a problem with OS-locked threads, since there is no way to guarantee that the same thread services syscalls for a particular P. This problem could be solved by having syscalls on OS-locked threads be executed inline instead of via the pool (OS-locked threads technically never needed the syscall state since there are no other waiting Gs when a goroutine is running a syscall) but this would require a much larger scope of changes within the scheduler.

Another alternative would be to tune the scheduler to prefer to place goroutines that have recently made long-running syscalls into their own P and avoid spinning it down until some time has passed since the last long syscall. We would then choose not to create a new M during long syscalls in cases the origin P has no additional G’s to serve, even if the syscall extended past the threshold. This has the following downsides:

  • Today, programs have GOMAXPROCS P’s available for running Go at all times, because long syscalls are removed from the scheduler while they are running. This change would remove available P’s equal to the number of goroutines that interact with long syscalls. If there are more goroutines that interact with long syscalls than GOMAXPROCS, we would be back where we started in terms of context switching and M thrashing.
  • It seems to me that “time since last long syscall” is not a very good measure of long syscall contention or throughput, and running long syscalls on a dedicated resource would make it easier to measure whether there is contention and whether it can be reduced by P’s stealing work.

Compatibility

Because this is a change to an internal system, it would not cause language compatibility issues. Additionally, while performance characteristics for large programs without long-running syscalls would change slightly (and this is most Go programs), adding even a few dozen idle threads would not make a measurable difference in Go performance. On the other hand, an entire class of Go applications would suddenly perform much better, including network-heavy applications on Windows.

Late edit: It just occurred to me that another class of go performance would perform much worse unless #21827 is addressed: parking goroutines OS-locked threads tend to create context switches themselves. Alternatively, the very inflammatory title of this issue could be changed, and the syscall state could be used to indicate "I am currently waiting on the syscall thread to work".

@gopherbot gopherbot added this to the Proposal milestone Feb 13, 2023
@seankhliao seankhliao changed the title proposal: runtime: Eliminate The Syscall State proposal: runtime: eliminate the syscall state Feb 13, 2023
@seankhliao
Copy link
Member

cc @golang/runtime

@mknyszek
Copy link
Contributor

Moving this out of proposal. (In the past we have phrased these kinds of internal changes as proposals but I think we've stopped doing that as the proposal process became more of an actual process. And given that all the changes here would be internal, I don't see a reason as to why this needs to go through the proposal review process. This is more about the merits of the implementation anyway.)

@mknyszek mknyszek modified the milestones: Proposal, Unplanned Feb 13, 2023
@mknyszek mknyszek added compiler/runtime Issues related to the Go compiler and/or runtime. and removed Proposal labels Feb 13, 2023
@mknyszek mknyszek changed the title proposal: runtime: eliminate the syscall state runtime: eliminate the notion of a "syscall state" Feb 13, 2023
@prattmic
Copy link
Member

prattmic commented Feb 13, 2023

In #54622, the case is laid out that unnecessarily raising threads for a brief boost in workload can have undesirable performance implications. Effectively, this task identifies that the largest extant performance issues in the scheduler today are related to unnecessary thread creation and destruction.

For clarification, the issue is not unnecessary thread creation and destruction, but unnecessary thread wake and sleep. Most programs reach a steady state of thread count fairly quickly (we ~never destroy threads). It is the wakeup of an idle thread and subsequent sleep when that thread has nothing else to do that is expensive.

IIUC, this proposal introduces a wakeup of the syscall thread for every syscall (unless the syscall thread is already running). I suspect that this would result in a significant performance degradation for most programs, even if improves the tail case for long syscalls.

In #54622, thread sleep is particularly expensive because the Go runtime does so much work trying to find something to do prior to sleep. This proposal wouldn't have that problem; the conditions for the syscall thread to sleep would be much simpler. But I still think the OS-level churn of requiring a thread wakeup (a several microsecond ordeal) just to make any syscall will be a non-starter.

@prattmic
Copy link
Member

Compatibility

Users often get/set various bits of thread-specific state via syscall.Syscall and having them fetch from a different thread would break those use cases.

That said, the scheduler can migrate goroutines between threads at any time, so I think we could argue this only matters for goroutines that called runtime.LockOSThread. Those would need to make syscalls directly on the calling thread.

@CannibalVox
Copy link
Author

IIUC, this proposal introduces a wakeup of the syscall thread for every syscall (unless the syscall thread is already running). I suspect that this would result in a significant performance degradation for most programs, even if improves the tail case for long syscalls.

The intent was that the threads would be live and waiting on some sort of sync primitive rather than needing to be resumed

Users often get/set various bits of thread-specific state via syscall.Syscall and having them fetch from a different thread would break those use cases.

I would expect syscall.Syscall to be executed on the syscall thread, so for OS-locked thread, thread context would all be present in the same place, the syscall thread.

@CannibalVox
Copy link
Author

But I still think the OS-level churn of requiring a thread wakeup (a several microsecond ordeal) just to make any syscall will be a non-starter.

Additionally, not to put too fine a point on it, but this is already the plan for syscalls that take longer than a microsecond.

@prattmic
Copy link
Member

The intent was that the threads would be live and waiting on some sort of sync primitive rather than needing to be resumed

Could you be more specific about what you mean here? The main options I can think of here are:

  1. Busy loop
  2. Busy loop with PAUSE instruction
  3. Loop calling sched_yield (or equivalent syscall)
  4. Block in futex (or other wake-able syscall)

1 and 2 burn CPU continuously (2 slightly more efficiently), 3 burns CPU unless the system is fully loaded, and 4 requires a wake-up (and is what I was referring to).

@mknyszek

This comment was marked as duplicate.

@CannibalVox
Copy link
Author

CannibalVox commented Feb 13, 2023

The intent was that the threads would be live and waiting on some sort of sync primitive rather than needing to be resumed

Could you be more specific about what you mean here? The main options I can think of here are:

  1. Busy loop
  2. Busy loop with PAUSE instruction
  3. Loop calling sched_yield (or equivalent syscall)
  4. Block in futex (or other wake-able syscall)

1 and 2 burn CPU continuously (2 slightly more efficiently), 3 burns CPU unless the system is fully loaded, and 4 requires a wake-up (and is what I was referring to).

I understand now- I guess the faster wakeups in go primitives are due to the fact that the P stays in motion continuously.

It's safe to say that the design as written won't work, then, but that mainly pushes me toward the alternatives. As you identified, waking and sleeping a thread with every syscall is fairly untenable- Go is in a state right now where network communications on windows have massive performance issues because it does exactly that. Having 1 thread burn CPU per P is unacceptable, but having 1 thread total do it + others for a short periods at the tail end of a burst is not. The current situation is fairly dire.

@prattmic
Copy link
Member

I certainly agree that the bad cases of syscall churn could use improvement. I haven't had a chance to look closely at #58336, but it seems like that provides a good example case.

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 14, 2023
@nitrix
Copy link

nitrix commented Apr 11, 2024

Sorry to chime in, I'll try to be the voice of others:

Problem statement

My use case is a Cgo call to glfw.SwapBuffers() which is a wrapper for the C function glfwSwapBuffers(), a common graphics API. When VSync is enabled, it internally blocks to synchronize with the user's window compositor / monitor refresh rate. That brief pause will frequently go beyond 20ns and then some overhead of context switching / launching a separate thread to maintain GOMAXPROCS to not starve the goroutines causes a brutal stutter that can skip 1 to 5 frames and is very noticeable in a soft real-time application.

Trying to understand

If I'm understanding correctly the problem, then I'm even more confused because this is happening on the main thread in a os-locked goroutine with os.LockOSThread(), so nothing else can already run on that thread. There's no other goroutines on that thread that can be starved of work, so what are we yielding to here?

Even then, I think a special case of tolerating one core to be blocked momentarily is fairly reasonable when you have other cores available. They'll just do a bit more work. I understand if that happened to all cores, but 1 should be acceptable and so common that I'm surprised it's not handled differently. The workers are work-stealing aren't they?

Maybe another option is to be able to mark the function as "blocking" and that'd be fine. That's actually desirable for some people. We just need an escape hatch somewhere. There's zero control currently.

Closing words

Anyway, I know soft real-time isn't a priority to the Go team. You'd think garbage collection would be the primary blocker for soft real-time, but it isn't. The GC is great. This single issue with cgo and scheduling is actually what has plagued so many before me; Docker, Sqlite, CockroachDB, Dqlite [1], etc.

[1] https://dqlite.io/docs/explanation/faq#why-c-7

@mknyszek
Copy link
Contributor

mknyszek commented Apr 11, 2024

That brief pause will frequently go beyond 20ns and then some overhead of context switching / launching a separate thread to maintain GOMAXPROCS to not starve the goroutines causes a brutal stutter that can skip 1 to 5 frames and is very noticeable in a soft real-time application.

Oof. That sounds frustrating. (I assume by 20ns you meant 20µs?) I encourage you to file a new issue so that your specific case can be discussed in more detail. Having a separate issue filed for this will be useful when looking at scheduler issues holistically.

I will say that I don't think this is going to be a very easy issue to resolve (happy to be wrong, though). There's a fundamental mismatch between the model expected by graphics libraries and the model of execution Go presents. In Go, all goroutines (locked to an OS thread or not) are treated equal and are anonymous. This interacts poorly with graphics libraries that care a lot about which thread does what. LockOSThread makes calling into graphics libraries possible, but it doesn't resolve the mismatch.

FWIW, releasing the P isn't just about maintaining GOMAXPROCS (in fact, it kinda doesn't, if the thread ends up doing a whole bunch of CPU-bound stuff for a long time). It's about being able to do schedule goroutines cooperatively. If the P was never released off a goroutine that called into C, then the Go runtime couldn't do a whole bunch of important things (for example, stop all goroutines), because it can't preempt or cooperatively interact with C code. It must be the case that the C code, upon returning to Go, blocks until the Go code is allowed to run again.

If I'm understanding correctly the problem, then I'm even more confused because this is happening on the main thread in a os-locked goroutine with os.LockOSThread(), so nothing else can already run on that thread. There's no other goroutines on that thread that can be starved of work, so what are we yielding to here?

Even when a goroutine is locked to an OS thread, it can still yield back into the scheduler. What happens when it does that is that it puts itself on its P's run queue. It then starts up another thread and hands its P to that thread to run some other goroutine, then puts its own thread to sleep. This is necessary because LockOSThread introduces a 1:1 relationship between a goroutine and an OS thread. Thus if a goroutine locked to a thread blocks, the whole thread must block.

@nitrix
Copy link

nitrix commented Apr 12, 2024

I assume by 20ns you meant 20µs?

My mistake, 20µs yes.

There's a fundamental mismatch between the model expected by graphics libraries and the model of execution Go presents.
This interacts poorly with graphics libraries that care a lot about which thread does what.

Well, the largest one that I see is that we usually have a single thread with reliable timing for rendering and then we off-load the heavier computational tasks (fluid simulation, sound, networking, file I/O, etc) asynchronously on the remaining cores. Which ones does what often doesn't quite matter. Go's scheduler would actually improve on a lot of homebrewed schedulers that you see in engines, fully utilising the remaining cores and keeping their workload evenly distributed.

I think that's the fundamental mismatch, it's that Go insist of messing up with the main thread. Specifically, a locked thread.

Beyond the 1:1 for the integrity of the thread-local storage, it comes with the guarantee that there's nothing else running on that thread. It should be able to leverage this. It's dedicated to that one and only task. If it wants to block, that's fine, let it block, the asynchronous workload is elsewhere and there are spare Ps for them.

If the P was never released off a goroutine that called into C, then the Go runtime couldn't do a whole bunch of important things (for example, stop all goroutines)

Would it help Go's scheduler if we could hint that a given Cgo call will not mutate Go's memory, nor callback from C to Go?

Because in this case if it knew that the C call was "safe", the already blocked goroutine could stay blocked, the remaining goroutines could be stopped and the GC can happily STW without being worried about mutators. No need for the strange G/M/P dance.

I'm assuming some check is needed in case C returns prior to the GC finishing, but that seems somewhat doable.
The conversative approach that C and the GC can't execute concurrently seems overly restrictive here.

I'll add that games are also written with great care to not generate garbage in the hot path. They pace themselves pretty nicely and I haven't personally seen (with GODEBUG=gctrace=1) any forceful GC due to outpacing the GC and running out of memory. Maybe it could delay to run the GC just a bit later once we've returned from C land.

I will say that I don't think this is going to be a very easy issue to resolve

I support the idea of a "simple on the surface" and "complex underneath" language, but having something like Cgo and then no mechanism for C and Go to express what's safe and what isn't makes it hard for them to co-exist. I want Go and C to play nicely together.

"All Go Ms will now consist of a primary thread and a syscall thread."

This used to be my biggest gripe, the ffi cost. I'm hoping Vox's proposal will make it a lot cheaper to call into C without all the Go baggage (thanks to the dedicated thread/syscall threads) and I'm hoping somewhere in that process, someone finds a way that allows C (inside locked os threads) to block without compromising Go.

Then the performance problem goes away entirely.

@ianlancetaylor
Copy link
Contributor

Just a note that we should soon have #cgo nocallback support. See #56378. I don't know how much it will help this case.

@CannibalVox
Copy link
Author

CannibalVox commented Apr 12, 2024

My use case is a Cgo call to glfw.SwapBuffers() which is a wrapper for the C function glfwSwapBuffers(), a common graphics API. When VSync is enabled, it internally blocks to synchronize with the user's window compositor / monitor refresh rate. That brief pause will frequently go beyond 20ns and then some overhead of context switching / launching a separate thread to maintain GOMAXPROCS to not starve the goroutines causes a brutal stutter that can skip 1 to 5 frames and is very noticeable in a soft real-time application.

This seems unexpected to me- context switches are slow, but they're microseconds slow, not milliseconds slow. If you're on windows, be aware that windows is being launched with the default timer granularity of 16.7ms, which applies to native code as well, which could be the issue you're encountering, if swap buffers is timing out. You can work around this by making the traditional dll calls to reduce this to 1ms

@mknyszek
Copy link
Contributor

mknyszek commented Apr 15, 2024

@CannibalVox is right and I think I was overly pessimistic in my previous message. The fact that you get multiple frame drops is really significant and you might be running into some performance bug or corner case. I wouldn't expect that from the syscall/cgo slow path to reenter Go, unless the scheduler is really overloaded on CPU-bound goroutines.

@nitrix Please do file a new issue so we can track it. Please also include the following information:

Also, I think I may have introduced some misunderstandings as to how the runtime currently works. I made the mistake of assuming the root cause of your issue, and didn't properly consider the magnitude of the issue, which doesn't match up with my expectations of how the runtime should behave. I tried to clarify below.

Would it help Go's scheduler if we could hint that a given Cgo call will not mutate Go's memory, nor callback from C to Go?

Because in this case if it knew that the C call was "safe", the already blocked goroutine could stay blocked, the remaining goroutines could be stopped and the GC can happily STW without being worried about mutators. No need for the strange G/M/P dance.

It does help some things for sure, but keep in mind that even if the GC stops the world, it does still have to force the C call returning to Go to give up its P so that it blocks before reentering Go code.

I'm assuming some check is needed in case C returns prior to the GC finishing, but that seems somewhat doable.
The conversative approach that C and the GC can't execute concurrently seems overly restrictive here.

This is already how it works today: the C code keeps executing until it needs to return back to Go. At that point, the thread checks if it's allowed to run. C code is definitely allowed to execute concurrently with a STW.

I support the idea of a "simple on the surface" and "complex underneath" language, but having something like Cgo and then no mechanism for C and Go to express what's safe and what isn't makes it hard for them to co-exist. I want Go and C to play nicely together.

Agreed. Like I said at the start of this reply, I think the excerpt of mine that you quoted was a bit too pessimistic. In principle, I don't see a reason why the latencies you're seeing should be so high.

This used to be my biggest gripe, the ffi cost. I'm hoping Vox's proposal will make it a lot cheaper to call into C without all the Go baggage (thanks to the dedicated thread/syscall threads) and I'm hoping somewhere in that process, someone finds a way that allows C (inside locked os threads) to block without compromising Go.

I think earlier in this issue @prattmic and @CannibalVox came to the conclusion that a separate syscall/C thread isn't quite the right approach to improving C/Go interop.

To quote @CannibalVox (emphasis mine):

It's safe to say that the design as written won't work, then, but that mainly pushes me toward the alternatives. As you identified, waking and sleeping a thread with every syscall is fairly untenable-

Most of the cost of cgo comes from the fact that Go code wants to be able to stop C code from returning to Go so it can maintain its own invariants. This requires synchronization on both syscall/cgo enter and exit. By having a second thread to switch to, that forces an OS-level context switch on each syscall/call to C, with one thread going to sleep so the other one can run. Currently, goroutines have their own stack, and the runtime directly switches from running on the goroutine stack to the thread stack to perform the C call or syscall (it has to anyway because Go stacks can be really small since they're growable). Go context switches are orders of magnitude cheaper than OS context switches. Also, having a second thread doesn't change the fact that upon switching back to the "Go" thread it may need to block until it's OK for Go code to run again.

(Hope is not lost; there are probably ways to make the synchronization cheaper, but it'll take time and effort to explore and implement.)

But again, I think what you're experiencing may not be quite so fundamental to the design of Go, but actually just a bug or some case that isn't handled well.

Lastly, I also want to address this part of your comment:

... it comes with the guarantee that there's nothing else running on that thread. It should be able to leverage this. It's dedicated to that one and only task. If it wants to block, that's fine, let it block, the asynchronous workload is elsewhere and there are spare Ps for them.

To be totally clear, this is true of even non-locked goroutines calling into C or into a syscall. Before a goroutine calls a syscall or enter C code, it binds itself to the thread it's currently running on for the duration of the call. Then the aforementioned switch to the thread stack occurs. Nothing is kicking the goroutine off the thread, and in fact, the runtime may have to spin up a new thread to run more Go code.

@nitrix
Copy link

nitrix commented Apr 15, 2024

Great clarification. It's on Windows and it's a proof-of-concept 3D game engine for a fancy non euclidean game that has both a C and Go implementation to compare ease of implementation and test the performance. Go is doing incredibly well except for this small stutter every other minute.

I'll follow the advice and branch off in its own tracked issue + collect a trace.

Btw, using the dangerous it-shall-not-be-named fastcgo [1] makes whatever stutter is happening with the scheduling/cgo issue completely go away. I think we also all know the downsides of using that, but it is a reasonable (albeit unportable and cryptically not easily maintainable) solution.

[1] https://github.com/petermattis/fastcgo

@CannibalVox
Copy link
Author

CannibalVox commented Apr 15, 2024

There have been a couple improvements to windows committed for .23 that may apply: improved timer granularity on windows (might not apply to native code), and some context switch reduction to cgo on unlocked threads (probably won't apply at all in this case). For the sake of thoroughness, consider running the code on tip to see if there are improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: No status
Development

No branches or pull requests

7 participants