New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
context: clarify guidance on whether a canceled context can cause a graceful termination vs. immediate termination #54775
Comments
The sentence following the one you've quoted is:
That is the guidance we give, a context is a signal to stop, but you don't care about when it actually happens. |
If you don't care when a cancellation happens, then a context is a relatively unhelpful concept: taking this mantra to the extreme, I could completely ignore contexts, because the expectation for all non-infinite-loops/waits is that all code will eventually stop. A call to ctx, cancel := context.WithCancel(context.Background())
done := make(chan struct{})
go func() {
defer close(done)
select {
case <-ctx.Done():
case <-time.After(time.Hour):
}
}()
cancel()
<-done I expect cancel to return immediately. Importantly, I also expect done to be closed relatively immediately. If I swap the The example I commonly use is |
I agree with @twmb's feedback regarding the official guidance being rather loose. If the expectation and idiomatic approach is an immediate cancellation, as it can be seem throughout the standard library, I guess the only thing to do here is to make that more explicit in the documentation. |
No, where does your citation express or even imply an immediate return? What is unclear about "tell[ing] an operation to abandon its work" not giving any guarantees whatsoever about when a "function" will return? If you look at various examples you will notice that "immediate" often isn't even achievable, as so many times there are only certain discrete points while processing something where an operation can be aborted after it has been told so. After pondering this issue the only clarification I personally could imagine would be to add that there is no guarantee if and when an operation finally abandon its work after its context having been cancelled. It's completely dependent on a particular implementation using the context. |
I've ended up considering graceful shutdown itself to be a new workload, and this implies it requires an independently constructed context. Programmatically, the implementation is more elaborate but conceptually it's an explanation that makes sense to me, I've had success with it. |
Compare #50436. |
My mental model is: a canceled If the caller is no longer interested in the result, a function should avoid leaking work in much the same way that it would avoid leaking memory or otherwise avoid doing unnecessary work. A function should take the time to clean up after itself — close and delete temporary files and open file handles, close network connections, etc. — but should not continue to do work to compute results, which are (by definition) no longer relevant. |
Compare #52805 (comment). |
Agreed.
How does this recommendation fit in the context of a graceful shutdown vs a non-graceful shutdown? Do I have any way to reason about what chunk of time is reasonable? Per #52805, "the fact that the server needs to run for a grace period after shutdown has begun argues that a context isn't the right way to bound a server's lifetime". Am I reading this correctly that a context is not to be used for graceful shutdowns? If so, does this imply immediate shutdown? To me, "take the time to clean up after itself" implies taking the time for a graceful shutdown. Alternatively, cutting anything that is blocking is an immediate shutdown. Taking the time is graceful. Say I have a contract to satisfy and return a result within 30s. For rough safety (barring clock problems), I give myself 2s of overhead to return. I call a function in an external package with To satisfy my own 28s goal, do I have to call all context functions in a goroutine so that I am not dependent on any external API's potentially slow, but required, cleanup? Or, do I call these context-accepting functions inline, and any function that has a slow cleanup is not obeying the purpose of a context? Put another way: What is the purpose of If I as a caller cannot rely on anything related to when the sub-function will return, and I have my own timeout contract to satisfy, then what is the purpose of a context? I cancel a context, I cannot rely on when it will return, I must return myself. To satisfy this contract, I must run the sub-function in a goroutine and spin it off to die whenever, even after I return as the callee. Say I'm shutting down the program: I cancel a context. If a context is meant to return immediately, then I know to wait for the sub-function and receive a status on whether the shutdown was successful. If a context can delay for an arbitrary amount of time for cleanup, then I should not wait, otherwise I block shutdown. |
Because the contexts-derived-from-contexts scheme generates a tree:
From this perspective, |
My two cents. Once the context is canceled, all blocking operations that leverage that context will immediately fail anyway (if they are implemented in the expected way). Consequently, the only things you should do once the context is canceled are operations that are "fast" (i.e., don't require a context chan) but still necessary to avoid leaking resources, such as closing a file. This does not mean that it is "graceful" though. For example, if you have a TLS connection, you might just directly sever the underlying TCP connection instead of potentially blocking on close notify. The other thing to note is I think context is intended to be used as a per-request concept. But, say, running an HTTP server eternally in the background isn't really a request per se, so the notion of cancelling it does not translate very well. For this reason you really need explicit graceful and immediate shutdown methods. The graceful shutdown itself can be context bound, and upon cancellation it turns into an immediate shutdown. (This is what http.Server does.) |
It's not clear to me what the goal of this issue is. Do you have suggestions for different wording in the context package? It seems to me that the way that any given task task responds to context cancellation is going to be, well, context-dependent. In the mostly cooperative system that Go uses, I don't see how it could be any other way. |
Something along the lines of one of the following options: // When a context's Done channel is closed, code should exit any blocking
// function, perform any necessary non-blocking cleanup, and return
// immediately. Cancelling a context should not begin a graceful cleanup
// process. Functions should not take measurable wall time to return; callers
// rely on callees to exit promptly when a context is cancelled. // A cancelled context is a hint to functions to abandon work and exit.
// Callers should not rely on functions to return promptly. Callers that rely
// on a fast return when a context is cancelled should run the callee function
// in a goroutine and manually manage prompt exits. |
Note that in the issue proposing merging x/net/context to the stdlib, @bradfitz mentions "semantically, canceling a context is much more abrupt than that". This is the idea I would like to formalize (my first suggestion of the prior comment). I have personally refactored many chunks of code that began graceful shutdowns when a context is canceled. This is the idea that is specifically dismissed in the linked comment. There is no guidance in the existing context documentation that a cancelation should lead to abrupt exits (what I call immediate termination). |
Thanks. Both paragraphs seem to doctrinaire to me, but happy to hear other opinions. CC @Sajmani |
Pinging perhaps @rsc to provide some clarity or get this unstuck (or close as irrelevant) |
I think this is the way to go. I currently work on a web service. I started with propagating the To use the quote above: The muxer may not be interested in my result anymore – but the client at the other end of the HTTP connection still probably is. Now, I replace ctx in the handler function with If you really need a predictable timeout, you must get it from elsewhere in my opinion. (In my case, Kubernetes’ “grace period“ when sending TERM to a pod.) |
It seems the sentiment of the above comments is that the language is not interested in providing guidance on how a context cancellation should be handled. I'll close the issue. Thanks for the consideration. |
Since
context.Context
has been introduced, I've understood it such that a canceled context should cause an immediate return from a function. I see this same expectation in almost all source code I read, stdlib and otherwise. Some exceptions exist -- I think some of thego mod
code has a hardcoded 1s upper bound for some graceful termination -- but this can be seen as relatively immediate.In the context docs, the main line that hints at an immediate return is
A CancelFunc tells an operation to abandon its work
. However, nothing in the context documentation actually discusses how quickly a cancelation should be reacted to, nor in the linked context blog post, nor in that post's linked pipelines post.I've seen a few times where a canceled context is used to begin a graceful shutdown, which itself could take 30s. I've restructured this existing code to instead have a separate "BeginShutdown" method and switch any usage of contexts to immediate-return semantics. But, there remains confusion on why we shouldn't use contexts in the graceful shutdown way. At the end of the day, my recommendation usually boils down to
This isn't really great nor official guidance, though.
This issue is also a bit more important in the context of
signal.NotifyContext
, which I've seen people use to begin a graceful shutdown of their applications (rather than an immediate shutdown), and inexec.CommandContext
, where people are surprised by the underlying Kill rather than Interrupt.Is it possible to clarify whether a canceled context should cause immediate returns? Or, is it fine to begin a long graceful shutdown?
The text was updated successfully, but these errors were encountered: