Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

context: clarify guidance on whether a canceled context can cause a graceful termination vs. immediate termination #54775

Closed
twmb opened this issue Aug 30, 2022 · 19 comments
Labels
Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@twmb
Copy link
Contributor

twmb commented Aug 30, 2022

Since context.Context has been introduced, I've understood it such that a canceled context should cause an immediate return from a function. I see this same expectation in almost all source code I read, stdlib and otherwise. Some exceptions exist -- I think some of the go mod code has a hardcoded 1s upper bound for some graceful termination -- but this can be seen as relatively immediate.

In the context docs, the main line that hints at an immediate return is A CancelFunc tells an operation to abandon its work. However, nothing in the context documentation actually discusses how quickly a cancelation should be reacted to, nor in the linked context blog post, nor in that post's linked pipelines post.

I've seen a few times where a canceled context is used to begin a graceful shutdown, which itself could take 30s. I've restructured this existing code to instead have a separate "BeginShutdown" method and switch any usage of contexts to immediate-return semantics. But, there remains confusion on why we shouldn't use contexts in the graceful shutdown way. At the end of the day, my recommendation usually boils down to

  • The whole ecosystem is programmed around relatively immediate returns when a context is canceled
  • If people assume context cancelation begins a shutdown, this removes the possibility of causing a forceful shutdown

This isn't really great nor official guidance, though.

This issue is also a bit more important in the context of signal.NotifyContext, which I've seen people use to begin a graceful shutdown of their applications (rather than an immediate shutdown), and in exec.CommandContext, where people are surprised by the underlying Kill rather than Interrupt.

Is it possible to clarify whether a canceled context should cause immediate returns? Or, is it fine to begin a long graceful shutdown?

@seankhliao seankhliao added Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 30, 2022
@seankhliao
Copy link
Member

The sentence following the one you've quoted is:

A CancelFunc does not wait for the work to stop.

That is the guidance we give, a context is a signal to stop, but you don't care about when it actually happens.
If you do, you'll need something else (eg, WaitGroup, errgroup)

@seankhliao
Copy link
Member

cc @neild @Sajmani @bcmills

@twmb
Copy link
Contributor Author

twmb commented Aug 30, 2022

If you don't care when a cancellation happens, then a context is a relatively unhelpful concept: taking this mantra to the extreme, I could completely ignore contexts, because the expectation for all non-infinite-loops/waits is that all code will eventually stop.

A call to cancel() does not wait, that's true, that's the expectation. In this code:

ctx, cancel := context.WithCancel(context.Background())
done := make(chan struct{})
go func() {
    defer close(done)
    select {
    case <-ctx.Done():
    case <-time.After(time.Hour):
    }
}()
cancel()
<-done

I expect cancel to return immediately. Importantly, I also expect done to be closed relatively immediately. If I swap the select block for a function that accepts the context, the question about usage is whether I should expect done to be received from immediately or not.

The example I commonly use is net/http.Server's Shutdown method. The docs only indicate that the context's error is returned if the context is canceled. The docs do not describe how canceling the context causes an immediate return. We expect the immediate return due to how context is used in the ecosystem, and the immediate return is indeed the implementation. In this loop, srv.closeIdleConns is non-io-blocking (it checks internal state quickly under a mutex), and then lines 2813 and 2814 return immediately if the context is canceled.

@c4milo
Copy link
Member

c4milo commented Aug 30, 2022

I agree with @twmb's feedback regarding the official guidance being rather loose. If the expectation and idiomatic approach is an immediate cancellation, as it can be seem throughout the standard library, I guess the only thing to do here is to make that more explicit in the documentation.

@thediveo
Copy link

thediveo commented Aug 30, 2022

In the context docs, the main line that hints at an immediate return is "A CancelFunc tells an operation to abandon its work."

No, where does your citation express or even imply an immediate return? What is unclear about "tell[ing] an operation to abandon its work" not giving any guarantees whatsoever about when a "function" will return? If you look at various examples you will notice that "immediate" often isn't even achievable, as so many times there are only certain discrete points while processing something where an operation can be aborted after it has been told so.

After pondering this issue the only clarification I personally could imagine would be to add that there is no guarantee if and when an operation finally abandon its work after its context having been cancelled. It's completely dependent on a particular implementation using the context.

@AndrewHarrisSPU
Copy link

If people assume context cancelation begins a shutdown, this removes the possibility of causing a forceful shutdown

I've ended up considering graceful shutdown itself to be a new workload, and this implies it requires an independently constructed context. Programmatically, the implementation is more elaborate but conceptually it's an explanation that makes sense to me, I've had success with it.

@bcmills
Copy link
Contributor

bcmills commented Aug 31, 2022

This issue is also a bit more important in the context of … exec.CommandContext, where people are surprised by the underlying Kill rather than Interrupt.

Compare #50436.

@bcmills
Copy link
Contributor

bcmills commented Aug 31, 2022

My mental model is: a canceled Context indicates that the caller is no longer interested in the result of the associated computation.

If the caller is no longer interested in the result, a function should avoid leaking work in much the same way that it would avoid leaking memory or otherwise avoid doing unnecessary work. A function should take the time to clean up after itself — close and delete temporary files and open file handles, close network connections, etc. — but should not continue to do work to compute results, which are (by definition) no longer relevant.

@bcmills
Copy link
Contributor

bcmills commented Aug 31, 2022

I've seen a few times where a canceled context is used to begin a graceful shutdown, which itself could take 30s.

The example I commonly use is net/http.Server's Shutdown method.

Compare #52805 (comment).

@twmb
Copy link
Contributor Author

twmb commented Aug 31, 2022

the caller is no longer interested in the result

Agreed.

avoid leaking work ... A function should take the time to clean up after itself

How does this recommendation fit in the context of a graceful shutdown vs a non-graceful shutdown? Do I have any way to reason about what chunk of time is reasonable?

Per #52805, "the fact that the server needs to run for a grace period after shutdown has begun argues that a context isn't the right way to bound a server's lifetime".

Am I reading this correctly that a context is not to be used for graceful shutdowns? If so, does this imply immediate shutdown? To me, "take the time to clean up after itself" implies taking the time for a graceful shutdown. Alternatively, cutting anything that is blocking is an immediate shutdown. Taking the time is graceful.


Say I have a contract to satisfy and return a result within 30s. For rough safety (barring clock problems), I give myself 2s of overhead to return. I call a function in an external package with context.WithDeadline(25*time.Second). The timeout expires.

To satisfy my own 28s goal, do I have to call all context functions in a goroutine so that I am not dependent on any external API's potentially slow, but required, cleanup? Or, do I call these context-accepting functions inline, and any function that has a slow cleanup is not obeying the purpose of a context?

Put another way: What is the purpose of context.WithDeadline if I cannot rely on the deadline to actually return from the sub-function? Is the purpose strictly to indicate to the sub-function that it should abandon work, and I as a caller cannot rely on anything related to when the function will return? Or, is the purpose a little broader, and a canceled context does indicate that I can rely on some semblance of a relatively immediate return?

If I as a caller cannot rely on anything related to when the sub-function will return, and I have my own timeout contract to satisfy, then what is the purpose of a context? I cancel a context, I cannot rely on when it will return, I must return myself. To satisfy this contract, I must run the sub-function in a goroutine and spin it off to die whenever, even after I return as the callee.

Say I'm shutting down the program: I cancel a context. If a context is meant to return immediately, then I know to wait for the sub-function and receive a status on whether the shutdown was successful. If a context can delay for an arbitrary amount of time for cleanup, then I should not wait, otherwise I block shutdown.

@AndrewHarrisSPU
Copy link

AndrewHarrisSPU commented Aug 31, 2022

Do I have any way to reason about what chunk of time is reasonable?

Because the contexts-derived-from-contexts scheme generates a tree:

  • If any two context listeners listen to a shared root, they may behave as if they listen to a logical clock that ticks once
  • some arranged orders of context cancellations translate into a reliably ordered series of logical clock ticks.

From this perspective, WithTimeout or WithDeadline turn a physical clock observation into a logical one; they create checkpoints. For graceful shutdown, "graceful" might mean some fault tolerance: if a client connection is unresponsive from the beginning of shutdown out to some further checkpoint, assume the client is unreachable; if all clients are unreachable, assume instead I'm unreachable. I don't know that it's anything other engineering to work out what the sensible duration of physical time here is; the diagrammatic sense of how something is programmed IMHO is far more clear with logical checkpoints, and context is sneakily excellent for this.

@rittneje
Copy link

rittneje commented Sep 1, 2022

My two cents. Once the context is canceled, all blocking operations that leverage that context will immediately fail anyway (if they are implemented in the expected way). Consequently, the only things you should do once the context is canceled are operations that are "fast" (i.e., don't require a context chan) but still necessary to avoid leaking resources, such as closing a file. This does not mean that it is "graceful" though. For example, if you have a TLS connection, you might just directly sever the underlying TCP connection instead of potentially blocking on close notify.

The other thing to note is I think context is intended to be used as a per-request concept. But, say, running an HTTP server eternally in the background isn't really a request per se, so the notion of cancelling it does not translate very well. For this reason you really need explicit graceful and immediate shutdown methods. The graceful shutdown itself can be context bound, and upon cancellation it turns into an immediate shutdown. (This is what http.Server does.)

@ianlancetaylor
Copy link
Contributor

It's not clear to me what the goal of this issue is. Do you have suggestions for different wording in the context package?

It seems to me that the way that any given task task responds to context cancellation is going to be, well, context-dependent. In the mostly cooperative system that Go uses, I don't see how it could be any other way.

@twmb
Copy link
Contributor Author

twmb commented Sep 2, 2022

Something along the lines of one of the following options:

// When a context's Done channel is closed, code should exit any blocking
// function, perform any necessary non-blocking cleanup, and return
// immediately. Cancelling a context should not begin a graceful cleanup
// process. Functions should not take measurable wall time to return; callers
// rely on callees to exit promptly when a context is cancelled.
// A cancelled context is a hint to functions to abandon work and exit.
// Callers should not rely on functions to return promptly. Callers that rely
// on a fast return when a context is cancelled should run the callee function
// in a goroutine and manually manage prompt exits.

@seankhliao seankhliao added this to the Unplanned milestone Sep 3, 2022
@twmb
Copy link
Contributor Author

twmb commented Sep 6, 2022

Note that in the issue proposing merging x/net/context to the stdlib, @bradfitz mentions "semantically, canceling a context is much more abrupt than that". This is the idea I would like to formalize (my first suggestion of the prior comment). I have personally refactored many chunks of code that began graceful shutdowns when a context is canceled. This is the idea that is specifically dismissed in the linked comment. There is no guidance in the existing context documentation that a cancelation should lead to abrupt exits (what I call immediate termination).

@ianlancetaylor
Copy link
Contributor

Thanks. Both paragraphs seem to doctrinaire to me, but happy to hear other opinions.

CC @Sajmani

@twmb
Copy link
Contributor Author

twmb commented Nov 22, 2022

Pinging perhaps @rsc to provide some clarity or get this unstuck (or close as irrelevant)

@bronger
Copy link

bronger commented Jan 19, 2023

My mental model is: a canceled Context indicates that the caller is no longer interested in the result of the associated computation.

I think this is the way to go. I currently work on a web service. I started with propagating the ctx parameter to all non-trivial functions. The result is that when I make a rolling update in Kubernetes, client connections are forcefully severed. This is nice if you only want to get the updated service quickly into production but I think everyone will agree that one has the 2 or 3 additional seconds to make the client happy.

To use the quote above: The muxer may not be interested in my result anymore – but the client at the other end of the HTTP connection still probably is.

Now, I replace ctx in the handler function with context.Background(). This way, a term signal avoids any new connection to be made, and the client does’t even notice the update, which is very nice.

If you really need a predictable timeout, you must get it from elsewhere in my opinion. (In my case, Kubernetes’ “grace period“ when sending TERM to a pod.)

@twmb
Copy link
Contributor Author

twmb commented May 25, 2023

It seems the sentiment of the above comments is that the language is not interested in providing guidance on how a context cancellation should be handled. I'll close the issue. Thanks for the consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

9 participants