New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http: Transport.CancelRequest is inherently racy #11013
Comments
CC @bradfitz |
This came up on the list again. https://groups.google.com/d/topic/golang-nuts/m-Ocj7YVahU/discussion I'm starting to think that it's impossible to do cancellation without races with the current interface. We might have to make a new RoundTrip interface which takes a context object and store the cancellation on there, like what http://blog.golang.org/context describes. |
As pointed out in that thread, the documentation is correct. You can't cancel it until it's been started. This code is also "racy": var mu sync.Mutex
m := map[string]string{}
go func() { mu.Lock(); m["foo"] = "bar"; mu.Unlock() }()
go func() { mu.Lock(); delete(m, "foo"); mu.Unlock() }() ... sometimes the map entry will be there and sometimes it won't, because you're not waiting to delete it until it's been added. That doesn't mean there's a bug with maps or with We could add more words if that would help, but we can't start adding contexts to APIs for cancellation because the standard library a) doesn't have them, and b) is frozen. |
Having the outcome of the map example be non-deterministic is fine, but for the Transport, the stakes are a bit higher. If you cancel a request before it has been started, RoundTrip might block indefinitely when the request does issue, leading to a leaked goroutine and a leaked net.Conn since you never end up calling resp.Body.Close() on the response. Clearly this is a punt to go1.6 issue, but still think we need to address this. The current API makes it too easy to write code that is subtly wrong. |
@bradfitz: Of course there's no bug, in the sense that the API does what is promises. But I'm saying I think it's a less useful API than it could be. Your example is not fair: the user can synchronize on the completion of an insertion into a map they own, but it's impossible for them to synchronize on RoundTrip inserting into its internal map. So it's impossible for them to ensure that a cancellation will not be lost. You can't change the existing CancelRequest function because of the API freeze, I understand. But surely there is another solution that is backwards compatible, by adding a new method? context.Context would be the "best" thing to use if designing from scratch since the community has standardized on it, but that doesn't mean the standard library can't have its own bespoke version (e.g. pass in a cancellation channel). |
@jacobsa, the use cases this was originally designed for was calling And for those concerned about If you really want to know what phase of its lifecycle a Transport's In addition to lack of contexts, it's also unfortunate that But unless there's something concrete to do here, I'm tempted to close this bug. So please propose something if you have an idea. Maybe Transport could keep track of the set of requests which have been canceled but not yet started (and then fail the request if it does come in later), and start a timer to remove that map entry in N seconds? But what is N? |
@bradfitz: Thanks for the background. I see, it makes a lot more sense when you view it from the lens of wanting to cancel once you've already received the response headers and are still reading the body. I came at it from "wanting to cancel at any time". I'm sure you're aware that setting a timeout up-front is less satisfying than being able to cancel at an arbitrary time, especially when you're trying to bridge a context.Context API with net/http. But it's still helpful of course, since it bounds your latency from "context is cancelled" to "I actually return from my function that calls into net/http". Still, it has the same trouble you mentioned: how do I choose the timeout? Here are two concrete proposals:
I personally like (2). The cost is that it would require net/http to maintain a background goroutine with a list of cancellation channels for in-flight requests (if any had ever been made), calling reflect.Select to wait for one of them to be closed or for a request to be finished. But this is what users have to do today if they want to support cancellation in the context.Context style, and they would still need to do it if (1) were chosen. |
I kinda like (2) also. But: what is its type? a) And if (b), who can close it? You wouldn't want multiple goroutines calling close on it. One might panic then. What about |
I think Regarding
Odd to be using the deprecated API to implement the new one internally, but it's safe because you can synchronize on the insert into the map. Put another way: if the old API weren't already promised, it could just be an implementation detail. |
@bradfitz: What do you think of my previous comment? If you're good with the idea, I'm happy to send a CL. |
I don't care much about the implementation. What's easiest or most efficient. I think both the easiest and most efficient is selecting from the *Request.Cancel Please send a CL, ideally soon. The tree is increasingly frozen. |
CL https://golang.org/cl/11601 mentions this issue. |
Due to golang/go#11013 race conditions in the remote testcases mean that HTTP requests aren't actually canceled. In client.go::httpDo() the goroutine that actually runs the http.Do() may not execute until after that routine's select{} loop, at which point the Context may have already been canceled. Unfortunatel with Go < 1.5 there is a race in the http.Transport package that drops cancellations on the floor. Work around the Go http.Transport cancellation issue by checking a private cancellation variable, since context.Context has no mechanism to check whether it has already been canceled. This can be removed once Flannel can depend on Go 1.5, at which point the code should start to use http.Request.Cancel instead of calling http.transport.CancelRequest(). Second, use a WaitGroup in doTestWatch() to assert that the outstanding watch is actually canceled.
Due to golang/go#11013 race conditions in the remote testcases mean that HTTP requests aren't actually canceled. In client.go::httpDo() the goroutine that actually runs the http.Do() may not execute until after that routine's select{} loop, at which point the Context may have already been canceled. Unfortunatel with Go < 1.5 there is a race in the http.Transport package that drops cancellations on the floor. Work around the Go http.Transport cancellation issue by checking a private cancellation variable, since context.Context has no mechanism to check whether it has already been canceled. This can be removed once Flannel can depend on Go 1.5, at which point the code should start to use http.Request.Cancel instead of calling http.transport.CancelRequest(). Second, use a WaitGroup in doTestWatch() to assert that the outstanding watch is actually canceled.
Due to golang/go#11013 race conditions in the remote testcases mean that HTTP requests aren't actually canceled. In client.go::httpDo() the goroutine that actually runs the http.Do() may not execute until after that routine's select{} loop, at which point the Context may have already been canceled. Unfortunatel with Go < 1.5 there is a race in the http.Transport package that drops cancellations on the floor. Work around the Go http.Transport cancellation issue by checking a private cancellation variable, since context.Context has no mechanism to check whether it has already been canceled. This can be removed once Flannel can depend on Go 1.5, at which point the code should start to use http.Request.Cancel instead of calling http.transport.CancelRequest(). Second, use a WaitGroup in doTestWatch() to assert that the outstanding watch is actually canceled. This issue can be reproduced by using the remote_test.go hunks but not the client.go hunks. Specifically, doTestWatch() calls cancel() before watch.go::WatchLeases() calls sm.WatchLeases(), which in this case is RemoteManager.WatchLeases(), which then calls RemoteManager.httpGet() -> httpDo(), which then calls http.Client.Do() which has the aformentioned cancellation race.
Due to golang/go#11013 race conditions in the remote testcases mean that HTTP requests aren't actually canceled. In client.go::httpDo() the goroutine that actually runs the http.Do() may not execute until after that routine's select{} loop, at which point the Context may have already been canceled. Unfortunately with Go < 1.5 there is a race in the http.Transport package that drops cancellations on the floor. This issue can be reproduced by using the remote_test.go hunks that use a WaitGroup to wait for subnet.WatchLeases() to exit after the Context is canceled, but not the client.go hunks. Specifically, doTestWatch() calls cancel() before watch.go::WatchLeases() calls sm.WatchLeases(), which in this case is RemoteManager.WatchLeases(), which then calls RemoteManager.httpGet() -> httpDo(), which then calls http.Client.Do() which has the aformentioned cancellation race. Work around the Go http.Transport cancellation issue by checking a private cancellation variable, since context.Context has no mechanism to check whether it has already been canceled. This can be removed once Flannel can depend on Go 1.5, at which point the code should start to use http.Request.Cancel instead of calling http.transport.CancelRequest(). This obviously doesn't fix the race, but I think that has to wait for Go 1.5. Second, use a WaitGroup in doTestWatch() to assert that the outstanding watch is actually canceled.
Due to golang/go#11013 race conditions in the remote testcases mean that HTTP requests aren't actually canceled. In client.go::httpDo() the goroutine that actually runs the http.Do() may not execute until after that routine's select{} loop, at which point the Context may have already been canceled. Unfortunately with Go < 1.5 there is a race in the http.Transport package that drops cancellations on the floor. This issue can be reproduced by using the remote_test.go hunks that use a WaitGroup to wait for subnet.WatchLeases() to exit after the Context is canceled, but not the client.go hunks. Specifically, doTestWatch() calls cancel() before watch.go::WatchLeases() calls sm.WatchLeases(), which in this case is RemoteManager.WatchLeases(), which then calls RemoteManager.httpGet() -> httpDo(), which then calls http.Client.Do() which has the aformentioned cancellation race. To fix the races more completely (though not as completely as Go 1.5 will do) we have to copy most of http.Transport into the tree and implement some cancellation fixups ourselves. A few parts of the new Transport are notable: 1) CancelRequest() - if the request is not yet being tracked, the cancelation is saved in another map for later 2) getConn() - the real work is done here; when adding a new request to the internal table, check if it the client has asked to cancel it, and actually cancel it. Second if we are going to use an idle connection for the request, and the request has already been canceled, then cancel the idle connection too. Second, use a WaitGroup in doTestWatch() to assert that the outstanding watch is actually canceled.
The implementation of http.Transport.CancelRequest looks for the supplied request struct in a map of in-flight requests, doing nothing if it doesn't exist. If I understand correctly, that means that this snippet is racy, in the sense that cancellation of the context may be lost if it is seen before RoundTrip gets around to sticking the request in the map:
This is made more likely to happen by RoundTripper implementations like oauth2.Transport, which may make network calls before calling through to RoundTrip. In the end I'm forced to cancel in a hacky loop to ensure that the HTTP package pays attention.
Is there anything that can be done to eliminate this race?
The text was updated successfully, but these errors were encountered: